Python requests with javascript

Requests-HTML: HTML Parsing for Humans!¶

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

When using this library you automatically get:

  • Full JavaScript support!
  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
  • XPath Selectors, for the faint at heart.
  • Mocked user-agent (like a real web browser).
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical parsing abilities.

Installation¶

$ pipenv install requests-html ✨🍰✨

Only later versions of Python 3 are supported.

Tutorial & Usage¶

Make a GET request to ‘python.org’, using Requests:

>>> from requests_html import HTMLSession >>> session = HTMLSession() >>> r = session.get('https://python.org/') 

Grab a list of all links on the page, as–is (anchors excluded):

Grab a list of all links on the page, in absolute form (anchors excluded):

Select an Element with a CSS Selector (learn more):

>>> about = r.html.find('#about', first=True) 

Grab an Element ’s text contents:

>>> print(about.text) About Applications Quotes Getting Started Help Python Brochure 

Introspect an Element ’s attributes (learn more):

Render out an Element ’s HTML:

Select an Element list within an Element :

Search for links within an element:

Search for text on the page:

>>> r.html.search('Python is a <> language')[0] programming 

More complex CSS Selector example (copied from Chrome dev tools):

>>> r = session.get('https://github.com/') >>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p' >>> print(r.html.find(sel, first=True).text) GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers. 

JavaScript Support¶

Let’s grab some text that’s rendered by JavaScript:

>>> r = session.get('http://python-requests.org/') >>> r.html.render() >>> r.html.search('Python 2 will retire in only months!')['months'] '' 

Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/ ). This only happens once.

Using without Requests¶

You can also use this library without Requests:

>>> from requests_html import HTML >>> doc = """""" >>> html = HTML(html=doc) >>> html.links  

API Documentation¶

Main Classes¶

These classes are the main interface to requests-html :

class requests_html. HTML ( *, url=’https://example.org/’, html, default_encoding=’utf-8′ ) → None¶

An HTML document, ready for parsing.

All found links on page, in absolute form (learn more).

The base URL for the page. Supports the tag (learn more).

The encoding string to be used, extracted from the HTML and HTMLResponse headers.

find ( selector: str, first: bool = False, _encoding: str = None ) ¶

Given a CSS Selector, returns a list of Element objects.

See W3School’s CSS Selectors Reference for more details.

If first is True , only returns the first Element found.

The full text content (including links) of the Element or HTML ..

Unicode representation of the HTML content (learn more).

All found links on page, in as–is form.

lxml representation of the Element or HTML .

PyQuery representation of the Element or HTML .

Bytes representation of the HTML content (learn more).

render ( retries: int = 8, script: str = None, scrolldown=False, sleep: int = 0 ) ¶

Reloads the response in Chromium, and replaces HTML content with an updated version, with JavaScript executed.

If scrolldown is specified, the page will scrolldown the specified number of times, after sleeping the specified amount of time (e.g. scrolldown=10, sleep=1 ).

If just sleep is provided, the rendering will wait n seconds, before returning.

If script is specified, it will execute the provided JavaScript at runtime. Example:

script = """ () =>  return  width: document.documentElement.clientWidth, height: document.documentElement.clientHeight, deviceScaleFactor: window.devicePixelRatio, > > """ 

Returns the return value of the executed script , if any is provided:

Warning: the first time you run this method, it will download Chromium into your home directory ( ~/.pyppeteer ).

search ( template: str ) → parse.Result¶

Searches the Element for the given parse template.

search_all ( template: str ) → parse.Result¶

Searches the Element (multiple times) for the given parse template.

Unicode representation of the HTML content (learn more).

The text content of the Element or HTML .

xpath ( selector: str, first: bool = False, _encoding: str = None ) ¶

Given an XPath selector, returns a list of Element objects.

If a sub-selector is specified (e.g. //a/@href ), a simple list of results is returned.

See W3School’s XPath Examples for more details.

If first is True , only returns the first Element found.

class requests_html. Element ( *, element, url, default_encoding ) → None¶

All found links on page, in absolute form (learn more).

Returns a dictionary of the attributes of the Element (learn more).

The base URL for the page. Supports the tag (learn more).

The encoding string to be used, extracted from the HTML and HTMLResponse headers.

find ( selector: str, first: bool = False, _encoding: str = None ) ¶

Given a CSS Selector, returns a list of Element objects.

See W3School’s CSS Selectors Reference for more details.

If first is True , only returns the first Element found.

The full text content (including links) of the Element or HTML ..

Unicode representation of the HTML content (learn more).

All found links on page, in as–is form.

lxml representation of the Element or HTML .

PyQuery representation of the Element or HTML .

Bytes representation of the HTML content (learn more).

search ( template: str ) → parse.Result¶

Searches the Element for the given parse template.

search_all ( template: str ) → parse.Result¶

Searches the Element (multiple times) for the given parse template.

Unicode representation of the HTML content (learn more).

The text content of the Element or HTML .

xpath ( selector: str, first: bool = False, _encoding: str = None ) ¶

Given an XPath selector, returns a list of Element objects.

If a sub-selector is specified (e.g. //a/@href ), a simple list of results is returned.

See W3School’s XPath Examples for more details.

If first is True , only returns the first Element found.

Utility Functions¶

Returns a random user-agent, if not requested one of a specific style. Defaults to a Chrome-style User-Agent.

HTML Sessions¶

These sessions are for making HTTP requests:

class requests_html. HTMLSession ( mock_browser=True, *args, **kwargs ) ¶

A consumable session, for cookie persistence and connection pooling, amongst other things.

Closes all adapters and as such the session

Sends a DELETE request. Returns Response object.

Sends a GET request. Returns Response object.

Returns the appropriate connection adapter for the given URL.

Return type: requests.adapters.BaseAdapter

get_redirect_target ( resp ) ¶

Receives a Response. Returns a redirect URI or None

Sends a HEAD request. Returns Response object.

Check the environment and merge it with some settings.

Return type: dict

mount ( prefix, adapter ) ¶

Registers a connection adapter to a prefix.

Adapters are sorted in descending order by prefix length.

Sends a OPTIONS request. Returns Response object.

Sends a PATCH request. Returns Response object.

  • url – URL for the new Request object.
  • data – (optional) Dictionary, bytes, or file-like object to send in the body of the Request .
  • **kwargs – Optional arguments that request takes.

Sends a POST request. Returns Response object.

  • url – URL for the new Request object.
  • data – (optional) Dictionary, bytes, or file-like object to send in the body of the Request .
  • json – (optional) json to send in the body of the Request .
  • **kwargs – Optional arguments that request takes.

Constructs a PreparedRequest for transmission and returns it. The PreparedRequest has settings merged from the Request instance and those of the Session .

Parameters: request – Request instance to prepare with this session’s settings.
Return type: requests.PreparedRequest

put ( url, data=None, **kwargs ) ¶

Sends a PUT request. Returns Response object.

  • url – URL for the new Request object.
  • data – (optional) Dictionary, bytes, or file-like object to send in the body of the Request .
  • **kwargs – Optional arguments that request takes.

When being redirected we may want to strip authentication from the request to avoid leaking credentials. This method intelligently removes and reapplies authentication where possible to avoid credential loss.

rebuild_method ( prepared_request, response ) ¶

When being redirected we may want to change the method of the request based on certain specs or browser behavior.

rebuild_proxies ( prepared_request, proxies ) ¶

This method re-evaluates the proxy configuration by considering the environment variables. If we are redirected to a URL covered by NO_PROXY, we strip the proxy configuration. Otherwise, we set missing proxy keys for this URL (in case they were stripped by a previous redirect).

This method also replaces the Proxy-Authorization header where necessary.

Return type: dict

resolve_redirects ( resp, req, stream=False, timeout=None, verify=True, cert=None, proxies=None, yield_requests=False, **adapter_kwargs ) ¶

Receives a Response. Returns a generator of Responses or Requests.

Send a given PreparedRequest.

Return type: requests.Response

Источник

How to use Python Requests with JavaScript pages?

Labrador retriever puppy walking on green grass

Sometimes, we want to use Python Requests with JavaScript pages.

In this article, we’ll look at how to use Python Requests with JavaScript pages.

How to use Python Requests with JavaScript pages?

To use Python Requests with JavaScript pages, we can create an HTMLSession with requests_html .

from requests_html import HTMLSession session = HTMLSession() r = session.get('http://www.example.com') r.html.render() 

Then we call get with the URL we want to make a request to.

And then we call html_render to render the page with JavaScript.

Conclusion

To use Python Requests with JavaScript pages, we can create an HTMLSession with requests_html .

How to make asynchronous requests with Python requests? Sometimes, we want to make asynchronous requests with Python requests. In this article, we’ll look…

How to make asynchronous requests with Python requests module? Sometimes, we want to make asynchronous requests with Python. In this article, we’ll look at…

How to fix Python Requests throwing SSLError? Sometimes, we want to fix Python Requests throwing SSLError. In this article, we’ll look at…

By John Au-Yeung

Web developer specializing in React, Vue, and front end development.

One reply on “How to use Python Requests with JavaScript pages?”

Great article but I’d mention this is only available for Python 3.6

Источник

Читайте также:  Ubuntu install php gettext
Оцените статью