- How to Use Requests-HTML Library in Python
- How to install Requests-HTML?
- Making a get Request with Requests-HTML?
- Getting the contents of the response from the get method?
- Find the mode of the request from the response?
- Find the status code of the response by using requests-html?
- Find the header information by using requests-HTML?
- Getting all links of a webpage by using requests-HTML?
- How to Parse HTML from a Local File with Requests-Html?
- Find the title tag in Html Page by using Requests-Html?
- Get the Javascript generated contents with requests-html
- Python Save Html File From Url Example
- 1. Steps To Use Python Requests Module To Get A Web Page Content By URL.
How to Use Requests-HTML Library in Python
Requests-HTML is a Python Library that is specially created to make HTML Parsing as much easy as possible. Sometimes we need to Parse HTML to get our Required Data from the Webpage that we are scrapping. So in these scenarios, Requests-HTML will be a good candidate to choose for this task.
How to install Requests-HTML?
In order to use Requests-HTML, we first have to install it. For the installation, we can use the pip. the following command will help us to install Requests-Html. Note: Python 3.6 or greater Version is needed for the installation of this Library.
Making a get Request with Requests-HTML?
The following code is used to make a get request to the website. in our case, we are making a get request to the blooger.com website. This will return a response with the response code. in our case, it is Response [200] means that we have successfully make a get request.
# We have to import HTML Session from requests_html import HTMLSession # create the Object of the HTMLSession session = HTMLSession() #call the get Method of the HTMLSession class request = session.get('https://blogger.com')
Getting the contents of the response from the get method?
The content can be extracted by using the contents attribute on the response object that is returned from the get method.
the response will be printed as a binary string.
the below code can be used to print the content of the response object.
from requests_html import HTMLSession session = HTMLSession() request = session.get('https://blogger.com') data = request.content print(data)
Find the mode of the request from the response?
if we need to find the mood[get, post] of the response that we have made. For example, we need to know that which type of request is made to the website, in which this type of content is returned to us. we can do it as follows.
data = request.request print(data)
Find the status code of the response by using requests-html?
The status code can give us information about the request. if the response code is 200, it means this is a good response. we can get the response code by calling the status_code attribute on a Response object.
data = request.status_code print(data)
Find the header information by using requests-HTML?
The header of the response contains all information about the response. We can get all the information of the header by just using the header attribute on the response object.
data = request.headers print(data)
Getting all links of a webpage by using requests-HTML?
to get all the available anchor tag (links) we can use the html.link attribute. it will return a set of all links that are out there in the response that we have got from the request to the specific site.
from requests_html import HTMLSession session = HTMLSession() request = session.get('https://google.com/') data = request.html.links print(type(data))
How to Parse HTML from a Local File with Requests-Html?
We can read the Html from a file and then we have to parse it with the Requests-HTML. we can do it as follows.
from requests_html import HTML with open("htmlfile.html") as htmlfile: sourcecode = htmlfile.read() parsedHtml = HTML(html=sourcecode) print(parsedHtml)
Find the title tag in Html Page by using Requests-Html?
We can find any tag by using the find method. For example incase we want to find the title tag we can do it as follows.
from requests_html import HTML with open("htmlfile.html") as htmlfile: sourcecode = htmlfile.read() parsedHtml = HTML(html=sourcecode) print(parsedHtml.find("title"))
This will print a list of all the title tag out there in the HTML. the type of each will be an element of course. so we can select any element from the list by using the index of the list.
Get the Javascript generated contents with requests-html
To get the Javascript-generated data we have to wait until the website is fully loaded. Requests-Html gives us the ability to do it. we can call the render method and then it will wait for the Javascript. Though it will take some time, but it will do the job. In fact javascript redering with requests-html library stands it out from the rest of python libraries.
from requests_html import HTML with open("htmlfile.html") as htmlfile: sourcecode = htmlfile.read() parsedHtml = HTML(html=sourcecode) parsedHtml.render()
I am a software Engineer having 4+ Years of Experience in Building full-stack applications.
Python Save Html File From Url Example
This article will tell you how to use the Python requests module to retrieve a web page content by page URL and then save the web page content to a local file step by step.
1. Steps To Use Python Requests Module To Get A Web Page Content By URL.
- Open a terminal and run the command pip show requests to make sure the Python requests module has been installed.
$ pip show requests Name: requests Version: 2.22.0 Summary: Python HTTP for Humans. Home-page: http://python-requests.org Author: Kenneth Reitz Author-email: [email protected] License: Apache 2.0 Location: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages Requires: chardet, idna, urllib3, certifi Required-by:
$ python Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>>
>>> web_page_url = "http://www.google.com"
>>> response = requests.get(url=web_page_url, headers=headers)
>>> page_content = response.text
>>> with open('./google.html', 'w', encoding='utf8') as fp: . fp.write(page_content) . 131502 >>> print('Save web page content ' + web_page_url + ' successfully.') Save web page content http://www.google.com successfully.
# Open the local file with read permission. >>> with open('./google.html', 'r', encoding='utf8') as fp: . line = fp.readline() # read one line text. # Only when the read-out text's length is 0 then quit the loop. . while len(line) > 0: . print(line) # read the next line. . line = fp.readline()