Get HTML from URL in Python
Webpages are made using HTML. It is the programming code that defines the webpage and its contents. It is at the core of every website on the internet.
We can access and retrieve content from web pages using Python. Python allows us to access different types of data from URLs like JSON, HTML, XML, and more. We can use different libraries for working with HTML in Python.
Get HTML from URL in Python
We will now discuss how to get HTML from URL in Python.
Using the urllib library to get HTML from URL in Python
The urllib library in Python is used to handle operations related to fetching and working with URLs and accessing different URLs. We can use different functionalities from this module to get HTML from URL in Python.
First, we need to access the URL. For this, we use the urllib.request class. We can use the urllib.request.urlopen() function to create a urllib.request class object that creates a connection to the desired URL. We specify the URL within the urlopen() function.
Then, to get HTML from URL in Python, we use the read() function with this object. In Python 3, this returns a bytes object. So, we need to convert this object to a string by decoding it.
We will use the decode() function to retrieve the HTML as strings and display it. One should also terminate the urllib.request object using the close() function.
We will now use this in the code below.
Get Data From a URL in Python
A URL or a Uniform Resource Locator is a valid and unique web address that points to some resource over the internet. This resource can be a simple text file, a zip file, an exe file, a video, an image, or a webpage.
In the case of a webpage, the HTML or the Hypertext Markup Language content is fetched. This article will show how to get this HTML or Hypertext Markup Language data from a URL using Python.
Get Data From a URL Using the requests Module in Python
Python has a requests module that easily sends HTTP (Hypertext Transfer Protocol) requests. This module can be used to fetch the HTML content or any content from a valid URL.
The requests module has a get() method that we can use to fetch data from a URL. This method accepts a url as an argument and returns a requests.Response object.
This requests.Response object contains details about the server’s response to the sent HTTP request. If an invalid URL is passed to this get() method, the get() method will throw a ConnectionError exception.
If you are unsure about the URL’s validity, it is highly recommended to use the try and except blocks. Just enclose the get() method call inside a try and except block. This will be depicted in the upcoming example.
Now, let us understand how to use this function to fetch HTML content or any data from a valid URL. Refer to the following code for the same.
To learn more about the requests.Response object, refer to the official documentation here.
import requests try: url = "https://www.lipsum.com/feed/html" r = requests.get(url) print("HTML:\n", r.text) except: print("Invalid URL or some error occured while making the GET request to the specified URL")
Note that . represents the HTML content that was fetched from the URL. The HTML content has not been shown in the output above since it was too big.
If the URL is faulty, the above code will run the code inside the except block. The following code depicts how it works.
import requests try: url = "https://www.thisisafaultyurl.com/faulty/url/" r = requests.get(url) print("HTML:\n", r.text) except: print("Invalid URL or some error occured while making the GET request to the specified URL")
Invalid URL or some error occurred while making the GET request to the specified URL
Some web pages do not allow GET requests to fetch their content for security purposes. In such cases, we can use the post() method from the requests module.
As the name suggests, this method sends POST requests to a valid URL. This method accepts two arguments, namely, url , and data .
The url is the target URL, and the data accepts a dictionary of header details in the form of key-value pairs. The header details could be an API or Application Programming Interface key, CSRF or Cross-Site Request Forgery token, etc.
The Python code for such a case would be as follows.
import requests try: url = "https://www.thisisaurl.com/that/accepts/post/requests/" payload = "api-key": "my-api-key", # more key-value pairs > r = requests.post(url, data = payload) print("HTML:\n", r.text) except: print("Invalid URL or some error occured while making the POST request to the specified URL")
Vaibhav is an artificial intelligence and cloud computing stan. He likes to build end-to-end full-stack web and mobile applications. Besides computer science and technology, he loves playing cricket and badminton, going on bike rides, and doodling.