Python get web pages

Python Get Webpage Html Examples

Python provides some modules for you to get webpage Html source code from a URL. It includes the modules urllib ( urllib2 is not supported in python3 ), urllib3, and request. This article will show you how to use these python modules to get webpage Html source code with examples.

1. Python Get Webpage Html Use urllib Module Example.

1.1 Python urllib Module Introduction.

  1. Python’s built-in urllib library is used to obtain the HTML source code of web pages.
  2. The urllib library is a standard library module of Python and does not need to be installed separately.

1.2 Python urllib Library request Module Introduction.

  1. Before you can use the urllib library request module, you need to import it into your source code.
# import the urllib.request module import urllib.request # or use the below method to import the request module from the urllib library from urllib import request
url: the requested web page URL. timeout: response timeout. If no response is received within the specified time, a timeout exception will be thrown
url: the request web page URL. headers: the request headers.

1.3 The http.client.HTTPResponse Class.

  1. All the above urllib.request module’s methods will return an http.client.HTTPResponse object. Below will introduce it’s methods.
  2. read(): read the bytes data from the response object.
  3. bytes.decode(“utf-8”) : convert the bytes data to string data.
  4. string.encode(“utf-8”): convert string data to bytes data.
  5. geturl(): return the URL address of the response object.
  6. getcode(): return the HTTP response code.
Читайте также:  Programming blender with python

1.4 Use Python urllib.request To Crawl Web Page Examples.

  1. This example will show you how to use python urllib.request module to request a web page by URL and how to get webpage html content and headers.
import urllib.request # or # from urllib import request # this function will request the url page and get the response object. def urllib_request_web_page(url): # send request to the url web page and get the response object. response = urllib.request.urlopen(url) # print out the response object. print(response) # get the response url. resp_url = response.geturl() print('Response url : ', resp_url) # get the response code. resp_code = response.getcode() print('Response code : ', resp_code) # get all the response headers in a list object. resp_headers_list = response.getheaders() # loop in the response headers. for resp_headers in resp_headers_list: # get the response header name header_name = resp_headers[0] # get the response header value. header_value = resp_headers[1] print(resp_headers) print(header_name, ' = ', header_value) # read the response content in bytes object. bytes = response.read() print(bytes) # convert the bytes object to string. html_content = bytes.decode('utf-8') print(html_content) if __name__ == '__main__': url = "https://www.bing.com" urllib_request_web_page(url)
 Response url : https://www.bing.com Response code : 200 ('Cache-Control', 'private') Cache-Control = private ('Transfer-Encoding', 'chunked') Transfer-Encoding = chunked . . b'. '

2. Python Get Webpage Html Use urllib3 Module Example.

  1. Python module urllib2 has been removed from Python 3, but there is urllib3 which is similar to the module urllib.
  2. But the python urllib3 module is not python built-in, it needs to be installed in your python environment first.

2.1 How To Install Python urllib3 Module.

  1. Open a terminal and run the command pip install urllib3 to install the python module urllib3.
> pip install urllib3 Defaulting to user installation because normal site-packages is not writeable Collecting urllib3 Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB) ---------------------------------------- 140.4/140.4 kB 106.8 kB/s eta 0:00:00 Installing collected packages: urllib3 Successfully installed urllib3-1.26.12
> pip show urllib3 Name: urllib3 Version: 1.26.12 Summary: HTTP library with thread-safe connection pooling, file post, and more. Home-page: https://urllib3.readthedocs.io/ Author: Andrey Petrov Author-email: [email protected] License: MIT Location: c:\users\zhao song\appdata\roaming\python\python39\site-packages Requires: Required-by:
# import the urllib3 module first. import urllib3 # define the function to get Html web page source code by URL. def get_webpage_html_use_urllib3(url): # Get the HTTP pool manager object in urllib3. http_pool_manager = urllib3.PoolManager() # Send the request to the url using the http pool manager. response = http_pool_manager.request('GET', url) # Print out the response status code. print(response.status) # Print out the response header. print(response.headers) # Print out the response webpage Html source code. print(response.data) if __name__ == '__main__': url = "https://www.bing.com" get_webpage_html_use_urllib3(url)

Источник

Читайте также:  Амперсанд в си шарп

Get Web Page in Python

Get Web Page in Python

  1. Use the urllib Package to Get a Web Page in Python
  2. Use the requests Package to Get a Webpage in Python

In Python, we can create connections and read data from the web. We can download files over the web and read whole web pages.

This tutorial shows how to get a webpage in Python.

Use the urllib Package to Get a Web Page in Python

This package is used to fetch web pages and handle URL-related operations in Python. We can use the urllib.request.urlopen() function to retrieve a webpage using its URL.

The urllib.request module opens the given URL and returns an object. This object has different attributes like header , status , and more. We can read the webpage using the read() function with this object. It will return the full content of the web page.

See the following example.

import urllib.request page = urllib.request.urlopen('http://www.python.org') print(page.read()) 

In recent times, newer versions of the urllib package have emerged. First, we have the urllib2 package, built as an experimental version of urllib with newer and improved features. It can also accept Requests object from the requests package. The urlencode() is missing from the urllib2 package.

The urllib3 package was also introduced and is a third-party package, unlike the previous two versions. The requests package discussed below uses functionalities from this package internally.

Use the requests Package to Get a Webpage in Python

The requests library is simple to use and provides a lot of HTTP-related functionalities. We can use the requests.get() function to retrieve a webpage and return a Response object.

This object also possesses several attributes like status_code , content , and more. We can use the content attribute to return the given web page’s content.

import requests response = requests.get('http://www.python.org') print (response.status_code) print (response.content) 

The requests library aims to provide simple to use API and has a more convenient way to handle errors. Also, it automatically decodes the response retrieved into Unicode.

Manav is a IT Professional who has a lot of experience as a core developer in many live projects. He is an avid learner who enjoys learning new things and sharing his findings whenever possible.

Related Article — Python Web

Источник

Оцените статью