Using curl with python

There are a few prerequisites before you begin. You’ll need a basic knowledge of Python’s syntax, or at least beginner-level programming experience with a different language. You should also understand basic networking concepts such as protocols and client-server communication.

You’ll need to install the following programs.

Python

Select the Python version to download, as well as the appropriate Python executable installer. This tutorial uses the Windows executable installer x86-64, and the downloaded size is almost 25 MB.
Run the installer when you have downloaded the Python setup.

Pip

If you chose an older version of Python, it likely didn’t include pip, a package management system that works with Python programs. Make sure to install it because pip is recommended for most packages, especially when work must be performed in virtual environments.

To confirm that pip has been installed, follow these steps:

From the Start menu, select cmd.
Open the Command Prompt application and enter pip —version.
If pip has been installed, it will show the version number. If it hasn’t been installed, the following message will appear:

"pip" is not considered to be an external or internal command. A batch file is a program to operate.

PycURL

PycURL needs to ensure that the SSL library it’s constructed against is the same one that libcURL, and consequently PycURL, is running. PycURL’s setup.py utilizes curl-config to determine this.

certifi

Certifi is used to provide the SSL with Mozilla’s root certificates. You can read more about certifi on the project description site.

Making GET Requests

You’ll use a GET request to get resources from HTTP servers. To create a GET request, create a connection between cURL and a web page.

import pycurl import certifi from io import BytesIO # Creating a buffer as the cURL is not allocating a buffer for the network response buffer = BytesIO() c = pycurl.Curl() #initializing the request URL c.setopt(c.URL, 'https://www.scrapingbee.com/') #setting options for cURL transfer c.setopt(c.WRITEDATA, buffer) #setting the file name holding the certificates c.setopt(c.CAINFO, certifi.where()) # perform file transfer c.perform() #Ending the session and freeing the resources c.close()

You’ll need to create a buffer because cURL isn’t allocating one for the network response. Use buffer = BytesIO() :

#retrieve the content BytesIO body = buffer.getvalue() #decoding the buffer print(body.decode('iso-8859-1'))

The output should be something like this:

 html lang="en"> head> meta name="generator" content="Hugo 0.60.1"/> meta charset="utf-8"/> meta http-equiv="x-ua-compatible" content="ie=edge"/> title>ScrapingBee - Web Scraping APItitle> meta name="description" content="ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else."/> meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"/> meta name="twitter:title" content="ScrapingBee - Web Scraping API"/>.

Making POST Requests

The POST method is the default way to send data to the HTTP server, or to create or update data.

You’ll use the c.setopt(c.POSTFIELDS, postfields) function for POST requests. This means that the following program will send JSON data to a server by filling the variable data with the JSON data to be sent and specifying the “Content-Type” as “application/json”.

import pycurl from urllib.parse import urlencode c = pycurl.Curl() #initializing the request URL c.setopt(c.URL, 'https://httpbin.org/post') #the data that we need to Post post_data = 'field': 'value'> # encoding the string to be used as a query postfields = urlencode(post_data) #setting the cURL for POST operation c.setopt(c.POSTFIELDS, postfields) # perform file transfer c.perform() #Ending the session and freeing the resources c.close()

Note that c.POSTFIELDS is used to set the HTTP request to POST operation. You can get more info about urlencode from the Python documentation.

Searching Responses

To search the responses for specific data, utilize the c.getinfo() API to access more data. Go back to the GET example mentioned before and add these two lines at the end:

# page response code, Ex. 200 or 404. print('Response Code: %d' % c.getinfo(c.RESPONSE_CODE))

You must call c.getinfo(c.RESPONSE_CODE)) before c.close() or the code won’t work. You can get more details on getinfo() from the PycURL docs.

Simple Scraping

When you execute the code to scrape the web, it requests the URL you’ve mentioned. The web server transmits the data and allows you to access your HTML or XML page in response. The code analyzes the HTML or XML page, then locates and extracts the information.

To collect data through web scraping and Python, follow these steps:

Look for the URL you want to scrape.
Examine the page to find the data you want.
Write your code and extract the data.
Keep the data in the required format.

As an example, here is a simple Python code to show what HTML response parsing code looks like. This code follows the examples provided earlier sharing the buffer variable:

from html.parser import HTMLParser class Parser(HTMLParser): #creating lists to parse the data in StartTags_list = list() EndTags_list = list() StartEndTags_list = list() Comments_list = list() def handle_starttag(self, startTag, attrs): self.StartTags_list.append(startTag) def handle_endtag(self, endTag): self.EndTags_list.append(endTag) def handle_startendtag(self,startendTag, attrs): self.StartEndTags_list.append(startendTag) def handle_comment(self,data): self.Comments_list.append(data) s = Parser() body = buffer.getvalue() x = body.decode('iso-8859-1') s.feed(x) print( s.Comments_list)

For the output, use [‘ navigation ‘, ‘ JS Plugins ‘] .

You can find more about the HTML parser Python lib in the docs.

Writing Response Data to a File

The easiest way to write a response to a file is to use the open() method with the attributes that you want:

file1 = open("MyParsed.txt","a") file1.writelines(s.StartEndTags_list) file1.close()

This will open a file named MyParsed.txt . Append the parsed StartEndTags in it and close the file.

You can find more on how to use the open method in the docs.

Conclusion

As you’ve seen, PycURL offers you a lot of flexibility in how you grab information from the web as well as for other tasks like user authentication or SSL connections. It’s a powerful tool that works well with your Python programs, particularly when using PycURL. PycURL is a lower-level package compared to Requests and other popular HTTP client in Python. It’s not as easy to use, but much faster if you need concurrent connections.

If you’d like to maximize your web scraping capabilities, try ScrapingBee. Its API enables you to scrape websites and search engine results. It can manage multiple headless instances for you, and it renders JavaScript so you can scrape any type of site.

To see ScrapingBee in action, sign up for a free trial.

Staff Embedded Software R&D Engineer with experience of 5+ years in software development and Machine Learning.

Источник

Using curl with python

How to use cURL with Python?

What Is cURL?

Why scrape the web?