Using curl with python

How to use cURL with Python?

cURL is a the most popular command-line tool for transferring information across networks. It’s highly configurable and offers libraries in multiple programming languages, making it a good choice for automated web scraping. One of the languages it works well with is Python, widely used for its versatility and readability.

Together, cURL and Python can help you script API requests, debug complex instances, and retrieve any type of data from web pages. This article will demonstrate how you can use these two tools in conjunction, especially for POST and GET requests. We will be using the PycURL package.

💡 If all you need to do is convert cURL command to the Python language, check out our cURL Python converter

What Is cURL?

cURL is an open-source command-line tool and library that’s used to transfer data in command lines or scripts with URL syntax. It supports nearly twenty-six protocols; among the multiple complex tasks it can handle are user authentication, FTP uploads, and testing REST APIs.

In Python, cURL transfers requests and data to and from servers using PycURL. PycURL functions as an interface for the libcURL library within Python.

Almost every programming language can use REST APIs to access an endpoint hosted on a web server. Instead of creating web-based calls using Java, Python, C++, JavaScript, or Ruby, you can demonstrate the calls using cURL, which offers a language-independent way to show HTTP requests and their responses. Then you can translate the requests into a format appropriate to your language.

Читайте также:  Vertical text position css

Here are some of the benefits that cURL offers:

  • It’s versatile. It works with nearly all operating systems and devices and supports a wide variety of protocols, including HTTP, FILE, and FTP.
  • It helps to test endpoints and determine whether they’re working.
  • It’s a low level command-line tool and offers great performance for transfering data / HTTP requests.
  • It offers reports on what was sent or received, which could be helpful in troubleshooting.

Why scrape the web?

Web scraping can sometimes be the sole way to access information on the internet. A lot of data is not available in CSV exports or APIs. For instance, think about the types of analysis you can do when you can download every post on a web forum.

Web scraping is an automated technique used to extract huge quantities of unstructured data from websites and then store it in a structured format. There are various methods of scraping websites, from APIs to writing your own code.

Using cURL with Python

There are a few prerequisites before you begin. You’ll need a basic knowledge of Python’s syntax, or at least beginner-level programming experience with a different language. You should also understand basic networking concepts such as protocols and client-server communication.

You’ll need to install the following programs.

Python

  • Select the Python version to download, as well as the appropriate Python executable installer. This tutorial uses the Windows executable installer x86-64, and the downloaded size is almost 25 MB.
  • Run the installer when you have downloaded the Python setup.

Pip

If you chose an older version of Python, it likely didn’t include pip, a package management system that works with Python programs. Make sure to install it because pip is recommended for most packages, especially when work must be performed in virtual environments.

To confirm that pip has been installed, follow these steps:

  • From the Start menu, select cmd.
  • Open the Command Prompt application and enter pip —version.
  • If pip has been installed, it will show the version number. If it hasn’t been installed, the following message will appear:
"pip" is not considered to be an external or internal command. A batch file is a program to operate. 

PycURL

PycURL needs to ensure that the SSL library it’s constructed against is the same one that libcURL, and consequently PycURL, is running. PycURL’s setup.py utilizes curl-config to determine this.

certifi

Certifi is used to provide the SSL with Mozilla’s root certificates. You can read more about certifi on the project description site.

Making GET Requests

You’ll use a GET request to get resources from HTTP servers. To create a GET request, create a connection between cURL and a web page.

import pycurl import certifi from io import BytesIO # Creating a buffer as the cURL is not allocating a buffer for the network response buffer = BytesIO() c = pycurl.Curl() #initializing the request URL c.setopt(c.URL, 'https://www.scrapingbee.com/') #setting options for cURL transfer c.setopt(c.WRITEDATA, buffer) #setting the file name holding the certificates c.setopt(c.CAINFO, certifi.where()) # perform file transfer c.perform() #Ending the session and freeing the resources c.close() 

You’ll need to create a buffer because cURL isn’t allocating one for the network response. Use buffer = BytesIO() :

#retrieve the content BytesIO body = buffer.getvalue() #decoding the buffer print(body.decode('iso-8859-1')) 

The output should be something like this:

 html lang="en"> head> meta name="generator" content="Hugo 0.60.1"/> meta charset="utf-8"/> meta http-equiv="x-ua-compatible" content="ie=edge"/> title>ScrapingBee - Web Scraping APItitle> meta name="description" content="ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else."/> meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"/> meta name="twitter:title" content="ScrapingBee - Web Scraping API"/>. 

Making POST Requests

The POST method is the default way to send data to the HTTP server, or to create or update data.

You’ll use the c.setopt(c.POSTFIELDS, postfields) function for POST requests. This means that the following program will send JSON data to a server by filling the variable data with the JSON data to be sent and specifying the “Content-Type” as “application/json”.

import pycurl from urllib.parse import urlencode c = pycurl.Curl() #initializing the request URL c.setopt(c.URL, 'https://httpbin.org/post') #the data that we need to Post post_data = 'field': 'value'> # encoding the string to be used as a query postfields = urlencode(post_data) #setting the cURL for POST operation c.setopt(c.POSTFIELDS, postfields) # perform file transfer c.perform() #Ending the session and freeing the resources c.close() 

Note that c.POSTFIELDS is used to set the HTTP request to POST operation. You can get more info about urlencode from the Python documentation.

Searching Responses

To search the responses for specific data, utilize the c.getinfo() API to access more data. Go back to the GET example mentioned before and add these two lines at the end:

# page response code, Ex. 200 or 404. print('Response Code: %d' % c.getinfo(c.RESPONSE_CODE)) 

You must call c.getinfo(c.RESPONSE_CODE)) before c.close() or the code won’t work. You can get more details on getinfo() from the PycURL docs.

Simple Scraping

When you execute the code to scrape the web, it requests the URL you’ve mentioned. The web server transmits the data and allows you to access your HTML or XML page in response. The code analyzes the HTML or XML page, then locates and extracts the information.

To collect data through web scraping and Python, follow these steps:

  • Look for the URL you want to scrape.
  • Examine the page to find the data you want.
  • Write your code and extract the data.
  • Keep the data in the required format.

As an example, here is a simple Python code to show what HTML response parsing code looks like. This code follows the examples provided earlier sharing the buffer variable:

from html.parser import HTMLParser class Parser(HTMLParser): #creating lists to parse the data in StartTags_list = list() EndTags_list = list() StartEndTags_list = list() Comments_list = list() def handle_starttag(self, startTag, attrs): self.StartTags_list.append(startTag) def handle_endtag(self, endTag): self.EndTags_list.append(endTag) def handle_startendtag(self,startendTag, attrs): self.StartEndTags_list.append(startendTag) def handle_comment(self,data): self.Comments_list.append(data) s = Parser() body = buffer.getvalue() x = body.decode('iso-8859-1') s.feed(x) print( s.Comments_list) 

For the output, use [‘ navigation ‘, ‘ JS Plugins ‘] .

You can find more about the HTML parser Python lib in the docs.

Writing Response Data to a File

The easiest way to write a response to a file is to use the open() method with the attributes that you want:

file1 = open("MyParsed.txt","a") file1.writelines(s.StartEndTags_list) file1.close() 

This will open a file named MyParsed.txt . Append the parsed StartEndTags in it and close the file.

You can find more on how to use the open method in the docs.

Conclusion

As you’ve seen, PycURL offers you a lot of flexibility in how you grab information from the web as well as for other tasks like user authentication or SSL connections. It’s a powerful tool that works well with your Python programs, particularly when using PycURL. PycURL is a lower-level package compared to Requests and other popular HTTP client in Python. It’s not as easy to use, but much faster if you need concurrent connections.

If you’d like to maximize your web scraping capabilities, try ScrapingBee. Its API enables you to scrape websites and search engine results. It can manage multiple headless instances for you, and it renders JavaScript so you can scrape any type of site.

To see ScrapingBee in action, sign up for a free trial.

image description

Staff Embedded Software R&D Engineer with experience of 5+ years in software development and Machine Learning.

Источник

Оцените статью