- How to Download a File in Python
- Download a File in Python Over HTTP
- Download a File in Python From an API
- Closing Thoughts on How to Download a File in Python
- How to Download Files in Python
- Requests Library
- Making Requests
- Making a GET request
- Downloading files from web using Python?
- 1. Import module
- 2. Get the link or url
- 3. Save the content with name.
- Example
- Result
- Get filename from an URL
How to Download a File in Python
Did you know you can download a file programmatically in Python? I will show you how to fetch and save a file in Python. This process is known as web scraping and is an essential step of any data-related project.
Web scraping is the process of collecting data from a website. While it can be done manually by a user, it usually refers to an automated method of data collection with the help of a web crawler.
You can do all of this programmatically in Python. By the end of this article, you will know how to download any kind of file in Python, including PDFs, images, videos, and pages. The process is similar between different types of files.
To get the most out of this article, it is good to have a basic understanding of programming in Python. Also, to save time and accelerate your learning, I encourage you to check our Python programming track.
To download a file in Python, we need to fetch it and save it. This process can be done by calling an API or with just a regular web URL pointing to a GIF you like.
Before going further, let’s understand REST APIs. A REST API is a service that allows you to access and manipulate data such as text files, images, services, and collections of other resources on a server via REST mechanisms. An API helps improve the portability of client apps and eases the evolving process of the different components of a product. These APIs usually return UTF-8 encoded JSON objects as the resource.
There are two fundamental steps to making a request when working with REST APIs. First, the client accesses a specific location on a REST API and states the method to be executed. This is known as a request. Second, the server executes the method and returns the data to the client. This is known as a response.
Authentication is a critical component of internet security. Any REST API that lets clients access or modify sensitive or critical data must have an authentication system in place. Even if the API is free, the owner may introduce authentication to limit the number of requests per user.
For this tutorial, we will fetch and save files in Python from place.dog and randomfox.ca. No authentication is required, so you can reuse the code snippets to download a file in Python. You can find a list of public APIs here.
First, we will download a file in Python over HTTP. Later, we will download a file in Python from an API. Let’s get right to it!
Download a File in Python Over HTTP
In our first example, we will fetch and save a picture of a dog. This website offers random pictures of dogs you can use as placeholders for your next project. If you refresh the page, it generates another dog picture.
We will use the requests library, which makes HTTP requests simpler than using the built-in urllib library. You may have to install the requests library with the following command:
Then, we import requests , set the url variable with our target URL, write a GET request and check its status. The following are the different types of response status you may face when writing a GET request:
- 1xx Informational. It indicates that a request has been received and the client should continue to make requests for the data payload.
- 2xx Successful. It indicates a requested action has been received, understood, and accepted. It helps you verify the data exists before working on it.
- 3xx Redirection. It indicates the client must take additional action to complete the request, such as using a proxy or a different endpoint to access the resources.
- 4xx Client Error. It indicates problems with the client, for example, disallowed methods, authorization issues, forbidden access, or attempts to access resources that do not exist.
- 5xx Server Error. It indicates problems with the server providing the API.
Let’s write a request to fetch a file in Python.
>>> import requests >>> url = 'https://place.dog/300/200' >>> # fetch file >>> response = requests.get(url, allow_redirects=True) >>> # Get response status >>> response.status_code 200
The 200 status code indicates the request is successful and the data exists. From there, we continue to the next step and save a file in Python with the help of the write() method.
The 200 status code indicates the request is successful and the data exists. From there, we continue to the next step and save a file in Python with the help of the write() method.
Now, the file has been saved as dog1.jpg and contains a picture of a dog.
For a good refresher on the write() method to save a file in Python, check my article on how to write to file in Python here.
Download a File in Python From an API
Now, let’s explore how to fetch and save a file in Python by calling an API and parsing the JSON file. In contrast to what we have done previously, we will save the file with pathlib.
Most of the data available online are in the form of JSON (JavaScript Object Notation). It is used to store information in databases and is the most common data type you’ll find when working with modern REST APIs. JSON data structures may be unordered name-value pairs, such as dictionaries, hash tables, objects, or keyed lists depending on the programming language, or an ordered list of values such as arrays, lists, and vectors.
JSON can be difficult for humans to read and use directly. Python has different libraries to help us read the JSON data fetched from the web to resolve this problem. Among them is the JSON library with built-in support for converting JSON components into native Python objects. The following table shows the conversion mapping between JSON and Python:
JSON | Python |
---|---|
object | dictionary |
array | List or tuple |
string | string |
number | Integer or float |
true | True |
false | False |
null | None |
You have to deal with JSON data often when working with REST APIs. You can find more information about JSON in our course on How to Read and Write JSON Files in Python.
The requests library has many features, but we only need the GET request and the json() formatter for the following example. As we have done previously, the first step is to import the requests library. Then, we create a GET request to the API endpoint we want to access. The API provides a response object that includes the JSON data. We are only interested in the JSON data, which is returned with the json() module.
>>> import requests >>> url = "https://randomfox.ca/floof" >>> # fetch file >>> response = requests.get(url, allow_redirects=True) >>> # get json data >>> json = response.json() >>> print(json)
The json output is similar to a Python dictionary. We extract the URL of the image as follows:
>>> img = json['image'] >>> print(img) https://randomfox.ca/images/2.jpg
Next, we want to save the image. As mentioned previously, we use pathlib , an object-oriented framework to handle filesystem paths. One of its advantages is its better portability between operating systems. You can find more information about pathlib in my article on how to rename files.
To save the picture of our fox, we will use the Path.write_bytes(data) method to open the path in binary/bytes mode and write data to it.
>>> # import Path class from pathlib >>> from pathlib import Path >>> # define filename >>> filename = Path('fox.jpg') >>> # fetch file >>> response = requests.get(img) >>> # save file >>> filename.write_bytes(response.content)
Our file has now been saved as fox.jpg . We just saw how to extract the URL in the API response by inspecting the json data.
Closing Thoughts on How to Download a File in Python
We have now learned how to download a file in Python over HTTP and from an API. I encourage you to play with the code and fetch files from different APIs.
There is a lot more to learn about JSON, which is a widespread and handy format to store data. You can find more about it and Python programming with our Python programming track.
Last but not least, it is always a good idea to reflect on your Python programming skills. To help you with this process, check out my article on Things That Can Help You Write Better Python Code and browse our content on LearnPython.com. Keep learning every day!
How to Download Files in Python
Esther Vaati Last updated Dec 29, 2022
Python provides several ways to download files from the internet. This can be done over HTTP using the urllib package or the requests library. This tutorial will discuss how to use these libraries to download files from URLs using Python.
Requests Library
The requests library is one of the most popular libraries in Python. Requests allow you to send HTTP/1.1 requests without the need to manually add query strings to your URLs, or form-encode your POST data.
With the requests library, you can perform a lot of functions including:
- adding form data
- adding multipart files
- accessing the response data of Python
Making Requests
The first you need to do is to install the library, and it’s as simple as:
To test if the installation has been successful, you can do a very easy test in your Python interpreter by simply typing:
If the installation has been successful, there will be no errors.
Making a GET request
Making requests is very easy, as illustrated below.
req = requests.get(“https://www.google.com”)
The above command will get the google web page and store the information in the req variable. We can then go on to get other attributes as well.
For instance, to know if fetching the Google web page was successful, we will query the status_code .
req = requests.get(“https://www.google.com")
Downloading files from web using Python?
Python provides different modules like urllib, requests etc to download files from the web. I am going to use the request library of python to efficiently download files from the URLs.
Let’s start a look at step by step procedure to download files using URLs using request library−
1. Import module
2. Get the link or url
url = 'https://www.facebook.com/favicon.ico' r = requests.get(url, allow_redirects=True)
3. Save the content with name.
open('facebook.ico', 'wb').write(r.content)
save the file as facebook.ico.
Example
import requests url = 'https://www.facebook.com/favicon.ico' r = requests.get(url, allow_redirects=True) open('facebook.ico', 'wb').write(r.content)
Result
We can see the file is downloaded(icon) in our current working directory.
But we may need to download different kind of files like image, text, video etc from the web. So let’s first get the type of data the url is linking to−
>>> r = requests.get(url, allow_redirects=True) >>> print(r.headers.get('content-type')) image/png
However, there is a smarter way, which involved just fetching the headers of a url before actually downloading it. This allows us to skip downloading files which weren’t meant to be downloaded.
>>> print(is_downloadable('https://www.youtube.com/watch?v=xCglV_dqFGI')) False >>> print(is_downloadable('https://www.facebook.com/favicon.ico')) True
To restrict the download by file size, we can get the filezie from the content-length header and then do as per our requirement.
contentLength = header.get('content-length', None) if contentLength and contentLength > 2e8: # 200 mb approx return False
Get filename from an URL
To get the filename, we can parse the url. Below is a sample routine which fetches the last string after backslash(/).
url= "http://www.computersolution.tech/wp-content/uploads/2016/05/tutorialspoint-logo.png" if url.find('/'): print(url.rsplit('/', 1)[1]
Above will give the filename of the url. However, there are many cases where filename information is not present in the url for example – http://url.com/download. In such a case, we need to get the Content-Disposition header, which contains the filename information.
import requests import re def getFilename_fromCd(cd): """ Get filename from content-disposition """ if not cd: return None fname = re.findall('filename=(.+)', cd) if len(fname) == 0: return None return fname[0] url = 'http://google.com/favicon.ico' r = requests.get(url, allow_redirects=True) filename = getFilename_fromCd(r.headers.get('content-disposition')) open(filename, 'wb').write(r.content)
The above url-parsing code in conjunction with above program will give you filename from Content-Disposition header most of the time.