- Downloading files from web using Python?
- 1. Import module
- 2. Get the link or url
- 3. Save the content with name.
- Example
- Result
- Get filename from an URL
- Downloading Files In Python Using Requests Module
- Introduction
- Not all URLs pointing to downloadable resources
- Define a function to verify a downloadable resource
- Checking Content-Type of the request header
- Restricting the file size of the downloading resource
- Getting the file name from the URL
- Leave a Reply Cancel reply
- Python library to get filename of content downloaded with HTTP
- — Content-Disposition: Attachment; filename=example.html filename: example.html
- filename: an example.html
- filename: € rates
- Requirements
- Non-requirements
Downloading files from web using Python?
Python provides different modules like urllib, requests etc to download files from the web. I am going to use the request library of python to efficiently download files from the URLs.
Let’s start a look at step by step procedure to download files using URLs using request library−
1. Import module
2. Get the link or url
url = 'https://www.facebook.com/favicon.ico' r = requests.get(url, allow_redirects=True)
3. Save the content with name.
open('facebook.ico', 'wb').write(r.content)
save the file as facebook.ico.
Example
import requests url = 'https://www.facebook.com/favicon.ico' r = requests.get(url, allow_redirects=True) open('facebook.ico', 'wb').write(r.content)
Result
We can see the file is downloaded(icon) in our current working directory.
But we may need to download different kind of files like image, text, video etc from the web. So let’s first get the type of data the url is linking to−
>>> r = requests.get(url, allow_redirects=True) >>> print(r.headers.get('content-type')) image/png
However, there is a smarter way, which involved just fetching the headers of a url before actually downloading it. This allows us to skip downloading files which weren’t meant to be downloaded.
>>> print(is_downloadable('https://www.youtube.com/watch?v=xCglV_dqFGI')) False >>> print(is_downloadable('https://www.facebook.com/favicon.ico')) True
To restrict the download by file size, we can get the filezie from the content-length header and then do as per our requirement.
contentLength = header.get('content-length', None) if contentLength and contentLength > 2e8: # 200 mb approx return False
Get filename from an URL
To get the filename, we can parse the url. Below is a sample routine which fetches the last string after backslash(/).
url= "http://www.computersolution.tech/wp-content/uploads/2016/05/tutorialspoint-logo.png" if url.find('/'): print(url.rsplit('/', 1)[1]
Above will give the filename of the url. However, there are many cases where filename information is not present in the url for example – http://url.com/download. In such a case, we need to get the Content-Disposition header, which contains the filename information.
import requests import re def getFilename_fromCd(cd): """ Get filename from content-disposition """ if not cd: return None fname = re.findall('filename=(.+)', cd) if len(fname) == 0: return None return fname[0] url = 'http://google.com/favicon.ico' r = requests.get(url, allow_redirects=True) filename = getFilename_fromCd(r.headers.get('content-disposition')) open(filename, 'wb').write(r.content)
The above url-parsing code in conjunction with above program will give you filename from Content-Disposition header most of the time.
Downloading Files In Python Using Requests Module
This post aims to present you how to download a resource from the given URL using the requests module. Of course, there are other modules which allow you to accomplish this purpose but I just focus on explaining how to do with the requests module and leave you discovering the other methods. Let’s get started now.
Introduction
Below is a simple snippet to download Google’s logo in the Google search page via the link https://www.google.co.uk/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
import requests url = "https://www.google.co.uk/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png" r = requests.get(url, allow_redirects=True) open("google.ico", "wb").write(r.content)
The file named google.ico is saved into the current working directory. It’s easy as a piece of cake, right? In practice, we have to face more difficult situations that I am gonna show you now.
Not all URLs pointing to downloadable resources
The real world is you almost certainly handle circumstances where the resources in downloading are protected not allow users to download. For example, Youtube videos have been secured to prevent users from greedily downloading. People developers browser extensions or standalone applications to download Youtube videos, however, Google has detected such violent activities and increasingly protected their data. Therefore, it is important to check whether the resource of interest is allowed to download or not before sending a request. A snippet below simulates how to check that based on the Content-Type parameter of the header of the requesting URL.
import requests def extract_content_type(_url): r = requests.get(_url, allow_redirects=True) return r.headers.get("Content-Type") url = "https://www.google.co.uk/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png" # open("google.ico", "wb").write(r.content) print(extract_content_type(url)) url = "https://www.youtube.com/watch?v=ylk5AYyOcGI" print(extract_content_type(url))
The output of the script above looks like
image/png text/html; charset=utf-8
The extract_content_type function returns a string as the mime type of the remote file. In the above example, what we are expecting from the Youtube URL is a video type rather than text/html while the first URL returns an expected value. In other words, the content type of a request is text/html which we just download a plain text or HTML document instead of well-known mime types such as image/png, video/mp4, etc.
Define a function to verify a downloadable resource
As explained in the previous section, checking a resource allowed to download is necessary before sending a request.
Checking Content-Type of the request header
The function below can do what we need by checking the content type from the header.
def is_downloadable(_url): """ Does the url contain a downloadable resource """ h = requests.head(_url, allow_redirects=True) header = h.headers content_type = header.get('content-type') if 'text' in content_type.lower(): return False if 'html' in content_type.lower(): return False return True
Applying this function for the two URLs in the previous examples, it returns False for Youtube URL while True is returned with Google’s icon link.
Restricting the file size of the downloading resource
We might have another restriction on the downloading resource, for example, just downloading the file which the size is not greater than 100 MB. By inspecting the header of the request URL on the content-length property, the code below can work as expected.
content_length = header.get('content-length', None) if content_length and content_length > 1e8: # 100 MB approx return False
Getting the file name from the URL
Again, to obtain the file name of the downloading resource, we can use the Content-Disposition property of the request header.
def get_filename_from_url(_url): """ Get filename from content-disposition """ r = requests.get(_url, allow_redirects=True) cd = r.headers.get('content-disposition') if not cd: return None filename = re.findall('filename=(.+)', cd) if len(filename) == 0: return None return filename[0]
The URL-parsing code in conjunction with the above method to get filename from the Content-Disposition header will work for most of the cases.
Voilà! If you have any judgments, please don’t hesitate to leave your comments in the comment box below.
Leave a Reply Cancel reply
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Python library to get filename of content downloaded with HTTP
I download a file using the get function of Python requests library. For storing the file, I’d like to determine the filename they way a web browser would for its ‘save’ or ‘save as . ‘ dialog. Easy, right? I can just get it from the Content-Disposition HTTP header, accessible on the response object:
import re d = r.headers['content-disposition'] fname = re.findall("filename=(.+)", d)
But looking more closely at this topic, it isn’t that easy: According to RFC 6266 section 4.3, and the grammar in the section 4.1, the value can be an unquoted token (e.g. the_report.pdf ) or a quoted string that can also contain whitespace (e.g. «the report.pdf» ) and escape sequences (the latter are discouraged, though, thus their handling isn’t a hard requirement for me). Further,
when both «filename» and «filename*» are present in a single header field value, [we] SHOULD pick «filename*» and ignore «filename».
The value of filename* , though, is yet a bit more complicated than the one of filename . Also, the RFC seems to allow for additional whitespace around the = . Thus, for the examples listed in the RFC, I’d want the following results:
— Content-Disposition: Attachment; filename=example.html filename: example.html
Content-Disposition: INLINE; FILENAME= "an example.html"
filename: an example.html
Content-Disposition: attachment; filename*= UTF-8''%e2%82%ac%20rates
filename: € rates
Content-Disposition: attachment; filename="EURO rates"; filename*=utf-8''%e2%82%ac%20rates
filename: € rates here, too (not EURO rates , as filename* takes precedence) I could implement the parsing of the Content-Disposition header I get from requests accordingly myself, but if I can avoid it and use an existing proven implementation instead, I’d prefer that. Is there a Python library that can do this?
Requirements
- provide a function that extracts and returns the proper filename (if there is one) from a passed requests response
or - provide a function that extracts and returns the proper filename (if there is one) from a passed Content-Disposition header field value (a string)
or - provide a function accepting the all the same parameters as requests.get that performs the request, and returns the response as well as the filename (if there is one)
or - provides something similarly practical
Non-requirements
What it doesn’t have to handle (but if it does, even better) as I can do that myself:
- sanitize values so that they don’t contain directory names or other path elements except for a single filename, so storing with that name won’t cause files to be created or overwritten at arbitrary locations
- produce «save» filename extensions «optimally matching the media type of the received payload» (see section 4.3)
- sanitize filenames to prevent user confusion (section 4.3 mentions replacing «control characters and leading and trailing whitespace»)
- provide a fall-back
- for when neither the filename nor the filename* disposition parameter are present or
- for when the ones that are present cannot be parsed or
- for when the complete Content-Disposition header is missing
Though it should report that consistently (be it by raising or by returning None or » ), so that I can let my own fall-back kick in.