urllib.parse — Split URLs into Components¶
The urllib.parse module provides functions for manipulating URLs and their component parts, to either break them down or build them up.
Parsing¶
The return value from the urlparse() function is a ParseResult object that acts like a tuple with six elements.
from urllib.parse import urlparse url = 'http://netloc/path;param?query=arg#frag' parsed = urlparse(url) print(parsed)
The parts of the URL available through the tuple interface are the scheme, network location, path, path segment parameters (separated from the path by a semicolon), query, and fragment.
$ python3 urllib_parse_urlparse.py ParseResult(scheme='http', netloc='netloc', path='/path', params='param', query='query=arg', fragment='frag')
Although the return value acts like a tuple, it is really based on a namedtuple , a subclass of tuple that supports accessing the parts of the URL via named attributes as well as indexes. In addition to being easier to use for the programmer, the attribute API also offers access to several values not available in the tuple API.
from urllib.parse import urlparse url = 'http://user:pwd@NetLoc:80/path;param?query=arg#frag' parsed = urlparse(url) print('scheme :', parsed.scheme) print('netloc :', parsed.netloc) print('path :', parsed.path) print('params :', parsed.params) print('query :', parsed.query) print('fragment:', parsed.fragment) print('username:', parsed.username) print('password:', parsed.password) print('hostname:', parsed.hostname) print('port :', parsed.port)
The username and password are available when present in the input URL, and set to None when not. The hostname is the same value as netloc , in all lower case and with the port value stripped. And the port is converted to an integer when present and None when not.
$ python3 urllib_parse_urlparseattrs.py scheme : http netloc : user:pwd@NetLoc:80 path : /path params : param query : query=arg fragment: frag username: user password: pwd hostname: netloc port : 80
The urlsplit() function is an alternative to urlparse() . It behaves a little differently, because it does not split the parameters from the URL. This is useful for URLs following RFC 2396, which supports parameters for each segment of the path.
from urllib.parse import urlsplit url = 'http://user:pwd@NetLoc:80/p1;para/p2;para?query=arg#frag' parsed = urlsplit(url) print(parsed) print('scheme :', parsed.scheme) print('netloc :', parsed.netloc) print('path :', parsed.path) print('query :', parsed.query) print('fragment:', parsed.fragment) print('username:', parsed.username) print('password:', parsed.password) print('hostname:', parsed.hostname) print('port :', parsed.port)
Since the parameters are not split out, the tuple API will show five elements instead of six, and there is no params attribute.
$ python3 urllib_parse_urlsplit.py SplitResult(scheme='http', netloc='user:pwd@NetLoc:80', path='/p1;para/p2;para', query='query=arg', fragment='frag') scheme : http netloc : user:pwd@NetLoc:80 path : /p1;para/p2;para query : query=arg fragment: frag username: user password: pwd hostname: netloc port : 80
To simply strip the fragment identifier from a URL, such as when finding a base page name from a URL, use urldefrag() .
from urllib.parse import urldefrag original = 'http://netloc/path;param?query=arg#frag' print('original:', original) d = urldefrag(original) print('url :', d.url) print('fragment:', d.fragment)
The return value is a DefragResult , based on namedtuple , containing the base URL and the fragment.
$ python3 urllib_parse_urldefrag.py original: http://netloc/path;param?query=arg#frag url : http://netloc/path;param?query=arg fragment: frag
Unparsing¶
There are several ways to assemble the parts of a split URL back together into a single string. The parsed URL object has a geturl() method.
from urllib.parse import urlparse original = 'http://netloc/path;param?query=arg#frag' print('ORIG :', original) parsed = urlparse(original) print('PARSED:', parsed.geturl())
geturl() only works on the object returned by urlparse() or urlsplit() .
$ python3 urllib_parse_geturl.py ORIG : http://netloc/path;param?query=arg#frag PARSED: http://netloc/path;param?query=arg#frag
A regular tuple containing strings can be combined into a URL with urlunparse() .
from urllib.parse import urlparse, urlunparse original = 'http://netloc/path;param?query=arg#frag' print('ORIG :', original) parsed = urlparse(original) print('PARSED:', type(parsed), parsed) t = parsed[:] print('TUPLE :', type(t), t) print('NEW :', urlunparse(t))
While the ParseResult returned by urlparse() can be used as a tuple, this example explicitly creates a new tuple to show that urlunparse() works with normal tuples, too.
$ python3 urllib_parse_urlunparse.py ORIG : http://netloc/path;param?query=arg#frag PARSED: ParseResult(scheme='http', netloc='netloc', path='/path', params='param', query='query=arg', fragment='frag') TUPLE : ('http', 'netloc', '/path', 'param', 'query=arg', 'frag') NEW : http://netloc/path;param?query=arg#frag
If the input URL included superfluous parts, those may be dropped from the reconstructed URL.
from urllib.parse import urlparse, urlunparse original = 'http://netloc/path;?#' print('ORIG :', original) parsed = urlparse(original) print('PARSED:', type(parsed), parsed) t = parsed[:] print('TUPLE :', type(t), t) print('NEW :', urlunparse(t))
In this case, parameters , query , and fragment are all missing in the original URL. The new URL does not look the same as the original, but is equivalent according to the standard.
$ python3 urllib_parse_urlunparseextra.py ORIG : http://netloc/path;?# PARSED: ParseResult(scheme='http', netloc='netloc', path='/path', params='', query='', fragment='') TUPLE : ('http', 'netloc', '/path', '', '', '') NEW : http://netloc/path
Joining¶
In addition to parsing URLs, urlparse includes urljoin() for constructing absolute URLs from relative fragments.
from urllib.parse import urljoin print(urljoin('http://www.example.com/path/file.html', 'anotherfile.html')) print(urljoin('http://www.example.com/path/file.html', '../anotherfile.html'))
In the example, the relative portion of the path ( «../» ) is taken into account when the second URL is computed.
$ python3 urllib_parse_urljoin.py http://www.example.com/path/anotherfile.html http://www.example.com/anotherfile.html
Non-relative paths are handled in the same way as by os.path.join() .
from urllib.parse import urljoin print(urljoin('http://www.example.com/path/', '/subpath/file.html')) print(urljoin('http://www.example.com/path/', 'subpath/file.html'))
If the path being joined to the URL starts with a slash ( / ), it resets the URL’s path to the top level. If it does not start with a slash, it is appended to the end of the path for the URL.
$ python3 urllib_parse_urljoin_with_path.py http://www.example.com/subpath/file.html http://www.example.com/path/subpath/file.html
Encoding Query Arguments¶
Before arguments can be added to a URL, they need to be encoded.
from urllib.parse import urlencode query_args = 'q': 'query string', 'foo': 'bar', > encoded_args = urlencode(query_args) print('Encoded:', encoded_args)
Encoding replaces special characters like spaces to ensure they are passed to the server using a format that complies with the standard.
$ python3 urllib_parse_urlencode.py Encoded: q=query+string&foo=bar
To pass a sequence of values using separate occurrences of the variable in the query string, set doseq to True when calling urlencode() .
from urllib.parse import urlencode query_args = 'foo': ['foo1', 'foo2'], > print('Single :', urlencode(query_args)) print('Sequence:', urlencode(query_args, doseq=True))
The result is a query string with several values associated with the same name.
$ python3 urllib_parse_urlencode_doseq.py Single : foo=%5B%27foo1%27%2C+%27foo2%27%5D Sequence: foo=foo1&foo=foo2
To decode the query string, use parse_qs() or parse_qsl() .
from urllib.parse import parse_qs, parse_qsl encoded = 'foo=foo1&foo=foo2' print('parse_qs :', parse_qs(encoded)) print('parse_qsl:', parse_qsl(encoded))
The return value from parse_qs() is a dictionary mapping names to values, while parse_qsl() returns a list of tuples containing a name and a value.
$ python3 urllib_parse_parse_qs.py parse_qs : parse_qsl: [('foo', 'foo1'), ('foo', 'foo2')]
Special characters within the query arguments that might cause parse problems with the URL on the server side are “quoted” when passed to urlencode() . To quote them locally to make safe versions of the strings, use the quote() or quote_plus() functions directly.
from urllib.parse import quote, quote_plus, urlencode url = 'http://localhost:8080/~hellmann/' print('urlencode() :', urlencode('url': url>)) print('quote() :', quote(url)) print('quote_plus():', quote_plus(url))
The quoting implementation in quote_plus() is more aggressive about the characters it replaces.
$ python3 urllib_parse_quote.py urlencode() : url=http%3A%2F%2Flocalhost%3A8080%2F~hellmann%2F quote() : http%3A//localhost%3A8080/~hellmann/ quote_plus(): http%3A%2F%2Flocalhost%3A8080%2F~hellmann%2F
To reverse the quote operations, use unquote() or unquote_plus() , as appropriate.
from urllib.parse import unquote, unquote_plus print(unquote('http%3A//localhost%3A8080/%7Ehellmann/')) print(unquote_plus( 'http%3A%2F%2Flocalhost%3A8080%2F%7Ehellmann%2F' ))
The encoded value is converted back to a normal string URL.
$ python3 urllib_parse_unquote.py http://localhost:8080/~hellmann/ http://localhost:8080/~hellmann/
- Standard library documentation for urllib.parse
- urllib.request – Retrieve the contents of a resource identified by a URL.
- RFC 1738 – Uniform Resource Locator (URL) syntax
- RFC 1808 – Relative URLs
- RFC 2396 – Uniform Resource Identifier (URI) generic syntax
- RFC 3986 – Uniform Resource Identifier (URI) syntax