Python get url domain

python 2 and 3 extract domain from url

I have an url like: http://xxx.abcdef.com/fdfdf/ And I want to get xxx.abcdef.com Which module can i use for accomplish this? I want to use the same module and method at python2 and python3 I don’t like the try except way for python2/3 compatibility Thanks you so much!

2 Answers 2

from urlparse import urlparse o = urlparse("http://xxx.abcdef.com/fdfdf/") print o print o.netloc 

In Python 3, you import urlparse like so:

from urllib.parse import urlparse 
url = "http://xxx.abcdef.com/fdfdf/" print url.split('/')[2] 

Sidenote: Here’s how you write an import of urlparse that will work in either version:

if sys.version_info >= (3, 0): from urllib.parse import urlparse if sys.version_info < (3, 0) and sys.version_info >= (2, 5): from urlparse import urlparse 

You can use 3rd party library six, which takes care of compatibility issues between python versions and standard library function urlparse to extract the hostname

so all you need to do is install six and import urlparse

from six.moves.urllib.parse import urlparse u = urlparse("http://xxx.abcdef.com/fdfdf/") print(u.hostname) 

Linked

Hot Network Questions

Subscribe to RSS

To subscribe to this RSS feed, copy and paste this URL into your RSS reader.

Site design / logo © 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2023.7.27.43548

By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

Читайте также:  Основы нейронных сетей python

Источник

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

9 Answers 9

Getting the hostname is easy enough using urlparse:

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname 

Getting the «root domain», however, is going to be more problematic, because it isn’t defined in a syntactic sense. What’s the root domain of «www.theregister.co.uk»? How about networks using default domains? «devbox12» could be a valid hostname.

One way to handle this would be to use the Public Suffix List, which attempts to catalogue both real top level domains (e.g. «.com», «.net», «.org») as well as private domains which are used like TLDs (e.g. «.co.uk» or even «.github.io»). You can access the PSL from Python using the publicsuffix2 library:

import publicsuffix import urlparse def get_base_domain(url): # This causes an HTTP request; if your script is running more than, # say, once a day, you'd want to cache it yourself. Make sure you # update frequently, though! psl = publicsuffix.fetch() hostname = urlparse.urlparse(url).hostname return publicsuffix.get_public_suffix(hostname, psl) 

Please can you explain how this code hostname = «.».join(len(hostname[-2]) < 4 and hostname[-3:] or hostname[-2:]) works? Thanks

@Joozty — Negative indices start from the end, so hostname[-2] means the next-to-last entry (in this case, the hostname split by dots). foo and bar or baz works much like a ternary: if «foo» is true, return «bar»; otherwise, return «baz». Finally, hostname[-3:] means the last three parts. All together, this means «If the next-to-last part of the hostname is shorter than four characters, use the last three parts and join them together with dots. Otherwise, take only the last two parts and join them together.»

For some reason, even after installing the module, on Python 3 I get ImportError: cannot import name ‘get_public_suffix’ . Couldn’t find any answer online or in the documentation, so just used «tldextract» instead which just works! Of course, I had to sudo pip3 install tldextract first.

As TIMTOWTDI motto:

>>> from urllib.parse import urlparse # python 3.x >>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever') # returns six components >>> domain = '/'.format(uri=parsed_uri) >>> result = domain.replace('www.', '') # as per your case >>> print(result) 'stackoverflow.com/' 
>>> import tldextract # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers >>> tldextract.extract('http://forums.news.cnn.com/') ExtractResult(subdomain='forums.news', domain='cnn', suffix='com') 
>>> extracted = tldextract.extract('http://www.techcrunch.com/') >>> '<>.<>'.format(extracted.domain, extracted.suffix) 'techcrunch.com' 

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

Following script is not perfect, but can be used for display/shortening purposes. If you really want/need to avoid any 3rd party dependencies — especially remotely fetching and caching some tld data I can suggest you following script which I use in my projects. It uses last two parts of domain for most common domain extensions and leaves last three parts for rest of the less known domain extensions. In worst case scenario domain will have three parts instead of two:

from urlparse import urlparse def extract_domain(url): parsed_domain = urlparse(url) domain = parsed_domain.netloc or parsed_domain.path # Just in case, for urls without scheme domain_parts = domain.split('.') if len(domain_parts) > 2: return '.'.join(domain_parts[-(2 if domain_parts[-1] in < 'com', 'net', 'org', 'io', 'ly', 'me', 'sh', 'fm', 'us'>else 3):]) return domain extract_domain('google.com') # google.com extract_domain('www.google.com') # google.com extract_domain('sub.sub2.google.com') # google.com extract_domain('google.co.uk') # google.co.uk extract_domain('sub.google.co.uk') # google.co.uk extract_domain('www.google.com') # google.com extract_domain('sub.sub2.voila.fr') # sub2.voila.fr 

Источник

Extract domain from URL in python [duplicate]

I have an url like:
http://abc.hostname.com/somethings/anything/ I want to get:
hostname.com What module can I use to accomplish this?
I want to use the same module and method in python2.

5 Answers 5

For parsing the domain of a URL in Python 3, you can use:

from urllib.parse import urlparse domain = urlparse('http://www.example.test/foo/bar').netloc print(domain) # --> www.example.test 

However, for reliably parsing the top-level domain ( example.test in this example), you need to install a specialized library (e.g., tldextract).

Instead of regex or hand-written solutions, you can use python’s urlparse

from urllib.parse import urlparse print(urlparse('http://abc.hostname.com/somethings/anything/')) >> ParseResult(scheme='http', netloc='abc.hostname.com', path='/somethings/anything/', params='', query='', fragment='') print(urlparse('http://abc.hostname.com/somethings/anything/').netloc) >> abc.hostname.com 

To get without the subdomain

t = urlparse('http://abc.hostname.com/somethings/anything/').netloc print ('.'.join(t.split('.')[-2:])) >> hostname.com 

t.split(‘.’)[-2:] literally keeps only the last two substrings, so I am afraid it will just return co.uk and ac.uk , whether you prepend that or not.

from tldextract import extract tsd, td, tsu = extract("http://abc.hostname.com/somethings/anything/") # prints abc, hostname, com url = td + '.' + tsu # will prints as hostname.com print(url) 

tldextract is not a standard lib ( at least not in python 2.7 ) , I think you should mention that. Still +1

Assuming you have it in an accessible string, and assuming we want to be generic for having multiple levels on the top domain, you could:

token=my_string.split('http://')[1].split('/')[0] top_level=token.split('.')[-2]+'.'+token.split('.')[-1] 

We split first by the http:// to remove that from the string. Then we split by the / to remove all directory or sub-directory parts of the string, and then the [-2] means we take the second last token after a . , and append it with the last token, to give us the top level domain.

Источник

Extract domain name from URL using python’s re regex

I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes. I wrote the regex and used the python’s re module as follows:

import re m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something') m.group(1) print(m) 

My understanding is that m.group(1) will extract the part between () in the re.search. The output that I expect is: google.co.uk But I am getting this:

4 Answers 4

Even better yet — have a condition before:

m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something') if m: print(m.group(1)) 

Jan has already provided solution for this. But just to note, we can implement the same without using re . All it needs is !»#$%&\'()*+,-./:;?@[\\]^_`<|>~ for validation purposes. The same can be obtained from string package.

def domain_finder(link): import string dot_splitter = link.split('.') seperator_first = 0 if '//' in dot_splitter[0]: seperator_first = (dot_splitter[0].find('//') + 2) seperator_end = '' for i in dot_splitter[2]: if i in string.punctuation: seperator_end = i break if seperator_end: end_ = dot_splitter[2].split(seperator_end)[0] else: end_ = dot_splitter[2] domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_] domain = '.'.join(domain) return domain link = 'https://google.co.uk?link=something' domain = domain_finder(link=link) print(domain) # prints ==> 'google.co.uk' 

This is just another way of solving the same without re .

Источник

Оцените статью