Регулярное выражение поиск ссылок python

Содержание

URL regex Python
URL regex that starts with HTTP or HTTPS
Notes on URL validation
Create an internal tool with UI Bakery
How to Extract URL from a string in Python?
This regular expression is going to help us to judge the presence of a URL.
Find URL in string of HTML format
Регулярные выражения. Парсинг html

URL regex Python

URL regular expressions can be used to verify if a string has a valid URL format as well as to extract an URL from a string.

URL regex that starts with HTTP or HTTPS

HTTP and HTTPS URLs that start with protocol can be validated using the following regular expression

Enter a text in the input above to see the result


URL regex that doesn’t start with HTTP or HTTPS
The regular expression to validate URL without protocol is very similar:
Enter a text in the input above to see the result
 

Enter a text in the input above to see the result
Notes on URL validation
The above-mentioned regular expressions only cover the most commonly used types of URLs with domain names. If you have some more complex cases you might need a different solution.
Create an internal tool with UI Bakery 
 
Discover UI Bakery – an intuitive visual internal tools builder.
Источник


How to Extract URL from a string in Python?
Today we are going to learn how we can find and extract a URL of a website from a string in Python. We will be using the regular expression module of python. So if we have a string and we want to check if it contains a URL and if it contains one then we can extract it and print it.
First, we need to understand how to judge a URL presence. To judge that we will be using a regular expression that has all possible symbols combination/conditions that can constitute a URL.
This regular expression is going to help us to judge the presence of a URL.
#regular expression to find URL in string in python r"(?i)\b((?:https?://|www\d[.]|[a-z0-9.-]+[.][a-z]/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]<>;:'\".,<>?«»“”‘’]))"
Then we will just parse our string with this regular expression and check the URL presence. So to do that we will be using findall() method/function from the regular expression module of python.
Code Example
#How to Extract URL from a string in Python? import re def URLsearch(stringinput): #regular expression regularex = r"(?i)\b((?:https?://|www\d[.]|[a-z0-9.-]+[.][a-z]/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]<>;:'\".,<>?«»“”‘’]))" #finding the url in passed string urlsrc = re.findall(regularex,stringinput) #return the found website url return [url[0] for url in urlsrc] textcontent = 'text :a software website find contents related to technology https://devenum.com https://google.com,http://devenum.com' #using the above define function print("Urls found: ", URLsearch(textcontent))
 Urls found: ['https://devenum.com', 'https://google.com,http://devenum.com']
 Find URL in string of HTML format
 In this code example we are searching the urls inside a HTML tags.We are using the above defines regular expression to find the same.
 import re def URLsearch(stringinput): #regular expression regularex = regularex = r"(?i)\b((?:https?://|www\d[.]|[a-z0-9.-]+[.][a-z]/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]<>;:'\".,<>?«»“”‘’]))" #finding the url in passed string urlsrc = re.findall(regularex,stringinput) #return the found website url return [url[0] for url in urlsrc] textcontent = 'Contents Python ExamplesEven More Examples 
' #using the above define function print("Urls found: ", URLsearch(textcontent))
 Urls found: ['https://www.google.com"', 'https://devenum.com"', 'http://www.devenum.com"']
 Источник
 Регулярные выражения. Парсинг html
 Есть задача вытащить ссылки из html-файла. Первая мысль: "Регулярные выражения, настало ваше время". До этого момента с ними не сталкивался. Решил разобраться. Вот пробую по гугловским статьям. В итоге по задаче:
 - есть html-код (в моем случае такой): 
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
  html> head> title>Example Domain/title> meta charset="utf-8" /> meta http-equiv="Content-type" content="text/html; charset=utf-8" /> meta name="viewport" content="width=device-width, initial-scale=1" /> style type="text/css"> body < background-color: #f0f0f2; margin: 0; padding: 0; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; >div < width: 600px; margin: 5em auto; padding: 50px; background-color: #fff; border-radius: 1em; >a:link, a:visited < color: #38488f; text-decoration: none; >@media (max-width: 700px) < body < background-color: #fff; >div < width: auto; margin: 0 auto; border-radius: 0; padding: 1em; >> /style> /head> body> div> h1>Example Domain/h1> p>This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission./p> p>a href="http://www.iana.org/domains/example">More information. /a>/p> /div> /body> /html>
 #reg ex pattern = '\S*">' result = re.findall(pattern, string) # string - считанный код HTML print (result)
 Вроде все хорошо, но в результате получаю ссылку (она одна, для тех, кто не глядел в html) вместе с "служебной информацией" . ">, а по задаче нужно получить именно ссылки. Это конечно можно реализовать функциями для работы с обычными строками (обрезать куски до и после кавычек), но хотелось бы узнать, как получить такой результат регулярными выражениями.
 Источник
 
Читайте также:  Connect mysql server with java