Регулярное выражение поиск ссылок python

URL regex Python

URL regular expressions can be used to verify if a string has a valid URL format as well as to extract an URL from a string.

URL regex that starts with HTTP or HTTPS

HTTP and HTTPS URLs that start with protocol can be validated using the following regular expression

Enter a text in the input above to see the result

 

URL regex that doesn’t start with HTTP or HTTPS

The regular expression to validate URL without protocol is very similar:

Enter a text in the input above to see the result

 

Enter a text in the input above to see the result

Notes on URL validation

The above-mentioned regular expressions only cover the most commonly used types of URLs with domain names. If you have some more complex cases you might need a different solution.

Create an internal tool with UI Bakery

Discover UI Bakery – an intuitive visual internal tools builder.

Источник

How to Extract URL from a string in Python?

Today we are going to learn how we can find and extract a URL of a website from a string in Python. We will be using the regular expression module of python. So if we have a string and we want to check if it contains a URL and if it contains one then we can extract it and print it.

First, we need to understand how to judge a URL presence. To judge that we will be using a regular expression that has all possible symbols combination/conditions that can constitute a URL.

This regular expression is going to help us to judge the presence of a URL.

#regular expression to find URL in string in python r"(?i)\b((?:https?://|www\d[.]|[a-z0-9.-]+[.][a-z]/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]<>;:'\".,<>?«»“”‘’]))"

Then we will just parse our string with this regular expression and check the URL presence. So to do that we will be using findall() method/function from the regular expression module of python.

Code Example

#How to Extract URL from a string in Python? import re def URLsearch(stringinput): #regular expression regularex = r"(?i)\b((?:https?://|www\d[.]|[a-z0-9.-]+[.][a-z]/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]<>;:'\".,<>?«»“”‘’]))" #finding the url in passed string urlsrc = re.findall(regularex,stringinput) #return the found website url return [url[0] for url in urlsrc] textcontent = 'text :a software website find contents related to technology https://devenum.com https://google.com,http://devenum.com' #using the above define function print("Urls found: ", URLsearch(textcontent))
Urls found: ['https://devenum.com', 'https://google.com,http://devenum.com']

Find URL in string of HTML format

In this code example we are searching the urls inside a HTML tags.We are using the above defines regular expression to find the same.

import re def URLsearch(stringinput): #regular expression regularex = regularex = r"(?i)\b((?:https?://|www\d[.]|[a-z0-9.-]+[.][a-z]/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]<>;:'\".,<>?«»“”‘’]))" #finding the url in passed string urlsrc = re.findall(regularex,stringinput) #return the found website url return [url[0] for url in urlsrc] textcontent = '

Contents Python ExamplesEven More Examples

' #using the above define function print("Urls found: ", URLsearch(textcontent))
Urls found: ['https://www.google.com"', 'https://devenum.com"', 'http://www.devenum.com"']

Источник

Регулярные выражения. Парсинг html

Есть задача вытащить ссылки из html-файла. Первая мысль: "Регулярные выражения, настало ваше время". До этого момента с ними не сталкивался. Решил разобраться. Вот пробую по гугловским статьям. В итоге по задаче:

- есть html-код (в моем случае такой):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 html> head> title>Example Domain/title> meta charset="utf-8" /> meta http-equiv="Content-type" content="text/html; charset=utf-8" /> meta name="viewport" content="width=device-width, initial-scale=1" /> style type="text/css"> body < background-color: #f0f0f2; margin: 0; padding: 0; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; >div < width: 600px; margin: 5em auto; padding: 50px; background-color: #fff; border-radius: 1em; >a:link, a:visited < color: #38488f; text-decoration: none; >@media (max-width: 700px) < body < background-color: #fff; >div < width: auto; margin: 0 auto; border-radius: 0; padding: 1em; >> /style> /head> body> div> h1>Example Domain/h1> p>This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission./p> p>a href="http://www.iana.org/domains/example">More information. /a>/p> /div> /body> /html>
#reg ex pattern = '\S*">' result = re.findall(pattern, string) # string - считанный код HTML print (result)

Вроде все хорошо, но в результате получаю ссылку (она одна, для тех, кто не глядел в html) вместе с "служебной информацией" . ">, а по задаче нужно получить именно ссылки. Это конечно можно реализовать функциями для работы с обычными строками (обрезать куски до и после кавычек), но хотелось бы узнать, как получить такой результат регулярными выражениями.

Источник

Читайте также:  Connect mysql server with java
Оцените статью