Get links from html file

Парсер ссылок средствами Beautifulsoup

Статья будет простая и для кого то будет из разряда «как нарисовать сову», но для меня это неважно, ибо материал все равно кому-нибудь пригодится.

Речь пойдет о библиотеке Beautfulsoup и в качестве искомых данных будут URL адреса на ссылки, которые на языке HTML размечаются как ссылка, для этого будем ловить значения тега и следующего за ним атрибута href.

Импортируем библиотеку requests:

и библиотеку bs4, откуда вызываем объект супа:

from bs4 import BeautifulSoup
url = 'https://yandex.ru/' r = requests.get(url) soup_ing = str(BeautifulSoup(r.content, 'lxml'))

предварительно кодируем переменную soup_ing:

сохраняем контент в файл test.html:

with open("test.html", "wb") as file: file.write(soup_ing)

создаем метод fromSoup, который будет искать ссылки и
открываем сохраненный файл:

def fromSoup(): html_file = ("test.html") html_file = open(html_file, encoding='UTF-8').read() soup = BeautifulSoup(html_file, 'lxml') 

создаем объект soup, чтобы передать ему содержание файла:

soup = BeautifulSoup(html_file, 'lxml')

объявляем что поиск пройдет по всем тегам a:

for link in soup.find_all('a'):

и выводя содержимое в виде ссылок:

import requests from bs4 import BeautifulSoup url = 'https://yandex.ru/' r = requests.get(url) soup_ing = str(BeautifulSoup(r.content, 'lxml')) soup_ing = soup_ing.encode() with open("test.html", "wb") as file: file.write(soup_ing) def fromSoup(): html_file = ("test.html") html_file = open(html_file, encoding='UTF-8').read() soup = BeautifulSoup(html_file, 'lxml') # name of our soup for link in soup.find_all('a'): print(link.get('href')) fromSoup()

Источник

Inside this article we will see the concept of find and extract all links from a HTML string in php. Concept of this article will provide very classified information to understand the things.

This PHP tutorial is based on how to extract all links and their anchor text from a HTML string. In this guide, we will see how to fetch the HTML content of a web page by URL and then extract the links from it. To do this, we will be use PHP’s DOMDocument class.

DOMDocument of PHP also termed as PHP DOM Parser. We will see step by step concept to find and extract all links from a html using DOM parser.

Inside this example we will consider a HTML string value. From that html value we will extract all links.

Create file index.php inside your application.

Open index.php and write this complete code into it.

  Google Youtube Online Web Tutor  "; //Create a new DOMDocument object. $htmlDom = new DOMDocument; //Load the HTML string into our DOMDocument object. @$htmlDom->loadHTML($htmlString); //Extract all anchor elements / tags from the HTML. $anchorTags = $htmlDom->getElementsByTagName('a'); //Create an array to add extracted images to. $extractedAnchors = array(); //Loop through the anchors tags that DOMDocument found. foreach($anchorTags as $anchorTag)< //Get the href attribute of the anchor. $aHref = $anchorTag->getAttribute('href'); //Get the title text of the anchor, if it exists. $aTitle = $anchorTag->getAttribute('title'); //Add the anchor details to $extractedAnchors array. $extractedAnchors[] = array( 'href' => $aHref, 'title' => $aTitle ); > echo "
"; //print_r our array of anchors. print_r($extractedAnchors);

Concept

When we run index.php. Here is the output

Inside this example we will use web page URL to get all links.

Create file index.php inside your application.

Open index.php and write this complete code into it.

loadHTML($htmlString); //Extract all anchor elements / tags from the HTML. $anchorTags = $htmlDom->getElementsByTagName('a'); //Create an array to add extracted images to. $extractedAnchors = array(); //Loop through the anchors tags that DOMDocument found. foreach($anchorTags as $anchorTag)< //Get the href attribute of the anchor. $aHref = $anchorTag->getAttribute('href'); //Get the title text of the anchor, if it exists. $aTitle = $anchorTag->getAttribute('title'); //Add the anchor details to $extractedAnchors array. $extractedAnchors[] = array( 'href' => $aHref, 'title' => $aTitle ); > echo "
"; //print_r our array of anchors. print_r($extractedAnchors);

When we run index.php. Here is the output

We hope this article helped you to Find and Extract All links From a HTML String in PHP Tutorial in a very detailed way.

Online Web Tutor invites you to try Skillshike! Learn CakePHP, Laravel, CodeIgniter, Node Js, MySQL, Authentication, RESTful Web Services, etc into a depth level. Master the Coding Skills to Become an Expert in PHP Web Development. So, Search your favourite course and enroll now.

If you liked this article, then please subscribe to our YouTube Channel for PHP & it’s framework, WordPress, Node Js video tutorials. You can also find us on Twitter and Facebook.

Источник

In this post, we show you how to extract all links from a web page using jsoup Java library.

Add jsoup library to your Java project

To use jsoup Java library in the Gradle build project, add the following dependency into the build.gradle file.

compile 'org.jsoup:jsoup:1.13.1'

To use jsoup Java library in the Maven build project, add the following dependency into the pom.xml file.

To download the jsoup-1.13.1.jar file you can visit jsoup download page at jsoup.org/download

In jsoup library to get href’s value of anchor tag we can use Element.attr() method.

  • Element.attr(“href”) method to get relative URL
  • Element.attr(“abs:href”) method to get absolute URL

Example 1 using Document.getElementsByTag() method to get links Elements

import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; public class GetAllLinkExample1  public static void main(String[] args)  try  String url = "https://simplesolution.dev"; Document document = Jsoup.connect(url).get(); Elements allLinks = document.getElementsByTag("a"); for(Element link: allLinks)  String relativeUrl = link.attr("href"); String absoluteUrl = link.attr("abs:href"); System.out.println("Relative URL: " + relativeUrl); System.out.println("Absolute URL: " + absoluteUrl); > > catch (IOException e)  e.printStackTrace(); > > >
Relative URL: /page/2/ Absolute URL: https://simplesolution.dev/page/2/ Relative URL: /page/3/ Absolute URL: https://simplesolution.dev/page/3/

Example 2 using Document.select() method to get links Elements

import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; public class GetAllLinkExample2  public static void main(String[] args)  try  String url = "https://simplesolution.dev"; Document document = Jsoup.connect(url).get(); Elements allLinks = document.select("a[href]"); for(Element link: allLinks)  String relativeUrl = link.attr("href"); String absoluteUrl = link.attr("abs:href"); System.out.println("Relative URL: " + relativeUrl); System.out.println("Absolute URL: " + absoluteUrl); > > catch (IOException e)  e.printStackTrace(); > > >
Relative URL: /page/2/ Absolute URL: https://simplesolution.dev/page/2/ Relative URL: /page/3/ Absolute URL: https://simplesolution.dev/page/3/

Источник

Читайте также:  Features of java with examples
Оцените статью