how to get all page urls from a website
Web scraping is the technique to extract data from a website.
The module BeautifulSoup is designed for web scraping. The BeautifulSoup module can handle HTML and XML. It provides simple method for searching, navigating and modifying the parse tree.
Get links from website
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen(«https://arstechnica.com»)
soup = BeautifulSoup(html_page)
for link in soup.findAll(‘a’, attrs={‘href’: re.compile(«^http://»)}):
print link.get(‘href’)
It downloads the raw html code with the line:
html_page = urllib2.urlopen(«https://arstechnica.com»)
A BeautifulSoup object is created and we use this object to find all links:
soup = BeautifulSoup(html_page)
for link in soup.findAll(‘a’, attrs={‘href’: re.compile(«^http://»)}):
print link.get(‘href’)
Extract links from website into array
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen(«https://arstechnica.com»)
soup = BeautifulSoup(html_page)
links = []
for link in soup.findAll(‘a’, attrs={‘href’: re.compile(«^http://»)}):
links.append(link.get(‘href’))
print(links)
Function to extract links from webpage
from BeautifulSoup import BeautifulSoup
import urllib2
import re
def getLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
links = []
for link in soup.findAll(‘a’, attrs={‘href’: re.compile(«^http://»)}):
links.append(link.get(‘href’))
return links
print( getLinks(«https://arstechnica.com») )
How to Get href of Element using BeautifulSoup [Easily]
To get the href attribute of tag, we need to use the following syntax:
Get the href attribute of a tag
In the following example, we’ll use find() function to find tag and [‘href’] to print the href attribute.
Python string ''' soup = BeautifulSoup(html, 'html.parser') # 👉️ Parsing a_tag = soup.find('a', href=True) # 👉️ Find tag that have a href attr print(a_tag['href']) # 👉️ Print href
href=True: the tags that have a href attribute.
Get the href attribute of multi tags
To get the href of multi tags, we need to use findall() function to find all tags and [‘href’] to print the href attribute. However, let’s see an example.
Python string Python variable Python list Python set ''' soup = BeautifulSoup(html, 'html.parser') # 👉️ Parsing a_tags = soup.find_all('a', href=True) # 👉️ Find all tags that have a href attr # 👇 Loop over the results for tag in a_tags: print(tag['href']) # 👉️ Print href
Remember, when you want to get any attribute of a tag, use the following syntax:
You can visit beautifulsoup attribute to learn more about the BeautifulSoup attribute. Also, for more BeautifulSoup topics, scroll down you will find it.