- Parsing XML with BeautifulSoup in Python
- Setting up LXML and BeautifulSoup
- Parsing XML with lxml and BeautifulSoup
- Display Parsed Data in a Table
- Parsing an RSS Feed and Storing the Data to a CSV
- Free eBook: Git Essentials
- XML Parsing In Python Using Element Tree Module
- XML Parsing In Python – What Is XML And XML Parsing
- What Is XML ?
- How XML Looks Likes ?
- What Is XML Parsing ?
- XML Parsing In Python – Parsing XML Documents
- ElementTree Module
- Parsing XML
- Using parse() method
Parsing XML with BeautifulSoup in Python
Extensible Markup Language (XML) is a markup language that’s popular because of the way it structures data. It found usage in data transmission (representing serialized objects) and configuration files.
Despite JSON’s rising popularity, you can still find XML in Android development’s manifest file, Java/Maven build tools and SOAP APIs on the web. Parsing XML is therefore still a common task a developer would have to do.
In Python, we can read and parse XML by leveraging two libraries: BeautifulSoup and LXML.
In this guide, we’ll take a look at extracting and parsing data from XML files with BeautifulSoup and LXML, and store the results using Pandas.
Setting up LXML and BeautifulSoup
We first need to install both libraries. We’ll create a new folder in your workspace, set up a virtual environment, and install the libraries:
$ mkdir xml_parsing_tutorial $ cd xml_parsing_tutorial $ python3 -m venv env # Create a virtual environment for this project $ . env/bin/activate # Activate the virtual environment $ pip install lxml beautifulsoup4 # Install both Python packages
Now that we have everything set up, let’s do some parsing!
Parsing XML with lxml and BeautifulSoup
Parsing always depends on the underlying file and the structure it uses so there’s no single silver bullet for all files. BeautifulSoup parses them automatically, but the underlying elements are task-dependent.
Thus, it’s best to learn parsing with a hands-on approach. Save the following XML into a file in your working directory — teachers.xml :
teachers> teacher> name>Sam Davies name> age>35 age> subject>Maths subject> teacher> teacher> name>Cassie Stone name> age>24 age> subject>Science subject> teacher> teacher> name>Derek Brandon name> age>32 age> subject>History subject> teacher> teachers>
The tag indicates the root of the XML document, the tag is a child or sub-element of the , with information about a singular person. The , , are children of the tag, and grand-children of the tag.
The first line, , in the sample document above is called an XML Prolog. It always comes at the beginning of an XML file, although it is completely optional to include an XML Prolog in an XML document.
The XML Prolog shown above indicates the version of XML used and the type of character encoding. In this case, the characters in the XML document are encoded in UTF-8.
Now that we understand the structure of the XML file — we can parse it. Create a new file called teachers.py in your working directory, and import the BeautifulSoup library:
from bs4 import BeautifulSoup
Note: As you may have noticed, we didn’t import lxml ! With importing BeautifulSoup, LXML is automatically integrated, so importing it separately isn’t necessary, but it isn’t installed as part of BeautifulSoup.
Now let’s read the contents of the XML file we created and store it in a variable called soup so we can begin parsing:
with open('teachers.xml', 'r') as f: file = f.read() # 'xml' is the parser used. For html files, which BeautifulSoup is typically used for, it would be 'html.parser'. soup = BeautifulSoup(file, 'xml')
The soup variable now has the parsed contents of our XML file. We can use this variable and the methods attached to it to retrieve the XML information with Python code.
Let’s say we want to view only the names of the teachers from the XML document. We can get that information with a few lines of code:
names = soup.find_all('name') for name in names: print(name.text)
Running python teachers.py would give us:
Sam Davis Cassie Stone Derek Brandon
The find_all() method returns a list of all the matching tags passed into it as an argument. As shown in the code above, soup.find_all(‘name’) returns all the tags in the XML file. We then iterate over these tags and print their text property, which contains the tags’ values.
Display Parsed Data in a Table
Let’s take things one step further, we’ll parse all the contents of the XML file and display it in a tabular format.
Let’s rewrite the teachers.py file with:
from bs4 import BeautifulSoup # Opens and reads the xml file we saved earlier with open('teachers.xml', 'r') as f: file = f.read() # Initializing soup variable soup = BeautifulSoup(file, 'xml') # Storing tags and elements in names variable names = soup.find_all('name') # Storing tags and elements in 'ages' variable ages = soup.find_all('age') # Storing tags and elements in 'subjects' variable subjects = soup.find_all('subject') # Displaying data in tabular format print('-'.center(35, '-')) print('|' + 'Name'.center(15) + '|' + ' Age ' + '|' + 'Subject'.center(11) + '|') for i in range(0, len(names)): print('-'.center(35, '-')) print( f'|15 )>|5 )>|11 )>|') print('-'.center(35, '-'))
The output of the code above would look like this:
----------------------------------- | Name | Age | Subject | ----------------------------------- | Sam Davies | 35 | Maths | ----------------------------------- | Cassie Stone | 24 | Science | ----------------------------------- | Derek Brandon | 32 | History | -----------------------------------
Congrats! You just parsed your first XML file with BeautifulSoup and LXML! Now that you’re more comfortable with the theory and the process, let’s try a more real-world example.
We’ve formatted the data as a table as a precursor to storing it in a versatile data structure. Namely — in the upcoming mini-project, we’ll store the data in a Pandas DataFrame .
Parsing an RSS Feed and Storing the Data to a CSV
In this section, we’ll parse an RSS feed of The New York Times News, and store that data in a CSV file.
RSS is short for Really Simple Syndication. An RSS feed is a file that contains a summary of updates from a website and is written in XML. In this case, the RSS feed of The New York Times contains a summary of daily news updates on their website. This summary contains links to news releases, links to article images, descriptions of news items, and more. RSS feeds are also used to allow people to get data without scraping websites as a nice token by website owners.
Here’s a snapshot of an RSS feed from The New York Times:
Free eBook: Git Essentials
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
You can gain access to different New York Times RSS feeds of different continents, countries, regions, topics and other criteria via this link.
It’s important to see and understand the structure of the data before you can begin parsing it. The data we would like to extract from the RSS feed about each news article is:
Now that we’re familiar with the structure and have clear goals, let’s kick off our program! We’ll need the requests library and the pandas library to retrieve the data and easily convert it to a CSV file.
With requests , we can make HTTP requests to websites and parse the responses. In this case, we can use it to retrieve their RSS feeds (in XML) so BeautifulSoup can parse it. With pandas , we will be able to format the parsed data in a table, and finally store the table’s contents into a CSV file.
In the same working directory, install requests and pandas (your virtual environment should still be active):
$ pip install requests pandas
In a new file, nyt_rss_feed.py , let’s import our libraries:
import requests from bs4 import BeautifulSoup import pandas as pd
Then, let’s make an HTTP request to The New York Times’ server to get their RSS feed and retrieve its contents:
url = 'https://rss.nytimes.com/services/xml/rss/nyt/US.xml' xml_data = requests.get(url).content
With the code above, we have been able to get a response from the HTTP request and store its contents in the xml_data variable. The requests library returns data as bytes .
Now, create the following function to parse the XML data into a table in Pandas, with the help of BeautifulSoup:
def parse_xml(xml_data): # Initializing soup variable soup = BeautifulSoup(xml_data, 'xml') # Creating column for table df = pd.DataFrame(columns=['guid', 'title', 'pubDate', 'description']) # Iterating through item tag and extracting elements all_items = soup.find_all('item') items_length = len(all_items) for index, item in enumerate(all_items): guid = item.find('guid').text title = item.find('title').text pub_date = item.find('pubDate').text description = item.find('description').text # Adding extracted elements to rows in table row = < 'guid': guid, 'title': title, 'pubDate': pub_date, 'description': description > df = df.append(row, ignore_index=True) print(f'Appending row %s of %s' % (index+1, items_length)) return df
The function above parses XML data from an HTTP request with BeautifulSoup, storing its contents in a soup variable. The Pandas DataFrame with rows and columns for the data we would like to parse is referenced via the df variable.
We then iterate through the XML file to find all tags with . By iterating through the tag we are able to extract its children tags: , , , and . Note how we use the find() method to get only one object. We append the values of each child tag to the Pandas table.
Now, at the end of the file after the function, add these two lines of code to call the function and create a CSV file:
df = parse_xml(xml_data) df.to_csv('news.csv')
Run python nyt_rss_feed.py to create a new CSV file in your present working directory:
Appending row 1 of 24 Appending row 2 of 24 . Appending row 24 of 24
XML Parsing In Python Using Element Tree Module
If you are looking for a tutorial that can help you to learn Parsing XML in python using ElementTree Module then you are landed in right place. In XML Parsing In Python tutorial, you will learn to parse xml file in python. So keep reading this tutorial till the end so that you can have enough knowledge of XML Parsing.
XML Parsing In Python – What Is XML And XML Parsing
In this section, you will see some overview of XML and XML parsing so that you can understand the rest of this tutorial. So let’s get’s started.
What Is XML ?
- XML is short form of Extensible Markup Language.
- It is similar to HTML(Hyper Text Markup Language) but the big difference is that HTML is Data Presentation whereas XML defines What is the data that are going to be used.
- It is designed to store and transport data.
- XML can easily read by humans and machines.
How XML Looks Likes ?
So now you will see how actually the XML looks like. So this is the basic structure of an XML file.
An XML tree starts at a root element and branches from the root to child elements. All elements can have sub elements (child elements)
Now let’s take an example(products.xml) –
What Is XML Parsing ?
Basically parsing means to read informations from a file and split them into pieces by identifying parts of that particular XML file.
XML Parsing In Python – Parsing XML Documents
In this section, we will see how can we read and write to a XML file. To parse XML data in python, we need parsing module. So let’s discuss XML parsing module in python.
ElementTree Module
ElementTree module is used to formats data in a tree structure which is the most natural representation of hierarchical data. The Element data type allows storage the hierarchical data structure in memory.
The Element Tree is a class that can be used to wrap an element structure, and convert it from and to XML. It has the following properties, let’s see what they are.
- Tag : It is a string that represents the type of data being stored..
- Attributes: A number of attributes, stored in a Python dictionary.
- Text String: A text string having informations that needs to be displayed.
- Tail String: an optional tail string.
- Child Elements: a number of child elements, stored in a Python sequence.
Parsing XML
We can do parse XML documents using two methods –
Using parse() method
This method takes XML in file format to parse it. So write the following code to use this method.