- How to Read XML File with Python and Pandas
- Setup
- Step 1: Read XML File with read_xml()
- Step 2: Read XML File with read_xml() — remote
- Step 3: Read XML File as Python list or dict
- Step 4: Read multiple XML Files in Python
- Step 5: Read XML File — xmltodict
- Conclusion
- How to read XML file in Python
- Introduction to XML
- Read XML File Using MiniDOM
- Syntax
- Example Read XML File in Python
- Read XML File Using BeautifulSoup alongside the lxml parser
- Example Read XML File in Python
- Read XML File Using Element Tree
- Example Read XML File in Python
- Read XML File Using Simple API for XML (SAX)
- Python Code Example
- Conclusion
How to Read XML File with Python and Pandas
In this quick tutorial, we’ll cover how to read or convert XML file to Pandas DataFrame or Python data structure.
Since version 1.3 Pandas offers an elegant solution for reading XML files: pd.read_xml() .
With the single line above we can read XML file to Pandas DataFrame or Python structure.
Below we will cover multiple examples in greater detail by using two ways:
Setup
Suppose we have simple XML file with the following structure:
https://example.com/item-1 2022-06-02T00:00:00Z weekly https://example.com/item-2 2022-06-02T11:34:37Z weekly https://example.com/item-3 2022-06-03T19:24:47Z weekly
which we would like to read as Pandas DataFrame like shown below:
loc | lastmod | changefreq | |
---|---|---|---|
0 | https://example.com/item-1 | 2022-06-02T00:00:00Z | weekly |
1 | https://example.com/item-2 | 2022-06-02T11:34:37Z | weekly |
2 | https://example.com/item-3 | 2022-06-03T19:24:47Z | weekly |
or getting the links as Python list:
['https://example.com/item-1', 'https://example.com/item-2', 'https://example.com/item-3']
Step 1: Read XML File with read_xml()
The official documentation of method read_xml() is placed on this link:
To read the local XML file in Python we can give the absolute path of the file:
import pandas as pd df = pd.read_xml('sitemap.xml')
loc | lastmod | changefreq | |
---|---|---|---|
0 | https://example.com/item-1 | 2022-06-02T00:00:00Z | weekly |
1 | https://example.com/item-2 | 2022-06-02T11:34:37Z | weekly |
2 | https://example.com/item-3 | 2022-06-03T19:24:47Z | weekly |
The method has several useful parameters:
- xpath — The XPath to parse the required set of nodes for migration to DataFrame.
- elems_only — Parse only the child elements at the specified xpath. By default, all child elements and non-empty text nodes are returned.
- names — Column names for DataFrame of parsed XML data.
- encoding — Encoding of XML document.
- namespaces — The namespaces defined in XML document as dicts with key being namespace prefix and value the URI.
Step 2: Read XML File with read_xml() — remote
Now let’s use Pandas to read XML from a remote location.
The first parameter of read_xml() is: path_or_buffer described as:
String, path object (implementing os.PathLike[str]), or file-like object implementing a read() function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.
So we can read remote files the same way:
import pandas as pd df = pd.read_xml( f'https://s3.example.com/sitemap.xml.gz')
The final output will be exactly the same as before — DataFrame which has all values from the XML data.
Step 3: Read XML File as Python list or dict
Now suppose you need to convert XML file to Python list or dictionary.
We need to read the XML file first, then to convert the file to DataFrame and finally to get the values from this DataFrame by:
Example 1: List
['https://example.com/item-1', 'https://example.com/item-2', 'https://example.com/item-3']
Example 2: Dictionary
Example 3: Dictionary — orient index
df[['loc', 'changefreq']].to_dict(orient='index')
Step 4: Read multiple XML Files in Python
Finally let’s see how to read multiple identical XML files with Python and Pandas.
Suppose that files are identical with the following format:
We can use the following code to read all files in a given range and concatenate them into a single DataFrame:
import pandas as pd df_temp = [] for i in (range(1, 10)): s = f'https://s3.example.com/sitemap.xml.gz' df_site = pd.read_xml(s) df_temp.append(df_site)
The result is a list of DataFrames which can be concatenated into a single one by:
Now we have information from all XML files into df_all.
Step 5: Read XML File — xmltodict
There is an alternative solution for reading XML file in Python by using the library: xmltodict .
To read XML file we can do:
import xmltodict with open('sitemap.xml') as fd: doc = xmltodict.parse(fd.read())
Accessing elements can be done by:
Conclusion
In this article, we covered several ways to read XML file with Python and Pandas. Now we know how to read local or remote XML files, using two Python libraries.
Different options and parameters make the XML conversion with Python — easy and flexible.
By using DataScientYst — Data Science Simplified, you agree to our Cookie Policy.
How to read XML file in Python
In this article, we will learn various ways to read XML files in Python. We will use some built-in modules and libraries available in Python and some related custom examples as well. Let’s first have a quick look over the full form of XML, introduction to XML, and then read about various parsing modules to read XML documents in Python.
Introduction to XML
XML stands for Extensible Markup Language . It’s needed for keeping track of the tiny to medium amount of knowledge. It allows programmers to develop their own applications to read data from other applications. The method of reading the information from an XML file and further analyzing its logical structure is known as Parsing. Therefore, reading an XML file is that the same as parsing the XML document.
In this article, we would take a look at four different ways to read XML documents using different XML modules. They are:
1. MiniDOM(Minimal Document Object Model)
2. BeautifulSoup alongside the lxml parser
XML File: We are using this XML file to read in our examples.
Read XML File Using MiniDOM
It is Python module, used to read XML file. It provides parse() function to read XML file. We must import Minidom first before using its function in the application. The syntax of this function is given below.
Syntax
xml.dom.minidom.parse(filename_or_file[, parser[, bufsize]])
This function returns a document of XML type.
Example Read XML File in Python
Since each node will be treated as an object, we are able to access the attributes and text of an element using the properties of the object. Look at the example below, we’ve accessed the attributes and text of a selected node.
from xml.dom import minidom # parse an xml file by name file = minidom.parse('models.xml') #use getElementsByTagName() to get tag models = file.getElementsByTagName('model') # one specific item attribute print('model #2 attribute:') print(models[1].attributes['name'].value) # all item attributes print('\nAll attributes:') for elem in models: print(elem.attributes['name'].value) # one specific item's data print('\nmodel #2 data:') print(models[1].firstChild.data) print(models[1].childNodes[0].data) # all items data print('\nAll model data:') for elem in models: print(elem.firstChild.data)
model #2 attribute:
model2
All attributes:
model1
model2
model #2 data:
model2abc
model2abc
All model data:
model1abc
model2abc
Read XML File Using BeautifulSoup alongside the lxml parser
In this example, we will use a Python library named BeautifulSoup . Beautiful Soup supports the HTML parser (lxml) included in Python’s standard library. Use the following command to install beautiful soup and lmxl parser in case, not installed.
#for beautifulsoup pip install beautifulsoup4 #for lmxl parser pip install lxml
After successful installation, use these libraries in python code.
We are using this XML file to read with Python code.
Acer is a laptop Add model number here Onida is an oven Exclusive Add price here Add content here Add company name here Add number of employees here
Example Read XML File in Python
Let’s read the above file using beautifulsoup library in python script.
from bs4 import BeautifulSoup # Reading the data inside the xml file to a variable under the name data with open('models.xml', 'r') as f: data = f.read() # Passing the stored data inside the beautifulsoup parser bs_data = BeautifulSoup(data, 'xml') # Finding all instances of tag b_unique = bs_data.find_all('unique') print(b_unique) # Using find() to extract attributes of the first instance of the tag b_name = bs_data.find('child', ) print(b_name) # Extracting the data stored in a specific attribute of the `child` tag value = b_name.get('qty') print(value)
Read XML File Using Element Tree
The Element tree module provides us with multiple tools for manipulating XML files. No installation is required. Due to the XML format present in the hierarchical data format, it becomes easier to represent it by a tree. Element Tree represents the whole XML document as a single tree.
Example Read XML File in Python
To read an XML file, firstly, we import the ElementTree class found inside the XML library. Then, we will pass the filename of the XML file to the ElementTree.parse() method, to start parsing. Then, we will get the parent tag of the XML file using getroot() . Then we will display the parent tag of the XML file. Now, to get attributes of the sub-tag of the parent tag will use root[0].attrib . At last, display the text enclosed within the 1st sub-tag of the 5th sub-tag of the tag root.
# importing element tree import xml.etree.ElementTree as ET # Pass the path of the xml document tree = ET.parse('models.xml') # get the parent tag root = tree.getroot() # print the root (parent) tag along with its memory location print(root) # print the attributes of the first tag print(root[0].attrib) # print the text contained within first subtag of the 5th tag from the parent print(root[5][0].text)
Read XML File Using Simple API for XML (SAX)
In this method, first, register callbacks for events that occur, then the parser proceeds through the document. this can be useful when documents are large or memory limitations are present. It parses the file because it reads it from disk and also the entire file isn’t stored in memory. Reading XML using this method requires the creation of ContentHandler by subclassing xml.sax.ContentHandler.
Note: This method might not be compatible with Python 3 version. Please check your version before implementing this method.
- ContentHandler — handles the tags and attributes of XML. The ContentHandler is called at the beginning and at the end of every element.
- startDocument and endDocument — called at the start and the end of the XML file.
- If the parser is’nt in namespace mode, the methods startElement(tag, attributes) and endElement(tag) are called; otherwise, the corresponding methods startElementNS and endElementNS
35000 12 Samsung 46500 14 Onida 30000 8 Lenovo 45000 12 Acer
Python Code Example
import xml.sax class XMLHandler(xml.sax.ContentHandler): def __init__(self): self.CurrentData = "" self.price = "" self.qty = "" self.company = "" # Call when an element starts def startElement(self, tag, attributes): self.CurrentData = tag if(tag == "model"): print("*****Model*****") title = attributes["number"] print("Model number:", title) # Call when an elements ends def endElement(self, tag): if(self.CurrentData == "price"): print("Price:", self.price) elif(self.CurrentData == "qty"): print("Quantity:", self.qty) elif(self.CurrentData == "company"): print("Company:", self.company) self.CurrentData = "" # Call when a character is read def characters(self, content): if(self.CurrentData == "price"): self.price = content elif(self.CurrentData == "qty"): self.qty = content elif(self.CurrentData == "company"): self.company = content # create an XMLReader parser = xml.sax.make_parser() # turn off namepsaces parser.setFeature(xml.sax.handler.feature_namespaces, 0) # override the default ContextHandler Handler = XMLHandler() parser.setContentHandler( Handler ) parser.parse("models.xml")
*****Model*****
Model number: ST001
Price: 35000
Quantity: 12
Company: Samsung
*****Model*****
Model number: RW345
Price: 46500
Quantity: 14
Company: Onida
*****Model*****
Model number: EX366
Price: 30000
Quantity: 8
Company: Lenovo
*****Model*****
Model number: FU699
Price: 45000
Quantity: 12
Company: Acer
Conclusion
In this article, we learned about XML files and different ways to read an XML file by using several built-in modules and API’s such as Minidom , Beautiful Soup , ElementTree , Simple API(SAX) . We used some custom parsing codes as well to parse the XML file.