Содержание
- Python XML (lxml)
- XML vocabulary
- Parsing
- Parse XML text
- Parse XML from file
- Parse HTML from URL (keeping the doctype declaration)
- Parse HTML from URL (losing the doctype declaration)
- Setting encoding when parsing
- Iterating
- Iterate every element and all child elements in the tree:
- Iterate only specific elements in the tree:
- Iterate only immediate child elements in the tree:
- Processing XML
- Get an element that’s the child of root
- Find an element anywhere in the tree
- Get the tag
- Get the text
- Get the attributes
- Get a specific attribute
- Get a child element
- Get a specific element by ID
- Printing
- Printing XML
- Printing HTML
- Generating XML
Python XML (lxml)
This page has been archived and will receive no further updates.
XML vocabulary
- Each node is an element
- Each element has a tag
- Elements can have attributes within the tags
- Elements can have text between opening and closing tags
<element-tag> <subElement-tag attribute="value" /> <subElement-tag>>element textsubElement-tag> element-tag>
Parsing
Parse XML text
import lxml.etree root = lxml.etree.fromstring(xml_text)
Parse XML from file
root = lxml.etree.parse(infile_name)
Note: infile_name can be the full path to the file as a string or a file object
Parse HTML from URL (keeping the doctype declaration)
import urllib.request import lxml.etree import lxml.html parser = lxml.etree.HTMLParser() with urllib.request.urlopen('https://pypi.python.org/simple') as f: page = lxml.html.parse(f, parser)
import lxml.html # put the page into an lxml Element type page = lxml.html.parse(source_url) # must refer to page.getroot() to get the lxml root object page.getroot().find('ELEMENT-TAG')
Parse HTML from URL (losing the doctype declaration)
import lxml.html # put the page into an lxml Element type page = lxml.html.parse(source_url).getroot()
Setting encoding when parsing
parser = lxml.etree.XMLParser(encoding='utf-8') root = lxml.etree.parse(infile, parser) parser = lxml.html.HTMLParser(encoding='utf-8') page = lxml.html.parse(infile, parser)
Iterating
Iterate every element and all child elements in the tree:
Iterate only specific elements in the tree:
for element in root.iter('child'):
Iterate only immediate child elements in the tree:
for element in root: # elements act like lists
Processing XML
Get an element that’s the child of root
Find an element anywhere in the tree
Get the tag
Get the text
Get the attributes
Get a specific attribute
>>> a1.get('href') 'http://www.google.com'
Get a child element
element .find(‘ child-element-tag ‘) # this is if it’s an immediate child
element .find(‘.// child-element-tag ‘) # this is if it’s a child but not necessarily immediate
Or: (elements behave like lists)
Get a specific element by ID
page.getroot().get_element_by_id('desired_id')
Printing
Printing XML
lxml.etree.tostring( # the element you want to print root, # (recommended) set encoding encoding='utf-8', # (recommended) include XML declaration xml_declaration=True, # (recommended) add whitespace to output pretty_print=True, # (optional) add a standalone attribute standalone='yes' )
Printing HTML
lxml.html.tostring( # don't forget to add .getroot() if you didn't do it when parsing page.getroot(), # (recommended) set encoding encoding='utf-8', # (recommended) include the doctype declaration doctype=page.docinfo.doctype, pretty_print=True, # (optional) add for XHTML output (HTML is default) method='xml' )
Generating XML
import lxml.etree # create an element root = lxml.etree.Element('root') # create a child element child = lxml.etree.SubElement(root, 'child') # set an attribute child.set('attribute-name', 'attribute-value') >>> print(lxml.etree.tostring(root, pretty_print=True))