Xml parser python encoding

Python XML (lxml)

This page has been archived and will receive no further updates.

XML vocabulary

  • Each node is an element
  • Each element has a tag
  • Elements can have attributes within the tags
  • Elements can have text between opening and closing tags
<element-tag> <subElement-tag attribute="value" /> <subElement-tag>>element textsubElement-tag> element-tag>

Parsing

Parse XML text

import lxml.etree root = lxml.etree.fromstring(xml_text) 

Parse XML from file

root = lxml.etree.parse(infile_name) 

Note: infile_name can be the full path to the file as a string or a file object

Parse HTML from URL (keeping the doctype declaration)

import urllib.request import lxml.etree import lxml.html parser = lxml.etree.HTMLParser() with urllib.request.urlopen('https://pypi.python.org/simple') as f: page = lxml.html.parse(f, parser) 
import lxml.html # put the page into an lxml Element type page = lxml.html.parse(source_url) # must refer to page.getroot() to get the lxml root object page.getroot().find('ELEMENT-TAG') 

Parse HTML from URL (losing the doctype declaration)

import lxml.html # put the page into an lxml Element type page = lxml.html.parse(source_url).getroot() 

Setting encoding when parsing

parser = lxml.etree.XMLParser(encoding='utf-8') root = lxml.etree.parse(infile, parser) parser = lxml.html.HTMLParser(encoding='utf-8') page = lxml.html.parse(infile, parser) 

Iterating

Iterate every element and all child elements in the tree:

Iterate only specific elements in the tree:

for element in root.iter('child'): 

Iterate only immediate child elements in the tree:

for element in root: # elements act like lists 

Processing XML

Get an element that’s the child of root

Find an element anywhere in the tree

Get the tag

Get the text

Get the attributes

Get a specific attribute

>>> a1.get('href') 'http://www.google.com' 

Get a child element

element .find(‘ child-element-tag ‘) # this is if it’s an immediate child

Читайте также:  Spider men 3 java

element .find(‘.// child-element-tag ‘) # this is if it’s a child but not necessarily immediate

Or: (elements behave like lists)

Get a specific element by ID

page.getroot().get_element_by_id('desired_id') 

Printing

Printing XML

lxml.etree.tostring( # the element you want to print root, # (recommended) set encoding encoding='utf-8', # (recommended) include XML declaration xml_declaration=True, # (recommended) add whitespace to output pretty_print=True, # (optional) add a standalone attribute standalone='yes' ) 

Printing HTML

lxml.html.tostring( # don't forget to add .getroot() if you didn't do it when parsing page.getroot(), # (recommended) set encoding encoding='utf-8', # (recommended) include the doctype declaration doctype=page.docinfo.doctype, pretty_print=True, # (optional) add for XHTML output (HTML is default) method='xml' ) 

Generating XML

import lxml.etree # create an element root = lxml.etree.Element('root') # create a child element child = lxml.etree.SubElement(root, 'child') # set an attribute child.set('attribute-name', 'attribute-value') >>> print(lxml.etree.tostring(root, pretty_print=True))   

Источник

Оцените статью