Xml parser python encoding

Содержание

Python XML (lxml)
XML vocabulary
Parsing
Parse XML text
Parse XML from file
Parse HTML from URL (keeping the doctype declaration)
Parse HTML from URL (losing the doctype declaration)
Setting encoding when parsing
Iterating
Iterate every element and all child elements in the tree:
Iterate only specific elements in the tree:
Iterate only immediate child elements in the tree:
Processing XML
Get an element that’s the child of root
Find an element anywhere in the tree
Get the tag
Get the text
Get the attributes
Get a specific attribute
Get a child element
Get a specific element by ID
Printing
Printing XML
Printing HTML
Generating XML

Python XML (lxml)

This page has been archived and will receive no further updates.

XML vocabulary

Each node is an element
Each element has a tag
Elements can have attributes within the tags
Elements can have text between opening and closing tags

<element-tag> <subElement-tag attribute="value" /> <subElement-tag>>element textsubElement-tag> element-tag>

Parsing

Parse XML text

import lxml.etree root = lxml.etree.fromstring(xml_text)

Parse XML from file

root = lxml.etree.parse(infile_name)

Note: infile_name can be the full path to the file as a string or a file object

Parse HTML from URL (keeping the doctype declaration)

import urllib.request import lxml.etree import lxml.html parser = lxml.etree.HTMLParser() with urllib.request.urlopen('https://pypi.python.org/simple') as f: page = lxml.html.parse(f, parser)

import lxml.html # put the page into an lxml Element type page = lxml.html.parse(source_url) # must refer to page.getroot() to get the lxml root object page.getroot().find('ELEMENT-TAG')

Parse HTML from URL (losing the doctype declaration)

import lxml.html # put the page into an lxml Element type page = lxml.html.parse(source_url).getroot()

Setting encoding when parsing

parser = lxml.etree.XMLParser(encoding='utf-8') root = lxml.etree.parse(infile, parser) parser = lxml.html.HTMLParser(encoding='utf-8') page = lxml.html.parse(infile, parser)

Iterating

Iterate every element and all child elements in the tree:

Iterate only specific elements in the tree:

for element in root.iter('child'):

Iterate only immediate child elements in the tree:

for element in root: # elements act like lists

Processing XML

Get an element that’s the child of root

Find an element anywhere in the tree

Get the tag

Get the text

Get the attributes

Get a specific attribute

>>> a1.get('href') 'http://www.google.com'

Get a child element

element .find(‘ child-element-tag ‘) # this is if it’s an immediate child

Читайте также: Spider men 3 java

element .find(‘.// child-element-tag ‘) # this is if it’s a child but not necessarily immediate

Or: (elements behave like lists)

Get a specific element by ID

page.getroot().get_element_by_id('desired_id')

Printing

Printing XML

lxml.etree.tostring( # the element you want to print root, # (recommended) set encoding encoding='utf-8', # (recommended) include XML declaration xml_declaration=True, # (recommended) add whitespace to output pretty_print=True, # (optional) add a standalone attribute standalone='yes' )

Printing HTML

lxml.html.tostring( # don't forget to add .getroot() if you didn't do it when parsing page.getroot(), # (recommended) set encoding encoding='utf-8', # (recommended) include the doctype declaration doctype=page.docinfo.doctype, pretty_print=True, # (optional) add for XHTML output (HTML is default) method='xml' )

Generating XML

import lxml.etree # create an element root = lxml.etree.Element('root') # create a child element child = lxml.etree.SubElement(root, 'child') # set an attribute child.set('attribute-name', 'attribute-value') >>> print(lxml.etree.tostring(root, pretty_print=True))

Источник