How to correctly parse utf-8 xml with ElementTree?
I need help to understand why parsing my xml file* with xml.etree.ElementTree produces the following errors. *My test xml file contains arabic characters. Task: Open and parse utf8_file.xml file. My first try:
import xml.etree.ElementTree as etree with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file: xml_tree = etree.parse(utf8_file)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)
import xml.etree.ElementTree as etree with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file: xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml') xml_tree = etree.fromstring(xml_string)
AttributeError: 'file' object has no attribute 'getiterator'
1 Answer 1
Leave decoding the bytes to the parser; do not decode first:
import xml.etree.ElementTree as etree with open('utf8_file.xml', 'r') as xml_file: xml_tree = etree.parse(xml_file)
An XML file must contain enough information in the first line to handle decoding by the parser. If the header is missing, the parser must assume UTF-8 is used.
Because it is the XML header that holds this information, it is the responsibility of the parser to do all decoding.
Your first attempt failed because Python was trying to encode the Unicode values again so that the parser could handle byte strings as it expected. The second attempt failed because etree.tostring() expects a parsed tree as first argument, not a unicode string.
How to write XML declaration using xml.etree.ElementTree
I am generating an XML document in Python using an ElementTree , but the tostring function doesn’t include an XML declaration when converting to plaintext.
from xml.etree.ElementTree import Element, tostring document = Element('outer') node = SubElement(document, 'inner') node.NewValue = 1 print tostring(document) # Outputs " "
However, there does not seem to be any documented way of doing this. Is there a proper method for rendering the XML declaration in an ElementTree ?
11 Answers 11
I am surprised to find that there doesn’t seem to be a way with ElementTree.tostring() . You can however use ElementTree.ElementTree.write() to write your XML document to a fake file:
from io import BytesIO from xml.etree import ElementTree as ET document = ET.Element('outer') node = ET.SubElement(document, 'inner') et = ET.ElementTree(document) f = BytesIO() et.write(f, encoding='utf-8', xml_declaration=True) print(f.getvalue()) # your XML file, encoded as UTF-8
See this question. Even then, I don’t think you can get your ‘standalone’ attribute without writing prepending it yourself.
is there a pretty print parameter for ´et.write()´? or any other way to generate a xml with line-breaks?
from lxml import etree document = etree.Element('outer') node = etree.SubElement(document, 'inner') print(etree.tostring(document, xml_declaration=True))
xml.etree.ElementTree.tostring writes a XML encoding declaration with encoding=’utf8′
Sample Python code (works with Python 2 and 3):
import xml.etree.ElementTree as ElementTree tree = ElementTree.ElementTree( ElementTree.fromstring('123 ') ) root = tree.getroot() print('without:') print(ElementTree.tostring(root, method='xml')) print('') print('with:') print(ElementTree.tostring(root, encoding='utf8', method='xml'))
$ python2 example.py without: 123 with: 123
With Python 3 you will note the b prefix indicating byte literals are returned (just like with Python 2):
$ python3 example.py without: b'123 ' with: b"\n123 "
What helped in this answer is wondering why you were doing so much of this Elementree.Elementree(Elementree.fromstring(. and I now realize fromstring returns an element not an ElementTree , whereas the parse method does return an ElementTree . This make trying to mock an xml file in a test suite by using a string very confusing! If you take that element and run tostring , it allows those encoding & method parameters, but the output is missing the
Note that utf8 is NOT a valid character encoding string. That’s also why Python3 adds the declaration and returns the whole thing as Bytes instead of string.
xml_declaration Argument
Is there a proper method for rendering the XML declaration in an ElementTree?
YES, and there is no need of using .tostring function. According to ElementTree Documentation, you should create an ElementTree object, create Element and SubElements, set the tree’s root, and finally use xml_declaration argument in .write function, so the declaration line is included in output file.
import xml.etree.ElementTree as ET tree = ET.ElementTree("tree") document = ET.Element("outer") node1 = ET.SubElement(document, "inner") node1.text = "text" tree._setroot(document) tree.write("./output.xml", encoding = "UTF-8", xml_declaration = True)
I encounter this issue recently, after some digging of the code, I found the following code snippet is definition of function ElementTree.write
def write(self, file, encoding="us-ascii"): assert self._root is not None if not hasattr(file, "write"): file = open(file, "wb") if not encoding: encoding = "us-ascii" elif encoding != "utf-8" and encoding != "us-ascii": file.write("\n" % encoding) self._write(file, self._root, encoding, <>)
So the answer is, if you need write the XML header to your file, set the encoding argument other than utf-8 or us-ascii , e.g. UTF-8
It would be a nice albeit brittle hack, but it doesn’t seem to work (the encoding is probably lower-cased before that). Also, ElementTree.ElementTree.write() is documented to have a xml_declaration paramater (see the accepted answer). But ElementTree.tostring() doesn’t have that parameter, which was the method asked in the original question.
Sample for both Python 2 and 3 (encoding parameter must be utf8):
import xml.etree.ElementTree as ElementTree tree = ElementTree.ElementTree(ElementTree.fromstring('123 ')) root = tree.getroot() print(ElementTree.tostring(root, encoding='utf8', method='xml'))
From Python 3.8 there is xml_declaration parameter for that stuff:
New in version 3.8: The xml_declaration and default_namespace parameters.
xml.etree.ElementTree.tostring(element, encoding=»us-ascii», method=»xml», *, xml_declaration=None, default_namespace=None, short_empty_elements=True) Generates a string representation of an XML element, including all subelements. element is an Element instance. encoding 1 is the output encoding (default is US-ASCII). Use encoding=»unicode» to generate a Unicode string (otherwise, a bytestring is generated). method is either «xml», «html» or «text» (default is «xml»). xml_declaration, default_namespace and short_empty_elements has the same meaning as in ElementTree.write(). Returns an (optionally) encoded string containing the XML data.
Sample for Python 3.8 and higher:
import xml.etree.ElementTree as ElementTree tree = ElementTree.ElementTree(ElementTree.fromstring('123 ')) root = tree.getroot() print(ElementTree.tostring(root, encoding='unicode', method='xml', xml_declaration=True))