Python open utf 8 bom

Why do Python unicode strings require special treatment for UTF-8 BOM?

For some reason, Python seems to be having issues with BOM when reading unicode strings from a UTF-8 file. Consider the following:

with open('test.py') as f: for line in f: print unicode(line, 'utf-8') 

UnicodeEncodeError: ‘charmap’ codec can’t encode character u’\ufeff’ in position 0: character maps to

import codecs with open('test.py') as f: for line in f: print unicode(line.replace(codecs.BOM_UTF8, ''), 'utf-8') 

This one runs fine. However I’m struggling to see any merit in this. Is there a rationale behind above-described behavior? In contrast, UTF-16 works seamlessly.

It cannot encode it because U+FEFF is an invalid noncharacter. It’s because UTF-8 files aren’t supposed to contain a BOM in them! They are neither required nor recommended. Endianness makes no sense with 8-bit code units. They screw things up, too, because you can no longer just do cat a b c > abc if those files have extraneous (read: any) BOMs in them. UTF-8 streams should not contain a BOM. If you need to specify the contents of the file, you are supposed to use a higher-level prototocl. This is just a Windows bug.

@tchrist — You know, this explanation in combination with Josh Lee’s suggestion would make into a perfect answer.

2 Answers 2

The ‘utf-8-sig’ encoding will consume the BOM signature on your behalf.

@Gringo Suave: The funny thing is that the Unicode Standard does allow a BOM in UTF-8. See unicode.org/versions/Unicode5.0.0/ch02.pdf page 36, table 2-4.

 UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to

When you specify the «utf-8» encoding in Python, it takes you at your word. UTF-8 files aren’t supposed to contain a BOM in them. They are neither required nor recommended. Endianness makes no sense with 8-bit code units.

Читайте также:  Script language javascript src type text javascript

BOMs screw things up, too, because you can no longer just do:

if those UTF-8 files have extraneous (read: any) BOMs in them. See now why BOMs are so stupid/bad/harmful in UTF-8? They actually break things.

A BOM is metadata, not data, and the UTF-8 encoding spec makes no allowance for them the way the UTF-16 and UTF-32 specs do. So Python took you at your word and followed the spec. Hard to blame it for that.

If you are trying to use the BOM as a filetype magic number to specify the contents of the file, you really should not be doing that. You are really supposed to use a higher-level prototocl for these metadata purposes, just as you would with a MIME type.

This is just another lame Windows bug, the workaround for which is to use the alternate encoding «utf-8-sig» to pass off to Python.

Источник

Handling the BOM in

As described earlier in this chapter, some encoding schemes store a special byte order marker (BOM) sequence at the start of files, to specify data endianness or declare the encoding type. Python both skips this marker on input and writes it on output if the encoding name implies it, but we sometimes must use a specific encoding name to force BOM processing explicitly.

For example, when you save a text file in Windows Notepad, you can specify its encoding type in a drop-down list—simple ASCII text, UTF-8, or little- or big-endian UTF-16. If a one-line text file named spam.txt is saved in Notepad as the encoding type «ANSI,» for instance, it’s written as simple ASCII text without a BOM. When this file is read in binary mode in Python, we can see the actual bytes stored in the file. When it’s read as text, Python performs end-of-line translation by default; we can decode it as explicit UTF-8 text since ASCII is a subset of this scheme (and UTF-8 is Python 3.0’s default encoding):

c:\misc> C:\Python30\python # File saved in Notepad

>>> open(‘spam.txt’, ‘rb’).read() # ASCII (UTF-8) text file b’spam\r\nSPAM\r\n’

>>> open(‘spam.txt’, ‘r’).read() # Text mode translates line-end

>>> open(‘spam.txt’, ‘r’, encoding=’utf-8′).read()

If this file is instead saved as «UTF-8» in Notepad, it is prepended with a three-byte UTF-8 BOM sequence, and we need to give a more specific encoding name («utf-8-sig») to force Python to skip the marker:

>>> open(‘spam.txt’, ‘rb’).read() # UTF-8 with 3-byte BOM

>>> open(‘spam.txt’, ‘r’, encoding=’utf-8′).read()

>>> open(‘spam.txt’, ‘r’, encoding=’utf-8-sig’).read()

If the file is stored as «Unicode big endian» in Notepad, we get UTF-16-format data in the file, prepended with a two-byte BOM sequence—the encoding name «utf-16» in Python skips the BOM because it is implied (since all UTF-16 files have a BOM), and «utf-16-be» handles the big-endian format but does not skip the BOM: >>> open(‘spam.txt’, ‘rb’).read()

b’\xfe\xff\x00s\x00p\x00a\x00m\x00\r\x00\n\x00S\x00P\x00A\x00M\x00\r\x00\n’ >>> open(‘spam.txt’, ‘r’).read()

UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\xfe’ in position 1.

>>> open(‘spam.txt’, ‘r’, encoding=’utf-l6′).read()

>>> open(‘spam.txt’, ‘r’, encoding=’utf-l6-be’).read()

The same is generally true for output. When writing a Unicode file in Python code, we need a more explicit encoding name to force the BOM in UTF-8—»utf-8″ does not write (or skip) the BOM, but «utf-8-sig» does:

>>> open(‘temp.txt’, ‘w’, encoding=’utf-8′).write(‘spam\nSPAM\n’)

>>> open(‘temp.txt’, ‘rb’).read() # No BOM

>>> open(‘temp.txt’, ‘w’, encoding=’utf-8-sig’).write(‘spam\nSPAM\n’)

>>> open(‘temp.txt’, ‘rb’).read() # Wrote BOM

>>> open(‘temp.txt’, ‘r’, encoding=’utf-8′).read() # Keeps BOM

>>> open(‘temp.txt’, ‘r’, encoding=’utf-8-sig’).read() # Skips BOM

Notice that although «utf-8» does not drop the BOM, data without a BOM can be read with both «utf-8» and «utf-8-sig»—use the latter for input if you’re not sure whether a BOM is present in a file (and don’t read this paragraph out loud in an airport security line!):

>>> open(‘temp.txt’, ‘rb’).read() # Data without BOM

>>> open(‘temp.txt’, ‘r’).read() # Any utf-8 works

>>> open(‘temp.txt’, ‘r’, encoding=’utf-8′).read()

>>> open(‘temp.txt’, ‘r’, encoding=’utf-8-sig’).read()

Finally, for the encoding name «utf-16,» the BOM is handled automatically: on output, data is written in the platform’s native endianness, and the BOM is always written; on input, data is decoded per the BOM, and the BOM is always stripped. More specific

UTF-16 encoding names can specify different endianness, though you may have to manually write and skip the BOM yourself in some scenarios if it is required or present:

>>> open(‘temp.txt’, ‘w’, encoding=’utf-16′).write(‘spam\nSPAM\n’)

>>> open(‘temp.txt’, ‘r’, encoding=’utf-16′).read()

>>> open(‘temp.txt’, ‘w’, encoding=’utf-16-be’).write(‘\ufeffspam\nSPAM\n’)

>>> open(‘temp.txt’, ‘r’, encoding=’utf-16′).read()

>>> open(‘temp.txt’, ‘r’, encoding=’utf-16-be’).read()

The more specific UTF-16 encoding names work fine with BOM-less files, though «utf-16» requires one on input in order to determine byte order:

>>> open(‘temp.txt’, ‘w’, encoding=’utf-16-le’).write(‘SPAM’)

>>> open(‘temp.txt’, ‘rb’).read() # OK if BOM not present or expected b’S\x00P\x00A\x00M\x00′

>>> open(‘temp.txt’, ‘r’, encoding=’utf-16-le’).read()

>>> open(‘temp.txt’, ‘r’, encoding=’utf-16′).read()

UnicodeError: UTF-16 stream does not start with BOM

Experiment with these encodings yourself or see Python’s library manuals for more details on the BOM.

Источник

Оцените статью