Change file encoding python

Содержание

Convert file encoding using python
2. Environments
3. The requirement
4. The code
5. The io.open
10.9. File Encoding¶
10.9.1. Str vs Bytes¶
10.9.2. UTF-8¶
10.9.3. Unicode Encode Error¶
10.9.4. Unicode Decode Error¶
10.9.5. Escape Characters¶

Convert file encoding using python

In this post, I would demo how to convert a text file’s encoding by using python.

2. Environments

3. The requirement

There is a text file named a.dat whose encoding is not utf-8
The a.dat contains lines of texts
You want to convert the whole file’s encoding to utf-8

4. The code

#coding:utf-8 import io fname = 'a.dat' def process_file(): text = None with io.open(fname, 'r', encoding='latin_1', newline='\n') as fin: text = fin.read() with io.open(fname+'_out', 'w', encoding='utf-8', newline='\n') as fout: fout.write(text) pass if __name__ == '__main__': process_file()

We read the file with latin_1 encoding to variable text
We write the file content text to a file with encoding utf-8

5. The io.open

Here we used the io.open to read and write file content with encodings, the io.open is:

io.open(file, mode=’r’, buffering=-1, encoding=None, errors=None, newline=None, closefd=True) Open file and return a corresponding stream. If the file cannot be opened, an IOError is raised.

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used. See the codecs module for the list of supported encodings.

It’s so easy, do you think so?

Источник

10.9. File Encoding¶

utf-8 — a.k.a. Unicode — international standard (should be always used!)
iso-8859-1 — ISO standard for Western Europe and USA
iso-8859-2 — ISO standard for Central Europe (including Poland)
cp1250 or windows-1250 — Central European encoding on Windows
cp1251 or windows-1251 — Eastern European encoding on Windows
cp1252 or windows-1252 — Western European encoding on Windows
ASCII — ASCII characters only
Since Windows 10 version 1903, UTF-8 is default encoding for Notepad!

10.9.1. Str vs Bytes¶

That was a big change in Python 3
In Python 2, str was bytes
In Python 3, str is unicode (UTF-8)

>>> text = 'Księżyc' >>> text 'Księżyc'

>>> text = b'Księżyc' Traceback (most recent call last): SyntaxError: bytes can only contain ASCII literal characters

Default encoding is UTF-8 . Encoding names are case insensitive. cp1250 and windows-1250 are aliases the same codec:

>>> text = 'Księżyc' >>> >>> text.encode() b'Ksi\xc4\x99\xc5\xbcyc' >>> text.encode('utf-8') b'Ksi\xc4\x99\xc5\xbcyc' >>> text.encode('iso-8859-2') b'Ksi\xea\xbfyc' >>> text.encode('cp1250') b'Ksi\xea\xbfyc' >>> text.encode('windows-1250') b'Ksi\xea\xbfyc'

Note the length change while encoding:

>>> text = 'Księżyc' >>> text 'Księżyc' >>> len(text) 7

>>> text = 'Księżyc'.encode() >>> text b'Ksi\xc4\x99\xc5\xbcyc' >>> len(text) 9

Note also, that those characters produce longer output:

But despite being several «characters» long, the length is different:

Here’s the output of all Polish diacritics (accented characters) with their encoding:

>>> 'ą'.encode() b'\xc4\x85' >>> 'ć'.encode() b'\xc4\x87' >>> 'ę'.encode() b'\xc4\x99' >>> 'ł'.encode() b'\xc5\x82' >>> 'ń'.encode() b'\xc5\x84' >>> 'ó'.encode() b'\xc3\xb3' >>> 'ś'.encode() b'\xc5\x9b' >>> 'ż'.encode() b'\xc5\xbc' >>> 'ź'.encode() b'\xc5\xba'

Note also a different way of iterating over bytes :

>>> text = 'Księżyc' >>> >>> for character in text: . print(character) K s i ę ż y c >>> >>> for character in text.encode(): . print(character) 75 115 105 196 153 197 188 121 99

10.9.2. UTF-8¶

>>> FILE = r'/tmp/myfile.txt' >>> >>> with open(FILE, mode='w', encoding='utf-8') as file: . file.write('José Jiménez') 12 >>> >>> with open(FILE, encoding='utf-8') as file: . print(file.read()) José Jiménez

10.9.3. Unicode Encode Error¶

>>> FILE = r'/tmp/myfile.txt' >>> >>> with open(FILE, mode='w', encoding='cp1250') as file: . file.write('José Jiménez') 12

10.9.4. Unicode Decode Error¶

>>> FILE = r'/tmp/myfile.txt' >>> >>> with open(FILE, mode='w', encoding='utf-8') as file: . file.write('José Jiménez') 12 >>> >>> with open(FILE, encoding='cp1250') as file: . print(file.read()) JosĂ© JimĂ©nez

10.9.5. Escape Characters¶

\r\n — is used on windows
\n — is used everywhere else
More information in Builtin Printing
Learn more at https://en.wikipedia.org/wiki/List_of_Unicode_characters

Frequently used escape characters:

\n — New line (ENTER)
\t — Horizontal Tab (TAB)
\’ — Single quote ‘ (escape in single quoted strings)
\» — Double quote » (escape in double quoted strings)
\\ — Backslash \ (to indicate, that this is not escape char)

Less frequently used escape characters:

\a — Bell (BEL)
\b — Backspace (BS)
\f — New page (FF — Form Feed)
\v — Vertical Tab (VT)
\uF680 — Character with 16-bit (2 bytes) hex value F680
\U0001F680 — Character with 32-bit (4 bytes) hex value 0001F680
\o755 — ASCII character with octal value 755
\x1F680 — ASCII character with hex value 1F680

>>> a = '\U0001F9D1' # 🧑 >>> b = '\U0000200D' # '' >>> c = '\U0001F680' # 🚀 >>> >>> astronaut = a + b + c >>> print(astronaut) 🧑‍🚀

Источник