Change file encoding python

Convert file encoding using python

In this post, I would demo how to convert a text file’s encoding by using python.

2. Environments

3. The requirement

  • There is a text file named a.dat whose encoding is not utf-8
  • The a.dat contains lines of texts
  • You want to convert the whole file’s encoding to utf-8

4. The code

#coding:utf-8 import io fname = 'a.dat' def process_file(): text = None with io.open(fname, 'r', encoding='latin_1', newline='\n') as fin: text = fin.read() with io.open(fname+'_out', 'w', encoding='utf-8', newline='\n') as fout: fout.write(text) pass if __name__ == '__main__': process_file()
  • We read the file with latin_1 encoding to variable text
  • We write the file content text to a file with encoding utf-8

5. The io.open

Here we used the io.open to read and write file content with encodings, the io.open is:

io.open(file, mode=’r’, buffering=-1, encoding=None, errors=None, newline=None, closefd=True) Open file and return a corresponding stream. If the file cannot be opened, an IOError is raised.

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used. See the codecs module for the list of supported encodings.

It’s so easy, do you think so?

Читайте также:  Python запустить bash команду

Источник

10.9. File Encoding¶

  • utf-8 — a.k.a. Unicode — international standard (should be always used!)
  • iso-8859-1 — ISO standard for Western Europe and USA
  • iso-8859-2 — ISO standard for Central Europe (including Poland)
  • cp1250 or windows-1250 — Central European encoding on Windows
  • cp1251 or windows-1251 — Eastern European encoding on Windows
  • cp1252 or windows-1252 — Western European encoding on Windows
  • ASCII — ASCII characters only
  • Since Windows 10 version 1903, UTF-8 is default encoding for Notepad!

../../_images/files-windows2000-notepad-saveas.png../../_images/files-windows10-notepad-saveas.png../../_images/files-encoding-ascii.png ../../_images/files-encoding-unicode2.png../../_images/files-encoding-unicode3.png

10.9.1. Str vs Bytes¶

  • That was a big change in Python 3
  • In Python 2, str was bytes
  • In Python 3, str is unicode (UTF-8)
>>> text = 'Księżyc' >>> text 'Księżyc' 
>>> text = b'Księżyc' Traceback (most recent call last): SyntaxError: bytes can only contain ASCII literal characters 

Default encoding is UTF-8 . Encoding names are case insensitive. cp1250 and windows-1250 are aliases the same codec:

>>> text = 'Księżyc' >>> >>> text.encode() b'Ksi\xc4\x99\xc5\xbcyc' >>> text.encode('utf-8') b'Ksi\xc4\x99\xc5\xbcyc' >>> text.encode('iso-8859-2') b'Ksi\xea\xbfyc' >>> text.encode('cp1250') b'Ksi\xea\xbfyc' >>> text.encode('windows-1250') b'Ksi\xea\xbfyc' 

Note the length change while encoding:

>>> text = 'Księżyc' >>> text 'Księżyc' >>> len(text) 7 
>>> text = 'Księżyc'.encode() >>> text b'Ksi\xc4\x99\xc5\xbcyc' >>> len(text) 9 

Note also, that those characters produce longer output:

But despite being several «characters» long, the length is different:

Here’s the output of all Polish diacritics (accented characters) with their encoding:

>>> 'ą'.encode() b'\xc4\x85' >>> 'ć'.encode() b'\xc4\x87' >>> 'ę'.encode() b'\xc4\x99' >>> 'ł'.encode() b'\xc5\x82' >>> 'ń'.encode() b'\xc5\x84' >>> 'ó'.encode() b'\xc3\xb3' >>> 'ś'.encode() b'\xc5\x9b' >>> 'ż'.encode() b'\xc5\xbc' >>> 'ź'.encode() b'\xc5\xba' 

Note also a different way of iterating over bytes :

>>> text = 'Księżyc' >>> >>> for character in text: . print(character) K s i ę ż y c >>> >>> for character in text.encode(): . print(character) 75 115 105 196 153 197 188 121 99 

10.9.2. UTF-8¶

>>> FILE = r'/tmp/myfile.txt' >>> >>> with open(FILE, mode='w', encoding='utf-8') as file: . file.write('José Jiménez') 12 >>> >>> with open(FILE, encoding='utf-8') as file: . print(file.read()) José Jiménez 

../../_images/files-encoding-utf.png ../../_images/files-encoding-utf2.jpg

10.9.3. Unicode Encode Error¶

>>> FILE = r'/tmp/myfile.txt' >>> >>> with open(FILE, mode='w', encoding='cp1250') as file: . file.write('José Jiménez') 12 

10.9.4. Unicode Decode Error¶

>>> FILE = r'/tmp/myfile.txt' >>> >>> with open(FILE, mode='w', encoding='utf-8') as file: . file.write('José Jiménez') 12 >>> >>> with open(FILE, encoding='cp1250') as file: . print(file.read()) JosĂ© JimĂ©nez 

10.9.5. Escape Characters¶

  • \r\n — is used on windows
  • \n — is used everywhere else
  • More information in Builtin Printing
  • Learn more at https://en.wikipedia.org/wiki/List_of_Unicode_characters

../../_images/type-machine.jpg

Frequently used escape characters:

  • \n — New line (ENTER)
  • \t — Horizontal Tab (TAB)
  • \’ — Single quote ‘ (escape in single quoted strings)
  • \» — Double quote » (escape in double quoted strings)
  • \\ — Backslash \ (to indicate, that this is not escape char)

Less frequently used escape characters:

  • \a — Bell (BEL)
  • \b — Backspace (BS)
  • \f — New page (FF — Form Feed)
  • \v — Vertical Tab (VT)
  • \uF680 — Character with 16-bit (2 bytes) hex value F680
  • \U0001F680 — Character with 32-bit (4 bytes) hex value 0001F680
  • \o755 — ASCII character with octal value 755
  • \x1F680 — ASCII character with hex value 1F680
>>> a = '\U0001F9D1' # 🧑 >>> b = '\U0000200D' # '' >>> c = '\U0001F680' # 🚀 >>> >>> astronaut = a + b + c >>> print(astronaut) 🧑‍🚀 

Источник

Оцените статью