- UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xd1 in position 2: ordinal not in range(128)
- UnicodeDecodeError: ‘ascii’ codec can’t decode byte in Python
- 1 Answer 1
- How to fix: «UnicodeDecodeError: ‘ascii’ codec can’t decode byte»
- 20 Answers 20
- tl;dr / quick fix
- Unicode Zen in Python 2.x — The Long Version
- Gotchas
- Examples
- The Unicode Sandwich
- Input / Decode
- Source code
- Files
- CSV Files
- Databases
- HTTP
- Manually
- The meat of the sandwich
- Output
- stdout / printing
- Files
- Database
- Python 3
- Why you shouldn’t use sys.setdefaultencoding(‘utf8’)
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xd1 in position 2: ordinal not in range(128)
I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.) I open the CSV using:
15 ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')
name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
I’m encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.
Traceback (most recent call last): File "push_into_db.py", line 80, in main() File "push_into_db.py", line 74, in main district_map = buildDistrictSchoolMap() File "push_into_db.py", line 32, in buildDistrictSchoolMap county=row[25].encode('utf-8'), lat=row[22], lng=row[23]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
I think I should tell you that I’m using python 2.7.2, and this is part of an app build on django 1.4. I’ve read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated. You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.
UnicodeDecodeError: ‘ascii’ codec can’t decode byte in Python
I’ve got a very peculiar encoding problem. I’ve looked at plenty of questions about this error with no actual answers. I am aware of Unicode issues in Python, so I start every file with:
# -*- coding: utf-8 -*- g = " " s = "2 000€" if g in s: print s
if gap not in tokenString:
tokenString string contains Unicode. The funny thing is that if I try to print it just before that line it prints without an error. What could be the cause of that? I feel like I’m missing something and I don’t understand what. EDITED gap is of type unicode and tokenString of type str .
Please include the full traceback. Are you printing Unicode data to the Windows console or a Unix terminal? Then see wiki.python.org/moin/PrintFails.
1 Answer 1
You haven’t given us enough information to solve your problem for sure, but I can make a guess:
If gap is a str , and tokenString is a unicode , this line:
if gap not in tokenString:
… will try to convert gap to unicode to do the search. But if gap has any non-ASCII characters—e.g., because it’s a Unicode string encoded into UTF-8—this conversion will fail.
>>> if 'é' in u'a': . print 'Yes' UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
You will get the same problem if gap is a unicode and tokenString is a str holding non-ASCII:
>>> if u'a' in 'é': . print 'Yes' UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
And you’ll also get the same problem, or similar ones, with various other mixed-type operator and method calls (e.g., u’a’.find(‘é’) ).
The solution is to use the same type on both sides of the in . For example:
>>> if 'é'.decode('utf-8') in u'a': . print 'Yes'
The larger solution is to always use one type or the other everywhere within our code. Of course at the boundaries, you can’t do that (e.g., if you’re using unicode everywhere, but then you want to write to an 8-bit file), so you need to explicitly call decode and encode at those boundaries. But even then, you can usually wrap that up (e.g., with codecs.open , or with a custom file-writing function, or whatever, so all of your visible code is Unicode, fill stop.
Or, of course, you can use Python 3, which will immediately catch you trying to compare byte strings and Unicode strings and raise a TypeError , instead of trying to decode the bytes from ASCII and either misleadingly working or giving you a more confusing error…
How to fix: «UnicodeDecodeError: ‘ascii’ codec can’t decode byte»
How to fix it? In some other python-based static blog apps, Chinese post can be published successfully. Such as this app: http://github.com/vrypan/bucket3. In my site http://bc3.brite.biz/, Chinese post can be published successfully.
20 Answers 20
tl;dr / quick fix
- Don’t decode/encode willy nilly
- Don’t assume your strings are UTF-8 encoded
- Try to convert strings to Unicode strings as soon as possible in your code
- Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
- Don’t be tempted to use quick reload hacks
Unicode Zen in Python 2.x — The Long Version
Without seeing the source it’s difficult to know the root cause, so I’ll have to speak generally.
UnicodeDecodeError: ‘ascii’ codec can’t decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.
In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.
The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code — it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can’t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.
Unicode strings can be declared in your code using the u prefix to strings. E.g.
>>> my_u = u'my ünicôdé strįng' >>> type(my_u)
Unicode strings may also come from file, databases and network modules. When this happens, you don’t need to worry about the encoding.
Gotchas
Conversion from str to Unicode can happen even when you don’t explicitly call unicode() .
The following scenarios cause UnicodeDecodeError exceptions:
# Explicit conversion without encoding unicode('€') # New style format string into Unicode string # Python will try to convert value string to Unicode first u"The currency is: <>".format('€') # Old style format string into Unicode string # Python will try to convert value string to Unicode first u'The currency is: %s' % '€' # Append string to Unicode # Python will try to convert string to Unicode first u'The currency is: ' + '€'
Examples
In the following diagram, you can see how the word café has been encoded in either «UTF-8» or «Cp1252» encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In «Cp1252», é is 0xE9 (which is also happens to be the Unicode point value (it’s no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:
In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can’t contain bytes greater than 0x7F , this will throw a UnicodeDecodeError exception:
The Unicode Sandwich
It’s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to str s on the way out. This saves you from worrying about the encoding of strings in the middle of your code.
Input / Decode
Source code
If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u . E.g.
To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as ‘UTF-8’, you would use:
This is only necessary when you have non-ASCII in your source code.
Files
Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding . You must use the correct encoding for the file — it can’t be easily guessed. For example, for a UTF-8 file:
import io with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file: my_unicode_string = my_file.read()
my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you’ve probably used the wrong encoding value.
CSV Files
The Python 2.7 CSV module does not support non-ASCII characters 😩. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.
Use it like above but pass the opened file to it:
from backports import csv import io with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file: for row in csv.reader(my_file): yield row
Databases
Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.
In the connection string add:
charset='utf8', use_unicode=True
>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE) psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
HTTP
Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text .
Manually
If you must decode strings manually, you can simply do my_string.decode(encoding) , where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you’ve probably got the wrong encoding.
The meat of the sandwich
Work with Unicodes as you would normal strs.
Output
stdout / printing
print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console’s encoding. For example, if a Linux shell’s locale is en_GB.UTF-8 , the output will be encoded to UTF-8 . On Windows, you will be limited to an 8bit code page.
An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.
Files
Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.
Database
The same configuration for reading will allow Unicodes to be written directly.
Python 3
Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes .
The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people’s Unicode problems.
Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.
Why you shouldn’t use sys.setdefaultencoding(‘utf8’)
It’s a nasty hack (there’s a reason you have to use reload ) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding(«utf-8») in a py script? for further details