Python encode non ascii characters

gornostal / Unicode.md

I know I’m late with this article for about 5 years or so, but people are still using Python 2.x, so this subject is relevant I think.

  • Unicode is an international encoding standard for use with different languages and scripts
  • In python-2.x, there are two types that deal with text.
    1. str is an 8-bit string.
    2. unicode is for strings of unicode code points.
      A code point is a number that maps to a particular abstract character. It is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal)
  • Encoding (noun) is a map of Unicode code points to a sequence of bytes. (Synonyms: character encoding, character set, codeset). Popular encodings: UTF-8, ASCII, Latin-1, etc.
  • Encoding (verb) is a process of converting unicode to bytes of str , and decoding is the reverce operation.
  • Python 2.x uses ASCII as a default encoding. (More about this later)

SyntaxError: Non-ASCII character

When you sees something like this

SyntaxError: Non-ASCII character '\xd0' in file /tmp/p.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details 

you just need to define encoding in the first or second line of your file. All you need is to have string coding=utf8 or coding: utf8 somewhere in your comments. Python doesn’t care what goes before or after those string, so the following will work fine too:

Читайте также:  Какие бывают стримы java

Notice the dash in utf-8. Python has many aliases for UTF-8 encoding, so you should not worry about dashes or case sensitivity.

>>> str(u'café') Traceback (most recent call last): File "", line 1, in module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

str() function encodes a string. We passed a unicode string, and it tried to encode it using a default encoding, which is ASCII. Now the error makes sence because ASCII is 7-bit encoding which doesn’t know how to represent characters outside of range 0..128.
Here we called str() explicitly, but something in your code may call it implicitly and you will also get UnicodeEncodeError .

How to fix: encode unicode string manually using .encode(‘utf8’) before passing to str()

>>> utf_string = u'café' >>> byte_string = utf_string.encode('utf8') >>> unicode(byte_string) Traceback (most recent call last): File "", line 1, in module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Let’s say we somehow obtained a byte string byte_string which contains encoded UTF-8 characters. We could get this by simply using a library that returns str type.
Then we passed the string to a function that converts it to unicode . In this example we explicitly call unicode() , but some functions may call it implicitly and you’ll get the same error.
Now again, Python uses ASCII encoding by default, so it tries to convert bytes to a default encoding ASCII. Since there is no ASCII symbol that converts to 0xc3 (195 decimal) it fails with UnicodeDecodeError .

How to fix: decode str manually using .decode(‘utf8’) before passing to your function.

Make sure your code works only with Unicode strings internally, converting to a particular encoding on output, and decoding str on input. Learn the libraries you are using, and find places where they return str . Decode str before return value is passed further in your code.

I use this helper function in my code:

def force_to_unicode(text): "If text is unicode, it is returned as is. If it's str, convert it to Unicode using UTF-8 encoding" return text if isinstance(text, unicode) else text.decode('utf8')

Источник

Python Encode Unicode and non-ASCII characters into JSON

This article will provide a comprehensive guide on how to work with Unicode and non-ASCII characters in Python when generating and parsing JSON data. We will look at the different ways to handle Unicode and non-ASCII characters in JSON. By the end of this article, you should have a good understanding of how to work with Unicode and non-ASCII characters in JSON using Python. Also, we are going to cover the following topics related to encoding and serializing Unicode and non-ASCII characters in Python:

  1. How to encode Unicode and non-ASCII characters into JSON in Python.
  2. How to save non-ASCII or Unicode data as-is, without converting it to a \u escape sequence, in JSON.
  3. How to serialize Unicode data and write it into a file.
  4. How to serialize Unicode objects into UTF-8 JSON strings, instead of \u escape sequences.
  5. How to escape non-ASCII characters while encoding them into JSON in Python.

What is a UTF-8 Character?

Unicode is a standardized encoding system that represents most of the world’s written languages. It includes characters from many different scripts, such as Latin, Greek, and Chinese, and is capable of representing a wide range of characters and symbols. Non-ASCII characters are characters that are not part of the ASCII (American Standard Code for Information Interchange) character set, which consists of only 128 characters.

UTF-8 is a character encoding that represents each Unicode code point using one to four bytes. It is the most widely used character encoding for the Web and is supported by all modern web browsers and most other applications. UTF-8 is also backward-compatible with ASCII, so any ASCII text is also a valid UTF-8 text.

What is JSON?

The JSON module is a built-in module in Python that provides support for working with JSON (JavaScript Object Notation) data. It provides methods for encoding and decoding JSON objects, as well as for working with the data structures that represent them. The json.dumps() method is a method of the JSON module that serializes an object (e.g. a Python dictionary or list) to a JSON-formatted string. This string can then be saved to a file, sent over a network connection, or used in any other way that requires the data to be represented as a string.

BHere is how you could use the json.dumps() method to encode a Python dictionary as a JSON string.

Источник

Python Encode Unicode and non-ASCII characters as-is into JSON

In this article, we will address the following frequently asked questions about working with Unicode JSON data in Python.

  • How to serialize Unicode or non-ASCII data into JSON as-is strings instead of \u escape sequence (Example, Store Unicode string ø as-is instead of \u00f8 in JSON)
  • Encode Unicode data in utf-8 format.
  • How to serialize all incoming non-ASCII characters escaped (Example, Store Unicode string ø as \u00f8 in JSON)

Further Reading:

The Python RFC 7159 requires that JSON be represented using either UTF-8, UTF-16, or UTF-32, with UTF-8 being the recommended default for maximum interoperability.

The ensure_ascii parameter

Use Python’s built-in module json provides the json.dump() and json.dumps() method to encode Python objects into JSON data.

The json.dump() and json.dumps() has a ensure_ascii parameter. The ensure_ascii is by-default true so the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii=False , these characters will be output as-is.

The json module always produces str objects. You get a string back, not a Unicode string. Because the escaping is allowed by JSON.

  • using a ensure_ascii=True , we can present a safe way of representing Unicode characters. By setting it to true we make sure the resulting JSON is valid ASCII characters (even if they have Unicode inside).
  • Using a ensure_ascii=False , we make sure resulting JSON store Unicode characters as-is instead of \u escape sequence.

Save non-ASCII or Unicode data as-is not as \u escape sequence in JSON

In this example, we will try to encode the Unicode Data into JSON. This solution is useful when you want to dump Unicode characters as characters instead of escape sequences.

Set ensure_ascii=False in json.dumps() to encode Unicode as-is into JSON

import json unicodeData= < "string1": "明彦", "string2": u"\u00f8" >print("unicode Data is ", unicodeData) encodedUnicode = json.dumps(unicodeData, ensure_ascii=False) # use dump() method to write it in file print("JSON character encoding by setting ensure_ascii=False", encodedUnicode) print("Decoding JSON", json.loads(encodedUnicode))

unicode Data is JSON character encoding by setting ensure_ascii=False Decoding JSON

Note: This example is useful to store the Unicode string as-is in JSON.

JSON Serialize Unicode Data and Write it into a file.

In the above example, we saw how to Save non-ASCII or Unicode data as-is not as \u escape sequence in JSON. Now, Let’s see how to write JSON serialized Unicode data as-is into a file.

import json sampleDict= < "string1": "明彦", "string2": u"\u00f8" >with open("unicodeFile.json", "w", encoding='utf-8') as write_file: json.dump(sampleDict, write_file, ensure_ascii=False) print("Done writing JSON serialized Unicode Data as-is into file") with open("unicodeFile.json", "r", encoding='utf-8') as read_file: print("Reading JSON serialized Unicode data from file") sampleData = json.load(read_file) print("Decoded JSON serialized Unicode data") print(sampleData["string1"], sampleData["string1"])
Done writing JSON serialized Unicode Data as-is into file Reading JSON serialized Unicode data from file Decoded JSON serialized Unicode data 明彦 明彦

JSON file after writing Unicode data as-is

Serialize Unicode objects into UTF-8 JSON strings instead of \u escape sequence

You can also set JSON encoding to UTF-8. UTF-8 is the recommended default for maximum interoperability. set ensure_ascii=False to and encode Unicode data into JSON using ‘UTF-8‘.

import json # encoding in UTF-8 unicodeData= < "string1": "明彦", "string2": u"\u00f8" >print("unicode Data is ", unicodeData) print("Unicode JSON Data encoding using utf-8") encodedUnicode = json.dumps(unicodeData, ensure_ascii=False).encode('utf-8') print("JSON character encoding by setting ensure_ascii=False", encodedUnicode) print("Decoding JSON", json.loads(encodedUnicode))

unicode Data is Unicode JSON Data encoding using utf-8 JSON character encoding by setting ensure_ascii=False b» Decoding JSON

Encode both Unicode and ASCII (Mix Data) into JSON using Python

In this example, we will see how to encode Python dictionary into JSON which contains both Unicode and ASCII data.

import json sampleDict = print("unicode Data is ", sampleDict) # set ensure_ascii=True jsonDict = json.dumps(sampleDict, ensure_ascii=True) print("JSON character encoding by setting ensure_ascii=True") print(jsonDict) print("Decoding JSON", json.loads(jsonDict)) # set ensure_ascii=False jsonDict = json.dumps(sampleDict, ensure_ascii=False) print("JSON character encoding by setting ensure_ascii=False") print(jsonDict) print("Decoding JSON", json.loads(jsonDict)) # set ensure_ascii=False and encode using utf-8 jsonDict = json.dumps(sampleDict, ensure_ascii=False).encode('utf-8') print("JSON character encoding by setting ensure_ascii=False and UTF-8") print(jsonDict) print("Decoding JSON", json.loads(jsonDict))

unicode Data is JSON character encoding by setting ensure_ascii=True Decoding JSON JSON character encoding by setting ensure_ascii=False Decoding JSON JSON character encoding by setting ensure_ascii=False and UTF-8 b» Decoding JSON

Python Escape non-ASCII characters while encoding it into JSON

Let’ see how store all incoming non-ASCII characters escaped in JSON. It is a safe way of representing Unicode characters. By setting ensure_ascii=True we make sure resulting JSON is valid ASCII characters (even if they have Unicode inside).

import json unicodeData= < "string1": "明彦", "string2": u"\u00f8" >print("unicode Data is ", unicodeData) # set ensure_ascii=True encodedUnicode = json.dumps(unicodeData, ensure_ascii=True) print("JSON character encoding by setting ensure_ascii=True") print(encodedUnicode) print("Decoding JSON") print(json.loads(encodedUnicode))

unicode Data is JSON character encoding by setting ensure_ascii=True Decoding JSON

Did you find this page helpful? Let others know about it. Sharing helps me continue to create free Python resources.

About Vishal

I’m Vishal Hule, Founder of PYnative.com. I am a Python developer, and I love to write articles to help students, developers, and learners. Follow me on Twitter

Python Exercises and Quizzes

Free coding exercises and quizzes cover Python basics, data structure, data analytics, and more.

  • 15+ Topic-specific Exercises and Quizzes
  • Each Exercise contains 10 questions
  • Each Quiz contains 12-15 MCQ

Источник

Оцените статью