Should I import unicode_literals?¶
The future package can be used with or without unicode_literals imports.
In general, it is more compelling to use unicode_literals when back-porting new or existing Python 3 code to Python 2/3 than when porting existing Python 2 code to 2/3. In the latter case, explicitly marking up all unicode string literals with u» prefixes would help to avoid unintentionally changing the existing Python 2 API. However, if changing the existing Python 2 API is not a concern, using unicode_literals may speed up the porting process.
This section summarizes the benefits and drawbacks of using unicode_literals . To avoid confusion, we recommend using unicode_literals everywhere across a code-base or not at all, instead of turning on for only some modules.
Benefits¶
- String literals are unicode on Python 3. Making them unicode on Python 2 leads to more consistency of your string types across the two runtimes. This can make it easier to understand and debug your code.
- Code without u» prefixes is cleaner, one of the claimed advantages of Python 3. Even though some unicode strings would require a function call to invert them to native strings for some Python 2 APIs (see Standard library incompatibilities ), the incidence of these function calls would usually be much lower than the incidence of u» prefixes for text strings in the absence of unicode_literals .
- The diff when porting to a Python 2/3-compatible codebase may be smaller, less noisy, and easier to review with unicode_literals than if an explicit u» prefix is added to every unadorned string literal.
- If support for Python 3.2 is required (e.g. for Ubuntu 12.04 LTS or Debian wheezy), u» prefixes are a SyntaxError , making unicode_literals the only option for a Python 2/3 compatible codebase. [However, note that future doesn’t support Python 3.0-3.2.]
Drawbacks¶
- Adding unicode_literals to a module amounts to a “global flag day” for that module, changing the data types of all strings in the module at once. Cautious developers may prefer an incremental approach. (See here for an excellent article describing the superiority of an incremental patch-set in the the case of the Linux kernel.)
### Module: mypaths.py . def unix_style_path(path): return path.replace('\\', '/') . ### User code: >>> path1 = '\\Users\\Ed' >>> unix_style_path(path1) '/Users/ed'
>>> from __future__ import unicode_literals >>> . >>> from future.utils import bytes_to_native_str as n >>> s = n(b'ABCD') >>> s 'ABCD' # on both Py2 and Py3
>>> def f(): . u"Author: Martin von Löwis" >>> help(f) /Users/schofield/Install/anaconda/python.app/Contents/lib/python2.7/pydoc.pyc in pipepager(text, cmd) 1376 pipe = os.popen(cmd, 'w') 1377 try: -> 1378 pipe.write(text) 1379 pipe.close() 1380 except IOError: UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 71: ordinal not in range(128)
Others’ perspectives¶
In favour of unicode_literals ¶
Django recommends importing unicode_literals as its top porting tip for migrating Django extension modules to Python 3. The following quote is from Aymeric Augustin on 23 August 2012 regarding why he chose unicode_literals for the port of Django to a Python 2/3-compatible codebase.:
“… I’d like to explain why this PEP [PEP 414, which allows explicit u» prefixes for unicode literals on Python 3.3+] is at odds with the porting philosophy I’ve applied to Django, and why I would have vetoed taking advantage of it.
“I believe that aiming for a Python 2 codebase with Python 3 compatibility hacks is a counter-productive way to port a project. You end up with all the drawbacks of Python 2 (including the legacy u prefixes) and none of the advantages Python 3 (especially the sane string handling).
“Working to write Python 3 code, with legacy compatibility for Python 2, is much more rewarding. Of course it takes more effort, but the results are much cleaner and much more maintainable. It’s really about looking towards the future or towards the past.
“I understand the reasons why PEP 414 was proposed and why it was accepted. It makes sense for legacy software that is minimally maintained. I hope nobody puts Django in this category!”
Against unicode_literals ¶
“There are so many subtle problems that unicode_literals causes. For instance lots of people accidentally introduce unicode into filenames and that seems to work, until they are using it on a system where there are unicode characters in the filesystem path.”
—Armin Ronacher
“+1 from me for avoiding the unicode_literals future, as it can have very strange side effects in Python 2…. This is one of the key reasons I backed Armin’s PEP 414.”
—Nick Coghlan
“Yeah, one of the nuisances of the WSGI spec is that the header values IIRC are the str or StringType on both py2 and py3. With unicode_literals this causes hard-to-spot bugs, as some WSGI servers might be more tolerant than others, but usually using unicode in python 2 for WSGI headers will cause the response to fail.”
—Antti Haapala
© Copyright 2013-2019, Python Charmers Pty Ltd, Australia.
unicode_literals in Python
Unicode is also called Universal Character set. ASCII uses 8 bits(1 byte) to represents a character and can have a maximum of 256 (2^8) distinct combinations. The issue with the ASCII is that it can only support the English language but what if we want to use another language like Hindi, Russian, Chinese, etc. We didn’t have enough space in ASCII to covers up all these languages and emojis. This is where Unicode comes, Unicode provides us a huge table to which can store ASCII table and also the extent to store other languages, symbols, and emojis.
We actually can not save the text as Unicode directly. Because Unicode is just an abstract representation of the text data. We need some kind of encoding/mapping to map each character to a certain number. If a character uses more than 1 byte(8-bits), then all that bytes need to be packed as a single unit (think of a box with more than one item). This boxing method is called the UTF-8 method. In UTF-8 character can occupy a minimum of 8 bits and in UTF-16 a character can occupy a minimum of 16-bits. UTF is just an algorithm that turns Unicode into bytes and read it back
Normally, in python2 all string literals are considered as byte strings by default but in the later version of python, all the string literals are Unicode strings by default. So to make all the strings literals Unicode in python we use the following import :
from __future__ import unicode_literals
If we are using an older version of python, we need to import the unicode_literals from the future package. This import will make python2 behave as python3 does. This will make the code cross-python version compatible.
Should I import unicode_literals?¶
The future package can be used with or without unicode_literals imports.
In general, it is more compelling to use unicode_literals when back-porting new or existing Python 3 code to Python 2/3 than when porting existing Python 2 code to 2/3. In the latter case, explicitly marking up all unicode string literals with u» prefixes would help to avoid unintentionally changing the existing Python 2 API. However, if changing the existing Python 2 API is not a concern, using unicode_literals may speed up the porting process.
This section summarizes the benefits and drawbacks of using unicode_literals . To avoid confusion, we recommend using unicode_literals everywhere across a code-base or not at all, instead of turning on for only some modules.
Benefits¶
- String literals are unicode on Python 3. Making them unicode on Python 2 leads to more consistency of your string types across the two runtimes. This can make it easier to understand and debug your code.
- Code without u» prefixes is cleaner, one of the claimed advantages of Python 3. Even though some unicode strings would require a function call to invert them to native strings for some Python 2 APIs (see Standard library incompatibilities ), the incidence of these function calls would usually be much lower than the incidence of u» prefixes for text strings in the absence of unicode_literals .
- The diff when porting to a Python 2/3-compatible codebase may be smaller, less noisy, and easier to review with unicode_literals than if an explicit u» prefix is added to every unadorned string literal.
- If support for Python 3.2 is required (e.g. for Ubuntu 12.04 LTS or Debian wheezy), u» prefixes are a SyntaxError , making unicode_literals the only option for a Python 2/3 compatible codebase. [However, note that future doesn’t support Python 3.0-3.2.]
Drawbacks¶
- Adding unicode_literals to a module amounts to a “global flag day” for that module, changing the data types of all strings in the module at once. Cautious developers may prefer an incremental approach. (See here for an excellent article describing the superiority of an incremental patch-set in the the case of the Linux kernel.)
### Module: mypaths.py . def unix_style_path(path): return path.replace('\\', '/') . ### User code: >>> path1 = '\\Users\\Ed' >>> unix_style_path(path1) '/Users/ed'
>>> from __future__ import unicode_literals >>> . >>> from future.utils import bytes_to_native_str as n >>> s = n(b'ABCD') >>> s 'ABCD' # on both Py2 and Py3
>>> def f(): . u"Author: Martin von Löwis" >>> help(f) /Users/schofield/Install/anaconda/python.app/Contents/lib/python2.7/pydoc.pyc in pipepager(text, cmd) 1376 pipe = os.popen(cmd, 'w') 1377 try: -> 1378 pipe.write(text) 1379 pipe.close() 1380 except IOError: UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 71: ordinal not in range(128)
Others’ perspectives¶
In favour of unicode_literals ¶
Django recommends importing unicode_literals as its top porting tip for migrating Django extension modules to Python 3. The following quote is from Aymeric Augustin on 23 August 2012 regarding why he chose unicode_literals for the port of Django to a Python 2/3-compatible codebase.:
“… I’d like to explain why this PEP [PEP 414, which allows explicit u» prefixes for unicode literals on Python 3.3+] is at odds with the porting philosophy I’ve applied to Django, and why I would have vetoed taking advantage of it.
“I believe that aiming for a Python 2 codebase with Python 3 compatibility hacks is a counter-productive way to port a project. You end up with all the drawbacks of Python 2 (including the legacy u prefixes) and none of the advantages Python 3 (especially the sane string handling).
“Working to write Python 3 code, with legacy compatibility for Python 2, is much more rewarding. Of course it takes more effort, but the results are much cleaner and much more maintainable. It’s really about looking towards the future or towards the past.
“I understand the reasons why PEP 414 was proposed and why it was accepted. It makes sense for legacy software that is minimally maintained. I hope nobody puts Django in this category!”
Against unicode_literals ¶
“There are so many subtle problems that unicode_literals causes. For instance lots of people accidentally introduce unicode into filenames and that seems to work, until they are using it on a system where there are unicode characters in the filesystem path.”
—Armin Ronacher
“+1 from me for avoiding the unicode_literals future, as it can have very strange side effects in Python 2…. This is one of the key reasons I backed Armin’s PEP 414.”
—Nick Coghlan
“Yeah, one of the nuisances of the WSGI spec is that the header values IIRC are the str or StringType on both py2 and py3. With unicode_literals this causes hard-to-spot bugs, as some WSGI servers might be more tolerant than others, but usually using unicode in python 2 for WSGI headers will cause the response to fail.”
—Antti Haapala
© Copyright 2013-2019, Python Charmers Pty Ltd, Australia.