Python получить код символа utf 8

Содержание

How to get the ASCII value of a character
Get unicode code point of a character using Python
5 Answers 5
The codepoints module
Python get character code in different encoding?
3 Answers 3

How to get the ASCII value of a character

The function ord() gets the int value of the char. And in case you want to convert back after playing with the number, function chr() does the trick.

>>> ord('a') 97 >>> chr(97) 'a' >>> chr(ord('a') + 3) 'd' >>>

In Python 2, there was also the unichr function, returning the Unicode character whose ordinal is the unichr argument:

>>> unichr(97) u'a' >>> unichr(1234) u'\u04d2'

In Python 3 you can use chr instead of unichr .

@njzk2: it doesn’t use any character encoding it returns a bytestring in Python 2. It is upto you to interpret it as a character e.g., chr(ord(u’й’.encode(‘cp1251’))).decode(‘cp1251′) == u’й’ . In Python 3 (or unichr in Python 2), the input number is interpreted as Unicode codepoint integer ordinal: unichr(0x439) == ‘\u0439′ (the first 256 integers has the same mapping as latin-1: unichr(0xe9) == b’\xe9’.decode(‘latin-1′) , the first 128 — ascii: unichr(0x0a) == b’\x0a’.decode(‘ascii’) it is a Unicode thing, not Python).

@eLymar: it’s short for «ordinal,» which has similar linguistic roots to «order» — i.e. the numeric rather than symbolic representation of the character

Note that ord() doesn’t give you the ASCII value per se; it gives you the numeric value of the character in whatever encoding it’s in. Therefore the result of ord(‘ä’) can be 228 if you’re using Latin-1, or it can raise a TypeError if you’re using UTF-8. It can even return the Unicode codepoint instead if you pass it a unicode:

Depends on the object type. Python3 (str): unicode by default. Python3 (bytes): str(b’\xc3\x9c’, ‘ascii’) -> raises UnicodeDecodeError. Python3 (bytes): str(b’\xc3\x9c’, ‘utf-8’) -> returns Ü. You can also look into the six package.

The accepted answer is correct, but there is a more clever/efficient way to do this if you need to convert a whole bunch of ASCII characters to their ASCII codes at once. Instead of doing:

for ch in mystr: code = ord(ch)

you convert to Python native types that iterate the codes directly. On Python 3, it’s trivial:

for code in mystr.encode('ascii'):

and on Python 2.6/2.7, it’s only slightly more involved because it doesn’t have a Py3 style bytes object ( bytes is an alias for str , which iterates by character), but they do have bytearray :

# If mystr is definitely str, not unicode for code in bytearray(mystr): # If mystr could be either str or unicode for code in bytearray(mystr, 'ascii'):

Encoding as a type that natively iterates by ordinal means the conversion goes much faster; in local tests on both Py2.7 and Py3.5, iterating a str to get its ASCII codes using map(ord, mystr) starts off taking about twice as long for a len 10 str than using bytearray(mystr) on Py2 or mystr.encode(‘ascii’) on Py3, and as the str gets longer, the multiplier paid for map(ord, mystr) rises to ~6.5x-7x.

The only downside is that the conversion is all at once, so your first result might take a little longer, and a truly enormous str would have a proportionately large temporary bytes / bytearray , but unless this forces you into page thrashing, this isn’t likely to matter.

Источник

Get unicode code point of a character using Python

In Python API, is there a way to extract the unicode code point of a single character? Edit: In case it matters, I’m using Python 2.7.

e.g. for ‘\u304f’ I want ‘304f’. is that what ‘ord()’ will do? Yes- docs.python.org/library/functions.html#ord

Yes, ord(«\N«) is indeed 12367, aka 0x304F. I would never use numbers for characters the way you do, only named ones the way I do. Magic numbers are bad for your program. Just think of chr and ord as inverse functions of each other. It’s really easy.

@tchrist it might be worth noting chr is the opposite of ord in python 3.x, but in python 2.x unichr is the inverse of ord as chr only works for ordinals up to 255 in python 2.x.

@David: Yes, but I consider that a legacy system, which doesn’t really work very well for Unicode — as you have yourself just demonstrated. chr and ord were always meant to be inverses, and it was a legacy Python 2 bug that they sometimes weren’t. That’s nuts.

@tchrist there are still lots of people using python 2.x. Even in python 3.x there are still narrow Unicode builds (for example most Windows builds of python 3.x are narrow.) I wouldn’t call most 2.x Unicode issues bugs so much as additions to maintain backwards compatibility with older scripts, python 2.x usually works fine with Unicode. python 3.0 does make things much more consistent though, eliminating the difference between str and unicode .

5 Answers 5

If I understand your question correctly, you can do this.

>>> s='㈲' >>> s.encode("unicode_escape") b'\\u3232'

Shows the unicode escape code as a source string.

For me, this doesn’t work with ASCII characters: ‘a’.encode(‘unicode_escape’) gives a instead of ‘\\u. (Same with u’a’.encode(‘unicode_escape’) .) Also, the format is different when you go outside the Basic Multilingual Plane: u’😱’.encode(‘unicode_escape’) gives ‘\\U0001f631’ .

@ShreevatsaR Try «a».encode(«unicode_escape»).hex() to get the hexadecimal representation as a str . Alternatively, hex(ord(«a»)) will also work.

>>> ord(u"ć") 263 >>> u"café"[2] u'f' >>> u"café"[3] u'\xe9' >>> for c in u"café": . print repr(c), ord(c) . u'c' 99 u'a' 97 u'f' 102 u'\xe9' 233

If ‘c’ is my character variable (say it’s equal to ‘あ’), if I do ucp = ord(c) then print ucp I get three integers, not a single integer. How do I get a single integer?

How did you get あ into the variable? If it’s a literal in your source code, then make sure your source file has an appropriate encoding set. Otherwise, ask a new question and post more detailed code.

Turns out getting this right is fairly tricky: Python 2 and Python 3 have some subtle issues with extracting Unicode code points from a string.

Up until Python 3.3, it was possible to compile Python in one of two modes:

In this mode, Python’s Unicode strings support the full range of Unicode code points from U+0000 to U+10FFFF. One code point is represented by one string element:

>>> import sys >>> hex(sys.maxunicode) '0x10ffff' >>> len(u'\U0001F40D') 1 >>> [c for c in u'\U0001F40D'] [u'\U0001f40d']

This is the default for Python 2.7 on Linux, as well as universally on Python 3.3 and later across all operating systems.

In this mode, Python’s Unicode strings only support the range of Unicode code points from U+0000 to U+FFFF. Any code points from U+10000 through U+10FFFF are represented using a pair of string elements in the UTF-16 encoding::

>>> import sys >>> hex(sys.maxunicode) '0xffff' >>> len(u'\U0001F40D') 2 >>> [c for c in u'\U0001F40D'] [u'\ud83d', u'\udc0d']

This is the default for Python 2.7 on macOS and Windows.

This runtime difference makes writing Python modules to manipulate Unicode strings as series of codepoints quite inconvenient.

The codepoints module

To solve this, I contributed a new module codepoints to PyPI :

This module solves the problem by exposing APIs to convert Unicode strings to and from lists of code points, regardless of the underlying setting for sys.maxunicode ::

>>> hex(sys.maxunicode) '0xffff' >>> snake = tuple(codepoints.from_unicode(u'\U0001F40D')) >>> len(snake) 1 >>> snake[0] 128013 >> hex(snake[0]) '0x1f40d' >>> codepoints.to_unicode(snake) u'\U0001f40d'

Источник

Python get character code in different encoding?

Given a character code as integer number in one encoding, how can you get the character code in, say, utf-8 and again as integer?

3 Answers 3

UTF-8 is a variable-length encoding, so I’ll assume you really meant «Unicode code point». Use chr() to convert the character code to a character, decode it, and use ord() to get the code point.

>>> ord(chr(145).decode('koi8-r')) 9618

In Python 2, chr only supports ASCII, so only numbers in the [0..255] range. Use unichr instead for Unicode support.

Hmm UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\u8140′ in position 0 : ordinal not in range(128)

chr(145) is probably equivalent to unichr(145).encode(‘latin1’) on Python 2 if the input is in range(256) . There is no unichr on Python 3, it is renamed to chr . It is usually a hack to fix the input if you need: reinterpreted = unistr.encode(one_encoding).decode(another_encoding)

You can only map an «integer number» from one encoding to another if they are both single-byte encodings.

Here’s an example using «iso-8859-15» and «cp1252» (aka «ANSI»):

>>> s = u'€' >>> s.encode('iso-8859-15') '\xa4' >>> s.encode('cp1252') '\x80' >>> ord(s.encode('cp1252')) 128 >>> ord(s.encode('iso-8859-15')) 164

Note that ord is here being used to get the ordinal number of the encoded byte. Using ord on the original unicode string would give its unicode code point:

The reverse operation to ord can be done using either chr (for codes in the range 0 to 127 ) or unichr (for codes in the range 0 to sys.maxunicode ):

>>> print chr(65) A >>> print unichr(8364) €

For multi-byte encodings, a simple «integer number» mapping is usually not possible.

Here’s the same example as above, but using «iso-8859-15» and «utf-8»:

>>> s = u'€' >>> s.encode('iso-8859-15') '\xa4' >>> s.encode('utf-8') '\xe2\x82\xac' >>> [ord(c) for c in s.encode('iso-8859-15')] [164] >>> [ord(c) for c in s.encode('utf-8')] [226, 130, 172]

The «utf-8» encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including «utf-8») are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).

Источник