Python unicode code to string

How To Convert Python Unicode Characters To String | Python

In this tutorial let’s learn about How To Convert Python Unicode Characters To String in Python. This article shows how to translate Unicode characters into an ASCII string. The purpose is to either eliminate non-ASCII characters or replace Unicode characters with their corresponding ASCII ones.

Unicode Characters in Python

Unicode Characters is the universal character encoding standard for all languages. Unlike ASCII, which only allows for a single byte per character, Unicode characters allow for four bytes, allowing for more characters in any language.

Unicode strings can be encoded in plain strings to any encoding you like. The abstract object large enough to carry the character in Python Unicode character is equivalent to Python’s long integers. If the string simply contains ASCII characters, transform it to a string using the str() method.

encode() and decode() Methods in Python

If you have a Unicode string and need to write it to a file or another serialised form, you must first encode it into a saveable format. There are many popular Unicode encodings, such as UTF-16 (which requires two bytes for most Unicode characters) or UTF-8, and others.

Читайте также:  Html форма поверх форме

You can use the following code to convert that string to a certain encoding.

The above output is bytes datatype.

Use the decode() function to convert bytes to strings in python.

Convert Python Unicode Characters To String

Use the unicodedata.normalize() method to convert Python Unicode to string. Based on canonical equivalence and compatibility equivalence, the Unicode standard offers multiple normalisation forms of a Unicode string.

unicodedata.normalize() to Convert Unicode to ASCII String in Python

The Python module unicodedata provides a method to use the Unicode character database as well as utility functions that make accessing, filtering, and looking up these characters more easier.

normalize() is a function in unicodedata that accepts two parameters: the normalized form of the Unicode string and the provided string. Normalized Unicode forms are classified into four types: NFC, NFKC, NFD, and NFKD. The NFKD normalized form will be used in this article.

b’Kluft infor pa federal electoral groe’

Because the encode() method is used on the string, the b symbol at the beginning indicates that it is a byte literal. To remove the symbol and the single quotes that enclose the string, call decode() after calling encode() to re-convert it to a string literal.

You can see in the result that we got the encoded bytes string, which we can now decode to get a Python string using the string decode() function.

Kluft infor pa federal electoral groe

Let’s attempt another example where the replace argument is used as the second parameter in the encode() function.

The replace argument substitutes all characters that do not have ASCII equivalents with a question mark? symbol. If we used ignore on the same string the output will be:

Conclusion

To convert Unicode characters to ASCII characters, use the unicodedata module’s normalize() function and the string’s built-in encode() function. Unicode characters that do not have ASCII counterparts can be ignored or replaced. The ignore option removes the character, while the replace option replaces it with question marks.

Similar Posts:

Источник

Convert a Unicode String to a String in Python

In this python tutorial, you will learn how to convert a Unicode string to a string.

Table Of Contents

A Unicode string that is used to represent the characters in a number system. If we want to specify a Unicode string, we have to place the character – “u” in front of the string.

Convert a Unicode string to a string using str()

Here, we will use str() to convert Unicode string to string.

Frequently Asked:

It takes only one parameter.

Where inp_str is a Unicode string.
Example 1:

In this example, we will convert the Unicode string – u”Welcome to thisPointer” to a string using str().

# Consider the unicode string inp_str= u"Welcome to thisPointer" # Convert to string print("Converted String: ",str(inp_str))
Converted String: Welcome to thisPointer

Convert a Unicode string to UTF-8

Here, we will take a Unicode string and encode it to UTF-8 using the encode() method. The UTF-8 converts each character in the Unicode string into 1 to 4 characters. The conversion depends upon the character.

Where inp_str is the Unicode string.

In this example, we will convert the Unicode string – u”Welcome to thisPointer” to UTF-8.

# Consider the unicode string inp_str= u"Welcome to thisPointer" # Convert unicode string to UTF-8 encoding inp_str=inp_str.encode('UTF-8') print("Converted String: ", inp_str)
Converted String: b'Welcome to thisPointer'

From the above string, it takes 1 character to convert from Unicode to UTF-8. Suppose, if you want to revert the Unicode string, then you can use the decode() method.

Example:
In this example, we will convert the Unicode string – u”Welcome to thisPointer” to UTF-8 and again decode it to a unicode string.

# Consider the unicode string inp_str= u"Welcome to thisPointer" # Convert unicode string to UTF-8 encoding inp_str=inp_str.encode('UTF-8') print("Converted String: ", inp_str) # Convert back inp_str=inp_str.decode('UTF-8') print("Actual String: ", inp_str)
Converted String: b'Welcome to thisPointer' Actual String: Welcome to thisPointer

Convert a Unicode string to UTF-16

Here, we will take a Unicode string and encode to UTF-16 using encode() method. The UTF-16 converts each character in the Unicode string into mostly 2 bytes.

Where inp_str is the Unicode string.
Example:

In this example, we will convert the Unicode string – u”Welcome to thisPointer” to UTF-16.

# Consider the unicode string inp_str= u"Welcome to thisPointer" # Convert unicode string to UTF-16 encoding inp_str=inp_str.encode('UTF-16') print("Converted String: ", inp_str)
Converted String: b'\xff\xfeW\x00e\x00l\x00c\x00o\x00m\x00e\x00 \x00t\x00o\x00 \x00t\x00h\x00i\x00s\x00P\x00o\x00i\x00n\x00t\x00e\x00r\x00'

From the above string, it returned 2 bytes of each character, if you want to revert the Unicode string, then you can use the decode() method.

In this example, we will convert the Unicode string – u”Welcome to thisPointer” to UTF-16 and again decode it to a Unicode string.

# Consider the unicode string inp_str= u"Welcome to thisPointer" # Convert unicode string to UTF-16 encoding inp_str=inp_str.encode('UTF-16') print("Converted String: ", inp_str) # Convert back inp_str=inp_str.decode('UTF-16') print("Actual String: ", inp_str)
Converted String: b'\xff\xfeW\x00e\x00l\x00c\x00o\x00m\x00e\x00 \x00t\x00o\x00 \x00t\x00h\x00i\x00s\x00P\x00o\x00i\x00n\x00t\x00e\x00r\x00' Actual String: Welcome to thisPointer

Convert a Unicode string to UTF-32

Here, we will take a Unicode string and encode it to UTF-32 using encode() method.UTF-16 converts each character in the Unicode string into mostly 4 bytes.

Where inp_str is the Unicode string.

In this example, we will convert the Unicode string – u”Welcome to thisPointer” to UTF-32.

# Consider the unicode string inp_str= u"Welcome to thisPointer" # Convert unicode string to UTF-32 encoding inp_str=inp_str.encode('UTF-32') print("Converted String: ", inp_str)
Converted String: b'\xff\xfe\x00\x00W\x00\x00\x00e\x00\x00\x00l\x00\x00\x00c\x00\x00\x00o\x00\x00\x00m\x00\x00\x00e\x00\x00\x00 \x00\x00\x00t\x00\x00\x00o\x00\x00\x00 \x00\x00\x00t\x00\x00\x00h\x00\x00\x00i\x00\x00\x00s\x00\x00\x00P\x00\x00\x00o\x00\x00\x00i\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00r\x00\x00\x00'

From the above string, it returned 4 bytes of each character, if you want to revert the Unicode string, then you can use the decode() method.

In this example, we will convert the Unicode string – u”Welcome to thisPointer” to UTF-32 and again decode it to a Unicode string.

# Consider the unicode string inp_str= u"Welcome to thisPointer" # Convert unicode string to UTF-32 encoding inp_str=inp_str.encode('UTF-32') print("Converted String: ", inp_str) # Convert back inp_str=inp_str.decode('UTF-32') print("Actual String: ", inp_str)
Converted String: b'\xff\xfe\x00\x00W\x00\x00\x00e\x00\x00\x00l\x00\x00\x00c\x00\x00\x00o\x00\x00\x00m\x00\x00\x00e\x00\x00\x00 \x00\x00\x00t\x00\x00\x00o\x00\x00\x00 \x00\x00\x00t\x00\x00\x00h\x00\x00\x00i\x00\x00\x00s\x00\x00\x00P\x00\x00\x00o\x00\x00\x00i\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00r\x00\x00\x00' Actual String: Welcome to thisPointer

Summary

In this Python String article, we have seen how to convert a Unicode string to a string using the str(). Also, we saw how to encode the strings to UTF-8, UTF-16, and UTF-32 with encode() and decode the strings to Unicode strings with decode() method. Happy Learning.

Share your love

Leave a Comment Cancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Terms of Use

Disclaimer

Copyright © 2023 thisPointer

To provide the best experiences, we and our partners use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us and our partners to process personal data such as browsing behavior or unique IDs on this site and show (non-) personalized ads. Not consenting or withdrawing consent, may adversely affect certain features and functions.

Click below to consent to the above or make granular choices. Your choices will be applied to this site only. You can change your settings at any time, including withdrawing your consent, by using the toggles on the Cookie Policy, or by clicking on the manage consent button at the bottom of the screen.

The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.

The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.

The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.

The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.

Источник

Как преобразовать Python Unicode в строку

Строки Unicode могут быть закодированы в виде простых строк в любой кодировке, которую вы выберете. Символ Python Unicode — это абстрактный объект, достаточно большой для хранения символа, аналогичный длинным целым числам Python. Если строка содержит только символы ASCII, используйте функцию str() для преобразования ее в строку.

Если у вас есть строка Unicode, и вам нужно записать ее в файл или другую сериализованную форму, вы должны сначала закодировать ее в определенное представление, которое можно сохранить.

Существует множество распространенных кодировок Unicode, таких как UTF-16 (которая использует два байта для большинства символов Unicode) или UTF-8(которая использует 1-4 байта/кодовую точку в зависимости от символа) и т. д.

Чтобы преобразовать эту строку в определенную кодировку, вы можете использовать следующий код.

Итак, мы получили результат в байтах. Чтобы преобразовать байты в строку, используйте функцию decode().

Вы можете видеть, что мы получили исходные строки.

Как преобразовать Python Unicode в строку

Чтобы преобразовать Unicode Python в строку, используйте функцию unicodedata.normalize(). Стандарт Unicode определяет различные формы нормализации строки Unicode на основе канонической эквивалентности и эквивалентности совместимости.

Для каждого символа есть две нормальные формы:

Нормальная форма D (NFD) также известна как каноническая декомпозиция и переводит каждый символ в его декомпозированную форму. Нормальная форма C (NFC) сначала применяет каноническую декомпозицию, а затем снова составляет предварительно объединенные символы.

Источник

Оцените статью