Python detect string encoding

How do I check if a string is unicode or ascii?

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode . You can tell which using code something like this:

def whatisthis(s): if isinstance(s, str): print "ordinary string" elif isinstance(s, unicode): print "unicode string" else: print "not a string" 

This does not distinguish «Unicode or ASCII»; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

Note: first, you need to confirm you’re running Python2. If your code is designed to run under either Python2 or Python3, you’ll need to check your Python version first.

How to tell if an object is a unicode string or a byte string

You can use type or isinstance .

>>> type(u'abc') # Python 2 unicode string literal >>> type('abc') # Python 2 byte string literal

In Python 2, str is just a sequence of bytes. Python doesn’t know what its encoding is. The unicode type is the safer way to store text. If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.

>>> type('abc') # Python 3 unicode string literal >>> type(b'abc') # Python 3 byte string literal

In Python 3, str is like Python 2’s unicode , and is used to store text. What was called str in Python 2 is called bytes in Python 3.

Читайте также:  Default field value html

How to tell if a byte string is valid utf-8 or ascii

You can call decode . If it raises a UnicodeDecodeError exception, it wasn’t valid.

>>> u_umlaut = b'\xc3\x9c' # UTF-8 representation of the letter 'Ü' >>> u_umlaut.decode('utf-8') u'\xdc' >>> u_umlaut.decode('ascii') Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) 

Just for other people’s reference — str.decode doesn’t not exist in python 3. Looks like you have to unicode(s, «ascii») or something

@ProsperousHeart Updated to cover Python 3. And to try to explain the difference between bytestrings and unicode strings.

decode() method’s default is ‘utf-8’. So, if you call this method over a class ‘bytes’, you would get a ‘OK’ with print(«utf8 content:», html.decode()) , for example.

In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.

With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.

If you want to check if you have a ‘string-like’ object all with one statement though, you can do the following:

@PythonNut: I believe that was the point. The use of isinstance(x, basestring) suffices to replace the distinct dual tests above.

This is the answer to the question. All others misunderstood what OP said and gave generic answers about type checking in Python.

Doesn’t answer OP’s question. The title of the question (alone) COULD be interpreted such that this answer is correct. However, OP specifically says «figure out which» in the question’s description, and this answer does not address that.

Unicode is not an encoding — to quote Kumar McMillan:

If ASCII, UTF-8, and other byte strings are «text» .

. then Unicode is «text-ness»;

it is the abstract form of text

Have a read of McMillan’s Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.

If your code needs to be compatible with both Python 2 and Python 3, you can’t directly use things like isinstance(s,bytes) or isinstance(s,unicode) without wrapping them in either try/except or a python version test, because bytes is undefined in Python 2 and unicode is undefined in Python 3.

There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here’s an example:

# convert bytes (python 3) or unicode (python 2) to str if str(type(s)) == "": # only possible in Python 3 s = s.decode('ascii') # or s = str(s)[2:-1] elif str(type(s)) == "": # only possible in Python 2 s = str(s) 

An arguably slightly less ugly workaround is to check the Python version number, e.g.:

if sys.version_info >= (3,0,0): # for Python 3 if isinstance(s, bytes): s = s.decode('ascii') # or s = str(s)[2:-1] else: # for Python 2 if isinstance(s, unicode): s = str(s) 

Those are both unpythonic, and most of the time there’s probably a better way.

Источник

How to detect string byte encoding?

I’ve got about 1000 filenames read by os.listdir() , some of them are encoded in UTF8 and some are CP1252. I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode? Example:

for item in os.listdir(rootPath): #Convert to Unicode if isinstance(item, str): item = item.decode('cp1252') # or item = item.decode('utf-8') print item 

6 Answers 6

Use chardet library. It is super easy

import chardet the_encoding = chardet.detect('your string')['encoding'] 

in python3 you need to provide type bytes or bytearray so:

import chardet the_encoding = chardet.detect(b'your string')['encoding'] 

Seems to me it doesnt work. I have created string variable and encoded it utf-8. chardet returned TIS-620 encoding.

I found that cchardet appears to be the current name for this or a similar library. ; chardet was not findable.

A bit confused here. It seems like it isn’t possible to provide an str class as an argument. Only b’your string’ works for me, or directly providing a byte variable.

The problem with this answer for me is that some cp1252/latin1 characters can be interpreted as technically valid utf8 — which leads to ê type characters where it should have been ê . chardet seems to try utf8 first, which results in this. There may be a way to tell it which order to use, but lucemia’s answer worked better for me.

if your files either in cp1252 and utf-8 , then there is an easy way.

import logging def force_decode(string, codecs=['utf8', 'cp1252']): for i in codecs: try: return string.decode(i) except UnicodeDecodeError: pass logging.warn("cannot decode url %s" % ([string])) for item in os.listdir(rootPath): #Convert to Unicode if isinstance(item, str): item = force_decode(item) print item 

otherwise, there is a charset detect lib.

You also can use json package to detect encoding.

import json json.detect_encoding(b"Hello") 

charset_normalizer is a drop in replacement for chardet.

It works better on natural language and has a permissive MIT licence: https://github.com/Ousret/charset_normalizer/

from charset_normalizer import detect encoding = detect(byte_string)['encoding'] 

PS: This is not strictly related to the original question but this page comes up in Google a lot

chardet detected encoding can be used to decode an bytearray without any exception, but the output string may not be correct.

The try . except . way works perfectly for known encodings, but it does not work for all scenarios.

We can use try . except . first and then chardet as plan B:

 def decode(byte_array: bytearray, preferred_encodings: List[str] = None): if preferred_encodings is None: preferred_encodings = [ 'utf8', # Works for most cases 'cp1252' # Other encodings may appear in your project ] for encoding in preferred_encodings: # Try preferred encodings first try: return byte_array.decode(encoding) except UnicodeDecodeError: pass else: # Use detected encoding encoding = chardet.detect(byte_array)['encoding'] return byte_array.decode(encoding) 

Источник

Can I detect the text codec used in a string?

No, there is no such function, because files do not record what codec was used to write the text contained.

If there is more context (like a more specific format such as HTML or XML) then you can determine the codec because the standard specifies a default or allows for annotating the data with the codec, but otherwise you are reduced to guessing based on the contents (which is what tools like chardet do).

For a file that anyone can modify, you have no hope but to document clearly what codec should be used.

I also think there is more context, but there isn’t. The most of the strings(about 5k) are in utf-8, but on some strings I’ve get the UnicodeDecodeError. 🙁

@alabamajack: the best you can do then is to use an error mode that either ignores such errors or replaces undecodable bytes with replacement characters ( ? or � ).

Interestingly, there are systems that do record the encoding with every file (like IBM midrange). But of course, if they interact at all with the «outside world», they may receive files with no encoding information, or may send files to other systems that don’t honor the encoding information provided.

You can use a 3rd-party chardet module.

>>> import chardet >>> chardet.detect(b'\xed\x95\x9c\xea\xb8\x80') # u'한글'.encode('utf-8') >>> chardet.detect(b'\xc7\xd1\xb1\xdb') # u'한글'.encode('euc-kr') 

NOTE: chardet is not foolproof, and if a file is small enough can easily guess wrong.

nice idea, but cannot take this, because the python-installation then must be adapted manually on so much PC’s.

If you cannot use chardet and have no chance of specifying the encoding in advance, I think your only remaining recourse is to simply guess at it. You could do something like this:

# Add whichever you want to the list, but only end it in a codec like latin1 that never fails codecs = ["utf-8", "euc-kr", "shift-jis", "latin1"] def try_decode(text): for codec in codecs: try: return text.decode(codec) except UnicodeError: continue 

Certainly, but since you do, indeed, not know it, I hardly think there’s any other way to do it. I mean, this is basically what chardet does as well, only with a bit more sophistication.

I must say I was a little surprised that there is any codec that «never fails», but after a little experimentation, it does appear that you can always decode with latin1 (at least on my Windows PC with Python 2.7). I think it’s worth noting, particularly for those who are not 100% solid on character encoding issues (which frankly is most of us!) that while the latin1 codec always «succeeds» in creating a Unicode string, that string could be complete garbage. It can happen that we would have been better off using, say, shift-jis and just ignoring/replacing a few bytes here and there.

@JohnY: Just for the record, there are tons of codecs that can never fail, since they simply map every possible byte to one, single, unique Unicode character. Common examples include the ISO-8859 encodings (Latin1 is ISO-8859-1), KOI-8, CP437, CP850 and CP1252, but there are many others also in common use.

Источник

Оцените статью