Python chardet что это

chardet package¶

We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [€-ÿ] marker: everything else [^a-zA-Z€-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters.

get_confidence ( ) [source] ¶ static remove_xml_tags ( buf ) [source] ¶

Returns a copy of buf that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by Latin1Prober .

chardet.codingstatemachine module¶

A state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:

START state: This is the state to start with, or a legal byte sequence (i.e. a valid code point) for character has been identified. ME state: This indicates that the state machine identified a byte sequence that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector. ERROR state: This indicates the state machine identified an illegal byte sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on. get_coding_state_machine ( ) [source] ¶ get_current_charlen ( ) [source] ¶ language ¶ next_state ( c ) [source] ¶ reset ( ) [source] ¶

Читайте также:  Обновление версии питона убунту

Источник

Character Encodings and Detection with Python, chardet, and cchardet

If your name is José, you are in good company. José is a very common name. Yet, when dealing with text files, sometimes José will appear as José, or other mangled array of symbols and letters. Or, in some cases, Python will fail to convert the file to text at all, complaining with a UnicodeDecodeError . Unless only dealing with numerical data, any data jockey or software developer needs to face the problem of encoding and decoding characters.

Why encodings?

Ever heard or asked the question, «why do we need character encodings?» Indeed, character encodings cause heaps of confusion for software developer and end user alike. But ponder for a moment, and we all have to admit that the «do we need character encoding?» question is nonsensical. If you are dealing with text and computers, then there has to be encoding. The letter «a», for instance, must be recorded and processed like everything else: as a byte (or multiple bytes). Most likely (but not necessarily), your text editor or terminal will encode «a» as the number 97. Without the encoding, you aren’t dealing with text and strings. Just bytes.

Encoding and decoding

Think of character encoding like a top secret substitution cipher, in which every letter has a corresponding number when encoded. No one will ever figure it out!

a: 61 g: 67 m: 6d s: 73 y: 79
b: 62 h: 68 n: 6e t: 74 z: 7a
c: 63 i: 69 o: 6f u: 75
d: 64 j: 6a p: 70 v: 76
e: 65 k: 6b q: 71 w: 77
f: 66 l: 6c r: 72 x: 78

The above 4 character codes are hexadecimal: 73, 70, 61, 6d (the escape code \x is Python’s way of designating a hexadecimal literal character code). In decimal, that’s 115, 112, 97, and 109. Try the above print statement in a Python console or script and you should see our beloved «spam». It was automatically decoded in the Python console, printing the corresponding letters (characters). But let’s be more explicit, creating a byte string of the above numbers, and specifying the ASCII encoding:

Again, «spam» . A canned response, if I ever heard one. We are encoding and decoding! There you have it.

The complex and beautiful world beyond ASCII

What happens, however, with our dear friend José? In other words, what is the number corresponding to the letter «é»? Depends on the encoding. Let’s try number 233 (hexadecimal e9), as somebody told us that might work:

b"\x4a\x6f\x73\xe9".decode("ascii") UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128) 

That didn’t go over well. The error complains that 233 is not in the 0-127 range that ASCII uses. No problem. We heard of this thing called Unicode, specifically UTF-8. One encoding to rule them all! We can just use that:

b"\x4a\x6f\x73\xe9".decode("utf-8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: unexpected end of data 

Still, no dice! After much experimentation, we find the ISO-8859-1 encoding. This is a Latin (i.e. European-derived) character set, but it works in this case, as the letters in «José» are all Latin.

b"\x4a\x6f\x73\xe9".decode("iso-8859-1") 'José' 

A Cute Pig

So nice to have our friend back in one piece. ISO-8859-1 works if all you speak is Latin. That is not José. It is a picture of another friend, who speaks Latin.

UTF-8 is our friend

Once upon a time, everyone spoke «American» and character encoding was a simple translation of 127 characters to codes and back again (the ASCII character encoding, a subset of which is demonstrated above). The problem is, of course, that if this situation ever did exist, it was the result of a then U.S. dominated computer industry, or simple short-sightedness, to put it kindly (ethnocentrist and complacent may be more descriptive and accurate, if less gracious). Reality is much more complex. And, thankfully, the world is full of a wide range of people and languages. Good thing that Unicode has happened, and there are character encodings that can represent a wide range of the characters used around the world. You can see non-Ascii names such as «Miloš» and «María», as well as 张伟. One of these encodings, UTF-8, is common. It is used on this web page, and is the default encoding since Python version 3. With UTF-8, a character may be encoded as a 1, 2, 3, or 4-byte number. This covers a wealth of characters, including ♲, 水, Ж, and even 😀. UTF-8, being variable width, is even backwards compatible with ASCII. In other words, «a» is still encoded to a one-byte number 97.

Character encoding detection

While ubiquitous, UTF-8 is not the only character encoding. As José so clearly discovered above. For instance, dear Microsoft Excel often saves CSV files in a Latin encoding (unless you have a newer version and explicitly select UTF-8 CSV). How do we know what to use? The easiest way is to have someone decide, and communicate clearly. If you are the one doing the encoding, select an appropriate version of Unicode, UTF-8 if you can. Then always decode with UTF-8. This is usually the default in Python since version 3. If you are saving a CSV file from Microsoft Excel, know that the «CSV UTF-8» format uses the character encoding «utf-8-sig» (a beginning-of-message, or BOM, character is used to designate UTF-8 at the start of the file). If using the more traditional and painful Microsoft Excel CSV format, the character encoding is likely «cp1252» which is a Latin encoding. Don’t know? Ask. But what happens if the answer is «I don’t know»? Or, more commonly, «we don’t use character encoding» (🤦). Or even «probably Unicode?» These all should be interpreted as «I don’t know.»

If you do not know what the character encoding is for a file you need to handle in Python, then try chardet.

Use something like the above to install it in your Python virtual environment. Character detection with chardet works something like this:

import chardet name = b"\x4a\x6f\x73\xe9" detection = chardet.detect(name) print(detection) encoding = detection["encoding"] print(name.decode(encoding)) 

That may have worked for you, especially if the name variable contains a lot of text with many non-ASCII characters. In this case, it works on my machine with just «José» but it cannot be very confident, and chardet might get it wrong in other similar situations. Summary: give it plenty of data, if you can. Even b’Jos\xe9 Gonz\xe1lez’ will result in more accuracy. Did you see in response to print(detection) , that there is a confidence level? That can be helpful.

Two ways to use character detection

There are two ways I might use the chardet library. First, I could use chardet.detect() in a one-off fashion on a text file, to determine the first time what the character encoding will be on subsequent engagements. Let’s say there is a source system that always exports a CSV file with the same character encoding. When I contact the ever-helpful support line, they kindly inform me that they have no clue what character encoding even is, so I know I am left to my own devices. Good thing the device I have is chardet. I use it on a large source file, and determine that the encoding is cp1252 (no big surprise) and then I write my code to always with open(«filename.csv», encoding=»cp1252″) as filehandle: and go on my merry way. I don’t need character detection anymore. The second scenario is more complex. What if I am creating a tool to handle arbitrary text files, and I will never know in advance what the character encoding is? In these cases, I will always want to import chardet and then use chardet.detect() . I may want to throw an error or warning, though, if the confidence level is below a certain threshold. If confident, I will use the suggested encoding when opening and reading the file.

cchardet, the crazy-fast Python character detection library

In the second scenario above, I may appreciate a performance boost, especially if it is an operation that is repeated frequently. Enter cchardet, a faster chardet. It is a drop-in replacement. Install it with something like:

import cchardet as chardet 

A simple command line tool

"""A tool for reading text files with an unknown encoding.""" from pathlib import Path import sys import cchardet as chardet def read_confidently(filename): """Detect encoding and return decoded text, encoding, and confidence level.""" filepath = Path(filename) # We must read as binary (bytes) because we don't yet know encoding blob = filepath.read_bytes() detection = chardet.detect(blob) encoding = detection["encoding"] confidence = detection["confidence"] text = blob.decode(encoding) return text, encoding, confidence def main(): """Command runner.""" filename = sys.argv[1] # assume first command line argument is filename text, encoding, confidence = read_confidently(filename) print(text) print(f"Encoding was detected as encoding>.") if confidence  0.6: print(f"Warning: confidence was only confidence>!") print("Please double-check output for accuracy.") if __name__ == "__main__": main() 

You can also download this code from Github here. Place the above in an appropriate directory, along with a text file. Then, from the terminal, in that directory, something like the following (use python instead of python3 if necessary) should work:

Do you see output and detected encoding? I welcome comments below. Feel free to suggest additional use cases, problems you encounter, or affirmation of the cute pig picture above. You are welcome to view and test the code along with some text file samples at the associated Github repo. Some variation of the following should get you up and running:

git clone https://github.com/bowmanjd/python-chardet-example.git cd python-chardet-example/ python3 -m venv .venv . .venv/bin/activate pip install cchardet python detect.py sample-latin1.csv 

Источник

Оцените статью