Python html character encoding

Encoding in BeautifulSoup

The character encoding plays a major role in the interpretation of the content of an HTML and XML document. A document does not only contain English characters but also non-English characters like Hebrew, Latin, Greek and much more. To let the parser know, which encoding method should be used, the documents will contain a dedicated tag and attribute to specify this. For example:

In HTML documents

In XML documents

These tags convey the browser which encoding method can be used for parsing. If the proper encoding method is not specified, either the content is rendered incorrectly or sometimes with the replacement character ‘�’.

XML encoding methods

The XML documents can be encoded in one of the formats listed below.

Amongst these methods, UTF-8 is commonly found. UTF-16 allows 2 bytes for each character and the documents with ‘0xx’ are encoded by this method. Latin1 covers Western European characters.

HTML encoding methods

The HTML and HTML5 documents can be encoded by any one of the methods below.

For HTML5 documents, mostly UTF-8 is recommended. ISO-8859-1 is mostly used with XHTML documents. Some methods like UTF-7, UTF-32, BOCU-1, CESU-8 are explicitly mentioned not to use as they replace most of the characters with replacement character ‘�’.

BeautifulSoup and encoding

The BeautifulSoup module, popularly imported as bs4, is a boon that makes HTML/XML parsing a cake-walk. It has a rich number of methods among which one helps to select contents by their tag name or by the attribute present in the tag, one helps to extract the content based on the hierarchy, printing content with indentation required for HTML, and so on. The bs4 module auto-detects the encoding method used in the documents and converts it to a suitable format efficiently. The returned BeautifulSoup object will have various attributes which give more information. However, sometimes it incorrectly predicts the encoding method. Thus, if the encoding method is known by the user, it is good to pass it as an argument. This article provides the various ways in which the encoding methods can be specified in the bs4 module.

original_encoding

The bs4 module has a sub-library called Unicode, Dammit that finds the encoded method and uses that to convert to Unicode characters. The original_encoding attribute is used to return the detected encoding method.

Given an HTML element parse it and find the encoding method used.

Источник

Читайте также:  Все символы таблицы html
Оцените статью