Encoding in BeautifulSoup
The character encoding plays a major role in the interpretation of the content of an HTML and XML document. A document does not only contain English characters but also non-English characters like Hebrew, Latin, Greek and much more. To let the parser know, which encoding method should be used, the documents will contain a dedicated tag and attribute to specify this. For example:
In HTML documents
In XML documents
These tags convey the browser which encoding method can be used for parsing. If the proper encoding method is not specified, either the content is rendered incorrectly or sometimes with the replacement character ‘�’.
XML encoding methods
The XML documents can be encoded in one of the formats listed below.
Amongst these methods, UTF-8 is commonly found. UTF-16 allows 2 bytes for each character and the documents with ‘0xx’ are encoded by this method. Latin1 covers Western European characters.
HTML encoding methods
The HTML and HTML5 documents can be encoded by any one of the methods below.
For HTML5 documents, mostly UTF-8 is recommended. ISO-8859-1 is mostly used with XHTML documents. Some methods like UTF-7, UTF-32, BOCU-1, CESU-8 are explicitly mentioned not to use as they replace most of the characters with replacement character ‘�’.
BeautifulSoup and encoding
The BeautifulSoup module, popularly imported as bs4, is a boon that makes HTML/XML parsing a cake-walk. It has a rich number of methods among which one helps to select contents by their tag name or by the attribute present in the tag, one helps to extract the content based on the hierarchy, printing content with indentation required for HTML, and so on. The bs4 module auto-detects the encoding method used in the documents and converts it to a suitable format efficiently. The returned BeautifulSoup object will have various attributes which give more information. However, sometimes it incorrectly predicts the encoding method. Thus, if the encoding method is known by the user, it is good to pass it as an argument. This article provides the various ways in which the encoding methods can be specified in the bs4 module.
original_encoding
The bs4 module has a sub-library called Unicode, Dammit that finds the encoded method and uses that to convert to Unicode characters. The original_encoding attribute is used to return the detected encoding method.
Given an HTML element parse it and find the encoding method used.