Html to unicode python

Decoding HTML Entities to Text in Python

A while ago, I had to import some HTML into a Python script and found out that—while there is cgi.escape() for encoding to HTML—there did not seem to be an easy or well-documented way for decoding HTML entities in Python.

Turns out, there are at least three ways of doing it, and which one you use probably depends on your particular app’s needs.

1) Overkill: BeautifulSoup

BeautifulSoup is an HTML parser that will also decode entities for you, like this:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

The advantage is its fault-tolerance. If your input document is malformed, it will do its best to extract a meaningful DOM tree from it. The disadvantage is, if you just have a few short strings to convert, introducing the dependency on an entire HTML parsing library into your project seems overkill.

2) Duct Tape: htmlentitydefs

Python comes with a list of known HTML entity names and their corresponding unicode codepoints. You can use that together with a simple regex to replace entities with unicode characters:

import htmlentitydefs, re mystring = re.sub('&([^;]+);', lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]), mystring) print mystring.encode('utf-8')

Of course, this works. But I hear you saying, how in the world is this not in the standard library? And the geeks among you have also noticed that this will not work with numerical entities. While © will give you © , © will fail miserably. If you’re handling random, user-entered HTML, this is not a great option.

Читайте также:  Вывести номера элементов массива python

3) Standard library to the rescue: HTMLParser

After all this, I’ll give you the option I like best. The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

>>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> s = h.unescape('© 2010') >>> s u'\xa9 2010' >>> print s © 2010 >>> s = h.unescape('© 2010') >>> s u'\xa9 2010'

So unless you need the advanced parsing capabilities of BeautifulSoup or want to show off your mad regex skills, this might be your best bet for squeezing unicode out of HTML snippets in Python.

Was this helpful? Buy me a coffee with Bitcoin! (What is this?)

Updating Adobe Flash Without Restarting Firefox

No reason for a Flash upgrade to shut down your entire browser, even if it claims so.It’s 2015, and the love-hate relationship of the Web. … Continue reading

Reddit’s Fail-Alien (or «Fail-ien?»)

Distraction free writing in a «big boring system»

Plan your next holiday with a Camper Van Rental. Enjoy!

Источник

Python — HTML to Unicode

BTW, all HTML entities will be resolved to unicode characters. Solution 1: unicode characters -> bytes = ‘encode’ bytes -> unicode characters = ‘decode’ You have bytes and you want unicode characters, so the method for that is .

Python — HTML to Unicode

I have a python script where I am getting some html and parsing it using beautiful soup. In the HTML sometimes there are no unicode characters and it causes errors with my script and the file I am creating.

Here is how I am getting the HTML

html = urllib2.urlopen(url).read().replace(' ',"") xml = etree.HTML(html) 
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace') 

I get an error UnicodeDecodeError

How could I change this into unicode. So if there are non unicode characters, my code won’t break.

html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace') 

I get an error UnicodeDecodeError. How could I change this into unicode.

You have bytes and you want unicode characters, so the method for that is decode . As you have used encode , Python thinks you want to go from characters to bytes, so tries to convert the bytes to characters so they can be turned back to bytes! It uses the default encoding for this, which in your case is ASCII, so it fails for non-ASCII bytes.

However it is unclear why you want to do this. etree parses bytes as-is. If you want to remove character U+00A0 Non Breaking Space from your data you should do that with the extracted content you get after HTML parsing, rather than try to grapple with the HTML source version. HTML markup might include U+00A0 as raw bytes, incorrectly-unterminated entity references, numeric character references and so on. Let the HTML parser handle that for you, it’s what it’s good at.

If you feed HTML to BeautifulSoup, it will decode it to Unicode. If the charset declaration is wrong or missing, or parts of the document are encoded differently, this might fail; there is a special module which comes with BeautifulSoup, dammit , which might help you with these documents.

If you mention BeautifulSoup, why don’t you do it like this:

from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen(url).read()) 

and work with the soup? BTW, all HTML entities will be resolved to unicode characters.

The ascii character set is very limited and might lack many characters in your document. I’d use utf-8 instead whenever possible.

Python byte to unicode Code Example, binary to text python. return codecs.charmap_decode (input,self.errors,decoding_table) [0] UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x8d in position 280: character maps to . string to hex python. bytes-like object. utf-8 codec can’t decode byte python. python convert …

Convert XML/HTML Entities into Unicode String in Python [duplicate]

I’m doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

which represents an «ǎ» with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u’\u01ce’

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

import HTMLParser h = HTMLParser.HTMLParser() h.unescape('© 2010') # u'\xa9 2010' h.unescape('© 2010') # u'\xa9 2010' 
import html html.unescape('© 2010') # u'\xa9 2010' html.unescape('© 2010') # u'\xa9 2010' 

Python has the htmlentitydefs module, but this doesn’t include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

import re, htmlentitydefs ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary. def unescape(text): def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: pass else: # named entity try: text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: pass return text # leave as is return re.sub("&#?\w+;", fixup, text) 

Use the builtin unichr — BeautifulSoup isn’t necessary:

>>> entity = 'ǎ' >>> unichr(int(entity[3:],16)) u'\u01ce' 

If you are on Python 3.4 or newer, you can simply use the html.unescape :

import html s = html.unescape(s) 

Convert character to unicode python Code Example, Get code examples like «convert character to unicode python» instantly right from your google search results with the Grepper Chrome Extension.

Python: Convert String with Unicode to HTML numeric code

Hy guys i’m looking for a solution to convert all the unicodes contained in a string to the corresponding HTML entities.

input: «This is \u+0024. a string with \u+0024. random \u+0024. unicode»
output: «This is $ a string with $ random $ unicode»

My current solution to this problem looks like:

if "\\u+" in my_string: unicode_code = (label_content.split("\\u+"))[1].split('.')[0] unicode_to_replace = f"\\u+." unicode_string = f"U+" html_code = unicode_string.encode('ascii', 'xmlcharrefreplace') my_string = label_content.replace(unicode_to_replace, html_code) 

But the Unicode string is not converted in the right way, any suggestion?

I’d prefer applying Regular expression operations ( re module). The pattern variable covers

  • all valid Unicode values (see e.g. U+042F instead of the middle U+0024 ),
  • all syntax versions of the input string: input variable in the original question was edited three times ( with/without leading backslash and/or trailing dot), and
  • my_string variable in the OQ’s self answer is incorrect: ‘\u+0024’ raises the truncated \uXXXX escape error.
import re def UPlusHtml(matchobj): return re.sub( r"^\\?[uU]\+", '&#x', re.sub( r'\.$', '', matchobj.group(0) ) ) + ';'; def UPlusRepl(matchobj): return chr( int( re.sub( r"^\\?[uU]\+", '', re.sub( r'\.$', '', matchobj.group(0) ) ),16 ) ); pattern = r"(\\?[uU]\+[0-9a-fA-F]+\.?)" input = "This is U+0024. a string with U+042f random U+0024. unicode" print( input ) print( re.sub( pattern, UPlusHtml, input ) ) print( re.sub( pattern, UPlusRepl, input ) ) print('--') my_string = "This is \\u+0024. a string with \\u+042F random \\u+0024. unicodes" print( my_string ) print( re.sub( pattern, UPlusHtml, my_string ) ) print( re.sub( pattern, UPlusRepl, my_string ) ) 

Output : \SO\67105976.py

This is U+0024. a string with U+042f random U+0024. unicode This is $ a string with Я random $ unicode This is $ a string with Я random $ unicode -- This is \u+0024. a string with \u+042F random \u+0024. unicodes This is $ a string with Я random $ unicodes This is $ a string with Я random $ unicodes 

Please note that I’m a regex beginner myself so I believe that the must exist more efficient regex-based solution, without any doubt…

Found a solution by myself, for anybody who’s intrested in this. It differs a bit from what i’ve asked, the output does not show unicodes to html entities, but converts them to the corresponding char, because in my case this is better.

So the final portion of code looks like this:

# e.g. of an input string containing some sort of unicodes. # This is how they are formatted in my input file. my_string = "This is \u+0024. a string with \u+0024. random \u+0024. unicodes" if "\\u+" in my_string : unicode_code = (my_string .split("\\u+"))[1].split('.')[0] unicode_to_replace = f"\\u+." unicode = f"\\u" # Where the actual unicode is converted to html entity html_entity = unicode.encode('utf-8').decode('raw-unicode-escape') my_string = my_string .replace(unicode_to_replace, html_entity) print(my_string) my_string >> "This is $ a string with $ random $ 

Unicode error in python Code Example, c++ to python code converter; python decouple default value; convert uint8 to double in python; python number type; Issue TypeError: ‘numpy.float64’ object cannot be interpreted as an integer; decimal to ascii python; invalid base64-encoded string: number of data characters (9) cannot be 1 more than a multiple …

Decoding HTML Entities With Python

The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien’s «The Children of Húrin».

import urllib2 from BeautifulSoup import BeautifulStoneSoup URL = ("http://www.librarything.com/services/rest/1.0/" "?method=librarything.ck.getwork&id=1907912" "&apikey=2a2e596b887f554db2bbbf3b07ff812a") soup = BeautifulStoneSoup(urllib2.urlopen(URL), convertEntities=BeautifulStoneSoup.ALL_ENTITIES) title_field = soup.find('field', attrs=) print title_field.find('fact').string 

Unfortunately, instead of ‘Húrin’, it prints out ‘Húrin’. This is obviously an encoding issue, but I can’t work out what I need to do to get the expected output. Help would be greatly appreciated.

In the source of the web page it looks like this: The Children of Húrin . So the encoding is already broken somewhere on their side before it even gets converted to XML.

If it’s a general issue with all the books and you need to work around it, this seems to work:

unicode(title_field.find('fact').string).encode("latin1").decode("utf-8") 

The web page may be lying about its encoding. The output looks like UTF-8. If you got a str at the end then you’ll need to decode it as UTF-8. If you have a unicode instead then you’ll need to encode as Latin-1 first.

Python: Convert String with Unicode to HTML numeric, unicode = f»\\u » # Where the actual unicode is converted to html entity html_entity = unicode.encode (‘utf-8’).decode (‘raw-unicode-escape’) my_string = my_string .replace (unicode_to_replace, html_entity) print (my_string) my_string >> «This is $ a string with $ random $ Share Improve …

Источник

Оцените статью