Python requests content decode

python requests.get() returns improperly decoded text instead of UTF-8?

When the content-type of the server is ‘Content-Type:text/html’ , requests.get() returns improperly encoded data. However, if we have the content type explicitly as ‘Content-Type:text/html; charset=utf-8’ , it returns properly encoded data. Also, when we use urllib.urlopen() , it returns properly encoded data. Has anyone noticed this before? Why does requests.get() behave like this?

4 Answers 4

Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).

For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.

Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding ), so you usually want to do:

r = requests.get("https://martin.slouf.name/") # override encoding by real educated guess as provided by chardet r.encoding = r.apparent_encoding # access the data r.text 

The approach with r.encoding = r.apparent_encoding didn’t work (é showed up as é) for a web page where line 13 of 374 is . However, changing to r.encoding = ‘UTF-8’ worked ok. One could have code to search r.text for a «Content-Type» . charset=. entry, then set r.encoding before accessing r.text further. This would be clunky but more general than just setting the encoding to UTF-8.

Well, it is a guess after all ;). I suppose you realize that r.apparent_encoding value is set by chardet library — and of course — it can guess wrong. You should also be aware that you should not access r.text before setting the r.encoding to desired value (using r.apparent_encoding or any method desirable). I recommend reading the chardet library docs (chardet.readthedocs.io/en/latest), if you are attempting to guess it your way — it can offer a solution you seek.

Читайте также:  Генерация имени переменной python

ok. Note, re «should not access r.text before setting the r.encoding to desired value», some doc I looked at (and now can’t find) gave impression it is ok to repeatedly set different encodings and then access .text if you want to see different encodings. ¶ But a doc looked at just now implies that’s not so. ¶ Re chardet, I see it has methods that would be less ad hoc than searching for a charset=. entry. Thanks!

This was a great solution for me. I was using requests and Beautiful Soup to do web scraping. At first I thought the issue was with Beautiful Soup and I was ready to dive into its documentation to figure what it does with respect to UTF-8. Before that though, I checked the string returned with .text on my response object. It had the badly-encoded characters. In my case, it looked like 19% ± 3%â\x96¼ for text that should actually be 19% ± 3%▼ . encoding was «ISO-8859-1» and apparent_encoding was «UTF-8». By setting encoding to apparent_encoding , then getting text , it worked.

Источник

Request returns bytes and I’m failing to decode them

Essentially I made a request to a website and got a byte response back: b'[<"geonameId:"703448">. ‘. I’m confused because although it is of type byte, it is very human readable and appears like a list of json. I do know that the response is encoded in latin1 from running r.encoding which returned ISO-859-1 and I have tried to decode it, but it just returns an empty string. Here’s what I have so far:

r = response.content string = r.decode("ISO-8859-1") print (string) 

in python 2.x the b prefix will cause the enclosed string to become a type str you may have some encoded characters already hidden somewhere within. On Python 3.x you will receive a bytes literal. why do you believe you need to perform any encoding/decoding?

Because I need to parse the json, and I just tried looping over it: with for i in range(len(contents)): print content[i] and it’s just printing out lots of numbers.

4 Answers 4

Did you try to parse it with the json module?

import json parsed = json.loads(response.content) 

There should be a header in the response object telling you what encoding it has. You should decode the content with that codec, otherwise any unusual characters (emoji, accents, some quote characters, . ) will end up garbled. See the Answer from @salah

@mzc, decode(‘latin1’) doesn’t work always, in case of the content-type is text/html; charset=UTF-8 , it fails.

Another solution is to use response.text, which returns the content in unicode

Type: property String form: Docstring: Content of the response, in unicode. If Response.encoding is None, encoding will be guessed using ``chardet``. The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set ``r.encoding`` appropriately before accessing this property. 

There is r.text and r.content . The first one is a string, the second one is bytes.

import json data = json.loads(r.text) 

I’m requesting the source / HTML of a webpage (dockethound.com) and when I use r.content, it shows up.

I faced a similar issue using beautifulsoup4 and requests while scraping webpages, however both response.text and response.content looked like it was bytes.

The response headers included ‘Content-Type’: ‘text/html; charset=UTF-8’ encoding in the headers, also had this in the response headers — ‘Content-Encoding’: ‘br’ . It turns out I hadn’t installed brotlipy in the environment and running pip install brotlipy fixed my issues. I thought chardet or cchardet would be enough, but the data needed to be correctly decompressed.

A similar issue was solved here in the same way, and linking to this answer since it didn’t come up until I explicitly searched for brotli compression.

Источник

Python Requests response decode

I send a request using python requests and then print the response, what confuse me is that the Chinese characters in response is something like \u6570\u636e\u8fd4\u56de\u6210\u529f
Here is the code:

# -*- coding:utf-8 -*- import requests url = "http://www.biyou888.com/api/getdataapi.php?ac=3005&sess_token=&urlid=-1" res = requests.get(url) if res.status_code == 200: print res.text 

Below is the response response data What should i do to tranfer the response? I have try to use encode and decode but it dose work.

3 Answers 3

Use requests.Response.json and you will get the chinese caracters.

import requests import json url = "http://www.biyou888.com/api/getdataapi.phpac=3005&sess_token=&urlid=-1" res = requests.get(url) if res.status_code == 200: res_payload_dict = res.json() print(res_payload_dict) 

What do you mean by it doesn’t work, you get an error? The json you have is not the right one? Just edited the 3rd line that was broken.

There is no «?» between host and parameters in your code, I fix it but the result is something like this

Thx, I have found the reason why the response text is not Chinese characters by deault, it’s because the response return «/u» format Letters, Instead of characters encoded using utf-8

Testing the code with python 2 and printing res_payload_dict[‘info’] , I got 数据返回成功. Encoding is still puzzling to me to say the least. Anyway, glad you found a solution.

How to use response.encoding using Python requests?

For illustrate, let’s ping API of Github.

# import requests module import requests # Making a get request response = requests.get('https://api.github.com') # print response print(response) # print encoding of response print(response.encoding) 

So, in your example, try res.json() instead of res.text ?

# encoding of your response print(res.encoding) res.encoding='utf-8-sig' print(res.json()) 

sig in utf-8-sig is the abbreviation of signature (i.e. signature utf-8 file).

Using utf-8-sig to read a file will treat BOM as file info instead of a string.

import requests import json url = "http://www.biyou888.com/api/getdataapi.php?ac=3005&sess_token=&urlid=-1" res = requests.get(url) res_dict = <> if res.status_code == 200: print type(res.text) print res.text res_dict = json.loads(res.text) # get a dict from parameter string print "Dict:" print type(res_dict) print res_dict print res_dict['info'] 

Use json module to parse that input. And the prefix u just means it is a unicode string. When you use them, the u prefix won’t affect, just like what I show at the last several lines.

The ‘u’ prefix won’t bother when you really use the response text. And if you just want to print the Chinese characters out, like my print demo shows, use something like print res_dict[‘data’][0][‘title’] .

Источник

Оцените статью