Python convert strings of bytes to byte array
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte. if the given string was AAAA or some known characters I can simply do string.encode(‘utf-8’) , but I am expecting the string of bytes to just be random. Doing that to ‘\xf0\x9f\xa4\xb1′ ( random bytes ) produces unexpected result b’\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1’ . There must be a simpler way to do this? Edit: I want to convert the string to bytes without using an encoding
Do you want to convert the string to bytes? It is not clear what the desired solution is. if you know it is a byte string without the b, you can do some string formatting. If you need it in bytes, you can call bytes(string) . Does this help: stackoverflow.com/questions/606191/convert-bytes-to-a-string ?
The bytes function takes in a string and an encoding . Since the bytes I’m expecting are random, I don’t want to pick an encoding for it
2 Answers 2
The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don’t want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).
How to convert a Python string representing bytes into actual bytes?
I have a string like: «01030009» and I want to get another string (because in Python 2.x we use strings for bytes) newString which will produce this result:
for a in newString: print ord(a) 0 1 0 3 0 0 0 9
3 Answers 3
''.join(chr(int(x)) for x in oldString)
chr is the inverse of ord .
chr, int and generator expressions, I think this is as built-in as you’ll get 🙂 If you meant something shorter, or a single function, the answer would be «no» as well. It would be over-specializing, which Python tends to avoid as much as possible.
If you expect to be converting hex (or some higher base encode-able alphanumerically) as well, then pass in a base to int() . You can go up to 36: int(x, 36) .
All the «deeply builtin» ways interpret characters as bytes in a different way than the one you want, because the way you appear to desire seems limited to represent bytes worth less than 10 (or less than 16 if you meant to use hex and just completely forgot to mention it). In other words, your desired code can represent a truly miniscule fraction of byte strings, and therefore would be absurd to «officialize» in any way (such as supporting it in builtin ways)!
For example, considering strings of length 8 (your example’s short length), the total number of byte strings of that length which exist is 256 ** 8 , while the number your chosen notation can represent is 10 ** 8 . I.e.
>>> exist = 256 ** 8 >>> repre = 10 ** 8 >>> print exist, repre 18446744073709551616 100000000 >>> print (repre / float(exist)) 5.42101086243e-12 >>>
So why would you expect any kind of «built-in» official support for a representation which, even for such really short strings, can only represent about five thousandths of one billionth of the possible byte strings?! The words «special case» were invented for things that happen far more frequently than this (if you got a random 8-byte string every second, it would be many centuries before you finally got one representable in your scheme), and longer byte strings keep exacerbating this effect exponentially, of course.
There are many «official» schemes for representation of byte strings, such as base64 and friends as specified in RFC 3548. your desired scheme is very signally not among them;-). Those are the schemes that get «official», built-in support in Python, of course.
How to convert a byte like string to normal bytes?
I’ve a problem during exception-handling with the imapclient-library. I tried to handle the LoginError like this:
source = IMAPClient(host=args.source_server, port=args.source_port, ssl=not args.source_no_ssl) try: print('Login source. '.format(args.source_user), end='', flush=False) source.login(args.source_user, args.source_pass) print('OK') except exceptions.LoginError as e: print('ERROR: <>'.format(e)) exit()
Login source. ERROR: b'Invalid login'
I think The problem is, that format is calling the __str__() -method of the Exception-object and do not try to decode. So the main question is who can i convert this string
edit 1
try: print('Login source. '.format(args.source_user), end='', flush=False) source.login(args.source_user, args.source_pass) print('OK') except exceptions.LoginError as e: print('ERROR: <>'.format(e.message.decode())) exit()
AttributeError: 'LoginError' object has no attribute 'message'
edit 2
try: print('Login source. '.format(args.source_user), end='', flush=False) source.login(args.source_user, args.source_pass) print('OK') except exceptions.LoginError as e: print('ERROR: <>'.format(e.args[0].decode())) exit()
AttributeError: 'str' object has no attribute 'decode'
3 Answers 3
imapclient ‘s login method looks like this:
def login(self, username, password): """Login using *username* and *password*, returning the server response. """ try: rv = self._command_and_check( 'login', to_unicode(username), to_unicode(password), unpack=True, ) except exceptions.IMAPClientError as e: raise exceptions.LoginError(str(e)) logger.info('Logged in as %s', username) return rv
We can see that it calls str on IMAPClientError , so if IMAPClientError was created with a bytes instance as its argument then we end up with stringified bytes in LoginError * .
There are two ways to deal with this:
Of the two approaches, I think (1) is better in this specific case, but (2) is more generally applicable when you have stringified bytes.
* Looking at the history of the imaplib module on github, it looks as if it changed to explicitly decode error messages before raising errors from the authenticate command in Python 3.5. So another solution might be to upgrade to Python 3.5+.
Have you tried to do this:
>>> a = b'invalid' >>> a b'invalid' >>> a.decode() 'invalid'
>>> import imaplib >>> dir(imaplib.IMAP4.error) ['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__getslice__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', 'args', 'message'] >>> imaplib.IMAP4.error.message
seems like there should be a message there, because LoginError seems to be a descendant from imaplib.IMAP4.error according to the source: https://imapclient.readthedocs.io/en/2.1.0/_modules/imapclient/exceptions.html#LoginError
You might want to print dir(e) where you catch the exception to see what it has — there should be something that get’s converted by __str__() into a byte string.
Then again, there’s a conversation about IMAP4 and IMAPClient library and catching the exceptions here: Catching imaplib exception (using IMAPClient package) in Python
Best way to convert string to bytes in Python 3?
TypeError: ‘str’ does not support the buffer interface suggests two possible methods to convert a string to bytes:
b = bytes(mystring, 'utf-8') b = mystring.encode('utf-8')
@LennartRegebro I dismiss. Even if it’s more common, reading «bytes()» i know what its doing, while encode() don’t make me feel it is encoding to bytes.
@erm3nda Which is a good reason to use it until it does feel like that, then you are one step closer to Unicode zen.
@LennartRegebro I feel good enough to just use bytes(item, «utf8») , as explicit is better than implicit, so. str.encode( ) defaults silently to bytes, making you more Unicode-zen but less Explicit-Zen. Also «common» is not a term that i like to follow. Also, bytes(item, «utf8″) , is more like the str() , and b»string» notations. My apologies if i am so noob to understand your reasons. Thank you.
@erm3nda if you read the accepted answer you can see that encode() doesn’t call bytes() , it’s the other way around. Of course that’s not immediately obvious which is why I asked the question.
5 Answers 5
If you look at the docs for bytes , it points you to bytearray :
bytearray([source[, encoding[, errors]]])
Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0
The optional source parameter can be used to initialize the array in a few different ways:
If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().
If it is an integer, the array will have that size and will be initialized with null bytes.
If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.
If it is an iterable, it must be an iterable of integers in the range 0
Without an argument, an array of size 0 is created.
So bytes can do much more than just encode a string. It’s Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.
For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting — «take this string and encode it with this encoding» is clearer than bytes(some_string, encoding) — there is no explicit verb when you use the constructor.
I checked the Python source. If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode ; so you’re just skipping a level of indirection if you call encode yourself.
Also, see Serdalis’ comment — unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice.