- How do I get a size of an UTF-8 string in Bytes with Python
- How do I get a size of an UTF-8 string in Bytes with Python
- How to convert/decode a pandas.Series of mixed bytes/strings into string or utf-8
- Convert zero-padded bytes to UTF-8 string
- Python encode/decode hex string to utf-8 string
- 2 ways to get the size of a string in Python (in bytes)
- Using the encode() method and the len() function
- Using sys.getsizeof()
- Python get string size in bytes
- # Table of Contents
- # Get the length of a Bytes object in Python
- # Get the size of a String in Python
How do I get a size of an UTF-8 string in Bytes with Python
Byte strings: Unicode strings: It’s good practice to maintain all of your strings in Unicode, and only encode when communicating with the outside world. Where the strings in a are mixed bytes and `UTF-8/. Which I’d guess would have the same solution.
How do I get a size of an UTF-8 string in Bytes with Python
Having an UTF-8 string like this:
is it possible to get its (in memory) size in Bytes with Python (2.5)?
Assuming you mean the number of UTF-8 bytes (and not the extra bytes that Python requires to store the object), it’s the same as for the length of any other string. A string literal in Python 2.x is a string of encoded bytes, not Unicode characters .
>>> mystring = "işğüı" >>> print "length of is ".format(repr(mystring), ****(mystring)) length of 'i\xc5\x9f\xc4\x9f\xc3\xbc\xc4\xb1' is 9
>>> myunicode = u"işğüı" >>> print "length of is ".format(repr(myunicode), ****(myunicode)) length of u'i\u015f\u011f\xfc\u0131' is 5
It’s good practice to maintain all of your strings in Unicode, and only encode when communicating with the outside world. In this case, you could use ****(myunicode.encode(‘utf-8’)) to find the size it would be after encoding.
Python translate bytecode to utf-8 using a variable, UTF8 would translate ‘\xc3\x86′ to Æ. If I type the value in b’PR\xc3\x86KVAL’ , python recognizes it as bytecode and I can translate it to PRÆKVAL. See below: s = b’PR\xc3\x86KVAL’ print (s) bb = s.decode (‘utf-8’) print (bb) The problem is that I don’t know how I can turn ‘PR\xc3\x86KVAL’ to be …
How to convert/decode a pandas.Series of mixed bytes/strings into string or utf-8
I would like to solve the problem in two possible cases:
- Where I don’t know whether the Series of strings is going to be UTF-8 or bytes beforehand.
Which I’d guess would have the same solution.
b = pd.Series(['123', '434,', 'fgd', 'aas', b'442321']) b.str.decode('utf-8')
Gives NaNs where the strings were already in UTF-8. Or are they automatically ASCII? Can I give the error parameter in decode so that the string remains «undecoded» where it’s already in UTF-8 for example? The docstring doesn’t seem to provide much info.
Or is there a better way to accomplish this?
Alternatively, is there a string method in pandas like .str.decode which instead just returns a True/False when a string is bytes or UTF-8 ?
One option I can think of is:
b = pd.Series(['123', '434,', 'fgd', 'aas', b'442321']) converted = b.str.decode('utf-8') b.loc[~converted.isnull()] = converted
Is this the recommended way then? Seems a bit roundabout. I guess what would be more elegant is really just a way to check if an str is bytes on all the elements of a Series and return a boolean array where it’s the case.
This will definitely slow things down for a large Series, but you can pass a ternary expression with a callable:
>>> b.apply(lambda ****.decode('utf-8') if isinstance(x, bytes) else x) 0 123 1 434, 2 fgd 3 aas 4 442321 dtype: object
Looking at the source for .str.decode() is instructive — it just applies _na_map(f, arr) over the Series, where the function f is f = lambda ****.decode(encoding, errors) . Because str doesn’t have a «decode» method to begin with, that error will become NaN. This happens in str_decode() .
>>> from pandas.core.strings import str_decode >>> from pandas.core.strings import _cpython_optimized_encoders >>> "utf-8" in _cpython_optimized_encoders True >>> str_decode(b, "utf-8") array([nan, nan, nan, nan, '442321'], dtype=object) >>> from pandas.core.strings import _na_map >>> f = lambda ****.decode("utf-8") >>> _na_map(f, b) array([nan, nan, nan, nan, '442321'], dtype=object)
The problem still open in git
except (TypeError, AttributeError): return na_value
b.str.decode('utf-8').fillna(b) Out[237]: 0 123 1 434, 2 fgd 3 aas 4 442321 dtype: object
Python — Encode Bytearray into UTF-8, You cannot encode this to UTF-8 as it is not a unicode (text) object. Your UnicodeDecodeError is caused by Python trying to decode the data first; it is trying to be helpful because normally you can only encode from Unicode to bytes.
Convert zero-padded bytes to UTF-8 string
I’m unpacking several structs that contain ‘s’ type fields from C. The fields contain zero-padded UTF-8 strings handled by strncpy in the C code (note this function’s vestigial behaviour). If I decode the bytes I get a unicode string with lots of NUL characters on the end.
>>> b'hiya\0\0\0'.decode('utf8') 'hiya\x00\x00\x00'
I was under the impression that trailing zero bytes were part of UTF-8 and would be dropped automatically.
What’s the proper way to drop the zero bytes?
Use str.rstrip() to remove the trailing NULs:
Either rstrip or replace will only work if the string is padded out to the end of the buffer with nulls. In practice the buffer may not have been initialised to null to begin with so you might get something like b’hiya\0x\0′ .
If you know categorically 100% that the C code starts with a null initialised buffer and never never re-uses it, then you might find rstrip to be simpler, otherwise I’d go for the slightly messier but much safer:
which treats the first null as a terminator.
Unlike the split/partition-solution this does not copy several strings and might be faster for long bytearrays.
data = b'hiya\0\0\0' i = data.find(b'\x00') if i == -1: return data return data[:i]
Python — Converting to UTF-8 (again), The work-around is to set r.encoding = ‘utf-8’ before accessing r.json so that the contents are correctly decoded without guessing. However, as J.F. Sebastian correctly points out, if the response really is JSON, then the encoding has to be one of the UTF family of encodings.
Python encode/decode hex string to utf-8 string
I trying to decode a hex string in python.
value = "" for i in "54 C3 BC 72 20 6F 66 66 65 6E 20 4B 6C 69 6D 61".split(" "): value += chr(int(i, 16)) print(value)
Expected result should be «Tür offen Klima» How can i make this work properly ?
Your data is encoded as UTF-8, which means that you sometimes have to look at more than one byte to get one character. The easiest way to do this is probably to decode your string into a sequence of bytes, and then decode those bytes into a string. Python has built-in features for both:
value = bytes.fromhex(«54 C3 BC»).decode(«utf-8»)
The problem is the result of the string
«54 C3 BC 72 20 6F 66 66 65 6E 20 4B 6C 69 6D 61»
The proper hex string that result in «Tür offen Klima» is actually:
«54 FC 72 20 6F 66 66 65 6E 20 4B 6C 69 6D 61»
Therefore, the code below would generate the result you expected:
value = "" for i in "54 FC 72 20 6F 66 66 65 6E 20 4B 6C 69 6D 61".split(" "): value += chr(int(i, 16)) print(value)
How to convert a string of utf-8 bytes into a unicode, 1 Answer. Sorted by: 3. Yes, I encountered the same problem when trying to decode a Facebook message dump. Here’s how I solved it: string = «\u00f0\u009f\u0098\u0086».encode («latin-1»).decode («utf-8») # ‘😆’. Here’s why: This emoji takes 4 bytes to encode in UTF-8 ( F0 9F 98 86, check at the bottom …
2 ways to get the size of a string in Python (in bytes)
This concise, straight-to-the-point article will show you 2 different ways to get the size (in bytes) of a string in Python. There’s no time to waste; let’s get started!
Using the encode() method and the len() function
Here’re the steps you can follow:
- Encode your string into bytes using the encode() method. You can specify the encoding as an argument, such as encode(‘utf-8’) or encode(‘ascii’) . This will return a bytes object that represents your string in the given encoding.
- Use the len() function to get the number of bytes in the resulting bytes object.
string = "Welcome to Sling Academy!" encoded_bytes = string.encode("utf-8") size_in_bytes = len(encoded_bytes) print(f"Size of string: bytes")
This approach gives you the actual byte size of the string. It is the best option for you.
Using sys.getsizeof()
The sys.getsizeof() function provides the size of the input string object in memory, including overhead. Because the result you get may include additional overhead, it may not reflect the actual byte size of the string. Therefore, we have to subtract the result for the size of an empty string returned by sys.getsizeof() .
The main points of this approach are:
- Import the sys module.
- Use the sys.getsizeof() function to get the size of the string object.
- Subtract the size of an empty string to get the size of the string in bytes.
import sys string = "Welcome to Sling Academy!" size_in_bytes = sys.getsizeof(string) - sys.getsizeof("") print(f"Size of string: bytes")
Python get string size in bytes
Last updated: Feb 18, 2023
Reading time · 3 min
# Table of Contents
# Get the length of a Bytes object in Python
Use the len() function to get the length of a bytes object.
The len() function returns the length (the number of items) of an object and can be passed a sequence (a bytes, string, list, tuple or range) or a collection (a dictionary, set, or frozen set).
Copied!my_bytes = 'hello'.encode('utf-8') print(type(my_bytes)) # 👉️ print(len(my_bytes)) # 👉️ 5
The len() function returns the length (the number of items) of an object.
Here is another example with some special characters.
Copied!my_bytes = 'éé'.encode('utf-8') print(type(my_bytes)) # 👉️ print(len(my_bytes)) # 👉️ 4
The same approach can be used to get the length of a bytearray .
Copied!my_byte_array = bytearray('hello', encoding='utf-8') print(len(my_byte_array)) # 👉️ 5
If you need to get the size of an object, use the sys.getsizeof() method.
Copied!import sys my_bytes = 'hello'.encode('utf-8') print(sys.getsizeof(my_bytes)) # 👉️ 38 print(sys.getsizeof('hello')) # 👉️ 54
The sys.getsizeof method returns the size of an object in bytes.
The object can be any type of object and all built-in objects return correct results.
The getsizeof method only accounts for the direct memory consumption of the object, not the memory consumption of objects it refers to.
The getsizeof() method calls the __sizeof__ method of the object, so it doesn’t handle custom objects that don’t implement it.
# Get the size of a String in Python
If you need to get the size of a string:
- Use the len() function to get the number of characters in the string.
- Use the sys.getsizeof() method to get the size of the string in memory.
- Use the string.encode() method and len() to get the size of the string in bytes.
Copied!import sys string = 'bobby' # ✅ Get the length of the string (number of characters) print(len(string)) # 👉️ 5 # ------------------------------------------ # ✅ get the memory size of the object in bytes print(sys.getsizeof(string)) # 👉️ 54 # ------------------------------------------ # ✅ get size of string in bytes my_bytes = len(string.encode('utf-8')) print(my_bytes) # 👉️ 5 print(len('őŴœ'.encode('utf-8'))) # 👉️ 6
Use the len() function if you need to get the number of characters in a string.
Copied!print(len('ab')) # 👉️ 2 print(len('abc')) # 👉️ 3
The len() function returns the length (the number of items) of an object.
The argument the function takes may be a sequence (a string, tuple, list, range or bytes) or a collection (a dictionary, set, or frozen set).