- Working with Binary Data in Python
- Encoding
- Python and Bytes
- Working with bits and bytes in Python 2 and 3
- Introduction
- Common to Python 2 and 3
- Finding parts of the string
- Converting to binary
- Mutable and immutable types
- Python 2
- Using strings
- Using a bytearray
- Python 3
- Using bytes
- Using a bytearray
- Which one to choose?
- Read more
- Python Bits and Bytes
- Base Conversions
- Unicode Code Points
- bytes and bytearray
- Byte literals and ASCII Conversions
- Hex Stream
- Structures / Packets
Working with Binary Data in Python
Alright, lets get this out of the way! The basics are pretty standard:
- There are 8 bits in a byte
- Bits either consist of a 0 or a 1
- A byte can be interpreted in different ways, like binary octal or hexadecimal
Note: These are not character encodings, those come later. This is just a way to look at a set of 1’s and 0’s and see it in three different ways(or number systems).
Input : 10011011 Output : 1001 1011 ---- 9B (in hex) 1001 1011 ---- 155 (in decimal) 1001 1011 ---- 233 (in octal)
This clearly shows a string of bits can be interpreted differently in different ways. We often use the hex representation of a byte instead of the binary one because it is shorter to write, this is just a representation and not an interpretation.
Encoding
Now that we know what a byte is and what it looks like, let us see how it is interpreted, mainly in strings. Character Encodings are a way to assign values to bytes or sets of bytes that represent a certain character in that scheme. Some encodings are ASCII(probably the oldest), Latin, and UTF-8(most widely used as of today. In a sense encodings are a way for computers to represent, send and interpret human readable characters. This means that a sentence in one encoding might become completely incomprehensible in another encoding.
Python and Bytes
From a developer’s point of view, the largest change in Python 3 is the handling of strings. In Python 2, the str type was used for two different kinds of values – text and bytes, whereas in Python 3, these are separate and incompatible types. This means that before Python3 we could treat a set of bytes as a string and work from there, this is not the case now, now we have a separate data type, called bytes. This data type can be briefly explained as a string of bytes, which essentially means, once the bytes data type is initialized it is immutable.
Working with bits and bytes in Python 2 and 3
When performing a bit flip attack or working with XOR encryption, you want to change the bits and bytes in a string of bytes. How this is done differs between Python 2 and 3, and this article explains how.
Introduction
In a bit flip attack, you typically want to change a single bit in a predefined message. As an example we take the message “attack at dawn”. Imagine we want to change the least significant bit of a letter, resulting for example in “att`ck at dawn”.
To flip a bit we will XOR it with 1. The XOR operation is done using the ^ hat.
Common to Python 2 and 3
Finding parts of the string
We can refer to a specific letter in the message by using string indexing. Similarly, we can refer to part of the payload by using the slice notation:
>>> message[3] 'a' >>> message[0:3] 'att'
This is useful when we want to change only a single letter, or copy part of the string unchanged.
Converting to binary
We can only flip bits on binary content. If we were passed a text string instead, we must first convert the text message to bytes. This is done using the encode function, which takes as parameter an encoding to use to convert the text to bytes:
>>> message = u"Thanks for the tête-à-tête about our coup d'état in Zaïre" >>> message.encode("utf-8") b"Thanks for the t\xc3\xaate-\xc3\xa0-t\xc3\xaate about our coup d'\xc3\xa9tat in Za\xc3\xafre"
As you can see our letters have been converted to bytes according to the UTF-8 encoding.
Alternatively, we can provide the message in bytes to begin with. By putting a little b before our string literal we specify that it is a byte string as opposed to a text string:
To easily get binary data in and out of Python you can use base64.b64encode to base64-encode it, or binascii.hexlify to convert it to hex.
Mutable and immutable types
The string and bytes types are immutable.
>>> message = "attack at dawn" >>> message[3] = "x" Traceback (most recent call last): File "", line 1, in TypeError: 'str' object does not support item assignment
We can’t simply assign one different letter to the message, since it is immutable. Immutable objects can’t be changed. There are two ways to overcome this problem:
- Create a new string containing the value we want.
- Copy the string to another mutable object and work on that.
To create a new string, we simply copy the parts we want to keep and inject our changed letter into it:
>>> message = "attack at dawn" >>> message[:3] + "x" + message[4:] 'attxck at dawn'
Alternatively we can use bytearray as our mutable object. We copy our message into a bytearray and change that:
>>> message = "attack at dawn" >>> message_array = bytearray(message) >>> message_array[3] = "x" >>> str(message_array) 'attxck at dawn'
Now, this example will only work in Python 2 and not in Python 3. Let’s get into the differences.
Python 2
Using strings
The str type in Python 2 is a string of bytes. If you index it you get another str containing just one byte:
>>> message = b"attack at dawn" >>> message[3] 'a'
We can’t just flip bits in this single-byte string. We need to convert it to a number and back again using chr and ord. The ord (for “ordinal”) function converts the letter to a number. We can modify that number as we please and convert it back using chr :
>>> ord(message[3]) 97 >>> chr(97) 'a' >>> chr(ord(message[3]) ^ 1) '`'
Now we have changed the letter at index 3, we can concatenate the rest of the string to it:
>>> message[0:3] + chr(ord(message[3]) ^ 1) + message[4:] 'att`ck at dawn'
Using a bytearray
An alternative is to copy the string to a bytearray and directly change a letter:
>>> message = b"attack at dawn" >>> message_array = bytearray(message) >>> message_array[3] = message_array[3] ^ 1 >>> str(message_array) 'att`ck at dawn'
Even in Python 2 the elements of bytearray are numbers. Indexing a bytearray will give a number. That said, it is still possible to assign single-letter strings to positions in a bytearray:
>>> message_array[3] 97 >>> message_array[3] = 'x' >>> message_array bytearray(b'attxck at dawn')
Python 3
Using bytes
In Python 3, str is the type for a string of text and bytes is the type for a string of bytes. If you index a bytes you get a number:
>>> message = b"attack at dawn" >>> message[3] 97
After we modify the number we want to put it back in our message. We have to convert it to bytes again. In Python 2 we used chr for this, but this won’t work in Python 3: it will convert the number to a string instead of a byte. We will use the bytes constructor instead:
Using a bytearray
The example given for Python 2 still works in Python 3. However, it is no longer possible to assign letters to bytearray indices. You can only assign numbers:
>>> message_array[3] = "x" Traceback (most recent call last): File "", line 1, in TypeError: an integer is required >>> message_array[3] = 96 >>> message_array bytearray(b'att`ck at dawn')
Which one to choose?
Use Python 3 with bytearray. Python 3 is stricter with data types than Python 2. It may seem easier to use Python 2 at first because you can get away with treating strings and bytes the same, but this can hide subtle bugs. Python 3 is more explicit with types and warns you try to do something that is probably not what you want.
Read more
Python Bits and Bytes
In C, handling binary data such as network packets feels almost like a core part of the language. In Python on the other hand, there are a lot of supporting library functions required to facilitate this.
As I only occasionally use Python for this purpose, I’ve written up the below as a reference for myself. All of these examples target Python 3.
Base Conversions
Python has three built in functions for base conversions. These are int() , hex() and bin() . Note that hex() and bin() both return strings.
Considering the example where x = 42 :
Alternatively, we can get slightly more control over the output by using the str.format() method and it’s format syntax.
For example, the following outputs zero-padded binary numbers to a width of 8:
If the initial value you wish to convert is a string, the int() function can be used to firstly convert it to an integer. This requires providing both the string and its base as arguments to the int() function.
In the case where x = «0x2a» :
Unicode Code Points
The ord() built in function returns the integer value / code point of a specified character. For example, examining the “straight” ASCII apostrophe and the “curly” opening version:
The chr() function preforms the inverse of ord() . It will return the string representation of an integer argument. If you wanted the rocket symbol you could issue:
bytes and bytearray
Binary values can be stored within the bytes object. This object is immutable and can store raw binary values within the range 0 to 255. It’s constructor is the aptly named bytes() . There are several different ways to initialise a bytes object:
>>> bytes((1,2,3)) b'\x01\x02\x03' >>> bytes("hello", "ascii") b'hello'
The bytearray object serves the same purpose as bytes but is mutable, allowing elements in the array to be modified. It has the constructor bytearray() .
>>> x = bytearray("hello.", "ascii") >>> x bytearray(b'hello.') >>> x[5] = ord("!") >>> x bytearray(b'hello!')
Byte literals and ASCII Conversions
A bytes literal can be specified using the b or B prefix, e.g. b»bytes literal» .
Comparing this with a standard string:
Non-ASCII bytes can be inserted using the «\xHH» escape sequence. This places the binary representation of the hexadecimal number 0xHH into the string, e.g. b»The NULL terminator is \x00″ .
The str object has an encode() method to return the bytes representation of the string. Similarly, the bytes object has a decode() method to return the str representation of the data:
- «string to bytes».encode(«ascii») gives b’string to bytes’
- b»bytes to string».decode(«ascii») gives ‘bytes to string’
Hex Stream
The hexadecimal string representation of a single byte requires two characters, so a hex representation of a bytes string will be twice the length.
To convert from bytes to a hex representation use binascii.hexlify() and from hex to bytes binascii.unhexlify() .
For example, where x = b»hello»
The reverse process, if y = «68656c6c6f»
Structures / Packets
The struct module provides a way to convert data to/from C structs (or network data).
The key functions in this module are struct.pack() and struct.unpack() . In addition to the data, these functions require a format string to be provided to specify the byte order and the intended binary layout of the data.
Consider an IPv4 header. This structure contains some fields that are shorter than a byte (octet), e.g. the version field is 4-bits wide (aka a nibble). The smallest data unit struct can handle is a byte, so these fields must be treated as larger data units and then extracted separately via bit shifting.
IPv4 Field | Format Character |
---|---|
Version and IHL | B |
Type of Service | B |
Total Length | H |
Identification | H |
Flags and Fragmentation Offset | H |
Time to Live | B |
Protocol | B |
Header Checksum | H |
Source Address | L |
Destination Address | L |
As this data should be in network byte order, we need to specify this with an exclamation mark, ! . The format string which represents an IPv4 header is therefore: !BBHHHBBHLL .
Below is an example of packing IPv4 fields into a bytes object and hex stream:
import struct import binascii fmt_string = "!BBHHHBBHLL" version_ihl = 4 4 | 4 tos = 0 total_length = 100 identification = 42 flags = 0 ttl = 32 protocol = 6 checksum = 0xabcd s_addr = 0x0a0b0c0d d_addr = 0x01010101 ip_header = struct.pack(fmt_string, version_ihl, tos, total_length, identification, flags, ttl, protocol, checksum, s_addr, d_addr) print(ip_header) print(binascii.hexlify(ip_header).decode())
b'D\x00\x00d\x00*\x00\x00 \x06\xab\xcd\n\x0b\x0c\r\x01\x01\x01\x01' 44000064002a00002006abcd0a0b0c0d01010101
The unpack() method can reverse this process:
ip_header_fields = struct.unpack(fmt_string, ip_header) print(ip_header_fields)
The unpacked data is a tuple of the individual fields:
(68, 0, 100, 42, 0, 32, 6, 43981, 168496141, 16843009)