The Python Binary File Parsing Deep Dive
I’m an electrical engineer by trade, however my job has a lot of software engineering overlap. Recently something I’ve been working on is a Python parser for a binary format that has a varying record format. The format is STDF V4, which is a binary format created by Teradyne for storing semi-conductor test data. My goal was to parse this data into a tabular format for data analysis/visualization (ideally CSV). There are some existing libraries already in Python for parsing this format, however they are quite slow for any files >100MB. Specifically, I was using pySTDF. Rather than continue to mess with this rather complicated format, I decided to do a deep dive on general binary file parsing methods in Python. This post will be divided into what research has told me thus far, an experiment I conducted on binary file parsing, and what answers I’m hoping the community can provide for me.
Pre Experiment Research
I found two awesome posts that seem to directly address the root of my issue. One post here on the code review portion of stack exchange addresses some methods to speed up binary file reads in Python3. When I dig into the source code for pySTDF, this post starts to highlight why this library is so slow. Although the library is very organized and «Pythonic» in nature, there are many, many nested classes and function calls that slow down parsing. Another good post here looks at file i/o performance in Python at a higher level. This post also brings things into clearer view. Python isn’t neccessarily struggling with reading binary data from disk slowly, but rather when you’re iterating on the order of millions of times every little function call, operation, etc. starts to really hurt you. With all this in mind I decided to try out a couple of different general binary parsing libraries in Python and compare them in performance.
The Experiment
For brevity’s sake I won’t get into describing the nitty gritty of how all these parser’s work (including my own) but please see the github link here for full source code and refer to the links above for information on those parsers.
File Generation
To generate some test files, I used the Construct library to generate binary files of my own format.
First I wanted a simple binary file with a constant record length and format:
# Simple tabular binary record with fixed fields data_simple_fmt = Struct( "field1" / Int16ul, "field2" / Int8ul, "field3" / Float32l, "field4" / Int32sl )
Next I wanted a format where it had a consistent number of fields, but had some fields that varied in length:
# Simple tabular binary record with 2 varying string fields, a record length header is included data_varying_fmt = Struct( "record_len" / Int16ul, "field1" / Int16ul, "field2" / Int8ul, "field3" / Int16ul, "field4" / PascalString(Int8ul, "utf8"), "field5" / PascalString(Int8ul, "utf8"), "field6" / Float32l )
Diagram of format:
Finally, in the spirit of STDF, I wanted a format where there could be different record types depending on a record header:
# Complex binary record with multiple tabular record types that have varying fields indicated by record header table1_fmt = Struct( "table1_field1" / Int16ul, "table1_field2" / Int8ul, "table1_field3" / PascalString(Int8ul, "utf8"), "table1_field4" / Float32l ) table2_fmt = Struct( "table2_field1" / PascalString(Int8ul, "utf8"), "table2_field2" / PascalString(Int8ul, "utf8"), "table2_field3" / PascalString(Int8ul, "utf8") ) table3_fmt = Struct( "table3_field1" / Int16ul, "table3_field2" / Int32ul, "table3_field3" / Int32ul, "table3_field4" / Int32ul, "table3_field5" / Int32ul, "table3_field6" / Int32ul, "table3_field7" / Int32ul, "table3_field8" / Int32ul, "table3_field9" / PascalString(Int8ul, "utf8"), "table3_field10" / Float32l, "table3_field10" / Float32l ) data_complex_fmt = Struct( "record_len" / Int16ul, "record_typ" / Int8ul, Switch(this.record_typ, < 1: Embedded(table1_fmt), 2: Embedded(table2_fmt), 3: Embedded(table3_fmt) >) )
Diagram of format:
From here I coded a rather clunky script for generating 100 KB, 100 MB, and 1 GB files of each format with randomized data:
import os import random import string from construct_format_defs import data_simple_fmt, data_varying_fmt, data_complex_fmt # generation switches gen_simple = False gen_varying = True gen_complex = False # output directory output_dir = "./test_binary_files" # file names data_simple_fn = "data_simple" data_varying_fn = "data_varying" data_complex_fn = "data_complex" # desired file sizes file_size_small = 100*1024 # 100 kib file_size_medium = 100*1024**2 # 100 mib file_size_big = 1024**3 # 1 gib for fname in [data_simple_fn, data_varying_fn, data_complex_fn]: if (fname in data_simple_fn and gen_simple) or (fname in data_varying_fn and gen_varying) or \ (fname in data_complex_fn and gen_complex): # build each file type with open(os.path.join(output_dir, fname + "_small.bin"), "wb") as f_small, \ open(os.path.join(output_dir, fname + "_medium.bin"), "wb") as f_medium, \ open(os.path.join(output_dir, fname + "_large.bin"), "wb") as f_large: # initialize file size in bytes file_size = 0 while file_size < file_size_big: # Create randomized dictionary if fname in data_simple_fn: d_write = < "field1": random.randint(0, 2**16-1), "field2": random.randint(0, 2**8-1), "field3": random.randint(0, 25 * 10000) / 10000, "field4": random.randint(-(2**31), 2**31-1) ># write byte stream to files bytes_data = data_simple_fmt.build(d_write) elif fname in data_varying_fn: d_write = < "field1": random.randint(0, 2**16-1), "field2": random.randint(0, 2**8-1), "field3": random.randint(0, 2**16-1), "field4": ''.join(str(chr(random.choice(range(0, 128)))) for x in range(random.randint(1, 20))), "field5": ''.join(random.choice(string.ascii_letters) for x in range(random.randint(1, 20))), "field6": random.randint(-(2**31), 2**31-1) ># calculate what record length is with varying fields d_write["record_len"] = 9 + len(d_write["field4"]) + len(d_write["field5"]) + 2 # write byte stream to files bytes_data = data_varying_fmt.build(d_write) else: d_write = < "record_typ": random.randint(1, 3) >d_sub_write = <> if d_write["record_typ"] == 1: d_sub_write["table1_field1"] = random.randint(0, 2**16-1) d_sub_write["table1_field2"] = random.randint(0, 2**8-1) d_sub_write["table1_field3"] = ''.join( str(chr(random.choice(range(0, 128)))) for x in range(random.randint(1, 20))) d_sub_write["table1_field4"] = random.randint(-(2**31), 2**31-1) d_write["record_len"] = len(d_sub_write["table1_field3"]) + 7 + 1 elif d_write["record_typ"] == 2: d_sub_write["table2_field1"] = ''.join( random.choice(string.ascii_letters) for x in range(random.randint(1, 20))) d_sub_write["table2_field2"] = ''.join( str(chr(random.choice(range(0, 128)))) for x in range(random.randint(1, 20))) d_sub_write["table2_field3"] = ''.join( str(chr(random.choice(range(0, 128)))) for x in range(random.randint(1, 20))) d_write["record_len"] = len(d_sub_write["table2_field1"]) + len(d_sub_write["table2_field2"]) + len( d_sub_write["table2_field3"]) + 3 else: d_sub_write["table3_field1"] = random.randint(0, 2**16-1) d_sub_write["table3_field2"] = random.randint(0, 2**32-1) d_sub_write["table3_field3"] = random.randint(0, 2**32-1) d_sub_write["table3_field4"] = random.randint(0, 2**32-1) d_sub_write["table3_field5"] = random.randint(0, 2**32-1) d_sub_write["table3_field6"] = random.randint(0, 2**32-1) d_sub_write["table3_field7"] = random.randint(0, 2**32-1) d_sub_write["table3_field8"] = random.randint(0, 2**32-1) d_sub_write["table3_field9"] = ''.join( random.choice(string.ascii_letters) for x in range(random.randint(1, 20))) d_sub_write["table3_field10"] = random.randint(-(2**31), 2**31-1) d_sub_write["table3_field11"] = random.randint(-(2**31), 2**31-1) d_write["record_len"] = 38 + len(d_sub_write["table3_field9"]) + 1 d_write["sub_table"] = d_sub_write # write byte stream to files bytes_data = data_complex_fmt.build(d_write) # gate writing if we have surpassed desired file size if file_size < file_size_small: f_small.write(bytes_data) if file_size < file_size_medium: f_medium.write(bytes_data) f_large.write(bytes_data) # add bytes data to determine file size file_size += len(bytes_data)
File Parsing
Next I coded a test suite for timing the 3 different parsers. To keep things simple, the goal was to parse each of the 3 file formats into memory. I decided to skip the 1 GB file for now due to how long it takes.
import os import time import mmap import struct # CONSTRUCT IMPORTS from construct import GreedyRange from construct_format_defs import data_simple_fmt, data_varying_fmt, data_complex_fmt # KAITAI IMPORTS/PARSERS from kaitai_export_simple import DataSimple from kaitai_export_varying import DataVarying from kaitai_export_complex import DataComplex ## CONSTANTS # test directory test_dir = "./test_binary_files" # file names data_simple_fn = "data_simple" data_varying_fn = "data_varying" data_complex_fn = "data_complex" # sizes small_f = "_small.bin" med_f = "_medium.bin" big_f = "_large.bin" ## MY PARSERS def parse_simple(bin_fh): # Can also use: Array.array or numpy from file read since this is fixed format # precompile struct object structobj = struct.Struct("
Performance Analysis
The results in increasing order of time (seconds):
- Homebrew:
- small size (100 KB):
- simple format: 0.004
- varying format: 0.788
- complex format: 0.3748
- simple format: 3.706
- varying format: 5.626
- complex format: 5.965
- small size (100 KB):
- simple format: 0.606
- varying format: 0.028
- complex format: 0.030
- simple format: 40.580
- varying format: 43.815
- complex format: 46.839
- small size (100 KB):
- simple format: 0.252
- varying format: 2.518
- complex format: 1.343
- simple format: 231.170
- varying format: 107.381
- complex format: 136.934
I realized too late that creating constant file sizes with varying formats maybe isn't the best way to test. You can see Construct really struggles with the most basic format as it has by far the highest number of records, and thus the most iteration operations.
Conclusions and Questions
From my experiments I've concluded that there currently isn't a generic binary file parsing library that can beat pure Python coding from a speed perspective. The one exception I'll give is that in the case of having fixed record formatted binary files like in the "simple format" case, numpy has a method documented here that can directly import a binary file into an array as long as the fields are of fixed length. This method is extremely fast as I believe it is C based.
This brings me to my questions for the software engineering community:
- Am I missing something obvious here or are my pure Python examples about the fastest I can get?
- Is there an obvious area to substitute functions coded in C that I can call from Python for a massive speed up?
- Is there a c-based Python library I just haven't found that can parse non-uniform binary files very quickly?
You might be saying "well ~6 seconds isn't too bad for your complex example", however when I applied this same methodology to code as lean an STDF parser as possible, I was still dealing with ~25 seconds for an 100 MB file which wasn't acceptable for my use case. Extremely interested to hear feedback from the community on this. At the end of the day I may just need to switch to a different language as a coworker and I were able to accomplish the same task in ~1 second using Golang.
Github for Project
Please see my Github here for all the code described in this project.
Parsing binary records with struct
The struct module provides functions to parse fields of bytes into a tuple of Python objects, and to perform the opposite conversion, from a tuple into packed bytes. struct can be used with bytes , bytearray , and memoryview objects.
The struct module is powerful and convenient, but before using it you should seriously consider alternatives, so that’s the first short section in this post.
Should we use struct ?
Proprietary binary records in the real world are brittle and can be corrupted easily. The super simple example in Struct 101 will expose one of many caveats: a string field may be limited only by its size in bytes, it may be padded by spaces, or it may contain a null-terminated string followed by random garbage up to a certain size. There is also the problem of endianness: the order of the bytes used to represent integers and floats, which depends on the CPU architecture.
If you need to read or write from an existing binary format, I recommend trying to find a library that is ready to use instead of rolling your own solution.
If you need to exchange binary data among in-company Python systems, the pickle module is the easiest way—but beware that different versions of Python use different binary formats by default, and reading a pickle may run arbitrary code, so it’s not safe for external use.
If the exchange involves programs in other languages, use JSON or a multi-platform binary serialization format like MessagePack or Protocol Buffers.
Struct 101
Suppose you need to read a binary file containing data about metropolitan areas, produced by a program in C with a record defined as Example 1
Here is how to read one record in that format, using struct.unpack :
>>> from struct import unpack, calcsize >>> FORMAT = 'i12s2sf' >>> size = calcsize(FORMAT) >>> data = open('metro_areas.bin', 'rb').read(size) >>> data b"\xe2\x07\x00\x00Tokyo\x00\xc5\x05\x01\x00\x00\x00JP\x00\x00\x11X'L" >>> unpack(FORMAT, data) (2018, b'Tokyo\x00\xc5\x05\x01\x00\x00\x00', b'JP', 43868228.0)
Note how unpack returns a tuple with four fields, as specified by the FORMAT string. The letters and numbers in FORMAT are Format Characters described in the struct module documentation.
Table 1 explains the elements of the format string from Example 2.
- small size (100 KB):