Fast xlsx parsing with Python
So you have tried all the tools and libraries, but extracting data from
giant Excel sheets is still too slow ?
On the menu for this blog post we have:
- Quick dive into the structure of XLSX files
- The lxml library and the XSLT transformation
- How to use XSLT to parse XLSX sheets
- Full listing of the code
While the method presented here-under is fast, it cuts a lot of
corners, it is only meant to extract raw data.
XLSX file structure and the performance implications
Basically an XLSX file is a zip archive of a bunch of xml files. This is how a basic sheet looks in Excel:
And this is the content of the archive (seen inside emacs) corresponding
to the above screenshot:
Filemode Length Date Time File - ---------- -------- ----------- -------- ----------------------------------- -rw-rw-rw- 1811 1-Jan-1980 00:00:00 [Content_Types].xml -rw-rw-rw- 588 1-Jan-1980 00:00:00 _rels/.rels -rw-rw-rw- 1266 1-Jan-1980 00:00:00 xl/_rels/workbook.xml.rels -rw-rw-rw- 1883 1-Jan-1980 00:00:00 xl/workbook.xml -rw-rw-rw- 199 1-Jan-1980 00:00:00 xl/sharedStrings.xml -rw-rw-rw- 304 1-Jan-1980 00:00:00 xl/worksheets/_rels/sheet1.xml.rels -rw-rw-rw- 8390 1-Jan-1980 00:00:00 xl/theme/theme1.xml -rw-rw-rw- 1618 1-Jan-1980 00:00:00 xl/styles.xml -rw-rw-rw- 1464 1-Jan-1980 00:00:00 xl/worksheets/sheet1.xml -rw-rw-rw- 235 1-Jan-1980 00:00:00 customXml/itemProps2.xml -rw-rw-rw- 72 1-Jan-1980 00:00:00 xl/customProperty1.bin -rw-rw-rw- 296 1-Jan-1980 00:00:00 customXml/_rels/item1.xml.rels -rw-rw-rw- 296 1-Jan-1980 00:00:00 customXml/_rels/item2.xml.rels -rw-rw-rw- 296 1-Jan-1980 00:00:00 customXml/_rels/item3.xml.rels -rw-rw-rw- 296 1-Jan-1980 00:00:00 customXml/_rels/item4.xml.rels -rw-rw-rw- 41 1-Jan-1980 00:00:00 customXml/item1.xml -rw-rw-rw- 235 1-Jan-1980 00:00:00 customXml/itemProps1.xml -rw-rw-rw- 991 1-Jan-1980 00:00:00 docProps/app.xml -rw-rw-rw- 49 1-Jan-1980 00:00:00 customXml/item3.xml -rw-rw-rw- 235 1-Jan-1980 00:00:00 customXml/itemProps3.xml -rw-rw-rw- 49 1-Jan-1980 00:00:00 customXml/item4.xml -rw-rw-rw- 235 1-Jan-1980 00:00:00 customXml/itemProps4.xml -rw-rw-rw- 623 1-Jan-1980 00:00:00 docProps/core.xml -rw-rw-rw- 41 1-Jan-1980 00:00:00 customXml/item2.xml - ---------- -------- ----------- -------- ----------------------------------- 21513 24 files
Workbook
The usual entry point is the xl/workbook.xml file which contains a
mapping between the sheet names and the corresponding file. In our
minimal example the exercise is pointless since we only have one sheet,
but let’s see have a look anyway:
For the sake of brevity, most of the content has been replaced by dots.
So to extract this information we use XPath , but because this file
is usually rather small any other method would fit the bill.
ns = < 'ns': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main', >fh = zipfile.ZipFile(file_path) name = 'xl/workbook.xml' root = etree.parse(fh.open(name)) workbook = <> for el in etree.XPath("//ns:sheet", namespaces=ns)(root): workbook[el.attrib['name']] = el.attrib['sheetId']
So after running the above code, workbook is a dictionary containing
a mapping between a sheet name and the corresponding id. In our case:
.
Shared strings
Similarly, we have to also extract data out of
xl/sharedStrings.xml . It contains all the strings that will appear
in the sheets. So when a cell contains a string, Excel actually store
an id that refers to the position of the corresponding string in the
sharedStrings.xml file. This is the full content for our example:
So parsing this xml is also done with XPath:
name = 'xl/sharedStrings.xml' root = etree.parse(fh.open(name)) res = etree.XPath("/ns:sst/ns:si/ns:t", namespaces=ns)(root) shared =
This shared string structure is one of the indirections (but not the
only one) you have to solve when parsing XLSX files. There is no easy
way to extract information as-is, hence the raw representation has to
be post-processed, that is why Python suffers when parsing XLSX.
La pièce de résistance
Now that we have collected the workbook and shared string data, we can
move our attention to the sheets themselves. This is where performance
matters because those are the parts that growth when sheets content
gets larger.
For our example, xl/worksheets/sheet1.xml looks like this:
0 1
1 2
3 4
5 6
To be able to extract sheet content we have to collect cells values,
their types and their positions.
- A cell is of type string when there is t=»s» on the cell element ( c ).
- The cell position is given by the r attribute (ex: r=»B3″ ). We
have to rely on this attribute because Excel sometimes likes to
shuffle thoses. - The cell value is in the v element (which can be missing if the
actual cell is empty)
So you can see that cell A1 contains the shared string 0 which is
“Ham”. B2 is 1 and “Spam”.
Parsing this info is not doable with XPath, even in several passes (as
we would loose relations between different bits of info). Another way
to do that would be to use a DOM parser and then walk across the
structure to gather what we want, it works but it tend to be slow and
memory heavy on large files.
The ugly duck saves the day
XSLT is an XML to transform XML into XML (and it’s Turing-complete!), so
nobody wants to deal with it. But:
- It’s available in lxml and it’s fast, at least faster than a
pure-python equivalent. - It can spit out any text content (not only XML).
The strategy here is to use XSLT to generate a CSV (it’s fast) and
parse it again with pandas’ read_csv (it’s fast too). As long as
pandas does not provide a read_xml method, it’s the best way to plug
those two together.
sheet_xslt = etree.XML(''' , , , \n ''')
It’s ugly and it took a lot of trial and error to get right. But
the most difficult part was to figure out which combination of
namespaces would appease the gods of XML.
The result of this transformation is a csv that contains one line per
excel cell and whose columns are ‘row’, ‘cell’, ‘type’ and ‘value’.
Once we have the correct XSLT, it’s straightforward to plug the output
of the transformation to read_csv :
sheet_name = 'Sheet1' sheet_id = workbook[sheet_name] sheet_path = 'xl/worksheets/sheet%s.xml' % sheet_id root = etree.parse(fh.open(sheet_path)) transform = etree.XSLT(sheet_xslt) result = transform(root) df = read_csv(BuffIO(str(result)), header=None, dtype=str, names=['row', 'cell', 'type', 'value'], )
So at this point in the code, you have a DataFrame which reflects the
xml content of the sheet, something like this:
(Pdb) print(df) row cell type value 0 1 A1 s 0 1 1 B1 s 1 2 2 A2 NaN 1 3 2 B2 NaN 2 4 3 A3 NaN 3 5 3 B3 NaN 4 6 4 A4 NaN 5 7 4 B4 NaN 6
# First row numbers are sometimes filled with nan df['row'] = to_numeric(df['row'].fillna(0)) # Translate string contents cond = (df.type == 's') & (~df.value.isnull()) df.loc[cond, 'value'] = df[cond]['value'].map(shared) # Add column label ('AA99' becomes 'AA') df['col'] = df.cell.str.replace(r'8+', '') df = df.sort_values(by='row') # Pivot everything df = df.pivot( index='row', columns='col', values='value' ).reset_index(drop=True).reset_index(drop=True) df.columns.name = None # pivot adds a name to the "columns" array # Sort columns (pivot will put AA before B) cols = sorted(df.columns, key=lambda x: (len(x), x)) df = df[cols] df = df.dropna(how='all') # Ignore empty lines df = df.dropna(how='all', axis=1) # Ignore empty cols
As you can see, there is no for-loop in Python, every low-level
operation is done through vectorized pandas method. This is the
resulting DataFrame (yay!):
A B 0 Ham Spam 1 1 2 2 3 4 3 5 6
Benchmark & Closing words
The XSLT method is twice as fast as pandas’ read_excel . The
benchmarks were run on a large sheet of 537 lines and 341 columns:
$ python fast_xlsx_parsing.py xslt: 3.311 pandas: 6.248
Incidentally, googling for «xslt» «pandas» «csv» returns this
unanswered StackOverflow question with the same approach,
as usual it’s easier to find a solution when you already know the
answer. So I’m not the first to combine XSLT and read_csv , but I
probably am in the context of XLSX file parsing.
Full code
import io import zipfile from lxml import etree from pandas import read_csv, to_numeric class XLSX: sheet_xslt = etree.XML(''' , , , \n ''') def __init__(self, file_path): self.ns = < 'ns': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main', >self.fh = zipfile.ZipFile(file_path) self.shared = self.load_shared() self.workbook = self.load_workbook() def load_workbook(self): # Load workbook name = 'xl/workbook.xml' root = etree.parse(self.fh.open(name)) res = <> for el in etree.XPath("//ns:sheet", namespaces=self.ns)(root): res[el.attrib['name']] = el.attrib['sheetId'] return res def load_shared(self): # Load shared strings name = 'xl/sharedStrings.xml' root = etree.parse(self.fh.open(name)) res = etree.XPath("/ns:sst/ns:si/ns:t", namespaces=self.ns)(root) return < str(pos): el.text for pos, el in enumerate(res) >def _parse_sheet(self, root): transform = etree.XSLT(self.sheet_xslt) result = transform(root) df = read_csv(io.StringIO(str(result)), header=None, dtype=str, names=['row', 'cell', 'type', 'value'], ) return df def read(self, sheet_name): sheet_id = self.workbook[sheet_name] sheet_path = 'xl/worksheets/sheet%s.xml' % sheet_id root = etree.parse(self.fh.open(sheet_path)) df = self._parse_sheet(root) # First row numbers are filled with nan df['row'] = to_numeric(df['row'].fillna(0)) # Translate string contents cond = (df.type == 's') & (~df.value.isnull()) df.loc[cond, 'value'] = df[cond]['value'].map(self.shared) # Add column number and sort rows df['col'] = df.cell.str.replace(r'5+', '') df = df.sort_values(by='row') # Pivot everything df = df.pivot( index='row', columns='col', values='value' ).reset_index(drop=True).reset_index(drop=True) df.columns.name = None # pivot adds a name to the "columns" array # Sort columns (pivot will put AA before B) cols = sorted(df.columns, key=lambda x: (len(x), x)) df = df[cols] df = df.dropna(how='all') # Ignore empty lines df = df.dropna(how='all', axis=1) # Ignore empty cols return df if __name__ == '__main__': xlsx = XLSX('example.xlsx') xlsx.read('Sheet1')