Python pdf extract tables

Содержание

How to extract tables from PDF using Python Pandas and tabula-py
A quick and ready script to extract repetitive tables from PDF
Extract Regions names
Extract Table from PDF using Python
Method 1:
Step 1: Import library and define file path
Step 2: Extract table from PDF file
Step 3: Write dataframe to CSV file
Method 2:
Extract multiple tables from a single page of PDF using Python
Extract all tables from PDF using Python
Conclusion
How to Extract Table from PDF with Python and Pandas
1: Extract tables from PDF with Python
2: Extract tables from PDF — keep format
2.1 Convert PDF to HTML
2.2 Extract tables with Pandas
2.3 HTMLTableParser
3. Python Libraries for extraction from PDF files
3.1 Python PDF parsing
3.2 Parse HTML tables
3.3 Example PDF files

How to extract tables from PDF using Python Pandas and tabula-py

A quick and ready script to extract repetitive tables from PDF

This tutorial is an improvement of my previous post, where I extracted multiple tables without Python pandas . In this tutorial, I will use the same PDF file, as that used in my previous post, with the difference that I manipulate the extracted tables with Python pandas .

The code of this tutorial can be downloaded from my Github repository.

Almost all the pages of the analysed PDF file have the following structure:

In the top-right part of the page, there is the name of the Italian region, while in the bottom-right part of the page there is a table.

I want to extract both the region names and the tables for all the pages. I need to extract the bounding box for both the tables. The full procedure to measure margins is illustrated in my previous post, section Define margins.

This script implements the following steps:

define the bounding box, which is represented through a list with the following shape: [top,left,bottom,width] . Data within the bounding box are expressed in cm. They must be converted to PDF points, since tabula-py requires them in this format. We set the conversion factor fc = 28.28 .
extract data using the read_pdf() function
save data to a pandas dataframe.

Extract Regions names

Firstly, I define the bounding box to extract the regions:

box = [1.5, 22,3.8,26.741]
fc = 28.28 

for i in range(0, len(box)): 
box[i] *= fc

Источник

Extract Table from PDF using Python

In this tutorial we will discuss how to extract table from PDF files using Python.

We know that it is on the first page of the PDF file. Now we can extract it to CSV or DataFrame using Python:

Method 1:

Step 1: Import library and define file path

Step 2: Extract table from PDF file

The above code reads the first page of the PDF file, searching for tables, and appends each table as a DataFrame into a list of DataFrames dfs.

Here we expected only a single table, therefore the length of the dfs list should be 1:

You can also validate the result by displaying the contents of the first element in the list:

 Number of Coils Number of Paperclips 0 5 3, 5, 4 1 10 7, 8, 6 2 15 11, 10, 12 3 20 15, 13, 14

Step 3: Write dataframe to CSV file

Simply write the DataFrame to CSV in the same directory:

Method 2:

This method will produce the same result, and rather than going step-by-step, the library provides a one-line solution:

Both of the above methods are easy to use when you are sure that there is only one table on a particular page.

In the next section we will explore how to adjust the code when working with multiple tables.

Extract multiple tables from a single page of PDF using Python

We want to extract the tables below:

Using Method 1 from the previous section, we can extract each table as a DataFrame and create a list of DataFrames:

Notice that in this case we set pages=’2′, since we are extracting tables from page 2 of the PDF file.

Check that the list contains two DataFrames:

Now that the list contains more than one DataFrame, each can be extracted in a separated CSV file using a for loop:

and you should get two CSV files: table_0.csv and table_1.csv.

Note: if you try to use Method 2 described in the previous section, it will extract the 2 tables into a single worksheet in the CSV file and you would need to break it up into two worksheets manually.

Extract all tables from PDF using Python

In the above sections we focused on extracting tables from a given single page (page 1 or page 2). Now what do we do if we simply want to get all of the tables from the PDF file into different CSV files?

It is easily solvable with tabula-py library. The code is almost identical to the previous part. The only change we would need to do is set pages=’all’, so the code extracts all of the tables it finds as DataFrames and creates a list with them:

Check that the list contains all three DataFrames:

Now that the list contains more than one DataFrame, each can be extracted in a separated CSV file using a for loop:

Conclusion

In this article we discussed how to extract table from PDF files using tabula-py library.

Источник

How to Extract Table from PDF with Python and Pandas

In this short tutorial, we’ll see how to extract tables from PDF files with Python and Pandas.

We will cover two cases of table extraction from PDF:

(1) Simple table with tabula-py

from tabula import read_pdf df_temp = read_pdf('china.pdf')

(2) Table with merged cells

import pandas as pd html_tables = pd.read_html(page)

Let’s cover both examples in more detail as context is important.

1: Extract tables from PDF with Python

In this example we will extract multiple tables from remote PDF file: china.pdf.

We will use library called: tabula-py which can be installed by:

The .pdf file contains 2 table:

from tabula import read_pdf file = 'https://raw.githubusercontent.com/tabulapdf/tabula-java/master/src/test/resources/technology/tabula/china.pdf' df_temp = read_pdf(file, stream=True)

After reading the data we can get a list of DataFrames which contain table data.

FLA Audit Profile	Unnamed: 0
0	Country	China
1	Factory name	01001523B
2	IEM	BVCPS (HK), Shen Zhen Office
3	Date of audit	May 20-22, 2003
4	PC(s)	adidas-Salomon
5	Number of workers	243
6	Product(s)	Scarf, cap, gloves, beanies and headbands
7	Production processes	Sewing, cutting, packing, embroidery, die-cutting

Which is the exact match of the first table from the PDF file.

While the second one is a bit weird. The reason is because of the merged cells which are extracted as NaN values:

Unnamed: 0	Unnamed: 1	Unnamed: 2	Findings	Unnamed: 3
0	FLA Code/ Compliance issue	Legal Reference / Country Law	FLA Benchmark	Monitor’s Findings	NaN
1	1. Code Awareness	NaN	NaN	NaN	NaN
2	2. Forced Labor	NaN	NaN	NaN	NaN
3	3. Child Labor	NaN	NaN	NaN	NaN
4	4. Harassment or Abuse	NaN	NaN	NaN	NaN

How to workaround this problem we will see in the next step.
Some cells are extracted to multiple rows as we can see from the image:

2: Extract tables from PDF — keep format

Often tables in PDF files have:

Most libraries and software are not able to extract them in a reliable way.

To extract complex table from PDF files with Python and Pandas we will do:

download the file (it’s possible without download)
convert the PDF file to HTML
extract the tables with Pandas

2.1 Convert PDF to HTML

First we will download the file from: china.pdf.

Then we will convert it to HTML with the library: pdftotree.

import pdftotree page = pdftotree.parse('china.pdf', html_path=None, model_type=None, model_path=None, visualize=False)

library can be installed by:

2.2 Extract tables with Pandas

Finally we can read all the tables from this page with Pandas:

import pandas as pd html_tables = pd.read_html(page) html_tables[1]

Which will give us better results in comparison to tabula-py

2.3 HTMLTableParser

As alternatively to Pandas, we can use the library: html-table-parser-python3 to parse the HTML tables to Python lists.

from html_table_parser.parser import HTMLTableParser p = HTMLTableParser() p.feed(page) print(p.tables[0])

it convert the HTML table to Python list:

[['', ''], ['Country', 'China'], ['Factory name', '01001523B'], ['IEM', 'BVCPS (HK), Shen Zhen Office'], ['Date of audit', 'May 20-22, 2003'], ['PC(s)', 'adidas-Salomon'], ['Number of workers', '243'], ['Product(s)', 'Scarf, cap, gloves, beanies and headbands']]

Now we can convert the list to Pandas DataFrame:

import pandas as pd pd.DataFrame(p.tables[1])

To install this library we can do:

pip install html-table-parser-python3

There are two differences to Pandas:

3. Python Libraries for extraction from PDF files

Finally let’s find a list of useful Python libraries which can help in PDF parsing and extraction:

3.1 Python PDF parsing

3.2 Parse HTML tables

html-table-parser-python3 — parse HTML tables with Python 3 to list of values
tablextract — extracts the information represented in any HTML table
pdftotree — convert PDF into hOCR with text, tables, and figures being recognized and preserved.
pandas.read_html
html-table-extractor — A python library for extracting data from html table
py-html-table — Python library to extract data from HTML Tables with rowspan

3.3 Example PDF files

Finally you can find example PDF files where you can test table extraction with Python and Pandas:

By using DataScientYst — Data Science Simplified, you agree to our Cookie Policy.

Источник