Объединение пдф файлов python

Содержание

How to merge PDF documents using Python
Prerequisites
Install PyPDF2 using Pip
Merging Pdf Documents inside a Subfolder
Create the Python script to merge PDF documents
Python’s code to extract all the files under the Parent_folder
Python’s code to Merge the PDF documents
Complete code to merge PDFs in Python
Conclusions
Разделение, объединение и поворот PDF-документов на Python с помощью borb
Установка borb
Разделение PDF с помощью borb
Разделение PDF-документов на Python
Объединение PDF-документов в Python
Поворот страниц в PDF-документах на Python
Merge PDF Files using Python
Merge two PDF files using Python
Merge many PDF files using Python
Conclusion

How to merge PDF documents using Python

It’s easy to merge PDF documents using Python code, if you use the right tools!

Nowadays PDF format is the most popular file format to share documents on the internet. Many software is available to manage PDF documents, but when you need to perform some easy task on them, it is not so easy to find the best one to use. One example is merging multiple PDF files into one file.

In this guide, we will show how you can merge many PDF documents into one file using a simple Python script. In particular, to merge two PDF documents in Python we need to:

Let’s now see in the following sections how to do it in more details.

Prerequisites

First, we need to make sure that our development environment meets all the minimum requirements we need. In particular, we should already have installed:

After we met all the minimum requirements, we can move on to the next steps.

Install PyPDF2 using Pip

In Python, there are different libraries that allow us to manipulate PDF documents, and some of them are pretty complex to use. In this guide, we use PyPDF2, which is a simple Python library that we can use also to merge multiple PDF documents.

We can use Pip, the Python’s package installer, to install PyPDF2. To do so, we simply need to run the following command:

python3 -m pip install PyPdf2==1.26.0

Merging Pdf Documents inside a Subfolder

When you have to merge lots of PDF documents, it is really annoying to pass all the absolute paths of the files. So, to save time in coding the paths we can just put all the PDF files in a parent directory, and then Python will handle automatically all the paths of the PDF files for us.

To do so, make sure your project structure looks like this:

\main.py \parent_folder \folder_1 file_1.pdf file_2.pdf \folder_2 file_3.pdf \folder_3 \folder_4 file_4.pdf file_5.pdf file_6.pdf

Note that all the PDF files should be inside the \parent_folder .

Create the Python script to merge PDF documents

After setting up the environment, it’s finally time to code our script.

Python’s code to extract all the files under the Parent_folder

To get all the paths of the PDF files under the parent folder we can just use the following code:

import os # pass the path of the parent_folder def fetch_all_files(parent_folder: str): target_files = [] for path, subdirs, files in os.walk(parent_folder): for name in files: target_files.append(os.path.join(path, name)) return target_files # get a list of all the paths of the pdf extracted_files = fetch_all_files('./parent_folder')

Basically, we use the fetch_all_files function to recursively find all PDF files located into the parent_folder and it’s subfolders.

Python’s code to Merge the PDF documents

Now that we have the list of all the PDF files paths, we can use the following code to merge them into a unique PDF file:

from PyPDF2 import PdfFileMerger # pass the path of the output final file.pdf and the list of paths def merge_pdf(out_path: str, extracted_files: list [str]): merger = PdfFileMerger() for pdf in extracted_files: merger.append(pdf) merger.write(out_path) merger.close() merge_pdf('./final.pdf', extracted_files)

Complete code to merge PDFs in Python

Simple as that! We can now put all the final code in a ./main.py file:

#main.py import os from PyPDF2 import PdfFileMerger # pass the path of the parent_folder def fetch_all_files(parent_folder: str): target_files = [] for path, subdirs, files in os.walk(parent_folder): for name in files: target_files.append(os.path.join(path, name)) return target_files # pass the path of the output final file.pdf and the list of paths def merge_pdf(out_path: str, extracted_files: list [str]): merger = PdfFileMerger() for pdf in extracted_files: merger.append(pdf) merger.write(out_path) merger.close() # get a list of all the paths of the pdf parent_folder_path = './parent_folder' outup_pdf_path = './final.pdf' extracted_files = fetch_all_files(parent_folder_path) merge_pdf(outup_pdf_path, extracted_files)

To run the script, you just need to type python3 main.py in your terminal!

Conclusions

In this article, we have learned to merge different PDF files into a unique PDF using Python code. We used the simple Python library PyPDF2 to manipulate the PDFs and we wrote some code to collect multiple PDFs paths under different subfolders.

For more details about what you can do with PyPDF2, you can check its official documentation:

Источник

Разделение, объединение и поворот PDF-документов на Python с помощью borb

Формат переносимых документов (PDF) не является форматом WYSIWYG (What You See is What You Get (То, Что Вы Видите, это То, Что Вы Получаете)). Он был разработан, чтобы быть независимым от платформы, независимым от базовой операционной системы и механизмов рендеринга.

Для достижения этой цели PDF был создан для взаимодействия с помощью чего-то более похожего на язык программирования, и для достижения результата полагается ряд инструкций и операций. Фактически, PDF основан на языке сценариев — PostScript, который был первым независимым от устройства языком описания страниц.

В этом руководстве мы будем использовать borb — библиотеку Python, предназначенную для чтения, манипулирования и генерации PDF-документов. Он предлагает как низкоуровневую модель (что позволяет получить доступ к точным координатам и макету), так и высокоуровневую модель (где вы можете делегировать точные расчеты полей, позиций и т. д.).

В этом руководстве мы рассмотрим, как разделить и объединить PDF-документы на Python с помощью borb, а также рассмотрим, как поворачивать страницы в PDF-документе.

Разделение и объединение PDF-документов являются основой для многих сценариев использования:

Обработка счета-фактуры (вам не нужны условия, чтобы вы могли удалить эти страницы)
Добавление сопроводительного письма к документам (отчет об испытаниях, счет-фактура, рекламные материалы)
Агрегирование тестовых результатов из гетерогенных источников
И т.д.

Установка borb

Borb можно загрузить из исходного кода на GitHub или установить через pip:

Разделение PDF с помощью borb

Чтобы продемонстрировать разделение, вам понадобится PDF-файл с несколькими страницами.

Мы начнем с создания такого PDF-файла с помощью borb. Этот шаг не является обязательным, вы, конечно, можете просто использовать PDF-файл, который у вас есть вместо этого:

from borb.pdf.canvas.color.color import HexColor from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout from borb.pdf.canvas.layout.text.paragraph import Paragraph from borb.pdf.document import Document from borb.pdf.page.page import Page from borb.pdf.pdf import PDF from decimal import Decimal def create_document(heading_color: HexColor = HexColor("0b3954"), text_color: HexColor = HexColor("de6449"), file_name: str = "output.pdf"): d: Document = Document() N: int = 10 for i in range(0, N): # Создайте новую страницу и добавьте ее в документ p: Page = Page() d.append_page(p) # Установите отображение страницы на новой странице l: PageLayout = SingleColumnLayout(p) # Добавьте абзац, чтобы идентифицировать страницу l.add(Paragraph("Page %d of %d" % (i+1, N), font_color=heading_color, font_size=Decimal(24))) # Добавьте абзац фиктивного текста l.add(Paragraph(""" Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. """, font_color=text_color)) # Сохраните документ на диске with open(file_name, "wb") as pdf_out_handle: PDF.dumps(pdf_out_handle, d)

Этот пример кода генерирует PDF-документ, состоящий из 10 страниц:

Каждая страница начинается с «Страницы x из 10». Это облегчит идентификацию страниц позже.
Каждая страница содержит 1 абзац текста.

Разделение PDF-документов на Python

Теперь давайте разделим данный PDF. Начнем с его разделения на две части: первая половина содержит первые 5 страниц, а вторая половина содержит оставшиеся:

def split_half_half(): # Читать PDF with open("output.pdf", "rb") as pdf_file_handle: input_pdf = PDF.loads(pdf_file_handle) # Создайте два пустых PDF-файла для хранения каждой половины разделения output_pdf_001 = Document() output_pdf_002 = Document() # Разделение for i in range(0, 10): if i < 5: output_pdf_001.append_page(input_pdf.get_page(i)) else: output_pdf_002.append_page(input_pdf.get_page(i)) # Написать PDF with open("output_001.pdf", "wb") as pdf_out_handle: PDF.dumps(pdf_out_handle, output_pdf_001) # Написать PDF with open("output_002.pdf", "wb") as pdf_out_handle: PDF.dumps(pdf_out_handle, output_pdf_002)

Мы извлекли первые 5 страниц в новый Document, а следующие 5 страниц во второй новый Document, фактически разделив оригинальную на две меньшие сущности.

Может быть упрощен с помощью метода get_page(), так как его возвращаемый тип может быть непосредственно использован для withappendappend_page().

Вы можете проверить полученные PDF-файлы, чтобы убедиться, что код работает должным образом

Объединение PDF-документов в Python

Для работы со следующими примерами нам понадобятся два PDF-файла. Давайте использовать более ранний код для их генерации, если у вас его еще нет:

create_document(HexColor("247B7B"), HexColor("78CDD7"), "output_001.pdf") create_document(file_name="output_002.pdf")

Интуиция, используемая для разделения, очень похожа на слияние - хотя мы можем добавлять целые документы в другие документы, а не только страницы. Однако иногда вам может потребоваться разделить документ (отрезать последнюю страницу), прежде чем объединять его с другой.

Мы можем объединить их полностью (объединяя оба PDF-файла), но мы также можем просто добавить некоторые страницы первого PDF-файла во второй, если предпочтем это таким образом - используя функцию append_page(), как и раньше.

Давайте начнем с их полного объединения :

def concatenate_two_documents(): # Прочитайте первый PDF-файл with open("output_001.pdf", "rb") as pdf_file_handle: input_pdf_001 = PDF.loads(pdf_file_handle) # Прочитайте второй PDF-файл with open("output_002.pdf", "rb") as pdf_file_handle: input_pdf_002 = PDF.loads(pdf_file_handle) # Создайте новый PDF-файл, объединив два входных файла output_document = Document() output_document.append_document(input_pdf_001) output_document.append_document(input_pdf_002) # Написать PDF with open("output.pdf", "wb") as pdf_out_handle: PDF.dumps(pdf_out_handle, output_document)

Поворот страниц в PDF-документах на Python

Страница в PDF-документе может быть повернута на 90 градусов в любую сторону. Этот вид работы позволяет легко переключаться между альбомным и портретным режимами.

В следующем примере вы сможете повернуть страницу одного из входных PDF-файлов, которые мы создали ранее:

def rotate_first_page(): # Чтение PDF with open("output_001.pdf", "rb") as pdf_file_handle: input_pdf_001 = PDF.loads(pdf_file_handle) # Поворот страницы input_pdf_001.get_page(0).rotate_left() # Запись PDF на диск with open("output.pdf", "wb") as pdf_out_handle: PDF.dumps(pdf_out_handle, input_pdf_001)

Источник

Merge PDF Files using Python

In this tutorial we will explore how to merge PDF files using Python.

Table of Contents

Conclusion

Merge two PDF files using Python

In order to perform PDF merging in Python we will need to import the PdfFileMerger() class from the PyPDF2 library, and create an instance of this class.

In this example we will merge two files: sample_page1.pdf and sample_page2.pdf.

In this case, the two file paths can be placed into a list, which we will then iterate over and append one to another:

And you should see merged_2_pages.pdf created in the same directory as the main.py file with the code:

Merge many PDF files using Python

In this section we will explore how to merge many PDF files using Python.

One way of merging many PDF files would be to add the file names of every PDF files to a list manually and then perform the same operation as in the previous section.

And you should see merged_all_pages.pdf created in the same directory as the main.py file with the code:

Conclusion

In this article we explored how to merge multiple PDF files using Python.

Источник