Split pdf to pages python

Splitting and Merging PDFs with Python

The PyPDF2 package allows you to do a lot of useful operations on existing PDFs. In this article, we will learn how to split a single PDF into multiple smaller ones. We will also learn how to take a series of PDFs and join them back together into a single PDF.

Getting Started

PyPDF2 doesn’t come as a part of the Python Standard Library, so you will need to install it yourself. The preferred way to do so is to use pip.

Now that we have PyPDF2 installed, let’s learn how to split and merge PDFs!

Splitting PDFs

The PyPDF2 package gives you the ability to split up a single PDF into multiple ones. You just need to tell it how many pages you want. For this example, we will download a W9 form from the IRS and loop over all six of its pages. We will split off each page and turn it into its own standalone PDF.

# pdf_splitter.py import os from PyPDF2 import PdfFileReader, PdfFileWriter def pdf_splitter(path): fname = os.path.splitext(os.path.basename(path))[0] pdf = PdfFileReader(path) for page in range(pdf.getNumPages()): pdf_writer = PdfFileWriter() pdf_writer.addPage(pdf.getPage(page)) output_filename = '<>_page_<>.pdf'.format( fname, page+1) with open(output_filename, 'wb') as out: pdf_writer.write(out) print('Created: <>'.format(output_filename)) if __name__ == '__main__': path = 'w9.pdf' pdf_splitter(path)

For this example, we need to import both the PdfFileReader and the PdfFileWriter. Then we create a fun little function called pdf_splitter. It accepts the path of the input PDF. The first line of this function will grab the name of the input file, minus the extension. Next we open the PDF up and create a reader object. Then we loop over all the pages using the reader object’s getNumPages method.

Читайте также:  Html get object by class

Inside of the for loop, we create an instance of PdfFileWriter. We then add a page to our writer object using its addPage method. This method accepts a page object, so to get the page object, we call the reader object’s getPage method. Now we had added one page to our writer object. The next step is to create a unique file name which we do by using the original file name plus the word “page” plus the page number + 1. We add the one because PyPDF2’s page numbers are zero-based, so page 0 is actually page 1.

Finally we open the new file name in write-binary mode and use the PDF writer object’s write method to write the object’s contents to disk.

Merging Multiple PDFs Together

Now that we have a bunch of PDFs, let’s learn how we might take them and merge them back together. One useful use case for doing this is for businesses to merge their dailies into a single PDF. I have needed to merge PDFs for work and for fun. One project that sticks out in my mind is scanning documents in. Depending on the scanner you have, you might end up scanning a document into multiple PDFs, so being able to join them together again can be wonderful.

When the original PyPdf came out, the only way to get it to merge multiple PDFs together was like this:

# pdf_merger.py import glob from PyPDF2 import PdfFileWriter, PdfFileReader def merger(output_path, input_paths): pdf_writer = PdfFileWriter() for path in input_paths: pdf_reader = PdfFileReader(path) for page in range(pdf_reader.getNumPages()): pdf_writer.addPage(pdf_reader.getPage(page)) with open(output_path, 'wb') as fh: pdf_writer.write(fh) if __name__ == '__main__': paths = glob.glob('w9_*.pdf') paths.sort() merger('pdf_merger.pdf', paths)

Here we create a PdfFileWriter object and several PdfFileReader objects. For each PDF path, we create a PdfFileReader object and then loop over its pages, adding each and every page to our writer object. Then we write out the writer object’s contents to disk.

PyPDF2 made this a bit simpler by creating a PdfFileMerger object:

# pdf_merger2.py import glob from PyPDF2 import PdfFileMerger def merger(output_path, input_paths): pdf_merger = PdfFileMerger() file_handles = [] for path in input_paths: pdf_merger.append(path) with open(output_path, 'wb') as fileobj: pdf_merger.write(fileobj) if __name__ == '__main__': paths = glob.glob('w9_*.pdf') paths.sort() merger('pdf_merger2.pdf', paths)

Here we just need to create the PdfFileMerger object and then loop through the PDF paths, appending them to our merging object. PyPDF2 will automatically append the entire document so you don’t need to loop through all the pages of each document yourself. Then we just write it out to disk.

The PdfFileMerger class also has a merge method that you can use. Its code definition looks like this:

def merge(self, position, fileobj, bookmark=None, pages=None, import_bookmarks=True): """ Merges the pages from the given file into the output file at the specified page number. :param int position: The *page number* to insert this file. File will be inserted after the given number. :param fileobj: A File Object or an object that supports the standard read and seek methods similar to a File Object. Could also be a string representing a path to a PDF file. :param str bookmark: Optionally, you may specify a bookmark to be applied at the beginning of the included file by supplying the text of the bookmark. :param pages: can be a :ref:`Page Range ` or a ``(start, stop[, step])`` tuple to merge only the specified range of pages from the source document into the output document. :param bool import_bookmarks: You may prevent the source document's bookmarks from being imported by specifying this as ``False``. """

Basically the merge method allows you to tell PyPDF where to merge a page by page number. So if you have created a merging object with 3 pages in it, you can tell the merging object to merge the next document in at a specific position. This allows the developer to do some pretty complex merging operations. Give it a try and see what you can do!

Wrapping Up

PyPDF2 is a powerful and useful package. I have been using it off and on for years to work on various home and work projects. If you need to manipulate existing PDFs, then this package might be right up your alley!

  • A Simple Step-by-Step Reportlab Tutorial
  • ReportLab 101: The textobject
  • ReportLab – How to add Charts and Graphs
  • Extracting PDF Metadata and Text with Python

Источник

Split a PDF into Multiple Files in Python

Split a PDF File into Multiple Files using Python

In today’s digital age, PDF files have become an essential part of our lives. However, sometimes we may need to split a large PDF file into smaller ones for various reasons, such as, sending specific pages to someone or uploading them to a website. A similar situation may come across when processing PDF files in Python. So in this article, we will see how to split a PDF file in Python. We will cover how to split PDFs by each page or a collection of pages.

Python Library to Split PDF#

To split PDF files, we will use Aspose.PDF for Python. It is a feature-rich PDF manipulation library that allows you to create, edit, and process PDF documents seamlessly. Use the following pip command to install the library in your Python application.

Split a PDF by Page in Python#

You may need different PDF splitting criteria in each situation, for example, splitting each page in a PDF, selective pages only, even pages only, and so on. First, let’s have a look at how to split a PDF by each page in Python. Below are the steps to perform this operation.

  • Load the PDF file using Document class.
  • Iterate through the pages in the Document.pages collection.
  • In each iteration, perform the following steps:
    • Create a new Document object and add the page to the document using Document.pages.add(Page) method.
    • Save the PDF file using Document.save() method.

    The following code sample shows how to split each page in a PDF using Python.

    Split Specific Pages of PDF in Python#

    Let’s now see how to split more than one pages in a PDF and save them in a separate file. The following are the steps to split multiple PDF pages in Python.

    • Load the PDF file using Document class.
    • Create a new Document object for new PDF file.
    • Iterate through the pages in the Document.pages collection.
    • In each iteration, check if the page should be split.
    • Add page to the new PDF document using Document.pages.add(Page) method.
    • Finally, save the PDF file using Document.save() method.

    The following code sample shows how to split a collection of pages in a PDF using Python.

    Split PDF Files Online#

    We also provide a free online tool to split PDF files, which is based on Aspose.PDF for Python.

    Free Python PDF Library#

    You can get a free temporary license to split PDF files without any limitations. Also, you can visit the documentation to explore more about the Python PDF library.

    Conclusion#

    In this article, you have learned how to split the PDF files in Python. You have seen how to split every page or a collection of pages in a PDF to separate files. You can easily follow the provided steps and code samples to split PDF files in your Python application.

    See Also#

    Источник

Оцените статью