Python docx save as pdf

Converting DOCX to PDF using Python

When you ask someone to send you a contract or a report there is a high probability that you’ll get a DOCX file. Whether you like it not, it makes sense considering that 1.2 billion people use Microsoft Office although a definition of “use” is quite vague in this case. DOCX is a binary file which is, unlike XLSX, not famous for being easy to integrate into your application. PDF is much easier when you care more about how a document is displayed than its abilities for further modifications. Let’s focus on that.

Python has a few great libraries to work with DOCX (python-dox) and PDF files (PyPDF2, pdfrw). Those are good choices and a lot of fun to read or write files. That said, I know I’d fail miserably trying to achieve 1:1 conversion.

Looking further I came across unoconv. Universal Office Converter is a library that’s converting any document format supported by LibreOffice/OpenOffice. That sound like a solid solution for my use case where I care more about quality than anything else. As execution time isn’t my problem I have been only concerned whether it’s possible to run LibreOffice without X display. Apparently, LibreOffice can be run in haedless mode and supports conversion between various formats, sweet!

I’m grateful to unoconv for an idea and great README explaining multiple problems I can come across. In the same time, I’m put off by the number of open issues and abandoned pull requests. If I get versions right, how hard can it be? Not hard at all, with few caveats though.

Читайте также:  Python selenium proxy geckodriver

LibreOffice is available on all major platforms and has an active community. It’s not active as new-hot-js-framework-active but still with plenty of good read and support. You can get your copy from the download page. Be a good user and go with up-to-date version. You can always downgrade in case of any problems and feedback on latest release is always appreciated.

On macOS and Windows executable is called soffice and libreoffice on Linux. I’m on macOS, executable soffice isn’t available in my PATH after the installation but you can find it inside the LibreOffice.app . To test how LibreOffice deals with your files you can run:

$ /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf test.docx

In my case results were more than satisfying. The only problem I saw was a misalignment in a file when the alignment was done with spaces, sad but true. This problem was caused by missing fonts and different width of «replacements» fonts. No worries, we’ll address this problem later.

While reading unoconv issues I’ve noticed that many problems are connected due to the mismatch of the versions. I’m going with Docker so I can have pretty stable setup and so I can be sure that everything works.

Let’s start with defining simple Dockerfile , just with dependencies and ADD one DOCX file just for testing:

FROM ubuntu:17.04 RUN apt-get update RUN apt-get install -y python3 python3-pip RUN apt-get install -y build-essential libssl-dev libffi-dev python-dev RUN apt-get install -y libreoffice ADD test.docx /app/
docker build -t my/docx2pdf .

After image is created we can run the container and convert the file inside the container:

docker run --rm --name docx2pdf-container my/docx2pdf \ libreoffice --headless --convert-to pdf --outdir app /app/test.docx

Running LibreOffice as a subprocess

We want to run LibreOffice converter as a subprocess and provide the same API for all platforms. Let’s define a module which can be run as a standalone script or which we can later import on our server.

import sys import subprocess import re def convert_to(folder, source, timeout=None): args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source] process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout) filename = re.search('-> (.*?) using filter', process.stdout.decode()) if filename is None: raise LibreOfficeError(process.stdout.decode()) else: return filename.group(1) def libreoffice_exec(): # TODO: Provide support for more platforms if sys.platform == 'darwin': return '/Applications/LibreOffice.app/Contents/MacOS/soffice' return 'libreoffice' class LibreOfficeError(Exception): def __init__(self, output): self.output = output if __name__ == '__main__': print('Converted to ' + convert_to(sys.argv[1], sys.argv[2]))

Required arguments which convert_to accepts are folder to which we save PDF and a path to the source file. Optionally we specify a timeout in seconds. I’m saying optional but consider it mandatory. We don’t want a process to hang too long in case of any problems or just to limit computation time we are able to give away to each conversion. LibreOffice executable location and name depends on the platform so edit libreoffice_exec to support platform you’re using.

subprocess.run doesn’t capture stdout and stderr by default. We can easily change the default behavior by passing subprocess.PIPE . Unfortunately, in the case of the failure, LibreOffice will fail with return code 0 and nothing will be written to stderr. I decided to look for the success message assuming that it won’t be there in case of an error and raise LibreOfficeError otherwise. This approach hasn’t failed me so far.

Uploading files with Flask

Converting using the command line is ok for testing and development but won’t take us far. Let’s build a simple server in Flask.

# common/files.py import os from config import config from werkzeug.utils import secure_filename def uploads_url(path): return path.replace(config['uploads_dir'], '/uploads') def save_to(folder, file): os.makedirs(folder, exist_ok=True) save_path = os.path.join(folder, secure_filename(file.filename)) file.save(save_path) return save_path
# common/errors.py from flask import jsonify class RestAPIError(Exception): def __init__(self, status_code=500, payload=None): self.status_code = status_code self.payload = payload def to_response(self): return jsonify('error': self.payload>), self.status_code class BadRequestError(RestAPIError): def __init__(self, payload=None): super().__init__(400, payload) class InternalServerErrorError(RestAPIError): def __init__(self, payload=None): super().__init__(500, payload)

We’ll need few helper function to work with files and few custom errors for handling error messages. Upload directory path is defined in config.py . You can also consider using flask-restplus or flask-restful which makes handling errors a little easier.

import os from uuid import uuid4 from flask import Flask, render_template, request, jsonify, send_from_directory from subprocess import TimeoutExpired from config import config from common.docx2pdf import LibreOfficeError, convert_to from common.errors import RestAPIError, InternalServerErrorError from common.files import uploads_url, save_to app = Flask(__name__, static_url_path='') @app.route('/') def hello(): return render_template('home.html') @app.route('/upload', methods=['POST']) def upload_file(): upload_id = str(uuid4()) source = save_to(os.path.join(config['uploads_dir'], 'source', upload_id), request.files['file']) try: result = convert_to(os.path.join(config['uploads_dir'], 'pdf', upload_id), source, timeout=15) except LibreOfficeError: raise InternalServerErrorError('message': 'Error when converting file to PDF'>) except TimeoutExpired: raise InternalServerErrorError('message': 'Timeout when converting file to PDF'>) return jsonify('result': 'source': uploads_url(source), 'pdf': uploads_url(result)>>) @app.route('/uploads/', methods=['GET']) def serve_uploads(path): return send_from_directory(config['uploads_dir'], path) @app.errorhandler(500) def handle_500_error(): return InternalServerErrorError().to_response() @app.errorhandler(RestAPIError) def handle_rest_api_error(error): return error.to_response() if __name__ == '__main__': app.run(host='0.0.0.0', threaded=True)

The server is pretty straightforward. In production, you would probably want to use some kind of authentication to limit access to uploads directory. If not, give up on serving static files with Flask and go for Nginx.

Important take-away from this example is that you want to tell your app to be threaded so one request won’t prevent other routes from being served. However, WSGI server included with Flask is not production ready and focuses on development. In production, you want to use a proper server with automatic worker process management like gunicorn. Check the docs for an example how to integrate gunicorn into your app. We are going to run the application inside a container so host has to be set to publicly visible 0.0.0.0 .

Now when we have a server we can update Dockerfile . We need to copy our application source code to the image filesystem and install required dependencies.

FROM ubuntu:17.04 RUN apt-get update RUN apt-get install -y python3 python3-pip RUN apt-get install -y build-essential libssl-dev libffi-dev python-dev RUN apt-get install -y libreoffice ADD app /app WORKDIR /app RUN pip3 install -r requirements.txt ENV LC_ALL=C.UTF-8 ENV LANG=C.UTF-8 CMD python3 application.py

In docker-compose.yml we want to specify ports mapping and mount a volume. If you followed the code and you tried running examples you have probably noticed that we were missing the way to tell Flask to run in a debugging mode. Defining environment variable without a value is causing that this variable is going to be passed to the container from the host system. Alternatively, you can provide different config files for different environments.

version: '3' services: web: build: . ports: - '5000:5000' volumes: - ./app:/app environment: - FLASK_DEBUG

I’ve mentioned a problem with missing fonts earlier. LibreOffice can, of course, make use of custom fonts. If you can predict which fonts your user might be using there’s a simple remedy. Add following line to your Dockfile .

Now when you put custom font file in the font directory in your project, rebuild the image. From now on you support custom fonts!

This should give you the idea how you can provide quality conversion of different documents to PDF. Although the main goal was to convert a DOCX file you should be fine with presentations, spreadsheets or images.

Further improvements could be providing support for multiple files, the converter can be configured to accept more than one file as well.

Did you enjoy it? Follow me @MichalZalecki on Twitter, where I share some interesting, bite-sized content.

Mastering Jest: Tips & Tricks for JavaScript Developers

This ebook goes beyond Jest documentation to explain software testing techniques. I focus on unit test separation, mocking, matchers, patterns, and best practices.

Источник

docx2pdf 0.1.8

Convert docx to pdf on Windows or macOS directly using Microsoft Word (must be installed).

Ссылки проекта

Статистика

Метаданные

Лицензия: MIT License (MIT)

Требует: Python >=3.5

Сопровождающие

Классификаторы

Описание проекта

docx2pdf

Convert docx to pdf on Windows or macOS directly using Microsoft Word (must be installed).

On Windows, this is implemented via win32com while on macOS this is implemented via JXA (Javascript for Automation, aka AppleScript in JS).

Install

brew install aljohri/-/docx2pdf 

CLI

usage: docx2pdf [-h] [--keep-active] [--version] input [output] Example Usage: Convert single docx file in-place from myfile.docx to myfile.pdf: docx2pdf myfile.docx Batch convert docx folder in-place. Output PDFs will go in the same folder: docx2pdf myfolder/ Convert single docx file with explicit output filepath: docx2pdf input.docx output.docx Convert single docx file and output to a different explicit folder: docx2pdf input.docx output_dir/ Batch convert docx folder. Output PDFs will go to a different explicit folder: docx2pdf input_dir/ output_dir/ positional arguments: input input file or folder. batch converts entire folder or convert single file output output file or folder optional arguments: -h, --help show this help message and exit --keep-active prevent closing word after conversion --version display version and exit 

Library

See CLI docs above (or in docx2pdf --help ) for all the different invocations. It is the same for the CLI and python library.

Jupyter Notebook

If you are using this in the context of jupyter notebook, you will need ipywidgets for the tqdm progress bar to render properly.

pip install ipywidgets jupyter nbextension enable --py widgetsnbextension `` 

Источник

Оцените статью