Docx to xml python

Содержание

docx2python
Installation
Use
Return Value
Arguments
Return Format
Working with output
New in docx2python Version 2
merge consecutive runs with identical formatting
merge consecutive links with identical hrefs
correctly handle nested paragraphs
paragraph styles
export xml
expose some intermediate functionality
Как конвертировать docx в xml и обратно?
docx-utils 0.1.3
Навигация
Ссылки проекта
Статистика
Метаданные
Сопровождающие
Классификаторы
Описание проекта
Overview
Features
Installation
Using the library
Command Line Interface (CLI)
Documentation
Development
Changelog
v0.1.3 (2020-07-15)
Fixed
Other
v0.1.2 (2018-07-26)
Fixed
Other
v0.1.1 (2018-07-25)
v0.1.1 (2018-07-24)
Подробности проекта
Ссылки проекта
Статистика
Метаданные

docx2python

Extract docx headers, footers, text, footnotes, endnotes, properties, and images to a Python object.

README_DOCX_FILE_STRUCTURE.md may help if you’d like to extend docx2python.

For a summary of what’s new in docx2python 2, scroll down to New in docx2python Version 2

shared features:

extracts footnotes and endnotes
converts bullets and numbered lists to ascii with indentation
converts hyperlinks to link text
retains some structure of the original file (more below)
extracts document properties (creator, lastModifiedBy, etc.)
inserts image placeholders in text ( ‘—-image1.jpg—-‘ )
inserts plain text footnote and endnote references in text ( ‘—-footnote1—-‘ )
(optionally) retains font size, font color, bold, italics, and underscore as html
extract math equations
extract user selections from checkboxes and dropdown menus

subtractions:

Installation

Use

    docx2python opens a zipfile object and (lazily) reads it. Use context management ( with . as ) to close this zipfile object or explicitly close with docx_content.close() .

supports italic, bold, underline, strike, superscript, subscript, small caps, all caps, highlighted, font size, colored text.

hyperlinks will always be exported as html ( link text ), even if html=False , because I couldn’t think of a more canonical representation.

every tag open in a paragraph will be closed in that paragraph (and, where appropriate, reopened in the next paragraph). If two subsequenct paragraphs are bold, they will be returned as paragraph a , paragraph b . This is intentional to make each paragraph its own entity.

if you specify html=True , & , > and < in your docx text will be encoded as &amp , >and

Return Value

Function docx2python returns a DocxContent instance with several attributes.

header — contents of the docx headers in the return format described herein

footer — contents of the docx footers in the return format described herein

body — contents of the docx in the return format described herein

footnotes — contents of the docx in the return format described herein

endnotes — contents of the docx in the return format described herein

document — header + body + footer (read only)

text — all docx text as one string, similar to what you’d get from python-docx2txt

properties — docx property names mapped to values (e.g., )

images — image names mapped to images in binary format. Write to filesystem with

for name, image in result.images.items(): with open(name, 'wb') as image_destination: write(image_destination, image) # or with docx2python('path/to/file.docx', 'path/to/image/directory') as docx_content: . # or with docx2python('path/to/file.docx') as docx_content: docx_content.save_images('path/to/image/directory')

docx_reader — a DocxReader (see docx_reader.py ) instance with several methods for extracting xml portions.

Arguments

def docx2python( docx_filename: str | Path | BytesIO, image_folder: str | None = None, html: bool = False, paragraph_styles: bool = False, extract_image: bool | None = None, duplicate_merged_cells: bool = False ) -> DocxContent: """ Unzip a docx file and extract contents. :param docx_filename: path to a docx file :param image_folder: optionally specify an image folder (images in docx will be copied to this folder) :param html: bool, extract some formatting as html :param paragraph_styles: prepend the paragraphs style (if any, else "") to each paragraph. This will only be useful with ``*_runs`` attributes. :param duplicate_merged_cells: bool, duplicate merged cells to return a mxn nested list for each table (default False) :return: DocxContent object """

Return Format

Some structure will be maintained. Text will be returned in a nested list, with paragraphs always at depth 4 (i.e., output.body[i][j][k][l] will be a paragraph).

If your docx has no tables, output.body will appear as one a table with all content in one cell:

Table cells will appear as table cells. Text outside tables will appear as table cells.

A docx document can be tables within tables within tables. Docx2Python flattens most of this to more easily navigate within the content.

Working with output

This package provides several documented helper functions in the docx2python.iterators module. Here are a few recipes possible with these functions:

>> remove_empty_paragraphs(tables) [[[['a', 'b'], ['a', 'd']]]]

tags tags tags tags > html_map(tables)

'(0, 0, 0, 0) a' '(0, 0, 0, 1) b' '(0, 0, 1, 0) a' '(0, 0, 1, 1) d'

Some fine print about checkboxes:

MS Word has checkboxes that can be checked any time, and others that can only be checked when the form is locked. The previous print as. \u2610 (open checkbox) or \u2612 (crossed checkbox). Which this module, the latter will too. I gave checkboxes a bailout value of —-checkbox failed—- if the xml doesn’t look like I expect it to, because I don’t have several-thousand test files with checkboxes (as I did with most of the other form elements). Checkboxes should work, but please let me know if you encounter any that do not.

New in docx2python Version 2

merge consecutive runs with identical formatting

MS Word will break up text runs arbitrarily, often in the middle of a word.

work to im prove docx2python

This makes things like algorithmic search-and-replace problematic. Docx2python does not currently write docx files, but I often use docx templates with placeholders (e.g., #CATEGORY_NAME# ) then replace those placeholders with data. This won’t work if your placeholders are broken up (e.g, #CAT , E , GORY_NAME# ).

Docx2python v1 merges such runs together when exporting text. Docx2python v2 will merge such runs in the XML as a pre-processing step. This will allow saving such «repaired» XML later on.

merge consecutive links with identical hrefs

MS Word will break up links, giving each link a different rId , even when these rIds point to the same address.

# rID13 points to https://github.com/ShayHill/docx2python docx2py # rID14 ALSO points to https://github.com/ShayHill/docx2python thon

This is similar to the broken-up runs, but the cause is a little deeper in. Docx2python v1 makes a mess of these.

Docx2python v2 will merge such links together in the XML as a pre-processing step. As above, this will allow saving such «repaired» XML later on.

correctly handle nested paragraphs

MS Word will nest paragraphs

text # paragraph inside a paragraph text text

I haven’t been able to create such a paragraph, but I’ve found a few files that have them. Docx2pyhon v1 will omit closing html tags when a new paragraph is opened before the old paragraph is closed.

outer par bold text This text is in nested par (not bold) outer par bold text

Docx2python v2 will correctly handle such cases, but this will require substantial internal changes to the way docx2python opens and closes paragraphs.

outer par bold text This text is in nested par (not bold) outer par bold text

paragraph styles

The internal changes allow for easy access to paragraph styles (e.g., Heading 1 ). Docx2python v1 ignores these, even with html=True . Docx2python v2 will capture paragraph styles.

h1 is a paragraph stylebold is a run style

export xml

To allow above-described light editing (e.g., search and replace), docx2python v2 will give the user access to

1. extracted xml files 2. the functions used to write these files to a docx

The user can only go so far with this. A docx file is built from folders full of xml files. None of these xml files are self contained. But search and replace is enough to make document templates (documents with placeholders for data), and that’s pretty useful in itself.

expose some intermediate functionality

Navigating through XML is straightforward with lxml . It is a separate step to take whatever you find and bring it out of the XML. For instance, you may want to iterate over a document, looking for paragraphs with a particular format, then pull the text out of those paragraphs. Docx2python v1 did not separate or expose «iter the document» and «pull the content». Docx2python v2 separates and exposes these steps. This will allow easier extension.

See the docx_reader.py module and simple examples in the utilities.py module.

Источник

Как конвертировать docx в xml и обратно?

Добрый день. Интересует вопрос, есть ли какие-то библиотеки для конвертации файлов типа docx в файлы кода xml? Мне нужно получить код документа, обработать его и конвертировать обратно в формат docx или pdf с помощью python.
Перешарил весь рунет и даже залез в англоязычные страницы, но не смог найти удовлетворяющего варианта.

Простой 2 комментария

Александр Карабанов, тогда у меня созрел другой вопрос. Можно ли как-то через docx редактировать только текст ячеек таблицы, оставляя их стиль и все объекты ячейки на месте?

Александр Карабанов, прошу прощения, но я это решение тоже находил. Но через него, почему-то, пропадает картинка в ячейке (поэтому и ищу способ заменять значения без изменения стилей).
Нашел такой способ только через редактирование xml файлов. Но для их редактирования нужна конвертация из docx в xml, какую реализовать трудно. Если можно напрямую редактировать таблицы через py docx, то я был бы очень рад

Источник

docx-utils 0.1.3

Creation and manipulation of Open XML documents (mainly docx).

Ссылки проекта

Статистика

Метаданные

Лицензия: MIT License (MIT License)

Метки Microsoft, Office, Word, Excel, PowerPoint, docx, xlsx, pptx, XML

Требует: Python >= 2.7, != 3.0.*, != 3.1.*, != 3.2.*, != 3.3.*, != 3.4.*, < 4

Сопровождающие

Классификаторы

Описание проекта

Overview

Creation and manipulation of Open XML documents (mainly docx).

Features

This library allow you to:

Installation

Using the library

Using the library to convert an Open XML document into flat OPC format:

Command Line Interface (CLI)

$ docx_utils --help docx_utils  COMMAND  Docx utilities --version Show the version and exit. --help Show this message and exit. flatten Convert an Open XML document into flat OPC format.

Converting an Open XML document into flat OPC format:

$ docx_utils flatten sample.docx sample.xml to flat XML.

Documentation

Development

Changelog

v0.1.3 (2020-07-15)

Fixed

Correct the project’s dependencies: Enum34 is only required for Python versions < 3.4.
Add the docx_utils.exceptions module: Exception hierarchy for the docx-utils package.
Fix #1:
- Add the on_error option in the docx_utils.flatten.opc_to_flat_opc function in order to ignore (or raise an exception) when a part URI cannot be resolved during the Microsoft Office document parsing.
- Change the command line interface: add the —on-error option to handle parsing error.
Other

v0.1.2 (2018-07-26)

Fixed
- Drop support for PyPy: it seams that lxml is not available for this Python implementation.
- Drop support for Python 3.7: this Python version is not yet available on all platform. However, it is known to work on Ubuntu with the python-3.7-dev release.
Other
- Use the pseudo-tags start-exclude / end-exclude in CHANGELOG.rst and README.rst to exclude text from the generated PKG-INFO during setup.
v0.1.1 (2018-07-25)

v0.1.1 (2018-07-24)

Подробности проекта

Ссылки проекта

Статистика

Метаданные

Лицензия: MIT License (MIT License)

Метки Microsoft, Office, Word, Excel, PowerPoint, docx, xlsx, pptx, XML

Требует: Python >= 2.7, != 3.0.*, != 3.1.*, != 3.2.*, != 3.3.*, != 3.4.*, < 4

Источник

Читайте также: Example Domain

Docx to xml python

docx2python

Installation

Use

Return Value

Arguments

Return Format

Working with output

New in docx2python Version 2

merge consecutive runs with identical formatting

merge consecutive links with identical hrefs

correctly handle nested paragraphs

paragraph styles

h1 is a paragraph stylebold is a run style

export xml

expose some intermediate functionality

Как конвертировать docx в xml и обратно?

docx-utils 0.1.3

Навигация

Ссылки проекта

Статистика

Метаданные

Сопровождающие

Классификаторы

Описание проекта

Overview

Features

Installation

Using the library

Command Line Interface (CLI)

Documentation

Development

Changelog

v0.1.3 (2020-07-15)

Fixed

Other

v0.1.2 (2018-07-26)

Fixed

Other

v0.1.1 (2018-07-25)

v0.1.1 (2018-07-24)

Подробности проекта

Ссылки проекта

Статистика

Метаданные