- Saved searches
- Use saved searches to filter your results more quickly
- suqingdong/docx_parser
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
- Parsing Word documents with Python
- Step 1: Import your packages
- Step 2: Parse the document XML
- Step 3: Explore the XML for the sections and text you want
- Step 4: Find all the paragraphs
- Step 5: Find all the “Heading 2” sections
- Step 6: Finally, extract the Heading 2 headers and subsequent text
- What is docparser?
- Installation
- Usage
- Подробности проекта
- Ссылки проекта
- Статистика
- Метаданные
- Сопровождающие
- Классификаторы
- История выпусков Уведомления о выпусках | Лента RSS
- Загрузка файлов
- Source Distributions
- Built Distribution
- Хеши для python_docparser-1.1.0-py3-none-any.whl
- Помощь
- О PyPI
- Внесение вклада в PyPI
- Использование PyPI
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Parse all contents of a docx file with python-docx
suqingdong/docx_parser
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Parse all contents of a docx file with python-docx
python3 -m pip install docx-parser
- paragraph : text paragraph, with style_id
- multipart : paragraph with image or hyperlink
- table : table data with merged_cells
docx_parser --help # parse image as file docx_parser tests/demo.docx -D tests/media -o tests/out.file.jl # parse image as base64 string docx_parser tests/demo.docx -A base64 -o tests/out.base64.jl
from docx_parser import DocumentParser infile = 'tests/demo.docx' doc = DocumentParser(infile) for _type, item in doc.parse(): print(_type, item)
About
Parse all contents of a docx file with python-docx
Parsing Word documents with Python
If you ever had a need to programmatically examine the text in a Microsoft Word document, getting the text out in the first place can be challenging. Sure, you can manually save your document to a plain text file that’s much easier to process, but if you have multiple documents to examine, that can be painful.
Recently I had such a need and found this Toward Data Science article quite helpful. But let’s take the challenge a little further: suppose you had a document with multiple sections and need to pull the text from specific sections.
Let’s suppose I need to pull just the text from the “sub-sections”. In my example, I have three sub-sections: Sub-Section 1, Sub-Section 2, and Sub-Section 3. In my Word document, I’ve styled these headers as “Heading 2” text. Here’s how I went about pull out the text for each of these sections.
Step 1: Import your packages
For my needs, I only need to import zipfile and ElementTree, which is nice as I didn’t need to install any third party packages:
import zipfile import xml.etree.ElementTree as ET
Step 2: Parse the document XML
doc = zipfile.ZipFile('./data/test.docx').read('word/document.xml') root = ET.fromstring(doc)
Step 3: Explore the XML for the sections and text you want
You’ll spend most of your time here, trying to figure out what elements hold the contents in which you are interested. The XML of Microsoft documents follows the WordprocessingML standard, which can be quite complicated. I spent a lot of time manually reviewing my XML looking for the elements I needed. You can write out the XML like so:
Step 4: Find all the paragraphs
To solve my problem, I first decided to pull together a collection of all the paragraphs in the document so that I could later iterate across them and make decisions. To make that work a little easier, I also declared a namespace object used by Microsoft’s WordprocessingML standard:
# Microsoft's XML makes heavy use of XML namespaces; thus, we'll need to reference that in our code ns = body = root.find('w:body', ns) # find the XML "body" tag p_sections = body.findall('w:p', ns) # under the body tag, find all the paragraph sections
It can be helpful to actually see the text in each of these sections. Through researching Microsoft’s XML standard, I know that document text is usually contained in “t” elements. So, if I write an XPath query to find all the “t” elements within a given section, I can join the text of all those elements together to get the full text of the paragraph. This code does that:
for p in p_sections: text_elems = p.findall('.//w:t', ns) print(''.join([t.text for t in text_elems])) print()
Step 5: Find all the “Heading 2” sections
Now, let’s iterate through each paragraph section and see if we can figure out which sections have been styled with “Heading 2”. If we can find those Heading 2 sections, we’ll then know that the subsequent text is the text we need.
Through researching more the XML standard, I found that if I search for pStyle elements that contain the value “Heading2”, these will be the sections I’m after. To make my code a little cleaner, I wrote functions to both evaluate each section for the Heading 2 style and extract the full text of the section:
def is_heading2_section(p): """Returns True if the given paragraph section has been styled as a Heading2""" return_val = False heading_style_elem = p.find(".//w:pStyle[@w:val='Heading2']", ns) if heading_style_elem is not None: return_val = True return return_val def get_section_text(p): """Returns the joined text of the text elements under the given paragraph tag""" return_val = '' text_elems = p.findall('.//w:t', ns) if text_elems is not None: return_val = ''.join([t.text for t in text_elems]) return return_val section_labels = [get_section_text(s) if is_heading2_section(s) else '' for s in p_sections]
Now, if I print out my section_labels list, I see this:
Step 6: Finally, extract the Heading 2 headers and subsequent text
Now, I can use simple list comprehension to glue together both the section headers and associated text of the three sub-sections I’m after:
section_text = [ for i, t in enumerate(section_labels) if len(t) > 0]
And that list looks like this:
What is docparser?
docparser is python package that extract text form a DOCX document.
Installation
pip install python-docparser
Usage
Подробности проекта
Ссылки проекта
Статистика
Метаданные
Лицензия: MIT License
Требует: Python >=3.7
Сопровождающие
Классификаторы
История выпусков Уведомления о выпусках | Лента RSS
Загрузка файлов
Загрузите файл для вашей платформы. Если вы не уверены, какой выбрать, узнайте больше об установке пакетов.
Source Distributions
No source distribution files available for this release. See tutorial on generating distribution archives.
Built Distribution
Uploaded 11 янв. 2023 г. py3
Хеши для python_docparser-1.1.0-py3-none-any.whl
Алгоритм | Хеш-дайджест | |
---|---|---|
SHA256 | 263ec1b7bc9454d0d03feef4b68bda89f468c913801faf37ce044c20ddcdcff0 | Копировать |
MD5 | ab5ee443c4ad0704b893afe7c7147fa8 | Копировать |
BLAKE2b-256 | 9e1fed6638ff382a6abaee3460e765c8f52d7036c52ffb9bf581d9194f8b34f6 | Копировать |
Помощь
О PyPI
Внесение вклада в PyPI
Использование PyPI
Разработано и поддерживается сообществом Python’а для сообщества Python’а.
Пожертвуйте сегодня!
PyPI», «Python Package Index» и логотипы блоков являются зарегистрированными товарными знаками Python Software Foundation.