html-docx-js

How to convert HTML to DOCX in the browser

DOCX is a file format commonly associated with Microsoft Word. However, not everyone might be aware that this is a standardized open format, not limited by any license. Implemented according to the Office Open XML (OOXML) specification, DOCX shares the similar structure with presentation files (PPTX) and spreadsheets (XSLX). Interestingly, a file compatible with the OOXML format is actually an archive of related XML files. We can easily verify this by unpacking any DOCX file:

t3rmian@wasp:~/$ unzip "test.docx" Archive: test.docx creating: word/ creating: word/media/ extracting: word/media/image-WngyOTnaQ.png extracting: word/media/image-Xom8iU2nqh.png extracting: word/media/image-2MiVrdV3Lg.png extracting: word/media/image-u5K-49bkCE.png extracting: word/media/image-AM5Ve0JASj.png extracting: word/media/image-L85HC3HelY.png extracting: word/media/image-TGo0ZXsleV.png extracting: word/media/image-YOBg89XJk0.png creating: _rels/ extracting: _rels/.rels creating: docProps/ extracting: docProps/core.xml creating: word/theme/ extracting: word/theme/theme1.xml extracting: word/document.xml extracting: word/fontTable.xml extracting: word/styles.xml extracting: word/numbering.xml extracting: word/settings.xml extracting: word/webSettings.xml creating: word/_rels/ extracting: word/_rels/document.xml.rels extracting: [Content_Types].xml 

After unpacking, in the word folder we will find XML files, among others responsible for styles (styles.xml), document content (document.xml) with references (_rels/document.xml.rels) to various resources (media/*), e.g. images.

Читайте также:  Home Page

HTML to DOCX conversion – document building or altChunks?

There are actually two approaches to converting an HTML document to DOCX. We can build such a document by converting individual HTML tags and styles to their equivalents in DOCX format or use altChunk feature.

An example of an HTML document displayed in Google Docs exported to DOCX using html-to-docx

The first approach is understandable, but what is altChunk? The altchunk element is simply a pointer to a file whose contents will be processed and imported into the document by the application (e.g. Microsoft Word) that supports the indicated format. This option doesn’t give much control over the resulting document.

Among the most popular applications that are able to display the DOCX format, only Microsoft Word will correctly display a document built using altChunk. In the LibreOffice Writer, Apache OpenOffice Writer, and Google Docs we will see a blank document. Note this when choosing or implementing a conversion from HTML to OOXML.

Client side conversion

When it comes to web applications, the undoubted advantage of feature feasibility is the possibility of implementation on the client’s (browser) side. This method reduces server-side processing and delegates the work to the client, making the application more scalable and closer to a distributed system. Converting an HTML file to DOCX, despite being familiar with the structure of the OOXML format, is not an easy task.

Among the available solutions, however, we have a choice of two libraries written in JavaScript that implement this complicated process. A solution based on altChunk feature can be found in a slightly older html-docx-js project. On the other hand, tag and style conversion is used in a more recent html-to-docx library.

Читайте также:  Java как округлить до десятых

html-docx-js

Using the html-docx-js library is really simple. All we need to do is add this script to our website. If you are using the npm package manager, you can find the library under the same name and install it with the npm i html-docx-js command. It is worth mentioning that html-docx-js will also work on the server-side. But let’s see how we can use it the browser:

      p 

Hello HTML

Download

After the page is loaded, it will be converted to DOCX format and saved under the download href blob link. The unpacked DOCX archive will contain a folder with, among others, the word/document.xml file:

The actual content of the document can be found under the reference to the word/afchunk.mht:

MIME-Version: 1.0 Content-Type: multipart/related; type="text/html"; boundary="----=mhtDocumentPart" ------=mhtDocumentPart Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/fake/document.html     p 

Hello HTML

Download ------=mhtDocumentPart--

Moreover, in case of the images, it will be necessary to first convert them to the base64 form (example).

html-to-docx

Converting HTML by building a document is a complicated process that the html-to-docx library does best. In the latest version 1.2.2, we will also find some support for generating documents in the browser. Using npm, install the module with the npm i html-to-docx command. Here comes the harder step, importing the library on the website is not so straightforward as in the previous case.

In the ./node_modules/html-to-docx/dist/ folder we find two html-to-docx.[esm|umd].js files, which we can load on the website. The source of the problem turns out to be, the dependency on other CJS type modules. This type requires transpilation before loading it into the browser, and unfortunately, these modules are not bundled with the library. Often it is not a problem. If we already use some kind of a bundler, loading the library usually does not require additional steps.

To familiarize ourselves with this topic, let’s see how to build a browser script from scratch. One of the quickest solutions here is to install the webpack bundler: npm i webpack webpack-cli —save-dev . Its latest version does not require any additional configuration. Then in the src/index.js file add the code referencing the installed library:

import HTMLtoDOCX from "html-to-docx/dist/html-to-docx.umd" const link = document.getElementById("download") HTMLtoDOCX(document.documentElement.outerHTML) .then(blob => < link.href = URL.createObjectURL(blob) >) 

Next add the polyfills for necessary features that are not implemented in browsers: npm i util url buffer . . The configuration for webpack.config.js and the number of polyfills can differ depending on the dependency/Webpack version (see the demo for Webpack 5). Finally, build the code bundle with the npx webpack command verifying any missing dependencies. The HTML file will look like this:

      

Hello HTML

Download

Meanwhile, the word/document.xml, after running the conversion in the browser, will contain specific word processing tags and styles:

               Hello HTML          Download      

Similar to html-docx-js the images will have to be converted to base64 format. Additionally, in the current version, you will have to keep them out of some specific tags, otherwise, they might not be displayed.

Summary

The JavaScript html-docx-js and html-to-docx libraries allow you to convert HTML documents to DOCX in two different ways. The DOCX format itself is not so complicated and you can viev or create your own document in the form of an XML files archive. When used in production, it is worth remembering that you will not always get the same result in every application that displays the OOXML format due to implementation differences (e.g. images anchoring in LibreOffice and Microsoft Word).

Do not forget to convert relevant images to the base64 format. In case of problems with referencing external resources e.g. for complex SVG references you can consider the canvg library. For other maybe unsupported elements, quite an interesting approach is to try to render as an image using html2canvas. Do also consider contributing to the above-mentioned projects in case you find a fix to any of the encountered problems.

2023/04/18: Added a source link to a minimal working example.

Источник

HTML to DOCX Converter

CloudConvert is an online document converter. Amongst many others, we support PDF, DOCX, PPTX, XLSX. Thanks to our advanced conversion technology the quality of the output will be as good as if the file was saved through the latest Microsoft Office 2021 suite.

convert to

compress

capture website as

create archive

extract

Options

HTML

HTML is a markup language that is used to create web pages. Web browsers can parse the HTML file. This file format use tags (e.g ) to build web contents. It can embed texts, image, heading, tables etc using the tags. Other markup languages like PHP, CSS etc can be used with html tags.

DOCX

DOCX is an XML based word processing file developed by Microsoft. DOCX files are different than DOC files as DOCX files store data in separate compressed files and folders. Earlier versions of Microsoft Office (earlier than Office 2007) do not support DOCX files because DOCX is XML based where the earlier versions save DOC file as a single binary file.

+200 Formats Supported

CloudConvert is your universal app for file conversions. We support nearly all audio, video, document, ebook, archive, image, spreadsheet, and presentation formats. Plus, you can use our online tool without downloading any software.

Data Security

CloudConvert has been trusted by our users and customers since its founding in 2012. No one except you will ever have access to your files. We earn money by selling access to our API, not by selling your data. Read more about that in our Privacy Policy.

High-Quality Conversions

Besides using open source software under the hood, we’ve partnered with various software vendors to provide the best possible results. Most conversion types can be adjusted to your needs such as setting the quality and many other options.

Powerful API

Our API allows custom integrations with your app. You pay only for what you actually use, and there are huge discounts for high-volume customers. We provide a lot of handy features such as full Amazon S3 integration. Check out the CloudConvert API.

Источник

HTML to DOCX converter

This online document converter allows you to convert your files from HTML to DOCX in high quality.

We support a lot of different file formats like PDF, DOCX, PPTX, XLSX and many more. By using the online-convert.com conversion technology, you will get very accurate conversion results.

How to convert a HTML to a DOCX file?

  1. Choose the HTML file you want to convert
  2. Change quality or size (optional)
  3. Click on «Start conversion» to convert your file from HTML to DOCX
  4. Download your DOCX file

To convert in the opposite direction, click here to convert from DOCX to HTML:

Not convinced? Click on the following link to convert our demo file from HTML to DOCX:

Firefox extension for Online-Convert

Rate this tool 3.1 / 5

You need to convert and download at least 1 file to provide feedback

Converter

Convert to HTML

Convert from HTML

File Format

HTML (Hypertext Markup Language with a client-side image map)

HTML (HyperText Markup Language) is the standard for creating websites. The idea was proposed in 1989 by physicist Tim Berners-Lee at CERN. Web browsers can read this language to interpret the coding into different texts, colors, formats (headings, p.

DOCX (Microsoft Word Open XML Document)

DOCX is an advanced version of the DOC file format and is much more usable and accessible than the latter at any given time. Unlike the DOC file, the DOCX file is not an extensive file format. Instead, it appears as being a single file while actuall.

Источник

Конвертировать HTML в DOCX (WORD) / URL в DOCX (WORD) онлайн

Продвинутый онлайн-сервис конвертации html файлов в DOCX. Для mac & windows

  • Image
  • Document
  • Ebook
  • Audio
  • Archive
  • Video
  • Presentation
  • Font
  • Vector
  • CAD
  • Image
  • Document
  • Ebook
  • Audio
  • Archive
  • Video
  • Presentation
  • Font
  • Vector
  • CAD

Язык гипертекстовой разметки

HTML ― это файл веб-формата. Исходный код HTML можно изменить в текстовом редакторе. HTML-файлы разрабатываются для будущего использования в веб-браузерах пользователей и позволяют форматировать сайты с текстом, изображениями и другими необходимыми материалами. В файлах этого формата используются теги для создания веб-страниц. Интерпретация HTML-кода выполняется веб-браузером, и этот код, как правило, не показывается пользователю.

Microsoft Office Open XML

С 2007 года Microsoft начал использовать формат файла docx, созданный с использованием формата Office Open XML. Этот формат представляет собой сжатый файл, содержащий текст в форме XML, графики и иные данные, которые могут быть преобразованы в битовые последовательность при помощи защищенных патентами двоичных форматов. Поначалу предполагалось, что этот формат заменит формат doc, но оба формата все еще используются по сегодняшний день.

Источник

Оцените статью