- Saved searches
- Use saved searches to filter your results more quickly
- License
- Codeuctivity/OpenXmlPowerTools
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- Openxml word to html
- Saved searches
- Use saved searches to filter your results more quickly
- johnwiichang/OpenXMLWordExtension
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
- Openxml word to html
- Answered by:
- Question
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
License
Codeuctivity/OpenXmlPowerTools
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Known missing features — Conversion of DOCX to HTML/CSS
Example — Convert DOCX to HTML
var sourceDocxFileContent = File.ReadAllBytes("./source.docx"); using var memoryStream = new MemoryStream(); await memoryStream.WriteAsync(sourceDocxFileContent, 0, sourceDocxFileContent.Length); using var wordProcessingDocument = WordprocessingDocument.Open(memoryStream, true); var settings = new WmlToHtmlConverterSettings("htmlPageTile"); var html = WmlToHtmlConverter.ConvertToHtml(wordProcessingDocument, settings); var htmlString = html.ToString(SaveOptions.DisableFormatting); File.WriteAllText("./target.html", htmlString, Encoding.UTF8);
- Splitting DOCX/PPTX files into multiple files.
- Combining multiple DOCX files into a single file.
- Populating content in template DOCX files with data from XML.
- Conversion of HTML/CSS to DOCX.
- Searching and replacing content in DOCX/PPTX using regular expressions.
- Managing tracked-revisions, including detecting tracked revisions, and accepting tracked revisions.
- Updating Charts in DOCX/PPTX files, including updating cached data, as well as the embedded XLSX.
- Retrieving metrics from DOCX files, including the hierarchy of styles used, the languages used, and the fonts used.
- Writing XLSX files using far simpler code than directly writing the markup, including a streaming approach that enables writing XLSX files with millions of rows.
- Extracting data (along with formatting) from spreadsheets.
Openxml word to html
In the last article, I showed you how to extract images from a Word document. In this article, I’m going to use some of the same code and expand it to detect text formatting. I’ll then turn the formatting into HTML. I recently did a project for Envato called WordPress Auto Publisher where a Windows service picks up stored Word documents and uploads it to WordPress. To work with this process, I had to turn the Word document into HTML.
This article will describe how to get the following formatted text from Word:
- Bold
- Underline
- Italics
- Highlighted
- Strike through
- Colored text (if it’s something other than standard black)
Prerequisites
The prerequisites are the same as the last article, so please take a look at the first article in this series before you read this one if you’re unsure where to start. You need the same using statements and OpenXML by Microsoft must be installed from Nuget for your project.
I’m also using the same button event with the same code that retrieves the file and sends it to a function that does the actual translation from Word to HTML. The button is placed on a WPF window. Here is the button’s event function code again.
private void button_Click(object sender, RoutedEventArgs e) < FileStream fs = new FileStream(System.IO.Path.GetDirectoryName(Process.GetCurrentProcess().MainModule.FileName) + @"\TestFiles\testfilewithformatting.docx", FileMode.Open); Body body = null; MainDocumentPart mainPart = null; using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(fs, false)) < mainPart = wdDoc.MainDocumentPart; body = wdDoc.MainDocumentPart.Document.Body; if (body != null) < ConvertWordToHTML(body, mainPart); >> fs.Flush(); fs.Close(); >
Notice that I call a “ConvertWordToHTML” method in this article, and this will be used to loop through our Word document.
Creating the Main Loop for Each Paragraph
If you recall from the last article, Word documents from 2007 to current versions are made up of XML. This XML is how you can parse them without even having Microsoft Office installed on the computer that runs this code.
Word documents are made up of paragraphs that are made up of runs. A paragraph could have 20 runs embedded in it. You need a loop that goes through each paragraph and then you need an embedded second loop that goes through each run.
The code is very similar to the last article’s code except in this one we just need to know the run properties.
private string ConvertWordToHTML(Body content, MainDocumentPart wDoc) < string htmlConvertedString = string.Empty; foreach (Paragraph par in content.Descendants()) < foreach (Run run in par.Descendants()) < RunProperties props = run.RunProperties; htmlConvertedString += ApplyTextFormatting(run.InnerText, props); >> return htmlConvertedString; >
Compared to the last article, this method is much smaller but we call a third method “ApplyTextFormatting.” We’ll get to that one in a bit. The important call in this method is retrieving the run properties. This is then assigned to a “props” variable. This has all the properties for the run, including if there are any formatting options. We send the actual text (contained in the InnerText property) and the run’s properties to the ApplyTextFormatting method.
Converting Word Formatting to HTML
Now for the method that does the actual conversion.
private string ApplyTextFormatting(string content, RunProperties property) < StringBuilder buildString = new StringBuilder(content); if (property.Bold != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Italic != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Underline != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Color != null && property.Color.Val != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Highlight != null && property.Highlight.Val != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Strike != null) < buildString.Insert(0, ""); buildString.Append(""); > return buildString.ToString(); >
When a user formats text, the format object is populated with a value. For instance, if text is in bold, the Bold property object isn’t null. That’s all we need to know to convert it to HTML. I used a StringBuilder variable to insert the corresponding HTML tag in front of the text and at the end of the run. When you format text in Word, the formatted text makes up the entire run, so you know that the content you pass to this method is either formatted or not. If it has no formatting, then the text is returned without any formatting. The great thing about this method is that if there is multiple formatting — for instance, bold and underline — the method will add both HTML tags to the content.
Here is the entire page of code including the button event.
using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Windows; using System.Windows.Controls; using System.Windows.Data; using System.Windows.Input; using System.Windows.Media; using System.Windows.Media.Imaging; using System.Windows.Navigation; using System.Windows.Shapes; using DocumentFormat.OpenXml.Packaging; using DocumentFormat.OpenXml.Wordprocessing; using System.Diagnostics; namespace DetectWordFormatting < ////// Interaction logic for MainWindow.xaml /// public partial class MainWindow : Window < public MainWindow() < InitializeComponent(); >private string ConvertWordToHTML(Body content, MainDocumentPart wDoc) < string htmlConvertedString = string.Empty; foreach (Paragraph par in content.Descendants()) < ParagraphProperties paragraphProperties = par.ParagraphProperties; foreach (Run run in par.Descendants()) < RunProperties props = run.RunProperties; htmlConvertedString += ApplyTextFormatting(run.InnerText, props); >> return htmlConvertedString; > ////// Apply Word style in HTML and return a string with the HTML tags /// /// /// ///string private string ApplyTextFormatting(string content, RunProperties property) < StringBuilder buildString = new StringBuilder(content); if (property.Bold != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Italic != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Underline != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Color != null && property.Color.Val != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Highlight != null && property.Highlight.Val != null) < buildString.Insert(0, ""); buildString.Append(""); > if (property.Strike != null) < buildString.Insert(0, ""); buildString.Append(""); > return buildString.ToString(); > private void button_Click(object sender, RoutedEventArgs e) < FileStream fs = new FileStream(System.IO.Path.GetDirectoryName(Process.GetCurrentProcess().MainModule.FileName) + @"\TestFiles\testfilewithformatting.docx", FileMode.Open); Body body = null; MainDocumentPart mainPart = null; using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(fs, false)) < mainPart = wdDoc.MainDocumentPart; body = wdDoc.MainDocumentPart.Document.Body; if (body != null) < ConvertWordToHTML(body, mainPart); >> fs.Flush(); fs.Close(); > > >
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
This library provides a series of methods to easily convert docx documents to html format and access Office OpenXML Word Processing files.
johnwiichang/OpenXMLWordExtension
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Office OpenXML WordProcessing Extension
This library provides a series of methods to easily convert docx documents to html format and access Office OpenXML Word Processing files.
- For the paragraph section of the HTML conversion.
- Get the picture in the file (Base64 encoded).
- Convert tables to HTML.
The above function currently only supports part of the stylized effect. It is recommended only for the extraction of text material. Does not support encrypted Office documents.
using DocumentFormat.OpenXml.Packaging; using OpenXMLWordExtension; class Program < static void Main(string[] args) < WordprocessingDocument docx = WordprocessingDocument.Open("testDocx.docx", false); System.IO.File.WriteAllText(@"/Users/johnwii/Desktop/out.html", docx.MainDocumentPart.Document.ToHtml()); >>
Because .NET Core does not support the System.Xml.Xsl namespace temporarily, graphics (not images) are not currently visible. If you need this part of the feature, you can try to integrate the VectorConvertor project.
About
This library provides a series of methods to easily convert docx documents to html format and access Office OpenXML Word Processing files.
Openxml word to html
This forum has migrated to Microsoft Q&A. Visit Microsoft Q&A to post new questions.
Answered by:
Question
HI, I am trying to convert the document file(.docx) to html file. Currently I am able to do the conversion but html file does not retain the formatting. I am using open xml sdk 2.0. For example: If a paragraph contain the text in red color with some text with bold and underline in docx file, the converted html shows all the lines as simple text and lost all the formatting. Here is my current code :
public string ConvertDocxToHtml(string docxFileEncodedData) < string inputFileName = DateTime.Now.ToString("ddMMyyyyhhmmss") + ".docx"; string imageDirectoryName = inputFileName.Split('.')[0] + "_files"; DirectoryInfo imgDirInfo = new DirectoryInfo(HttpContext.Current.Server.MapPath("~/Documents/" + imageDirectoryName)); int imageCounter = 0; byte[] byteArray = Convert.FromBase64String(docxFileEncodedData);//File.ReadAllBytes(docxFile); using (MemoryStream memoryStream = new MemoryStream()) < memoryStream.Write(byteArray, 0, byteArray.Length); using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true)) < HtmlConverterSettings settings = new HtmlConverterSettings() < PageTitle = inputFileName, ConvertFormatting = false, >; XElement html = HtmlConverter.ConvertToHtml(doc, settings, imageInfo => < DirectoryInfo localDirInfo = imgDirInfo; if (!localDirInfo.Exists) localDirInfo.Create(); ++imageCounter; string extension = imageInfo.ContentType.Split('/')[1].ToLower(); ImageFormat imageFormat = null; if (extension == "png") < // Convert the .png file to a .jpeg file. extension = "jpeg"; imageFormat = ImageFormat.Jpeg; >else if (extension == "bmp") imageFormat = ImageFormat.Bmp; else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg; else if (extension == "tiff") imageFormat = ImageFormat.Tiff; // If the image format is not one that you expect, ignore it, // and do not return markup for the link. if (imageFormat == null) return null; string imageFileName = "image" + imageCounter.ToString() + "." + extension; try < imageInfo.Bitmap.Save(imgDirInfo.FullName + "/" + imageFileName, imageFormat); >catch (System.Runtime.InteropServices.ExternalException) < return null; >XElement img = new XElement(Xhtml.img, new XAttribute(NoNamespace.src, imageDirectoryName + "/" + imageFileName), imageInfo.ImgStyleAttribute, imageInfo.AltText != null ? new XAttribute(NoNamespace.alt, imageInfo.AltText) : null); return img; >); string htmlFilePath = HttpContext.Current.Server.MapPath("~/Documents/" + inputFileName.Split('.')[0] + ".html"); File.WriteAllText(htmlFilePath, html.ToStringNewLineOnAttributes()); return ConfigurationManager.AppSettings["ServerUri"].ToString() + "/Documents/" + inputFileName.Split('.')[0] + ".html"; > > >