- How to convert HTML page to plain text in node.js?
- 5 Answers 5
- How to convert html page to plain text in node.js?
- Method 1: Using a Package
- Step 1: Install the html-to-text package
- Step 2: Require the package in your Node.js file
- Step 3: Convert the HTML page to plain text using the htmlToText method
- Method 2: Regular Expression
- Method 3: DOM Parsing with Cheerio
- Method 4: DOM Parsing with JSDOM
How to convert HTML page to plain text in node.js?
I know this has been asked before but I can’t find a good answer for node.js I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched. I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side. I’ve tried https://npmjs.org/package/html-to-text but this doesn’t handle scripts.
var htmlToText = require('html-to-text'); var request = require('request'); request.get(url, function (error, result) < var text = htmlToText.fromString(result.body, < wordwrap: 130 >); >);
5 Answers 5
Use jsdom and jQuery (server-side).
With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.
(This is not tested with jsdom and node, only in Chrome)
jQuery('script').remove() jQuery('noscript').remove() jQuery('body').text().replace(/\s/g, ' ')
how do I delete the scripts? $.find(«script»).delete() generates a no-such-method error.` jsdom.env(< url: url, scripts: ["code.jquery.com/jquery.js"], done: function (errors, window) < var $ = window.$; $.find("script").delete();`
Sorry, .delete is not the right method, it’s remove() . But generally you should first test the script in your browser (Chrome or FireFox or Safari, not MSIE!). In Chrome you can simply press Shift+Ctrl+I to get the Developer Tools. Load the page and in the Script tab test your script. Be aware that $ might not be jQuery . To be safe use jQuery instead of $ . And be careful not to delete the jQuery script too soon!
For those searching for a regex solution, here is my one
const HTMLPartToTextPart = (HTMLPart) => ( HTMLPart .replace(/\n/ig, '') .replace(/