Html to text nodejs

How to convert HTML page to plain text in node.js?

I know this has been asked before but I can’t find a good answer for node.js I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched. I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side. I’ve tried https://npmjs.org/package/html-to-text but this doesn’t handle scripts.

 var htmlToText = require('html-to-text'); var request = require('request'); request.get(url, function (error, result) < var text = htmlToText.fromString(result.body, < wordwrap: 130 >); >); 

5 Answers 5

Use jsdom and jQuery (server-side).

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.

(This is not tested with jsdom and node, only in Chrome)

jQuery('script').remove() jQuery('noscript').remove() jQuery('body').text().replace(/\s/g, ' ') 

how do I delete the scripts? $.find(«script»).delete() generates a no-such-method error.` jsdom.env(< url: url, scripts: ["code.jquery.com/jquery.js"], done: function (errors, window) < var $ = window.$; $.find("script").delete();`

Sorry, .delete is not the right method, it’s remove() . But generally you should first test the script in your browser (Chrome or FireFox or Safari, not MSIE!). In Chrome you can simply press Shift+Ctrl+I to get the Developer Tools. Load the page and in the Script tab test your script. Be aware that $ might not be jQuery . To be safe use jQuery instead of $ . And be careful not to delete the jQuery script too soon!

Читайте также:  Чтение из файла python read

For those searching for a regex solution, here is my one

const HTMLPartToTextPart = (HTMLPart) => ( HTMLPart .replace(/\n/ig, '') .replace(/]*>[\s\S]*?]*>/ig, '') .replace(/]*>[\s\S]*?]*>/ig, '') .replace(/]*>[\s\S]*?]*>/ig, '') .replace(//ig, '\n') .replace(/]*\/?>/ig, '\n') .replace(/<[^>]*>/ig, '') .replace(' ', ' ') .replace(/[^\S\r\n][^\S\r\n]+/ig, ' ') ); 

As another answer suggested, use JSDOM, but you don’t need jQuery. Try this:

JSDOM.fragment(sourceHtml).textContent 

You can use TextVersionJS (http://textversionjs.com) to generate the plain text version of an HTML string. It’s pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

This library may work for your needs, but it’s NOT the same as getting the text of an element in the browser. Its purpose is to create a text version of an HTML email. This means that things like images are included. For example, given the following HTML and code snippet:

var textVersion = require("textversionjs"); var htmlText = "" + "" + "Lorem ipsum dolor sic amet
" + "Lorem ipsum \"foo\"sic
amet

" + "

Lorem ipsum dolor
sic amet

" + "" + "" + ""; var plainText = textVersion.htmlToPlainText(htmlText);

The variable plainText will contain this string:

Lorem ipsum [dolor] (http://foo.foo) sic amet Lorem ipsum ![foo] (http://foo.jpg) sic amet Lorem ipsum dolor sic amet 

Note that it does properly ignore script tags. You’ll find the latest version of the source code on GitHub.

Источник

How to convert html page to plain text in node.js?

If you need to extract the plain text content of an HTML page in a Node.js environment, there are several methods you can use. Here are a few options for converting HTML to plain text in Node.js:

Method 1: Using a Package

There are several packages available in Node.js that can help you convert HTML to plain text. One such package is html-to-text . Here’s how you can use it to convert an HTML page to plain text:

Step 1: Install the html-to-text package

Step 2: Require the package in your Node.js file

const htmlToText = require('html-to-text');

Step 3: Convert the HTML page to plain text using the htmlToText method

const html = '

Hello, World!

This is an example HTML page.

'
; const plainText = htmlToText.fromString(html); console.log(plainText);

This will output the following plain text:

Hello, World! This is an example HTML page.

You can also customize the output by passing options to the htmlToText method. For example, you can remove line breaks and preserve links:

const options =  wordwrap: null, preserveNewlines: true, linkHrefBaseUrl: '' >; const plainText = htmlToText.fromString(html, options); console.log(plainText);

This will output the following plain text:

Hello, World! This is an example HTML page.

In summary, to convert an HTML page to plain text in Node.js using the html-to-text package, you need to install the package, require it in your file, and use the htmlToText method to convert the HTML to plain text. You can also customize the output by passing options to the method.

Method 2: Regular Expression

To convert an HTML page to plain text in Node.js using Regular Expression, you can follow these steps:

  1. First, you need to install the cheerio and html-to-text npm packages. Cheerio is a jQuery-like library for parsing and manipulating HTML, while html-to-text is a module that converts HTML to plain text.
npm install cheerio html-to-text
  1. Next, you can use the cheerio library to load the HTML page and extract the text content using Regular Expression. Here is an example code:
const cheerio = require('cheerio'); const htmlToText = require('html-to-text'); const request = require('request'); request('https://www.example.com', (error, response, html) =>  if (!error && response.statusCode == 200)  const $ = cheerio.load(html); const text = htmlToText.fromString($.html(),  wordwrap: 130, ignoreImage: true, ignoreHref: true >); const plainText = text.replace(/(\r\n|\n|\r)/gm, " "); console.log(plainText); > >);

In this example, we are using the request module to fetch the HTML page from the URL. Then, we are using the cheerio library to load the HTML and extract the text content using htmlToText.fromString() method with options to ignore images and links. Finally, we are using Regular Expression to replace line breaks with spaces and print the plain text output.

const cheerio = require('cheerio'); const request = require('request'); request('https://www.example.com', (error, response, html) =>  if (!error && response.statusCode == 200)  const $ = cheerio.load(html); const plainText = $.text().replace(/(\r\n|\n|\r)/gm, " "); console.log(plainText); > >);

In this example, we are using the text() method of cheerio to extract the text content from HTML and then using Regular Expression to replace line breaks with spaces and print the plain text output.

These are just some examples of how to convert an HTML page to plain text in Node.js using Regular Expression. There are many other ways to achieve this, depending on your specific use case and requirements.

Method 3: DOM Parsing with Cheerio

To convert an HTML page to plain text in Node.js using DOM Parsing with Cheerio, follow these steps:

const cheerio = require('cheerio'); const fs = require('fs');
const html = fs.readFileSync('path/to/html/file.html');
const plainText = $('body').text();
fs.writeFileSync('path/to/plain/text/file.txt', plainText);
const cheerio = require('cheerio'); const fs = require('fs'); const html = fs.readFileSync('path/to/html/file.html'); const $ = cheerio.load(html); const plainText = $('body').text(); fs.writeFileSync('path/to/plain/text/file.txt', plainText);

This code will read the HTML file, convert it to plain text using Cheerio, and save the plain text to a file. You can modify the code to suit your specific needs, such as changing the file paths or selecting a different element to get the plain text from.

Method 4: DOM Parsing with JSDOM

To convert an HTML page to plain text in Node.js using DOM Parsing with JSDOM, you can follow these steps:

npm install jsdom npm install dom-to-text
const jsdom = require("jsdom"); const  JSDOM > = jsdom; const dom = new JSDOM(html);
const document = dom.window.document; const plainText = require('dom-to-text').getPlainText(document);

Here’s the complete code example:

const jsdom = require("jsdom"); const  JSDOM > = jsdom; const html = "

Hello World!

"
; const dom = new JSDOM(html); const document = dom.window.document; const plainText = require('dom-to-text').getPlainText(document); console.log(plainText);

Note that this approach only converts the text content of the HTML elements. If you want to include other information such as attributes or tags, you may need to modify the code accordingly.

Источник

Оцените статью