Node js парсинг html строки

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

A very fast HTML parser, generating a simplified DOM, with basic element query support.

License

taoqf/node-html-parser

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Читайте также:  Javascript parent document all

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Fast HTML Parser is a very fast HTML parser. Which will generate a simplified DOM tree, with element query support.

npm install --save node-html-parser

Note: when using Fast HTML Parser in a Typescript project the minimum Typescript version supported is ^4.1.2 .

html-parser :24.1595 ms/file ± 18.7667 htmljs-parser :4.72064 ms/file ± 5.67689 html-dom-parser :2.18055 ms/file ± 2.96136 html5parser :1.69639 ms/file ± 2.17111 cheerio :12.2122 ms/file ± 8.10916 parse5 :6.50626 ms/file ± 4.02352 htmlparser2 :2.38179 ms/file ± 3.42389 htmlparser :17.4820 ms/file ± 128.041 high5 :3.95188 ms/file ± 2.52313 node-html-parser:2.04288 ms/file ± 1.25203 node-html-parser (last release):2.00527 ms/file ± 1.21317

Parse the data provided, and return the root of the generated DOM.

 lowerCaseTagName: false, // convert tag name to lower case (hurts performance heavily) comment: false, // retrieve comments (hurts performance slightly) voidTag: tags: ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'], // optional and case insensitive, default value is ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'] addClosingSlash: true // optional, default false. void tag serialisation, add a final slash 
>, blockTextElements: script: true, // keep text content when parsing noscript: true, // keep text content when parsing style: true, // keep text content when parsing pre: true // keep text content when parsing > >

Parse the data provided, return true if the given data is valid, and return false if not.

classDiagram direction TB class HTMLElement < this trimRight() this removeWhitespace() Node[] querySelectorAll(string selector) Node querySelector(string selector) HTMLElement[] getElementsByTagName(string tagName) Node closest(string selector) Node appendChild(Node node) this insertAdjacentHTML('beforebegin' | 'afterbegin' | 'beforeend' | 'afterend' where, string html) this setAttribute(string key, string value) this setAttributes(Record~string, string~ attrs) this removeAttribute(string key) string getAttribute(string key) this exchangeChild(Node oldNode, Node newNode) this removeChild(Node node) string toString() this set_content(string content) this set_content(Node content) this set_content(Node[] content) this remove() this replaceWith((string | Node)[] . nodes) ClassList classList HTMLElement clone() HTMLElement getElementById(string id) string text string rawText string tagName string structuredText string structure Node firstChild Node lastChild Node nextSibling HTMLElement nextElementSibling Node previousSibling HTMLElement previousElementSibling string innerHTML string outerHTML string textContent Record~string, string~ attributes [number, number] range >class Node< > string toString() Node clone() this remove() number nodeType string innerText string textContent > class ClassList < add(string c) replace(string c1, string c2) remove(string c) toggle(string c) boolean contains(string c) number length string[] value string toString() >class CommentNode < CommentNode clone() string toString() >class TextNode < TextNode clone() string toString() string rawText string trimmedRawText string trimmedText string text boolean isWhitespace >Node --|> HTMLElement Node --|> CommentNode Node --|> TextNode Node ..> ClassList

Trim element from right (in block) after seeing pattern in a TextNode.

Remove whitespaces in this sub tree.

Query CSS selector to find matching nodes.

Note: Full range of CSS3 selectors supported since v3.0.0.

Query CSS Selector to find matching node.

Get all elements with the specified tagName.

Note: Use * for all elements.

Query closest element by css selector.

Append a child node to childNodes

Parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.

setAttribute(key: string, value: string)

Set value to key attribute.

Set attributes of the element.

exchangeChild(oldNode: Node, newNode: Node)

Exchanges given child with new child.

set_content(content: string | Node | Node[])

Set content. Notice: Do not set content of the root node.

replaceWith(. nodes: (string | Node)[])

Replace current element with other node(s).

classList.replace(old: string, new: string)

Replace class name with another one.

Toggle class. Remove it if it is already included, otherwise add.

classList.contains(className: string): boolean

Returns true if the classname is already in the classList.

getElementById(id: string): HTMLElement;

Get unescaped text value of current node and its children. Like innerText . (slow for the first time)

Get escaped (as-is) text value of current node and its children. May have & in it. (fast)

Get or Set tag name of HTMLElement. Notice: the returned value would be an uppercase string.

Returns a reference to the next child node of the current element’s parent.

Returns a reference to the next child element of the current element’s parent.

Returns a reference to the previous child node of the current element’s parent.

Returns a reference to the previous child element of the current element’s parent.

Get or Set textContent of current element, more efficient than set_content.

Get all attributes of current element. Notice: do not try to change the returned value.

Corresponding source code start and end indexes (ie [ 0, 40 ])

About

A very fast HTML parser, generating a simplified DOM, with basic element query support.

Источник

Fast HTML Parser NPM version

Fast HTML Parser is a very fast HTML parser. Which will generate a simplified DOM tree, with element query support.

Install

npm install --save node-html-parser

Note: when using Fast HTML Parser in a Typescript project the minimum Typescript version supported is ^4.1.2 .

Performance

html-parser :24.1595 ms/file ± 18.7667 htmljs-parser :4.72064 ms/file ± 5.67689 html-dom-parser :2.18055 ms/file ± 2.96136 html5parser :1.69639 ms/file ± 2.17111 cheerio :12.2122 ms/file ± 8.10916 parse5 :6.50626 ms/file ± 4.02352 htmlparser2 :2.38179 ms/file ± 3.42389 htmlparser :17.4820 ms/file ± 128.041 high5 :3.95188 ms/file ± 2.52313 node-html-parser:2.04288 ms/file ± 1.25203 node-html-parser (last release):2.00527 ms/file ± 1.21317

Usage

Global Methods

parse(data[, options])

Parse the data provided, and return the root of the generated DOM.

 lowerCaseTagName: false, // convert tag name to lower case (hurts performance heavily) comment: false, // retrieve comments (hurts performance slightly) voidTag: tags: ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'], // optional and case insensitive, default value is ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'] addClosingSlash: true // optional, default false. void tag serialisation, add a final slash 
>, blockTextElements: script: true, // keep text content when parsing noscript: true, // keep text content when parsing style: true, // keep text content when parsing pre: true // keep text content when parsing > >

valid(data[, options])

Parse the data provided, return true if the given data is valid, and return false if not.

Class

classDiagram direction TB class HTMLElement this trimRight() this removeWhitespace() Node[] querySelectorAll(string selector) Node querySelector(string selector) HTMLElement[] getElementsByTagName(string tagName) Node closest(string selector) Node appendChild(Node node) this insertAdjacentHTML('beforebegin' | 'afterbegin' | 'beforeend' | 'afterend' where, string html) this setAttribute(string key, string value) this setAttributes(Record~string, string~ attrs) this removeAttribute(string key) string getAttribute(string key) this exchangeChild(Node oldNode, Node newNode) this removeChild(Node node) string toString() this set_content(string content) this set_content(Node content) this set_content(Node[] content) this remove() this replaceWith((string | Node)[] . nodes) ClassList classList HTMLElement clone() HTMLElement getElementById(string id) string text string rawText string tagName string structuredText string structure Node firstChild Node lastChild Node nextSibling HTMLElement nextElementSibling Node previousSibling HTMLElement previousElementSibling string innerHTML string outerHTML string textContent Record~string, string~ attributes [number, number] range > class Node <abstract>> string toString() Node clone() this remove() number nodeType string innerText string textContent > class ClassList add(string c) replace(string c1, string c2) remove(string c) toggle(string c) boolean contains(string c) number length string[] value string toString() > class CommentNode CommentNode clone() string toString() > class TextNode TextNode clone() string toString() string rawText string trimmedRawText string trimmedText string text boolean isWhitespace > Node --|> HTMLElement Node --|> CommentNode Node --|> TextNode Node ..> ClassList

HTMLElement Methods

trimRight()

Trim element from right (in block) after seeing pattern in a TextNode.

removeWhitespace()

Remove whitespaces in this sub tree.

querySelectorAll(selector)

Query CSS selector to find matching nodes.

Note: Full range of CSS3 selectors supported since v3.0.0.

querySelector(selector)

Query CSS Selector to find matching node.

getElementsByTagName(tagName)

Get all elements with the specified tagName.

Note: Use * for all elements.

closest(selector)

Query closest element by css selector.

appendChild(node)

Append a child node to childNodes

insertAdjacentHTML(where, html)

Parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.

setAttribute(key: string, value: string)

Set value to key attribute.

setAttributes(attrs: Record)

Set attributes of the element.

removeAttribute(key: string)

getAttribute(key: string)

exchangeChild(oldNode: Node, newNode: Node)

Exchanges given child with new child.

removeChild(node: Node)

toString()

set_content(content: string | Node | Node[])

Set content. Notice: Do not set content of the root node.

remove()

replaceWith(. nodes: (string | Node)[])

Replace current element with other node(s).

classList

classList.add

classList.replace(old: string, new: string)

Replace class name with another one.

classList.remove()

classList.toggle(className: string):void

Toggle class. Remove it if it is already included, otherwise add.

classList.contains(className: string): boolean

Returns true if the classname is already in the classList.

classList.value

clone()

getElementById(id: string): HTMLElement;

HTMLElement Properties

text

Get unescaped text value of current node and its children. Like innerText . (slow for the first time)

rawText

Get escaped (as-is) text value of current node and its children. May have & in it. (fast)

tagName

Get or Set tag name of HTMLElement. Notice: the returned value would be an uppercase string.

structuredText

structure

firstChild

lastChild

innerHTML

outerHTML

nextSibling

Returns a reference to the next child node of the current element’s parent.

nextElementSibling

Returns a reference to the next child element of the current element’s parent.

previousSibling

Returns a reference to the previous child node of the current element’s parent.

previousElementSibling

Returns a reference to the previous child element of the current element’s parent.

textContent

Get or Set textContent of current element, more efficient than set_content.

attributes

Get all attributes of current element. Notice: do not try to change the returned value.

range

Corresponding source code start and end indexes (ie [ 0, 40 ])

Источник

Parsing HTML with Node

The perils of using regular expressions to parse HTML is well documented. Take a look at the articles here and here and here and here for why parsing HTML with regex is such a bad idea.

Using regex

That said, if you are a least a moderate regex user, it’s hard to resist the pull of a non-greedy regular expression for parsing HTML.

I needed to parse the tags where rel = stylesheet out of an HTML document with Node the other day. In a hurry I threw this code at the problem ( fileContents is the contents of an HTML file):

and it spit out this result:

which at first seemed correct, but one of those link tags was embedded in an HTML comment. Notice that the result returned for the first line above ends with —> . While that would mostly do what I wanted, it just didn’t feel right.

Using JSSoup

Having used Python’s Beautiful Soup HTML-parsing library, I then looked at the JSSoup NPM package.

This produced the correct results:

JSSoup has a Beautiful Soup-like syntax and using it was almost a go when I noticed this comment at the bottom of its NPM package home page (sic): «There’s a lot of work need to be done.» I agree. JSSoup is not nearly as effective as Python’s Beautiful Soup is.

Yikes. No thanks. While I’m not sure if that comment reflects incompleteness (which may be OK) or some tests not passing (which is definitely not OK) I moved on.

Using node-html-parser

Next up was the the node-html-parser NPM package. This package has a very JavaScript-like API, is mature, and has many users.

Источник

Оцените статью