- Saved searches
- Use saved searches to filter your results more quickly
- License
- taoqf/node-html-parser
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
- Fast HTML Parser
- Install
- Performance
- Usage
- Global Methods
- parse(data[, options])
- valid(data[, options])
- Class
- HTMLElement Methods
- trimRight()
- removeWhitespace()
- querySelectorAll(selector)
- querySelector(selector)
- getElementsByTagName(tagName)
- closest(selector)
- appendChild(node)
- insertAdjacentHTML(where, html)
- setAttribute(key: string, value: string)
- setAttributes(attrs: Record)
- removeAttribute(key: string)
- getAttribute(key: string)
- exchangeChild(oldNode: Node, newNode: Node)
- removeChild(node: Node)
- toString()
- set_content(content: string | Node | Node[])
- remove()
- replaceWith(. nodes: (string | Node)[])
- classList
- classList.add
- classList.replace(old: string, new: string)
- classList.remove()
- classList.toggle(className: string):void
- classList.contains(className: string): boolean
- classList.value
- clone()
- getElementById(id: string): HTMLElement;
- HTMLElement Properties
- text
- rawText
- tagName
- structuredText
- structure
- firstChild
- lastChild
- innerHTML
- outerHTML
- nextSibling
- nextElementSibling
- previousSibling
- previousElementSibling
- textContent
- attributes
- range
- Parsing HTML with Node
- Using regex
- Using JSSoup
- Using node-html-parser
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
A very fast HTML parser, generating a simplified DOM, with basic element query support.
License
taoqf/node-html-parser
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Fast HTML Parser is a very fast HTML parser. Which will generate a simplified DOM tree, with element query support.
npm install --save node-html-parser
Note: when using Fast HTML Parser in a Typescript project the minimum Typescript version supported is ^4.1.2 .
html-parser :24.1595 ms/file ± 18.7667 htmljs-parser :4.72064 ms/file ± 5.67689 html-dom-parser :2.18055 ms/file ± 2.96136 html5parser :1.69639 ms/file ± 2.17111 cheerio :12.2122 ms/file ± 8.10916 parse5 :6.50626 ms/file ± 4.02352 htmlparser2 :2.38179 ms/file ± 3.42389 htmlparser :17.4820 ms/file ± 128.041 high5 :3.95188 ms/file ± 2.52313 node-html-parser:2.04288 ms/file ± 1.25203 node-html-parser (last release):2.00527 ms/file ± 1.21317
Parse the data provided, and return the root of the generated DOM.
lowerCaseTagName: false, // convert tag name to lower case (hurts performance heavily) comment: false, // retrieve comments (hurts performance slightly) voidTag: tags: ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'], // optional and case insensitive, default value is ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'] addClosingSlash: true // optional, default false. void tag serialisation, add a final slash
>, blockTextElements: script: true, // keep text content when parsing noscript: true, // keep text content when parsing style: true, // keep text content when parsing pre: true // keep text content when parsing > >
Parse the data provided, return true if the given data is valid, and return false if not.
classDiagram direction TB class HTMLElement < this trimRight() this removeWhitespace() Node[] querySelectorAll(string selector) Node querySelector(string selector) HTMLElement[] getElementsByTagName(string tagName) Node closest(string selector) Node appendChild(Node node) this insertAdjacentHTML('beforebegin' | 'afterbegin' | 'beforeend' | 'afterend' where, string html) this setAttribute(string key, string value) this setAttributes(Record~string, string~ attrs) this removeAttribute(string key) string getAttribute(string key) this exchangeChild(Node oldNode, Node newNode) this removeChild(Node node) string toString() this set_content(string content) this set_content(Node content) this set_content(Node[] content) this remove() this replaceWith((string | Node)[] . nodes) ClassList classList HTMLElement clone() HTMLElement getElementById(string id) string text string rawText string tagName string structuredText string structure Node firstChild Node lastChild Node nextSibling HTMLElement nextElementSibling Node previousSibling HTMLElement previousElementSibling string innerHTML string outerHTML string textContent Record~string, string~ attributes [number, number] range >class Node< > string toString() Node clone() this remove() number nodeType string innerText string textContent > class ClassList < add(string c) replace(string c1, string c2) remove(string c) toggle(string c) boolean contains(string c) number length string[] value string toString() >class CommentNode < CommentNode clone() string toString() >class TextNode < TextNode clone() string toString() string rawText string trimmedRawText string trimmedText string text boolean isWhitespace >Node --|> HTMLElement Node --|> CommentNode Node --|> TextNode Node ..> ClassList
Trim element from right (in block) after seeing pattern in a TextNode.
Remove whitespaces in this sub tree.
Query CSS selector to find matching nodes.
Note: Full range of CSS3 selectors supported since v3.0.0.
Query CSS Selector to find matching node.
Get all elements with the specified tagName.
Note: Use * for all elements.
Query closest element by css selector.
Append a child node to childNodes
Parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.
setAttribute(key: string, value: string)
Set value to key attribute.
Set attributes of the element.
exchangeChild(oldNode: Node, newNode: Node)
Exchanges given child with new child.
set_content(content: string | Node | Node[])
Set content. Notice: Do not set content of the root node.
replaceWith(. nodes: (string | Node)[])
Replace current element with other node(s).
classList.replace(old: string, new: string)
Replace class name with another one.
Toggle class. Remove it if it is already included, otherwise add.
classList.contains(className: string): boolean
Returns true if the classname is already in the classList.
getElementById(id: string): HTMLElement;
Get unescaped text value of current node and its children. Like innerText . (slow for the first time)
Get escaped (as-is) text value of current node and its children. May have & in it. (fast)
Get or Set tag name of HTMLElement. Notice: the returned value would be an uppercase string.
Returns a reference to the next child node of the current element’s parent.
Returns a reference to the next child element of the current element’s parent.
Returns a reference to the previous child node of the current element’s parent.
Returns a reference to the previous child element of the current element’s parent.
Get or Set textContent of current element, more efficient than set_content.
Get all attributes of current element. Notice: do not try to change the returned value.
Corresponding source code start and end indexes (ie [ 0, 40 ])
About
A very fast HTML parser, generating a simplified DOM, with basic element query support.
Fast HTML Parser
Fast HTML Parser is a very fast HTML parser. Which will generate a simplified DOM tree, with element query support.
Install
npm install --save node-html-parser
Note: when using Fast HTML Parser in a Typescript project the minimum Typescript version supported is ^4.1.2 .
Performance
html-parser :24.1595 ms/file ± 18.7667 htmljs-parser :4.72064 ms/file ± 5.67689 html-dom-parser :2.18055 ms/file ± 2.96136 html5parser :1.69639 ms/file ± 2.17111 cheerio :12.2122 ms/file ± 8.10916 parse5 :6.50626 ms/file ± 4.02352 htmlparser2 :2.38179 ms/file ± 3.42389 htmlparser :17.4820 ms/file ± 128.041 high5 :3.95188 ms/file ± 2.52313 node-html-parser:2.04288 ms/file ± 1.25203 node-html-parser (last release):2.00527 ms/file ± 1.21317
Usage
Global Methods
parse(data[, options])
Parse the data provided, and return the root of the generated DOM.
lowerCaseTagName: false, // convert tag name to lower case (hurts performance heavily) comment: false, // retrieve comments (hurts performance slightly) voidTag: tags: ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'], // optional and case insensitive, default value is ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'] addClosingSlash: true // optional, default false. void tag serialisation, add a final slash
>, blockTextElements: script: true, // keep text content when parsing noscript: true, // keep text content when parsing style: true, // keep text content when parsing pre: true // keep text content when parsing > >
valid(data[, options])
Parse the data provided, return true if the given data is valid, and return false if not.
Class
classDiagram direction TB class HTMLElement this trimRight() this removeWhitespace() Node[] querySelectorAll(string selector) Node querySelector(string selector) HTMLElement[] getElementsByTagName(string tagName) Node closest(string selector) Node appendChild(Node node) this insertAdjacentHTML('beforebegin' | 'afterbegin' | 'beforeend' | 'afterend' where, string html) this setAttribute(string key, string value) this setAttributes(Record~string, string~ attrs) this removeAttribute(string key) string getAttribute(string key) this exchangeChild(Node oldNode, Node newNode) this removeChild(Node node) string toString() this set_content(string content) this set_content(Node content) this set_content(Node[] content) this remove() this replaceWith((string | Node)[] . nodes) ClassList classList HTMLElement clone() HTMLElement getElementById(string id) string text string rawText string tagName string structuredText string structure Node firstChild Node lastChild Node nextSibling HTMLElement nextElementSibling Node previousSibling HTMLElement previousElementSibling string innerHTML string outerHTML string textContent Record~string, string~ attributes [number, number] range > class Node <abstract>> string toString() Node clone() this remove() number nodeType string innerText string textContent > class ClassList add(string c) replace(string c1, string c2) remove(string c) toggle(string c) boolean contains(string c) number length string[] value string toString() > class CommentNode CommentNode clone() string toString() > class TextNode TextNode clone() string toString() string rawText string trimmedRawText string trimmedText string text boolean isWhitespace > Node --|> HTMLElement Node --|> CommentNode Node --|> TextNode Node ..> ClassList
HTMLElement Methods
trimRight()
Trim element from right (in block) after seeing pattern in a TextNode.
removeWhitespace()
Remove whitespaces in this sub tree.
querySelectorAll(selector)
Query CSS selector to find matching nodes.
Note: Full range of CSS3 selectors supported since v3.0.0.
querySelector(selector)
Query CSS Selector to find matching node.
getElementsByTagName(tagName)
Get all elements with the specified tagName.
Note: Use * for all elements.
closest(selector)
Query closest element by css selector.
appendChild(node)
Append a child node to childNodes
insertAdjacentHTML(where, html)
Parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.
setAttribute(key: string, value: string)
Set value to key attribute.
setAttributes(attrs: Record)
Set attributes of the element.
removeAttribute(key: string)
getAttribute(key: string)
exchangeChild(oldNode: Node, newNode: Node)
Exchanges given child with new child.
removeChild(node: Node)
toString()
set_content(content: string | Node | Node[])
Set content. Notice: Do not set content of the root node.
remove()
replaceWith(. nodes: (string | Node)[])
Replace current element with other node(s).
classList
classList.add
classList.replace(old: string, new: string)
Replace class name with another one.
classList.remove()
classList.toggle(className: string):void
Toggle class. Remove it if it is already included, otherwise add.
classList.contains(className: string): boolean
Returns true if the classname is already in the classList.
classList.value
clone()
getElementById(id: string): HTMLElement;
HTMLElement Properties
text
Get unescaped text value of current node and its children. Like innerText . (slow for the first time)
rawText
Get escaped (as-is) text value of current node and its children. May have & in it. (fast)
tagName
Get or Set tag name of HTMLElement. Notice: the returned value would be an uppercase string.
structuredText
structure
firstChild
lastChild
innerHTML
outerHTML
nextSibling
Returns a reference to the next child node of the current element’s parent.
nextElementSibling
Returns a reference to the next child element of the current element’s parent.
previousSibling
Returns a reference to the previous child node of the current element’s parent.
previousElementSibling
Returns a reference to the previous child element of the current element’s parent.
textContent
Get or Set textContent of current element, more efficient than set_content.
attributes
Get all attributes of current element. Notice: do not try to change the returned value.
range
Corresponding source code start and end indexes (ie [ 0, 40 ])
Parsing HTML with Node
The perils of using regular expressions to parse HTML is well documented. Take a look at the articles here and here and here and here for why parsing HTML with regex is such a bad idea.
Using regex
That said, if you are a least a moderate regex user, it’s hard to resist the pull of a non-greedy regular expression for parsing HTML.
I needed to parse the tags where rel = stylesheet out of an HTML document with Node the other day. In a hurry I threw this code at the problem ( fileContents is the contents of an HTML file):
and it spit out this result:
which at first seemed correct, but one of those link tags was embedded in an HTML comment. Notice that the result returned for the first line above ends with —> . While that would mostly do what I wanted, it just didn’t feel right.
Using JSSoup
Having used Python’s Beautiful Soup HTML-parsing library, I then looked at the JSSoup NPM package.
This produced the correct results:
JSSoup has a Beautiful Soup-like syntax and using it was almost a go when I noticed this comment at the bottom of its NPM package home page (sic): «There’s a lot of work need to be done.» I agree. JSSoup is not nearly as effective as Python’s Beautiful Soup is.
Yikes. No thanks. While I’m not sure if that comment reflects incompleteness (which may be OK) or some tests not passing (which is definitely not OK) I moved on.
Using node-html-parser
Next up was the the node-html-parser NPM package. This package has a very JavaScript-like API, is mature, and has many users.