- PHP Simple HTML DOM Parser Manual
- How to create HTML DOM object?
- How to find HTML elements?
- How to access the HTML element’s attributes?
- How to traverse the DOM tree?
- Parsing documents
- DOM methods & properties
- Element methods & properties
- DOM traversing
- Camel naming conventions
- simple_html_dom
- Public Properties
- Protected Properties
- Quick Start
- Read plain text from HTML document
- Read plaint text from HTML string
- Read specific elements from HTML document
- Modify HTML documents
- Collect information from Slashdot
PHP Simple HTML DOM Parser Manual
// Find all article blocks
foreach($html->find( ‘div.article’ ) as $article) $item[ ‘title’ ] = $article->find( ‘div.title’ , 0 )->plaintext;
$item[ ‘intro’ ] = $article->find( ‘div.intro’ , 0 )->plaintext;
$item[ ‘details’ ] = $article->find( ‘div.details’ , 0 )->plaintext;
$articles[] = $item;
>
How to create HTML DOM object?
// Create a DOM object from a string
$html = str_get_html( ‘
// Create a DOM object from a URL
$html = file_get_html( ‘http://www.google.com/’ );
// Create a DOM object from a HTML file
$html = file_get_html( ‘test.htm’ );
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load( ‘
// Load HTML from a URL
$html->load_file( ‘http://www.google.com/’ );
// Load HTML from a HTML file
$html->load_file( ‘test.htm’ );
How to find HTML elements?
// Find all anchors, returns a array of element objects
$ret = $html->find( ‘a‘ );
// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find( ‘a‘, 0 );
// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find( ‘a‘, -1 );
// Find all with the id attribute
$ret = $html->find( ‘div[id]‘ );
// Find all which attribute id=foo
$ret = $html->find( ‘div[id=foo]‘ );
// Find all element which id=foo
$ret = $html->find( ‘#foo‘ );
// Find all element which class=foo
$ret = $html->find( ‘.foo‘ );
// Find all element has attribute id
$ret = $html->find( ‘*[id]‘ );
// Find all anchors and images
$ret = $html->find( ‘a, img‘ );
// Find all anchors and images with the «title» attribute
$ret = $html->find( ‘a[title], img[title]‘ );
Supports these operators in attribute selectors:
Filter | Description |
---|---|
[attribute] | Matches elements that have the specified attribute. |
[!attribute] | Matches elements that don’t have the specified attribute. |
[attribute=value] | Matches elements that have the specified attribute with a certain value. |
[attribute!=value] | Matches elements that don’t have the specified attribute with a certain value. |
[attribute^=value] | Matches elements that have the specified attribute and it starts with a certain value. |
[attribute$=value] | Matches elements that have the specified attribute and it ends with a certain value. |
[attribute*=value] | Matches elements that have the specified attribute and it contains a certain value. |
$es = $html->find( ‘ul li‘ );
// Find Nested tags
$es = $html->find( ‘div div div‘ );
// Find all td tags with attribite align=center in table tags
$es = $html->find( »table td[align=center]‘ );
// Find all text blocks
$es = $html->find( ‘text‘ );
// Find all comment () blocks
$es = $html->find( ‘comment‘ );
foreach($html->find( ‘ul‘ ) as $ul)
foreach($ul->find( ‘li‘ ) as $li)
// do something.
>
>
How to access the HTML element’s attributes?
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected. ), it will returns true or false)
$value = $e->href;
// Set a attribute(If the attribute is non-value attribute (eg. checked, selected. ), set it’s value as true or false)
$e->href = ‘my link’ ;
// Remove a attribute, set it’s value as null!
$e->href = null ;
// Determine whether a attribute exist?
if(isset($e->href))
echo ‘href exist!’ ;
// Example
$ html = str_get_html ( «
» ) ;
$e = $html->find( «div» , 0 );
echo $e->tag; // Returns: » div»
echo $e->outertext; // Returns: »
»
echo $e->innertext; // Returns: » foo bar»
echo $e->plaintext; // Returns: » foo bar«
Attribute Name | Usage |
---|---|
$e->tag | Read or write the tag name of element. |
$e->outertext | Read or write the outer HTML text of element. |
$e->innertext | Read or write the inner HTML text of element. |
$e->plaintext | Read or write the plain text of element. |
// Extract contents from HTML
echo $html->plaintext;
// Wrap a element
$e->outertext = » . $e->outertext . ‘ ‘;
// Remove a element, set it’s outertext as an empty string
$e->outertext = » ;
// Append a element
$e->outertext = $e->outertext . ‘foo ‘;
// Insert a element
$e->outertext = ‘foo ‘ . $e->outertext;
How to traverse the DOM tree?
// If you are not so familiar with HTML DOM, check this link to learn more.
// Example
echo $html->find( «#div1», 0 )->children( 1 )->children( 1 )->children( 2 )-> id ;
// or
echo $html->getElementById( «div1» )->childNodes( 1 )->childNodes( 1 )->childNodes( 2 )->getAttribute( ‘id’ );
Parsing documents
The parser accepts documents in the form of URLs, files and strings. The document must be accessible for reading and cannot exceed MAX_FILE_SIZE .
Name | Description |
---|---|
str_get_html( string $content ) : object | Creates a DOM object from string. |
file_get_html( string $filename ) : object | Creates a DOM object from file or URL. |
DOM methods & properties
Name | Description |
---|---|
__construct( [string $filename] ) : void | Constructor, set the filename parameter will automatically load the contents, either text or file/url. |
plaintext : string | Returns the contents extracted from HTML. |
clear() : void | Clean up memory. |
load( string $content ) : void | Load contents from string. |
save( [string $filename] ) : string | Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file. |
load_file( string $filename ) : void | Load contents from a file or a URL. |
set_callback( string $function_name ) : void | Set a callback function. |
find( string $selector [, int $index] ) : mixed | Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object. |
Element methods & properties
Name | Description |
---|---|
[attribute] : string | Read or write element’s attribute value. |
tag : string | Read or write the tag name of element. |
outertext : string | Read or write the outer HTML text of element. |
innertext : string | Read or write the inner HTML text of element. |
plaintext : string | Read or write the plain text of element. |
find( string $selector [, int $index] ) : mixed | Find children by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object. |
DOM traversing
Name | Description |
---|---|
$e->children( [int $index] ) : mixed | Returns the Nth child object if index is set, otherwise return an array of children. |
$e->parent() : element | Returns the parent of element. |
$e->first_child() : element | Returns the first child of element, or null if not found. |
$e->last_child() : element | Returns the last child of element, or null if not found. |
$e->next_sibling() : element | Returns the next sibling of element, or null if not found. |
$e->prev_sibling() : element | Returns the previous sibling of element, or null if not found. |
Camel naming conventions
Method | Mapping |
---|---|
$e->getAllAttributes() | $e->attr |
$e->getAttribute( $name ) | $e->attribute |
$e->setAttribute( $name, $value) | $value = $e->attribute |
$e->hasAttribute( $name ) | isset($e->attribute) |
$e->removeAttribute ( $name ) | $e->attribute = null |
$e->getElementById ( $id ) | $e->find ( «#$id», 0 ) |
$e->getElementsById ( $id [,$index] ) | $e->find ( «#$id» [, int $index] ) |
$e->getElementByTagName ($name ) | $e->find ( $name, 0 ) |
$e->getElementsByTagName ( $name [, $index] ) | $e->find ( $name [, int $index] ) |
$e->parentNode () | $e->parent () |
$e->childNodes ( [$index] ) | $e->children ( [int $index] ) |
$e->firstChild () | $e->first_child () |
$e->lastChild () | $e->last_child () |
$e->nextSibling () | $e->next_sibling () |
$e->previousSibling () | $e->prev_sibling () |
simple_html_dom
Represents the DOM in memory. Provides functions to parse documents and access individual elements (see simple_html_dom_node ).
Public Properties
Property | Description |
---|---|
root | Root node of the document. |
nodes | List of top-level nodes in the document. |
callback | Callback function that is called for each element in the DOM when generating outertext. |
lowercase | If enabled, all tag names are converted to lowercase when parsing documents. |
original_size | Original document size in bytes. |
size | Current document size in bytes. |
_charset | Charset of the original document. |
_target_charset | Target charset for the current document. |
default_span_text | Text to return for elements. |
Protected Properties
Property | Description |
---|---|
pos | Current parsing position within doc . |
doc | The original document. |
char | Character at position pos in doc . |
cursor | Current element cursor in the document. |
parent | Parent element node. |
noise | Noise from the original document (i.e. scripts, comments, etc. ). |
token_blank | Tokens that are considered whitespace in HTML. |
token_equal | Tokens to identify the equal sign for attributes, stopping either at the closing tag («/» i.e. ) or the end of an opening tag («>» i.e. ). |
token_slash | Tokens to identify the end of a tag name. A tag name either ends on the ending slash («/» i.e. ) or whitespace ( «\s\r\n\t» ). |
token_attr | Tokens to identify the end of an attribute. |
default_br_text | Text to return for elements. |
self_closing_tags | A list of tag names where the closing tag is omitted. |
block_tags | A list of tag names where remaining unclosed tags are forcibly closed. |
optional_closing_tags | A list of tag names where the closing tag can be omitted. |
Quick Start
Find below sample code that demonstrate the fundamental features of PHP Simple HTML DOM Parser.
Read plain text from HTML document
echo file_get_html('https://www.google.com/')->plaintext;
Loads the specified HTML document into memory, parses it and returns the plain text. Note that file_get_html supports local files as well as remote files!
Read plaint text from HTML string
Parses the provided HTML string and returns the plain text. Note that the parser handles partial documents as well as full documents.
Read specific elements from HTML document
$html = file_get_html('https://www.google.com/'); foreach($html->find('img') as $element) echo $element->src . '
'; foreach($html->find('a') as $element) echo $element->href . '
';
Loads the specified document into memory and returns a list of image sources as well as anchor links. Note that find supports CSS selectors to find elements in the DOM.
Modify HTML documents
$doc = ' find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo'; echo $html; // Parses the provided HTML string and replaces elements in the DOM before returning the updated HTML string. In this example, the class for the second div element is set to bar and the inner text for the first div element to foo .
Note that find supports a second parameter to return a single element from the array of matches.
Note that attributes can be accessed directly by the means of magic methods ( ->class and ->innertext in the example above).
Collect information from Slashdot
$html = file_get_html('https://slashdot.org/'); $articles = $html->find('article[data-fhtype="story"]'); foreach($articles as $article) < $item['title'] = $article->find('.story-title', 0)->plaintext; $item['intro'] = $article->find('.p', 0)->plaintext; $item['details'] = $article->find('.details', 0)->plaintext; $items[] = $item; > print_r($items);
Collects information from Slashdot for further processing.
Note that the combination of CSS selectors and magic methods make the process of parsing HTML documents a simple task that is easy to understand.