Introduction
The DOM extension allows you to operate on XML documents through the DOM API with PHP.
Note:
The DOM extension uses UTF-8 encoding. Use mb_convert_encoding() , UConverter::transcode() , or iconv() to handle other encodings.
User Contributed Notes 2 notes
Be careful when using this for partial HTML. This will only take complete HTML documents with at least an HTML element and a BODY element. If you are working on partial HTML and you fill in the missing elements around it and don’t specify in META elements the character encoding then it will be treated as ISO-8859-1 and will mangle UTF-8 strings. Example:
$body = getHtmlBody ();
$doc = new DOMDocument ();
$doc -> loadHtml ( «
// $doc will treat your HTML ISO-8859-1.
// this is correct but may not be what you want if your source is UTF-8
?>
$body = getHtmlBody ();
$doc = new DOMDocument ();
$doc -> loadHtml ( «
// $doc will treat your HTML correctly as UTF-8.
?>
you can use below code to load html in utf-8 format. It is not enough that your encoding is in utf-8.
- DOM
- Introduction
- Installing/Configuring
- Predefined Constants
- Examples
- DOMAttr
- DOMCdataSection
- DOMCharacterData
- DOMChildNode
- DOMComment
- DOMDocument
- DOMDocumentFragment
- DOMDocumentType
- DOMElement
- DOMEntity
- DOMEntityReference
- DOMException
- DOMImplementation
- DOMNamedNodeMap
- DOMNode
- DOMNodeList
- DOMNotation
- DOMParentNode
- DOMProcessingInstruction
- DOMText
- DOMXPath
- DOM Functions
Document Object Model
// same as pq(‘anything’)->htmlOuter()
// but on document root (returns doctype etc)
print phpQuery :: getDocument ();
?>It uses DOM extension and XPath so it works only in PHP5.
If you want to use DOMDocument in your PHPUnit Tests drive on Symfony Controller (testing form)! Use like :
use Symfony\Bundle\FrameworkBundle\Test\WebTestCase;
use YourBundle\Controller\TextController;class DefaultControllerTest extends WebTestCase
public function testIndex()
$client = static::createClient(array(), array());$crawler = $client->request(‘GET’, ‘/text/add’);
$this->assertTrue($crawler->filter(«form»)->count() > 0, «Text form exist !»);$domDocument = new \DOMDocument;
$domInput = $domDocument->createElement(‘input’);
$dom = $domDocument->appendChild($domInput);
$dom->setAttribute(‘slug’, ‘bloc’);$formInput = new \Symfony\Component\DomCrawler\Field\InputFormField($domInput);
$form->set($formInput);if ($client->getResponse()->isRedirect())
$crawler = $client->followRedirect(false);
>// $this->assertTrue($client->getResponse()->isSuccessful());
//$this->assertEquals(200, $client->getResponse()->getStatusCode(),
// «Unexpected HTTP status code for GET /backoffice/login»);When I tried to parse my XHTML Strict files with DOM extension, it couldn’t understand xhtml entities (like ©). I found post about it here (14-Jul-2005 09:05) which adviced to add resolveExternals = true, but it was very slow. There was some small note about xml catalogs but without any glue. Here it is:
XML catalogs is something like cache. Download all needed dtd’s to /etc/xml, edit file /etc/xml/catalog and add this line:
I hate DOM model !
so I wrote dom2array simple function (simple for use):function dom2array($node) $res = array();
print $node->nodeType.’
‘;
if($node->nodeType == XML_TEXT_NODE) $res = $node->nodeValue;
>
else if($node->hasAttributes()) $attributes = $node->attributes;
if(!is_null($attributes)) $res[‘@attributes’] = array();
foreach ($attributes as $index=>$attr) $res[‘@attributes’][$attr->name] = $attr->value;
>
>
>
if($node->hasChildNodes()) $children = $node->childNodes;
for($i=0;$ilength;$i++) $child = $children->item($i);
$res[$child->nodeName] = dom2array($child);
>
>
>
return $res;
>The project I’m currently working on uses XPaths to dynamically navigate through chunks of an XML file. I couldn’t find any PHP code on the net that would build the XPath to a node for me, so I wrote my own function. Turns out it wasn’t as hard as I thought it might be (yay recursion), though it does entail using some PHP shenanigans.
Hopefully it’ll save someone else the trouble of reinventing this wheel.
function getNodeXPath ( $node ) // REMEMBER THAT XPATHS USE BASE-1 INSTEAD OF BASE-0.
// Get the index for the current node by looping through the siblings.
$parentNode = $node -> parentNode ;
if( $parentNode != null ) $nodeIndex = 0 ;
do $testNode = $parentNode -> childNodes -> item ( $nodeIndex );
$nodeName = $testNode -> nodeName ;
$nodeIndex ++;// PHP trickery! Here we create a counter based on the node
// name of the test node to use in the XPath.
if( !isset( $ $nodeName ) ) $ $nodeName = 1 ;
else $ $nodeName ++;// Failsafe return value.
if( $nodeIndex > $parentNode -> childNodes -> length ) return( «/» );
> while( ! $node -> isSameNode ( $testNode ) );// Recursively get the XPath for the parent.
return( getNodeXPath ( $parentNode ) . «/ < $node ->nodeName > [ ]» );
> else // Hit the root node! Note that the slash is added when
// building the XPath, so we return just an empty string.
return( «» );
>
>
?>If you want to print the DOM XML file content, you can use the next code:
$doc = new DOMDocument();
$doc->load($xmlFileName);
echo «
» . $doc->documentURI;
$x = $doc->documentElement;
getNodeContent($x->childNodes, 0);function getNodeContent($nodes, $level) foreach ($nodes AS $item) // print «
TIPO: » . $item->nodeType ;
printValues($item, $level);
if ($item->nodeType == 1) < //DOMElement
foreach ($item->attributes AS $itemAtt) printValues($itemAtt, $level+3);
>
if($item->childNodes || $item->childNodes->lenth > 0) getNodeContent($item->childNodes, $level+5);
>
>
>
>function printValues($item, $level) if ($item->nodeType == 1) < //DOMElement
printLevel($level);
print $item->nodeName . »
«;
if ($level == 0) print «
«;
>
for($i=0; $i < $level; $i++) print "-";
>
>As of PHP 5.1, libxml options may be set using constants rather than the use of proprietary DomDocument properties.
DomDocument->resolveExternals is equivilant to setting
LIBXML_DTDLOAD
LIBXML_DTDATTRDomDocument->validateOnParse is equivilant to setting
LIBXML_DTDLOAD
LIBXML_DTDVALIDPHP 5.1 users are encouraged to use the new constants.
DOMDocument::loadHTML
Функция разбирает HTML, содержащийся в строке source . В отличие от загрузки XML, HTML не должен быть правильно построенным (well-formed) документом. Эта функция также может быть вызвана статически для загрузки и создания объекта класса DOMDocument . Статический вызов может использоваться в случаях, когда нет необходимости устанавливать значения параметров объекта DOMDocument до загрузки документа.
Список параметров
Начиная с версии Libxml 2.6.0, можно также использовать параметр options для указания дополнительных параметров Libxml.
Возвращаемые значения
Возвращает true в случае успешного выполнения или false в случае возникновения ошибки. В случае статического вызова возвращает объект класса DOMDocument или false в случае возникновения ошибки.
Ошибки
Если через аргумент source передана пустая строка, будет сгенерировано предупреждение. Это предупреждение генерируется не libxml, поэтому оно не может быть обработано функциями обработки ошибок libxml.
До PHP 8.0.0 метод может вызываться статически, но вызовет ошибку E_DEPRECATED . Начиная с PHP 8.0.0, вызов этого метода статически выбрасывает исключение Error .
Несмотря на то, что некорректный HTML обычно успешно загружается, данная функция может генерировать ошибки уровня E_WARNING при обнаружении плохой разметки. Для обработки данных ошибок можно воспользоваться функциями обработки ошибок libxml.
Примеры
Пример #1 Создание документа
Смотрите также
- DOMDocument::loadHTMLFile() — Загрузка HTML из файла
- DOMDocument::saveHTML() — Сохраняет документ из внутреннего представления в строку, используя форматирование HTML
- DOMDocument::saveHTMLFile() — Сохраняет документ из внутреннего представления в файл, используя форматирование HTML
User Contributed Notes 19 notes
You can also load HTML as UTF-8 using this simple hack:
$doc = new DOMDocument ();
$doc -> loadHTML ( » . $html );// dirty fix
foreach ( $doc -> childNodes as $item )
if ( $item -> nodeType == XML_PI_NODE )
$doc -> removeChild ( $item ); // remove hack
$doc -> encoding = ‘UTF-8’ ; // insert properDOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does.
This isn’t well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors.
Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions.
When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get «Cạnh tranh», you will receive «Cạnh tranh». I suggest we use mb_convert_encoding before load UTF-8 page :
$pageDom = new DomDocument ();
$searchPage = mb_convert_encoding ( $htmlUTF8Page , ‘HTML-ENTITIES’ , «UTF-8» );
@ $pageDom -> loadHTML ( $searchPage );Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html’s head section:
If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.
Warning: This does not function well with HTML5 elements such as SVG. Most of the advice on the Web is to turn off errors in order to have it work with HTML5.
If we are loading html5 tags such as
, there is following error: DOMDocument::loadHTML(): Tag section invalid in Entity
We can disable standard libxml errors (and enable user error handling) using libxml_use_internal_errors(true); before loadHTML();
This is quite useful in phpunit custom assertions as given in following example (if using phpunit test cases):
// Create a DOMDocument
$dom = new DOMDocument();// fix html5/svg errors
libxml_use_internal_errors(true);// Load html
$dom->loadHTML(» «);
$htmlNodes = $dom->getElementsByTagName(‘section’);if ($htmlNodes->length == 0) $this->assertFalse(TRUE);
> else $this->assertTrue(TRUE);
>Remember: If you use an HTML5 doctype and a meta element like so
your HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. However the HTML4-like version will work (as has been pointed out 10 years ago by «bigtree at 29a»):
It should be noted that when any text is provided within the body tag
outside of a containing element, the DOMDocument will encapsulate that
text into a paragraph tag ().
For those of you who want to get an external URL’s class element, I have 2 usefull functions. In this example we get the ‘
‘
elements back (search result headers) from google search:1. Check the URL (if it is reachable, existing)
# URL Check
function url_check ( $url ) <
$headers = @ get_headers ( $url );
return is_array ( $headers ) ? preg_match ( ‘/^HTTP\\/\\d+\\.\\d+\\s+2\\d\\d\\s+.*$/’ , $headers [ 0 ]) : false ;
>;
?>2. Clean the element you want to get (remove all tags, tabs, new-lines etc.)
# Function to clean a string
function clean ( $text ) $clean = html_entity_decode ( trim ( str_replace ( ‘;’ , ‘-‘ , preg_replace ( ‘/\s+/S’ , » » , strip_tags ( $text ))))); // remove everything
return $clean ;
echo ‘\n’ ; // throw a new line
>
?>After doing that, we can output the search result headers with following method:
$searchstring = ‘djceejay’ ;
$url = ‘http://www.google.de/webhp#q=’ . $searchstring ;
if( url_check ( $url )) $doc = new DomDocument ;
$doc -> validateOnParse = true ;
$doc -> loadHtml ( file_get_contents ( $url ));
$output = clean ( $doc -> getElementByClass ( ‘r’ )-> textContent );
echo $output . ‘
‘ ;
>else echo ‘URL not reachable!’ ; // Throw message when URL not be called
>
?>Be aware that this function doesn’t actually understand HTML — it fixes tag-soup input using the general rules of SGML, so it creates well-formed markup, but has no idea which element contexts are allowed.
For example, with input like this where the first element isn’t closed:
loadHTML will change it to this, which is well-formed but invalid: