Php domdocument get all text

Содержание

Document Object Model
Processing HTML using PHP’s DomDocument class
# Case 1:How to automagically make responsive images from all the images in the page
# Case 2: How to make responsive videos from all the Youtube videos in the page
# Case 3: How to remove the style tags from all the HTML elements in the page
# Case 4: How to automatically add rel = nofollow to all the links
Conclusion

Document Object Model

// same as pq(‘anything’)->htmlOuter()
// but on document root (returns doctype etc)
print phpQuery :: getDocument ();
?>

It uses DOM extension and XPath so it works only in PHP5.

If you want to use DOMDocument in your PHPUnit Tests drive on Symfony Controller (testing form)! Use like :

use Symfony\Bundle\FrameworkBundle\Test\WebTestCase;
use YourBundle\Controller\TextController;

class DefaultControllerTest extends WebTestCase
public function testIndex()
$client = static::createClient(array(), array());

$crawler = $client->request(‘GET’, ‘/text/add’);
$this->assertTrue($crawler->filter(«form»)->count() > 0, «Text form exist !»);

$domDocument = new \DOMDocument;

$domInput = $domDocument->createElement(‘input’);
$dom = $domDocument->appendChild($domInput);
$dom->setAttribute(‘slug’, ‘bloc’);

$formInput = new \Symfony\Component\DomCrawler\Field\InputFormField($domInput);
$form->set($formInput);

if ($client->getResponse()->isRedirect())
$crawler = $client->followRedirect(false);
>

// $this->assertTrue($client->getResponse()->isSuccessful());
//$this->assertEquals(200, $client->getResponse()->getStatusCode(),
// «Unexpected HTTP status code for GET /backoffice/login»);

When I tried to parse my XHTML Strict files with DOM extension, it couldn’t understand xhtml entities (like ©). I found post about it here (14-Jul-2005 09:05) which adviced to add resolveExternals = true, but it was very slow. There was some small note about xml catalogs but without any glue. Here it is:

XML catalogs is something like cache. Download all needed dtd’s to /etc/xml, edit file /etc/xml/catalog and add this line:

I hate DOM model !
so I wrote dom2array simple function (simple for use):

function dom2array($node) $res = array();
print $node->nodeType.’
‘;
if($node->nodeType == XML_TEXT_NODE) $res = $node->nodeValue;
>
else if($node->hasAttributes()) $attributes = $node->attributes;
if(!is_null($attributes)) $res[‘@attributes’] = array();
foreach ($attributes as $index=>$attr) $res[‘@attributes’][$attr->name] = $attr->value;
>
>
>
if($node->hasChildNodes()) $children = $node->childNodes;
for($i=0;$ilength;$i++) $child = $children->item($i);
$res[$child->nodeName] = dom2array($child);
>
>
>
return $res;
>

The project I’m currently working on uses XPaths to dynamically navigate through chunks of an XML file. I couldn’t find any PHP code on the net that would build the XPath to a node for me, so I wrote my own function. Turns out it wasn’t as hard as I thought it might be (yay recursion), though it does entail using some PHP shenanigans.

Hopefully it’ll save someone else the trouble of reinventing this wheel.

function getNodeXPath ( $node ) // REMEMBER THAT XPATHS USE BASE-1 INSTEAD OF BASE-0.

// Get the index for the current node by looping through the siblings.
$parentNode = $node -> parentNode ;
if( $parentNode != null ) $nodeIndex = 0 ;
do $testNode = $parentNode -> childNodes -> item ( $nodeIndex );
$nodeName = $testNode -> nodeName ;
$nodeIndex ++;

// PHP trickery! Here we create a counter based on the node
// name of the test node to use in the XPath.
if( !isset( $ $nodeName ) ) $ $nodeName = 1 ;
else $ $nodeName ++;

// Failsafe return value.
if( $nodeIndex > $parentNode -> childNodes -> length ) return( «/» );
> while( ! $node -> isSameNode ( $testNode ) );

// Recursively get the XPath for the parent.
return( getNodeXPath ( $parentNode ) . «/ < $node ->nodeName > [ ]» );
> else // Hit the root node! Note that the slash is added when
// building the XPath, so we return just an empty string.
return( «» );
>
>
?>

If you want to print the DOM XML file content, you can use the next code:

$doc = new DOMDocument();
$doc->load($xmlFileName);
echo «
» . $doc->documentURI;
$x = $doc->documentElement;
getNodeContent($x->childNodes, 0);

function getNodeContent($nodes, $level) foreach ($nodes AS $item) // print «

TIPO: » . $item->nodeType ;
printValues($item, $level);
if ($item->nodeType == 1) < //DOMElement
foreach ($item->attributes AS $itemAtt) printValues($itemAtt, $level+3);
>
if($item->childNodes || $item->childNodes->lenth > 0) getNodeContent($item->childNodes, $level+5);
>
>
>
>

function printValues($item, $level) if ($item->nodeType == 1) < //DOMElement
printLevel($level);
print $item->nodeName . »
«;
if ($level == 0) print «
«;
>
for($i=0; $i < $level; $i++) print "-";
>
>

As of PHP 5.1, libxml options may be set using constants rather than the use of proprietary DomDocument properties.

DomDocument->resolveExternals is equivilant to setting
LIBXML_DTDLOAD
LIBXML_DTDATTR

DomDocument->validateOnParse is equivilant to setting
LIBXML_DTDLOAD
LIBXML_DTDVALID

PHP 5.1 users are encouraged to use the new constants.

Источник

Processing HTML using PHP’s DomDocument class

In the previous tutorial I briefly presented the subject of regular expressions in PHP. As a result, I received a number of comments all saying the same thing that it is not a good practice to parse HTML using regular expressions, so as I answered to one of the responder, it is useful to do parsing when the structure of the string is known in advance. In this tutorial, I offer a complementary approach that uses DomDocument, which is a built-in PHP class that can parse HTML code, find matches and replace parts of the HTML without the need for regular expressions.

This tutorial consists of 4 case studies:

Joseph Benharosh is a full stack web developer and the author of the eBook The essentials of object oriented PHP.

# Case 1:How to automagically make responsive images from all the images in the page

To make responsive images we’re going to use the DomDocument to wrap the images in a div that has the class of ‘responsive-img’ .

function makeResposiveImages($html='') < // Create a DOMDocument $dom = new DOMDocument(); // Load html including utf8, like Hebrew $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); // Create the div wrapper $div = $dom->createElement('div'); $div->setAttribute('class', 'responsive-img'); // Get all the images $images = $dom->getElementsByTagName('img'); // Loop the images foreach ($images as $image) < //Clone our created div $new_div_clone = $div->cloneNode(); //Replace image with wrapper div $image->parentNode->replaceChild($new_div_clone,$image); //Append image to wrapper div $new_div_clone->appendChild($image); > // Save the HTML $html = $dom->saveHTML(); return $html; >

To use the DomDocument class we first need to instantiate it using

When loading the HTML it is desirable to use the UTF-8 parameter for languages other than English.

$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

We create a wrapping div for the images that has the class ‘responsive-img’

$div = $dom->createElement('div'); $div->setAttribute('class', 'responsive-img');

In order to extract the images from the HTML:

$images = $dom->getElementsByTagName('img');

Next, we loop through the images and wrap each one with the wrapping div.

At the end, we save the changes with:

# Case 2: How to make responsive videos from all the Youtube videos in the page

The following code identifies Youtube iframes that are embedded in a given HTML input and makes changes that include adding autoplay tag, changing the dimensions of the video, and adding a class.

function betterEmbeddedYoutubeVideoes($html='') < $doc = new DOMDocument(); $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); $docVideos = $doc->getElementsByTagName('iframe'); foreach($docVideos as $docVideo) < // Get the 'src' attribute of the iframe $docVideoSrc = $docVideo->getAttribute('src'); // Parse the src url $docVideoSrcParts = parse_url($docVideoSrc); // Add autoplay attribute to the src $newVideoSrc = $docVideoSrcParts['scheme'] . '://' . $docVideoSrcParts['host'] . '/' . $docVideoSrcParts['path'] . '?autoplay=1'; // Set the source $docVideo->setAttribute('src', $newVideoSrc); // Set the dimensions $docVideo->setAttribute('height', '433'); $docVideo->setAttribute('width', '719'); // Set the class $docVideo->setAttribute('class', 'embed-responsive-item'); > $html = $doc->saveHTML(); return $html; >

# Case 3: How to remove the style tags from all the HTML elements in the page

My customers like to use text-editing WYSIWYG plugins that make data entry easier for them. Editing the text in this way causes style tags to be inserted in all kinds of places, resulting in search engines and screen readers having difficulty processing the content of the page. So that the problem is twofold. Both in promoting the site in the search engines and in making the site accessible to people with disabilities that use screen readers to interact with the website.

I address the problem with the following function that cleans the style tags using the DOMDocument class.

function stripStyleTags($html='') < $dom = new DOMDocument; $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); $xpath = new DOMXPath($dom); // Find any element with the style attribute $nodes = $xpath->query('//*[@style]'); // Loop the elements foreach ($nodes as $node) < // Remove style attribute $node->removeAttribute('style'); > $html = $dom->saveHTML(); return $html; >

# Case 4: How to automatically add rel = nofollow to all the links

The rel = nofollow attribute is added to the link tag, and tells the search engine spiders to avoid entering the link and leaving the page. It is customary to use the technique to avoid reducing the page’s ranking in search results.

The code below automatically adds the attribute to all the links in a given HTML input.

function addRelNofollowToLinks($html) < $dom = new DOMDocument; $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); // Find any element which is a link $nodes = $dom->getElementsByTagName('a'); // Loop the elements foreach ($nodes as $node) < // Add the rel attribute $node->setAttribute('rel', 'nofollow'); > $html = $dom->saveHTML(); return $html; >

Conclusion

Now that you know the basics of using the DomDocument class for parsing HTML you can further invest in your professional skills and buy the essentials of object oriented PHP, the most easy to learn from eBook in the field.

Источник