Php find all links in html

Now the code which will extract all the links from the given page is : But if you want to do advance html parsing, then you’ll need to use PHP Simple HTML Dom Parser. Solution 3: To find all links in HTML you could use preg_match_all().

This is an example of my code. It can scan only one web page and print all links on this page.

I need to parse recursively scan the entire site and print all links for all pages of this website.

Here is an example of my class:

sRootLink = $sRootLink; $this->iCountOfPages = $iCountOfPages; > public function getRootLink() < return $this->sRootLink; > public function getCountOfPages() < return $this->iCountOfPages; > public function setRootLink($sRootLink) < $this->sRootLink = $sRootLink; > public function setCountOfPages($iCountOfPages) < $this->iCountOfPages = $iCountOfPages; > public function getAllLinks() < $this->rec($this->sRootLink); > private function rec($link) < $this->cache[$link] = true; $html = file_get_contents($link); $DOM = new DOMDocument; @$DOM->loadHTML($html); $links = $DOM->getElementsByTagName('a'); //----------------- $sPatternURL = $this->sRootLink; foreach ($links as $element) < if($this->iCounter == $this->iCountOfPages) break; if($this->startsWith($element->getAttribute("href"), $sPatternURL)) < echo $element->getAttribute("href") . "
"; $this->iCounter++; //$this->rec($element->getAttribute("href")); > > > private function startsWith($haystack, $needle) < // search backwards starting from haystack length characters from the end return $needle === "" || strrpos($haystack, $needle, -strlen($haystack)) !== false; >>

If someone need, here is my version. Works correctly. Here is an example of my class: At the entrance, enter the site and the number of links to display.

sRootLink = $sRootLink; $this->iCountOfPages = $iCountOfPages; $this->iDeep = 0; $this->sDomain = ""; $this->sScheme = ""; > public function getAllLinks() < $this->recParseLinks($this->sRootLink); $this->printLinks(); $this->saveToCSV(); > private function printLinks() < echo "Web-site: www." . $this->sDomain . "
Count of links: " . count($this->linkArray) . "

"; foreach($this->linkArray as $element) echo "" . $element . "" . "
"; > private function saveToCSV() < $fp = fopen("allLinksFromYourSite.csv", "w"); fwrite($fp, "Web-site: $this->sDomain" . PHP_EOL); fwrite($fp, "Count of links: " . count($this->linkArray) . PHP_EOL . PHP_EOL); foreach($this->linkArray as $element) fwrite($fp, $element . PHP_EOL); fclose($fp); > private function recParseLinks($link) < if(strlen($link) iDeep == 0) < $d = parse_url($link); if($d != false) < $this->sDomain = $d['host']; $this->sScheme = $d['scheme']; > else return; > $this->iDeep++; $doc = new DOMDocument(); $doc->loadHTML(file_get_contents($link)); $elements = $doc->getElementsByTagName('a'); foreach($elements as $element) < if(count($this->linkArray) >= $this->iCountOfPages) return; $links = $element->getAttribute('href'); if($links[0] == '/' || $links[0] == '?') $links = $this->sScheme . "://" . $this->sDomain . $links; $p_links = parse_url($links); if($p_links == FALSE) continue; if($p_links["host"] != $this->sDomain) continue; if(!$this->linkExists($links) && strlen($links) > 1) < $this->linkArray[] = $links; if($this->iDeep < 4) < $this->recParseLinks($links); > > > $this->iDeep--; > private function linkExists($link) < foreach($this->linkArray as $element) if($element == $link) return true; return false; > > $parseLinksObject = new ParseLinks('https://yoursite.com/', 3000); $parseLinksObject->getAllLinks();

Search a Specific Word or a Sentence on a Web Page (https) using, As @Berto99 mentions there is the built-in strpos() function. However you will have to get the page contents too which you could do thru the

Читайте также:  Java split list to chunks

I’m trying to get all link URL of news on some div from this web

To get all link, after I view source but there is nothing.

But there are any data display

Could any that understand PHP , Array() and JS help me, please?

This is my code to get the content:

$html = file_get_contents("https://qc.yahoo.com/"); if ($result === FALSE) < die("?"); >echo $html; 
$html = new DOMDocument(); @$html->loadHtmlFile('https://qc.yahoo.com/'); $xpath = new DOMXPath( $html ); $nodelist = $xpath->query( "//div[@id='news_moreTopStories']//a/@href" ); foreach ($nodelist as $n)< echo $n->nodeValue."\n"; > 

you can get all links from the divs you specify. make sure you put the div ids in id=’news_moreTopStories’] . you’re using xpath to query the divs. you don’t need a ton of code, just this portion.

Assuming, you want to extract all Anchor Tags with their hyperlinks from the given page.

Now there are certain problems with doing file_get_contents on that URL :

So, to overcome first problem of gzip character encoding, we’ll use CURL as @gregn3 suggested in his answer. But he missed to use CURL’s ability to automatically decompress gzip ed content.

For second problem, you can either follow this guide or disable SSL verification from CURL’s curl_setopt methods.

Now the code which will extract all the links from the given page is :

"; echo "links found (" . count ($matches[1]) . "):" . "
"; $n = 0; foreach ($matches[1] as $link) < $n++; echo "$n: " . htmlspecialchars ($link) . "
"; >

But if you want to do advance html parsing, then you’ll need to use PHP Simple HTML Dom Parser . In PHP Simple HTML Dom you can select the div by using jQuery selectors and fetch the anchor tags . Here are it’s documentation & api manual.

To find all links in HTML you could use preg_match_all().

$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches); 

That url https://qc.yahoo.com/ uses gzip compression , so you have to detect that and decompress it using the function gzdecode(). (It must be installed in your PHP version)

The gzip compression is indicated by the Content-Encoding: gzip HTTP header. You have to check that header, so you must use curl or a similar method to retrieve the headers. (file_get_contents() will not give you the HTTP headers. it only downloads the gzip compressed content. You need to detect that it is compressed but for that you need to read the headers.)

Here is a complete example:

 1); $enc = $pieces2 && (preg_match ("/content-encoding/i", $pieces[0]) ); $gz = $pieces2 && (preg_match ("/gzip/i", $pieces[1]) ); if ($enc && $gz) < $gzip = 1; break; >> # unzip content if gzipped if ($gzip) < $content = gzdecode ($content); ># find links $links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches); # output results echo "url = " . htmlspecialchars ($url) . "
"; echo "links found (" . count ($matches[1]) . "):" . "
"; $n = 0; foreach ($matches[1] as $link) < $n++; echo "$n: " . htmlspecialchars ($link) . "
"; >

How to get links from website, not seen in view source, And what you’re going to be needing to do: figure out the values in the JavaScript you posted and give them to «loadajax.php» on the site, then

I need to find links in a part of some html code and replace all the links with two different absolute or base domains followed by the link on the page.

I have found a lot of ideas and tried a lot different solutions.. Luck aint on my side on this one.. Please help me out!! Thank you!!

',$start) + 8; $table = substr($content,$start,$end-$start); echo ""; $dom = new DOMDocument(); $dom->loadHTML($table); $dom->strictErrorChecking = FALSE; // Get all the links $links = $dom->getElementsByTagName("a"); foreach($links as $link) < $href = $link->getAttribute("href"); echo ""; if (strpos("http://oxfordreference.com", $href) == -1) < if (strpos("/views/", $href) == -1) < $ref = "http://oxfordreference.com/views/"+$href; >else $ref = "http://oxfordreference.com"+$href; $link->setAttribute("href", $ref); echo "getAttribute("href")>"; > > $table12 = $dom->saveHTML; preg_match_all("||U",$table12,$rows); echo ""; foreach ($rows[0] as $row)< if ((strpos($row,'|U",$row,$cells); echo ""; > > ?> 

When i run this code i get htmlParseEntityRef: expecting ‘;’ warning for the line where i load the html

var links = document.getElementsByTagName(«a»); will get you all the links. And this will loop through them:

You should use jQuery — it is excellent for link replacement. Rather than explaining it here. Please look at this answer.

I recommend scrappedcola’s answer, but if you dont want to do it on client side you can use regex to replace:

ob_start(); //your HTML //end of the page $body=ob_get_clean(); preg_replace("/]*href=(\"[^\"]*\")/", "NewURL", $body); echo $body; 

You can use referencing (\$1) or callback version to modify output as you like.

Web crawler — get all tags href in page with php, I would suggest using DOMDocument() and DOMXPath(). This allows the result to only contain external links as you’

Источник

This is a PHP tutorial on how to extract all links and their anchor text from a HTML string. In this guide, I will show you how to fetch the HTML content of a web page and then extract the links from it. To do this, we will be using PHP’s DOMDocument class.

Let’s jump right in and take a look at a simple example:

//Get the page's HTML source using file_get_contents. $html = file_get_contents('https://en.wikipedia.org'); //Instantiate the DOMDocument class. $htmlDom = new DOMDocument; //Parse the HTML of the page using DOMDocument::loadHTML @$htmlDom->loadHTML($html); //Extract the links from the HTML. $links = $htmlDom->getElementsByTagName('a'); //Array that will contain our extracted links. $extractedLinks = array(); //Loop through the DOMNodeList. //We can do this because the DOMNodeList object is traversable. foreach($links as $link)< //Get the link text. $linkText = $link->nodeValue; //Get the link in the href attribute. $linkHref = $link->getAttribute('href'); //If the link is empty, skip it and don't //add it to our $extractedLinks array if(strlen(trim($linkHref)) == 0) < continue; >//Skip if it is a hashtag / anchor link. if($linkHref[0] == '#') < continue; >//Add the link to our $extractedLinks array. $extractedLinks[] = array( 'text' => $linkText, 'href' => $linkHref ); > //var_dump the array for example purposes var_dump($extractedLinks);
  1. We sent a GET request to a given web page using PHP’s file_get_contents function. This function will return the HTML source of the URL as a string.
  2. We instantiated the DOMDocument class.
  3. In order to load the HTML string into our newly-created DOMDocument object, we used the DOMDocument::loadHTML function.
  4. After that, we used the getElementsByTagName function to search our HTML for all “a” elements. As I’m sure you already know, the tag is used to define a hyperlink. Note that this function will return a traversable DOMNodeList object.
  5. We created an empty array called $extractedLinks, which will be used to neatly package all our retrieved links.
  6. Because the DOMNodeList object is traversable, we are able to loop through each tag using a foreach loop.
  7. Inside our foreach loop, we retrieved the link text using the nodeValue property. To retrieve the actual link itself, we used the getAttribute function to retrieve the href HTML attribute.
  8. If the link is blank or starts with a hashtag / anchor link, we skip it by using the continue statement.
  9. Finally, we store the link’s details in our $extractedLinks array.

If you run the PHP above, the script will dump out an array of all links that were found on the Wikipedia homepage. Note that these links can be relative or absolute.

Источник

Оцените статью