Russian language

How to download HTML using PHP?

How do I download an HTML file from a URL in PHP, and download all of the dependencies like CSS and Images and store these to my server as files? Am I asking for too much?

7 Answers 7

The easiest way to do this would be to use wget. It can recursively download HTML and its dependencies. otherwise you will be parsing the html yourself. See Yacoby’s answer for details on doing it in pure php.

I would recommend using a html parsing library to simplify everything. Namely something like Simple HTML DOM.

$html = file_get_html('http://www.google.com/'); foreach($html->find('img') as $element) < //download image >

For download files (and html) I would recommend using a HTTP wrapper such as curl, as it allows far more control over using file_get_contents. However, if you wanted to use file_get_contents, there are some good examples on the php site of how to get URLs.

The more complex method allows you to specify the headers, which could be useful if you wanted to set the User Agent. (If you are scraping other sites a lot, it is good to have a custom user agent as you can use it to let website admin your site or point of contact if you are using too much bandwidth, which is better than the admin blocking your IP address).

$opts = array( 'http'=>array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" ) ); $context = stream_context_create($opts); $file = file_get_contents('http://www.example.com/', false, $context); 

Although of course it can be done simply by:

$file = file_get_contents('http://www.example.com/'); 

Источник

Читайте также:  Css задание своего шрифта

create and save pure html file from php ‘template’?

I am looking for a simple and effective way to create a pure html file based off a php file. For instance, in template.php below the php would be inserting various portions of the page. I want to save the page then as html removing all php code and leaving what was inserted by it. hopefully that makes sense. the output of template.php would be a better way to say it I guess. First, I do not know if something like this is possible. Second, is this the best way to go about something like this? Before anyone starts screaming about security there will be ZERO user submitted / form submitted variables in this page. My goal is to create a report from database values with the template which the user can then view/print/save off the server as pure html. There will be no images only inline css. EDIT : This html only output of template.php needs to be saved on the server as its own file. The reason for the php ‘template’ is because I will be creating the vast majority of the page with php. but I only want to save its resulting output. template.php :

      " name="description" />   . further html with php mixed in 

Current solution : I did some further research and this is acting exactly how I want it to. Comments/suggestions welcome for it.

Most browsers let you save a page to an html file. Could you just view the page in your browser, right click and do ‘save as’?

You will find everything you need on PHP.net’s filesystem directory php.net/manual/en/ref.filesystem.php

Источник

Save current page as HTML to server

What approach could someone suggest to save the current page as an HTML file to the server? In this case, also note that security is not an issue. I have spent endless hours searching around for this, and have not found a single thing. Your help is much appreciated, thank you! Edit Thank you all for your help, it was very much appreciated.

you asked the same question just 14 hour ago: [Saving Source of Self (PHP)] — why don’t you try to get an answer there instead of posting it again?(stackoverflow.com/questions/3769504/saving-source-of-self-php)

oezi — Well, it looks like I was right about creating this new question. I got a correct answer. New question, different people, different answers. You can go ahead and close this question now.

These questions aren’t exactly the same. They are very similar but differ in the target: here it’s the server, in the other question it’s the browser itself.

6 Answers 6

If you meant saving the output of a page in a file, you can use buffering to do that. The function you need to use are ob_start and ob_get_contents.

 Your page content bla bla bla bla . 

This will save the content of the page in the file yourpage.html .

I also had the same question. Thanks for the answer. BUT where is this yourpage.html being saved? I couldn’t find the file in my directory folder

@Walahh Unless chdir has been called, it should be in the same folder of the requested script. If you are unsure what it is, you can call getcwd .

I think we can use Output Control Functions of PHP, you can use save the content to the variable first and then save them to the new file, next time, you can test it the html file exists, then render that else re-generate the page.

 time()) ) < $content = file_get_contents($cacheFile); echo $content; >else < ob_start(); // write content echo '

Hello world to cache

'; $content = ob_get_contents(); ob_end_clean(); file_put_contents($cacheFile,$content); echo $content; > ?>

Use JavaScript to send document.getElementsByTagName(‘html’)[0].innerHTML as hidden input value or by ajax to the server side. This is more useful than output buffering if the content is afterwards traversed/modified by JavaScript, which the server side might not have any notion about.

Thanks, BalusC. So if I use var $s = document.getElements. (in php) I can then write the whole var to a file on the server?

JavaScript runs at webbrowser, not at webserver. Do you know JS? Anyway, given your comment I think this answer is after all not what you need 🙂 You probably rather want to save the immediate PHP-generated HTML page, not the currently opened HTML page (in all its current client side state). Check Holyvier’s answer.

In case you are looking to save complete html page along with css, images and scripts in a single html file, you can use this class I have written:

This class can save HTML pages complete with images, CSS and JavaScript.

It takes the URL of a given page and retrieves it to store in a given file.

The class can parse the HTML and determine which images, CSS and JavaScript files it needs, so those files are also downloaded and saved inside the HTML page saved to a local file.

Optionally it can skip the JavaScript code, keep only the page content, and compress the resulting page removing the whitespace.

Источник

DOMDocument::saveHTMLFile

Creates an HTML document from the DOM representation. This function is usually called after building a new dom document from scratch as in the example below.

Parameters

The path to the saved HTML document.

Return Values

Returns the number of bytes written or false if an error occurred.

Examples

Example #1 Saving a HTML tree into a file

$doc = new DOMDocument ( ‘1.0’ );
// we want a nice output
$doc -> formatOutput = true ;

$root = $doc -> createElement ( ‘html’ );
$root = $doc -> appendChild ( $root );

$head = $doc -> createElement ( ‘head’ );
$head = $root -> appendChild ( $head );

$title = $doc -> createElement ( ‘title’ );
$title = $head -> appendChild ( $title );

$text = $doc -> createTextNode ( ‘This is the title’ );
$text = $title -> appendChild ( $text );

echo ‘Wrote: ‘ . $doc -> saveHTMLFile ( «/tmp/test.html» ) . ‘ bytes’ ; // Wrote: 129 bytes

See Also

  • DOMDocument::saveHTML() — Dumps the internal document into a string using HTML formatting
  • DOMDocument::loadHTML() — Load HTML from a string
  • DOMDocument::loadHTMLFile() — Load HTML from a file

User Contributed Notes 3 notes

saveHTMLFile() always saves the file in UTF-8. Even if the DOMDocument->encoding explicitly prescribe different from UTF-8 encoding. All «non-Latin» characters will be converted to HTML-entities. Tested in PHP 5.2.9-2 and PHP 5.2.17. Example:

$document =new domDocument ( ‘1.0’ , ‘WINDOWS-1251’ );
$document -> loadHTML ( ‘Русский язык‘ );
$document -> formatOutput = true ;
$document -> encoding = ‘WINDOWS-1251’ ;
echo «Записано байт. Recorded bytes: » . $document -> saveHTMLFile ( ‘html.html’ );
?>

Method recorded file in UTF-8 encoding. The contents of the file html.html:

Not mentioned in the documentation is the fact that using DOMDocument::saveHTMLFile() will automatically overwrite the contents if an existing file is used — with no notice, warning or error thrown.

Make sure you check the filename before using this function so that you don’t accidentally overwrite important files.

$file = fopen ( ‘test.html’ , ‘w’ );
fwrite ( $file , ‘this is some text’ );
fclose ( $file );

$doc = new DOMDocument ();
$doc -> formatOutput = true ;
$doc -> loadHTML ( ‘ ‘ );
$doc -> saveHTMLFile ( ‘test.html’ );

?>

If you’re dynamically generating a series of pages using DOMDocument objects, make sure you are also dynamically generating the file or directory names using something that can’t easily be confused for an existing file/folder, or check if the desired path already exists before saving so that you don’t accidentally delete previous files.

I foolishly assumed that this function was equivalent to
file_put_contents ( $filename , $document -> saveHTML ());
?>
but there are differences in the generated HTML:
$doc = new DOMDocument ();
$doc -> loadHTML (
‘ ‘
);
$doc -> encoding = ‘iso-8859-1’ ;

?>
Note that saveHTMLFile() adds a UTF-8 meta tag despite the ISO-8859-1 document encoding.

  • DOMDocument
    • _​_​construct
    • createAttribute
    • createAttributeNS
    • createCDATASection
    • createComment
    • createDocumentFragment
    • createElement
    • createElementNS
    • createEntityReference
    • createProcessingInstruction
    • createTextNode
    • getElementById
    • getElementsByTagName
    • getElementsByTagNameNS
    • importNode
    • load
    • loadHTML
    • loadHTMLFile
    • loadXML
    • normalizeDocument
    • registerNodeClass
    • relaxNGValidate
    • relaxNGValidateSource
    • save
    • saveHTML
    • saveHTMLFile
    • saveXML
    • schemaValidate
    • schemaValidateSource
    • validate
    • xinclude

    Источник

Оцените статью