Get page contents javascript

Содержание

Get HTML code using JavaScript with a URL
7 Answers 7
get web page text via javascript [closed]
4 Answers 4
How to get the entire document HTML as a string?
17 Answers 17
Is there a way to get all text from the rendered page with JS?
3 Answers 3
Update:

Get HTML code using JavaScript with a URL

I am trying to get the source code of HTML by using an XMLHttpRequest with a URL. How can I do that? I am new to programming and I am not too sure how can I do it without jQuery.

You may want to look into the problem of the same origin policy. Just search on SO and you will find tons of info.

but is there any other way of going about this thing? like not using xmlhttprequest? with just javascript?

no. xmlhttprequest and iframes are the only way, and both are limited by same-origin policy. If you want to get around this, the remote server needs to cooperate (by serving as jsonp, or putting a special header on the data it serves)

7 Answers 7

Without jQuery (just JavaScript):

function makeHttpObject() < try catch (error) <> try catch (error) <> try catch (error) <> throw new Error("Could not create HTTP request object."); > var request = makeHttpObject(); request.open("GET", "your_url", true); request.send(null); request.onreadystatechange = function() < if (request.readyState == 4) alert(request.responseText); >;

@Senad Meskin thanks for your answer, but issit possible to do it with jQuery? i was wondering if there are other methods to do it.

No its not possible, only thing that you can is call your url, and on serverside code call www.google.com and write to response content of google.com

fetch('some_url') .then(function (response) < switch (response.status) < // status "OK" case 200: return response.text(); // status "Not Found" case 404: throw response; >>) .then(function (template) < console.log(template); >) .catch(function (response) < // "Not Found" console.log(response.statusText); >);

Asynchronous with arrow function version:

get web page text via javascript [closed]

It’s difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.

4 Answers 4

You could do it with Range s / TextRange s. This has the advantage of only getting the visible text on the page (unlike, for example, the textContent property of elements in non-IE browsers, which will also get you the contents of and possibly other elements). The following will work in all mainstream browsers although I can’t make any guarantees about the consistency of line breaks between different browsers.

UPDATE November 2012

I don’t think this is a good idea these days. While Selection is now specified, its toString() method is not, and for some time (including when Microsoft were implementing it for IE 9) it was specified to behave like textContent . For this particular method, browser consistency has got worse rather than better since 2009.

function getBodyText(win) < var doc = win.document, body = doc.body, selection, range, bodyText; if (body.createTextRange) < return body.createTextRange().text; >else if (win.getSelection) < selection = win.getSelection(); range = doc.createRange(); range.selectNodeContents(body); selection.addRange(range); bodyText = selection.toString(); selection.removeAllRanges(); return bodyText; >> alert( getBodyText(window) );

Источник

How to get the entire document HTML as a string?

Stop upvoting John’s bolded comment! The answer he links to replaces && with && and so it breaks all your inline tags! You should use document.documentElement.outerHTML instead, but note that it doesn’t grab , so you’ll need to add that yourself.

17 Answers 17

Get the root element with document.documentElement then get its .innerHTML :

const txt = document.documentElement.innerHTML; alert(txt);

or its .outerHTML to get the tag as well

const txt = document.documentElement.outerHTML; alert(txt);

worked like a charm! thank you! is there any way to get the size of any/all files linked to the document as well including js and css files?

@CMCDragonkai: You could get the doctype separately and prepend it to the markup string. Not ideal, I know, but possible.

note that neither this nor none of these answers necessarily give you content that is the exact hash equivalent of saving the page to a file or the file generated by view-source. It seems the DOM normalizes some fields from the literal response content, like capitalising DOCTYPE headers

new XMLSerializer().serializeToString(document)

in browsers newer than IE 9

This was the first correct answer according to date/time stamps. Parts of the page such as the XML declaration will not be included and browsers will manipulate the code when using the other «answers». This is the only post that should be up-voted (dos’s posted three days later). People need to pay attention!

This is not entirely correct since it serializeToString performs an HTML encode. For example if your code contains styles defining fonts such as «Times New Roman», Times, serif the quotes will get html encoded. Perhaps that is not important to some of you but to me it is.

@John well the OP actually asks for «the entire HTML within the html tags». And the selected best answer by Colin Burnett does achieve this. This particular answer (Erik’s) will include the html tags and the doctype. That said, this was totally a diamond in the rough for me and exactly what I was looking for! Your comment helped too because it made me spend more time with this answer, so thanks 🙂

I think people should be careful with this one, specifically because it returns a value that is not the actual html that your browser receives. In my case, it added attributes to the html tag that the server never actually sent 🙁

I tried the various answers to see what is returned. I’m using the latest version of Chrome.

The suggestion document.documentElement.innerHTML; returned .

Gaby’s suggestion document.getElementsByTagName(‘html’)[0].innerHTML; returned the same.

The suggestion document.documentElement.outerHTML; returned . which is everything apart from the ‘doctype’.

You can retrieve the doctype object with document.doctype; This returns an object, not a string, so if you need to extract the details as strings for all doctypes up to and including HTML5 it is described here: Get DocType of an HTML as string with Javascript

I only wanted HTML5, so the following was enough for me to create the whole document:

alert(» + ‘\n’ + document.documentElement.outerHTML);

This is the most complete answer and should be accepted. As of 2016, browser compatibility is complete, and mentioning it in detail (as in the currently accepted answer) is no longer necessary.

I believe document.documentElement.outerHTML should return that for you.

According to MDN, outerHTML is supported in Firefox 11, Chrome 0.2, Internet Explorer 4.0, Opera 7, Safari 1.3, Android, Firefox Mobile 11, IE Mobile, Opera Mobile, and Safari Mobile. outerHTML is in the DOM Parsing and Serialization specification.

The MSDN page on the outerHTML property notes that it is supported in IE 5+. Colin’s answer links to the W3C quirksmode page, which offers a good comparison of cross-browser compatibility (for other DOM features too).

@Colin: Yeah, good point. From experience, I seem to remember that both IE 6+ and Firefox support it, though the quirksmode page you linked suggests otherwise.

document.getElementsByTagName('html')[0].innerHTML

You will not get the Doctype or html tag, but everything else.

document.documentElement.outerHTML

Supported in Firefox 11, Chrome 0.2, Internet Explorer 4.0, Opera 7, Safari 1.3, Android, Firefox Mobile 11, IE Mobile, Opera Mobile, and Safari Mobile (MDN). outerHTML is in the DOM Parsing and Serialization specification.

//serialize current DOM-Tree incl. changes/edits to ss-variable var ns = new XMLSerializer(); var ss= ns.serializeToString(document); alert(ss.substr(0,300));

may work in FF. (Shows up the VERY FIRST 300 characters from the VERY beginning of source-text, mostly doctype-defs.)

BUT be aware, that the normal «Save As»-Dialog of FF MIGHT NOT save the current state of the page, rather the originallly loaded X/h/tml-source-text !! (a POST-up of ss to some temp-file and redirect to that might deliver a saveable source-text WITH the changes/edits prior made to it.)

Although FF surprises by good recovery on «back» and a NICE inclusion of states/values on «Save (as) . » for input-like FIELDS, textarea etc. , not on elements in contenteditable/ designMode.

If NOT a xhtml- resp. xml-file (mime-type, NOT just filename-extension!), one may use document.open/write/close to SET the appr. content to the source-layer, that will be saved on user’s save-dialog from the File/Save menue of FF. see: http://www.w3.org/MarkUp/2004/xhtml-faq#docwrite resp.

Neutral to questions of X(ht)ML, try a «view-source:http://. » as the value of the src-attrib of an (script-made!?) iframe, — to access an iframes-document in FF:

.contentDocument , see google «mdn contentDocument» for appr. members, like ‘textContent’ for instance. ‘Got that years ago and no like to crawl for it. If still of urgent need, mention this, that I got to dive in .

Источник

Is there a way to get all text from the rendered page with JS?

Is there an (unobtrusive, to the user) way to get all the text in a page with Javascript? I could get the HTML, parse it, remove all tags, etc, but I’m wondering if there’s a way to get the text from the alread rendered page. To clarify, I don’t want to grab text from a selection, I want the entire page. Thank you!

3 Answers 3

All credit to Greg W’s answer, as I based this answer on his code, but I found that for a website without inline style or script tags it was generally simpler to use:

as this grabs all text in all tags without one having to manually set every tag that might contain text.

Also, if you’re not careful, setting the tags manually has the propensity to create duplicated text in the output as the each function will often have to check tags contained within other tags which results in it grabbing the same text twice. Using one selector which contains all the tags we want to grab text from circumvents this issue.

The caveat is that if there are inline style or script tags within the body tag it will grab those too.

Update:

After reading this article about innerText I now think the absolute best way to get the text is plain ol vanilla js:

As is, this is not reliable cross-browser, but in controlled environments it returns the best results. Read the article for more details.

This method formats the text in a usually more readable manner and does not include style or script tag contents in the output.

Источник