html — javascript page source code — Stack Overflow

viewing actual source code of a website

I’ll explain my question with an example. Suggest I go the the url: http://www.google.co.il/#q=university and then I right click and choose «view source», I don’t get the real html source, I’m sure of that because if I search in the code unique words that appear in the document I get no results. I know that in chrome I can mark something and check the component, then I can see the real source code, but I want to use a java program for getting the code so I want to understand the issue of why I don’t see the real html source when I go to ‘view source’.

8 Answers 8

View source usually does not show any javascript generated content, for seeing that you’ll want to use a plugin as for example firebug.

The only way I know to see the actual source in Java, including javascript made modification would be through a virtual browser framework, like HtmlUnit.

HtmlUnit can execute JS scripts and apply all changes to the DOM tree. You would have to serialize it to get the actual page. Keep in mind there is no such thing as «complete html source». You can only get DOM tree and possibly serialize it.

Well, if you select «view source» you see the actual HTML source code of the page in your address bar. However, it might be that the page(s) you want to view are «obfuscated» by having embedded code which loads external content and puts it in your HTML.

Читайте также:  Javascript on the fly

If you still want to automatically parse such a page in a «nice» you need to run a whole HTML interpreter like for example Webkit — a hell of work, and in principle what you are doing with «inspect element». The other way is that you find the lines in the page-html that load the external contents and then in turn load them on your own. If you are lucky this is not obfuscated on purpose and kind of easy to achive for small tasks.

However, if you need the whole DOM structure, you should think about implementing one of the browser engines.

You could do something like document.documentElement which gives all the HTML content.

console.log(document.documentElement); 

The text you’re looking for could have been rendered from JavaScript. If you’re using Chrome (since you mentioned it), the web developer pane that comes up when you do «inspect element» has a «Resources» tab that lists JavaScript files, stylesheets, etc.

«View source» gives you a pure response generated by a server. As Joachim Isaksson has already mentioned — use Chrome or Firebug for Firefox.

Источник

Can Javascript read the source of any web page?

I am working on screen scraping, and want to retrieve the source code a particular page. How can achieve this with javascript? Please help me.

Here is similar page you may get your answer as it solve my problem of getting the source of the HTML Page stackoverflow.com/questions/1367587/javascript-page-source-code

@mikenvck Why did you even mention PHP when the question was about JavaScript? The answers below show how to do this with JavaScript.

to get source of a link, you may need to use $.ajax for external links. here is the solution — stackoverflow.com/a/18447625/2657601

jQuery is native JavaScript. It’s just JavaScript you can copy from jquery.com instead of from stackoverflow.com.

17 Answers 17

$("#links").load("/Main_Page #jq-p-Getting-Started li"); 

Another way to do screen scraping in a much more structured way is to use YQL or Yahoo Query Language. It will return the scraped data structured as JSON or xml.
e.g.
Let’s scrape stackoverflow.com

select * from html where url="http://stackoverflow.com" 

will give you a JSON array (I chose that option) like this

The beauty of this is that you can do projections and where clauses which ultimately gets you the scraped data structured and only the data what you need (much less bandwidth over the wire ultimately)
e.g

select * from html where url="http://stackoverflow.com" and xpath='//div/h3/a' 

Now to get only the questions we do a

select title from html where url="http://stackoverflow.com" and xpath='//div/h3/a' 

Note the title in projections

Once you write your query it generates a url for you

So ultimately you end up doing something like this

var titleList = $.getJSON(theAboveUrl); 

Beautiful, isn’t it?

Brilliant, especially for hinting to the poor-man’s solution at yahoo that eliminates the need for a proxy to fetch the data. Thank you!! I took the liberty to fix the last demo-link to query.yahooapis.com: it was missing a % sign in the url-encoding. Cool that this still works!!

query.yahooapis has been retired as of Jan. 2019. Looks really neat, too bad we can’t use it now. See tweet here: twitter.com/ydn/status/1079785891558653952?ref_src=twsrc%5Etfw

Javascript can be used, as long as you grab whatever page you’re after via a proxy on your domain:

       

that’s really interesting. presumably there is some code to install on the server to make that happen?

You will get a ‘from origin ‘null’ has been blocked by CORS policy: No ‘Access-Control-Allow-Origin’ header is present on the requested resource.’ if you are not on the same domain though

You could simply use XmlHttp (AJAX) to hit the required URL and the HTML response from the URL will be available in the responseText property. If it’s not the same domain, your users will receive a browser alert saying something like «This page is trying to access a different domain. Do you want to allow this?»

const URL = 'https://www.sap.com/belgique/index.html'; fetch(URL) .then(res => res.text()) .then(text => < console.log(text); >) .catch(err => console.log(err));

As a security measure, Javascript can’t read files from different domains. Though there might be some strange workaround for it, I’d consider a different language for this task.

If you absolutely need to use javascript, you could load the page source with an ajax request.

Note that with javascript, you can only retrieve pages that are located under the same domain with the requesting page.

       

You can’t request a page outside of your domain in this way, you have to do it via proxy, e.g. $.get(‘mydomain.com/?url=www.google.com’)

I used ImportIO. They let you request the HTML from any website if you set up an account with them (which is free). They let you make up to 50k requests per year. I didn’t take them time to find an alternative, but I’m sure there are some.

In your Javascript, you’ll basically just make a GET request like this:

var request = new XMLHttpRequest(); request.onreadystatechange = function() < jsontext = request.responseText; alert(jsontext); >request.open("GET", "https://extraction.import.io/query/extractor/THE_PUBLIC_LINK_THEY_GIVE_YOU?_apikey=YOUR_KEY&url=YOUR_URL", true); request.send();

Sidenote: I found this question while researching what I felt like was the same question, so others might find my solution helpful.

UPDATE: I created a new one which they just allowed me to use for less than 48 hours before they said I had to pay for the service. It seems that they shut down your project pretty quick now if you aren’t paying. I made my own similar service with NodeJS and a library called NightmareJS. You can see their tutorial here and create your own web scraping tool. It’s relatively easy. I haven’t tried to set it up as an API that I could make requests to or anything.

Источник

How do I get the HTML source from the page?

Is there a way to access the page HTML source code using javascript? I know that I can use document.body.innerHTML but it contains only the code inside the body. I want to get all the page source code including head and body tags with their content, and, if it’s possible, also the html tag and the doctype. Is it possible?

5 Answers 5

document.documentElement.outerHTML 
document.documentElement.innerHTML 

Be aware that the source you get with Firefox/most browsers is the «true» source you served up. In IE you will get the «live» HTML of the page including any changes the user has made to forms, any new DOM content etc. In IE it will also be the mixed case invalid tag soup that IE provides when requesting the .innerHTML of elements.

In case anyone else is still looking into this, the situation has changed somewhat. @Crescent Fresh was correct 2 years ago, however more recent versions of Chrome and Safari also implement HTMLELement.outerHTML — though at the time of writing, Firefox does not.

@LiamNewmarch 2 years after your comment, which was 2 years after the initial post, and it seems that now Firefox also implements outerHTML. 🙂

This can be done in a one-liner using XMLSerializer.

var generatedSource = new XMLSerializer().serializeToString(document); 
  . 

Unfortunately you will get garbage if the document content has any character that requires escaping in XML. Also you will not get the real original string but something slightly different (e.g. including an XML schema link).

One way to do this would be to re-request the page using XMLHttpRequest, then you’ll get the entire page verbatim from the web server.

  • true html source code is wanted (not current DOM serization)
  • and that the page was loaded using GET method,

the page source can be re-downloaded:

fetch(document.location.href) .then(response => response.text()) .then(pageSource => /* . */) 

That is unreliable because there is no guarentee that the server will serve the same content next time.

@SzczepanHołyszewski Given that the REST protocol is defined as stateless, as long as you send the same headers in the ajax request as the browser did, then I would be confident the server would send the same response.

@dantechguy What are you talking about? There is nothing in the OP about REST. Whether an endpoint is a REST one depends on the server. The fetch API is typically used by client-side JS to talk to REST endpoints, but using the fetch API on a non-REST endpoint doesn’t magically turn it into a REST one. But even if we talk REST, statelessness is irrelevant. Two identical REST GET requests can return different data if the resource was actually modified between the requests, or your permission to access the resource was revoked, or for a number of other reasons.

You make this a bit more reliable by at least adding an Accept header similar to that of the browser. But yeah, this approach is not generally reliable.

This worked for me! this youtube url has timedtext (transcription) in ‘view page source’ and could only retrieve this by fetching the url again. youtube.com/watch?v=LA-LMRFhzaw&ab_channel=jordifieke

Источник

Displaying Source Code of HTML

I’m creating a Feature Index for a website and I thought that it would be nice if the user could just click on a link to view the source code instead of using the browser developer tools. After some research I found that we can easily view the source code of the current page like this:

Sadly, i’m kind of a big javascript noob still and was wondering if there is a way to hyperlink to the source code of other HTML pages on the site. Thanks

4 Answers 4

Please note that the view-source protocol is blocked in several browsers

If you need to show the source of the page and other pages on the same site, you might do something like this assuming the html is well formed — I am using jQuery to get the data by the way (note link 2-4 will not work in this demo) :

 $(function() < $(".codeLink").on("click", function(e) < e.preventDefault(); // cancel the link if (this.id == "thispage") < $("#codeOutput").html(("" + $("html").html() + "").replace(/ else < $.get($(this).prop("href"), function(data) < $("#codeOutput").html(data.replace(/); > >); >);

Yeah, exactly the same as you did now: direct the user to the url with view-source: prepended to the url. Example given, view-source:http://www.stackoverflow.com will direct your visitor to the source of stackoverflow. But beware that this depends solely on the browser you are using, meaning: some users will see the source, other users might not.

And a bonus jquery to convert all source code links (with class ‘source’) (not tested)

Page 1 (source) Page 2 (source) External (source) jQuery(document).ready(function ($) < $('a.source').each(function() < $(this).attr('href', 'view-source:' + $(this).attr('href')); >); >); 

Thank you, i will try this now. The users viewing this site will only be using chrome but yes thank you for pointing that out

Hmm strange, view-source:stackoverflow.com works . but view-source:cisweb.ufv.ca/~300105626/assignment02 does not work .. maybe somethin to do with my school server?

Источник

Оцените статью