- Get URL Content using Java
- About Krishna Srinivasan
- Java IO — Reading Content from URL
- 2. Reading Directly from a URL using BufferedReader
- 3. Reading content from URL with Scanner
- 4. Get URL content using Java 9 InputStream
- 5. Read URL using Guava
- 6. Reading URL content using Apache Commons IO library
- 7. Conclusion
- Reading a web page in Java
- Java reading web page tools
- Java read web page with HttpClient
- Reading a web page with URL
- Reading a web page with JSoup
- Reading a web page with HtmlCleaner
- Reading a web page with Apache HttpClient
- Reading a web page with Jetty HttpClient
- Reading a web page with HtmlUnit
- Author
Get URL Content using Java
This example shows how to download a URL content to your local machine. If you look at the below example, it simply takes an URL as the input and get the content and save it to the local system.
package javabeat.net.core; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.InputStreamReader; import java.net.MalformedURLException; import java.net.URL; import java.net.URLConnection; public class URLDownloadExample < public static void main(String[] args) < URL URL; try < // Download URL content URL = new URL("http://en.wikipedia.org/"); URLConnection connection = URL.openConnection(); BufferedReader bufferReader = new BufferedReader(new InputStreamReader( connection.getInputStream())); String input; // Save the down loaded content to local file name String fileName = "D:/test.html"; File localFile = new File(fileName); if (!localFile.exists()) < localFile.createNewFile(); >FileWriter fileWriter = new FileWriter(localFile.getAbsoluteFile()); BufferedWriter bufferWriter = new BufferedWriter(fileWriter); while ((input = bufferReader.readLine()) != null) < bufferWriter.write(input); >bufferWriter.close(); bufferReader.close(); System.out.println("Saving the content is Done"); > catch (MalformedURLException exception) < exception.printStackTrace(); >catch (IOException exception) < exception.printStackTrace(); >> >
About Krishna Srinivasan
He is Founder and Chief Editor of JavaBeat. He has more than 8+ years of experience on developing Web applications. He writes about Spring, DOJO, JSF, Hibernate and many other emerging technologies in this blog.
Java IO — Reading Content from URL
In this article, we are going to present ways to read content directly from URL in Java. We will use classes available in plain Java like BufferedReader , Scanner , InputStream , and external libraries such as Guava or Apache Commons IO .
This article is a part of Java I/O Series.
2. Reading Directly from a URL using BufferedReader
Let’s start with a simple solution in plain Java. In this example we make the use of InputStreamReader that is a bridge from byte streams to character streams. We are using this class to convert InputStream available under URL to a character-based stream. For better performance, we wrapped InputStreamReader with BufferedReader that uses buffering for efficient reading of characters, arrays, and lines.
package com.frontbackend.java.io.url; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; public class ReadURLUsingBufferedReader < public static void main(String[] args) throws IOException < String line; StringBuffer buff = new StringBuffer(); URL url = new URL("http://www.example.com/"); try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) < while ((line = in.readLine()) != null) < buff.append(line) .append(System.lineSeparator()); >> System.out.println(buff.toString()); > >
In this example, we are reading line by line from the URL and append these Strings into StringBuffer using platform-dependent line separator — System.lineSeparator() .
3. Reading content from URL with Scanner
In the next example, we used Scanner class that can parse primitive types and strings using regular expressions.
package com.frontbackend.java.io.url; import java.io.IOException; import java.net.URL; import java.util.Scanner; public class ReadURLUsingScanner < public static void main(String[] args) throws IOException < URL url = new URL("http://www.example.com/"); String content; try (Scanner scanner = new Scanner(url.openStream(), "UTF-8")) < content = scanner.useDelimiter("\\A") .next(); >System.out.println(content); > >
In this example, we used Scanner with \\A delimiter that matches the beginning of the string. Then invoking next() method returns all characters from beginning to the end of the stream.
4. Get URL content using Java 9 InputStream
In Java 9 there is a nice method that reads all bytes from bytes streams. We can make use it in the following example:
package com.frontbackend.java.io.url; import java.io.IOException; import java.io.InputStream; import java.net.URL; import java.nio.charset.StandardCharsets; public class ReadURLUsingInputStream < public static void main(String[] args) throws IOException < URL url = new URL("http://www.google.com/"); try (InputStream inputStream = url.openStream()) < byte[] bytes = inputStream.readAllBytes(); System.out.println(new String(bytes, StandardCharsets.UTF_8)); >> >
Note that encoding should be always provided on conversions from bytes to characters.
5. Read URL using Guava
Guava library provides Resources.toString(. ) method that allows us to read all content from URL into a String .
package com.frontbackend.java.io.url; import java.io.IOException; import java.net.URL; import java.nio.charset.StandardCharsets; import com.google.common.io.Resources; public class ReadURLUsingGuava < public static void main(String[] args) throws IOException < URL url = new URL("http://www.google.com/"); String str = Resources.toString(url, StandardCharsets.UTF_8); System.out.println(str); >>
6. Reading URL content using Apache Commons IO library
Apache Commons IO library comes with IOUtils class that can be used to convert InputStream from URL to String .
package com.frontbackend.java.io.url; import java.io.IOException; import java.io.InputStream; import java.net.URL; import java.nio.charset.StandardCharsets; import org.apache.commons.io.IOUtils; public class ReadURLUsingApacheCommonsIO < public static void main(String[] args) throws IOException < URL url = new URL("http://www.google.com/"); try (InputStream in = url.openStream()) < String str = IOUtils.toString(in, StandardCharsets.UTF_8); System.out.println(str); >> >
7. Conclusion
In this article, we presented several ways to read content from the URL in Java. We used classes available in plain Java and libraries such as Guava and Apache Commons IO . Luckily URL object contains method openStream() that returns InputStream . Reading URL actually comes to converting InputStream to a String .
Examples used in this tutorial are available under our GitHub repository.
Reading a web page in Java
Reading a web page in Java is a tutorial that presents several ways to to read a web page in Java. It contains seven examples of downloading an HTTP source from a small web page.
Java reading web page tools
Java has built-in tools and third-party libraries for reading/downloading web pages. In the examples, we use HttpClient, URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.
In the following examples, we download HTML source from the webcode.me tiny web page.
Java read web page with HttpClient
Java 11 introduced HttpClient library.
package com.zetcode; import java.io.IOException; import java.net.URI; import java.net.http.HttpClient; import java.net.http.HttpRequest; import java.net.http.HttpResponse; public class ReadWebPage < public static void main(String[] args) throws IOException, InterruptedException < HttpClient client = HttpClient.newHttpClient(); HttpRequest request = HttpRequest.newBuilder() .uri(URI.create("http://webcode.me")) .GET() // GET is default .build(); HttpResponseresponse = client.send(request, HttpResponse.BodyHandlers.ofString()); System.out.println(response.body()); > >
We use the Java HttpClient to download the web page.
HttpClient client = HttpClient.newHttpClient();
A new HttpClient is created with the newHttpClient factory method.
HttpRequest request = HttpRequest.newBuilder() .uri(URI.create("http://webcode.me")) .build();
We build a synchronous request to the webpage. The default method is GET.
HttpResponse response = client.send(request, HttpResponse.BodyHandlers.ofString()); System.out.println(response.body());
We send the request and retrieve the content of the response and print it to the console. We use HttpResponse.BodyHandlers.ofString since we expect a string HTML response.
Reading a web page with URL
URL represents a Uniform Resource Locator, a pointer to a resource on the World Wide Web.
package com.zetcode; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; public class ReadWebPageEx < public static void main(String[] args) throws IOException < var url = new URL("http://webcode.me"); try (var br = new BufferedReader(new InputStreamReader(url.openStream()))) < String line; var sb = new StringBuilder(); while ((line = br.readLine()) != null) < sb.append(line); sb.append(System.lineSeparator()); >System.out.println(sb); > > >
The code example reads the contents of a web page.
try (var br = new BufferedReader(new InputStreamReader(url.openStream())))The openStream method opens a connection to the specified url and returns an InputStream for reading from that connection. The InputStreamReader is a bridge from byte streams to character streams. It reads bytes and decodes them into characters using a specified charset. In addition, BufferedReader is used for better performance.
var sb = new StringBuilder(); while ((line = br.readLine()) != null)The HTML data is read line by line with the readLine method. The source is appended to the StringBuilder .
In the end, the contents of the StringBuilder are printed to the terminal.
Reading a web page with JSoup
JSoup is a popular Java HTML parser.
We have used this Maven dependency.
package com.zetcode; import org.jsoup.Jsoup; import java.io.IOException; public class ReadWebPageEx2 < public static void main(String[] args) throws IOException < String webPage = "http://webcode.me"; String html = Jsoup.connect(webPage).get().html(); System.out.println(html); >>The code example uses JSoup to download and print a tiny web page.
String html = Jsoup.connect(webPage).get().html();The connect method connects to the specified web page. The get method issues a GET request. Finally, the html method retrieves the HTML source.
Reading a web page with HtmlCleaner
HtmlCleaner is an open source HTML parser written in Java.
net.sourceforge.htmlcleaner htmlcleaner 2.16 For this example, we use the htmlcleaner Maven dependency.
package com.zetcode; import java.io.IOException; import java.net.URL; import org.htmlcleaner.CleanerProperties; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.SimpleHtmlSerializer; import org.htmlcleaner.TagNode; public class ReadWebPageEx3 < public static void main(String[] args) throws IOException < var url = new URL("http://webcode.me"); var props = new CleanerProperties(); props.setOmitXmlDeclaration(true); var cleaner = new HtmlCleaner(props); TagNode node = cleaner.clean(url); var htmlSerializer = new SimpleHtmlSerializer(props); htmlSerializer.writeToStream(node, System.out); >>The example uses HtmlCleaner to download a web page.
var props = new CleanerProperties(); props.setOmitXmlDeclaration(true);In the properties, we set to omit the XML declaration.
var htmlSerializer = new SimpleHtmlSerializer(props); htmlSerializer.writeToStream(node, System.out);A SimpleHtmlSerializer creates the resulting HTML without any indenting and/or compacting.
Reading a web page with Apache HttpClient
Apache HttpClient is a HTTP/1.1 compliant HTTP agent implementation. It can scrape a web page using the request and response process. An HTTP client implements the client side of the HTTP and HTTPS protocols.
org.apache.httpcomponents httpclient 4.5.10 We use this Maven dependency for the Apache HTTP client.
package com.zetcode; import java.io.IOException; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.HttpClientBuilder; import org.apache.http.util.EntityUtils; public class ReadWebPageEx4 < public static void main(String[] args) throws IOException < HttpGet request = null; try < String url = "http://webcode.me"; HttpClient client = HttpClientBuilder.create().build(); request = new HttpGet(url); request.addHeader("User-Agent", "Apache HTTPClient"); HttpResponse response = client.execute(request); HttpEntity entity = response.getEntity(); String content = EntityUtils.toString(entity); System.out.println(content); >finally < if (request != null) < request.releaseConnection(); >> > >In the code example, we send a GET HTTP request to the specified web page and receive an HTTP response. From the response, we read the HTML source.
HttpClient client = HttpClientBuilder.create().build();HttpGet is a class for the HTTP GET method.
request.addHeader("User-Agent", "Apache HTTPClient"); HttpResponse response = client.execute(request);A GET method is executed and an HttpResponse is received.
HttpEntity entity = response.getEntity(); String content = EntityUtils.toString(entity); System.out.println(content);From the response, we get the content of the web page.
Reading a web page with Jetty HttpClient
Jetty project has an HTTP client as well.
org.eclipse.jetty jetty-client 9.4.25.v20191220 This is a Maven dependency for the Jetty HTTP client.
package com.zetcode; import org.eclipse.jetty.client.HttpClient; import org.eclipse.jetty.client.api.ContentResponse; public class ReadWebPageEx5 < public static void main(String[] args) throws Exception < HttpClient client = null; try < client = new HttpClient(); client.start(); String url = "http://webcode.me"; ContentResponse res = client.GET(url); System.out.println(res.getContentAsString()); >finally < if (client != null) < client.stop(); >> > >In the example, we get the HTML source of a web page with the Jetty HTTP client.
client = new HttpClient(); client.start();An HttpClient is created and started.
ContentResponse res = client.GET(url);A GET request is issued to the specified URL.
System.out.println(res.getContentAsString());The content is retrieved from the response with the getContentAsString method.
Reading a web page with HtmlUnit
HtmlUnit is a Java unit testing framework for testing Web based applications.
net.sourceforge.htmlunit htmlunit 2.36.0 We use this Maven dependency.
package com.zetcode; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.WebResponse; import com.gargoylesoftware.htmlunit.html.HtmlPage; import java.io.IOException; public class ReadWebPageEx6 < public static void main(String[] args) throws IOException < try (var webClient = new WebClient()) < String url = "http://webcode.me"; HtmlPage page = webClient.getPage(url); WebResponse response = page.getWebResponse(); String content = response.getContentAsString(); System.out.println(content); >> >The example downloads a web page and prints it using the HtmlUnit library.
In this article we have scraped a web page in Java using various tools, including HttpClient, URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.
Author
My name is Jan Bodnar and I am a passionate programmer with many years of programming experience. I have been writing programming articles since 2007. So far, I have written over 1400 articles and 8 e-books. I have over eight years of experience in teaching programming.