- Jsoup Tutorial with Examples
- Why you should use the Jsoup instead of regular expressions for web scraping?
- How to download and use the Jsoup in your project?
- Reading a web page in Java
- Java reading web page tools
- Java read web page with HttpClient
- Reading a web page with URL
- Reading a web page with JSoup
- Reading a web page with HtmlCleaner
- Reading a web page with Apache HttpClient
- Reading a web page with Jetty HttpClient
- Reading a web page with HtmlUnit
- Author
Jsoup Tutorial with Examples
Jsoup tutorial with examples will help you understand how to use Jsoup in an easy way. In this Jsoup tutorial, I will show you how web scraping was never been easier using Jsoup examples. Jsoup is an open-source library for parsing HTML content and web scraping which is distributed under MIT license. That means you are free to download, use and distribute it.
Why you should use the Jsoup instead of regular expressions for web scraping?
The real-world HTML content may not be well-formed, for example, some programmers choose to write
while others prefer
for line breaks in HTML pages. In this situation, parsing the HTML using regular expression will not yield the desired results or becomes too complicated. Plus, it will be very error-prone and resource-intensive to write all such combinations for parsing HTML content.
All these problems can be easily avoided by using an HTML parser like Jsoup instead of trying to parse the content using regular expressions.
Below given are some of the main capabilities of the Jsoup parser.
- Jsoup can parse HTML directly from URL, from file or even from the String variable.
- Jsoup allows HTML element structure manipulation like adding, changing or removing elements. It also allows adding and removing attributes easily.
- Finding data in elements or attributes is very easy using Jsoup.
- Jsoup supports basic authentication using a user name and password.
- If you are behind the proxy, no problem! Jsoup works with proxy as well.
- Jsoup supports cleaning the HTML. You can specify what tags you want to retain in the parsed HTML using the whitelist.
- Jsoup can output tidy HTML from the parsed HTML.
These are some of the main features of the Jsoup. It provides many other features that are very useful in real-world scenarios. Plus, selecting an element from Jsoup parsed HTML is very easy as it supports jquery styled selectors. For example, to select all td elements from all the table rows of an HTML document, you can write a selector like document.select(«table tr td») which returns all the matching td elements.
How to download and use the Jsoup in your project?
You can download the binary distribution (Jsoup jar file) directly from the download section of the Jsoup website. Once you download the library, put it in your build path to start using it. If you use Maven in your project, mention the following Jsoup maven dependency.
Reading a web page in Java
Reading a web page in Java is a tutorial that presents several ways to to read a web page in Java. It contains seven examples of downloading an HTTP source from a small web page.
Java reading web page tools
Java has built-in tools and third-party libraries for reading/downloading web pages. In the examples, we use HttpClient, URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.
In the following examples, we download HTML source from the webcode.me tiny web page.
Java read web page with HttpClient
Java 11 introduced HttpClient library.
package com.zetcode; import java.io.IOException; import java.net.URI; import java.net.http.HttpClient; import java.net.http.HttpRequest; import java.net.http.HttpResponse; public class ReadWebPage < public static void main(String[] args) throws IOException, InterruptedException < HttpClient client = HttpClient.newHttpClient(); HttpRequest request = HttpRequest.newBuilder() .uri(URI.create("http://webcode.me")) .GET() // GET is default .build(); HttpResponseresponse = client.send(request, HttpResponse.BodyHandlers.ofString()); System.out.println(response.body()); > >
We use the Java HttpClient to download the web page.
HttpClient client = HttpClient.newHttpClient();
A new HttpClient is created with the newHttpClient factory method.
HttpRequest request = HttpRequest.newBuilder() .uri(URI.create("http://webcode.me")) .build();
We build a synchronous request to the webpage. The default method is GET.
HttpResponse response = client.send(request, HttpResponse.BodyHandlers.ofString()); System.out.println(response.body());
We send the request and retrieve the content of the response and print it to the console. We use HttpResponse.BodyHandlers.ofString since we expect a string HTML response.
Reading a web page with URL
URL represents a Uniform Resource Locator, a pointer to a resource on the World Wide Web.
package com.zetcode; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; public class ReadWebPageEx < public static void main(String[] args) throws IOException < var url = new URL("http://webcode.me"); try (var br = new BufferedReader(new InputStreamReader(url.openStream()))) < String line; var sb = new StringBuilder(); while ((line = br.readLine()) != null) < sb.append(line); sb.append(System.lineSeparator()); >System.out.println(sb); > > >
The code example reads the contents of a web page.
try (var br = new BufferedReader(new InputStreamReader(url.openStream())))The openStream method opens a connection to the specified url and returns an InputStream for reading from that connection. The InputStreamReader is a bridge from byte streams to character streams. It reads bytes and decodes them into characters using a specified charset. In addition, BufferedReader is used for better performance.
var sb = new StringBuilder(); while ((line = br.readLine()) != null)The HTML data is read line by line with the readLine method. The source is appended to the StringBuilder .
In the end, the contents of the StringBuilder are printed to the terminal.
Reading a web page with JSoup
JSoup is a popular Java HTML parser.
We have used this Maven dependency.
package com.zetcode; import org.jsoup.Jsoup; import java.io.IOException; public class ReadWebPageEx2 < public static void main(String[] args) throws IOException < String webPage = "http://webcode.me"; String html = Jsoup.connect(webPage).get().html(); System.out.println(html); >>The code example uses JSoup to download and print a tiny web page.
String html = Jsoup.connect(webPage).get().html();The connect method connects to the specified web page. The get method issues a GET request. Finally, the html method retrieves the HTML source.
Reading a web page with HtmlCleaner
HtmlCleaner is an open source HTML parser written in Java.
net.sourceforge.htmlcleaner htmlcleaner 2.16 For this example, we use the htmlcleaner Maven dependency.
package com.zetcode; import java.io.IOException; import java.net.URL; import org.htmlcleaner.CleanerProperties; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.SimpleHtmlSerializer; import org.htmlcleaner.TagNode; public class ReadWebPageEx3 < public static void main(String[] args) throws IOException < var url = new URL("http://webcode.me"); var props = new CleanerProperties(); props.setOmitXmlDeclaration(true); var cleaner = new HtmlCleaner(props); TagNode node = cleaner.clean(url); var htmlSerializer = new SimpleHtmlSerializer(props); htmlSerializer.writeToStream(node, System.out); >>The example uses HtmlCleaner to download a web page.
var props = new CleanerProperties(); props.setOmitXmlDeclaration(true);In the properties, we set to omit the XML declaration.
var htmlSerializer = new SimpleHtmlSerializer(props); htmlSerializer.writeToStream(node, System.out);A SimpleHtmlSerializer creates the resulting HTML without any indenting and/or compacting.
Reading a web page with Apache HttpClient
Apache HttpClient is a HTTP/1.1 compliant HTTP agent implementation. It can scrape a web page using the request and response process. An HTTP client implements the client side of the HTTP and HTTPS protocols.
org.apache.httpcomponents httpclient 4.5.10 We use this Maven dependency for the Apache HTTP client.
package com.zetcode; import java.io.IOException; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.HttpClientBuilder; import org.apache.http.util.EntityUtils; public class ReadWebPageEx4 < public static void main(String[] args) throws IOException < HttpGet request = null; try < String url = "http://webcode.me"; HttpClient client = HttpClientBuilder.create().build(); request = new HttpGet(url); request.addHeader("User-Agent", "Apache HTTPClient"); HttpResponse response = client.execute(request); HttpEntity entity = response.getEntity(); String content = EntityUtils.toString(entity); System.out.println(content); >finally < if (request != null) < request.releaseConnection(); >> > >In the code example, we send a GET HTTP request to the specified web page and receive an HTTP response. From the response, we read the HTML source.
HttpClient client = HttpClientBuilder.create().build();HttpGet is a class for the HTTP GET method.
request.addHeader("User-Agent", "Apache HTTPClient"); HttpResponse response = client.execute(request);A GET method is executed and an HttpResponse is received.
HttpEntity entity = response.getEntity(); String content = EntityUtils.toString(entity); System.out.println(content);From the response, we get the content of the web page.
Reading a web page with Jetty HttpClient
Jetty project has an HTTP client as well.
org.eclipse.jetty jetty-client 9.4.25.v20191220 This is a Maven dependency for the Jetty HTTP client.
package com.zetcode; import org.eclipse.jetty.client.HttpClient; import org.eclipse.jetty.client.api.ContentResponse; public class ReadWebPageEx5 < public static void main(String[] args) throws Exception < HttpClient client = null; try < client = new HttpClient(); client.start(); String url = "http://webcode.me"; ContentResponse res = client.GET(url); System.out.println(res.getContentAsString()); >finally < if (client != null) < client.stop(); >> > >In the example, we get the HTML source of a web page with the Jetty HTTP client.
client = new HttpClient(); client.start();An HttpClient is created and started.
ContentResponse res = client.GET(url);A GET request is issued to the specified URL.
System.out.println(res.getContentAsString());The content is retrieved from the response with the getContentAsString method.
Reading a web page with HtmlUnit
HtmlUnit is a Java unit testing framework for testing Web based applications.
net.sourceforge.htmlunit htmlunit 2.36.0 We use this Maven dependency.
package com.zetcode; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.WebResponse; import com.gargoylesoftware.htmlunit.html.HtmlPage; import java.io.IOException; public class ReadWebPageEx6 < public static void main(String[] args) throws IOException < try (var webClient = new WebClient()) < String url = "http://webcode.me"; HtmlPage page = webClient.getPage(url); WebResponse response = page.getWebResponse(); String content = response.getContentAsString(); System.out.println(content); >> >The example downloads a web page and prints it using the HtmlUnit library.
In this article we have scraped a web page in Java using various tools, including HttpClient, URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.
Author
My name is Jan Bodnar and I am a passionate programmer with many years of programming experience. I have been writing programming articles since 2007. So far, I have written over 1400 articles and 8 e-books. I have over eight years of experience in teaching programming.