"jsoupis a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods” – JSoup.org
Like in the description above provided by JSoup’s website, the JSoup library serves as a great tool when you want to extract and manipulate HTML data on a webpage. This blog post will go into depth about the basics of the API, as well as providing a few real-life examples of how to use the library.
How to set up the library
Before you can get started using the JSoup library, you actually need to download and add it to your project. For this tutorial, I will be using Eclipse.
To download the library, go on this page and download the core library jar file.
Once you’re done, create a new project in Eclipse like you usually do, and afterwards create a folder inside your project folder called “libs” and place the JSoup library jar file inside that folder.
All that’s left to do now is adding the JSoup jar as a library in the build path. This is pretty straightforward. In the menu bar, click Projects -> Properties -> Java Build Path -> Add Jars and add the jar in your libs folder. It’ll look like this:
You’re all set!
Let’s start with the complete basics. When we want to fetch and parse the HTML content of a website, we’ll have to create an object of the Document class and then load the page. Loading the page is pretty straightforward. There are three ways to do this:
- Loading an HTML file from your computer
- Loading/fetching the HTML code from a URL
You can also write some HTML code inside a string and then parse it.
Loading HTML file from your computer
As you can see, the first thing we did was define a String constant containing the path to your HTML file, and an instance of the File class.
Inside the constructor, we initialized our file instance and put our path in its constructor’s parameters. Afterwards, we simply just created an instance of the Document class of the JSoup library and parsed the content of our HTML file.
We can then retrieve the title of our page by doing doc.title();.
Fetching and parsing HTML content via URL[crayon-5bf047283e39e305134275/] This code snippet actually has fewer lines than the other. Unlike the previous example, you don’t use the parse method when fetching and parsing HTML content via a URL. Instead, you connect to the URL and then get() the HTML content of the webpage.
Only the loading part is different, the rest stays the same. Once you’ve loaded the HTML content, be it via URL or file, the rest of the code will be the same because you’ll either way be working with the instance of the Document class.
Now that you’ve learned how to load HTML content, it’s time to start working with it!
Working with the parsed HTML
Now that the parsing of the HTML is done, we can start doing the fun stuff! We’ve already worked a little bit with the parsed HTML. Remember in the previous examples where we retrieved the title of the webpage? Well, that’s only one of the many things you can do once you’ve parsed the HTML!
Retrieving values from elements
One of the most fundamental things in the JSoup API is the use of elements. If you didn’t already know, elements are basically everything from the start of a tag to the end of a tag.
An example of an element is:
Here’s an example of a program that retrieves all <p> elements and then prints out all the values:
[crayon-5bf047283e3a7830180774/] What I did is pretty straightforward. When you want to retrieve all elements of a certain type, such as <p> elements, you’ll have to create an instance of the Elements class and use the Document.select() method to “select” all the elements and retrieve them. After I selected all the elements, I just looped through every element and printed out the text in those elements.
But what do you do if you want to retrieve a single individual element from the HTML code? For instance, what if you want to retrieve the first paragraph element in the HTML code? In that case, you’ll have to create an instance of the Element class. See a pattern here?
Here’s a code snippet for retrieving the first paragraph element of the same website as in the last example:
Here’s a list of the most useful methods in the Element class:
|html();||String||Retrieves inner html of element||String inner = element.html();|
|hasText();||Boolean||Returns true if element has text in it that’s not whitespace||boolean hasText = element.hasText();|
|id();||String||Grabs the ID attribute of the element||String id = element.id();|
|tagName();||String||Gets the name of the tag for the element||String tagName = element.tagName();|
|getElementById(String id);||Element||Find an element by its id||Element id = element.getElementById(“Example”);|
|text(String text);||Element||Set the text of the element||element.text(“Example”);|
If you want to get an overview of all the methods, you can go on this link to get to the documentation of the API.