Web scraping is the process of extracting data from websites automatically. This is useful for gathering large datasets that would be tedious to collect manually. Here we will go through Java code that scrapes all dog breed images from a Wikipedia page.
This is page we are talking about…
Prerequisites
To follow along, you'll need:
Jsoup library for Java
Java 8+
Let's dive in and see how the scraping is done!
Logic Overview
At a high level, the code:
Connects to the target Wikipedia page
Initializes variables to store the extracted data
Iterates through each row of the dog breed table
Downloads the images and saves them locally
Prints out the extracted information
Now let's break this down step-by-step.
Understanding the Selectors
While the logic may sound simple, the key part is properly extracting data from the raw HTML of the page. This is done using CSS selectors.
CSS selectors allow targeting specific elements in the HTML document structure. For example, you can select all table rows, links, images etc.
Jsoup implements CSS selectors for parsing the document that is retrieved from the website. Let's see how they are used here:
Selecting the Table
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
First, we need to select the appropriate table element that actually contains the dog breed data.
<table class="wikitable sortable">
<!-- dog breed rows here-->
</table>
The table can be uniquely identified by its class attributes wikitable and sortable. Jsoup allows us to use a CSS selector string to target elements with given classes:
Element table = doc.select("table.wikitable.sortable").first();
Breaking this down:
table - selects HTML
tags
.wikitable - class selector, targets elements with wikitable class
.sortable - also has sortable class
first() - returns just the first matching element
So this selector finds the table element with BOTH matching classes, uniquely identifying the dog breed table.
Skipping Header Row
Now that we have selected the table, we can loop through its rows:
for (Element row : table.select("tr:gt(0)")) {
// extract data from rows
}
Details:
tr selects the table row (
) elements
:gt(0) filters to only rows GREATER THAN index 0
We then iterate over these rows
Getting Row Cells
Next we get the cells within each row:
Elements columns = row.select("td, th");
This selects both
and
cells in the row using a multiple element selector.
We assign them to an Elements object which acts like an array of elements.
Extracting Text from Elements
Finally, having isolated elements, we can extract text or other attributes from them.
Get link text of first cell:
String name = columns.get(0).select("a").text().trim();
.attr() to get attributes like src on image elements
Conclusion
That covers the key functionality of the provided web scraping example. As you can see, Jsoup's selector API allows easily drilling into HTML and extraction data at will!
The full code is provided below for reference
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class DogBreedsScraper {
public static void main(String[] args) {
// URL of the Wikipedia page
String url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds";
// Define a user-agent header to simulate a browser request
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
try {
// Send an HTTP GET request to the URL with the headers
Document doc = Jsoup.connect(url).userAgent(userAgent).get();
// Find the table with class 'wikitable sortable'
Element table = doc.select("table.wikitable.sortable").first();
// Initialize lists to store the data
StringBuilder names = new StringBuilder();
StringBuilder groups = new StringBuilder();
StringBuilder localNames = new StringBuilder();
StringBuilder photographs = new StringBuilder();
// Create a folder to save the images
Path imagesFolder = Paths.get("dog_images");
Files.createDirectories(imagesFolder);
// Iterate through rows in the table (skip the header row)
for (Element row : table.select("tr:gt(0)")) {
Elements columns = row.select("td, th");
if (columns.size() == 4) {
// Extract data from each column
String name = columns.get(0).select("a").text().trim();
String group = columns.get(1).text().trim();
// Check if the second column contains a span element
Element spanTag = columns.get(2).select("span").first();
String localName = (spanTag != null) ? spanTag.text().trim() : "";
// Check for the existence of an image tag within the fourth column
Element imgTag = columns.get(3).select("img").first();
String photograph = (imgTag != null) ? imgTag.attr("src") : "";
// Download the image and save it to the folder
if (!photograph.isEmpty()) {
String imageFilename = Paths.get("dog_images", name + ".jpg").toString();
downloadImage(photograph, imageFilename);
}
// Append data to respective lists
names.append("Name: ").append(name).append("\n");
groups.append("FCI Group: ").append(group).append("\n");
localNames.append("Local Name: ").append(localName).append("\n");
photographs.append("Photograph: ").append(photograph).append("\n\n");
}
}
// Print or process the extracted data as needed
System.out.println(names.toString());
System.out.println(groups.toString());
System.out.println(localNames.toString());
System.out.println(photographs.toString());
} catch (IOException e) {
System.err.println("Failed to retrieve the web page. Error: " + e.getMessage());
}
}
private static void downloadImage(String imageUrl, String destinationPath) throws IOException {
URL url = new URL(imageUrl);
try (BufferedInputStream in = new BufferedInputStream(url.openStream());
FileOutputStream fileOutputStream = new FileOutputStream(destinationPath)) {
byte[] dataBuffer = new byte[1024];
int bytesRead;
while ((bytesRead = in.read(dataBuffer, 0, 1024)) != -1) {
fileOutputStream.write(dataBuffer, 0, bytesRead);
}
}
}
}
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
With millions of high speed rotating proxies located all over the world,
With our automatic IP rotation
With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
With our automatic CAPTCHA solving technology,
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you