Scraping All Images from a Website with Java

Web scraping is the process of extracting data from websites automatically. This is useful for gathering large datasets that would be tedious to collect manually. Here we will go through Java code that scrapes all dog breed images from a Wikipedia page.

This is page we are talking about…

Prerequisites

To follow along, you'll need:

Jsoup library for Java

Java 8+

Let's dive in and see how the scraping is done!

Logic Overview

At a high level, the code:

Connects to the target Wikipedia page
Initializes variables to store the extracted data
Iterates through each row of the dog breed table
Downloads the images and saves them locally
Prints out the extracted information

Now let's break this down step-by-step.

Understanding the Selectors

While the logic may sound simple, the key part is properly extracting data from the raw HTML of the page. This is done using CSS selectors.

CSS selectors allow targeting specific elements in the HTML document structure. For example, you can select all table rows, links, images etc.

Jsoup implements CSS selectors for parsing the document that is retrieved from the website. Let's see how they are used here:

Selecting the Table

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

First, we need to select the appropriate table element that actually contains the dog breed data.

<table class="wikitable sortable">
  <!-- dog breed rows here-->
</table>

The table can be uniquely identified by its class attributes wikitable and sortable. Jsoup allows us to use a CSS selector string to target elements with given classes:

Element table = doc.select("table.wikitable.sortable").first();

Breaking this down:

table - selects HTML tags

.wikitable - class selector, targets elements with wikitable class

.sortable - also has sortable class

first() - returns just the first matching element

So this selector finds the table element with BOTH matching classes, uniquely identifying the dog breed table.

Skipping Header Row

Now that we have selected the table, we can loop through its rows:

for (Element row : table.select("tr:gt(0)")) {
  // extract data from rows
}

Details:

tr selects the table row (

) elements

:gt(0) filters to only rows GREATER THAN index 0

We then iterate over these rows

Getting Row Cells

Next we get the cells within each row:

Elements columns = row.select("td, th");

This selects both

and	cells in the row using a multiple element selector. We assign them to an Elements object which acts like an array of elements. Extracting Text from Elements Finally, having isolated elements, we can extract text or other attributes from them. Get link text of first cell: `String name = columns.get(0).select("a").text().trim();` columns.get(0) - first cell select("a") - anchor tag inside cell .text() - extract text within anchor .trim() - clean whitespace Other data is extracted similarly: `String group = columns.get(1).text().trim(); Element spanTag = columns.get(2).select("span").first(); String localName = (spanTag != null) ? spanTag.text().trim() : ""; Element imgTag = columns.get(3).select("img").first(); String photograph = (imgTag != null) ? imgTag.attr("src") : "";` These demonstrate usage of: .text() to get cell text content .select on cell elements to drill down further .attr() to get attributes like src on image elements Conclusion That covers the key functionality of the provided web scraping example. As you can see, Jsoup's selector API allows easily drilling into HTML and extraction data at will! The full code is provided below for reference import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.BufferedInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.net.URL; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; public class DogBreedsScraper { public static void main(String[] args) { // URL of the Wikipedia page String url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds"; // Define a user-agent header to simulate a browser request String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"; try { // Send an HTTP GET request to the URL with the headers Document doc = Jsoup.connect(url).userAgent(userAgent).get(); // Find the table with class 'wikitable sortable' Element table = doc.select("table.wikitable.sortable").first(); // Initialize lists to store the data StringBuilder names = new StringBuilder(); StringBuilder groups = new StringBuilder(); StringBuilder localNames = new StringBuilder(); StringBuilder photographs = new StringBuilder(); // Create a folder to save the images Path imagesFolder = Paths.get("dog_images"); Files.createDirectories(imagesFolder); // Iterate through rows in the table (skip the header row) for (Element row : table.select("tr:gt(0)")) { Elements columns = row.select("td, th"); if (columns.size() == 4) { // Extract data from each column String name = columns.get(0).select("a").text().trim(); String group = columns.get(1).text().trim(); // Check if the second column contains a span element Element spanTag = columns.get(2).select("span").first(); String localName = (spanTag != null) ? spanTag.text().trim() : ""; // Check for the existence of an image tag within the fourth column Element imgTag = columns.get(3).select("img").first(); String photograph = (imgTag != null) ? imgTag.attr("src") : ""; // Download the image and save it to the folder if (!photograph.isEmpty()) { String imageFilename = Paths.get("dog_images", name + ".jpg").toString(); downloadImage(photograph, imageFilename); } // Append data to respective lists names.append("Name: ").append(name).append("\n"); groups.append("FCI Group: ").append(group).append("\n"); localNames.append("Local Name: ").append(localName).append("\n"); photographs.append("Photograph: ").append(photograph).append("\n\n"); } } // Print or process the extracted data as needed System.out.println(names.toString()); System.out.println(groups.toString()); System.out.println(localNames.toString()); System.out.println(photographs.toString()); } catch (IOException e) { System.err.println("Failed to retrieve the web page. Error: " + e.getMessage()); } } private static void downloadImage(String imageUrl, String destinationPath) throws IOException { URL url = new URL(imageUrl); try (BufferedInputStream in = new BufferedInputStream(url.openStream()); FileOutputStream fileOutputStream = new FileOutputStream(destinationPath)) { byte[] dataBuffer = new byte[1024]; int bytesRead; while ((bytesRead = in.read(dataBuffer, 0, 1024)) != -1) { fileOutputStream.write(dataBuffer, 0, bytesRead); } } } } In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser! If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail. Overcoming IP Blocks Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works. Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. With millions of high speed rotating proxies located all over the world, With our automatic IP rotation With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions) With our automatic CAPTCHA solving technology, Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. `curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"` We have a running offer of 1000 API calls completely free. Register and get your free API Key here. Browse by tags: web scraping data extraction HTML parsing IP blocking CSS selectors Java JSoup Browse by language: C# PHP Python JavaScript Rust Ruby Go C++ Objective-C Scala Elixir Kotlin Perl R Java The easiest way to do Web Scraping Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you Try ProxiesAPI for free curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> <html> <head> <title>Example Domain</title> <meta charset="utf-8" /> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> ... Tired of getting blocked while scraping the web? Get access to 1,000 free API credits, no credit card required! Try for free X Don't leave just yet! Enter your email below to claim your free API key:

and

cells in the row using a multiple element selector.

We assign them to an Elements object which acts like an array of elements.

Extracting Text from Elements

Finally, having isolated elements, we can extract text or other attributes from them.

Get link text of first cell:

String name = columns.get(0).select("a").text().trim();

columns.get(0) - first cell

select("a") - anchor tag inside cell

.text() - extract text within anchor

.trim() - clean whitespace

Other data is extracted similarly:

String group = columns.get(1).text().trim();

Element spanTag = columns.get(2).select("span").first();
String localName = (spanTag != null) ? spanTag.text().trim() : "";

Element imgTag = columns.get(3).select("img").first();
String photograph = (imgTag != null) ? imgTag.attr("src") : "";

These demonstrate usage of:

.text() to get cell text content

.select on cell elements to drill down further

.attr() to get attributes like src on image elements

Conclusion

That covers the key functionality of the provided web scraping example. As you can see, Jsoup's selector API allows easily drilling into HTML and extraction data at will!

The full code is provided below for reference

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.BufferedInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class DogBreedsScraper {
    public static void main(String[] args) {
        // URL of the Wikipedia page
        String url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds";

        // Define a user-agent header to simulate a browser request
        String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";

        try {
            // Send an HTTP GET request to the URL with the headers
            Document doc = Jsoup.connect(url).userAgent(userAgent).get();

            // Find the table with class 'wikitable sortable'
            Element table = doc.select("table.wikitable.sortable").first();

            // Initialize lists to store the data
            StringBuilder names = new StringBuilder();
            StringBuilder groups = new StringBuilder();
            StringBuilder localNames = new StringBuilder();
            StringBuilder photographs = new StringBuilder();

            // Create a folder to save the images
            Path imagesFolder = Paths.get("dog_images");
            Files.createDirectories(imagesFolder);

            // Iterate through rows in the table (skip the header row)
            for (Element row : table.select("tr:gt(0)")) {
                Elements columns = row.select("td, th");
                if (columns.size() == 4) {
                    // Extract data from each column
                    String name = columns.get(0).select("a").text().trim();
                    String group = columns.get(1).text().trim();

                    // Check if the second column contains a span element
                    Element spanTag = columns.get(2).select("span").first();
                    String localName = (spanTag != null) ? spanTag.text().trim() : "";

                    // Check for the existence of an image tag within the fourth column
                    Element imgTag = columns.get(3).select("img").first();
                    String photograph = (imgTag != null) ? imgTag.attr("src") : "";

                    // Download the image and save it to the folder
                    if (!photograph.isEmpty()) {
                        String imageFilename = Paths.get("dog_images", name + ".jpg").toString();
                        downloadImage(photograph, imageFilename);
                    }

                    // Append data to respective lists
                    names.append("Name: ").append(name).append("\n");
                    groups.append("FCI Group: ").append(group).append("\n");
                    localNames.append("Local Name: ").append(localName).append("\n");
                    photographs.append("Photograph: ").append(photograph).append("\n\n");
                }
            }

            // Print or process the extracted data as needed
            System.out.println(names.toString());
            System.out.println(groups.toString());
            System.out.println(localNames.toString());
            System.out.println(photographs.toString());

        } catch (IOException e) {
            System.err.println("Failed to retrieve the web page. Error: " + e.getMessage());
        }
    }

    private static void downloadImage(String imageUrl, String destinationPath) throws IOException {
        URL url = new URL(imageUrl);
        try (BufferedInputStream in = new BufferedInputStream(url.openStream());
             FileOutputStream fileOutputStream = new FileOutputStream(destinationPath)) {
            byte[] dataBuffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = in.read(dataBuffer, 0, 1024)) != -1) {
                fileOutputStream.write(dataBuffer, 0, bytesRead);
            }
        }
    }
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...