Scraping All Images from a Website with Scala

In this beginner-friendly guide, we will go through how to use Scala to scrape all images from a website. We won't go into web scraping theory -- instead we'll focus on understanding and running real-world code that accomplishes this specific task of extracting images.

This is page we are talking about… We will be scraping images of dog breeds from Wikipedia

Prerequisites

To follow along, you'll need:

JDK 8+

SBT

Jsoup library

First, make sure Java and SBT are installed on your system. Then add the Jsoup dependency in your SBT build file:

libraryDependencies += "org.jsoup" % "jsoup" % "1.13.1"

Run sbt update to download the Jsoup library.

Now let's dive into the code!

Setup

We start by importing the required Scala and Java libraries:

import java.io.{File, FileOutputStream}

import org.jsoup.Jsoup

import scala.collection.JavaConverters._

The key one is Jsoup which we'll use for scraping the web page.

Next, we define the URL of the Wikipedia page we want to scrape:

val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"

No modifications needed for the URL string. We hard-code it exactly as shown.

We also set a user agent header to simulate a real browser request:

val userAgent =
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

Preserving the user agent string literal here.

Making the HTTP Request

With the imports and variables defined, we can now use Jsoup to connect to the URL and get the HTML document:

val doc = Jsoup.connect(url).userAgent(userAgent).get()

The userAgent() method sets the custom header we defined earlier.

Selecting the Table Element

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

We use a CSS selector to select the table with class names "wikitable" and "sortable":

val table = doc.select("table.wikitable.sortable").first()

The select() method finds all matching elements, and we take the first one using .first().

Initializing Storage Lists

With the target table element selected, we initialize some mutable Scala lists to store the extracted data:

val names = scala.collection.mutable.ListBuffer.empty[String]

val groups = scala.collection.mutable.ListBuffer.empty[String]

val localNames = scala.collection.mutable.ListBuffer.empty[String]

val photographs = scala.collection.mutable.ListBuffer.empty[String]

One list per data field we will scrape.

Creating Image Folder

Since we want to download all images from the page, we need a folder to save them:

val folder = new File("dog_images")

if (!folder.exists()) folder.mkdirs()

This creates a "dog_images" folder if it doesn't already exist.

Extracting Data from Table Rows

Now for the most complex part - extracting data within each row of the selected table element.

We loop through the rows, skipping the header:

for (row <- table.select("tr").asScala.drop(1)) {

// extract data from each row

}

Inside the loop, we select all "td" (normal cells) and "th" (header cells) within each row:

val columns = row.select("td, th").asScala.toSeq

And check that 4 columns were found - otherwise skip the row:

if (columns.size == 4) {

  // extract data from each column

}

Extracting the Name

To extract the breed name, we select the anchor tag inside the first column:

val name = columns(0).select("a").text().trim

The select("a") finds the anchor element, text() gets the text inside it, and trim removes whitespace.

We add the scraped name string to the respective list:

names += name

Extracting the Group

The group name exists directly as text in the second column. We extract it through:

val group = columns(1).text().trim

groups += group

Extracting the Local Name

Next, we check if there is a tag inside the third column:

val spanTag = columns(2).select("span").first()

val localName = if (spanTag != null) spanTag.text().trim else ""

If found, we get the text inside element. Otherwise set to empty string.

And add to the localNames list:

localNames += localName

Extracting the Image

Finally, we check if there is an tag inside the 4th column:

val imgTag = columns(3).select("img").first()

val photograph = if (imgTag != null) imgTag.attr("src") else ""

If an image exists, we extract its src attribute to get the image URL.

Downloading and Saving Images

For each found image, we download and save it to our folder:

if (!photograph.isEmpty) {

  val imageStream = Jsoup.connect(photograph)
                          .ignoreContentType(true)
                          .execute()
                          .bodyAsBytes()

  val imageFile = new File(folder, s"${name}.jpg")

  val fos = new FileOutputStream(imageFile)

  fos.write(imageStream)
  fos.close()

}

We use the breed name in each image filename.

And add the photograph URL to its list:

photographs += photograph

This whole process repeats for every row, extracting and storing data from each column.

Printing the Extracted Data

Finally, we can print out or process the scraped data as needed:

for (i <- names.indices) {

  println("Name:", names(i))
  println("FCI Group:", groups(i))
  println("Local Name:", localNames(i))
  println("Photograph:", photographs(i))
  println()

}

So in summary, we:

Made an HTTP request for the web page HTML
Selected the data table element
Initialized lists to store extracted data
Looped through rows, extracting information
Downloaded and saved images
Printed/processed scraped data

Full Code

import java.io.{File, FileOutputStream}

import org.jsoup.Jsoup

import scala.collection.JavaConverters._

object DogBreedsScraper {

  def main(args: Array[String]): Unit = {

    // URL of the Wikipedia page
    val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"

    // Define a user-agent header to simulate a browser request
    val userAgent =
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

    // Send an HTTP GET request to the URL with the headers
    val doc = Jsoup.connect(url).userAgent(userAgent).get()

    // Find the table with class 'wikitable sortable'
    val table = doc.select("table.wikitable.sortable").first()

    // Initialize lists to store the data
    val names = scala.collection.mutable.ListBuffer.empty[String]
    val groups = scala.collection.mutable.ListBuffer.empty[String]
    val localNames = scala.collection.mutable.ListBuffer.empty[String]
    val photographs = scala.collection.mutable.ListBuffer.empty[String]

    // Create a folder to save the images
    val folder = new File("dog_images")

    if (!folder.exists()) folder.mkdirs()

    // Iterate through rows in the table (skip the header row)
    for (row <- table.select("tr").asScala.drop(1)) {

      val columns = row.select("td, th").asScala.toSeq

      if (columns.size == 4) {

        // Extract data from each column
        val name = columns(0).select("a").text().trim
        val group = columns(1).text().trim

        // Check if the second column contains a span element
        val spanTag = columns(2).select("span").first()
        val localName = if (spanTag != null) spanTag.text().trim else ""

        // Check for the existence of an image tag within the fourth column
        val imgTag = columns(3).select("img").first()
        val photograph = if (imgTag != null) imgTag.attr("src") else ""

        // Download the image and save it to the folder
        if (!photograph.isEmpty) {

          val imageStream = Jsoup.connect(photograph).ignoreContentType(true).execute().bodyAsBytes()
          val imageFile = new File(folder, s"${name}.jpg")
          val fos = new FileOutputStream(imageFile)
          fos.write(imageStream)
          fos.close()

        }

        // Append data to respective lists
        names += name
        groups += group
        localNames += localName
        photographs += photograph

      }

    }

    // Print or process the extracted data as needed
    for (i <- names.indices) {

      println("Name:", names(i))
      println("FCI Group:", groups(i))
      println("Local Name:", localNames(i))
      println("Photograph:", photographs(i))
      println()

    }

  }

}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping All Images from a Website with Scala

Prerequisites

Setup

Making the HTTP Request

Selecting the Table Element

Inspecting the page

Initializing Storage Lists

Creating Image Folder

Extracting Data from Table Rows

Extracting the Name

Extracting the Group

Extracting the Local Name

Extracting the Image

Downloading and Saving Images

Printing the Extracted Data

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping All Images from a Website with Scala

Prerequisites

Setup

Making the HTTP Request

Selecting the Table Element

Inspecting the page

Initializing Storage Lists

Creating Image Folder

Extracting Data from Table Rows

Extracting the Name

Extracting the Group

Extracting the Local Name

Extracting the Image

Downloading and Saving Images

Printing the Extracted Data

Full Code

The easiest way to do Web Scraping

Don't leave just yet!