In this article, we will learn how to use Scala and the scalaj-http and rucola libraries to download all the images from a Wikipedia page.
—-
Overview
The goal is to extract the names, breed groups, local names, and image URLs for all dog breeds listed on this Wikipedia page. We will store the image URLs, download the images and save them to a local folder.
Here are the key steps we will cover:
- Import required libraries
- Send HTTP request to fetch the Wikipedia page
- Parse the page HTML using rucola
- Find the table with dog breed data using a CSS selector
- Iterate through the table rows
- Extract data from each column
- Download images and save locally
- Print/process extracted data
Let's go through each of these steps in detail.
Imports
We need these libraries:
import scalaj.http._
import rucola.expr._
import java.io._
Send HTTP Request
To download the web page:
val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"
val response = Http(url)
.header("User-Agent", "ScalaScraper")
.asString
response.code match {
case 200 =>
// Parse HTML
case _ =>
println(s"Error fetching $url")
}
We make a GET request and provide a custom user-agent header.
Parse HTML
To parse the HTML:
val html = XML.loadString(response.body)
The
Find Breed Table
We use a CSS selector to find the table element:
val table = html >> element("table.wikitable.sortable")
This selects the We loop through the rows: We drop the first row and iterate through the Inside the loop, we get the column data: We use To download and save images: We reuse the HTTP client and save the image bytes to a file. We store the extracted data: The arrays can then be processed as needed. And that's it! Here is the full code: This provides a complete Scala solution using scalaj-http and rucola to scrape data and images from HTML tables. The same approach can apply to many websites. While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help. Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself. This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping. With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: tag with the required CSS classes.
Iterate Through Rows
(table >> element("tr")).drop(1).foreach { row =>
// Extract data
}
elements. Extract Column Data
val cells = row >> elements("td", "th")
val name = (cells(0) >> element("a")).text
val group = cells(1).text
val localName = (cells(2) >> element("span")).text.getOrElse("")
val img = cells(3) >> element("img")
val photograph = img.attr("src").getOrElse("")
Download Images
if (photograph.nonEmpty) {
val imageData = Http(photograph).asBytes
val file = new File(s"dog_images/$name.jpg")
new FileOutputStream(file) << imageData
}
Store Extracted Data
names :+= name
groups :+= group
localNames :+= localName
photographs :+= photograph
// Imports
import scalaj.http._
import rucola.expr._
import java.io._
// Arrays to store data
val names = ArrayBuffer[String]()
val groups = ArrayBuffer[String]()
val localNames = ArrayBuffer[String]()
val photographs = ArrayBuffer[String]()
// Send HTTP request
val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"
val response = Http(url)
.header("User-Agent", "ScalaScraper")
.asString
response.code match {
case 200 =>
// Parse HTML
val html = XML.loadString(response.body)
// Find table
val table = html >> element("table.wikitable.sortable")
// Iterate rows
(table >> element("tr")).drop(1).foreach { row =>
// Get cells
val cells = row >> elements("td", "th")
// Extract data
val name = (cells(0) >> element("a")).text
val group = cells(1).text
val localName = (cells(2) >> element("span")).text.getOrElse("")
val img = cells(3) >> element("img")
val photograph = img.attr("src").getOrElse("")
// Download image
if (photograph.nonEmpty) {
val imageData = Http(photograph).asBytes
val file = new File(s"dog_images/$name.jpg")
new FileOutputStream(file) << imageData
}
// Store data
names :+= name
groups :+= group
localNames :+= localName
photographs :+= photograph
}
case _ =>
println(s"Error fetching $url")
}
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!