This article will provide a practical, step-by-step guide to scraping all images from a website using real Kotlin code. We will focus specifically on explaining the code to extract images, without providing any unnecessary background on web scraping basics or the Kotlin language.
This is page we are talking about… We will be scraping images of dog breed from wikipedia
Importing Libraries
We first import the libraries needed to send HTTP requests and parse HTML:
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
Jsoup will be used to connect to the web page, send a request, and parse the HTML document.
Defining Key Variables
Next we define the URL of the Wikipedia page we want to scrape:
val url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>"
We also define a user agent header to simulate a real browser request:
val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
Sending Request and Parsing HTML
We use Jsoup to connect to the URL and send a GET request with the user agent specified:
val doc = Jsoup.connect(url).userAgent(userAgent).get()
This parses and loads the full HTML document into the
Selecting Target Table
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
We next select the table element containing the data we want to scrape - dog breed information:
val table = doc.select("table.wikitable.sortable").first()
This uses a CSS selector to target the table uniquely identified by the
Initializing Storage Lists
We initialize empty lists to store the data extracted from the table:
val names = mutableListOf<String>()
val groups = mutableListOf<String>()
val localNames = mutableListOf<String>()
val photographs = mutableListOf<String>()
Creating Image Folder
Since we want to download the dog images, we create a folder to save them:
val imageFolder = File("dog_images")
imageFolder.mkdirs()
The
Extracting Data from Table Rows
This is where the main data extraction occurs from the HTML.
We loop through each row, skipping the header:
for (row: Element in table.select("tr").drop(1)) {
// extract data from each row
}
The key part is using selectors to extract elements from each column:
val columns = row.select("td, th")
val name = columns[0].select("a").text().trim()
val group = columns[1].text().trim()
val spanTag = columns[2].select("span").first()
val localName = spanTag?.text()?.trim() ?: ""
val imgTag = columns[3].select("img").first()
val photograph = imgTag?.attr("src") ?: ""
This shows how to:
The data extracted is added to the previously defined storage lists.
Downloading and Saving Images
For each non-blank image link extracted, we download and save the image:
if (photograph.isNotBlank()) {
val imageFileName = File(imageFolder, "$name.jpg")
downloadImage(photograph, imageFileName)
}
The
Printing Extracted Data
Finally, we can print out or process the extracted data now available in the lists:
for (i in names.indices) {
println("Name: ${names[i]}")
println("Group: ${groups[i]}")
// etc
}
This allows us to work with each piece of scraped data from the site.
Full Code
Here is the full code for reference:
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import java.io.File
import java.io.FileOutputStream
import java.io.IOException
import java.io.InputStream
import java.net.URL
import java.nio.file.Files
import java.nio.file.Path
import java.nio.file.StandardCopyOption
fun main() {
// URL of the Wikipedia page
val url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds"
// Define a user-agent header to simulate a browser request
val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
// Send an HTTP GET request to the URL with the headers
val doc = Jsoup.connect(url).userAgent(userAgent).get()
// Find the table with class 'wikitable sortable'
val table = doc.select("table.wikitable.sortable").first()
// Initialize lists to store the data
val names = mutableListOf<String>()
val groups = mutableListOf<String>()
val localNames = mutableListOf<String>()
val photographs = mutableListOf<String>()
// Create a folder to save the images
val imageFolder = File("dog_images")
imageFolder.mkdirs()
// Iterate through rows in the table (skip the header row)
for (row: Element in table.select("tr").drop(1)) {
val columns = row.select("td, th")
if (columns.size == 4) {
// Extract data from each column
val name = columns[0].select("a").text().trim()
val group = columns[1].text().trim()
// Check if the second column contains a span element
val spanTag = columns[2].select("span").first()
val localName = spanTag?.text()?.trim() ?: ""
// Check for the existence of an image tag within the fourth column
val imgTag = columns[3].select("img").first()
val photograph = imgTag?.attr("src") ?: ""
// Download the image and save it to the folder
if (photograph.isNotBlank()) {
val imageFileName = File(imageFolder, "$name.jpg")
downloadImage(photograph, imageFileName)
}
// Append data to respective lists
names.add(name)
groups.add(group)
localNames.add(localName)
photographs.add(photograph)
}
}
// Print or process the extracted data as needed
for (i in names.indices) {
println("Name: ${names[i]}")
println("FCI Group: ${groups[i]}")
println("Local Name: ${localNames[i]}")
println("Photograph: ${photographs[i]}")
println()
}
}
@Throws(IOException::class)
fun downloadImage(imageUrl: String, destination: File) {
val url = URL(imageUrl)
val inputStream: InputStream = url.openStream()
Files.copy(inputStream, destination.toPath(), StandardCopyOption.REPLACE_EXISTING)
inputStream.close()
}
The key concepts covered:
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
Try ProxiesAPI for free
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...
Don't leave just yet!
Enter your email below to claim your free API key: