Extracting Structured Data by Scraping Wikipedia with Kotlin

Wikipedia contains a wealth of valuable information spanning all topics imaginable. Much of this info is tucked away in tables, infoboxes, and other structured data on pages. As developers and data enthusiasts, having access to this data can enable all sorts of useful applications.

That's where web scraping comes into play. Web scraping refers to the technique of programmatically extracting data from websites using tools that parse and select information out of HTML. In this post, we'll walk through a hands-on example of scraping Wikipedia to pull together a dataset of US presidents.

This is the table we are talking about

Setting the Stage with Jsoup

For this tutorial we'll use Jsoup, a Java library designed for working with real-world HTML and scraping data out of it.

First we'll define some imports to pull in the Jsoup functionality:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

Jsoup contains tools for fetching web pages, traversing/selecting DOM elements, and extracting text or other data.

Making the Initial Request

Let's define the URL of the Wikipedia page we want to scrape:

val url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"

This page contains a large structured table with data on each American president.

To fetch the contents of that page, we use the Jsoup.connect() method which returns a Connection object. This handles making the GET request and retrieving the response:

val response = Jsoup.connect(url).get()

One thing I always recommend is setting a user agent header. This mimics a real web browser so the server knows the request didn't come from an unknown bot:

val headers = mapOf("User-Agent" to "Mozilla/5.0 ...")

Jsoup.connect(url).headers(headers).get()

Checking for Success

We can then check the status code of the response to confirm it succeeded:

if (response.statusCode() == 200) {
  // scraping logic here
} else {
  println("Failed, status code: ${response.statusCode()}")
}

Server responses are full of helpful metadata. Get familiar with status codes to better handle errors.

Parsing the Page with Jsoup

Now that we have the raw HTML, we need to parse it into a structured Document object that we can query:

val soup: Document = response

This Document represents the entire parsed page, modeled after the DOM tree structure.

Selecting Elements

Inspecting the page

When we inspect the page we can see that the table has a class called wikitable and sortable

We can use CSS-style selectors to locate specific elements in that document:

val table: Element? = soup.select("table.wikitable.sortable").first()

This selects the first table element with the classes "wikitable" and "sortable".

Jsoup selectors are very powerful - read more about selector syntax.

Extracting the Data

Finally, we can iterate through the rows and cells of this table to extract our president data:

table?.select("tr")?.drop(1)?.forEach { row ->

  val columns = row.select("td, th")

  val row_data = columns.map { col ->
    col.text()
  }

  println(row_data) // prints array of cell data
}

And there you have it - we now have the structured data scraped from Wikipedia! Jsoup handles all the heavy lifting.

Potential Challenges

There are always practical challenges with real-world scraping:

Handling cookies/sessions

Working around bot protections

Dealing with JavaScript rendering

Parsing complex page layouts

The example here sidesteps some of those issues, but they come up frequently. We have a whole separate guide on handling those scrapers dilemmas.

Now that you have the foundations, you can scrape most semi-structured data off sites like Wikipedia. Always be responsible by adding throttling, caching, and obeying robots.txt.

Check out the full code for this Wikipedia scraper below. Hopefully this gives you a template to start extracting your own datasets!

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

fun main() {
    // Define the URL of the Wikipedia page
    val url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"

    // Define a user-agent header to simulate a browser request
    val headers = mapOf("User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")

    // Send an HTTP GET request to the URL with the headers
    val response = Jsoup.connect(url).headers(headers).get()

    // Check if the request was successful (status code 200)
    if (response.statusCode() == 200) {
        // Parse the HTML content of the page using Jsoup
        val soup: Document = response

        // Find the table with the specified class name
        val table: Element? = soup.select("table.wikitable.sortable").first()

        // Initialize empty lists to store the table data
        val data: MutableList<List<String>> = mutableListOf()

        // Iterate through the rows of the table
        table?.select("tr")?.drop(1)?.forEach { row ->
            val columns = row.select("td, th")

            // Extract data from each column and append it to the data list
            val row_data = columns.map { col -> col.text() }
            data.add(row_data)
        }

        // Print the scraped data for all presidents
        for (president_data in data) {
            println("President Data:")
            println("Number: ${president_data[0]}")
            println("Name: ${president_data[2]}")
            println("Term: ${president_data[3]}")
            println("Party: ${president_data[5]}")
            println("Election: ${president_data[6]}")
            println("Vice President: ${president_data[7]}")
            println()
        }

    } else {
        println("Failed to retrieve the web page. Status code: ${response.statusCode()}")
    }
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Extracting Structured Data by Scraping Wikipedia with Kotlin

Setting the Stage with Jsoup

Making the Initial Request

Checking for Success

Parsing the Page with Jsoup

Selecting Elements

Extracting the Data

Potential Challenges

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Extracting Structured Data by Scraping Wikipedia with Kotlin

Setting the Stage with Jsoup

Making the Initial Request

Checking for Success

Parsing the Page with Jsoup

Selecting Elements

Extracting the Data

Potential Challenges

The easiest way to do Web Scraping

Don't leave just yet!