Scraping Multiple Pages in Scala with HTTP Client and XML Libraries

Web scraping is useful to programmatically extract data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we'll see how to scrape multiple pages in Scala using the HTTP client and XML libraries.

Prerequisites

To follow along, you'll need:

Basic Scala knowledge

SBT build tool

HTTP client and XML libraries:

libraryDependencies += "org.scala-lang.modules" %% "scala-xml" % "2.0.0"
libraryDependencies += "org.scalaj" %% "scalaj-http" % "2.4.2"

Import Libraries

We'll need the following imports:

import scalaj.http._
import scala.xml._

Define Base URL

—

We'll scrape a blog - https://copyblogger.com/blog/. The page URLs follow a pattern:

<https://copyblogger.com/blog/>
<https://copyblogger.com/blog/page/2/>
<https://copyblogger.com/blog/page/3/>

Let's define the base URL pattern:

val baseUrl = "<https://copyblogger.com/blog/page/%d/>"

The %d allows us to insert the page number.

Specify Number of Pages

Next, we'll specify how many pages to scrape. Let's scrape the first 5 pages:

val numPages = 5

Loop Through Pages

We can now loop from 1 to numPages and construct the URL for each page:

for (page <- 1 to numPages) {

  // Construct page URL
  val url = baseUrl.format(page)

  // Code to scrape each page

}

Send Request and Parse HTML

Inside the loop, we'll send a GET request and parse the HTML:

val response = Http(url).asString

val html = XML.loadString(response.body)

This gives us an XML node to extract data from.

Extract Data

Now within the loop we can use XPath expressions to extract data from each page:

val articles = html \\ "article"

for (article <- articles) {

  // Extract data from article

  val title = (article \\ "h2[@class='entry-title']") text
  val url = (article \\ "a[@class='entry-title-link']/@href") text
  val author = (article \\ "div[@class='post-author']/a") text

  // Print extracted data
  println(s"Title: $title")
  println(s"URL: $url")
  println(s"Author: $author")

}

Full Code

Our full code to scrape 5 pages is:

import scalaj.http._
import scala.xml._

object WebScraper extends App {

  val baseUrl = "https://copyblogger.com/blog/page/%d/"
  val numPages = 5

  for (page <- 1 to numPages) {

    val url = baseUrl.format(page)

    val response = Http(url).asString

    val html = XML.loadString(response.body)

    val articles = html \ "article"

    for (article <- articles) {
    
      val title = (article \ "h2[@class='entry-title']") text
      val url = (article \ "a[@class='entry-title-link']/@href") text
      val author = (article \ "div[@class='post-author']/a") text
      
      val categories = (article \ "div[@class='entry-categories']/a") map(_.text)
      
      println(s"Title: $title")
      println(s"URL: $url")  
      println(s"Author: $author")
      println(s"Categories: $categories")
      
    }

  }

}

This allows us to scrape and extract data from multiple pages sequentially. The code can be extended to scrape any number of pages.

Summary

Use a base URL pattern with %d placeholder

Loop through pages with for expression

Construct each page URL

Send request and parse HTML

Extract data using XPath expressions

Print or store scraped data

Web scraping enables collecting large datasets programmatically. With the techniques here, you can scrape and extract information from multiple pages of a website in Scala.

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Scraping Multiple Pages in Scala with HTTP Client and XML Libraries

Prerequisites

Import Libraries

Define Base URL

Specify Number of Pages

Loop Through Pages

Send Request and Parse HTML

Extract Data

Full Code

Summary

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Multiple Pages in Scala with HTTP Client and XML Libraries

Prerequisites

Import Libraries

Define Base URL

Specify Number of Pages

Loop Through Pages

Send Request and Parse HTML

Extract Data

Full Code

Summary

The easiest way to do Web Scraping

Don't leave just yet!