Scraping Multiple Pages in Rust with reqwest and selectors

Web scraping is useful to programmatically extract data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we'll see how to scrape multiple pages in Rust using the reqwest and selectors crates.

Prerequisites

To follow along, you'll need:

Basic Rust knowledge

Rust installed

reqwest and selectors crates:

[dependencies]
reqwest = "0.11"
selectors = "0.22"

Import Crates

We'll need the following crates:

use reqwest;
use selectors;

Define Base URL

—

We'll scrape a blog - https://copyblogger.com/blog/. The page URLs follow a pattern:

<https://copyblogger.com/blog/>
<https://copyblogger.com/blog/page/2/>
<https://copyblogger.com/blog/page/3/>

Let's define the base URL pattern:

let base_url = "<https://copyblogger.com/blog/page/{}/>";

The {} allows us to insert the page number.

Specify Number of Pages

Next, we'll specify how many pages to scrape. Let's scrape the first 5 pages:

let num_pages = 5;

Loop Through Pages

We can now loop from 1 to num_pages and construct the URL for each page:

for page_num in 1..num_pages + 1 {

  // Construct page URL
  let url = base_url.replace("{}", &page_num.to_string());

  // Code to scrape each page

}

Send Request and Check Response

Inside the loop, we'll use reqwest::get() to send a request to the page URL:

let resp = reqwest::get(&url).await?;

if resp.status().is_success() {

  // Page retrieved, can parse HTML

} else {

  println!("Error retrieving page {}", page_num);

}

We check for a successful status code to ensure the request succeeded.

Parse HTML

If successful, we can parse the HTML using selectors:

let document = selectors::Document::from_read(resp.text().await?)?;

This gives us a parsed DOM document to extract data from.

Extract Data

Now within the loop we can use document to find and extract data from each page.

For example, to get all article elements:

let articles = document.select("article");

We can loop through articles and extract information like title, URL, author etc.

Full Code

Our full code to scrape 5 pages is:

use reqwest;
use selectors;

let base_url = "<https://copyblogger.com/blog/page/{}/>";
let num_pages = 5;

for page_num in 1..num_pages + 1 {

  let url = base_url.replace("{}", &page_num.to_string());

  let resp = reqwest::get(&url).await?;

  if resp.status().is_success() {

    let document = selectors::Document::from_read(resp.text().await?)?;

    let articles = document.select("article");

    for article in articles {

      // Extract data from article

      let title = article.select("h2.entry-title").text();
      let url = article.select("a.entry-title-link").attr("href")?;
      let author = article.select("div.post-author a").text();

      let categories = article.select("div.entry-categories a").map(|cat| cat.text());

      // Print extracted data
      println!("Title: {}", title);
      println!("URL: {}", url);
      println!("Author: {}", author);
      println!("Categories: {:?}", categories);
      println!();

    }

  } else {
    println!("Error retrieving page {}", page_num);
  }

}

This allows us to scrape and extract data from multiple pages sequentially. The code can be extended to scrape any number of pages.

Summary

Use a base URL pattern with {} placeholder

Loop through pages with for loop

Construct each page URL

Send request with reqwest and check response

Parse HTML using selectors

Find and extract data inside loop

Print or store scraped data

Web scraping enables collecting large datasets programmatically. With the techniques here, you can scrape and extract information from multiple pages of a website in Rust.

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Scraping Multiple Pages in Rust with reqwest and selectors

Prerequisites

Import Crates

Define Base URL

Specify Number of Pages

Loop Through Pages

Send Request and Check Response

Parse HTML

Extract Data

Full Code

Summary

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Multiple Pages in Rust with reqwest and selectors

Prerequisites

Import Crates

Define Base URL

Specify Number of Pages

Loop Through Pages

Send Request and Check Response

Parse HTML

Extract Data

Full Code

Summary

The easiest way to do Web Scraping

Don't leave just yet!