Scraping Reddit Posts with Rust

In this article, we'll be walking through code that scrapes Reddit to extract information about posts. We'll go step-by-step to understand what's happening behind the scenes.

here is the page we are talking about

Prerequisites

To follow along, you'll need:

Rust installed on your system

The reqwest and select crates added to your Cargo.toml file

To install Rust, follow the instructions here.

The reqwest and select crates handle making web requests and HTML parsing respectively. To add them, run:

$ cargo add reqwest
$ cargo add select

With those done, you're all set! Let's get scraping.

Walkthrough

We first import the crates and types we need:

extern crate reqwest;
extern crate select;

use reqwest::header;
use select::document::Document;
use select::node::Node;

Then we define the target URL and User-Agent string:

let reddit_url = "<https://www.reddit.com>";

let user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

The user agent helps make requests look like they're coming from a real browser.

Next, we create a reqwest client with that user agent:

let client = reqwest::blocking::Client::builder()
    .user_agent(user_agent)
    .build()?;

We use a blocking client for simplicity here.

Then we send a GET request to the Reddit URL:

let response = client.get(reddit_url).send()?;

We check if it was successful (200 status code):

if response.status().is_success() {
  // ...
}

Inside the if statement is the scraping logic.

First we get the HTML content as a string:

let html_content = response.text()?;

We save it to a file:

let filename = "reddit_page.html";
std::fs::write(filename, &html_content)?;

Then we parse the HTML into a document using Select:

let document = Document::from_read(html_content.as_bytes()).unwrap();

Inspecting the elements

Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

This is where things get interesting! We find all posts with:

for node in document.find(Name("shreddit-post").and(Class("block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible"))) {

}

This selector searches for elements with:

shreddit-post tag

Classes for styling, hover states etc.

The key thing is that it finds Reddit post blocks.

Inside the loop, we extract post data by chaining more Select calls:

Permalink

let permalink = node.attr("permalink").unwrap_or("");

Gets the permalink attribute.

Content URL

let content_href = node.attr("content-href").unwrap_or("");

Gets content-href attribute.

Comment Count

let comment_count = node.attr("comment-count").unwrap_or("");

Gets comment-count attribute.

Post Title

let post_title = node.find(Name("div").and(Attr("slot", "title"))).next().unwrap().text();

Finds the title div, gets text within.

Author

let author = node.attr("author").unwrap_or("");

Gets author attribute.

Score

let score = node.attr("score").unwrap_or("");

Gets score attribute.

Finally, we print out all extracted fields!

The key things to note are:

We use Select's expressive finder methods to extract data

We rely on HTML attributes and classes that Reddit adds

And that's it! Full code:

extern crate reqwest;
extern crate select;

use reqwest::header;
use select::document::Document;
use select::node::Node;

fn main() -> Result<(), reqwest::Error> {
    // Define the Reddit URL you want to download
    let reddit_url = "https://www.reddit.com";

    // Define a User-Agent header
    let user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";

    // Create a reqwest client with the User-Agent header
    let client = reqwest::blocking::Client::builder()
        .user_agent(user_agent)
        .build()?;

    // Send a GET request to the URL
    let response = client.get(reddit_url).send()?;

    // Check if the request was successful (status code 200)
    if response.status().is_success() {
        // Get the HTML content of the page as a string
        let html_content = response.text()?;

        // Specify the filename to save the HTML content
        let filename = "reddit_page.html";

        // Save the HTML content to a file
        std::fs::write(filename, &html_content)?;

        println!("Reddit page saved to {}", filename);

        // Parse the HTML content
        let document = Document::from_read(html_content.as_bytes()).unwrap();

        // Find all blocks with the specified tag and class
        for node in document.find(Name("shreddit-post").and(Class("block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible"))) {
            let permalink = node.attr("permalink").unwrap_or("");
            let content_href = node.attr("content-href").unwrap_or("");
            let comment_count = node.attr("comment-count").unwrap_or("");
            let post_title = node.find(Name("div").and(Attr("slot", "title"))).next().unwrap().text();
            let author = node.attr("author").unwrap_or("");
            let score = node.attr("score").unwrap_or("");

            // Print the extracted information for each block
            println!("Permalink: {}", permalink);
            println!("Content Href: {}", content_href);
            println!("Comment Count: {}", comment_count);
            println!("Post Title: {}", post_title);
            println!("Author: {}", author);
            println!("Score: {}", score);
            println!("\n");
        }
    } else {
        println!("Failed to download Reddit page (status code {})", response.status());
    }

    Ok(())
}

Scraping Reddit Posts with Rust

Prerequisites

Walkthrough

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Reddit Posts with Rust

Prerequisites

Walkthrough

The easiest way to do Web Scraping

Don't leave just yet!