How to Build a Reddit Scraper in Java

Jan 9, 2024 · 7 min read

In this beginner-friendly guide, we will understand how to scrape Reddit posts using a simple Java program. We'll go through the code step-by-step, explaining how it works with detailed comments and examples.

here is the page we are talking about

Introduction

Web scraping refers to automatically collecting information from websites. Here, we will scrape post data from Reddit by:

  1. Sending a request to the Reddit URL
  2. Downloading the HTML content of the page
  3. Parsing the HTML to extract post information we want

This process allows us to get structured data from websites to power applications, analysis, research and more.

Scraping activities can sometimes violate terms of service. Be sure to review and strictly follow all applicable laws, terms and policies related to scraping.

Now, let's jump into the code!

Imports

We import several Java packages that provide useful functionality:

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Key highlights:

  • java.io - helpers for reading and writing files
  • java.util - utilities like lists and arrays
  • org.jsoup - Java HTML parser library
  • We'll see how these are used throughout the program.

    Main Method

    The main() method is the entry point for our Java program:

    public static void main(String[] args) {
      // Code goes here
    }
    

    All scraping logic will be enclosed within this method.

    Define Target URL

    We'll scrape posts from Reddit's front page:

    String redditUrl = "<https://www.reddit.com>";
    

    This hardcoded URL is our target. We could also take it as input.

    Set User-Agent Header

    Websites identify clients via the User-Agent header. We'll mimic a Chrome browser:

    String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    

    This helps avoid bot detection mechanisms sites may employ.

    Send GET Request

    We use the Jsoup library to send a HTTP GET request:

    Document doc = Jsoup.connect(redditUrl)
      .userAgent(userAgent)
      .get();
    

    The userAgent() method sets our defined header. The get() method sends the request.

    The return Document contains the entire HTML content of Reddit's front page.

    Check for Success

    We validate the status by checking if the document is not null:

    if (doc != null) {
    
      // Success!
    
    } else {
    
      // Request failed
    
    }
    

    Save HTML Content

    Let's save the downloaded HTML to a file:

    String htmlContent = doc.html();
    
    String filename = "reddit_page.html";
    
    try (BufferedWriter writer = new BufferedWriter(new FileWriter(filename))) {
    
      writer.write(htmlContent);
    
      System.out.println("Saved to " + filename);
    
    } catch (IOException e) {
    
      System.err.println("Failed to save: " + e);
    
    }
    

    We get the HTML using doc.html(), write it using a BufferedWriter, and handle errors via catch.

    Parse HTML

    So far we have only downloaded the Reddit front page HTML. To extract actual post data, we need to parse the HTML content:

    Document page = Jsoup.parse(htmlContent);
    

    This creates a parsable Document object from the string content.

    Extract Post Blocks

    Inspecting the elements

    Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

    Each post on Reddit is contained in an HTML div with a unique CSS class. We use the selector select() to target posts:

    Elements blocks = page.select("shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible");
    

    Let's break this complex selector down:

  • shreddit-post - Class name for post block
  • .block - Another class
  • Other chained classes - Appearance and styles
  • nd:visible - Visible node filter
  • This selector specifically identifies Reddit post blocks, saving them in an Elements collection we iterate through.

    Selectors are very powerful but do take practice to master. We'll explore more examples ahead.

    Iterate Post Blocks

    We can now loop through the selected post blocks:

    for (Element block : blocks) {
    
      // Extract data from each block
    
    }
    

    Extract Post Data

    Inside the loop, we use attributes and selectors to extract specific post data we want from each block:

    String permalink = block.attr("permalink");
    
    String contentHref = block.attr("content-href");
    
    String commentCount = block.attr("comment-count");
    
    String postTitle = block.selectFirst("div[slot=title]").text().trim();
    
    String author = block.attr("author");
    
    String score = block.attr("score");
    

    Let's understand how we get each field:

    Permalink

    The permalink contains the post ID, stored in a block attribute:

    block.attr("permalink")
    

    We simply extract this attribute value.

    Content URL

    Similar to permalink, the content URL is also stored as a block attribute:

    block.attr("content-href")
    

    Comment Count

    Comment count for the post is again extracted from an attribute:

    block.attr("comment-count")
    

    And so on for the other fields like author and score.

    Post Title

    Title requires an additional selector within the block:

    block.selectFirst("div[slot=title]")
    

    We then extract .text() and trim whitespace.

    Selectors give us incredible flexibility to pinpoint specific data from HTML.

    Print Extracted Data

    We can now print the extracted post data:

                       // Print the extracted information for each block
                        System.out.println("Permalink: " + permalink);
                        System.out.println("Content Href: " + contentHref);
                        System.out.println("Comment Count: " + commentCount);
                        System.out.println("Post Title: " + postTitle);
                        System.out.println("Author: " + author);
                        System.out.println("Score: " + score);
                        System.out.println();
    

    This outputs data from each post.

    Full Code

    For reference, here is the complete code we just walked through:

    import java.io.BufferedWriter;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    public class RedditScraper {
    
        public static void main(String[] args) {
            // Define the Reddit URL you want to download
            String redditUrl = "https://www.reddit.com";
    
            // Define a User-Agent header
            String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    
            // Send a GET request to the URL with the User-Agent header
            try {
                Document doc = Jsoup.connect(redditUrl)
                        .userAgent(userAgent)
                        .get();
    
                // Check if the request was successful (status code 200)
                if (doc != null) {
                    // Get the HTML content of the page
                    String htmlContent = doc.html();
    
                    // Specify the filename to save the HTML content
                    String filename = "reddit_page.html";
    
                    // Save the HTML content to a file
                    try (BufferedWriter writer = new BufferedWriter(new FileWriter(filename))) {
                        writer.write(htmlContent);
                        System.out.println("Reddit page saved to " + filename);
                    } catch (IOException e) {
                        System.err.println("Failed to save Reddit page: " + e.getMessage());
                    }
    
                    // Parse the entire HTML content
                    Document page = Jsoup.parse(htmlContent);
    
                    // Find all blocks with the specified tag and class
                    Elements blocks = page.select("shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible");
    
                    // Iterate through the blocks and extract information from each one
                    for (Element block : blocks) {
                        String permalink = block.attr("permalink");
                        String contentHref = block.attr("content-href");
                        String commentCount = block.attr("comment-count");
                        String postTitle = block.selectFirst("div[slot=title]").text().trim();
                        String author = block.attr("author");
                        String score = block.attr("score");
    
                        // Print the extracted information for each block
                        System.out.println("Permalink: " + permalink);
                        System.out.println("Content Href: " + contentHref);
                        System.out.println("Comment Count: " + commentCount);
                        System.out.println("Post Title: " + postTitle);
                        System.out.println("Author: " + author);
                        System.out.println("Score: " + score);
                        System.out.println();
                    }
                } else {
                    System.err.println("Failed to download Reddit page");
                }
            } catch (IOException e) {
                System.err.println("Failed to make a GET request: " + e.getMessage());
            }
        }
    }
    

    We were able to build a complete Reddit scraper in Java extracting post titles, URLs, authors and more!

    Selectors are extremely powerful for targeting elements on web pages. With some practice, you'll be scraping all sorts of data.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: