Scraping Reddit Posts with Ruby

In this article, we'll walk through a Ruby script that scrapes various data from Reddit posts.

Some common use cases for web scraping Reddit include:

Collecting public data for research purposes

Analyzing posting trends and activity

Gathering data to train machine learning models

Building Reddit bots or apps

While scraping does have some ethical considerations (which we won't get into here), it can be a useful skill for programmers to acquire.

here is the page we are talking about

So let's jump right into the code!

Setting Up

We'll be using the open-uri and nokogiri Ruby gems, so those will need to be installed first:

gem install open-uri
gem install nokogiri

The open-uri module gives us a convenient API for opening URLs, while nokogiri is used for parsing and searching the HTML content.

Okay, with the dependencies handled, let's start going through the script:

Defining Constants

First we set some constant values that will be reused throughout the program:

reddit_url = "<https://www.reddit.com>"

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"

reddit_url contains the base URL of the Reddit homepage

user_agent spoof a desktop Chrome browser's user agent string when sending requests

Browser user agent strings identify what type of browser is making the request. Spoofing a common desktop user agent helps avoid bot detection.

Making the Initial Request

Next, we use the handy URI.open method that open-uri gives us to fetch the contents of the Reddit homepage:

begin
  html_content = URI.open(reddit_url, "User-Agent" => user_agent).read
rescue OpenURI::HTTPError => e
  # error handling
end

Breaking this down:

URI.open opens the reddit_url

We pass the user_agent string in a header to spoof a browser

.read reads the contents of the open URI into a string

We wrap it in begin/rescue to handle any HTTP errors

At this point html_content contains a big string holding the raw HTML of the Reddit homepage.

Saving the HTML

Next we write the HTML content to a local file:

filename = "reddit_page.html"

File.open(filename, "w:UTF-8") do |file|
  file.write(html_content)
end

puts "Reddit page saved to #{filename}"

Here we:

Define a filename for the output

File.open that file for writing

file.write the html_content string to it

Confirm save with a puts

Saving the HTML locally allows us to scrape the content multiple times without needing to re-download.

Parsing the HTML with Nokogiri

Now we have the Reddit homepage HTML saved locally, and can parse it using the Nokogiri gem:

doc = Nokogiri::HTML(html_content)

Nokogiri::HTML() parses HTML from either a string or file

Returns a doc object we can now search and iterate through

Nokogiri gives us very powerful selectors to extract the exact pieces of information we want.

Extracting Reddit Posts

Inspecting the elements

Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

In the longest and perhaps trickiest part of the script, we use a complex CSS selector to extract Reddit post blocks from the parsed HTML document:

blocks = doc.css('shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within\\\\:bg-neutral-background-hover.hover\\\\:bg-neutral-background-hover.xs\\\\:rounded-\\\\\\[16px\\\\\\].p-md.my-2xs.nd\\\\:visible')

Let's break down what this selector is doing to understand it:

doc.css() runs our CSS selection on the parsed doc

It looks for tags with many classes like:

Special characters like \\[ and \\] match literal brackets

The \\ escapes special characters like \\: and allows matching that exact text

So in plain English, we're finding all Reddit post blocks on the page that have these very specific CSS classes styling and positioning them. This took some trail and error to pinpoint.

The key things to remember are:

Use very explicit, precise selectors to extract just what you want

You may need to experiment to get the specificity right

Special characters MUST be escaped properly

Don't change the literal text like shreddit-post

This returns a blocks object containing all the matched post blocks we want to scrape.

Looping Through the Posts

With the posts now stored in blocks, we can loop through them:

blocks.each do |block|
  # extract data from each block
end

The .each method loops through the post block elements

We can now access each one individually to scrape data

Let's look at what information we're extracting from every block next.

Scraping Post Data

Inside the loop, we use the post block and handy Nokogiri methods to scrape key data points:

permalink = block['permalink']
content_href = block['content-href']
comment_count = block['comment-count']

post_title = block.css('div[slot="title"]').text.strip
author = block['author']
score = block['score']

Here's what we're grabbing and how:

permalink, content_href, etc use the [ ] syntax to get HTML attributes

post_title uses a more specific CSS selector and .text

.strip removes whitespace from that text

author and score attributes are extracted directly again

The data points gathered describe core aspects of posts like title, author, webpage link, comments, score.

We print this data after the loop to output and check what we scraped.

The key things to emphasize again:

Attribute vs text data extraction

Using precise, unique CSS selectors

Methods like .text and .strip

And most importantly, not changing the literal strings like permalink used in the selectors.

Putting It All Together

Walkthrough complete! Here is the full code:

require 'open-uri'
require 'nokogiri'

# Define the Reddit URL you want to download
reddit_url = "https://www.reddit.com"

# Define a User-Agent header
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"

# Send a GET request to the URL with the User-Agent header
begin
  html_content = URI.open(reddit_url, "User-Agent" => user_agent).read

  # Specify the filename to save the HTML content
  filename = "reddit_page.html"

  # Save the HTML content to a file
  File.open(filename, "w:UTF-8") do |file|
    file.write(html_content)
  end

  puts "Reddit page saved to #{filename}"
rescue OpenURI::HTTPError => e
  puts "Failed to download Reddit page (status code #{e.io.status[0]})"
end

# Parse the entire HTML content
doc = Nokogiri::HTML(html_content)

# Find all blocks with the specified tag and class
blocks = doc.css('shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within\:bg-neutral-background-hover.hover\:bg-neutral-background-hover.xs\:rounded-\[16px\].p-md.my-2xs.nd\:visible')

# Iterate through the blocks and extract information from each one
blocks.each do |block|
  permalink = block['permalink']
  content_href = block['content-href']
  comment_count = block['comment-count']
  post_title = block.css('div[slot="title"]').text.strip
  author = block['author']
  score = block['score']

  # Print the extracted information for each block
  puts "Permalink: #{permalink}"
  puts "Content Href: #{content_href}"
  puts "Comment Count: #{comment_count}"
  puts "Post Title: #{post_title}"
  puts "Author: #{author}"
  puts "Score: #{score}"
  puts "\n"
end

And that's it! We walked through the script from start to finish, explaining how Reddit is scraped at each step of the way.

Let me know if any part was confusing or needs more clarification! I'm happy to explain web scraping concepts in further detail.

Scraping Reddit Posts with Ruby

Setting Up

Defining Constants

Making the Initial Request

Saving the HTML

Parsing the HTML with Nokogiri

Extracting Reddit Posts

Looping Through the Posts

Scraping Post Data

Putting It All Together

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Reddit Posts with Ruby

Setting Up

Defining Constants

Making the Initial Request

Saving the HTML

Parsing the HTML with Nokogiri

Extracting Reddit Posts

Looping Through the Posts

Scraping Post Data

Putting It All Together

The easiest way to do Web Scraping

Don't leave just yet!