Scraping Reddit Posts with PHP

Web scraping is the process of automatically extracting data from websites. This handy PHP script scrapes post data from Reddit by fetching the HTML content of a Reddit page and then using DOM parsing and CSS selectors to extract information like titles, scores, authors, etc.

here is the page we are talking about

Let's walk through it step-by-step.

Prerequisites

To run this code, you'll need:

PHP installed

curl enabled in PHP

The simple_html_dom library

First make sure you have PHP cli:

php -v

Then you can install simple_html_dom by downloading it from Sourceforge or via Composer:

composer require sunra/php-simple-html-dom-parser

Importing Libraries

We start by including the simple_html_dom library which will handle parsing and searching the HTML:

require('simple_html_dom.php');

Defining URLs and Headers

Next we define the Reddit URL we want to scrape, and a User-Agent header to send with the requests:

$reddit_url = "<https://www.reddit.com>";

$headers = array(
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
);

It's best practice to identify your scraper instead of faking a browser User-Agent. But some sites block scraping bots so this helps bypass that.

Initializing cURL

We use cURL to make the HTTP requests in PHP. So we initialize a cURL session:

$ch = curl_init();

And configure the options:

curl_setopt($ch, CURLOPT_URL, $reddit_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

Here we set the URL to fetch, enable return transfer to get the response directly, and add our custom User-Agent header.

Making the Request

With cURL configured, we use curl_exec() to make the GET request:

$response = curl_exec($ch);

We also check that it was successful by verifying the response code is 200 OK:

if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {

  // Request succeeded!

} else {

  // Request failed

}

And save the HTML content to a file:

$html_content = $response;

$filename = "reddit_page.html";

file_put_contents($filename, $html_content);

This saves the raw HTML we'll parse next.

Parsing HTML

With simple_html_dom, parsing HTML is easy. We just initialize a new DOM object and load the HTML content:

$html = new simple_html_dom();

$html->load($response);

Now we have tons of helpful DOM traversal methods to extract data!

Extracting Data with Selectors

Inspecting the elements

Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

This is where most people struggle with web scraping - how to write the CSS selectors to actually match the content you want.

Let's break this down:

$blocks = $html->find('shreddit-post[class=block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible]');

The key things to understand are:

find() searches DOM elements that match our CSS selector

shreddit-post selects

Extracting Post Data

Inside the loop, we use other DOM methods to get attributes and values:

foreach ($blocks as $block) {

  $permalink = $block->getAttribute('permalink');

  $content_href = $block->getAttribute('content-href');

  $comment_count = $block->getAttribute('comment-count');

  $post_title = $block->find('div[slot=title]', 0)->plaintext;

  $author = $block->getAttribute('author');

  $score = $block->getAttribute('score');

  // Print post data

}

getAttribute() gets the named attribute

find() searches inside this block only

plaintext gets the raw text content

And we print out all the extracted fields!

Full Code

Here is the complete script for reference:

<?php
// Include the simple_html_dom library for HTML parsing
require('simple_html_dom.php');

// Define the Reddit URL you want to download
$reddit_url = "https://www.reddit.com";

// Define a User-Agent header
$headers = array(
    "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"  // Replace with your User-Agent string
);

// Initialize a cURL session
$ch = curl_init();

// Set the cURL options
curl_setopt($ch, CURLOPT_URL, $reddit_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

// Send the GET request to the URL
$response = curl_exec($ch);

// Check if the request was successful (status code 200)
if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
    // Get the HTML content of the page
    $html_content = $response;

    // Specify the filename to save the HTML content
    $filename = "reddit_page.html";

    // Save the HTML content to a file
    file_put_contents($filename, $html_content);

    echo "Reddit page saved to $filename\n";
} else {
    echo "Failed to download Reddit page (status code " . curl_getinfo($ch, CURLINFO_HTTP_CODE) . ")\n";
}

// Create a DOM object
$html = new simple_html_dom();
$html->load($response);

// Find all blocks with the specified tag and class
$blocks = $html->find('shreddit-post[class=block relative cursor-pointer bg-neutral-background focus-within:bg-neutral-background-hover hover:bg-neutral-background-hover xs:rounded-[16px] p-md my-2xs nd:visible]');

// Iterate through the blocks and extract information from each one
foreach ($blocks as $block) {
    $permalink = $block->getAttribute('permalink');
    $content_href = $block->getAttribute('content-href');
    $comment_count = $block->getAttribute('comment-count');
    $post_title = $block->find('div[slot=title]', 0)->plaintext;
    $author = $block->getAttribute('author');
    $score = $block->getAttribute('score');

    // Print the extracted information for each block
    echo "Permalink: $permalink\n";
    echo "Content Href: $content_href\n";
    echo "Comment Count: $comment_count\n";
    echo "Post Title: $post_title\n";
    echo "Author: $author\n";
    echo "Score: $score\n\n";
}

// Close the cURL session
curl_close($ch);

Scraping Reddit Posts with PHP

Prerequisites

Importing Libraries

Defining URLs and Headers

Initializing cURL

Making the Request

Parsing HTML

Extracting Data with Selectors

Extracting Post Data

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Reddit Posts with PHP

Prerequisites

Importing Libraries

Defining URLs and Headers

Initializing cURL

Making the Request

Parsing HTML

Extracting Data with Selectors

Extracting Post Data

Full Code

The easiest way to do Web Scraping

Don't leave just yet!