In this beginner-friendly guide, we'll walk through a Perl script that scrapes articles from the popular Hacker News site.
This is the page we are talking about…
Prerequisites
To follow along, you'll need:
You can install these from CPAN using the cpan command like so:
cpan LWP::Simple HTML::TreeBuilder::XPath
Step-by-step walkthrough
Importing modules
We start by importing the Perl modules we need:
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
Defining the target URL
Next, we store the Hacker News homepage URL in a variable:
my $url = "<https://news.ycombinator.com/>";
This is the page we will scrape.
Sending HTTP request
We use
my $content = get($url);
This downloads the raw HTML content of the Hacker News homepage.
Parsing the HTML
Next, we parse the HTML content using
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
This creates a "tree" representation of elements, on which we can perform DOM operations.
Finding all table rows
Inspecting the page
You can notice that the items are housed inside a Hacker News consists of a table with rows for each article/details. We extract all This finds every table row on the page. Next, we loop through the rows to identify article rows vs detail rows: There are a few key steps: Let's go through each section. An article row has CSS class We store the row in The detail row comes immediately after the article row. We check for proximity: Now we can extract data from Inside the detail row, we use XPath and regular expressions to extract fields: Let's break this down: Finally, we print the extracted data: And so on for each article! Here is the complete code once more for reference: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: tag with the class athing elements with: my @rows = $tree->findnodes('//tr');
Processing the rows
foreach my $row (@rows) {
# Identify article vs detail rows
# Extract data from detail rows
}
Identifying article rows
my $class = $row->attr('class');
if ($class && $class eq "athing") {
# This is an article row
$current_article = $row;
$current_row_type = "article";
}
Identifying detail rows
elsif ($current_row_type && $current_row_type eq "article") {
# This is the details row
// process this row
}
Extracting article data
my $title_elem = $current_article->findvalue('.//span[@class="title"]');
my $article_title = $title_elem->as_text if $title_elem;
my $article_url_elem = $current_article->findnodes('.//a[@class="storylink"]')->[0];
my $article_url = $article_url_elem->attr('href') if $article_url_elem;
my $subtext = $row->findvalue('.//td[@class="subtext"]');
my ($points, $author, $timestamp, $comments);
if ($subtext) {
($points) = $subtext =~ /(\\d+)\\s+points/;
($author) = $subtext =~ /by\\s+(\\S+)/;
($timestamp) = $subtext =~ /(\\d+\\s+\\S+\\s+ago)/;
($comments) = $subtext =~ /(\\d+\\s+comments?)/;
}
print("Title: $article_title\\n");
print ("URL: $article_url\\n");
// etc...
Full code
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
# Define the URL of the Hacker News homepage
my $url = "https://news.ycombinator.com/";
# Send a GET request to the URL
my $content = get($url);
# Check if the request was successful
if ($content) {
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
# Find all rows in the table
my @rows = $tree->findnodes('//tr');
# Initialize variables to keep track of the current article and row type
my ($current_article, $current_row_type);
# Iterate through the rows to scrape articles
foreach my $row (@rows) {
my $class = $row->attr('class');
if ($class && $class eq "athing") {
# This is an article row
$current_article = $row;
$current_row_type = "article";
} elsif ($current_row_type && $current_row_type eq "article") {
# This is the details row
if ($current_article) {
my $title_elem = $current_article->findvalue('.//span[@class="title"]');
my $article_title = $title_elem->as_text if $title_elem;
my $article_url_elem = $current_article->findnodes('.//a[@class="storylink"]')->[0];
my $article_url = $article_url_elem->attr('href') if $article_url_elem;
my $subtext = $row->findvalue('.//td[@class="subtext"]');
my ($points, $author, $timestamp, $comments);
if ($subtext) {
($points) = $subtext =~ /(\d+)\s+points/;
($author) = $subtext =~ /by\s+(\S+)/;
($timestamp) = $subtext =~ /(\d+\s+\S+\s+ago)/;
($comments) = $subtext =~ /(\d+\s+comments?)/;
}
# Print the extracted information
print("Title: $article_title\n");
print("URL: $article_url\n");
print("Points: $points\n");
print("Author: $author\n");
print("Timestamp: $timestamp\n");
print("Comments: $comments\n");
print("-" x 50 . "\n"); # Separating articles
}
# Reset the current article and row type
$current_article = undef;
$current_row_type = undef;
} elsif ($row->attr('style') && $row->attr('style') eq "height:5px") {
# This is the spacer row, skip it
next;
}
}
} else {
print("Failed to retrieve the page.\n");
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!