In this beginner web scraping tutorial, we'll walk through code that scrapes news articles from the popular Hacker News site using the rvest package in R.
Specifcally, this code will extract the title, URL, points, author, timestamp, and comment count for each article on Hacker News' front page.
This is the page we are talking about…
Installation
Before running the web scraping script, you need to install rvest:
install.packages("rvest")
rvest depends on the xml2 package, so install that first if needed:
install.packages("xml2")
You may also need to install other dependencies like httr, curl, and stringr.
Once installed, let's load rvest:
library(rvest)
Okay, now we're ready to scrape!
Walkthrough
Let's break down what exactly this Hacker News web scraper is doing:
1. Define URL and Initialize Session
url <- "<https://news.ycombinator.com/>"
response <- read_html(url)
First, we store the HackerNews homepage URL in a variable called url.
Then, we use rvest's
2. Check that the Request Succeeded
if (length(response) > 0) {
...
} else {
cat("Failed to retrieve the page.\\n")
}
It's good practice to verify that the request succeeded before trying to parse the HTML.
Here we simply check if response contains data. If not, we print an error.
3. Find All Table Rows
Inspecting the page
You can notice that the items are housed inside a Hacker News displays the articles in a table, so to extract each article's data we need to loop through the table rows. As we loop through the rows, we need to keep track of the current article row we're processing and what type of row it is. For example, the first row contains the article title and URL. The next row contains additional details like points and author. By tracking state with these variables, we can pair this data together for each article. Now let's walk through the for loop to understand how the data extraction works: We loop through each Here is the key logic that identifies whether the current row is an article title or the additional details. Specifically, the CSS class "athing" only occurs on article title rows. So we check for that class name using If found, we save the row to current_article and set the type. Else, if we previously found an article row, we know the next row must contain the additional details for that article. This sets us up nicely to extract the two connected rows of data. Now that we've identified an article row and details row pair, let's extract the data: Using Specifically, From there we use This gives us the article's actual text title and URL! I won't walk through every line but it continues similarly extracting each data field like points, author, timestamp, and comment count by targeting specific elements in the HTML. The key is ??? using CSS selectors and XPath to target elements + rvest functions to get text and attribute data! Finally, we print the extracted article data so we can see the scraper working: And that's the core logic! We continue this process row by row to extract articles from the whole HackerNews table. Now that we understand each part, here is the full web scraper code: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: tag with the class athing rows <- html_nodes(response, "tr")
4. Set Up Tracker Variables
current_article <- NULL
current_row_type <- NULL
5. Loop Through Each Row
for (row in rows) {
...
}
6. Identify Article Row or Details Row
if ("athing" %in% html_attr(row, "class")) {
# This is an article row
current_article <- row
current_row_type <- "article"
} else if (current_row_type == "article") {
# This is the details row
}
7. Extract Article Details
title_elem <- html_node(current_article, "span.titleline")
if (!is.null(title_elem)) {
article_title <- html_text(html_node(title_elem, "a"))
article_url <- html_attr(html_node(title_elem, "a"), "href")
...
}
cat("Title: ", article_title, "\\n")
cat("URL: ", article_url, "\\n")
...
Full Code
# Load the necessary libraries
library(rvest)
# Define the URL of the Hacker News homepage
url <- "https://news.ycombinator.com/"
# Send a GET request to the URL
response <- read_html(url)
# Check if the request was successful
if (length(response) > 0) {
# Find all rows in the table
rows <- html_nodes(response, "tr")
# Initialize variables to keep track of the current article and row type
current_article <- NULL
current_row_type <- NULL
# Iterate through the rows to scrape articles
for (row in rows) {
if ("athing" %in% html_attr(row, "class")) {
# This is an article row
current_article <- row
current_row_type <- "article"
} else if (current_row_type == "article") {
# This is the details row
if (!is.null(current_article)) {
# Extract information from the current article and details row
title_elem <- html_node(current_article, "span.titleline")
if (!is.null(title_elem)) {
article_title <- html_text(html_node(title_elem, "a"))
article_url <- html_attr(html_node(title_elem, "a"), "href")
subtext <- html_node(row, "td.subtext")
points <- gsub("\\D", "", html_text(html_node(subtext, "span.score")))
author <- html_text(html_node(subtext, "a.hnuser"))
timestamp <- html_attr(html_node(subtext, "span.age"), "title")
comments_elem <- html_node(subtext, xpath = ".//a[contains(text(), 'comment')]")
comments <- ifelse(is.null(comments_elem), "0", html_text(comments_elem))
# Print the extracted information
cat("Title: ", article_title, "\n")
cat("URL: ", article_url, "\n")
cat("Points: ", points, "\n")
cat("Author: ", author, "\n")
cat("Timestamp: ", timestamp, "\n")
cat("Comments: ", comments, "\n")
cat("-" * 50, "\n") # Separating articles
}
}
# Reset the current article and row type
current_article <- NULL
current_row_type <- NULL
} else if ("height:5px" == html_attr(row, "style")) {
# This is the spacer row, skip it
next
}
}
} else {
cat("Failed to retrieve the page.\n")
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!