Web scraping is the process of extracting data from websites automatically through code. It allows gathering information published openly on the web to analyze or use programmatically.
A common use case is scraping article headlines and links from news sites like The New York Times to perform text analysis or feed into machine learning models. Instead of laboriously copying content by hand, web scraping makes this fast and easy.
In this beginner R tutorial, we'll walk through a simple example of scraping the main New York Times page to extract article titles and links into R for further processing.
Prerequisites
Before running the code, some packages need to be installed:
install.packages(c("rvest", "httr"))
Load the libraries:
library(rvest)
library(httr)
Making HTTP Requests with R
We first need to download the New York Times HTML page content into R to search through it. This requires sending an HTTP GET request from R:
url <- '<https://www.nytimes.com/>'
headers <- add_headers("User-Agent" = "Mozilla/5.0)")
response <- GET(url, headers)
Here we:
We'll check the status code to confirm success:
if(status_code(response) == 200) {
# Continue scraping
} else {
print("Request failed")
}
HTTP status 200 indicates the request and page load worked properly. Any other code means an error occurred we need to handle.
Parsing the Page Content in R
With the HTML content now stored in the response object, we leverage rvest to parse and search through it.
page_content <- content(response, "text", encoding = "UTF-8")
page <- read_html(page_content)
Next we need to find which elements on the page contain the article titles and links to extract. Viewing the page source, we can notice article content sits within
Inspecting the page
We now inspect element in chrome to see how the code is structured…
You can see that the articles are contained inside section tags and with the class story-wrapper
We grab all such sections:
article_sections <- html_nodes(page, "section.story-wrapper")
Extracting Article Data
To extract the title and link from each section, we loop through the results:
for (section in article_sections) {
title <- html_node(section, "h3.indicate-hover")
link <- html_node(section, "a.css-9mylee")
if(!is.na(title) & !is.na(link)) {
article_title <- html_text(title)
article_url <- html_attr(link, "href")
print(article_title)
print(article_url)
}
}
Here we first find the specific nodes for title and link using CSS selectors, then extract the text and attribute values if they exist.
Finally, we print the results - we would likely store and process these further in a real system.
Potential Challenges
Some potential issues to be aware of:
There are R packages like RSelenium which can help simulate a real browser environment better if issues arise.
Next Steps
The rvest package provides a wide toolkit for scraping many types of sites. From here you could look to:
Hopefully this gives a glimpse into the possibilities once you can easily acquire website data into R at scale!
Full R Code
Here is the complete runnable script:
# Load the required libraries
library(httr)
library(rvest)
# URL of The New York Times website
url <- 'https://www.nytimes.com/'
# Define a user-agent header to simulate a browser request
headers <- add_headers(
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
)
# Send an HTTP GET request to the URL
response <- GET(url, headers = headers)
# Check if the request was successful (status code 200)
if (status_code(response) == 200) {
# Parse the HTML content of the page
page_content <- content(response, "text", encoding = "UTF-8")
page <- read_html(page_content)
# Find all article sections with class 'story-wrapper'
article_sections <- html_nodes(page, "section.story-wrapper")
# Initialize lists to store the article titles and links
article_titles <- character(0)
article_links <- character(0)
# Iterate through the article sections
for (article_section in article_sections) {
# Check if the article title element exists
title_element <- html_node(article_section, "h3.indicate-hover")
# Check if the article link element exists
link_element <- html_node(article_section, "a.css-9mylee")
# If both title and link are found, extract and append
if (!is.na(title_element) && !is.na(link_element)) {
article_title <- html_text(title_element)
article_link <- html_attr(link_element, "href")
article_titles <- c(article_titles, article_title)
article_links <- c(article_links, article_link)
}
}
# Print or process the extracted article titles and links
for (i in seq_along(article_titles)) {
cat("Title:", article_titles[i], "\n")
cat("Link:", article_links[i], "\n\n")
}
} else {
cat("Failed to retrieve the web page. Status code:", status_code(response), "\n")
}
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.