Scraping New York Times News Headlines in R

Web scraping is the process of extracting data from websites automatically through code. It allows gathering information published openly on the web to analyze or use programmatically.

A common use case is scraping article headlines and links from news sites like The New York Times to perform text analysis or feed into machine learning models. Instead of laboriously copying content by hand, web scraping makes this fast and easy.

In this beginner R tutorial, we'll walk through a simple example of scraping the main New York Times page to extract article titles and links into R for further processing.

Prerequisites

Before running the code, some packages need to be installed:

install.packages(c("rvest", "httr"))

rvest - for HTML parsing and extraction

httr - provides useful HTTP client functionality

Load the libraries:

library(rvest)
library(httr)

Making HTTP Requests with R

We first need to download the New York Times HTML page content into R to search through it. This requires sending an HTTP GET request from R:

url <- '<https://www.nytimes.com/>'
headers <- add_headers("User-Agent" = "Mozilla/5.0)")

response <- GET(url, headers)

Here we:

Define the NYT homepage URL

Set a browser-like user-agent header (sites may block non-browser requests)

Make the GET request, store response

We'll check the status code to confirm success:

if(status_code(response) == 200) {
  # Continue scraping
} else {
  print("Request failed")
}

HTTP status 200 indicates the request and page load worked properly. Any other code means an error occurred we need to handle.

Parsing the Page Content in R

With the HTML content now stored in the response object, we leverage rvest to parse and search through it.

page_content <- content(response, "text", encoding = "UTF-8")
page <- read_html(page_content)

Next we need to find which elements on the page contain the article titles and links to extract. Viewing the page source, we can notice article content sits within

tags.

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

We grab all such sections:

article_sections <- html_nodes(page, "section.story-wrapper")

Extracting Article Data

To extract the title and link from each section, we loop through the results:

for (section in article_sections) {

  title <- html_node(section, "h3.indicate-hover")
  link <- html_node(section, "a.css-9mylee")

  if(!is.na(title) & !is.na(link)) {
    article_title <- html_text(title)
    article_url <- html_attr(link, "href")

    print(article_title)
    print(article_url)
  }
}

Here we first find the specific nodes for title and link using CSS selectors, then extract the text and attribute values if they exist.

Finally, we print the results - we would likely store and process these further in a real system.

Potential Challenges

Some potential issues to be aware of:

Site layout changes may break CSS selectors

News sites often dynamically load content so may need to further paginate

May hit request limits without proper throttling

Dealing with authentication or cookies

There are R packages like RSelenium which can help simulate a real browser environment better if issues arise.

Next Steps

The rvest package provides a wide toolkit for scraping many types of sites. From here you could look to:

Automatically scrape articles daily

Perform sentiment analysis on headlines

Feed data into predictive models

Expand types of fields extracted beyond title/link

Hopefully this gives a glimpse into the possibilities once you can easily acquire website data into R at scale!

Full R Code

Here is the complete runnable script:

# Load the required libraries
library(httr)
library(rvest)

# URL of The New York Times website
url <- 'https://www.nytimes.com/'

# Define a user-agent header to simulate a browser request
headers <- add_headers(
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
)

# Send an HTTP GET request to the URL
response <- GET(url, headers = headers)

# Check if the request was successful (status code 200)
if (status_code(response) == 200) {
  # Parse the HTML content of the page
  page_content <- content(response, "text", encoding = "UTF-8")
  page <- read_html(page_content)

  # Find all article sections with class 'story-wrapper'
  article_sections <- html_nodes(page, "section.story-wrapper")

  # Initialize lists to store the article titles and links
  article_titles <- character(0)
  article_links <- character(0)

  # Iterate through the article sections
  for (article_section in article_sections) {
    # Check if the article title element exists
    title_element <- html_node(article_section, "h3.indicate-hover")
    # Check if the article link element exists
    link_element <- html_node(article_section, "a.css-9mylee")

    # If both title and link are found, extract and append
    if (!is.na(title_element) && !is.na(link_element)) {
      article_title <- html_text(title_element)
      article_link <- html_attr(link_element, "href")

      article_titles <- c(article_titles, article_title)
      article_links <- c(article_links, article_link)
    }
  }

  # Print or process the extracted article titles and links
  for (i in seq_along(article_titles)) {
    cat("Title:", article_titles[i], "\n")
    cat("Link:", article_links[i], "\n\n")
  }
} else {
  cat("Failed to retrieve the web page. Status code:", status_code(response), "\n")
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping New York Times News Headlines in R

Prerequisites

Making HTTP Requests with R

Parsing the Page Content in R

Inspecting the page

Extracting Article Data

Potential Challenges

Next Steps

Full R Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping New York Times News Headlines in R

Prerequisites

Making HTTP Requests with R

Parsing the Page Content in R

Inspecting the page

Extracting Article Data

Potential Challenges

Next Steps

Full R Code

The easiest way to do Web Scraping

Don't leave just yet!