Web scraping is the process of extracting data from websites automatically. In this article, we'll walk through code that scrapes article titles and links from the New York Times homepage using Node.js modules like request and cheerio.

Why Scrape the New York Times?

The New York Times publishes tons of high-quality content every day on topics like news, opinion, arts, living, and more. Scraping the site allows you to extract and store this content to power other applications. For example, you could:

Build a daily digest by scraping the latest headlines

Create a local search engine of NYT content

Analyze article sentiment over time

Archive interesting articles to read later

The possibilities are vast once you have structured data from a site like The Times!

Step 1: Import Needed Modules

Let's walk through the code section-by-section. First we import the modules we'll need:

const request = require('request'); // for sending HTTP requests
const cheerio = require('cheerio'); // for selecting/parsing HTML
const fs = require('fs'); // for writing to the filesystem

We use the request module to send the HTTP request to fetch the Times homepage.

Cheerio allows us to select elements in the HTML of the page, kind of like jQuery.

The fs module is used at the end for writing the scraped data to a JSON file.

Step 2: Define the URL and Request Options

Next we set the URL to scrape and define some options for our HTTP request:

// NYTimes URL
const url = '<https://www.nytimes.com/>';

// Request settings
const options = {
  url: url,
  headers: {
    'User-Agent': 'Mozilla/5.0'
  }
};

Here we are scraping the main nytimes.com homepage URL.

We also set a custom User-Agent header to mimic a real web browser, which prevents getting blocked as a bot.

Step 3: Send the Request and Load HTML

With our URL and options defined, we use request to grab the page HTML:

// Send request
request(options, (err, res, html) => {

  // Load HTML
  let $;
  try {
    $ = cheerio.load(html);
  } catch(err) {
    console.log('Cheerio error:', err);
    return;
  }

We pass our options to request and provide a callback to handle the response.

Inside, we use cheerio's load method to parse the HTML string into a Cheerio object that we can query (stored in $).

This gives us jQuery-style selectors to extract data.

Step 4: Define Variables and Select Elements

Now that we have the page loaded, we can start extracting the data we want - article titles and links:

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

// Initialize variables
let titles = [];
let links = [];

// Select articles
$('section.story-wrapper').each(function() {

  // Get data
  const title = $(this).find('h3').text().trim();
  const link = $(this).find('a').attr('href');

  // Validate
  if (!title || !link) {
    return;
  }

  // Save data
  titles.push(title);
  links.push(link);
});

First we define some arrays to store the info we scrape.

We select all section elements with the class story-wrapper, loop through, grab the text content from the h3 elements for titles and get the href attribute from links.

We do some validation to make sure we have valid content before saving to our arrays.

Step 5: Log and Store the Scraped Data

With our titles and links arrays populated, we wrap up by logging and storing the data:

// Check no articles
if (titles.length === 0) {
  console.log('No articles found');
  return;
}

// Log articles
titles.forEach((title, i) => {
  console.log('Title:', title);
  console.log('Link:', links[i]);
  console.log();
});

// Write to file
fs.writeFileSync('articles.txt', JSON.stringify({ titles, links }));

First we make sure we actually captured articles by checking the length.

Then we log each title/link pair in the console so you can validate it worked.

Finally, we use fs to write the data to a JSON file for later use.

And that's it! Here is the full code for reference:

// Import modules
const request = require('request');
const cheerio = require('cheerio'); 
const fs = require('fs');

// NYTimes URL
const url = 'https://www.nytimes.com/';  

// Request settings
const options = {
  url: url,
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' 
  }
};

// Send request
request(options, (err, res, html) => {

  // Check errors
  if (err) { 
    console.log('Error:', err);
    return;
  }

  if(res.statusCode !== 200) {
    console.log('Status:', res.statusCode); 
    return;
  }

  // Load HTML
  let $; 
  try {
    $ = cheerio.load(html); 
  } catch(err) {
    console.log('Cheerio error:', err);
    return;
  }

  // Initialize variables
  let titles = []; 
  let links = [];

  // Select articles
  $('section.story-wrapper').each(function() {
   
    // Get data 
    const title = $(this).find('h3').text().trim();
    const link = $(this).find('a').attr('href');
    
    // Validate
    if (!title || !link) {
      return;
    }

    // Save data
    titles.push(title);
    links.push(link);

  });

  // Check if no articles
  if (titles.length === 0) {
    console.log('No articles found');
    return; 
  }

  // Log articles  
  titles.forEach((title, i) => {
    console.log('Title:', title);
    console.log('Link:', links[i]);
    console.log();
  });

  // Write to file
  fs.writeFileSync('articles.txt', JSON.stringify({ titles, links }));

});

Next Steps

With this foundation, you can now:

Customize the selectors to scrape other sites

Set this script on a schedule to run automatically

Expand the data you capture from each article

Analyze the sentiment of headlines over time

Build a local search engine based on NYT content

Play around and make it your own!

The key is that you now understand how to programmatically grab data from a site using Node.js. The possibilities are endless :)

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Web Scraping New York Times News Headlines with Node.js

Why Scrape the New York Times?

Step 1: Import Needed Modules

Step 2: Define the URL and Request Options

Step 3: Send the Request and Load HTML

Step 4: Define Variables and Select Elements

Inspecting the page

Step 5: Log and Store the Scraped Data

Next Steps

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Web Scraping New York Times News Headlines with Node.js

Why Scrape the New York Times?

Step 1: Import Needed Modules

Step 2: Define the URL and Request Options

Step 3: Send the Request and Load HTML

Step 4: Define Variables and Select Elements

Inspecting the page

Step 5: Log and Store the Scraped Data

Next Steps

The easiest way to do Web Scraping

Don't leave just yet!