Scraping New York Times News Headlines in C++

Web scraping is a technique for extracting data from websites automatically. It can be useful for collecting large volumes of data for analysis. In this guide, we'll walk through a program to scrape article titles and links from The New York Times using C++.

Key Concepts

To follow along, you'll need a basic understanding of:

HTTP requests - How data is transferred on the web

HTML - Structure of web page content

C++ - Programming language we'll use

libcurl - C++ library for transferring data with URLs

Gumbo - C++ library for parsing HTML

Don't worry if you're unfamiliar with these! We'll explain each piece as we go.

Step 1: Send an HTTP Request and Get the HTML

We'll use the libcurl library to send an HTTP GET request to fetch the NYTimes HTML:

CURL* curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, "<https://www.nytimes.com/>");

// Store response in string
std::string html;
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html);

curl_easy_perform(curl);
curl_easy_cleanup(curl);

This makes a request just like your browser does, except instead of rendering the HTML, we save it as a string to parse later.

Note: We use a callback function to accumulate the data. I won't cover the details here but the curl docs explain it well.

Step 2: Parse the HTML

Next we'll use the Gumbo HTML parser to analyze the HTML content:

GumboOutput* output = gumbo_parse(html.c_str());

This converts the HTML string into a structured format we can traverse programmatically.

Step 3: Find Article Elements

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

We want to extract articles specifically. These are wrapped in

tags with a "story-wrapper" class on the NYTimes homepage.

We'll walk through the parsed HTML tree to find them:

GumboNode* root = output->root;

for (GumboNode* child : root->children) {
    if (child->type == GUMBO_NODE_ELEMENT &&
        child->tag == GUMBO_TAG_SECTION) {

        GumboAttribute* class_attr = child->get_attribute("class");

        if (class_attr->value == "story-wrapper") {
            // Found article element
        }
   }
}

Here we:

Get root element
Loop through its children
Check if it's a
tag
Fetch the class attribute
Check if it contains "story-wrapper"

This filters down the millions of elements to just the articles.

Step 4: Extract Title and Links

Now we can dig into our found article elements to get the title and link. These are stored in specific child elements we can search for:

// Within found article element
for (GumboNode* inner : child->children) {

    if (inner->tag == GUMBO_TAG_H2) {
        // Title element
        std::string title = inner->text->text;
    }

    else if (inner->tag == GUMBO_TAG_A) {
        // Link element
        std::string url = inner->get_attribute("href")->value;
    }

}

And we have our data! The full code at the bottom puts this all together into a program that prints out titles and links.

Key Takeaways

The scraping process mainly involves:

Getting HTML data
Parsing into structured format
Traversing parsed DOM to extract relevant data

There are lots of optimizations possible but this covers the core technique. You could build upon this to fetch entire articles content or add caching for example.

Hope this gives you a template for getting started with scraping in C++!

Full Code

#include <iostream>
#include <curl/curl.h>
#include <gumbo.h>

// Callback passed to curl to accumulate response data
size_t WriteCallback(void *contents, size_t size, size_t nmemb, std::string *userp) {
  ((std::string*)userp)->append((char*)contents, size * nmemb);
  return size * nmemb;
}

int main() {

  // Fetch HTML
  CURL* curl = curl_easy_init();

  curl_easy_setopt(curl, CURLOPT_URL, "<https://nytimes.com>");

  std::string html;
  curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
  curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html);

  CURLcode res = curl_easy_perform(curl);

  if(res != CURLE_OK) {
      std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res);
  }

  curl_easy_cleanup(curl);

  // Parse HTML
  GumboOutput* output = gumbo_parse(html.c_str());

  // Find article elements
  GumboNode* root = output->root;

  for (GumboNode* child : root->v.element.children) {
     if (child->type != GUMBO_NODE_ELEMENT) {
       continue;
     }

     if (child->v.element.tag != GUMBO_TAG_SECTION) {
       continue;
     }

     GumboAttribute* class_attr = gumbo_get_attribute(&child->v.element.attributes, "class");

     if (!class_attr) {
        continue;
     }

     if (class_attr->value != "story-wrapper") {
        continue;
     }

     // Extract data
     for (GumboNode* inner : child->v.element.children) {

        if (inner->type != GUMBO_NODE_ELEMENT){
          continue;
        }

        if (inner->v.element.tag == GUMBO_TAG_H2) {
          std::cout << inner->v.text.text << std::endl;
        }

        if (inner->v.element.tag == GUMBO_TAG_A
            && inner->v.element.attributes.length > 0) {

          GumboAttribute* href = gumbo_get_attribute(&inner->v.element.attributes, "href");

          if (!href) {
            continue;
          }

          std::cout << href->value << std::endl << std::endl;
        }
     }
  }

  return 0;
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping New York Times News Headlines in C++

Key Concepts

Step 1: Send an HTTP Request and Get the HTML

Step 2: Parse the HTML

Step 3: Find Article Elements

Inspecting the page

Step 4: Extract Title and Links

Key Takeaways

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping New York Times News Headlines in C++

Key Concepts

Step 1: Send an HTTP Request and Get the HTML

Step 2: Parse the HTML

Step 3: Find Article Elements

Inspecting the page

Step 4: Extract Title and Links

Key Takeaways

Full Code

The easiest way to do Web Scraping

Don't leave just yet!