Web scraping is a technique for extracting data from websites automatically. It can be useful for collecting large volumes of data for analysis. In this guide, we'll walk through a program to scrape article titles and links from The New York Times using C++.
Key Concepts
To follow along, you'll need a basic understanding of:
Don't worry if you're unfamiliar with these! We'll explain each piece as we go.
Step 1: Send an HTTP Request and Get the HTML
We'll use the libcurl library to send an HTTP GET request to fetch the NYTimes HTML:
CURL* curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, "<https://www.nytimes.com/>");
// Store response in string
std::string html;
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html);
curl_easy_perform(curl);
curl_easy_cleanup(curl);
This makes a request just like your browser does, except instead of rendering the HTML, we save it as a string to parse later.
Note: We use a callback function to accumulate the data. I won't cover the details here but the curl docs explain it well.
Step 2: Parse the HTML
Next we'll use the Gumbo HTML parser to analyze the HTML content:
GumboOutput* output = gumbo_parse(html.c_str());
This converts the HTML string into a structured format we can traverse programmatically.
Step 3: Find Article Elements
Inspecting the page
We now inspect element in chrome to see how the code is structured…
You can see that the articles are contained inside section tags and with the class story-wrapper
We want to extract articles specifically. These are wrapped in
We'll walk through the parsed HTML tree to find them:
GumboNode* root = output->root;
for (GumboNode* child : root->children) {
if (child->type == GUMBO_NODE_ELEMENT &&
child->tag == GUMBO_TAG_SECTION) {
GumboAttribute* class_attr = child->get_attribute("class");
if (class_attr->value == "story-wrapper") {
// Found article element
}
}
}
Here we:
- Get root element
- Loop through its children
- Check if it's a
tag - Fetch the
class attribute - Check if it contains
"story-wrapper"
This filters down the millions of elements to just the articles.
Step 4: Extract Title and Links
Now we can dig into our found article elements to get the title and link. These are stored in specific child elements we can search for:
// Within found article element
for (GumboNode* inner : child->children) {
if (inner->tag == GUMBO_TAG_H2) {
// Title element
std::string title = inner->text->text;
}
else if (inner->tag == GUMBO_TAG_A) {
// Link element
std::string url = inner->get_attribute("href")->value;
}
}
And we have our data! The full code at the bottom puts this all together into a program that prints out titles and links.
Key Takeaways
The scraping process mainly involves:
- Getting HTML data
- Parsing into structured format
- Traversing parsed DOM to extract relevant data
There are lots of optimizations possible but this covers the core technique. You could build upon this to fetch entire articles content or add caching for example.
Hope this gives you a template for getting started with scraping in C++!
Full Code
#include <iostream>
#include <curl/curl.h>
#include <gumbo.h>
// Callback passed to curl to accumulate response data
size_t WriteCallback(void *contents, size_t size, size_t nmemb, std::string *userp) {
((std::string*)userp)->append((char*)contents, size * nmemb);
return size * nmemb;
}
int main() {
// Fetch HTML
CURL* curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, "<https://nytimes.com>");
std::string html;
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &html);
CURLcode res = curl_easy_perform(curl);
if(res != CURLE_OK) {
std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res);
}
curl_easy_cleanup(curl);
// Parse HTML
GumboOutput* output = gumbo_parse(html.c_str());
// Find article elements
GumboNode* root = output->root;
for (GumboNode* child : root->v.element.children) {
if (child->type != GUMBO_NODE_ELEMENT) {
continue;
}
if (child->v.element.tag != GUMBO_TAG_SECTION) {
continue;
}
GumboAttribute* class_attr = gumbo_get_attribute(&child->v.element.attributes, "class");
if (!class_attr) {
continue;
}
if (class_attr->value != "story-wrapper") {
continue;
}
// Extract data
for (GumboNode* inner : child->v.element.children) {
if (inner->type != GUMBO_NODE_ELEMENT){
continue;
}
if (inner->v.element.tag == GUMBO_TAG_H2) {
std::cout << inner->v.text.text << std::endl;
}
if (inner->v.element.tag == GUMBO_TAG_A
&& inner->v.element.attributes.length > 0) {
GumboAttribute* href = gumbo_get_attribute(&inner->v.element.attributes, "href");
if (!href) {
continue;
}
std::cout << href->value << std::endl << std::endl;
}
}
}
return 0;
}
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.