Web scraping is a cool way to gather a ton of data from websites by using a bit of code. This friendly guide dives into how you can use it with high-performance C++.

We'll touch on the basics like HTTP requests and HTML parsing, and even take a look at some important libraries.

There's even a fun tutorial where we'll scrape Wikipedia together! Also, we'll tackle real-world challenges you might face, such as handling bots and scraping on a large scale.

Is C++ a Good Language for Web Scraping?

C++ excels in web scraping due to its speed, efficiency, and integration with various libraries and tools. Its benefits include:

Speed: C++ runs fast, enabling scraping of large sites efficiently.

Control: It allows fine-grained control over memory and resources for optimal performance.

Ecosystem: It supports popular scraping tools like Scrapy and Selenium and frameworks like Boost and POCO.

Scalability: C++ is easily scaled for distributed scraping due to its performance and network capabilities.

General purpose: Apart from scraping, C++ is versatile for tasks like data analysis, automation, and machine learning.

While Python is simpler and faster to write, C++ may be preferable for large-scale scraping due to its performance.

Best C++ Web Scraping Libraries

Here are popular C++ scrapers:

cpp-httplib: A basic HTTP client library.

Curlpp: libcurl's C++ wrapper for data transfer over HTTP/FTP.

Boost.Asio: Enables async network programming.

Scrapy: Python scraping framework with C++ via libscrapy.

Selenium: Automated browser testing for dynamic sites.

Other options include libcurl and Poco.

Prerequisites

To follow along with the web scraping example, you will need:

C++ compiler

This scraping code uses C++17 features so you need a modern C++ compiler like GCC 8+, Clang 6+ or MSVC 2019+.

cpp-httplib

We will use this library for the HTTP client and networking. To install:

git clone <https://github.com/yhirose/cpp-httplib.git>
cd cpp-httplib
cmake -Bbuild -H.
cmake --build build

pugixml

For fast XML parsing, we rely on pugixml which you can setup as:

git clone <https://github.com/zeux/pugixml>
cd pugixml
cmake -Bbuild -H.
cmake --build build

selector-lib

To simplify selecting elements from HTML, we use selector-lib:

git clone <https://github.com/amiremohamadi/selector-lib.git>
cd selector-lib
cmake -Bbuild -H.
cmake --build build

That covers the external dependencies needed to run the scraper code shown later.

Let's pick a target website

For this web scraping tutorial, we will scrape Wikipedia's list of dog breeds to extract information like names, breed groups, alternative names and images for various breeds.

The reasons this page makes a good scraping target are:

Structured data in an HTML table which is easy to parse

Images we can download to practice scraping binary data

Light page fast to scrape even at scale

Public open data so no usage restrictions

You could also try scraping other Wikipedia lists, news sites, blogs or really any site with data you want to collect.

For now, this is the page we are talking about…

Write the scraping code

Below is the full code we will walkthrough piece-by-piece to scrape the dog breeds page.

// Includes
#include <httplib.h>
#include <selector/selector.h>
#include <fstream>
#include <vector>

// Vectors to store data
std::vector<std::string> names;
std::vector<std::string> groups;
std::vector<std::string> localNames;
std::vector<std::string> photographs;

// HTTP client
httplib::Client cli("commons.wikimedia.org");

// Send request
auto res = cli.Get("/wiki/List_of_dog_breeds",
  {{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}});

if(res) {

  // Parse HTML
  pugi::xml_document doc;
  doc.load(res->body.c_str());

  auto html = doc.child("html");

  // Find table
  auto table = html.select_node("table.wikitable.sortable").node();

  // Iterate rows
  for (auto& row : table.select_nodes("tr")) {

    // Get cells
    auto cells = row.select_nodes("td, th");

    // Extract data
    auto name = cells[0].child("a").text().get();
    auto group = cells[1].text().get();

    auto localNameNode = cells[2].select_node("span");
    auto localName = localNameNode.text().get("");

    auto img = cells[3].select_node("img");
    auto photograph = img.attribute("src").value();

    // Download image
    if (!photograph.empty()) {

      auto img_data = cli.Get(photograph.c_str());

      std::ofstream file("dog_images/" + name + ".jpg", std::ios::binary);
      file << img_data->body;

    }

    // Store data
    names.push_back(name);
    groups.push_back(group);
    localNames.push_back(localName);
    photographs.push_back(photograph);

  }

}

Let's break this down section by section to understand what it's doing behind the scenes.

The includes

We start by including the necessary libraies:

#include <httplib.h> // cpp-httplib
#include <selector/selector.h> // selector-lib
#include <fstream> // file io
#include <vector> // dynamic arrays

cpp-httplib provides the HTTP client for making requests and handling responses

selector-lib enables querying elements in HTML documents with CSS selectors

fstream allows saving files like images to disk

vector stores dynamic arrays to hold our scraped data

No tricky setup needed here, just import what we need to scrape.

Downloading the page

Next we setup the HTTP client and make the request:

// HTTP client
httplib::Client cli("commons.wikimedia.org");

// Send request
auto res = cli.Get("/wiki/List_of_dog_breeds",
  {{"User-Agent", "cpp-httplib"}});

Here we:

Create an httplib client instance to connect to commons.wikimedia.org

Use the Get() method to request the dog breeds page

Pass a custom User-Agent header to mimic a real browser

By default APIs these days will block vague requests lacking a user agent. So it's important to always spoof one to avoid access issues.

Setting User-Agent

We explicitly set a user agent even though it has a default:

auto res = cli.Get("/wiki/List_of_dog_breeds",
  {{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}});

This helps avoid blocks from sites limiting anonymous traffic without user agents.

Some tips for setting scrappy looking user agents:

Rotate user agents to avoid re-use limits

Mimic major browsers: Chrome, Firefox etc.

Monitor blocks and adapt user agents to avoid them

Rotating user agents helps distribute requests across many identities making your scraper seem more human rather than bot-like.

Inspecting the code

Viewing the page in Chrome or Firefox inspector, we can see it has an HTML table with dog breed data we want.

The key highlights in inspector:

Table with class .wikitable.sortable

Rows are table rows (tr tags)

Cells hold breed data (td tags)

This structure makes selecting the data straight forward as we'll see next.

Parsing the HTML

After downloading the page successfully, we can parse the HTML content:

if(res) {

  // Parse HTML
  pugi::xml_document doc;
  doc.load(res->body.c_str());

  auto html = doc.child("html");

}

Key points:

Check we got a valid res response back

Initialize a pugixml document

Load the HTML body into it

Get the node

At this point html contains the entire structured DOM tree allowing us to query any elements within it.

The Magic of CSS Selectors

One of the most powerful tools for extracting data out of HTML documents are CSS selectors. The pugixml library we use allows querying nodes using this simple yet flexible syntax.

Some examples of CSS selectors:

// By element tag
div

// By id
#container

// By class
.item

// Descendants
div span

// Direct children
div > span

We can compose these together to target nearly any elements on an HTML page.

For example, here is sample HTML:

<table class="breed-table">
  <tr>
    <td>Labrador</td>
    <td>Sporting</td>
  </tr>
</table>

And C++ code with pugixml to extract the breed name:

// Parse document
pugi::xml_document doc;
doc.load_string(html);

// Get breed name
auto breed = doc.select_node(".breed-table tr td:nth-child(1)").node().text();

The selector combination lets us directly target the text element we want to extract!

Selectors provide a concise, flexible way to query HTML. Rather than complex parsing code, we declaratively describe elements to extract. This simplicity is part of what makes scraping so accessible.

While the syntax may seem magical at first, a little knowledge goes a long way in wielding these querying powers!

Finding the table

Using selector-lib we can easily locate the table element:

// Find table
auto table = html.select_node("table.wikitable.sortable").node();

Breaking this down:

select_node finds an element matching the CSS selector

"table.wikitable.sortable" targets the breeds table by class

.node() gives the raw table node

So with one line we've zeroed in on the exact table to scrape from the entire document! This is the magic of selectors in action.

Extracting all the fields

Now we can iterate the rows and use selectors to extract the data fields we want:

// Iterate rows
for (auto& row : table.select_nodes("tr")) {

  // Get cells
  auto cells = row.select_nodes("td, th");

  // Extract data
  auto name = cells[0].child("a").text().get();

  auto group = cells[1].text().get();

  auto localNameNode = cells[2].select_node("span");
  auto localName = localNameNode.text().get("");

  auto img = cells[3].select_node("img");
  auto photograph = img.attribute("src").value();

  // Store data
  names.push_back(name);
  groups.push_back(group);
  localNames.push_back(localName);
  photographs.push_back(photograph);

}

The key steps are:

Loop over tr row nodes

Get td cell nodes for each row

Use selectors like .child() and .text() to extract data

Access attributes like src easily

Store scraped data in vectors

Being able to concisely target elements and attributes is what makes selectors so useful for parsing HTML programmatically.

And that's it, by iterating the table rows and applying selectors, we've scraped structured data from the entire page. Vectors give us typed arrays to hold and work with the scraped content.

Downloading and saving the images

After extracting image urls, we can download and save the dog breed photos locally:

if (!photograph.empty()) {

  auto img_data = cli.Get(photograph.c_str());

  std::ofstream file("dog_images/" + name + ".jpg", std::ios::binary);
  file << img_data->body;

}

Here's what it's doing:

Check if an image src url exists for the breed

Fetch the image binary data with an HTTP request

Write data to a .jpg file on disk

Automate the entire image saving process

This allows scraping both HTML text content as well as media like images or documents from a site.

And with that we've walked through the entire scraper code flow - hope this gives you a great template for building your own C++ scraping scripts!

Alternative libraries and tools for web scraping

While we used cpp-httplib, there are a few other popular options for web scraping in C++:

libcurl

The underlying library powering cpp-httplib. Lower level but highly tunable for scraping needs.

Poco

C++ framework including HTTP clients, parsers and other network utilities.

Scrapy

A popular Python scraping framework that can be used from C++ via libscrapy bindings.

Selenium

Automated browser testing framework useful for scraping dynamic JS sites.

So while cpp-httplib covered our use case, these alternatives may serve other needs better:

libcurl for more customization and control

Poco for a fuller development framework

Scrapy when interop with Python ecosystem is beneficial

Selenium when a real browser environment is required

Evaluate options based on the type of site, data and workflow needing to be scraped.

Challenges of Web Scraping in the real world: Some tips & best practices

When taking scrapers beyond simple tutorial sites to real world scenarios at scale, some common challenges arise:

Getting blocked

Sites aim to prevent huge automated scraping due to bandwidth or usage policy reasons. Some tips:

Always mimic a real browser's user agent string. and rotate them. See the example below.

Rotate user agents frequently to avoid re-use limits

Use proxies to distribute requests across IPs

Implement throttling, delays, retries for resilient scraping

Rotating User Agents

When scraping sites, using the same static user agent for all requests can get your scraper blocked. Sites may think you are a bot and ban your IP or user agent signature.

To properly mimic a real browser, you need to rotate between a set of common user agents. Here is C++ example code to achieve this with each HTTP request:

#include <vector>
#include <stdlib.h>

std::vector<std::string> userAgents{
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
  "Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148",
  "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
};

// Choose random user agent
std::string pickUserAgent() {

  int index = rand() % userAgents.size();
  return userAgents[index];

}

// Use with HTTP client
httplib::Client cli("example.com");

cli.set_header("User-Agent", pickUserAgent());
auto res = cli.Get("/page");

Here we store a vector of actual user agent strings and then randomly select one before each request. This helps properly mimic browsers across scrapers.

Make sure to refresh user agents frequently within long running scraping jobs for optimal scraping performance.

Some libraries like Scraper or Puppeteer also automatically handle rotating user agents so you don't need custom logic. But understanding how the process works is still useful.

Handling dynamic content

Modern sites rely heavily on Javascript to render content. Some approaches:

Fetch and parse JS if possible to extract data sources

Use a headless browser like Selenium to evaluate Javascript

Scrape browser Developer Tools network panel to find AJAX APIs

Conclusion

In this comprehensive guide we walked through web scraping end-to-end in C++, learning:

C++ Principles: How C++ provides speed, control and versatility ideal for robust scraping.

Libraries: Useful scraping-focused C++ libraries like cpp-httplib, Curlpp and Selenium.

Code Deep Dive: Fully worked example of extracting structured data from a target page.

Real World Topics: Additional considerations like dynamic content, user agents and respecting sites.

C++ provides performance benefits and scraping capabilities through various libraries. It offers speed for large-scale data collection and control for sophisticated workflows. With this guide, you can now build high-performance, resilient scrapers in C++ for your needs.

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Web Scraping in C++ - The Complete Guide

Is C++ a Good Language for Web Scraping?

Best C++ Web Scraping Libraries

Prerequisites

Let's pick a target website

Write the scraping code

The includes

Downloading the page

Setting User-Agent

Inspecting the code

Parsing the HTML

Finding the table

Extracting all the fields

Downloading and saving the images

Alternative libraries and tools for web scraping

Challenges of Web Scraping in the real world: Some tips & best practices

Conclusion

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Web Scraping in C++ - The Complete Guide

Is C++ a Good Language for Web Scraping?

Best C++ Web Scraping Libraries

Prerequisites

Let's pick a target website

Write the scraping code

The includes

Downloading the page

Setting User-Agent

Inspecting the code

Parsing the HTML

Finding the table

Extracting all the fields

Downloading and saving the images

Alternative libraries and tools for web scraping

Challenges of Web Scraping in the real world: Some tips & best practices

Conclusion

The easiest way to do Web Scraping

Don't leave just yet!