Scraping Booking.com Property Listings in C++ in 2023

In this article, we will learn how to scrape property listings from Booking.com using C++. We will use the curl and libxml2 libraries to fetch the HTML content and then extract key information like property name, location, ratings, etc.

Prerequisites

To follow along, you will need:

g++ compiler installed

libcurl for making HTTP requests

libxml2 for HTML parsing

Include Libraries

At the top of your C++ file, include the required libraries:

#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

Define URL

—

Let's define the URL we want to scrape:

const char* url = "<https://www.booking.com/searchresults.html?ss=New+York&>...";

We won't paste the full URL here.

Set User Agent

We need to set a valid user agent header:

curl_easy_setopt(curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...");

This will make the request appear to come from a real browser.

Fetch HTML Page

We can use libcurl to fetch the HTML content:

CURLcode result = curl_easy_perform(curl);

if(result == CURLE_OK) {

  // Parse HTML

}

We make sure the request succeeded before parsing.

Parse HTML

To parse the HTML, we use libxml2's htmlReadMemory method:

htmlDocPtr doc = htmlReadMemory(response.data, response.size, "", NULL, 0);

This loads the HTML into an htmlDoc struct.

Extract Property Cards

The property cards have a data-testid attribute we can search for:

xpathCtxPtr ctx = xmlXPathNewContext(doc);
xpathObjPtr cards = xmlXPathEvalExpression("//div[@data-testid='property-card']", ctx);

This evaluates an XPath expression to find the cards.

Loop Through Cards

We can iterate through the cards:

for(int i = 0; i < cards->nodesetval->nodeNr; i++) {

  xmlNodePtr card = cards->nodesetval->nodeTab[i];

  // Extract data from card node

}

Inside this loop we will extract information from each card node.

Extract Title

To get the title, we search for the data-testid="title" element:

xmlChar* title = xmlGetProp(card, "data-testid");

We fetch the text content of the element.

Extract Location

Similarly, the address is under a data-testid="address" element:

xmlChar* location = xmlGetProp(card, "data-testid");

The pattern is the same for other fields.

Extract Rating

The star rating aria-label contains the score:

xmlChar* rating = xmlGetProp(card, "aria-label");

Here we get the aria-label attribute from the div.

Extract Review Count

The review count text is inside a class="abf093bdfe" element:

xmlChar* reviewCount = xmlGetProp(card, "textContent");

Extract Description

The description is in a class="d7449d770c" element:

xmlChar* description = xmlGetProp(card, "textContent");

Print Extracted Data

We can print out the extracted data:

std::cout << "Name: " << title << std::endl;
std::cout << "Location: " << location << std::endl;
std::cout << "Rating: " << rating << std::endl;
// etc...

And that covers scraping Booking.com property listings in C++! Let me know if you have any other questions.

Full Code

Here is the complete C++ code:

#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

int main() {

  const char* url = "https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2";

  CURL* curl = curl_easy_init();

  curl_easy_setopt(curl, CURLOPT_URL, url);
  curl_easy_setopt(curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...");

  CURLcode result = curl_easy_perform(curl);

  if(result == CURLE_OK) {

    xmlDocPtr doc = htmlReadMemory(response.data, response.size, "", NULL, 0);

    xpathCtxPtr ctx = xmlXPathNewContext(doc);
    xpathObjPtr cards = xmlXPathEvalExpression("//div[@data-testid='property-card']", ctx);

    for(int i = 0; i < cards->nodesetval->nodeNr; i++) {

      xmlNodePtr card = cards->nodesetval->nodeTab[i];

      xmlChar* title = xmlGetProp(card, "data-testid");
      xmlChar* location = xmlGetProp(card, "data-testid");
      xmlChar* rating = xmlGetProp(card, "aria-label");
      xmlChar* reviewCount = xmlGetProp(card, "textContent");
      xmlChar* description = xmlGetProp(card, "textContent");

      std::cout << "Name: " << title << std::endl;
      std::cout << "Location: " << location << std::endl;
      std::cout << "Rating: " << rating << std::endl;
      std::cout << "Review Count: " << reviewCount << std::endl;
      std::cout << "Description: " << description << std::endl;

    }

  }

  curl_easy_cleanup(curl);

  return 0;
}

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Scraping Booking.com Property Listings in C++ in 2023

Prerequisites

Include Libraries

Define URL

Set User Agent

Fetch HTML Page

Parse HTML

Extract Property Cards

Loop Through Cards

Extract Title

Extract Location

Extract Rating

Extract Review Count

Extract Description

Print Extracted Data

Full Code

Browse by language:

The easiest way to do Web Scraping

Scraping Booking.com Property Listings in C++ in 2023

Prerequisites

Include Libraries

Define URL

Set User Agent

Fetch HTML Page

Parse HTML

Extract Property Cards

Loop Through Cards

Extract Title

Extract Location

Extract Rating

Extract Review Count

Extract Description

Print Extracted Data

Full Code

The easiest way to do Web Scraping

Don't leave just yet!