Scraping Real Estate Listings From Realtor with C++

Web scraping allows you to programmatically extract data from websites. In this beginner C++ tutorial, we'll walk through code that scrapes real estate listing data from Realtor.com.

Background

This program sends a request to the Realtor.com URL, downloads the HTML content from the page, and uses XML parsing libraries to analyze the HTML and extract specific listing details like price, address, beds, baths etc.

It uses 3 external libraries to do this:

libcurl - Makes the initial request and retrieves the raw HTML

libxml2 - Parses the HTML so we can analyze it

XPath - A querying language for targeting specific parts of the parsed HTML

This is the listings page we are talking about…

Let's look at how it's done!

Include the Needed Libraries

We start by including headers for the external libraries we'll use:

#include <iostream>
#include <string>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

iostream and string provide basic C++ functions

curl/curl.h is for libcurl features

HTMLparser.h and xpath.h import functionality from libxml2

Make sure these libraries are installed to compile and run the program later.

Define the Target URL

We'll scrape listing data from Realtor.com focused on San Francisco:

const std::string url = "<https://www.realtor.com/realestateandhomes-search/San-Francisco_CA>";

This URL will be passed to libcurl to download the page content.

Initialize CURL and Make Request

Next we initialize a curl "handle", set options, and make the HTTP request:

CURL *curl = curl_easy_init();

// Set options like URL and headers
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());

// Send GET request
CURLcode res = curl_easy_perform(curl);

This downloads the HTML content from the provided URL.

Parse the HTML

With libcurl fetching the raw HTML, libxml2 parses it so we can analyze the page content:

xmlDocPtr doc = htmlReadDoc(responseData, NULL, NULL, HTML_PARSE_RECOVER);

xmlDocPtr gives a parsed representation of the HTML document.

Analyze Parsed HTML with XPath

Now we can use XPath queries to target specific parts of the parsed HTML.

XPath expressions let you navigate through XML/HTML structures. Some examples:

//div - Find all

Set Up the XPath Context

We first create an XPath context - an object to hold the query state:

xmlXPathContextPtr xpathCtx = xmlXPathNewContext(doc);

Pass it the parsed HTML doc to query against.

Define XPath Expressions

Inspecting the element

When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…

Now we can define XPath queries to pull listing data.

First, an expression to get all listing container blocks:

const xmlChar* xpathExpr = "//div[contains(@class, 'BasePropertyCard')]";

This targets

elements that have the BasePropertyCard class.

Later queries will be limited to these listing blocks.

Evaluate Expressions & Extract Data

To run the queries, we use xmlXPathEvalExpression, passing the XPath context and expression:

xmlXPathObjectPtr results = xmlXPathEvalExpression(xpathExpr, xpathCtx);

for(int i = 0; i < results->nodesetval->nodeNr; i++) {

  xmlNodePtr listing = results->nodesetval->nodeTab[i];

  // Extract data from this listing node
}

This loops through each matching

from our earlier listing search.

Inside here we extract details from each listing, using additional XPath queries scoped to that listing's XML node.

For example, price data:

std::string price = xmlXPathEvalExpression(
    ".//div[contains(@class, 'price')]",
    listing
)->nodesetval->nodeTab[0]->children->content;

The same process extracts address, beds, baths etc!

Wrap Up

That's the key logic to scrape these listings from Realtor.com in C++!

We walked through:

Importing libraries

Making a request with libcurl

Parsing HTML with libxml2

Using XPath to target and extract specific data

The full runnable code is below to see it all put together.

Hope this gives you a great starting point for your own web scraping projects! Let me know if you have any other questions.

#include <iostream>
#include <string>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

int main() {
    // Define the URL of the Realtor.com search page
    const std::string url = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA";

    // Initialize libcurl
    CURL *curl = curl_easy_init();

    if (curl) {
        // Define a User-Agent header
        struct curl_slist *headers = NULL;
        headers = curl_slist_append(headers, "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36");

        // Set libcurl options
        curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
        curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, NULL);

        // Send a GET request to the URL with the User-Agent header
        CURLcode res = curl_easy_perform(curl);

        if (res == CURLE_OK) {
            // Parse the HTML content of the page using libxml2
            xmlDocPtr doc = htmlReadDoc(reinterpret_cast<const xmlChar*>(curl_easy_strerror(res)), NULL, NULL, HTML_PARSE_RECOVER);

            if (doc) {
                // Create an XPath context
                xmlXPathContextPtr xpathCtx = xmlXPathNewContext(doc);
                if (xpathCtx) {
                    // Define the XPath expression to find all the listing blocks
                    const xmlChar* xpathExpr = (const xmlChar*)"//div[contains(@class, 'BasePropertyCard_propertyCardWrap__J0xUj')]";

                    // Evaluate the XPath expression
                    xmlXPathObjectPtr xpathObj = xmlXPathEvalExpression(xpathExpr, xpathCtx);

                    if (xpathObj) {
                        // Loop through each listing block and extract information
                        for (int i = 0; i < xpathObj->nodesetval->nodeNr; i++) {
                            xmlNodePtr listing_block = xpathObj->nodesetval->nodeTab[i];

                            // Extract the broker information
                            xmlNodePtr broker_info = xmlXPathEvalExpression((const xmlChar*)".//div[contains(@class, 'BrokerTitle_brokerTitle__ZkbBW')]", listing_block)->nodesetval->nodeTab[0];
                            std::string broker_name = xmlXPathEvalExpression((const xmlChar*)".//span[contains(@class, 'BrokerTitle_titleText__20u1P')]", broker_info)->nodesetval->nodeTab[0]->children->content;

                            // Extract the status (e.g., For Sale)
                            std::string status = xmlXPathEvalExpression((const xmlChar*)".//div[contains(@class, 'message')]", listing_block)->nodesetval->nodeTab[0]->children->content;

                            // Extract the price
                            std::string price = xmlXPathEvalExpression((const xmlChar*)".//div[contains(@class, 'card-price')]", listing_block)->nodesetval->nodeTab[0]->children->content;

                            // Extract other details like beds, baths, sqft, and lot size
                            xmlNodePtr beds_element = xmlXPathEvalExpression((const xmlChar*)".//li[@data-testid='property-meta-beds']", listing_block)->nodesetval->nodeTab[0];
                            xmlNodePtr baths_element = xmlXPathEvalExpression((const xmlChar*)".//li[@data-testid='property-meta-baths']", listing_block)->nodesetval->nodeTab[0];
                            xmlNodePtr sqft_element = xmlXPathEvalExpression((const xmlChar*)".//li[@data-testid='property-meta-sqft']", listing_block)->nodesetval->nodeTab[0];
                            xmlNodePtr lot_size_element = xmlXPathEvalExpression((const xmlChar*)".//li[@data-testid='property-meta-lot-size']", listing_block)->nodesetval->nodeTab[0];

                            // Check if the elements exist before extracting their text
                            std::string beds = beds_element ? reinterpret_cast<const char*>(beds_element->children->content) : "N/A";
                            std::string baths = baths_element ? reinterpret_cast<const char*>(baths_element->children->content) : "N/A";
                            std::string sqft = sqft_element ? reinterpret_cast<const char*>(sqft_element->children->content) : "N/A";
                            std::string lot_size = lot_size_element ? reinterpret_cast<const char*>(lot_size_element->children->content) : "N/A";

                            // Extract the address
                            std::string address = xmlXPathEvalExpression((const xmlChar*)".//div[contains(@class, 'card-address')]", listing_block)->nodesetval->nodeTab[0]->children->content;

                            // Print the extracted information
                            std::cout << "Broker: " << broker_name << std::endl;
                            std::cout << "Status: " << status << std::endl;
                            std::cout << "Price: " << price << std::endl;
                            std::cout << "Beds: " << beds << std::endl;
                            std::cout << "Baths: " << baths << std::endl;
                            std::cout << "Sqft: " << sqft << std::endl;
                            std::cout << "Lot Size: " << lot_size << std::endl;
                            std::cout << "Address: " << address << std::endl;
                            std::cout << std::string(50, '-') << std::endl;  // Separating listings
                        }

                        // Free the XPath object
                        xmlXPathFreeObject(xpathObj);
                    }

                    // Free the XPath context
                    xmlXPathFreeContext(xpathCtx);
                }

                // Free the parsed HTML document
                xmlFreeDoc(doc);
            }
        } else {
            std::cerr << "Failed to retrieve the page. Error code: " << res << std::endl;
        }

        // Cleanup libcurl
        curl_easy_cleanup(curl);

        // Cleanup libcurl headers
        curl_slist_free_all(headers);
    } else {
        std::cerr << "Failed to initialize libcurl" << std::endl;
    }

    return 0;
}

Scraping Real Estate Listings From Realtor with C++

Background

Include the Needed Libraries

Define the Target URL

Initialize CURL and Make Request

Parse the HTML

Analyze Parsed HTML with XPath

Set Up the XPath Context

Define XPath Expressions

Evaluate Expressions & Extract Data

Wrap Up

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Real Estate Listings From Realtor with C++

Background

Include the Needed Libraries

Define the Target URL

Initialize CURL and Make Request

Parse the HTML

Analyze Parsed HTML with XPath

Set Up the XPath Context

Define XPath Expressions

Evaluate Expressions & Extract Data

Wrap Up

The easiest way to do Web Scraping

Don't leave just yet!