Scraping Yelp Business Listings with C++

Web scraping is the technique of automatically extracting data from websites through code. In this article, we'll walk through a full code example of scraping business listing data from Yelp using C++ and two helpful libraries - libcurl and Gumbo.

This is the page we are talking about

The goal is to extract key details like the business name, rating, number of reviews, price range, and location for each search result. This data could then be used for analysis, visualizations, lead generation, and more.

Why Scrape Yelp?

Yelp has rich data on local businesses across many categories like restaurants, shops, services, etc. Manually extracting this data would be tedious. Scraping allows us to extract thousands of listings programmatically. The structured data can then be used for all sorts of purposes:

Analyze trends across categories, locations, ratings

Create maps/graphs visualizing businesses

Generate sales leads by scraping contact info

Compare with your own business data

Aggregate data into a searchable local business database

Of course, check Yelp's terms and conditions before scraping. Now let's dive into the code!

Imports and Initialization

We start with including the necessary header files for I/O streams, strings, vectors, CURL, and Gumbo:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>
#include <curl/curl.h>
#include <gumbo.h>

Then we initialize libcurl, which handles making web requests:

CURL* curl = curl_easy_init();
if (!curl) {
    std::cerr << "Failed to initialize libcurl.\\n";
    return 1;
}

And that's it for setup! We'll explain Gumbo when we get to parsing.

Constructing the URLs

We need two URLs - one for the Yelp search page, and one for the ProxiesAPI service API:

// Yelp search page
std::string url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>";

// Encode URL for API
std::string encoded_url = curl_easy_escape(curl, url.c_str(), url.length());

// Proxy API URL
std::string api_url = "<http://api.proxiesapi.com/?premium=true&auth_key=><key>&url=" + encoded_url;

The proxy API allows us to tunnel through proxies on each request, avoiding detection as scrapers by Yelp.

We URL-encode the Yelp URL before inserting it into the proxy API URL. This makes sure special characters pass safely.

Setting CURL Options

To mimic a real browser, we construct header strings and pass them to CURL:

struct curl_slist* headers = NULL;
headers = curl_slist_append(headers, "User-Agent: Mozilla/5.0 (...) Chrome/58.0");
headers = curl_slist_append(headers, "Accept-Language: en-US,en;q=0.5");

curl_easy_setopt(curl, CURLOPT_URL, api_url.c_str());
curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);

The key things set here:

api_url: URL to make request to

CURLOPT_HTTPHEADER: Headers that mimic Chrome browser

User-Agent: Pretends we are Chrome browser

Without these options, Yelp may detect us as bots!

Making the Request

With CURL configured, we can make the GET request and store the response:

std::string response_data;

curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);

CURLcode res = curl_easy_perform(curl);

if (res != CURLE_OK) {
    std::cerr << "Request failed, error: " << res;
    return 1;
}

WriteCallback handles saving the response data into our string buffer. We check res to see if the request succeeded before moving on.

At this point response_data contains the full HTML of the Yelp search results page for San Francisco Chinese restaurants.

Parsing the HTML

To extract structured data, we need to parse the HTML content and identify the key data elements. This is where the Gumbo parser comes in handy.

First we parse and get the root node:

GumboOutput* output = gumbo_parse(response_data.c_str());

if (!output) {
    std::cerr << "Failed to parse response.";
    return 1;
}

GumboNode* root = output->root;

root contains the whole HTML document as a node tree we can now traverse.

Extracting Listing Data

The key part is identifying patterns in the HTML related to each business listing. We can then extract data for all listings:

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
if (class_attr && std::string(class_attr->value) == "arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x") {
    listings.push_back(current);
}

The key things to understand:

Find all the listings by a common CSS class

Extract each data field through CSS selectors

Text values are stored in v.text.text

With these patterns, we can extract any data we want from structured HTML.

The full code contains additional examples for other data points like number of reviews and price range.

Next Steps

The final code puts all the pieces together to scrape Yelp results and extract business listings.

Some ideas for building on this:

Iterate through multiple pages of results

Save data to CSV/JSON/database

Expand fields extracted

Compare vs. your own business data

Visualize trends on maps and graphs

Web scraping opens up many possibilities for harnessing structured public data! Check the laws in your area and respect sites terms before scraping.

Let me know if any part of the explanation needs more detail! Here is the full code again for reference:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>
#include <curl/curl.h>
#include <gumbo.h>

// Function to write HTTP response data to a string
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
    size_t total_size = size * nmemb;
    output->append(static_cast<char*>(contents), total_size);
    return total_size;
}

int main() {
    // URL of the Yelp search page
    std::string url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA";

    // Initialize libcurl
    CURL* curl = curl_easy_init();
    if (!curl) {
        std::cerr << "Failed to initialize libcurl." << std::endl;
        return 1;
    }

    // URL-encode the URL
    std::string encoded_url = curl_easy_escape(curl, url.c_str(), url.length());

    // API URL with the encoded Yelp URL
    std::string api_url = "http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=" + encoded_url;

    // Define a user-agent header to simulate a browser request
    struct curl_slist* headers = NULL;
    headers = curl_slist_append(headers, "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36");
    headers = curl_slist_append(headers, "Accept-Language: en-US,en;q=0.5");
    headers = curl_slist_append(headers, "Accept-Encoding: gzip, deflate, br");
    headers = curl_slist_append(headers, "Referer: https://www.google.com/");

    // Configure libcurl
    curl_easy_setopt(curl, CURLOPT_URL, api_url.c_str());
    curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    std::string response_data;

    // Send an HTTP GET request to the URL with the headers
    CURLcode res = curl_easy_perform(curl);

    // Check if the request was successful (status code 200)
    if (res == CURLE_OK) {
        // Write the response data to a file
        std::ofstream file("yelp_html.html", std::ios::out | std::ios::binary);
        file.write(response_data.c_str(), response_data.length());
        file.close();

        // Use Gumbo to parse the HTML content
        GumboOutput* output = gumbo_parse(response_data.c_str());
        if (output) {
            GumboNode* root = output->root;

            // Find all the listings
            std::vector<GumboNode*> listings;
            GumboNode* current = root;
            while (current) {
                if (current->type == GUMBO_NODE_ELEMENT && current->v.element.tag == GUMBO_TAG_DIV) {
                    GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                    if (class_attr && std::string(class_attr->value) == "arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x") {
                        listings.push_back(current);
                    }
                }
                current = current->next;
            }

            std::cout << "Number of Listings: " << listings.size() << std::endl;

            // Loop through each listing and extract information
            for (GumboNode* listing : listings) {
                // Initialize variables to store extracted information
                std::string business_name = "N/A";
                std::string rating = "N/A";
                std::string num_reviews = "N/A";
                std::string price_range = "N/A";
                std::string location = "N/A";

                // Find the business name element
                GumboNode* business_name_elem = nullptr;
                current = listing->v.element.children.data;
                while (current) {
                    if (current->type == GUMBO_NODE_ELEMENT) {
                        GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                        if (class_attr && std::string(class_attr->value) == "css-19v1rkv") {
                            business_name_elem = current;
                            break;
                        }
                    }
                    current = current->next;
                }

                // Extract business name if found
                if (business_name_elem) {
                    business_name = business_name_elem->v.text.text;
                }

                // Find the rating element
                GumboNode* rating_elem = nullptr;
                current = listing->v.element.children.data;
                while (current) {
                    if (current->type == GUMBO_NODE_ELEMENT) {
                        GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                        if (class_attr && std::string(class_attr->value) == "css-gutk1c") {
                            rating_elem = current;
                            break;
                        }
                    }
                    current = current->next;
                }

                // Extract rating if found
                if (rating_elem) {
                    rating = rating_elem->v.text.text;
                }

                // Find the span elements for number of reviews and location
                current = listing->v.element.children.data;
                while (current) {
                    if (current->type == GUMBO_NODE_ELEMENT) {
                        GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                        if (class_attr && std::string(class_attr->value) == "css-chan6m") {
                            if (current->v.element.children.length > 0 && current->v.element.children.data->type == GUMBO_NODE_TEXT) {
                                std::string text = current->v.element.children.data->v.text.text;
                                if (num_reviews == "N/A" && std::isdigit(text[0])) {
                                    num_reviews = text;
                                } else {
                                    location = text;
                                }
                            }
                        }
                    }
                    current = current->next;
                }

                // Find the price range element
                GumboNode* price_range_elem = nullptr;
                current = listing->v.element.children.data;
                while (current) {
                    if (current->type == GUMBO_NODE_ELEMENT) {
                        GumboAttribute* class_attr = gumbo_get_attribute(&current->v.element.attributes, "class");
                        if (class_attr && std::string(class_attr->value) == "priceRange__09f24__mmOuH") {
                            price_range_elem = current;
                            break;
                        }
                    }
                    current = current->next;
                }

                // Extract price range if found
                if (price_range_elem) {
                    price_range = price_range_elem->v.text.text;
                }

                // Print the extracted information
                std::cout << "Business Name: " << business_name << std::endl;
                std::cout << "Rating: " << rating << std::endl;
                std::cout << "Number of Reviews: " << num_reviews << std::endl;
                std::cout << "Price Range: " << price_range << std::endl;
                std::cout << "Location: " << location << std::endl;
                std::cout << "=============================" << std::endl;
            }

            // Clean up Gumbo
            gumbo_destroy_output(&kGumboDefaultOptions, output);
        }
    } else {
        std::cerr << "Failed to retrieve data. Status Code: " << res << std::endl;
    }

    // Clean up libcurl
    curl_easy_cleanup(curl);
    curl_slist_free_all(headers);

    return 0;
}

Scraping Yelp Business Listings with C++

Why Scrape Yelp?

Imports and Initialization

Constructing the URLs

Setting CURL Options

Making the Request

Parsing the HTML

Extracting Listing Data

Next Steps

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Yelp Business Listings with C++

Why Scrape Yelp?

Imports and Initialization

Constructing the URLs

Setting CURL Options

Making the Request

Parsing the HTML

Extracting Listing Data

Next Steps

The easiest way to do Web Scraping

Don't leave just yet!