This article explains how to scrape Craigslist apartment listings using C++ and the libcurl and libxml2 libraries. We'll go through each part of the code.
First install libcurl and libxml2 if needed.
Include the headers:
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
libcurl handles HTTP requests. libxml2 parses HTML/XML.
Next set the URL to scrape:
const char* url = "<https://sfbay.craigslist.org/search/apa>";
Use libcurl to fetch the page content:
CURL* curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, url);
std::string html;
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, writeCallback);
curl_easy_perform(curl);
Parse the HTML with libxml2:
htmlDocPtr doc = htmlReadMemory(html.c_str(), html.size(), "", NULL, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
If you check the source code of Craigslist listings you can see that the listings area code looks something like this…
You can see the code block that generates the listing…
<li class="cl-static-search-result" title="Situated in Sunnyvale!, Recycling Center, 1/BD">
<a href="https://sfbay.craigslist.org/sby/apa/d/santa-clara-situated-in-sunnyvale/7666802370.html">
<div class="title">Situated in Sunnyvale!, Recycling Center, 1/BD</div>
<div class="details">
<div class="price">$2,150</div>
<div class="location">
sunnyvale
</div>
</div>
</a>
</li>
its encapsulated in the cl-static-search-result class. We also need to get the title class div and the price and location class divs to get all the data
Find all listings using XPath:
xmlXPathContextPtr xpathCtx = xmlXPathNewContext(doc);
xmlXPathObjectPtr xpathObj = xmlXPathEvalExpression("//li[@class='cl-static-search-result']", xpathCtx);
Loop through listings and extract info:
for(int i = 0; i < xpathObj->nodesetval->nodeNr; i++) {
xmlNodePtr node = xpathObj->nodesetval->nodeTab[i];
xmlChar* title = xmlGetProp(node, "title");
xmlChar* price = xmlGetProp(node, "price");
xmlChar* location = xmlGetProp(node, "location");
xmlNodePtr linkNode = node->xmlChildrenNode;
xmlChar* link = xmlGetProp(linkNode, "href");
std::cout << title << " " << price << " " << location << " " << link << std::endl;
xmlFree(title);
xmlFree(price);
xmlFree(location);
xmlFree(link);
}
The full C++ code is:
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
const char* url = "<https://sfbay.craigslist.org/search/apa>";
CURL* curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, url);
std::string html;
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, writeCallback);
curl_easy_perform(curl);
htmlDocPtr doc = htmlReadMemory(html.c_str(), html.size(), "", NULL, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
xmlXPathContextPtr xpathCtx = xmlXPathNewContext(doc);
xmlXPathObjectPtr xpathObj = xmlXPathEvalExpression("//li[@class='cl-static-search-result']", xpathCtx);
for(int i = 0; i < xpathObj->nodesetval->nodeNr; i++) {
xmlNodePtr node = xpathObj->nodesetval->nodeTab[i];
xmlChar* title = xmlGetProp(node, "title");
xmlChar* price = xmlGetProp(node, "price");
xmlChar* location = xmlGetProp(node, "location");
xmlNodePtr linkNode = node->xmlChildrenNode;
xmlChar* link = xmlGetProp(linkNode, "href");
std::cout << title << " " << price << " " << location << " " << link << std::endl;
xmlFree(title);
xmlFree(price);
xmlFree(location);
xmlFree(link);
}
This walks through scraping Craigslist in C++.
This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.
Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
We have a running offer of 1000 API calls completely free. Register and get your free API Key.