In the beginning stages of a web crawling project or when you have to scale it to only a few hundred requests, you might want a simple proxy rotator that uses the free proxy pools available on the internet to populate itself now and then.
We can use a website like https://sslproxies.org/ to fetch public proxies every few minutes and use them in our C++ projects.
This is what the site looks like:
And if you check the HTML using the inspect tool, you will see the full content is encapsulated in a table with the id proxylisttable
The IP and port are the first and second elements in each row.
We can use the following code to select the table and its rows to iterate on and further pull out the first and second elements of the elements.
Fetching Proxies with libcurl
We'll use the libcurl library to make requests and parse the HTML.
First, include the libcurl headers:
#include <curl/curl.h>
To fetch the proxies, we'll make a GET request to the sslproxies.org URL. Here's a simple request function:
std::string fetchProxies() {
CURL* curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, "<https://sslproxies.org/>");
// Set user agent to mimic browser
curl_easy_setopt(curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36");
// Perform request and store response
std::string response;
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, [&](char *contents, size_t size, size_t nmemb) {
response.append(contents, size * nmemb);
return size * nmemb;
});
CURLcode result = curl_easy_perform(curl);
curl_easy_cleanup(curl);
return response;
}
This will fetch the HTML content from the website.
Parsing the Proxies
Next, we need to parse the HTML to extract the proxies. We'll use the RapidXML library for this.
First, include RapidXML:
#include <rapidxml.hpp>
#include <rapidxml_print.hpp>
Then we can parse the HTML:
std::vector<Proxy> parseProxies(const std::string& html) {
std::vector<Proxy> proxies;
rapidxml::xml_document<> doc;
doc.parse<0>(const_cast<char*>(html.c_str()));
rapidxml::xml_node<>* table = doc.first_node("table");
for (rapidxml::xml_node<>* row = table->first_node("tr"); row; row = row->next_sibling()) {
rapidxml::xml_node<>* ip = row->first_node("td");
rapidxml::xml_node<>* port = ip->next_sibling();
if (ip && port) {
Proxy proxy;
proxy.ip = ip->value();
proxy.port = std::stoi(port->value());
proxies.push_back(proxy);
}
}
return proxies;
}
This parses the HTML, extracts the proxy Now we can put it all together into a simple program: To build: This provides a simple proxy rotator in C++ that can be called periodically to fetch new proxies. For production use at scale, consider using a paid rotating proxy service with an API. The full code is provided above to easily copy and run. Make sure to install libcurl and RapidXML first. Feel free to extend and customize this simple starter code! If you want to use this in production and want to scale to thousands of links, then you will find that many free proxies won't hold up under the speed and reliability requirements. In this scenario, using a rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. A simple API can access the whole thing like below in any programming language. We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key:, loops through the
rows, and extracts the IP and port from the elements. Putting It Together
#include <curl/curl.h>
#include <rapidxml.hpp>
#include <rapidxml_print.hpp>
#include <string>
#include <vector>
#include <iostream>
struct Proxy {
std::string ip;
int port;
};
std::string fetchProxies();
std::vector<Proxy> parseProxies(const std::string& html);
int main() {
// Fetch proxies
std::string html = fetchProxies();
// Parse proxies
std::vector<Proxy> proxies = parseProxies(html);
// Print random proxy
srand(time(NULL));
int index = rand() % proxies.size();
std::cout << proxies[index].ip << ":" << proxies[index].port << std::endl;
return 0;
}
g++ -o proxy main.cpp -lcurl -lrapidxml
curl "<http://api.proxiesapi.com/?key=API_KEY&url=https://example.com>"
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!