Google Scholar is an invaluable resource for researching academic papers and articles. However, the search interface limits you to manually looking through results. To do more advanced research, it's helpful to be able to directly access the paper metadata - title, URL link, authors, abstract, etc.
This is the Google Scholar result page we are talking about…
The code in this article explains how to scrape a Google Scholar search URL to extract key metadata fields that you can then programmatically analyze or export elsewhere. We'll walk through the steps for a beginner audience new to web scraping.
Installations & Imports
To get started, you'll need the following:
- libcurl
- tidy HTML parser
C++ imports:
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <regex>
#include <curl/curl.h>
#include <tidy/tidy.h>
#include <tidy/buffio.h>
Make sure to install libcurl and tidy on your system and import the necessary C++ libraries.
Walkthrough
We first define two key constants:
// Define the URL of the Google Scholar search page
const std::string url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";
// Define a User-Agent header
const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
The
Next we define a callback function
// Callback function for libcurl to write response data into a string
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
// .. function body ..
}
In the
// Initialize libcurl
CURL* curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str());
We define a
We then send a GET request and check that it succeeded:
CURLcode res = curl_easy_perform(curl);
if (res == CURLE_OK) {
// Request succeeded
} else {
// Request failed
}
If successful, we have the HTML content. We use the Tidy parser to clean up the HTML:
// Parse HTML using Tidy
TidyDoc tidyDoc = tidyCreate();
// .. Tidy options & setup ..
tidyParseString(tidyDoc, response.c_str());
// Clean up HTML
tidyCleanAndRepair(tidyDoc);
tidyRunDiagnostics(tidyDoc);
// Save cleaned HTML
tidySaveBuffer(tidyDoc, &outputBuffer);
The formatted HTML is now stored in
// Convert to string
std::string htmlContent = reinterpret_cast<char*>(outputBuffer.bp);
Extracting Data with Regular Expressions
Inspecting the code
You can see that the items are enclosed in a Here is where the real scraping takes place. We define four regex patterns to match and extract the title, URL, authors, and abstract text from the HTML: To extract the title, we use: This pulls just the captured group into the We extract just the link URL with: The authors are extracted via: We grab just the captured abstract with: To match multiple occurrences, we iterate through using And that covers the key components for scraping the metadata! Here is the full code to bring the whole process together: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key:Title:
std::regex titleRegex("<h3 class=\\"gs_rt\\">(.*?)<\\\\/h3>");
std::string title = titleIterator[i].str(1);
URL:
std::regex urlRegex("<a href=\\"(.*?)\\"");
std::string url = urlIterator[i].str(1);
Authors:
std::regex authorsRegex("<div class=\\"gs_a\\">(.*?)<\\\\/div>");
std::string authors = authorsIterator[i].str(1);
Abstract:
std::regex abstractRegex("<div class=\\"gs_rs\\">(.*?)<\\\\/div>");
std::string abstract = abstractIterator[i].str(1);
Full Code
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <regex>
#include <curl/curl.h>
#include <tidy/tidy.h>
#include <tidy/buffio.h>
// Define the URL of the Google Scholar search page
const std::string url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
// Define a User-Agent header
const std::string userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
// Callback function for libcurl to write response data into a string
size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
size_t totalSize = size * nmemb;
output->append(static_cast<char*>(contents), totalSize);
return totalSize;
}
int main() {
// Initialize libcurl
CURL* curl = curl_easy_init();
if (curl) {
// Set the URL and User-Agent header
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, userAgent.c_str());
// Response string to store the HTML content
std::string response;
// Set the callback function to handle the response data
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);
// Send a GET request
CURLcode res = curl_easy_perform(curl);
// Check if the request was successful (status code 200)
if (res == CURLE_OK) {
// Parse the HTML content using Tidy
TidyDoc tidyDoc = tidyCreate();
TidyBuffer outputBuffer = {0};
TidyBuffer errBuffer = {0};
tidyOptSetBool(tidyDoc, TidyXhtmlOut, yes);
tidyOptSetInt(tidyDoc, TidyWrapLen, 4096);
tidySetErrorBuffer(tidyDoc, &errBuffer);
tidyParseString(tidyDoc, response.c_str());
tidyCleanAndRepair(tidyDoc);
tidyRunDiagnostics(tidyDoc);
tidySaveBuffer(tidyDoc, &outputBuffer);
// Convert the output to a string
std::string htmlContent = reinterpret_cast<char*>(outputBuffer.bp);
// Use regular expressions to extract information
std::regex titleRegex("<h3 class=\"gs_rt\">(.*?)<\\/h3>");
std::regex urlRegex("<a href=\"(.*?)\"");
std::regex authorsRegex("<div class=\"gs_a\">(.*?)<\\/div>");
std::regex abstractRegex("<div class=\"gs_rs\">(.*?)<\\/div>");
std::smatch titleMatch;
std::smatch urlMatch;
std::smatch authorsMatch;
std::smatch abstractMatch;
// Find all matches in the HTML content
std::sregex_iterator titleIterator(htmlContent.begin(), htmlContent.end(), titleRegex);
std::sregex_iterator urlIterator(htmlContent.begin(), htmlContent.end(), urlRegex);
std::sregex_iterator authorsIterator(htmlContent.begin(), htmlContent.end(), authorsRegex);
std::sregex_iterator abstractIterator(htmlContent.begin(), htmlContent.end(), abstractRegex);
// Loop through each match and extract information
for (size_t i = 0; i < titleIterator.size(); ++i) {
std::string title = titleIterator[i].str(1);
std::string url = urlIterator[i].str(1);
std::string authors = authorsIterator[i].str(1);
std::string abstract = abstractIterator[i].str(1);
// Print the extracted information
std::cout << "Title: " << title << std::endl;
std::cout << "URL: " << url << std::endl;
std::cout << "Authors: " << authors << std::endl;
std::cout << "Abstract: " << abstract << std::endl;
std::cout << std::string(50, '-') << std::endl;
}
// Clean up
tidyBufFree(&outputBuffer);
tidyBufFree(&errBuffer);
tidyRelease(tidyDoc);
} else {
std::cerr << "Failed to retrieve the page. CURL error code: " << res << std::endl;
}
// Cleanup libcurl
curl_easy_cleanup(curl);
} else {
std::cerr << "Failed to initialize libcurl." << std::endl;
}
return 0;
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!