In this beginner web scraping tutorial, we'll walk through code that scrapes search results data from Google Scholar.
This is the Google Scholar result page we are talking about…
Overview
We'll be using Node.js for web scraping, with the following key packages:
First we require these packages:
const rp = require('request-promise');
const cheerio = require('cheerio');
Then we set up the initial scraper configuration:
// Define the URL of the Google Scholar search page
const url = "<https://scholar.google.com/scholar?hl=en&as\\_sdt=0%2C5&q=transformers&btnG=>";
// Define a User-Agent header
const headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
};
// Configure the request options
const options = {
uri: url,
headers: headers,
transform: function (body) {
return cheerio.load(body);
}
};
This sets up the Google Scholar URL we want to scrape, adds a browser User-Agent string, and configures the request to use cheerio for HTML parsing.
Making the Request
With the configuration complete, we can now make the GET request:
// Send a GET request to the URL with the User-Agent header
rp(options)
.then(($) => {
// ... extract data here
})
.catch((error) => {
console.error("Failed to retrieve the page:", error);
});
We pass the options to
The
Extracting Search Result Data
Inspecting the code
You can see that the items are enclosed in a Inside the We grab all the Then we iterate through them with Let's get the title and URL of the search result: We use The linked URL is extracted directly from the anchor tag's Next up is author data: Simply grab the inner text of the Finally, we extract the abstract text: The To finish, we log out all the information extracted from each search result: This prints the title, URL, authors and abstract for inspection. The separating line keeps each result organized in the terminal output. And that covers scraping key data fields from Google Scholar search results! The full code is included below to run as a complete scraper. To run the web scraper code, you need: Here is the complete Google Scholar scraping script: The output will show extracted data from search results for the query "transformers". Feel free to customize the search URL for other queries. This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html>// Find all the search result blocks with class "gs_ri"
const search_results = $(".gs_ri");
// Loop through each search result block
search_results.each((index, element) => {
// Extract data from each result...
});
Title and URL
// Extract the title and URL
const title_elem = $(element).find(".gs_rt");
const title = title_elem.text() || "N/A";
const url = title_elem.find("a").attr("href") || "N/A";
Authors
// Extract the authors
const authors_elem = $(element).find(".gs_a");
const authors = authors_elem.text() || "N/A";
Abstract
// Extract the abstract or description
const abstract_elem = $(element).find(".gs_rs");
const abstract = abstract_elem.text() || "N/A";
Printing the Results
console.log("Title:", title);
console.log("URL:", url);
console.log("Authors:", authors);
console.log("Abstract:", abstract);
console.log("-".repeat(50)); // Separating search results
Running the Scraper
const rp = require('request-promise');
const cheerio = require('cheerio');
// Define the URL of the Google Scholar search page
const url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
// Define a User-Agent header
const headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" // Replace with your User-Agent string
};
// Configure the request options
const options = {
uri: url,
headers: headers,
transform: function (body) {
return cheerio.load(body);
}
};
// Send a GET request to the URL with the User-Agent header
rp(options)
.then(($) => {
// Find all the search result blocks with class "gs_ri"
const search_results = $(".gs_ri");
// Loop through each search result block and extract information
search_results.each((index, element) => {
// Extract the title and URL
const title_elem = $(element).find(".gs_rt");
const title = title_elem.text() || "N/A";
const url = title_elem.find("a").attr("href") || "N/A";
// Extract the authors and publication details
const authors_elem = $(element).find(".gs_a");
const authors = authors_elem.text() || "N/A";
// Extract the abstract or description
const abstract_elem = $(element).find(".gs_rs");
const abstract = abstract_elem.text() || "N/A";
// Print the extracted information
console.log("Title:", title);
console.log("URL:", url);
console.log("Authors:", authors);
console.log("Abstract:", abstract);
console.log("-".repeat(50)); // Separating search results
});
})
.catch((error) => {
console.error("Failed to retrieve the page:", error);
});
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...