This article explains how to scrape Google Scholar search results in R by walking through a fully-functional example script. We will extract the title, URL, authors, and abstract for each search result.
This is the Google Scholar result page we are talking about…
Prerequisites
To follow along, you'll need:
Sending a Request
We begin by defining the URL of a Google Scholar search page:
url <- "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
Next we set a User-Agent header to spoof a regular browser request:
user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
We send a GET request to the URL using
page <- read_html(url, user_agent(user_agent))
Finally, we check if the request succeeded by verifying the status code is 200:
if (status_code(page) == 200) {
// Scrape page
} else {
// Handle error
}
So in just a few lines, we've programmatically sent a request to Google Scholar posing as a real browser!
Extracting Search Results
Inspecting the code
You can see that the items are enclosed in a Now that we have the page contents, we can parse through it to extract data. Google Scholar conveniently puts each search result within a Within each search result The authors are stored similarly, under an element with class Finally, the abstract lives under By using the element classes as CSS selectors, we've cleanly extracted all the data we want! We wrap up by printing out the extracted information - title, URL, authors, and abstract for diagnostics. The full contents are now programmatically accessible for further analysis. The key in scraping is meticulously analyzing the HTML structure to locate the data you want. Tools like browser developer consoles are invaluable for this. Once you've identified the right selectors, parsing and extracting becomes straightforward. Here is the complete script for reference: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key:search_results <- html_nodes(page, ".gs_ri")
for (result in search_results) {
// Extract data from each result
}
Title and URL
title_elem <- html_node(result, ".gs_rt")
title <- html_text(title_elem, trim = TRUE)
url <- html_attr(title_elem, "href")
Authors
authors_elem <- html_node(result, ".gs_a")
authors <- html_text(authors_elem, trim = TRUE)
Abstract
abstract_elem <- html_node(result, ".gs_rs")
abstract <- html_text(abstract_elem, trim = TRUE)
Printing the Results
cat("Title:", title, "\\n")
cat("URL:", url, "\\n")
cat("Authors:", authors, "\\n")
cat("Abstract:", abstract, "\\n")
cat("-" * 50, "\\n") # Separator
Full Code
# Load the required libraries
library(rvest)
# Define the URL of the Google Scholar search page
url <- "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
# Define a User-Agent header
user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
# Send a GET request to the URL with the User-Agent header
page <- read_html(url, user_agent(user_agent))
# Check if the request was successful (HTTP status code 200)
if (status_code(page) == 200) {
# Find all the search result blocks with class "gs_ri"
search_results <- html_nodes(page, ".gs_ri")
# Loop through each search result block and extract information
for (result in search_results) {
# Extract the title and URL
title_elem <- html_node(result, ".gs_rt")
title <- html_text(title_elem, trim = TRUE)
url <- html_attr(title_elem, "href")
# Extract the authors and publication details
authors_elem <- html_node(result, ".gs_a")
authors <- html_text(authors_elem, trim = TRUE)
# Extract the abstract or description
abstract_elem <- html_node(result, ".gs_rs")
abstract <- html_text(abstract_elem, trim = TRUE)
# Print the extracted information
cat("Title:", title, "\n")
cat("URL:", url, "\n")
cat("Authors:", authors, "\n")
cat("Abstract:", abstract, "\n")
cat("-" * 50, "\n") # Separating search results
}
} else {
cat("Failed to retrieve the page. Status code:", status_code(page), "\n")
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!