In this post, we'll walk through a real-world Ruby script that scrapes search result data from Google Scholar. We'll go step-by-step to understand exactly how it works.
This is the Google Scholar result page we are talking about…
Overview
The goal of this script is straightforward - retrieve search result data from a Google Scholar query. This includes:
Rather than using Google's API (which has usage limits), we'll request the HTML directly and parse it.
Let's dive into the code!
Setup
First we
require 'nokogiri'
require 'open-uri'
Nokogiri lets us extract data from HTML and XML in Ruby. We'll use it to parse Google's response.
OpenURI makes sending HTTP requests easy from Ruby.
Defining the Request
Next we set up the URL and headers for our request:
url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
headers = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
The URL performs a Google Scholar search for "transformers".
The
Making the Request
With the URL and headers ready, we use
response = URI.open(url, headers: headers)
We pass the URL, specifying our
This gives us back a
Checking the Response
Before parsing, we check that the request succeeded:
if response.status == ["200", "OK"]
# Parse HTML
else
puts "Failed to retrieve the page. Status code: #{response.status[0]}"
end
A status code of 200 means success. Any other code likely means an error or blocked request.
We print a failure message in that case.
Parsing the HTML
Inspecting the code
You can see that the items are enclosed in a Now we can parse the HTML search results with Nokogiri: We initialize a Nokogiri The With the search result elements selected, we can extract fields: We loop through each The key part is using CSS selectors to extract elements, then getting text or attributes from those. For example: This may look confusing at first! But when you break it down selector-by-selector, you can understand exactly how we extract each data field. Finally, we print the extracted data: This outputs each search result's data to the console. For easy reference, here is the complete script: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key:doc = Nokogiri::HTML(response)
search_results = doc.css("div.gs_ri")
Extracting Search Result Data
search_results.each do |result|
title_elem = result.css("h3.gs_rt").first
title = title_elem&.text || "N/A"
url = title_elem&.at("a")&.attr("href") || "N/A"
authors_elem = result.css("div.gs_a").first
authors = authors_elem&.text || "N/A"
abstract_elem = result.css("div.gs_rs").first
abstract = abstract_elem&.text || "N/A"
# Print output
end
Printing Output
puts "Title: #{title}"
puts "URL: #{url}"
puts "Authors: #{authors}"
puts "Abstract: #{abstract}"
puts "-" * 50 # Separator
Full Code
require 'nokogiri'
require 'open-uri'
# Define the URL of the Google Scholar search page
url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
# Define a User-Agent header
headers = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" # Replace with your User-Agent string
}
# Send a GET request to the URL with the User-Agent header
response = URI.open(url, headers: headers)
# Check if the request was successful (status code 200)
if response.status == ["200", "OK"]
# Parse the HTML content of the page using Nokogiri
doc = Nokogiri::HTML(response)
# Find all the search result blocks with class "gs_ri"
search_results = doc.css("div.gs_ri")
# Loop through each search result block and extract information
search_results.each do |result|
# Extract the title and URL
title_elem = result.css("h3.gs_rt").first
title = title_elem&.text || "N/A"
url = title_elem&.at("a")&.attr("href") || "N/A"
# Extract the authors and publication details
authors_elem = result.css("div.gs_a").first
authors = authors_elem&.text || "N/A"
# Extract the abstract or description
abstract_elem = result.css("div.gs_rs").first
abstract = abstract_elem&.text || "N/A"
# Print the extracted information
puts "Title: #{title}"
puts "URL: #{url}"
puts "Authors: #{authors}"
puts "Abstract: #{abstract}"
puts "-" * 50 # Separating search results
end
else
puts "Failed to retrieve the page. Status code: #{response.status[0]}"
end
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!