Google Scholar provides access to extensive academic literature across disciplines. Fortunately, they provide a straightforward web interface we can scrape to leverage their search capabilities from within our own Elixir applications. In this guide, we'll walk through a complete example of scraping Google Scholar search results.
This is the Google Scholar result page we are talking about…
Prerequisites
To follow along, you'll need:
You'll also need to install dependencies so run:
mix deps.get
This will fetch the HTTPoison, Floki, and String libraries we'll utilize.
Scraping Overview
Here's a high-level overview of what our scraper will do:
- Send an HTTP request to the Google Scholar search URL
- Parse the HTML content of the result
- Find result item elements in the DOM
- Extract fields like title, URL, authors, abstract
- Print out extracted fields
So essentially we fetch the initial data, then parse and extract the specific pieces we want.
Setting Up the HTTP Request
Let's walk through the code one section at a time. We start with some standard Elixir module declarations:
defmodule ScholarScraper do
use HTTPoison.Base
@url "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
We
Next we'll define the HTTP headers to spoof a browser visit:
defp headers do
%{
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
end
This
Making the Request and Parsing
Inspecting the code
You can see that the items are enclosed in a Now let's implement the function that will fetch the search results: Breaking this down: So at this point if all goes well, we have extracted each of the DOM nodes representing an individual search result item. Let's take a closer look at how data extraction works in The key things to understand here: So what this does: Printing this out gives us clean extracted fields for each search result item! Finally, to execute everything - simply call the main function: The full code at this point: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key:def fetch_search_results do
{:ok, %HTTPoison.Response{status_code: 200, body: body}} = get(@url, [], headers())
case Floki.parse(body) do
{:ok, document} ->
search_results = Floki.find(document, "div.gs_ri")
Enum.each(search_results, &extract_and_print(&1))
_ ->
IO.puts("Failed to parse HTML content.")
end
end
Extracting Search Result Fields
defp extract_and_print(result) do
title = Floki.find_one(result, "h3.gs_rt") |> Floki.text() |> String.trim()
url = Floki.find_one(result, "h3.gs_rt a") |> Floki.attribute("href") |> String.trim()
authors = Floki.find_one(result, "div.gs_a") |> Floki.text() |> String.trim()
abstract = Floki.find_one(result, "div.gs_rs") |> Floki.text() |> String.trim()
IO.puts("Title: #{title}")
IO.puts("URL: #{url}")
IO.puts("Authors: #{authors}")
IO.puts("Abstract: #{abstract}")
IO.puts(String.duplicate("-", 50))
end
Running the Scraper
ScholarScraper.fetch_search_results()
defmodule ScholarScraper do
use HTTPoison.Base
@url "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
defp headers do
%{
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
end
def fetch_search_results do
{:ok, %HTTPoison.Response{status_code: 200, body: body}} = get(@url, [], headers())
case Floki.parse(body) do
{:ok, document} ->
search_results = Floki.find(document, "div.gs_ri")
Enum.each(search_results, &extract_and_print(&1))
_ ->
IO.puts("Failed to parse HTML content.")
end
end
defp extract_and_print(result) do
title = Floki.find_one(result, "h3.gs_rt") |> Floki.text() |> String.trim()
url = Floki.find_one(result, "h3.gs_rt a") |> Floki.attribute("href") |> String.trim()
authors = Floki.find_one(result, "div.gs_a") |> Floki.text() |> String.trim()
abstract = Floki.find_one(result, "div.gs_rs") |> Floki.text() |> String.trim()
IO.puts("Title: #{title}")
IO.puts("URL: #{url}")
IO.puts("Authors: #{authors}")
IO.puts("Abstract: #{abstract}")
IO.puts(String.duplicate("-", 50))
end
end
# To run the scraper:
ScholarScraper.fetch_search_results()
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!