In this article, we'll walk through a full code example of scraping search results data from Google Scholar using Scala and the Jsoup library.
Even as a beginner, by the end of this article you'll understand:
This will provide a foundation for building your own web scrapers to gather data for any purpose.
This is the Google Scholar result page we are talking about…
Setting up the Environment
Because we'll be using external libraries, there is some setup required before running the code:
Install Scala
If you don't already have Scala installed on your machine, you'll need to:
- Download Scala from https://www.scala-lang.org/download/
- Follow the installation instructions for your operating system
Get the Jsoup Dependency
We use the Jsoup Java library to connect to webpages and parse the HTML content.
You'll need to add this dependency to your Scala project. If using SBT, add this line to
libraryDependencies += "org.jsoup" % "jsoup" % "1.14.3"
If using another build tool, check its documentation for adding external libraries.
Okay, we're ready to dive into the code!
Connecting to Google Scholar
We first need to connect to Google Scholar to get the raw HTML content of the search page:
// Define the URL of the Google Scholar search page
val url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
// Send a GET request to the URL
val doc: Document = Jsoup.connect(url)
.userAgent("Mozilla/5.0...")
.get()
Here's what's happening in detail:
So after those lines run, the
Note: In web scraping, using a timeout and retry logic is also important in case of errors. We omitted that here for simplicity.
Extracting Elements from the Page
Now that we have the page content, we can use CSS selectors to extract specific elements from the HTML.
Understanding CSS Selectors
CSS selectors allow locating elements in the DOM tree based on class names, IDs, hierarchy, attributes and more.
Some examples:
We can use these to extract specific pieces of data.
Selecting Google Scholar Results
Inspecting the code
You can see that the items are enclosed in a In our code, the main search results on Google Scholar have CSS class This gives us a collection of We can now loop through them to extract info from each result: Let's look at how each piece of data is selected: We first use the selector From that element, we get the Here we get the anchor ( And so on for authors, publication details, abstract, etc - each field has a specific selector to extract it. We loop through the results, applying the above element selection logic to print out key fields: And at the end we have a structured dataset with the information we wanted per search result! The full code can be seen below: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html>val searchResults: Elements = doc.select("div.gs_ri")
for (result: Element <- searchResults.toArray) {
// Extract data from this search result
...
}
Extracting the Title
// Select the <h3> tag under this result
val titleElem: Element = result.selectFirst("h3.gs_rt")
// Get the text contents of that h3 element
val title: String = if(titleElem != null) titleElem.text() else "N/A"
Extracting the URL
// Select the <a> tag under h3
val url: String = if(titleElem != null) titleElem.selectFirst("a").attr("href") else "N/A"
Key Insight: Understanding which CSS selector corresponds to which data field is crucial for successful scraping. We spent time here understanding them since that's where beginners tend to struggle.
Putting it Together
for (result: Element <- searchResults.toArray) {
// Select h3 tag
val titleElem = ...
// Extract title
val title = ...
// Extract URL
val url = ...
// Print output
println("Title: " + title)
println("URL: " + url)
}
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements
object ScholarScraper {
def main(args: Array[String]): Unit = {
// Define the URL of the Google Scholar search page
val url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
// Send a GET request to the URL
val doc: Document = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36")
.get()
// Find all the search result blocks with class "gs_ri"
val searchResults: Elements = doc.select("div.gs_ri")
// Loop through each search result block and extract information
for (result: Element <- searchResults.toArray) {
// Extract the title and URL
val titleElem: Element = result.selectFirst("h3.gs_rt")
val title: String = if (titleElem != null) titleElem.text() else "N/A"
val url: String = if (titleElem != null) titleElem.selectFirst("a").attr("href") else "N/A"
// Extract the authors and publication details
val authorsElem: Element = result.selectFirst("div.gs_a")
val authors: String = if (authorsElem != null) authorsElem.text() else "N/A"
// Extract the abstract or description
val abstractElem: Element = result.selectFirst("div.gs_rs")
val abstract: String = if (abstractElem != null) abstractElem.text() else "N/A"
// Print the extracted information
println("Title: " + title)
println("URL: " + url)
println("Authors: " + authors)
println("Abstract: " + abstract)
println("-" * 50) // Separating search results
}
}
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...
Don't leave just yet!
Enter your email below to claim your free API key: