Google Scholar is an excellent resource for finding scholarly articles and studies on any topic. The search engine provides detailed information on publications, including the title, authors, abstract, citations, and more. This wealth of data also makes Google Scholar pages prime targets for web scraping.
This is the Google Scholar result page we are talking about…
In this beginner tutorial, we will walk through a full code example for scraping key details from Google Scholar search results using Jsoup in Kotlin.
Required Packages
To scrape web pages, we need a Java library that can retrieve and parse HTML content. Jsoup is a popular option that makes it easy to extract and manipulate data from HTML documents using a jQuery-style selector API.
We import the main
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements
Walking Through the Code
Let's break down this full web scraping script step-by-step:
Define Target URL
We specify the root Google Scholar search URL that we want to scrape:
val url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
This URL searches Google Scholar for the term "transformers".
Set a User-Agent Header
Many sites block requests missing a valid User-Agent string to prevent spam bots and scrapers. So we define a browser User-Agent:
val userAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
This mimics a Chrome browser on Windows.
Send GET Request
We use Jsoup to connect to the target URL and pass the User-Agent header to avoid blocks:
val document: Document = Jsoup.connect(url).userAgent(userAgent).get()
The
Check Page Load Success
It's good practice to verify the page loaded properly before scraping:
if (document.title().contains("Google Scholar")) {
// scraping code here
} else {
println("Failed to retrieve the page.")
}
We simply check if the document title contains "Google Scholar" to confirm success.
Find Search Result Elements
Inspecting the code
You can see that the items are enclosed in a The key data we want exists within We use Jsoup's selector syntax to find all of them: This returns an The selector With the search result elements found, we loop through each one to extract data: Inside the loop, we can now query within each To get the title, we select the To get the linked URL within the h3 title element: Here we select the child For author names and other metadata shown below the title: We grab the Finally, to obtain the paper's abstract or excerpt text within its containing element: As the last step inside the loop, we print out the scraped info neatly: When finished, we'll have extracted the core metadata from Google Scholar for each search result. That covers the key steps to scrape a Google Scholar search page with Kotlin and Jsoup: Next we'll cover the basics of getting Jsoup installed and set up. To use Jsoup, you need: Here is sample Gradle configuration: Now you can import Jsoup and start loading web pages. Here again is the full Google Scholar scraping script covered in this guide: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key:val searchResults: Elements = document.select("div.gs_ri")
Jsoup selectors work much like jQuery or CSS by using tag names, IDs, classes, attributes, and more to target elements.
Loop Through Results
for (result: Element in searchResults) {
// extract data from each result element
}
Extract Title
val titleElem: Element? = result.selectFirst("h3.gs_rt")
val title: String = titleElem?.text() ?: "N/A"
Extract URL
val url: String = titleElem?.selectFirst("a")?.attr("href") ?: "N/A"
Extract Authors & Details
val authorsElem: Element? = result.selectFirst("div.gs_a")
val authors: String = authorsElem?.text() ?: "N/A"
Extract the Abstract
val abstractElem: Element? = result.selectFirst("div.gs_rs")
val abstract: String = abstractElem?.text() ?: "N/A"
Print Scraped Information
println("Title: $title")
println("URL: $url")
println("Authors: $authors")
println("Abstract: $abstract")
println("-".repeat(50)) // Separator lines between results
Summary
Installation
dependencies {
implementation 'org.jsoup:jsoup:1.14.3'
}
Full Code Example
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements
fun main() {
// Define the URL of the Google Scholar search page
val url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
// Define a User-Agent header
val userAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
// Send a GET request to the URL with the User-Agent header
val document: Document = Jsoup.connect(url).userAgent(userAgent).get()
// Check if the request was successful (status code 200)
if (document.title().contains("Google Scholar")) {
// Find all the search result blocks with class "gs_ri"
val searchResults: Elements = document.select("div.gs_ri")
// Loop through each search result block and extract information
for (result: Element in searchResults) {
// Extract the title and URL
val titleElem: Element? = result.selectFirst("h3.gs_rt")
val title: String = titleElem?.text() ?: "N/A"
val url: String = titleElem?.selectFirst("a")?.attr("href") ?: "N/A"
// Extract the authors and publication details
val authorsElem: Element? = result.selectFirst("div.gs_a")
val authors: String = authorsElem?.text() ?: "N/A"
// Extract the abstract or description
val abstractElem: Element? = result.selectFirst("div.gs_rs")
val abstract: String = abstractElem?.text() ?: "N/A"
// Print the extracted information
println("Title: $title")
println("URL: $url")
println("Authors: $authors")
println("Abstract: $abstract")
println("-".repeat(50)) // Separating search results
}
} else {
println("Failed to retrieve the page.")
}
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!