Web scraping is a technique for automatically extracting information from websites. In this comprehensive tutorial, we'll walk through an example Java program that scrapes search results data from Google Scholar.
This is the Google Scholar result page we are talking about…
Specifcally, we'll learn how to use the popular Jsoup Java library to connect to Google Scholar, send search queries, and scrape key bits of data - title, URL, authors, and abstract text - from the search results pages.
Prerequisites
To follow along with the code examples below, you'll need:
That's it! Jsoup handles most of the heavy lifting, so we can focus on the fun data extraction parts.
Walkthrough of the Web Scraper Code
Let's break it down section by section.
Imports
We import Jsoup classes that allow connecting to web pages and selecting elements:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Define URL and User-Agent
Next we define the Google Scholar URL we want to scrape along with a common User-Agent header:
// Define the URL of the Google Scholar search page
String url = "<https://scholar.google.com/scholar?hl=en&as\\_sdt=0%2C5&q=transformers&btnG=>";
// Define a User-Agent header
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
Quick web scraping tip - impersonating a real browser's User-Agent helps avoid bot detection.
Connect to URL and Select Elements
Inspecting the code
You can see that the items are enclosed in a The magic happens in this section where we: Let's break this down... The Jsoup This HTML is stored in a The All matching elements get stored in an With search result elements selected, we can traverse each one and extract the inner text and attributes: We loop through each previously selected The scraped pieces of data are printed, with each search result separated by dashes. And that's it! The full code connects to Google Scholar, scrapes results, and extracts key pieces of data from each one. Let's quickly summarize the key concepts: This core scraper recipe can be adapted to pull data from almost any site. Here is the complete code example for scraping search results data from Google Scholar: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key:// Send a GET request to the URL with the User-Agent header
Document document = Jsoup.connect(url).userAgent(userAgent).get();
// Find all the search result blocks with class "gs_ri"
Elements searchResults = document.select("div.gs_ri");
div.gs_ri
Pro tip: Install browser developer tools to inspect elements and test selectors.
Extract Data from Search Results
// Loop through each search result block and extract information
for (Element result : searchResults) {
// Extract the title and URL
Element titleElement = result.selectFirst("h3.gs_rt");
String title = titleElement != null ? titleElement.text() : "N/A";
String resultUrl = titleElement != null ? titleElement.selectFirst("a").attr("href") : "N/A";
// Extract the authors and publication details
Element authorsElement = result.selectFirst("div.gs_a");
String authors = authorsElement != null ? authorsElement.text() : "N/A";
// Extract the abstract or description
Element abstractElement = result.selectFirst("div.gs_rs");
String abstractText = abstractElement != null ? abstractElement.text() : "N/A";
// Print the extracted information
System.out.println("Title: " + title);
System.out.println("URL: " + resultUrl);
System.out.println("Authors: " + authors);
System.out.println("Abstract: " + abstractText);
System.out.println("-".repeat(50)); // Separating search results
}
Full Java Code for Scraping Google Scholar
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class GoogleScholarScraper {
public static void main(String[] args) {
// Define the URL of the Google Scholar search page
String url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
// Define a User-Agent header
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
try {
// Send a GET request to the URL with the User-Agent header
Document document = Jsoup.connect(url).userAgent(userAgent).get();
// Find all the search result blocks with class "gs_ri"
Elements searchResults = document.select("div.gs_ri");
// Loop through each search result block and extract information
for (Element result : searchResults) {
// Extract the title and URL
Element titleElement = result.selectFirst("h3.gs_rt");
String title = titleElement != null ? titleElement.text() : "N/A";
String resultUrl = titleElement != null ? titleElement.selectFirst("a").attr("href") : "N/A";
// Extract the authors and publication details
Element authorsElement = result.selectFirst("div.gs_a");
String authors = authorsElement != null ? authorsElement.text() : "N/A";
// Extract the abstract or description
Element abstractElement = result.selectFirst("div.gs_rs");
String abstractText = abstractElement != null ? abstractElement.text() : "N/A";
// Print the extracted information
System.out.println("Title: " + title);
System.out.println("URL: " + resultUrl);
System.out.println("Authors: " + authors);
System.out.println("Abstract: " + abstractText);
System.out.println("-".repeat(50)); // Separating search results
}
} catch (IOException e) {
System.err.println("Failed to retrieve the page. Error: " + e.getMessage());
}
}
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!