Google Scholar is an excellent source of academic papers and research. In this article, we'll go through code to scrape Google Scholar search results using Go. The code searches for "transformers", then extracts key data like title, URL, authors, and abstract for each search result.
This is the Google Scholar result page we are talking about…
We'll dive deep into how it works - explaining each step clearly for beginners.
Imports
Let's first look at the imports:
import (
"fmt"
"log"
"net/http"
"strings"
"github.com/PuerkitoBio/goquery"
)
These provide the key functionality we need:
Main Function
The
func main() {
// Define Google Scholar URL
url := "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
// Set User-Agent header
headers := map[string]string{
"User-Agent": "Mozilla/5.0...",
}
// Make GET request
resp, err := http.Get(url)
// Check if status code is 200 OK
if resp.StatusCode == 200 {
// Parse HTML using goquery
doc, err := goquery.NewDocumentFromReader(resp.Body)
// Find search results
doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {
// Extract data from each search result
// And print
})
}
}
Let's break this down:
- Define URL: We define the Google Scholar URL to search for "transformers".
- Set User-Agent header: We set a browser User-Agent header to mimic a real user.
- Make GET request: We use net/http to make the GET request to the URL.
- Check status code: We check if the status code in the response is 200 OK.
- Parse HTML with goquery: If status is OK, we parse the HTML content using goquery.
- Find search results: We use a CSS selector to find all search result blocks.
- Extract data & print: We loop through each search result block to extract and print data.
Next we'll dive deep into extracting data from each search result.
Scraping Each Search Result
Inspecting the code
You can see that the items are enclosed in a The key part is using goquery to find search result elements on the page and extract data: Let's understand this: We use the A CSS selector is like an address that uniquely identifies elements on a web page. Here we use selector The We get a pointer Let's see how the title and URL are extracted: We first find the We use For URL, we dig into the So it selects: And extracts the title text and URL separately. Similarly, authors are extracted using the And the abstract using So goquery allows easily drilling down and extracting any data from elements. Finally, we print out all the extracted info - title, URL, authors, abstract: This gives us nicely structured data for each search result. The process then repeats for every result on the page by looping through For easy reference, here is the complete code: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html>doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {
// Extract title, URL, authors, abstract for each result
})
CSS Selectors
Looping Through Results
Extracting Title & URL
titleElem := s.Find("h3.gs_rt")
title := titleElem.Text()
url, _ := titleElem.Find("a").Attr("href")
<h3 class="gs_rt">
<a href="<http://url-to-paper.com>">
This is the paper title
</a>
</h3>
Authors & Abstract
authorsElem := s.Find("div.gs_a")
authors := authorsElem.Text()
abstractElem := s.Find("div.gs_rs")
abstract := abstractElem.Text()
Printing Output
Full Code
package main
import (
"fmt"
"log"
"net/http"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
// Define the URL of the Google Scholar search page
url := "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
// Define a User-Agent header
headers := map[string]string{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
}
// Send a GET request to the URL with the User-Agent header
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatal(err)
}
for key, value := range headers {
req.Header.Add(key, value)
}
// Check if the request was successful (status code 200)
resp, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode == 200 {
// Parse the HTML content of the page using goquery
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
// Find all the search result blocks with class "gs_ri"
doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {
// Extract the title and URL
titleElem := s.Find("h3.gs_rt")
title := titleElem.Text()
url, _ := titleElem.Find("a").Attr("href")
// Extract the authors and publication details
authorsElem := s.Find("div.gs_a")
authors := authorsElem.Text()
// Extract the abstract or description
abstractElem := s.Find("div.gs_rs")
abstract := abstractElem.Text()
// Print the extracted information
fmt.Println("Title:", title)
fmt.Println("URL:", url)
fmt.Println("Authors:", authors)
fmt.Println("Abstract:", abstract)
fmt.Println(strings.Repeat("-", 50)) // Separating search results
})
} else {
fmt.Println("Failed to retrieve the page. Status code:", resp.StatusCode)
}
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...