This is the Google Scholar result page we are talking about…
Installing Required Perl Modules
To run this script, you need to have Perl installed along with the LWP::UserAgent and HTML::TreeBuilder modules.
To install these:
cpan LWP::UserAgent
cpan HTML::TreeBuilder
Understanding The Code
Below we will walk through what each section of code is doing to scrape Google Scholar.
First we load the required modules:
use LWP::UserAgent;
use HTML::TreeBuilder;
Next we define the URL of the Google Scholar search results page we want to scrape:
my $url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";
Then we create a User-Agent header that identifies us as a Chrome browser to Google:
my $ua = LWP::UserAgent->new(
agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
);
We send a GET request to fetch the Google Scholar page content:
my $response = $ua->get($url);
We check that the request succeeded with a 200 status code:
if ($response->is_success) {
# Parse page content
} else {
# Request failed
}
If successful, we parse the HTML content using HTML::TreeBuilder:
my $tree = HTML::TreeBuilder->new;
$tree->parse($response->content);
Extracting Search Results
Inspecting the code
You can see that the items are enclosed in a Now we get to the key part - extracting information from the search result items. We locate all search result blocks by their "gs_ri" class: We loop through each search result: Inside this loop is where we extract the title, URL, authors, and abstract fields. We get the title text from the "gs_rt" element: We also extract the link URL: For the authors field, we get the text from the "gs_a" element: Finally, we take the abstract or description from the "gs_rs" element: This shows how each piece of information is extracted from elements in the search result HTML. The full code to put this together: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html>my @search_results = $tree->look_down(_tag => 'div', class => 'gs_ri');
foreach my $result (@search_results) {
# Extract info from $result
}
Extracting The Title
my $title_elem = $result->look_down(_tag => 'h3', class => 'gs_rt');
my $title = $title_elem ? $title_elem->as_text : "N/A";
my $url = $title_elem ? $title_elem->look_down(_tag => 'a')->attr('href') : "N/A";
Extracting The Authors
my $authors_elem = $result->look_down(_tag => 'div', class => 'gs_a');
my $authors = $authors_elem ? $authors_elem->as_text : "N/A";
Extracting The Abstract
my $abstract_elem = $result->look_down(_tag => 'div', class => 'gs_rs');
my $abstract = $abstract_elem ? $abstract_elem->as_text : "N/A";
use LWP::UserAgent;
use HTML::TreeBuilder;
# Define the URL of the Google Scholar search page
my $url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
# Define a User-Agent header
my $ua = LWP::UserAgent->new(
agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
);
# Send a GET request to the URL with the User-Agent header
my $response = $ua->get($url);
# Check if the request was successful (status code 200)
if ($response->is_success) {
# Parse the HTML content of the page using HTML::TreeBuilder
my $tree = HTML::TreeBuilder->new;
$tree->parse($response->content);
# Find all the search result blocks with class "gs_ri"
my @search_results = $tree->look_down(_tag => 'div', class => 'gs_ri');
# Loop through each search result block and extract information
foreach my $result (@search_results) {
# Extract the title and URL
my $title_elem = $result->look_down(_tag => 'h3', class => 'gs_rt');
my $title = $title_elem ? $title_elem->as_text : "N/A";
my $url = $title_elem ? $title_elem->look_down(_tag => 'a')->attr('href') : "N/A";
# Extract the authors and publication details
my $authors_elem = $result->look_down(_tag => 'div', class => 'gs_a');
my $authors = $authors_elem ? $authors_elem->as_text : "N/A";
# Extract the abstract or description
my $abstract_elem = $result->look_down(_tag => 'div', class => 'gs_rs');
my $abstract = $abstract_elem ? $abstract_elem->as_text : "N/A";
# Print the extracted information
print("Title: $title\n");
print("URL: $url\n");
print("Authors: $authors\n");
print("Abstract: $abstract\n");
print("-" x 50 . "\n"); # Separating search results
}
} else {
print("Failed to retrieve the page. Status code: " . $response->code . "\n");
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...