This guide will walk through a Perl script to scrape image URLs and other data from a Wikipedia page. We will extract the names, groups, local names, and image URLs for all dog breeds listed on the page.
This is page we are talking about…
Modules Used
The script uses the following modules which may need to be installed:
use LWP::UserAgent;
use HTML::TreeBuilder;
To install these, run:
cpan LWP::UserAgent HTML::TreeBuilder
Define URL and User Agent
First we define the URL of the Wikipedia page we want to scrape:
my $url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>';
Next we create a User Agent header to mimic a browser request:
my $ua = LWP::UserAgent->new(
agent => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
);
Send Request and Parse HTML
We send a GET request for the URL and check if it succeeded:
my $response = $ua->get($url);
if ($response->is_success) {
# Parse HTML
my $tree = HTML::TreeBuilder->new;
$tree->parse($response->content);
# ... rest of code
}
If successful, we use
Extract Data from the Table
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
We find this table element:
my $table = $tree->look_down(_tag => 'table', class => 'wikitable sortable');
We define arrays to store the scraped data fields:
my @names;
my @groups;
my @local_names;
my @photographs;
And create a folder to save images:
mkdir('dog_images') unless -d 'dog_images';
Understanding the Selectors
The most complex part is extracting the data within each row of the table. This is done by the selector code:
my @rows = $table->look_down(_tag => 'tr');
shift @rows; # skip header row
for my $row (@rows) {
my @columns = $row->look_down(_tag => qr/^(td|th)$/);
if (@columns == 4) {
# Extract data from each column
my $name = $columns[0]->look_down(_tag => 'a')->as_text;
my $group = $columns[1]->as_text;
my $span_tag = $columns[2]->look_down(_tag => 'span');
my $local_name = $span_tag ? $span_tag->as_text : '';
my $img_tag = $columns[3]->look_down(_tag => 'img');
my $photograph = $img_tag ? $img_tag->attr('src') : '';
# Download images
if ($photograph) {
// image download code
}
# Store data
push @names, $name;
push @groups, $group;
push @local_names, $local_name;
push @photographs, $photograph;
}
}
This code loops through each Name Column The name is within a Group Column The group name is directly the text content of the second column: Local Name Column There may be a Image Column We check if there is an If found, we extract the The key things to understand are: This allows us to traverse the HTML structure and extract precisely the data we want. The rest of the code downloads images and stores the scraped data into the arrays. Finally, the data can be printed out: So in summary, this script: Here is the complete runnable script: In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser! If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail. Overcoming IP Blocks Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works. Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: row, gets the columns, and extracts data from the columns: my $name = $columns[0]->look_down(_tag => 'a')->as_text;
my $group = $columns[1]->as_text;
my $span_tag = $columns[2]->look_down(_tag => 'span');
my $local_name = $span_tag ? $span_tag->as_text : '';
my $img_tag = $columns[3]->look_down(_tag => 'img');
my $photograph = $img_tag ? $img_tag->attr('src') : '';
Output Data
for my $i (0..$#names) {
print "Name: $names[$i]\\n";
print "Group: $groups[$i]\\n";
print "Local Name: $local_names[$i]\\n";
print "Image: $photographs[$i]\\n";
print "\\n";
}
Full Code
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;
# URL of the Wikipedia page
my $url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds';
# Define a user-agent header to simulate a browser request
my $ua = LWP::UserAgent->new(
agent => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
);
# Send an HTTP GET request to the URL with the headers
my $response = $ua->get($url);
# Check if the request was successful (status code 200)
if ($response->is_success) {
my $content = $response->content;
# Parse the HTML content of the page
my $tree = HTML::TreeBuilder->new;
$tree->parse($content);
# Find the table with class 'wikitable sortable'
my $table = $tree->look_down(_tag => 'table', class => 'wikitable sortable');
# Initialize arrays to store the data
my @names;
my @groups;
my @local_names;
my @photographs;
# Create a directory to save the images
mkdir('dog_images') unless -d 'dog_images';
# Iterate through rows in the table (skip the header row)
my @rows = $table->look_down(_tag => 'tr');
shift @rows; # Skip the header row
for my $row (@rows) {
my @columns = $row->look_down(_tag => qr/^(td|th)$/);
if (@columns == 4) {
# Extract data from each column
my $name = $columns[0]->look_down(_tag => 'a')->as_text;
my $group = $columns[1]->as_text;
# Check if the second column contains a span element
my $span_tag = $columns[2]->look_down(_tag => 'span');
my $local_name = $span_tag ? $span_tag->as_text : '';
# Check for the existence of an image tag within the fourth column
my $img_tag = $columns[3]->look_down(_tag => 'img');
my $photograph = $img_tag ? $img_tag->attr('src') : '';
# Download the image and save it to the folder
if ($photograph) {
my $image_url = $photograph;
my $image_filename = "dog_images/$name.jpg";
my $img_response = $ua->get($image_url);
if ($img_response->is_success) {
open(my $img_file, '>:raw', $image_filename) or die "Cannot open $image_filename: $!";
print $img_file $img_response->content;
close($img_file);
}
}
# Push data into respective arrays
push @names, $name;
push @groups, $group;
push @local_names, $local_name;
push @photographs, $photograph;
}
}
# Print or process the extracted data as needed
for my $i (0..$#names) {
print "Name: $names[$i]\n";
print "FCI Group: $groups[$i]\n";
print "Local Name: $local_names[$i]\n";
print "Photograph: $photographs[$i]\n";
print "\n";
}
}
else {
die "Failed to retrieve the web page. Status code: " . $response->code;
}
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!