Scraping All Images from a Website with Perl

This guide will walk through a Perl script to scrape image URLs and other data from a Wikipedia page. We will extract the names, groups, local names, and image URLs for all dog breeds listed on the page.

This is page we are talking about…

Modules Used

The script uses the following modules which may need to be installed:

use LWP::UserAgent;
use HTML::TreeBuilder;

To install these, run:

cpan LWP::UserAgent HTML::TreeBuilder

Define URL and User Agent

First we define the URL of the Wikipedia page we want to scrape:

my $url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>';

Next we create a User Agent header to mimic a browser request:

my $ua = LWP::UserAgent->new(
  agent => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
);

Send Request and Parse HTML

We send a GET request for the URL and check if it succeeded:

my $response = $ua->get($url);

if ($response->is_success) {

  # Parse HTML
  my $tree = HTML::TreeBuilder->new;
  $tree->parse($response->content);

  # ... rest of code
}

If successful, we use HTML::TreeBuilder to parse the HTML content into an object structure we can traverse.

Extract Data from the Table

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

We find this table element:

my $table = $tree->look_down(_tag => 'table', class => 'wikitable sortable');

We define arrays to store the scraped data fields:

my @names;
my @groups;
my @local_names;
my @photographs;

And create a folder to save images:

mkdir('dog_images') unless -d 'dog_images';

Understanding the Selectors

The most complex part is extracting the data within each row of the table. This is done by the selector code:

my @rows = $table->look_down(_tag => 'tr');
shift @rows; # skip header row

for my $row (@rows) {

  my @columns = $row->look_down(_tag => qr/^(td|th)$/);

  if (@columns == 4) {

    # Extract data from each column
    my $name = $columns[0]->look_down(_tag => 'a')->as_text;
    my $group = $columns[1]->as_text;

    my $span_tag = $columns[2]->look_down(_tag => 'span');
    my $local_name = $span_tag ? $span_tag->as_text : '';

    my $img_tag = $columns[3]->look_down(_tag => 'img');
    my $photograph = $img_tag ? $img_tag->attr('src') : '';

    # Download images
    if ($photograph) {
     // image download code
    }

    # Store data
    push @names, $name;
    push @groups, $group;
    push @local_names, $local_name;
    push @photographs, $photograph;

  }
}

This code loops through each row, gets the columns, and extracts data from the columns:

Name Column

The name is within a tag inside the first column:

my $name = $columns[0]->look_down(_tag => 'a')->as_text;

Group Column

The group name is directly the text content of the second column:

my $group = $columns[1]->as_text;

Local Name Column

There may be a tag with the local name. We check if this exists:

my $span_tag = $columns[2]->look_down(_tag => 'span');
my $local_name = $span_tag ? $span_tag->as_text : '';

Image Column

We check if there is an tag inside the 4th column:

my $img_tag = $columns[3]->look_down(_tag => 'img');
my $photograph = $img_tag ? $img_tag->attr('src') : '';

If found, we extract the src attribute which contains the image URL.

The key things to understand are:

look_down() searches elements recursively for matching selectors

as_text returns the text content of an element

attr() gets the attribute value from a tag

This allows us to traverse the HTML structure and extract precisely the data we want.

The rest of the code downloads images and stores the scraped data into the arrays.

Output Data

Finally, the data can be printed out:

for my $i (0..$#names) {

  print "Name: $names[$i]\\n";
  print "Group: $groups[$i]\\n";
  print "Local Name: $local_names[$i]\\n";
  print "Image: $photographs[$i]\\n";

  print "\\n";
}

So in summary, this script:

Fetches the web page HTML
Parses it into a traversable structure
Uses selectors to extract specific data
Downloads images
Stores and prints the scraped data

Full Code

Here is the complete runnable script:

use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;

# URL of the Wikipedia page
my $url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds';

# Define a user-agent header to simulate a browser request
my $ua = LWP::UserAgent->new(
    agent => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
);

# Send an HTTP GET request to the URL with the headers
my $response = $ua->get($url);

# Check if the request was successful (status code 200)
if ($response->is_success) {
    my $content = $response->content;

    # Parse the HTML content of the page
    my $tree = HTML::TreeBuilder->new;
    $tree->parse($content);

    # Find the table with class 'wikitable sortable'
    my $table = $tree->look_down(_tag => 'table', class => 'wikitable sortable');

    # Initialize arrays to store the data
    my @names;
    my @groups;
    my @local_names;
    my @photographs;

    # Create a directory to save the images
    mkdir('dog_images') unless -d 'dog_images';

    # Iterate through rows in the table (skip the header row)
    my @rows = $table->look_down(_tag => 'tr');
    shift @rows;  # Skip the header row

    for my $row (@rows) {
        my @columns = $row->look_down(_tag => qr/^(td|th)$/);
        if (@columns == 4) {
            # Extract data from each column
            my $name = $columns[0]->look_down(_tag => 'a')->as_text;
            my $group = $columns[1]->as_text;

            # Check if the second column contains a span element
            my $span_tag = $columns[2]->look_down(_tag => 'span');
            my $local_name = $span_tag ? $span_tag->as_text : '';

            # Check for the existence of an image tag within the fourth column
            my $img_tag = $columns[3]->look_down(_tag => 'img');
            my $photograph = $img_tag ? $img_tag->attr('src') : '';

            # Download the image and save it to the folder
            if ($photograph) {
                my $image_url = $photograph;
                my $image_filename = "dog_images/$name.jpg";
                my $img_response = $ua->get($image_url);
                if ($img_response->is_success) {
                    open(my $img_file, '>:raw', $image_filename) or die "Cannot open $image_filename: $!";
                    print $img_file $img_response->content;
                    close($img_file);
                }
            }

            # Push data into respective arrays
            push @names, $name;
            push @groups, $group;
            push @local_names, $local_name;
            push @photographs, $photograph;
        }
    }

    # Print or process the extracted data as needed
    for my $i (0..$#names) {
        print "Name: $names[$i]\n";
        print "FCI Group: $groups[$i]\n";
        print "Local Name: $local_names[$i]\n";
        print "Photograph: $photographs[$i]\n";
        print "\n";
    }
}
else {
    die "Failed to retrieve the web page. Status code: " . $response->code;
}

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping All Images from a Website with Perl

Modules Used

Define URL and User Agent

Send Request and Parse HTML

Extract Data from the Table

Inspecting the page

Understanding the Selectors

Output Data

Full Code

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping All Images from a Website with Perl

Modules Used

Define URL and User Agent

Send Request and Parse HTML

Extract Data from the Table

Inspecting the page

Understanding the Selectors

Output Data

Full Code

The easiest way to do Web Scraping

Don't leave just yet!