Wikipedia contains a vast amount of structured data across millions of articles. Often, it can be useful to extract or "scrape" data from Wikipedia pages for use in other applications. In this article, I'll walk through a simple example of scraping tabular data from Wikipedia using Perl.
This is the table we are talking about
When Would You Want to Scrape Wikipedia Data?
A few examples where scraping Wikipedia data may be helpful:
So in short - any use case where you want to utilize the structured data within Wikipedia pages.
Scraping the Wikipedia Presidents Table
To make things concrete, we'll walk through a full code example of scraping the List of Presidents of the United States table from Wikipedia.
This table contains data like president number, name, term dates, political party, etc. Scraping it will allow us to extract and utilize this data in other applications.
Import Perl Modules
We'll use a couple Perl modules to send HTTP requests and parse the returned HTML:
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
Make sure these modules are installed to follow along.
Define Wikipedia URL
We need to pass the URL of the Wikipedia page we want to scrape to the request. We'll define it as:
my $url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>";
Create a User Agent String
We'll also define a user agent header that mimics a real browser's user agent. This helps avoid blocked requests that some sites may impose on scrapers:
my $ua = LWP::UserAgent->new(
agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
);
Send HTTP GET Request
We use the user agent to send a simple GET request to fetch the content of the Wikipedia URL:
my $response = $ua->get($url);
And we can check if the request succeeded with:
if ($response->is_success) {
# Request succeeded, scrape content
} else {
# Request failed, print error
print "Request failed with status: " . $response->status_line . "\\n";
}
Parse Returned HTML
If the request succeeds, we have the HTML content of the page saved in
my $content = $response->decoded_content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
This gives us a DOM tree we can now query with XPath to find elements.
Locate Presidents Table
We want to extract the tabular data from the page.
Inspecting the page
When we inspect the page we can see that the table has a class called wikitable and sortable
We can use an XPath query to locate this table element:
my ($table) = $tree->findnodes('//table[@class="wikitable sortable"]');
Initialize Data Storage
Now that we've located the presidents table, we can loop through it and extract each row. We'll store the extracted data in an array of arrays:
my @data;
Each inner array will store a single president's data.
Loop Through Table Rows
We first find all the The XPath query skips the header row. Then we iterate the rows: Within the row loop, we grab all the table cells with: We can simplify the text from the cells: And append this row's data to the array: So now To confirm it worked, we can print out the data: And we have successfully scraped the Wikipedia table! The full code is included again down below. With the president data extracted, you could now: The possibilities are endless! What other interesting Wikipedia data would be useful for you to scrape? Let me know in the comments! Here is the complete code example again for reference: In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser! If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail. Overcoming IP Blocks Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works. Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: rows within the table: my @rows = $table->findnodes('.//tr[position()>1]');
for my $row (@rows) {
# Extract and store data for this row
}
Extract Row Data
my @columns = $row->findnodes('.//td | .//th');
my @row_data = map { $_->as_text =~ s/^\\s+|\\s+$//gr } @columns;
push @data, \\@row_data;
Print Scraped Data
for my $president_data (@data) {
print "Number: " . $president_data->[0] . "\\n";
print "Name: " . $president_data->[2] . "\\n";
# Print more fields...
}
What's Next?
Full Wikipedia Scraping Code
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
# Define the URL of the Wikipedia page
my $url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
# Define a user-agent header to simulate a browser request
my $ua = LWP::UserAgent->new(
agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
);
# Send an HTTP GET request to the URL with the headers
my $response = $ua->get($url);
# Check if the request was successful (status code 200)
if ($response->is_success) {
my $content = $response->decoded_content;
# Parse the HTML content of the page using HTML::TreeBuilder::XPath
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
# Find the table with the specified class name
my ($table) = $tree->findnodes('//table[@class="wikitable sortable"]');
# Initialize empty arrays to store the table data
my @data;
# Iterate through the rows of the table
my @rows = $table->findnodes('.//tr[position()>1]'); # Skip the header row
for my $row (@rows) {
# Extract data from each column and append it to the data array
my @columns = $row->findnodes('.//td | .//th');
my @row_data = map { $_->as_text =~ s/^\s+|\s+$//gr } @columns;
push @data, \@row_data;
}
# Print the scraped data for all presidents
for my $president_data (@data) {
print("President Data:\n");
print("Number:", $president_data->[0], "\n");
print("Name:", $president_data->[2], "\n");
print("Term:", $president_data->[3], "\n");
print("Party:", $president_data->[5], "\n");
print("Election:", $president_data->[6], "\n");
print("Vice President:", $president_data->[7], "\n");
print("\n");
}
} else {
print("Failed to retrieve the web page. Status code:", $response->status_line, "\n");
}
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!