Web scraping is the process of extracting data from websites automatically. It can be useful for getting data off the web and into a format you can analyze or use programmatically.
In this article, we'll walk through an example of scraping Wikipedia to get data on all the Presidents of the United States.
Why Scrape Wikipedia?
Wikipedia contains structured data in tables that cover an incredibly wide range of topics. Scraping Wikipedia can be useful for research projects, data analysis, aggregating facts for quizzes or games, and more. The data is free to use and constantly updated by the Wikipedia community.
For our example, we'll scrape the table on this page to get data on each president like their name, term start and end dates, party, etc.
This is the table we are talking about
Prerequisites
To follow along, you'll need:
We'll also use some PHP libraries like DOMDocument and DOMXPath to parse the HTML.
Note: If you don't already have a development environment setup, the easiest way is to use a package like XAMPP which includes PHP, Apache server, and everything needed to run PHP scripts on your local computer.
Scraping the President Data
Let's walk through the script line-by-line to understand how it works:
Define the URL
We start by defining the URL of the Wikipedia page we want to scrape:
$url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>";
Set a User-Agent Header
Many sites try to detect and block scraping bots, so we simulate a real browser request by setting a user-agent header:
$headers = [
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
];
This makes Wikipedia think a real person is accessing the page from Chrome browser on Windows.
Pro Tip: You can get real user agent strings from your browser's developer tools network tab.
Initialize cURL
Next we initialize a cURL session, passing in the URL to fetch:
$ch = curl_init($url);
cURL will make the request and retrieve the content.
Set cURL Options
We configure some options to get the content of the page returned as a string:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
The user agent header is also passed here.
Send Request & Get Response
To execute the request and get the response, we simply run:
$response = curl_exec($ch);
Then we can check that it was successful:
if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
// parsing logic here...
}
A 200 status code means everything went well.
Parse Response HTML
With the HTML content, we can now parse out the data we want.
First we load it into a DOMDocument object:
$dom = new DOMDocument();
@$dom->loadHTML($response);
The @ symbol suppresses any warnings about invalid HTML. Wikipedia pages generally have valid XHTML, but the @ keeps things clean.
Insider Tip: Using DOMDocument to parse HTML allows accessing elements with DOM methods like getElementById, querySelector, etc. It's very powerful for scraping!
We then use XPath to select the specific table we want - the one with class
$xpath = new DOMXPath($dom);
$table = $xpath->query('//table[@class="wikitable sortable"]')->item(0);
This grabs the first matching table element.
Extract Table Data
With the table node, we can loop through rows and cells to save the data:
$rows = $table->getElementsByTagName('tr');
foreach ($rows as $row) {
$columns = $row->getElementsByTagName('td');
// save cell data
$data[] = [$column1_text, $column2_text, ...];
}
We add all rows to the
Output Scraped Data
Finally, we can print the scraped info or save it to CSV, JSON, etc:
// Print the scraped data for all presidents
foreach ($data as $presidentData) {
echo "President Data:\n";
echo "Number: " . $presidentData[0] . "\n";
echo "Name: " . $presidentData[2] . "\n";
echo "Term: " . $presidentData[3] . "\n";
echo "Party: " . $presidentData[5] . "\n";
echo "Election: " . $presidentData[6] . "\n";
echo "Vice President: " . $presidentData[7] . "\n\n";
}
And we've successfully scraped the Wikipedia table!
Full Script
Here is the full script putting all the pieces together:
<?php
// Define the URL of the Wikipedia page
$url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
// Define a user-agent header to simulate a browser request
$headers = [
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
];
// Initialize cURL session
$ch = curl_init($url);
// Set cURL options
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
// Send the HTTP GET request
$response = curl_exec($ch);
// Check if the request was successful (status code 200)
if (curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200) {
// Create a DOMDocument object and load the HTML content
$dom = new DOMDocument();
@$dom->loadHTML($response); // '@' to suppress HTML parsing warnings
// Find the table with the specified class name
$xpath = new DOMXPath($dom);
$table = $xpath->query('//table[@class="wikitable sortable"]')->item(0);
// Initialize empty arrays to store the table data
$data = [];
// Iterate through the rows of the table
$rows = $table->getElementsByTagName('tr');
foreach ($rows as $row) {
$columns = $row->getElementsByTagName('td');
$headerColumns = $row->getElementsByTagName('th');
$rowData = [];
foreach ($headerColumns as $col) {
$rowData[] = $col->textContent;
}
foreach ($columns as $col) {
$rowData[] = $col->textContent;
}
if (!empty($rowData)) {
$data[] = $rowData;
}
}
// Print the scraped data for all presidents
foreach ($data as $presidentData) {
echo "President Data:\n";
echo "Number: " . $presidentData[0] . "\n";
echo "Name: " . $presidentData[2] . "\n";
echo "Term: " . $presidentData[3] . "\n";
echo "Party: " . $presidentData[5] . "\n";
echo "Election: " . $presidentData[6] . "\n";
echo "Vice President: " . $presidentData[7] . "\n\n";
}
} else {
echo "Failed to retrieve the web page. Status code: " . curl_getinfo($ch, CURLINFO_HTTP_CODE) . "\n";
}
// Close cURL session
curl_close($ch);
?>
Challenges & Next Steps
Some challenges you may run into:
This example just scratches the surface of web scraping in PHP. Some ideas for next steps:
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.