The goal of the PHP script we will be discussing is to scrape all images available on a Wikipedia page that lists dog breeds. Specifically, it extracts the name, group, local name, and image URL for each breed listed on the page.
This is page we are talking about…
Importing Required Libraries
We first need to import some PHP libraries that will enable sending HTTP requests and parsing HTML:
require 'simple_html_dom.php';
The simple_html_dom library allows easily manipulating HTML and XML documents. We will use this later to parse the content of the Wikipedia page.
Defining the Target URL
Next, we store the URL of the Wikipedia page in a variable:
$url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>';
This is the page that contains the data we want to scrape.
Setting a User Agent
Websites can identify requests coming from scripts vs browsers. To mimic a browser request, we need to define a user agent header:
$options = [
'http' => [
'method' => 'GET',
'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
]
];
This makes the request appear like it's coming from a Chrome browser running on Windows.
Sending HTTP Request
Now we can send a GET request to fetch the content of the target URL:
$context = stream_context_create($options);
$response = file_get_contents($url, false, $context);
The stream context applies the headers we defined earlier.
We can check the HTTP status code to verify that the request succeeded:
if ($response !== false) {
// Request succeeded
} else {
// Request failed
}
A status code of 200 means the request was successful.
Parsing the HTML
Since the request succeeded, we have the HTML content of the Wikipedia page saved in the $response variable.
We can parse this using simple_html_dom's str_get_html() method:
$html = str_get_html($response);
This will convert the HTML into a special object that we can traverse using DOM selectors.
Identifying Key Elements
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
Our goal is to extract the data from a specific table on the page. We need to first find this table element:
$table = $html->find('table.wikitable.sortable', 0);
This finds the table with classes
Initializing Data Arrays
Let's initialize some empty arrays to store the data we extract:
$names = [];
$groups = [];
$local_names = [];
$photographs = [];
Creating Image Directory
Since we want to download the images of each dog breed, let's create a folder called
if (!is_dir('dog_images')) {
mkdir('dog_images');
}
This will create the folder if it doesn't already exist.
Extracting Data from Table Rows
Now we can loop through each Inside the loop, we first need to grab the We check if there are exactly 4 cells, since rows with less are not data rows: Now we can extract the data we need from the cells. The most complex part of web scraping is identifying the correct HTML elements to extract the data you need. This requires traversing the DOM structure and targeting elements using CSS selectors or other methods exposed by the HTML parsing library. Let's break down how data is extracted from each cell in this script: Name Column The name is wrapped in an anchor tag inside the first cell: We use DOM traversal to find the anchor tag and get its plain text: Breaking this down: This stores the value "Affenpinscher" in the $name variable. Group Column The group name is directly inside the second cell: So we can directly extract the cell's text content: This stores "FCI Group 2, Section 1" in $group. Local Name Column Some rows contain a We check if this span exists before getting its text: If the span exists, we store its text in $local_name, else we set it to an empty string. Image URL Column The last cell contains the image we want to download. We check if there is an If found, we get the image source URL from the src attribute. Else set it to empty string. As you can see, accurately locating the data relies heavily on analyzing the HTML structure and using the correct selectors and traversal methods. The reason the literal strings are preserved inside selectors is because they directly correspond to elements on the page. Changing them would break the data extraction! If an image URL is found, we download and save it: We use unique filenames like "dog_images/affenpinscher.jpg" to prevent conflicts. After extraction, we append the data from each row to our arrays: This builds up the arrays containing all the scraped data. Finally, we can work with the data in the arrays: We may also write it to a file, database, etc. for future use. Here is the complete code again for reference: Here are some handy tricks for web scraping: In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser! If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail. Overcoming IP Blocks Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works. Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: row inside the table we located earlier: foreach ($table->find('tr') as $row) {
// Extract data from each row
}
Traversing Row Cells
cells in each row: $columns = $row->find('td, th');
if (count($columns) == 4) {
// Extract data from cells
}
Understanding Selectors and Data Extraction
<td>
<a href="/dog/affenpinscher">Affenpinscher</a>
</td>
$name = trim($columns[0]->find('a', 0)->plaintext);
<td>FCI Group 2, Section 1</td>
$group = trim($columns[1]->plaintext);
<td><span>Mops</span></td>
$span_tag = $columns[2]->find('span', 0);
$local_name = $span_tag ? trim($span_tag->plaintext) : '';
$img_tag = $columns[3]->find('img', 0);
$photograph = $img_tag ? $img_tag->src : '';
Downloading and Saving Images
if ($photograph) {
// Download image
$image_data = file_get_contents($image_url);
// Save to folder
file_put_contents($image_filename, $image_data);
}
Appending Extracted Data
$names[] = $name;
$groups[] = $group;
$local_names[] = $local_name;
$photographs[] = $photograph;
Processing the Scraped Data
for ($i = 0; $i < count($names); $i++) {
echo "Name: " . $names[$i] . "\n";
echo "FCI Group: " . $groups[$i] . "\n";
echo "Local Name: " . $local_names[$i] . "\n";
echo "Photograph: " . $photographs[$i] . "\n\n";
}
Full Code
<?php
// Include the required PHP libraries
require 'simple_html_dom.php';
// URL of the Wikipedia page
$url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds';
// Define a user-agent header to simulate a browser request
$options = [
'http' => [
'method' => 'GET',
'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
]
];
$context = stream_context_create($options);
// Send an HTTP GET request to the URL with the headers
$response = file_get_contents($url, false, $context);
// Check if the request was successful (HTTP status code 200)
if ($response !== false) {
// Parse the HTML content of the page
$html = str_get_html($response);
// Find the table with class 'wikitable sortable'
$table = $html->find('table.wikitable.sortable', 0);
// Initialize arrays to store the data
$names = [];
$groups = [];
$local_names = [];
$photographs = [];
// Create a directory to save the images
if (!is_dir('dog_images')) {
mkdir('dog_images');
}
// Iterate through rows in the table (skip the header row)
foreach ($table->find('tr') as $row) {
$columns = $row->find('td, th');
if (count($columns) == 4) {
// Extract data from each column
$name = trim($columns[0]->find('a', 0)->plaintext);
$group = trim($columns[1]->plaintext);
// Check if the second column contains a span element
$span_tag = $columns[2]->find('span', 0);
$local_name = $span_tag ? trim($span_tag->plaintext) : '';
// Check for the existence of an image tag within the fourth column
$img_tag = $columns[3]->find('img', 0);
$photograph = $img_tag ? $img_tag->src : '';
// Download the image and save it to the folder
if ($photograph) {
$image_url = $photograph;
$image_data = file_get_contents($image_url);
if ($image_data !== false) {
$image_filename = 'dog_images/' . $name . '.jpg';
file_put_contents($image_filename, $image_data);
}
}
// Append data to respective arrays
$names[] = $name;
$groups[] = $group;
$local_names[] = $local_name;
$photographs[] = $photograph;
}
}
// Print or process the extracted data as needed
for ($i = 0; $i < count($names); $i++) {
echo "Name: " . $names[$i] . "\n";
echo "FCI Group: " . $groups[$i] . "\n";
echo "Local Name: " . $local_names[$i] . "\n";
echo "Photograph: " . $photographs[$i] . "\n\n";
}
} else {
echo "Failed to retrieve the web page.\n";
}
?>
Tricks and Tips
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!