The first step is to load the R libraries that we will need to perform the web scraping:
library(rvest)
library(httr)
library(stringr)
The key libraries are:
Defining the URL and Headers
Next we need to specify the URL of the web page that contains the images we want to scrape:
url <- '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'
We are scraping images of dog breeds from a Wikipedia page.
This is page we are talking about…
When scraping web pages, it is good practice to define a custom user agent header. This helps simulate a real browser request so the server will respond properly:
headers <- c(
`User-Agent` = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
)
Here we are setting a Chrome browser user agent.
Sending the HTTP Request
To download the web page content, we can send an HTTP GET request using the httr package:
response <- httr::GET(url, httr::add_headers(headers))
This will fetch the contents of the specified
Checking the Response Status
It's good practice to check that the request succeeded before trying to parse the response. We can check the status code:
if (httr::status_code(response) == 200) {
# Request succeeded logic
} else {
# Failed request handling
}
A status code of 200 means the request was successful. Other codes indicate an error.
Parsing the HTML
Since the request succeeded, we can parse the HTML content using rvest:
page <- read_html(httr::content(response, "text"))
The
Finding the Data Table
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
We can use XPath to find that table element:
table <- page %>%
html_nodes(xpath = '//*[@class="wikitable sortable"]') %>%
html_table()
Let's break this down:
Now the table data is extracted into the
Initializing Data Storage
As we scrape data from the table, we need variables to accumulate the results:
names <- character()
groups <- character()
local_names <- character()
photographs <- character()
Empty vectors are created to store the dog name, breed group, local names, and image URLs as we extract them.
Iterating Through the Table Rows
To scrape the data from each row, we can iterate through the table:
for (i in 2:length(table[[1]][, 1])) {
row <- table[[1]][i, ]
# Extract data for each dog breed
}
This skips the header row and processes each data row, storing the current row in
Extracting Data from Each Column
Now here is the most complex part - extracting each data field from the table columns:
# Column 1: Name
name <- row[[1]]
# Column 2: Group
group <- row[[2]]
# Check column 3 for a <span> tag
span_tag <- html_nodes(row[[3]], 'span')
local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')
# Check column 4 for an <img>
img_tag <- html_nodes(row[[4]], 'img')
photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')
As you can see, each column requires different logic to extract the text or attributes. Let's break it down:
Name Column:
The name is directly in the text of column 1. We grab it with:
name <- row[[1]]
Group Column:
The group is also basic text, extracted by:
group <- row[[2]]
Local Name Column:
For local names, we first check if the column contains a
span_tag <- html_nodes(row[[3]], 'span')
If found, we extract its text:
local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')
Photograph Column:
Finally, for the photo we check if an image tag
img_tag <- html_nodes(row[[4]], 'img')
If yes, we grab its source URL attribute:
photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')
This logic carefully handles all the edge cases that can appear when scraping semi-structured HTML.
Downloading and Saving Images
With the image URLs extracted, we can now download and save the photos:
if (photograph != '') {
# Download image
# Save to file
}
The code checks that we have a valid
We won't include all the image download code here for brevity.
Printing the Extracted Data
Finally, to print out the scraped data:
for (i in 1:length(names)) {
cat("Name:", names[i], "\\n")
cat("FCI Group:", groups[i], "\\n")
cat("Local Name:", local_names[i], "\\n")
cat("Photograph:", photographs[i], "\\n")
cat("\\n")
}
This iterates through each record and prints the extracted fields.
Handling Errors
The code also contains logic to handle errors:
} else {
cat("Failed to retrieve the web page. Status code:", httr::status_code(response), "\\n")
}
If the HTTP request failed, it prints an error message with the status code.
Full Code
# Load the required libraries
library(rvest)
library(httr)
library(stringr)
# URL of the Wikipedia page
url <- 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'
# Define a user-agent header to simulate a browser request
headers <- c(
`User-Agent` = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
)
# Send an HTTP GET request to the URL with the headers
response <- httr::GET(url, httr::add_headers(headers))
# Check if the request was successful (status code 200)
if (httr::status_code(response) == 200) {
# Parse the HTML content of the page
page <- read_html(httr::content(response, "text"))
# Find the table with class 'wikitable sortable'
table <- page %>%
html_nodes(xpath = '//*[@class="wikitable sortable"]') %>%
html_table()
# Initialize lists to store the data
names <- character()
groups <- character()
local_names <- character()
photographs <- character()
# Create a folder to save the images
dir.create('dog_images', showWarnings = FALSE)
# Iterate through rows in the table (skip the header row)
for (i in 2:length(table[[1]][, 1])) {
row <- table[[1]][i, ]
# Extract data from each column
name <- row[[1]]
group <- row[[2]]
# Check if the second column contains a span element
span_tag <- html_nodes(row[[3]], 'span')
local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')
# Check for the existence of an image tag within the fourth column
img_tag <- html_nodes(row[[4]], 'img')
photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')
# Download the image and save it to the folder
if (photograph != '') {
image_url <- photograph
image_response <- httr::GET(image_url, httr::add_headers(headers))
if (httr::status_code(image_response) == 200) {
image_filename <- file.path('dog_images', paste0(name, '.jpg'))
writeBin(httr::content(image_response, "raw"), image_filename)
}
}
# Append data to respective lists
names <- c(names, name)
groups <- c(groups, group)
local_names <- c(local_names, local_name)
photographs <- c(photographs, photograph)
}
# Print or process the extracted data as needed
for (i in 1:length(names)) {
cat("Name:", names[i], "\n")
cat("FCI Group:", groups[i], "\n")
cat("Local Name:", local_names[i], "\n")
cat("Photograph:", photographs[i], "\n")
cat("\n")
}
} else {
cat("Failed to retrieve the web page. Status code:", httr::status_code(response), "\n")
}
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.