Introduction
In this article, we will be scraping the "List of dog breeds" Wikipedia page to extract information and images of different dog breeds. Our end goal is to save all dog breed photos locally along with metadata like the breed name, breed group, and local breed name.
This is page we are talking about…
To achieve this, we will send an HTTP request to download the raw HTML content of the Wikipedia page. We will then use the Nokogiri library in Ruby to parse the HTML and xpath selectors to extract the data we want from the structured content.
The full Ruby code to accomplish this web scraping is provided at the end for reference. We will walk through it section by section to understand the logic and mechanics behind each part.
Prerequisites
Before we dive into the code, let's outline the prerequisites needed to follow along:
Languages:
Libraries:
Installation:
All of the above libraries can be installed via
For example:
gem install nokogiri
We also want to be in an environment with Ruby setup properly to run code. This could be through tools like rvm, rbenv, etc on your local machine or an online IDE.
Sending the Request
We start by defining the URL of the Wikipedia page we want to scrape:
url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'
Next, we setup a user agent header string:
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
This simulates a request coming from a Chrome browser on Windows. Websites tend to block scraping requests missing user agent headers so this helps avoid access issues.
We then use Ruby's handy open-uri library to send a GET request to the URL. The user agent header is passed along so the website thinks this is coming from a real browser:
html_content = URI.open(url, 'User-Agent' => user_agent).read
The page HTML content is downloaded and saved into the html_content variable. We handle potential errors around connectivity, server issues, etc to retry failed requests if needed.
Parsing the HTML
Now that we've fetched the raw HTML of the Wikipedia page, we want to parse it so we can extract the data we want.
This is where Nokogiri comes in. Nokogiri allows us to take the HTML and turn it into a parseable DOM structure.
doc = Nokogiri::HTML(html_content)
The doc variable now contains a structured Document Object Model (DOM) representation of the HTML.
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
We can use Nokogiri's methods combined with xpath selectors to query elements just like we would in the browser console.
For example, to find the main table element:
table = doc.at('table.wikitable.sortable')
Here we are looking for a Now that we've zoomed into the main table element, we can focus our attention on extracting the data from it. We loop through each Inside this loop, we dig into each table column Breed name: Breed group: Local breed name: And most importantly, the image URL: We check if image and span tags exist before extracting text as some rows lack this data. With the image URL, we can then download the photo and save it locally: We apply error handling around the image download in case issues come up. As we extract, all data gets stored into arrays to process later. Now that we've parsed through the entire table and extracted the data, the arrays contain all the information we wanted about these dog breeds. We can iterate through and print it out: The data could also be saved to a database, exported to CSV, analyzed further etc. In this article, we walked through a full web scraping script to extract images and information on dog breeds from Wikipedia. We learned how to: Full code again here for reference: In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser! If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail. Overcoming IP Blocks Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works. Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: tag with CSS classes
Extracting the Data
row, skipping the header: table.search('tr')[1..-1].each do |row|
# extraction logic
end
for the data pieces we want. name = columns[0].at('a').text.strip
group = columns[1].text.strip
span_tag = columns[2].at('span')
local_name = span_tag ? span_tag.text.strip : ''
img_tag = columns[3].at('img')
photograph = img_tag ? img_tag['src'] : ''
if !photograph.empty?
image_url = URI.join(url, photograph).to_s
image_filename = File.join('dog_images', "#{name}.jpg")
File.open(image_filename, 'wb') do |img_file|
img_file.write(URI.open(image_url, 'User-Agent' => user_agent).read)
end
end
Processing Results
names.each_index do |i|
puts "Name: #{names[i]}"
puts "FCI Group: #{groups[i]}"
# etc...
end
Conclusion
require 'open-uri'
require 'nokogiri'
require 'fileutils'
require 'uri'
# URL of the Wikipedia page
url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'
# Define a user-agent header to simulate a browser request
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
# Send an HTTP GET request to the URL with the headers
html_content = URI.open(url, 'User-Agent' => user_agent).read
# Parse the HTML content of the page
doc = Nokogiri::HTML(html_content)
# Find the table with class 'wikitable sortable'
table = doc.at('table.wikitable.sortable')
# Initialize arrays to store the data
names = []
groups = []
local_names = []
photographs = []
# Create a folder to save the images
FileUtils.mkdir_p('dog_images')
# Iterate through rows in the table (skip the header row)
table.search('tr')[1..-1].each do |row|
columns = row.search('th, td')
if columns.length == 4
# Extract data from each column
name = columns[0].at('a').text.strip
group = columns[1].text.strip
# Check if the second column contains a span element
span_tag = columns[2].at('span')
local_name = span_tag ? span_tag.text.strip : ''
# Check for the existence of an image tag within the fourth column
img_tag = columns[3].at('img')
photograph = img_tag ? img_tag['src'] : ''
# Download the image and save it to the folder
if !photograph.empty?
image_url = URI.join(url, photograph).to_s
image_filename = File.join('dog_images', "#{name}.jpg")
File.open(image_filename, 'wb') do |img_file|
img_file.write(URI.open(image_url, 'User-Agent' => user_agent).read)
end
end
# Append data to respective arrays
names << name
groups << group
local_names << local_name
photographs << photograph
end
end
# Print or process the extracted data as needed
names.each_index do |i|
puts "Name: #{names[i]}"
puts "FCI Group: #{groups[i]}"
puts "Local Name: #{local_names[i]}"
puts "Photograph: #{photographs[i]}"
puts
end
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!