Introduction

In this article, we will be scraping the "List of dog breeds" Wikipedia page to extract information and images of different dog breeds. Our end goal is to save all dog breed photos locally along with metadata like the breed name, breed group, and local breed name.

This is page we are talking about…

To achieve this, we will send an HTTP request to download the raw HTML content of the Wikipedia page. We will then use the Nokogiri library in Ruby to parse the HTML and xpath selectors to extract the data we want from the structured content.

The full Ruby code to accomplish this web scraping is provided at the end for reference. We will walk through it section by section to understand the logic and mechanics behind each part.

Prerequisites

Before we dive into the code, let's outline the prerequisites needed to follow along:

Languages:

Ruby We use Ruby as our main programming language here. Basic Ruby syntax will be helpful to understand what's going on.

Libraries:

open-uri Provides easy access in Ruby to fetch remote resources over HTTP and HTTPS. We use this to send the GET request.

nokogiri XML/HTML parser for Ruby. We rely on Nokogiri's methods to parse and query the HTML content.

fileutils Adds extra file utility methods for Ruby. We use it to create directories and write image files.

Installation: All of the above libraries can be installed via gem install {library_name}

For example:

gem install nokogiri

We also want to be in an environment with Ruby setup properly to run code. This could be through tools like rvm, rbenv, etc on your local machine or an online IDE.

Sending the Request

We start by defining the URL of the Wikipedia page we want to scrape:

url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'

Next, we setup a user agent header string:

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

This simulates a request coming from a Chrome browser on Windows. Websites tend to block scraping requests missing user agent headers so this helps avoid access issues.

We then use Ruby's handy open-uri library to send a GET request to the URL. The user agent header is passed along so the website thinks this is coming from a real browser:

html_content = URI.open(url, 'User-Agent' => user_agent).read

The page HTML content is downloaded and saved into the html_content variable. We handle potential errors around connectivity, server issues, etc to retry failed requests if needed.

Parsing the HTML

Now that we've fetched the raw HTML of the Wikipedia page, we want to parse it so we can extract the data we want.

This is where Nokogiri comes in. Nokogiri allows us to take the HTML and turn it into a parseable DOM structure.

doc = Nokogiri::HTML(html_content)

The doc variable now contains a structured Document Object Model (DOM) representation of the HTML.

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

We can use Nokogiri's methods combined with xpath selectors to query elements just like we would in the browser console.

For example, to find the main table element:

table = doc.at('table.wikitable.sortable')

Here we are looking for a

tag with CSS classes wikitable and sortable. The .at() method returns just the first matching element.

Extracting the Data

Now that we've zoomed into the main table element, we can focus our attention on extracting the data from it.

We loop through each

row, skipping the header:

table.search('tr')[1..-1].each do |row|

  # extraction logic

end

Inside this loop, we dig into each table column

for the data pieces we want.

Breed name:

name = columns[0].at('a').text.strip

Breed group:

group = columns[1].text.strip

Local breed name:

span_tag = columns[2].at('span')

local_name = span_tag ? span_tag.text.strip : ''

And most importantly, the image URL:

img_tag = columns[3].at('img')

photograph = img_tag ? img_tag['src'] : ''

We check if image and span tags exist before extracting text as some rows lack this data.

With the image URL, we can then download the photo and save it locally:

if !photograph.empty?
      image_url = URI.join(url, photograph).to_s
      image_filename = File.join('dog_images', "#{name}.jpg")

      File.open(image_filename, 'wb') do |img_file|
        img_file.write(URI.open(image_url, 'User-Agent' => user_agent).read)
      end
    end

We apply error handling around the image download in case issues come up.

As we extract, all data gets stored into arrays to process later.

Processing Results

Now that we've parsed through the entire table and extracted the data, the arrays contain all the information we wanted about these dog breeds.

We can iterate through and print it out:

names.each_index do |i|

  puts "Name: #{names[i]}"
  puts "FCI Group: #{groups[i]}"

  # etc...

end

The data could also be saved to a database, exported to CSV, analyzed further etc.

Conclusion

In this article, we walked through a full web scraping script to extract images and information on dog breeds from Wikipedia.

We learned how to:

Send GET requests with simulated browser headers

Parse HTML using Nokogiri

Use XPath selectors to extract data

Handle edge cases and missing data

Download files from remote URLs

Full code again here for reference:

require 'open-uri'
require 'nokogiri'
require 'fileutils'
require 'uri'

# URL of the Wikipedia page
url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'

# Define a user-agent header to simulate a browser request
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

# Send an HTTP GET request to the URL with the headers
html_content = URI.open(url, 'User-Agent' => user_agent).read

# Parse the HTML content of the page
doc = Nokogiri::HTML(html_content)

# Find the table with class 'wikitable sortable'
table = doc.at('table.wikitable.sortable')

# Initialize arrays to store the data
names = []
groups = []
local_names = []
photographs = []

# Create a folder to save the images
FileUtils.mkdir_p('dog_images')

# Iterate through rows in the table (skip the header row)
table.search('tr')[1..-1].each do |row|
  columns = row.search('th, td')
  if columns.length == 4
    # Extract data from each column
    name = columns[0].at('a').text.strip
    group = columns[1].text.strip

    # Check if the second column contains a span element
    span_tag = columns[2].at('span')
    local_name = span_tag ? span_tag.text.strip : ''

    # Check for the existence of an image tag within the fourth column
    img_tag = columns[3].at('img')
    photograph = img_tag ? img_tag['src'] : ''

    # Download the image and save it to the folder
    if !photograph.empty?
      image_url = URI.join(url, photograph).to_s
      image_filename = File.join('dog_images', "#{name}.jpg")

      File.open(image_filename, 'wb') do |img_file|
        img_file.write(URI.open(image_url, 'User-Agent' => user_agent).read)
      end
    end

    # Append data to respective arrays
    names << name
    groups << group
    local_names << local_name
    photographs << photograph
  end
end

# Print or process the extracted data as needed
names.each_index do |i|
  puts "Name: #{names[i]}"
  puts "FCI Group: #{groups[i]}"
  puts "Local Name: #{local_names[i]}"
  puts "Photograph: #{photographs[i]}"
  puts
end

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you

Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...

Scraping all the Images from a Website with Ruby