Here is a step-by-step guide to scraping a website for images using Elixir. This article will explain the code for scraping dog breed information and images from a Wikipedia page, to help beginners understand the key concepts.
This is page we are talking about…
Overview
The goal of this scraper is to extract dog breed names, details like categories, local names, and images from a Wikipedia page listing hundreds of breeds.
It will:
- Retrieve the web page content
- Parse the page to extract information
- Download all images of dog breeds
- Save images and print extracted data
The code uses the Elixir programming language along with several libraries:
Retrieving the Web Page
The first step is to retrieve the content of the web page that contains the data we want to scrape.
The
defp get_page(url, headers) do
case :httpc.request(:get, {URI.parse(url), headers: headers}, [], []) do
{:ok, {{_, 200, _},_ , body}} ->
{:ok, body}
{:ok, {{_, status_code, _},_ , _}} ->
{:error, status_code}
{:error, reason} ->
{:error, reason}
end
end
This makes the request, checks the status code, and if a 200 OK response is received, returns the page body.
The headers contain a user agent string to identify the scraper to the server.
The start function calls this getter, handling any errors:
case get_page(@url, headers) do
{:ok, body} ->
# parse page
{:error, reason} ->
IO.puts("Failed to retrieve the web page. Status code: #{reason}")
end
So at this point if successful, the
Parsing the Page
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
Selecting the Table
We use the
table = Floki.find(document, "table.wikitable.sortable")
The
Iterating Through Rows Inside the table, data is organized in rows, with each row containing information about a specific dog breed. We use a loop to iterate through these rows and extract relevant data:
for row <- tl(Floki.find(table, "tr")) do
# Extract data from the row
end
The
Extracting Data from Columns
Within each row, data is stored in columns. We use
columns = Floki.find(row, "td,th")
name = Floki.find(columns |> hd, "a") |> hd |> Floki.text() |> String.trim()
group = columns |> Enum.at(1) |> Floki.text() |> String.trim()
local_name = case Floki.find(columns |> Enum.at(2), "span") do
[] -> ""
[span] -> Floki.text(span) |> String.trim()
end
img_tag = Floki.find(columns |> Enum.at(3), "img")
photograph = case img_tag do
[] -> ""
[img] -> Floki.attribute(img, "src")
end
Here's what each extraction step does:
Downloading Images
After extracting image sources, we can download the actual image data:
defp download_image(photograph, name) do
case get_image(photograph) do
{:ok, image_data} ->
image_filename = "dog_images/#{name}.jpg"
File.write(image_filename, image_data)
_ ->
IO.puts("Failed to download image: #{photograph}")
end
end
defp get_image(url) do
# make HTTP request
case :httpc.request(...) do
{:ok, {{_, 200, _},_ , body}} ->
{:ok, body}
_ ->
{:error, "Failed to download image"}
end
end
We reuse the HTTPClient library to fetch each image by URL.
If successful, we write the image binary data to a file using the breed's name and the File module.
The
Saving and Printing Output
Finally,
The full code can be seen below, showing how these pieces fit together into a complete scraper:
defmodule DogBreedsScraper do
@url 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'
def start do
headers = [
{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}
]
case get_page(@url, headers) do
{:ok, body} ->
case parse_page(body) do
{:ok, data} ->
save_images(data)
print_data(data)
{:error, reason} ->
IO.puts("Failed to parse the page: #{reason}")
end
{:error, reason} ->
IO.puts("Failed to retrieve the web page. Status code: #{reason}")
end
end
defp get_page(url, headers) do
case :httpc.request(:get, {URI.parse(url), headers: headers}, [], []) do
{:ok, {{_, 200, _}, _, body}} ->
{:ok, body}
{:ok, {{_, status_code, _}, _, _}} ->
{:error, status_code}
{:error, reason} ->
{:error, reason}
end
end
defp parse_page(body) do
case Floki.parse(body) do
{:ok, document} ->
table = Floki.find(document, "table.wikitable.sortable")
names = []
groups = []
local_names = []
photographs = []
for row <- tl(Floki.find(table, "tr")) do
columns = Floki.find(row, "td,th")
if length(columns) == 4 do
name = Floki.find(columns |> hd, "a") |> hd |> Floki.text() |> String.trim()
group = columns |> Enum.at(1) |> Floki.text() |> String.trim()
local_name = case Floki.find(columns |> Enum.at(2), "span") do
[] -> ""
[span] -> Floki.text(span) |> String.trim()
end
img_tag = Floki.find(columns |> Enum.at(3), "img")
photograph = case img_tag do
[] -> ""
[img] -> Floki.attribute(img, "src")
end
names = [name | names]
groups = [group | groups]
local_names = [local_name | local_names]
photographs = [photograph | photographs]
if photograph != "" do
download_image(photograph, name)
end
end
end
{:ok, Enum.reverse(names), Enum.reverse(groups), Enum.reverse(local_names), Enum.reverse(photographs)}
_ ->
{:error, "Failed to parse the page"}
end
end
defp download_image(photograph, name) do
case get_image(photograph) do
{:ok, image_data} ->
image_filename = "dog_images/#{name}.jpg"
File.write(image_filename, image_data)
_ ->
IO.puts("Failed to download image: #{photograph}")
end
end
defp get_image(url) do
case :httpc.request(:get, {URI.parse(url)}, [], []) do
{:ok, {{_, 200, _}, _, body}} ->
{:ok, body}
_ ->
{:error, "Failed to download image"}
end
end
defp save_images(data) do
File.mkdir_p("dog_images")
Enum.zip(data |> elem(0), data |> elem(3))
|> Enum.each(fn {name, photograph} -> download_image(photograph, name) end)
end
defp print_data({names, groups, local_names, photographs}) do
Enum.each(0..(length(names) - 1), fn i ->
IO.puts("Name: #{Enum.at(names, i)}")
IO.puts("FCI Group: #{Enum.at(groups, i)}")
IO.puts("Local Name: #{Enum.at(local_names, i)}")
IO.puts("Photograph: #{Enum.at(photographs, i)}")
IO.puts()
end)
end
end
# Start the scraping process
DogBreedsScraper.start()
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.