Web scraping is the process of programmatically extracting data from websites. In this tutorial, we'll learn how to use Rust to scrape all images from a website.
Our goal is to:
- Send a request to download a web page
- Parse through the HTML content
- Find and save all the images on that page locally
- Extract and store other data like image names and categories
We'll go through each step required to build a complete web scraper. No prior scraping experience needed!
Setup
Let's briefly go over the initial setup:
extern crate reqwest;
extern crate select;
We import two key Rust crates:
These provide the essential web scraping capabilities we need.
We also import standard Rust modules for file I/O and error handling:
use std::fs;
use std::io::Write;
With the imports out of the way, let's get scraping!
Making an HTTP Request
This is page we are talking about…
Let's send a simple GET request to download the raw HTML content:
let url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>";
let client = reqwest::blocking::Client::new();
let response = client.get(url)
.header("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/58.0.3029.110 Safari/537.36")
.send()?;
We provide the URL we wish to scrape. Creating a
Key Points:
Let's add a status check:
if response.status().is_success() {
// Scrape page
} else {
eprintln!("Failed with status: {}", response.status());
}
This verifies that the page loaded before scraping.
Parsing the HTML
Now that we've downloaded the raw HTML content, we can parse through it to extract data.
The
let body = response.text()?;
let document = Document::from_read(body.as_bytes())?;
This parses the HTML body into a flexible
Time to analyze the page content and locate the target table of dog breeds.
Inspecting the page
You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable
Using the
let table = document.find(Attr("class", "wikitable sortable"))
.next()
.ok_or("Table not found")?;
Note: Similar approaches work for
Our scraper now has access to the core content!
Saving Images
Let's build in functionality to save images locally as we scrape them from the table:
fs::create_dir_all("dog_images")?;
This prepares a folder to store images.
Initializing Storage
To store structured data from each row, we initialize some vectors:
let mut names = vec![];
let mut groups = vec![];
let mut local_names = vec![];
let mut photographs = vec![];
They will capture the:
We iterate the table rows to populate them.
Extracting Data
Let's walk through the full data extraction logic:
for row in table.find(Name("tr")).skip(1) {
let columns = row.find(Name("td"))
.collect::<Vec<_>>();
if columns.len() == 4 {
let name = columns[0].text();
let group = columns[1].text();
let span_tag = columns[2].find(Name("span"))
.next();
let local_name = span_tag.map(|span| span.text())
.unwrap_or_default();
let img_tag = columns[3].find(Name("img"))
.next();
let photograph = img_tag.map(|img| img.attr("src")
.unwrap_or_default())
.unwrap_or_default();
// ...
This demonstrates:
We can pull data in various ways thanks to selectors.
Downloading Images
Let's handle downloading and saving images:
if !photograph.is_empty() {
let image_url = photograph;
let image_response = client.get(&image_url)
.send()?;
if image_response.status().is_success() {
let image_filename = format!("dog_images/{}.jpg", name);
let mut image_file = fs::File::create(image_filename)?;
image_file.write_all(&image_response.bytes()?)?;
}
}
For valid images, we:
- Check that the
src isn't empty - Make a separate request to download
- Verify it succeeds
- Construct a file name
- Write out the image bytes
And we have a scraper that saves images!
Storing Extracted Data
Finally, let's store the extracted data:
names.push(name);
groups.push(group);
local_names.push(local_name);
photographs.push(photograph);
We populate each vector accordingly to capture key details.
And can print or process the metadata further:
for i in 0..names.len() {
println!("Name: {}", names[i]);
// ...
}
There we have it - a complete web scraper!
Handling Errors
We use
let response = client.get(url)
.send()?;
This surfaces errors clearly. Some common ones:
Print descriptive errors before scraping to identify issues.
extern crate reqwest;
extern crate select;
use std::fs;
use std::io::Write;
use select::document::Document;
use select::predicate::{Name, Attr};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// URL of the Wikipedia page
let url = "https://commons.wikimedia.org/wiki/List_of_dog_breeds";
// Send an HTTP GET request to the URL
let client = reqwest::blocking::Client::new();
let response = client.get(url).header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36").send()?;
// Check if the request was successful (status code 200)
if response.status().is_success() {
// Parse the HTML content of the page
let body = response.text()?;
let document = Document::from_read(body.as_bytes())?;
// Find the table with class 'wikitable sortable'
let table = document.find(Attr("class", "wikitable sortable")).next().ok_or("Table not found")?;
// Create a folder to save the images
fs::create_dir_all("dog_images")?;
// Initialize vectors to store the data
let mut names = vec![];
let mut groups = vec![];
let mut local_names = vec![];
let mut photographs = vec![];
// Iterate through rows in the table (skip the header row)
for row in table.find(Name("tr")).skip(1) {
// Extract data from each column
let columns = row.find(Name("td")).collect::<Vec<_>>();
if columns.len() == 4 {
let name = columns[0].text();
let group = columns[1].text();
// Check if the second column contains a span element
let span_tag = columns[2].find(Name("span")).next();
let local_name = span_tag.map(|span| span.text()).unwrap_or_default();
// Check for the existence of an image tag within the fourth column
let img_tag = columns[3].find(Name("img")).next();
let photograph = img_tag.map(|img| img.attr("src").unwrap_or_default()).unwrap_or_default();
// Download the image and save it to the folder
if !photograph.is_empty() {
let image_url = photograph;
let image_response = client.get(&image_url).send()?;
if image_response.status().is_success() {
let image_filename = format!("dog_images/{}.jpg", name);
let mut image_file = fs::File::create(image_filename)?;
image_file.write_all(&image_response.bytes()?)?;
}
}
// Append data to respective vectors
names.push(name);
groups.push(group);
local_names.push(local_name);
photographs.push(photograph);
}
}
// Print or process the extracted data as needed
for i in 0..names.len() {
println!("Name: {}", names[i]);
println!("FCI Group: {}", groups[i]);
println!("Local Name: {}", local_names[i]);
println!("Photograph: {}", photographs[i]);
println!();
}
} else {
eprintln!("Failed to retrieve the web page. Status code: {}", response.status());
}
Ok(())
}
Wrap Up
In this step-by-step tutorial, we:
- Learned how to send requests and parse HTML
- Extracted data like image URLs and metadata
- Downloaded and saved images locally
- Stored extracted content in structured vectors
We covered everything needed to build a robust web scraper with Rust!
Some ideas for next steps:
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.