Downloading Images from a Website with CSharp and HtmlAgilityPack

In this article, we will learn how to use C# and the HtmlAgilityPack library to download all the images from a Wikipedia page.

—-

Overview

The goal is to extract the names, breed groups, local names, and image URLs for all dog breeds listed on this Wikipedia page. We will store the image URLs, download the images and save them to a local folder.

Here are the key steps we will cover:

Add required namespaces
Send HTTP request to fetch the Wikipedia page
Parse the page HTML using HtmlAgilityPack
Find the table with dog breed data
Iterate through the table rows
Extract data from each column
Download images and save locally
Print/process extracted data

Let's go through each of these steps in detail.

Namespaces

We need the following namespaces:

using System.Net;
using HtmlAgilityPack;

System.Net - provides HTTP client

HtmlAgilityPack - HTML parser

Send HTTP Request

To download the web page:

string url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";

using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
  // Parse HTML
}

We provide a user-agent and use the HttpWebRequest and HttpWebResponse classes.

Parse HTML

To parse the HTML:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(response.GetResponseStream());

We load the response stream into an HtmlDocument.

Find Breed Table

We can use an XPath query to find the table element:

var table = doc.DocumentNode.SelectSingleNode("//table[contains(@class, 'wikitable') and contains(@class, 'sortable')]");

This selects the table node by its CSS classes.

Iterate Through Rows

We loop through the rows:

foreach (var row in table.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
{
  // Extract data
}

We filter for element nodes.

Extract Column Data

Inside the loop, we get the data from each column:

var cells = row.ChildNodes;

string name = cells[0].InnerText.Trim();
string group = cells[1].InnerText.Trim();

var localNameNode = cells[2].FirstChild;
string localName = localNameNode != null ? localNameNode.InnerText.Trim() : "";

var imgNode = cells[3].FirstChild;
string photograph = imgNode != null ? imgNode.GetAttributeValue("src", "") : "";

We use InnerText for text and GetAttributeValue() for attributes.

Download Images

To download and save images:

if (!string.IsNullOrEmpty(photograph))
{
  using (WebClient client = new WebClient())
  {
    byte[] imageBytes = client.DownloadData(photograph);

    string imagePath = $"dog_images/{name}.jpg";
    File.WriteAllBytes(imagePath, imageBytes);
  }
}

The WebClient class handles downloading the image bytes.

Store Extracted Data

We store the extracted data:

names.Add(name);
groups.Add(group);
localNames.Add(localName);
photographs.Add(photograph);

The lists can then be processed as needed.

And that's it! Here is the full code:

// Full code

using System.Net;
using HtmlAgilityPack;
using System.IO;
using System.Linq;

// Lists to store data
List<string> names = new List<string>();
List<string> groups = new List<string>();
List<string> localNames = new List<string>();
List<string> photographs = new List<string>();

string url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";

using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
  HtmlDocument doc = new HtmlDocument();
  doc.LoadHtml(response.GetResponseStream());

  var table = doc.DocumentNode.SelectSingleNode("//table[contains(@class, 'wikitable') and contains(@class, 'sortable')]");

  foreach (var row in table.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
  {
    var cells = row.ChildNodes;

    string name = cells[0].InnerText.Trim();
    string group = cells[1].InnerText.Trim();

    var localNameNode = cells[2].FirstChild;
    string localName = localNameNode != null ? localNameNode.InnerText.Trim() : "";

    var imgNode = cells[3].FirstChild;
    string photograph = imgNode != null ? imgNode.GetAttributeValue("src", "") : "";

    if (!string.IsNullOrEmpty(photograph))
    {
      using (WebClient client = new WebClient())
      {
        byte[] imageBytes = client.DownloadData(photograph);

        string imagePath = $"dog_images/{name}.jpg";
        File.WriteAllBytes(imagePath, imageBytes);
      }
    }

    names.Add(name);
    groups.Add(group);
    localNames.Add(localName);
    photographs.Add(photograph);
  }
}

This provides a complete C# solution using HtmlAgilityPack to scrape data and images from HTML tables. The same technique can be applied to many websites.

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Downloading Images from a Website with CSharp and HtmlAgilityPack

Overview

Namespaces

Send HTTP Request

Parse HTML

Find Breed Table

Iterate Through Rows

Extract Column Data

Download Images

Store Extracted Data

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Downloading Images from a Website with CSharp and HtmlAgilityPack

Overview

Namespaces

Send HTTP Request

Parse HTML

Find Breed Table

Iterate Through Rows

Extract Column Data

Download Images

Store Extracted Data

The easiest way to do Web Scraping

Don't leave just yet!