Web scraping is useful to programmatically extract data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we'll see how to scrape multiple pages in C# using the HtmlAgilityPack library.
Prerequisites
To follow along, you'll need:
dotnet add package HtmlAgilityPack
Import Libraries
We'll need the following namespaces:
using System.Net.Http;
using HtmlAgilityPack;
Define Base URL
—
We'll scrape a blog -
<https://copyblogger.com/blog/>
<https://copyblogger.com/blog/page/2/>
<https://copyblogger.com/blog/page/3/>
Let's define the base URL pattern:
string baseUrl = "<https://copyblogger.com/blog/page/{0}/>";
The
Specify Number of Pages
Next, we'll specify how many pages to scrape. Let's scrape the first 5 pages:
int numPages = 5;
Loop Through Pages
We can now loop from 1 to
for (int pageNum = 1; pageNum <= numPages; pageNum++)
{
// Construct page URL
string url = string.Format(baseUrl, pageNum);
// Code to scrape each page
}
Send Request and Check Response
Inside the loop, we'll use
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.GetAsync(url);
if (response.IsSuccessStatusCode)
{
// Scrape page
}
else
{
Console.WriteLine("Failed to retrieve page " + pageNum);
}
We check for a success status code to ensure the request succeeded.
Parse HTML
If successful, we can parse the HTML using HtmlAgilityPack:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(await response.Content.ReadAsStringAsync());
This gives us a DOM document to extract data from.
Extract Data
Now within the loop we can use
For example, to get all article elements:
var articles = htmlDoc.DocumentNode.SelectNodes("//article");
We can loop through
Full Code
Our full code to scrape 5 pages is:
using System.Net.Http;
using HtmlAgilityPack;
string baseUrl = "<https://copyblogger.com/blog/page/{0}/>";
int numPages = 5;
for (int pageNum = 1; pageNum <= numPages; pageNum++)
{
string url = string.Format(baseUrl, pageNum);
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.GetAsync(url);
if (response.IsSuccessStatusCode)
{
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(await response.Content.ReadAsStringAsync());
var articles = htmlDoc.DocumentNode.SelectNodes("//article");
foreach (var article in articles)
{
// Extract data from article
string title = article.SelectSingleNode("./h2[@class='entry-title']").InnerText.Trim();
string url = article.SelectSingleNode("./a[@class='entry-title-link']").GetAttributeValue("href", null);
string author = article.SelectSingleNode("./div[@class='post-author']/a").InnerText.Trim();
List<string> categories = new List<string>();
foreach (var node in article.SelectNodes("./div[@class='entry-categories']/a"))
{
categories.Add(node.InnerText.Trim());
}
// Print extracted data
Console.WriteLine("Title: " + title);
Console.WriteLine("URL: " + url);
Console.WriteLine("Author: " + author);
Console.WriteLine("Categories: " + string.Join(", ", categories));
Console.WriteLine();
}
}
else
{
Console.WriteLine("Failed to retrieve page " + pageNum);
}
}
This allows us to scrape and extract data from multiple pages sequentially. The full code can be extended to scrape any number of pages.
Summary
Web scraping enables collecting large datasets programmatically. With the techniques here, you can scrape and extract information from multiple pages of a website in C#.
While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.
Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.
This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.
With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.