HTML Agility Pack is an HTML parser for .NET. It allows easy manipulation and data extraction from HTML documents.
Getting Started
Install NuGet package:
Install-Package HtmlAgilityPack
Load HTML:
Dim html As String = "<html>...</html>"
Dim doc As HtmlDocument = New HtmlDocument()
doc.LoadHtml(html)
Select nodes:
Dim nodes As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//div")
Get text:
Dim text As String = doc.DocumentNode.InnerText
Selecting Nodes
By CSS selector:
doc.DocumentNode.SelectNodes(".header")
By XPath:
doc.DocumentNode.SelectNodes("//table")
By tag name:
doc.DocumentNode.SelectNodes("img")
By id:
doc.GetElementbyId("header")
Virtual collections:
Dim virtualCol = doc.CreateVCollection(XPath)
Querying & Extracting
Get attribute:
Dim href As String = node.GetAttributeValue("href", "")
Get text:
Dim text As String = node.InnerText
Get HTML:
Dim html As String = node.OuterHtml
Find ancestors:
Dim parent As HtmlNode = node.ParentNode
Evaluate XPath:
doc.DocumentNode.Evaluate("//a")
Manipulation
Add node:
doc.DocumentNode.AppendChild(HtmlNode.CreateNode("<p>Hello</p>"))
Update text:
node.InnerText = "New text"
Update HTML:
node.OuterHtml = "<div>New HTML</div>"
Remove node:
node.Remove()
Add class:
node.SetAttributeValue("class", "blue")
Parsing HTML
From string:
doc.LoadHtml(htmlString)
From URL:
doc.Load(url)
From file:
doc.Load(filename)
Auto detect encoding:
doc.OptionAutoCloseOnEnd = true
Tips
Example
Dim html = <html>
<body>
<h1>Title</h1>
<p>Hello World!</p>
</body>
</html>
Dim doc As HtmlDocument = New HtmlDocument()
doc.LoadHtml(html)
Dim title As String = doc.DocumentNode.SelectSingleNode("//h1").InnerText
' Title
Dim text As String = doc.DocumentNode.SelectSingleNode("//p").InnerText
' Hello World!
Advanced Querying
XPath Axes
Query by Node Type
doc.DocumentNode.SelectNodes("//*[self::p or self::div]")
Predicates
//div[@class='header']
Advanced Manipulation
Insert Nodes
doc.DocumentNode.InsertBefore(newNode, refNode);
doc.DocumentNode.InsertAfter(newNode, refNode);
Clone Nodes
var clone = node.Clone();
Move & Remove Nodes
node.Remove();
doc.DocumentNode.InsertBefore(node, refNode);
Handling Documents
Loading
doc.Load(url);
doc.LoadHtml(htmlString);
doc.Load(stream);
doc.Load(textReader);
Saving
doc.Save(filename);
Options
doc.OptionOutputAsXml = true;
Working with Fragments
doc.LoadHtml(htmlFragment);
doc.CreateElement("div");
Best Practices
Additional Tips
doc.OptionFixNestedTags = true;
doc.DetectEncoding(stream);
// Integrate AngleSharp
// Support .NET Framework + Core