HtmlAgilityPack allows fast and robust manipulation of HTML documents in .NET. This cheat sheet aims to be the most in-depth reference possible for working with HtmlAgilityPack.
Installation
PM> Install-Package HtmlAgilityPack
Loading HTML
From string:
var doc = new HtmlDocument();
doc.LoadHtml("<html>...</html>");
From file:
doc.Load("page.html");
From stream:
using(var fs = File.OpenRead("page.html")) {
doc.Load(fs);
}
From web:
doc.Load("<http://example.com>");
Custom options:
doc.OptionFixNestedTags = true;
Helper method:
private static HtmlDocument LoadHtml(string html) {
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc;
}
Selecting Nodes
By CSS selector:
var paras = doc.DocumentNode
.SelectNodes("//p");
By XPath:
var items = doc.DocumentNode
.SelectNodes("//ul/li");
Get single element by ID:
var content = doc.GetElementbyId("content");
Get elements by tag name:
var divs = doc.GetElementsByTagName("div");
Evaluate XPath:
var xpath = "//div/p";
var nodes = doc.DocumentNode.Evaluate(xpath);
Looping Nodes
For each loop:
foreach(var item in items) {
// ...
}
For loop:
for(int i = 0; i < items.Count; i++) {
var item = items[i];
}
While loop:
int i = 0;
while(node = nodes[i++]) {
// ...
}
Modifying Nodes
Get attribute value:
var cls = el.GetAttributeValue("class", null);
Set attribute value:
el.SetAttributeValue("class", "blue");
Get inner text:
var text = el.InnerText;
Set inner text:
el.InnerText = "Hello World";
Get inner HTML:
var html = el.InnerHtml;
Set inner HTML:
el.InnerHtml = "<strong>Hello</strong>";
Creating Nodes
Create element:
var el = doc.CreateElement("p");
Create text node:
var text = doc.CreateTextNode("Hello");
Create document fragment:
var frag = doc.CreateDocumentFragment();
Create from HTML:
var frag = doc.ParseFragment("<b>Hi!</b>");
Inserting Nodes
Append child element:
parent.AppendChild(el);
Insert before element:
parent.InsertBefore(newEl, el);
Insert after element:
parent.InsertAfter(newEl, el);
Prepend child element:
parent.PrependChild(el);
Insert adjacent HTML:
el.InsertAdjacentHtml("beforebegin", "<p>Hello</p>");
Removing Nodes
Remove single element:
parent.RemoveChild(el);
Remove all children:
parent.RemoveAllChildren();
Remove nodes by ID:
doc.DocumentNode.Descendants("p")
.Where(p => p.Id == "intro")
.ToList()
.ForEach(p => p.Remove());
Remove all nodes:
doc.DocumentNode.RemoveAll();
Loading Sub-Documents
Parse HTML fragment:
var frag = doc.ParseFragment("<b>Hi!</b>");
Append parsed fragment:
doc.DocumentNode.AppendChild(frag);
Load partial document:
var newDoc = new HtmlDocument();
newDoc.Load(doc.DocumentNode);
Namespaces
Register namespace:
doc.DocumentNode.RegisterNamespace("h", "<http://example.com/ns/>");
Get namespaced nodes:
var nodes = doc.DocumentNode
.SelectNodes("//h:element");
DOM Traversal
Parent node:
var parent = node.ParentNode;
Child nodes:
var children = parent.ChildNodes;
Next sibling:
var nextSibling = node.NextSibling;
Previous sibling:
var prevSibling = node.PreviousSibling;
Caching XPath Queries
Don't reparse queries:
// Reusable query
private static string ParasXpath = "//p";
var nodes = doc.DocumentNode.SelectNodes(ParasXpath);
// Later...
var moreNodes = doc.DocumentNode.SelectNodes(ParasXpath);
Validation
DTD validate:
doc.OptionValidateDTD = true;
doc.LoadHtml(html); // Throws on error
XSD validate:
doc.Validate(schemaStream); // Returns issues
Encoding
Load as UTF-8:
doc.OptionDefaultStreamEncoding = Encoding.UTF8;
Special characters:
doc.DocumentNode.SelectNodes("//p/text()[contains(., 'en dash –')]");
LINQ Integration
LINQ query:
var paras = from p in doc.DocumentNode.Descendants("p")
where !p.HasClass("intro")
select p.InnerText;
Extension methods:
doc.DocumentNode.Descendants("p")
.Where(p => !p.HasClass("intro"))
.Select(p => p.InnerText);
Real World Use Cases
This covers the full range of capabilities and best practices for parsing, traversing, and modifying HTML documents with HtmlAgilityPack in C#!