Goquery is a Go library that provides jQuery-style DOM manipulation. It makes it easy to parse and extract data from HTML documents using a syntax similar to jQuery.
Getting Started
Import the goquery package:
import "github.com/PuerkitoBio/goquery"
Load HTML from a string:
doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlString))
Or load from a file:
doc, err := goquery.NewDocumentFromFile(filename)
Check for errors when loading the document.
Selection
Select elements similar to jQuery:
doc.Find(".class")
doc.Find("#id")
doc.Find("div")
Chaining:
doc.Find(".outer").Find(".inner")
Select by index:
doc.Find("div").Eq(1)
Select parent/children:
doc.Find(".inner").Parent()
doc.Find("div").Children()
Filter selection:
doc.Find(".text").Filter(".inner-text")
Find vs FindSelection:
Find performs a search across all descendants of the document and returns a Selection object.
FindSelection searches only the current Selection's descendants.
Matchers:
You can define a matcher function to perform custom filtering:
hasText := func(i int, s *goquery.Selection) bool {
return s.Text() == "Some text"
}
doc.Find("div").FilterFunction(hasText)
Traversing
Traverse to siblings:
doc.Find(".inner").Next() // next sibling
doc.Find(".inner").Prev() // previous sibling
Traverse up and down:
doc.Find(".inner").Parent() // parent
doc.Find(".outer").Children() // children
Contents:
Get child nodes contents:
doc.Find("div").Contents()
Slice:
Get sibling range as slice:
doc.Find("li").Slice(2, 5)
Manipulation
Get/set text:
doc.Find("h1").Text() // get
doc.Find("h1").Text("New header") // set
Get/set HTML:
doc.Find("div").Html() // get
doc.Find("div").Html(`<span>New content</span>`) // set
Add/remove classes:
doc.Find(".outer").AddClass("container") // add class
doc.Find(".inner").RemoveClass("highlighted") // remove class
Empty:
Remove all child nodes:
doc.Find("ul").Empty()
Append/Prepend:
Insert adjacent to selection:
doc.Find("ul").Append("<li>New</li>")
doc.Find("ul").Prepend("<li>New</li>")
Wrap/Unwrap:
Wrap selection in new parent element:
doc.Find("span").Wrap("<div>")
Remove wrapper element:
doc.Find("span").Unwrap()
Attributes
Get an attribute value:
doc.Find("a").Attr("href")
Set an attribute value:
doc.Find("a").Attr("href", "new-url")
Remove an attribute:
doc.Find("table").RemoveAttr("width")
Get all attributes as a map:
attrs := doc.Find("div").Attributes()
Data:
Get custom data attributes:
doc.Find("div").Data("myattr")
Iteration
Iterate through selections:
doc.Find(".inner").Each(func(i int, s *goquery.Selection) {
// do something with s
})
Helper iteration methods:
doc.Find(".inner").EachWithBreak(func(i int, s *goquery.Selection) bool {
return false // break iteration
})
doc.Find(".inner").Map(func(i int, s *goquery.Selection) string {
return s.Text() // return value
})
Slice:
Iterate selection as a slice:
for _, item := range doc.Find("li").Slice() {
// item is *Selection
}
Utilities
Serialize selection as HTML:
html, _ := doc.Find(".outer").Html()
Check if selection contains element:
doc.Find(".container").Has(".button").Length() > 0
Get number of elements in selection:
doc.Find(".items").Length()
Clone:
Clone document:
newDoc := doc.Clone()
Parse:
Re-parse document:
root, err := goquery.Parse(doc)
Is/End:
Check selection type:
doc.Find("div").Is("div") // true
doc.Find("ul").End() == 0 // at end
Common Use Cases
Web Scraping:
Extract data from pages:
doc.Find(".titles").Each(func(i int, s *goquery.Selection) {
title := s.Text()
fmt.Println(title)
})
Parse HTML:
Process HTML documents:
doc.Find("a[rel='nofollow']").Each(func(i int, s *goquery.Selection) {
s.Remove() // clean up HTML
})
Make Changes:
Modify HTML pages:
doc.Find("img").Each(func(i int, s *goquery.Selection) {
s.SetAttr("src", newSrc) // set new img src
})
Remote HTML:
Use with HTTP requests:
res, _ := http.Get(url)
doc, _ := goquery.NewDocumentFromResponse(res)
Selection Strategies
Targeting Elements:
Unique IDs:
doc.Find("#header")
Known classes:
doc.Find(".product-listing")
Nested selections:
doc.Find("#container").Find(".row .product")
Dynamic Content:
Re-parse after JavaScript:
doc, _ = goquery.NewDocumentFromReader(browser.HTML())
Wait for element to appear:
sel := doc.Find(".loaded")
for !sel.Length() {
time.Sleep(1 * time.Second)
doc = getNewDoc()
sel = doc.Find(".loaded")
}
Ads and Popups:
Remove unwanted elements:
doc.Find(".ad-banner").Remove()
Blocking:
Throttle requests:
time.Sleep(2 * time.Second) // slow down
Rotate user agents:
uas := []string{
"Mozilla/5.0...",
"Chrome/87.0..."
}
// cycle through uas
JavaScript Content:
Use browser automation:
doc, _ := goquery.NewDocumentFromReader(browser.HTML())
Use API if available:
data := getAPIJSON() // may provide HTML
doc, _ := goquery.NewDocumentFromReader(bytes.NewReader(data))
Tips and Tricks
FAQ
When to use goquery vs standard packages?
Use goquery for HTML manipulation like jQuery. Use standard packages for XML parsing or creating HTML output.
What are some alternatives to goquery?
colly for scraping, gopherjs+jQuery for client side DOM manipulation.
How to test and validate selections?
Use Is() to validate selection name, Length() to check size, Each() to iterate.
How to mock documents for testing?
Use goquery.NewDocumentFromReader() with strings.NewReader() to load test HTML.
Summary
Goquery brings the power of jQuery to Go for easy HTML manipulation and extraction. With Go's speed and concurrency, goquery is great for web scraping and building web apps.