JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data from HTML documents using DOM traversal and CSS selectors.
Getting Started
Add dependency:
implementation("org.jsoup:jsoup:1.15.3")
Parse HTML:
val html = "<html>...</html>"
val doc = Jsoup.parse(html)
Select elements:
val elements = doc.select(".content")
Extract text:
val text = doc.body().text()
Selecting Elements
By CSS query:
doc.select(".main")
By tag:
doc.getElementsByTag("img")
By id:
doc.getElementById("header")
By attribute:
doc.getElementsByAttribute("href")
Custom filters:
doc.select(".text").filter { it.text().length > 10 }
Traversing
Navigate up:
element.parent()
Navigate down:
element.children()
Sideways:
element.nextElementSibling()
element.previousElementSibling()
Manipulation
Set text:
element.text("new text")
Set HTML:
element.html("<span>new html</span>")
Add class:
element.addClass("highlighted")
Remove class:
element.removeClass("highlighted")
Remove element:
element.remove()
Attributes
Get attribute:
val href = element.attr("href")
Set attribute:
element.attr("href", "link")
Remove attribute:
element.removeAttr("class")
Examples
Extract text from paragraphs:
doc.select("p").forEach {
println(it.text())
}
Extract links:
doc.select("a[href]").forEach {
println(it.attr("href"))
}
Change image src:
doc.select("img").forEach {
it.attr("src", "new.png")
}
Validation
Check valid HTML:
val errors = Validator.createValidatingInstance().validate(doc)
if (errors.isNotEmpty()) {
// handle errors
}
Advanced Usage
Async parsing:
Jsoup.connect(url).get() { doc ->
// process doc
}
Custom headers:
val headers = mapOf("Auth" to "token")
Jsoup.connect(url).headers(headers)
More Element Selection Examples
By element:
doc.getElementsByTag("div")
By ID and class:
doc.getElementById("header")
doc.getElementsByClass("article")
Combinators:
doc.select(".article td")
Attribute value:
doc.select("[width=500]")
Navigating the DOM
Go up:
element.parent()
element.closest(".content")
Sideways:
element.nextSibling()
element.previousSibling()
All ancestors:
element.ancestors()
Modifying the Document
Add element:
doc.body().appendChild(newElement)
Remove element:
element.remove()
Set attribute:
element.attr("href", "link")
Set CSS style:
element.css("color", "red")
Validation
Check valid HTML:
val errors = Validator.createValidatingInstance().validate(doc)
if(errors.isEmpty()) {
print("Document is valid")
} else {
errors.forEach { println(it) }
}
Configure options:
val settings = Validator.Settings.html()
.setErrorMode(ErrorMode.RELAXED)
Output cleaned HTML
Clean the HTML and output it after making changes:
val cleanedHtml = doc.html()
// Output to file, network, etc.
Comments and CDATA
Get comments:
val comments = doc.getAllElements().filter { it.nodeName() == "#comment" }
Get CDATA sections:
val cdata = doc.getAllElements().filter { it.nodeName() == "#cdata-section" }
Working with forms
Get form by ID:
val form = doc.getElementById("login-form")
Get input by name:
val usernameInput = form.select("#username")
Set input value:
usernameInput.val("myuser")
Multi-threaded scraping
Scrape in multiple threads:
val urls = listOf("url1", "url2")
val exec = Executors.newFixedThreadPool(10)
urls.forEach {
exec.submit {
Jsoup.connect(it).get() // parse in parallel
}
}
exec.shutdown()
Efficient selection
Cache selections:
val headers = doc.select("#headers").first() // cache in variable
Avoid re-parsing:
doc.select(".item").remove() // doesn't re-parse entire doc
Parser configuration
Custom parser settings:
val parser = Parser.htmlParser()
.setTrackErrors(10) // number of errors to track
.setTimeout(10*1000) // 10 second timeout