JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data from HTML documents using DOM traversal and CSS selectors.
Getting Started
Import JSoup:
import org.jsoup._
Parse HTML:
val doc: Document = Jsoup.parse(html)
Select elements:
val elements: Elements = doc.select("div.class")
Extract text:
val text = doc.body().text()
Update text:
doc.body().text("New text")
Selecting Elements
By CSS query:
doc.select("div.content")
By tag:
doc.getElementsByTag("img")
By id:
doc.getElementById("header")
By attribute:
doc.getElementsByAttribute("href")
Custom filters:
doc.select(".txt").filter(el => el.text.length > 10)
Traversing
Navigate to parent:
element.parent()
Navigate to children:
element.children()
Sideways to siblings:
element.nextElementSibling()
element.previousElementSibling()
Manipulation
Set text:
element.text("new text")
Set HTML:
element.html("<span>new html</span>")
Add class:
element.addClass("highlighted")
Remove class:
element.removeClass("highlighted")
Remove element:
element.remove()
Attributes
Get attribute:
val href = element.attr("href")
Set attribute:
element.attr("href", "link.html")
Remove attribute:
element.removeAttr("class")
Get all attributes:
val attrs = element.attributes()
Examples
Extract text:
doc.select("p").forEach(p => {
println(p.text())
})
Extract links:
doc.select("a[href]").forEach(a => {
val href = a.attr("href")
println(href)
})
Change image src:
doc.select("img").forEach(img => {
img.attr("src", "new-img.jpg")
})
Validation
Check valid HTML:
val errors = JsoupValidator.createValidatingInstance().validate(doc)
if (errors.hasErrors()) {
// handle errors
}
Connection Settings
Custom user-agent:
val connection = Jsoup.connect(url).userAgent("Bot")
Custom headers:
connection.headers(Map("Auth" -> "token"))
Timeout:
connection.timeout(10*1000) // 10 seconds
Advanced Usage
Async callbacks:
Jsoup.connect(url).get(new Callback() {
def success(result: Result) {
// handle result
}
def error(e: Exception) {
// handle error
}
})
Multi-threading:
// process pages concurrently
docs.par.foreach(doc => {
// extract data
})
Common Use Cases
Extract all links:
doc.select("a[href]").forEach(a -> {
println(a.attr("href"));
})
Extract text from paragraphs:
doc.select("p").forEach(p -> {
println(p.text());
})
Extract images:
doc.select("img").forEach(img -> {
String src = img.attr("src");
// download image from src
})
Submit a form:
Connection.Response res = Jsoup.connect(url)
.data("username", "example")
.data("password", "secret")
.method(Method.POST)
.execute();
Log in and maintain session:
Connection con = Jsoup.connect(url);
Connection.Response res = con.execute();
Map<String, String> cookies = res.cookies();
Document doc = Jsoup.connect(url2)
.cookies(cookies)
.get();
Tips and Best Practices
Advanced Topics
Custom request handling:
HttpConnection con = new HttpConnection() {
public Response execute(Request request) {
Response res = super.execute(request);
// handle response
return res;
}
};
Document doc = con.get(url);
Control network settings:
con.timeout(5000);
con.proxy("webproxy", 8080);
JSoup on Android:
Document doc = Jsoup.parse(string, "", Parser.xmlParser());
Integrate with JSON/JAXB:
JsonObject json = new JsonObject(doc.html());
// bind to POJOs
Output cleaned HTML:
String cleanHtml = Jsoup.clean(dirtyHtml, baseUri, Whitelist.basic());
Sanitize untrusted input:
String safe = Jsoup.clean(unsafe, Whitelist.none());
Validate against DTDs/schemas:
Validator v = Validator.nu();
v.validate(doc, Errors.ReportLevel.FATAL);