The Ultimate JSoup Kotlin Cheatsheet

JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data from HTML documents using DOM traversal and CSS selectors.

Getting Started

Add dependency:

implementation("org.jsoup:jsoup:1.15.3")

Parse HTML:

val html = "<html>...</html>"
val doc = Jsoup.parse(html)

Select elements:

val elements = doc.select(".content")

Extract text:

val text = doc.body().text()

Selecting Elements

By CSS query:

doc.select(".main")

By tag:

doc.getElementsByTag("img")

By id:

doc.getElementById("header")

By attribute:

doc.getElementsByAttribute("href")

Custom filters:

doc.select(".text").filter { it.text().length > 10 }

Traversing

Navigate up:

element.parent()

Navigate down:

element.children()

Sideways:

element.nextElementSibling()
element.previousElementSibling()

Manipulation

Set text:

element.text("new text")

Set HTML:

element.html("<span>new html</span>")

Add class:

element.addClass("highlighted")

Remove class:

element.removeClass("highlighted")

Remove element:

element.remove()

Attributes

Get attribute:

val href = element.attr("href")

Set attribute:

element.attr("href", "link")

Remove attribute:

element.removeAttr("class")

Examples

Extract text from paragraphs:

doc.select("p").forEach {
  println(it.text())
}

Extract links:

doc.select("a[href]").forEach {
  println(it.attr("href"))
}

Change image src:

doc.select("img").forEach {
  it.attr("src", "new.png")
}

Validation

Check valid HTML:

val errors = Validator.createValidatingInstance().validate(doc)
if (errors.isNotEmpty()) {
  // handle errors
}

Advanced Usage

Async parsing:

Jsoup.connect(url).get() { doc ->
  // process doc
}

Custom headers:

val headers = mapOf("Auth" to "token")
Jsoup.connect(url).headers(headers)

More Element Selection Examples

By element:

doc.getElementsByTag("div")

By ID and class:

doc.getElementById("header")
doc.getElementsByClass("article")

Combinators:

doc.select(".article td")

Attribute value:

doc.select("[width=500]")

Navigating the DOM

Go up:

element.parent()
element.closest(".content")

Sideways:

element.nextSibling()
element.previousSibling()

All ancestors:

element.ancestors()

Modifying the Document

Add element:

doc.body().appendChild(newElement)

Remove element:

element.remove()

Set attribute:

element.attr("href", "link")

Set CSS style:

element.css("color", "red")

Validation

Check valid HTML:

val errors = Validator.createValidatingInstance().validate(doc)
if(errors.isEmpty()) {
  print("Document is valid")
} else {
  errors.forEach { println(it) }
}

Configure options:

val settings = Validator.Settings.html()
  .setErrorMode(ErrorMode.RELAXED)

Output cleaned HTML

Clean the HTML and output it after making changes:

val cleanedHtml = doc.html()
// Output to file, network, etc.

Comments and CDATA

Get comments:

val comments = doc.getAllElements().filter { it.nodeName() == "#comment" }

Get CDATA sections:

val cdata = doc.getAllElements().filter { it.nodeName() == "#cdata-section" }

Working with forms

Get form by ID:

val form = doc.getElementById("login-form")

Get input by name:

val usernameInput = form.select("#username")

Set input value:

usernameInput.val("myuser")

Multi-threaded scraping

Scrape in multiple threads:

val urls = listOf("url1", "url2")

val exec = Executors.newFixedThreadPool(10)

urls.forEach {
  exec.submit {
    Jsoup.connect(it).get() // parse in parallel
  }
}

exec.shutdown()

Efficient selection

Cache selections:

val headers = doc.select("#headers").first() // cache in variable

Avoid re-parsing:

doc.select(".item").remove() // doesn't re-parse entire doc

Parser configuration

Custom parser settings:

val parser = Parser.htmlParser()
  .setTrackErrors(10) // number of errors to track
  .setTimeout(10*1000) // 10 second timeout

The Ultimate JSoup Kotlin Cheatsheet

Getting Started

Selecting Elements

Traversing

Manipulation

Attributes

Examples

Validation

Advanced Usage

More Element Selection Examples

Navigating the DOM

Modifying the Document

Validation

Output cleaned HTML

Comments and CDATA

Working with forms

Multi-threaded scraping

Efficient selection

Parser configuration

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

The Ultimate JSoup Kotlin Cheatsheet

Getting Started

Selecting Elements

Traversing

Manipulation

Attributes

Examples

Validation

Advanced Usage

More Element Selection Examples

Navigating the DOM

Modifying the Document

Validation

Output cleaned HTML

Comments and CDATA

Working with forms

Multi-threaded scraping

Efficient selection

Parser configuration

The easiest way to do Web Scraping

Don't leave just yet!