The Ultimate Jsoup Cheatsheet in Java

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Parsing HTML

To parse HTML from a string:

String html = "<html><head><title>First parse</title></head>" +
             "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

To parse HTML from a URL:

Document doc = Jsoup.connect("<http://example.com>").get();

To parse HTML from a file:

File input = new File("/path/to/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "<http://example.com/>");

Selecting Elements

To select an element by ID:

Element content = doc.getElementById("content");

To select elements by tag name:

Elements paragraphs = doc.getElementsByTag("p");

To select elements by CSS query:

Elements links = doc.select("a[href]"); // a elements with href attrib

To select the first matching element:

Element masthead = doc.select("div.masthead").first();

Extracting Data

To get the text inside an element:

String title = doc.title(); // title text
String text = content.text(); // inner text

To get the HTML inside an element:

String html = content.html();

To get an attribute value:

String href = link.attr("href"); // href attribute

Manipulating Elements

To remove an element:

content.remove();

To add a class to an element:

div.addClass("highlight");

To set text:

content.text("This is the new text content");

To set HTML:

content.html("<p>This is new HTML content</p>");

To append HTML to an element:

content.append("<p>Appended paragraph</p>");

Outputting HTML

To get the HTML of the entire parsed document:

String html = doc.html();

To get the outer HTML of an element:

String html = div.outerHtml();

To get the HTML representation of an element:

String html = content.html();

To update the document HTML:

doc.html("<html><head></head><body></body></html>");

To write the document HTML to a file:

File output = new File("output.html");
doc.outputSettings().prettyPrint(true);
PrintWriter writer = new PrintWriter(output);
writer.write(doc.html());
writer.close();

Working with Forms

To get a form by ID:

Form form = doc.getElementById("loginForm");

To fill in a form field:

form.input("username", "john");

To submit a form:

Connection.Response res = Jsoup.connect(form.attr("action"))
                               .data(form.formData())
                               .method(form.attr("method"))
                               .execute();

Cookies

To get cookies from a connection:

Map<String, String> cookies =
     Jsoup.connect("<http://example.com>").execute().cookies();

To send cookies with a request:

Connection con = Jsoup.connect("<http://example.com>");
Cookie cookie = new Cookie("name", "value");
con.cookie(cookie); // add cookie to request
Document doc = con.get();

Handling Exceptions

Jsoup methods can throw the following exceptions:

IOException: On network errors connecting to or reading from a URL.

HttpStatusException: When an HTTP status code indicates an error (non 200-300 code).

ParseException: For errors parsing HTML.

IllegalArgumentException: If an invalid argument is passed.

To handle exceptions:

try {
  Document doc = Jsoup.connect("<http://example.com>").get();
} catch (IOException e) {
  // handle network error
} catch (HttpStatusException e) {
  // handle non-success status code
} catch (ParseException e) {
  // handle parse error
}

Best Practices

Some best practices when using Jsoup:

Always validate and sanitize any user-supplied HTML to prevent XSS attacks. Use the Cleaner API.

Specify a base URI when parsing documents to resolve relative URLs.

Use element IDs where available instead of CSS or element queries for performance.

Fetch and parse documents in a background thread to avoid locking the UI.

Limit which protocols can be fetched to prevent SSRF attacks.

Set request timeouts to prevent hanging on slow networks or bad responses.

Handle exceptions and invalid input gracefully.

Advanced Topics

Cleaning HTML

To sanitize untrusted HTML:

String unsafe =
  "<p><a href='javascript:sendSpam()'>Buy stuff</a></p>";

Cleaner cleaner = Cleaner.basic();
String safe = cleaner.clean(unsafe); // sanitized html

Document clean = Jsoup.parse(safe);

POST Requests

To make a POST request:

Connection con = Jsoup.connect("<http://example.com>");
con.data("name", "value"); // set POST data
Document doc = con.post();

Multi-part Forms

To upload files:

Connection con = Jsoup.connect("<http://example.com>");
File img = new File("/path/to/img.jpg");
con.data("name", "John");
con.data("photo", img); // upload file
Document doc = con.post();

Connections

To get a low-level Connection:

Connection con = Jsoup.connect("<http://example.com>");
con.header("X-Custom", "value");
con.cookie("key", "value");
con.timeout(3000);
Response res = con.execute();

DOM Traversal

To get parent elements:

Element parent = element.parent();

To get sibling elements:

Element nextSibling = element.nextElementSibling();
Element prevSibling = element.previousElementSibling();

To get children:

Elements children = element.children();

Modifying the DOM

To create new elements:

Element img = new Element(Tag.img);
img.attr("src", "example.png");

To add elements:

parent.appendChild(newEl);
parent.prependChild(newEl);

To insert elements:

Element inserted = parent.insertChildren(0, newEl);

Working with XML

Jsoup can parse and manipulate XML documents with the XML parser:

String xml = "<doc xmlns:x='<http://example.org>'>" +
            "<x:el>Text</x:el></doc>";

Parser xmlParser = Parser.xmlParser();
Document doc = xmlParser.parseInput(xml, "");

Element el = doc.select("x|el").first();

The XML parser supports:

Namespaces - can select with | delimiter

XPath queries

Retrieving namespace-prefixed attributes

Troubleshooting

Handling bad markup

Use the -relaxed parser option to handle bad markup:

Document doc = Jsoup.parse(html, "", Parser.htmlParser()
  .setTrackErrors(500) // number of errors
  .recoverFromErrors(true) // attempt recovery
  .relaxng(true) // relaxed parsing
);

Resolving relative URLs

Always provide a base URI when parsing:

Document doc = Jsoup.parse(html, "<https://example.com/>");

Encoding issues

Parse using the character set:

Document doc = Jsoup.parse(html, "UTF-8");

Tips and Best Practices

Select vs find

select searches descendants, find only direct children:

el.select(".item"); // getAll(.item)
el.find(".item"); // getChildren(.item)

Avoid full re-parse

Reuse Documents and avoid re-parsing full HTML:

Document doc = ... // parse once

Elements links = doc.select("a"); // query many times

Parser callback

Use a ParseCallback to modify the document during parse:

Parser parser = Parser.htmlParser();
parser.parse(html, new ParseCallback() {

  @Override
  public void handle(Element el) {
    if (el.tagName().equals("img")) {
      el.attr("src", "placeholder.jpg"); // rewrite
    }
  }

});

Integration

Spring MVC

@Bean
ViewResolver jsoup() {
    JsoupViewResolver jsoup = new JsoupViewResolver();
    jsoup.setPrefix("WEB-INF/jsp/");
    jsoup.setSuffix(".jsp");
    return jsoup;
}

Async responses

Use a callback to handle the parsed Document:

Jsoup.connect(url).async(new DocumentCallback() {
  @Override
  public void onComplete(Document doc) {
    // process doc
  }
});

Large documents

Avoid loading the entire document into memory:

Jsoup.connect(url).maxBodySize(1 * 1024 * 1024).execute().bodyStream();

Form Handling

JSON serialization

Serialize forms to JSON with .serialize():

Map json = form.serialize();

File uploads

Upload files in multi-part forms:

Connection con = Jsoup.connect("<http://example.com>");
File img = new File("/path/img.jpg");
con.data("profile_pic", img);

Request interceptor

Inspect and modify requests:

con.requestInterceptor(new RequestInterceptor() {
  @Override
  public void intercept(Connection.Request r) {
    r.header("Authorization", "token");
  }
});

Thread Safety

A Document is not thread safe - access it from one thread:

Document doc = ... // parse

synchronized(doc) {
  Elements els = doc.select(".item"); // read
}

Use a DocumentCloner to create a snapshot for concurrent access.

Extensions

Custom Parser

Extend Parser to create a custom parsing implementation:

public class MyParser extends Parser {

  @Override
  public Document parseInput(String html, String baseUri) {
    Document doc = ...;
    return doc;
  }

}

Jsoup + JFlex

Use JFlex to generate lexical parsers for HTML, integrated with Jsoup's DOM construction.

Testing

Verify elements exist:

Assert.assertEquals(1, doc.select("h1").size());

Check attribute values:

Assert.assertEquals("text", doc.select(".title").attr("content"));

FAQ

OuterHtml vs Html?

outerHtml includes the element tag, html is just inner.

Select vs find?

select searches descendants, find only children.

Prevent script execution?

Use the default Cleaner to sanitize. Or whitelist tags.

Relative URL resolution?

Always provide a base URL to Jsoup.parse().

Modifying the DOM

Introduce vulnerabilities:

Element script = doc.appendElement("script");
script.attr("src", "malicious.js");

Evaluate scripts:

doc.outputSettings().syntax(Document.OutputSettings.Syntax.html);
doc.select("script").html("(() => alert('XSS'))()");

The Ultimate Jsoup Cheatsheet in Java

Parsing HTML

Selecting Elements

Extracting Data

Manipulating Elements

Outputting HTML

Working with Forms

Cookies

Handling Exceptions

Best Practices

Advanced Topics

Cleaning HTML

POST Requests

Multi-part Forms

Connections

DOM Traversal

Modifying the DOM

Working with XML

Troubleshooting

Handling bad markup

Resolving relative URLs

Encoding issues

Tips and Best Practices

Select vs find

Avoid full re-parse

Parser callback

Integration

Spring MVC

Async responses

Large documents

Form Handling

JSON serialization

File uploads

Request interceptor

Thread Safety

Extensions

Custom Parser

Jsoup + JFlex

Testing

FAQ

Modifying the DOM

Appendix

CSS Selectors

Parser Options

Method Reference

The easiest way to do Web Scraping

Don't leave just yet!