The Ultimate Jsoup Cheatsheet in Java

Oct 31, 2023 ยท 8 min read

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Parsing HTML

To parse HTML from a string:

String html = "<html><head><title>First parse</title></head>" +
             "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

To parse HTML from a URL:

Document doc = Jsoup.connect("<http://example.com>").get();

To parse HTML from a file:

File input = new File("/path/to/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "<http://example.com/>");

Selecting Elements

To select an element by ID:

Element content = doc.getElementById("content");

To select elements by tag name:

Elements paragraphs = doc.getElementsByTag("p");

To select elements by CSS query:

Elements links = doc.select("a[href]"); // a elements with href attrib

To select the first matching element:

Element masthead = doc.select("div.masthead").first();

Extracting Data

To get the text inside an element:

String title = doc.title(); // title text
String text = content.text(); // inner text

To get the HTML inside an element:

String html = content.html();

To get an attribute value:

String href = link.attr("href"); // href attribute

Manipulating Elements

To remove an element:

content.remove();

To add a class to an element:

div.addClass("highlight");

To set text:

content.text("This is the new text content");

To set HTML:

content.html("<p>This is new HTML content</p>");

To append HTML to an element:

content.append("<p>Appended paragraph</p>");

Outputting HTML

To get the HTML of the entire parsed document:

String html = doc.html();

To get the outer HTML of an element:

String html = div.outerHtml();

To get the HTML representation of an element:

String html = content.html();

To update the document HTML:

doc.html("<html><head></head><body></body></html>");

To write the document HTML to a file:

File output = new File("output.html");
doc.outputSettings().prettyPrint(true);
PrintWriter writer = new PrintWriter(output);
writer.write(doc.html());
writer.close();

Working with Forms

To get a form by ID:

Form form = doc.getElementById("loginForm");

To fill in a form field:

form.input("username", "john");

To submit a form:

Connection.Response res = Jsoup.connect(form.attr("action"))
                               .data(form.formData())
                               .method(form.attr("method"))
                               .execute();

Cookies

To get cookies from a connection:

Map<String, String> cookies =
     Jsoup.connect("<http://example.com>").execute().cookies();

To send cookies with a request:

Connection con = Jsoup.connect("<http://example.com>");
Cookie cookie = new Cookie("name", "value");
con.cookie(cookie); // add cookie to request
Document doc = con.get();

Handling Exceptions

Jsoup methods can throw the following exceptions:

  • IOException: On network errors connecting to or reading from a URL.
  • HttpStatusException: When an HTTP status code indicates an error (non 200-300 code).
  • ParseException: For errors parsing HTML.
  • IllegalArgumentException: If an invalid argument is passed.
  • To handle exceptions:

    try {
      Document doc = Jsoup.connect("<http://example.com>").get();
    } catch (IOException e) {
      // handle network error
    } catch (HttpStatusException e) {
      // handle non-success status code
    } catch (ParseException e) {
      // handle parse error
    }
    

    Best Practices

    Some best practices when using Jsoup:

  • Always validate and sanitize any user-supplied HTML to prevent XSS attacks. Use the Cleaner API.
  • Specify a base URI when parsing documents to resolve relative URLs.
  • Use element IDs where available instead of CSS or element queries for performance.
  • Fetch and parse documents in a background thread to avoid locking the UI.
  • Limit which protocols can be fetched to prevent SSRF attacks.
  • Set request timeouts to prevent hanging on slow networks or bad responses.
  • Handle exceptions and invalid input gracefully.
  • Advanced Topics

    Cleaning HTML

    To sanitize untrusted HTML:

    String unsafe =
      "<p><a href='javascript:sendSpam()'>Buy stuff</a></p>";
    
    Cleaner cleaner = Cleaner.basic();
    String safe = cleaner.clean(unsafe); // sanitized html
    
    Document clean = Jsoup.parse(safe);
    

    POST Requests

    To make a POST request:

    Connection con = Jsoup.connect("<http://example.com>");
    con.data("name", "value"); // set POST data
    Document doc = con.post();
    

    Multi-part Forms

    To upload files:

    Connection con = Jsoup.connect("<http://example.com>");
    File img = new File("/path/to/img.jpg");
    con.data("name", "John");
    con.data("photo", img); // upload file
    Document doc = con.post();
    

    Connections

    To get a low-level Connection:

    Connection con = Jsoup.connect("<http://example.com>");
    con.header("X-Custom", "value");
    con.cookie("key", "value");
    con.timeout(3000);
    Response res = con.execute();
    

    DOM Traversal

    To get parent elements:

    Element parent = element.parent();
    

    To get sibling elements:

    Element nextSibling = element.nextElementSibling();
    Element prevSibling = element.previousElementSibling();
    

    To get children:

    Elements children = element.children();
    

    Modifying the DOM

    To create new elements:

    Element img = new Element(Tag.img);
    img.attr("src", "example.png");
    

    To add elements:

    parent.appendChild(newEl);
    parent.prependChild(newEl);
    

    To insert elements:

    Element inserted = parent.insertChildren(0, newEl);
    

    Working with XML

    Jsoup can parse and manipulate XML documents with the XML parser:

    String xml = "<doc xmlns:x='<http://example.org>'>" +
                "<x:el>Text</x:el></doc>";
    
    Parser xmlParser = Parser.xmlParser();
    Document doc = xmlParser.parseInput(xml, "");
    
    Element el = doc.select("x|el").first();
    

    The XML parser supports:

  • Namespaces - can select with | delimiter
  • XPath queries
  • Retrieving namespace-prefixed attributes
  • Troubleshooting

    Handling bad markup

    Use the -relaxed parser option to handle bad markup:

    Document doc = Jsoup.parse(html, "", Parser.htmlParser()
      .setTrackErrors(500) // number of errors
      .recoverFromErrors(true) // attempt recovery
      .relaxng(true) // relaxed parsing
    );
    

    Resolving relative URLs

    Always provide a base URI when parsing:

    Document doc = Jsoup.parse(html, "<https://example.com/>");
    

    Encoding issues

    Parse using the character set:

    Document doc = Jsoup.parse(html, "UTF-8");
    

    Tips and Best Practices

    Select vs find

    select searches descendants, find only direct children:

    el.select(".item"); // getAll(.item)
    el.find(".item"); // getChildren(.item)
    

    Avoid full re-parse

    Reuse Documents and avoid re-parsing full HTML:

    Document doc = ... // parse once
    
    Elements links = doc.select("a"); // query many times
    

    Parser callback

    Use a ParseCallback to modify the document during parse:

    Parser parser = Parser.htmlParser();
    parser.parse(html, new ParseCallback() {
    
      @Override
      public void handle(Element el) {
        if (el.tagName().equals("img")) {
          el.attr("src", "placeholder.jpg"); // rewrite
        }
      }
    
    });
    

    Integration

    Spring MVC

    Register a ViewResolver:

    @Bean
    ViewResolver jsoup() {
        JsoupViewResolver jsoup = new JsoupViewResolver();
        jsoup.setPrefix("WEB-INF/jsp/");
        jsoup.setSuffix(".jsp");
        return jsoup;
    }
    

    Async responses

    Use a callback to handle the parsed Document:

    Jsoup.connect(url).async(new DocumentCallback() {
      @Override
      public void onComplete(Document doc) {
        // process doc
      }
    });
    

    Large documents

    Avoid loading the entire document into memory:

    Jsoup.connect(url).maxBodySize(1 * 1024 * 1024).execute().bodyStream();
    

    Form Handling

    JSON serialization

    Serialize forms to JSON with .serialize():

    Map json = form.serialize();
    

    File uploads

    Upload files in multi-part forms:

    Connection con = Jsoup.connect("<http://example.com>");
    File img = new File("/path/img.jpg");
    con.data("profile_pic", img);
    

    Request interceptor

    Inspect and modify requests:

    con.requestInterceptor(new RequestInterceptor() {
      @Override
      public void intercept(Connection.Request r) {
        r.header("Authorization", "token");
      }
    });
    

    Thread Safety

    A Document is not thread safe - access it from one thread:

    Document doc = ... // parse
    
    synchronized(doc) {
      Elements els = doc.select(".item"); // read
    }
    

    Use a DocumentCloner to create a snapshot for concurrent access.

    Extensions

    Custom Parser

    Extend Parser to create a custom parsing implementation:

    public class MyParser extends Parser {
    
      @Override
      public Document parseInput(String html, String baseUri) {
        Document doc = ...;
        return doc;
      }
    
    }
    

    Jsoup + JFlex

    Use JFlex to generate lexical parsers for HTML, integrated with Jsoup's DOM construction.

    Testing

    Verify elements exist:

    Assert.assertEquals(1, doc.select("h1").size());
    

    Check attribute values:

    Assert.assertEquals("text", doc.select(".title").attr("content"));
    

    FAQ

    OuterHtml vs Html?

    outerHtml includes the element tag, html is just inner.

    Select vs find?

    select searches descendants, find only children.

    Prevent script execution?

    Use the default Cleaner to sanitize. Or whitelist tags.

    Relative URL resolution?

    Always provide a base URL to Jsoup.parse().

    Modifying the DOM

    Introduce vulnerabilities:

    Element script = doc.appendElement("script");
    script.attr("src", "malicious.js");
    

    Evaluate scripts:

    doc.outputSettings().syntax(Document.OutputSettings.Syntax.html);
    doc.select("script").html("(() => alert('XSS'))()");
    

    Appendix

    CSS Selectors

    All CSS query selectors supported.

    Parser Options

    Options to customize parsing - docs.

    Method Reference

    Complete Jsoup API reference.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: