Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Parsing HTML
To parse HTML from a string:
String html = "<html><head><title>First parse</title></head>" +
"<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
To parse HTML from a URL:
Document doc = Jsoup.connect("<http://example.com>").get();
To parse HTML from a file:
File input = new File("/path/to/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "<http://example.com/>");
Selecting Elements
To select an element by ID:
Element content = doc.getElementById("content");
To select elements by tag name:
Elements paragraphs = doc.getElementsByTag("p");
To select elements by CSS query:
Elements links = doc.select("a[href]"); // a elements with href attrib
To select the first matching element:
Element masthead = doc.select("div.masthead").first();
Extracting Data
To get the text inside an element:
String title = doc.title(); // title text
String text = content.text(); // inner text
To get the HTML inside an element:
String html = content.html();
To get an attribute value:
String href = link.attr("href"); // href attribute
Manipulating Elements
To remove an element:
content.remove();
To add a class to an element:
div.addClass("highlight");
To set text:
content.text("This is the new text content");
To set HTML:
content.html("<p>This is new HTML content</p>");
To append HTML to an element:
content.append("<p>Appended paragraph</p>");
Outputting HTML
To get the HTML of the entire parsed document:
String html = doc.html();
To get the outer HTML of an element:
String html = div.outerHtml();
To get the HTML representation of an element:
String html = content.html();
To update the document HTML:
doc.html("<html><head></head><body></body></html>");
To write the document HTML to a file:
File output = new File("output.html");
doc.outputSettings().prettyPrint(true);
PrintWriter writer = new PrintWriter(output);
writer.write(doc.html());
writer.close();
Working with Forms
To get a form by ID:
Form form = doc.getElementById("loginForm");
To fill in a form field:
form.input("username", "john");
To submit a form:
Connection.Response res = Jsoup.connect(form.attr("action"))
.data(form.formData())
.method(form.attr("method"))
.execute();
Cookies
To get cookies from a connection:
Map<String, String> cookies =
Jsoup.connect("<http://example.com>").execute().cookies();
To send cookies with a request:
Connection con = Jsoup.connect("<http://example.com>");
Cookie cookie = new Cookie("name", "value");
con.cookie(cookie); // add cookie to request
Document doc = con.get();
Handling Exceptions
Jsoup methods can throw the following exceptions:
To handle exceptions:
try {
Document doc = Jsoup.connect("<http://example.com>").get();
} catch (IOException e) {
// handle network error
} catch (HttpStatusException e) {
// handle non-success status code
} catch (ParseException e) {
// handle parse error
}
Best Practices
Some best practices when using Jsoup:
Advanced Topics
Cleaning HTML
To sanitize untrusted HTML:
String unsafe =
"<p><a href='javascript:sendSpam()'>Buy stuff</a></p>";
Cleaner cleaner = Cleaner.basic();
String safe = cleaner.clean(unsafe); // sanitized html
Document clean = Jsoup.parse(safe);
POST Requests
To make a POST request:
Connection con = Jsoup.connect("<http://example.com>");
con.data("name", "value"); // set POST data
Document doc = con.post();
Multi-part Forms
To upload files:
Connection con = Jsoup.connect("<http://example.com>");
File img = new File("/path/to/img.jpg");
con.data("name", "John");
con.data("photo", img); // upload file
Document doc = con.post();
Connections
To get a low-level Connection:
Connection con = Jsoup.connect("<http://example.com>");
con.header("X-Custom", "value");
con.cookie("key", "value");
con.timeout(3000);
Response res = con.execute();
DOM Traversal
To get parent elements:
Element parent = element.parent();
To get sibling elements:
Element nextSibling = element.nextElementSibling();
Element prevSibling = element.previousElementSibling();
To get children:
Elements children = element.children();
Modifying the DOM
To create new elements:
Element img = new Element(Tag.img);
img.attr("src", "example.png");
To add elements:
parent.appendChild(newEl);
parent.prependChild(newEl);
To insert elements:
Element inserted = parent.insertChildren(0, newEl);
Working with XML
Jsoup can parse and manipulate XML documents with the XML parser:
String xml = "<doc xmlns:x='<http://example.org>'>" +
"<x:el>Text</x:el></doc>";
Parser xmlParser = Parser.xmlParser();
Document doc = xmlParser.parseInput(xml, "");
Element el = doc.select("x|el").first();
The XML parser supports:
Troubleshooting
Handling bad markup
Use the -relaxed parser option to handle bad markup:
Document doc = Jsoup.parse(html, "", Parser.htmlParser()
.setTrackErrors(500) // number of errors
.recoverFromErrors(true) // attempt recovery
.relaxng(true) // relaxed parsing
);
Resolving relative URLs
Always provide a base URI when parsing:
Document doc = Jsoup.parse(html, "<https://example.com/>");
Encoding issues
Parse using the character set:
Document doc = Jsoup.parse(html, "UTF-8");
Tips and Best Practices
Select vs find
el.select(".item"); // getAll(.item)
el.find(".item"); // getChildren(.item)
Avoid full re-parse
Reuse Documents and avoid re-parsing full HTML:
Document doc = ... // parse once
Elements links = doc.select("a"); // query many times
Parser callback
Use a ParseCallback to modify the document during parse:
Parser parser = Parser.htmlParser();
parser.parse(html, new ParseCallback() {
@Override
public void handle(Element el) {
if (el.tagName().equals("img")) {
el.attr("src", "placeholder.jpg"); // rewrite
}
}
});
Integration
Spring MVC
Register a ViewResolver:
@Bean
ViewResolver jsoup() {
JsoupViewResolver jsoup = new JsoupViewResolver();
jsoup.setPrefix("WEB-INF/jsp/");
jsoup.setSuffix(".jsp");
return jsoup;
}
Async responses
Use a callback to handle the parsed Document:
Jsoup.connect(url).async(new DocumentCallback() {
@Override
public void onComplete(Document doc) {
// process doc
}
});
Large documents
Avoid loading the entire document into memory:
Jsoup.connect(url).maxBodySize(1 * 1024 * 1024).execute().bodyStream();
Form Handling
JSON serialization
Serialize forms to JSON with
Map json = form.serialize();
File uploads
Upload files in multi-part forms:
Connection con = Jsoup.connect("<http://example.com>");
File img = new File("/path/img.jpg");
con.data("profile_pic", img);
Request interceptor
Inspect and modify requests:
con.requestInterceptor(new RequestInterceptor() {
@Override
public void intercept(Connection.Request r) {
r.header("Authorization", "token");
}
});
Thread Safety
A Document is not thread safe - access it from one thread:
Document doc = ... // parse
synchronized(doc) {
Elements els = doc.select(".item"); // read
}
Use a
Extensions
Custom Parser
Extend
public class MyParser extends Parser {
@Override
public Document parseInput(String html, String baseUri) {
Document doc = ...;
return doc;
}
}
Jsoup + JFlex
Use JFlex to generate lexical parsers for HTML, integrated with Jsoup's DOM construction.
Testing
Verify elements exist:
Assert.assertEquals(1, doc.select("h1").size());
Check attribute values:
Assert.assertEquals("text", doc.select(".title").attr("content"));
FAQ
OuterHtml vs Html?
Select vs find?
Prevent script execution?
Use the default Cleaner to sanitize. Or whitelist tags.
Relative URL resolution?
Always provide a base URL to
Modifying the DOM
Introduce vulnerabilities:
Element script = doc.appendElement("script");
script.attr("src", "malicious.js");
Evaluate scripts:
doc.outputSettings().syntax(Document.OutputSettings.Syntax.html);
doc.select("script").html("(() => alert('XSS'))()");
Appendix
CSS Selectors
All CSS query selectors supported.
Parser Options
Options to customize parsing - docs.
Method Reference
Complete Jsoup API reference.