Gumbo is an HTML5 parsing library in C++. It parses HTML into a tree structure for easy manipulation and extraction.
Getting Started
Include:
#include "gumbo.h"
Parse:
GumboOutput* output = gumbo_parse(html);
Check gumbo_get_error_code() for errors.
Query:
Get document node:
GumboNode* doc = output->root;
Cleanup:
gumbo_destroy_output(&kGumboDefaultOptions, output);
DOM Types
GumboNode:
Parent class for all nodes.
GumboElement:
Element node, contains tag, attributes, and children.
GumboText:
Text node, contains textual content.
GumboAttribute:
Attribute with name and value.
GumboVector:
Array-like container for nodes.
Selecting Nodes
By tag:
GumboNode* node = gumbo_get_element_by_tag(doc, GUMBO_TAG_DIV);
By id:
GumboNode* node = gumbo_get_element_by_id(doc, "someId");
Query selector:
GumboNode* node = gumbo_query_selector(doc, ".someClass");
Children:
GumboVector* children = &node->v.element.children;
Iterate children:
for (int i = 0; i < children->length; ++i) {
GumboNode* child = static_cast<GumboNode*>(children->data[i]);
// do something with child
}
Traversing
Parent:
GumboNode* parent = node->parent;
Next sibling:
GumboNode* next = node->next_sibling;
Previous sibling:
GumboNode* prev = node->previous_sibling;
Manipulating Nodes
Create element:
GumboNode* div = gumbo_create_element(GUMBO_TAG_DIV);
Append child:
gumbo_append_child(doc, div);
Insert child:
GumboNode* p = gumbo_insert_before(parent, child, NULL); // after
Remove child:
gumbo_remove_from_parent(child);
Inner HTML:
gumbo_tag_from_original_text(doc, text); // set
std::string html = gumbo_tag_to_original_html(doc); // get
Attributes
Get attribute:
const GumboAttribute* attr = gumbo_get_attribute(node, "id");
Set attribute:
GumboAttribute attr;
attr.name = "href";
attr.value = "link.html";
gumbo_add_attribute(node, &attr);
Remove attribute:
gumbo_remove_attribute(node, "class");
Text Nodes
Extract text:
std::string text = gumbo_text(textNode);
Create text node:
GumboNode* text = gumbo_create_text_node(parser, "Text");
Outputting HTML
To HTML:
std::string html = gumbo_normalize_html(output->root, &kGumboDefaultOptions);
To string:
std::string html = gumbo_stringify(output);
Check errors:
GumboError error = gumbo_get_error_code(output);
Parsing Options
Fragment parsing:
GumboOutput* output = gumbo_parse_fragment(...)
Default options:
struct GumboOutput* output = gumbo_parse_with_options(...)
See GumboParserOptions for all options.
Memory Management
Ownership:
GumboNode* pointers are owned by GumboOutput.
Allocator:
Provide custom allocator:
options.allocator = &custom_allocator;
Cleanup:
gumbo_destroy_output(&options, output);
Frees all memory.
Error Handling
Error codes:
if (gumbo_get_error_code(output) == GUMBO_OK) {
// no errors
}
See GumboError for error codes.
Error messages:
#define GUMBO_ENABLE_ERROR_MESSAGES
Prints debug error messages.
Tips
Examples
Parse and print HTML:
GumboOutput* output = gumbo_parse(html);
std::cout << gumbo_normalize_html(output->root);
gumbo_destroy_output(&kGumboDefaultOptions, output);
Extract text:
GumboNode* body = gumbo_get_element_by_tag(doc, GUMBO_TAG_BODY);
for (GumboNode* child = body->v.element.children.data[0];
child != NULL;
child = child->next_sibling) {
if (child->type == GUMBO_NODE_TEXT) {
std::string text = gumbo_text(child);
std::cout << text;
}
}
Change links:
GumboNode* body = gumbo_get_element_by_tag(doc, GUMBO_TAG_BODY);
for (GumboNode* child = body->v.element.children.data[0];
child != NULL;
child = child->next_sibling) {
if (child->type != GUMBO_NODE_ELEMENT) {
continue;
}
GumboAttribute* href = gumbo_get_attribute(child, "href");
if (href) {
href->value = "new_link.html";
}
}
Advanced Usage
Custom memory allocator:
class CustomAllocator : public GumboAllocator {
public:
virtual void* allocate(...) { ... }
virtual void free(...) { ... }
};
options.allocator = &customAllocator;
Custom tag callbacks:
options.tag_handler = &MyTagHandler;
class MyTagHandler : GumboTagHandler {
public:
void startElement(...) { ... }
void endElement(...) { ... }
};
## Real-World Use Cases
**Web scraping:**
```cpp
// Parse page
GumboOutput* output = gumbo_parse(html);
// Find all links
GumboNode* body = gumbo_get_element_by_tag(output->root, GUMBO_TAG_BODY);
GumboVector* children = &body->v.element.children;
for (unsigned int i = 0; i < children->length; ++i) {
GumboNode* child = static_cast<GumboNode*>(children->data[i]);
if (child->type != GUMBO_NODE_ELEMENT) {
continue;
}
GumboAttribute* href = gumbo_get_attribute(child, "href");
if (href) {
// Save link for later scraping
scraped_links.push_back(href->value);
}
}
// Cleanup
gumbo_destroy_output(&kGumboDefaultOptions, output);
Modifying HTML:
GumboOutput* output = gumbo_parse(html);
// Change tag from <div> to <section>
GumboNode* node = gumbo_get_element_by_id(output->root, "content");
node->v.element.tag = GUMBO_TAG_SECTION;
std::string modified_html = gumbo_normalize_html(output->root);
gumbo_destroy_output(&kGumboDefaultOptions, output);
Building search index:
// Parse document
GumboOutput* output = gumbo_parse(html);
// Extract text from nodes
std::string text = GetText(output->root);
// Save text to index
index.AddDocument(url, text);
// Cleanup
gumbo_destroy_output(&kGumboDefaultOptions, output);
Performance and Memory Usage
Reuse GumboOutput:
GumboOutput* output = gumbo_parse(html);
// Modify DOM...
// Reparse instead of gumbo_destroy_output
gumbo_parse_with_reused_output(html, output);
Cache parsed documents:
// Cache mapping URLs to GumboOutput
std::unordered_map<std::string, GumboOutput*> cache;
GumboOutput* Parse(const std::string& url) {
if (cache.find(url) != cache.end()) {
return cache[url];
}
GumboOutput* output = gumbo_parse(LoadHTML(url));
cache[url] = output;
return output;
}
Custom allocator:
class MyAllocator : public GumboAllocator {
// Implement allocate and free...
};
// Set custom allocator
options.allocator = &myAllocator;
Advanced Callbacks
Tag callbacks:
class LinkParser : public GumboTagHandler {
public:
void startElement(GumboTag tag,...) {
if (tag == GUMBO_TAG_A) {
// Extract link
}
}
}
// Set tag handler
options.tag_handler = &linkParser;
Attribute callbacks:
void ExtractImages(const GumboAttribute* attr) {
if (attr->name == "src" && attr->value.find(".jpg")) {
// Save image
}
}
options.attribute_handler = ExtractImages;
Common Pitfalls
Memory leaks:
Remember to call gumbo_destroy_output() after parsing.
Invalid HTML:
Handle errors gracefully when parsing malformed HTML.
Pointer errors:
Nodes are owned by GumboOutput. Don't delete separately.
Troubleshooting
Crashing:
Unexpected output:
Errors:
FAQ
Q: Why not just use libxml2?
A: Gumbo is focused just on HTML while libxml2 supports XML. Gumbo may be easier to use for some HTML tasks.
Q: Is Gumbo thread-safe?
A: No, you need to synchronize multi-threaded access to GumboOutput.
Q: What browsers does Gumbo support?
A: Gumbo aims for compatibility with all modern browsers. See docs for details.