Libxml2 is a XML processing library written in C for use in C/C++ applications. It provides DOM, SAX, XMLReader, XPath and XPointer support.
Getting Started
Include:
#include <libxml/parser.h>
#include <libxml/tree.h>
Parse:
xmlDocPtr doc = xmlParseFile("file.xml");
or
xmlDocPtr doc = xmlParseMemory(xml, size);
Validate:
xmlSchemaPtr schema = xmlSchemaNewParserCtxt("schema.xsd");
xmlSchemaValidCtxtPtr valid = xmlSchemaNewValidCtxt(schema);
xmlSchemaValidateDoc(valid, doc);
Check xmlSchemaValidCtxtGetParserErrors() for errors.
Cleanup:
xmlFreeDoc(doc);
xmlSchemaFreeValidCtxt(valid);
xmlSchemaFree(schema);
DOM Parsing
Get root element:
xmlNodePtr root = xmlDocGetRootElement(doc);
Iterate children:
for(xmlNodePtr cur = root->children; cur != NULL; cur = cur->next) {
// process cur node
}
Get child:
xmlNodePtr child = root->children;
Node Types
xmlNode: Base node class.
xmlElem: Element nodes.
xmlText: Text nodes.
xmlAttr: Attribute nodes.
xmlNs: Namespace nodes.
Node Operations
Add child:
xmlNodePtr child = xmlNewChild(parent, NULL, "node", NULL);
Set/get properties:
xmlSetProp(node, "key", "value");
xmlGetProp(node, "key");
Set/get content:
xmlNodeSetContent(node, "text");
xmlNodeGetContent(node);
Remove node:
xmlUnlinkNode(node);
xmlFreeNode(node);
XPath Usage
Evaluate xpath:
xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
xmlXPathObjectPtr result = xmlXPathEvalExpression(ctxt, "/root/node");
Get node result:
if(result->nodesetval->nodeNr > 0) {
xmlNodePtr node = result->nodesetval->nodeTab[0];
}
Get string result:
if(result->type == XPATH_STRING) {
xmlChar *str = result->stringval;
}
Cleanup:
xmlXPathFreeObject(result);
xmlXPathFreeContext(ctxt);
SAX Parsing
Create parser:
xmlSAXHandler sax;
memset(sax, 0, sizeof(sax));
xmlSAXParserCreate(&sax, NULL);
Set handlers:
sax.startDocument = &startDocHandler;
sax.endElement = &endElementHandler;
Parse:
xmlSAXUserParseFile(&sax, "file.xml");
Tips
Examples
Modify XML:
xmlDocPtr doc = xmlParseFile("data.xml");
xmlNodePtr root = xmlDocGetRootElement(doc);
xmlNodePtr node = xmlNewChild(root, NULL, "newNode", NULL);
xmlSetProp(node, "key", "value");
xmlSaveFile(doc, "out.xml");
xmlFreeDoc(doc);
Extract text:
xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
xmlXPathObjectPtr result = xmlXPathEvalExpression(ctxt, "/root/node/text()");
if(result->type == XPATH_STRING) {
std::cout << result->stringval << std::endl;
}
xmlXPathFreeObject(result);
xmlXPathFreeContext(ctxt);
Namespaces
Register namespace:
xmlNewNs(node, "<http://ns>", "ns");
Add with namespace:
xmlNewChild(node, ns, "ns:child", NULL);
Search with namespace:
xpath = "/ns:root/ns:node";
HTML Parsing
Parse HTML:
htmlDocPtr doc = htmlReadFile("file.html", NULL, HTML_PARSE_NOERROR);
Print HTML:
htmlDocDump(stdout, doc);
Tidy:
htmlDocPtr tidy = htmlReadDoc(doc, "utf8", htmlTidyDocDefaultOptions);
Advanced Usage
Custom streams:
xmlParserCtxtPtr ctxt = xmlCreatePushParserCtxt(&sax, NULL, NULL, 0, NULL);
while(moreData) {
xmlParseChunk(ctxt, data, size, 0);
}
xmlParseChunk(ctxt, NULL, 0, 1); // end
Custom memory:
xmlMemSetup(xmlFree, xmlMalloc, xmlRealloc, xmlStrdup);
Debug memory:
xmlMemUsed(); // check used mem
Memory Management
Proper memory management is critical when using libxml2 to avoid leaks.
Free document trees:
xmlFreeDoc(doc);
Frees the entire document tree.
Free nodes:
xmlFreeNode(node);
Frees a specific node. Parent links and children aren't modified.
Avoid leaks:
Encoding Handling
Parse with encoding:
doc = htmlReadDoc(buffer, "UTF-8", XML_PARSE_NOERROR);
Output encoding:
xmlSaveFormatFileEnc(file, doc, "UTF-8", 1);
Avoid encoding issues:
Advanced XPath
Predicates:
/book[author='James']
Axes:
//ancestor::chapter
Functions:
count(//book)
Namespaces
Register namespace:
xmlNewNs(node, "<http://ns>", "ns");
Use in XPath:
/ns:book/ns:title
Default namespace:
<root xmlns="<http://ns>">
Now unprefixed elements like
Performance
Parser options:
xmlReadDoc(doc, "nonet", XML_PARSE_NOENT);
Disables network access and entity substitution.
Reuse contexts:
Avoid creating new xpathContext for each query.
Cache nodes/results:
Cache costly lookups or searches.
Troubleshooting
HTML parse errors:
Use XML_PARSE_RECOVER to recover from common HTML errors.
XPath type errors:
Cast string results when needed.
string(//title)
Memory leaks:
Use valgrind, instrumentation, logging to detect unreleased memory.