Nokogiri is a powerful HTML/XML parsing and scraping library for Ruby. This exhaustive cheat sheet aims to cover its extensive capabilities in depth.
Installation
Gemfile:
gem 'nokogiri', '~>1.6'
Bundler:
bundle install
Or install globally:
gem install nokogiri
Parsing Documents
From string:
doc = Nokogiri::HTML('<html>...</html>')
From file:
doc = Nokogiri::HTML(File.open('page.html'))
From URL:
doc = Nokogiri::HTML(URI.open('<https://example.com/>'))
Namespaced XML:
builder = Nokogiri::XML::Builder.new do |xml|
xml.root('xmlns:xsi' => '<http://www.w3.org/2001/XMLSchema-instance>') {
xml.cars {
xml.car {
xml.make "Honda"
xml.model "Civic"
}
}
}
end
doc = Nokogiri::XML(builder.to_xml)
Searching DOM
CSS selector:
articles = doc.css('div.article')
XPath expression:
articles = doc.xpath('//div[@class="article"]')
Namespaced XML:
xml.xpath('./xmlns:car', namespaces={'xmlns' => '<http://example.com>'})
Get by id:
doc.at('#intro')
Get by name:
doc.search('product-title')
Traversing DOM
Children:
product.children # All
product.element_children # Elements only
Parents:
node.parent
node.ancestors
Siblings:
node.next_sibling
node.previous_sibling
node.next_element
node.previous_element
Filtering:
node.search('./ancestor::div[.//p]') # Ancestor div with p
Modifying DOM
Set id:
node['id'] = 'some-id'
Set class:
node['class'] = 'highlighted'
Add class:
node.add_class('blue')
Remove class:
node.remove_class('blue')
Set attribute:
node['data-type'] = 'article'
Remove attribute:
node.delete('class')
Set text content:
node.content = 'New text'
Set HTML:
node.inner_html = 'New <strong>HTML</strong>'
Creating Nodes
New element:
el = Nokogiri::XML::Node.new('p', doc)
New text node:
text = Nokogiri::XML::Text.new('Hello', doc)
From HTML:
fragment = Nokogiri::HTML.fragment('<div>Hi</div>')
DOC node:
doc_node = Nokogiri::XML::DTD.new(doc)
Outputting HTML/XML
ToString:
html = doc.to_html
xml = doc.to_xml
Save to file:
File.open('out.html', 'w') { |f| f.write(doc.to_html) }
Prettify output:
puts doc.to_html(indent: 2)
puts doc.to_xml(indent: 2)
Encode characters:
doc.to_html(encoding: 'UTF-8')
Encoding
Parse as UTF-8:
doc = Nokogiri::HTML(html_string, nil, 'UTF-8')
Detect encoding:
doc.encoding # Returns encoding
Handle encodings:
require 'iconv'
Iconv.conv('UTF-8', 'ISO-8859-2', html_string) # Convert
Advanced Parsing
Turn off network access:
Nokogiri::HTML(html, nil, nil, Nokogiri::XML::ParseOptions::NONET)
ErrorHandler:
errors = []
Nokogiri::HTML(html) { |config|
config.error_handler = -> (err) { errors << err }
}
Strict parsing:
Nokogiri::HTML(html, nil, nil, Nokogiri::XML::ParseOptions::STRICT)
Performance
Cache XPath queries:
DOCUMENT_NODE = Nokogiri::HTML(html)
XPATH_ITEMS = DOCUMENT_NODE.xpath('//li')
XPATH_ITEMS.each do |item|
# ...
end
Pool documents:
class Nokogiri::HTML::Document
include Nokogiri::XML::PP::Node
include Nokogiri::XML::PP::CharacterData
end
Browser Integration
With Watir:
browser = Watir::Browser.new
doc = Nokogiri::HTML(browser.html)
With Selenium:
driver = Selenium::WebDriver.for :firefox
html = driver.page_source
doc = Nokogiri::HTML(html)
Scraping Data
Extract text:
doc.search('h2').map(&:text)
Extract attributes:
doc.search('.post').map { |post| post['id'] }
Build JSON:
require 'json'
posts = doc.css('.post').map { |post|
{
id: post['id'],
title: post.at('h2').text,
content: post.at('.content').text
}
}
File.write('output.json', JSON.dump(posts))
Real World Use Cases
This aims to cover the full range of Nokogiri's capabilities in depth for Ruby.