The Ultimate Nokogiri Cheat Sheet for Ruby

Oct 31, 2023 ยท 3 min read

Nokogiri is a powerful HTML/XML parsing and scraping library for Ruby. This exhaustive cheat sheet aims to cover its extensive capabilities in depth.

Installation

Gemfile:

gem 'nokogiri', '~>1.6'

Bundler:

bundle install

Or install globally:

gem install nokogiri

Parsing Documents

From string:

doc = Nokogiri::HTML('<html>...</html>')

From file:

doc = Nokogiri::HTML(File.open('page.html'))

From URL:

doc = Nokogiri::HTML(URI.open('<https://example.com/>'))

Namespaced XML:

builder = Nokogiri::XML::Builder.new do |xml|
  xml.root('xmlns:xsi' => '<http://www.w3.org/2001/XMLSchema-instance>') {
    xml.cars {
      xml.car {
        xml.make "Honda"
        xml.model "Civic"
      }
    }
  }
end
doc = Nokogiri::XML(builder.to_xml)

Searching DOM

CSS selector:

articles = doc.css('div.article')

XPath expression:

articles = doc.xpath('//div[@class="article"]')

Namespaced XML:

xml.xpath('./xmlns:car', namespaces={'xmlns' => '<http://example.com>'})

Get by id:

doc.at('#intro')

Get by name:

doc.search('product-title')

Traversing DOM

Children:

product.children # All
product.element_children # Elements only

Parents:

node.parent
node.ancestors

Siblings:

node.next_sibling
node.previous_sibling
node.next_element
node.previous_element

Filtering:

node.search('./ancestor::div[.//p]') # Ancestor div with p

Modifying DOM

Set id:

node['id'] = 'some-id'

Set class:

node['class'] = 'highlighted'

Add class:

node.add_class('blue')

Remove class:

node.remove_class('blue')

Set attribute:

node['data-type'] = 'article'

Remove attribute:

node.delete('class')

Set text content:

node.content = 'New text'

Set HTML:

node.inner_html = 'New <strong>HTML</strong>'

Creating Nodes

New element:

el = Nokogiri::XML::Node.new('p', doc)

New text node:

text = Nokogiri::XML::Text.new('Hello', doc)

From HTML:

fragment = Nokogiri::HTML.fragment('<div>Hi</div>')

DOC node:

doc_node = Nokogiri::XML::DTD.new(doc)

Outputting HTML/XML

ToString:

html = doc.to_html
xml = doc.to_xml

Save to file:

File.open('out.html', 'w') { |f| f.write(doc.to_html) }

Prettify output:

puts doc.to_html(indent: 2)
puts doc.to_xml(indent: 2)

Encode characters:

doc.to_html(encoding: 'UTF-8')

Encoding

Parse as UTF-8:

doc = Nokogiri::HTML(html_string, nil, 'UTF-8')

Detect encoding:

doc.encoding # Returns encoding

Handle encodings:

require 'iconv'
Iconv.conv('UTF-8', 'ISO-8859-2', html_string) # Convert

Advanced Parsing

Turn off network access:

Nokogiri::HTML(html, nil, nil, Nokogiri::XML::ParseOptions::NONET)

ErrorHandler:

errors = []

Nokogiri::HTML(html) { |config|
  config.error_handler = -> (err) { errors << err }
}

Strict parsing:

Nokogiri::HTML(html, nil, nil, Nokogiri::XML::ParseOptions::STRICT)

Performance

Cache XPath queries:

DOCUMENT_NODE = Nokogiri::HTML(html)
XPATH_ITEMS = DOCUMENT_NODE.xpath('//li')

XPATH_ITEMS.each do |item|
  # ...
end

Pool documents:

class Nokogiri::HTML::Document
  include Nokogiri::XML::PP::Node
  include Nokogiri::XML::PP::CharacterData
end

Browser Integration

With Watir:

browser = Watir::Browser.new
doc = Nokogiri::HTML(browser.html)

With Selenium:

driver = Selenium::WebDriver.for :firefox
html = driver.page_source
doc = Nokogiri::HTML(html)

Scraping Data

Extract text:

doc.search('h2').map(&:text)

Extract attributes:

doc.search('.post').map { |post| post['id'] }

Build JSON:

require 'json'

posts = doc.css('.post').map { |post|
  {
    id: post['id'],
    title: post.at('h2').text,
    content: post.at('.content').text
  }
}

File.write('output.json', JSON.dump(posts))

Real World Use Cases

  • Web scraping and automation
  • Testing browsers with Capybara
  • Screenshot generation
  • Consuming XML APIs
  • Parsing documents for NLP
  • Archiving sites
  • Structuring CMS templates
  • Building HTML editors
  • Data mining and dataset analysis
  • PDF generation
  • Markdown/HTML conversion
  • Parsing XLS spreadsheet exports
  • This aims to cover the full range of Nokogiri's capabilities in depth for Ruby.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: