The Ultimate Loofah Cheatsheet for Ruby

Overview

Loofah is a Ruby library for parsing and manipulating HTML/XML documents. It provides a simple API for traversing, manipulating, and extracting data from markup. Some key features:

Built on top of Nokogiri, so it inherits Nokogiri's speed and Ruby idioms.

HTML/XML parsing and traversal.

XSS sanitization via Loofah::Scrubber.

Built-in scrubbers for stripping unwanted markup.

Integration with Rails ActionView helpers.

Installation

Install the loofah gem:

gem install loofah

Or in a Gemfile:

gem 'loofah'

Require in Ruby:

require 'loofah'

Parsing and Traversal

Parse HTML/XML into a Loofah document:

html = <<-HTML
  <html>
    <body>
      <h1>Hello world!</h1>
      <p>Welcome to my page.</p>
    </body>
  </html>
HTML

doc = Loofah.document(html)

Traverse elements with Nokogiri methods:

doc.css('h1')
#=> [#<Nokogiri::XML::Element:0x3fc96a44b618 name="h1">]

doc.at('h1').text
#=> "Hello world!"

Find text nodes:

doc.text
#=> "Hello world!Welcome to my page."

Manipulation

Modify the document:

doc.at('h1').content = "Welcome!"

puts doc.to_html
# <html>
#   <body>
#     <h1>Welcome!</h1>
#     ...

Add new nodes:

new_para = Nokogiri::XML::Node.new("p", doc)
new_para.content = "New paragraph"

doc.at('body').add_child(new_para)

XSS Sanitization

Loofah provides XSS sanitization via the Loofah::Scrubber class.

Remove unwanted tags/attributes:

html = "<script>alert('xss')</script><div>Test</div>"

doc = Loofah.document(html)
doc.scrub!(Loofah::Scrubber.new)

puts doc.to_html
#=> "<div>Test</div>"

Customize scrubbing behavior:

class CustomScrubber < Loofah::Scrubber
  def scrub(node)
    node.remove if node.name == 'script'
  end
end

doc = Loofah.document(html)
doc.scrub!(CustomScrubber.new)
puts doc.to_html
#=> "<script>alert('xss')</script><div>Test</div>"

Built-in scrubbers:

Loofah::Scrubber - Remove all tags/attributes.

Loofah::HtmlScrubber - Allow certain tags/attrs, strip others.

Loofah::StripScrubber - Remove given tags/attributes.

Integration with Rails

Loofah integrates with Rails helpers:

# In a Rails view

@content = "<script>alert('xss')</script>Test"

sanitize @content, scrubber: Loofah::Scrubber.new
#=> "Test"

sanitize @content, tags: ['b', 'i']
#=> "<b>Test</b>"

Use Loofah as the Rails sanitizer:

# In config/initializers/loofah.rb

Rails.application.config.action_view.sanitized_allowed_tags = ['strong']
Rails.application.config.action_view.sanitized_allowed_attributes = ['style']

ActionView::Base.sanitizer = Loofah

Now Rails helpers will use Loofah for sanitization.

Performance

Avoid slow XPath expressions:

# Slow
doc.at_xpath('//body//div[2]//span')

# Faster
doc.at('body').at('div:nth-child(2)').at('span')

Benchmark scrubber performance:

require 'benchmark'

n = 1000
Benchmark.bm do |x|
  x.report('Strip') { n.times { doc.scrub!(Loofah::StripScrubber.new) } }
  x.report('White') { n.times { doc.scrub!(Loofah::Html5libSanitizer.new) } }
end

Web Scraping

Extract text from HTML:

doc.text # All text

doc.css('h1').map(&:text) # Headings

doc.search('//p').map(&:text).join(". ") # Paragraphs

Extract links:

doc.css('a').map { |a| {href: a['href'], text: a.text} }

Testing

Test scrubbers with RSpec:

RSpec.describe MyScrubber do
  it 'scrubs scripts' do
    html = '<script>alert(1)</script>'
    doc = Loofah.document(html)

    scrubber = MyScrubber.new
    doc.scrub!(scrubber)

    expect(doc.to_html).not_to include('<script>')
  end
end

Assert correct scrubbing with Minitest:

class LoofahTest < Minitest::Test
  def test_scrub_xss
    html = '<script>alert(1)</script>'
    doc = Loofah.document(html)

    scrubber = Loofah::Scrubber.new
    doc.scrub!(scrubber)

    assert_equal '<div></div>', doc.to_html
  end
end

JavaScript Frameworks

React

Sanitize HTML before rendering:

// Component
import loofah from 'loofah'

function MyComponent({html}) {
  const cleanHtml = loofah.scrubHtml(html, {scrubber: Loofah.Scrubber})
  return <div dangerouslySetInnerHTML={{__html: cleanHtml}} />
}

Vue

Scrub HTML in server-rendered app:

// server.js
const loofah = require('loofah')

app.get('/', (req, res) => {
  const html = loofah.scrubHtml(someHtml) // Scrub on server
  res.renderVue('index.html', { html })
})

Angular

Create scrubbing pipe:

// scrub.pipe.ts
import { Pipe } from '@angular/core';
import loofah from 'loofah';

@Pipe({name: 'scrub'})
export class ScrubPipe {
  transform(html: string) {
    return loofah.scrubHtml(html);
  }
}

Use in template:

<!-- template.html -->
<div [innerHtml]="someHtml | scrub"></div>

Debugging Issues

Handle encoding errors:

doc = Loofah.document(html.force_encoding('UTF-8'))

Gracefully parse malformed HTML:

doc = Loofah.parse(bad_html)
doc.errors # Inspect errors

doc.repair! # Attempt fix

Clone before scrubbing to avoid side effects:

node = doc.at('p').dup
scrubber.scrub(node)

Advanced Nokogiri

Namespaced XML:

doc.search('//x:node', {'x' => '<http://name.space>'})

Optimize XPath with CSS selectors:

doc.at('div#content @class="text"')

NodeSet manipulation:

nodes = doc.css('p.note')
nodes.each { |n| ... }
nodes.remove

Scraping Frameworks

Scrapy

from scrapy.pipeines.images import ImagesPipeline
from loofah import scrub_html

class MyImagesPipeline(ImagesPipeline):

    def process_html(self, response, spider):
        cleaned = scrub_html(response.body)
        return HtmlResponse(url=response.url, body=cleaned)

Kimurai

class MySpider < Kimurai::Base
  def parse(response)
    response.doc.scrub!(Loofah::Scrubber.new)
    # ... scrape response.doc
  end
end

Immutable Documents

Clone before manipulating:

doc2 = doc.clone
doc2.at('img').remove

Use version with destructive methods disabled:

doc = Loofah::ImmutableDocument.parse(some_html)

doc.scrub! # Error raised

Authentication Integration

# ApplicationController

before_action :scrub_html

def scrub_html
  loofah.scrub_params!(params, scrubber: Scrubber.new)
end

Efficient Manipulation

Modify multiple nodes:

doc.search('//img').each do |img|
  img['src'] = '/placeholder.jpg'
end

Remove nodesets:

articles = doc.css('article')
articles.remove

Debugging

Handle encoding issues:

doc = Loofah.document(html_string.force_encoding('UTF-8'))

Fix malformed HTML:

html = <<-HTML
<div>
  <span>Hello
</div>
HTML

doc = Loofah.document(html)
doc.repair!

Avoid unintended node changes:

node = doc.at('h1')
node = node.dup # Scrub a copy

scrubber.scrub(node)

Advanced Selectors

Grouped CSS selectors:

doc.css('div.note, span.alert')

jQuery selectors:

doc.css('div:not(.ignore)') # Negation
doc.css('li:contains("hello")') # Text contains

Namespaced XML:

doc.search('//x:node', 'x' => 'namespace')

React Integration

Sanitize HTML before rendering:

// Component
import loofah from 'loofah'

function MyComponent({html}) {
  const cleanHtml = loofah.scrubHtml(html)
  return <div dangerouslySetInnerHTML={{__html: cleanHtml}} />
}

Kimurai Scraping

Configure Loofah scrubber:

class MySpider < Kimurai::Base
  def parse(response)
    response.doc.scrub!(Loofah::Scrubber.new)
    # ... scrape response.doc
  end
end

Handling Complex HTML Structures

Loofah is versatile and can handle complex HTML structures. For example, you can easily navigate and manipulate deeply nested elements:

html = <<-HTML
  <div>
    <section>
      <article>
        <h1>Article Title</h1>
        <p>Content goes here.</p>
      </article>
    </section>
  </div>
HTML

doc = Loofah.document(html)

# Access deeply nested elements
article_title = doc.at('div > section > article > h1').text

Optimizing Performance

To optimize performance, avoid using slow XPath expressions and prefer CSS selectors when possible:

# Slow XPath expression
slow_node = doc.at_xpath('//body//div[2]//span')

# Faster CSS selector equivalent
fast_node = doc.at('body div:nth-child(2) span')

Security Best Practices

Handling Security Vulnerabilities

Loofah helps mitigate security vulnerabilities like Cross-Site Scripting (XSS) attacks. Here's an example of using a custom scrubber to remove potentially harmful script tags:

class CustomScrubber < Loofah::Scrubber
  def scrub(node)
    node.remove if node.name == 'script'
  end
end

html_with_xss = "<script>alert('xss')</script><div>Safe content</div>"
doc = Loofah.document(html_with_xss)
doc.scrub!(CustomScrubber.new)

cleaned_html = doc.to_html
# Result: "<div>Safe content</div>"

Integration with Other Libraries

Integrating Loofah with Nokogiri

Loofah is built on top of Nokogiri, so you can use Nokogiri methods for parsing and manipulation:

require 'nokogiri'

html = <<-HTML
  <div>
    <p>Hello, <strong>world!</strong></p>
  </div>
HTML

nokogiri_doc = Nokogiri::HTML(html)

# Use Nokogiri methods to traverse and manipulate
strong_text = nokogiri_doc.at('strong').text

Comparisons

Loofah vs. Sanitize

Loofah and the Sanitize gem both offer HTML sanitization, but they have different approaches. Loofah allows fine-grained control with custom scrubbers, while Sanitize provides a simpler, rules-based approach:

# Using Sanitize for XSS sanitization
require 'sanitize'

html_with_xss = "<script>alert('xss')</script><div>Safe content</div>"
sanitized_html = Sanitize.fragment(html_with_xss)

# Result: "<div>Safe content</div>"

FAQs

Q: How does Loofah handle different character encodings?

A: Loofah can handle various character encodings. Ensure you set the correct encoding using force_encoding('UTF-8') for your HTML string to avoid encoding issues:

html_string = "HTML content"
doc = Loofah.document(html_string.force_encoding('UTF-8'))

Q: Does Loofah support HTML5?

A: Yes, Loofah supports HTML5. It utilizes Nokogiri, which is capable of parsing and manipulating HTML5 documents.

Q: How can I sanitize mixed content (safe and unsafe) with Loofah?

A: Customize Loofah's behavior by creating custom scrubbers. For instance, remove script tags while keeping safe content:

class CustomScrubber < Loofah::Scrubber
  def scrub(node)
    node.remove if node.name == 'script'
  end
end

html = "<script>alert('xss')</script><div>Safe content</div>"
doc = Loofah.document(html)
doc.scrub!(CustomScrubber.new)

cleaned_html = doc.to_html

Q: Can I customize Loofah's sanitization for specific tags or attributes?

A: Yes, create custom scrubbers to define rules. For example, allow 'href' attributes for links while removing other tags:

class CustomScrubber < Loofah::Scrubber
  def scrub(node)
    if node.name == 'a'
      node['href'] = node['href'].strip if node['href']
    else
      node.remove
    end
  end
end

html = "<a href='<https://example.com>' target='_blank'>Visit Example</a><script>alert('xss')</script>"
doc = Loofah.document(html)
doc.scrub!(CustomScrubber.new)

cleaned_html = doc.to_html

Q: Is Loofah vulnerable to security issues?

A: Loofah is designed to mitigate security vulnerabilities, including XSS attacks. Keep Loofah and its dependencies up-to-date to stay protected. Monitor the Loofah GitHub repository and RubyGems for updates and advisories.

Q: What are the benefits of using Nokogiri over Loofah, and vice versa?

A: Nokogiri

Fine-Grained Control: Offers granular control over HTML/XML parsing for custom solutions.

Versatility: Supports both HTML and XML parsing.

Performance: Known for efficient parsing and speed.

Extensive Ecosystem: Large community with ample documentation and plugins.

Low-Level Manipulation: Ideal for custom data extraction and manipulation.

Loofah

HTML Sanitization: Specializes in HTML sanitization, particularly for XSS mitigation.

Simplicity: Simplifies HTML content sanitization.

Rails Integration: Seamlessly integrates with Ruby on Rails.

Higher-Level Abstraction: Provides a higher-level abstraction for ease of use.

Built-In Rules: Includes ready-made rules for common sanitization tasks.

The choice between Nokogiri and Loofah depends on your project's specific needs. Nokogiri is versatile and performance-oriented, while Loofah excels at HTML sanitization and simplicity.

Troubleshooting

Handling Malformed HTML

If you have malformed HTML, Loofah can help repair it. Use the repair! method to attempt to fix issues:

malformed_html = <<-HTML
  <div>
    <p>Unclosed div
  </div>
HTML

doc = Loofah.document(malformed_html)
doc.repair!

# The document is now repaired and can be used safely

Cross-Framework Integration

Using Loofah with Sinatra

Integrating Loofah with Sinatra is straightforward, similar to using it with Rails:

require 'sinatra'
require 'loofah'

before do
  @content = "<script>alert('xss')</script>Some content"
  @cleaned_content = Loofah.scrub(@content, scrubber: Loofah::Scrubber.new)
end

get '/' do
  erb :index
end

References and Further Reading

Here are some references and further reading materials for in-depth knowledge:

Loofah GitHub Repo

Loofah Documentation

Nokogiri Tutorial