Overview
Loofah is a Ruby library for parsing and manipulating HTML/XML documents. It provides a simple API for traversing, manipulating, and extracting data from markup. Some key features:
Installation
Install the
gem install loofah
Or in a Gemfile:
gem 'loofah'
Require in Ruby:
require 'loofah'
Parsing and Traversal
Parse HTML/XML into a Loofah document:
html = <<-HTML
<html>
<body>
<h1>Hello world!</h1>
<p>Welcome to my page.</p>
</body>
</html>
HTML
doc = Loofah.document(html)
Traverse elements with Nokogiri methods:
doc.css('h1')
#=> [#<Nokogiri::XML::Element:0x3fc96a44b618 name="h1">]
doc.at('h1').text
#=> "Hello world!"
Find text nodes:
doc.text
#=> "Hello world!Welcome to my page."
Manipulation
Modify the document:
doc.at('h1').content = "Welcome!"
puts doc.to_html
# <html>
# <body>
# <h1>Welcome!</h1>
# ...
Add new nodes:
new_para = Nokogiri::XML::Node.new("p", doc)
new_para.content = "New paragraph"
doc.at('body').add_child(new_para)
XSS Sanitization
Loofah provides XSS sanitization via the
Remove unwanted tags/attributes:
html = "<script>alert('xss')</script><div>Test</div>"
doc = Loofah.document(html)
doc.scrub!(Loofah::Scrubber.new)
puts doc.to_html
#=> "<div>Test</div>"
Customize scrubbing behavior:
class CustomScrubber < Loofah::Scrubber
def scrub(node)
node.remove if node.name == 'script'
end
end
doc = Loofah.document(html)
doc.scrub!(CustomScrubber.new)
puts doc.to_html
#=> "<script>alert('xss')</script><div>Test</div>"
Built-in scrubbers:
Integration with Rails
Loofah integrates with Rails helpers:
# In a Rails view
@content = "<script>alert('xss')</script>Test"
sanitize @content, scrubber: Loofah::Scrubber.new
#=> "Test"
sanitize @content, tags: ['b', 'i']
#=> "<b>Test</b>"
Use Loofah as the Rails sanitizer:
# In config/initializers/loofah.rb
Rails.application.config.action_view.sanitized_allowed_tags = ['strong']
Rails.application.config.action_view.sanitized_allowed_attributes = ['style']
ActionView::Base.sanitizer = Loofah
Now Rails helpers will use Loofah for sanitization.
Performance
Avoid slow XPath expressions:
# Slow
doc.at_xpath('//body//div[2]//span')
# Faster
doc.at('body').at('div:nth-child(2)').at('span')
Benchmark scrubber performance:
require 'benchmark'
n = 1000
Benchmark.bm do |x|
x.report('Strip') { n.times { doc.scrub!(Loofah::StripScrubber.new) } }
x.report('White') { n.times { doc.scrub!(Loofah::Html5libSanitizer.new) } }
end
Web Scraping
Extract text from HTML:
doc.text # All text
doc.css('h1').map(&:text) # Headings
doc.search('//p').map(&:text).join(". ") # Paragraphs
Extract links:
doc.css('a').map { |a| {href: a['href'], text: a.text} }
Testing
Test scrubbers with RSpec:
RSpec.describe MyScrubber do
it 'scrubs scripts' do
html = '<script>alert(1)</script>'
doc = Loofah.document(html)
scrubber = MyScrubber.new
doc.scrub!(scrubber)
expect(doc.to_html).not_to include('<script>')
end
end
Assert correct scrubbing with Minitest:
class LoofahTest < Minitest::Test
def test_scrub_xss
html = '<script>alert(1)</script>'
doc = Loofah.document(html)
scrubber = Loofah::Scrubber.new
doc.scrub!(scrubber)
assert_equal '<div></div>', doc.to_html
end
end
JavaScript Frameworks
React
Sanitize HTML before rendering:
// Component
import loofah from 'loofah'
function MyComponent({html}) {
const cleanHtml = loofah.scrubHtml(html, {scrubber: Loofah.Scrubber})
return <div dangerouslySetInnerHTML={{__html: cleanHtml}} />
}
Vue
Scrub HTML in server-rendered app:
// server.js
const loofah = require('loofah')
app.get('/', (req, res) => {
const html = loofah.scrubHtml(someHtml) // Scrub on server
res.renderVue('index.html', { html })
})
Angular
Create scrubbing pipe:
// scrub.pipe.ts
import { Pipe } from '@angular/core';
import loofah from 'loofah';
@Pipe({name: 'scrub'})
export class ScrubPipe {
transform(html: string) {
return loofah.scrubHtml(html);
}
}
Use in template:
<!-- template.html -->
<div [innerHtml]="someHtml | scrub"></div>
Debugging Issues
Handle encoding errors:
doc = Loofah.document(html.force_encoding('UTF-8'))
Gracefully parse malformed HTML:
doc = Loofah.parse(bad_html)
doc.errors # Inspect errors
doc.repair! # Attempt fix
Clone before scrubbing to avoid side effects:
node = doc.at('p').dup
scrubber.scrub(node)
Advanced Nokogiri
Namespaced XML:
doc.search('//x:node', {'x' => '<http://name.space>'})
Optimize XPath with CSS selectors:
doc.at('div#content @class="text"')
NodeSet manipulation:
nodes = doc.css('p.note')
nodes.each { |n| ... }
nodes.remove
Scraping Frameworks
Scrapy
from scrapy.pipeines.images import ImagesPipeline
from loofah import scrub_html
class MyImagesPipeline(ImagesPipeline):
def process_html(self, response, spider):
cleaned = scrub_html(response.body)
return HtmlResponse(url=response.url, body=cleaned)
Kimurai
class MySpider < Kimurai::Base
def parse(response)
response.doc.scrub!(Loofah::Scrubber.new)
# ... scrape response.doc
end
end
Immutable Documents
Clone before manipulating:
doc2 = doc.clone
doc2.at('img').remove
Use version with destructive methods disabled:
doc = Loofah::ImmutableDocument.parse(some_html)
doc.scrub! # Error raised
Authentication Integration
# ApplicationController
before_action :scrub_html
def scrub_html
loofah.scrub_params!(params, scrubber: Scrubber.new)
end
Efficient Manipulation
Modify multiple nodes:
doc.search('//img').each do |img|
img['src'] = '/placeholder.jpg'
end
Remove nodesets:
articles = doc.css('article')
articles.remove
Debugging
Handle encoding issues:
doc = Loofah.document(html_string.force_encoding('UTF-8'))
Fix malformed HTML:
html = <<-HTML
<div>
<span>Hello
</div>
HTML
doc = Loofah.document(html)
doc.repair!
Avoid unintended node changes:
node = doc.at('h1')
node = node.dup # Scrub a copy
scrubber.scrub(node)
Advanced Selectors
Grouped CSS selectors:
doc.css('div.note, span.alert')
jQuery selectors:
doc.css('div:not(.ignore)') # Negation
doc.css('li:contains("hello")') # Text contains
Namespaced XML:
doc.search('//x:node', 'x' => 'namespace')
React Integration
Sanitize HTML before rendering:
// Component
import loofah from 'loofah'
function MyComponent({html}) {
const cleanHtml = loofah.scrubHtml(html)
return <div dangerouslySetInnerHTML={{__html: cleanHtml}} />
}
Kimurai Scraping
Configure Loofah scrubber:
class MySpider < Kimurai::Base
def parse(response)
response.doc.scrub!(Loofah::Scrubber.new)
# ... scrape response.doc
end
end
Handling Complex HTML Structures
Loofah is versatile and can handle complex HTML structures. For example, you can easily navigate and manipulate deeply nested elements:
html = <<-HTML
<div>
<section>
<article>
<h1>Article Title</h1>
<p>Content goes here.</p>
</article>
</section>
</div>
HTML
doc = Loofah.document(html)
# Access deeply nested elements
article_title = doc.at('div > section > article > h1').text
Optimizing Performance
To optimize performance, avoid using slow XPath expressions and prefer CSS selectors when possible:
# Slow XPath expression
slow_node = doc.at_xpath('//body//div[2]//span')
# Faster CSS selector equivalent
fast_node = doc.at('body div:nth-child(2) span')
Security Best Practices
Handling Security Vulnerabilities
Loofah helps mitigate security vulnerabilities like Cross-Site Scripting (XSS) attacks. Here's an example of using a custom scrubber to remove potentially harmful script tags:
class CustomScrubber < Loofah::Scrubber
def scrub(node)
node.remove if node.name == 'script'
end
end
html_with_xss = "<script>alert('xss')</script><div>Safe content</div>"
doc = Loofah.document(html_with_xss)
doc.scrub!(CustomScrubber.new)
cleaned_html = doc.to_html
# Result: "<div>Safe content</div>"
Integration with Other Libraries
Integrating Loofah with Nokogiri
Loofah is built on top of Nokogiri, so you can use Nokogiri methods for parsing and manipulation:
require 'nokogiri'
html = <<-HTML
<div>
<p>Hello, <strong>world!</strong></p>
</div>
HTML
nokogiri_doc = Nokogiri::HTML(html)
# Use Nokogiri methods to traverse and manipulate
strong_text = nokogiri_doc.at('strong').text
Comparisons
Loofah vs. Sanitize
Loofah and the Sanitize gem both offer HTML sanitization, but they have different approaches. Loofah allows fine-grained control with custom scrubbers, while Sanitize provides a simpler, rules-based approach:
# Using Sanitize for XSS sanitization
require 'sanitize'
html_with_xss = "<script>alert('xss')</script><div>Safe content</div>"
sanitized_html = Sanitize.fragment(html_with_xss)
# Result: "<div>Safe content</div>"
FAQs
Q: How does Loofah handle different character encodings?
A: Loofah can handle various character encodings. Ensure you set the correct encoding using
html_string = "HTML content"
doc = Loofah.document(html_string.force_encoding('UTF-8'))
Q: Does Loofah support HTML5?
A: Yes, Loofah supports HTML5. It utilizes Nokogiri, which is capable of parsing and manipulating HTML5 documents.
Q: How can I sanitize mixed content (safe and unsafe) with Loofah?
A: Customize Loofah's behavior by creating custom scrubbers. For instance, remove script tags while keeping safe content:
class CustomScrubber < Loofah::Scrubber
def scrub(node)
node.remove if node.name == 'script'
end
end
html = "<script>alert('xss')</script><div>Safe content</div>"
doc = Loofah.document(html)
doc.scrub!(CustomScrubber.new)
cleaned_html = doc.to_html
Q: Can I customize Loofah's sanitization for specific tags or attributes?
A: Yes, create custom scrubbers to define rules. For example, allow 'href' attributes for links while removing other tags:
class CustomScrubber < Loofah::Scrubber
def scrub(node)
if node.name == 'a'
node['href'] = node['href'].strip if node['href']
else
node.remove
end
end
end
html = "<a href='<https://example.com>' target='_blank'>Visit Example</a><script>alert('xss')</script>"
doc = Loofah.document(html)
doc.scrub!(CustomScrubber.new)
cleaned_html = doc.to_html
Q: Is Loofah vulnerable to security issues?
A: Loofah is designed to mitigate security vulnerabilities, including XSS attacks. Keep Loofah and its dependencies up-to-date to stay protected. Monitor the Loofah GitHub repository and RubyGems for updates and advisories.
Q: What are the benefits of using Nokogiri over Loofah, and vice versa?
A: Nokogiri
Loofah
The choice between Nokogiri and Loofah depends on your project's specific needs. Nokogiri is versatile and performance-oriented, while Loofah excels at HTML sanitization and simplicity.
Troubleshooting
Handling Malformed HTML
If you have malformed HTML, Loofah can help repair it. Use the
malformed_html = <<-HTML
<div>
<p>Unclosed div
</div>
HTML
doc = Loofah.document(malformed_html)
doc.repair!
# The document is now repaired and can be used safely
Cross-Framework Integration
Using Loofah with Sinatra
Integrating Loofah with Sinatra is straightforward, similar to using it with Rails:
require 'sinatra'
require 'loofah'
before do
@content = "<script>alert('xss')</script>Some content"
@cleaned_content = Loofah.scrub(@content, scrubber: Loofah::Scrubber.new)
end
get '/' do
erb :index
end
References and Further Reading
Here are some references and further reading materials for in-depth knowledge: