Floki makes it easy to parse and query HTML documents in Elixir. It uses CSS selectors and tree traversal for HTML manipulation.
Getting Started
Add dependency:
def deps do
[
{:floki, "~> 0.10.0"}
]
end
Parse HTML:
html = File.read!("index.html")
doc = Floki.parse_document!(html)
Find elements:
Floki.find(doc, "div.content")
Get text:
Floki.text(doc)
Selecting
By CSS selector:
Floki.find(doc, "div.main")
By tag name:
Floki.find(doc, "img")
By id:
Floki.find_by_id(doc, "header")
By attribute:
Floki.find_by_attribute(doc, "href")
Traversing
Get parent:
[parent | _] = Floki.parents(element)
Get children:
Floki.children(element)
Get siblings:
Floki.siblings(element)
Manipulation
Insert element:
Floki.insert_after(new_el, target_el)
Replace element:
Floki.replace(new_el, target_el)
Remove element:
Floki.remove(element)
Update attribute:
Floki.update_attribute(element, "src", "new.jpg")
Append html:
Floki.append(doc, "<div>New div</div>")
Parsing HTML
From string:
html = "<html>...</html>"
doc = Floki.parse_document!(html)
From file:
doc = Floki.parse_document!(File.read!("index.html"))
From URL:
doc = Floki.parse_document!(HTTPoison.get!(url).body)
Extracting Data
Extract text:
Floki.text(doc)
Find links:
Floki.find(doc, "a[href]") |> Floki.attribute("href")
Extract images:
Floki.find(doc, "img") |> Floki.attribute("src")
Advanced Usage
Parse fragments:
doc = Floki.parse_fragment(html_fragment)
Encode special chars:
Floki.raw_html(html) # escape HTML
Decode entities:
Floki.unescape_and_decode(html)
Inspect HTML tree:
IO.inspect(doc) # print HTML tree
More Examples
Find by class name:
Floki.find(doc, ".article")
Nest selectors:
Floki.find(doc, "div.content ul li a")
Traverse tree:
parent = Floki.parent(element)
children = Floki.children(element)
Manipulate HTML:
Floki.insert_after(new_div, content_div)
Floki.replace(new_img_el, old_img_el)
Floki.remove(ad_div)
Extract text, links, images:
text = Floki.text(doc)
links = Floki.find(doc, "a[href]") |> Floki.attribute("href")
imgs = Floki.find(doc, "img") |> Floki.attribute("src")
Advanced Usage
Parse fragments:
fragment = "<div>...</div>"
doc = Floki.parse_fragment(fragment)
Escape HTML:
html = "<div>10 > 5</div>"
escaped = Floki.raw_html(html)
Unescape HTML:
html = "<div>Hello</div>"
unescaped = Floki.unescape_and_decode(html)
Inspect tree:
html
|> Floki.parse_document!
|> IO.inspect
Lazy Loading
Floki.HTMLTree.parse loads HTML lazily to avoid parsing the entire document at once:
html = File.read!("large.html")
tree = Floki.HTMLTree.parse(html)
# Elements loaded as needed
meta = Floki.find(tree, "meta")
head = Floki.find(tree, "head")
This is more efficient for large HTML documents.
Search vs Find
Floki.search searches all nodes while Floki.find only searches subtree at that element:
Floki.search(tree, "meta") # all nodes
Floki.find(tree, "head meta") # only in head
So use find when you can scope the search for better performance.
LiveView Integration
Floki can parse HTML in Phoenix LiveView on the server before sending to client:
def handle_info(%{topic: "new_html"}, socket) do
html = ExternalApi.fetch_html()
doc = Floki.parse_document!(html)
# Manipulate doc
html = Floki.serialize(doc)
{:reply, {:ok, html}, socket}
end
HTML to CSV/JSON
Use Floki to extract data from HTML to other formats like CSV/JSON:
html
|> Floki.parse_document!
|> Floki.find("table tr")
|> CSV.encode()
|> IO.write()
html
|> Floki.parse_document!
|> Floki.find("div.post")
|> Enum.map(&post_to_map/1)
|> JSON.encode!()
|> IO.write()
Invalid HTML
Floki can handle invalid/malformed HTML by passing
Idempotent HTML
Sort attributes to normalize HTML for consistent re-parsing:
doc
|> Floki.find("div")
|> Floki.update_attributes(fn attributes ->
Enum.sort(attributes)
end)