HTML::Parser is a Perl module that parses HTML/XML documents and provides access to their elements for content extraction and manipulation.
Getting Started
Installation:
cpan HTML::Parser
# or from package manager
HTML::Parser is distributed on CPAN. It can also be installed from most Perl package managers.
Simple parsing:
use HTML::Parser;
my $parser = HTML::Parser->new;
$parser->parse($html);
my $text = $parser->text;
print $text;
This parses an HTML string and prints the extracted text content.
Concepts:
Parsing HTML
From string:
$parser->parse($html_string);
From file:
open(my $fh, '<', $file);
$parser->parse_file($fh);
From URL:
use LWP::Simple;
my $html = get($url);
$parser->parse($html);
Parse options:
$parser->strict(1); # die on invalid HTML
$parser->junk_text(1); # ignore text outside of elements
$parser->parse_fragment($html); # parse partial HTML
Accessing Elements
By tag name:
my @divs = $parser->elements_by_tagname('div');
By attribute:
my @inputs = $parser->elements_with_attribute('name');
With XPath:
my @links = $parser->xpath('//a');
Extract links:
my @links = $parser->extract_links();
Element content:
my $content = $element->[0]; # inside HTML
Manipulating HTML
Modify elements:
$parser->handler(start => sub {
my ($tag, $attr) = @_;
$attr->{class} = 'newclass';
});
Remove elements:
$parser->handler(discard_element => "script");
Modify text:
$parser->handler(text => sub {
my $text = shift;
$text =~ s/foo/bar/g;
return $text;
});
Insert elements:
$parser->handler(start => sub {
my $elem = shift;
$elem->push_content("<div>New elem</div>");
});
Handlers and Events
Start handler:
$parser->handler(start => \\&start, "div");
sub start {
my ($tag, $attr, $self) = @_;
print "Start $tag\\n";
}
End handler:
$parser->handler(end => \\&end, "div");
sub end {
print "End div\\n";
}
Available events: text, comment, process, declaration
Tree Traversal
$parser->handler(start => sub {
my $elem = $_->[1];
$elem->traverse(\\&process);
});
sub process {
my $node = shift;
# process node
}
my $parent = $node->parent;
my @children = $node->content_list;
Integration
Web scraping:
use Web::Scraper;
my $scraper = scraper {
process "div.results", "links[]" => scraper::attr("href");
};
$scraper->parse_html($parser, $html);
Mojolicious:
get '/' => sub {
my $parser = HTML::Parser->new;
$parser->parse($html);
# process parser
$self->render;
};
Feeds:
use HTML::Parser;
use XML::FeedPP;
my $parser = HTML::Parser->new;
my $feed = XML::FeedPP->new(handlers => $parser);
$feed->parse($feed_xml);
Parsing Edge Cases
Malformed HTML:
Use
$parser->junk_text(1);
Extract data from invalid markup:
$parser->handler(text => sub {
my $text = shift;
if($text =~ /(\\d{4}-\\d{2}-\\d{2})/) {
return $1; # extract date
}
});
Parse fragments:
$parser->parse_fragment($html);
Embedded content:
$parser->handler(text => sub {
my $text = shift;
if($text =~ /<style/i) {
$text = ""; # remove CSS
}
return $text;
});
Best Practices
Troubleshooting
Debugging:
use HTML::Parser::Debug;
$debug_parser->parse($html);
Common bugs:
Error handling:
eval {
$parser->parse($html);
};
if($@) {
print "Parse failed: $@";
}
Customizing
Custom parsers:
package MyParser;
@ISA = ('HTML::Parser');
sub start {
my $self = shift;
# custom logic
}
Extending:
use HTML::Parser::Plugins;
$parser->plugin(MyPlugin);
Modifying architecture:
@HTML::Parser::ISA = qw(MyParser);