Parse Html Using Perl
I have the following HTML-
Date: 19 July 2011
I have been using HTML::TreeBuilder to parse out particular parts ofSolution 1:
The "dump" method is invaluable in finding your way around an HTML::TreeBuilder object.
The solution here is to get the parent element of the element you're interested in (which is, in this case, the <div>) and iterate across its content list. The text you're interested in will be plain text nodes, i.e. elements in the list that are not references to HTML::Element objects.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse(<<END_OF_HTML);
<div>
<strong>Date: </strong>
19 July 2011
</div>
END_OF_HTML
my $date;
for my $div ($tree->look_down( _tag => 'div')) {
for ($div->content_list) {
$date = $_ unless ref;
}
}
print "$date\n";
Solution 2:
It looks like HTML::Element::content_list() is the function you want. Descendant nodes will be objects while text will just be text, so you can filter with ref() to just get the text part(s).
for ($tree->find('div')) {
my @content = grep { ! ref } $_->content_list;
# @content now contains just the bare text portion of the tag
}
Solution 3:
You could work around it by removing the text within <strong>
from <div>
:
my $div = $tree->look_down( '_tag' => 'div' );
my $div_text = $div->as_trimmed_text;
if ( my $strong = $div->look_down( '_tag' => 'strong' ) ) {
my $strong_text = $strong->as_trimmed_text;
my $date = $div_text;
$date =~ s/$strong_text\s*//;
}
Post a Comment for "Parse Html Using Perl"