Skip to content Skip to sidebar Skip to footer

Parse Html Using Perl

I have the following HTML-
Date: 19 July 2011
I have been using HTML::TreeBuilder to parse out particular parts of

Solution 1:

The "dump" method is invaluable in finding your way around an HTML::TreeBuilder object.

The solution here is to get the parent element of the element you're interested in (which is, in this case, the <div>) and iterate across its content list. The text you're interested in will be plain text nodes, i.e. elements in the list that are not references to HTML::Element objects.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse(<<END_OF_HTML);
<div>
   <strong>Date: </strong>
       19 July 2011
</div>
END_OF_HTML

my $date;

for my $div ($tree->look_down( _tag => 'div')) {
  for ($div->content_list) {
    $date = $_ unless ref;
  }
}

print "$date\n";

Solution 2:

It looks like HTML::Element::content_list() is the function you want. Descendant nodes will be objects while text will just be text, so you can filter with ref() to just get the text part(s).

for ($tree->find('div')) {
  my @content = grep { ! ref } $_->content_list;
  # @content now contains just the bare text portion of the tag
}

Solution 3:

You could work around it by removing the text within <strong> from <div>:

my $div      = $tree->look_down( '_tag' => 'div' );
my $div_text = $div->as_trimmed_text;
if ( my $strong = $div->look_down( '_tag' => 'strong' ) ) {
    my $strong_text = $strong->as_trimmed_text;
    my $date        = $div_text;
    $date =~ s/$strong_text\s*//;
}

Post a Comment for "Parse Html Using Perl"