i have had the exact same problem with HTML::TreeBuilder. Actually, this is a problem of the "as_text" method of HTML::Element.
As a workaround, i created a hack by adding a new method called "as_newline_text()" as ugly copy-paste-programming. It's a modification of the original "as_text" method:
# You may add this to .../HTML/Element.pm
sub as_newline_text {
# STh: a special version of as_text that tries to keep the outline structure
my($this,%options) = @_;
my $skip_dels = $options{'skip_dels'} || 0;
#print "Skip dels: $skip_dels\n";
my(@pile) = ($this);
my $tag;
my $text = '';
while(@pile) {
if(!defined($pile[0])) { # undef!
# no-op
} elsif(!ref($pile[0])) { # text bit! save it!
# $text .= "\n{$tag}" if $HTML::Element::canTighten{$tag};
# $text .= "\n[$tag]" unless $HTML::Element::canTighten{$tag};
$text .= shift @pile;
} else { # it's a ref -- traverse under it
unshift @pile, @{$this->{'_content'} || $nillio}
unless
($tag = ($this = shift @pile)->{'_tag'}) eq 'style'
or $tag eq 'script'
or ($skip_dels and $tag eq 'del');
# $text .= "\n{+$tag}" if $HTML::Element::canTighten{$tag};
# $text .= "\n[+$tag]" unless $HTML::Element::canTighten{$tag};
$text .= "\n\n" if $HTML::Element::canTighten{$tag};
}
}
$text =~ s/^\n+//; # remove all leading \n
$text =~ s/\n+$//; # remove all trailing \n
$text =~ s/\n\n\n/\n\n/g; # collapse multi \n to a maximum of double \n
return $text;
}
By uncommenting the other "$text .= ..." lines, you can figure out a little bit more about how the procedure works. You may also convert the "\n\n" to a single space if you do not want newlines in your address.
|