Thread

Posted on Mon Jun 27 00:44:15 2005 by jerrycap
HTML::TreeBuilder
I am using an example from Sean Burke's "Perl & LWP" to use the TreeBuilder module to build a tree. I want to extract the text from embedded HMTL tables. The method as_text() is used to extract the text. I am stumbling upon embedded
tags in the table text that are not being evaluated as I would expect. Example of text - 123 Main Street
Anytown, CA
123456 This is displayed as 123 Main StreetAnytown, CA123456 when using the as_text() method. Is there some way of translating the html/text and having it display on separate lines as in: 123 Main Street Anytown, CA 123456
Direct Responses: 660 | Write a response
Posted on Mon Jun 27 15:26:51 2005 by jerrycap in response to 657
Re: HTML::TreeBuilder
I noticed that the CPAN forum rendered/translated my HTML BR tags that were needed for the example. I am re-posting for clarification. Sorry about posting to the XML::TreeBuilder module, could not find one for HTML::TreeBuilder. HTML-BR-TAG-HERE is used to replace the true BR tag ======================================================================================= I am using an example from Sean Burke's "Perl & LWP" to use the TreeBuilder module to build a tree. I want to extract the text from embedded HMTL tables. The method as_text() is used to extract the text. I am stumbling upon embedded tags in the table text that are not being evaluated as I would expect. Example of text - 123 Main Street HTML-BT-TAG-HERE Anytown, CA HTML-BR-TAG-HERE 123456. This is displayed as 123 Main StreetAnytown, CA123456 when using the as_text() method. Is there some way of translating the html/text and having it display on separate lines as in: 123 Main Street Anytown, CA 123456
Direct Responses: 1717 | Write a response
Posted on Mon Jan 30 08:45:40 2006 by sth in response to 660
Re: HTML::TreeBuilder
i have had the exact same problem with HTML::TreeBuilder. Actually, this is a problem of the "as_text" method of HTML::Element.

As a workaround, i created a hack by adding a new method called "as_newline_text()" as ugly copy-paste-programming. It's a modification of the original "as_text" method:

# You may add this to .../HTML/Element.pm sub as_newline_text { # STh: a special version of as_text that tries to keep the outline structure my($this,%options) = @_; my $skip_dels = $options{'skip_dels'} || 0; #print "Skip dels: $skip_dels\n"; my(@pile) = ($this); my $tag; my $text = ''; while(@pile) { if(!defined($pile[0])) { # undef! # no-op } elsif(!ref($pile[0])) { # text bit! save it! # $text .= "\n{$tag}" if $HTML::Element::canTighten{$tag}; # $text .= "\n[$tag]" unless $HTML::Element::canTighten{$tag}; $text .= shift @pile; } else { # it's a ref -- traverse under it unshift @pile, @{$this->{'_content'} || $nillio} unless ($tag = ($this = shift @pile)->{'_tag'}) eq 'style' or $tag eq 'script' or ($skip_dels and $tag eq 'del'); # $text .= "\n{+$tag}" if $HTML::Element::canTighten{$tag}; # $text .= "\n[+$tag]" unless $HTML::Element::canTighten{$tag}; $text .= "\n\n" if $HTML::Element::canTighten{$tag}; } } $text =~ s/^\n+//; # remove all leading \n $text =~ s/\n+$//; # remove all trailing \n $text =~ s/\n\n\n/\n\n/g; # collapse multi \n to a maximum of double \n return $text; }
By uncommenting the other "$text .= ..." lines, you can figure out a little bit more about how the procedure works. You may also convert the "\n\n" to a single space if you do not want newlines in your address.
Write a response