Hi Joshua
The problem with HTML is that it is inherently free form and stumbling across misnested tags can throw the best algorithms for a loop (no bad pun intended). Assuming your tags are properly nested, though, the best way of dealing with this is to either switch to HTML::TreeBuilder (which would let you treat the spans as leafs on a tree), or to maintain either a tag stack or a tag count. I've chose then the latter in the following HTML snippet:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser::Simple 3.13;
my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);
while (my $token = $parser->get_token) {
next unless $token->is_start_tag('span');
my $html = get_element($parser, 'span');
print $html;
}
# pass this the parser and the name of the tag you're interested in.
sub get_element {
my ($parser, $tag) = @_;
my $html = '';
my $more_tags = 0;
while (my $token = $parser->get_token) {
return $html if $token->is_end_tag($tag) && ! $more_tags;
$more_tags++ if $token->is_start_tag($tag);
$more_tags-- if $token->is_end_tag($tag);
$html .= $token->as_is;
}
return $html;
}
__DATA__
<head>
<body>
<span>
<span foo="bar">
stuff
</span>
</span>
</body>
</head>
|