String-Approx - Any plans to fix aslice?

Posted on Wed Apr 26 23:54:43 2006 by adamcarlson
Any plans to fix aslice?
This package would be of much greater value if you could find out what portion of the input matched your pattern. Currently, all you can do is confirm that there is a match, and see what the distance is. There's no way to, for example, extract the approximately matching text. aslice is supposed to solve this problem, but it is broken so that the index and size it return do not reliably indicate where in the input the match occurred. According to the documentation, aindex can return slightly erroneous results (when there are duplicate letters, like "aa" in the pattern or input), and aslice can return strange sizes. My experience is that these results can actually be wildly off. For example, consider the following code:
#!/usr/bin/perl use strict; use String::Approx qw(aslice); my $pattern = "the cask of amontillado"; my $input = <<EOF; groundwork for this book. after brainstorming about overarching principles of companion constructio +n and individual topics to be included, the editors crammed themselves companionably into a single + elevator while continuing to converse about the project that had brought them together. although +the elevator doors closed in the normal way, the passengers soon noticed that the elevator itself +did not stop where they expected, but descended beyond their chosen floor. all conversation stoppe +d as the doors opened upon a solid stone wall offering absolutely no egress. in unison came the cr +y "for the love of god, montresor!" this outcry, echoing the horrified plea of the victim who has + been entombed behind a stonewall in edgar allan poe's story "the cask of amontillado", was proof +positive that the editors experiencing the nightmarish elevator- ride shared a mental landscape di +stinguished by at least one landmark from the world of crime and mystery fiction. realizing this, +the group engaged in nervous laughter, which only reached a stage of true levity as the elevator d +oors closed themselves upon the masonry, the machine ascended, and the doors finally opened into a + clean, welllighted corridor. it is not insignificant that these scholars of crime and mystery wr +iting shared quotations, images, and a sense of play about a genre that demands intelligence, imag +ination, and active involvement from its readers. with the exception of the riddle , there is no f +orm o EOF my ($result) = aslice($pattern, $input); my ($index, $size) = @$result; print "index: $index\nsize: $size\n", substr($input, $index, $size), "\n";

This code produces:
index: 50
size: 725
t overarching principles of companion construction and individual topics to be included, the editors crammed themselves companionably into a single elevator while continuing to converse about the project that had brought them together. although the elevator doors closed in the normal way, the passengers soon noticed that the elevator itself did not stop where they expected, but descended beyond their chosen floor. all conversation stopped as the doors opened upon a solid stone wall offering absolutely no egress. in unison came the cry "for the love of god, montresor!" this outcry, echoing the horrified plea of the victim who has been entombed behind a stonewall in edgar allan poe's story "the cask of amontillado",

In this case, the pattern appears at the end of the input (though not exactly at the end), and a signficant portion of the string is included in the "matching portion".

Actually, I've found that by using the 'minimal_distance' modifier, the accuracy of aslice can be significantly improved. I think that, at least, this should be the default behavior unless there are cases where it really messes up.

Adam
Write a response