Encode - Unicode pattern matching

Posted on Tue Jun 5 21:09:26 2007 by rkellerjr
Unicode pattern matching
I have just delved into the world of Unicode versus Latin1. I've written a quick program (ignore some of the messy code, it's been a work in progress) that traps a string of text that contains Unicode or more aptly said, traps only those lines that Perl cannot translate to Latin1. My goal was to trap and then substitute the Unicode character(s) within that string with something to my liking from the Latin1 character set. For example, Perl will translate the right accent in Unicode, says there is no direct translation, however, I want to translate that to the Latin1 character 39 or the apostrophe. The subroutine (created for ease of reading) is where that action is taking place. What I want to do is just substitute the unicode, using hex, which Perl gives me in it's error message, to another character however, Perl is not finding the hex value within that string, even though it is itself reporting the information. What happens is that there is no substitution. If I use the remmed out line perl will put question marks in for those characters that it does not recognize as having a literal translation to Latin1. Thanks for your time.
#use Encode; use Encode 'from_to'; #use Encode qw (:fallbacks); $infile = "\\hayter\\raw\\2007-00a\\Mod_extract.xml"; $outfile = "rich.txt"; print "Open IN ... "; open (IN, "< $infile") || die "Could not open $infile: $!\n"; print "Done\n"; print "Open OUT ... "; open (OUT, "> $outfile") || die "Could not open $outfile: $!\n"; print "Done\n"; print "Processing file "; $data = <IN>; @datalines = split (/\>/, $data); close (IN); open (XML, "> test.xml"); foreach $dataline (@datalines) { $line++; print XML "$dataline\>\n"; eval { from_to ($dataline, "utf8", "iso-8859-1", 1); }; if ($@) { $myerror = $@; $myerror =~ s/^.+\{(.+)\}.*$/$1/; $myerror =~ s/\s*$//; $myhex = $myerror; $myerror = "\\x\{" . hex($myerror) . "\}"; $errors{$myerror}++; $errhex{$myerror} = $myhex; $errline{$myerror} = $line unless $errline{$myerror}; &unicode_latin1_lax ($dataline); # Translate to something I want } print OUT "$dataline\>\n"; } close (OUT); close (XML); print " Done\n\n"; if (%errors) { print "Printing errors ... "; open (ERR, "> unicode.err") || die "Could not open unicode.err: $!\n"; foreach $error (sort keys %errors) { $decerr = $error; $decerr =~ s/\\//; $decerr =~ s/x//; $decerr =~ s/\{//; $decerr =~ s/\}//; print ERR "ERROR \($errors{$error}\): \"\\x\{$errhex{$error}\}\" does not map to iso-8859-1 +, DEC\. $decerr - EX\. line $errline{$error}\n"; } close (ERR); print "Done\n\n"; } sub unicode_latin1_lax { my ($dataline) = @_; $dataline =~ s/\x{2018}/\~/g; # Should translate unicode to what I want # from_to ($dataline, "utf8", "iso-8859-1", 0); return $dataline; }
Here's the information Perl is giving me directly when attempting to translate one of the lines from the text file.... "\x{2018}" does not map to iso-8859-1 at C:/Perl_58/site/lib/Encode.pm line 183.
Write a response