Thread

Posted on Thu Mar 2 21:01:49 2006 by gnurob
Garbled Text
Alfred, thanks for sharing the PDF::API2 module. It has enabled me to audit more than 10,000 PDF documents without breaking a sweat.

I did run into one problem that I can't lick. Approximately one out of ten PDF documents contain wide characters that $pdf-info() returns as "junk_text" rather than "(OHS).PDF" as seen in Acrobat 6.0, Document Properties, Description. This also happens with encrypted documents...

The CPAN::Forum would not accept the remainder of this post. Here's some PDF examples.

www.gs.gov.nl.ca/ohs/pdf/ann-rep-whsi.pdf (garbled) www.gs.gov.nl.ca/cca/cr/pdf/coop/coop21-art-dis.pdf (works) www.gs.gov.nl.ca/misc/data/gazette/wk/2006-01-13.pdf (garbled, encrypted file)

Can PDF-API2 process unicode characters in meta info?

Thanks, Rob
Direct Responses: 1893 | Write a response
Posted on Thu Mar 2 21:05:07 2006 by gnurob in response to 1892
Re: Garbled Text
# The script, in brief... use PDF::API2; use LWP; $ua = LWP::UserAgent->new; $ua->agent('PDFInspector/0.2 (email@host.com)'); binmode STDOUT, ":utf8"; @url_list = qw( http://www.gs.gov.nl.ca/ohs/pdf/ann-rep-whsi.pdf http://www.gs.gov.nl.ca/cca/cr/pdf/coop/coop21-art-dis.pdf http://www.gs.gov.nl.ca/misc/data/gazette/wk/2006-01-13.pdf ); # ann-rep-whsi.pdf contains wide characters # coop21-art-dis.pdf # 2006-01-13.pdf is encrypted foreach (@url_list) { ($pdf_doc, $pdf_status, undef, undef) = do_get ($pdf_url, "Accept-Language" => "en"); $pdf = PDF::API2->openScalar($pdf_doc); %pdf_info = $pdf->info(); print "URL: $pdf_url"; print "Title: $pdf_info{'Title'}"; print "Author: $pdf_info{'Author'}"; print "Subject: $pdf_info{'Subject'}"; print "Creator: $pdf_info{'Creator'}"; print "Producer: $pdf_info{'Producer'}"; } # Results look like... URL: http://offline.gs.gov.nl.ca/ohs/pdf/ann-rep-whsi.pdf Title: *Garbled* Author: *Garbled* Subject: Creator: *Garbled* Producer: Acrobat PDFWriter 4.0 for Windows URL: http://offline.gs.gov.nl.ca/cca/cr/pdf/coop/coop01-art-inc.pdf Title: The Co-Operatives Act (Form 1) Author: Commercial Registrations Subject: Articles of Incorporation (Section 8) Creator: Producer: Acrobat Distiller 4.05 for Windows URL: http://offline.gs.gov.nl.ca/misc/data/gazette/wk/2006-01-13.pdf Title: *Garbled* Author: *Garbled* Subject: Creator: *Garbled* Producer: *Garbled*
Direct Responses: 2204 | Write a response
Posted on Thu Apr 27 05:19:17 2006 by karl in response to 1893
Re: Garbled Text

Hi Rob -

It seems that PDF-API2 returns unicode characters in attribute values in byte-swapped order for some reason.

Here's how I changed your code to compensate for that.

#!perl use strict; use PDF::API2 1.19; # Revision 1.19 2004/03/20 08:38:38 fredo added isEncrypted determinator use Unicode::String qw(utf8); use Encode; use LWP::UserAgent; use HTTP::Request::Common; use HTTP::Response; binmode STDOUT, ":utf8"; use constant GET_FROM_HTTP => 1; my $ua = LWP::UserAgent->new; $ua->agent('PDFInspector/0.2 (email@host.com)'); my @url_list = qw( http://www.gs.gov.nl.ca/ohs/pdf/ann-rep-whsi.pdf http://www.gs.gov.nl.ca/cca/cr/pdf/coop/coop21-art-dis.pdf http://www.gs.gov.nl.ca/misc/data/gazette/wk/2006-01-13.pdf ); # ann-rep-whsi.pdf contains wide characters # coop21-art-dis.pdf # 2006-01-13.pdf is encrypted foreach my $pdf_url (@url_list) { my ($pdf_doc, $pdf_status, $pdf, %pdf_info); ($pdf_doc, $pdf_status, undef, undef) = do_get ($pdf_url, "Accept-Language" => "en"); $pdf = PDF::API2->openScalar($pdf_doc); %pdf_info = $pdf->info(); print sprintf("%-10s %s\n", "URL:",$pdf_url); foreach my $attribute (qw(Title Author Subject Creator Producer)) { my $value = $pdf_info{$attribute}; print sprintf("%-10s ", $attribute.":"); if ($pdf->isEncrypted) { print "[encrypted]"; } elsif (Encode::is_utf8($value)) { # Unicode bug in PDF-API2? Byte order is swapped. print utf8($value)->byteswap(); } else { print $value; } print "\n"; } print "\n"; } sub str2hex { my $line = shift; $line =~ s/(.)/sprintf("%02x ",ord($1))/eg; return $line; } sub do_get { my $url = shift; if (GET_FROM_HTTP) { my $response = $ua->request(GET $url); return ($response->content, $response->code, undef, undef); }else{ $url =~ s/^.*\///; my $content = read_file($url); return ($content, 200, undef, undef); } } sub read_file { # modified code from Perl Slurp-Eaze by Uri Guttman http://www.perl.com/pub/a/2003/11/21/slurp. +html my( $file_name ) = shift; my $buf; my $buf_ref = \$buf; open( FH, "<", $file_name ) or die "Can't open $file_name: $!"; binmode(FH); my $size_left = -s FH; while( $size_left > 0 ) { my $read_cnt = sysread( FH, ${$buf_ref}, $size_left, length ${$buf_ref} ); unless( $read_cnt ) { die "read error in file $file_name: $!"; last; } $size_left -= $read_cnt; } return ${$buf_ref}; } __END__ # Results look like... URL: http://www.gs.gov.nl.ca/ohs/pdf/ann-rep-whsi.pdf Title: (OHS).PDF Author: JDutton Subject: Creator: A:\annual report 2000 (OHS).wpd Producer: Acrobat PDFWriter 4.0 for Windows URL: http://www.gs.gov.nl.ca/cca/cr/pdf/coop/coop21-art-dis.pdf Title: The Co-Operatives Act (Form 21) Author: Commercial Registrations Subject: Articles of Dissolution (Section 116, 117, 118) Creator: Producer: Acrobat Distiller 4.05 for Windows URL: http://www.gs.gov.nl.ca/misc/data/gazette/wk/2006-01-13.pdf Title: [encrypted] Author: [encrypted] Subject: [encrypted] Creator: [encrypted] Producer: [encrypted]

Enjoy.

--Karl.

Write a response