|
Posted on Tue May 15 10:41:47 2007
by pdi
|
| Warning: Malformed UTF-8 character(s) |
|
This is possibly an odd case, but I would certainly appreciate any help.
I prepare an ARGFILE, in utf-8, with XMP tags like -xmp-dc:title=. Exiftool writes the data correctly to jpegs.
I then (1) change the tags to the corresponding IPTC ones, e.g. -xmp-dc:title= to -iptc:ObjectName=, and (2) use iconv to convert the ARGFILE encoding from utf-8 to iso-8859-7 (Greek), as most programs do not read utf-8 IPTC. The resulting file is read correctly by text editors.
However, when exiftool tries to write the data to jpegs it returns "Warning: Malformed UTF-8 character(s)". The cause seems to be the Greek characters.
Both exiftool and iconv are well established, so perhaps I do something out of place. But if not, is there a way exiftool can accept the iconv output? Or is there another standard encoding conversion tool that exiftool is happy with?
Many thanks in advance,
pdi
|
|
|
Posted on Tue May 15 13:53:20 2007
by pdi
in response to 5133
|
| Re: Warning: Malformed UTF-8 character(s) |
|
After some trials I found to my surprise that, irrespective of iconv, the same error occurred with both entirely new txt iso-8859-7 files and old ARGFILES in the same encoding from about a year ago which worked perfectly.
Preliminary findings point to an important change in exiftool in ver. 6.70 about the treatment of encoded characters. I am still trying to understand it's implications. From a first reading it seems to cater either for utf-8 or cp1252. What about cp1253 (iso-8859-7)?
Regards,
pdi
|
|
|
Posted on Tue May 15 14:31:15 2007
by exiftool
in response to 5135
|
| Re: Warning: Malformed UTF-8 character(s) |
Yes, ExifTool now translates coded characters for IPTC.
See FAQ #10
for details.
You can use the -L option when writing IPTC if you want to disable translation
of special characters.
- Phil
|
|
|
Posted on Tue May 15 15:04:31 2007
by pdi
in response to 5136
|
| Re: Warning: Malformed UTF-8 character(s) |
|
Phil,
Thank you for your reply. I was confused by the mention only of cp1252, but when I tried the -L option the result was correct. I'm not sure I understand it, but I'm glad it works.
Regards,
pdi
|
|
|
Posted on Tue May 15 15:16:10 2007
by exiftool
in response to 5138
|
| Re: Warning: Malformed UTF-8 character(s) |
|
This works because 1) ExifTool assumes IPTC in the file is coded in
Latin1 unless the recorded CodedCharacterSet is "ESC % G" (UTF8),
and 2) the -L option specifies the external character set as Latin1.
When the recorded character set is the same as the external character
set, no translation is performed.
I hope this makes a bit more sense now. :)
- Phil
|
|
|
Posted on Tue May 15 15:46:37 2007
by pdi
in response to 5140
|
| Re: Warning: Malformed UTF-8 character(s) |
|
Phil,
I'm afraid I was not very clear about what I don't understand. Encodings and translations is a terrain only partly familiar to me. So I wonder how it all works when, while -L denotes the txt file character set as Latin1 (cp1252), the file's character set is Greek (cp1253). To be more exact, various text editors recognize the file as ANSI, but the underlying code page in Windows for Greek is cp1253. So exiftool is told to write cp1252 and writes in fluent cp1253 :-) It suits me fine, but I'd rather understand it than not :-)
Regards,
pdi
|
|
|
Posted on Tue May 15 16:06:24 2007
by exiftool
in response to 5141
|
| Re: Warning: Malformed UTF-8 character(s) |
|
I understood your confusion, but I guess you didn't understand my
explanation.
It is really fairly simple. You give ExifTool a string of bytes and tell it
what character encoding was used. As long as ExifTool thinks that
the internal and external character sets are the same, then no translation
is performed and the bytes are passed through unchanged. (This is the
behaviour of older ExifTool versions for IPTC information.)
As long as ExifTool is not translating the text, it is totally irrelevant
what character set is actually used since the bytes are passed
through unchanged. So as long as ExifTool believes there is no need
to translate the text, you are free to use whatever character set you
like.
I can see how this could be confusing.
If possible, it is best to use UTF8 to avoid this confusion.
- Phil
|
|
|
Posted on Tue May 15 19:45:16 2007
by pdi
in response to 5143
|
| Re: Warning: Malformed UTF-8 character(s) |
|
Phil,
I appreciate your patience with my dim wits :-) All is much clearer now.
As long as ExifTool is not translating the text, it is totally irrelevant what character set is ac
+tually used since the bytes are passed through unchanged.
Perhaps you might include some similar note in FAQ #10, to make it clearer we are not limited only to cp1252.
I am writing IPTC data to a jpg which has no previous IPTC data, only XMP; so my guess is that ExifTool handles the case of no internal data the same as if these existed and were of the same character set with the external ones.
Unfortunately, many IPTC tools cannot handle the notorious "ESC % G" sequence and fail to display utf-8 properly. I was very surprised to see the change in the default behaviour of ExifTool, but I am sure you had very sound reasons for it. It must be that the tide is turning :-)
Regards,
pdi
|
|
|
Posted on Tue May 15 20:43:00 2007
by exiftool
in response to 5146
|
| Re: Warning: Malformed UTF-8 character(s) |
|
I'm glad it makes a bit more sense now.
When writing information, ExifTool uses the value of CodedCharacterSet to
determine how to encode the text. If CodedCharacterSet is being written at
the same time as text, the new character set is used. If no CodedCharacterSet
exists and none is written, then Latin1 is assumed.
The special character handling in IPTC is a real mess. The way ExifTool
originally handled it (by never translating) was simplest, but it seems that
other applications most commonly assume Latin1 characters (contrary to
the actual IPTC specification) so ExifTool was displaying special characters
written by these applications incorrectly. This is the reason for the change.
If enough people have problems with this, I am open to changing it back
again.
It is a pity that not many applications support UTF8 in IPTC, because this
is the best solution. The original IPTC specification used ISO 2022, which
is a real can of worms and hence isn't well supported either, but UTF8
support was added as a revision to the IPTC specification (I believe),
and is a much better solution.
- Phil
|
|
|
Posted on Fri May 25 22:18:08 2007
by exiftool
in response to 5147
|
| Re: Warning: Malformed UTF-8 character(s) |
For reference, here
is the thread which prompted the change in handling of special
characters in IPTC.
- Phil
|
|