Thread

Posted on Wed May 17 13:57:07 2006 by xdim
Non UTF8 decoding XML
Hello!

Please, help me: is it possible to store tree created by XML::Simple not to UTF8 but to my national code page.

XML has correct header <?xml version='1.0' encoding='Windows-1251' standalone='yes'?>

it's correctly parsed by XML::Simple, but all data become in UTF8.

Is there any way to store it in original format, and not to decode it again to win1251 (for example)?

Thank you!
Direct Responses: 2320 | Write a response
Posted on Thu May 18 00:16:19 2006 by grantm in response to 2315
Re: Non UTF8 decoding XML

No, there is no way to do that and even if there was it would be a very bad idea.

See the Perl XML FAQ for help on converting from UTF-8 to a legacy encoding on output.

Direct Responses: 6312 | Write a response
Posted on Tue Oct 23 10:46:26 2007 by lschult2 in response to 2320
Re: Non UTF8 decoding XML
Why is this such a bad idea? I am trying to read and write in CP1252. Why should I have to deal with translating back and forth from UTF8? I'm curious what I'm missing here. I don't get what's so bad about this. I'm just now learning character sets, so I'm new to all this.
Direct Responses: 6316 | Write a response
Posted on Tue Oct 23 22:23:11 2007 by grantm in response to 6312
Re: Non UTF8 decoding XML

You're asking the wrong question. The right question to ask is why the original data you're trying to read isn't in Unicode already. And once you've fixed that by parsing the document why do you want to unfix it and convert it back to a legacy encoding? Is there some application that you're feeding the data to that doesn't understand Unicode?

ASCII, Latin-1, CP1252 and all the other legacy encodings each cater for a tiny subset of the characters supported by Unicode. This means that you can't combine data from source documents in different encodings, because a character that you read from one document might not be available in the encoding used by the other document. Whereas if you simply used Unicode you wouldn't need to worry.

Direct Responses: 6317 | Write a response
Posted on Wed Oct 24 06:47:54 2007 by lschult2 in response to 6316
Re: Non UTF8 decoding XML
To answer your question: the original data isn't unicode because it isn't unicode. It just isn't. This database and application I am working on is 7 years old, and decisions were made back then for whatever reasone they were made. I want to work in the legacy encoding for the same reason anyone wants legacy support: to support legacy code and databases. It seems odd that to use XML in Perl that I cannot use that with a legacy character set. XML supports the legacy character set. So why doesn't XML::Simple support legacy encodings that XML itself supports? Honestly, I don't get it. It's either some engineers on a high horse saying all apps should be re-written for unicode for the common good to keep the software gods happy. Or there is a real practical reason. But XML:Simple is the only piece of software in my stack that is making it difficult to work with my legacy code and legacy database and legacy encodings. Perl works just fine with legacy encodings. XML works just fine. MySQL works just fine. HTML works just fine. Only XML::Simple doesn't. I know the benefit of unicode when combining data source docs in different encodings. That is great for those applications that need that. But my legacy application doesnt' need to do this. It just needs CP1252 and that's all it needs. So it is the case that the XML::Simple authors demand I rewrite all my legacy code and database to support unicode in order to use XML::Simple? Is that the only option? I would really rather not change all that legacy code and data.
Direct Responses: 6319 | Write a response
Posted on Wed Oct 24 11:19:50 2007 by grantm in response to 6317
Re: Non UTF8 decoding XML

Perl can quite happily deal with binary strings of bytes. If putting CP1252 data into binary strings makes your job easier then fine do that - I'm certainly not suggesting that would be 'wrong'. But XML is not a binary format. XML is for strings of character data. Perl's native format for character data is UTF-8. If you want to use Perl's character functions like lc, uc, length, substr, tr, regular expressions etc then your data needs to be in UTF-8. ASCII data is a subset of UTF-8 so characters in the 0x00-0x7F range already are UTF-8. But for characters from 0x80 and beyond your data must be UTF-8 for Perl's character functions to understand them. Other languages use different native encodings - Java for instance uses UTF-16.

Remember that an XML document which uses the CP1252 encoding can also contain characters which are not CP1252. For example if the document contained the numeric character entity &#256; (a capital A with a macron: Ā) then the Perl string returned from XML::Simple will contain that character. Similarly if the document used a DTD which defined named character entities (such as &alpha;) then those will also be translated into the appropriate characters.

You ask why XML::Simple doesn't support legacy encodings and yet it demonstrably does. When you read an XML file in a legacy encoding the parser module used by XML::Simple transparently converts the data from the specified legacy encoding into Perl's native character encoding. Note this is a function of all the parser modules (eg: SAX, XML::Parser, LibXML etc) and not actually a function of XML::Simple which is just a convenience layer over the parser modules. XML::Simple doesn't transparently support converting to another encoding when generating XML because I have never needed that functionality. You could probably implement it for the CP1252 case by overriding XML::Simple's escape_value method.

Finally, you make the silly assertion that as the author of XML::Simple I am demanding you rewrite your legacy code to use unicode in order to use XML::Simple. I am not demanding anything. I am providing a tool which you may or may not find useful. If it is not useful to you then of course you should use a different tool.

Direct Responses: 6320 | Write a response
Posted on Wed Oct 24 17:17:16 2007 by lschult2 in response to 6319
Re: Non UTF8 decoding XML
I think what I will do is take the output of XMLout and convert all the Unicode to latin1 using Unicode::String::latin1(). Then I won't need to change any other code in my application. I'll look at XML::Simple's escape_value method as an alternative. I apologize for my silly assertion. I did not mean to offend. You have been very helpful, thank you.
Write a response