Thread

Posted on Sat Jun 30 22:17:18 2007 by jeunice
Coalescing adjacent #PCDATA elements?
In processing Open Document Text (ODT) files, I remove some elements, so that the (much simplified) structure
text:p #PCDATA text:s #PCDATA text:s #PCDATA
becomes
text:p #PCDATA #PCDATA #PCDATA
Unfortunately, subs_text doesn't seem to work across #PCDATA elements. What is the best way to coalesce them into a single element, within which subs_text should work? I can certainly concatenate the elements' pcdata() contents, but I'm wary of causing unintended consequences (say with character encodings, a topic on which I'm a little fuzzy), or missing some more elegant and robust solution already built into XML::Twig. Thoughts?
Direct Responses: 5592 | Write a response
Posted on Sun Jul 1 00:02:13 2007 by mirod in response to 5591
Re: Coalescing adjacent #PCDATA elements?
Is what you are looking for the normalize method that's included in the current development version of the module (that can be found at http://xmltwig.com/xmltwig)?
Direct Responses: 5601 | Write a response
Posted on Sun Jul 1 22:44:41 2007 by jeunice in response to 5592
Re: Coalescing adjacent #PCDATA elements?
mirod is, as usual, way ahead of me! normalize() does the trick.

Unfortunately, even in the 3.30 development version, normalize() must be explicitly called. I can see a need for that in some situations, but at least in my document processing work, I generally want normalized trees. So rather than calling normalize() whenever I am unsure whether the tree has become un-normalized, I think I'll wrap the few editing methods that lead to un-normalized trees (cut and erase/unwrap, primarily) with a
my $parent = $elt->parent; $elt->cut; # or unwrap, or erase $parent->normalize();

treatment. I might even override XML::Twig::Elt methods with such a front-end to do this auto-magically. That would seem to deterministically keep the tree normalized. Does that seem reasonable? Any other edit methods I should similarly worry about?

Finally, mirod, feel free to add an auto-normalization option. ;-)
Direct Responses: 5628 | Write a response
Posted on Wed Jul 4 13:19:56 2007 by mirod in response to 5601
Re: Coalescing adjacent #PCDATA elements?

Sorry for the delay, I missed your answer.

Indeed it makes sense to try to keep the trees normalized. It should also be more efficient than doing it "after the fact". I have to go through the (long!) list of methods and see which ones would need to check. Beyond cut and unwrap (erase is just an alias for unwrap) and methods which use those, I can think of set_tag (if you set to '#PCDATA'), set_content maybe (I have to check). I'll let you know when I have an updated version.

Thanks for the idea

Write a response