Bug 67485

Summary: FILESAVE: Processing Instructions Stripped from XML
Product: LibreOffice Reporter: Russell Harper <russell.s.harper>
Component: WriterAssignee: Not Assigned <libreoffice-bugs>
Status: RESOLVED NOTABUG    
Severity: normal CC: dtardon, russell.s.harper
Priority: medium    
Version: 4.1.0.4 release   
Hardware: Other   
OS: Linux (All)   
Whiteboard: BSA
Crash report or crash signature: Regression By:

Description Russell Harper 2013-07-29 15:17:58 UTC
Problem description:

LibreOffice has limited automation capabilities. To get around that limitation, manipulating the internal XML is an option.

For example, for complex search and replace with custom formatting, filling tables with variable numbers of rows, or customized removal of sections. Currently using processing instructions to delimit complex items like table rows, e.g. <?dec BEGIN SOMETHING?>LibreOffice XML here<?dec END SOMETHING?>.

Alternatives like inserting comments don't work, because the insertion points aren't predictable.

According to http://www.w3.org/TR/2008/REC-xml-20081126/#sec-pi, "processing instructions MUST be passed through to the application."

Steps to reproduce:
1. Extract content.xml from ODT
2. Add some custom processing instructions directly (e.g. <?abc LIKE THIS?>) into the XML
3. Update content.xml
4. Make changes to the document by the regular interface
5. Extract content.xml
6. These processing instructions will be removed

Current behavior:
Strips out all but the first (XML) processing instruction in content.xml.

Expected behavior:
Don't remove processing instructions!
             
Operating System: Ubuntu
Version: 4.1.0.4 release
Comment 1 Russell Harper 2013-07-29 17:12:14 UTC
(In reply to comment #0)
> Steps to reproduce:
> ...
> 3. Update content.xml

Meant:

3. Update content.xml in the ODT
Comment 2 David Tardon 2013-07-30 06:10:21 UTC
> According to http://www.w3.org/TR/2008/REC-xml-20081126/#sec-pi, "processing
> instructions MUST be passed through to the application."

I understand this as "a conforming XML processor must pass PIs to the application". The application is still free to ignore them.
Comment 3 Russell Harper 2013-07-30 09:44:14 UTC
Of note, http://www.w3.org/TR/2008/REC-xml-20081126/#sec-starttags (see production [43]) PIs are defined as legal content.

While the application need not interpret PIs, they are defined as content, so it would seem that they should be preserved when the document is resaved?
Comment 4 David Tardon 2013-07-31 04:56:43 UTC
(In reply to comment #3)
> Of note, http://www.w3.org/TR/2008/REC-xml-20081126/#sec-starttags (see
> production [43]) PIs are defined as legal content.

This is a description of structure. PIs are element 'content' as they can appear inside an element. (Note that comments are 'content' too. And so are CDATA sections. I do not think you will argue that CDATA sections must be preserved exactly as they are.)

It has nothing to do with semantics.

> 
> While the application need not interpret PIs, they are defined as content,
> so it would seem that they should be preserved when the document is resaved?

No, it would not seem (see above). Anyway, it looks like the authors of XSLT agree with my interpretation, because PIs are discarded by the built-in rules (therefore, I guess 99.9 % of existing stylesheets discards them).
Comment 5 Russell Harper 2013-07-31 16:44:46 UTC
Behaviour is as designed - CDATA sections, comments, & PIs are stripped from source XML prior to saving.