Bug 76106 - File Corruption: RT gets corrupt since the target for hyperlink is exported incorrectly.
Summary: File Corruption: RT gets corrupt since the target for hyperlink is exported i...
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: low normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 79880 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-03-13 09:19 UTC by Rajashri
Modified: 2014-07-03 03:19 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
Optimized file. The exact area under '2094037651.docx' which highlights the corruption. (89.19 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2014-03-13 09:19 UTC, Rajashri
Details
Oroginal_file (2.71 MB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2014-03-14 07:12 UTC, Rajashri
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rajashri 2014-03-13 09:19:22 UTC
Created attachment 95699 [details]
Optimized file. The exact area under '2094037651.docx' which highlights the corruption.

File is getting corrupt because the hyperlink is being exported incorrectly.
The target in the relationship Id 2 in document.xml.rels is incorrect.
Comment 1 Chris Sherlock 2014-03-14 06:01:30 UTC
The issue here is that the URI has a greater than sign. In RF 3986 Uniform Resource Identifier (URI): Generic syntax, this character is not a reserved or unreserved character, and thus must be percentage encoded, which is what we are doing (correctly). However, it appears that Microsoft do not do this. 

Now I looked into this in the OOXML spec. Firstly, Microsoft file systems reserve the greater than and less than characters - these are not allowed in filenames. However, in the OOXML spec (ECMA-376 Office Open XML File Formats — Open Packaging
Conventions, "10.2.1 Mapping Part Data") it reads:

"ZIP item names are case-sensitive ASCII strings. Package implementers shall create ZIP item names that conform to ZIP archive file name grammar."

I checked the normative references, and this points to:

http://www.pkware.com/documents/APPNOTE/APPNOTE_6.2.0.txt

which merely says:

      file name: (Variable)

          The name of the file, with optional relative path.
          The path stored should not contain a drive or
          device letter, or a leading slash.  All slashes
          should be forward slashes '/' as opposed to
          backwards slashes '\' for compatibility with Amiga
          and Unix file systems etc.  If input came from standard
          input, there is no file name field.  If encrypting
          the central directory and general purpose bit flag 13 is set 
          indicating masking, the file name stored in the Local Header 
          will not be the actual file name.  A masking value consisting 
          of a unique hexadecimal value will be stored.  This value will 
          be sequentially incremented for each file in the archive. See
          the section on the Strong Encryption Specification for details 
          on retrieving the encrypted file name. 

Nothing about > or <. 

However, it later says in 10.2.5 ZIP Package Limitations:

"Package implementers should restrict part naming to accommodate file system limitations when naming parts to be stored as ZIP items."

This is very... ambiguous. Which file system limitations? They seem to be referring to Microsoft FS limitations. Which is why > and < was never an issue, because that's an illegal character in NTFS and FAT filenames.

I would suggest that if we get such a file, we should either:

1. Warn about the illegal character, or
2. Rename the zip file to replace the illegal filesystem character with it's URL encoded character. 

If 2 is implemented, the illegal characters to encode are:

\ / : * ? " < > |
Comment 2 Chris Sherlock 2014-03-14 06:06:41 UTC
This is an edge case, btw. I'd not consider it to be a major issue!
Comment 3 Rajashri 2014-03-14 07:12:26 UTC
Created attachment 95775 [details]
Oroginal_file
Comment 4 Chris Sherlock 2014-03-14 07:19:28 UTC
I didn't read that correctly. There is no greater than symbol. This has a semi-colon in it. Sigh. My bad.
Comment 5 Chris Sherlock 2014-03-14 07:21:44 UTC
Semi-colon, of course, has the same issue. It should be escaped. I don't see that it's an issue.
Comment 6 Chris Sherlock 2014-03-14 07:31:43 UTC
I do not believe this is a problem. The issue here is not with us, in fact Microsoft is not honouring the URI RFC. If a character is not reserved or unreserved, they should be percentage encoding it, full stop. I'm very surprised they have not for semi-colons. 

But that's sort of their problem, not ours. A browser should automatically decode the URI when it sees the percentage encoding. 

Not a bug!
Comment 7 Yogesh Bharate 2014-07-03 03:19:34 UTC
*** Bug 79880 has been marked as a duplicate of this bug. ***