Bug 55143 - FILESAVE: DOCX - LibreOffice corrupting documents having html entities in links
Summary: FILESAVE: DOCX - LibreOffice corrupting documents having html entities in links
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.0 release
Hardware: All All
: high blocker
Assignee: Not Assigned
URL:
Whiteboard: BSA, target:4.0.0
Keywords: filter:docx
Depends on:
Blocks: DOCX-SAXParse DOCX Hyperlink
  Show dependency treegraph
 
Reported: 2012-09-20 12:53 UTC by Doug
Modified: 2022-07-20 05:14 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
This is a re-enactment of the document contents that resulted in corruption of .docx. (3.90 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2012-09-20 13:12 UTC, Doug
Details
This is the same file, but saved by MS Word into .docx format (10.76 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2012-09-20 19:37 UTC, Doug
Details
steps for reproduce (1.76 MB, video/ogg)
2013-11-19 22:20 UTC, Boris
Details
Faulty document.xml with orphaned w:hyperlink tags (25.51 KB, text/xml)
2014-03-07 21:42 UTC, Elmo
Details
Repackaged with orphaned w:hyperlink tags removed (see above), works. (11.24 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2014-03-07 21:43 UTC, Elmo
Details
Problematic DOCX fixed and saved as ODT (11.81 KB, application/vnd.oasis.opendocument.text)
2022-07-20 05:14 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Doug 2012-09-20 12:53:37 UTC
Problem description: 
Occasionally corrupted documents, missing text


Steps to reproduce:
1. Edit document that includes html tags or bookmarks
2. Save
3. Return to document

   In one case, added some hyperlinks to .docx.  One of the hyperlinks was malformed by LibreOffice (it was correct when I entered it).  LibreOffice did not tell me there was a problem.  Saved, closed, left it alone, came back.  When I tried to move it elsewhere (without opening), OpenSuse told me that there was malformed html in the file and wouldn't do it.  When it opened it, the file cut off in the middle of one of the new html tags.  LibreOffice returns no errors, just cuts it off.  It actually eliminated all of the tag except the final bookmark part of the html page, #Bookmark.  MS Word refuses to open the file and returns the message "The name in the end tag of the element must match the element type in the start tag" "Location: Part: /word/document.xml, Line: 2, Column:7625"  It's a small file now, so it looks like LibreOffice just chopped off the 2d half of it, which I will say is very frustrating.

   In the other case, in a complex .odt document using bookmarks, I pasted a block of encrypted text into the document and when I returned to the document after a save/reopen, the text was cut off halfway through (not good).  The rest of the document afterward continued after a page break.  So I did it again, pasted in a new block of encrypted text under the last and deleted the partial block of text and lo, the missing text reappeared after deleting the first half!  It was in the document but LibreOffice was failing to display it (and I had tried to manually move the cursor through it earlier to see if it was hidden but no luck).  Very bad.

Current behavior:

In complex documents, I see now that I never can tell if LibreOffice saved the document correctly.  When I reopen, a decent chance something is corrupted or missing.

Expected behavior:

When I save a document, and the document does not return errors during the save, all of the features and text should be saved and not corrupted.  Even if they are not later available in Word (which I understand has some compatibility issues) the document should be readable in LibreOffice.

Platform (if different from the browser): 
OpenSuse 12.2
              
Browser: Mozilla/5.0 (Windows NT 5.1; rv:15.0) Gecko/20100101 Firefox/15.0.1
Comment 1 Doug 2012-09-20 13:12:14 UTC
Created attachment 67436 [details]
This is a re-enactment of the document contents that resulted in corruption of .docx.

This file was created on Windows/LibreOffice 3.6.0.4.  The original text was:


This is the beginning of the document.  Rule 8.4 of the Rules of Professional Conduct.  This is the rest of the document.


The "Rule 8.4 of the Rules of Professional Conduct" has the following link:

http://www.mass.gov/obcbbo/rpc8.htm#Rule 8.4

You can see the result in OpenOffice XML is that it just breaks the file without warning, rendering it unusable and corrupting the subsequent text.  This sequence saved correctly in .doc and .odt.
Comment 2 Doug 2012-09-20 13:14:50 UTC
Here is the link with proper html formatting

http://www.mass.gov/obcbbo/rpc8.htm#Rule%208.4
Comment 3 Doug 2012-09-20 17:47:47 UTC
bug also is present on 3.6.1.2.
Comment 4 Doug 2012-09-20 19:33:31 UTC
The 2d half of the file is not missing, just malformed.  Was accessible by changing the extension to .zip and opening document.xml in text editor.

The problem here is that LibreOffice saves html tags in an inartful way in .docx files.  LibreOffice tries to do everything in the document.xml file.  In the example I posted, the link was represented in the document as (forgive me it I mis-crop leading or trailing instructions):

HYPERLINK &quot;http://www.mass.gov/obcbbo/rpc8.htm&quot; \l &quot;Rule 8.4&quot;</w:instrText></w:r><w:r><w:fldChar w:fldCharType="separate"/></w:r><w:r><w:rPr><w:rStyle w:val="style15"/></w:rPr><w:t>Rule 8.4 of the Rules of Professional Conduct</w:t></w:r><w:r><w:fldChar w:fldCharType="end"/></w:r></w:hyperlink>

Something in that could not be parsed either by Word or LibreOffice on reopen.

Word itself does not try to do this in the document.xml file.  Instead, it inserts a bookmark with a reference to a different file in the compressed  .docx structure:

Word /document.xml :

><w:hyperlink r:id="rId4" w:anchor="Rule 8.4" w:history="1"><w:r><w:rPr><w:rStyle w:val="Hyperlink"/></w:rPr><w:t>Rule 8.4 of the Rules of Professional Conduct</w:t></w:r></w:hyperlink>

Word /_rels/document.xml.rels :

Target="fontTable.xml"/><Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="http://www.mass.gov/obcbbo/rpc8.htm" TargetMode="External"/></Relationships>

LibreOffice does not attempt to use the "rels" folder/functionality in the .docx structure in connection with the hyperlinks.  As a result, using html links and bookmarks in LibreOffice with .docx files is a problem waiting to happen.
Comment 5 Doug 2012-09-20 19:37:22 UTC
Created attachment 67463 [details]
This is the same file, but saved by MS Word into .docx format

Compare the treatment of the html tags in this file with the malformed file above.  Word put the html tag into a separate "rels" file inside the .docx structure, which avoids whatever problem LibreOffice encountered by putting the entire tag directly into the document.xml.
Comment 6 David Juran 2013-10-21 14:30:36 UTC
It happened to me on libreoffice-4.1.2.3-3.fc19
Comment 7 Boris 2013-11-19 22:20:21 UTC
Created attachment 89499 [details]
steps for reproduce
Comment 8 Boris 2013-11-19 22:21:11 UTC
I think I have the same issue. My steps are:
1. create several lines with text
2. in one of the lines add hyperlink e.g “www.link.com ” (with space so as text become a hyperlink)
3. save the document with .docx extension
4. open my document
Result: all lines after hyperlink are dissapear.
Reproduced: always 
Video with steps attached.

LibreOffice Writer Version: 4.1.2.3 Build ID: 410m0(Build:3)
OS: Ubuntu 13.10
Comment 9 Bruce Kirkpatrick 2014-03-04 00:40:13 UTC
I just encountered this serious bug when re-opening a docx document I was working on.   All text after the hyperlinks was mysteriously deleted.   I noticed that the file size was still very large despite most of the text missing, and then I tried to figure out a way to decode the docx format and recover the data inside the file somehow, and then I learned that docx is just a renamed zip archive in openxml format.   After renaming the .docx to be .zip, I was able to open word/document.xml file and see that all the missing text was still there as a plain xml document.    I deleted the xml tags related to the hyperlinks, then I zipped the files again and renamed the .zip to .docx.   It worked!  I was able to restore the hours of lost work.    Hope this helps someone fix the bug and prevent losing their work!   Perhaps the software is writing an invalid OpenXML syntax or when it reads it back out, it fails to read it correctly.
Comment 10 Elmo 2014-03-07 21:38:16 UTC
Just fixed a similarly corrupted .docx file saved by libre office (Version: 4.1.3.2 Build ID: 410m0(Build:2)).

I'm not sure what happened during editing (weren't there then), but somehow the hyperlinking added extraneous <w:hyperlink ...> tags just after a <\w:hyperlink> without actual url and the corresponding closing tag, see in quote:
"""
<w:hyperlink r:id="rId4"><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:sz w:val="22"/><w:szCs w:val="22"/></w:rPr><w:t xml:space="preserve"> Käyttäjälähtöiset innovaatiot toimivat arjessa</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="style31"/><w:tabs><w:tab w:leader="none" w:pos="0" w:val="left"/></w:tabs><w:ind w:hanging="0" w:left="0" w:right="0"/><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:sz w:val="22"/><w:szCs w:val="22"/></w:rPr></w:pPr><w:hyperlink r:id="rId5">
"""

Removing the orphan tags seems to make everything visible again in libreoffice.
I'll attach the fixed file and the original faulty document.xml, for diffing.
Comment 11 Elmo 2014-03-07 21:42:30 UTC
Created attachment 95316 [details]
Faulty document.xml with orphaned w:hyperlink tags
Comment 12 Elmo 2014-03-07 21:43:43 UTC
Created attachment 95317 [details]
Repackaged with orphaned w:hyperlink tags removed (see above), works.
Comment 13 robp 2014-03-14 18:47:44 UTC
(In reply to comment #11)
> Created attachment 95316 [details]
> Faulty document.xml with orphaned w:hyperlink tags

I've had this problem only for the last few months. I tried the solution as you have it and found that it worked. That is an amazing piece of detective work; I knew about the structure of docx and the existence of document.xml, but I don't think I would ever have been able to figure out what the issue was. Fantastic work; well done.
Comment 14 Alefa 2014-12-13 19:10:41 UTC
This bug is still there in LibreOffice 4.2.7.2. It was a huge shock to find that all my text had vanished. The workaround posted by Bruce Kirkbatrick and Elmo worked (thanks a lot for that!), but most non-technical users would not be able to follow the steps required to recover their data. Any chance that this severe bug will be fixed soon? If not, could LibreOffice at least issue a warning when the user adds hyperlinks to a docx file?
Comment 15 Doug 2015-03-25 02:56:12 UTC
Tested again in LO Version: 4.4.1.2 Build ID: 40m0(Build:2) (OpenSuse 13.2) with html links in original report, now it WORKSFORME.  I do not know the commit, but I'll say FIXED for now.
Comment 16 Andy Pillip 2015-04-14 15:23:41 UTC
I'm seeing the same symptoms, can someone help me verify if it's the same cause?

xmlstarlet val document.xml says it's well-formed, which would not be the case if there were orphan tags.

I'm having trouble verifying the cause, since the document.xml is > 3 MB in one line, and no editor seems to work on this.

When opening the file, it ends in the mid of a sentence, where the XML doesn't even open a tag or anything.

New bug?
Comment 17 Andy Pillip 2015-04-14 15:30:59 UTC
I'm using LibreOffice Build-ID: 4.3.6.2-8.fc21.
Comment 18 ELind77 2017-02-22 20:04:22 UTC
I'm experiencing this bug in 5.3.0.3 on Ubuntu 16.04
Comment 19 Yousuf Philips (jay) (retired) 2017-05-09 19:40:58 UTC
(In reply to ELind77 from comment #18)
> I'm experiencing this bug in 5.3.0.3 on Ubuntu 16.04

I tested the link from comment 2, resaving attachment 67463 [details] from comment 5, and repeated the video steps from comment 7 and no issues with LibreOffice 5.3.2.2, so likely you have a document you saved when this bug was an issue that needs fixing. Try using the steps in comment 9 and comment 10 to fix the issue. If you are unable to do so yourself, you can email me your document and i'll fix it for you.

(In reply to Andy Pillip from comment #16)
> I'm having trouble verifying the cause, since the document.xml is > 3 MB in
> one line, and no editor seems to work on this.

I use tidy < http://tidy.sourceforge.net/ > to convert the single line into multiple lines with indenting with this command '$ tidy -i -xml -raw -w 0 document.xml > document1.xml'. If the document1.xml is blank then xml isnt well formatted.

> When opening the file, it ends in the mid of a sentence, where the XML
> doesn't even open a tag or anything.

Hopefully document.xml hasnt sustained any data loss and if so, fixing the issue shouldnt be to difficult.

So for anyone who has a corrupted file, attach it to the bug report and i'll attempt to fix it for you.
Comment 20 Mike Kaganski 2022-07-20 05:14:33 UTC
Created attachment 181339 [details]
Problematic DOCX fixed and saved as ODT

FTR:

This is the attachment 67436 [details] that was fixed (simply removed one unmatched closing </hyperlink> tag) and saved as ODT.

It reproduces the problem if saved as DOCX in 3.6.0 and 3.5.0; it worked OK in 3.4.0, and also since 4.0.0.