Bug 147892 - Corrupt DOCX document after saving (track changes involved)
Summary: Corrupt DOCX document after saving (track changes involved)
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.2.5.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.6.0 target:7.5.2 target:7.4.7
Keywords: bibisected, bisected, implementationError
: 154070 (view as bug list)
Depends on:
Blocks: DOCX-SAXParse DOCX-Track-Changes
  Show dependency treegraph
 
Reported: 2022-03-10 09:02 UTC by Milan Bouchet-Valat
Modified: 2024-03-27 10:24 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
diff -u output on document.xml from correct to corrupt DOCX (373.40 KB, patch)
2022-03-10 09:02 UTC, Milan Bouchet-Valat
Details
Minimal reproducer (1.13 KB, application/vnd.oasis.opendocument.text-flat-xml)
2023-01-26 14:37 UTC, Miklos Vajna
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Milan Bouchet-Valat 2022-03-10 09:02:46 UTC
Created attachment 178770 [details]
diff -u output on document.xml from correct to corrupt DOCX

Two days ago I saved a document containing many tracked changes, including a new table. When I tried to reopen it, LibreOffice complained that it was corrupt, and was unable to fix it:

> An error occurred during opening the file. This may be caused by incorrect file contents. The error details are:
> SAXException: [word/document.xml line 2]: Opening and ending tag mismatch: rPr line 2 and ins
> /buildir/build/BUILD/libreoffice-7.2.5.2/sax/source/fastparser/fastparser.cxx:619
> Proceeding with import may cause data loss or corruption, and application may become unstable or crash.
> Do you want to ignore the error and attempt to continue loading the file?

If I choose "Yes" then the file opens, but contents are truncated in the middle the table I added (with tracked changes) just before corruption happened.

Of course I don't know how to reproduce the bug, but I had a backup of the previous version of the file, and the diff between the XML code of that document (which is not corrupt) and the new corrupt one is very limited (apart from trivial one-line changes), so maybe it can be useful? I'm attaching it. I can send you both the correct and the corrupt document privately if you're interested.


Version: 7.2.5.2.0+
Build ID: 20(Build:2)
CPU threads: 8; OS: Linux 5.16; UI render: default; VCL: gtk3
Locale: fr-FR (fr_FR.UTF-8); UI: fr-FR
Calc: threaded
Comment 1 Gabor Kelemen (allotropia) 2022-03-10 21:39:23 UTC
Hi Milan

I'm interested in this one. Please send me the two files in private, I'll figure out what operation went wrong.
Comment 2 Timur 2022-03-11 12:20:28 UTC
In addition to sending to a volunteer, you can also try to replace all chars with x in the backup file and then try to reproduce a problem with changing it.
Comment 3 Milan Bouchet-Valat 2022-03-11 16:22:17 UTC
Thanks Gabor, I've just sent you the files.

I'll also try reproducing the bug, but since that involved creating a table and making various modifications to it in a relatively irregular order, I don't have high hopes to suceed.
Comment 4 QA Administrators 2022-03-12 03:39:18 UTC Comment hidden (obsolete)
Comment 5 Milan Bouchet-Valat 2022-10-17 18:13:40 UTC
This happened to me again. I can send you the new corrupt file if you're interested. Fixing it manually turned out to be relatively easy: a </w:ins> markup was placed before rather than after the <w:ins ...> block it was supposed to close.
Comment 6 Aron Budea 2022-10-17 23:28:48 UTC
I wonder if Gabor managed to find any specific steps after the previous sample, ultimately being able to reproduce the bug would be the most helpful.

You could send me the new file, and any recollection of what you did, in case Gabor is busy with other stuff.
Comment 7 Milan Bouchet-Valat 2022-10-18 07:46:59 UTC
I can send you the file, but there's no chance for me to remember what I did. The bug only appears when doing many modifications with lots of tracked changes, and I only realize it's there after closing and reloading the file -- i.e. after hours of work on the document.
Comment 8 Milan Bouchet-Valat 2022-10-24 13:18:33 UTC
Ah, now I have an ODT document which is reliably corrupt simply by saving it to DOCX. I'm sending it to you privately.
Comment 9 Aron Budea 2022-10-24 19:15:13 UTC
I was able to reproduce _a_ bug using the privately sent sample. It's a regression that could be bibisected to the 7.3 backport of the following commit using repo bibisect-linux-64-7.3. Adding CC: to Miklos Vajna.

It may not be the originally reported problem, as that was reported for 7.2.5.2.0+, and this change didn't get into 7.2. The description also mentions a table, and the sample doesn't have one (though the table might've been unrelated to the corruption). But it's definitely a bug.

https://cgit.freedesktop.org/libreoffice/core/commit/?id=71019ec15bd3fe15385443b68614fd2402e0040f
author		Miklos Vajna <vmiklos@collabora.com>	2022-05-06 15:21:45 +0200
committer	Miklos Vajna <vmiklos@collabora.com>	2022-05-06 16:38:53 +0200

sw: don't copy useless char escapement to next node on split
Comment 10 Miklos Vajna 2022-10-26 06:53:14 UTC
Some ideas:

- The output in document.xml looks not well-formed. This must be because we screw up the start/end calls in sw/source/filter/ww8/docxattributeoutput.cxx, but hard to guess, given that also marking / merging is involved as a fast serializer level. It feels like this is the root cause.

- You can hide the previously hidden problem again but limiting the effect of the above commit. It was meant to modify how editing works, so perhaps adding a condition to not do anything in the pDoc->IsInReading() case (where pDoc is an SwDoc*) would work around the immediate problem.
Comment 11 Milan Bouchet-Valat 2022-11-18 14:42:04 UTC
FWIW the bug happens really often now. I had to fix three documents manually in the last month. I'm surprised it has not been reported by others. I'm going to go back to LO 7.2 for now.
Comment 12 Miklos Vajna 2023-01-26 14:37:27 UTC
Created attachment 184934 [details]
Minimal reproducer

Saving this document as DOCX produces not-well-formed XML. But if I revert the above commit, the problem doesn't go away for me.
Comment 13 Miklos Vajna 2023-01-26 14:51:14 UTC
I've bisected the problem that is reproducible with the attachment, the first bad commit is:
 eeee19b3fcf8b0374c361c7f6c285fd5c89b5a2f (tdf#142387 DOCX track changes: export w:del in w:ins, 2021-06-11)

Adding Cc: to László Németh

László: please take a look. Thanks!
Comment 14 Miklos Vajna 2023-02-01 08:06:20 UTC
I now realize that previously these nested changes were simply lost, adjusting keywords accordingly.
Comment 15 László Németh 2023-02-10 15:04:31 UTC
Description of the proposed fix in https://gerrit.libreoffice.org/c/core/+/146782:

tdf#147892 DOCX: fix broken export at para marker change tracking history

OOXML export of tracked deletion of tracked paragraph insertion
resulted invalid document.xml, because change tracking history of
paragraph markers isn't supported by OOXML. Export the tracked deletion
of the last run of paragraphs without history (tracked insertion).
This way w:p/w:pPr/w:rPr contains only w:del or w:ins, not both of them
(with broken tag order).

Note: it's possible to optimize the fix to keep the change
tracking history of the characters of the last run of the paragraph,
except the paragraph marker.

Regression from commit eeee19b3fcf8b0374c361c7f6c285fd5c89b5a2f
"tdf#142387 DOCX track changes: export w:del in w:ins".

Minimal unit test document was created by Miklós Vajna.
Comment 16 Commit Notification 2023-02-11 21:06:50 UTC
László Németh committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/382892341a63e400f221e1a533fd5a5d6b4d9d70

tdf#147892 DOCX: fix corrupt export at para marker revision history

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 László Németh 2023-02-11 21:19:34 UTC
Fixed in master, backport is in progress.

@Milan: thanks for the bug report!

@Miklós: thanks for bibisecting & analysis!

@Gábor, Áron, Timur: thanks for handling the issue!
Comment 18 Milan Bouchet-Valat 2023-02-13 10:52:05 UTC
Thanks!
Comment 19 Commit Notification 2023-02-14 10:08:14 UTC
László Németh committed a patch related to this issue.
It has been pushed to "libreoffice-7-5":

https://git.libreoffice.org/core/commit/4eb34d281260d7c9b6f2268ca1fa439655238a55

tdf#147892 DOCX: fix corrupt export at para marker revision history

It will be available in 7.5.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 20 Aron Budea 2023-03-23 00:40:53 UTC
László, could the fix be backported to 7.4 as well?
Comment 21 Stéphane Guillou (stragu) 2023-03-23 07:51:07 UTC
*** Bug 154070 has been marked as a duplicate of this bug. ***
Comment 22 Roman 2023-03-23 09:00:45 UTC Comment hidden (noise)
Comment 23 Roman 2023-03-23 09:05:45 UTC Comment hidden (noise)
Comment 24 Commit Notification 2023-03-30 13:37:51 UTC
László Németh committed a patch related to this issue.
It has been pushed to "libreoffice-7-4":

https://git.libreoffice.org/core/commit/4c05f5b45ce8449adee81aeac355fae652b97786

tdf#147892 DOCX: fix corrupt export at para marker revision history

It will be available in 7.4.7.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 25 Roman 2023-03-30 20:35:38 UTC Comment hidden (noise)
Comment 26 Roman 2023-03-30 20:41:20 UTC Comment hidden (noise)
Comment 27 Roman 2023-03-30 21:44:38 UTC Comment hidden (noise)
Comment 28 Roman 2023-10-28 03:27:59 UTC Comment hidden (noise, off-topic)
Comment 29 Roman 2023-10-28 19:39:09 UTC Comment hidden (noise, off-topic)