Bug 134860 - Illegal table in generated DOCX not shown in writer, shown in Word correctly (and in LO if resaved in MSO)
Summary: Illegal table in generated DOCX not shown in writer, shown in Word correctly ...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: low normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:docx
Depends on:
Blocks: DOCX-Tables
  Show dependency treegraph
 
Reported: 2020-07-16 11:10 UTC by Radomir Terber
Modified: 2021-07-21 13:23 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments
The sample file that behaves as described (26.26 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2020-07-16 11:12 UTC, Radomir Terber
Details
patch134860.diff: table loads with this change made (474 bytes, patch)
2020-07-17 11:53 UTC, Justin L
Details
characterTableJ.docx: unit test minimal reproducer document (1.49 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2020-07-17 11:55 UTC, Justin L
Details
The sample file resaved in MSO opens OK (37.25 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2021-03-06 16:07 UTC, Timur
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Radomir Terber 2020-07-16 11:10:52 UTC
Description:
I have a document that contains table. This table is shown in MS Word correctly, but in Writer there is nothing in the place where the table should exist.

Actual Results:
No table is shown in the document

Expected Results:
Table is shown like in the Word


Reproducible: Always


User Profile Reset: No



Additional Info:
It may be some inaccuracy in the document since it is generated, but I suppose the behaviour of LO should be as close as possible to the MS Word.
Comment 1 Radomir Terber 2020-07-16 11:12:41 UTC
Created attachment 163111 [details]
The sample file that behaves as described
Comment 2 Radomir Terber 2020-07-16 11:15:19 UTC
I tried to open this file also in LO 7.0.0 beta2 with the same result.
Comment 3 NISZ LibreOffice Team 2020-07-16 13:45:41 UTC
Confirming with:

Version: 7.1.0.0.alpha0+ (x64)
Build ID: 28f09062b2e9a3228ef5fe41312769ad566cc4b6
CPU szálak: 4; OS: Windows 6.3 Build 9600; Felületmegjelenítés: Skia/Raster; VCL: win
Locale: hu-HU (hu_HU); UI: hu-HU
Calc: CL

Table is missing from the document.
Comment 4 Telesto 2020-07-16 19:42:13 UTC
Also seen in
6.0

4.4.7.2

4.1

3.3.0
Comment 5 Telesto 2020-07-16 19:42:56 UTC
@Justin,
You might find this one interesting
Comment 6 Justin L 2020-07-17 06:44:29 UTC
(In reply to Radomir Terber from comment #0)
> It may be some inaccuracy in the document since it is generated.

If I first round-trip this with Word 2003, then LO can see the table just fine. 

The table is located inside of a character-run
<w:p>
  <w:r>
    <w:tbl>
    ...
    </w:tbl>
  <w:r>
</w:p>

Sounds identical to LO 6.0 commit 67a61e54531801645d51ad89aac30064b8c4b4e8
Author: Mike Kaganski on Thu Jul 13 09:08:56 2017 +0300
    tdf#111550: A workaround for out-of-order (in-paragraph) tbl on OOXML

    Word allows <w:tbl> to be direct child of <w:p>, which is illegal
    according to ECMA-376-1:2016.

    This allows for import the data in such tables (previously, this text
    was simply dropped, causing dataloss) - bug-to-bug compatibility with Word.
Comment 7 Mike Kaganski 2020-07-17 10:17:27 UTC
(In reply to Radomir Terber from comment #0)
> Additional Info:
> It may be some inaccuracy in the document since it is generated, but I
> suppose the behaviour of LO should be as close as possible to the MS Word.

Every such illegal case IMO should *only* be worked around if it's generated by MS Office itself (i.e., absolutely likely to be present in most documents, and impossible to expect some change). Otherwise it's better to fix the invalid generator by filing appropriate bug report to its maintainers.

Every workaround like this makes our code more fragile, prone to bugs or exploits; for the sake of pleasing some lazy developer of invalid generators, we risk introducing bugs into normal file handling.
Comment 8 Justin L 2020-07-17 11:53:55 UTC
Created attachment 163175 [details]
patch134860.diff: table loads with this change made

(In reply to Mike Kaganski from comment #7)
> Every such illegal case IMO should *only* be worked around if it's generated
> by MS Office itself (i.e., absolutely likely to be present in most
> documents, and impossible to expect some change).
OK. I attach the patch that worked for me, but will leave it unimplemented.
Comment 9 Justin L 2020-07-17 11:55:11 UTC
Created attachment 163176 [details]
characterTableJ.docx: unit test minimal reproducer document
Comment 10 Telesto 2020-07-17 20:32:35 UTC
(In reply to Mike Kaganski from comment #7)
> Every workaround like this makes our code more fragile, prone to bugs or
> exploits; for the sake of pleasing some lazy developer of invalid
> generators, we risk introducing bugs into normal file handling.

I understand the argument made here.. however out curiosity. 

1) Why is Word able to handle this? Show Microsoft decided to allow this to work? The had to implement this them-self for a reason (even if it's of spec)

2) And what about file compatibility. Not sure how many of those files are around. From technical point of view this might be correct answer. However from user point of view.. If the intended use case is to replace MS Office with LibreOffice, this might be a hurdle. And saying fix it by using MS Office, might be saying lets use MS Office in general. 

Should this not better be assessed based what the competitors do? 
-> Google Docs handles this as LibreOffice (didn't check the rest)
Comment 11 Mike Kaganski 2020-07-17 20:41:44 UTC
(In reply to Telesto from comment #10)
> 1) Why is Word able to handle this? Show Microsoft decided to allow this to
> work? The had to implement this them-self for a reason (even if it's of spec)

Lol. If you see that some program does something when it's fed invalid data, it doesn't necessarily mean it's intentional. In this case, I'm sure it's called a bug. To have a bug in your software, you don't have to make any extra effort. To the contrary. And then people start to generate your files without using its documentation, and simply test if some weird output they generate happens to render in your software as they expected. If your software is popular enough, and many people do that, each your bug sooner or later gets its happy user.

I myself made lots of patches to be "bug-to-bug compatible" with MS Office. So I know that it makes sense sometimes, but -

the *first* thing to do here is not to discuss if we need to support one of these bugs. The first thing should be to tell "this file is generated by software X. We contacted them, and filed the report. But they refused to fix it, telling ... . The software is used widely, and its impact is large, so that there are millions invalid documents generated all the time, without any hope for a fix." ... and only then it's the time to discuss if we need to break our software.
Comment 12 Radomir Terber 2020-07-18 08:20:48 UTC
Since I do not know OOXML specification, I posted this bug for curiosity that the behavoiur of both tools is different. I didn't know which kind of incompatibility is it. I only observed different behaviour of both - LO and Word. 

I mentioned that this document is generated and of course we can fix it in the generator. So from my perspective everything is OK when I know the cause :)

However I guess the discussion here is relevant. The best solution would be to perform deep validation of format of all files against specifications and to show message if any file does not follow specification - with option to open document with potential data loosing. Also I see space here for independent tool that would may correct such documents (out of he scope of the LO itself).

But I understand that the check are time consuming and a lot of work. So probably current "solution" (to ignore it) is acceptable. Especially in case when word itself can't create such kind of error.
Comment 13 Timur 2020-07-18 10:04:40 UTC
(In reply to Radomir Terber from comment #12)
> I mentioned that this document is generated and of course we can fix it in
> the generator. 
Are you saying it's some internal software of your company, not of the shelf?

This looks like NotOurBug do far.
Comment 14 Radomir Terber 2020-07-18 10:36:20 UTC
This is a template engine crated by one of our suppliers. Even if I looked onto generated files, it looked formally OK. I donť know OOXML (and have no time to study it), so I supposed that if Word opens the document correctly, it could be issue of LO. I had no idea that Word opens incorrect document without any message, warning or so
Comment 15 Mike Kaganski 2020-07-18 12:33:28 UTC
(In reply to Radomir Terber from comment #12)
> However I guess the discussion here is relevant. The best solution would be
> to perform deep validation of format of all files against specifications and
> to show message if any file does not follow specification - with option to
> open document with potential data loosing. Also I see space here for
> independent tool that would may correct such documents (out of he scope of
> the LO itself).

This is not a valid approach. There's no way to perform such a "deep validation" of format, because the format changes over time - current revision of ECMA-376 is 5th edition. We actually do perform the appropriate validation as we read the format; but apart from really invalid things like mismatching tags or duplicating attributes (that make invalid XML) which we actually must warn and reject or allow the risk of further processing, everything unexpected from the OOXML PoV we must silently ignore, not stop and warn. So that if anything is unexpected in the format, we accept that as possibly unknown extension, and properly *ignore*. Doing otherwise is wrong conceptually. And that's an obvious bug to not ignore that, but accept as a *different* element/placement/syntax (I look at you Word :-D).
Comment 16 Timur 2021-03-06 16:07:05 UTC
Created attachment 170273 [details]
The sample file resaved in MSO opens OK
Comment 17 NISZ LibreOffice Team 2021-07-21 13:23:17 UTC
attachment 163111 [details] was generated by...

<Application>LibreOffice/6.4.2.2$Windows_X86_64 LibreOffice_project/4e471d8c02c9c90f512f7f9ead8875b57fcb1ec3</Application>

But I'm unable to recreate the broken document with LO using the MSO-fixed attachment 170273 [details]
- In 7.3 master it roundtrips correctly
- in 6.4.0 release it also roundtrips correctly.