Bug 108849 - DOCX IMPORT: Extra pages and wrong page sizes in a specific document
Summary: DOCX IMPORT: Extra pages and wrong page sizes in a specific document
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Mike Kaganski
URL:
Whiteboard: target:6.0.0
Keywords: filter:docx
Depends on:
Blocks:
 
Reported: 2017-06-29 08:20 UTC by Mike Kaganski
Modified: 2017-08-17 10:22 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
A sanitized DOCX that has only 2 pages in Word (1.33 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-06-29 08:20 UTC, Mike Kaganski
Details
Updated sanitized DOCX that has only 2 pages in Word (1.33 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-07-18 20:15 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2017-06-29 08:20:12 UTC
Created attachment 134374 [details]
A sanitized DOCX that has only 2 pages in Word

The attached test document has only 2 pages in Word, first 15x10 cm, and second 25x20 cm (both landscape), having one paragraph with short text each.

When imported into LibreOffice, it has 4 pages: first (empty) 10x15 portrait, second (empty) 25x20 landscape, third 15x10 cm landscape (with text "Page 1"), and fourth Letter-sized (with text "Page 2").

The document is sanitized version of a real-life document generated by a third-party report generator. It actually is invalid OOXML, with last section defined in wrong place.

According to ISO/IEC 29500-1:2016(E) 17.6.17 sectPr (Document Final Section Properties), the final <w:sectPr> must be the last child element of the body element. Also, this is enforced in schema for CT_Body complex type (Annex A. (normative) Schemas – W3C XML Schema, A.1 WordprocessingML, page 3866), where sectPr is a part of <xsd:sequence>, and thus *must* stay at specific place in sequence, namely being the last element, and be at most one instance.

However, the test document has two sectPr before other body contents. Unfortunately, MS Word seems to allow this standards-violating content, and thus encourages creation of non-standard documents by third-party generators.
Comment 1 Mike Kaganski 2017-06-29 08:45:50 UTC
A patch is here: https://gerrit.libreoffice.org/39382
Comment 2 Xisco Faulí 2017-06-29 08:58:13 UTC
Moving to ASSIGNED
Comment 3 Mike Kaganski 2017-07-18 20:15:33 UTC
Created attachment 134715 [details]
Updated sanitized DOCX that has only 2 pages in Word

The original test document was slightly incorrectly sanitized. This changed order of paragraphs in the XML.

The updated document (with same problems as original one) more closely follows the generating software's pattern: the wrong-placed final sectPr does not go first, but rather after previous paragraph's sectPr.

While I cannot finish the patch that would also fix the first bugdoc (mentioned in comment 1 - it's blocked by bug 108970), I prepared a patch that fixes this updated sample: https://gerrit.libreoffice.org/40161. Fixing this seems adequate ATM, at least until we find documents resembling attachment 134374 [details] in the wild.
Comment 4 Commit Notification 2017-07-20 09:08:19 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=4b4cd502806cfc9c9cc9754b8aae18a2c2632cdc

tdf#108849: allow out-of-order sectPr

It will be available in 6.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 5 Mike Kaganski 2017-07-20 09:09:08 UTC
Let's call it fixed for now.
Comment 6 Cor Nouws 2017-08-17 07:59:22 UTC
checked in Version: 6.0.0.0.alpha0+
Build ID: 75933b220d48bceff25b07cfc4b55c70a2e24917
CPU threads: 4; OS: Linux 4.10; UI render: default; VCL: gtk2; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2017-08-16_22:50:24
Locale: nl-NL (nl_NL.UTF-8); Calc: group

is OK. Thanks Mike :) !

Would backport to 5.4 be an option?
Comment 7 Mike Kaganski 2017-08-17 08:56:44 UTC
(In reply to Cor Nouws from comment #6)
> Would backport to 5.4 be an option?

I don't think so. This is a specific problem in documents created by an incorrectly programmed generator. I don't think they are widespread in the wild; this is not a LibreOffice bug strictly speaking, and unless there's evidence than it's required, I'll not backport this. (Among other reasons, I naively hope that this half-year might push the developers in right direction, if they decide to try their documents with LO.)
Comment 8 Cor Nouws 2017-08-17 10:22:40 UTC
Fine - thanks for outlining that.