Created attachment 134374 [details] A sanitized DOCX that has only 2 pages in Word The attached test document has only 2 pages in Word, first 15x10 cm, and second 25x20 cm (both landscape), having one paragraph with short text each. When imported into LibreOffice, it has 4 pages: first (empty) 10x15 portrait, second (empty) 25x20 landscape, third 15x10 cm landscape (with text "Page 1"), and fourth Letter-sized (with text "Page 2"). The document is sanitized version of a real-life document generated by a third-party report generator. It actually is invalid OOXML, with last section defined in wrong place. According to ISO/IEC 29500-1:2016(E) 17.6.17 sectPr (Document Final Section Properties), the final <w:sectPr> must be the last child element of the body element. Also, this is enforced in schema for CT_Body complex type (Annex A. (normative) Schemas – W3C XML Schema, A.1 WordprocessingML, page 3866), where sectPr is a part of <xsd:sequence>, and thus *must* stay at specific place in sequence, namely being the last element, and be at most one instance. However, the test document has two sectPr before other body contents. Unfortunately, MS Word seems to allow this standards-violating content, and thus encourages creation of non-standard documents by third-party generators.
A patch is here: https://gerrit.libreoffice.org/39382
Moving to ASSIGNED
Created attachment 134715 [details] Updated sanitized DOCX that has only 2 pages in Word The original test document was slightly incorrectly sanitized. This changed order of paragraphs in the XML. The updated document (with same problems as original one) more closely follows the generating software's pattern: the wrong-placed final sectPr does not go first, but rather after previous paragraph's sectPr. While I cannot finish the patch that would also fix the first bugdoc (mentioned in comment 1 - it's blocked by bug 108970), I prepared a patch that fixes this updated sample: https://gerrit.libreoffice.org/40161. Fixing this seems adequate ATM, at least until we find documents resembling attachment 134374 [details] in the wild.
Mike Kaganski committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=4b4cd502806cfc9c9cc9754b8aae18a2c2632cdc tdf#108849: allow out-of-order sectPr It will be available in 6.0.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Let's call it fixed for now.
checked in Version: 6.0.0.0.alpha0+ Build ID: 75933b220d48bceff25b07cfc4b55c70a2e24917 CPU threads: 4; OS: Linux 4.10; UI render: default; VCL: gtk2; TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2017-08-16_22:50:24 Locale: nl-NL (nl_NL.UTF-8); Calc: group is OK. Thanks Mike :) ! Would backport to 5.4 be an option?
(In reply to Cor Nouws from comment #6) > Would backport to 5.4 be an option? I don't think so. This is a specific problem in documents created by an incorrectly programmed generator. I don't think they are widespread in the wild; this is not a LibreOffice bug strictly speaking, and unless there's evidence than it's required, I'll not backport this. (Among other reasons, I naively hope that this half-year might push the developers in right direction, if they decide to try their documents with LO.)
Fine - thanks for outlining that.