Bug 113849 - FILESAVE: Saving a file as .docx format adds spurious extra page breaks
Summary: FILESAVE: Saving a file as .docx format adds spurious extra page breaks
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.1.0.4 release
Hardware: All All
: medium normal
Assignee: Justin L
URL:
Whiteboard: target:6.3.0 target:6.2.1
Keywords: bibisected, bisected, filter:docx, regression
Depends on:
Blocks: DOCX-Page
  Show dependency treegraph
 
Reported: 2017-11-15 10:11 UTC by Luke Kendall
Modified: 2020-06-18 08:22 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
The sample files that allow easy observation of the bug (161.16 KB, application/zip)
2017-11-15 10:11 UTC, Luke Kendall
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kendall 2017-11-15 10:11:28 UTC
Created attachment 137776 [details]
The sample files that allow easy observation of the bug

I discovered this bug when using Calibre to convert LO-created .docx files to epub and .mobi formats for ebooks.  (I need to use the .docx format because the support for it in Calibre is better than for .odt files.)

If you open the example file provided (ShadowHunt-KDP-PageBreakProbs.odt) and save as .docx format, LO will insert three extra page breaks - one visible, two invisible.

The visible spurious page break occurs immediately after the phrase "make her new home perfect."

The two invisible spurious page breaks occur immediately the phrase "to pans and pots." and after the phrase "had also been the worst night."  Each of these phrases happen to end paragraphs at the bottom of a page - but without any manual page break.  After saving as .docx however, a manual page break has been inserted at that point in the .docx.

The invisible page breaks can be revealed by  converting into another format where re-pagination can occur, or if the bulk of the document is converted to a smaller font size.

The file ShadowHunt-KDP-PageBreakProbsOdt.odt is the sample document.
The file ShadowHunt-KDP-PageBreakProbsOdt.docx is that file saved as .docx.
The file ShadowHunt-KDP-PageBreakProbs-ManlFix1.docx is the .docx file with the visible spurious page break deleted by positioning the cursor at the end of the paragraph and hitting Delete and then Enter to break it back into the correct paragraphs.
The file ShadowHunt-KDP-PageBreakProbs-ManlFix2.docx is that file with the two invisible page breaks fixed the same way.

I have rated this as a major problem because it means if LO is used for ebook production, it introduces a significant quality problem to the ebooks produced.

In addition, it means any edits to the .odt file mean all the manual corrections to the .docx file need to be done again.  This is exacerbated by the inability to search for manual page breaks in LO.  This further means that while this bug exists, the.odt file is therefore unsuitable for use for writing books.
Comment 1 Buovjaga 2017-11-16 11:22:58 UTC
They are not added, they already exist in the .odt.

The visible thing after "make her new home perfect." is a non-breaking space. Just remove it.

Looking inside the content.xml we see the invisible things are soft page breaks:

to pans and pots.</text:p><text:p text:style-name="Text_20_body"><text:soft-page-break/>

had also been the worst night.</text:p><text:p text:style-name="P6"><text:soft-page-break/>

I did not find a way to remove the soft page breaks in the LibreOffice GUI. So you would have to unzip the .odt and edit the content.xml to remove the <text:soft-page-break/> elements manually. Then zip everything back again and rename the file type to .odt.

I found a mention of soft page breaks related to Page styles using Next style, but when I changed it in your document it did not help:
https://help.libreoffice.org/Writer/Changing_Page_Orientation_Landscape_or_Portrait#One_Page_Long_Styles

Closing as NOTABUG.
Comment 2 Luke Kendall 2017-11-16 12:35:53 UTC
Sorry, I dispute this.  Since when is a non breaking space a page break?

I never inserted any of these page breaks, and the idea of literally inserting a "soft" page break just because the page style changes seems ludicrous to me.

The reason for changing page styles is that typically in a book, pages that start a new chapter are formatted differently (e.g., no header and /or footer).  And if this ludicrous behaviour was correct, then LO would be causing this unwanted behaviour after every new-chapter page, not just on those where the end of a paragraph happens to fall at the end of the page.

Note that I tried the experiment, before submitting the bug, of deleting the end-of-line after "pans and pots." and then hitting Enter, and re-saving as a .docx: LO re-introduced the invisible page break.  I certainly didn't insert and soft page breaks!  Doing that would be most unwise, since any editing of the document would change the layout of many pages and ruin the formatting.

I also did not suffer this problem until recently.  So it seems to be new behaviour with LO 5.4.2 or thereabouts.  It was not a problem with my previous two books over the last two years, nor this third book until I tried to update it this month.

In my opinion this IS a bug, and a serious one.

Please re-check.

If it is a new feature, and it is the way LO will be from now on, then I'll know to finally give up on LO and switch over to an alternative, like WPS, or run MS Office running under Wine.  Or try the Apache fork.

Please investigate further.
Comment 3 Buovjaga 2017-11-16 14:52:08 UTC
I tested on 3.6 and the problems do not occur, so adding bibisect request.

Arch Linux 64-bit
Version: 6.0.0.0.alpha1+
Build ID: 121303615054568c204def97872343d2014af4a0
CPU threads: 8; OS: Linux 4.13; UI render: default; VCL: kde4; 
Locale: fi-FI (fi_FI.UTF-8); Calc: group
Built on November 16th 2017

Arch Linux 64-bit
Version 3.6.7.2 (Build ID: e183d5b)
Comment 4 Luke Kendall 2017-11-16 15:37:37 UTC
Thank you!
Comment 5 Telesto 2017-11-21 13:07:01 UTC
Repro with
Versie: 4.1.0.4 
Build ID: 89ea49ddacd9aa532507cbf852f2bb22b1ace28

No repro in
Versie 4.0.0.3 (Bouw-id: 7545bee9c2a0782548772a21bc84a9dcc583b89
Comment 6 Telesto 2017-11-23 12:58:30 UTC
@Luke
You might be interested in this: https://vmiklos.hu/blog/basic-epub3-export.html
Comment 7 Buovjaga 2018-07-05 13:44:40 UTC
Still repro, will bibisect later.

Arch Linux 64-bit
Version: 6.2.0.0.alpha0+
Build ID: ea39c41fdf63191579d25f327db81db14862251c
CPU threads: 8; OS: Linux 4.17; UI render: default; VCL: gtk3; 
Locale: fi-FI (fi_FI.UTF-8); Calc: group threaded
Built on July 4th 2018
Comment 8 Buovjaga 2018-07-06 18:55:21 UTC
Bisected on Linux with 41max to
commit 1aa664e781c50a322170070e7668cce173a23b4f
Author: Matthew Francis <mjay.francis@gmail.com>
Date:   Fri Sep 18 10:19:41 2015 +0800

    source-hash-ee9f23bb94b4c2c8c4db6466ecca272a092e9492
    
    commit ee9f23bb94b4c2c8c4db6466ecca272a092e9492
    Author:     Pierre-Eric Pelloux-Prayer <pierre-eric@lanedo.com>
    AuthorDate: Thu Jan 10 18:45:42 2013 +0100
    Commit:     Noel Power <noel.power@suse.com>
    CommitDate: Mon Jan 14 15:35:13 2013 +0000
    
        docx export: invalid sectPr added at the beginning of the doc
    
        This reverts commit 60fa5057039d2413d56813df4d45e5cfdfbb40ac,
        which was a revert of 723f772d (fix for ooo#106749) with an
        alternative fix to avoid a regression (fdo#56513).
    
        This commit contain a fix for the sectPr issue, and does not
        regress on the 2 previously fixed issue.
    
        Change-Id: Ibc551b38d25554c59b7c4ac5a447a0d60323f53f
        Reviewed-on: https://gerrit.libreoffice.org/1647
        Reviewed-by: Noel Power <noel.power@suse.com>
        Tested-by: Noel Power <noel.power@suse.com>
Comment 9 Xisco Faulí 2018-07-13 11:04:45 UTC Comment hidden (obsolete)
Comment 10 Xisco Faulí 2018-07-13 11:35:22 UTC
@Justin Luth, I thought you could be interested in this issue.
It was caused by ee9f23bb94b4c2c8c4db6466ecca272a092e9492, which was the same as for bug 93366, which was fixed by https://cgit.freedesktop.org/libreoffice/core/commit/?id=7e92a996d1588bdf2ff1e2df10220a0f57686cfb
Comment 11 Justin L 2018-07-14 04:37:46 UTC
dangerous stuff, as evidenced by the fact that the offending patch reverted a couple of other patches, and affects both doc and docx. As a reminder to myself when I eventually look at this - the referenced doc fixes have no relevance for this docx bug, so start debugging from scratch.
Comment 12 Justin L 2018-12-25 14:57:43 UTC
Lots of problems here.
1.) For some reason, title-page (First/Follow) isn't working. I assume that is because the inside/outside margins are too different between the two page styles. It automatically looks a lot better when they are the same. (I don't think that docx has the idea of a First/Follow style, just a Title page (which has all the same margins by definition).
2.) The footer is missing from even pages in docx (but not in .doc). Probably because header is different even/odd, but footer is not?
Comment 13 Justin L 2018-12-29 09:26:38 UTC
(In reply to Justin L from comment #12)
> 1.) For some reason, title-page (Title/Follow) isn't working. I assume that
> is because the inside/outside margins are too different
Correct. This prevents IsPlausableSingleWordSection().

> (I don't think that docx has the idea of a Title/Follow page styles
Kinda correct - I guess a continuous section serves that role, but continuous sections are an anathema to LO.

So, we have to try to emulate title/follow when exporting to DOCX format, and frequently that can be done if the follow style doesn't already have a different "first page". In that case, if IsPlausableSingleWordSection, then we consolidate the two page styles into one, marking the initial headers/footers as "different first page".

***Ideally, if you plan to export to MS formats, try to use "different first page header/footers) instead of title/follow page styles.***

> 2.) The footer is missing from even pages in docx (but not in .doc).
> Probably because header is different even/odd, but footer is not?
Correct. proposed fix at https://gerrit.libreoffice.org/65699 and gerrit.libreoffice.org/65700

One reason the page breaks were noticeable after the Chapter page was that the text wasn't fitting all on one page. Undefined even headers/footers being inherited from previous page styles took up some extra space. That was fixed in LO 6.1 by Tamas' commit 6aa1df5a627697e6adaee70adcef2c5b50cfcbf7.
Comment 14 Commit Notification 2018-12-29 10:38:55 UTC
Justin Luth committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/+/7815b4d2a2c89261ad6424c7fe3ce0c453e4d02c%5E%21

tdf#113849 ooxmlexport: even headers/footers for both or none.

It will be available in 6.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 15 Commit Notification 2019-01-28 14:03:02 UTC
Justin Luth committed a patch related to this issue.
It has been pushed to "libreoffice-6-2":

https://git.libreoffice.org/core/+/303e03115e59c8e8c7e7727569012453e643c2a9%5E%21

tdf#113849 ooxmlexport: even headers/footers for both or none.

It will be available in 6.2.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 16 Justin L 2019-01-28 16:38:54 UTC
fixed in 6.2 fully when comment 13's gerrit patch https://gerrit.libreoffice.org/67020 is accepted into 6.2.