Bug 132496 - Converting word document loses page numbers
Summary: Converting word document loses page numbers
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
(earliest affected) rc
Hardware: All All
: low minor
Assignee: Not Assigned
Keywords: bibisected, bisected, filter:docx
Depends on:
Reported: 2020-04-28 18:16 UTC by Rhys Young
Modified: 2020-05-08 15:24 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:

Word doc - working page numbers (37.01 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2020-04-28 18:17 UTC, Rhys Young
Converted pdf - bad page numbers (39.32 KB, application/pdf)
2020-04-28 18:19 UTC, Rhys Young

Note You need to log in before you can comment on or make changes to this bug.
Description Rhys Young 2020-04-28 18:16:51 UTC
When converting a word document to a PDF the PDF symbols do not match those of the word doc.

Steps to Reproduce:
Steps to Reproduce:
1. Convert file to PDF using 'soffice --headless --nolockcheck --nodefault --nofirststartwizard --nologo --norestore --convert-to pdf --outdir /tmp /tmp/test.docx'
2. Open PDF using a viewer
3. Observe the pdf does not have the correct page numbers

Actual Results:
Should have matching page numbers to word doc.

Expected Results:
Doesn't have matching page numbers to word doc.

Reproducible: Always

User Profile Reset: No

Additional Info:
Comment 1 Rhys Young 2020-04-28 18:17:42 UTC
Created attachment 160036 [details]
Word doc - working page numbers

the word doc has been converted to use 'a' where text was to protect the original document
Comment 2 Rhys Young 2020-04-28 18:19:30 UTC
Created attachment 160037 [details]
Converted pdf - bad page numbers
Comment 3 Timur 2020-05-03 09:39:47 UTC
Report is not correct and precise.
In Actual / Expected you need to write what exactly page numbers you expect and where and what you get.

Note: headless exports without blank pages so PDF is 10 pages, although DOCX shows 13 with blanks.
Comment 4 Timur 2020-05-03 09:41:18 UTC
Also, even if you are using headless, you need to also test GUI export with/without blank pages and compare.
Comment 5 Timur 2020-05-03 09:44:14 UTC
It'a also wrong to attach sample with all "a", so it's harder to compare, you should instead write "page 3" where you expect page 3 etc.
Comment 6 Rhys Young 2020-05-04 13:28:18 UTC

Just wondering if you compared both of the attached files? The reason one page is blank is simply because I wanted to preserve the integrity of the original file that had this issue. I changed all the lettering in the word document to A to preserve client confidentiality while still giving you guys everything you need, everything else is constant. 

If you open up the word document, you will see that those ones are numbered in libre office writer. The word document is what gets passed in and the pdf that gets spit out is MISSING those page numbers and only contains one page number. 

The ordering of the pages is the same between the two files. If there is a page number in file 1 (word document) it should appear in the pdf file 2.
Comment 7 Timur 2020-05-04 14:20:04 UTC
Report is misleading. It's not about headless convertt, it's about LO 6.3 not opening page numbers.
Fine in master 7.0+ so I close as WFM. This is a duplicate of some fixed bug.
Comment 8 Timur 2020-05-08 13:24:47 UTC
Actually, this fix had no bug, so I change this one to Fixed.

c462ed55e03da0e74d40eb2f0a22949c04fe6b08 is the first good commit
Author: Jenkins Build User <tdf@pollux.tdf>
Date:   Tue Jan 21 13:47:21 2020 +0100

    source sha:8d58d0ef72162bbfb92cd3a894387f57c62ee8ae

Previous commit: 
Previous source: 8f84922be15d37cb54fa592e1445fa5ab2c37f15


commit	8d58d0ef72162bbfb92cd3a894387f57c62ee8ae	[log]
author	Miklos Vajna <vmiklos@collabora.com>	Fri Jan 10 16:03:43 2020 +0100
committer	Michael Stahl <michael.stahl@cib.de>	Tue Jan 21 12:12:01 2020 +0100
tree	7f5eb36368fab37198f4562d5625f1606576b28f
parent	8f84922be15d37cb54fa592e1445fa5ab2c37f15 [diff]

DOCX import: fix lost objects anchored to an empty linked header

This is really similar to commit
04b2310aaa094794ceedaa1bb6ff1823a2d29d3e (DOCX import: fix lost objects
anchored to the single para of a linked header, 2020-01-10), except here
the header is not just a single-paragraph one, but has no text portions.

Update text-copy.docx to have a header which is not only a single
paragraph, but also has no character content. This keeps testing the
original case, but now also tests the more strict case (single paragraph
-> single empty paragraph).