Bug 156303 - Saving a PDF Import to doc/docx/rtf moves all text boxes to the first page
Summary: Saving a PDF Import to doc/docx/rtf moves all text boxes to the first page
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
6.0.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:doc, filter:docx
Depends on:
Blocks: Format-Filters PDF-Import-Writer
  Show dependency treegraph
 
Reported: 2023-07-15 17:05 UTC by Martin Minchev
Modified: 2024-02-28 04:09 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
multiple-pages.pdf - PDF test file with multiple pages (84.26 KB, application/pdf)
2023-07-15 17:05 UTC, Martin Minchev
Details
sample ODT (11.21 KB, application/vnd.oasis.opendocument.text)
2023-07-17 21:09 UTC, Stéphane Guillou (stragu)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Minchev 2023-07-15 17:05:06 UTC
Description:
When saving or converting a PDF to a text document, all text boxes now end up on the first page. They are stacked on top of each other, making the text unreadable.

This happens both with the Writer GUI and when using the command-line to `--convert-to`.

This is due to some change made after LibreOffice 7.4.3.2 - the last correctly working version.

Steps to Reproduce:
Using the CLI:
1. `cd` to the directory with the attached test file to this bug report: `multiple-pages.pdf`
2. Execute:
soffice --infilter="writer_pdf_import" --convert-to doc multiple-pages.pdf
3. View the `doc` file in Writer or in MS Office.

Using the GUI:
1. Launch Writer.
2. File > Open > Select file type "PDF - Portable Document Format (Writer) (*.pdf)" and Open the `multiple-pages.pdf`
3. File > Save as > doc
4. Close Writer and open the saved `doc` file in Writer or MS Office.

Actual Results:
All text boxes are on the first page.

Expected Results:
The text boxes should be on their corresponding pages as in the original PDF.


Reproducible: Always


User Profile Reset: Yes

Additional Info:
Version: 7.4.4.1 (x64) / LibreOffice Community
Build ID: 988f4a351a6fa8cf4bdf2bdc873ca12cf8cbe625
CPU threads: 12; OS: Windows 10.0 Build 19045; UI render: Skia/Vulkan; VCL: win
Locale: bg-BG (bg_BG); UI: en-US
Calc: CL
Comment 1 Martin Minchev 2023-07-15 17:05:57 UTC
Created attachment 188383 [details]
multiple-pages.pdf - PDF test file with multiple pages
Comment 2 Stéphane Guillou (stragu) 2023-07-17 21:08:26 UTC
Reproduced when saved as DOCX and DOC with:

Version: 7.4.7.2 / LibreOffice Community
Build ID: 723314e595e8007d3cf785c16538505a1c878ca5
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded

And in recent master build:

Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 77fca616e0bd79e0b405fd0b3543cf8e94e15df3
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded

Already the case in 6.0.0.3.

In 5.4, text boxes would disappear and LO would hang.
Comment 3 Stéphane Guillou (stragu) 2023-07-17 21:09:53 UTC
Created attachment 188418 [details]
sample ODT

The bug can be tested directly form this ODT, created in LO 7.5.4 after importing the sample PDF with Writer.
Comment 4 V Stuart Foote 2023-07-20 06:19:04 UTC
Confirmed, do we have faulty "text:anchor-page-number" during pdfio import?

Be sure to use the correct PDF import filter, but I do reproduce during "save-as" export to OOXML and opening that format with Word 2021 or Writer 7.6.0

PDF filter import (pdfio) into Writer should be done with:

"PDF - Portable Document Format (Writer) (*.pdf)"

The sample PDF is parsed into a four page Writer document and each text run of the PDF ends up on its correct page on the writer canvas.

But writing out to ODF seems incorrect in addition to the issues noted for doc/docx MS Binary and OOXML format.

Opening the ODF archive and examining content.xml for the text-box spans, each of the T2 spans holding text are being written as to "page" anchors, but then the associated "text:anchor-page-number" is set as "1".

Not too sure, but assume that would be OK for a relative page ref, but suspect that that page number is then getting parsed when opened as OOXML or MS Binary, or when those formats are opened back into LibreOffice.

Seems like the import filter parsing of the PDF text runs is correct, but then we're doing incorrect thing for referencing the text span anchors. Is the issue with the filter import of the PDF elements, or with the filter export from ODF to MS Binary or OOXML? Or both?
Comment 5 Mike Kaganski 2024-02-27 12:53:21 UTC
(In reply to Martin Minchev from comment #0)
> LibreOffice 7.4.3.2 - the last correctly working version.

(In reply to Stéphane Guillou (stragu) from comment #2)
> Already the case in 6.0.0.3.

Did I miss something?
It feels like there was a change in the topic, when one issue (conversion from PDF to DOC) was replaced with another (saving a specific ODT, created from PDF, to DOC(X)). The latter could be longstanding; but it seems that something changed in 7.4.4 (7.5)?
Comment 6 Mike Kaganski 2024-02-27 13:12:32 UTC
I reproed the problem with a lorem ipsum document (with several copies of the text copied to make it 5 pages), created in Writer, and exported to PDF. Then, I used different versions to convert from that PDF to DOC.

I repro the reported problem with v.7.5.0.3 and 7.6.0.3. In 7.4.0.3, it worked fine. And it also works fine with v.24.2.0.3.

Closing WORKSFORME.
Comment 7 Mike Kaganski 2024-02-27 13:15:54 UTC
Note that the *other* issue - i.e., conversion from that "multiple-pages.odt" attachment 188418 [details] to DOC - is still producing a single-page document. By the way, filing that separately, linking to this one, and reverse-bisecting the fix to this one, could provide a good code pointer for fixing that other one.
Comment 8 Stéphane Guillou (stragu) 2024-02-28 03:13:05 UTC
Thanks Mike!

(In reply to Mike Kaganski from comment #7)
> Note that the *other* issue - i.e., conversion from that
> "multiple-pages.odt" attachment 188418 [details] to DOC - is still producing
> a single-page document.
I tested with a minimal new document: the issue is that text boxes anchored to different pages will stay put in ODT, but saving as DOC/DOCX will anchor them to character, so will jump to the closest paragraph, i.e. the first page in our case. Which is already tracked in bug 100580.
Comment 9 Mike Kaganski 2024-02-28 04:09:25 UTC
(In reply to Stéphane Guillou (stragu) from comment #8)

Good. Note though, that my idea about this fix might help to fix that one, still stands. Reverse-bisection might hint how to change the export code in a different place in a similar way.