Description: When saving or converting a PDF to a text document, all text boxes now end up on the first page. They are stacked on top of each other, making the text unreadable. This happens both with the Writer GUI and when using the command-line to `--convert-to`. This is due to some change made after LibreOffice 7.4.3.2 - the last correctly working version. Steps to Reproduce: Using the CLI: 1. `cd` to the directory with the attached test file to this bug report: `multiple-pages.pdf` 2. Execute: soffice --infilter="writer_pdf_import" --convert-to doc multiple-pages.pdf 3. View the `doc` file in Writer or in MS Office. Using the GUI: 1. Launch Writer. 2. File > Open > Select file type "PDF - Portable Document Format (Writer) (*.pdf)" and Open the `multiple-pages.pdf` 3. File > Save as > doc 4. Close Writer and open the saved `doc` file in Writer or MS Office. Actual Results: All text boxes are on the first page. Expected Results: The text boxes should be on their corresponding pages as in the original PDF. Reproducible: Always User Profile Reset: Yes Additional Info: Version: 7.4.4.1 (x64) / LibreOffice Community Build ID: 988f4a351a6fa8cf4bdf2bdc873ca12cf8cbe625 CPU threads: 12; OS: Windows 10.0 Build 19045; UI render: Skia/Vulkan; VCL: win Locale: bg-BG (bg_BG); UI: en-US Calc: CL
Created attachment 188383 [details] multiple-pages.pdf - PDF test file with multiple pages
Reproduced when saved as DOCX and DOC with: Version: 7.4.7.2 / LibreOffice Community Build ID: 723314e595e8007d3cf785c16538505a1c878ca5 CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3 Locale: en-AU (en_AU.UTF-8); UI: en-US Calc: threaded And in recent master build: Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: 77fca616e0bd79e0b405fd0b3543cf8e94e15df3 CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3 Locale: en-AU (en_AU.UTF-8); UI: en-US Calc: threaded Already the case in 6.0.0.3. In 5.4, text boxes would disappear and LO would hang.
Created attachment 188418 [details] sample ODT The bug can be tested directly form this ODT, created in LO 7.5.4 after importing the sample PDF with Writer.
Confirmed, do we have faulty "text:anchor-page-number" during pdfio import? Be sure to use the correct PDF import filter, but I do reproduce during "save-as" export to OOXML and opening that format with Word 2021 or Writer 7.6.0 PDF filter import (pdfio) into Writer should be done with: "PDF - Portable Document Format (Writer) (*.pdf)" The sample PDF is parsed into a four page Writer document and each text run of the PDF ends up on its correct page on the writer canvas. But writing out to ODF seems incorrect in addition to the issues noted for doc/docx MS Binary and OOXML format. Opening the ODF archive and examining content.xml for the text-box spans, each of the T2 spans holding text are being written as to "page" anchors, but then the associated "text:anchor-page-number" is set as "1". Not too sure, but assume that would be OK for a relative page ref, but suspect that that page number is then getting parsed when opened as OOXML or MS Binary, or when those formats are opened back into LibreOffice. Seems like the import filter parsing of the PDF text runs is correct, but then we're doing incorrect thing for referencing the text span anchors. Is the issue with the filter import of the PDF elements, or with the filter export from ODF to MS Binary or OOXML? Or both?
(In reply to Martin Minchev from comment #0) > LibreOffice 7.4.3.2 - the last correctly working version. (In reply to Stéphane Guillou (stragu) from comment #2) > Already the case in 6.0.0.3. Did I miss something? It feels like there was a change in the topic, when one issue (conversion from PDF to DOC) was replaced with another (saving a specific ODT, created from PDF, to DOC(X)). The latter could be longstanding; but it seems that something changed in 7.4.4 (7.5)?
I reproed the problem with a lorem ipsum document (with several copies of the text copied to make it 5 pages), created in Writer, and exported to PDF. Then, I used different versions to convert from that PDF to DOC. I repro the reported problem with v.7.5.0.3 and 7.6.0.3. In 7.4.0.3, it worked fine. And it also works fine with v.24.2.0.3. Closing WORKSFORME.
Note that the *other* issue - i.e., conversion from that "multiple-pages.odt" attachment 188418 [details] to DOC - is still producing a single-page document. By the way, filing that separately, linking to this one, and reverse-bisecting the fix to this one, could provide a good code pointer for fixing that other one.
Thanks Mike! (In reply to Mike Kaganski from comment #7) > Note that the *other* issue - i.e., conversion from that > "multiple-pages.odt" attachment 188418 [details] to DOC - is still producing > a single-page document. I tested with a minimal new document: the issue is that text boxes anchored to different pages will stay put in ODT, but saving as DOC/DOCX will anchor them to character, so will jump to the closest paragraph, i.e. the first page in our case. Which is already tracked in bug 100580.
(In reply to Stéphane Guillou (stragu) from comment #8) Good. Note though, that my idea about this fix might help to fix that one, still stands. Reverse-bisection might hint how to change the export code in a different place in a similar way.