Bug 157589 - PDF: Conversion of pdf to docx or doc collapses all content onto one page
Summary: PDF: Conversion of pdf to docx or doc collapses all content onto one page
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
7.4.4.2 release
Hardware: All All
: medium normal
Assignee: Kevin Suo
URL: https://ask.libreoffice.org/t/convers...
Whiteboard: target:24.2.0 target:7.6.4
Keywords: bibisected, bisected, filter:pdf, regression
Depends on:
Blocks:
 
Reported: 2023-10-04 09:08 UTC by ruslanik55
Modified: 2023-12-01 12:59 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
pdf for testing purposes (1.25 MB, application/pdf)
2023-10-04 09:08 UTC, ruslanik55
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ruslanik55 2023-10-04 09:08:46 UTC
Created attachment 189995 [details]
pdf for testing purposes

--convert-to “docx:MS Word 2007 XML” test.pdf --infilter=“writer_pdf_import” --headless

this command works not bad in 6.4 and doesn't in 7.6 version, docx file is collapsing to one page

but in 6.4 version it seems like ignoring tables
Comment 1 ruslanik55 2023-10-04 10:29:10 UTC
libreoffice --convert-to 'docx:MS Word 2007 XML' test.pdf --infilter='writer_pdf_import' --headless

changed quoutes
Comment 2 Mike Kaganski 2023-10-04 10:57:49 UTC
Repro using Version: 7.6.2.1 (X86_64) / LibreOffice Community
Build ID: 56f7684011345957bbf33a7ee678afaf4d2ba333
CPU threads: 12; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: ru-RU (ru_RU); UI: en-US
Calc: CL threaded

and also using current master, using e.g. command:

> soffice --convert-to docx test.pdf --infilter=writer_pdf_import

It is not specific to the DOCX; the same result when the "docx" is replaced with e.g. "doc".

The interesting thing is, that the single-page result is visible *in MS Word*, but not in Writer.

Version 7.4.0.3 generated a file that opened normally in Word.
The problem started already in version 7.5.0.3.
Comment 3 ruslanik55 2023-10-06 14:22:41 UTC Comment hidden (off-topic)
Comment 4 Mike Kaganski 2023-10-06 14:48:42 UTC Comment hidden (off-topic)
Comment 5 Stéphane Guillou (stragu) 2023-11-18 21:46:23 UTC
The resulting DOCX is impossibly slow to open in LO and hangs for me. But even opening the original PDF with Writer's PDF filter results in LO hanging (document displayed but impossible to work on it).

Tested recent trunk build and 6.0.0.3.

In any case, even with the long loading times in both LO and MSO, I can see the collapsed contents in MSO, which I bibisected with linux-64-7.4 to first bad build commit [b77a5408177cf0db37ca5aa3d9cf106c0157ab9b] which points to core commit 588e59cc36475ded243ce4fd9062473cddd2c016 which is a cherrypick of:

commit fc2fb95fdb4262792e94afe61b784c8ae71d171e
author	Kevin Suo Sun Oct 23 19:10:29 2022 +0800
committer	Kevin Suo Sun Oct 23 20:10:18 2022 +0800
sdext.pdfimport Writer: Do not visit DrawElement twice in WriterXmlEmitter
https://gerrit.libreoffice.org/c/core/+/142313

Kevin, can you please have a look?
Comment 6 Kevin Suo 2023-11-22 04:03:28 UTC
(In reply to Stéphane Guillou (stragu) from comment #5)
Stéphane: There may be more than one issue here.

Would you please clarify:
Is the source commit fc2fb95fdb4262792e94afe61b784c8ae71d171e you identified causes the .docx file content to be on one page (when you open it in MSO or some other office software), or does it cause the slow loading time when you open the docx in LibreOffice Writer? 
Also, would you please attach your bibisect log?

I tried with a bibisect version of:
    2021-11-25 20:07:24 source-hash-bd0fb2d95
    
    bump product version to 7.4.0.0.alpha0+
It does not cause the content to be on one page (when open with MSO), but when open the generated docx in LibreOffice Writer the loading is already slow and not able to work in that docx in Writer.
Comment 7 Kevin Suo 2023-11-22 08:58:54 UTC
I am reversing that commit:
https://gerrit.libreoffice.org/c/core/+/159811
Comment 8 Stéphane Guillou (stragu) 2023-11-22 12:46:16 UTC
(In reply to Kevin Suo from comment #6)
> (In reply to Stéphane Guillou (stragu) from comment #5)
> Would you please clarify:
> Is the source commit fc2fb95fdb4262792e94afe61b784c8ae71d171e you identified
> causes the .docx file content to be on one page (when you open it in MSO or
> some other office software), or does it cause the slow loading time when you
> open the docx in LibreOffice Writer?
I tested opening the resulting DOCX with online MS Office 365. While neither ever finishes loading, these are the differences:
* Before commit: a few seconds to show canvas, objects like images actually rendered, more than one page (at least 5). 794 kb.
* Since commit: more than 30 seconds to show canvas; all elements overlapped on one single page, loads forever. Slightly bigger file. (801 kb)

Definitely already problematic before your commit, but I understood we were focusing on the 1-page issue here.

On the other hand, converting to DOC makes the regression more obvious as there is no loading problem:
* Before commit: correct number of pages (20)
* After commit: all content on single page
Comment 9 Commit Notification 2023-11-28 02:09:43 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/5589659829f8a1cef8ca1c8a468732105bbe231b

tdf#157589 tdf#153969: Revert "sdext.pdfimport Writer: Do not visit...

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 10 Commit Notification 2023-11-28 08:47:21 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "libreoffice-7-6":

https://git.libreoffice.org/core/commit/f52d8f004f7d70f89ee805c6f71f1791cac70c0f

tdf#157589 tdf#153969: Revert "sdext.pdfimport Writer: Do not visit...

It will be available in 7.6.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 11 Stéphane Guillou (stragu) 2023-12-01 12:59:27 UTC
Verified opening on MS office.com a doc exported with:

Version: 24.2.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 619500d6919c227e734b119481a4b334972e0b7b
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded

Thank you!