Bug 88914 - PDF Import deadlock for an advertising presentation PDF with complex fill patterns
Summary: PDF Import deadlock for an advertising presentation PDF with complex fill pat...
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
4.2.7.2 release
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:pdf, perf
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2015-01-29 21:12 UTC by Philip
Modified: 2024-07-18 23:04 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
sample_document (1.37 MB, application/pdf)
2015-01-29 21:12 UTC, Philip
Details
MS Stacktrace of mini-dump prior to abort (8.62 KB, text/plain)
2015-01-30 05:45 UTC, V Stuart Foote
Details
pg14, pg22 extracted from problem PDF and then opened AND inserted to Draw ODG (1.69 MB, application/x-zip-compressed)
2024-07-02 10:51 UTC, V Stuart Foote
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Philip 2015-01-29 21:12:11 UTC
Created attachment 112930 [details]
sample_document

Hi,

Calling 

libreoffice --headless --convert-to odt msa_bug.pdf

results in a deadlock. The cpu usage jumps to 100% and the software is not responsive anymore.

Best Regards
Philip
Comment 1 V Stuart Foote 2015-01-30 05:45:32 UTC
Created attachment 112938 [details]
MS Stacktrace of mini-dump prior to abort

Rather than headless, attempt to import the PDF into Draw

Windows 7 sp1, 64-bit en-US
Version: 4.4.0.3
Build ID: de093506bcdc5fafd9023ee680b8c60e3e0645d7
Locale: en_US

i7 920 CPU holds at ~13% consuming 186,956K RAM, ~430 file handles ~14 threads, ~87 user objects and ~161 GDI Objects. I/O read 104,518,857 bytes, I/O write grwos to ~61,570,000 bytes in 45 minutes.  Captured a mini-dump and aborted.

svtlo!GraphicManager::ImplCheckSizeOfSwappedInGraphics+b0 [c:\cygwin64\home\buildslave\source\libo-core\svtools\source\graphic\grfmgr2.cxx @ 223]

Attaching the Stacktrace.
Comment 2 V Stuart Foote 2015-01-30 05:51:11 UTC Comment hidden (obsolete)
Comment 3 Philip 2015-01-30 09:12:20 UTC
Hi Stuart,

I've seen the issue on the following versions:

LibreOffice 4.3.5.2 430m0(Build:2)
LibreOffice 4.2.7.2 420m0(Build:2)

Best Regards
Philip
Comment 4 vvort 2015-02-01 06:08:34 UTC
There are too many small images on page #14.
Not investigated it in detail yet.
Comment 5 QA Administrators 2016-02-21 08:37:44 UTC Comment hidden (obsolete)
Comment 6 QA Administrators 2017-03-06 15:59:46 UTC Comment hidden (obsolete)
Comment 7 Timur 2020-05-20 12:32:23 UTC Comment hidden (obsolete)
Comment 8 Timur 2022-02-28 15:40:30 UTC
Repro 7.4+. Very slow to open. 2:54 for me.
Comment 9 QA Administrators 2024-02-29 03:16:24 UTC Comment hidden (obsolete)
Comment 10 Dave Gilbert 2024-07-01 00:35:09 UTC
a little slow for me on a modern machine with 24.2.4.2-2; *as long as I have the navigator closed* - with it open it's much much slower.

The command line convert is now not awful;

dg@dalek:~/bugs/libreoffice-88914-pdfhang$ time libreoffice --headless --convert-to odg msa_bug.pdf 
convert /home/dg/bugs/libreoffice-88914-pdfhang/msa_bug.pdf as a Draw document -> /home/dg/bugs/libreoffice-88914-pdfhang/msa_bug.odg using filter : draw8

real	0m55.382s
user	0m53.156s
sys	0m2.213s

As well as page 14 mentioned in comment 4, page 22 also has a load.
(Although curiously command line image extract doesn't show them, so they must be getting created by a fill or something, but it doesn't look like a tiling fill)
Comment 11 V Stuart Foote 2024-07-01 12:11:19 UTC
Also, no issues filter opening to Draw with a 24.2.4.2 on Win10 build.

=> WFM


Version: 24.2.4.2 (X86_64) / LibreOffice Community
Build ID: 51a6219feb6075d9a4c46691dcfe0cd9c4fff3c2
CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: default; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: CL threaded
Comment 12 Dave Gilbert 2024-07-02 00:08:55 UTC
Whatever encoded this pdf is just depressing; while some of the problems we have are use of hyper-clever fill patterns in pdf, this one is just silly.
Page 14 has 3025 copies of each of 6 different stipple patterns, all indidivudally embedded in the PDF rather than referencing a single instance or using a tiled fill.
Page 22 has 1715 copies of each of the stipple patterns; although depressingly they seem to be used to make a stippled white on white background so are triply pointless.
Comment 13 V Stuart Foote 2024-07-02 10:51:50 UTC
Created attachment 195082 [details]
pg14, pg22 extracted from problem PDF and then opened AND inserted to Draw ODG

The PDF was generated 2012-11-06 with ghostscript based "PDF Creator 1.2.3"

Extracting page 14, and page 22 with PDFtk these individual PDF pages are both slow to "Open" into Draw canvas--but do load with poppler/cairo based filter.

Additionally, if the individual PDF pages are "Inserted" to document page, and so will use the pdfium based filter path, they open reasonably fast with good fidelity to original layout. 

As they are inserted as bitmaps, the resolution of the filter action can be adjusted by setting environment variable 'PDFIMPORT_RESOLUTION_DPI' but will otherwise get a default appropriate to the display device (so ~96-120 dpi) for non-HiDPI.

And of course, performing a "break" of an inserted image will have performance and fidelity issues.
Comment 14 V Stuart Foote 2024-07-02 10:59:35 UTC
@Miklos, Tomaž -- anything further to be said or done about the pdfio filter handling for this and similar PDF? The pdfium based filter does a good job with it. While poppler/cairo chokes just a bit.  

Any movement on convenience bug 114234 to not have to split out PDF pages?
Comment 15 Dave Gilbert 2024-07-18 17:29:36 UTC
I think it might be possible to combine the duplicated images during the import; on the poppler import path it looks fairly easy to me (everything goes through tree/imagecontainer.cxx which is a std:vector - I'm thinking of trying to turn it into a Hash of some type.
However, it does mean we have to figure out how to represent that shared image.
(I'm about to post a question to the list about that).

But there is a 2nd problem; if the Navigator is open the current code will still apparently hang - the Navigator really doesn't handle huge flat documents well.
Comment 16 Dave Gilbert 2024-07-18 23:04:50 UTC
(In reply to Dave Gilbert from comment #15)
> I think it might be possible to combine the duplicated images during the
> import; on the poppler import path it looks fairly easy to me (everything
> goes through tree/imagecontainer.cxx which is a std:vector - I'm thinking of
> trying to turn it into a Hash of some type.
> However, it does mean we have to figure out how to represent that shared
> image.
> (I'm about to post a question to the list about that).

Oh, Regina explained that's actually only in the flat format - there's already dedupe going on in the un-flat versions.

> But there is a 2nd problem; if the Navigator is open the current code will
> still apparently hang - the Navigator really doesn't handle huge flat
> documents well.

Actually, for this one, navigator is kind of surviving OK.
So yeh, this one seems OK now.