Bug 106057 - General input/output error loading pdf file (because of multiple trailers which is valid per PDF specification)
Summary: General input/output error loading pdf file (because of multiple trailers whi...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Mike Kaganski
URL:
Whiteboard: target:24.2.0 target:7.6.3
Keywords: filter:pdf, implementationError
: 137648 (view as bug list)
Depends on:
Blocks: PDF-Import-Draw File-Opening
  Show dependency treegraph
 
Reported: 2017-02-17 05:33 UTC by Jim Avera
Modified: 2023-11-07 01:17 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
document (139.81 KB, application/pdf)
2017-02-17 20:19 UTC, Xisco Faulí
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jim Avera 2017-02-17 05:33:23 UTC
Description:
Attempting to load the indicated pdf results in a pop-up saying

    General Error.
    General input/output error


Steps to Reproduce:
1. wget -Otest.pdf http://www.firsttuesday.us/course/Downloads/315.pdf
2. libreoffice test.pdf

Actual Results:  
General input-output error

Expected Results:
A Draw document should be opened containing content from the PDF.


Reproducible: Always

User Profile Reset: No

Additional Info:


User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0
Comment 1 Xisco Faulí 2017-02-17 20:18:21 UTC
Confirmed in

- Version: 5.4.0.0.alpha0+
Build ID: 880033edde516fc30225005245253293a6a58ba4
CPU Threads: 4; OS Version: Linux 4.8; UI Render: default; VCL: gtk3; 
Locale: ca-ES (ca_ES.UTF-8); Calc: group

- Version: 4.4.0.0.alpha0+
Build ID: a5e137eb1d37361c60175e8fba780fc46b377a23


- LibreOffice 3.3.0 
OOO330m19 (Build:6)
tag libreoffice-3.3.0.4
Comment 2 Xisco Faulí 2017-02-17 20:19:53 UTC
Created attachment 131304 [details]
document
Comment 3 QA Administrators 2018-03-01 03:41:19 UTC Comment hidden (obsolete)
Comment 4 Roman Kuznetsov 2019-03-16 20:44:32 UTC
still repro in

Version: 6.3.0.0.alpha0+ (x64)
Build ID: 13a260f59e421f3e67845f8f2eb22b8f0f8fcaf0
CPU threads: 4; OS: Windows 10.0; UI render: GL; VCL: win; 
TinderBox: Win-x86_64@42, Branch:master, Time: 2019-03-11_02:46:09
Locale: ru-RU (ru_RU); UI-Language: en-US
Calc: threaded
Comment 5 Hans Deragon 2020-01-10 20:30:11 UTC
Still reproducible with:

Version: 6.3.4.2
Build ID: 1:6.3.4-0ubuntu0.19.10.1
CPU threads: 4; OS: Linux 5.3; UI render: default; VCL: gtk3; 
Locale: en-CA (en_CA.UTF-8); UI-Language: en-US
Comment 6 sfbarbee@gmail.com 2020-07-14 16:15:07 UTC Comment hidden (obsolete)
Comment 7 esnible 2021-09-06 20:36:18 UTC
I was able to open the test.pdf file originally reported (using LibreOffice 7.2.0.4 on Mac).  There is an error message "This document has an invalid signature." and Show Signatures shows four signatures that can't be found.

I am here because I got the same message with one of my own PDF files.  I don't want to upload it here because it is 44MB and I need to redact it before sharing.  It am less concerned that LibreOffice can't open it than I am that the message is "General Error.  General input/output error." and I cannot figure out how to get more details.  Nothing is written to stderr/stdout.  There is nothing in the system log.
Comment 8 Kevin Suo 2021-11-04 08:19:25 UTC
The FILEOPEN Input/Output error is still reproducible on master.

Terminal output:
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:681: error got 2 stack objects in parse
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:684: N8pdfparse7PDFFileE
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:689: (type N8pdfparse7PDFFileE)
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:684: N8pdfparse10PDFTrailerE
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:689: (type N8pdfparse10PDFTrailerE)
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:681: error got 2 stack objects in parse
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:684: N8pdfparse7PDFFileE
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:689: (type N8pdfparse7PDFFileE)
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:684: N8pdfparse10PDFTrailerE
warn:sdext.pdfimport.pdfparse:61348:61348:sdext/source/pdfimport/pdfparse/pdfparse.cxx:689: (type N8pdfparse10PDFTrailerE)

./instdir/program/xpdfimport has generated valid output.
Comment 9 Kevin Suo 2021-11-04 08:20:44 UTC
(In reply to sfbarbee@gmail.com from comment #6)
The export issue should be reported in a separate bug report, and you should provide test document for the export.
Comment 10 himajin100000 2021-11-04 23:49:54 UTC
Can this information be any of use ?
Unlike other comments of my own on other bug reports,
I don't have even a bit of clue. 

https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/pdfparse/pdfparse.cxx?r=776a1b9b#575
==================
parseinfo: stop = 
xref

0 229

(snip)

ýýýý (buff=%PDF-1.6
%âãÏÓ

226 0 obj

(snip)

, offset = 138357), hit = true, full = false, length = 124498
Comment 11 himajin100000 2021-11-05 00:26:06 UTC
can I have two trailers in a PDF?
Comment 12 Kevin Suo 2021-11-05 01:49:54 UTC
There can be many xref in a pdf file.

Question is, why is the sdext.pdfimport.pdfparse code called? We use poppler to parse pdf which then resulted in the xpdfimprt executable which then generate the token to assemble an Flat ODF to be rendered. Poppler may parse the pdf very well. Why do we parse the pdf structure on our own?
Comment 13 Kevin Suo 2021-11-05 03:12:00 UTC Comment hidden (obsolete)
Comment 14 Kevin Suo 2021-11-05 03:16:06 UTC
sdext/source/pdfimport/pdfiadaptor.cxx PDFIRawAdaptor::importer (in line 291):
calls
/source/pdfimport/pdfiadaptor.cxx PDFIRawAdaptor::parse (in line 217)

which calls (in line 231):
sdext/source/pdfimport/wrappter/wrapper.cxx xpdf_ImportFromStream (in line 1182)

xpdf_ImportFromStream copied the pdf content to a temp file because the caller has passed in a file stream thus xInput.is() is true. I don't think it is necessary to make such temp file - why not pass the url of the original PDF file and then use the xpdf_ImportFromFile directly? Anyway,this is a separate issue.

xpdf_ImportFromStream then calls:
sdext/source/pdfimport/wrappter/wrapper.cxx xpdf_ImportFromFile (in line 998)
which uses the temp file as the data source

xpdf_ImportFromFile then calls (in line 1020):
sdext/source/pdfimport/wrappter/wrapper.cxx checkEncryption (in line 891)
(Poppler has the check encryption functionality, so why do we use our own encryption checking here? I think it is because we need to show a dialog to ask for password if it is encrypted. But how about we ask poppler to check encryption, and if poppler tells it is encrypted, then we provide the password through stdin?)

checkEncryption then calls (in line 901):
sdext/source/pdfimport/pdfparse/pdfparse.cxx pdfparse::PDFReader::read (there are two such function, one for win32 and another for the "else". I am confused by those #ifdef _WIN32 stuff, but for me on linux it is in line 608. Interestingly, there is another #ifdef _WIN32 in this block, and my program jumps to line 637 directly)

Take note of the aGrammar:
PDFGrammar< file_iterator<> > aGrammar( file_start );

pdfparse::PDFReader::read then calls boost::spirit::classic::parse, which took several seconds (maybe a performance issue here?) But there is no exception here yet:

            boost::spirit::classic::parse( file_start,
                                  file_end,
                                  aGrammar,
                                  boost::spirit::classic::space_p );

Then, finally, in line 672 we get the nEntries:

    unsigned int nEntries = aGrammar.m_aObjectStack.size();

And its value is 2 for this pdf, as a result a pRet is not set in line 679 block, thus in xpdf_ImportFromFile it returned False.

I am not familiar with the boost::spirit::classic::parse staff, thus not sure why the aGrammar.m_aObjectStack.size() is 2.
Comment 15 Kevin Suo 2021-11-05 03:53:40 UTC
Below is the portion related to trailer in this pdf:

<contents above ommitted>
endstream
endobj
xref
0 4644
0000000004 65535 f
0000056752 00000 n
<omitted multiple xref entries>
0000004642 65535 f
trailer
<</Size 4644/Root 1 0 R>>
xref
0 0
trailer
<</Size 4644/Prev 4950910/XRefStm 55777/Root 1 0 R/Info 373 0 R/ID[<23394E591A08E64B8237236C314F97F2><64F67A6686A94D0B8F325336D36A35E8>]>>
startxref
5043836
%%EOF

So yes, there are two trailers in this pdf. The first xref and the first trailer should have been added before the file was once "incrementally updated". The 2nd one (i.e. the last one) is the one which should be used for pdf parsing.
I note that there is a problem here - the first trailer is not terminated by its own end-of-file ( %%EOF ) marker, see citation below.

-----------

Citing the Adobe PDF Reference (third edition):

3.4.5 Incremental Updates

In an incremental update, any new or changed objects are appended to the file, a cross-reference section is added, and a new trailer is inserted. ...

The cross-reference section added when a file is updated contains entries only for objects that have been changed, replaced, or deleted, plus the entry for object 0. Deleted objects are left unchanged in the file, but are marked as deleted via their cross-reference entries. The added trailer contains all the entries (perhaps modified) from the previous trailer, as well as a Prev entry giving the location of the previous cross-reference section (see Table 3.12 on page 68). As shown in Figure 3.3, a file that has been updated several times contains several trailers; note that each trailer is terminated by its own end-of-file ( %%EOF ) marker.
Comment 16 Kevin Suo 2021-11-05 04:04:24 UTC
The cause may be, that the PDF file has two trailers while the first trailer is not terminated by a %%EOF, thus there is only one endTrailer called in "PDFGrammar", which means one m_aObjectStack is not pop_back(), which finally resulted in 2 m_aObjectStack.
Comment 17 Kevin Suo 2021-11-22 11:22:52 UTC Comment hidden (obsolete)
Comment 18 Buovjaga 2021-11-22 11:55:46 UTC
(In reply to Kevin Suo from comment #17)
> back to new as the patch was abandoned due to license issue.

Comment for the curious: https://gerrit.libreoffice.org/c/core/+/124909/comment/8036ec31_600312f1/
Comment 19 Mike Kaganski 2023-10-31 17:38:45 UTC
*** Bug 137648 has been marked as a duplicate of this bug. ***
Comment 20 Mike Kaganski 2023-10-31 17:59:41 UTC
https://gerrit.libreoffice.org/c/core/+/158737
Comment 21 Commit Notification 2023-10-31 20:12:07 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/ba26d5f5e0529d7accf6f268559b8d659ba7c6c2

tdf#106057: Don't fail PDFReader::read, when several entries in stack

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 22 Commit Notification 2023-11-01 23:54:17 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-7-6":

https://git.libreoffice.org/core/commit/1f6eb154d859f28f9523961e7b3901603d69d445

tdf#106057: Don't fail PDFReader::read, when several entries in stack

It will be available in 7.6.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 23 Jim Avera 2023-11-07 01:17:13 UTC
I confirm it is fixed in master.  Thanks for fixing this!!

Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: d7a5e7643f3540b1490c1e2f1a91ff86c721d7b6
CPU threads: 12; OS: Linux 6.2; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded