Bug 66580 - exported PDF is invalid because of forbidden custom keys in the trailer
Summary: exported PDF is invalid because of forbidden custom keys in the trailer
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.6.0
Keywords: filter:pdf
: 142051 (view as bug list)
Depends on:
Blocks: PDF-Export-Invalid
  Show dependency treegraph
 
Reported: 2013-07-04 09:39 UTC by Jos van den Oever
Modified: 2025-03-29 05:20 UTC (History)
14 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jos van den Oever 2013-07-04 09:39:34 UTC
PDF exported by LibreOffice contains the key /DocChecksum in the trailer dictionary. When an ODF document is embedded, it also contains the key /AdditionalStreams.

These keys are not defined in the PDF 1.4 specification. The specification forbids use of custom keys in the trailer:

===
A PDF producer or Acrobat plug-in extension may also add keys to any PDF
object that is implemented as a dictionary, except the file trailer dictionary (see Section 3.4.4, “File Trailer”). In addition, a PDF producer or Acrobat plug-in may create tags that indicate the role of marked-content operators (PDF 1.2), as described in Section 9.5, “Marked Content.”
===

A strict PDF validator would declare PDF documents saved by LibreOffice invalid.
Comment 1 Cor Nouws 2013-07-04 09:55:09 UTC
Hi Jos,
Thanks for the report.
I set this to New, trusting your expertise in this ;)
Is this an issue new with the 410beta2 or already in older versions too?
Best,
Cor
Comment 2 Jos van den Oever 2013-07-04 10:06:15 UTC
The /DocChecksum and /AdditionalStreams were added to OpenOffice on 2007-03-26.

In the LibreOffice git repository this is commit d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9
Comment 3 Cor Nouws 2013-07-04 10:34:02 UTC
thanks! I change the version conform your info.
Comment 4 kurt.pfeifle 2014-05-04 17:14:39 UTC
What will happen on this?

Do you need suggestions about how to implement these features in a spec conforming way?
Comment 5 kurt.pfeifle 2014-05-04 17:18:29 UTC
To give you a link to the relevant PDF-1.4 specification:

     http://acroeng.adobe.com/PDFReference/PDF_1.4/PDF%20Reference%201.4.pdf

The quote given by Jos in his bug report is from named page 723 (as printed on page), page 743 (as counted from first), in appendix E, "PDF Name Registry". 

Here is a recently created website holding *all* PDF specifications ever published by Adobe:

     http://acroeng.adobe.com/wp/?page_id=321
Comment 6 kurt.pfeifle 2014-05-09 21:12:44 UTC
According to these test results:

    https://docs.google.com/spreadsheets/d/1Ok37dvlRSpzKpdKJ6gYycM5QzM7sv4_YCybHbiFMVFI

none of 36 different PDF viewers or applications did have a problem to display or process the tested hybrid PDF created by LibreOffice.

Hence I took the liberty to set importance of this bug to much lower for now. I won't protest if someone even closed it as WONTFIX unless there appears other evidence of real life problems...
Comment 7 QA Administrators 2015-06-08 14:41:36 UTC Comment hidden (obsolete)
Comment 8 QA Administrators 2016-09-20 10:00:33 UTC Comment hidden (obsolete)
Comment 9 Jos van den Oever 2016-09-20 10:59:11 UTC
PDF documents created with version 5.1.2.2.0 on Linux 4.4 still add the key /DocChecksum and /AdditionalStreams to the PDF files.
Comment 10 kurt.pfeifle 2017-10-22 20:01:35 UTC
(In reply to Jos van den Oever from comment #9)
> PDF documents created with version 5.1.2.2.0 on Linux 4.4 still add the key
> /DocChecksum and /AdditionalStreams to the PDF files.

Jos, the additional (proprietary) keys used by OpenOffice/LibreOffice to embed
the original OpenDocument file into the Hybrid PDF are not doing any real
h a r m:

  * As I showed in comment #6 none of the 36 tested PDF viewers has any problem
    opening and displaying a Hybrid PDF!

There are other reasons which would may  M E  want to modify the way LO creates
a Hybrid PDF:

  * N O N E  of the other PDF readers do have a way to detect that there is an
    embedded OpenDocument file in the PDF!

The reason for this is that the way OO/LO implemented this feature was that they did it in a non-standard, "proprietary" way -- while they could have utilized the standards-defined "embed another file into the PDF"-feature. (See for example bug95328 and comments).

And there  A R E  good use cases to be able to detect the embeddedness of the
original OpenDocument file in a PDF even by non-OO/LO applications:

  - User(s) may not be aware of this when they open the PDF in a PDF reader.
    However, the reader may draw their attention to the fact of the original
    ODT/ODS document being embedded. After all, whoever embedded the original
    document into the Hybrid PDF most likely  W A N T E D  it to be editable.

  - Users may need/want to extract the embedded ODT/ODS file without switching 
    to LibreOffice first (which may not even be installed on their currently
    used computer system).

  - I could easily think of more use cases, why it would be good to be able to
    D E T E C T  the fact of the embedded original and editable file and also
    to  E X T R A C T  it from the PDF via a software other than OO/LO.
Comment 11 QA Administrators 2018-10-23 02:50:06 UTC Comment hidden (obsolete)
Comment 12 Alexis de Lattre 2020-01-02 12:20:36 UTC
It is very very strange that a project such as LibreOffice that promote the OpenDocument standard and interoperability in general doesn't respect the PDF standard ! Adding proprietary keys in the PDF trailer that only LibreOffice can read is certainly a bad practice. Using the "Embedded Files" feature of the PDF standard is clearly the way to go !

It could be the occasion to add support for Embedded Files in the LibreOffice PDF export. Embedded Files in PDF is starting to be a widely used feature with electronic invoicing standards such as Factur-X/ZUGFeRD that use the Embedded Files feature of the PDF standard to add an XML file in a PDF invoice (to allow automatic processing of the invoice), and the possibility to add other document as attachments of the PDF (documents that justify the invoice, for example a signed acceptance form).

For instance, I recently developed a LibreOffice extension to be able to generate Factur-X invoices from LibreOffice Calc (cf https://github.com/akretion/factur-x-libreoffice-extension). This extension contains a Python macro that post-processes the PDF file generated by LibreOffice to add the XML file as attachment to the PDF. The code of this macro would be much simpler if the PDF export feature of LibreOffice had native support for Embedded Files. And generating structured electronic invoices (with Factur-X, ZUGFeRD or other standards) is starting to be compulsory in some countries (for example, it is now compulsory in France when you invoice the public sector).
Comment 13 Julien Nabet 2020-01-02 19:37:54 UTC
Michael/Miklos/Tomaž: I don't know who's PDF expert so thought one of you might have some idea.

The problem here is "AdditionalStreams" keyword doesn't exist in PDF standard.
Taking a look at git history of d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9, it's been added with:
commit d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9
Author: Ivo Hinkelmann <ihi@openoffice.org>
Date:   Mon Mar 26 10:21:15 2007 +0000
    INTEGRATION: CWS ipdf (1.92.80); FILE MERGED
    2007/01/19 16:08:58 pl 1.92.80.8: #137143# ecnrypt add streams
    2007/01/19 11:48:56 pl 1.92.80.7: RESYNC: (1.99-1.102); FILE MERGED
    2006/10/04 18:52:04 pl 1.92.80.6: RESYNC: (1.96-1.99); FILE MERGED
    2006/07/25 09:31:00 pl 1.92.80.5: RESYNC: (1.93-1.96); FILE MERGED
    2006/07/04 16:34:49 pl 1.92.80.4: removed a warning
    2006/07/04 13:48:22 pl 1.92.80.3: RESYNC: (1.92-1.93); FILE MERGED
    2006/06/26 15:00:09 pl 1.92.80.2: #137143# emit document checksum
    2006/06/12 16:53:42 pl 1.92.80.1: #137143# add AddStream interface

Shouldn't it be removed, put in readonly (I mean LO may read this on old files but should replace the keyword when modifying) or at minimum make this deprecated?
Instead there's "EmbeddedFiles" in specs (see https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_reference_archive/pdf_reference_1-7.pdf)

LO should respect PDF standard, I put this one to normal importance but it should be even higher than this.
Comment 14 Jean-Baptiste Faure 2020-01-02 20:27:41 UTC
According to comment #13 I changed version to inherited from OOo.

Best regards. JBF
Comment 15 Michael Meeks 2020-01-02 20:28:34 UTC
The hybrid PDF functionality was a great innovation, and the standard didn't cover it then of course. It would be great to find some resources / and/or interested people to implement the new standard using EmbeddedFiles. Alexis - are you interested in some code pointers there ? hacking the core to rename a few attributes and re-structuring the stream is likely to be a good start. I imagine a Collaboran would be happy to mentor someone that wanted to work on this themselves, but we can't resource a fix absent a customer ourselves today.
Comment 16 Julien Nabet 2020-01-02 21:30:53 UTC
Thank you Michael for your very quick feedback! :-)

I took a look at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf which 1.4 version (released in 2001 according to https://fr.wikipedia.org/wiki/Portable_Document_Format#Versions). Version 1.4 had already "EmbeddedFiles" keyword (see 3.3 part).
So wondered why adding the non standard "AdditionalStreams" whereas this keyword was existing. 
Or perhaps I wrongly understood this?
Comment 17 kurt.pfeifle 2020-01-02 22:26:04 UTC
(In reply to Michael Meeks from comment #15)
> The hybrid PDF functionality was a great innovation, and the standard didn't
> cover it then of course.

Indeed the hybrid PDF functionality  i s   a great innovation.
However it could even then have been (and still can be implemented) by using
the standard conforming method of embedding the source file.

If done, this would have the advantage that every standard compliant PDF
viewer or PDF processing software could auto-discover the embedded source
file and let the user "do something" with it even in the absence of a
LibreOffice installation on his system.
Comment 18 Julien Nabet 2020-01-03 10:12:55 UTC
Here are some code pointers:
https://opengrok.libreoffice.org/search?project=core&full=AdditionalStreams&defs=&refs=&path=&hist=&type=&si=full

To create the pdf:
emitTrailer() method from vcl/source/gdi/pdfwriter_impl.cxx

The rest seems related to PDF import
Comment 19 V Stuart Foote 2021-05-03 19:21:22 UTC
*** Bug 142051 has been marked as a duplicate of this bug. ***
Comment 20 Commit Notification 2023-01-24 10:50:55 UTC
Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e052f6e1d49a5289411b31561d6e310bf414d896

tdf#66580 write ODF document as an attachment in hybrid mode

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 21 Commit Notification 2023-01-25 14:57:08 UTC
Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/8640e24b12c7df3170c9f3e7ff3edced81fd0838

tdf#66580 added hybrid PDF test cases

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 22 Commit Notification 2023-02-01 02:11:39 UTC
Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9740331d8bc56a9b6fbe3e4c1b26fb97f6639cc6

tdf#66580 write more metadata to embedded and attached files

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 23 Stéphane Guillou (stragu) 2023-12-07 10:14:11 UTC
Tomaž mentioned in e052f6e1d49a5289411b31561d6e310bf414d896 that /AdditionalStreams was kept for backward-compatibility - "for now".
What do you think should be done with this ticket? Close as "fixed" even though some of it is "won't fix"? Or are we consider breaking backward compatibility in the future with some warning to users? Or third option: yet another option for PDF export?
Comment 24 peter.wyatt 2024-03-14 06:38:36 UTC
The latest ISO standard for PDF (ISO 32000-2:2020) no longer prohibits custom keys in trailers - in fact, it explicitly permits keys that are 2nd-class names: "The PDF file trailer dictionary may also contain any second-class name as described in Annex E, "Extending PDF"."

Historically the issue was that a lot of SW did not maintain all custom trailer entries when applying incremental updates so software that relied on their presence got broken if such files were later edited. Also, conventional trailer dictionaries don't exist when using (non-hybrid) cross-reference streams and software that converts to conventional PDFs to cross-reference stream-based PDFs also need to remember to copy custom keys across when converting. Basically its high risk putting stuff there...

Depending on what the data is (just DocChecksum??), there are more appropriate and reliable places to save the data in the PDF DOM: Document Catalog PieceInfo, XMP Metadata, use a proper DigSig if the checksum is somehow important for detecting tampering, ...
Comment 25 Dave Gilbert 2025-03-21 02:21:37 UTC
Two thoughts about the use of trailers:
  a) Without a trailer, I don't understand how a hybrid loader is supposed to understand that a PDF embedded file is actually a hybrid PDF rather than just a PDF that happens to have an attached; A PDF could have loads of embedded files and none of them actually represent the same contents as the PDF (e.g. maybe include raw data spreadsheets or something with an explanatory PDF).

  b) caolanm's performance hack  - 'detectHasAdditionalStreams' from ~2023 - sniffs the trailer at the end of the file to make a quick detection about whether it might be a hybrid.  So if the use of the trailer disappears you lose that performance trick.

(Oh, and does anyone understand why we have the /DocChecksum ?  Was that intended to detect if the pdf and hybrid diverge due to an external editor?)
Comment 26 Miklos Vajna 2025-03-21 07:34:17 UTC
> Oh, and does anyone understand why we have the /DocChecksum

My guess is that some PDF readers have this feature to remember e.g. what page were you looking last and this keeps working even if you rename the file. One way to implement this may be to work with the DocChecksum of the file, rather than the file path. But I may be wrong.
Comment 27 Tomaz Vajngerl 2025-03-23 15:04:20 UTC
 (In reply to Dave Gilbert from comment #25)
> Two thoughts about the use of trailers:
>   a) Without a trailer, I don't understand how a hybrid loader is supposed
> to understand that a PDF embedded file is actually a hybrid PDF rather than
> just a PDF that happens to have an attached; A PDF could have loads of
> embedded files and none of them actually represent the same contents as the
> PDF (e.g. maybe include raw data spreadsheets or something with an
> explanatory PDF).

Using a filename convention "Original.od*" and making sure it's a compatible document. Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   

>   b) caolanm's performance hack  - 'detectHasAdditionalStreams' from ~2023 -
> sniffs the trailer at the end of the file to make a quick detection about
> whether it might be a hybrid.  So if the use of the trailer disappears you
> lose that performance trick.

Well, that's unfortunate but reading the xref entry and checking the objects for an embedded file is also fast and can spare you from loading the whole PDF.
Comment 28 Tomaz Vajngerl 2025-03-23 15:10:54 UTC
(In reply to Miklos Vajna from comment #26)
> > Oh, and does anyone understand why we have the /DocChecksum
> 
> My guess is that some PDF readers have this feature to remember e.g. what
> page were you looking last and this keeps working even if you rename the
> file. One way to implement this may be to work with the DocChecksum of the
> file, rather than the file path. But I may be wrong.

/DocChecksum is not part of any PDF specs and only LibreOffice sets that. Why would any PDF reader implement something that is non-standard and only LO sets?.. 

Also /ID which is also in the trailer is part of the standard and can serve this purpose - it's (or should be) a random generated number so "unique" for each document.
Comment 29 Tomaz Vajngerl 2025-03-23 15:14:54 UTC
Also BTW, we don't write those to the trailer anymore, if we export to PDF/A variant or PDF/UA is enabled.
Comment 30 Dave Gilbert 2025-03-23 16:08:54 UTC
(Copying Caolán in due to the hybrid performance hack)
Hi Tomaz,
  Thanks for replying.
For context, one reason I'm asking abotu this stuff is that I just fixed the PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I was looking at how to fix that, and know a fairly easy way to do it - but understanding the details of hybrid seems to make sense first!

>>   a) Without a trailer, I don't understand how a hybrid loader is supposed
>> to understand that a PDF embedded file is actually a hybrid PDF rather than
>> just a PDF that happens to have an attached; A PDF could have loads of
>> embedded files and none of them actually represent the same contents as the
>> PDF (e.g. maybe include raw data spreadsheets or something with an
>> explanatory PDF).

> Using a filename convention "Original.od*" and making sure it's a compatible document. 
> Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   

OK, I don't know the details of PDF 2.0 yet, so I'll take your word for that; but is that something the writer already does?

>>   b) caolanm's performance hack  - 'detectHasAdditionalStreams' from ~2023 -
>> sniffs the trailer at the end of the file to make a quick detection about
>> whether it might be a hybrid.  So if the use of the trailer disappears you
>> lose that performance trick.

>Well, that's unfortunate but reading the xref entry and checking the objects for an 
> embedded file is also fast and can spare you from loading the whole PDF.

Before I break it, I'd like to give Caolán (copied in) a chance to say stuff.
My feeling is that Caolán hack is unreasonably fast - it was driven by this ticket:
https://github.com/CollaboraOnline/online/issues/7307

> Also BTW, we don't write those to the trailer anymore, if we export to PDF/A variant or > PDF/UA is enabled.

Hmm so we don't - we really should have a test for each of the hybrid types we support to make sure we can read it!  We've already got a couple in there, but not any of the new types (which we can't read!)

Also:
  - Do we have a 'spec' of hybrid anywhere?
  - If we change it what's the rules on compatibility for old docs?
  - My guess for DocumentChecksum had been that it was intended to stop the pdf and the original content getting out of sync - but that would want a checksum of the embedded document, not the whole file wouldn't it?
  - If we needed to check the embedded file name and/or the  AFRelationship you mentioned, is that doable on a file prior to decryption?
  - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the hybrid extraction; then we can lose our extra PDF parser code.
Comment 31 Miklos Vajna 2025-03-24 09:38:23 UTC
(In reply to Tomaz Vajngerl from comment #28)
> > My guess is that some PDF readers have this feature to remember e.g. what
> > page were you looking last and this keeps working even if you rename the
> > file. One way to implement this may be to work with the DocChecksum of the
> > file, rather than the file path. But I may be wrong.
> 
> /DocChecksum is not part of any PDF specs and only LibreOffice sets that.
> Why would any PDF reader implement something that is non-standard and only
> LO sets?.. 

Ah OK, I'll stop guessing. :-) Thanks for pointing that out.
Comment 32 Tomaz Vajngerl 2025-03-25 09:55:08 UTC
(In reply to Dave Gilbert from comment #30)
> (Copying Caolán in due to the hybrid performance hack)
> Hi Tomaz,
>   Thanks for replying.
> For context, one reason I'm asking abotu this stuff is that I just fixed the
> PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I
> was looking at how to fix that, and know a fairly easy way to do it - but
> understanding the details of hybrid seems to make sense first!
> 
> >>   a) Without a trailer, I don't understand how a hybrid loader is supposed
> >> to understand that a PDF embedded file is actually a hybrid PDF rather than
> >> just a PDF that happens to have an attached; A PDF could have loads of
> >> embedded files and none of them actually represent the same contents as the
> >> PDF (e.g. maybe include raw data spreadsheets or something with an
> >> explanatory PDF).
> 
> > Using a filename convention "Original.od*" and making sure it's a compatible document. 
> > Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   
> 
> OK, I don't know the details of PDF 2.0 yet, so I'll take your word for
> that; but is that something the writer already does?

It does in master when the output format is PDF 2.0. 

Also a fairly good type of detection is to check if there is just one embedded file that is ODF. That's what will be the case in case of hybrid files. If there are more leave it up to the user to open the correct source in the PDF viewer - if it indeed is a hybrid file.  

> Before I break it, I'd like to give Caolán (copied in) a chance to say stuff.
> My feeling is that Caolán hack is unreasonably fast - it was driven by this
> ticket:
> https://github.com/CollaboraOnline/online/issues/7307

Well, sure, but parsing the PDF document structure only is not very complex either. The problem is that we do a full PDF parse (I guess) in getAdditionalStream with calling the pdfparse::PDFReader::read method. 

Instead we could've used a modified PDFDocument::Read just for detection that doesn't do any unnecessary things. So to detect an embedded file you can read the xref table, then check the trailer for /Root which should point you to the object with the /Catalog dictionary and from there you need to get the /Names entry and check the /EmbeddedFiles... Those things should eb fast as you just jump around in the file from one location to the other and do minimal reads.

Example of PDF Catalog dictionary with embedded files:
16 0 obj
<<
  /Type/Catalog
  /Pages 6 0 R
  /Names 
  <<
    /EmbeddedFiles 
    <<
      /Names [<FEFF004F0072006900670069006E0061006C002E006F00640074> 2 0 R]
    >>
  >>
  /OpenAction[3 0 R /XYZ null null 0]
  /Lang(en-GB)
>>
endobj


> Hmm so we don't - we really should have a test for each of the hybrid types
> we support to make sure we can read it!  We've already got a couple in
> there, but not any of the new types (which we can't read!)

As I said you can just go into the PDF viewer and open the embedded document there. I think this should be the preferred way of doing it anyway, so I don't think it's such a big problem, but we need to make users aware that they can access the original document in this way as well, and this is probably the biggest issue.

> Also:
>   - Do we have a 'spec' of hybrid anywhere?

No.

>   - If we change it what's the rules on compatibility for old docs?

Not breaking the compatibility, so reading /AdditionalStreams should still be supported. 

Writing /AdditionalStreams is probably something we can stop adding at some point (for plain PDFs).

>   - My guess for DocumentChecksum had been that it was intended to stop the
> pdf and the original content getting out of sync - but that would want a
> checksum of the embedded document, not the whole file wouldn't it?

All I can see is that we check the PDF file is not corrupted by checking the hashes - I don't think it has any purpose except maybe to not prevent opening the source documnet if the PDF was modified (for example - annotation was added), which may not make sense actually. I would remove that completely. 

>   - If we needed to check the embedded file name and/or the  AFRelationship
> you mentioned, is that doable on a file prior to decryption?

PDF document structure is not completely encrypted - you can still see the most of the structure, but not the content of streams, so you see that there is an embedded file and the type of the file. 

>   - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the
> same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the
> hybrid extraction; then we can lose our extra PDF parser code.

The problem is only that Poppler is optional, so it might not be available, which means this functionallity will also not be available. But then the issue in general is that this is implemented inside PDF import code and not as a independent filter, which it could be done easily.

Also we use PDFium mostly in the core nowadays for various tasks and I think that will become available by default much more likely than Poppler, which is only used for PDF import.
Comment 33 Caolán McNamara 2025-03-25 12:31:43 UTC
You can probably disregard detectHasAdditionalStreams as just an optimization for the state of things at the time it was added.
Comment 34 Dave Gilbert 2025-03-26 01:46:13 UTC
(In reply to Tomaz Vajngerl from comment #32)
> (In reply to Dave Gilbert from comment #30)
> > (Copying Caolán in due to the hybrid performance hack)
> > Hi Tomaz,
> >   Thanks for replying.
> > For context, one reason I'm asking abotu this stuff is that I just fixed the
> > PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I
> > was looking at how to fix that, and know a fairly easy way to do it - but
> > understanding the details of hybrid seems to make sense first!
> > 
> > >>   a) Without a trailer, I don't understand how a hybrid loader is supposed
> > >> to understand that a PDF embedded file is actually a hybrid PDF rather than
> > >> just a PDF that happens to have an attached; A PDF could have loads of
> > >> embedded files and none of them actually represent the same contents as the
> > >> PDF (e.g. maybe include raw data spreadsheets or something with an
> > >> explanatory PDF).
> > 
> > > Using a filename convention "Original.od*" and making sure it's a compatible document. 
> > > Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   
> > 
> > OK, I don't know the details of PDF 2.0 yet, so I'll take your word for
> > that; but is that something the writer already does?
> 
> It does in master when the output format is PDF 2.0. 
> 
> Also a fairly good type of detection is to check if there is just one
> embedded file that is ODF. That's what will be the case in case of hybrid
> files. If there are more leave it up to the user to open the correct source
> in the PDF viewer - if it indeed is a hybrid file.  

OK, that feels fairly simple; does that cover for all old Hybrid files?

> > Before I break it, I'd like to give Caolán (copied in) a chance to say stuff.
> > My feeling is that Caolán hack is unreasonably fast - it was driven by this
> > ticket:
> > https://github.com/CollaboraOnline/online/issues/7307
> 
> Well, sure, but parsing the PDF document structure only is not very complex
> either. The problem is that we do a full PDF parse (I guess) in
> getAdditionalStream with calling the pdfparse::PDFReader::read method. 
> 
> Instead we could've used a modified PDFDocument::Read just for detection
> that doesn't do any unnecessary things. So to detect an embedded file you
> can read the xref table, then check the trailer for /Root which should point
> you to the object with the /Catalog dictionary and from there you need to
> get the /Names entry and check the /EmbeddedFiles... Those things should eb
> fast as you just jump around in the file from one location to the other and
> do minimal reads.
> 
> Example of PDF Catalog dictionary with embedded files:
> 16 0 obj
> <<
>   /Type/Catalog
>   /Pages 6 0 R
>   /Names 
>   <<
>     /EmbeddedFiles 
>     <<
>       /Names [<FEFF004F0072006900670069006E0061006C002E006F00640074> 2 0 R]
>     >>
>   >>
>   /OpenAction[3 0 R /XYZ null null 0]
>   /Lang(en-GB)
> >>
> endobj
> 

<See below>

> > Hmm so we don't - we really should have a test for each of the hybrid types
> > we support to make sure we can read it!  We've already got a couple in
> > there, but not any of the new types (which we can't read!)
> 
> As I said you can just go into the PDF viewer and open the embedded document
> there. I think this should be the preferred way of doing it anyway, so I
> don't think it's such a big problem, but we need to make users aware that
> they can access the original document in this way as well, and this is
> probably the biggest issue.
> 
> > Also:
> >   - Do we have a 'spec' of hybrid anywhere?
> 
> No.

Feels like we should.

> >   - If we change it what's the rules on compatibility for old docs?
> 
> Not breaking the compatibility, so reading /AdditionalStreams should still
> be supported. 

Gah OK, that was a bit I hadn't realised - but then we do have 
    ./sw/qa/extras/pdf/data/Hybrid_AdditionalStreamsOnly.pdf 

as a test for that case, where there is no PDF EmbeddedFile

So we're stuck in having to do both of them for some long time:
  a) Check for an AdditionalStreams (required for old old files)
  b) Check for EmbedeedFile named Original.o* with correct MIME (required for current PDF/UA, etc)   - not currently done

> Writing /AdditionalStreams is probably something we can stop adding at some
> point (for plain PDFs).


> 
> >   - My guess for DocumentChecksum had been that it was intended to stop the
> > pdf and the original content getting out of sync - but that would want a
> > checksum of the embedded document, not the whole file wouldn't it?
> 
> All I can see is that we check the PDF file is not corrupted by checking the
> hashes - I don't think it has any purpose except maybe to not prevent
> opening the source documnet if the PDF was modified (for example -
> annotation was added), which may not make sense actually. I would remove
> that completely. 

OK, tht's not hard.
 
> >   - If we needed to check the embedded file name and/or the  AFRelationship
> > you mentioned, is that doable on a file prior to decryption?
> 
> PDF document structure is not completely encrypted - you can still see the
> most of the structure, but not the content of streams, so you see that there
> is an embedded file and the type of the file. 

OK.
 
> >   - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the
> > same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the
> > hybrid extraction; then we can lose our extra PDF parser code.
> 
> The problem is only that Poppler is optional, so it might not be available,
> which means this functionallity will also not be available. But then the
> issue in general is that this is implemented inside PDF import code and not
> as a independent filter, which it could be done easily.
> 
> Also we use PDFium mostly in the core nowadays for various tasks and I think
> that will become available by default much more likely than Poppler, which
> is only used for PDF import.

I'm up for rewriting the hybrid Detect/extract using PDFium rather than the other inbuilt
parser in  sdext/source/pdfimport/pdfparse,  but the need to keep parsing the trailer for AdditionalStreams might make that a lot more tricky;  Poppler parses the trailer and it's easy just to ask it for it:

  pDocUnique->getXRef()->getTrailerDict()->dictLookup("AdditionalStreams")
(Which seems to work - add error checking to taste)
I'm not seeing a similar interface in PDFium - the nearest seems to be 'getTrailerEnds' - which I think is just start and end bytes of (each??) trailer.
Do we have anything that parses a trailer using PDFium?
Comment 35 Miklos Vajna 2025-03-26 08:51:36 UTC
The pdfium API is a bit higher level, see if FPDFDoc_GetAttachmentCount(), FPDFDoc_GetAttachment(), FPDFAttachment_GetName() and so in fpdf_attachment.h provides what you need.

We also have a C++ wrapper around these in vcl, so you don't need to put ifdefs everywhere in the code for the --disable-pdfium case.

getTrailerEnds() is just for signatures that need to know about previous trailers; for type detect you can ignore all the non-last trailers, I would expect.
Comment 36 Dave Gilbert 2025-03-26 13:05:45 UTC
(In reply to Miklos Vajna from comment #35)
> The pdfium API is a bit higher level, see if FPDFDoc_GetAttachmentCount(),
> FPDFDoc_GetAttachment(), FPDFAttachment_GetName() and so in
> fpdf_attachment.h provides what you need.

Those I found, they get me the embeddedfiles which is great, but I still need to get
the trailer for the older files.

> We also have a C++ wrapper around these in vcl, so you don't need to put
> ifdefs everywhere in the code for the --disable-pdfium case.

Nice.

> getTrailerEnds() is just for signatures that need to know about previous
> trailers; for type detect you can ignore all the non-last trailers, I would
> expect.

But that seems the only API it's got for getting any trailer info isn't it?


Dave
Comment 37 Miklos Vajna 2025-03-27 07:32:59 UTC
> But that seems the only API it's got for getting any trailer info isn't it?

Ah, I see. The pdfium API is intentionally high-level, so most probably it won't give you access to raw dictionaries, and high-level API just exposes standard properties. So one way to get around this is to ask on the pdfium mailing list if it would be accepted to add API to get the ODF content for old, non-standard PDFs -- if so, this should be reasonably straightforward to implement (at least the detection part).
Comment 38 Dave Gilbert 2025-03-28 01:26:08 UTC
(In reply to Miklos Vajna from comment #37)
> > But that seems the only API it's got for getting any trailer info isn't it?
> 
> Ah, I see. The pdfium API is intentionally high-level, so most probably it
> won't give you access to raw dictionaries, and high-level API just exposes
> standard properties. So one way to get around this is to ask on the pdfium
> mailing list if it would be accepted to add API to get the ODF content for
> old, non-standard PDFs -- if so, this should be reasonably straightforward
> to implement (at least the detection part).

I think given that, I'll stick to my current intention of modifying the existing poppler code to do it for now; we can come back to pdfium-ising that when it can do it.
So, I'd look to:
  a) First fix encrypted PDF2 hybrid loading (which is what got me here in the first place)
  b) Lose the checksum check
  c) Make sure to handle hybrid with no trailer data, just the single embedded file named Original.o* with the correct mime type/extension
Comment 39 Miklos Vajna 2025-03-28 07:30:36 UTC
Sure, I'm not trying to tell you what to do. :-)

One aspect to keep in mind is the --enable-mpl-subset case, which lacks poppler. So if you switch the hybrid PDF detection to poppler, you may want to keep the old code for the --disable-poppler case, so your changes are not seen as a regression from that point of view. Thanks.
Comment 40 Dave Gilbert 2025-03-28 22:44:14 UTC
Tomaz:
  Actually, I might have a simplification - am I right in saying that there aren't any valid Hybrids that are:

    a) AES encrypted (PDF2 ????)
and
    b) Only have the old trailer syntax (not an embedded file)

If those don't exist, then it simplifies things a lot; in theory then I can use PDFium to deal with modern files (which might be AES encrypted), and the existing path to deal with old trailer syntax files ?
Comment 41 Tomaz Vajngerl 2025-03-29 05:20:33 UTC
C(In reply to Dave Gilbert from comment #40)
> Tomaz:
>   Actually, I might have a simplification - am I right in saying that there
> aren't any valid Hybrids that are:
> 
>     a) AES encrypted (PDF2 ????)
> and
>     b) Only have the old trailer syntax (not an embedded file)

Correct, the only way to get PDF 2.0 encryption is in master selecting PDF 2.0 as the output. This will always write the document as embedded file. 

> If those don't exist, then it simplifies things a lot; in theory then I can
> use PDFium to deal with modern files (which might be AES encrypted), and the
> existing path to deal with old trailer syntax files ?

Yes, that's a possible solution.