66580 – exported PDF is invalid because of forbidden custom keys in the trailer

Bug 66580 - exported PDF is invalid because of forbidden custom keys in the trailer

Summary: exported PDF is invalid because of forbidden custom keys in the trailer

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Printing and PDF export (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:	target:7.6.0 target:25.8.0
Keywords:	filter:pdf

Duplicates (1):	142051 (view as bug list)
Depends on:
Blocks:	PDF-Export-Invalid
	Show dependency tree / graph

Reported:	2013-07-04 09:39 UTC by Jos van den Oever
Modified:	2025-05-17 12:13 UTC (History)
CC List:	15 users (show)

See Also:	138793 158537 95328
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jos van den Oever 2013-07-04 09:39:34 UTC

PDF exported by LibreOffice contains the key /DocChecksum in the trailer dictionary. When an ODF document is embedded, it also contains the key /AdditionalStreams.

These keys are not defined in the PDF 1.4 specification. The specification forbids use of custom keys in the trailer:

===
A PDF producer or Acrobat plug-in extension may also add keys to any PDF
object that is implemented as a dictionary, except the file trailer dictionary (see Section 3.4.4, “File Trailer”). In addition, a PDF producer or Acrobat plug-in may create tags that indicate the role of marked-content operators (PDF 1.2), as described in Section 9.5, “Marked Content.”
===

A strict PDF validator would declare PDF documents saved by LibreOffice invalid.

Comment 1 Cor Nouws 2013-07-04 09:55:09 UTC

Hi Jos,
Thanks for the report.
I set this to New, trusting your expertise in this ;)
Is this an issue new with the 410beta2 or already in older versions too?
Best,
Cor

Comment 2 Jos van den Oever 2013-07-04 10:06:15 UTC

The /DocChecksum and /AdditionalStreams were added to OpenOffice on 2007-03-26.

In the LibreOffice git repository this is commit d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9

Comment 3 Cor Nouws 2013-07-04 10:34:02 UTC

thanks! I change the version conform your info.

Comment 4 kurt.pfeifle 2014-05-04 17:14:39 UTC

What will happen on this?

Do you need suggestions about how to implement these features in a spec conforming way?

Comment 5 kurt.pfeifle 2014-05-04 17:18:29 UTC

To give you a link to the relevant PDF-1.4 specification:

     http://acroeng.adobe.com/PDFReference/PDF_1.4/PDF%20Reference%201.4.pdf

The quote given by Jos in his bug report is from named page 723 (as printed on page), page 743 (as counted from first), in appendix E, "PDF Name Registry". 

Here is a recently created website holding *all* PDF specifications ever published by Adobe:

     http://acroeng.adobe.com/wp/?page_id=321

Comment 6 kurt.pfeifle 2014-05-09 21:12:44 UTC

According to these test results:

    https://docs.google.com/spreadsheets/d/1Ok37dvlRSpzKpdKJ6gYycM5QzM7sv4_YCybHbiFMVFI

none of 36 different PDF viewers or applications did have a problem to display or process the tested hybrid PDF created by LibreOffice.

Hence I took the liberty to set importance of this bug to much lower for now. I won't protest if someone even closed it as WONTFIX unless there appears other evidence of real life problems...

Comment 7 QA Administrators 2015-06-08 14:41:36 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

   Test to see if the bug is still present on a currently supported version of LibreOffice (4.4.3 or later)
   https://www.libreoffice.org/download/

   If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior
 
 If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

   Update the version field
   Reply via email (please reply directly on the bug tracker)
   Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case)


If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug 
3. Leave a comment with your results;
4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 
4b. If the bug was not present in 3.3 - add "regression" to keyword


Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for your help!

-- The LibreOffice QA Team This NEW Message was generated on: 2015-06-08

Comment 8 QA Administrators 2016-09-20 10:00:33 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice
(5.1.5 or 5.2.1 https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug-20160920

Comment 9 Jos van den Oever 2016-09-20 10:59:11 UTC

PDF documents created with version 5.1.2.2.0 on Linux 4.4 still add the key /DocChecksum and /AdditionalStreams to the PDF files.

Comment 10 kurt.pfeifle 2017-10-22 20:01:35 UTC

(In reply to Jos van den Oever from comment #9)
> PDF documents created with version 5.1.2.2.0 on Linux 4.4 still add the key
> /DocChecksum and /AdditionalStreams to the PDF files.

Jos, the additional (proprietary) keys used by OpenOffice/LibreOffice to embed
the original OpenDocument file into the Hybrid PDF are not doing any real
h a r m:

  * As I showed in comment #6 none of the 36 tested PDF viewers has any problem
    opening and displaying a Hybrid PDF!

There are other reasons which would may  M E  want to modify the way LO creates
a Hybrid PDF:

  * N O N E  of the other PDF readers do have a way to detect that there is an
    embedded OpenDocument file in the PDF!

The reason for this is that the way OO/LO implemented this feature was that they did it in a non-standard, "proprietary" way -- while they could have utilized the standards-defined "embed another file into the PDF"-feature. (See for example bug95328 and comments).

And there  A R E  good use cases to be able to detect the embeddedness of the
original OpenDocument file in a PDF even by non-OO/LO applications:

  - User(s) may not be aware of this when they open the PDF in a PDF reader.
    However, the reader may draw their attention to the fact of the original
    ODT/ODS document being embedded. After all, whoever embedded the original
    document into the Hybrid PDF most likely  W A N T E D  it to be editable.

  - Users may need/want to extract the embedded ODT/ODS file without switching 
    to LibreOffice first (which may not even be installed on their currently
    used computer system).

  - I could easily think of more use cases, why it would be good to be able to
    D E T E C T  the fact of the embedded original and editable file and also
    to  E X T R A C T  it from the PDF via a software other than OO/LO.

Comment 11 QA Administrators 2018-10-23 02:50:06 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 12 Alexis de Lattre 2020-01-02 12:20:36 UTC

It is very very strange that a project such as LibreOffice that promote the OpenDocument standard and interoperability in general doesn't respect the PDF standard ! Adding proprietary keys in the PDF trailer that only LibreOffice can read is certainly a bad practice. Using the "Embedded Files" feature of the PDF standard is clearly the way to go !

It could be the occasion to add support for Embedded Files in the LibreOffice PDF export. Embedded Files in PDF is starting to be a widely used feature with electronic invoicing standards such as Factur-X/ZUGFeRD that use the Embedded Files feature of the PDF standard to add an XML file in a PDF invoice (to allow automatic processing of the invoice), and the possibility to add other document as attachments of the PDF (documents that justify the invoice, for example a signed acceptance form).

For instance, I recently developed a LibreOffice extension to be able to generate Factur-X invoices from LibreOffice Calc (cf https://github.com/akretion/factur-x-libreoffice-extension). This extension contains a Python macro that post-processes the PDF file generated by LibreOffice to add the XML file as attachment to the PDF. The code of this macro would be much simpler if the PDF export feature of LibreOffice had native support for Embedded Files. And generating structured electronic invoices (with Factur-X, ZUGFeRD or other standards) is starting to be compulsory in some countries (for example, it is now compulsory in France when you invoice the public sector).

Comment 13 Julien Nabet 2020-01-02 19:37:54 UTC

Michael/Miklos/Tomaž: I don't know who's PDF expert so thought one of you might have some idea.

The problem here is "AdditionalStreams" keyword doesn't exist in PDF standard.
Taking a look at git history of d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9, it's been added with:
commit d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9
Author: Ivo Hinkelmann <ihi@openoffice.org>
Date:   Mon Mar 26 10:21:15 2007 +0000
    INTEGRATION: CWS ipdf (1.92.80); FILE MERGED
    2007/01/19 16:08:58 pl 1.92.80.8: #137143# ecnrypt add streams
    2007/01/19 11:48:56 pl 1.92.80.7: RESYNC: (1.99-1.102); FILE MERGED
    2006/10/04 18:52:04 pl 1.92.80.6: RESYNC: (1.96-1.99); FILE MERGED
    2006/07/25 09:31:00 pl 1.92.80.5: RESYNC: (1.93-1.96); FILE MERGED
    2006/07/04 16:34:49 pl 1.92.80.4: removed a warning
    2006/07/04 13:48:22 pl 1.92.80.3: RESYNC: (1.92-1.93); FILE MERGED
    2006/06/26 15:00:09 pl 1.92.80.2: #137143# emit document checksum
    2006/06/12 16:53:42 pl 1.92.80.1: #137143# add AddStream interface

Shouldn't it be removed, put in readonly (I mean LO may read this on old files but should replace the keyword when modifying) or at minimum make this deprecated?
Instead there's "EmbeddedFiles" in specs (see https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_reference_archive/pdf_reference_1-7.pdf)

LO should respect PDF standard, I put this one to normal importance but it should be even higher than this.

Comment 14 Jean-Baptiste Faure 2020-01-02 20:27:41 UTC

According to comment #13 I changed version to inherited from OOo.

Best regards. JBF

Comment 15 Michael Meeks 2020-01-02 20:28:34 UTC

The hybrid PDF functionality was a great innovation, and the standard didn't cover it then of course. It would be great to find some resources / and/or interested people to implement the new standard using EmbeddedFiles. Alexis - are you interested in some code pointers there ? hacking the core to rename a few attributes and re-structuring the stream is likely to be a good start. I imagine a Collaboran would be happy to mentor someone that wanted to work on this themselves, but we can't resource a fix absent a customer ourselves today.

Comment 16 Julien Nabet 2020-01-02 21:30:53 UTC

Thank you Michael for your very quick feedback! :-)

I took a look at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf which 1.4 version (released in 2001 according to https://fr.wikipedia.org/wiki/Portable_Document_Format#Versions). Version 1.4 had already "EmbeddedFiles" keyword (see 3.3 part).
So wondered why adding the non standard "AdditionalStreams" whereas this keyword was existing. 
Or perhaps I wrongly understood this?

Comment 17 kurt.pfeifle 2020-01-02 22:26:04 UTC

(In reply to Michael Meeks from comment #15)
> The hybrid PDF functionality was a great innovation, and the standard didn't
> cover it then of course.

Indeed the hybrid PDF functionality  i s   a great innovation.
However it could even then have been (and still can be implemented) by using
the standard conforming method of embedding the source file.

If done, this would have the advantage that every standard compliant PDF
viewer or PDF processing software could auto-discover the embedded source
file and let the user "do something" with it even in the absence of a
LibreOffice installation on his system.

Comment 18 Julien Nabet 2020-01-03 10:12:55 UTC

Here are some code pointers:
https://opengrok.libreoffice.org/search?project=core&full=AdditionalStreams&defs=&refs=&path=&hist=&type=&si=full

To create the pdf:
emitTrailer() method from vcl/source/gdi/pdfwriter_impl.cxx

The rest seems related to PDF import

Comment 19 V Stuart Foote 2021-05-03 19:21:22 UTC

*** Bug 142051 has been marked as a duplicate of this bug. ***

Comment 20 Commit Notification 2023-01-24 10:50:55 UTC

Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e052f6e1d49a5289411b31561d6e310bf414d896

tdf#66580 write ODF document as an attachment in hybrid mode

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 21 Commit Notification 2023-01-25 14:57:08 UTC

Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/8640e24b12c7df3170c9f3e7ff3edced81fd0838

tdf#66580 added hybrid PDF test cases

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 22 Commit Notification 2023-02-01 02:11:39 UTC

Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9740331d8bc56a9b6fbe3e4c1b26fb97f6639cc6

tdf#66580 write more metadata to embedded and attached files

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 23 Stéphane Guillou (stragu) 2023-12-07 10:14:11 UTC

Tomaž mentioned in e052f6e1d49a5289411b31561d6e310bf414d896 that /AdditionalStreams was kept for backward-compatibility - "for now".
What do you think should be done with this ticket? Close as "fixed" even though some of it is "won't fix"? Or are we consider breaking backward compatibility in the future with some warning to users? Or third option: yet another option for PDF export?

Comment 24 peter.wyatt 2024-03-14 06:38:36 UTC

The latest ISO standard for PDF (ISO 32000-2:2020) no longer prohibits custom keys in trailers - in fact, it explicitly permits keys that are 2nd-class names: "The PDF file trailer dictionary may also contain any second-class name as described in Annex E, "Extending PDF"."

Historically the issue was that a lot of SW did not maintain all custom trailer entries when applying incremental updates so software that relied on their presence got broken if such files were later edited. Also, conventional trailer dictionaries don't exist when using (non-hybrid) cross-reference streams and software that converts to conventional PDFs to cross-reference stream-based PDFs also need to remember to copy custom keys across when converting. Basically its high risk putting stuff there...

Depending on what the data is (just DocChecksum??), there are more appropriate and reliable places to save the data in the PDF DOM: Document Catalog PieceInfo, XMP Metadata, use a proper DigSig if the checksum is somehow important for detecting tampering, ...

Comment 25 Dave Gilbert 2025-03-21 02:21:37 UTC

Two thoughts about the use of trailers:
  a) Without a trailer, I don't understand how a hybrid loader is supposed to understand that a PDF embedded file is actually a hybrid PDF rather than just a PDF that happens to have an attached; A PDF could have loads of embedded files and none of them actually represent the same contents as the PDF (e.g. maybe include raw data spreadsheets or something with an explanatory PDF).

  b) caolanm's performance hack  - 'detectHasAdditionalStreams' from ~2023 - sniffs the trailer at the end of the file to make a quick detection about whether it might be a hybrid.  So if the use of the trailer disappears you lose that performance trick.

(Oh, and does anyone understand why we have the /DocChecksum ?  Was that intended to detect if the pdf and hybrid diverge due to an external editor?)

Comment 26 Miklos Vajna 2025-03-21 07:34:17 UTC

> Oh, and does anyone understand why we have the /DocChecksum

My guess is that some PDF readers have this feature to remember e.g. what page were you looking last and this keeps working even if you rename the file. One way to implement this may be to work with the DocChecksum of the file, rather than the file path. But I may be wrong.

Comment 27 Tomaz Vajngerl 2025-03-23 15:04:20 UTC

 (In reply to Dave Gilbert from comment #25)
> Two thoughts about the use of trailers:
>   a) Without a trailer, I don't understand how a hybrid loader is supposed
> to understand that a PDF embedded file is actually a hybrid PDF rather than
> just a PDF that happens to have an attached; A PDF could have loads of
> embedded files and none of them actually represent the same contents as the
> PDF (e.g. maybe include raw data spreadsheets or something with an
> explanatory PDF).

Using a filename convention "Original.od*" and making sure it's a compatible document. Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   

>   b) caolanm's performance hack  - 'detectHasAdditionalStreams' from ~2023 -
> sniffs the trailer at the end of the file to make a quick detection about
> whether it might be a hybrid.  So if the use of the trailer disappears you
> lose that performance trick.

Well, that's unfortunate but reading the xref entry and checking the objects for an embedded file is also fast and can spare you from loading the whole PDF.

Comment 28 Tomaz Vajngerl 2025-03-23 15:10:54 UTC

(In reply to Miklos Vajna from comment #26)
> > Oh, and does anyone understand why we have the /DocChecksum
> 
> My guess is that some PDF readers have this feature to remember e.g. what
> page were you looking last and this keeps working even if you rename the
> file. One way to implement this may be to work with the DocChecksum of the
> file, rather than the file path. But I may be wrong.

/DocChecksum is not part of any PDF specs and only LibreOffice sets that. Why would any PDF reader implement something that is non-standard and only LO sets?.. 

Also /ID which is also in the trailer is part of the standard and can serve this purpose - it's (or should be) a random generated number so "unique" for each document.

Comment 29 Tomaz Vajngerl 2025-03-23 15:14:54 UTC

Also BTW, we don't write those to the trailer anymore, if we export to PDF/A variant or PDF/UA is enabled.

Comment 30 Dave Gilbert 2025-03-23 16:08:54 UTC

(Copying Caolán in due to the hybrid performance hack)
Hi Tomaz,
  Thanks for replying.
For context, one reason I'm asking abotu this stuff is that I just fixed the PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I was looking at how to fix that, and know a fairly easy way to do it - but understanding the details of hybrid seems to make sense first!

>>   a) Without a trailer, I don't understand how a hybrid loader is supposed
>> to understand that a PDF embedded file is actually a hybrid PDF rather than
>> just a PDF that happens to have an attached; A PDF could have loads of
>> embedded files and none of them actually represent the same contents as the
>> PDF (e.g. maybe include raw data spreadsheets or something with an
>> explanatory PDF).

> Using a filename convention "Original.od*" and making sure it's a compatible document. 
> Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   

OK, I don't know the details of PDF 2.0 yet, so I'll take your word for that; but is that something the writer already does?

>>   b) caolanm's performance hack  - 'detectHasAdditionalStreams' from ~2023 -
>> sniffs the trailer at the end of the file to make a quick detection about
>> whether it might be a hybrid.  So if the use of the trailer disappears you
>> lose that performance trick.

>Well, that's unfortunate but reading the xref entry and checking the objects for an 
> embedded file is also fast and can spare you from loading the whole PDF.

Before I break it, I'd like to give Caolán (copied in) a chance to say stuff.
My feeling is that Caolán hack is unreasonably fast - it was driven by this ticket:
https://github.com/CollaboraOnline/online/issues/7307

> Also BTW, we don't write those to the trailer anymore, if we export to PDF/A variant or > PDF/UA is enabled.

Hmm so we don't - we really should have a test for each of the hybrid types we support to make sure we can read it!  We've already got a couple in there, but not any of the new types (which we can't read!)

Also:
  - Do we have a 'spec' of hybrid anywhere?
  - If we change it what's the rules on compatibility for old docs?
  - My guess for DocumentChecksum had been that it was intended to stop the pdf and the original content getting out of sync - but that would want a checksum of the embedded document, not the whole file wouldn't it?
  - If we needed to check the embedded file name and/or the  AFRelationship you mentioned, is that doable on a file prior to decryption?
  - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the hybrid extraction; then we can lose our extra PDF parser code.

Comment 31 Miklos Vajna 2025-03-24 09:38:23 UTC

(In reply to Tomaz Vajngerl from comment #28)
> > My guess is that some PDF readers have this feature to remember e.g. what
> > page were you looking last and this keeps working even if you rename the
> > file. One way to implement this may be to work with the DocChecksum of the
> > file, rather than the file path. But I may be wrong.
> 
> /DocChecksum is not part of any PDF specs and only LibreOffice sets that.
> Why would any PDF reader implement something that is non-standard and only
> LO sets?.. 

Ah OK, I'll stop guessing. :-) Thanks for pointing that out.

Comment 32 Tomaz Vajngerl 2025-03-25 09:55:08 UTC

(In reply to Dave Gilbert from comment #30)
> (Copying Caolán in due to the hybrid performance hack)
> Hi Tomaz,
>   Thanks for replying.
> For context, one reason I'm asking abotu this stuff is that I just fixed the
> PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I
> was looking at how to fix that, and know a fairly easy way to do it - but
> understanding the details of hybrid seems to make sense first!
> 
> >>   a) Without a trailer, I don't understand how a hybrid loader is supposed
> >> to understand that a PDF embedded file is actually a hybrid PDF rather than
> >> just a PDF that happens to have an attached; A PDF could have loads of
> >> embedded files and none of them actually represent the same contents as the
> >> PDF (e.g. maybe include raw data spreadsheets or something with an
> >> explanatory PDF).
> 
> > Using a filename convention "Original.od*" and making sure it's a compatible document. 
> > Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   
> 
> OK, I don't know the details of PDF 2.0 yet, so I'll take your word for
> that; but is that something the writer already does?

It does in master when the output format is PDF 2.0. 

Also a fairly good type of detection is to check if there is just one embedded file that is ODF. That's what will be the case in case of hybrid files. If there are more leave it up to the user to open the correct source in the PDF viewer - if it indeed is a hybrid file.  

> Before I break it, I'd like to give Caolán (copied in) a chance to say stuff.
> My feeling is that Caolán hack is unreasonably fast - it was driven by this
> ticket:
> https://github.com/CollaboraOnline/online/issues/7307

Well, sure, but parsing the PDF document structure only is not very complex either. The problem is that we do a full PDF parse (I guess) in getAdditionalStream with calling the pdfparse::PDFReader::read method. 

Instead we could've used a modified PDFDocument::Read just for detection that doesn't do any unnecessary things. So to detect an embedded file you can read the xref table, then check the trailer for /Root which should point you to the object with the /Catalog dictionary and from there you need to get the /Names entry and check the /EmbeddedFiles... Those things should eb fast as you just jump around in the file from one location to the other and do minimal reads.

Example of PDF Catalog dictionary with embedded files:
16 0 obj
<<
  /Type/Catalog
  /Pages 6 0 R
  /Names 
  <<
    /EmbeddedFiles 
    <<
      /Names [<FEFF004F0072006900670069006E0061006C002E006F00640074> 2 0 R]
    >>
  >>
  /OpenAction[3 0 R /XYZ null null 0]
  /Lang(en-GB)
>>
endobj


> Hmm so we don't - we really should have a test for each of the hybrid types
> we support to make sure we can read it!  We've already got a couple in
> there, but not any of the new types (which we can't read!)

As I said you can just go into the PDF viewer and open the embedded document there. I think this should be the preferred way of doing it anyway, so I don't think it's such a big problem, but we need to make users aware that they can access the original document in this way as well, and this is probably the biggest issue.

> Also:
>   - Do we have a 'spec' of hybrid anywhere?

No.

>   - If we change it what's the rules on compatibility for old docs?

Not breaking the compatibility, so reading /AdditionalStreams should still be supported. 

Writing /AdditionalStreams is probably something we can stop adding at some point (for plain PDFs).

>   - My guess for DocumentChecksum had been that it was intended to stop the
> pdf and the original content getting out of sync - but that would want a
> checksum of the embedded document, not the whole file wouldn't it?

All I can see is that we check the PDF file is not corrupted by checking the hashes - I don't think it has any purpose except maybe to not prevent opening the source documnet if the PDF was modified (for example - annotation was added), which may not make sense actually. I would remove that completely. 

>   - If we needed to check the embedded file name and/or the  AFRelationship
> you mentioned, is that doable on a file prior to decryption?

PDF document structure is not completely encrypted - you can still see the most of the structure, but not the content of streams, so you see that there is an embedded file and the type of the file. 

>   - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the
> same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the
> hybrid extraction; then we can lose our extra PDF parser code.

The problem is only that Poppler is optional, so it might not be available, which means this functionallity will also not be available. But then the issue in general is that this is implemented inside PDF import code and not as a independent filter, which it could be done easily.

Also we use PDFium mostly in the core nowadays for various tasks and I think that will become available by default much more likely than Poppler, which is only used for PDF import.

Comment 33 Caolán McNamara 2025-03-25 12:31:43 UTC

You can probably disregard detectHasAdditionalStreams as just an optimization for the state of things at the time it was added.

Comment 34 Dave Gilbert 2025-03-26 01:46:13 UTC

(In reply to Tomaz Vajngerl from comment #32)
> (In reply to Dave Gilbert from comment #30)
> > (Copying Caolán in due to the hybrid performance hack)
> > Hi Tomaz,
> >   Thanks for replying.
> > For context, one reason I'm asking abotu this stuff is that I just fixed the
> > PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I
> > was looking at how to fix that, and know a fairly easy way to do it - but
> > understanding the details of hybrid seems to make sense first!
> > 
> > >>   a) Without a trailer, I don't understand how a hybrid loader is supposed
> > >> to understand that a PDF embedded file is actually a hybrid PDF rather than
> > >> just a PDF that happens to have an attached; A PDF could have loads of
> > >> embedded files and none of them actually represent the same contents as the
> > >> PDF (e.g. maybe include raw data spreadsheets or something with an
> > >> explanatory PDF).
> > 
> > > Using a filename convention "Original.od*" and making sure it's a compatible document. 
> > > Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   
> > 
> > OK, I don't know the details of PDF 2.0 yet, so I'll take your word for
> > that; but is that something the writer already does?
> 
> It does in master when the output format is PDF 2.0. 
> 
> Also a fairly good type of detection is to check if there is just one
> embedded file that is ODF. That's what will be the case in case of hybrid
> files. If there are more leave it up to the user to open the correct source
> in the PDF viewer - if it indeed is a hybrid file.  

OK, that feels fairly simple; does that cover for all old Hybrid files?

> > Before I break it, I'd like to give Caolán (copied in) a chance to say stuff.
> > My feeling is that Caolán hack is unreasonably fast - it was driven by this
> > ticket:
> > https://github.com/CollaboraOnline/online/issues/7307
> 
> Well, sure, but parsing the PDF document structure only is not very complex
> either. The problem is that we do a full PDF parse (I guess) in
> getAdditionalStream with calling the pdfparse::PDFReader::read method. 
> 
> Instead we could've used a modified PDFDocument::Read just for detection
> that doesn't do any unnecessary things. So to detect an embedded file you
> can read the xref table, then check the trailer for /Root which should point
> you to the object with the /Catalog dictionary and from there you need to
> get the /Names entry and check the /EmbeddedFiles... Those things should eb
> fast as you just jump around in the file from one location to the other and
> do minimal reads.
> 
> Example of PDF Catalog dictionary with embedded files:
> 16 0 obj
> <<
>   /Type/Catalog
>   /Pages 6 0 R
>   /Names 
>   <<
>     /EmbeddedFiles 
>     <<
>       /Names [<FEFF004F0072006900670069006E0061006C002E006F00640074> 2 0 R]
>     >>
>   >>
>   /OpenAction[3 0 R /XYZ null null 0]
>   /Lang(en-GB)
> >>
> endobj
> 

<See below>

> > Hmm so we don't - we really should have a test for each of the hybrid types
> > we support to make sure we can read it!  We've already got a couple in
> > there, but not any of the new types (which we can't read!)
> 
> As I said you can just go into the PDF viewer and open the embedded document
> there. I think this should be the preferred way of doing it anyway, so I
> don't think it's such a big problem, but we need to make users aware that
> they can access the original document in this way as well, and this is
> probably the biggest issue.
> 
> > Also:
> >   - Do we have a 'spec' of hybrid anywhere?
> 
> No.

Feels like we should.

> >   - If we change it what's the rules on compatibility for old docs?
> 
> Not breaking the compatibility, so reading /AdditionalStreams should still
> be supported. 

Gah OK, that was a bit I hadn't realised - but then we do have 
    ./sw/qa/extras/pdf/data/Hybrid_AdditionalStreamsOnly.pdf 

as a test for that case, where there is no PDF EmbeddedFile

So we're stuck in having to do both of them for some long time:
  a) Check for an AdditionalStreams (required for old old files)
  b) Check for EmbedeedFile named Original.o* with correct MIME (required for current PDF/UA, etc)   - not currently done

> Writing /AdditionalStreams is probably something we can stop adding at some
> point (for plain PDFs).


> 
> >   - My guess for DocumentChecksum had been that it was intended to stop the
> > pdf and the original content getting out of sync - but that would want a
> > checksum of the embedded document, not the whole file wouldn't it?
> 
> All I can see is that we check the PDF file is not corrupted by checking the
> hashes - I don't think it has any purpose except maybe to not prevent
> opening the source documnet if the PDF was modified (for example -
> annotation was added), which may not make sense actually. I would remove
> that completely. 

OK, tht's not hard.
 
> >   - If we needed to check the embedded file name and/or the  AFRelationship
> > you mentioned, is that doable on a file prior to decryption?
> 
> PDF document structure is not completely encrypted - you can still see the
> most of the structure, but not the content of streams, so you see that there
> is an embedded file and the type of the file. 

OK.
 
> >   - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the
> > same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the
> > hybrid extraction; then we can lose our extra PDF parser code.
> 
> The problem is only that Poppler is optional, so it might not be available,
> which means this functionallity will also not be available. But then the
> issue in general is that this is implemented inside PDF import code and not
> as a independent filter, which it could be done easily.
> 
> Also we use PDFium mostly in the core nowadays for various tasks and I think
> that will become available by default much more likely than Poppler, which
> is only used for PDF import.

I'm up for rewriting the hybrid Detect/extract using PDFium rather than the other inbuilt
parser in  sdext/source/pdfimport/pdfparse,  but the need to keep parsing the trailer for AdditionalStreams might make that a lot more tricky;  Poppler parses the trailer and it's easy just to ask it for it:

  pDocUnique->getXRef()->getTrailerDict()->dictLookup("AdditionalStreams")
(Which seems to work - add error checking to taste)
I'm not seeing a similar interface in PDFium - the nearest seems to be 'getTrailerEnds' - which I think is just start and end bytes of (each??) trailer.
Do we have anything that parses a trailer using PDFium?

Comment 35 Miklos Vajna 2025-03-26 08:51:36 UTC

The pdfium API is a bit higher level, see if FPDFDoc_GetAttachmentCount(), FPDFDoc_GetAttachment(), FPDFAttachment_GetName() and so in fpdf_attachment.h provides what you need.

We also have a C++ wrapper around these in vcl, so you don't need to put ifdefs everywhere in the code for the --disable-pdfium case.

getTrailerEnds() is just for signatures that need to know about previous trailers; for type detect you can ignore all the non-last trailers, I would expect.

Comment 36 Dave Gilbert 2025-03-26 13:05:45 UTC

(In reply to Miklos Vajna from comment #35)
> The pdfium API is a bit higher level, see if FPDFDoc_GetAttachmentCount(),
> FPDFDoc_GetAttachment(), FPDFAttachment_GetName() and so in
> fpdf_attachment.h provides what you need.

Those I found, they get me the embeddedfiles which is great, but I still need to get
the trailer for the older files.

> We also have a C++ wrapper around these in vcl, so you don't need to put
> ifdefs everywhere in the code for the --disable-pdfium case.

Nice.

> getTrailerEnds() is just for signatures that need to know about previous
> trailers; for type detect you can ignore all the non-last trailers, I would
> expect.

But that seems the only API it's got for getting any trailer info isn't it?


Dave

Comment 37 Miklos Vajna 2025-03-27 07:32:59 UTC

> But that seems the only API it's got for getting any trailer info isn't it?

Ah, I see. The pdfium API is intentionally high-level, so most probably it won't give you access to raw dictionaries, and high-level API just exposes standard properties. So one way to get around this is to ask on the pdfium mailing list if it would be accepted to add API to get the ODF content for old, non-standard PDFs -- if so, this should be reasonably straightforward to implement (at least the detection part).

Comment 38 Dave Gilbert 2025-03-28 01:26:08 UTC

(In reply to Miklos Vajna from comment #37)
> > But that seems the only API it's got for getting any trailer info isn't it?
> 
> Ah, I see. The pdfium API is intentionally high-level, so most probably it
> won't give you access to raw dictionaries, and high-level API just exposes
> standard properties. So one way to get around this is to ask on the pdfium
> mailing list if it would be accepted to add API to get the ODF content for
> old, non-standard PDFs -- if so, this should be reasonably straightforward
> to implement (at least the detection part).

I think given that, I'll stick to my current intention of modifying the existing poppler code to do it for now; we can come back to pdfium-ising that when it can do it.
So, I'd look to:
  a) First fix encrypted PDF2 hybrid loading (which is what got me here in the first place)
  b) Lose the checksum check
  c) Make sure to handle hybrid with no trailer data, just the single embedded file named Original.o* with the correct mime type/extension

Comment 39 Miklos Vajna 2025-03-28 07:30:36 UTC

Sure, I'm not trying to tell you what to do. :-)

One aspect to keep in mind is the --enable-mpl-subset case, which lacks poppler. So if you switch the hybrid PDF detection to poppler, you may want to keep the old code for the --disable-poppler case, so your changes are not seen as a regression from that point of view. Thanks.

Comment 40 Dave Gilbert 2025-03-28 22:44:14 UTC

Tomaz:
  Actually, I might have a simplification - am I right in saying that there aren't any valid Hybrids that are:

    a) AES encrypted (PDF2 ????)
and
    b) Only have the old trailer syntax (not an embedded file)

If those don't exist, then it simplifies things a lot; in theory then I can use PDFium to deal with modern files (which might be AES encrypted), and the existing path to deal with old trailer syntax files ?

Comment 41 Tomaz Vajngerl 2025-03-29 05:20:33 UTC

C(In reply to Dave Gilbert from comment #40)
> Tomaz:
>   Actually, I might have a simplification - am I right in saying that there
> aren't any valid Hybrids that are:
> 
>     a) AES encrypted (PDF2 ????)
> and
>     b) Only have the old trailer syntax (not an embedded file)

Correct, the only way to get PDF 2.0 encryption is in master selecting PDF 2.0 as the output. This will always write the document as embedded file. 

> If those don't exist, then it simplifies things a lot; in theory then I can
> use PDFium to deal with modern files (which might be AES encrypted), and the
> existing path to deal with old trailer syntax files ?

Yes, that's a possible solution.

Comment 42 Dave Gilbert 2025-03-29 12:32:21 UTC

Excellent, right that gives me something to get on with.

Comment 43 Dave Gilbert 2025-04-03 00:58:04 UTC

Hmm this was looking promising, until I've hit another thing PDFium doesn't seem to implement.
I can't see a way of getting the MIME type of the attachment being loaded to check it matches:

            auto pAttachment = pPdfiumDoc->getAttachment(0);
            auto aName = pAttachment->getName();

which is great, gets me the "Original.odt"

I was trying to use FPDFAttachment_GetStringValue on that attachment,
but that gets an entry out of the 'embedded file *parameter* dictionary' where as the mime type is the Subtype key in the 'embedded file *stream* dictionary' (table 45/46 of the PDF spec I have).
I guess I could just go off the filename, but that feels delicate - anyone any ideas how to get the mimetype out of PDFium?

Comment 44 Miklos Vajna 2025-04-03 07:19:22 UTC

I would again suspect that simply there is no public pdfium API to provide that info, it's something that could be added.

In pdfium.git, fpdfsdk/fpdf_attachment.cpp seems to be the code for this, if you read the FPDFAttachment_SetFile() implementation, you can see how it only sets the Params -> Size key on the stream, and there is nothing similar in FPDFAttachment_Get*().

Comment 45 Michael Meeks 2025-04-03 08:44:04 UTC

I guess we are well up for patching PDFium to add more API that we need here :-)

Comment 46 Dave Gilbert 2025-04-03 11:55:18 UTC

Yeh I had a dig about in fpdfsdk/fpdf_attachment.cpp before posting.
I'll start by posting to the PDFium list.

How do people feel about just going with Original.oXX (with oXX being checked for sanity) until PDFium is improved?

Comment 47 Dave Gilbert 2025-04-03 23:47:58 UTC

Question posted to:
https://groups.google.com/g/pdfium/c/W9XS-2GIECA

Comment 48 Dave Gilbert 2025-04-04 01:09:37 UTC

(In reply to Dave Gilbert from comment #47)
> Question posted to:
> https://groups.google.com/g/pdfium/c/W9XS-2GIECA

I was pointed to create a bug entry, so we now have:

https://issues.chromium.org/issues/408241034

Comment 49 Tomaz Vajngerl 2025-04-04 06:13:54 UTC

(In reply to Dave Gilbert from comment #46)
> How do people feel about just going with Original.oXX (with oXX being
> checked for sanity) until PDFium is improved?

Sure, it should be safe to rely on the filename only.

BTW you can still cheaply check if the byte stream is ODF too by checking the stream's magic bytes. You will need to change PDFAttachment::getFile to only get a limited number of bytes of the attachment, not the whole attachment.

Comment 50 Dave Gilbert 2025-04-04 12:59:39 UTC

(In reply to Tomaz Vajngerl from comment #49)
> (In reply to Dave Gilbert from comment #46)
> > How do people feel about just going with Original.oXX (with oXX being
> > checked for sanity) until PDFium is improved?
> 
> Sure, it should be safe to rely on the filename only.

Good, let me get this set out of the way then I can swing back and attack PDFium.

> BTW you can still cheaply check if the byte stream is ODF too by checking
> the stream's magic bytes. You will need to change PDFAttachment::getFile to
> only get a limited number of bytes of the attachment, not the whole
> attachment.

I'll probably leave that for now; probably best to spend the cycles on the PDFium fix.

Comment 51 Dave Gilbert 2025-04-07 00:23:52 UTC

OK, I have a WIP at:
https://gerrit.libreoffice.org/c/core/+/183765
(Not added the password check yet, so doesn't work for encrypted files,
and I've not tidied any of the style up yet, and it's one big patch; danger: Do not read on a full stomach yet;   I'll tidy it up now and add some tests etc)

Comment 52 Dave Gilbert 2025-04-09 01:06:26 UTC

A problem with the 'EmbeddedFile' approach is that it's not readable until after decryption - and as far as I can tell the filename isn't either.
That means we need to be able to decrypt in the ::detect method of the filter - the existing code wasn't trying to password prompt in the ::detect method, but it seems to work if I enable it.

Does anyone know if that's legal to be able to do a password entry there?

Comment 53 Dave Gilbert 2025-04-12 15:46:02 UTC

OK, the series at:

https://gerrit.libreoffice.org/c/core/+/183765/4

allows loading of Hybrids looking only at the embeddedFile.
If the embedded file is not found it checks the old scheme as well; so it should cover
all old files.
It incldues a test of loading a PDF/UA 2.0 hybrid with a password.

Reviews welcome.

Comment 54 Commit Notification 2025-04-21 06:43:14 UTC

Dr. David Alan Gilbert committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9b401a09c5cde0250457f45d9540ee15ab8e43f9

tdf#55425, tdf#66580: sdext,pdfimport: Start detecting newer hybrids

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 55 Commit Notification 2025-04-21 13:16:54 UTC

Dr. David Alan Gilbert committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/bc680499b5fa782c7396f7ab688c8f5498b58b46

tdf#55425, tdf#66580: sdext,pdfimport: embeddedFile detection

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 56 Commit Notification 2025-04-22 02:37:13 UTC

Dr. David Alan Gilbert committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/b7a8ce65b47bd85e31ae3ff8e81f23f79380de87

tdf#55425, tdf#66580: sdext,pdfimport: Filename/MIME lookup

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 57 Commit Notification 2025-04-22 02:38:18 UTC

Dr. David Alan Gilbert committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/0d4d2563091eccb125a3d0f11b2df858d059d6a7

tdf#55425, tdf#66580: sdext,pdfimport: Extract the embedded file

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 58 Commit Notification 2025-04-22 02:38:22 UTC

Dr. David Alan Gilbert committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/f3fd83098f3020c9f69032911754a5c1d2fc4fd5

tdf#55425,tdf#66580: sdext,pdfimport: Always stash password

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 59 Commit Notification 2025-04-22 02:38:27 UTC

Dr. David Alan Gilbert committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/7c46138b61e1dfa9b5d649d6a797737304d225fa

tdf#55425, tdf#66580: sdext,pdfimport: Tame password prompts

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 60 Commit Notification 2025-04-22 02:39:31 UTC

Dr. David Alan Gilbert committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/d629416be0a3742afda0f5c50c2f80a15553143c

tdf#55425, tdf#66580: sdext,pdfimport: embeddedFile password entry

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 61 Dave Gilbert 2025-04-22 10:53:29 UTC

Great; so that means we can now read PDFs without the custom keys in, with the old stuff still supported.
I'm not sure what else could be done for this bug.
The only thing I've got outstanding is to dive into the guts of pdfium and add something to read the mime type at some point and use it.

Comment 62 Buovjaga 2025-05-17 11:17:27 UTC

(In reply to Dave Gilbert from comment #61)
> Great; so that means we can now read PDFs without the custom keys in, with
> the old stuff still supported.
> I'm not sure what else could be done for this bug.
> The only thing I've got outstanding is to dive into the guts of pdfium and
> add something to read the mime type at some point and use it.

Looks like Helmut Januschka did it a few weeks ago: https://pdfium-review.googlesource.com/c/pdfium/+/130970

Comment 63 Dave Gilbert 2025-05-17 12:13:53 UTC

Yeh I noticed that; got to dig myself out of some other bugs first.
The first thing to figure out is what ifdef etc you'd use to test for the version of pdfium new enough to use it with.