Bug 66580 - exported PDF is invalid because of forbidden custom keys in the trailer
Summary: exported PDF is invalid because of forbidden custom keys in the trailer
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.6.0
Keywords: filter:pdf
: 142051 (view as bug list)
Depends on:
Blocks: PDF-Export-Invalid
  Show dependency treegraph
 
Reported: 2013-07-04 09:39 UTC by Jos van den Oever
Modified: 2024-03-14 06:38 UTC (History)
12 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jos van den Oever 2013-07-04 09:39:34 UTC
PDF exported by LibreOffice contains the key /DocChecksum in the trailer dictionary. When an ODF document is embedded, it also contains the key /AdditionalStreams.

These keys are not defined in the PDF 1.4 specification. The specification forbids use of custom keys in the trailer:

===
A PDF producer or Acrobat plug-in extension may also add keys to any PDF
object that is implemented as a dictionary, except the file trailer dictionary (see Section 3.4.4, “File Trailer”). In addition, a PDF producer or Acrobat plug-in may create tags that indicate the role of marked-content operators (PDF 1.2), as described in Section 9.5, “Marked Content.”
===

A strict PDF validator would declare PDF documents saved by LibreOffice invalid.
Comment 1 Cor Nouws 2013-07-04 09:55:09 UTC
Hi Jos,
Thanks for the report.
I set this to New, trusting your expertise in this ;)
Is this an issue new with the 410beta2 or already in older versions too?
Best,
Cor
Comment 2 Jos van den Oever 2013-07-04 10:06:15 UTC
The /DocChecksum and /AdditionalStreams were added to OpenOffice on 2007-03-26.

In the LibreOffice git repository this is commit d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9
Comment 3 Cor Nouws 2013-07-04 10:34:02 UTC
thanks! I change the version conform your info.
Comment 4 kurt.pfeifle 2014-05-04 17:14:39 UTC
What will happen on this?

Do you need suggestions about how to implement these features in a spec conforming way?
Comment 5 kurt.pfeifle 2014-05-04 17:18:29 UTC
To give you a link to the relevant PDF-1.4 specification:

     http://acroeng.adobe.com/PDFReference/PDF_1.4/PDF%20Reference%201.4.pdf

The quote given by Jos in his bug report is from named page 723 (as printed on page), page 743 (as counted from first), in appendix E, "PDF Name Registry". 

Here is a recently created website holding *all* PDF specifications ever published by Adobe:

     http://acroeng.adobe.com/wp/?page_id=321
Comment 6 kurt.pfeifle 2014-05-09 21:12:44 UTC
According to these test results:

    https://docs.google.com/spreadsheets/d/1Ok37dvlRSpzKpdKJ6gYycM5QzM7sv4_YCybHbiFMVFI

none of 36 different PDF viewers or applications did have a problem to display or process the tested hybrid PDF created by LibreOffice.

Hence I took the liberty to set importance of this bug to much lower for now. I won't protest if someone even closed it as WONTFIX unless there appears other evidence of real life problems...
Comment 7 QA Administrators 2015-06-08 14:41:36 UTC Comment hidden (obsolete)
Comment 8 QA Administrators 2016-09-20 10:00:33 UTC Comment hidden (obsolete)
Comment 9 Jos van den Oever 2016-09-20 10:59:11 UTC
PDF documents created with version 5.1.2.2.0 on Linux 4.4 still add the key /DocChecksum and /AdditionalStreams to the PDF files.
Comment 10 kurt.pfeifle 2017-10-22 20:01:35 UTC
(In reply to Jos van den Oever from comment #9)
> PDF documents created with version 5.1.2.2.0 on Linux 4.4 still add the key
> /DocChecksum and /AdditionalStreams to the PDF files.

Jos, the additional (proprietary) keys used by OpenOffice/LibreOffice to embed
the original OpenDocument file into the Hybrid PDF are not doing any real
h a r m:

  * As I showed in comment #6 none of the 36 tested PDF viewers has any problem
    opening and displaying a Hybrid PDF!

There are other reasons which would may  M E  want to modify the way LO creates
a Hybrid PDF:

  * N O N E  of the other PDF readers do have a way to detect that there is an
    embedded OpenDocument file in the PDF!

The reason for this is that the way OO/LO implemented this feature was that they did it in a non-standard, "proprietary" way -- while they could have utilized the standards-defined "embed another file into the PDF"-feature. (See for example bug95328 and comments).

And there  A R E  good use cases to be able to detect the embeddedness of the
original OpenDocument file in a PDF even by non-OO/LO applications:

  - User(s) may not be aware of this when they open the PDF in a PDF reader.
    However, the reader may draw their attention to the fact of the original
    ODT/ODS document being embedded. After all, whoever embedded the original
    document into the Hybrid PDF most likely  W A N T E D  it to be editable.

  - Users may need/want to extract the embedded ODT/ODS file without switching 
    to LibreOffice first (which may not even be installed on their currently
    used computer system).

  - I could easily think of more use cases, why it would be good to be able to
    D E T E C T  the fact of the embedded original and editable file and also
    to  E X T R A C T  it from the PDF via a software other than OO/LO.
Comment 11 QA Administrators 2018-10-23 02:50:06 UTC Comment hidden (obsolete)
Comment 12 Alexis de Lattre 2020-01-02 12:20:36 UTC
It is very very strange that a project such as LibreOffice that promote the OpenDocument standard and interoperability in general doesn't respect the PDF standard ! Adding proprietary keys in the PDF trailer that only LibreOffice can read is certainly a bad practice. Using the "Embedded Files" feature of the PDF standard is clearly the way to go !

It could be the occasion to add support for Embedded Files in the LibreOffice PDF export. Embedded Files in PDF is starting to be a widely used feature with electronic invoicing standards such as Factur-X/ZUGFeRD that use the Embedded Files feature of the PDF standard to add an XML file in a PDF invoice (to allow automatic processing of the invoice), and the possibility to add other document as attachments of the PDF (documents that justify the invoice, for example a signed acceptance form).

For instance, I recently developed a LibreOffice extension to be able to generate Factur-X invoices from LibreOffice Calc (cf https://github.com/akretion/factur-x-libreoffice-extension). This extension contains a Python macro that post-processes the PDF file generated by LibreOffice to add the XML file as attachment to the PDF. The code of this macro would be much simpler if the PDF export feature of LibreOffice had native support for Embedded Files. And generating structured electronic invoices (with Factur-X, ZUGFeRD or other standards) is starting to be compulsory in some countries (for example, it is now compulsory in France when you invoice the public sector).
Comment 13 Julien Nabet 2020-01-02 19:37:54 UTC
Michael/Miklos/Tomaž: I don't know who's PDF expert so thought one of you might have some idea.

The problem here is "AdditionalStreams" keyword doesn't exist in PDF standard.
Taking a look at git history of d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9, it's been added with:
commit d217c079d7b3ca7b5039428594e7cdfdf9a0c4a9
Author: Ivo Hinkelmann <ihi@openoffice.org>
Date:   Mon Mar 26 10:21:15 2007 +0000
    INTEGRATION: CWS ipdf (1.92.80); FILE MERGED
    2007/01/19 16:08:58 pl 1.92.80.8: #137143# ecnrypt add streams
    2007/01/19 11:48:56 pl 1.92.80.7: RESYNC: (1.99-1.102); FILE MERGED
    2006/10/04 18:52:04 pl 1.92.80.6: RESYNC: (1.96-1.99); FILE MERGED
    2006/07/25 09:31:00 pl 1.92.80.5: RESYNC: (1.93-1.96); FILE MERGED
    2006/07/04 16:34:49 pl 1.92.80.4: removed a warning
    2006/07/04 13:48:22 pl 1.92.80.3: RESYNC: (1.92-1.93); FILE MERGED
    2006/06/26 15:00:09 pl 1.92.80.2: #137143# emit document checksum
    2006/06/12 16:53:42 pl 1.92.80.1: #137143# add AddStream interface

Shouldn't it be removed, put in readonly (I mean LO may read this on old files but should replace the keyword when modifying) or at minimum make this deprecated?
Instead there's "EmbeddedFiles" in specs (see https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_reference_archive/pdf_reference_1-7.pdf)

LO should respect PDF standard, I put this one to normal importance but it should be even higher than this.
Comment 14 Jean-Baptiste Faure 2020-01-02 20:27:41 UTC
According to comment #13 I changed version to inherited from OOo.

Best regards. JBF
Comment 15 Michael Meeks 2020-01-02 20:28:34 UTC
The hybrid PDF functionality was a great innovation, and the standard didn't cover it then of course. It would be great to find some resources / and/or interested people to implement the new standard using EmbeddedFiles. Alexis - are you interested in some code pointers there ? hacking the core to rename a few attributes and re-structuring the stream is likely to be a good start. I imagine a Collaboran would be happy to mentor someone that wanted to work on this themselves, but we can't resource a fix absent a customer ourselves today.
Comment 16 Julien Nabet 2020-01-02 21:30:53 UTC
Thank you Michael for your very quick feedback! :-)

I took a look at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf which 1.4 version (released in 2001 according to https://fr.wikipedia.org/wiki/Portable_Document_Format#Versions). Version 1.4 had already "EmbeddedFiles" keyword (see 3.3 part).
So wondered why adding the non standard "AdditionalStreams" whereas this keyword was existing. 
Or perhaps I wrongly understood this?
Comment 17 kurt.pfeifle 2020-01-02 22:26:04 UTC
(In reply to Michael Meeks from comment #15)
> The hybrid PDF functionality was a great innovation, and the standard didn't
> cover it then of course.

Indeed the hybrid PDF functionality  i s   a great innovation.
However it could even then have been (and still can be implemented) by using
the standard conforming method of embedding the source file.

If done, this would have the advantage that every standard compliant PDF
viewer or PDF processing software could auto-discover the embedded source
file and let the user "do something" with it even in the absence of a
LibreOffice installation on his system.
Comment 18 Julien Nabet 2020-01-03 10:12:55 UTC
Here are some code pointers:
https://opengrok.libreoffice.org/search?project=core&full=AdditionalStreams&defs=&refs=&path=&hist=&type=&si=full

To create the pdf:
emitTrailer() method from vcl/source/gdi/pdfwriter_impl.cxx

The rest seems related to PDF import
Comment 19 V Stuart Foote 2021-05-03 19:21:22 UTC
*** Bug 142051 has been marked as a duplicate of this bug. ***
Comment 20 Commit Notification 2023-01-24 10:50:55 UTC
Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e052f6e1d49a5289411b31561d6e310bf414d896

tdf#66580 write ODF document as an attachment in hybrid mode

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 21 Commit Notification 2023-01-25 14:57:08 UTC
Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/8640e24b12c7df3170c9f3e7ff3edced81fd0838

tdf#66580 added hybrid PDF test cases

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 22 Commit Notification 2023-02-01 02:11:39 UTC
Tomaž Vajngerl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9740331d8bc56a9b6fbe3e4c1b26fb97f6639cc6

tdf#66580 write more metadata to embedded and attached files

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 23 Stéphane Guillou (stragu) 2023-12-07 10:14:11 UTC
Tomaž mentioned in e052f6e1d49a5289411b31561d6e310bf414d896 that /AdditionalStreams was kept for backward-compatibility - "for now".
What do you think should be done with this ticket? Close as "fixed" even though some of it is "won't fix"? Or are we consider breaking backward compatibility in the future with some warning to users? Or third option: yet another option for PDF export?
Comment 24 peter.wyatt 2024-03-14 06:38:36 UTC
The latest ISO standard for PDF (ISO 32000-2:2020) no longer prohibits custom keys in trailers - in fact, it explicitly permits keys that are 2nd-class names: "The PDF file trailer dictionary may also contain any second-class name as described in Annex E, "Extending PDF"."

Historically the issue was that a lot of SW did not maintain all custom trailer entries when applying incremental updates so software that relied on their presence got broken if such files were later edited. Also, conventional trailer dictionaries don't exist when using (non-hybrid) cross-reference streams and software that converts to conventional PDFs to cross-reference stream-based PDFs also need to remember to copy custom keys across when converting. Basically its high risk putting stuff there...

Depending on what the data is (just DocChecksum??), there are more appropriate and reliable places to save the data in the PDF DOM: Document Catalog PieceInfo, XMP Metadata, use a proper DigSig if the checksum is somehow important for detecting tampering, ...