Bug 160033 - soffice builds of pdf files are unreproducible
Summary: soffice builds of pdf files are unreproducible
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
24.2.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:pdf
Depends on:
Blocks: PDF-Export
  Show dependency treegraph
 
Reported: 2024-03-04 21:05 UTC by Rene Engelhard
Modified: 2024-04-03 07:55 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
ODP test file used for illustration (13.34 KB, application/vnd.oasis.opendocument.presentation)
2024-03-28 18:08 UTC, tovrstra
Details
HTML diff of PDFs with decompressed streams (30.93 KB, text/html)
2024-03-28 18:08 UTC, tovrstra
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rene Engelhard 2024-03-04 21:05:44 UTC
From https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1065448:

--- snip ---
Dear Maintainer,

When creating pdf files from odt files, soffice writes a CreationDate field
which contains the actual build date/time. This varies with every build.
For an example, see the bottom of
https://tests.reproducible-builds.org/debian/rb-pkg/trixie/amd64/diffoscope-results/winff.html

soffice could use the creation date of the input odt file,
or even SOURCE_DATE_EPOCH instead of the current system date.

Regards,
Peter
--- snip ---

I think there was an option to not add the field at all? But this probably is not exposed to command line conversion unless you configure it extra?
Comment 1 Stéphane Guillou (stragu) 2024-03-19 07:37:37 UTC
Thorsten, what do you think?
Don't e.g. ODT files also store a timestamp at each save?
Comment 2 Rene Engelhard 2024-03-19 17:40:57 UTC
But odt files stay the same (unless changed and re-saved of course) so are per definition reproducible.

pdf files which are (in this and other cases in Debian) are rebuilt every time on every package build from a .doc/.od? differ each time.

(Or, if one wants to go that route, the "source file" (od?) stays the same anyway and the "binary" (pdf) changes. That's a possible analogy)
Comment 3 tovrstra 2024-03-28 18:08:07 UTC
Created attachment 193373 [details]
ODP test file used for illustration
Comment 4 tovrstra 2024-03-28 18:08:59 UTC
Created attachment 193374 [details]
HTML diff of PDFs with decompressed streams
Comment 5 tovrstra 2024-03-28 18:19:03 UTC
Reproducibility seems indeed not possible at this stage.

I've attached an example to show that there is more going on than just different time stamps.

Steps to reproduce the example:

1. Export slide.odp to PDF twice.
2. Decompress streams in the two PDFs with `

mutool clean -d slide1.pdf tmp1.pdf
mutool clean -d slide2.pdf tmp2.pdf

3. Generate diff html with vim:

vimdiff tmp1.pdf tmp2.pdf -c TOhtml -c 'w! diff.html' -c 'qa!'

There are four points where the PDFs differ:

- A binary stream (length is also different).
- xmp:CreateDate tag.
- /CreationDate field.
- PDF Trailer ID, which is just a random blob.

As far as I understand, random trailer IDs are sometimes useful for document tracking, but they are not critical.

It would be helpful to have an option to create reproducible PDFs, e.g. with a command-line option, or to disable all variable parts when SOURCE_DATE_EPOCH is set.
Comment 6 Stéphane Guillou (stragu) 2024-04-03 07:55:02 UTC
OK, let's set as new.