Bug 95328 - Hybrid PDF -- implement PDF 1.4 level support for Embedded File Streams to expose ODF object stream as Attachment to the PDF document
Summary: Hybrid PDF -- implement PDF 1.4 level support for Embedded File Streams to ex...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: low enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Export
  Show dependency treegraph
 
Reported: 2015-10-26 11:33 UTC by azrdev
Modified: 2017-06-02 21:39 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description azrdev 2015-10-26 11:33:24 UTC
Currently "embedded OpenDocument files" are just appended "outside" the PDF file's scope, making the PDF reader ignore them. However, PDF has means to include attachments into the document, making them show up in viewers (like acrobat, okular) & being exportable from there.
Comment 1 Buovjaga 2015-11-03 09:09:44 UTC
All right -> NEW
Comment 2 V Stuart Foote 2015-11-03 14:09:02 UTC
First thought was to close as wontfix, and still leaning that direction.

Our current Poppler driven Hybrid PDF is generated as PDF 1.4 and is sufficient, and yes it appends our source ODF to PDF. It is not attached to the document, but that feature of PDF comes in at PDF 1.7 with ISO 32000-1 compliance.

Believe our Poppler integration is capable of generating that level of PDF, but suspect it would  would require a lot of work.

However, other than this issue of reworking our Hybrid PDF, more pressing compliance issues of producing accessible PDF/A-1a and PDF/UA (ISO 14289-1)[ bug 45636 ] probably already require a more complete implementation of ISO 32000-1 and production of PDF 1.7 level documents. While tackling those, that would probably be the time to change our handling of Hybrid PDF (both export and import) to use the ISO 32000-1 provided attachment(s) to PDF document.

@David T. -- worth pursuing?
Comment 3 kurt.pfeifle 2015-11-03 17:14:49 UTC
(In reply to V Stuart Foote from comment #2)
> First thought was to close as wontfix, and still leaning that direction.
> 
> Our current Poppler driven Hybrid PDF is generated as PDF 1.4 and is
> sufficient, and yes it appends our source ODF to PDF. It is not attached to
> the document, but that feature of PDF comes in at PDF 1.7 with ISO 32000-1
> compliance.

Last half sentence is not factually correct. The feature to attach files to a PDF was specified in PDF-1.3 for the first time. See table 169, item "FileAttachments" on page 390 of the Adobe version of the PDF-1.7 spec (filename "PDF32000_2008.pdf").
 
> Believe our Poppler integration is capable of generating that level of PDF,
> but suspect it would  would require a lot of work.

I do not think so. It should be no more difficult than the current ("proprietary") implementation that is used by OpenOffice and LibreOffice.

(There may be other arguments which could be made in favor of keeping it as it is, and make sure no other application could extract the embedded ODT source file. But I haven't heard any so far...)
 
> However, other than this issue of reworking our Hybrid PDF, more pressing
> compliance issues of producing accessible PDF/A-1a and PDF/UA (ISO 14289-1)[
> bug 45636 ] probably already require a more complete implementation of ISO
> 32000-1 and production of PDF 1.7 level documents.

* PDF/UA: Nes, that requires support for PDF-1.7.

* PDF/A-1a: No PDF/A-1b (basic) and /A1-a (advanced) both even *forbid* to use any new features beyond PDF-1.4 and up to -1.7. They *require* limitation to PDF-1.4 features (and have a few other feature requirements and prohibition of features).

However, to produce a PDF/UA does not require to support *ALL* features which have been introduced and defined in between and up to 1.4 and 1.7. It is just so that if you want to make a compliant PDF/UA, you'll have to support *some* of the new features, which in turn means you have to declare it as a PDF-1.7 file.
Comment 4 kurt.pfeifle 2015-11-03 17:24:32 UTC
@V Stuart Foote: 

I'd like to ask you to roll back the change in headline for this bug. The new version reads:

  "Hybrid PDF change and use PDF 1.7 attachment to PDF,
   rather than current append to PDF 1.4."

The old version did read:

  "PDF export: embed opendocument file as PDF attachment"

The new headline extends the scope of the original request in a very significant way.

If you indeed want PDF-1.7 features implemented (I myself would like to see such things too!), then you should create a new bugzilla entry.

The original request was to modify the "hybrid PDF" creation feature in a way that embeds the ODT with "PDF standard" methods (and also to modify accordingly the way LibreOffice reads these files), instead of doing it in the "LibreOffice/OpenOffice" proprietary way.
Comment 5 V Stuart Foote 2015-11-03 17:40:29 UTC
(In reply to kurt.pfeifle from comment #4)
> I'd like to ask you to roll back the change in headline for this bug. 

> ...

> The original request was to modify the "hybrid PDF" creation feature in a
> way that embeds the ODT with "PDF standard" methods (and also to modify
> accordingly the way LibreOffice reads these files), instead of doing it in
> the "LibreOffice/OpenOffice" proprietary way.

Kurt, thanks. But, I firmly believe it is correct to ask devs to move the bar--that means implementing ISO 32000-1 and declaring PDF 1.7

For Hybrid PDF that is notably an internal to LibreOffice usage--why would we *ONLY* rework the Hybrid PDF to gain functions of "attachment to document" rather than continuing to append as we do now?  We'd have to do the work on both ends anyway -- our filter export to PDF, changing from appending to embedding in some Adobe defined method pre-ISO 32000-1--coupled with having to filter the opening of said Hybrid PDF to get at the ODF "attachment".

IMHO much better if this is to be done, to do it in context of ISO 32000-1 and PDF 1.7, or even future ISO 32000-2.

You seem pretty up to date on the standards, and the history, are you willing to pitch in and give the devs a hand in scoping the requirement?

Stuart
Comment 6 kurt.pfeifle 2015-11-03 18:04:46 UTC
(In reply to V Stuart Foote from comment #5)
> (In reply to kurt.pfeifle from comment #4)
> > I'd like to ask you to roll back the change in headline for this bug. 
> 
> > ...
> 
> > The original request was to modify the "hybrid PDF" creation feature in a
> > way that embeds the ODT with "PDF standard" methods (and also to modify
> > accordingly the way LibreOffice reads these files), instead of doing it in
> > the "LibreOffice/OpenOffice" proprietary way.
> 
> Kurt, thanks. But, I firmly believe it is correct to ask devs to move the
> bar--that means implementing ISO 32000-1 and declaring PDF 1.7

Ok, I agree.

But for *that*, please create a new bugzilla item. Don't hijack an existing item. Or don't artificially extend the scope of the current item and thusly create an excuse to close it.

> For Hybrid PDF that is notably an internal to LibreOffice usage--why would
> we *ONLY* rework the Hybrid PDF to gain functions of "attachment to
> document" rather than continuing to append as we do now? 

Because we could then detach the embedded ODT on systems where there is not even LibreOffice installed?

In the interest of transparency we could recognize the fact that there is an embedded ODT file in the PDF by other means than opening it in LO?

In the interest of efficiency, we could remove the embedded file, should we (for whatever reason) want to create a smaller PDF (and keep all the original PDF objects) when we do not have LO around to open it and re-create the PDF again (which may change a lot of the objects, when your LO is a newer version)?

There are some other features to gain from such a rework too...

> We'd have to do
> the work on both ends anyway -- our filter export to PDF, changing from
> appending to embedding in some Adobe defined method pre-ISO 32000-1

Wrong. The method was originally "some Adobe defined method pre-ISO 32000-1", yes. But this exactly identical method was kept and preserved in ISO 32000-1.

> --coupled
> with having to filter the opening of said Hybrid PDF to get at the ODF
> "attachment".

Oh! And how do you currently open a Hybrid PDF with LO?!??

(You look at the key special entry in the trailer section. You can still keep that, or a slightly modified key entry for LO's benefit to make it easier to recognize its own source document format as being embedded. But add the other modifications for the benefit of other PDF processing applications to recognize the embedded file.)
 
> IMHO much better if this is to be done, to do it in context of ISO 32000-1
> and PDF 1.7, or even future ISO 32000-2.

Create a new, your own, bugzilla entry. Don't hijack other people's entries for a different purpose.

> You seem pretty up to date on the standards, and the history, are you
> willing to pitch in and give the devs a hand in scoping the requirement?

Yes, but I'm unsure how to do it. 

My guess is that if the devs are willing to read up a few spots in the ISO 32000-1 document, they'll very quickly see what they need to change. These devs, after all, were very creative in coming up with a non-standard way and to create that feature in their "proprietary" way without making the PDF invalid and without openly violating any of the PDF specs :-)    They may even have had some valid reasons and very sane considerations about why they did choose that path -- however, I haven't seen any yet; if they exist(ed), I'm not aware of them.
Comment 7 V Stuart Foote 2015-11-03 20:07:28 UTC
(In reply to kurt.pfeifle from comment #6)

> But for *that*, please create a new bugzilla item. Don't hijack an existing
> item. Or don't artificially extend the scope of the current item and thusly
> create an excuse to close it.
> 

As I said, my first inclination was to Close as Wontfix the original suggestion; Hybrid PDF is a corner case use that functions well for LibreOffice needs as implemented. Very comfortable to have done that ;-)

Beyond that--it is already "hijacked" and clarifies what was a poorly composed enhancement request--that would have been closed otherwise. 

It has now been given an appropriate extended scope & summary, the what and why of implementing ISO 32000-1 attachment rather than our existing appending--and in passing fulfilling the OPs more limited request.


(In reply to kurt.pfeifle from comment #6)

> Oh! And how do you currently open a Hybrid PDF with LO?!??
> 
> (You look at the key special entry in the trailer section. You can still
> keep that, or a slightly modified key entry for LO's benefit to make it
> easier to recognize its own source document format as being embedded. But
> add the other modifications for the benefit of other PDF processing
> applications to recognize the embedded file.)

Not sure... Reading the code, for import it looks like we parse everything beyond the PDF ending--and extract our content streams of interest matched only against MIME type. When exporting, we end the PDF stream, and then append a stream holding the source ODF archive. So we are not now inside the PDF structure at all.

=-refs-=

Import
http://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/filterdet.hxx#94
http://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/filterdet.cxx#305

Export
http://opengrok.libreoffice.org/xref/core/officecfg/registry/schema/org/openoffice/Office/Common.xcs#5195
http://opengrok.libreoffice.org/xref/core/filter/source/pdf/pdfexport.cxx#261
Comment 8 kurt.pfeifle 2015-11-03 21:00:25 UTC
(In reply to V Stuart Foote from comment #7)

> It has now been given an appropriate extended scope & summary, the what and
> why of implementing ISO 32000-1 attachment 

As I explained already: it is *not* the (scary sounding) PDF-1.7/ISO3200-1
compliant file attachment that should be implemented. It is the PDF-1.3 one....

[....]

> > Oh! And how do you currently open a Hybrid PDF with LO?!??
> > 
> > (You look at the key special entry in the trailer section. You can still
> > keep that, or a slightly modified key entry for LO's benefit to make it
> > easier to recognize its own source document format as being embedded. But
> > add the other modifications for the benefit of other PDF processing
> > applications to recognize the embedded file.)
> 
> Not sure... Reading the code, for import it looks like we parse everything
> beyond the PDF ending--and extract our content streams of interest matched
> only against MIME type.

I can't read the code... but I can read+analyze the PDF's source code.

For *importing* a PDF, there are two possibilities:

1. You have an OO-/LO-generated "hybrid" PDF: discover+extract the ODT stream
2. You have a no-hybrid PDF: open it with LO-Draw.

For case (1), the most efficient procedure would be:

1. Read the PDF trailer. It is at the end of the file. Each and every PDF 
   reader has to do that and has to start reading there. The trailer contains
   an entry pointing to the byte offset to the start of the xref table.

2. For OO-/LO-generated PDFs, the trailer also contains the "proprietary"
   key:

     /AdditionalStreams [/application#2Fvnd#2Eoasis#2Eopendocument#2Etext 6 0 R]

   This key names the PDF object number (here: object number 6) which has a
   stream that contains the ODT document.

3. Jump to the xref table and read it. The xref table contains a list of 
   all used PDF objects and their respective file offsets.

4. Jump to the byte offset named for the object no. 6 and extract the stream
   content. The stream content is a 1:1 copy of the original ODT file.

5. Only if you miss to evaluate above step (2), you would have to "parse
   everything beyond the PDF ending and extract our content streams of 
   interest".


> When exporting, we end the PDF stream, and then
> append a stream holding the source ODF archive. So we are not now inside the
> PDF structure at all.

Not correct. The factual PDFs generated by LO as "hybrid" do write the ODT
archive right *into* the PDF structure. In my above example, it was object
no. 6 (out of a total of 17 objects in the PDF file), and that object was at
byte offset 479 (out of a total file size of 5.023.779 Bytes).
Comment 9 V Stuart Foote 2015-11-03 22:28:05 UTC
OK, I've been schooled. Our ODF is embedded into the PDF with MIME entry in the trailer, and the stream object interspersed with other objects of he PDF "document".

We don't need the full ISO 32000-1/PDF 1.7 handling for the PDF applications to recognize them as attachments and be of use. 

Simply implementing the Embedded File Streams/Embedded Files & Name Dictionary of ISO 32000-1:2008 [7.11.4] (or PDF 1.7 [3.10.3]) mechanism from PDF 1.4 looks like it would add the ability for a compliant PDF application to identify and manipulate the embedded ODF as attachments.

Adjusting summary... 

Hybrid PDF -- implement PDF 1.4 level support for Embedded File Streams to expose ODF object stream as Attachment to the PDF document
Comment 10 David Tardon 2015-11-04 17:31:49 UTC
Note that the old import will have to be kept in any case. And that this will break the import for older LibreOffice versions.
Comment 11 V Stuart Foote 2015-11-04 17:52:24 UTC
(In reply to David Tardon from comment #10)
> Note that the old import will have to be kept in any case. And that this
> will break the import for older LibreOffice versions.

So, not a simple issue of adding the EF tag and using the Name Dictionary structure to get our existing ODF stream picked up as Attachment? Darn... I was hoping it might be benign enough to even support a back port to 5.0 as a late feature.

But, on the UX side with 4.4 approaching EOL, and 5.1 in the gate--it might be the best time to do it? Especially if it is a simple adjustment.

Yes it would require a one time "migration" for users to rebuild their Hybrid PDFs, but the source ODF would not be affected. 

After that would be good going forward, but with support of the Embedded File attachment structure useful to other PDF readers.

Not too egregious from a UX perspective.