152661 – "Hybrid PDF" must share embedded (non-font) media between the ODT and the proper PDF

Bug 152661 - "Hybrid PDF" must share embedded (non-font) media between the ODT and the proper PDF

Summary: "Hybrid PDF" must share embedded (non-font) media between the ODT and the pro...

Status:	UNCONFIRMED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Printing and PDF export (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	PDF-Export
	Show dependency tree / graph

Reported:	2022-12-23 19:37 UTC by Eyal Rozenberg
Modified:	2023-01-22 17:15 UTC (History)
CC List:	2 users (show)

See Also:	95328
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eyal Rozenberg 2022-12-23 19:37:57 UTC

If we take the LibreOffice English User's Guide (from https://documentation.libreoffice.org/en/english-documentation/ ), and export it to a PDF with the default options, the result takes up about 20 MB of space. If we export it as a Hybrid PDF, it takes up 45.3 MB of space.

That's totally excessive. It seems as though the full ODT, lock stock and barrol, is attached to the PDF. Instead, LO should arrange it so that as much of the information - at the very least entities like embedded images, fonts and such - with the proper PDF.

With my example, the (compressed) ODT without the image takes up around 683 KB. I would therefore expect the Hybrid PDF to take up less than 21 MB.

Comment 1 V Stuart Foote 2022-12-23 22:53:42 UTC

Why?

Not sure there is any means to do this--PDF and ODF are radically different document formats.  That PDF can "hold" a fully described ODF document in a LibreOffice "Hybrid PDF" is a nice means to deliver an editable PDF rendering of an ODF source document.

The cost is duplicated content and increased size--the PDF is fully rendered by PDF viewer, and LibreOFfice can parse out the ODF stream for the source document.

If that is a work flow a user needs, great! It is functional. Accept the cost (PDF size) and get on with it.

Otherwise don't use it.

Not a lot of reason to refactor or make any attempt at reducing projects Hybrid PDF size.

IHMO => INVALID

Comment 2 Eyal Rozenberg 2022-12-23 23:21:29 UTC

(In reply to V Stuart Foote from comment #1)
> Why?
> 
> Not sure there is any means to do this--PDF and ODF are radically different
> document formats.

Of course there is - don't just concatenate the ODF. Entwine aspects of it with the PDF. Then, if you want a proper ODF document, you can extract it from the hybrid format.

>  That PDF can "hold" a fully described ODF document in a
> LibreOffice "Hybrid PDF" is a nice means to deliver an editable PDF
> rendering of an ODF source document.

It's not "nice" - it doubles the size.

> The cost is duplicated content and increased size--the PDF is fully rendered
> by PDF viewer, and LibreOFfice can parse out the ODF stream for the source
> document.

You can have a fully renderable PDF, and data which can be reconstituted into an ODF.

> Otherwise don't use it.

This workflow will be improved if it doesn't double the size.

Comment 3 Heiko Tietze 2023-01-12 14:44:02 UTC

We discussed the topic in the design meeting.

While reducing the footprint is desired in general we have to deal with the standards. And either all PDF reader change their implementation or ODF relies on the way PDF handles content, if we read embedded data from the alien format. So this is unfortunately not going to fly.

Comment 4 Eyal Rozenberg 2023-01-13 14:03:32 UTC

(In reply to Heiko Tietze from comment #3)
> While reducing the footprint is desired in general we have to deal with the
> standards. And either all PDF reader change their implementation or ODF
> relies on the way PDF handles content, if we read embedded data from the
> alien format. So this is unfortunately not going to fly.

I believe you've misunderstood my suggestion. I'm suggesting for the PDF to not change at all, and be perfectly valid regardless of the ODT tacked on to it. It is the ODT which should be altered, so that instead of referring to media within the ODT, it refers to media that's part of the PDF. If one then saves the opened file to an ODT, it will be saved the "regular" way.

At worst, this would require some tweaking of how one refers to media in the ODF format, to cover this use-case. At best - ODF already supports something like this, and it's just a matter of using this support.

Comment 5 V Stuart Foote 2023-01-13 15:28:56 UTC

You are asking for PDF filter export to pack ODF compliant elements into a  PDF Object stream and individual xref entries. And to provide reverse PDF filter import to parse those same Object streams back into a coherent ODF ready XML.

The current approach is a single export stream embedding a single ODF compliant document as a PDF Object stream --the LibreOffice "Hybrid PDF". That is matched by a filter import stream that reads the PDF xref table, recognizes the entry for LO generated source ODF, and selectively parses that ODF stream rather than the full PDF.

The current two-way filters are efficient and functional--suited to our needs for Hybrid PDF. 

As noted in see also bug 95328 to refactor export/import filters and make the ODF a PDF "attachment" might make sense to allow other PDF viewers to recognize the attached ODF. 

However, refactoring PDF export filter to reliably embed ODF canvas internals as PDF object streams would be non-performant--which elements go where?  While the likely necessary use of /ActualText (as for bug 117428) tagging for *all* text runs would negate any potential size reduction of embedding ODF elements as PDF object streams.

And then there would be filter requirements to be able to roundtrip--rather than a single fully ODF compliant source document, we would have to parse the entire PDF xref table, identify where each PDF Object needs to be placed (and on which page) individually extract and hold, and then reassemble in to some semblance of the original source ODF.

It could be done, obviously--but it is not advantageous to the project in any sense to do so! Not an imperative, certainly not worth the dev effort  refactoring both PDF filters would require.

So again, NO!

Comment 6 Eyal Rozenberg 2023-01-13 15:56:37 UTC

(In reply to V Stuart Foote from comment #5)
> The current two-way filters are efficient and functional--suited to our
> needs for Hybrid PDF. 

They're not efficient - they about-double the amount of space necessary, when the embedded media is significantly larger than the rest of the document. Hence this bug.


As for the rest of your comment...

Right now, the PDF import filter, upon noticing a PDF is a "hybrid PDF" - e.g. by some field/tag in the trailer or xref table, I guess - chucks all of the PDF and keeps the embedded ODF document. So, there's already some parsing going on which results in a coherent ODF - although, granted, it's limited. Also, the PDF export filter (whether it's a hybrid PDF or not) already packs elements into multiple PDF object streams and creates xref entries for them. The change I'm proposing is that media references in the ODF saying "the PNG file named foo.png packed into this ODF", we will have references saying, oh, maybe something like "the indirect object 12345 foopng within the PDF this ODF is in".

Indeed, this means there will need to be more parsing. But - that's nothing compared to the amount of work done when importing MSO files! It's basically at the level of complexity of a regexp application.

> However, refactoring PDF export filter to reliably embed ODF canvas
> internals as PDF object streams

I don't think I suggested doing that. I hope my last couple of paragraphs illustrate what I mean

> would be non-performant--which elements go
> where?  While the likely necessary use of /ActualText (as for bug 117428)
> tagging for *all* text runs

I'm only talking about media such as images, sound, video and arbitrary binary files. I really think you've misunderstood my suggestions.

Comment 7 Heiko Tietze 2023-01-16 07:56:59 UTC

UX input given, removing the keyword.

Comment 8 V Stuart Foote 2023-01-16 15:44:11 UTC

Not a bug and in no sense agreed to. Closing.

Comment 9 Tomaz Vajngerl 2023-01-17 07:16:04 UTC

Thinking a bit about it I don't think this is that hard to implement. We don't really need to mess with the PDF structure - all we really need is to extract all the images from the PDF (easily done with PDFium I think), make sure to preserve the image name (various solutions) and reconstruct the ODF document before reading in the filter, then normally open the document. 

When saving the hybrid PDF, we save the ODF normally, but just skip saving the images.

There are some issues like making sure we don't re-compress the images when saving them to PDF (disable that option with hybrid PDF) and that the images are all compatible, if not we would duplicate them or something else. 

This probably wouldn't work for fonts as PDF subsets the fonts, and normally the fonts also aren't included into ODF. For a max compatibility option we could however embed the whole font into PDF and do a similar thing like with images also for fonts.

I like this idea, because smaller the overhead of hybrid PDF the more likely it is the user will use it.

Comment 10 V Stuart Foote 2023-01-17 14:52:33 UTC

Back to Unconfirmed then.

@quikee, are you offering to tackle it?

Comment 11 Eyal Rozenberg 2023-01-17 18:31:55 UTC

(In reply to Tomaz Vajngerl from comment #9)
> There are some issues like making sure we don't re-compress the images when
> saving them to PDF (disable that option with hybrid PDF) 

So, that might actually be relevant even when the images don't originally come from a PDF.  Or do you mean you want to avoid re-encoding the images as object streams, even when not recompressing?

> This probably wouldn't work for fonts as PDF subsets the fonts, and normally
> the fonts also aren't included into ODF. For a max compatibility option we
> could however embed the whole font into PDF and do a similar thing like with
> images also for fonts.

Do you feel this bug should focus just on images, leaving fonts for a separate bug report? Or is it close enough to keep them together in a single bug?

> I like this idea, because smaller the overhead of hybrid PDF the more likely
> it is the user will use it.

:-)

Comment 12 Tomaz Vajngerl 2023-01-22 14:27:44 UTC

(In reply to V Stuart Foote from comment #10)
> Back to Unconfirmed then.
> 
> @quikee, are you offering to tackle it?

I'll try, but first need to change the ODF document to be embedded as a compatible PDF embedded file.   

(In reply to Eyal Rozenberg from comment #11)
> So, that might actually be relevant even when the images don't originally
> come from a PDF.  Or do you mean you want to avoid re-encoding the images as
> object streams, even when not recompressing?

I mean the option in PDF export to re-compress JPEG images to reduce DPI resolution. Re-compressing would be problematic in this case as we don't want to mess with the original images in the ODF document.
 
> Do you feel this bug should focus just on images, leaving fonts for a
> separate bug report? Or is it close enough to keep them together in a single
> bug?

I think fonts would be way more messy and probably not worth the effort to de-duplicate, so it is at least out of my scope. I would keep this one for images only as also the document you refer to doesn't have fonts embedded into ODT file, but it does contain 20+MB of images that can be de-duplicated.

Comment 13 Eyal Rozenberg 2023-01-22 17:15:11 UTC

Ok, clarifying that fonts are out of scope. But of course there are non-font, non-image media objects which may be embedded in the document, so hopefully it could be "everything but fonts". But of course, whatever can be implemented without jumping through too many hoops.