Bug 157028 - FILESAVE PDF Tagged PDF export makes file size grow significantly
Summary: FILESAVE PDF Tagged PDF export makes file size grow significantly
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: low minor
Assignee: Not Assigned
URL:
Whiteboard: target:24.2.0 target:7.6.3
Keywords: filter:pdf
: 157063 157267 (view as bug list)
Depends on:
Blocks: PDF-Accessibility
  Show dependency treegraph
 
Reported: 2023-08-30 23:56 UTC by Gabor Kelemen (allotropia)
Modified: 2023-11-03 12:41 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gabor Kelemen (allotropia) 2023-08-30 23:56:17 UTC
This is a followup on https://bugs.documentfoundation.org/show_bug.cgi?id=39667#c12 

When for example the document http://www.microsoft.com/investor/reports/ar13/docs/2013_Annual_Report.docx is exported to PDF without enabling Tagged PDF option is about 600 KB in size, but with enabled Tagged PDF it is 3600 KB. 
For comparison the same file exported from Word 2016 as tagged PDF is about 1600 KB.
It would be nice to reduce the extensive marking resulting disk space waste.

1. Download and open http://www.microsoft.com/investor/reports/ar13/docs/2013_Annual_Report.docx 
2. File - Export As - PDF , check Tagged PDF (add document structure) option
3. Press the Export button
-> resulting files size is multiple of the original docx file size

Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 33ae7c12bbdf19b76ced472ca8aed6cf66477bbe
CPU threads: 15; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: en-US (hu_HU); UI: en-US
Calc: threaded
Comment 1 V Stuart Foote 2023-08-31 05:56:30 UTC
Question is how much content to include with tagging. IIRC by default our Tagged PDF/A-3 archival and PDF/UA includes what is appropriate for AT and a11y.

Size of "tagged" PDF is a secondary concern to PDF content flow from-to, locale/script tags, and including "Alternate Text" and "Actual Text" spans.

By nature "tagged" PDF will balloon in size.

IMHO => NAB
Comment 2 Commit Notification 2023-09-01 20:07:29 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/ee3c3fcf5c48964f7bc1d64484409f072c614866

tdf#157028 sw: PDF/UA export: reduce the number of Span ILSEs

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 3 Tiago do Amaral Rodrigues 2023-09-05 16:46:49 UTC
I have the same issue; I downloaded the file in this issue for checking, and exported it (using LO 7.6.0.3 (x86_64) using the same preferences, except for checking and unchecking the “tagged PDF” option. As Gabor Kelemen reports above, the sizes are the following:

original:     1 219 784 bytes
without tags:   629 441 bytes
with tags:    3 679 547 bytes

Then I ran the generated PDFs through an optimiser (https://github.com/pts/pdfsizeopt) and it reported the following:

> C:\pdfsizeopt>pdfsizeopt.exe 2013_Annual_Report-no-structure.pdf 2013_Annual_Report-no-structure-optimised.pdf
> info: This is pdfsizeopt ZIP rUNKNOWN size=69856.
> info: prepending to PATH: C:\pdfsizeopt\pdfsizeopt_win32exec
> info: loading PDF from: 2013_Annual_Report-no-structure.pdf
> info: loaded PDF of 629441 bytes
> info: separated to 330 objs + xref + trailer
> info: parsed 330 objs
> info: eliminated 103 unused objs, depth=8
> info: found 0 Type1 fonts loaded
> info: found 0 Type1C fonts loaded
> info: optimized 108 streams, kept 22 #orig, 1 uncompressed, 85 zip
> info: compressed 1 streams, kept 0 of them uncompressed
> info: saving PDF with 227 objs to: 2013_Annual_Report-no-structure-optimised.pdf
> info: generated object stream of 2362 bytes in 116 objects (8%)
> info: generated 606279 bytes (96%)

> C:\pdfsizeopt>pdfsizeopt.exe 2013_Annual_Report-with-structure.pdf 2013_Annual_Report-with-structure-optimised.pdf
> info: This is pdfsizeopt ZIP rUNKNOWN size=69856.
> info: prepending to PATH: C:\pdfsizeopt\pdfsizeopt_win32exec
> info: loading PDF from: 2013_Annual_Report-with-structure.pdf
> info: loaded PDF of 3679547 bytes
> info: separated to 26308 objs + xref + trailer
> info: parsed 26308 objs
> info: eliminated 103 unused objs, depth=9
> info: found 0 Type1 fonts loaded
> info: found 0 Type1C fonts loaded
> info: optimized 108 streams, kept 15 #orig, 1 uncompressed, 92 zip
> info: eliminated 12804 duplicate objs
> info: compressed 1 streams, kept 0 of them uncompressed
> info: saving PDF with 13401 objs to: 2013_Annual_Report-with-structure-optimised.pdf
> info: generated object stream of 190352 bytes in 13290 objects (7%)
> info: generated 835699 bytes (23%)

yielding:
without tags:   606 279 bytes (-3.68%)
with tags:      835 699 bytes (-77.29%)


So the first PDF file had 330 PDF objects, of which 103 were considered unused and discarded. The second PDF file had 26 308 objects, of which 103 were considered unused and 12 804 were duplicates; both classes were discarded. These may be the objects that were fused together by Michael Stahl's commit, but if not it may be convenient to explore what are the options for the PDF-manipulating library that LO uses.

In any case, I will attempt overnight to download the daily version and install it to try the same exercise again, and then report back with the results.
Thanks again.
Comment 4 Tiago do Amaral Rodrigues 2023-09-06 18:05:11 UTC
Testing LibreOfficeDev 24.2.0.0.alpha0+ (X86_64) https://git.libreoffice.org/core/+log/2ae9eb8be8d7eb9c3a72953a295d128b45639ea3

> C:\pdfsizeopt>pdfsizeopt.exe 2013_Annual_Report-24-no-outline.pdf 2013_Annual_Repor-24t-no-outline-optimised.pdf
> info: This is pdfsizeopt ZIP rUNKNOWN size=69856.
> info: prepending to PATH: C:\pdfsizeopt\pdfsizeopt_win32exec
> info: loading PDF from: 2013_Annual_Report-24-no-outline.pdf
> info: loaded PDF of 642467 bytes
> info: separated to 329 objs + xref + trailer
> info: parsed 329 objs
> info: eliminated 103 unused objs, depth=8
> info: found 0 Type1 fonts loaded
> info: found 0 Type1C fonts loaded
> info: optimized 108 streams, kept 31 #orig, 1 uncompressed, 76 zip
> info: compressed 1 streams, kept 0 of them uncompressed
> info: saving PDF with 226 objs to: 2013_Annual_Repor-24t-no-outline-optimised.pdf
> info: generated object stream of 2290 bytes in 115 objects (8%)
> info: generated 619628 bytes (96%)

> C:\pdfsizeopt>pdfsizeopt.exe 2013_Annual_Report-24-with-outline.pdf 2013_Annual_Repor-24t-with-outline-optimised.pdf
> info: This is pdfsizeopt ZIP rUNKNOWN size=69856.
> info: prepending to PATH: C:\pdfsizeopt\pdfsizeopt_win32exec
> info: loading PDF from: 2013_Annual_Report-24-with-outline.pdf
> info: loaded PDF of 3692842 bytes
> info: separated to 26305 objs + xref + trailer
> info: parsed 26305 objs
> info: eliminated 103 unused objs, depth=9
> info: found 0 Type1 fonts loaded
> info: found 0 Type1C fonts loaded
> info: optimized 108 streams, kept 15 #orig, 1 uncompressed, 92 zip
> info: eliminated 12803 duplicate objs
> info: compressed 1 streams, kept 0 of them uncompressed
> info: saving PDF with 13399 objs to: 2013_Annual_Repor-24t-with-outline-optimised.pdf
> info: generated object stream of 190315 bytes in 13288 objects (7%)
> info: generated 849547 bytes (23%)

So it looks to be the same:

|-------------|-------------------------------|-------------------------------|
|             |      before optimization      |       after optimization      |
|  file name  | size (bytes) | size (objects) | size (bytes) | size (objects) |
|-------------|--------------|----------------|--------------|----------------| 
| no outlines |      642 467 |            329 |      619 628 |            226 |
|with outlines|    3 692 842 |         23 305 |      849 547 |         13 399 |
|-------------|--------------|----------------|--------------|----------------|
Comment 5 Gabor Kelemen (allotropia) 2023-09-08 07:52:44 UTC
*** Bug 157063 has been marked as a duplicate of this bug. ***
Comment 6 Murdo Maclachlan 2023-09-16 20:36:59 UTC
*** Bug 157267 has been marked as a duplicate of this bug. ***
Comment 7 Commit Notification 2023-09-26 11:37:32 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "libreoffice-7-6":

https://git.libreoffice.org/core/commit/e8dda2f4c8f03c9fa0f9558b5d6ec4df81524682

tdf#157028 sw: PDF/UA export: reduce the number of Span ILSEs

It will be available in 7.6.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 8 Stéphane Guillou (stragu) 2023-09-29 15:33:33 UTC
Overall, the increase in size for tagged PDF is not huge, see analysis based on 213 files in attachment 174125 [details] from bug 39667: a median factor of 1.17.

But as Tiago's example shows, there is potential for improvement, and of course, smaller is better. So let's set to "new" (with a lower importance).
Comment 9 Commit Notification 2023-10-24 16:47:21 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/049f458143cbd02ab915271418625cda1299f4b1

tdf#157028 vcl: PDF export: inline attribute dictionaries

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 10 Commit Notification 2023-10-25 09:44:10 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "libreoffice-7-6":

https://git.libreoffice.org/core/commit/35b5672e52676a92c3888c9066a754e4eebffe45

tdf#157028 vcl: PDF export: inline attribute dictionaries

It will be available in 7.6.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 11 Commit Notification 2023-11-03 12:41:30 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/822a0c4fcd4607d5247b828c69728a510684a442

tdf#157028 vcl: PDF export: inline OBJR dictionaries

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.