Bug 158732 - deb packages don't use the best compression
Summary: deb packages don't use the best compression
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Installation (show other bugs)
Version:
(earliest affected)
24.8.0.0 alpha0+
Hardware: All Linux (All)
: medium enhancement
Assignee: Jérôme
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Installer-Linux
  Show dependency treegraph
 
Reported: 2023-12-16 13:46 UTC by Jérôme
Modified: 2025-03-16 21:16 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
the proposed fix for this bug (800 bytes, patch)
2025-01-02 13:34 UTC, Jérôme
Details
comparaison of deb archives sizes with or without the new variables definitions (9.93 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-01-04 15:09 UTC, Jérôme
Details
--with-lang=ALL comparaison of deb archives sizes with 3 configurations (46.43 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-01-11 17:50 UTC, Jérôme
Details
an other session --with-lang=ALL (90.80 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-01-14 20:38 UTC, Jérôme
Details
--with-lang=ALL comparaison of deb archives sizes with 3 configurations (51.40 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-01-17 19:05 UTC, Jérôme
Details
--with-lang=ALL comparaison of deb archives sizes across methods and forces (53.53 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-01-24 22:01 UTC, Jérôme
Details
--with-lang=ALL comparaison of deb archives sizes with several XZ_OPT values (58.69 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-01-26 22:54 UTC, Jérôme
Details
total packages size decreases with patch on epm (20.18 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-02-23 18:12 UTC, Jérôme
Details
comparaison of xz dictionary maximum size values (28.72 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-02-28 21:33 UTC, Jérôme
Details
patch comparaison against master after unsetting PARALLELISM (20.39 KB, application/vnd.oasis.opendocument.spreadsheet)
2025-03-16 21:16 UTC, Jérôme
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jérôme 2023-12-16 13:46:08 UTC
From the daily build language pack LibreOfficeDev_24.8.0.0.alpha0_Linux_x86-64_deb_langpack_de.tar.gz, 7z tells that the below deb package uses the xz compression method :
------------------------------
$ 7z l libreofficedev24.8-dict-de_24.8.0.0.alpha0-1_amd64.deb
[...]
Type = xz
Physical Size = 7713188
Method = LZMA2:23 CRC64
Streams = 1
Blocks = 4
------------------------------

Next, I extracted the content of this deb package with the below command :
7z x libreofficedev24.8-dict-de_24.8.0.0.alpha0-1_amd64.deb

This extracts the below file :
data.tar

Next I compressed this tar archive with xz using several compression forces :
------------------------------
$ xz -1 --threads=1 --stdout data.tar > d-1.xz
$ xz -4 --threads=1 --stdout data.tar > d-4.xz
$ xz -5 --threads=1 --stdout data.tar > d-5.xz
$ xz -9 --threads=1 --stdout data.tar > d-9.xz
------------------------------

Looking at the size of the files, it seems the deb archive doesn't use the maximum compression force of xz :
------------------------------
$ ls -lhS *
-rw-r--r-- 1 j j  82M déc.  16 03:40 data.tar
-rw-r--r-- 1 j j  16M déc.  16 14:21 d-1.xz
-rw-r--r-- 1 j j  11M déc.  16 14:22 d-4.xz
-rw-r--r-- 1 j j 7,4M déc.  16 03:40 libreofficedev24.8-dict-de_24.8.0.0.alpha0-1_amd64.deb
-rw-r--r-- 1 j j 6,7M déc.  16 14:24 d-5.xz
-rw-r--r-- 1 j j 3,8M déc.  16 14:25 d-9.xz
------------------------------

The memory requirement on GNU/Linux in order to use LibreOffice is at least 256MB but prefered 512MB :
https://www.libreoffice.org/get-help/system-requirements/#Linux

The man page of xz tells :
"decompressing a file created with xz -9 currently requires 65 MiB of memory".

So there is no limitation to use the maximum force ("-9") for xz compression for a LibreOffice deb package. 

Users would then be able to download the deb files faster and save storage. The servers would save storage and bandwidth. As this is a single compression for many downloads, the maximum compression force for xz would provide an overall benefit.
Comment 1 Stéphane Guillou (stragu) 2024-01-04 00:23:37 UTC
Thanks Jérôme.
Do you know what difference it makes in the time needed to decompress such packages, and overall install times? Could you test that too?
Comment 2 Jérôme 2024-01-04 14:44:25 UTC
On xz decompression speed, the man page of unxz tells :
"On  the same hardware, the decompression speed is approximately a constant number of bytes of compressed data per second. In other words, the better the compression, the faster the decompression will usually be."

My xz/unxz version :
-------------
$ unxz --version
xz (XZ Utils) 5.2.2
liblzma 5.2.2
$
-------------

If you want to know a part of the overall installation process performance, we can pipe the xz decompression process to file extraction process (tar in my test). I performed the below test with the core deb archive which is the largest. Of course, I ensured only one terminal ran :
-------------
$ mkdir t
$ dpkg-deb --extract LibreOfficeDev_24.8.0.0.alpha0_Linux_x86-64_deb/DEBS/lodevbasis24.8-core_24.8.0.0.alpha0-1_amd64.deb t
$ tar cf sys-tree.tar t
$ xz -9 --threads=1 --stdout sys-tree.tar > sys-tree-9.tar.xz
$ xz -1 --threads=1 --stdout sys-tree.tar > sys-tree-1.tar.xz
$ rm -rf t && mkdir t
$ time ( unxz --to-stdout sys-tree-1.tar.xz | tar xf - --directory t )

real	0m5,969s
user	0m5,992s
sys	0m0,528s
$ rm -rf t && mkdir t
$ time ( unxz --to-stdout sys-tree-9.tar.xz | tar xf - --directory t )

real	0m5,930s
user	0m5,360s
sys	0m0,588s
$
$ rm -rf t && mkdir t
$ time ( unxz --to-stdout sys-tree-1.tar.xz | tar xf - --directory t )

real	0m6,093s
user	0m6,004s
sys	0m0,560s
$ rm -rf t && mkdir t
$ time ( unxz --to-stdout sys-tree-9.tar.xz | tar xf - --directory t )

real	0m5,905s
user	0m5,368s
sys	0m0,624s
$ 
-------------

On my hardware, the core deb archive that has been compressed with the '-9' force parameter decompresses slightly faster than the archive that has been compressed with the '-1' force parameter.
Comment 3 Jamie Natali 2024-07-12 09:52:36 UTC Comment hidden (spam)
Comment 4 Stéphane Guillou (stragu) 2024-07-15 03:51:43 UTC
Thanks Jérôme. From what you said, I think it makes sense.
Cloph, is that an easy switch in packaging config?
Comment 5 Jérôme 2024-12-29 14:25:30 UTC
Most deb packages are created by the "epm" tool.
The "distro-configs/LibreOfficeLinux.conf" chooses the "--enable-epm" option of "autogen.sh".

epm has no command option for compression force.

The desktop integration package is built with dpkg-deb in sysui/CustomTarget_deb.mk.
Comment 6 Jérôme 2024-12-29 18:24:48 UTC
The epm source file workdir/UnpackedTarball/epm/deb.c shows :
---
  if (Verbosity)
    puts("Building Debian binary distribution...");

  if (run_command(directory, "dpkg --build %s", name))
---

Thus the epm tool itself calls the dpkg program.

When I look into the dpkg-deb man page, there are 3 environment variables that change the behaviour of the compression method of dpkg-deb (and thus dpkg) :
DPKG_DEB_COMPRESSOR_TYPE=xz
DPKG_DEB_COMPRESSOR_LEVEL=9
DPKG_DEB_THREADS_MAX=1

Could we set those variables in the config_host.mk.in file ?
Comment 7 Jérôme 2025-01-02 13:34:16 UTC
Created attachment 198353 [details]
the proposed fix for this bug
Comment 8 Buovjaga 2025-01-02 14:21:59 UTC
(In reply to Jérôme from comment #7)
> Created attachment 198353 [details]
> the proposed fix for this bug

Please submit it to Gerrit: https://wiki.documentfoundation.org/Development/gerrit/setup

If you want to do it completely via web, after creating a Gerrit account you may visit https://git.libreoffice.org/core/+/refs/heads/master/config_host.mk.in and click the [edit] link to immediately create a new change for the file.

https://wiki.documentfoundation.org/Documentation/GerritEditing

Also: https://wiki.documentfoundation.org/Development/GetInvolved#License_statement
Comment 9 Jérôme 2025-01-02 15:42:05 UTC
I just submitted it Gerrit.
Comment 10 Jérôme 2025-01-04 15:09:52 UTC
Created attachment 198378 [details]
comparaison of deb archives sizes with or without the new variables definitions

Thanks to the proposal from Christian Lohmaier, I moved those variable definitions into the packaging recipe in instsetoo_native/CustomTarget_install.mk instead.

I build without defining thoses variables with the below configuration :
./autogen.sh --with-distro=LibreOfficeLinux --with-package-format=deb --disable-online-update --disable-breakpad

Then I get the deb packages sizes in workdir/installation.

Next I delete all those deb archives. Finally I define the new variables and build again.

The attached file compares the deb archives sizes.

The total saving for the 43 deb archives is 2.6% (10 MiB).
Comment 11 Jérôme 2025-01-08 21:31:03 UTC
As a reviewer Christian Lohmaier notes that TDF baseline has currently 1.20.9 version of dpkg.
He points that the manpage https://man7.org/linux/man-pages/man1/dpkg-deb.1.html gives the following versions that begin to support the environment variables :
- 1.21.10 for DPKG_DEB_COMPRESSOR_LEVEL
- 1.21.10 for DPKG_DEB_COMPRESSOR_TYPE
- 1.21.9 for DPKG_DEB_THREADS_MAX

Currently the only way would be to use the "dpkg-deb -Zxz -z9" command line instead of "dpkg" (maybe as an additional patch on workdir/UnpackedTarball/epm).

However, the same man page tells the command line options began to be supported with the following version :
- 1.16.2 for "-z"
- 1.15.6 for "-Z"
- 1.21.9 for "--threads-max"

Updating the epm source with a patch may break the build with older versions of dpkg (and we couldn't use the --threads-max yet).

Perhaps the environment variables method is gentler until the next TDF baseline update.
Comment 12 Jérôme 2025-01-11 15:24:03 UTC
The xz man page (https://manpages.debian.org/testing/xz-utils/xz.1.en.html)
tells that from 6 to 9 has quite the same settings that affect compression speed :
-----------
Preset 	DictSize 	CompCPU 	CompMem 	DecMem
-0 	256 KiB     	0 	3 MiB     	1 MiB    
-1 	1 MiB     	1 	9 MiB     	2 MiB    
-2 	2 MiB     	2 	17 MiB     	3 MiB    
-3 	4 MiB     	3 	32 MiB     	5 MiB    
-4 	4 MiB     	4 	48 MiB     	5 MiB    
-5 	8 MiB     	5 	94 MiB     	9 MiB    
-6 	8 MiB     	6 	94 MiB     	9 MiB    
-7 	16 MiB     	6 	186 MiB     	17 MiB    
-8 	32 MiB     	6 	370 MiB     	33 MiB    
-9 	64 MiB     	6 	674 MiB     	65 MiB    

    Column descriptions:
[...]
    CompCPU is a simplified representation of the LZMA2 settings that affect compression speed. The dictionary size affects speed too, so while CompCPU is the same for levels -6 ... -9, higher levels still tend to be a little slower.
------------
Comment 13 Jérôme 2025-01-11 17:50:03 UTC
Created attachment 198493 [details]
--with-lang=ALL comparaison of deb archives sizes with 3 configurations

I used this configuration :
./autogen.sh --with-distro=LibreOfficeLinux --with-lang=ALL --with-package-format=deb --disable-online-update –disable-breakpad

It appears 9 force isn't efficient on small files (current language dependent files except dictionaries).

3 test cases :
- A : default (undefined environment variables)
- B : xz method, single thread
- C : all B variables + 9 force.
Comment 14 Jérôme 2025-01-14 20:38:56 UTC
Created attachment 198542 [details]
an other session --with-lang=ALL

I don't understand the statistics.

Between each test I call :
1. make clean
2. ./autogen.sh --with-distro=LibreOfficeLinux --with-lang=ALL --with-package-format=deb --disable-online-update –disable-breakpad
3. make

I don't restart my computer between the tests. I noticed that /tmp (on tmpfs) has a lot of directory with names like "ooopackaging*" with the modification time of the previous tests. This may decrease the available memory between each test.
Comment 15 Jérôme 2025-01-17 19:05:55 UTC
Created attachment 198599 [details]
--with-lang=ALL comparaison of deb archives sizes with 3 configurations

I used this configuration :
./autogen.sh --with-distro=LibreOfficeLinux --with-lang=ALL --with-package-format=deb --disable-online-update –disable-breakpad

3 test cases :
- A : default (undefined environment variables)
- B : xz method, single thread
- C : all B variables + 9 force.

Compared to my previous tests :
- I configured /tmp on hard disk instead of RAM memory,
- I set "export PARALLELISM=1".

My test host only has 12 Gigabytes of physical RAM, which could lead to memory competition.

It appears 9 force isn't efficient on small files (current language dependent files except dictionaries).

However the archives size in B case is slightly smaller than in A case : the single thread options makes the compression better. Moreover it should decrease the memory consumption.
Comment 16 Jérôme 2025-01-24 22:01:09 UTC
Created attachment 198748 [details]
--with-lang=ALL comparaison of deb archives sizes across methods and forces

I used this configuration :
./autogen.sh --with-distro=LibreOfficeLinux --with-lang=ALL --with-package-format=deb --disable-online-update –disable-breakpad

For all cases, I set "export PARALLELISM=1".

My host has 12 GiB of physical RAM.

Into instsetoo_native/CustomTarget_install.mk :
- always "DPKG_DEB_THREADS_MAX=1"
- DPKG_DEB_COMPRESSOR_TYPE takes "none", "zstd", "gzip" or "xz",
- DPKG_DEB_COMPRESSOR_LEVEL takes several values.

On my specific host, it appears the best setting is :
DPKG_DEB_COMPRESSOR_TYPE = xz
DPKG_DEB_COMPRESSOR_LEVEL = 7.

Maybe the archives are too small to take benefit from compression.
Comment 17 Jérôme 2025-01-25 13:53:43 UTC
I will try to use the XZ_OPT environment variable for xz.
Comment 18 Jérôme 2025-01-26 22:54:27 UTC
Created attachment 198777 [details]
--with-lang=ALL comparaison of deb archives sizes with several XZ_OPT values

I used this configuration :
./autogen.sh --with-distro=LibreOfficeLinux --with-lang=ALL --with-package-format=deb --disable-online-update –disable-breakpad

For all cases, I set "export PARALLELISM=1" because my host has 12 GiB of physical RAM.

Into instsetoo_native/CustomTarget_install.mk :
- always "DPKG_DEB_THREADS_MAX=1", "DPKG_DEB_COMPRESSOR_TYPE=xz",
- XZ_OPT has always "--threads=1 --memlimit=max",
- XZ_OPT take several compression forces (0 to 9, a few extreme, --x86).

It appears the final compression depends on the kind of archive (and maybe the size).

I will try to propose a patch with :
- dictionaries with 2 force
- help with 8 force
- default to 5 force.

It could help to access to the "Installed-Size" in order to choose the compression force.

We build smaller archives faster when disabling the xz multi-thread/process compression ("make -j 24" already provides parallel dpkg processes).
Comment 19 Jérôme 2025-02-23 18:12:24 UTC
Created attachment 199411 [details]
total packages size decreases with patch on epm

I submitted a patch to gerrit which saves 16MiB (2.4%) of total size.
The attached file shows the comparison results.
Comment 20 Jérôme 2025-02-28 21:33:07 UTC
Created attachment 199536 [details]
comparaison of xz dictionary maximum size values

With 128MiB maximum dictionary size, the epm patch saves 17MiB (2,4%) and it reduces the cpu usage while compressing (saves ~4% "user+sys" time).

128 MiB maximum xz dictionary size ensure that we respect the 256 MiB hardware memory prerequisites for installing LO on Linux (xz decompression memory ~ dictionary size of the xz archive).

The patch activate the --threads=1 which makes the overall cpu efficiency better (and archive compression better). Without this option with n processors, you will have n "make" processes which are running each n compression processes : n^2 concurrent processes on the same resources. This explain why the "real" time increases with this test with PARALLELISM=1 (because this variable has no impact on the number of xz parallel processes).
Comment 21 Jérôme 2025-03-08 16:39:47 UTC
I think the patch is ready for review here :
https://gerrit.libreoffice.org/c/core/+/179624
Comment 22 Jérôme 2025-03-15 15:42:53 UTC
The lastest patch should provide immediate benefit on build hosts without updating dpkg (see comment 11). This patch now uses the XZ_OPT environment variable, which seems to be available for years in xz.
Comment 23 Jérôme 2025-03-16 21:16:36 UTC
Created attachment 199841 [details]
patch comparaison against master after unsetting PARALLELISM

The patch limits the memory consumption of compression (< 1,3GiB for xz with max dictionary size of 128 MiB). Thus I can now unset PARALLELISM.
I tested it on my host with 4 cores and 12 GiB. I saw each test build with the default "make -j 4". The attached file shows the results.