155192 – Libreoffice takes long time to convert docx to pdf which has more than 500 pages.

Bug 155192 - Libreoffice takes long time to convert docx to pdf which has more than 500 pages.

Summary: Libreoffice takes long time to convert docx to pdf which has more than 500 pa...

Status:	VERIFIED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	6.4.0.3 release
Hardware:	x86-64 (AMD64) All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	bibisected, bisected, perf

Depends on:
Blocks:	PDF-Export
	Show dependency tree / graph

Reported:	2023-05-08 14:45 UTC by Naresh
Modified:	2023-05-12 18:22 UTC (History)
CC List:	3 users (show)

See Also:	64222
Crash report or crash signature:

Attachments
sample document with 662 pages (5.23 MB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2023-05-08 14:46 UTC, Naresh	Details
Bibisect log (3.36 KB, text/plain) 2023-05-08 20:46 UTC, Telesto	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Naresh 2023-05-08 14:45:31 UTC

Description:
Hello,

We have an application where we store microsoft office documents. As part of the release management we convert the office documents to pdf with watermarks using libreoffice.

Scenario:

Libreoffice takes time to convert docs to pdf which has more pages (Ex: around 600 - 1000 pages)
I have tested both CLI and GUI. Result is same.

Technical Stack:

Redhat Enterprise 7.9 (3.10.0-1160.83.1.el7.x86_64)

Tested multiple versions: Result is not ok.
LibreOffice 7.1.5.2
LibreOffice 7.4.7
LibreOffice 7.5.2

CLI Command:
/opt/libreoffice75/program/soffice --headless --convert-to “pdf:writer_pdf_Export” --outdir /tmp timeout.docx

Size of docx : 5MB

Appreciate some pointers to solve this problem.

Steps to Reproduce:
CLI Command:
/opt/libreoffice75/program/soffice --headless --convert-to “pdf:writer_pdf_Export” --outdir /tmp timeout.docx


Actual Results:
Takes time to generate pdf - around 8 - 15mins

Expected Results:
pdf generates faster.


Reproducible: Always


User Profile Reset: Yes

Additional Info:
sample document attached

Comment 1 Naresh 2023-05-08 14:46:10 UTC

Created attachment 187146 [details]
sample document with 662 pages

Comment 2 m_a_riosv 2023-05-08 15:27:25 UTC

Tested, opening the file with word, and also take a lot of time to produce the pdf, I didn't wait to end.

There are 663 tables, and a nine pages index.
Takes a lot of time at opening up to have the file formatted, with the right number of pages.
And seems there is a lot of direct format.

Comment 3 Telesto 2023-05-08 20:03:42 UTC

With LibreOffice 4.4.7.2
File Opening -> 120 seconds
Saving PDF -> 90 seconds

With
Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: c4a58634753a84b09f20f7271d6525a6656522d3
CPU threads: 4; OS: Windows 6.3 Build 9600; UI render: Skia/Raster; VCL: win
Locale: nl-NL (nl_NL); UI: en-US
Calc: CL threaded

File opening -> 300 seconds until something on screen, but still processing in the background.. 720 seconds and still not finished
Save to PDF -> unable to measure, because background process keeps going

Lots of time spend in SwFieldType::GetXObject (called by Python code)

PyType_Ready
PyEval_EvalFrameDefault
PyObject_Call
PyFunction_Vectorcall
PyCell_Set
PyMethod_Self
PyObject_CallMethodId_SizeT
PyObject_CallFunctionObjArgs
PyType_Ready
PyType_Ready
PyEval_EvalFrameDefault
PyObject_Call
PyFunction_Vectorcall
PyCell_Set
PyMethod_Self
PyVectorcall_Call
PyInit_pyuno
[00007FFF8F4D281C]
[00007FFF8F4D2CB7]
[00007FFF4C8B3E9A]
[00007FFF4C8B4311]
uno_ext_getMapping
uno_ext_getMapping
uno_ext_getMapping
linguistic_DicList_get_implementation
osl_getTempDirURL

Comment 4 Telesto 2023-05-08 20:21:17 UTC

disk I/ show decent loading (in my case 1 MB/s) on screen after 120 seconds with
Version: 6.1.6.3
Build ID: 5896ab1714085361c45cf540f76f60673dd96a72
CPU threads: 4; OS: Windows 6.3; UI render: default; 
Locale: nl-NL (nl_NL); Calc: CL

it's 100 kb/s for 7.6.0.0.

----
Lots of (or endless) background processing though after open by grammar checking on regular file open (however probably no relevant for commandline export)

Everything is fine loading speed until on screen based on disk i/o (120 seconds) and grammar checking (120 seconds) with 
Version: 5.2.5.0.0+
Build ID: a4d4fbeb623013f6377b30711ceedb38ea4b49f8
CPU Threads: 4; OS Version: Windows 6.2; UI Render: GL; 
TinderBox: Win-x86@62-merge-TDF, Branch:libreoffice-5-2, Time: 2016-12-24_14:43:55
Locale: nl-NL (nl_NL); Calc: CL


So there are actually even two perf issues, if you ask me..

Comment 5 Telesto 2023-05-08 20:46:31 UTC

Created attachment 187150 [details]
Bibisect log

Bibisected based on loading I/O speed to:
author	Michael Stahl <Michael.Stahl@cib.de>	2019-09-06 19:36:48 +0200
committer	Michael Stahl <Michael.Stahl@cib.de>	2019-09-17 10:45:40 +0200
commit 5ba30f588d6e41a13d68b1461345fca7a7ca61ac (patch)
tree 6f098ffd0fb2c75a2c1cbda4e7b82bd65fb8e7dd
parent 6e1cb2e9dd406fb2883460cefaa4660622996005 (diff)
tdf#64222 sw: better DOCX import/export of paragraph marker formatting
The problem here is that Word allows formatting the paragraph end
marker, and applies the same formatting to the generated numbering
string; Writer has no such marker thing.

This is currently represented by an empty AUTOFMT hint at the end of the
paragraph, which is created almost by accident in
SwXText::finishParagraph(), because the paragraph properties are set on
a SwPaM that doesn't select the whole paragraph but sits at the end.

This is a bit fragile and the hint may have unfortunate accidents such
as being merged into a preceding AUTOFMT hint if it happens to have the
same items in it.

It ought to work better to have an item in SwTextNode's SwAttrSet to
store these special items; has the advantage that the items will also be
copied when you split the paragraph, like in Word.

Add a RES_PARATR_LIST_AUTOFMT and UNO property "ListAutoFormat" (which
should be considered a first draft...) and use it in preference (where
possible) or in addition to (where necessary due to other missing
pieces) the empty hint.

Also revert the change in checkApplyParagraphMarkFormatToNumbering() to
consider hints that start before the end of the paragraph, as it has
unintended side effects as pointed out by Mike Kaganski.

Comment 6 Telesto 2023-05-08 20:59:39 UTC

@Noel,
You might be interested in this one.. assuming the design of commit "better DOCX import/export of paragraph marker formatting" being fine by itself, but simply requiring some optimizations to perform better.

Comment 7 Noel Grandin 2023-05-12 09:02:56 UTC

On my machine, using current master, this is already below 45seconds, so I think we can consider this fixed (probably by various other patches I have done to writer)

Comment 8 m_a_riosv 2023-05-12 14:42:00 UTC

About 50" for me with
Version: 7.6.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 99a88c9e55872214ce01d89447d18708e47e956b
CPU threads: 16; OS: Windows 10.0 Build 22621; UI render: default; VCL: win
Locale: es-ES (es_ES); UI: en-US Calc: CL threaded
with accessibility option disable, with it enable I was having issues, I'll retest it and in case report in a new bug.

Comment 9 Naresh 2023-05-12 14:47:06 UTC

Hello,

Is it possible to push the fix to 7.4 or 7.5 versions ?

Thanks,
Naresh

Comment 10 Noel Grandin 2023-05-12 18:22:33 UTC

(In reply to Naresh from comment #9)
> 
> Is it possible to push the fix to 7.4 or 7.5 versions ?

If you want to kind of service, I suggest you contract with a company like Collabora Productivity to do it for you.