Description: Hello, We have an application where we store microsoft office documents. As part of the release management we convert the office documents to pdf with watermarks using libreoffice. Scenario: Almost all documents are coverted to pdf without any problem. Our observation shows that some are getting failed with heavy memory consumption. Libreoffice crashes when the documents contains heavy images and more pages (Ex: around 600 - 1000 pages) I have tested both CLI and GUI. Result is same. Technical Stack: Redhat Enterprise 7.9 (3.10.0-1160.83.1.el7.x86_64) Tested multiple versions: Result is not ok. LibreOffice 7.1.5.2 LibreOffice 7.4.6.2 LibreOffice 7.5.1.2 CLI Command: /opt/libreoffice75/program/soffice --headless --convert-to “pdf:writer_pdf_Export” --outdir /tmp 7090190.docx Size of docx : 100MB Appreciate some pointers to solve this problem. Steps to Reproduce: CLI Command: /opt/libreoffice75/program/soffice --headless --convert-to “pdf:writer_pdf_Export” --outdir /tmp 7090190.docx Actual Results: Crashed with high memory consumption Expected Results: pdf generated. Reproducible: Always User Profile Reset: No Additional Info: gdb trace attached
Created attachment 186638 [details] gdb trace
I wasn't able to crash LO using that filter with: Version: 7.5.2.2 (X86_64) / LibreOffice Community Build ID: 53bb9681a964705cf672590721dbc85eb4d0c3a2 CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3 Locale: en-AU (en_AU.UTF-8); UI: en-US Calc: threaded Please provide an example document to test, preferably smaller than 100mb and sanitised if needed.
To go page-by-page to pinpoint the offending data, you could use a bash script with extra filter options, e.g. this inside a loop that changes the page number: libreoffice7.5 --headless --convert-to 'pdf:writer_pdf_Export:{"PageRange":{"type":"string","value":"2"}}' large_file.docx
Hi Stéphane, Thanks for your reply. Able to find the one pager that contains one image which is causing this problem. Attaching here for your reference and testing. Please let me know if you need more information on this. Thank you.
Created attachment 186725 [details] problematic document
The issue is about opening the file. When using a debug build, I see lots of EMF+ warnings. I bibisected with linux-64-5.4 repo and got a range of four commits. These stand out: 2e7c94f5054dec4ab19c44209136c886793f0acb tdf#107034 EMF+ Add support for import EmfPlusDrawPie record 9b693d896bf9a08cd8987e483f5269d6f2be1fd3 tdf#107019 EMF+ Add support for import EmfPlusRecordTypeDrawBeziers record
Extracting the EMF+ from media folder of attachment 186725 [details] is a rather large 47MB image of some complexity. Attached to WinDbg session, LO will eventually open it with image rendered to canvas but LO's memory use does grow to about 6.5GB as the EMF is parsed. Once open the Draw UI is rather sluggish. Version: 7.5.2.2 (X86_64) / LibreOffice Community Build ID: 53bb9681a964705cf672590721dbc85eb4d0c3a2 CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: default; VCL: win Locale: en-US (en_US); UI: en-US Calc: threaded Opening with Draw crashed with skia output device rendering, seems a different issue. GDI only default rendering slowly parses to completion rendering to document canvas--a rather pixelated image. GDI object counts remain low so no leakage there. It is just a bad EMF?
Created attachment 186845 [details] Extracted EMF+ image which is causing performance issues
Created attachment 186847 [details] Screeshot of problematic docx after opening The image contains total 597009 records, which most many of them are Bezier curves (EMF+ EmfPlusRecordTypeDrawBeziers (0x4019)). These curves are defined by few points, but then it is translated to hundreds of individual lines: https://en.wikipedia.org/wiki/B%C3%A9zier_curve To workaround we could try to disable EMF+ drawing (leave only EMF). It could be done by setting environment variable (Unfortunately I don't remember what was the name of it).
To disable EMF+, you could try setup environment variable: export EMF_PLUS_DISABLE=true It is much faster after enabling it.
The main difference between EMF and EMF+ is that EMF is operating on integers (sal_Int32) and EMF+ is operating on Floating-point numbers (double, float)
confirm setting EMF_PLUS_DISABLE environment variable true tames the behavior in UI. Memory use drops from ~6.5GB to ~400MB and the document canvas can actually be worked with. Suppose the same can be set for command line conversion as OP needs. Question though is if we can read image meta (get a count of curves before processing) and avoid EMF+ parsing when over some threshold. Or maybe some sort of timer based release and fall back to simple EMF when counts grow to high?
The performance improvement for Bezier curves are already created here: https://gerrit.libreoffice.org/c/core/+/150821 Please take a look and check how it is working for you.
Bartosz Kosiorek committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/ce008fa9d8f2752bdfeaeff763aafc774a4b4fb2 tdf#154789 EMF+ Performance boost of the EmfPlusRecordTypeDrawBeziers It will be available in 7.6.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Bartosz Kosiorek committed a patch related to this issue. It has been pushed to "libreoffice-7-5": https://git.libreoffice.org/core/commit/1328e2b7eb5251162834d7c0f953c6334686e95e tdf#154789 EMF+ Performance boost of the EmfPlusRecordTypeDrawBeziers It will be available in 7.5.4. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Thank you very much for your support and effort in fixing this problem quickly. Is it possible to add this fix to 7.4 version also ?
(In reply to Naresh from comment #16) > Thank you very much for your support and effort in fixing this problem > quickly. > > Is it possible to add this fix to 7.4 version also ? The timeline is a bit tight: https://wiki.documentfoundation.org/ReleasePlan/7.4#7.4.7_release
Bartosz Kosiorek committed a patch related to this issue. It has been pushed to "libreoffice-7-5-3": https://git.libreoffice.org/core/commit/b1ed265975407aea9eda568049be4d68301276af tdf#154789 EMF+ Performance boost of the EmfPlusRecordTypeDrawBeziers It will be available in 7.5.3. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Bartosz Kosiorek committed a patch related to this issue. It has been pushed to "libreoffice-7-4": https://git.libreoffice.org/core/commit/168dc9075d7be4d7da5f5e1ee602751f84dbd254 tdf#154789 EMF+ Performance boost of the EmfPlusRecordTypeDrawBeziers It will be available in 7.4.8. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Bartosz Kosiorek committed a patch related to this issue. It has been pushed to "libreoffice-7-4-7": https://git.libreoffice.org/core/commit/cd94594b24c48602a1eef6af8d98cbf5a6467e3a tdf#154789 EMF+ Performance boost of the EmfPlusRecordTypeDrawBeziers It will be available in 7.4.7. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Hello, I have verified the fix in both 7.4.7 and 7.5.3 dev levels. Both are working good. Convert is taking time (Around 8 mins) for a document of 700 pages. But still it is good for us than getting crashed. Thanks a lot for your support and time.
Thanks Naresh. Can you share some more documents so I could try to improve performance?
Created attachment 187133 [details] to improve the performance Hi, Attaching a new test document to improve the performance. This has 663 pages which takes around 8-10mins to convert to PDF. If you can improve the performance, it will be great.
(In reply to Naresh from comment #23) > Created attachment 187133 [details] > to improve the performance > > Hi, > > Attaching a new test document to improve the performance. > > This has 663 pages which takes around 8-10mins to convert to PDF. If you can > improve the performance, it will be great. That MS Word binary .doc document does not contain an EMF+ image, so not the issue resolved here. Please submit a new BZ ticket, and reattach the document (ZIP archive is fine) but its source needs to be native ODF .odt, or at a minimum OOXML .docx