Description: Time of opening and converting html files with embedded base64 images is quadratic to image size For my test I'm using command ``` soffice --headless --norestore --convert-to odt:writerweb8_writer $file_name ``` to convert from html to odt Time of execution of the command depends on size of the embedded image. For example if image size is 100kB, then it takes about 0.4 seconds, for 5MB image it is already 29 sec on my machine, html with 16 MB image requires 4 minutes to be converted Also time to open such html has similar correlation to the images See attachments for the statistics and the html files Steps to Reproduce: 1a. Convert html file with embedded base64 encoded image using command soffice --headless --norestore --convert-to odt:writerweb8_writer $file_name 1b. Or try to open such files with LibreOfficeWriter Actual Results: Time to open/convert file is quadratic to image file size Expected Results: Time to open/convert file should be linear to image file size Reproducible: Always User Profile Reset: No Additional Info: Version: 6.4.2.2 Build ID: 6.4.2-1 CPU threads: 12; OS: Linux 5.5; UI render: default; VCL: kf5; Locale: fr-CH (ru_RU.UTF-8); UI-Language: en-US Calc: threaded
Created attachment 159381 [details] Measurement of convertion time vs image file size
Created attachment 159382 [details] html files used for test
Created attachment 159383 [details] htm file with 8.5 MB image
Created attachment 159384 [details] html file with 16.9 MB image
Created attachment 159512 [details] valgrind trace of html parsing
Most likely, the problem is related not to the image size, but html file itself After taking a trace with valgrind, it is clear, that most of the time is used by HTMLParser::GetNextToken_(), which extensively uses OutString::operator+=
(In reply to Pavel from comment #6) > Most likely, the problem is related not to the image size, but html file > itself > After taking a trace with valgrind, it is clear, that most of the time is > used by > HTMLParser::GetNextToken_(), > which extensively uses OutString::operator+= So the problem is in HTML::ScanText method
HTML::ScanText (svtools/source/svhtml/parthhtml) reads html token data up to MAX_LEN (=1024) symbols to temp buffer and then do concatenation (+=) of strings. This causes allocation of memory and copying existing data and new data (memcpy) And because number of chunks is substantial, copying of almost the same data is repeated multiple times Possible solution could be increase buffer size each time it is filled (1024, 2048, 4096...)
Created attachment 159538 [details] Possible solution But need to ensure no integer overflow
Created attachment 159546 [details] Better version of fix Conversion time for page with 16.9 MB image changed from 200+ down to 10 seconds
Results with Version: 7.0.0.0.alpha0+ (x64) Build ID: 1c9ced04189c9d23ffea05d5570960b54b05ef28 CPU threads: 4; OS: Windows 10.0 Build 18363; UI render: Skia/Raster; VCL: win; Locale: de-DE (de_DE); UI-Language: en-GB Calc: CL 0,2 MB 2,24 sec 1,2 MB 3,95 sec 2,5 MB 9,71 sec 5,8 MB 45,14 sec So I won't say, that time of opening is quadratic to image size, but time of opening increases faster than image size. But I don't know, what would be the expected behaviour. So I leave it as UNCONFIRMED
@Xisco and/or Julien Pavel posted a patch; Some maybe a new contributor? Setting to NEW -> Assuming the patch works as expect
Hi Pavel, Could you please submit the patch to gerrit so other developers can review it ? See https://wiki.documentfoundation.org/Development/gerrit/SubmitPatch Thanks in advance
In addition, please send your license statement (see https://wiki.documentfoundation.org/Development/GetInvolved)
(In reply to Xisco Faulí from comment #13) > Hi Pavel, > Could you please submit the patch to gerrit so other developers can review > it ? > See https://wiki.documentfoundation.org/Development/gerrit/SubmitPatch > Thanks in advance Hi Xisco, thanks for advice https://gerrit.libreoffice.org/c/core/+/92456/
Pavel Klevakin committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/9429dacc7ff93f99dd84532357020669df33a0c5 tdf#131951: automatically increase buffer size It will be available in 7.0.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Noel Grandin committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/85a6aa5526c1e38865250e88ceb6bf02345248b2 tdf#131951 related, improve perf It will be available in 7.0.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Pavel, Noel, will you plan backport it to 6.4?
Hi, Roman Yes, it would be useful for me and our company, but the decision is upon Noel
File 16.9 mbs takes real 0m6,576s user 0m5,529s sys 0m0,960s in Version: 7.0.0.0.alpha0+ Build ID: 850b8de31c5be5127eac16a4f5cc18c26a582e53 CPU threads: 4; OS: Linux 4.19; UI render: default; VCL: gtk3; Locale: en-US (en_US.UTF-8); UI-Language: en-US Calc: threaded while in Version: 7.0.0.0.alpha0+ Build ID: 32c5832dfccc2f40370c2795b44adaf3b357d603 CPU threads: 4; OS: Linux 4.19; UI render: default; VCL: gtk3; Locale: en-US (en_US.UTF-8); UI-Language: en-US Calc: threaded it takes real 5m6,247s user 2m5,359s sys 3m0,256s nicee!! Backporting to 6.4 branch @Pavel, @Noel, thanks for fixing this issue!! Should we close this issue now ?
Pavel Klevakin committed a patch related to this issue. It has been pushed to "libreoffice-6-4": https://git.libreoffice.org/core/commit/acd8105a825ee4d0efa25ddf512dbb373bd7b8f3 tdf#131951: automatically increase buffer size It will be available in 6.4.4. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Great! Thanks all! I think this bug can be closed
Thank you for your feedback Pavel
Let's put this one to VERIFIED since it's been confirmed.
Noel Grandin committed a patch related to this issue. It has been pushed to "libreoffice-6-4": https://git.libreoffice.org/core/commit/c6ae3a0610700a730d549c25dbff1748f02b8e3e tdf#131951 related, improve perf It will be available in 6.4.4. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.