Bug 131951 - FILEOPEN: Quadratic time on reading and converting html files with images
Summary: FILEOPEN: Quadratic time on reading and converting html files with images
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
6.4.2.2 release
Hardware: x86-64 (AMD64) Windows (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.0.0 target:6.4.4
Keywords: perf
Depends on:
Blocks: HTML-Import
  Show dependency treegraph
 
Reported: 2020-04-07 08:19 UTC by Pavel
Modified: 2020-04-22 09:51 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Measurement of convertion time vs image file size (43.10 KB, application/vnd.oasis.opendocument.spreadsheet)
2020-04-07 08:20 UTC, Pavel
Details
html files used for test (25.71 MB, application/zip)
2020-04-07 08:22 UTC, Pavel
Details
htm file with 8.5 MB image (11.32 MB, text/html)
2020-04-07 08:23 UTC, Pavel
Details
html file with 16.9 MB image (22.52 MB, text/html)
2020-04-07 08:24 UTC, Pavel
Details
valgrind trace of html parsing (12.82 MB, application/zip)
2020-04-12 12:09 UTC, Pavel
Details
Possible solution (1.09 KB, patch)
2020-04-13 15:30 UTC, Pavel
Details
Better version of fix (1.30 KB, patch)
2020-04-14 05:05 UTC, Pavel
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pavel 2020-04-07 08:19:21 UTC
Description:
Time of opening and converting html files with embedded base64 images is quadratic to image size


For my test I'm using command 
```
soffice --headless --norestore --convert-to odt:writerweb8_writer $file_name
```
to convert from html to odt

Time of execution of the command depends on size of the embedded image.
For example if image size is 100kB, then it takes about 0.4 seconds,
for 5MB image it is already 29 sec on my machine, 
html with 16 MB image requires 4 minutes to be converted

Also time to open such html has similar correlation to the images 

See attachments for the statistics and the html files



Steps to Reproduce:
1a. Convert html file with embedded base64 encoded image
using command soffice --headless --norestore --convert-to odt:writerweb8_writer $file_name

1b. Or try to open such files with LibreOfficeWriter


Actual Results:
Time to open/convert file is quadratic to image file size

Expected Results:
Time to open/convert file should be linear to image file size


Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 6.4.2.2
Build ID: 6.4.2-1
CPU threads: 12; OS: Linux 5.5; UI render: default; VCL: kf5; 
Locale: fr-CH (ru_RU.UTF-8); UI-Language: en-US
Calc: threaded
Comment 1 Pavel 2020-04-07 08:20:30 UTC
Created attachment 159381 [details]
Measurement of convertion time vs image file size
Comment 2 Pavel 2020-04-07 08:22:27 UTC
Created attachment 159382 [details]
html files used for test
Comment 3 Pavel 2020-04-07 08:23:32 UTC
Created attachment 159383 [details]
htm file with 8.5 MB image
Comment 4 Pavel 2020-04-07 08:24:26 UTC
Created attachment 159384 [details]
html file with 16.9 MB image
Comment 5 Pavel 2020-04-12 12:09:48 UTC
Created attachment 159512 [details]
valgrind trace of html parsing
Comment 6 Pavel 2020-04-12 12:14:22 UTC
Most likely, the problem is related not to the image size, but html file itself
After taking a trace with valgrind, it is clear, that most of the time is used by
HTMLParser::GetNextToken_(), 
which extensively uses OutString::operator+=
Comment 7 Pavel 2020-04-12 12:44:14 UTC
(In reply to Pavel from comment #6)
> Most likely, the problem is related not to the image size, but html file
> itself
> After taking a trace with valgrind, it is clear, that most of the time is
> used by
> HTMLParser::GetNextToken_(), 
> which extensively uses OutString::operator+=

So the problem is in HTML::ScanText method
Comment 8 Pavel 2020-04-12 14:49:19 UTC
HTML::ScanText (svtools/source/svhtml/parthhtml) reads html token data up to MAX_LEN (=1024) symbols to temp buffer and then do concatenation (+=) of strings.
This causes allocation of memory and copying existing data and new data (memcpy)
And because number of chunks is substantial, copying of almost the same data is repeated multiple times

Possible solution could be increase buffer size each time it is filled (1024, 2048, 4096...)
Comment 9 Pavel 2020-04-13 15:30:48 UTC
Created attachment 159538 [details]
Possible solution

But need to ensure no integer overflow
Comment 10 Pavel 2020-04-14 05:05:29 UTC
Created attachment 159546 [details]
Better version of fix

Conversion time for page with 16.9 MB image changed from 200+ down to 10 seconds
Comment 11 Dieter 2020-04-17 11:03:33 UTC
Results with

Version: 7.0.0.0.alpha0+ (x64)
Build ID: 1c9ced04189c9d23ffea05d5570960b54b05ef28
CPU threads: 4; OS: Windows 10.0 Build 18363; UI render: Skia/Raster; VCL: win; 
Locale: de-DE (de_DE); UI-Language: en-GB
Calc: CL

0,2 MB   2,24 sec
1,2 MB   3,95 sec
2,5 MB   9,71 sec
5,8 MB  45,14 sec

So I won't say, that time of opening is quadratic to image size, but time of opening increases faster than image size.

But I don't know, what would be the expected behaviour. So I leave it as UNCONFIRMED
Comment 12 Telesto 2020-04-17 11:24:50 UTC
@Xisco and/or Julien
Pavel posted a patch; Some maybe a new contributor?

Setting to NEW -> Assuming the patch works as expect
Comment 13 Xisco Faulí 2020-04-17 11:27:54 UTC
Hi Pavel,
Could you please submit the patch to gerrit so other developers can review it ?
See https://wiki.documentfoundation.org/Development/gerrit/SubmitPatch
Thanks in advance
Comment 14 Julien Nabet 2020-04-17 11:55:00 UTC
In addition, please send your license statement (see https://wiki.documentfoundation.org/Development/GetInvolved)
Comment 15 Pavel 2020-04-17 15:07:52 UTC
(In reply to Xisco Faulí from comment #13)
> Hi Pavel,
> Could you please submit the patch to gerrit so other developers can review
> it ?
> See https://wiki.documentfoundation.org/Development/gerrit/SubmitPatch
> Thanks in advance

Hi Xisco,
thanks for advice

https://gerrit.libreoffice.org/c/core/+/92456/
Comment 16 Commit Notification 2020-04-19 12:48:47 UTC
Pavel Klevakin committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9429dacc7ff93f99dd84532357020669df33a0c5

tdf#131951: automatically increase buffer size

It will be available in 7.0.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Commit Notification 2020-04-19 17:43:12 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/85a6aa5526c1e38865250e88ceb6bf02345248b2

tdf#131951 related, improve perf

It will be available in 7.0.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 18 Roman Kuznetsov 2020-04-20 07:00:52 UTC
Pavel, Noel, will you plan backport it to 6.4?
Comment 19 Pavel 2020-04-20 08:28:57 UTC
Hi, Roman
Yes, it would be useful for me and our company, but the decision is upon Noel
Comment 20 Xisco Faulí 2020-04-21 15:23:33 UTC
File 16.9 mbs takes

real	0m6,576s
user	0m5,529s
sys	0m0,960s

in

Version: 7.0.0.0.alpha0+
Build ID: 850b8de31c5be5127eac16a4f5cc18c26a582e53
CPU threads: 4; OS: Linux 4.19; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.UTF-8); UI-Language: en-US
Calc: threaded

while in

Version: 7.0.0.0.alpha0+
Build ID: 32c5832dfccc2f40370c2795b44adaf3b357d603
CPU threads: 4; OS: Linux 4.19; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.UTF-8); UI-Language: en-US
Calc: threaded

it takes

real	5m6,247s
user	2m5,359s
sys	3m0,256s

nicee!! Backporting to 6.4 branch
@Pavel, @Noel, thanks for fixing this issue!! Should we close this issue now ?
Comment 21 Commit Notification 2020-04-21 19:57:51 UTC
Pavel Klevakin committed a patch related to this issue.
It has been pushed to "libreoffice-6-4":

https://git.libreoffice.org/core/commit/acd8105a825ee4d0efa25ddf512dbb373bd7b8f3

tdf#131951: automatically increase buffer size

It will be available in 6.4.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 22 Pavel 2020-04-22 05:58:10 UTC
Great! Thanks all!
I think this bug can be closed
Comment 23 Julien Nabet 2020-04-22 07:53:28 UTC
Thank you for your feedback Pavel
Comment 24 Julien Nabet 2020-04-22 07:53:57 UTC
Let's put this one to VERIFIED since it's been confirmed.
Comment 25 Commit Notification 2020-04-22 09:51:21 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "libreoffice-6-4":

https://git.libreoffice.org/core/commit/c6ae3a0610700a730d549c25dbff1748f02b8e3e

tdf#131951 related, improve perf

It will be available in 6.4.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.