Download it now!
Bug 129166 - XCellRangeData::setDataArray is many times slower when data contains some empty strings
Summary: XCellRangeData::setDataArray is many times slower when data contains some emp...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: BASIC (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisectNotNeeded, perf, regression
Depends on:
Blocks: Macro
  Show dependency treegraph
 
Reported: 2019-12-03 20:57 UTC by Mike Kaganski
Modified: 2020-06-07 18:37 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
perf flamegraph (51.29 KB, application/x-bzip)
2019-12-04 19:32 UTC, Julien Nabet
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2019-12-03 20:57:00 UTC
Consider the code

> sub DataArray
>   const maxRow = 900000
>   dim data(0 to maxRow)
>   for i = 0 to maxRow
>     if (i mod 10 = 0) Then data(i) = array("0") else data(i) = array(CStr(i))
>   next i
>   range = ThisComponent.Sheets(0).getCellRangeByPosition(0,0,0,maxRow)
>   t = now()
>   range.setDataArray(data())
>   MsgBox Format(now() - t, "[s] "" s""")
> end sub

Running in a new spreadsheet, setDataArray takes ~7-8 s on my system.

But replacing `array("0")` with `array("")` makes it execute longer than ~800 s. So the empty strings in the data array slow down the function manyfold.

Tested with Version: 6.3.4.1 (x64)
Build ID: a21169d87339dfa44546f33d6d159e89881e9d92
CPU threads: 12; OS: Windows 10.0; UI render: GL; VCL: win; 
Locale: ru-RU (ru_RU); UI-Language: en-US
Calc: threaded
Comment 1 Mike Kaganski 2019-12-04 08:55:43 UTC
Additional detail: replacing `array("0")` in the code sample with `array(" ")` - i.e., keeping it a non-empty string, but with a space instead of a zero character, makes the setDataArray function complete in ~30 s, which is also unexpected slowdown.
Comment 2 Mike Kaganski 2019-12-04 09:35:20 UTC
Saving to an ODS and reloading each test result gives these results:

1. Saving and reloading results of original code using `if (i mod 10 = 0) Then data(i) = array("0") else data(i) = array(CStr(i))`, the reload takes ~13 s
2. Saving and reloading results of test in comment 1 using `if (i mod 10 = 0) Then data(i) = array(" ") else data(i) = array(CStr(i))`, the reload takes ~35 s
3. Saving and reloading results of test in comment 0 using `if (i mod 10 = 0) Then data(i) = array("") else data(i) = array(CStr(i))`, the reload takes ~80 s

So there is a good correlation between setDataArray performance and reload time for #1 and #2, with ~5 s constant overhead in reloading; but for #3 with empty strings, setDataArray performs about 10 times slower than reloading its resulting data from file.
Comment 3 Roman Kuznetsov 2019-12-04 10:04:31 UTC
confirm in

Версия: 6.5.0.0.alpha0+ (x64)
ID сборки: 89f0af144c18efafe2573801641689a1432c0cae
Потоков ЦП: 4; ОС:Windows 10.0 Build 18362; Отрисовка ИП: по умолчанию; VCL: win; 
Локаль: ru-RU (ru_RU); Язык интерфейса: ru-RU
Calc: threaded

It takes 19s vs 491s on my machine (when replacing `array("0")` with `array("")`)
Comment 4 Roman Kuznetsov 2019-12-04 10:52:26 UTC
Julien, can you make your nice perfgraph here?
Comment 5 Julien Nabet 2019-12-04 10:57:02 UTC
(In reply to Roman Kuznetsov from comment #4)
> Julien, can you make your nice perfgraph here?

No pb, I'll do it when get back home.
Comment 6 Oliver Brinzing 2019-12-04 18:19:08 UTC
i can confirm:

LO 3.6.7.2 (Build ID: e183d5b) and AOO 4.1.5: 
~ 5 sec for "0" and ""

with LO 4.4.7.2 and LO 6.1/6.3 branch:
~11 sec for "0" and ~350 sec for ""
Comment 7 Mike Kaganski 2019-12-04 19:32:11 UTC
Started between 4.0 and 4.1
Comment 8 Julien Nabet 2019-12-04 19:32:44 UTC
Created attachment 156310 [details]
perf flamegraph

Here's a Flamegraph with maxRow=90000 and array("")
(8seconds)
Comment 9 Noel Grandin 2019-12-05 12:51:14 UTC
Ha :-)


Our data structures are very heavily optimised for continuous runs of data of the same type.

In this case, the first macro hits the heavily optimised path and runs very fast.
The second macro introduces discontinuities (because every 10th element is a different type), which pushes us off the optimised path.

In the first case, we end up with one block of MDDS data, and in the second case, we get a new block of MDDS data every 10th block, which adds up really quickly.
Comment 10 Roman Kuznetsov 2019-12-05 13:33:39 UTC
(In reply to Noel Grandin from comment #9)
> Ha :-)
> 
> 
> Our data structures are very heavily optimised for continuous runs of data
> of the same type.
> 
> In this case, the first macro hits the heavily optimised path and runs very
> fast.
> The second macro introduces discontinuities (because every 10th element is a
> different type), which pushes us off the optimised path.
> 
> In the first case, we end up with one block of MDDS data, and in the second
> case, we get a new block of MDDS data every 10th block, which adds up really
> quickly.

Noel, do you mean that it's not a bug?
Comment 11 Noel Grandin 2019-12-05 13:36:37 UTC
(In reply to Roman Kuznetsov from comment #10)
> 
> Noel, do you mean that it's not a bug?

Well in was never a bug in the sense that LO is broken, we are doing the right thing, just not doing it very quickly.

It is possible that it can improved, but that would require substantial work, and it will never be as fast as the first macro.
Comment 12 Mike Kaganski 2019-12-05 14:56:27 UTC
Possibly empty cells could be treated as not having different type compared to their neighbours? Typical use of large tabular data (where the performance problem happens) is to have same type of data in column; and empty value could be thought as a special case of value of any data type.

And in case of strings, this is especially true, because while we could argue that empty is not a valid number (but still it would be useful to have special "empty" number for efficiency), empty string is perfectly valid thing.