Bug 94677 - Calc is slow opening large CSV
Summary: Calc is slow opening large CSV
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
5.0.1.2 release
Hardware: x86-64 (AMD64) All
: medium major
Assignee: Not Assigned
URL:
Whiteboard: target:6.4.0
Keywords: haveBacktrace, perf
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2015-10-01 18:50 UTC by john cantin
Modified: 2019-08-07 20:11 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
test file (4.31 MB, application/x-7z-compressed)
2015-10-06 16:51 UTC, raal
Details
Callgrind output from master (3.74 MB, application/x-xz)
2018-09-22 10:46 UTC, Buovjaga
Details

Note You need to log in before you can comment on or make changes to this bug.
Description john cantin 2015-10-01 18:50:39 UTC
FILEOPEN

I'm dealing with large CSV files (200-500 MB) that were created by exporting from postgres.  If there are no CRLFs in the cells then Calc is a little slower than Excel (3:00 min vs 2:28 for a file that is 315 MB)  

But if there are CRLF in the fields then Calc can take 3-4 times as long to open the file (6:35 min vs 1:40 min for a 182MB file)

Win 7 pro service pack 1
LO 5.0.1.2 (although I was experiencing the same issue with vs 4)
Comment 1 MM 2015-10-02 14:33:34 UTC
Looks a bit like bug 84246 or even bug 82605
Comment 2 john cantin 2015-10-02 14:45:14 UTC
(In reply to MM from comment #1)
> Looks a bit like bug 84246 or even bug 82605

Not like bug 84246 because there is no crash, it just takes really long.

Not like bug 82605 because I'm using " as the delimiter and the default delimiter is already ".

The file eventually loads up just fine and as expected, it just takes a REALLY long time.  The CSV file in question actually has several cells per row that may have CRLFs.
Comment 3 tommy27 2015-10-03 06:15:45 UTC
@john
do you have a test file?
since it's a big one you should upload it to some webhosting space
Comment 4 raal 2015-10-03 06:45:41 UTC
I have set the bug's status to 'NEEDINFO', so please do change it back to 'UNCONFIRMED' once you have attached a document.
Comment 5 john cantin 2015-10-06 13:10:12 UTC
I've uploaded a sample file here: https://www.dropbox.com/s/9mcnlf367yr1nns/02_MWR_SXx.csv?dl=0

This is a mailing list, typical of what I have to produce for my company.  I've overwritten the personal info in the file.
Comment 6 raal 2015-10-06 16:51:50 UTC
Created attachment 119361 [details]
test file
Comment 7 Buovjaga 2015-10-08 16:49:58 UTC
I waited for 17 minutes and then got bored and killed it.

Win 7 Pro 64-bit, Version: 5.0.2.2 (x64)
Build ID: 37b43f919e4de5eeaca9b9755ed688758a8251fe
Locale: fi-FI (fi_FI)
Comment 8 john cantin 2015-10-13 18:39:24 UTC
I've not tried reporting an bug to LibreOffice before, so I have no idea what to expect.  Is this officially recognized as a bug now?  Will someone be looking at fixing it?  Is there a timeline on it?

I have user's that are complaining to me about it and I'd like to be able tell them something.
Comment 9 tommy27 2015-10-14 03:37:47 UTC
(In reply to john cantin from comment #8)
> I've not tried reporting an bug to LibreOffice before, so I have no idea
> what to expect.  

welcome on board :-)

> Is this officially recognized as a bug now? 

yes. when you report a bug the status is UNCONFIRMED.
when another user is able to reproduce it the status is set to NEW which means that the bug is confirmed.

> Will someone be looking at fixing it?  Is there a timeline on it?

there's no timeline yet.
what you can do to speedup the fixing is testing the bug with older releases in order to know if the issue has always been present or if it's a regression bug (it worked fine in a previous release and became a bug in a newer one).

so I suggest going at this page: 
http://sourceforge.net/projects/winpenpack/files/X-LibreOffice/releases/

and download some older LibO portable versions and retest.
I suggest testing the last version of each branch (i.e. 4.4.5, 4.3.5, 4.2.6) until you find the first version that doesn't show the bug then move forward and test the first release of each branch (i.e. 4.4.0, 4.3.0, 4.2.0) to find the first release that did show the bug.

if you find the regression point it will be easier to identify the root of the issue and have a fix
Comment 10 Robinson Tryon (qubit) 2015-12-09 18:07:54 UTC Comment hidden (obsolete)
Comment 11 Daniel 2016-01-30 15:15:56 UTC
I confirm, still happens (5.0.4.2 release) with rather simple 80 MB CSV file (6 columns).

I tried to open (selecting just 3 columns instead of all 6) and for 5 minutes nothing happened.

I had to kill soffice.bin as it was consuming whole processor power.

I don't understand why most of the programs including LibreOffice are just plain stupid about opening large files. Instead of buffering just some lines (1MB?) (principle known and used in the 1980's in most of the word processors), they try to load the whole thing and are not even to able to do that - crashing or need to be killed.
Comment 12 Buovjaga 2016-01-30 15:21:15 UTC
No need to change version field.
Comment 13 Markus Mohrhard 2016-03-30 19:53:07 UTC
This needs to be retested in current master.
Comment 14 Buovjaga 2016-04-01 14:52:48 UTC
(In reply to raal from comment #6)
> Created attachment 119361 [details]
> test file

Both 5.1 and 5.2 take about 2 min 20 sec to open the file on my very fast computer.

Arch Linux 64-bit, KDE Plasma 5
Version: 5.2.0.0.alpha0+
Build ID: 96c1ae1d8e78ae8b9bd7d4001645cad24d62b720
CPU Threads: 8; OS Version: Linux 4.4; UI Render: default; 
Locale: fi-FI (fi_FI.UTF-8)
Built on April 1st 2016

64-bit, KDE Plasma 5
Build ID: 5.1.1.3 Arch Linux build-2
CPU Threads: 8; OS Version: Linux 4.4; UI Render: default; 
Locale: fi-FI (fi_FI.UTF-8)
Comment 15 Markus Mohrhard 2016-04-13 08:41:01 UTC
The slowdown in the multiline file is related to the row height calculation. We can't skip that so currently I see no way to get that to acceptable performance.
Comment 16 john cantin 2016-08-30 20:41:13 UTC
How about, as part of the open csv dialog, give the user the option to skip the automatic row height calculations.  The delay makes it completely impossible to use calc on these files.
Comment 17 QA Administrators 2018-07-21 02:40:43 UTC Comment hidden (obsolete)
Comment 18 Buovjaga 2018-09-22 10:46:18 UTC
Created attachment 145104 [details]
Callgrind output from master

Took a callgrind in case it is of any help.

Arch Linux 64-bit
Version: 6.2.0.0.alpha0+
Build ID: 0ffa7a733d834647dfd59b864c52a015028822b6
CPU threads: 8; OS: Linux 4.18; UI render: default; VCL: gtk3_kde5; 
Locale: fi-FI (fi_FI.UTF-8); Calc: threaded
Built on September 21st 2018
Comment 19 Commit Notification 2019-06-25 06:29:33 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/+/c47d0174f2c6c3ebcb3b33276d0277e7aceac330%5E%21

tdf#94677 Calc is slow opening large CSV, avoid reset SetUpdateMode

It will be available in 6.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 20 Commit Notification 2019-06-25 06:29:42 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/+/2b58bb92b3d5da97290a0f273125ebc34fc5082b%5E%21

tdf#94677 Calc is slow opening large CSV, avoid std::shared_ptr

It will be available in 6.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 21 Commit Notification 2019-06-25 11:21:34 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/+/31589bf0239679d73417902655045c48c4868016%5E%21

tdf#94677 Calc is slow opening large CSV, improve tools::Fraction

It will be available in 6.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 22 Xisco Faulí 2019-06-28 11:48:50 UTC
it takes

real	6m28,995s
user	4m26,173s
sys	0m4,305s

in

Version: 6.4.0.0.alpha0+
Build ID: a294457eb95e60028539b6783abac78b56561fe2
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk3; 
Locale: ca-ES (ca_ES.UTF-8); UI-Language: en-US
Calc: threaded

while in

Version: 6.3.0.0.beta2+
Build ID: e17e30dceb110e780a7e7e89c2ede854d4bc38a7
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk3; 
Locale: ca-ES (ca_ES.UTF-8); UI-Language: en-US
Calc: threaded

it takes

real	10m24,053s
user	7m18,434s
sys	0m5,371s

Note: I was compiling LibreOffice when I measured it
Comment 23 Buovjaga 2019-06-29 05:49:43 UTC
Build from 23 June took 1min 15s (stopwatch time) to open, so we already had a ~1min improvement since my 2min 20s comment 14 in 2016 (probably thanks to other patches by Noel).
With a fresh build just now, it takes 43 seconds.

So the opening time is ~31% of the time in 2016.

A large file is a large file and I think these are pretty substantial improvements, so I would be happy to close this as fixed, unless Noel has other ideas.
Comment 24 Xisco Faulí 2019-07-01 13:17:57 UTC
I've just checked again without compiling at the same time and it takes

real	5m35,114s
user	5m25,324s
sys	0m5,297s

don't know why it takes sooo long for me compares to buovjaga's measurement...