71582 – FILEOPEN: CSV input is very slow on large files (Empty fields are the source of the issue)

Bug 71582 - FILEOPEN: CSV input is very slow on large files (Empty fields are the source of the issue)

Summary: FILEOPEN: CSV input is very slow on large files (Empty fields are the source ...

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	4.1.0.4 release
Hardware:	Other All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:	BSA target:4.2.0
Keywords:	regression

Depends on:
Blocks:

Reported:	2013-11-13 16:12 UTC by pinus
Modified:	2014-05-18 14:30 UTC (History)
CC List:	2 users (show)

See Also:
Crash report or crash signature:

Attachments
The import configuration, dates and large numbers as text (94.05 KB, image/png) 2013-11-15 20:08 UTC, pinus	Details
The 100T lines test file (1.69 MB, application/x-gzip) 2013-11-15 20:10 UTC, pinus	Details
Sample file with empty fields replaced with zeroes (1.59 MB, application/zip) 2013-11-16 01:02 UTC, m_a_riosv	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description pinus 2013-11-13 16:12:30 UTC

Problem description: 
Importing a large 60MB file with about 500000 lines is very slow.
Runs for hours. Currently stands in "Adapt row height" in the status line.

Steps to reproduce:
1. Create a large csv file with some date (YYYY-MM-DD), some text and some number columns. 500000 lines total. A file size of about 60MB.
2. Open the file, set the right column types

Current behavior:
Reading takes hours on an Intel i5 with SSD. 4 cores used, high load.

Expected behavior:
Read the files in a couple of minutes.

              
Operating System: Ubuntu
Version: 4.1.0.4 release

Comment 1 m_a_riosv 2013-11-14 00:15:01 UTC

Hi pinus, thanks for reporting.

I can open a csv file with 16500 records with eleven fields (two are dates) by record in a couple of seconds with:
Win7x64Ultiimate
Version: 4.1.4.0.0+ Build ID: d6ee64b75581cbeb92534271ee6f4e87f07aa5c

Is there some field with multi-line text?

Have you tried resetting the user profile?, sometimes solve strange issues.
http://wiki.documentfoundation.org/UserProfile

Comment 2 Ady 2013-11-14 01:26:15 UTC

I'm not sure if the problem in your case could be the amount of lines (BTW, do you mean you have 500000 records that should end in Calc as 500000 rows?), or the size of the csv (60MB) or its content.

I just tested a simple csv with more than 10^6 records (date, text, number), about 24MB. Indeed it is a very simple csv and it took just a couple of minutes to import it into Calc (but I tested on Windows, Version: 4.1.3.2 Build ID: 70feb7d99726f064edab4605a8ab840c50ec57a ).


I would suggest a couple of tests. First, make a copy of your original csv.

Test A:
1_ Trim the csv so it contains a few records (say, 10).
2_ Import into Calc.
3_ Success?

Test B:
1_ Trim the csv so it contains 65000 records.
2_ Import into Calc.
3_ Success?

Test C:
1_ Trim the csv so it contains 66000 records.
2_ Import into Calc.
3_ Success?

Calc is supposed to accept more rows than that, but if you happen to see a significant different behavior already between those 3 tests, then it might say something about the number of records; or, it might indicate some problem in the csv, or some inadequate option in the importing procedure.

Regards,
Ady.

Comment 3 pinus 2013-11-14 21:32:16 UTC

(In reply to comment #1)
> Hi pinus, thanks for reporting.
> 
> I can open a csv file with 16500 records with eleven fields (two are dates)
> by record in a couple of seconds with:
> Win7x64Ultiimate
> Version: 4.1.4.0.0+ Build ID: d6ee64b75581cbeb92534271ee6f4e87f07aa5c
> 
> Is there some field with multi-line text?
> 
> Have you tried resetting the user profile?, sometimes solve strange issues.
> http://wiki.documentfoundation.org/UserProfile

Well, 16T lines load in acceptable time. 100T lines take about 4 minutes. A file with 260T took about 35 minutes. I stopped loading the big file with 650T lines after 25 minutes with a status bar at about 30%.

This shows this process is not linear!

Comment 4 m_a_riosv 2013-11-15 02:14:17 UTC

Extended my CSV file with about 259000 records, takes around 40 seconds load.

Can you share a sample file to try. (Be aware that all comments posted, including attachments are public.)

Comment 5 pinus 2013-11-15 20:08:14 UTC

Created attachment 89284 [details]
The import configuration, dates and large numbers as text

Comment 6 pinus 2013-11-15 20:10:36 UTC

Created attachment 89285 [details]
The 100T lines test file

Loads in about 4:30 with the config appended and 2:50 with default config.

Comment 7 m_a_riosv 2013-11-16 01:02:30 UTC

Created attachment 89300 [details]
Sample file with empty fields replaced with zeroes

Seems that the options for detect special or text delimiter don't do any difference.

Times for load eliminating column: 
0:20 10 first columns
1:01 11 first columns
1:30 12 first columns
2:02 13 first columns
3:19 from column 11 to 16
3:45 All columns

The issue is clearly in relation with empty fields, replacing it with zeroes, the time for load is reduced:
0:32 All columns

Win7x64Ultimate
Version: 4.1.4.0.0+ Build ID: d6ee64b75581cbeb92534271ee6f4e87f07aa5c

Comment 8 m_a_riosv 2013-11-16 01:11:59 UTC

Regression from:
Win7x64Ultimate
Version 4.0.6.2 (Build ID: 2e2573268451a50806fcd60ae2d9fe01dd0ce24)

Time for load:
0:19 all columns with empty fields

NOT reproducible in master:
Win7x64Ultimate
Version: 4.2.0.0.alpha0+ Build ID: df21d317dacc4533ac999f3c3088765393842676
TinderBox: Win-x86@42, Branch:master, Time: 2013-11-05_00:13:53

Comment 9 pinus 2013-11-16 11:26:04 UTC

(In reply to comment #8)
> Regression from:
> Win7x64Ultimate
> Version 4.0.6.2 (Build ID: 2e2573268451a50806fcd60ae2d9fe01dd0ce24)
> 
> Time for load:
> 0:19 all columns with empty fields
> 
> NOT reproducible in master:
> Win7x64Ultimate
> Version: 4.2.0.0.alpha0+ Build ID: df21d317dacc4533ac999f3c3088765393842676
> TinderBox: Win-x86@42, Branch:master, Time: 2013-11-05_00:13:53

Nice to know that it is fixed with 4.2. Depending on the size of the patch/change it might be a good idea to backport the fix to 4.1. I'm not sure about your policies there.

Comment 10 m_a_riosv 2013-11-16 14:19:01 UTC

I don't know if it was fixed or the issue in 4.1 is a secondary effect of some change reverted later or not applied in master. But sure is good not to see it in master. In any case is a bug for 4.1.

Comment 11 Markus Mohrhard 2013-11-19 11:47:04 UTC

This should be fixed in master.(In reply to comment #9)
> (In reply to comment #8)
> > Regression from:
> > Win7x64Ultimate
> > Version 4.0.6.2 (Build ID: 2e2573268451a50806fcd60ae2d9fe01dd0ce24)
> > 
> > Time for load:
> > 0:19 all columns with empty fields
> > 
> > NOT reproducible in master:
> > Win7x64Ultimate
> > Version: 4.2.0.0.alpha0+ Build ID: df21d317dacc4533ac999f3c3088765393842676
> > TinderBox: Win-x86@42, Branch:master, Time: 2013-11-05_00:13:53
> 
> Nice to know that it is fixed with 4.2. Depending on the size of the
> patch/change it might be a good idea to backport the fix to 4.1. I'm not
> sure about your policies there.

I fixed it but it requires some parts of Kohei's refactoring that are not available in 4.1. Therefore there is no chance to backport this patch to the -41 branch.

Comment 12 Kohei Yoshida 2014-05-18 14:30:35 UTC

I'll mark this resolved in 4.2.  4.1 is near EOL.