Problem description: Importing a large 60MB file with about 500000 lines is very slow. Runs for hours. Currently stands in "Adapt row height" in the status line. Steps to reproduce: 1. Create a large csv file with some date (YYYY-MM-DD), some text and some number columns. 500000 lines total. A file size of about 60MB. 2. Open the file, set the right column types Current behavior: Reading takes hours on an Intel i5 with SSD. 4 cores used, high load. Expected behavior: Read the files in a couple of minutes. Operating System: Ubuntu Version: 4.1.0.4 release
Hi pinus, thanks for reporting. I can open a csv file with 16500 records with eleven fields (two are dates) by record in a couple of seconds with: Win7x64Ultiimate Version: 4.1.4.0.0+ Build ID: d6ee64b75581cbeb92534271ee6f4e87f07aa5c Is there some field with multi-line text? Have you tried resetting the user profile?, sometimes solve strange issues. http://wiki.documentfoundation.org/UserProfile
I'm not sure if the problem in your case could be the amount of lines (BTW, do you mean you have 500000 records that should end in Calc as 500000 rows?), or the size of the csv (60MB) or its content. I just tested a simple csv with more than 10^6 records (date, text, number), about 24MB. Indeed it is a very simple csv and it took just a couple of minutes to import it into Calc (but I tested on Windows, Version: 4.1.3.2 Build ID: 70feb7d99726f064edab4605a8ab840c50ec57a ). I would suggest a couple of tests. First, make a copy of your original csv. Test A: 1_ Trim the csv so it contains a few records (say, 10). 2_ Import into Calc. 3_ Success? Test B: 1_ Trim the csv so it contains 65000 records. 2_ Import into Calc. 3_ Success? Test C: 1_ Trim the csv so it contains 66000 records. 2_ Import into Calc. 3_ Success? Calc is supposed to accept more rows than that, but if you happen to see a significant different behavior already between those 3 tests, then it might say something about the number of records; or, it might indicate some problem in the csv, or some inadequate option in the importing procedure. Regards, Ady.
(In reply to comment #1) > Hi pinus, thanks for reporting. > > I can open a csv file with 16500 records with eleven fields (two are dates) > by record in a couple of seconds with: > Win7x64Ultiimate > Version: 4.1.4.0.0+ Build ID: d6ee64b75581cbeb92534271ee6f4e87f07aa5c > > Is there some field with multi-line text? > > Have you tried resetting the user profile?, sometimes solve strange issues. > http://wiki.documentfoundation.org/UserProfile Well, 16T lines load in acceptable time. 100T lines take about 4 minutes. A file with 260T took about 35 minutes. I stopped loading the big file with 650T lines after 25 minutes with a status bar at about 30%. This shows this process is not linear!
Extended my CSV file with about 259000 records, takes around 40 seconds load. Can you share a sample file to try. (Be aware that all comments posted, including attachments are public.)
Created attachment 89284 [details] The import configuration, dates and large numbers as text
Created attachment 89285 [details] The 100T lines test file Loads in about 4:30 with the config appended and 2:50 with default config.
Created attachment 89300 [details] Sample file with empty fields replaced with zeroes Seems that the options for detect special or text delimiter don't do any difference. Times for load eliminating column: 0:20 10 first columns 1:01 11 first columns 1:30 12 first columns 2:02 13 first columns 3:19 from column 11 to 16 3:45 All columns The issue is clearly in relation with empty fields, replacing it with zeroes, the time for load is reduced: 0:32 All columns Win7x64Ultimate Version: 4.1.4.0.0+ Build ID: d6ee64b75581cbeb92534271ee6f4e87f07aa5c
Regression from: Win7x64Ultimate Version 4.0.6.2 (Build ID: 2e2573268451a50806fcd60ae2d9fe01dd0ce24) Time for load: 0:19 all columns with empty fields NOT reproducible in master: Win7x64Ultimate Version: 4.2.0.0.alpha0+ Build ID: df21d317dacc4533ac999f3c3088765393842676 TinderBox: Win-x86@42, Branch:master, Time: 2013-11-05_00:13:53
(In reply to comment #8) > Regression from: > Win7x64Ultimate > Version 4.0.6.2 (Build ID: 2e2573268451a50806fcd60ae2d9fe01dd0ce24) > > Time for load: > 0:19 all columns with empty fields > > NOT reproducible in master: > Win7x64Ultimate > Version: 4.2.0.0.alpha0+ Build ID: df21d317dacc4533ac999f3c3088765393842676 > TinderBox: Win-x86@42, Branch:master, Time: 2013-11-05_00:13:53 Nice to know that it is fixed with 4.2. Depending on the size of the patch/change it might be a good idea to backport the fix to 4.1. I'm not sure about your policies there.
I don't know if it was fixed or the issue in 4.1 is a secondary effect of some change reverted later or not applied in master. But sure is good not to see it in master. In any case is a bug for 4.1.
This should be fixed in master.(In reply to comment #9) > (In reply to comment #8) > > Regression from: > > Win7x64Ultimate > > Version 4.0.6.2 (Build ID: 2e2573268451a50806fcd60ae2d9fe01dd0ce24) > > > > Time for load: > > 0:19 all columns with empty fields > > > > NOT reproducible in master: > > Win7x64Ultimate > > Version: 4.2.0.0.alpha0+ Build ID: df21d317dacc4533ac999f3c3088765393842676 > > TinderBox: Win-x86@42, Branch:master, Time: 2013-11-05_00:13:53 > > Nice to know that it is fixed with 4.2. Depending on the size of the > patch/change it might be a good idea to backport the fix to 4.1. I'm not > sure about your policies there. I fixed it but it requires some parts of Kohei's refactoring that are not available in 4.1. Therefore there is no chance to backport this patch to the -41 branch.
I'll mark this resolved in 4.2. 4.1 is near EOL.