Bug 84246 - Calc crashes on saving large (~500MB) CSV files
Summary: Calc crashes on saving large (~500MB) CSV files
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.3.1.2 release
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:xlsx
: 103267 (view as bug list)
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2014-09-23 14:15 UTC by Marcelo
Modified: 2021-01-26 09:24 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
simple bash script to generate a large CSV (212 bytes, application/x-shellscript)
2014-10-05 10:20 UTC, Owen Genat (retired)
Details
Random value CSV file ~237MB (unpacked) (3.93 MB, application/x-7z-compressed)
2016-11-24 23:28 UTC, Bartosz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marcelo 2014-09-23 14:15:28 UTC
When I open a big csv file about 500MB, and try to save as excel 2007 (xlsx), calc  grow fast on memory RAM, CPU almost 100%, status bar of saving task doesn't change, all the system start to slowdown until it freezes completely. It seems  its trying to convert and compress the whole file on memory, and then it writes the file. The process should be done on steps of smaller chunks of the data.
Comment 1 tommy27 2014-09-26 17:46:54 UTC
we need a test file to reproduce
Comment 2 Owen Genat (retired) 2014-10-05 10:20:36 UTC
Created attachment 107355 [details]
simple bash script to generate a large CSV

This script writes out records like:

> 1,2014-10-05 19:15:28.174189280+11:00,7jStH8bW5iMk...

i.e., a sequence number, date-time stamp, and 4096 bytes of random base64 data.

The script takes a single number (iterations/records) as a parameter: 10000 generates a file of ~52MB, while 100000 generates a file of ~525MB. It is no doubt horribly inefficient, but effective.

It is clear from testing with files created by this script that there are limits to the ability to handle large CSV files. v4.3.2.2 crashes trying to load a CSV with 100000 records. The same version eventually loads a CSV with 90000 records (~470MB) but then crashes trying to save the loaded data as XLSX. Around 2.5GB of RAM is used by LO during this process.
Comment 3 Owen Genat (retired) 2014-10-05 10:24:36 UTC
I am not sure whether there is anything the developers can do about handling CSV files of this magnitude, but for now confirmed that there is an issue. Status set to NEW. Component set to Spreadsheet. Summary amended for clarity.
Comment 4 Jean-Baptiste Faure 2014-10-05 10:52:27 UTC
@Marcello: 

How much RAM do you have on your machine?

Best regards. JBF
Comment 5 Marcelo 2014-10-06 11:56:27 UTC
I have 4GB RAM, processor Intel Core i5-3230M 2.60GHz, running elementary OS 0.2.1 64-bit
Comment 6 Owen Genat (retired) 2014-10-11 05:13:21 UTC
(In reply to Owen Genat from comment #2)
> Around 2.5GB of RAM is used by LO during this process.

This was a generalisation.

(In reply to Marcelo from comment #5)
> I have 4GB RAM, 

I think the machine is likely running out of RAM. Further test results using the provided script under GNU/Linux with v4.2.6.3:

On a system with 3708MB RAM, no swap.

CSV records/MB  XLSX MB  Peak RAM VIRT/RES[1]
--------------  -------  --------------------
30000/~157      ~119     2046/1.1
40000/~210      ~159     2378/1.6
50000/~262      ~200     2871/2.0
60000/~315      ~238     3268/2.3
70000/~367      ~278     3563/2.6
75000/~394      ~298     3750/2.8
80000/~420       N/A     3943/2.9[2]

On a system with 7941MB RAM, no swap.

CSV records/MB  XLSX MB  Peak RAM VIRT/RES[1]
--------------  -------  --------------------
100000/~525     ~397     4816/3.7

[1] Values of virtual (MB) and resident (GB) usage displayed by the top command.
[2] At this point Calc crashes.
Comment 7 QA Administrators 2015-10-14 19:56:40 UTC Comment hidden (obsolete)
Comment 8 Holli Bleßmann 2016-10-24 06:33:37 UTC
*** Bug 103267 has been marked as a duplicate of this bug. ***
Comment 9 Markus Mohrhard 2016-11-24 15:44:53 UTC
I'm not sure what people expect. Just storing the 500 MB string would require us to use 500 MB of memory. There is not yet anything additional that we have allocated then for this data like cell information, ...

If you need to process 500 MB csv files in a spreadsheet you need quite some memory. There is no way around it.
Comment 10 Holli Bleßmann 2016-11-24 17:17:25 UTC
This IS still a Bug. Have same issue with a 237MB large CSV file on xubuntu 16.04LTS 64bit running on an intel core i7 with 8GB RAM. Trying to save that file as a M$ Excel 2010 .xlsx ended up in getting out of memory (full use of RAM AND 5GB SWAP PARTITION!) and -after 30 minutes- a crashed LibreOffice Calc (v5.1.2.x).

Btw: xubuntu with loaded LibreOffice Calc with an empty sheet needs aprox. 250MB of RAM and no use of the swap partition.

As Markus Mohrhard mentioned above, I expected a normal RAM consumption of (temporary) 2x document size + needed additional spreadsheet data. In my case something about 500MB, NOT 12.5GB!

So, this bug still exists.
Comment 11 Holli Bleßmann 2016-11-24 17:18:31 UTC Comment hidden (obsolete)
Comment 12 Bartosz 2016-11-24 23:28:50 UTC
Created attachment 128996 [details]
Random value CSV file ~237MB (unpacked)

I have tested attached CSV export to .ODS and .XLSX.
In both format there are needed about 25GB of memory to save to desired formats.

The difference is that during exporting to .ODS, the hard disc is used (you could noticed that free disc space is decreasing). The RAM usage for LibreOffice during exporting to .ods is constant (about 1200MB) Generally exporting into .ods is successful.

Unforunately during exporting into .xlsx, only RAM is used.  There is no disc usage, during exporting into .xlsx  It means that during exporting to .xlsx, the amount of free RAM memory is dramatically decreasing until it crash.

In my opinion desired solution for exporting to .xlsx, is saving data on disc, as it is already done for .ods.
Comment 13 Jean-Baptiste Faure 2017-08-07 06:57:07 UTC
I tried to export the attached csv file to .ods with LO 5.4.1.0.0+ built at home under Ubuntu 16.04 x86-64 but it failed with error I/O message. No problem to load the csv file, it was long but it worked.
The PC has 8 Go of RAM and a SSD 512 Go. System monitor showed that process soffice.bin used 1,1 Go of RAM.

Best regards. JBF
Comment 14 John Nagle 2017-08-19 20:30:02 UTC
I'm seeing a similar bug.  Read a 233MB CSV file into Calc without problems.
Attempted to write it out as CSV with slightly different options ("quote everything"), and it's half done after an hour.  About 2/3 done after an hour and a half. Progress continues. Calc is in the "greyed out" state, not processing mouse clicks. 

Since this is on Linux, it doesn't affect other processes much, and I'm letting this run to see if it finishes. It's not hung; the progress bar advances slowly. 

CPU utilization 100% of 1 CPU. Very little disk I/O. Memory not full. Not thrashing. Very little I/O, maybe one disk write every 10 seconds.

This looks like there's something in writing out a CSV file that's worse than O(1), so it chokes on large files.  Somebody with a debug build should profile this while writing out a large CSV file and see where it's spending all the CPU time.

Ubuntu 16.04LTS x64 8GB.  
Calc version: libreoffice-calc 1:5.1.6-rc2.0ubuntu1-xenial2
Comment 15 John Nagle 2017-08-19 20:50:06 UTC
Update: after over 2 hours, the writing of the CSV file finished. The output file is valid. Calc is back to its normal operating state.

Strongly suggest profiling and looking for something that's O(N^2) in file exporting.
Comment 16 John Nagle 2017-08-19 20:57:42 UTC
File load time for 233MB file is only 30 seconds. Scrolling around in the file works fine.  It's only "Save" that's slow.

Save as .ods is also extremely slow. It's not just .csv. So it's not the CSV exporter.  About 5 minutes into that now. As before, 100% of 1 CPU in use, very little I/O, no thrashing, not out of memory.
Comment 17 John Nagle 2017-08-19 22:45:07 UTC
Update: saving as .ods took about 1.5 hours, but finished.
Opening file thus saved took about 5 minutes to open.
Comment 18 Buovjaga 2019-06-29 17:06:23 UTC
(In reply to Bartosz from comment #12)
> Created attachment 128996 [details]
> Random value CSV file ~237MB (unpacked)

Still crashing upon saving to xlsx. I have 32 GB memory, it crashed around 8GB used.

Loading is no problem.

Arch Linux 64-bit
Version: 6.4.0.0.alpha0+
Build ID: c2cb467a1e5194c56bb65706b7965fb2c9241b8f
CPU threads: 8; OS: Linux 5.1; UI render: default; VCL: gtk3; 
Locale: fi-FI (fi_FI.UTF-8); UI-Language: en-US
Calc: threaded
Built on 29 June 2019
Comment 19 Kevin Suo 2021-01-24 13:31:16 UTC
It takes me no more than 20 seconds to save the csv file generated using Owen Genat's bash script (550MB) to ODS, while the RAM usage is 2.6GB maximum in the whole process (I am using a SSD drive). Could someone retest? It may have been fixed somewhere.

Fedora 32 x64 with 8GB RAM, LibreOffice 7.1 branch.

I set this to NEEDINFO.
Comment 20 Buovjaga 2021-01-26 09:24:10 UTC
Indeed, similar result as Kevin when saving to XLSX

Arch Linux 64-bit
Version: 7.0.4.2
Build ID: 00(Build:2)
CPU threads: 8; OS: Linux 5.9; UI render: default; VCL: kf5
Locale: fi-FI (fi_FI.UTF-8); UI: en-US
7.0.4-2
Calc: threaded