Bug 124511 - handle import of multiple gzip csv streams
Summary: handle import of multiple gzip csv streams
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
(earliest affected) release
Hardware: x86-64 (AMD64) All
: medium enhancement
Assignee: Not Assigned
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
Reported: 2019-04-02 20:29 UTC by Jacek Pliszka
Modified: 2019-05-02 10:53 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:

Simple example - 2 csv.gz (95 bytes, application/octet-stream)
2019-04-03 17:49 UTC, Jacek Pliszka
Simple example - 4 csv.gz (179 bytes, application/gzip)
2019-04-03 17:49 UTC, Jacek Pliszka
Simple example - 4 csv.gz - 2 empty (143 bytes, application/gzip)
2019-04-03 17:49 UTC, Jacek Pliszka

Note You need to log in before you can comment on or make changes to this bug.
Description Jacek Pliszka 2019-04-02 20:29:09 UTC
gzip csv files generated in turns by pandas do not open correctly in libreoffice:
Python code:

import pandas as pd

a=pd.DataFrame({'a': [1,2], 'b':[3,4]})

b=pd.DataFrame({'a': [3,4], 'b':[5,6]})

a.to_csv('a.csv.gz', compression='gzip', header=True, mode='w')

b.to_csv('a.csv.gz', compression='gzip', header=False, mode='a')

The output file a.csv.gz gunzips correctly and opens correctly in less but in libre office only header and 2 first data rows appear when openning:

libreoffice a.csv.gz
Comment 1 Jacek Pliszka 2019-04-02 20:37:25 UTC
OK, looks like gunzip or less can handle 2 concatenated gzip files and libreoffice can not. Another example

$ gunzip -c z1.csv.gz 
$ gunzip -c z2.csv.gz 
$ cat z1.csv.gz z2.csv.gz  > z3.csv.gz
$ gunzip -c z3.csv.gz 

while libreoffice when opens z3.csv.gz shows only z1 contents
Comment 2 V Stuart Foote 2019-04-03 14:13:04 UTC
So question if it works without the gzip/gunzip steps. Or is issue in the concatenation?

Does a concatenated CSV get fully parsed into the Calc CSV input dialog? If not, is there an EOF or some other control character being inserted during the append.
Comment 3 Jacek Pliszka 2019-04-03 14:31:35 UTC
Issue is only when gzipped files are concatenated. everything is OK when files are not gzipped or gzipped after concatenation.

Only case when files are first gzipped then concatenated does not work.

pandas gzip less etc. handle the case when file consists of several gzips concatenated together and it would be good if calc could.
Comment 4 V Stuart Foote 2019-04-03 17:23:48 UTC
OK then, guess we are doing the correct thing then and this would be enhancement to handle stream of multiple gzip'd documents.  Onus would be on the user to ensure the format is correct--header on first, and subsequent only the data still in matching layout.

Please post a couple of sample concatenated gzip .csv streams.

But, kind of a specialized workflow, so not sure it belongs as a core feature of the Calc import dialog.

Comment 5 Jacek Pliszka 2019-04-03 17:49:02 UTC
Created attachment 150513 [details]
Simple example - 2 csv.gz
Comment 6 Jacek Pliszka 2019-04-03 17:49:29 UTC
Created attachment 150514 [details]
Simple example - 4 csv.gz
Comment 7 Jacek Pliszka 2019-04-03 17:49:57 UTC
Created attachment 150515 [details]
Simple example - 4 csv.gz - 2 empty
Comment 8 Xisco Faulí 2019-05-02 10:53:06 UTC
Make sense. Moving to NEW...