Bug 125110 - CalcSpreadsheet: issues converting .CSV where there are more than 30K rows of data
Summary: CalcSpreadsheet: issues converting .CSV where there are more than 30K rows of...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
6.1.5.2 release
Hardware: x86-64 (AMD64) Windows (All)
: medium minor
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2019-05-04 09:56 UTC by OK Foods Barrydale
Modified: 2020-04-09 08:29 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Data file - CSV format to be imported (5.02 MB, text/plain)
2019-05-04 09:58 UTC, OK Foods Barrydale
Details

Note You need to log in before you can comment on or make changes to this bug.
Description OK Foods Barrydale 2019-05-04 09:56:13 UTC
Description:
We have a need to convert a .CSV file with more than 13,000 rows of data to and XLXS file.
The data is a mix of character and numerals with 9 columns - simple data
It is our product lines Master Stock List exported from our back office inventory and sales system - the data contains product ID, descriptions, pack sized, last sold dates etc.

What appears to happen at about 131K lines or so the data does not get unpacked into columns/rows and comes through as a "stack" into one cell containing thousands of lines of data.

However, if I convert the same file using MS Excel, and then open the XLS file as created by MS EXCEL, then LibreOffice has no issues using and formatting the data.

Now it may be that the source data has some special characters and MS Excel is a liottle more forgiving ... I am not sure.

What I find

Steps to Reproduce:
1.Select given .CSV to import/convert - character set "Western Europe windows-1252/winlatin1"
2.at data row 13137 conversion stops ... data not unpacked
3.

Actual Results:
Warning message: The data could not be loaded completely because the maximum number of characters per cell was exceeded.

Expected Results:
Unpack all rows of data correctly


Reproducible: Always


User Profile Reset: No



Additional Info:
We will load the data file on the ticket about 5mb
I cannot find a corrupt data record, there may well be though
Comment 1 OK Foods Barrydale 2019-05-04 09:58:34 UTC
Created attachment 151174 [details]
Data file - CSV format to be imported
Comment 2 Julien Nabet 2019-05-04 11:43:16 UTC
On pc Debian x86-64 with master sources updated today, I could reproduce this.
But when changing string delimiter to simple quote instead of double quote, everything was fine.

I noticed that some lines contained a field beginning with a double quote.
Applying this patch on your file and I could use by default double quote string delimiter:
3144c3144
< SHOP3699250,3699250,,"BOX OF 14 PCS 15 CM X 6M X 6 PLY " 1`S, 1`S,14246,Inactive,False,False
---
> SHOP3699250,3699250,,BOX OF 14 PCS 15 CM X 6M X 6 PLY  1`S, 1`S,14246,Inactive,False,False
41227,41232c41227,41232
< 6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055946,5499,,"THUMB TACKS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055953,2240,,"PAPER CLIPS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055960,5400,,"TINPLATE MAGNETS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055977,5500,,"RUBBER BANDS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055984,2241,,"3 IN 1 CLIP SET" 1`S, 1`S,14205,Inactive,False,False
---
> 6005347055939,5498,,PUSH PINS 1`S, 1`S,14205,Inactive,False,False
> 6005347055946,5499,,THUMB TACKS 1`S, 1`S,14205,Inactive,False,False
> 6005347055953,2240,,PAPER CLIPS 1`S, 1`S,14205,Inactive,False,False
> 6005347055960,5400,,TINPLATE MAGNETS 1`S, 1`S,14205,Inactive,False,False
> 6005347055977,5500,,RUBBER BANDS 1`S, 1`S,14205,Inactive,False,False
> 6005347055984,2241,,3 IN 1 CLIP SET 1`S, 1`S,14205,Inactive,False,False

I don't have Excel right now to check what it uses as string delimiter but I don't think we have a bug here.
Comment 3 Mike Kaganski 2019-05-04 13:49:46 UTC
(In reply to Julien Nabet from comment #2)
> don't think we have a bug here.

Not sure; RFC 4180 tells:

> 7.  If double-quotes are used to enclose fields, then a double-quote
>     appearing inside a field must be escaped by preceding it with
>     another double quote.

So a single quote in a field starting with a quote shouldn't be treated as a normal quote in a quote-enclosed field; rather, if it's not followed by a field separator, the string should be treated as not-quore-enclosed; and the quotes should be treated as normal characters, part of the field. At least that's my understanding of the situation.

But here, expert opinion is required.
Comment 4 Julien Nabet 2019-05-04 14:01:33 UTC
(In reply to Mike Kaganski from comment #3)
> ...
> But here, expert opinion is required.

Let's take the original line for example:
6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False

The double quote before the "P" of "PUSH" isn't escaped by another double quote.

The only thing I'd change in LO is having the possibility to indicate empty string delimiter but I'm not an expert too.
Comment 5 Mike Kaganski 2019-05-04 14:36:43 UTC
(In reply to Julien Nabet from comment #4)
> The only thing I'd change in LO is having the possibility to indicate empty
> string delimiter but I'm not an expert too.

It's possible: just delete the separator - that's normal edit box :-)
Comment 6 Julien Nabet 2019-05-04 15:57:51 UTC
(In reply to Mike Kaganski from comment #5)
> (In reply to Julien Nabet from comment #4)
> > The only thing I'd change in LO is having the possibility to indicate empty
> > string delimiter but I'm not an expert too.
> 
> It's possible: just delete the separator - that's normal edit box :-)

Indeed! so nothing to change for me :-)
Comment 7 Mike Kaganski 2019-05-05 18:42:41 UTC
(In reply to Julien Nabet from comment #4)
> Let's take the original line for example:
> 6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False

Let me describe my idea about what happens now, and what should happen instead.

IIUC, when LO sees string separator (double quote) in the beginning of a field (start of line, or right after field separator), it enters "quote-enclosed field" mode. It expects the field to end with another double quote *exactly in the end-of-field position* (i.e., followed by a newline or field separator). And it continues consuming everything, until it finds such a specially-positioned double quote. For the sample above, it means that the closing double quote after PINS will be consumed without terminating the "quote-enclosed field" mode (because it's followed by space, not by end-of-field); and it will not find a suitable double quote until EOF. But the non-escaped double quote in the middle of a field is not valid for proper quote-enclosed field!

But let's modify the sample a little:

> 6005347055939,5498,,PUSH "PINS" 1`S, 1`S,14205,Inactive,False,False

The first double quote now isn't in the beginning of the field. And LibreOffice treats it as a normal field character, properly finding the end of field, producing the field 'PUSH "PINS" 1`S'.

Now recall that CSV has been used without any formal description for decades before RFC 4180 came; and the latter standard is a best effort to organize "best practices", but it cannot simply undo all the pre-existing history - so what is true for initially-formally-defined standards ("be strict when generating; be forgiving when consuming") is tenfold true for CSV.

So I suppose that what should have been done here is:
1. Seeing the opening double quote in the beginning of the field, start "quote-enclosed field" mode.
2. If it encounters something *invalid* for such a mode, it should re-read the field again, this time without the "quote-enclosed field" mode (to properly re-consume possible field separators that could had been read in the first pass as the quoted field content).

This way, this sample would be read properly, without introducing any ambiguity.
Comment 8 Mike Kaganski 2019-05-06 08:06:18 UTC
Possible code pointers:
ScImportExport::ExtText2Doc
ScImportExport::ScanNextFieldFromString
ReadCsvLine
Comment 9 Xisco Faulí 2019-06-10 15:37:34 UTC
(In reply to Mike Kaganski from comment #8)
> Possible code pointers:
> ScImportExport::ExtText2Doc
> ScImportExport::ScanNextFieldFromString
> ReadCsvLine

@Mike, Could we turn this issue into an interesting easyhack?
Moving to NEW meanwhile...
Comment 10 Mike Kaganski 2019-06-11 03:13:44 UTC
(In reply to Xisco Faulí from comment #9)
> @Mike, Could we turn this issue into an interesting easyhack?

I agree, I think we can. I already provided my idea of what and how should be changed, and the code pointers. I only hoped for a confirmation from erAck if it is a sane idea.
Comment 11 Thibault Molleman 2020-04-09 08:23:42 UTC Comment hidden (me-too)
Comment 12 Thibault Molleman 2020-04-09 08:29:19 UTC Comment hidden (no-value)