We have a need to convert a .CSV file with more than 13,000 rows of data to and XLXS file.
The data is a mix of character and numerals with 9 columns - simple data
It is our product lines Master Stock List exported from our back office inventory and sales system - the data contains product ID, descriptions, pack sized, last sold dates etc.
What appears to happen at about 131K lines or so the data does not get unpacked into columns/rows and comes through as a "stack" into one cell containing thousands of lines of data.
However, if I convert the same file using MS Excel, and then open the XLS file as created by MS EXCEL, then LibreOffice has no issues using and formatting the data.
Now it may be that the source data has some special characters and MS Excel is a liottle more forgiving ... I am not sure.
What I find
Steps to Reproduce:
1.Select given .CSV to import/convert - character set "Western Europe windows-1252/winlatin1"
2.at data row 13137 conversion stops ... data not unpacked
Warning message: The data could not be loaded completely because the maximum number of characters per cell was exceeded.
Unpack all rows of data correctly
User Profile Reset: No
We will load the data file on the ticket about 5mb
I cannot find a corrupt data record, there may well be though
Created attachment 151174 [details]
Data file - CSV format to be imported
On pc Debian x86-64 with master sources updated today, I could reproduce this.
But when changing string delimiter to simple quote instead of double quote, everything was fine.
I noticed that some lines contained a field beginning with a double quote.
Applying this patch on your file and I could use by default double quote string delimiter:
< SHOP3699250,3699250,,"BOX OF 14 PCS 15 CM X 6M X 6 PLY " 1`S, 1`S,14246,Inactive,False,False
> SHOP3699250,3699250,,BOX OF 14 PCS 15 CM X 6M X 6 PLY 1`S, 1`S,14246,Inactive,False,False
< 6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055946,5499,,"THUMB TACKS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055953,2240,,"PAPER CLIPS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055960,5400,,"TINPLATE MAGNETS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055977,5500,,"RUBBER BANDS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055984,2241,,"3 IN 1 CLIP SET" 1`S, 1`S,14205,Inactive,False,False
> 6005347055939,5498,,PUSH PINS 1`S, 1`S,14205,Inactive,False,False
> 6005347055946,5499,,THUMB TACKS 1`S, 1`S,14205,Inactive,False,False
> 6005347055953,2240,,PAPER CLIPS 1`S, 1`S,14205,Inactive,False,False
> 6005347055960,5400,,TINPLATE MAGNETS 1`S, 1`S,14205,Inactive,False,False
> 6005347055977,5500,,RUBBER BANDS 1`S, 1`S,14205,Inactive,False,False
> 6005347055984,2241,,3 IN 1 CLIP SET 1`S, 1`S,14205,Inactive,False,False
I don't have Excel right now to check what it uses as string delimiter but I don't think we have a bug here.
(In reply to Julien Nabet from comment #2)
> don't think we have a bug here.
Not sure; RFC 4180 tells:
> 7. If double-quotes are used to enclose fields, then a double-quote
> appearing inside a field must be escaped by preceding it with
> another double quote.
So a single quote in a field starting with a quote shouldn't be treated as a normal quote in a quote-enclosed field; rather, if it's not followed by a field separator, the string should be treated as not-quore-enclosed; and the quotes should be treated as normal characters, part of the field. At least that's my understanding of the situation.
But here, expert opinion is required.
(In reply to Mike Kaganski from comment #3)
> But here, expert opinion is required.
Let's take the original line for example:
6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False
The double quote before the "P" of "PUSH" isn't escaped by another double quote.
The only thing I'd change in LO is having the possibility to indicate empty string delimiter but I'm not an expert too.
(In reply to Julien Nabet from comment #4)
> The only thing I'd change in LO is having the possibility to indicate empty
> string delimiter but I'm not an expert too.
It's possible: just delete the separator - that's normal edit box :-)
(In reply to Mike Kaganski from comment #5)
> (In reply to Julien Nabet from comment #4)
> > The only thing I'd change in LO is having the possibility to indicate empty
> > string delimiter but I'm not an expert too.
> It's possible: just delete the separator - that's normal edit box :-)
Indeed! so nothing to change for me :-)
(In reply to Julien Nabet from comment #4)
> Let's take the original line for example:
> 6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False
Let me describe my idea about what happens now, and what should happen instead.
IIUC, when LO sees string separator (double quote) in the beginning of a field (start of line, or right after field separator), it enters "quote-enclosed field" mode. It expects the field to end with another double quote *exactly in the end-of-field position* (i.e., followed by a newline or field separator). And it continues consuming everything, until it finds such a specially-positioned double quote. For the sample above, it means that the closing double quote after PINS will be consumed without terminating the "quote-enclosed field" mode (because it's followed by space, not by end-of-field); and it will not find a suitable double quote until EOF. But the non-escaped double quote in the middle of a field is not valid for proper quote-enclosed field!
But let's modify the sample a little:
> 6005347055939,5498,,PUSH "PINS" 1`S, 1`S,14205,Inactive,False,False
The first double quote now isn't in the beginning of the field. And LibreOffice treats it as a normal field character, properly finding the end of field, producing the field 'PUSH "PINS" 1`S'.
Now recall that CSV has been used without any formal description for decades before RFC 4180 came; and the latter standard is a best effort to organize "best practices", but it cannot simply undo all the pre-existing history - so what is true for initially-formally-defined standards ("be strict when generating; be forgiving when consuming") is tenfold true for CSV.
So I suppose that what should have been done here is:
1. Seeing the opening double quote in the beginning of the field, start "quote-enclosed field" mode.
2. If it encounters something *invalid* for such a mode, it should re-read the field again, this time without the "quote-enclosed field" mode (to properly re-consume possible field separators that could had been read in the first pass as the quoted field content).
This way, this sample would be read properly, without introducing any ambiguity.
Possible code pointers:
(In reply to Mike Kaganski from comment #8)
> Possible code pointers:
@Mike, Could we turn this issue into an interesting easyhack?
Moving to NEW meanwhile...
(In reply to Xisco Faulí from comment #9)
> @Mike, Could we turn this issue into an interesting easyhack?
I agree, I think we can. I already provided my idea of what and how should be changed, and the code pointers. I only hoped for a confirmation from erAck if it is a sane idea.
Any progress on this?
Just got a csv file as well that's too large for calc to handle which is very annoying
Turns out even Excel 2010 can do this in half a second without errors