125110 – CalcSpreadsheet: issues converting broken .CSV of a bad generator

Bug 125110 - CalcSpreadsheet: issues converting broken .CSV of a bad generator

Summary: CalcSpreadsheet: issues converting broken .CSV of a bad generator

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	6.1.5.2 release
Hardware:	All All

Importance:	medium minor
Assignee:	Eike Rathke

URL:
Whiteboard:	target:7.5.0 target:7.4.3
Keywords:

Depends on:
Blocks:	CSV-Import
	Show dependency tree / graph

Reported:	2019-05-04 09:56 UTC by OK Foods Barrydale
Modified:	2022-10-02 18:45 UTC (History)
CC List:	4 users (show)

See Also:	48621 151211
Crash report or crash signature:

Attachments
Data file - CSV format to be imported (5.02 MB, text/plain) 2019-05-04 09:58 UTC, OK Foods Barrydale	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description OK Foods Barrydale 2019-05-04 09:56:13 UTC

Description:
We have a need to convert a .CSV file with more than 13,000 rows of data to and XLXS file.
The data is a mix of character and numerals with 9 columns - simple data
It is our product lines Master Stock List exported from our back office inventory and sales system - the data contains product ID, descriptions, pack sized, last sold dates etc.

What appears to happen at about 131K lines or so the data does not get unpacked into columns/rows and comes through as a "stack" into one cell containing thousands of lines of data.

However, if I convert the same file using MS Excel, and then open the XLS file as created by MS EXCEL, then LibreOffice has no issues using and formatting the data.

Now it may be that the source data has some special characters and MS Excel is a liottle more forgiving ... I am not sure.

What I find

Steps to Reproduce:
1.Select given .CSV to import/convert - character set "Western Europe windows-1252/winlatin1"
2.at data row 13137 conversion stops ... data not unpacked
3.

Actual Results:
Warning message: The data could not be loaded completely because the maximum number of characters per cell was exceeded.

Expected Results:
Unpack all rows of data correctly


Reproducible: Always


User Profile Reset: No



Additional Info:
We will load the data file on the ticket about 5mb
I cannot find a corrupt data record, there may well be though

Comment 1 OK Foods Barrydale 2019-05-04 09:58:34 UTC

Created attachment 151174 [details]
Data file - CSV format to be imported

Comment 2 Julien Nabet 2019-05-04 11:43:16 UTC

On pc Debian x86-64 with master sources updated today, I could reproduce this.
But when changing string delimiter to simple quote instead of double quote, everything was fine.

I noticed that some lines contained a field beginning with a double quote.
Applying this patch on your file and I could use by default double quote string delimiter:
3144c3144
< SHOP3699250,3699250,,"BOX OF 14 PCS 15 CM X 6M X 6 PLY " 1`S, 1`S,14246,Inactive,False,False
---
> SHOP3699250,3699250,,BOX OF 14 PCS 15 CM X 6M X 6 PLY  1`S, 1`S,14246,Inactive,False,False
41227,41232c41227,41232
< 6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055946,5499,,"THUMB TACKS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055953,2240,,"PAPER CLIPS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055960,5400,,"TINPLATE MAGNETS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055977,5500,,"RUBBER BANDS" 1`S, 1`S,14205,Inactive,False,False
< 6005347055984,2241,,"3 IN 1 CLIP SET" 1`S, 1`S,14205,Inactive,False,False
---
> 6005347055939,5498,,PUSH PINS 1`S, 1`S,14205,Inactive,False,False
> 6005347055946,5499,,THUMB TACKS 1`S, 1`S,14205,Inactive,False,False
> 6005347055953,2240,,PAPER CLIPS 1`S, 1`S,14205,Inactive,False,False
> 6005347055960,5400,,TINPLATE MAGNETS 1`S, 1`S,14205,Inactive,False,False
> 6005347055977,5500,,RUBBER BANDS 1`S, 1`S,14205,Inactive,False,False
> 6005347055984,2241,,3 IN 1 CLIP SET 1`S, 1`S,14205,Inactive,False,False

I don't have Excel right now to check what it uses as string delimiter but I don't think we have a bug here.

Comment 3 Mike Kaganski 2019-05-04 13:49:46 UTC

(In reply to Julien Nabet from comment #2)
> don't think we have a bug here.

Not sure; RFC 4180 tells:

> 7.  If double-quotes are used to enclose fields, then a double-quote
>     appearing inside a field must be escaped by preceding it with
>     another double quote.

So a single quote in a field starting with a quote shouldn't be treated as a normal quote in a quote-enclosed field; rather, if it's not followed by a field separator, the string should be treated as not-quore-enclosed; and the quotes should be treated as normal characters, part of the field. At least that's my understanding of the situation.

But here, expert opinion is required.

Comment 4 Julien Nabet 2019-05-04 14:01:33 UTC

(In reply to Mike Kaganski from comment #3)
> ...
> But here, expert opinion is required.

Let's take the original line for example:
6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False

The double quote before the "P" of "PUSH" isn't escaped by another double quote.

The only thing I'd change in LO is having the possibility to indicate empty string delimiter but I'm not an expert too.

Comment 5 Mike Kaganski 2019-05-04 14:36:43 UTC

(In reply to Julien Nabet from comment #4)
> The only thing I'd change in LO is having the possibility to indicate empty
> string delimiter but I'm not an expert too.

It's possible: just delete the separator - that's normal edit box :-)

Comment 6 Julien Nabet 2019-05-04 15:57:51 UTC

(In reply to Mike Kaganski from comment #5)
> (In reply to Julien Nabet from comment #4)
> > The only thing I'd change in LO is having the possibility to indicate empty
> > string delimiter but I'm not an expert too.
> 
> It's possible: just delete the separator - that's normal edit box :-)

Indeed! so nothing to change for me :-)

Comment 7 Mike Kaganski 2019-05-05 18:42:41 UTC

(In reply to Julien Nabet from comment #4)
> Let's take the original line for example:
> 6005347055939,5498,,"PUSH PINS" 1`S, 1`S,14205,Inactive,False,False

Let me describe my idea about what happens now, and what should happen instead.

IIUC, when LO sees string separator (double quote) in the beginning of a field (start of line, or right after field separator), it enters "quote-enclosed field" mode. It expects the field to end with another double quote *exactly in the end-of-field position* (i.e., followed by a newline or field separator). And it continues consuming everything, until it finds such a specially-positioned double quote. For the sample above, it means that the closing double quote after PINS will be consumed without terminating the "quote-enclosed field" mode (because it's followed by space, not by end-of-field); and it will not find a suitable double quote until EOF. But the non-escaped double quote in the middle of a field is not valid for proper quote-enclosed field!

But let's modify the sample a little:

> 6005347055939,5498,,PUSH "PINS" 1`S, 1`S,14205,Inactive,False,False

The first double quote now isn't in the beginning of the field. And LibreOffice treats it as a normal field character, properly finding the end of field, producing the field 'PUSH "PINS" 1`S'.

Now recall that CSV has been used without any formal description for decades before RFC 4180 came; and the latter standard is a best effort to organize "best practices", but it cannot simply undo all the pre-existing history - so what is true for initially-formally-defined standards ("be strict when generating; be forgiving when consuming") is tenfold true for CSV.

So I suppose that what should have been done here is:
1. Seeing the opening double quote in the beginning of the field, start "quote-enclosed field" mode.
2. If it encounters something *invalid* for such a mode, it should re-read the field again, this time without the "quote-enclosed field" mode (to properly re-consume possible field separators that could had been read in the first pass as the quoted field content).

This way, this sample would be read properly, without introducing any ambiguity.

Comment 8 Mike Kaganski 2019-05-06 08:06:18 UTC

Possible code pointers:
ScImportExport::ExtText2Doc
ScImportExport::ScanNextFieldFromString
ReadCsvLine

Comment 9 Xisco Faulí 2019-06-10 15:37:34 UTC

(In reply to Mike Kaganski from comment #8)
> Possible code pointers:
> ScImportExport::ExtText2Doc
> ScImportExport::ScanNextFieldFromString
> ReadCsvLine

@Mike, Could we turn this issue into an interesting easyhack?
Moving to NEW meanwhile...

Comment 10 Mike Kaganski 2019-06-11 03:13:44 UTC

(In reply to Xisco Faulí from comment #9)
> @Mike, Could we turn this issue into an interesting easyhack?

I agree, I think we can. I already provided my idea of what and how should be changed, and the code pointers. I only hoped for a confirmation from erAck if it is a sane idea.

Comment 11 Thibault Molleman 2020-04-09 08:23:42 UTC Comment hidden (me-too)

Any progress on this?
Just got a csv file as well that's too large for calc to handle which is very annoying

Comment 12 Thibault Molleman 2020-04-09 08:29:19 UTC Comment hidden (no-value)

Turns out even Excel 2010 can do this in half a second without errors

Comment 13 Eike Rathke 2021-08-16 17:18:48 UTC

(In reply to Mike Kaganski from comment #7)
> So I suppose that what should have been done here is:
> 1. Seeing the opening double quote in the beginning of the field, start
> "quote-enclosed field" mode.
> 2. If it encounters something *invalid* for such a mode, it should re-read
> the field again, this time without the "quote-enclosed field" mode (to
> properly re-consume possible field separators that could had been read in
> the first pass as the quoted field content).
> 
> This way, this sample would be read properly, without introducing any
> ambiguity.
It would fail in other constellations that now are handled well, like

"abc "def" ghi, jkl"

where
|abc "def" ghi, jkl|
is supposed to be *one* field content because the generator didn't escape quotes by doubling them. Your approach would result in
|"abc "def" ghi| jkl"|

Whatever we'll do, it will make things fail differently for other data of broken generators. You could throw more logic at it like thinking in "words" to be ignored re-triggering quotes have to have a space left (opening quote) or right (closing quote), which would fail for data that simply doesn't follow that assumption. Things get even worse if space was a field separator.

Take a look at what is done with the field start mode and quote state to fix known broken data cases and bug 48621 for test case sample files and related.

I tend to close this for a too broken generator, but if you can come up with some loose magic that doesn't break any of the already handled cases, then fine..

Comment 14 Mike Kaganski 2021-08-16 17:26:42 UTC

(In reply to Eike Rathke from comment #13)

Thanks - and you are absolutely correct! I also found some cases in the meanwhile, and I see how my proposal was just a wishful thinking :-)

Comment 15 Eike Rathke 2022-10-01 22:00:36 UTC

The solution for this will be to assume that if the generator is broken enough to start a field quoted followed by containing an unescaped embedded quote and there is no closing quote (i.e. immediately before a field delimiter) until the line end then the generator will not be clever enough to write embedded linefeeds either and the field starting quote wasn't one but to be taken literally as all other quotes until the now unquoted field end.

Comment 16 Commit Notification 2022-10-02 15:08:16 UTC

Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/cf777cfcb22647b1f2d6ace307fbcc4f6d2cca30

Resolves: tdf#125110 tdf#151211 Disentangle the convoluted CSV/TSV-clip import

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 17 Eike Rathke 2022-10-02 15:08:59 UTC

Pending review https://gerrit.libreoffice.org/c/core/+/140874 for 7-4

Comment 18 Commit Notification 2022-10-02 18:45:18 UTC

Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-7-4":

https://git.libreoffice.org/core/commit/9af7a8d60596fc59f366a0c3e94489ff8fc106aa

Resolves: tdf#125110 tdf#151211 Disentangle the convoluted CSV/TSV-clip import

It will be available in 7.4.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.