Bug 48621 - FILEOPEN: CSV: better handling of broken CSV files with unescaped embedded quote delimiters
Summary: FILEOPEN: CSV: better handling of broken CSV files with unescaped embedded qu...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium major
Assignee: Eike Rathke
QA Contact:
URL:
Whiteboard: target:3.6.0
Keywords:
Depends on:
Blocks: 39868
  Show dependency treegraph
 
Reported: 2012-04-12 12:58 UTC by Eike Rathke
Modified: 2013-01-28 20:50 UTC (History)
0 users

See Also:
Crash report or crash signature:


Attachments
conglomerate of testcases attached to the mentioned OOo issues (3.88 KB, text/csv)
2012-04-14 10:02 UTC, Eike Rathke
Details
the testcase file exported again, fixing representation (5.71 KB, text/csv)
2012-04-14 10:03 UTC, Eike Rathke
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eike Rathke 2012-04-12 12:58:43 UTC
CSV files not strictly following the CSV specification, regarding embedded quotes inside a quoted field are to be doubled, easily trick the import into not distributing following content as the generator intended it to be. Implement some magic to detect and correct at least some of those cases to prevent data loss.

Related:
https://issues.apache.org/ooo/show_bug.cgi?id=78926
and attachments.

Another test case mentioned there, originally from
https://issues.apache.org/ooo/show_bug.cgi?id=80385
with attachments:

,"abc" d "ef",
currently results in
'abc d "ef"'
To not lose data it should result in
'abc" d "ef'

Doing so would also lead to
,"a"b, "a",
resulting in _one_ field
'a"b, "a'
and not two, 'ab' and ' "a"' like it is currently the case. This would then differ from how Excel treats it, but would be more consistent.
Comment 1 Eike Rathke 2012-04-14 10:02:43 UTC
Created attachment 59980 [details]
conglomerate of testcases attached to the mentioned OOo issues
Comment 2 Eike Rathke 2012-04-14 10:03:54 UTC
Created attachment 59981 [details]
the testcase file exported again, fixing representation
Comment 3 Not Assigned 2012-04-14 10:04:26 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=7928b651965f747b02593d2a9fc73fac7c86dbf5

resolved fdo#48621 better handling of broken CSV files
Comment 4 Christopher Schultz 2013-01-28 14:44:00 UTC
I was the author of some of the attachments to OOo bug shown here: https://issues.apache.org/ooo/show_bug.cgi?id=78926

I was able to confirm that the 3 test cases I presented were fixed, but the original bug report included a sample input that I still could not open. I'm not sure if the error message shown below is expected for that input.

Thanks for addressing the cases I found, though!

Confirmed fixed in LibreOffice 3.6.4.2 on Mac OS X:
- att #75189 @ OOo BZ
- att #75191 @ OOo BZ
- att #75192 @ OOo BZ

I still get an error loading the original "input" (att #46282 @ OOo BZ): "The data could not be loaded completely because the maximum number of characters per cell was exceeded".
Comment 5 Eike Rathke 2013-01-28 20:50:07 UTC
(In reply to comment #4)
> I still get an error loading the original "input" (att #46282 @ OOo BZ):
> "The data could not be loaded completely because the maximum number of
> characters per cell was exceeded".

I don't get that error, tested in 3.6.4 and 4.0.0.rc2+ and master. All versions load 2371 rows without complaining, which matches the number of lines in the input file.