Bug 60468 - FILEOPEN: CSV cannot handle field using embedded tab and line breaks altogether
Summary: FILEOPEN: CSV cannot handle field using embedded tab and line breaks altogether
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
3.6.3.2 release
Hardware: x86-64 (AMD64) Linux (All)
: highest normal
Assignee: Eike Rathke
URL:
Whiteboard: target:4.1.0 target:3.6.7 target:4.0.4
Keywords:
Depends on:
Blocks:
 
Reported: 2013-02-08 05:20 UTC by x.gylee+libreoffice
Modified: 2015-11-22 17:38 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Test file for embedded tab and line break altogether (75 bytes, text/plain)
2013-02-08 05:20 UTC, x.gylee+libreoffice
Details
Test file with comma as separator (75 bytes, text/csv)
2013-02-08 05:21 UTC, x.gylee+libreoffice
Details
Screenshot result and options. (133.17 KB, image/png)
2013-02-09 00:21 UTC, m.a.riosv
Details
LibreOffice Output CSV saved with TAB (83 bytes, text/csv)
2013-05-13 19:14 UTC, Matthias Wieser
Details

Note You need to log in before you can comment on or make changes to this bug.
Description x.gylee+libreoffice 2013-02-08 05:20:40 UTC
Created attachment 74403 [details]
Test file for embedded tab and line break altogether

Using the tab character as the separator (field delimiter), Calc will not recognize correctly a field that contains a line break after an embedded tab.

Example:
"value1.1"<tab>"value2.1<LB><tab>value2.2<LB>value2.3"<tab>"value3.1"

The above will yield:
value1.1 => correct
value2.1<LB><tab>value2.2 => correct
value2.3" => INCORRECT (see it also includes the closing quote)
value3.1 => correct

However, if I change the separator to comma, it all works out correctly.
So, I guess this bug is pertinent to using "tab" as separator.
Comment 1 x.gylee+libreoffice 2013-02-08 05:21:38 UTC
Created attachment 74404 [details]
Test file with comma as separator
Comment 2 m.a.riosv 2013-02-09 00:21:53 UTC
Created attachment 74458 [details]
Screenshot result and options.

Hi x.gylee, thanks for reporting.

Attached screenshoot of what I get and the options import with 3.6.5.

I think is right.
Comment 3 x.gylee+libreoffice 2013-02-12 08:51:54 UTC
I don't think that is right.

value2.3 should go ogether in cell B2 since the line break is embedded inside the text delimiter for value2.x.

value3 should go into cell C2 (under head 3).
Comment 4 Brenda Granados 2013-02-12 17:57:55 UTC
Hi x.gylee. I get the same results as you from those files. I made modifications, and deleted the line breaks inside the second set, changing the file to

"value1.1"<tab>"value2.1<tab>value2.2<tab>value2.3"<tab>"value3.1"

This gives the result you expect. So for some reason, it doesn't like line breaks inside the delimeter. I don't use calc a lot, so I would need to research if this is expected behavior. Unless someone else can confirm.
Comment 5 x.gylee+libreoffice 2013-02-13 03:30:46 UTC
Hi Brenda, thanks for responding.

The point of this bug is the embedded line breaks. It should NOT be treated as a separator if it is inside a field delimiter. If you try the 2nd attachemnt, you will see it's working fine if the separator is a comma. But, somehow it doesn't if the separator is a tab.

There is RFC 4180 if it would help.
Comment 6 Brenda Granados 2013-02-13 15:58:44 UTC
Thank you, x.gylee, for the document. According to RFC 4180,
"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes." Calc should not treat line breaks as delimeters inside double quotes. I confirmed this behavior using LibreOffice 4.0 on Ubuntu 12.04.
Comment 7 Brenda Granados 2013-02-13 16:18:21 UTC
I am going to change the status to NEW, and mark as NORMAL, HIGH. The reason for this priority is that there is no loss of data, or broken core functions, such as printing or saving the document. Calc also doesn't have a problem opening a document including embedded line breaks inside the tab. The data is there, though not with the expected appearance. This does prevent high quality work, and most users would expect the line break inside the delimeter to be ignored per the documentation in RFC 4180.

I hope you agree. Thank you for reporting this!
Comment 8 Matthias Wieser 2013-05-13 19:14:51 UTC
Created attachment 79267 [details]
LibreOffice Output CSV saved with TAB

This shows, how libreoffice saves this case.
It is confused by having data mixed.
 * Some data is between "
 * Some data is not between "
If all data is enclosed within " it works (see attachment)
Comment 9 Matthias Wieser 2013-05-13 20:11:48 UTC
I guess the file sc/source/ui/docshell/impex.cxx has an error.

Ehm, somebody has programmed this error on intention.
Line 2330:
if (nQuotes)
                {
                    if (bTabSep && *p == '\t' && (nQuotes % 2) != 0)
                    {
                        // When tab-delimited, tab char ends quoted sequence
                        // even if we haven't reached the end quote.  Doing
                        // this helps keep mal-formed rows from damaging
                        // other, well-formed rows.
                        nQuotes = 0;
                        break;
                    }

Well, that heuristic will not work with a correct file.

The author was Caolán McNamara <caolanm@redhat.com> of 17fe34ec569f3e14f35f3958cc5885a00bd6cff9.

The correct way to have such a processor in place: parse file, if not correct(User decission!!!) enable workaround.

This error needs more steps:
  * Change in the UI (maybe "strict mode").
  * Change of the code there (if (bTabSep && *p == '\t' && (nQuotes % 2) != 0 && (!strictCSVImportMode))

That is my idea to solve this error.
Comment 10 x.gylee+libreoffice 2013-05-13 22:59:19 UTC
Matthias, thanks for pointing out the culprit.

Because of this error, my boss has been yelling at me to fix my CSV file since he thought it was not being properly constructed. If the developers could implement the fix, it would be a great help.
Comment 11 Caolán McNamara 2013-05-14 08:37:45 UTC
caolanm->Matthias: You misinterpreted 
In http://cgit.freedesktop.org/libreoffice/core/commit/?id=17fe34ec569f3e14f35f3958cc5885a00bd6cff9 a bit, that commit by me just *moves* that code from "tools" to "sc". The lines you highlight were added by Kohei as http://cgit.freedesktop.org/libreoffice/core/commit/?id=322cbc3818b0553254aab2dfb3c5b196fe814097

Whatever the final solution, this looks like an excellent low hanging place for a regression test.
Comment 12 Eike Rathke 2013-05-14 11:38:40 UTC
Taking. I believe that workaround isn't necessary anymore with my rework of the CSV import last year and actually just breaks things.
Comment 13 Eike Rathke 2013-05-14 11:58:56 UTC
Also without that workaround the import of the one failing document https://bugzilla.novell.com/attachment.cgi?id=294589 from https://bugzilla.novell.com/show_bug.cgi?id=507322 works fine, with one difference in row 1805 where data is broken and parts of two rows end up in one cell P1805. That row wasn't correct with the workaround either, see also cell P1805 in earlier versions.
Comment 14 Commit Notification 2013-05-14 12:21:27 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=a883783c2a1b0e7a8ee5b46387d1e1e3bc5e19bd

resolved fdo#60468 no special tab case workaround for CSV import



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 15 Eike Rathke 2013-05-14 12:22:56 UTC
Pending review
for 4-0 as https://gerrit.libreoffice.org/3905
for 3-6 as https://gerrit.libreoffice.org/3906
Comment 16 Commit Notification 2013-05-15 08:55:31 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-3-6":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=b2b6753df528d62c7f0a2be902c822a029512523&h=libreoffice-3-6

resolved fdo#60468 no special tab case workaround for CSV import


It will be available in LibreOffice 3.6.7.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 17 Commit Notification 2013-05-15 08:56:09 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-4-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=09ff02c499579c1dd9eed500121875e586ea8716&h=libreoffice-4-0

resolved fdo#60468 no special tab case workaround for CSV import


It will be available in LibreOffice 4.0.4.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 18 Ferdinand Noelscher 2013-06-08 11:36:17 UTC
As of 08.Jun.2013 I cannot reproduce this bug anymore.
Both sample files yield the same results