Bug 136013 - FILEOPEN Importing tsv/csv with no string delimiter causes whitespace only trailing column to corrupt
Summary: FILEOPEN Importing tsv/csv with no string delimiter causes whitespace only tr...
Status: CLOSED DUPLICATE of bug 142395
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.0.0.3 release
Hardware: All All
: low normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, regression
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2020-08-22 12:33 UTC by Andrew Crowe
Modified: 2021-08-30 04:24 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments
CSV file that triggers issue (975 bytes, text/plain)
2020-08-22 12:36 UTC, Andrew Crowe
Details
Screenshot of initial import dialog display (69.79 KB, image/png)
2020-08-22 12:38 UTC, Andrew Crowe
Details
Screenshot of import dialog after changing settings (74.25 KB, image/png)
2020-08-22 12:38 UTC, Andrew Crowe
Details
Screenshot after file opens in calc (173.93 KB, image/png)
2020-08-22 12:39 UTC, Andrew Crowe
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Crowe 2020-08-22 12:33:21 UTC
Description:
When importing a tsv or csv without string delimiters, if the final column consists of only whitespace it adds corrupt data to that column.

If the final column is empty the row loads correctly. Also if the string delimiter is set to anything (even if the delimiter character does not appear in the document) the file loads correctly.

One interesting behavior is initially the csv import dialog doesn't show corruption in the preview, however if you change any options the corruption appears.

Tested reproducible on versions 5.4, 6.4, 7.0

Steps to Reproduce:
1. Have CSV/TSV file without string delimiters and with trailing column consisting of only whitespace
2. Turn off string delimiters in import dialog box
3. Click OK

Actual Results:
Right hand column contains corrupt data

Expected Results:
Right hand column blank


Reproducible: Always


User Profile Reset: Yes



Additional Info:
Version: 7.0.0.3 (x64)
Build ID: 8061b3e9204bef6b321a21033174034a5e2ea88e
CPU threads: 24; OS: Windows 10.0 Build 19041; UI render: Skia/Vulkan; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Calc: CL
Comment 1 Andrew Crowe 2020-08-22 12:36:47 UTC
Created attachment 164559 [details]
CSV file that triggers issue
Comment 2 Andrew Crowe 2020-08-22 12:38:20 UTC
Created attachment 164560 [details]
Screenshot of initial import dialog display
Comment 3 Andrew Crowe 2020-08-22 12:38:55 UTC
Created attachment 164561 [details]
Screenshot of import dialog after changing settings
Comment 4 Andrew Crowe 2020-08-22 12:39:27 UTC
Created attachment 164562 [details]
Screenshot after file opens in calc
Comment 5 Justin L 2020-12-15 11:43:29 UTC
Confirmed. The key is to erase the double-quote in the string-delimiter box.

Seems to have worked in LO 3.6.
Bibisected with bibisect-linux-43all to get the range https://cgit.freedesktop.org/libreoffice/core/log/?qt=range&q=a1ac2538e9b287444500618ab4d2f0f06c25cf34..19f4ebd8a54da0ae03b9cc8481613e5cd20ee1e7

Nothing clearly obvious in this range, but various suspicious commits involving ICU and libexttextcat. 

Bad _bibisect 43all commit_ a67b874d60de1f1a44bef57a53a7b8a84db0ba58.
Comment 6 xpusostomos 2021-03-15 10:07:07 UTC
I think its worth adding this comment here rather than opening a new bug...

If you choose tab delimited, and string quote character double quote ( " ), then the following makes it choke

f1\tf2\t"f3",xxx\tf4

What happens, everything after f3... even to the very end of the file (no matter how many lines and fields that includes) will get dumped into one cell. Now one might argue that the above is badly formatted (should quotes end right at field end?), but this is not the right way to handle it.

Another thing, it wasn't obvious to me in the gui that the string delimited dropdown list was editable. I think a dropdown list here is pointless and distracting. Everyone uses either double quote or nothing. I would argue that as soon as you select tab delimited, this field should default to blank, because as far as I can tell, the whole internet is agreed that TSV files don't have a string quote character.
Comment 7 Eike Rathke 2021-08-29 20:42:52 UTC
(In reply to xpusostomos from comment #6)
> Another thing, it wasn't obvious to me in the gui that the string delimited
> dropdown list was editable. I think a dropdown list here is pointless and
> distracting. Everyone uses either double quote or nothing.
You certainly know everyone and every usage and can be sure no one, absolutely no one, uses anything else.

> I would argue
> that as soon as you select tab delimited, this field should default to
> blank, because as far as I can tell, the whole internet is agreed that TSV
> files don't have a string quote character.
Oh yes? Is it? Could you point out such agreement? So you'd argue that embedded tabs and embedded line feeds are not possible at all in a TSV file?
Comment 8 Eike Rathke 2021-08-29 20:57:12 UTC
Reproduced with 7.1.4
Appears to be fixed since 7.1.5, most likely with bug 142395.

*** This bug has been marked as a duplicate of bug 142395 ***
Comment 9 Mike Kaganski 2021-08-30 04:24:10 UTC
(In reply to Eike Rathke from comment #7)

I enjoyed comment 6 very much, made me recall playing with MySQL's "SELECT INTO OUTFILE" [1], where it puts even null bytes (and any other bytes that may appear in BLOBS), with configurable FIELDS ENCLOSED BY, LINES TERMINATED BY, and even absolutely inconsistent FIELDS ESCAPED BY, that needed a home-grown parser [2], because they obviously didn't know what xpusostomos knew ;)

[1] https://dev.mysql.com/doc/refman/8.0/en/select-into.html
[2] https://mikekaganski.wordpress.com/2021/02/18/reading-from-mysql-data-with-blobs-dumped-to-csv/