Bug 152337 - Show a warning infobar when imported text file used several of selected field separators
Summary: Show a warning infobar when imported text file used several of selected field...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CSV
  Show dependency treegraph
 
Reported: 2022-12-01 09:25 UTC by Mike Kaganski
Modified: 2022-12-08 10:20 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2022-12-01 09:25:05 UTC
When opening a text (CSV, TSV, ...) file in Calc, the import dialog allows to select several field separators, and three of those (tab/comma/semicolon) are selected by default. This allows to have some sort of "autodetection" for simple cases. It also allows importing less structured data, actually having different separators simultaneously.

In the latter case, seeing several of the selected separators in the file is expected and normal. However, there are more usual cases, when the data was actual CSV/TSV and the like, and in fact only had one separator, but some unquoted textual data in it could also include other characters that happen to be among selected separators. In such cases, the user relying on some automagic could not notice that their large body of data had imported wrong, some fields split on these false separators.

E.g., a CSV (only commas actually used for field separation) could have semicolon in a field:

a,b,c
content of field a with semicolon ; - but still one field!,field b,field c

Such a CSV, when imported with default settings in the import dialog (so tabs, colons, and semicolons are checked), would split second row into *four* cells, which is not what the user would expect:

content of field a with semicolon 
 - but still one field!
field b
field c

If that happens somewhere in the middle of a 100 000-row data, it may be easily overlooked. The user could work on the data, edit it, save, and not notice that some data got corrupt. After that, it is impossible to easily find and undo the corruption.

The idea is to show an infobar in such a case, informing that Calc saw several of the selected field separators in the imported file, and hinting that *possibly* it should be inspected, and re-imported, only selecting the actual separator used in the file. It would be a false detection in the "less structured data" case discussed in the first paragraph above, but likely a minor annoyance, with relatively low impact, compared to current potential for unnoticed data corruption in more common scenario.
Comment 1 Mike Kaganski 2022-12-01 09:35:46 UTC
The same infobar could be shown also in case of paste of text into existing spreadsheet document, when this text import dialog was used, in the same scenario.
Comment 2 Heiko Tietze 2022-12-08 09:22:16 UTC
I suspect the infobar would be annoying to many users. And if you load a lot of data in any tool you have to do sanity checks anyway.

And I wonder what threshold you have in mind. Something like more than 0.01% "common separator" in the document.

Why not do the opposite with the argument that only a few data points are separated by semicolon and you haven't check it.

Last but not least what impact would such a check have on the performance?
Comment 3 Mike Kaganski 2022-12-08 10:03:33 UTC
(In reply to Heiko Tietze from comment #2)
> I suspect the infobar would be annoying to many users. And if you load a lot
> of data in any tool you have to do sanity checks anyway.

If you import a *CSV* (or TSV, or any of the *normal* text files of this kind, generated by vast majority of software), which use a *single* separator inside, then no matter how many separators you selected in the dialog - only one of them must be used actually by Calc; if Calc happened to meet two of the selected separators - it means that the import *went wrong*, and it is a *destructive import error*. This is the whole essence of the issue. If "many users" would see it, it means that many users import their data and break it, and *do not notice it*!

> And I wonder what threshold you have in mind. Something like more than 0.01%
> "common separator" in the document.

No threshold. A single extra separator means the error occurred. *Especially* when it's a single occurrence, it's most possible that the user will not notice it somewhere in the middle of a huge data.

> Why not do the opposite with the argument that only a few data points are
> separated by semicolon and you haven't check it.

I totally do not understand you.

> Last but not least what impact would such a check have on the performance?

None.
Comment 4 Heiko Tietze 2022-12-08 10:20:00 UTC
No impact on performance, simple warning at every single occurrence... no need for input from UX in this case.