Bug 152336 - Improve CSV field separator detection/behavior when several separators are selected on import
Summary: Improve CSV field separator detection/behavior when several separators are se...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard: target:25.2.0 target:24.8.1
Keywords:
: 70111 127938 152332 160355 160742 (view as bug list)
Depends on:
Blocks: CSV-Dialog 160582
  Show dependency treegraph
 
Reported: 2022-12-01 08:42 UTC by Mike Kaganski
Modified: 2024-09-09 10:01 UTC (History)
12 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample file csv (192 bytes, application/vnd.ms-excel)
2024-04-07 20:43 UTC, m_a_riosv
Details
CSV file with separators: no comma, single quote, ° character (61.98 KB, text/csv)
2024-04-10 08:52 UTC, Laurent Balland
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2022-12-01 08:42:25 UTC
When opening a text (CSV, TSV, ...) file in Calc, the import dialog allows to select several field separators, and three of those (tab/comma/semicolon) are selected by default. This allows to have some sort of "autodetection" for simple cases. It also allows importing less structured data, actually having different separators simultaneously.

However, this has a drawback: this multiselection makes the program use "random" one of those upon save. So, when one opens a TSV, without unchecking the other options in the import dialog, and saving using File->Save, the result would be not TSV, but CSV.

This proposal is to implement two improvements over the current behavior:

1. When importing, Calc should keep an internal record, which of the selected separators were actually encountered. Upon the import end, it should filter out the unused separators from the filter settings stored in the document's media descriptor. For most important cases, this would already allow to detect the actually used separator, and avoid further user interactions.

2. When after #1, several separators were marked used (usually that would mean some complex data that had the characters selected as possible separators, so the import was actually wrong, and to have it right, the user needs to re-import with wrong separators unselected), then saving this file should ask user about the options, just as it does when user saves a new document to CSV (i.e., when there's no CSV filter settings in the media descriptor yet). Indeed, the dialog should gather the information from the existing media descriptor; and it may use some heuristics which field separator to suggest to user (e.g., it may count encountered separators on import, and choose the most used one in the export dialog) - but the forced dialog would not only prevent silent unexpected results, but also give a hint about some possible import problems. And in case of unstructured original data, having to manually select the separator out of several actually used is a reasonable requirement.

(The import problems in #2 could also show an infobar, so user would be notified immediately on opening, without the need to wait when one edits the data, then decides to save, and sees the dialog, at which point they may have no clue that the dialog means that the import was wrong, and also that they need to re-do all the work they done... but that is a topic for a separate request.)
Comment 1 Mike Kaganski 2022-12-01 08:43:16 UTC
*** Bug 152332 has been marked as a duplicate of this bug. ***
Comment 2 V Stuart Foote 2022-12-01 09:14:16 UTC
+1
Comment 3 Mike Kaganski 2022-12-01 09:50:22 UTC
*** Bug 70111 has been marked as a duplicate of this bug. ***
Comment 4 Mike Kaganski 2022-12-01 09:59:33 UTC
*** Bug 127938 has been marked as a duplicate of this bug. ***
Comment 5 BogdanB 2023-05-27 20:04:05 UTC
Changed to New based on +1 from Stuart, and duplicates.
Comment 6 Gabriel Masei 2024-03-25 14:44:41 UTC
Please take a look here: https://bugs.documentfoundation.org/show_bug.cgi?id=160355 . It implements some heuristics for separator suggestion.
Comment 7 V Stuart Foote 2024-03-25 15:21:16 UTC
*** Bug 160355 has been marked as a duplicate of this bug. ***
Comment 8 V Stuart Foote 2024-03-25 15:23:15 UTC
(In reply to Gabriel Masei from comment #6)
OP from bug 160355 dupe'd here:

<snip>
A CSV (comma-separated values) file is (in theory) a file that stores tabular data in plain text using commas to separate values and newlines to separate records. There is a standard (RFC 4180) for CSV files format. However, there is a lack of adherence to this standard with multiple formats used instead. The most common part of the format that differs is the separator. Instead of comma, multiple other separators are used: semicolon, pipe, tab, space, ...

Not knowing the structure of a CSV file makes difficult importing/converting it in LibreOffice.

1. In case of importing a CSV file an Import dialog is shown to the user where it can provide the right filter options for the format of the data. A default set of values for these options is provided when the dialog loads. This is a reasonable way of handling the issue.

2. Also, in case of conversions (performed without UI) Libreoffice provides the "infilter" parameter which is equivalent to the Import dialog from the above case. If the parameter is missing then some default values are used.

3. Although the above cases are handled reasonably, there is a third case which needs a better handling: automatic conversions where the format of the input file is not fixed, it can change from one file to another. In this case either a provided set of options through the "infilter" parameter is used or the default one. However, this will generate wrong conversions if the format differs from one file to another. A better approach is needed.

Taking into account the above considerations I consider that some kind of "detection/guess" mechanism can be implement so that it will cover automatically a greater number of formats. And I'm talking especially about the separator.

I already provided a patch for this here: https://gerrit.libreoffice.org/c/core/+/164936 . It first detects the character set and then the separator based on the detected character set. Also, it allows a small room for not well-formatted files. This detection applies to conversions as well as to the Import dialog as an initial suggestion.
</snip>


Hi Gabriel, thanks for the commit. You should probably tag your commit with the BZ issue, tdf#152336
Comment 9 Eyal Rozenberg 2024-03-25 19:37:36 UTC
I would assume that if I save a file in a certain format, the specifics would be independent of the input filtering, and only respect the format's spec (and whatever defaults I have chosen the app to abide by if the spec is flexible). Thus I would expect a TSV to be saved with tabs and a CSV with commas even if, say, the input was parsed so as to separate fields by spaces.

Now, I do remember there's a dialog before saving happens; but at that dialog, I would still expect LO to default to how the format is defined, e.g. commas for CSV, and ignore what the input had.

Also, I'm not sure I see how "Detect separator for CSV files" is a dupe of this bug.
Comment 10 Mike Kaganski 2024-03-26 04:11:36 UTC
(In reply to Eyal Rozenberg from comment #9)
> I would assume that if I save a file in a certain format, the specifics
> would be independent of the input filtering, and only respect the format's
> spec (and whatever defaults I have chosen the app to abide by if the spec is
> flexible). Thus I would expect a TSV to be saved with tabs and a CSV with
> commas even if, say, the input was parsed so as to separate fields by spaces.

This may be a preference of some; but I claim, that at least for CSV, this would be a minority's preference; and the majority-to-minority ratio here will be HUGE. Most people who use Calc to edit CSVs will need Calc to retain the original file's structure; especially taking into account, that all "specs" in CSV are application-specific (and there is a *informational memo* RFC, reflecting some suggestions, but not binding). "Application-specific" here relates to the workflow (generation/consuming applications), not to the editing application in the middle. And a workflow of "normalization" of heterogenously-generated CSVs will be a minority of use cases.
Comment 11 Commit Notification 2024-04-06 17:39:52 UTC
Gabriel Masei committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/2feda8ba21acdcf33a9b4ba94742f574c17839bd

tdf#152336 Detect charset and separators for csv files

It will be available in 24.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 m_a_riosv 2024-04-07 20:43:25 UTC
Created attachment 193559 [details]
Sample file csv

With
Version: 24.8.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 5ebdcc60701ca5f2d267f92fa69ac06202309c52
CPU threads: 16; OS: Windows 10.0 Build 22631; UI render: Skia/Raster; VCL: win
Locale: es-ES (es_ES); UI: en-US
Calc: CL threaded

The issue is that now, locale setting is not preserved in the import window. As was the case until 24.2.
Seems it is reset to the default locale in my case es-ES.
Comment 13 Mike Kaganski 2024-04-08 05:29:18 UTC
(In reply to m_a_riosv from comment #12)

This is a very important issue. Please file it as a regression bug.
Gabriel: please consider making this an opt-in (an option in filter option; a "detect me" kind of a button in the dialog...) What do you think?
Comment 14 ady 2024-04-08 15:14:42 UTC
(In reply to Mike Kaganski from comment #13)
> (In reply to m_a_riosv from comment #12)
> 
> This is a very important issue. Please file it as a regression bug.

That is now tdf#160582.
Comment 15 Gabriel Masei 2024-04-09 10:14:59 UTC
(In reply to Mike Kaganski from comment #13)
> (In reply to m_a_riosv from comment #12)
> 
> This is a very important issue. Please file it as a regression bug.
> Gabriel: please consider making this an opt-in (an option in filter option;
> a "detect me" kind of a button in the dialog...) What do you think?

I provided a quick fix only for locale issue here: https://gerrit.libreoffice.org/c/core/+/165905

I'll explore the possibility to preserve the compatibility with saved settings. This needs a bit of time to find a good solution to mitigate the conflict between the two features.
Comment 16 Gabriel Masei 2024-04-09 11:15:45 UTC
(In reply to Gabriel Masei from comment #15)
> (In reply to Mike Kaganski from comment #13)
> > (In reply to m_a_riosv from comment #12)
> > 
> > This is a very important issue. Please file it as a regression bug.
> > Gabriel: please consider making this an opt-in (an option in filter option;
> > a "detect me" kind of a button in the dialog...) What do you think?
> 
> I provided a quick fix only for locale issue here:
> https://gerrit.libreoffice.org/c/core/+/165905
> 
> I'll explore the possibility to preserve the compatibility with saved
> settings. This needs a bit of time to find a good solution to mitigate the
> conflict between the two features.

In fact it should work for all options except "Character set" and "Separator Options"
Comment 17 Laurent Balland 2024-04-10 08:42:26 UTC
There is a drawback with this enhancement: separators options are no more saved. Before this enhancement, when opening a CSV with some weird separators, you will retrieve your last choice of separators. Then, if you open several files with the same pattern, you will have the right default preselection of separators after the first opening.
With this enhancement, the separators options are always reset to default.
I'm going to add an example.
Comment 18 Laurent Balland 2024-04-10 08:52:32 UTC
Created attachment 193599 [details]
CSV file with separators: no comma, single quote, ° character

Before the change, when opening this CSV file for 1st time, change options:
- Character set=Western Europe (ISO-8859-15/EURO)
- Locale=French (France)
- Separated by: uncheck comma, check Other and insert
'°
in Other field
- check Detect special numbers
- select 2nd column and change Column type to Date (DMY)

For next times, change only these options:
- Character set=Western Europe (ISO-8859-15/EURO)
That's all :-)

With the change, you need to change all options like the 1st time :-(
Comment 19 Laurent Balland 2024-04-10 09:02:53 UTC
(In reply to Laurent Balland from comment #18)
> For next times, change only these options:
> - Character set=Western Europe (ISO-8859-15/EURO)
> That's all :-)
Ooops! It was missing
> - select 2nd column and change Column type to Date (DMY)

But Other options, like "Detect special numbers" or "Evaluate formulas", where preselected or not according to the last use of the dialog
Comment 20 Mike Kaganski 2024-04-20 09:21:18 UTC
*** Bug 160742 has been marked as a duplicate of this bug. ***
Comment 21 Eyal Rozenberg 2024-09-07 15:11:55 UTC
(In reply to Mike Kaganski from comment #10)
> Most people who use Calc to edit CSVs will need Calc to retain
> the original file's structure

Hmm. So, that's the specific use case of opening a file (rather than pasting / creating a new document), editing it, and saving the existing file. For that workflow, I agree that it makes more sense to maintain the file's existing format. I was thinking about the situation of creating a new file. For an existing file, the user needs to be offered UI which makes it clear which selection maintains the existing format and which doesn't.
Comment 22 Mike Kaganski 2024-09-07 15:23:17 UTC
(In reply to Eyal Rozenberg from comment #21)
> I was thinking about the situation of creating a new file.

But the new file creation case is not relevant for this "when several separators are selected on import" bug 152336?

> For an existing file, the user needs to be offered UI which makes it clear
> which selection maintains the existing format and which doesn't.

This phrase is unclear. What "selections" is meant, and what "existing format" is - specifically, in case of CSV with the import filter configured to accept *several* separators?
Comment 23 Commit Notification 2024-09-07 16:28:07 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "libreoffice-24-8":

https://git.libreoffice.org/core/commit/68da28d3c68a6c5800bb2163b4514e7fce83c71b

tdf#162822: Revert "tdf#152336 Detect charset... (24.8 only)

It will be available in 24.8.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 24 Eyal Rozenberg 2024-09-07 20:48:54 UTC
(In reply to Mike Kaganski from comment #22)
> This phrase is unclear. What "selections" is meant

I meant, in the dialog about the formatting choices for my chosen file format / for "Export Text File".

> and what "existing
> format" is - specifically, in case of CSV with the import filter configured
> to accept *several* separators?

That's a good point. Actually, maybe I need to partially take back my earlier comment #21. I mean, whether it's a CSV or a TSV, it's still supposed to be a one-separator format. Does it really make sense to try to recreate a weird non-format in the file you're editing?

You said:
> Most people who use Calc to edit CSVs will need Calc to retain the 
> original file's structure

With multiple separators, I don't think that's possible even with the best of intentions... I mean, suppose my input separators were:

comma tab comma tab comma tab

and in my editing, I delete the second and fourth columns. What separators would we now use: 

comma comma comma tab

or maybe

comma tab comma tab

? How can we even know?

I actually doubt that people would want us to recreate a file "structure" involving multiple separators.

> especially taking into account, that all "specs" in CSV are application-
> specific 

They're still specs with a single separator. And even if there are differences - what are they? string quoting?
Comment 25 Mike Kaganski 2024-09-07 22:13:08 UTC
(In reply to Eyal Rozenberg from comment #24)

At this point, please re=read comment 0.
Comment 26 Commit Notification 2024-09-09 10:01:35 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "libreoffice-24-8-1":

https://git.libreoffice.org/core/commit/724253c916db0517733181529e3f652aaf8f6890

tdf#162822: Revert "tdf#152336 Detect charset... (24.8 only)

It will be available in 24.8.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.