Bug 160355 - Detect separator for CSV files
Summary: Detect separator for CSV files
Status: RESOLVED DUPLICATE of bug 152336
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-25 14:36 UTC by Gabriel Masei
Modified: 2024-03-25 15:21 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gabriel Masei 2024-03-25 14:36:43 UTC
A CSV (comma-separated values) file is (in theory) a file that stores tabular data in plain text using commas to separate values and newlines to separate records. There is a standard (RFC 4180) for CSV files format. However, there is a lack of adherence to this standard with multiple formats used instead. The most common part of the format that differs is the separator. Instead of comma, multiple other separators are used: semicolon, pipe, tab, space, ...

Not knowing the structure of a CSV file makes difficult importing/converting it in LibreOffice.

1. In case of importing a CSV file an Import dialog is shown to the user where it can provide the right filter options for the format of the data. A default set of values for these options is provided when the dialog loads. This is a reasonable way of handling the issue.

2. Also, in case of conversions (performed without UI) Libreoffice provides the "infilter" parameter which is equivalent to the Import dialog from the above case. If the parameter is missing then some default values are used.

3. Although the above cases are handled reasonably, there is a third case which needs a better handling: automatic conversions where the format of the input file is not fixed, it can change from one file to another. In this case either a provided set of options through the "infilter" parameter is used or the default one. However, this will generate wrong conversions if the format differs from one file to another. A better approach is needed.

Taking into account the above considerations I consider that some kind of "detection/guess" mechanism can be implement so that it will cover automatically a greater number of formats. And I'm talking especially about the separator.

I already provided a patch for this here: https://gerrit.libreoffice.org/c/core/+/164936 . It first detects the character set and then the separator based on the detected character set. Also, it allows a small room for not well-formatted files. This detection applies to conversions as well as to the Import dialog as an initial suggestion.
Comment 1 V Stuart Foote 2024-03-25 15:21:16 UTC

*** This bug has been marked as a duplicate of bug 152336 ***