Bug 157799 - Import of csv is unreliable (locale dependent on thousands separator)
Summary: Import of csv is unreliable (locale dependent on thousands separator)
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
7.5.5.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-16 16:02 UTC by Jens-D Doll
Modified: 2023-10-18 13:53 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
csv file from a sql table (3.00 KB, text/csv)
2023-10-16 16:02 UTC, Jens-D Doll
Details
a view of the buggy import (242.17 KB, image/jpeg)
2023-10-17 22:14 UTC, Jens-D Doll
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jens-D Doll 2023-10-16 16:02:11 UTC
Created attachment 190242 [details]
csv file from a sql table

I imported a csv generated from a sql table. It looked rather good but had several exceptions in the format of the imported data. Subnet triples seperated by two dots in column 2 looked wrong in line 10 and 12+13, but looked good in the other lines. The dot in between had disappeared.

Why is it so?
Comment 1 Mike Kaganski 2023-10-16 16:27:22 UTC
Because your locale used dot as thousand separator; and you didn't mark the IP column as text - so the values that matched three-digit parts separated by thousand separators were recognized as numbers; while those that didn't match that pattern were imported as text.

If a column represents what is not a number, then it should be explicitly marked so in the import dialog.
Comment 2 Jens-D Doll 2023-10-17 06:38:03 UTC
Mike,

your answer is nonsense. The first 9 lines also contain the dots and are not recognized as numers. Also would a conversion to a number yield the wrong value. Are you a saboteur?
Comment 3 Mike Kaganski 2023-10-17 06:56:43 UTC
(In reply to Jens-D Doll from comment #2)
> your answer is nonsense.

Do you say that every time you do not understand anything, instead of asking others to re-phrase the idea in a simpler words that you can comprehend?

> The first 9 lines also contain the dots and are not recognized as numers.

9? You never provided a screenshot of what you actually see; but I guess that the first "problematic" result would appear at row 10, with previous *8* lines having the expected result (and the line 1 would contain column names, so no "dots recognized as numbers").

Now try to look at the data, and re-read what I wrote.

In row 2, the text of the second column in CSV is "101.91.148". It has three digits in the leftmost group, then a dot, then *two* digits in the middle group, then dot, and then three digits in the rightmost group. I explained, that when the data does not match the "numeric digits in groups of three, separated by dots", it would not get converted into a number. And here, the middle group does not consist of *thee* digits. What is unclear here? The whole data of the cell does not match the number recognition pattern -> it does not get converted.

Same for rows 3 to 9.

And row 10 has "149.154.159". It is the first row that contains data that matches the rule. It is considered "one hundred forty nine million one hundred fifty four thousand one hundred fifty nine" Indeed, the leftmost group could also be less than 3 digits (say, "9.154.159" would also match as nine million one hundred fifty four thousand one hundred fifty nine).

> Also would a conversion to a number yield the wrong value.

It would bring a number, which then would be displayed according to the *default* format rules, which don't include thousand separators.

> Are you a saboteur?

Did you like my words hinting that you are unable to comprehend simple things? If not, then why do you think it's OK to personally attack someone trying to help you by explaining things?
Comment 4 Jens-D Doll 2023-10-17 07:16:55 UTC
Regardless of your "arguments" a value must not be interpreted in lines 1-9 one way and in lines 10,12,13 the other way. That should be clear also to simple minds.
Comment 5 Mike Kaganski 2023-10-17 07:34:43 UTC
No, you are wrong. And your data exactly shows why: it has row 1 as headers (definitely text), while other rows as data (so other rows would definitely need different interpretation).

And CSV is used heavily for very different kinds of data, including intermediate results, sub-headings, etc.

If you use such an inherently loosely defined format as CSV, it's your duty to learn how to handle the specific flavor you have correctly. The program offers you the very flexible means (the dialog gives you the full control). And you can tell the program exactly that: instead of trying to guess every cell of a column (the standard), just treat it as I say (make it text by choosing the column type).

But trying to make program please one group of users would break infinite number of other users.
Comment 6 Jens-D Doll 2023-10-17 07:40:37 UTC
Sorry for those hard words, Mike. I'll explain it now.

A csv collection is similar to a table, where each column has the same type in all lines. So the first line decides over the type of the values in the following lines. Here the first line is all text and so all values in the following lines have to be interpreted as text values.
Comment 7 Mike Kaganski 2023-10-17 07:42:28 UTC
Sigh.
Please read comment 5. I explained it there, telling that there are HUGE VARIETY of CSV flavors. Yours is just one kind.
Comment 8 Jens-D Doll 2023-10-17 07:46:33 UTC
Have a look at Wikipedia, https://en.wikipedia.org/wiki/Comma-separated_values 

Comma-separated values (CSV) is a text file format that uses commas to separate values. A CSV file stores tabular data (numbers and text) in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.[3]
Comment 9 Mike Kaganski 2023-10-17 07:59:01 UTC
Better look at the RFC [1] (which, by the way, is just a memo). And acknowledge the fact, that CSV has evolved through more than 30 years before that attempt of "standardization", having infinite number of flavors (including binary data, or absolutely odd variants like described at [2]).

No amount of wishful thinking can change that. New CSVs (and TSVs, and other "tabular" textual formats) are generated each day by thousands of different softwares.

[1] https://www.rfc-editor.org/rfc/rfc4180
[2] https://mikekaganski.wordpress.com/2021/02/18/reading-from-mysql-data-with-blobs-dumped-to-csv/
Comment 10 Jens-D Doll 2023-10-17 09:14:25 UTC
We should look at csv in todays world and not in yesterdays' one. Before Copernicus the world was a plane and centered the world.
Comment 11 Mike Kaganski 2023-10-17 09:23:08 UTC
In comment 9, I provided you a link to my post from 2021. I explained there a non-standard CSV flavor generated by MySQL, right then, in 2021. If you ignore everything that contradicts your opinion, I have no intention to waste time trying to explain. If you suppose that 2021 is "yesterday's world", you are welcome to test current implementations of e.g. MySQL, and countless of other current applications.

I'm done here.
Comment 12 Jens-D Doll 2023-10-17 22:14:09 UTC
Created attachment 190273 [details]
a view of the buggy import
Comment 13 Jens-D Doll 2023-10-17 22:15:58 UTC
The csv format has always been a database format with fixed types in each column. See

https://support.microsoft.com/en-us/office/create-or-edit-csv-files-to-import-into-outlook-4518d70d-8fe9-46ad-94fa-1494247193c7

for reference. So this is a bug.
Comment 14 V Stuart Foote 2023-10-18 13:52:04 UTC
As noted comment 1, import of a column of data that can be either numbers or text (e.g. the subnet column 2 in a locale where . is the thousands separator) can not be left to detection as 'Standard'. 

You must understand your data and your number formats and make an appropriate selection. In this case assign a column as 'Text'--or you will get mixed numeric detection.

You could also force the "subnet" column to text enclosing the field with "", or during manual entry by prepending a '

Either method => NAB