Bug 138537 - Import CSV sometimes change delimeted char-values to numbers
Summary: Import CSV sometimes change delimeted char-values to numbers
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All Windows (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2020-11-28 08:47 UTC by Frank B.
Modified: 2021-02-06 16:59 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
IP-Table (9.55 MB, application/vnd.ms-excel)
2020-11-28 08:47 UTC, Frank B.
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Frank B. 2020-11-28 08:47:52 UTC
Created attachment 167635 [details]
IP-Table

When importing this: (Line 575)
First occured at line 1, then 73
.....
"1.72.0.0","1.79.255.255","JP","Japan"
"1.80.0.0","1.95.255.255","CN","China"
"1.96.0.0","1.111.255.255","KR","Korea (Republic of)"
"1.112.0.0","1.115.255.255","JP","Japan"
"1.116.0.0","1.119.255.255","CN","China"
"1.120.0.0","1.159.255.255","AU","Australia"
"1.160.0.0","1.174.206.255","TW","Taiwan (Province of China)"
"1.174.207.0","1.174.207.255","NO","Norway"
"1.174.208.0","1.175.255.255","TW","Taiwan (Province of China)"
.....
"1.208.0.0","1.255.255.255","KR","Korea (Republic of)"
"2.0.0.0","2.16.0.255","FR","France"
....

the result is this (misformated tab 'cause of copy&paste, that's not the fault):
....
1.72.0.0  1.79.255.255   JP   Japan
1.80.0.0  1.95.255.255   CN   China
1.96.0.0  1111255255     KR   Korea (Republic of)
1.112.0.0 1115255255     JP   Japan
1.116.0.0 1119255255     CN   China
1.120.0.0 1159255255     AU   Australia
1.160.0.0 1174206255     TW   Taiwan (Province of China)
1.174.207.0    1174207255     NO   Norway
1.174.208.0    1175255255     TW   Taiwan (Province of China)
....
1.208.0.0	1255255255	KR	Korea (Republic of)
2.0.0.0	2.16.0.255	FR	France
.....

But not only at the beginning, there are serveral faults like this after import.

Anonther fault:
Line 288: 
if your list is ,-seperated, this makes trouble:
....
"34729984","34732031","IT","Italy"
"34732032","34732287","UG","Uganda"
"34732288","34732543","AM","Armenia"
"34732544","34732799","TZ","Tanzania, United Republic of"
"34732800","34733055","DE","Germany"
"34733056","34733823","TR","Turkey"
"34733824","34734079","US","United States of America"
....

result in:
2.17.240.0	2.17.247.255	IT	Italy			
2.17.248.0	2.17.248.255	UG	Uganda			
2.17.249.0	2.17.249.255	AM	Armenia			
2.17.250.0	2.17.250.255	TZ	Tanzania"2.17.251.0	2.17.251.255	DE	Germany
2.17.252.0	2.17.254.255	TR	Turkey			
2.17.255.0	2.17.255.255	US	United States of America			
2.18.0.0	2.18.3.255	IT	Italy			


At the preview, all looks fine


Reproducible: Always

Steps to Reproduce:
1. right click attached csv
2. select open with libreoffice
3. use standard setting (UTF-8, german, Seperator: ,   character delimeter: ")



Version: 6.4.7.2 (x64)
Build-ID: 639b8ac485750d5696d7590a72ef1b496725cfb5
CPU-Threads: 12; BS: Windows 10.0 Build 18363; UI-Render: GL; VCL: win; 
Gebietsschema: de-DE (de_DE); UI-Sprache: de-DE
Calc: threaded
Comment 1 Frank B. 2020-11-28 10:26:41 UTC
additional info to line 288:
a faulty input 'causes this problem, but maybe it would be better,
if a delimeted phrase (e.g. by ") is taken as ONE expression

So that the input
"foo","bar","foo,bar"

results to:

foo    bar   foo,bar


and not to: 

foo    bar   foo   bar"
Comment 2 Mike Kaganski 2020-11-28 11:22:21 UTC
Repro (German in import settings is important) with Version: 7.0.3.1 (x64)
Build ID: d7547858d014d4cf69878db179d326fc3483e082
CPU threads: 12; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: ru-RU (ru_RU); UI: en-US
Calc: CL
Comment 3 Frank B. 2020-11-28 11:52:14 UTC
Repro (German in import settings is important) with Version: 6.4.7.2 (x64)

6.4.7.2 (x64) languages testet:
German                : fault
German (Austria)      : fault
German (Swiss)        : NO fault
German (Luxembourd)   : fault
German (Liechtenstein): NO fault
Frisian (german accent): fault

Spanish (Argentinia)  : fault
Spanish (Bolivia)     : fault
Spanish (Spain)       : fault

English: (USA)        : NO fault
English: (Kanada)     : NO fault

Icelandic             : fault

French (Belgium)      : fault
French (Niger)        : NO fault


OK, so far from me
Comment 4 Mike Kaganski 2020-11-28 17:03:48 UTC
I suppose it's not a bug actually. The "problematic" parts have groups of exactly three digits, separated by the locale's *thousand separator*. It's normal that these strings are recognized as valid whole numbers.

All cases where the groups were not three are recognized as not valid numbers, thus kept as is.

This works as expected. You should mark the columns as string explicitly on import, and that would disable the number recognition.
Comment 5 Frank B. 2020-11-29 08:30:25 UTC
I can't agree with your result.

As you see, it's not only depending on "exactly three digits, separated by the locale's *thousand separator*", but with languages settings before import.

AND: the values are bracketedby " with indicates, this IS TEXT

AND: in the preview, it is imported correctly

So, I take your solution as workaround.
Tahnks, but, sorry, for me, it's REOPEND"
Comment 6 Frank B. 2020-11-29 08:43:17 UTC
OK, 
depending on wikipedia:

https://en.wikipedia.org/wiki/Decimal_separator

it make more or less sense, but in my opinion, values in "-brackets should be conversion-proof.

If I want to have numbers, I could create a file like this:

1.72.0.0,1.79.255.255,"JP","Japan"

but I created this:

"1.72.0.0","1.79.255.255","JP","Japan"
Comment 7 Mike Kaganski 2020-11-29 09:45:53 UTC
It is not "FIXED", it's NOTABUG. Fixed is where something (known) had been changed in the software to remove the pre-existing actual problem in it.

(In reply to Frank B. from comment #5)
> I can't agree with your result.
> 
> As you see, it's not only depending on "exactly three digits, separated by
> the locale's *thousand separator*", but with languages settings before
> import.

1. Of course, as I wrote: "exactly three digits, separated by the locale's *thousand separator*". The phrase combined two factors: grouping of three digits, and *locale-specific* thousand separators.

> AND: the values are bracketedby " with indicates, this IS TEXT

No. In CSV, quotes are not for TEXT. CSV does not have notion of "data type" at all, ALL its content is text. It's up to application to try (or not) to interpret the textual elements of each record in CSV. Quotes in CSV are not to mark elements as "text", it's to mask some characters in elements that otherwise would have special meaning in CSV, like separators or newlines. That all is described in RFC 4180 [1].

> AND: in the preview, it is imported correctly

Preview does not "imports" things (in the sense that it does not interpret values), it just shows you the content of N first lines, *separated* by the rules that you define, so that you have the idea what gets into which cell. The conversions are performed only at actual import.

(In reply to Frank B. from comment #6)
> in my opinion, values in "-brackets should be conversion-proof.
> 
> If I want to have numbers, I could create a file like this:
> 
> 1.72.0.0,1.79.255.255,"JP","Japan"
> 
> but I created this:
> 
> "1.72.0.0","1.79.255.255","JP","Japan"

As described above, that's your misconception, and if you imply some special processing of quotes does not necessarily mean that it's what the format says about it.


[1] https://tools.ietf.org/html/rfc4180
Comment 8 Mike Kaganski 2020-11-29 09:51:33 UTC
(In reply to Mike Kaganski from comment #7)
> (In reply to Frank B. from comment #6)
> > in my opinion, values in "-brackets should be conversion-proof.
> > 
> > If I want to have numbers, I could create a file like this:
> > 
> > 1.72.0.0,1.79.255.255,"JP","Japan"
> > 
> > but I created this:
> > 
> > "1.72.0.0","1.79.255.255","JP","Japan"
> 
> As described above, that's your misconception, and if you imply some special
> processing of quotes does not necessarily mean that it's what the format
> says about it.

And by the way, the import dialog does include the dedicated checkbox for processing like you want in Other Options: [ ] Format quoted field as text.
Comment 9 Frank B. 2020-12-01 15:14:12 UTC
@Mike Kaganski: Thanks, after "studying" RFC 4180, I agree - your right, I'm wrong.

(then there is an error in Excel ;-)
Excel imports this parts as numbers, too - but it leaves the . between it...
so that it looks like an IP, but it is a number, too
Comment 10 Mike Kaganski 2020-12-01 15:36:05 UTC
(In reply to Frank B. from comment #9)
> (then there is an error in Excel ;-)
> Excel imports this parts as numbers, too - but it leaves the . between it...
> so that it looks like an IP, but it is a number, too

Well, I'd not call it a bug in Excel: it simply does more work, and attempts to create a number formatting that would represent the converted number as it was represented in CSV. So seeing that a value string had thousand separators, it applies a number format with thousand separators to the cell after the import... :-) That is likely to result in very inconsistently formatted cells throughout the file, but OTOH it keeps visual representation ... both ways have upsides and downsides.