158025 – Calc Open CSV Shows garbage characters / messy codes

Bug 158025 - Calc Open CSV Shows garbage characters / messy codes

Summary: Calc Open CSV Shows garbage characters / messy codes

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	7.5.7.1 release
Hardware:	All Linux (All)

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	CSV-Import
	Show dependency tree / graph

Reported:	2023-11-01 12:23 UTC by Jonny Grant
Modified:	2024-05-19 14:55 UTC (History)
CC List:	1 user (show)

See Also:	152336
Crash report or crash signature:

Attachments
utf8.csv file (20 bytes, application/octet-stream) 2023-11-01 12:23 UTC, Jonny Grant	Details
utf16.csv file (38 bytes, application/octet-stream) 2023-11-01 12:24 UTC, Jonny Grant	Details
Screenshot of CSV import dialog (63.82 KB, image/png) 2023-11-01 12:24 UTC, Jonny Grant	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jonny Grant 2023-11-01 12:23:13 UTC

Description:
Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM

Steps to Reproduce:
1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog.
2. Open UTF-16 CSV file with BOM  utf16.csv - click through the import dialog.
3. Open utf8.csv again and observe that it shows corrupted chinese text in the import dialog.
Screenshot attached.

I've seen this many times over the years, can't find another bug report. But someone else has reported here

https://ask.libreoffice.org/t/opening-a-csv-file-gives-chinese-signs/20425


Actual Results:
See screenshot

Expected Results:
The UTF16 file with UTF16 BOM should not cause the CSV import to default to UTF16. UNIX systems use UTF8. 


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 7.5.7.1 (X86_64) / LibreOffice Community
Build ID: 50(Build:1)
CPU threads: 8; OS: Linux 6.2; UI render: default; VCL: gtk3
Locale: en-GB (en_GB.UTF-8); UI: en-GB
Ubuntu package version: 4:7.5.7-0ubuntu0.23.04.1
Calc: threaded

Comment 1 Jonny Grant 2023-11-01 12:23:47 UTC

Created attachment 190581 [details]
utf8.csv file

Comment 2 Jonny Grant 2023-11-01 12:24:13 UTC

Created attachment 190582 [details]
utf16.csv file

Comment 3 Jonny Grant 2023-11-01 12:24:38 UTC

Created attachment 190583 [details]
Screenshot of CSV import dialog

Comment 4 Ming Hua 2023-11-01 12:40:53 UTC

(In reply to Jonny Grant from comment #0)
> Description:
> Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM
> 
> Steps to Reproduce:
> 1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog.
> 2. Open UTF-16 CSV file with BOM  utf16.csv - click through the import
> dialog.
> 3. Open utf8.csv again and observe that it shows corrupted chinese text in
> the import dialog.
> Screenshot attached.
The screenshot shows that "Character set" is set to "Unicode (UTF-16)", so of course the UTF-8 encoded file can not be displayed properly.

I assume the import will be OK if the user manually set "Character set" to UTF-8 manually in step 3?

If so, then the bug is that CSV import dialog doesn't remember the choice of encoding last time a file is opened.  I'm positive I've seen such a bug before.

Comment 5 Jonny Grant 2023-11-01 13:18:05 UTC

(In reply to Ming Hua from comment #4)
> (In reply to Jonny Grant from comment #0)
> > Description:
> > Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM
> > 
> > Steps to Reproduce:
> > 1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog.
> > 2. Open UTF-16 CSV file with BOM  utf16.csv - click through the import
> > dialog.
> > 3. Open utf8.csv again and observe that it shows corrupted chinese text in
> > the import dialog.
> > Screenshot attached.
> The screenshot shows that "Character set" is set to "Unicode (UTF-16)", so
> of course the UTF-8 encoded file can not be displayed properly.
> 
> I assume the import will be OK if the user manually set "Character set" to
> UTF-8 manually in step 3?
> 
> If so, then the bug is that CSV import dialog doesn't remember the choice of
> encoding last time a file is opened.  I'm positive I've seen such a bug
> before.

Dear Ming Hua 

Thank you for your reply.

The bug appears to be that a UTF16 BOM in a genuine UTF16 file, is remembered by Calc. It then stays in UTF16 mode. It does not return to a default of UTF8.

The test files are attached.

Note, my UTF8 file does not contain a BOM.

Comment 6 Jonny Grant 2023-11-01 13:22:29 UTC

Yes, if manually changing to UTF-8 step (3) it works fine again. And doesn't occur again.

Comment 7 Ming Hua 2023-11-02 04:03:10 UTC

(In reply to Jonny Grant from comment #5)
> The bug appears to be that a UTF16 BOM in a genuine UTF16 file, is
> remembered by Calc. It then stays in UTF16 mode. It does not return to a
> default of UTF8.
> 
> The test files are attached.
> 
> Note, my UTF8 file does not contain a BOM.
I am far from an expert, but my understanding is that with BOM it's much easier to detect the UTF-16 encoded file properly. On the other hand, the UTF-8 file attached only contain ASCII characters, and therefore without BOM is much harder to detect encoding confidently and correctly.

Anyway, let's ping the CSV import meta bug and see if someone knows better can look at this.

Comment 8 Mike Kaganski 2023-11-14 08:24:15 UTC

(In reply to Ming Hua from comment #7)

We have a way to detect encoding; it is easy enough to tell UTF-8 from UTF-16 (but it can't be said about other pairs of values). See e.g. bug 60145. The situation here is that we don't try to detect it at all.

Comment 9 Jonny Grant 2023-11-14 10:39:23 UTC

My suggestion is to always default to UTF-8 on Linux - or at least go by the locale where no BOM is found. So never save any UTF16 as "last opened charset".

It's easy enough to check the locale from the shell or C.
$ locale
LANG=en_GB.UTF-8

Comment 10 Mike Kaganski 2023-11-14 11:35:47 UTC

The *default* is already the system encoding. But the default is only used when there was no "last used configuration" stored.

1. Since we try to autodetect encoding (works with BOM), we might want to also autodetect files without BOM (see bug 60145 for an implementation of detection).
2. A possibility could be to avoid storing autodetected encoding (so if an encoding was autodetected, and not changed by user, then it should not go into the profile).

Comment 11 Kevin Suo 2023-11-22 09:51:56 UTC

I revise the summary field - They are not Chinese characters. They are simply garbage characters / messy codes.

Comment 12 Eike Rathke 2024-05-19 14:55:51 UTC

Not reproducible anymore in 24.8 since implementation for bug 152336 forces an encoding detection on file import.