Bug 158025 - Calc Open CSV Shows garbage characters / messy codes
Summary: Calc Open CSV Shows garbage characters / messy codes
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
7.5.7.1 release
Hardware: All Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2023-11-01 12:23 UTC by Jonny Grant
Modified: 2023-11-22 09:51 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
utf8.csv file (20 bytes, application/octet-stream)
2023-11-01 12:23 UTC, Jonny Grant
Details
utf16.csv file (38 bytes, application/octet-stream)
2023-11-01 12:24 UTC, Jonny Grant
Details
Screenshot of CSV import dialog (63.82 KB, image/png)
2023-11-01 12:24 UTC, Jonny Grant
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jonny Grant 2023-11-01 12:23:13 UTC
Description:
Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM

Steps to Reproduce:
1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog.
2. Open UTF-16 CSV file with BOM  utf16.csv - click through the import dialog.
3. Open utf8.csv again and observe that it shows corrupted chinese text in the import dialog.
Screenshot attached.

I've seen this many times over the years, can't find another bug report. But someone else has reported here

https://ask.libreoffice.org/t/opening-a-csv-file-gives-chinese-signs/20425


Actual Results:
See screenshot

Expected Results:
The UTF16 file with UTF16 BOM should not cause the CSV import to default to UTF16. UNIX systems use UTF8. 


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 7.5.7.1 (X86_64) / LibreOffice Community
Build ID: 50(Build:1)
CPU threads: 8; OS: Linux 6.2; UI render: default; VCL: gtk3
Locale: en-GB (en_GB.UTF-8); UI: en-GB
Ubuntu package version: 4:7.5.7-0ubuntu0.23.04.1
Calc: threaded
Comment 1 Jonny Grant 2023-11-01 12:23:47 UTC
Created attachment 190581 [details]
utf8.csv file
Comment 2 Jonny Grant 2023-11-01 12:24:13 UTC
Created attachment 190582 [details]
utf16.csv file
Comment 3 Jonny Grant 2023-11-01 12:24:38 UTC
Created attachment 190583 [details]
Screenshot of CSV import dialog
Comment 4 Ming Hua 2023-11-01 12:40:53 UTC
(In reply to Jonny Grant from comment #0)
> Description:
> Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM
> 
> Steps to Reproduce:
> 1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog.
> 2. Open UTF-16 CSV file with BOM  utf16.csv - click through the import
> dialog.
> 3. Open utf8.csv again and observe that it shows corrupted chinese text in
> the import dialog.
> Screenshot attached.
The screenshot shows that "Character set" is set to "Unicode (UTF-16)", so of course the UTF-8 encoded file can not be displayed properly.

I assume the import will be OK if the user manually set "Character set" to UTF-8 manually in step 3?

If so, then the bug is that CSV import dialog doesn't remember the choice of encoding last time a file is opened.  I'm positive I've seen such a bug before.
Comment 5 Jonny Grant 2023-11-01 13:18:05 UTC
(In reply to Ming Hua from comment #4)
> (In reply to Jonny Grant from comment #0)
> > Description:
> > Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM
> > 
> > Steps to Reproduce:
> > 1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog.
> > 2. Open UTF-16 CSV file with BOM  utf16.csv - click through the import
> > dialog.
> > 3. Open utf8.csv again and observe that it shows corrupted chinese text in
> > the import dialog.
> > Screenshot attached.
> The screenshot shows that "Character set" is set to "Unicode (UTF-16)", so
> of course the UTF-8 encoded file can not be displayed properly.
> 
> I assume the import will be OK if the user manually set "Character set" to
> UTF-8 manually in step 3?
> 
> If so, then the bug is that CSV import dialog doesn't remember the choice of
> encoding last time a file is opened.  I'm positive I've seen such a bug
> before.

Dear Ming Hua 

Thank you for your reply.

The bug appears to be that a UTF16 BOM in a genuine UTF16 file, is remembered by Calc. It then stays in UTF16 mode. It does not return to a default of UTF8.

The test files are attached.

Note, my UTF8 file does not contain a BOM.
Comment 6 Jonny Grant 2023-11-01 13:22:29 UTC
Yes, if manually changing to UTF-8 step (3) it works fine again. And doesn't occur again.
Comment 7 Ming Hua 2023-11-02 04:03:10 UTC
(In reply to Jonny Grant from comment #5)
> The bug appears to be that a UTF16 BOM in a genuine UTF16 file, is
> remembered by Calc. It then stays in UTF16 mode. It does not return to a
> default of UTF8.
> 
> The test files are attached.
> 
> Note, my UTF8 file does not contain a BOM.
I am far from an expert, but my understanding is that with BOM it's much easier to detect the UTF-16 encoded file properly. On the other hand, the UTF-8 file attached only contain ASCII characters, and therefore without BOM is much harder to detect encoding confidently and correctly.

Anyway, let's ping the CSV import meta bug and see if someone knows better can look at this.
Comment 8 Mike Kaganski 2023-11-14 08:24:15 UTC
(In reply to Ming Hua from comment #7)

We have a way to detect encoding; it is easy enough to tell UTF-8 from UTF-16 (but it can't be said about other pairs of values). See e.g. bug 60145. The situation here is that we don't try to detect it at all.
Comment 9 Jonny Grant 2023-11-14 10:39:23 UTC
My suggestion is to always default to UTF-8 on Linux - or at least go by the locale where no BOM is found. So never save any UTF16 as "last opened charset".

It's easy enough to check the locale from the shell or C.
$ locale
LANG=en_GB.UTF-8
Comment 10 Mike Kaganski 2023-11-14 11:35:47 UTC
The *default* is already the system encoding. But the default is only used when there was no "last used configuration" stored.

1. Since we try to autodetect encoding (works with BOM), we might want to also autodetect files without BOM (see bug 60145 for an implementation of detection).
2. A possibility could be to avoid storing autodetected encoding (so if an encoding was autodetected, and not changed by user, then it should not go into the profile).
Comment 11 Kevin Suo 2023-11-22 09:51:56 UTC
I revise the summary field - They are not Chinese characters. They are simply garbage characters / messy codes.