Description: Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM Steps to Reproduce: 1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog. 2. Open UTF-16 CSV file with BOM utf16.csv - click through the import dialog. 3. Open utf8.csv again and observe that it shows corrupted chinese text in the import dialog. Screenshot attached. I've seen this many times over the years, can't find another bug report. But someone else has reported here https://ask.libreoffice.org/t/opening-a-csv-file-gives-chinese-signs/20425 Actual Results: See screenshot Expected Results: The UTF16 file with UTF16 BOM should not cause the CSV import to default to UTF16. UNIX systems use UTF8. Reproducible: Always User Profile Reset: No Additional Info: Version: 7.5.7.1 (X86_64) / LibreOffice Community Build ID: 50(Build:1) CPU threads: 8; OS: Linux 6.2; UI render: default; VCL: gtk3 Locale: en-GB (en_GB.UTF-8); UI: en-GB Ubuntu package version: 4:7.5.7-0ubuntu0.23.04.1 Calc: threaded
Created attachment 190581 [details] utf8.csv file
Created attachment 190582 [details] utf16.csv file
Created attachment 190583 [details] Screenshot of CSV import dialog
(In reply to Jonny Grant from comment #0) > Description: > Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM > > Steps to Reproduce: > 1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog. > 2. Open UTF-16 CSV file with BOM utf16.csv - click through the import > dialog. > 3. Open utf8.csv again and observe that it shows corrupted chinese text in > the import dialog. > Screenshot attached. The screenshot shows that "Character set" is set to "Unicode (UTF-16)", so of course the UTF-8 encoded file can not be displayed properly. I assume the import will be OK if the user manually set "Character set" to UTF-8 manually in step 3? If so, then the bug is that CSV import dialog doesn't remember the choice of encoding last time a file is opened. I'm positive I've seen such a bug before.
(In reply to Ming Hua from comment #4) > (In reply to Jonny Grant from comment #0) > > Description: > > Calc Open CSV Chinese characters for UTF8 file after UTF16 file with BOM > > > > Steps to Reproduce: > > 1. Open UTF-8 CSV file (no BOM) utf8.csv - click through the import dialog. > > 2. Open UTF-16 CSV file with BOM utf16.csv - click through the import > > dialog. > > 3. Open utf8.csv again and observe that it shows corrupted chinese text in > > the import dialog. > > Screenshot attached. > The screenshot shows that "Character set" is set to "Unicode (UTF-16)", so > of course the UTF-8 encoded file can not be displayed properly. > > I assume the import will be OK if the user manually set "Character set" to > UTF-8 manually in step 3? > > If so, then the bug is that CSV import dialog doesn't remember the choice of > encoding last time a file is opened. I'm positive I've seen such a bug > before. Dear Ming Hua Thank you for your reply. The bug appears to be that a UTF16 BOM in a genuine UTF16 file, is remembered by Calc. It then stays in UTF16 mode. It does not return to a default of UTF8. The test files are attached. Note, my UTF8 file does not contain a BOM.
Yes, if manually changing to UTF-8 step (3) it works fine again. And doesn't occur again.
(In reply to Jonny Grant from comment #5) > The bug appears to be that a UTF16 BOM in a genuine UTF16 file, is > remembered by Calc. It then stays in UTF16 mode. It does not return to a > default of UTF8. > > The test files are attached. > > Note, my UTF8 file does not contain a BOM. I am far from an expert, but my understanding is that with BOM it's much easier to detect the UTF-16 encoded file properly. On the other hand, the UTF-8 file attached only contain ASCII characters, and therefore without BOM is much harder to detect encoding confidently and correctly. Anyway, let's ping the CSV import meta bug and see if someone knows better can look at this.
(In reply to Ming Hua from comment #7) We have a way to detect encoding; it is easy enough to tell UTF-8 from UTF-16 (but it can't be said about other pairs of values). See e.g. bug 60145. The situation here is that we don't try to detect it at all.
My suggestion is to always default to UTF-8 on Linux - or at least go by the locale where no BOM is found. So never save any UTF16 as "last opened charset". It's easy enough to check the locale from the shell or C. $ locale LANG=en_GB.UTF-8
The *default* is already the system encoding. But the default is only used when there was no "last used configuration" stored. 1. Since we try to autodetect encoding (works with BOM), we might want to also autodetect files without BOM (see bug 60145 for an implementation of detection). 2. A possibility could be to avoid storing autodetected encoding (so if an encoding was autodetected, and not changed by user, then it should not go into the profile).
I revise the summary field - They are not Chinese characters. They are simply garbage characters / messy codes.
Not reproducible anymore in 24.8 since implementation for bug 152336 forces an encoding detection on file import.