Bug 135441 - LibreOffice needs settable charset/encoding defaults
Summary: LibreOffice needs settable charset/encoding defaults
Status: CLOSED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All macOS (All)
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CSV-Dialog
  Show dependency treegraph
 
Reported: 2020-08-04 14:47 UTC by 伟思礼
Modified: 2023-05-10 20:08 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample UTF-8 file containing pasteable text (86 bytes, application/octet-stream)
2020-08-06 18:26 UTC, 伟思礼
Details
dialog showing default charset (100.28 KB, image/jpeg)
2020-08-06 18:28 UTC, 伟思礼
Details
dialog after correcting charset (99.71 KB, image/jpeg)
2020-08-06 18:30 UTC, 伟思礼
Details

Note You need to log in before you can comment on or make changes to this bug.
Description 伟思礼 2020-08-04 14:47:02 UTC
Description:
When I import or paste text into Calc, I sometimes forget to change the "Character set:" to UTF-8.  This MIGHT be the cause of finding a BOM or a non-printing zero-width unknown character embedded (not at start of file) into text files copy/pasted out of Calc.  The "Character set:" menu should also be available in Preferences for the User to set a default.

Steps to Reproduce:
1. Use an app other than LibreOffice to open a text file encoded UTF-8
2. Select some text and copy to clipboard.
3. Select a cell in LibreOffice and try to paste.


Actual Results:
"Character set:" is UTF-16 (always)

Expected Results:
"Character set:" should default to something user has put in preferences.


Reproducible: Always


User Profile Reset: No



Additional Info:
Probably applies to all platforms.  I currently have
Version: 6.4.4.2 Build ID: 3d775be2011f3886db32dfd395a6a6d1ca2630ff
but this was noticed long ago.

Version: 6.4.4.2
Build ID: 3d775be2011f3886db32dfd395a6a6d1ca2630ff
CPU threads: 4; OS: Mac OS X 10.15.6; UI render: default; VCL: osx; 
Locale: en-US (en.UTF-8); UI-Language: en-US

Don't know whether OpenGL is enabled.

NOTE: webform says "information from menu Help - About LibreOffice" but on MacOS, the "About" is on the LibreOffice menu at the opposite end of the menu bar.
Calc: threaded
Comment 1 Heiko Tietze 2020-08-05 12:46:23 UTC
What exactly do you mean with Character Set? (Tools) > Options > HTML compatibility?
Comment 2 Eike Rathke 2020-08-05 21:42:14 UTC
Character set in some dialogs is the text encoding, like in Text/CSV or HTML.
But IMHO the last used text encoding is remembered unless UTF-16 is detected (and then that would be remembered), which aren't many cases for, like embedded null-bytes. So yet another default actually is not needed. Attaching a small UTF-8 sample file which is detected as UTF-16 instead of the expected UTF-8 would be helpful.
Comment 3 伟思礼 2020-08-05 22:53:00 UTC
ALL files created or edited by me are UTF-8 without BOM.

That is the default for my editor.  My locale is en_US.UTF-8

What I do frequently, is create a temporary text file with new words/phrases I want to learn in another language.  I then copy them and paste into a spreadsheet which contains ALL the stuff I am learning.  Finally, I export that spreadsheet to overwrite a (non-temporary) tab-delimited file which I can then import into Anki (https://apps.ankiweb.net).

The file command confirms that both of those files are UTF-8.

But the import dialog that comes up when I paste ALWAYS says UTF-16.  Sometimes I forget to change it.  I do not know whether that is the cause of my Anki problems, but I have noticed that occasionally, there is a BOM in the permanent file, right before a recently pasted data item (not at the beginning of file).  And sometimes there is a zero-width non-printing character in the file.

Whenever either of these spurious characters has appeared, they are always on an item that Anki is having trouble with.
Comment 4 Eike Rathke 2020-08-06 10:31:50 UTC
(In reply to 伟思礼 from comment #3)
> ALL files created or edited by me are UTF-8 without BOM.
That's about normal these days when not on Windows.

> The file command confirms that both of those files are UTF-8.
> 
> But the import dialog that comes up when I paste ALWAYS says UTF-16. 
And that *never* happens for me. Hence my request to attach such file here.

> Sometimes I forget to change it.  I do not know whether that is the cause of
> my Anki problems, but I have noticed that occasionally, there is a BOM in
> the permanent file, right before a recently pasted data item (not at the
> beginning of file).
That would be wrong. A BOM must not occur in the middle of data, it may only appear at the start of a text stream. What did create that?

>  And sometimes there is a zero-width non-printing
> character in the file.
That shouldn't matter if it is properly encoded.

> Whenever either of these spurious characters has appeared, they are always
> on an item that Anki is having trouble with.
So Anki is the problem, and not LibreOffice?
Comment 5 Heiko Tietze 2020-08-06 10:36:45 UTC
No UX issue, apparently. Rather NOB. Let's wait for a test file.
Comment 6 伟思礼 2020-08-06 18:26:53 UTC
Created attachment 164009 [details]
Sample UTF-8 file containing pasteable text
Comment 7 伟思礼 2020-08-06 18:28:51 UTC
Created attachment 164010 [details]
dialog showing default charset
Comment 8 伟思礼 2020-08-06 18:30:48 UTC
Created attachment 164011 [details]
dialog after correcting charset
Comment 9 Heiko Tietze 2020-09-04 18:37:44 UTC
At least there is an issue => NEW.
Comment 10 Eike Rathke 2023-05-10 20:08:00 UTC
I don't see a problem. Yes, the encoding may be offered as UTF-16, but that is what _arrives_ at Calc from the clipboard, and as attachment 164010 [details] of comment 7  shows, the text is _correct_ in UTF-16. Comment 8 attachment 164011 [details] is not correcting the setting but voluntary picking UTF-8 and of course if the text is not encoded in UTF-8 then the text is broken after that.

I also don't see necessity to have settable defaults here or what they would even solve, even if there were such then forcing the encoding to UTF-8 in this case would import broken text.

Closing WFM.