135441 – LibreOffice needs settable charset/encoding defaults

Bug 135441 - LibreOffice needs settable charset/encoding defaults

Summary: LibreOffice needs settable charset/encoding defaults

Status:	CLOSED WORKSFORME

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All macOS (All)

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	CSV-Dialog
	Show dependency tree / graph

Reported:	2020-08-04 14:47 UTC by 伟思礼
Modified:	2023-05-10 20:08 UTC (History)
CC List:	5 users (show)

See Also:
Crash report or crash signature:

Attachments
Sample UTF-8 file containing pasteable text (86 bytes, application/octet-stream) 2020-08-06 18:26 UTC, 伟思礼	Details
dialog showing default charset (100.28 KB, image/jpeg) 2020-08-06 18:28 UTC, 伟思礼	Details
dialog after correcting charset (99.71 KB, image/jpeg) 2020-08-06 18:30 UTC, 伟思礼	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description 伟思礼 2020-08-04 14:47:02 UTC

Description:
When I import or paste text into Calc, I sometimes forget to change the "Character set:" to UTF-8.  This MIGHT be the cause of finding a BOM or a non-printing zero-width unknown character embedded (not at start of file) into text files copy/pasted out of Calc.  The "Character set:" menu should also be available in Preferences for the User to set a default.

Steps to Reproduce:
1. Use an app other than LibreOffice to open a text file encoded UTF-8
2. Select some text and copy to clipboard.
3. Select a cell in LibreOffice and try to paste.


Actual Results:
"Character set:" is UTF-16 (always)

Expected Results:
"Character set:" should default to something user has put in preferences.


Reproducible: Always


User Profile Reset: No



Additional Info:
Probably applies to all platforms.  I currently have
Version: 6.4.4.2 Build ID: 3d775be2011f3886db32dfd395a6a6d1ca2630ff
but this was noticed long ago.

Version: 6.4.4.2
Build ID: 3d775be2011f3886db32dfd395a6a6d1ca2630ff
CPU threads: 4; OS: Mac OS X 10.15.6; UI render: default; VCL: osx; 
Locale: en-US (en.UTF-8); UI-Language: en-US

Don't know whether OpenGL is enabled.

NOTE: webform says "information from menu Help - About LibreOffice" but on MacOS, the "About" is on the LibreOffice menu at the opposite end of the menu bar.
Calc: threaded

Comment 1 Heiko Tietze 2020-08-05 12:46:23 UTC

What exactly do you mean with Character Set? (Tools) > Options > HTML compatibility?

Comment 2 Eike Rathke 2020-08-05 21:42:14 UTC

Character set in some dialogs is the text encoding, like in Text/CSV or HTML.
But IMHO the last used text encoding is remembered unless UTF-16 is detected (and then that would be remembered), which aren't many cases for, like embedded null-bytes. So yet another default actually is not needed. Attaching a small UTF-8 sample file which is detected as UTF-16 instead of the expected UTF-8 would be helpful.

Comment 3 伟思礼 2020-08-05 22:53:00 UTC

ALL files created or edited by me are UTF-8 without BOM.

That is the default for my editor.  My locale is en_US.UTF-8

What I do frequently, is create a temporary text file with new words/phrases I want to learn in another language.  I then copy them and paste into a spreadsheet which contains ALL the stuff I am learning.  Finally, I export that spreadsheet to overwrite a (non-temporary) tab-delimited file which I can then import into Anki (https://apps.ankiweb.net).

The file command confirms that both of those files are UTF-8.

But the import dialog that comes up when I paste ALWAYS says UTF-16.  Sometimes I forget to change it.  I do not know whether that is the cause of my Anki problems, but I have noticed that occasionally, there is a BOM in the permanent file, right before a recently pasted data item (not at the beginning of file).  And sometimes there is a zero-width non-printing character in the file.

Whenever either of these spurious characters has appeared, they are always on an item that Anki is having trouble with.

Comment 4 Eike Rathke 2020-08-06 10:31:50 UTC

(In reply to 伟思礼 from comment #3)
> ALL files created or edited by me are UTF-8 without BOM.
That's about normal these days when not on Windows.

> The file command confirms that both of those files are UTF-8.
> 
> But the import dialog that comes up when I paste ALWAYS says UTF-16. 
And that *never* happens for me. Hence my request to attach such file here.

> Sometimes I forget to change it.  I do not know whether that is the cause of
> my Anki problems, but I have noticed that occasionally, there is a BOM in
> the permanent file, right before a recently pasted data item (not at the
> beginning of file).
That would be wrong. A BOM must not occur in the middle of data, it may only appear at the start of a text stream. What did create that?

>  And sometimes there is a zero-width non-printing
> character in the file.
That shouldn't matter if it is properly encoded.

> Whenever either of these spurious characters has appeared, they are always
> on an item that Anki is having trouble with.
So Anki is the problem, and not LibreOffice?

Comment 5 Heiko Tietze 2020-08-06 10:36:45 UTC

No UX issue, apparently. Rather NOB. Let's wait for a test file.

Comment 6 伟思礼 2020-08-06 18:26:53 UTC

Created attachment 164009 [details]
Sample UTF-8 file containing pasteable text

Comment 7 伟思礼 2020-08-06 18:28:51 UTC

Created attachment 164010 [details]
dialog showing default charset

Comment 8 伟思礼 2020-08-06 18:30:48 UTC

Created attachment 164011 [details]
dialog after correcting charset

Comment 9 Heiko Tietze 2020-09-04 18:37:44 UTC

At least there is an issue => NEW.

Comment 10 Eike Rathke 2023-05-10 20:08:00 UTC

I don't see a problem. Yes, the encoding may be offered as UTF-16, but that is what _arrives_ at Calc from the clipboard, and as attachment 164010 [details] of comment 7  shows, the text is _correct_ in UTF-16. Comment 8 attachment 164011 [details] is not correcting the setting but voluntary picking UTF-8 and of course if the text is not encoded in UTF-8 then the text is broken after that.

I also don't see necessity to have settable defaults here or what they would even solve, even if there were such then forcing the encoding to UTF-8 in this case would import broken text.

Closing WFM.