Bug 138100 - UTF 8 Text File in Windows seems to problem with Umlauts (in DE äöüÄÖÜ) when loaded in Writer (at least til version 7.0)
Summary: UTF 8 Text File in Windows seems to problem with Umlauts (in DE äöüÄÖÜ) when ...
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.0.3.1 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-09 18:07 UTC by hastrondl
Modified: 2020-11-10 06:48 UTC (History)
0 users

See Also:
Crash report or crash signature:


Attachments
screenshot details 138100 (27.83 KB, image/jpeg)
2020-11-09 19:46 UTC, hastrondl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description hastrondl 2020-11-09 18:07:40 UTC
Description:
Whe loading an utf8 text file with Umlauts (äöüÄÖÜ) into the writer, all Umlauts
are not correctly presented.
Since Version 7.1 alpha it works correctly.

Steps to Reproduce:
1. using a text file with ö,ä,ü edited and saved in utf8 format ( e.g. done with npp. 
2. just open it in 7.0.3.1 and have a look (bad !)
3. then compair the same Txt sample with the look inv 7.1 alpha ( fine!)

Actual Results:
123456789ABCDEF
AE=Ä OE=Ö UE=Ü 
ae=ä oe=ö ue=ü

Expected Results:
123456789ABCDEF
AE=Ä OE=Ö UE=Ü 
ae=ä oe=ö ue=ü 


Reproducible: Always


User Profile Reset: No



Additional Info:
Just take the UTF8 content an represent it coorectly.
Comment 1 Mike Kaganski 2020-11-09 18:52:10 UTC
The problem that you see is related to your text file having no BOM that is used to mark file as UTF-8. In the absence of that byte-order mark, previous versions of LibreOffice didn't detect the encoding, and used current system encoding - which on Windows is ~guaranteed to be non-UTF-8. So your text was imported using wrong encoding.

That was *not* a bug, but a missing feature of recognition of such files. The correct way to open such files was using special "Text - Choose Encoding" filter in File Open dialog.

In v.7.1, tdf#60145 was implemented, as you see. So no, when you see something fixed in the next version, it doesn't mean that not having it in the previous version is a bug and should be fixed. Not having it in 7.0 is NOTABUG.
Comment 2 hastrondl 2020-11-09 19:27:13 UTC
That was *not* a bug, but a missing feature of recognition of such files. The
correct way to open such files was using special "Text - Choose Encoding"
filter in File Open dialog
.......................
UTF8 w.o. BOM is a standard used world-wide ;   and still it is. 
---------------------------
Therefore we can  discuss, if a missing Standard behaviour is a Bug or Not.
It is not a good idea to prepare such answers, just to avoid to concede a  minor or major or normal‘ behaviour.

Your hint choosing Encoding , 
there are  Unicode UTF-xx Options onyl – no Standard UTF-xx

Therefore my bugzilla report was correct.
Comment 3 hastrondl 2020-11-09 19:33:05 UTC Comment hidden (obsolete)
Comment 4 Mike Kaganski 2020-11-09 19:38:42 UTC
Please don't play with the bug status, This is not a bug in the software. It is working as intended in 7.0. It is enhanced in 7.1. It will not be changed in 7.0 retroactively, since all new features, such as new detection code, are only introduced in master, not in release branches. This bug is closed. Period.
Comment 5 hastrondl 2020-11-09 19:46:06 UTC
Created attachment 167160 [details]
screenshot details 138100

shows the selection box 
after selected Text- Choosing Encoding e.g. in german (DE) 
Text - Textcodierung wählen

Unicode - (UTF-7) / Unicode - (UTF-8) / Unicode - (UTF-16)

What Option should be selected for the Standard UTF-8 (Not-UNICODE)
Comment 6 Mike Kaganski 2020-11-10 05:57:45 UTC
(In reply to hastrondl from comment #5)
> What Option should be selected for the Standard UTF-8 (Not-UNICODE)

There is *never* a text encoded in one of UTF encodings, which is not Unicode. UTF (*Unicode* Transformation Format) encoding family is created to encode UCS (Universal Coded Character Set) character set standardized in  ISO 10646, and that ISO standard is deliberately synchronized (identical) to The Unicode Standard (created/maintained by Unicode Consortium). Any UTF-encoded file is "some sequence of UCS codepoints, each codepoint encoded using this specific UTF variant". So after decoding, you get sequence of UCS/Unicode codepoints, never something else.

Please check RFC 3629 (UTF-8), and also RFC 2781 (UTF-16), RFC 2152 (UTF-7); ISO 10646; The Unicode Standard (current version [1] of which explicitly says "This version of the Unicode Standard is also synchronized with ISO/IEC 10646:2020, sixth edition", just like previous versions stated synchronization with then-respective ISO standard versions).

So the idea of a "Standard UTF-8 (Not-UNICODE)" is absurd.

[1] http://www.unicode.org/versions/Unicode13.0.0/
Comment 7 Ming Hua 2020-11-10 06:35:39 UTC
(In reply to hastrondl from comment #5)
> What Option should be selected for the Standard UTF-8 (Not-UNICODE)
In case Mike's reply wasn't clear enough -- to correctly display the utf8 format file created by npp like you described in comment #0, just choose "Unicode - (UTF-8)" option in LO 6.4.
Comment 8 Mike Kaganski 2020-11-10 06:48:22 UTC
(In reply to Ming Hua from comment #7)
> just choose "Unicode - (UTF-8)" option in LO 6.4.

Thanks - you are quite right; not only in 6.4, but in any version (including 7.1); that latter upcoming version 7.1 *also* can autodetect it, but the manual option with that "Text - Choose Encoding" filter is also there.