Bug 151830 - Save as text with coding utf-8 destroys all non-ascii characters, replacing with question mark (Norwegian Norsk bokmål UI)
Summary: Save as text with coding utf-8 destroys all non-ascii characters, replacing w...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.2.0.4 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:24.8.0 target:24.2.4
Keywords:
Depends on:
Blocks: Languages
  Show dependency treegraph
 
Reported: 2022-10-30 19:47 UTC by Enrique Perez-Terron
Modified: 2024-05-02 19:49 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Enrique Perez-Terron 2022-10-30 19:47:09 UTC
My version: 
Version: 7.3.5.2 (x64) / LibreOffice Community
Build ID: 184fe81b8c8c30d8b5082578aee2fed2ea847c01
CPU threads: 8; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win
Locale: nb-NO (nb_NO); UI: nb-NO
Calc: threaded

I save a file using File->Save as or File->Save a copy, set the File type to "Text - choose a coding", in the filter selection dialog, I choose encoding "Unicode (UTF-8)" and line ending "LF". 

Then I inspect the resulting file using od (octal dump) with options to show byte values as ascii and hex code, (od -c -t x1).

The file begins with these two lines:

Ferden til boplassen
Endelig stod jeg der. Langs den lille kanalen foran meg lå tre små sjøfly.

Notice the three non-ascii characters in the last four words. 

Inspecting the outcome, I find as follows:

$ od -c -t x1 Ren-tekst-versjon.txt | head -30
0000000                               1   .   F   e   r   d   e   n
         20  20  20  20  20  20  20  31  2e  46  65  72  64  65  6e  20
0000020   t   i   l       b   o   p   l   a   s   s   e   n  \n  \n   E
         74  69  6c  20  62  6f  70  6c  61  73  73  65  6e  0a  0a  45
0000040   n   d   e   l   i   g       s   t   o   d       j   e   g
         6e  64  65  6c  69  67  20  73  74  6f  64  20  6a  65  67  20
0000060   d   e   r   .       L   a   n   g   s       d   e   n       l
         64  65  72  2e  20  4c  61  6e  67  73  20  64  65  6e  20  6c
0000100   i   l   l   e       k   a   n   a   l   e   n       f   o   r
         69  6c  6c  65  20  6b  61  6e  61  6c  65  6e  20  66  6f  72
0000120   a   n       m   e   g       l   ?       t   r   e       s   m
         61  6e  20  6d  65  67  20  6c  3f  20  74  72  65  20  73  6d
0000140   ?       s   j   ?   f   l   y   .       T   e   r   m   i   n
         3f  20  73  6a  3f  66  6c  79  2e  20  54  65  72  6d  69  6e

(The first line is a heading, here indented by seven spaces, which I did not expect. In the original, it is not indented. The second line is part of a longer paragraph and is saved as a single long line - this is expected and OK.)

The issue in this report is that the characters å and ø are replaced with question marks. It seems like the file has not been converted to utf-8, but rather to ascii.
Comment 1 Enrique Perez-Terron 2022-11-04 19:01:49 UTC
A simple test case: 
(Since I have a Norwegian user interface, my English translations of the UI labels may be inexact.)

1. I opened a new writer document, and
2. selected the copyright symbol by clicking on the "Omega" button in the 'insert' toolbar.
3. Then I saved a copy of the document (File -> Save a copy)
4. navigating to a temporary folder "C:\cygwin64\tmp",
5. naming the file "Copyright" and
6. choosing the format "Text (choose encoding)". 

The coding dialog came up with "Unicode (UTF-8)" and "LF" selected, so I just clicked on "OK".

I have Cygwin tools installed on my computer.
7. In a Bash command window, I changed to the temporary directory and issued
8. $ od -c Copyright.txt
The output was:
0000000   ?  \n
0000002

Notice the question mark.

I will show the expected outcome in the next comment.
Comment 2 Enrique Perez-Terron 2022-11-04 19:12:00 UTC
(Continued from the previous comment)

In order to demonstrate what the correct outcome would be, I pasted the copyright symbol from the Writer document into the following Bash command line:

$ echo '©' | od -t x1
0000000 c2 a9 0a
0000003

The bytes C2 A9 are the correct UTF-8 encoding of the code point 0xA9, the copyright symbol.
Comment 3 Buovjaga 2023-03-03 14:49:20 UTC
I saved the file in Windows as suggested and tested it on Linux and my result is

$ od -c copyright.txt 
0000000 357 273 277 302 251  \n
0000006

I also got this with version 7.3.

Do you still see this with 7.5?

Set to NEEDINFO.
Change back to UNCONFIRMED, if the problem persists. Change to RESOLVED WORKSFORME, if the problem went away.

Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 687b950702c49c90cff9a43655ea97a0343799a0
CPU threads: 2; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win
Locale: en-US (en_FI); UI: en-US
Calc: threaded
Comment 4 Enrique Perez-Terron 2023-03-06 12:08:10 UTC
I installed

Version: 7.5.1.2 (X86_64) / LibreOffice Community
Build ID: fcbaee479e84c6cd81291587d2ee68cba099e129
CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: nb-NO (nb_NO); UI: nb-NO
Calc: CL threaded

yesterday, and tried again.

I created a new text document with just one character in it, the copyright mark, Unicode 0xa9. I saved it first as a regular 'odt' file (C:\cygwin64\tmp\Copyright.odt), then used the menu File->Save a copy, File type "Text - select coding (txt)". In the coding dialog, UTF8 and LF. The "Byte order mark" checkbox came with a check mark, but was greyed out and could not be deselected.

Then:
$ od -c Copyright.txt
0000000   ?  \n
0000002

So yes, I am still seeing the error in version 7.5.

What Buovjaga  is getting is a file with the UTF-8-encoded byte order mark 0xFEFF followed by the UTF-8-encoded copyright symbol 0xA9

Another test: I have an Ubuntu Linux with Libreoffice 7.4.4.2. With this version of Writer, it byte-order mark checkbox can be deselected, and the resulting file is 
$ od -c Copyright.txt
0000000 302 251  \n
0000003

This is the correct outcome. The bug is not present in Ubuntu Libreoffice Writer  7.4.4.2.
Comment 5 Enrique Perez-Terron 2023-03-06 12:24:55 UTC
I tried this on a different laptop running Windows 11, but the bug is absent.

Version: 7.5.1.2 (X86_64) / LibreOffice Community
Build ID: fcbaee479e84c6cd81291587d2ee68cba099e129
CPU threads: 8; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win
Locale: nb-NO (nb_NO); UI: en-GB
Calc: CL threaded

In this case, the user interface is English (en-GB). The Ubunutu case also has English UI (en-US).
Comment 6 Enrique Perez-Terron 2023-03-06 15:53:43 UTC
Yet another test, with another laptop.

Version: 7.3.5.2 (x64) / LibreOffice Community
Build ID: 184fe81b8c8c30d8b5082578aee2fed2ea847c01
CPU threads: 4; OS: Windows 10.0 Build 19045; UI render: Skia/Vulkan; VCL: win
Locale: nb-NO (nb_NO); UI: nb-NO
Calc: threaded

OS: Operativsystemnavn	Microsoft Windows 10 Home
Versjon	10.0.19045 Bygg 19045

(Norwegian "Bygg" = English "Build")

$ od -c Copyright.txt
0000000   ?  \n
0000002

In this case, the bug is present.

What do the laptops have in common, those who manifest the bug?
A: Operating system, B: Windows locale/language, C: Libreoffice User Interface language
1. The one where I first experienced the bug:
A: Versjon Windows 10 Home
Versjon 22H2
Installert den  ‎12.‎10.‎2020
Operativsystembygg      19045.2604
Opplevelse      Windows Feature Experience Pack 120.2212.4190.0
B: "Norsk Bokmål" (Norwegian)
C: Standard Norsk bokmål

2. The one I am reporting about now:
A:Versjon	Windows 10 Home
Versjon	22H2
Installert den	‎02.‎11.‎2020
Operativsystembygg	19045.2604
Opplevelse	Windows Feature Experience Pack 120.2212.4190.0
B: "Norsk Bokmål" (Norwegian)
C: Standard Norsk bokmål

3. The laptop that did not manifest the error:
A: Versjon Windows 11 Home
Versjon 22H2
Installert den  ‎04.‎10.‎2022
Operativsystembygg      22621.1265
Opplevelse      Windows Feature Experience Pack 1000.22638.1000.0
B: Two "preferred"languages: Norsk bokmål; English (USA)
C: English (UK) - but Norwegian also available in the drop-down list
Comment 7 Enrique Perez-Terron 2023-03-06 16:22:20 UTC
I changed the UI language in LibreOffice to English (USA), and now the bug is not there. This was on the first laptop, where I first experienced the bug.

Then I changed the UI language in LibreOffice to Norwegian (Norsk bokmål) on the third laptop - the one which had English (UK) and which initially did not manifest the bug - and now the bug is there.

So it now seems like the bug manifests itself only with the UI language Norwegian. When I find some more time, I may test other UI languages. 

Another hint to the origin of the bug may be that the filter settings dialog check box for Byte Order Mark is greyed out when the UI language is not English.
Comment 8 Enrique Perez-Terron 2023-03-06 17:07:06 UTC
I have now tried two more user interface languages: Japanese and Spanish. In both cases, the file was saved correctly as UTF-8.

Version 7.5.1.2.
Comment 9 Stéphane Guillou (stragu) 2024-01-08 14:58:25 UTC
I reproduced with nb-NO UI on Linux:

Version: 7.6.4.1 (X86_64) / LibreOffice Community
Build ID: e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: nb-NO
Calc: threaded

The Byte Order Mark setting was greyed out too.
Already reproduced in 7.2.0.4.

Not reproduced in en-US UI.
Comment 10 Mike Kaganski 2024-03-18 08:05:15 UTC
This is a problem of translation, that was incorrectly updated in commit a0c08eb77f9fd9e3b53f5c40abb554e83195fa27 (update translations for 6.0 beta1, 2017-11-22).

The problem starts at https://opengrok.libreoffice.org/xref/translations/source/nb/svx/messages.po?r=c662aec6#12176 :

> 12178 msgid "Arabic (ISO-8859-6)"
> 12179 msgstr "Gresk (ISO-8859-7)"

... and continues through all the rest of RID_SVXSTR_TEXTENCODING_TABLE entries. This is the entry that the STR uses:

> 12460 msgid "Chinese simplified (EUC-CN)"
> 12461 msgstr "Unicode (UTF-8)"
Comment 11 Mike Kaganski 2024-03-18 08:59:29 UTC
Sorry, the problem was as far as in commit d9a4b60f9ae7e15c44675ea56fe6a06613c419ae (fix of damaged files from beta1, 2012-12-09).
Comment 13 Mike Kaganski 2024-03-18 11:47:47 UTC
https://gerrit.libreoffice.org/c/translations/+/164882 is an attempt to fix it. I don't know the language. The fix is done mainly by moving wrongly placed strings to their proper places; but for some, I just copied the missing strings from the respective nn file.
A review is really needed from someone who reads the language, to make sure this blind fix makes sense.
Comment 14 Enrique Perez-Terron 2024-05-02 19:49:30 UTC
The patch looks good to me, except:

Chinese Traditional : Tradisjonell kinesisk

not: Tradisjonelt kinesisk

in lines 12371, 12407, 12431, 12443, 12449, 12467