Bug 150206 - FILESAVE as RTF in Writer is given wrong fcharset in the standard font
Summary: FILESAVE as RTF in Writer is given wrong fcharset in the standard font
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.3.5.2 release
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:rtf
Depends on:
Blocks: RTF-Character
  Show dependency treegraph
 
Reported: 2022-07-31 09:26 UTC by Bernard Moreton
Modified: 2023-10-12 09:33 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
A simple test case, with embedded 'Book Antiqua' font (9.21 KB, application/vnd.oasis.opendocument.text)
2022-08-01 11:20 UTC, Bernard Moreton
Details
screenshot showing tesult of the change of 'o' to o-umlaut (62.94 KB, image/png)
2022-08-01 11:24 UTC, Bernard Moreton
Details
My first attempt didnt seem to have embedded the font - this one does (171.79 KB, application/vnd.oasis.opendocument.text)
2022-08-01 11:47 UTC, Bernard Moreton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bernard Moreton 2022-07-31 09:26:35 UTC
Description:
When a simple file is saved AS RTF, the (Roman-style) font in use is saved in the fonttbl with \fcharset128 (= Shift Jis).  If a character is then changed in the RTF to an accented form, this then displays as a fr-eastern character instead of a (European) unicode.
The normal setting should be \fcharset1 (= Default).

Steps to Reproduce:
1.Create a new document in Writer
2.Save as RTF
3.

Actual Results:
\fcharset128 set for the font in use (Book Antiqua)

Expected Results:
\fcharset1 should have been set for the font in use


Reproducible: Always


User Profile Reset: Yes


OpenGL enabled: Yes

Additional Info:
Version: 7.3.5.2 / LibreOffice Community
Build ID: 30(Build:2)
CPU threads: 4; OS: Linux 5.4; UI render: default; VCL: gtk3
Locale: en-GB (en_GB.UTF-8); UI: en-GB
Ubuntu package version: 1:7.3.5~rc2-0ubuntu0.20.04.1~lo1
Calc: threaded
Comment 1 Mike Kaganski 2022-08-01 05:31:47 UTC
Please share a simple file (in ODT format) which, when exported to RTF, shows the problematic behavior. Also provide specific steps needed to repro the "If a character is then changed in the RTF ..." part, like what to edit using which tool (in Writer? in a plain text editor? which character to change to what), and the comparison of expected vs. actual results (preferably a screenshot with a marked difference). Thanks!
Comment 2 Bernard Moreton 2022-08-01 11:20:08 UTC
Created attachment 181530 [details]
A simple test case, with embedded 'Book Antiqua' font

Open, Save AS Rich text (rtf)
Using a text editor (I use vi), change the "o" to ö (o-umlaut), and save
open the changed RTF with LibreOffice

NB: this may give different results on your system if BookAntiqua is not installed,  but the result on my system is shown on the next attachment (notnormaltext.png)

The RTF-save is also given \fcharset128 if I use another Palatino font, TeX-Gyre-Pagella  (the situation is then complicated however, because Pagella is not recognized as \fromman, but is assigned \fnil - but that's another issue ...)
Comment 3 Bernard Moreton 2022-08-01 11:24:18 UTC
Created attachment 181531 [details]
screenshot showing tesult of the change of 'o' to o-umlaut

Of course, if at the time of editing the RTF, \fcharset is also changed, to \fcharset1 (the proper default), then all is ok again.

But the issue remains, that \fcharset128 results in an unstable RTF file.
Comment 4 Mike Kaganski 2022-08-01 11:32:58 UTC
No repro using Version: 7.4.0.2 (x64) / LibreOffice Community
Build ID: 1512ce97d7ed39dce3121f7e15651fd8895f950e
CPU threads: 12; OS: Windows 10.0 Build 19044; UI render: default; VCL: win
Locale: en-US (ru_RU); UI: en-US
Calc: CL

Could be system- or locale-specific issue, or maybe system encoding-dependent...

Also wanted to mention, that there is a large thread in a Russian forum, dedicated to RTF corrupt like described here [1] - the fix is described there in answer #17, but there it seems to be specific to AOO (one related to LO was about version 3.4.4).

[1] https://forumooo.ru/index.php/topic,6952.15.html
Comment 5 Bernard Moreton 2022-08-01 11:47:58 UTC
Created attachment 181532 [details]
My first attempt didnt seem to have embedded the font - this one does
Comment 6 Mike Kaganski 2022-08-01 12:01:41 UTC
(In reply to Bernard Moreton from comment #5)

No repro on my system, either - FTR, I have Book Antiqua installed locally.
Comment 7 Bernard Moreton 2022-08-01 14:12:14 UTC
Re Comment #4:  the reference to the Russian thread seems more related to RTF corruption than to LO/AOO in particular. 

I'm just concerned that LO save as RTF is generating an unstable an unstable file by mis-application of \fcharset128.

My own use is rather the other way around:  I write database reports to RTF, have LO display them, and 'Send As 'PDF where wanted.  I was only reminded of this bad \fcharset when analysing LO's output in order to get a (for me) unusual section structure right in my own coding, and found this peculiar character change,  which it took me some time to understand.

And I'm glad that someone else likes Palatino/Book Antiqua/T-G-Pagella!
Comment 8 QA Administrators 2022-08-02 03:31:47 UTC Comment hidden (obsolete)
Comment 9 Bernard Moreton 2022-08-02 14:44:56 UTC
Re comments 1,2,4:
editing the RTF to change an 'o' to an 'ö' (o-unlaut) should be done by copy-and-paste of the o-umlaut over the 'o', or by direct keying (on my Ubuntu, Magic+o," - my Magic key in AltGr), not by entering \u246\'3f .  Just thought I'd better make sure!

For the font BookAntiqua, to accord with the RTF-Specification, the correct font entry should include \fcharset1, NOT \fcharset128.  The PRQ entry should probably be \fprq2, like the other Roman entries - but at least it hasn't been set to \fprq1, which would be positively wrong.

It looks (from LO Git) as though LO uses internal XML font tables, probably inherited unchanged from OOo, rather than interrogating the font on the system.  I have looked for such a table on my system, but haven't found one, so I assume that it's compiled-in?  If I'm right, then to make LO consistent with the RTF Specification, the entry for Book Antiqua should be updated.  

The current font entry in RTF export does not just produce an unstable file, but does so because it does not conform to the specification.

(The use of an internal table would also explain why TeX-Gyre_Pagella is given such a wrong entry in the fonttbl - I can't find any mention of Pagella on LO Git.)

If that (hypothesised) LO internal font table is on the compiled system and I simply haven't found it, then please point me to the right location, and I'll happily try editing it on my system.
Comment 10 Dieter 2023-10-04 07:29:49 UTC
Bernhard, a new version is available. Could you please retest with LO 7.6? Is the bug still present?
=> NEEDINFO
Comment 11 Bernard Moreton 2023-10-04 11:55:59 UTC
There has been a change - the testcase file (normaltext.odt) now saves with \fcharset0 instead of \fcharset128 as before.

After changing the 'o' in the saved rtf to an umlauted form, the file then displays in Writer with 2 high-ascii(?) characters, not as 'ö'.

Changing \fcharset0 to \fcharset1 resolves the problem - the umlaut then shows correctly.

So the change has not resolved the problem.  Western UTF should save as \fcharset1.
Comment 12 QA Administrators 2023-10-05 03:20:05 UTC Comment hidden (obsolete)
Comment 13 Dieter 2023-10-12 07:05:59 UTC
I confirm the problem:

Steps:
1. Open a new document and type "Hallo"
2. Save as rtf
3. Open rtf file in an editor (I've used Windows Editor)
4. change "Hallo" to "Hällo"
5. Save and reopen in Writer

Actual result:
"Hテ、llo"

Expected result:
"Hallo"


But I can't follow your solution, but I also don't have deeper knowledge of editing rtf-file
Comment 14 Mike Kaganski 2023-10-12 07:10:52 UTC
(In reply to Dieter from comment #13)
> 3. Open rtf file in an editor (I've used Windows Editor)
> 4. change "Hallo" to "Hällo"

Why do you think it's OK to edit the RTF markup like this? The resulting RTF would be an invalid one, because RTF is ASCII (7-bit), and all Unicode is encoded there in a special way, not by using the Unicode characters directly.
Comment 15 Bernard Moreton 2023-10-12 08:25:36 UTC
Strictly speaking, Mike (#14) is right, and RTF is designed as ANSI-compliant.
But the RTF manual lists \fcharset1 as "default", even though \fcharset0 is listed as "ansi".  
As to "why" - some of us use RTF as an intermediate tool, even though that be non-compliant, and I don't think any RTF-reader actually cares.
It can be convenient, for example, to use something like 'sed' to effect a quick global change - and the "official" \ucN ... \uN is just too cumbersome.

So:  \fcharset0 is unnecessarily restrictive.  The "default" \fcharset1 should be used for all western text, whether UTF or not.  It allows liberty, where the current setting enforces antiquated restriction.
Comment 16 Mike Kaganski 2023-10-12 08:44:30 UTC
(In reply to Bernard Moreton from comment #15)

The charset of such an RTF would be unknown. Do you encode your "ö" using UTF-8, or using Win-1252?

This "extension" is not a proper thing. If the *original* problem is not reproducible anymore (I would love to have reliable steps to repro, because I believe that was a really important problem; unfortunately, I couldn't repro myself), then this is WORKSFORME. The problem mentioned in comment 11 (if that is the same as described in comment 13 - I wasn't so sure; the "2 high-ascii(?) characters" imply so, meaning that the encoding was likely UTF-8, which is e.g. still completely uncommon as system encoding on Windows) would be WONTFIX.
Comment 17 Mike Kaganski 2023-10-12 09:33:04 UTC
Concluding.

Comment 11 means WORKSFORME. (Please do re-open, if the original problem is reproducible - please also provide a reproducible scenario in that case.)

Using \fcharset1 means (in the absence of \cpgN) that respective text runs use "system encoding". This means, that *any* octet with value >127 has *system-specific* value. Thus, such characters would be imported differently on different systems; and - given that Bernard Moreton obviously uses Linux, with the usual (but not 100% used) UTF-8 system encoding, their "ö" would be UTF-8-encoded in the RTF. Such an RTF will open wrong on any Windows system (unless that system would use *still experimental* UTF-8 system encoding support - i.e., ~0% of Windows systems uses that).

The idea that '"default" \fcharset1 should be used for all western text' is not only wrong (meaning that all non-ASCII "western" characters would break on most systems randomly), it is also Western-centric way of thinking.