Bug 49645 - FILEOPEN particular MSWORD2008 .docx: misinterprets letters from Symbol font
Summary: FILEOPEN particular MSWORD2008 .docx: misinterprets letters from Symbol font
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.3 release
Hardware: Other All
: medium major
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2012-05-08 08:39 UTC by Roman Eisele
Modified: 2013-04-03 10:10 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
How the sample document looks in LibO 3.4.6 on Win XP: correct (89.78 KB, image/png)
2012-05-08 08:41 UTC, Roman Eisele
Details
How the sample document looks in LibO 3.5.3 on Win XP: wrong (107.33 KB, image/png)
2012-05-08 08:41 UTC, Roman Eisele
Details
How the sample document looks in LibO 3.4.4 on MacOS X: wrong (184.44 KB, image/png)
2012-05-08 08:42 UTC, Roman Eisele
Details
How the sample document looks in LibO 3.4.6 on MacOS X: wrong (187.61 KB, image/png)
2012-05-08 08:42 UTC, Roman Eisele
Details
How the sample document looks in LibO 3.5.3 on MacOS X: wrong (168.64 KB, image/png)
2012-05-08 08:43 UTC, Roman Eisele
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Roman Eisele 2012-05-08 08:39:51 UTC
This bug is similar to bug 39670 - "FILEOPEN Writer mis-displays greek letters in symbol font importing docx". The sample document for this bug (https://bugs.freedesktop.org/attachment.cgi?id=49722) contains two Greek letters (lowercase alpha and beta) taken from the standard 'Symbol' font. If I try to open this document with different LibreOffice versions, I get the following results:

* LibreOffice 3.4.6 on WinXP:
correct, Greek alpha and beta letters visible, both taken from the 'Symbol' font.

* LibreOffice 3.5.3 on WinXP:
wrong, two ornaments visible in place of the Greek letters. Also the font association is wrong: if you select any of the both letters, the font menu still shows the main text font (Cambria) instead of switching to 'Symbol'. It makes no difference if the 'Cambria' + 'Cambria Math' fonts are installed locally on the machine or not.

* LibreOffice 3.4.4, 3.4.6, 3.5.3.2 on MacOS X all show the same behaviour:
wrong, two ornaments visible in place of the Greek letters. The font association is wrong just like with LibreOffice 3.5.3 on WinXP ('Cambria' instead of 'Symbol'). It makes no difference if the 'Cambria' + 'Cambria Math' fonts are installed locally on the machine or not.

If I compare this results with the bug report for bug 39670, I get the following impression (just correct me if I am wrong):

* On Windows, bug 39670 was actually fixed, at least since LibreOffice 3.4.2. Therefore this (the present) bug is a regression on Windows.

* On MacOS X, I doubt that bug 39670 was fixed at all (nobody ever confirmed that bug 39670 has been really fixed in LibreOffice 3.4.x for MacOS).

* While there was some difference in handling these Greek letters taken from the 'Symbol' font between LibreOffice 3.4.x for Windows and 3.4.x for MacOS, now both platforms show the same behaviour. This is an advantage, of course.

Convinced by the results on MacOS X, I first tried to re-open bug 39670. But Christopher M. Penalver gave his advise to open a new bug report instead. While his assumptions are not completely correct (this bug is a regression only on the Windows side, not on the MacOS X side), it is right to open a new bug report, because in LibreOffice 3.5.3 this bug is now a real cross-platform bug with equal results on Windows (at least Windows XP) and MacOS X. Could someone else please test on Linux and/or Windows 7? Thank you!

NB: There is a good chance that this bug does not only affect the Greek letters taken from the 'Symbol' font but also other glyphs from this font. This is why I propose "Writer misinterprets letters from Symbol font in .docx file" for the Summary field of this report, not "misinterprets Greek letters" etc.

I will attach some screenshots showing the different results.
Comment 1 Roman Eisele 2012-05-08 08:41:10 UTC
Created attachment 61236 [details]
How the sample document looks in LibO 3.4.6 on Win XP: correct
Comment 2 Roman Eisele 2012-05-08 08:41:42 UTC
Created attachment 61237 [details]
How the sample document looks in LibO 3.5.3 on Win XP: wrong
Comment 3 Roman Eisele 2012-05-08 08:42:21 UTC
Created attachment 61238 [details]
How the sample document looks in LibO 3.4.4 on MacOS X: wrong
Comment 4 Roman Eisele 2012-05-08 08:42:44 UTC
Created attachment 61239 [details]
How the sample document looks in LibO 3.4.6 on MacOS X: wrong
Comment 5 Roman Eisele 2012-05-08 08:43:09 UTC
Created attachment 61240 [details]
How the sample document looks in LibO 3.5.3 on MacOS X: wrong
Comment 6 Roman Eisele 2012-05-08 08:44:28 UTC
As explained in the description and by Christopher M. Penalver in his comment #6 to bug 39670, this is a regression on Windows (and probably on Linux, too).
Comment 7 Roman Eisele 2012-05-08 08:59:09 UTC
I repeat here, but with some corrections, what I noted first in comment #5 to bug 39670:


If I don't miss anything, the problematic DOCX section (from
TestOffice2008.docx/word/document.xml) is:

<w:r><w:sym w:font="Symbol" w:char="F061"/></w:r><w:r><w:t
xml:space="preserve">-alpha, </w:t></w:r><w:r><w:sym w:font="Symbol"
w:char="F062"/></w:r><w:r><w:t>-beta.

I am no DOCX expert, but if I understand Microsoft's horrible file format I see two interesting points:

1) The w:font attribute is "Symbol", correct.

2) The w:char attribues have the values F061 and F062. If these are Unicode code points, it means that the two symbols alpha and beta are not Greek Unicode letters (would be U+03B1 and 03B2) nor taken from some math symbols range, but glyphs from the Private Use Area.

This (2) is a bit strange. If Microsoft wants Greek letters from the 'Symbol' font, it should just use the correct Unicode indices for the Greek letters, which are U+03B1 and 03B2. And at least the MacOS X version of the 'Symbol' font has alpha and beta with exactly these two right Unicode values. Another possibility would be to take both glyphs just from the main text font ('Cambria' + 'Cambria Math'). My copy of the Cambria Italic font contains both alpha and beta at the correct Unicode indices (I can't test Cambria Regular because it's a .ttc file which FontLab does not open).

MS should not rely on PUA glyphs for important things like formula symbols. And there is just no U+F061 or F062 glyph in the Symbol font ... (why should there be one?!).

Therefore, I'm not surprised about the two ornaments visible in the MacOS X screenshots. They are just the glyphs associated with U+F061 and U+F062 in some font installed on my machine (Apple Chancery in my case). This is correct behaviour: if the font used for the text does not contain any glyph associated with this Unicode code point, another font is taken which contains glyphs for these code points. The same explanation may be true for the ornaments visible in LibreOffice 3.5.3.2 on Windows.

But, what is really important: even if we blame MS for doing strange things,
there is still a problem in LibreOffice. If the sample file looks right in MS Office and in LibreOffice 3.4.x on Windows, there seems to be some mapping from the strange w:char="F061" to the right alpha glyph, and the same for beta and other letters. Therefore, we just need this mapping again in LibreOffice 3.5.x.

If I am completely wrong, just correct me. But we should fix this, don't you think so?
Comment 8 Rainer Bielefeld Retired 2012-05-08 09:29:34 UTC
FILEOPEN sample from Bug 39670 I see:
 + an alpha and a beta with "LibreOffice 3.4.5 German UI 
   [Build ID: OOO340m1 (Build:502)]" parallel Server installation on German WIN7
   Home Premium (64bit)

-  2 strange hooks instead alpha / beta with "LibreOffice 3.5.3.2 (RC2)
   German UI/Locale [Build-ID: 235ab8a-3802056-4a8fed3-2d66ea8-e241b80] on 
   German WIN7 Home Premium (64bit) 

What ever that might mean. I have no time for long examination concerning DUP, my bad result differs from all screenshots roman contributed here and in Bug 39670.

If there is a doubt I believe it's better to create a new Bug than to reopen the old one.

At least I can confirm the effect "misinterprets letters"
Comment 9 Roman Eisele 2012-05-08 10:44:34 UTC
@Rainer Bielefeld:
thank you for testing! So I know that I am not the only one seeing this issue ;-)

It's no surprise that you see other wrong glyphs ("two strange hooks") than visible on my screenshot. I think that LibreOffice just takes the two glyphs for U+F061 and U+F062 from some font which contains glyphs for these Unicode codepoints (maybe from the 1st font in the list of installed fonts which contains such glyphs), and this font will vary from installation to installation, depending on the installed fonts.

Update:
Regarding the Unicode code points U+F061 and U+F062 and the corresponding XML fragments from the .docx file (w:char="F061" and w:char="F062"), I have found that Microsoft's TrueType version of the 'Symbol' font, at least version 1.60 (2005) delivered with Windows XP, actually contains alpha and beta at U+F061 and U+F062. Of course, *all* glyphs/letters in this font have Private Use Area indices (U+F020 to U+F0FE), even the space/blank letter has the Unicode value U+F020 instead of U+0020.

I don't know why Microsoft did not use the correct Unicode values instead (for most symbols in the 'Symbol' font there are corresponding Unicode code points), but this is not our problem. What matters here is just:

* MS Office and LibreOffice 3.4 on Windows take the alpha and beta from the Windows 'Symbol' font, using the glyphs from the Private Use Area. This worked fine, and that is no suprise.

* LibreOffice 3.4 and 3.5 on MacOS can not display the alpha and the beta correctly because Apple's 'Symbol' font (at least version 6.1d7e3, dated 2009-05-12) does not contain that Private Use Area glyphs; it uses the correct Unicode code points for most symbols instead. This is no surprise, too.

* LibreOffice 3.5 on Windows does not take the alpha and the beta from the 'Symbol' font anymore, instead from the 1st installed font it can find which contains glyphs for the Private Use Area codepoints U+F061 and U+F062. This may be just a consequence of the fact that it does not switch to the 'Symbol' font for the two symbols anymore, but uses the main text font ('Cambria') instead which does not contain alpha and beta at these codepoints.

So, what regards LibreOffice 3.5 for Windows, it may be just necessary to switch to the Symbol font again, like in LibO 3.4.x. For MacOS, the solution is a bit more complicated. When we encounter a <w:sym> tag like

<w:sym w:font="Symbol" w:char="F061"/>

and when w:font is 'Symbol' and w:char is >= F020, this index must be mapped to the correct Unicode value using a replacement table like

F020 -> U+0020 # Space character
...
F061 -> U+03B1 # alpha
F062 -> U+03B2 # beta
...
F0C2 -> U+211C # Real part (of a complex number)
...

etc. I don't know about Linux, of course ...

Sorry for all these words, but I hope that they bring some light into this issue.
Comment 10 Fridrich Strba 2012-05-09 07:26:17 UTC
Here is a mapping for the old mac non-unicode symbol font that I have in libcdr and libwpd:
  static const unsigned short symbolmap [] =
  {
    0x0020, 0x0021, 0x2200, 0x0023, 0x2203, 0x0025, 0x0026, 0x220D, // 0x20 ..
    0x0028, 0x0029, 0x2217, 0x002B, 0x002C, 0x2212, 0x002E, 0x002F,
    0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037,
    0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
    0x2245, 0x0391, 0x0392, 0x03A7, 0x0394, 0x0395, 0x03A6, 0x0393,
    0x0397, 0x0399, 0x03D1, 0x039A, 0x039B, 0x039C, 0x039D, 0x039F,
    0x03A0, 0x0398, 0x03A1, 0x03A3, 0x03A4, 0x03A5, 0x03C2, 0x03A9,
    0x039E, 0x03A8, 0x0396, 0x005B, 0x2234, 0x005D, 0x22A5, 0x005F,
    0xF8E5, 0x03B1, 0x03B2, 0x03C7, 0x03B4, 0x03B5, 0x03C6, 0x03B3,
    0x03B7, 0x03B9, 0x03D5, 0x03BA, 0x03BB, 0x03BC, 0x03BD, 0x03BF,
    0x03C0, 0x03B8, 0x03C1, 0x03C3, 0x03C4, 0x03C5, 0x03D6, 0x03C9,
    0x03BE, 0x03C8, 0x03B6, 0x007B, 0x007C, 0x007D, 0x223C, 0x0020, // .. 0x7F
    0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085, 0x0086, 0x0087,
    0x0088, 0x0089, 0x008a, 0x008b, 0x008c, 0x008d, 0x008e, 0x008f,
    0x0090, 0x0091, 0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097,
    0x0098, 0x0099, 0x009a, 0x009b, 0x009c, 0x009d, 0x009E, 0x009f,
    0x20AC, 0x03D2, 0x2032, 0x2264, 0x2044, 0x221E, 0x0192, 0x2663, // 0xA0 ..
    0x2666, 0x2665, 0x2660, 0x2194, 0x2190, 0x2191, 0x2192, 0x2193,
    0x00B0, 0x00B1, 0x2033, 0x2265, 0x00D7, 0x221D, 0x2202, 0x2022,
    0x00F7, 0x2260, 0x2261, 0x2248, 0x2026, 0x23D0, 0x23AF, 0x21B5,
    0x2135, 0x2111, 0x211C, 0x2118, 0x2297, 0x2295, 0x2205, 0x2229,
    0x222A, 0x2283, 0x2287, 0x2284, 0x2282, 0x2286, 0x2208, 0x2209,
    0x2220, 0x2207, 0x00AE, 0x00A9, 0x2122, 0x220F, 0x221A, 0x22C5,
    0x00AC, 0x2227, 0x2228, 0x21D4, 0x21D0, 0x21D1, 0x21D2, 0x21D3,
    0x25CA, 0x3008, 0x00AE, 0x00A9, 0x2122, 0x2211, 0x239B, 0x239C,
    0x239D, 0x23A1, 0x23A2, 0x23A3, 0x23A7, 0x23A8, 0x23A9, 0x23AA,
    0xF8FF, 0x3009, 0x222B, 0x2320, 0x23AE, 0x2321, 0x239E, 0x239F,
    0x23A0, 0x23A4, 0x23A5, 0x23A6, 0x23AB, 0x23AC, 0x23AD, 0x0020  // .. 0xFE
  };
Comment 11 Fridrich Strba 2012-05-09 07:27:32 UTC
Normally, the symbols between 0x7f and 0xbf (inclusive) are not defined, but I have them there so that we don't get a surprise if they come.

So basically this table could be used for the case of symbol font and chars in 0xf0xx zone
Comment 12 Michael Stahl (CIB) 2012-12-18 19:09:32 UTC
Caolan, do your recent symbol font handling changes improve anything here?
Comment 13 Roman Eisele 2012-12-19 13:40:17 UTC
(In reply to comment #12)
> Caolan, do your recent symbol font handling changes improve anything here?

Let me answer instead of Caolán (my time is cheaper than Caolán’s one ;-):
I see good progress here: at least in our simple sample document, attachment 49722 [details], the two Greek letters alpha and beta are now displayed correctly ... when I open the file with the newest master build on Mac OS X.

So I would say the chances are good that we can close this bug, too --
thank you very much, Caolán!

Before we close this bug, it would be nice if someone could confirm that the problem is fixed on Windows and Linux, too; in comment #1 and comment #7 I suggested that the problem (mis-interpreted letters from the Symbol font) has different reasons on Windows vs. Mac OS X ...
Comment 14 Michael Stahl (CIB) 2013-02-19 13:29:24 UTC
testing this on Linux it appears to work in current 3.6 but not in 4.0 or master
Comment 15 Miklos Vajna 2013-04-03 10:10:32 UTC
I just checked with today's master and libreoffice-4-0 on Linux and it works for me fine.

I also checked 4.0 on Windows, that looks OK as well.