Bug 125596 - DOCX: Writer misidentify text language (and appropriate font) in MS Word file (MSO2019)
Summary: DOCX: Writer misidentify text language (and appropriate font) in MS Word file...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
6.0.7.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:docx
Depends on:
Blocks: DOCX
  Show dependency treegraph
 
Reported: 2019-05-30 17:08 UTC by Ratchanan Srirattanamet
Modified: 2022-06-22 16:16 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
The DOCX file which has the problem (12.04 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2019-05-30 17:08 UTC, Ratchanan Srirattanamet
Details
All screenshots from MS Office 2019 and LibreOffice 6.4.2.4 (87.20 KB, application/pdf)
2019-05-30 17:09 UTC, Ratchanan Srirattanamet
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ratchanan Srirattanamet 2019-05-30 17:08:45 UTC
Created attachment 151787 [details]
The DOCX file which has the problem

Step to reproduce:
1. Download the font "TH Sarabun New" from [1]. The font is licensed under GPL 2.0 + font exception.
2. Open the attached DOCX document. The text is configured to use "TH Sarabun New" as the complex (Thai) font, and "Liberation Sans" as the western font. Both of them are 16 pt.

Expectation: The Thai text (the word "ไทย") and most of the dots (".") are displayed using "TH Sarabun New", while the English text (the word "English") and the dots between the pipes ("|", including the pipes themselves) are displayed using "Liberation Sans". The whole text is fit within one line. MS Word 2019 shows this expected behavior. (See the screenshots.)

Actual result: The Thai text is displayed using "TH Sarabun New", while the English text, all dots, and the pipes are displayed using "Liberation Sans". The whole text is not fit within one line.

The problem is reproducible on:
- LO 6.0.7-0ubuntu0.18.04.6 from Ubuntu 18.04.
- LO 6.2.4.2 on Ubuntu 18.04, Snap and AppImage.
- LO 6.2.4.2 on Windows 10 version 1903 (build 18326.86)

The reason this is important is that most of the Thai fonts use the different font metrics then western fonts. For historical reason [2], Thai fonts consider that point-size means "line-height". As Thai symbols contain the symbol above and below the character, Thai fonts are usually 30% smaller than western fonts at the same point-size. [3]

Adding to this problem, MS Word considers the language of the text using the keyboard layout when it's typed, not actual text. For example, typing a dot (".") while using a Thai keyboard layout will make that dot Thai while typing a dot while using an English keyboard layout will make that dot English. MS Word seems to record this information in the file, which LO seems to be unable to read. So, when LO opens the file, LO displays the text using the wrong font with different font metric, causing the document's layout to changes.

[1] http://mdresearch.kku.ac.th/files/font/THSarabunNew.zip
[2] http://thep.blogspot.com/2016/02/thai-font-metrics.html (In Thai)
[3] However, some Thai fonts, mostly fonts from Thai Linux Working Group (TLWG), now uses the new metric which considers point-size to be character size. This makes those fonts have the same size as western fonts. See [2].
Comment 1 Ratchanan Srirattanamet 2019-05-30 17:09:49 UTC
Created attachment 151788 [details]
All screenshots from MS Office 2019 and LibreOffice 6.4.2.4
Comment 2 Usama 2019-06-16 03:26:31 UTC
Hello Ratchanan,

Thank you for reporting the bug. I can confirm that the bug is present in master.

Version: 6.3.0.0.alpha1+
Build ID: 77ae0abe21f672cf4b7d2e069f1d40d20edc49a7
CPU threads: 4; OS: Linux 4.9; UI render: default; VCL: gtk3; 
TinderBox: Linux-rpm_deb-x86_64@86-TDF, Branch:master, Time: 2019-05-31_15:33:33
Locale: en-GB (en_GB.utf8); UI-Language: en-US
Calc: threaded
Comment 3 Xisco Faulí 2019-06-28 15:27:42 UTC
I remember seeing the same document in another report.
Where did you get it from ?
Comment 4 Ratchanan Srirattanamet 2019-07-04 17:39:02 UTC
(In reply to Xisco Faulí from comment #3)
> I remember seeing the same document in another report.
> Where did you get it from ?

I didn't take it from anywhere. I created it by myself.
Comment 5 Aron Budea 2020-11-23 04:56:42 UTC
I'm assuming keyword bibisectRequest was added by mistake, if not, please readd with explanation.
Comment 6 Justin L 2022-06-22 16:16:11 UTC
repro 7.5+ and also true for DOC format.

writerfilter/source/dmapper/DomainMapper.cxx:
        case NS_ooxml::LN_CT_Fonts_hint :
            /*  assigns script type to ambiguous characters, values can be:
                NS_ooxml::LN_Value_ST_Hint_default
                NS_ooxml::LN_Value_ST_Hint_eastAsia
                NS_ooxml::LN_Value_ST_Hint_cs
             */
            //TODO: unsupported?