Bug 74301 - IMPORT: non-English text is garbled in specific WMF
Summary: IMPORT: non-English text is garbled in specific WMF
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Mike Kaganski
URL:
Whiteboard: target:5.0.0
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-01 04:02 UTC by Mike Kaganski
Modified: 2015-05-10 20:03 UTC (History)
0 users

See Also:
Crash report or crash signature:


Attachments
Problem WMF, screenshot and PDFs (334.71 KB, application/zip)
2014-02-01 04:02 UTC, Mike Kaganski
Details
WMF with CharSet set to DEFAULT_CHARSET (1.46 KB, image/x-wmf)
2015-02-03 22:32 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2014-02-01 04:02:10 UTC
Created attachment 93156 [details]
Problem WMF, screenshot and PDFs

Importing the attached WMF using LO with Russian locale gives garbled Cyrillic letters.
The file was generated by AutoCAD 2014 Russian under Win7x64 Russian.
If open with an ASCII text editor using Win-1251 codepage, the text string are visible in the file.
It seems that the codepage info is not specified in the file.
I think that if the encoding information is missing, then LO should honour the Default document language option, or else what is the point in it? Ignoring this setting when no other information is present is clearly a bug.

I additionally include PDFs to the attachment to show the problem: the PDF generated by AutoCAD from the source (not from WMF - just to show the desired output), and the PDF generated by LO from imported WMF.
Also, a screenshot of the WMF open in Notepad showing that the text is present in the file.

This problem is already present in OOo 3.3.0. Still present in LO 4.2.0.4 under Win7x64, and in 4.1.4.2 under Ubuntu 13.10 x64.
Comment 1 Owen Genat (retired) 2014-07-22 00:47:43 UTC
Confirmed under GNU/Linux using:

- v4.3.0.3 Build ID: 08ebe52789a201dd7d38ef653ef7a48925e7f9f7
- v4.4.0.0.alpha0+ Build ID: 4aa9b041de3129f19b48e66d349f48657b73f33e (2014-07-19)

Status set to NEW.
Comment 2 Urmas 2014-07-22 15:39:25 UTC
I cannot get the metafile supplied to display properly, the text does use Latin letters.
Comment 3 Urmas 2015-01-23 03:47:14 UTC
I've checked it once again and there are no traces of either 'Arial Cyr' font or charset 204.
Comment 4 Mike Kaganski 2015-01-23 04:22:02 UTC
(In reply to Urmas from comment #3)
Please note that the issue is not that "LO doesn't use some charset information available in the file", but that "in ABSENCE of such charset information in the file LO doesn't honor its own locale setting".

This WMF file surely DOESN'T contain charset info. I noted it in comment 0:
> It seems that the codepage info is not specified in the file.
It contain 8-bit textual information in (some unknown for LO) charset. This is unfortunate, and the maker software is to be blamed. But such files exist.

And I expect LO to follow the same logic that it uses when opening plain text files (single-byte, i.e. Win-1251) without language information available: it should use information that is set in "Options - Language settings - Languages".

The same problem exist for some other formats that don't store language information, e.g. Autodesk DXF (pre-2007), see Bug 74299.

By the way, there's no "Arial Cyr" for quite a long time, IIRC since Win2K? Modern localized (Russian) MS OSes contain only "Arial".
Comment 5 Urmas 2015-01-23 06:43:35 UTC
There are LOGFONT structures in metafiles, so they provide charset info explicitly.
The Arial Cyr font is emulated for non-charset-aware applications in every Windows version.
Comment 6 Mike Kaganski 2015-02-03 22:32:13 UTC
Created attachment 113108 [details]
WMF with CharSet set to DEFAULT_CHARSET

(In reply to Urmas from comment #5)
> There are LOGFONT structures in metafiles, so they provide charset info
> explicitly.

I agree.
After exploring the contents of the file and referring to [MS-WMF] v.11/1 at
https://msdn.microsoft.com/en-us/library/cc250370.aspx
I see that CharSet fields of Font objects in the original file contain zero (ANSI_CHARSET = 0x00), i.e. "Specifies the English character set". This is clearly the fault of the generating SW.

But when I manually set that field to DEFAULT_CHARSET = 0x01, I see that LO still ignores its locale setting, as if it were ANSI_CHARSET.
In the new attachment, there is a WMF containing single Russian word "Текст" ("Text"). It has CharSet set to DEFAULT_CHARSET.
According to the spec, it should be treated as "a character set based on the current system locale; for example, when the system locale is United States English, the default character set is ANSI_CHARSET" (page 31 of abovementioned doc). But LO imports it as arbacadabra when its locale is set to Russian, which is inconsistent behaviour WRT spec.
Comment 7 Mike Kaganski 2015-05-06 09:32:08 UTC
Submitted patch to gerrit: https://gerrit.libreoffice.org/15641
Comment 8 Commit Notification 2015-05-10 20:03:34 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c6bc9b33d5cac1ea40a829754004fde6ae16d8b1

tdf#74301: WMF: use LibreOffice locale on OEM_CHARSET/DEFAULT_CHARSET

It will be available in 5.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.