Bug 59355 - Publisher ANSI files have no recognizable fonts
Summary: Publisher ANSI files have no recognizable fonts
Status: NEW
Alias: None
Product: Document Liberation Project
Classification: Unclassified
Component: libmspub (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-01-14 12:41 UTC by Urmas
Modified: 2017-10-30 11:35 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Example (10.50 KB, application/x-mspublisher)
2013-01-15 01:51 UTC, Urmas
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Urmas 2013-01-14 12:41:35 UTC
Publisher files in ANSI format are formatted with Arial font and consequently wrong encoding upon opening.
The font index is (at least) a third byte value in character formatting data.
Of course, usual warnings about ANSI_CHARSET != CP1252 apply for this format too.
Comment 1 Brennan Vincent 2013-01-14 20:59:14 UTC
(In reply to comment #0)
> Publisher files in ANSI format are formatted with Arial font and
> consequently wrong encoding upon opening.
> The font index is (at least) a third byte value in character formatting data.
> Of course, usual warnings about ANSI_CHARSET != CP1252 apply for this format
> too.

Thanks for the bug report. Do you have a sample file?
Comment 2 Joel Madero 2013-01-14 22:16:29 UTC
Marked as NEEDINFO until requested sample file is provided.

Urmas - once document is provided you can set this bug to NEW as Brennan has assigned it.
Comment 3 Urmas 2013-01-15 01:51:29 UTC
Created attachment 73033 [details]
Example
Comment 4 Brennan Vincent 2013-01-15 03:10:25 UTC
Thanks, I am looking into it.

Note that opening this file in non-Russian versions of Publisher appears to suffer from the same problems, because the encoding does not appear to be stored anywhere in the file.

However, we may be able to use some heuristic in order to open it correctly in LibreOffice.
Comment 5 Urmas 2013-01-15 07:22:57 UTC
There was no 'Russian' version of Publisher until Office 2000. Therefore, the data in file are completely sufficient for a proper display.
Comment 6 Brennan Vincent 2013-01-15 08:04:27 UTC
Then perhaps Windows is the culprit, and not Publisher. Do you know if it opens properly for you on non-Russian-localized copies of Windows?
Comment 7 Brennan Vincent 2013-01-18 16:54:11 UTC
This bug has been *mostly* fixed in the libmspub git repository. I will close this bug when the fix makes it to LibreOffice core. If you are relying on this being fixed urgently, ping me for instructions on how to install a system libmspub from the latest git sources.

Actually we could not find any information on the encoding stored within the file itself. My current hypothesis is that the Windows locale is being used to determine what code page to use for pre-Unicode programs. This is consistent with the behavior we observe and with how many other old Windows programs work. So what we are doing now is using ICU to guess at the correct character set.

What this means is that files with very little text (like your example) will still be garbled. However, anything much longer than that should work.
Comment 8 Urmas 2013-01-19 02:14:59 UTC
No, it doesn't work this way. The suffix of the font name determines its encoding even if its character set (second byte in font table record) says.

The fonts with names without suffixes found in font table should be taken as being in corresponding Windows codepage.

It is easy to realize properly, but due to lack of multilingual understanding from developers we have to live with incompatible software or hacks intended for a few nationalities developed enough to be able to code.
Comment 9 Brennan Vincent 2013-01-19 05:06:20 UTC
Thanks for the information, Urmas. I will look into it when I have a free moment and perhaps make the fix even better.

Keep in mind that if you believe we libmspub developers are incompetent, the correct course of action is to submit a patch fixing the problem to gerrit. I'm an unpaid student committing what fixes I can in my free time; being insulted for a "lack of multilingual understanding" certainly makes the prospect of contributing less attractive.
Comment 10 Fridrich Strba 2013-01-19 14:29:14 UTC
(In reply to comment #9)
> Keep in mind that if you believe we libmspub developers are incompetent, the
> correct course of action is to submit a patch fixing the problem to gerrit.
> I'm an unpaid student committing what fixes I can in my free time; being
> insulted for a "lack of multilingual understanding" certainly makes the
> prospect of contributing less attractive.

Brennan, just ignore that guy. His principal objective is to piss off people. Check Wikipedia for "Troll" :). Brennan, we love you and appreciate your work.
Comment 11 David Tardon 2013-01-19 14:52:59 UTC
(In reply to comment #8)
> No, it doesn't work this way. The suffix of the font name determines its
> encoding even if its character set (second byte in font table record) says.
> 
> The fonts with names without suffixes found in font table should be taken as
> being in corresponding Windows codepage.
> 
> It is easy to realize properly, but due to lack of multilingual
> understanding from developers we have to live with incompatible software or
> hacks intended for a few nationalities developed enough to be able to code.

I am not sure if a missing knowledge about some hack cooked up by Microsoft to pretend their software is internationalized can be called "lack of multilingual understanding". Even so, since you are obviously so emphatic about it, where is your contribution? I do not expect a patch, but a link to a webpage where you explain the matter in detail so developers can educate themselves would be welcome. (Of course, I do not really expect such link, because you are a brainless nitwit whose only objective is to insult people, but life is full of surprises...)
Comment 12 Urmas 2013-01-21 02:26:23 UTC
> I am not sure if a missing knowledge about some hack cooked up by Microsoft to pretend their software is internationalized 

It is not a hack and it helps people to create documents in non-CP1252 codepages since 1992.

>  because you are a brainless nitwit whose only objective is to insult people

And that is told by a someone who most certainly never used any language outside of default codepage.

These file formats predate Unicode. There was no Unicode fonts. Each font is limited by 256 characters. There was no Windows 95 charsets. Assuming that the font is stored in file is Unicode one with valid charset field, defining a codepage can only be done from not understanding realia of that time.

Look into HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes registry section for the list of virtual fonts provided by Windows 95 for old applications and their codepages.
Comment 13 David Tardon 2014-12-25 17:58:36 UTC
(In reply to Urmas from comment #12)
> > I am not sure if a missing knowledge about some hack cooked up by Microsoft to pretend their software is internationalized 
> 
> It is not a hack and it helps people to create documents in non-CP1252
> codepages since 1992.

So, if I understand this correctly (you still have not really explained anything), the code page of a piece of text is based on a predefined "magic" suffix of the name of the font assigned to it, instead of being encoded in the file format. So "Foo Bar Cyr" is a cyrilic variant (or replacement) of "Foo Bar", not a completely different font.

If yes, than this can be rightfully called a hack, because it was obviously done as an afterthought. But let's not dwell on that... Is there at least a list of these suffixes and associated code pages somewhere (not hidden in the internals of Windows)?

> 
> >  because you are a brainless nitwit whose only objective is to insult people
> 
> And that is told by a someone who most certainly never used any language
> outside of default codepage.

You mean like Czech (which happens to be my native language)? I am pretty sure that several accented letters are not available in CP1252.

> 
> These file formats predate Unicode. There was no Unicode fonts. Each font is
> limited by 256 characters. There was no Windows 95 charsets. Assuming that
> the font is stored in file is Unicode one with valid charset field, defining
> a codepage can only be done from not understanding realia of that time.

Right. That is why I asked _you_ for an explanation, because you apparently know all that. And I am still waiting for it.

> Look into HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes
> registry section for the list of virtual fonts provided by Windows 95 for
> old applications and their codepages.

I can't. I do not run Windows.
Comment 14 Urmas 2014-12-26 05:58:26 UTC
> the code page of a piece of text is based on a predefined "magic" suffix of the name of the font assigned to it

That is convention used for the standard fonts. It's hardcoded in Word 95, for example. We can assume they are what they are regardless of the charset stored for them or the locale.

> Is there at least a list of these suffixes and associated code pages somewhere (not hidden in the internals of Windows)?

The CE, Cyr and Greek fonts were real; Baltic and Tur were virtual introduced in Windows 95 and that is not documented anywhere except the Windows setup files.

The actual issue is with the interpretation of charset 0. There was Windows NT 3.x which supported only CJK charsets, and all the others were rammed into 0, so we need to know the system locale which was used. Thai and Turkish Windows 3.1 also wasn't using the latest charsets, so we need to know whether the font is Western or Turkish (so we need locale) and whether it is Western or Thai (so we need to hardcode all the popular Thai font names).
There were also many faux fonts: they pretended to be Western (with charset 0), but in reality they were Cyrillic, etc.

> I do not run Windows.

I do respect your choice to be the 1%, but that isn't something to brag about as a software developer.
Comment 15 David Tardon 2014-12-27 13:12:19 UTC
(In reply to Urmas from comment #14)
> > the code page of a piece of text is based on a predefined "magic" suffix of the name of the font assigned to it
> 
> That is convention used for the standard fonts. It's hardcoded in Word 95,
> for example. We can assume they are what they are regardless of the charset
> stored for them or the locale.
> 
> > Is there at least a list of these suffixes and associated code pages somewhere (not hidden in the internals of Windows)?
> 
> The CE, Cyr and Greek fonts were real; Baltic and Tur were virtual
> introduced in Windows 95 and that is not documented anywhere except the
> Windows setup files.

All right. Any others you know of? This should be reasonably doable.

> 
> The actual issue is with the interpretation of charset 0. There was Windows
> NT 3.x which supported only CJK charsets, and all the others were rammed
> into 0, so we need to know the system locale which was used. Thai and
> Turkish Windows 3.1 also wasn't using the latest charsets, so we need to
> know whether the font is Western or Turkish (so we need locale) and whether
> it is Western or Thai

> (so we need to hardcode all the popular Thai font
> names).

Eww.

> There were also many faux fonts: they pretended to be Western (with charset
> 0), but in reality they were Cyrillic, etc.

This does not look good. I suppose we would have to ask the user in this case. Which requires an API change.

> 
> > I do not run Windows.
> 
> I do respect your choice to be the 1%, but that isn't something to brag
> about as a software developer.

I was not bragging, I was stating a fact. I am working on platform-independent code; the whole idea of that is that it does not matter what operating system I use. So what exactly is the advantage Windows gives me over my OS X and Linux?
Comment 16 Xisco Faulí 2017-07-13 12:42:14 UTC
Setting Assignee back to default. Please assign it back to yourself if you're
still working on this issue