Publisher files in ANSI format are formatted with Arial font and consequently wrong encoding upon opening. The font index is (at least) a third byte value in character formatting data. Of course, usual warnings about ANSI_CHARSET != CP1252 apply for this format too.
(In reply to comment #0) > Publisher files in ANSI format are formatted with Arial font and > consequently wrong encoding upon opening. > The font index is (at least) a third byte value in character formatting data. > Of course, usual warnings about ANSI_CHARSET != CP1252 apply for this format > too. Thanks for the bug report. Do you have a sample file?
Marked as NEEDINFO until requested sample file is provided. Urmas - once document is provided you can set this bug to NEW as Brennan has assigned it.
Created attachment 73033 [details] Example
Thanks, I am looking into it. Note that opening this file in non-Russian versions of Publisher appears to suffer from the same problems, because the encoding does not appear to be stored anywhere in the file. However, we may be able to use some heuristic in order to open it correctly in LibreOffice.
There was no 'Russian' version of Publisher until Office 2000. Therefore, the data in file are completely sufficient for a proper display.
Then perhaps Windows is the culprit, and not Publisher. Do you know if it opens properly for you on non-Russian-localized copies of Windows?
This bug has been *mostly* fixed in the libmspub git repository. I will close this bug when the fix makes it to LibreOffice core. If you are relying on this being fixed urgently, ping me for instructions on how to install a system libmspub from the latest git sources. Actually we could not find any information on the encoding stored within the file itself. My current hypothesis is that the Windows locale is being used to determine what code page to use for pre-Unicode programs. This is consistent with the behavior we observe and with how many other old Windows programs work. So what we are doing now is using ICU to guess at the correct character set. What this means is that files with very little text (like your example) will still be garbled. However, anything much longer than that should work.
No, it doesn't work this way. The suffix of the font name determines its encoding even if its character set (second byte in font table record) says. The fonts with names without suffixes found in font table should be taken as being in corresponding Windows codepage. It is easy to realize properly, but due to lack of multilingual understanding from developers we have to live with incompatible software or hacks intended for a few nationalities developed enough to be able to code.
Thanks for the information, Urmas. I will look into it when I have a free moment and perhaps make the fix even better. Keep in mind that if you believe we libmspub developers are incompetent, the correct course of action is to submit a patch fixing the problem to gerrit. I'm an unpaid student committing what fixes I can in my free time; being insulted for a "lack of multilingual understanding" certainly makes the prospect of contributing less attractive.
(In reply to comment #9) > Keep in mind that if you believe we libmspub developers are incompetent, the > correct course of action is to submit a patch fixing the problem to gerrit. > I'm an unpaid student committing what fixes I can in my free time; being > insulted for a "lack of multilingual understanding" certainly makes the > prospect of contributing less attractive. Brennan, just ignore that guy. His principal objective is to piss off people. Check Wikipedia for "Troll" :). Brennan, we love you and appreciate your work.
(In reply to comment #8) > No, it doesn't work this way. The suffix of the font name determines its > encoding even if its character set (second byte in font table record) says. > > The fonts with names without suffixes found in font table should be taken as > being in corresponding Windows codepage. > > It is easy to realize properly, but due to lack of multilingual > understanding from developers we have to live with incompatible software or > hacks intended for a few nationalities developed enough to be able to code. I am not sure if a missing knowledge about some hack cooked up by Microsoft to pretend their software is internationalized can be called "lack of multilingual understanding". Even so, since you are obviously so emphatic about it, where is your contribution? I do not expect a patch, but a link to a webpage where you explain the matter in detail so developers can educate themselves would be welcome. (Of course, I do not really expect such link, because you are a brainless nitwit whose only objective is to insult people, but life is full of surprises...)
> I am not sure if a missing knowledge about some hack cooked up by Microsoft to pretend their software is internationalized It is not a hack and it helps people to create documents in non-CP1252 codepages since 1992. > because you are a brainless nitwit whose only objective is to insult people And that is told by a someone who most certainly never used any language outside of default codepage. These file formats predate Unicode. There was no Unicode fonts. Each font is limited by 256 characters. There was no Windows 95 charsets. Assuming that the font is stored in file is Unicode one with valid charset field, defining a codepage can only be done from not understanding realia of that time. Look into HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes registry section for the list of virtual fonts provided by Windows 95 for old applications and their codepages.
(In reply to Urmas from comment #12) > > I am not sure if a missing knowledge about some hack cooked up by Microsoft to pretend their software is internationalized > > It is not a hack and it helps people to create documents in non-CP1252 > codepages since 1992. So, if I understand this correctly (you still have not really explained anything), the code page of a piece of text is based on a predefined "magic" suffix of the name of the font assigned to it, instead of being encoded in the file format. So "Foo Bar Cyr" is a cyrilic variant (or replacement) of "Foo Bar", not a completely different font. If yes, than this can be rightfully called a hack, because it was obviously done as an afterthought. But let's not dwell on that... Is there at least a list of these suffixes and associated code pages somewhere (not hidden in the internals of Windows)? > > > because you are a brainless nitwit whose only objective is to insult people > > And that is told by a someone who most certainly never used any language > outside of default codepage. You mean like Czech (which happens to be my native language)? I am pretty sure that several accented letters are not available in CP1252. > > These file formats predate Unicode. There was no Unicode fonts. Each font is > limited by 256 characters. There was no Windows 95 charsets. Assuming that > the font is stored in file is Unicode one with valid charset field, defining > a codepage can only be done from not understanding realia of that time. Right. That is why I asked _you_ for an explanation, because you apparently know all that. And I am still waiting for it. > Look into HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes > registry section for the list of virtual fonts provided by Windows 95 for > old applications and their codepages. I can't. I do not run Windows.
> the code page of a piece of text is based on a predefined "magic" suffix of the name of the font assigned to it That is convention used for the standard fonts. It's hardcoded in Word 95, for example. We can assume they are what they are regardless of the charset stored for them or the locale. > Is there at least a list of these suffixes and associated code pages somewhere (not hidden in the internals of Windows)? The CE, Cyr and Greek fonts were real; Baltic and Tur were virtual introduced in Windows 95 and that is not documented anywhere except the Windows setup files. The actual issue is with the interpretation of charset 0. There was Windows NT 3.x which supported only CJK charsets, and all the others were rammed into 0, so we need to know the system locale which was used. Thai and Turkish Windows 3.1 also wasn't using the latest charsets, so we need to know whether the font is Western or Turkish (so we need locale) and whether it is Western or Thai (so we need to hardcode all the popular Thai font names). There were also many faux fonts: they pretended to be Western (with charset 0), but in reality they were Cyrillic, etc. > I do not run Windows. I do respect your choice to be the 1%, but that isn't something to brag about as a software developer.
(In reply to Urmas from comment #14) > > the code page of a piece of text is based on a predefined "magic" suffix of the name of the font assigned to it > > That is convention used for the standard fonts. It's hardcoded in Word 95, > for example. We can assume they are what they are regardless of the charset > stored for them or the locale. > > > Is there at least a list of these suffixes and associated code pages somewhere (not hidden in the internals of Windows)? > > The CE, Cyr and Greek fonts were real; Baltic and Tur were virtual > introduced in Windows 95 and that is not documented anywhere except the > Windows setup files. All right. Any others you know of? This should be reasonably doable. > > The actual issue is with the interpretation of charset 0. There was Windows > NT 3.x which supported only CJK charsets, and all the others were rammed > into 0, so we need to know the system locale which was used. Thai and > Turkish Windows 3.1 also wasn't using the latest charsets, so we need to > know whether the font is Western or Turkish (so we need locale) and whether > it is Western or Thai > (so we need to hardcode all the popular Thai font > names). Eww. > There were also many faux fonts: they pretended to be Western (with charset > 0), but in reality they were Cyrillic, etc. This does not look good. I suppose we would have to ask the user in this case. Which requires an API change. > > > I do not run Windows. > > I do respect your choice to be the 1%, but that isn't something to brag > about as a software developer. I was not bragging, I was stating a fact. I am working on platform-independent code; the whole idea of that is that it does not matter what operating system I use. So what exactly is the advantage Windows gives me over my OS X and Linux?
Setting Assignee back to default. Please assign it back to yourself if you're still working on this issue