Bug 158225 - Incorrect encoding while opening cyrillic document created in MS Word 5.1 for Macintosh FILEOPEN
Summary: Incorrect encoding while opening cyrillic document created in MS Word 5.1 for...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Font-Rendering
  Show dependency treegraph
 
Reported: 2023-11-15 14:56 UTC by Mikhail Kukharenko
Modified: 2023-12-21 12:19 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
MS Word 5.1 for Macintosh document with cyrillic text. (2.50 KB, application/msword)
2023-11-15 14:56 UTC, Mikhail Kukharenko
Details
conversion of this file by "forcing" the Cyrillic encoding (17.66 KB, application/vnd.oasis.opendocument.text)
2023-11-16 14:55 UTC, osnola
Details
MS Excel 4.0 document with Cyrillic text (3.12 KB, application/octet-stream)
2023-11-19 14:32 UTC, Mikhail Kukharenko
Details
MS Word 5.1 for Mac doc with cyrillic text and another fonts (3.50 KB, application/msword)
2023-11-27 15:02 UTC, Mikhail Kukharenko
Details
The result (14.44 KB, application/vnd.oasis.opendocument.text)
2023-12-03 09:17 UTC, osnola
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mikhail Kukharenko 2023-11-15 14:56:20 UTC
Created attachment 190842 [details]
MS Word 5.1 for Macintosh document with cyrillic text.

When opening in LibreOffice documents created in MS Word 5.1 for Macintosh and in MS Excel 4.0 for Macintosh the cyrillic text become unreadable (incorrect encoding).
All formatting is ok, latin texts are ok. Problem is only with cyrillic characters.
I tried all reasonable file types in Open dialog - the result is the same.
Attachment provided is MS Word 5.1 for Macintosh document. It has no extension as it has no one in Mac. Adding the extension (doc) doesn't help.
Comment 1 Julien Nabet 2023-11-16 13:21:12 UTC
On pc Debian x86-64 with master sources updated today, I could reproduce this.
I noticed this on console:
MWAWHeader::constructHeader: find a Word 5.0 file
PageSpan::get: can not find a master page name
PageSpan::get: can not find a master page name
MWAWHeader::constructHeader: find a Word 5.0 file
MWAWFontConverter::getValidName: fontName contains bad character


Laurent: thought you might be interested in this one since it uses libmwaw lib.
Comment 2 Mikhail Kukharenko 2023-11-16 14:42:18 UTC
(In reply to Julien Nabet from comment #1)
> MWAWFontConverter::getValidName: fontName contains bad character
If it's possible to fix something inside the problematic document so that LibreOffice can open it correctly, that would be a great workaround.
Comment 3 osnola 2023-11-16 14:55:50 UTC
Created attachment 190874 [details]
conversion of this file by "forcing" the Cyrillic encoding

In a Word 5.0 file (and in general in every Mac Os 9's file), a font is defined by a name and an identifier, so I've created a table of correspondence between certain font names and their corresponding encoding. Of course, this table is incomplete.

In this case, I'd add this font in my table (I need to see how to do it properly); but it would be nice to have other Cyrillic documents so that I can at least recognize their basic fonts.
Comment 4 osnola 2023-11-16 15:02:48 UTC
Another problem is that this file was created on a Mac OS Cyrillic system and I haven't found a method of detecting this in a Word 5 file. The font name is therefore encoded in Cyrillic...
Comment 5 Mikhail Kukharenko 2023-11-16 15:28:18 UTC
(In reply to osnola from comment #3)
> Created attachment 190874 [details]
> conversion of this file by "forcing" the Cyrillic encoding
It works! Great!
> ...

Could you tell me please how can I reproduce what you did? Did you fix something inside the document? I would write the python script to "fix" it in my Word 5.0 & Excel 4.0 documents as I have plenty of them.

> but it would be nice to have other Cyrillic documents so that I
> can at least recognize their basic fonts.
I will check and report all cyrillic fonts from that old Mac OS 9. 

Thank you !
Comment 6 osnola 2023-11-16 15:34:47 UTC
I've applied a hacked patch to the libmwaw sources but I have to rewrite the code differently  ( like this, it's unreadable).

---------------- patch --------------------------
diff --git a/src/lib/MWAWFontConverter.cxx b/src/lib/MWAWFontConverter.cxx
index 6f9c7da..33ff37e 100644
--- a/src/lib/MWAWFontConverter.cxx
+++ b/src/lib/MWAWFontConverter.cxx
@@ -1276,7 +1276,8 @@ MWAWFontConverter::~MWAWFontConverter()
 // mac font name <-> id functions
 std::string MWAWFontConverter::getValidName(std::string const &name)
 {
-  std::string fName("");
+  if (name=="\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9") return "Times CY";
+  std::string fName;
   static bool first = true;
   for (auto c : name) {
Comment 7 Julien Nabet 2023-11-16 17:18:44 UTC
(In reply to osnola from comment #4)
> Another problem is that this file was created on a Mac OS Cyrillic system
> and I haven't found a method of detecting this in a Word 5 file. The font
> name is therefore encoded in Cyrillic...

Taking a look at Msdoc specs here:
https://msopenspecs.azureedge.net/files/MS-DOC/%5bMS-DOC%5d.pdf
(from https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22)

in 2.5.2 FibBase, there's "lid" defined as:
"A LID that specifies the install language of the application that is producing the document. If nFib is 0x00D9 or greater, then any East Asian install lid or any install lid with a base
language of Spanish, German or French MUST be recorded as 0x0409. If the nFib is 0x0101 or greater, then any install lid with a base language of Vietnamese, Thai, or Hindi MUST be recorded as 0x0409."

perhaps it may help.

Anyway, thank you for the quick investigation! :-)
Comment 8 osnola 2023-11-17 08:21:13 UTC
Thank you. Yes, I had looked at those documents and as word msdos file conversions are done by libwps, I know that filter too. 

In fact, the notion of language appeared later on Mac Classic Os. Initially, I'd say that the system was supplied with default routines, a list of fonts and applications (of course with text strings translated in the desired language); there was also a default font used for the system and another for each application.

The fonts were simply a set of 256 glyphs (given in several sizes) that the user could modify by taking a font and replacing one glyph with another using http://macintoshgarden.org/apps/fontastic-plus , ...
Comment 9 osnola 2023-11-17 08:33:58 UTC
(In reply to Mikhail Kukharenko from comment #5)
>  I would write the python script to "fix" it
> in my Word 5.0 & Excel 4.0 documents as I have plenty of them.
> 

libmwaw only converts Mac Word 1.0 to 5.0 files and older PowerPoint formats (Mac and PC): I didn't write any other filters because LibreOffice already had filters to open other Office formats. 

I'm not sure what python can easily fix. It seems difficult, basically the principle would be to look at the tables: https://en.wikipedia.org/wiki/Mac_OS_Roman
and https://en.wikipedia.org/wiki/Mac_OS_Cyrillic_encoding.

For example, if in the original file, we have the character 0x80, normally it should be converted to А ( but here we get a Ä ), ... So we'd have to go back one by one to each badly transformed character and, if we find a Ä, transform it into А, ...
Comment 10 Mikhail Kukharenko 2023-11-19 14:29:41 UTC
(In reply to osnola from comment #6)
> -  std::string fName("");
> +  if (name=="\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9") return "Times CY";
$ echo -e '\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9' | iconv -f MACCYRILLIC 
Латинский

I just changed in the binary word document via hex editor this font name to "Times CYR" and it opened perfectly with readable Cyrillic text.
It also worked when I changed it to "Times  CY" and to "Arial  CY" (with extra space in the middle to preserve the string length)

I tried to do the same with Excel 4.0 for Mac documents. Example is attached. All my Excel documents contain two fonts:
1. Латинский (\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9)  and 
2. Прямой Проп (8FF0DFECEEE9208FF0EEEF)

1st font is Serif (like Times)
2nd font is Sans Serif (like Arial)
I changed 1st font name again to "Times CYR" and tried to replace 2nd font name to "Arial   CYR" or to "Arial    CY" (to preserve the length) but it did not work (opened unreadable)

Can you please try to "force open" attached document?
Comment 11 Mikhail Kukharenko 2023-11-19 14:32:54 UTC
Created attachment 190911 [details]
MS Excel 4.0 document with Cyrillic text
Comment 12 osnola 2023-11-19 14:44:17 UTC
As far as MS Word files are concerned, this is normal. libmwaw recognizes fonts ending in " CY" or " CYR" as Cyrillic fonts. So if you can change the font name in the original file, it will work.

As far as the Excel files are concerned, you need to find out which filter is called to open these files (it's not libmwaw: see https://sourceforge.net/p/libmwaw/wiki/Home/ for the list of formats recognised by libmwaw). So it's probably best to open another bug report.

Note: I'll also add a special case for the font "Прямой Проп"
Comment 13 osnola 2023-11-20 15:21:40 UTC
I will take inspiration from https://en.wikipedia.org/wiki/Talk:Fonts_on_Macintosh (I)
and add 
 m_nameTranslatedNameMap["\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9"]="Latinskij";
 m_nameTranslatedNameMap["\x8f\xf0\xdf\xec\xee\xe9"]="Priamoj";
 m_nameTranslatedNameMap["\x8f\xf0\xdf\xec\xee\xe9\x20\x8f\xf0\xee\xef"]="Priamoj Prop";
 m_nameTranslatedNameMap["\x91\xe8\xf1\xf2\xe5\xec\xed\xfb\xe9"]="Sistemnyj";
 m_nameTranslatedNameMap["\x80\x90\x91\x8a\xf3\xf0\xfc\xe5\xf0"]="APC Courier";
Comment 14 osnola 2023-11-21 09:56:17 UTC
Normally, https://sourceforge.net/p/libmwaw/libmwaw/ci/7e583cd8e526a58b4387b4bd4794c511479e3827/  will solve this problem when I will release a new version.
Comment 15 Mikhail Kukharenko 2023-11-21 11:39:57 UTC
(In reply to osnola from comment #14)
> Normally,
> https://sourceforge.net/p/libmwaw/libmwaw/ci/
> 7e583cd8e526a58b4387b4bd4794c511479e3827/  will solve this problem when I
> will release a new version.

Thank you!
As suggested I created the separate ticket for Excel 4.0 for Macintosh
https://bugs.documentfoundation.org/show_bug.cgi?id=158282
Comment 16 Mikhail Kukharenko 2023-11-24 07:50:05 UTC
(In reply to osnola from comment #9)
> (In reply to Mikhail Kukharenko from comment #5)
> libmwaw only converts Mac Word 1.0 to 5.0 files and older PowerPoint formats
> (Mac and PC): I didn't write any other filters because LibreOffice already
> had filters to open other Office formats. 
 
What filter (library) is used by LO to open Mac Excel 4 documents?
We will try to add there support for cyrillic excel files. Thank you!
Comment 17 osnola 2023-11-24 08:53:52 UTC
I assume this is one of the LibreOffice's excel filters that reside in core/sc, but I'm not sure.

In filter/qa/complex/filter/detection/typeDetection/files.csv, we have:

Excel2	        Calc/Excel2.XLS	calc_MS_Excel_40	calc_MS_Excel_40
Excel3          Calc/Excel3.XLS	calc_MS_Excel_40	calc_MS_Excel_40
Excel4_document Calc/Excel4.XLS	calc_MS_Excel_40	calc_MS_Excel_40
Excel4_template Calc/Excel4.XLT	calc_MS_Excel_40_VorlageTemplate	calc_MS_Excel_40
Exel95_document Calc/Excel5_95.XLS	calc_MS_Excel_5095	calc_MS_Excel_5095
Exel95_template Calc/Excel5_template.XLT	calc_MS_Excel_5095	calc_MS_Excel_5095
Comment 18 Mikhail Kukharenko 2023-11-27 14:59:44 UTC
(In reply to Mikhail Kukharenko from comment #15)
> (In reply to osnola from comment #14)
> > Normally,
> > https://sourceforge.net/p/libmwaw/libmwaw/ci/
> > 7e583cd8e526a58b4387b4bd4794c511479e3827/  will solve this problem when I
> > will release a new version.

Could you please check fonts in the attached file "enduser" - looks like it contains another cyrillic fonts.
Comment 19 Mikhail Kukharenko 2023-11-27 15:02:17 UTC
Created attachment 191063 [details]
MS Word 5.1 for Mac doc with cyrillic text and another fonts

This is another kind of MS Word 5 for Mac documents with another set of cyrillic fonts.
Could you please check the font names and add support for them.
Thank you !
Comment 20 osnola 2023-11-27 15:28:23 UTC
Hello, 
this file does not contain a font name. This probably means that the system default font (or perhaps the application default font) should be used.

I'm not sure how to fix this as I haven't found a way to detect that the file wasn't created on a Mac Roman system but Cyrillic :-~
Comment 21 Mikhail Kukharenko 2023-11-27 15:44:26 UTC
(In reply to osnola from comment #20)

> I'm not sure how to fix this as I haven't found a way to detect that the
> file wasn't created on a Mac Roman system but Cyrillic :-~

Hello! 

If we assume the it was created on Cyrillic system (as it was) - is that possible to "force convert" it? 
Can we do the trick like we did before (set the font name inside the binary document) ? 

May be this can help:
at the offset 0000:0810 I see this stuff:
...NFlRight.normtxt.Punkt.RusN.SuperZagla.Zagolovok.SPage.p1.p2.2-COL.1-COL-SHIFT RIGHT.RUS-JUST.p3.p4ÿ.0.S1.S2.S3.S4.p5.S5.rusC.RusCBoldÿ.qÿÿÿÿÿÿÿÿ............

rusC and RusCBold mean most probably "Russian Courier" and "Russian Courier Bold" respectively. Look like its font names.
Zagolovok - means Heading looks like style name.
SuperZagla - means SuperHeading and again looks like style name.
Comment 22 osnola 2023-11-27 15:48:46 UTC
This zone corresponds to the style's names:

000802 [Styles(names):*_______******N1=NFlRight,N2=normtxt,N3=Punkt,N4=RusN,N5=SuperZagla,N6=Zagolovok,N7=SPage,N8=p1,N9=p2,N10=2-COL,N11=1-COL-SHIFT RIGHT,N12=RUS-JUST,N13=p3,N14=p4,_N16=0,N17=S1,N18=S2,N19=S3,N20=S4,N21=p5,N22=S5,N23=rusC,N24=RusCBold,_]009800ffffffffffffff000000000000084e466c5269676874076e6f726d7478740550756e6b74045275734e0a53757065725a61676c61095a61676f6c6f766f6b05535061676502703102703205322d434f4c11312d434f4c2d5348494654205249474854085255532d4a555354027033027034ff013002533102533202533302533402703502533504727573430852757343426f6c64ff

Note: I will try to force conversion by hand at the end of the week.
Comment 23 Mikhail Kukharenko 2023-12-02 18:15:50 UTC
(In reply to osnola from comment #22)

> Note: I will try to force conversion by hand at the end of the week.

Thank you! Will be waiting for it. May be its possible to set font directly inside binary word document...
Comment 24 osnola 2023-12-03 09:17:35 UTC
Created attachment 191199 [details]
The result

Sorry for the delay, I obtained this result by forcing the conversion.
 
To force the conversion, you can use the following patch on libmwaw. I need to install a Russian system to retrieve the numerical identifier of the supplied Cyrillic fonts and check that other fonts don't use the same identifiers in my base of files. I'll try to find the time to do it in December, the days are a bit busy at the moment...
--- a/src/lib/MWAWFontConverter.cxx
+++ b/src/lib/MWAWFontConverter.cxx
@@ -1207,6 +1207,8 @@ void State::initMaps()
   m_idNameMap[64640] = "Hiragino MaruGo W3";
   m_idNameMap[64643] = "Hiragino MaruGo W6";
 
+  // cyrillic font (check me)
+  m_idNameMap[19540] = "Sistemnyj";
   // Windows
   m_idNameMap[101250] = "CP1250";
   m_idNameMap[101251] = "CP1251";
Comment 25 Mikhail Kukharenko 2023-12-03 11:08:10 UTC
(In reply to osnola from comment #24)
> Created attachment 191199 [details]
> The result
Thank you so much!  We will try to apply the patch.
> I need
> to install a Russian system to retrieve the numerical identifier of the
> supplied Cyrillic fonts 
I can install Russian system and check if it helps. You mean LO with Russian settings ? On Linux / Windows / Mac ?
Comment 26 osnola 2023-12-03 11:17:49 UTC
(In reply to Mikhail Kukharenko from comment #25)
> > I can install Russian system and check if it helps. You mean LO with Russian
> settings ? On Linux / Windows / Mac ?

No, I mean install Mac Os 7.5, 8 or 9 on an old computer or in an emulator. Then use Font Mover, ... to retrieve the identifier of each Cyrillic font and their name.
Comment 27 Mikhail Kukharenko 2023-12-05 07:58:05 UTC
(In reply to osnola from comment #24)
> Created attachment 191199 [details]
> I obtained this result by forcing the conversion.

We applied this patch on libmwaw but conversion was not successfull. (
We use LibreOffice 7.5.8.2 50(Build:2) on Linux
As i understand 
> +  m_idNameMap[19540] = "Sistemnyj";
- the font in my document has no name, but has the # 19540.
- with your patch you specified that font #19540 should be considered as Sistemnyj and for this fontname cyrillic conversion shoud be used.
  m_convertMap[std::string("Sistemnyj")] = & m_cyrillicConv; 

So it should not depend on the Libreoffice itself. 

What system and LO version do you use to open the file?
Or may be you convert it with some tool aside from LO?
Thank you !
Comment 28 osnola 2023-12-05 08:11:17 UTC
Hello,
if you want to recompile LibreOffice, you must also apply the https://sourceforge.net/p/libmwaw/libmwaw/ci/7e583c patch to the libmwaw source. (Otherwise, it will put the name of the unknown font to Sistemnyj but it doesn't know this font and therefore it will assume that the encoding is Mac Roman).

Personally, I compile libmwaw outside LibreOffice: see https://www.documentliberation.org/projects/ and https://wiki.documentfoundation.org/DLP/Libraries. Basically, I compile librevenge, libmwaw, libodfgen and writerperfect...
Comment 29 osnola 2023-12-05 08:25:09 UTC
Oops, I see you've applied the previous patch. The most likely cause of your problem is that you may have two versions of libmwaw.lib/dld/dylib on your system and LibreOffice is still using the old one.

Please note 
- on an old Mac OS, a font has a name and an identifier. Normally in a document, there's a table of font i: (id,fontname) and later in the document, fonts are stored has (font i) or (font with have id). In a Word's file, the latter is used. 
- Normally, it's safer to use the fontname because the id can change if the user has used Font Mover, so I prefer to use it if it's available,
- and yes, you understand the logic, because in this case, nobody will define m_idNameMap[19540], it will use the default value "Sistemnyj" which will set the encoding to Cyrillic.
Comment 30 Mikhail Kukharenko 2023-12-05 19:52:20 UTC
(In reply to osnola from comment #29)
> Oops, I see you've applied the previous patch. 
Thank you so much! We applied your patches and built libmwaw on our side.
Now LO 7.5.8.2 (x86_64 Linux) perfectly opens all these old mac documents.

As for font id-s and font names from old Mac OS system. 
I will have an old mac with mac OS 9 in January and will use Font Mover to find this information and will provide it here.
I am changing status to NEEDINFO as I should provide this information.

As for converting Excel 4 Mac documents - this is the separate ticket.

Thank you!
Comment 31 QA Administrators 2023-12-06 03:18:13 UTC Comment hidden (obsolete)
Comment 32 Buovjaga 2023-12-21 12:19:53 UTC
Thanks, so this is pending a new libmwaw release and updating it for LibreOffice.