Created attachment 190842 [details] MS Word 5.1 for Macintosh document with cyrillic text. When opening in LibreOffice documents created in MS Word 5.1 for Macintosh and in MS Excel 4.0 for Macintosh the cyrillic text become unreadable (incorrect encoding). All formatting is ok, latin texts are ok. Problem is only with cyrillic characters. I tried all reasonable file types in Open dialog - the result is the same. Attachment provided is MS Word 5.1 for Macintosh document. It has no extension as it has no one in Mac. Adding the extension (doc) doesn't help.
On pc Debian x86-64 with master sources updated today, I could reproduce this. I noticed this on console: MWAWHeader::constructHeader: find a Word 5.0 file PageSpan::get: can not find a master page name PageSpan::get: can not find a master page name MWAWHeader::constructHeader: find a Word 5.0 file MWAWFontConverter::getValidName: fontName contains bad character Laurent: thought you might be interested in this one since it uses libmwaw lib.
(In reply to Julien Nabet from comment #1) > MWAWFontConverter::getValidName: fontName contains bad character If it's possible to fix something inside the problematic document so that LibreOffice can open it correctly, that would be a great workaround.
Created attachment 190874 [details] conversion of this file by "forcing" the Cyrillic encoding In a Word 5.0 file (and in general in every Mac Os 9's file), a font is defined by a name and an identifier, so I've created a table of correspondence between certain font names and their corresponding encoding. Of course, this table is incomplete. In this case, I'd add this font in my table (I need to see how to do it properly); but it would be nice to have other Cyrillic documents so that I can at least recognize their basic fonts.
Another problem is that this file was created on a Mac OS Cyrillic system and I haven't found a method of detecting this in a Word 5 file. The font name is therefore encoded in Cyrillic...
(In reply to osnola from comment #3) > Created attachment 190874 [details] > conversion of this file by "forcing" the Cyrillic encoding It works! Great! > ... Could you tell me please how can I reproduce what you did? Did you fix something inside the document? I would write the python script to "fix" it in my Word 5.0 & Excel 4.0 documents as I have plenty of them. > but it would be nice to have other Cyrillic documents so that I > can at least recognize their basic fonts. I will check and report all cyrillic fonts from that old Mac OS 9. Thank you !
I've applied a hacked patch to the libmwaw sources but I have to rewrite the code differently ( like this, it's unreadable). ---------------- patch -------------------------- diff --git a/src/lib/MWAWFontConverter.cxx b/src/lib/MWAWFontConverter.cxx index 6f9c7da..33ff37e 100644 --- a/src/lib/MWAWFontConverter.cxx +++ b/src/lib/MWAWFontConverter.cxx @@ -1276,7 +1276,8 @@ MWAWFontConverter::~MWAWFontConverter() // mac font name <-> id functions std::string MWAWFontConverter::getValidName(std::string const &name) { - std::string fName(""); + if (name=="\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9") return "Times CY"; + std::string fName; static bool first = true; for (auto c : name) {
(In reply to osnola from comment #4) > Another problem is that this file was created on a Mac OS Cyrillic system > and I haven't found a method of detecting this in a Word 5 file. The font > name is therefore encoded in Cyrillic... Taking a look at Msdoc specs here: https://msopenspecs.azureedge.net/files/MS-DOC/%5bMS-DOC%5d.pdf (from https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22) in 2.5.2 FibBase, there's "lid" defined as: "A LID that specifies the install language of the application that is producing the document. If nFib is 0x00D9 or greater, then any East Asian install lid or any install lid with a base language of Spanish, German or French MUST be recorded as 0x0409. If the nFib is 0x0101 or greater, then any install lid with a base language of Vietnamese, Thai, or Hindi MUST be recorded as 0x0409." perhaps it may help. Anyway, thank you for the quick investigation! :-)
Thank you. Yes, I had looked at those documents and as word msdos file conversions are done by libwps, I know that filter too. In fact, the notion of language appeared later on Mac Classic Os. Initially, I'd say that the system was supplied with default routines, a list of fonts and applications (of course with text strings translated in the desired language); there was also a default font used for the system and another for each application. The fonts were simply a set of 256 glyphs (given in several sizes) that the user could modify by taking a font and replacing one glyph with another using http://macintoshgarden.org/apps/fontastic-plus , ...
(In reply to Mikhail Kukharenko from comment #5) > I would write the python script to "fix" it > in my Word 5.0 & Excel 4.0 documents as I have plenty of them. > libmwaw only converts Mac Word 1.0 to 5.0 files and older PowerPoint formats (Mac and PC): I didn't write any other filters because LibreOffice already had filters to open other Office formats. I'm not sure what python can easily fix. It seems difficult, basically the principle would be to look at the tables: https://en.wikipedia.org/wiki/Mac_OS_Roman and https://en.wikipedia.org/wiki/Mac_OS_Cyrillic_encoding. For example, if in the original file, we have the character 0x80, normally it should be converted to А ( but here we get a Ä ), ... So we'd have to go back one by one to each badly transformed character and, if we find a Ä, transform it into А, ...
(In reply to osnola from comment #6) > - std::string fName(""); > + if (name=="\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9") return "Times CY"; $ echo -e '\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9' | iconv -f MACCYRILLIC Латинский I just changed in the binary word document via hex editor this font name to "Times CYR" and it opened perfectly with readable Cyrillic text. It also worked when I changed it to "Times CY" and to "Arial CY" (with extra space in the middle to preserve the string length) I tried to do the same with Excel 4.0 for Mac documents. Example is attached. All my Excel documents contain two fonts: 1. Латинский (\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9) and 2. Прямой Проп (8FF0DFECEEE9208FF0EEEF) 1st font is Serif (like Times) 2nd font is Sans Serif (like Arial) I changed 1st font name again to "Times CYR" and tried to replace 2nd font name to "Arial CYR" or to "Arial CY" (to preserve the length) but it did not work (opened unreadable) Can you please try to "force open" attached document?
Created attachment 190911 [details] MS Excel 4.0 document with Cyrillic text
As far as MS Word files are concerned, this is normal. libmwaw recognizes fonts ending in " CY" or " CYR" as Cyrillic fonts. So if you can change the font name in the original file, it will work. As far as the Excel files are concerned, you need to find out which filter is called to open these files (it's not libmwaw: see https://sourceforge.net/p/libmwaw/wiki/Home/ for the list of formats recognised by libmwaw). So it's probably best to open another bug report. Note: I'll also add a special case for the font "Прямой Проп"
I will take inspiration from https://en.wikipedia.org/wiki/Talk:Fonts_on_Macintosh (I) and add m_nameTranslatedNameMap["\x8b\xe0\xf2\xe8\xed\xf1\xea\xe8\xe9"]="Latinskij"; m_nameTranslatedNameMap["\x8f\xf0\xdf\xec\xee\xe9"]="Priamoj"; m_nameTranslatedNameMap["\x8f\xf0\xdf\xec\xee\xe9\x20\x8f\xf0\xee\xef"]="Priamoj Prop"; m_nameTranslatedNameMap["\x91\xe8\xf1\xf2\xe5\xec\xed\xfb\xe9"]="Sistemnyj"; m_nameTranslatedNameMap["\x80\x90\x91\x8a\xf3\xf0\xfc\xe5\xf0"]="APC Courier";
Normally, https://sourceforge.net/p/libmwaw/libmwaw/ci/7e583cd8e526a58b4387b4bd4794c511479e3827/ will solve this problem when I will release a new version.
(In reply to osnola from comment #14) > Normally, > https://sourceforge.net/p/libmwaw/libmwaw/ci/ > 7e583cd8e526a58b4387b4bd4794c511479e3827/ will solve this problem when I > will release a new version. Thank you! As suggested I created the separate ticket for Excel 4.0 for Macintosh https://bugs.documentfoundation.org/show_bug.cgi?id=158282
(In reply to osnola from comment #9) > (In reply to Mikhail Kukharenko from comment #5) > libmwaw only converts Mac Word 1.0 to 5.0 files and older PowerPoint formats > (Mac and PC): I didn't write any other filters because LibreOffice already > had filters to open other Office formats. What filter (library) is used by LO to open Mac Excel 4 documents? We will try to add there support for cyrillic excel files. Thank you!
I assume this is one of the LibreOffice's excel filters that reside in core/sc, but I'm not sure. In filter/qa/complex/filter/detection/typeDetection/files.csv, we have: Excel2 Calc/Excel2.XLS calc_MS_Excel_40 calc_MS_Excel_40 Excel3 Calc/Excel3.XLS calc_MS_Excel_40 calc_MS_Excel_40 Excel4_document Calc/Excel4.XLS calc_MS_Excel_40 calc_MS_Excel_40 Excel4_template Calc/Excel4.XLT calc_MS_Excel_40_VorlageTemplate calc_MS_Excel_40 Exel95_document Calc/Excel5_95.XLS calc_MS_Excel_5095 calc_MS_Excel_5095 Exel95_template Calc/Excel5_template.XLT calc_MS_Excel_5095 calc_MS_Excel_5095
(In reply to Mikhail Kukharenko from comment #15) > (In reply to osnola from comment #14) > > Normally, > > https://sourceforge.net/p/libmwaw/libmwaw/ci/ > > 7e583cd8e526a58b4387b4bd4794c511479e3827/ will solve this problem when I > > will release a new version. Could you please check fonts in the attached file "enduser" - looks like it contains another cyrillic fonts.
Created attachment 191063 [details] MS Word 5.1 for Mac doc with cyrillic text and another fonts This is another kind of MS Word 5 for Mac documents with another set of cyrillic fonts. Could you please check the font names and add support for them. Thank you !
Hello, this file does not contain a font name. This probably means that the system default font (or perhaps the application default font) should be used. I'm not sure how to fix this as I haven't found a way to detect that the file wasn't created on a Mac Roman system but Cyrillic :-~
(In reply to osnola from comment #20) > I'm not sure how to fix this as I haven't found a way to detect that the > file wasn't created on a Mac Roman system but Cyrillic :-~ Hello! If we assume the it was created on Cyrillic system (as it was) - is that possible to "force convert" it? Can we do the trick like we did before (set the font name inside the binary document) ? May be this can help: at the offset 0000:0810 I see this stuff: ...NFlRight.normtxt.Punkt.RusN.SuperZagla.Zagolovok.SPage.p1.p2.2-COL.1-COL-SHIFT RIGHT.RUS-JUST.p3.p4ÿ.0.S1.S2.S3.S4.p5.S5.rusC.RusCBoldÿ.qÿÿÿÿÿÿÿÿ............ rusC and RusCBold mean most probably "Russian Courier" and "Russian Courier Bold" respectively. Look like its font names. Zagolovok - means Heading looks like style name. SuperZagla - means SuperHeading and again looks like style name.
This zone corresponds to the style's names: 000802 [Styles(names):*_______******N1=NFlRight,N2=normtxt,N3=Punkt,N4=RusN,N5=SuperZagla,N6=Zagolovok,N7=SPage,N8=p1,N9=p2,N10=2-COL,N11=1-COL-SHIFT RIGHT,N12=RUS-JUST,N13=p3,N14=p4,_N16=0,N17=S1,N18=S2,N19=S3,N20=S4,N21=p5,N22=S5,N23=rusC,N24=RusCBold,_]009800ffffffffffffff000000000000084e466c5269676874076e6f726d7478740550756e6b74045275734e0a53757065725a61676c61095a61676f6c6f766f6b05535061676502703102703205322d434f4c11312d434f4c2d5348494654205249474854085255532d4a555354027033027034ff013002533102533202533302533402703502533504727573430852757343426f6c64ff Note: I will try to force conversion by hand at the end of the week.
(In reply to osnola from comment #22) > Note: I will try to force conversion by hand at the end of the week. Thank you! Will be waiting for it. May be its possible to set font directly inside binary word document...
Created attachment 191199 [details] The result Sorry for the delay, I obtained this result by forcing the conversion. To force the conversion, you can use the following patch on libmwaw. I need to install a Russian system to retrieve the numerical identifier of the supplied Cyrillic fonts and check that other fonts don't use the same identifiers in my base of files. I'll try to find the time to do it in December, the days are a bit busy at the moment... --- a/src/lib/MWAWFontConverter.cxx +++ b/src/lib/MWAWFontConverter.cxx @@ -1207,6 +1207,8 @@ void State::initMaps() m_idNameMap[64640] = "Hiragino MaruGo W3"; m_idNameMap[64643] = "Hiragino MaruGo W6"; + // cyrillic font (check me) + m_idNameMap[19540] = "Sistemnyj"; // Windows m_idNameMap[101250] = "CP1250"; m_idNameMap[101251] = "CP1251";
(In reply to osnola from comment #24) > Created attachment 191199 [details] > The result Thank you so much! We will try to apply the patch. > I need > to install a Russian system to retrieve the numerical identifier of the > supplied Cyrillic fonts I can install Russian system and check if it helps. You mean LO with Russian settings ? On Linux / Windows / Mac ?
(In reply to Mikhail Kukharenko from comment #25) > > I can install Russian system and check if it helps. You mean LO with Russian > settings ? On Linux / Windows / Mac ? No, I mean install Mac Os 7.5, 8 or 9 on an old computer or in an emulator. Then use Font Mover, ... to retrieve the identifier of each Cyrillic font and their name.
(In reply to osnola from comment #24) > Created attachment 191199 [details] > I obtained this result by forcing the conversion. We applied this patch on libmwaw but conversion was not successfull. ( We use LibreOffice 7.5.8.2 50(Build:2) on Linux As i understand > + m_idNameMap[19540] = "Sistemnyj"; - the font in my document has no name, but has the # 19540. - with your patch you specified that font #19540 should be considered as Sistemnyj and for this fontname cyrillic conversion shoud be used. m_convertMap[std::string("Sistemnyj")] = & m_cyrillicConv; So it should not depend on the Libreoffice itself. What system and LO version do you use to open the file? Or may be you convert it with some tool aside from LO? Thank you !
Hello, if you want to recompile LibreOffice, you must also apply the https://sourceforge.net/p/libmwaw/libmwaw/ci/7e583c patch to the libmwaw source. (Otherwise, it will put the name of the unknown font to Sistemnyj but it doesn't know this font and therefore it will assume that the encoding is Mac Roman). Personally, I compile libmwaw outside LibreOffice: see https://www.documentliberation.org/projects/ and https://wiki.documentfoundation.org/DLP/Libraries. Basically, I compile librevenge, libmwaw, libodfgen and writerperfect...
Oops, I see you've applied the previous patch. The most likely cause of your problem is that you may have two versions of libmwaw.lib/dld/dylib on your system and LibreOffice is still using the old one. Please note - on an old Mac OS, a font has a name and an identifier. Normally in a document, there's a table of font i: (id,fontname) and later in the document, fonts are stored has (font i) or (font with have id). In a Word's file, the latter is used. - Normally, it's safer to use the fontname because the id can change if the user has used Font Mover, so I prefer to use it if it's available, - and yes, you understand the logic, because in this case, nobody will define m_idNameMap[19540], it will use the default value "Sistemnyj" which will set the encoding to Cyrillic.
(In reply to osnola from comment #29) > Oops, I see you've applied the previous patch. Thank you so much! We applied your patches and built libmwaw on our side. Now LO 7.5.8.2 (x86_64 Linux) perfectly opens all these old mac documents. As for font id-s and font names from old Mac OS system. I will have an old mac with mac OS 9 in January and will use Font Mover to find this information and will provide it here. I am changing status to NEEDINFO as I should provide this information. As for converting Excel 4 Mac documents - this is the separate ticket. Thank you!
[Automated Action] NeedInfo-To-Unconfirmed
Thanks, so this is pending a new libmwaw release and updating it for LibreOffice.