Bug Hunting Session
Bug 48023 - russian characters in RTF file shown incorrectly in 3.5.1(shown ok in 3.4.4)
Summary: russian characters in RTF file shown incorrectly in 3.5.1(shown ok in 3.4.4)
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.1 release
Hardware: Other All
: medium normal
Assignee: Miklos Vajna
URL:
Whiteboard: target:3.6.0 target:3.5.4
Keywords: filter:rtf, regression
: 50775 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-03-29 00:01 UTC by Vitaliy Lotorev
Modified: 2015-12-17 12:05 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
opens correctly in writer 3.4.4 (77.31 KB, image/png)
2012-03-29 00:10 UTC, Vitaliy Lotorev
Details
russian characters show in improper encoding in 3.5.1 (76.10 KB, image/png)
2012-03-29 00:11 UTC, Vitaliy Lotorev
Details
The RTF-file (8.80 KB, application/rtf)
2012-03-29 00:12 UTC, Vitaliy Lotorev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vitaliy Lotorev 2012-03-29 00:01:04 UTC

    
Comment 1 Vitaliy Lotorev 2012-03-29 00:10:53 UTC
Created attachment 59188 [details]
opens correctly in writer 3.4.4
Comment 2 Vitaliy Lotorev 2012-03-29 00:11:32 UTC
Created attachment 59189 [details]
russian characters show in improper encoding in 3.5.1
Comment 3 Vitaliy Lotorev 2012-03-29 00:12:08 UTC
Created attachment 59190 [details]
The RTF-file
Comment 4 Vitaliy Lotorev 2012-03-29 00:12:44 UTC
rtf sample and examples are attached
Comment 5 s-joyemusequna 2012-03-29 05:40:01 UTC
Tested with Word 2007, WordPad, OOo 3.1, LibO 3.3.4, LibO 3.4.5, and LibO 3.5.2 RC1 on Windows Xp and Vista 64.

All programs display the the same error as your image (Attachment 2).

Probably the RTF file is not OK.
Comment 6 Urmas 2012-03-30 00:32:05 UTC
File is indeed very strange, as it doesn't contain anything pointing on language, ANSI codepage or font charsets.

But Microsoft Office and Wordpad open that file correctly as they default to system ANSI codepage, CP1251 in that case.
Comment 7 Urmas 2012-03-30 02:59:47 UTC

*** This bug has been marked as a duplicate of bug 44211 ***
Comment 8 Vitaliy Lotorev 2012-03-30 08:39:43 UTC
updated libreoffice 3.4.4 to libreoffice 3.4.6.
I've just downloaded the attached RTF-file and opened it in libreoffice 3.4.6 - it opens OK (like in attached pict1)
By the way, my OS is Fedora 14 x86_64.

s-joyemusequna, it seems i got why it opens correctly on my PC but doesn't on yours.

In Writer-Options->Language Settings->Languages I have 'Locale setting'=>Russian and Western=>Russian; once I replace Russian with any other language and reopen rtf-file it will show improper characters (checked in libreoffice 3.4.6).

On my PC both 3.4.6 & 3.5.1 have locale set to Russian.

It seems like regression:
* LO 3.4.x RTF-import filter treated locale set in options and used it in case encoding wasn't specified in rtf-file
* LO 3.5.x RTF-import filter doesn't treat locale settings at all
Comment 9 s-joyemusequna 2012-04-18 09:41:14 UTC
Works with LibO 3.4.5 if the locale is set to russian (File must be reloaded when the local is set to russian to display correctly.)

Doesn't work with LibO 3.5.2 if the locale is set to Russian and the file is reloaded => REGRESSION.
Comment 10 Miklos Vajna 2012-04-20 08:07:02 UTC
OK, I can reproduce the regression, after doing an 'export LC_ALL=ru_RU.utf8'.
Comment 11 Not Assigned 2012-04-21 03:02:34 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=f8bda240a09b4ece1c3401874b3cc8f325dbcedb

fdo#48023 fix RTF import of Russian characters without an encoding specified
Comment 12 Miklos Vajna 2012-04-21 03:05:35 UTC
Marking as resolved, fixed in master. Will request cherry-pick to -3-5 a bit later.
Comment 13 Aurimas Fišeras 2012-04-21 12:58:58 UTC
What about default text encodings for other languages?
Currently lcl_getDefaultTextEncoding() looks like a hack.
Isn't there some global function in LO that can return MS Code page for any language?
Comment 14 Not Assigned 2012-04-24 09:31:55 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "libreoffice-3-5":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c5282a72180ab7747c923cdc2b1988d15d3956b1&g=libreoffice-3-5

fdo#48023 fix RTF import of Russian characters without an encoding specified


It will be available in LibreOffice 3.5.4.
Comment 15 Miklos Vajna 2012-04-25 04:00:06 UTC
Aurimas,

Sure, it's a hack. Text without encoding info is always a hack. ;-) I'm not aware of such a table, if you are, please feel free to share, the WW8 (.doc) importer could benefit from that as well.
Comment 16 s-joyemusequna 2012-04-25 04:28:36 UTC
I don't know if this is of any use to you (probably Windows specific) - National Language Support (NLS) API Reference (MS Windows):
http://www.microsoft.com/resources/msdn/goglobal/default.mspx
Comment 17 Mike Kaganski 2012-05-06 05:15:11 UTC
Once I had to implement an RTF parser myself for Outlook mails import. There, I had to convert the MS codepage numbers to charset names. Maybe some Windows-specific function could use something similar here?

This is an excerpt of my patch to Thunderbird importer (http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/MapiMessage.cpp#410):

  struct CODEPAGE_TO_CHARSET {
    unsigned long cp;
    const char* charset;
  };

  // This table is based on http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx#1;
  // Please extend as appropriate. The codepage values are sorted ascending.
  static const CODEPAGE_TO_CHARSET cptocharset[] =
    {
      {37, "IBM037"}, // IBM EBCDIC US-Canada
      {437, "IBM437"}, //OEM United States
      {500, "IBM500"}, //IBM EBCDIC International
      {708, "ASMO-708"}, //Arabic (ASMO 708)
      //709  Arabic (ASMO-449+, BCON V4)
      //710  Arabic - Transparent Arabic
      {720, "DOS-720"}, //Arabic (Transparent ASMO); Arabic (DOS)
      {737, "ibm737"}, // OEM Greek (formerly 437G); Greek (DOS)
      {775, "ibm775"}, // OEM Baltic; Baltic (DOS)
      {850, "ibm850"}, // OEM Multilingual Latin 1; Western European (DOS)
      {852, "ibm852"}, // OEM Latin 2; Central European (DOS)
      {855, "IBM855"}, // OEM Cyrillic (primarily Russian)
      {857, "ibm857"}, // OEM Turkish; Turkish (DOS)
      {858, "IBM00858"}, // OEM Multilingual Latin 1 + Euro symbol
      {860, "IBM860"}, // OEM Portuguese; Portuguese (DOS)
      {861, "ibm861"}, // OEM Icelandic; Icelandic (DOS)
      {862, "DOS-862"}, // OEM Hebrew; Hebrew (DOS)
      {863, "IBM863"}, // OEM French Canadian; French Canadian (DOS)
      {864, "IBM864"}, // OEM Arabic; Arabic (864)
      {865, "IBM865"}, // OEM Nordic; Nordic (DOS)
      {866, "cp866"}, // OEM Russian; Cyrillic (DOS)
      {869, "ibm869"}, // OEM Modern Greek; Greek, Modern (DOS)
      {870, "IBM870"}, // IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2
      {874, "windows-874"}, // ANSI/OEM Thai (same as 28605, ISO 8859-15); Thai (Windows)
      {875, "cp875"}, // IBM EBCDIC Greek Modern
      {932, "shift_jis"}, // ANSI/OEM Japanese; Japanese (Shift-JIS)
      {936, "gb2312"}, // ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)
      {949, "ks_c_5601-1987"}, // ANSI/OEM Korean (Unified Hangul Code)
      {950, "big5"}, // ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)
      {1026, "IBM1026"}, // IBM EBCDIC Turkish (Latin 5)
      {1047, "IBM01047"}, // IBM EBCDIC Latin 1/Open System
      {1140, "IBM01140"}, // IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)
      {1141, "IBM01141"}, // IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)
      {1142, "IBM01142"}, // IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)
      {1143, "IBM01143"}, // IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)
      {1144, "IBM01144"}, // IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)
      {1145, "IBM01145"}, // IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)
      {1146, "IBM01146"}, // IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)
      {1147, "IBM01147"}, // IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)
      {1148, "IBM01148"}, // IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)
      {1149, "IBM01149"}, // IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)
      {1200, "utf-16"}, // Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
      {1201, "unicodeFFFE"}, // Unicode UTF-16, big endian byte order; available only to managed applications
      {1250, "windows-1250"}, // ANSI Central European; Central European (Windows)
      {1251, "windows-1251"}, // ANSI Cyrillic; Cyrillic (Windows)
      {1252, "windows-1252"}, // ANSI Latin 1; Western European (Windows)
      {1253, "windows-1253"}, // ANSI Greek; Greek (Windows)
      {1254, "windows-1254"}, // ANSI Turkish; Turkish (Windows)
      {1255, "windows-1255"}, // ANSI Hebrew; Hebrew (Windows)
      {1256, "windows-1256"}, // ANSI Arabic; Arabic (Windows)
      {1257, "windows-1257"}, // ANSI Baltic; Baltic (Windows)
      {1258, "windows-1258"}, // ANSI/OEM Vietnamese; Vietnamese (Windows)
      {1361, "Johab"}, // Korean (Johab)
      {10000, "macintosh"}, // MAC Roman; Western European (Mac)
      {10001, "x-mac-japanese"}, // Japanese (Mac)
      {10002, "x-mac-chinesetrad"}, // MAC Traditional Chinese (Big5); Chinese Traditional (Mac)
      {10003, "x-mac-korean"}, // Korean (Mac)
      {10004, "x-mac-arabic"}, // Arabic (Mac)
      {10005, "x-mac-hebrew"}, // Hebrew (Mac)
      {10006, "x-mac-greek"}, // Greek (Mac)
      {10007, "x-mac-cyrillic"}, // Cyrillic (Mac)
      {10008, "x-mac-chinesesimp"}, // MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)
      {10010, "x-mac-romanian"}, // Romanian (Mac)
      {10017, "x-mac-ukrainian"}, // Ukrainian (Mac)
      {10021, "x-mac-thai"}, // Thai (Mac)
      {10029, "x-mac-ce"}, // MAC Latin 2; Central European (Mac)
      {10079, "x-mac-icelandic"}, // Icelandic (Mac)
      {10081, "x-mac-turkish"}, // Turkish (Mac)
      {10082, "x-mac-croatian"}, // Croatian (Mac)
      // Unicode UTF-32, little endian byte order; available only to managed applications
      // impossible in 8-bit mail
      {12000, "utf-32"},
       // Unicode UTF-32, big endian byte order; available only to managed applications
       // impossible in 8-bit mail
      {12001, "utf-32BE"},
      {20000, "x-Chinese_CNS"}, // CNS Taiwan; Chinese Traditional (CNS)
      {20001, "x-cp20001"}, // TCA Taiwan
      {20002, "x_Chinese-Eten"}, // Eten Taiwan; Chinese Traditional (Eten)
      {20003, "x-cp20003"}, // IBM5550 Taiwan
      {20004, "x-cp20004"}, // TeleText Taiwan
      {20005, "x-cp20005"}, // Wang Taiwan
      {20105, "x-IA5"}, // IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)
      {20106, "x-IA5-German"}, // IA5 German (7-bit)
      {20107, "x-IA5-Swedish"}, // IA5 Swedish (7-bit)
      {20108, "x-IA5-Norwegian"}, // IA5 Norwegian (7-bit)
      {20127, "us-ascii"}, // US-ASCII (7-bit)
      {20261, "x-cp20261"}, // T.61
      {20269, "x-cp20269"}, // ISO 6937 Non-Spacing Accent
      {20273, "IBM273"}, // IBM EBCDIC Germany
      {20277, "IBM277"}, // IBM EBCDIC Denmark-Norway
      {20278, "IBM278"}, // IBM EBCDIC Finland-Sweden
      {20280, "IBM280"}, // IBM EBCDIC Italy
      {20284, "IBM284"}, // IBM EBCDIC Latin America-Spain
      {20285, "IBM285"}, // IBM EBCDIC United Kingdom
      {20290, "IBM290"}, // IBM EBCDIC Japanese Katakana Extended
      {20297, "IBM297"}, // IBM EBCDIC France
      {20420, "IBM420"}, // IBM EBCDIC Arabic
      {20423, "IBM423"}, // IBM EBCDIC Greek
      {20424, "IBM424"}, // IBM EBCDIC Hebrew
      {20833, "x-EBCDIC-KoreanExtended"}, // IBM EBCDIC Korean Extended
      {20838, "IBM-Thai"}, // IBM EBCDIC Thai
      {20866, "koi8-r"}, // Russian (KOI8-R); Cyrillic (KOI8-R)
      {20871, "IBM871"}, // IBM EBCDIC Icelandic
      {20880, "IBM880"}, // IBM EBCDIC Cyrillic Russian
      {20905, "IBM905"}, // IBM EBCDIC Turkish
      {20924, "IBM00924"}, // IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
      {20932, "EUC-JP"}, // Japanese (JIS 0208-1990 and 0121-1990)
      {20936, "x-cp20936"}, // Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)
      {20949, "x-cp20949"}, // Korean Wansung
      {21025, "cp1025"}, // IBM EBCDIC Cyrillic Serbian-Bulgarian
      //21027  (deprecated)
      {21866, "koi8-u"}, // Ukrainian (KOI8-U); Cyrillic (KOI8-U)
      {28591, "iso-8859-1"}, // ISO 8859-1 Latin 1; Western European (ISO)
      {28592, "iso-8859-2"}, // ISO 8859-2 Central European; Central European (ISO)
      {28593, "iso-8859-3"}, // ISO 8859-3 Latin 3
      {28594, "iso-8859-4"}, // ISO 8859-4 Baltic
      {28595, "iso-8859-5"}, // ISO 8859-5 Cyrillic
      {28596, "iso-8859-6"}, // ISO 8859-6 Arabic
      {28597, "iso-8859-7"}, // ISO 8859-7 Greek
      {28598, "iso-8859-8"}, // ISO 8859-8 Hebrew; Hebrew (ISO-Visual)
      {28599, "iso-8859-9"}, // ISO 8859-9 Turkish
      {28603, "iso-8859-13"}, // ISO 8859-13 Estonian
      {28605, "iso-8859-15"}, // ISO 8859-15 Latin 9
      {29001, "x-Europa"}, // Europa 3
      {38598, "iso-8859-8-i"}, // ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
      {50220, "iso-2022-jp"}, // ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)
      {50221, "csISO2022JP"}, // ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)
      {50222, "iso-2022-jp"}, // ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)
      {50225, "iso-2022-kr"}, // ISO 2022 Korean
      {50227, "x-cp50227"}, // ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)
      //50229  ISO 2022 Traditional Chinese
      //50930  EBCDIC Japanese (Katakana) Extended
      //50931  EBCDIC US-Canada and Japanese
      //50933  EBCDIC Korean Extended and Korean
      //50935  EBCDIC Simplified Chinese Extended and Simplified Chinese
      //50936  EBCDIC Simplified Chinese
      //50937  EBCDIC US-Canada and Traditional Chinese
      //50939  EBCDIC Japanese (Latin) Extended and Japanese
      {51932, "euc-jp"}, // EUC Japanese
      {51936, "EUC-CN"}, // EUC Simplified Chinese; Chinese Simplified (EUC)
      {51949, "euc-kr"}, // EUC Korean
      //51950  EUC Traditional Chinese
      {52936, "hz-gb-2312"}, // HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)
      {54936, "GB18030"}, // Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
      {57002, "x-iscii-de"}, // ISCII Devanagari
      {57003, "x-iscii-be"}, // ISCII Bengali
      {57004, "x-iscii-ta"}, // ISCII Tamil
      {57005, "x-iscii-te"}, // ISCII Telugu
      {57006, "x-iscii-as"}, // ISCII Assamese
      {57007, "x-iscii-or"}, // ISCII Oriya
      {57008, "x-iscii-ka"}, // ISCII Kannada
      {57009, "x-iscii-ma"}, // ISCII Malayalam
      {57010, "x-iscii-gu"}, // ISCII Gujarati
      {57011, "x-iscii-pa"}, // ISCII Punjabi
      {65000, "utf-7"}, // Unicode (UTF-7)
      {65001, "utf-8"}, // Unicode (UTF-8)
    };
Comment 18 s-joyemusequna 2012-05-18 23:53:44 UTC
Verified with LOdev 3.6 (master - 18-May-2012 02h44 x86@6-fast; Build ID: 8b1d29b) under Windows Vista 64.

Note: locale must te set to Russian first, see Comment 8.
Comment 19 sasha.libreoffice 2012-08-07 10:42:57 UTC
it works. Thanks for fixing this bug
Comment 20 Michael Meeks 2012-10-22 14:52:48 UTC
*** Bug 50775 has been marked as a duplicate of this bug. ***
Comment 21 Robinson Tryon (qubit) 2015-12-17 12:05:44 UTC Comment hidden (obsolete)