Bug 48446 - RTF Importer does not honor ansicpgN and cpgN control words -> fails to import some non-English documents properly
Summary: RTF Importer does not honor ansicpgN and cpgN control words -> fails to impor...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.2 release
Hardware: Other Windows (All)
: medium normal
Assignee: Miklos Vajna
URL:
Whiteboard: target:3.7.0 target:3.6.1 target:3.5.7
Keywords: filter:rtf
Depends on:
Blocks:
 
Reported: 2012-04-08 18:15 UTC by Mike Kaganski
Modified: 2015-12-17 12:06 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Test file showing this behaviour (8.83 KB, application/msword)
2012-04-08 18:15 UTC, Mike Kaganski
Details
This is how it renders now - only one piece is shown correctly (48.65 KB, application/pdf)
2012-05-06 04:20 UTC, Mike Kaganski
Details
This is how it should be. (52.90 KB, application/pdf)
2012-05-06 04:21 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2012-04-08 18:15:18 UTC
Created attachment 59657 [details]
Test file showing this behaviour

When an RTF document contains a /ansigpgN control word in the header just after /ansi control word, a reader should use this code page to perform ansi-to-Unicode conversion wherever another codepage isn't specified for a text run and Unicode RTF isn't used[1]. When a font definition contains /fcharsetN control word, it overrides the top-level setting, and when there is a /cpgN, it overrides both top-level setting and /fcharsetN [2].

Now, when opening an RTF which doesn't contain any codepage/charset data, LO defaults to Latin-1 (see Bug 48023). If such document contains /ansicpgN, of its fonts have /cpgN, LO ignores this information, and still uses Latin-1. Only /fcharsetN is taken into account.

The attachment is the test document from Bug 48023, where the missing language information is manually added. There is /ansicpg1251 in the header now, as well as /fcharset204 in one font, and /cpg1251 in another. It may be seen, that only the text using the first font is displayed properly.

As to documents that don't contain language information at all (and there is a great number of such documents generated by various non-MS software out there), I believe that LO should use user language (and provide a means of specifying another on opening, like a checkbox in Open dialog saying "Specify missing charset" doing something similar to Text Encoded filter).

--
1. Word 2007: Rich Text Format (RTF) Specification, version 1.9.1 (http://www.microsoft.com/download/en/details.aspx?id=10725), page 12: Character Set
2. Ibid., pages 17-20.
Comment 1 Urmas 2012-04-09 10:04:16 UTC
Also note that some RTF software stores font character set incorrectly as ANSI_CHARSET for some national fonts. At least, the standard Windows fonts ({Times New Roman|Arial|Courier New}[ CE| Cyr], Japanese, Chinese (Simplified and Traditional), Korean and Thai) should have that field corrected to ensure proper import. Same problem may exist for Arabic/Hebrew documents, which may contain legacy charset values.

Ideally there should be a way for user to provide a font mapping table to define a proper charset for custom fonts people could have used.
Comment 2 Jean-Baptiste Faure 2012-05-06 03:49:23 UTC
@Mike: please could you attach a pdf export of your test file which would show how it should look like when opened in LibreOffice ?

Best regards. JBF
Comment 3 Mike Kaganski 2012-05-06 04:20:18 UTC
Created attachment 61096 [details]
This is how it renders now - only one piece is shown correctly
Comment 4 Mike Kaganski 2012-05-06 04:21:30 UTC
Created attachment 61097 [details]
This is how it should be.

I'm not sure if assigning it to me is a right thing to do...
Comment 5 Jean-Baptiste Faure 2012-05-06 05:33:35 UTC
(In reply to comment #4)
> Created attachment 61097 [details]
> This is how it should be.

Thank you very for the data

Hi Miklos, another codepage problem in RTF import. Please, feel free to reassign if you can't handle this bug.

Best regards. JBF
Comment 6 Miklos Vajna 2012-08-10 15:48:16 UTC
Mike,

Thanks for the detailed report. Funny, your test document in Word matches your "how it renders now" PDF, at least here, with an English UI. ;-)

Since 3.5.2, we already implemented locale-dependent default (so your testdoc opens fine already if the locale is set to Russian), and also \ansicpg got implemented.

And you're right: with the \cpg implementation, the LibreOffice result matches the "how it should be", even with English UI. I'll push that patch in a bit.

Miklos
Comment 7 Not Assigned 2012-08-10 16:10:55 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=f6a24ace5ad12e79f0cc90709a290a30e3758781

fdo#48446 implement RTF_CPG
Comment 8 Miklos Vajna 2012-08-10 16:44:41 UTC
Resolved in master, -3-6 and -3-5 review requests:

https://gerrit.libreoffice.org/386
https://gerrit.libreoffice.org/387
Comment 9 Not Assigned 2012-08-13 17:26:17 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "libreoffice-3-6":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=8054472f666c87d6437dcea064c3cef379916245&g=libreoffice-3-6

fdo#48446 implement RTF_CPG


It will be available in LibreOffice 3.6.1.
Comment 10 Not Assigned 2012-08-13 17:32:39 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "libreoffice-3-5":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=98e895db332446b3fe2fc901a6cf9cff64d2b1b8&g=libreoffice-3-5

fdo#48446 implement RTF_CPG


It will be available in LibreOffice 3.5.7.
Comment 11 Robinson Tryon (qubit) 2015-12-17 12:06:08 UTC
Migrating Whiteboard tags to Keywords: (filter:rtf)
Replace rtf_filter -> filter:rtf.
[NinjaEdit]