Bug 34319 - Japanese text is incorrectly exported to Word document.
Summary: Japanese text is incorrectly exported to Word document.
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: Other All
: medium normal
Assignee: Caolán McNamara
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-15 20:45 UTC by Kohei Yoshida
Modified: 2011-02-24 05:32 UTC (History)
0 users

See Also:
Crash report or crash signature:


Attachments
rough and ready hack at the other part (14.71 KB, patch)
2011-02-23 08:57 UTC, Caolán McNamara
Details
as good as we can do (21.50 KB, application/msword)
2011-02-24 05:28 UTC, Caolán McNamara
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kohei Yoshida 2011-02-15 20:45:52 UTC
So, this is the other issue specific to Japanese language that we talked about at FOSDEM.

http://qa.openoffice.org/issues/show_bug.cgi?id=110526
http://qa.openoffice.org/issues/show_bug.cgi?id=110532

Some characters get corrupted when exporting to the Word document.
Comment 1 Caolán McNamara 2011-02-18 07:24:37 UTC
Problem with ooo#110526 will almost certainly be that the chars involved are Latin script, but have a script-override property set on them to make them treated as CJK by word, something we don't have. So original import would have to do something like examine that prop and if its set and counter the "natural" script hard code the script-override font onto the text, that would likely let it round trip properly.

Problem with ooo#110532 is probably a set of mixups with LN_CRgLid0/LN_CRgLid0_80 with potentially some need to set idcthint here and there
Comment 2 Caolán McNamara 2011-02-18 08:13:37 UTC
stage 1 of 2: ooo#110532# mixups between LN_CRgLid0/LN_CRgLid0_80, LN_CRgLid1/LN_CRgLid1_80 and, probably most crucially, incorrect additional export of LN_CRgLid1 alongside with LN_CLidBi

1e141a54c372bf98b7064a3c583dc44bd895c71b filters
233be021cabab6baadfe0eeb686b0f7b6f0a9d36 filters
a632f2a929439a02586540932eee97a90e164a75 writer
63267e756c7db34afab55568c2af4c5b85d57cae writer
Comment 3 Caolán McNamara 2011-02-23 08:57:36 UTC
Created attachment 43719 [details]
rough and ready hack at the other part
Comment 4 Caolán McNamara 2011-02-24 04:52:12 UTC
Man, what a mess, 4 font settings in MSWord

0x4A4F: sprmCRgFtc0 ascii text
0x4A50: sprmCRgFtc1 east asian text
0x4A51: sprmCRgFtc2 non-east asian text
0x4A5E: sprmCFtcBi complex text

which corresponds to the OfficeOpen equivalents, where...

"The ASCII font formats all characters in the ASCII range (character values 0–127). This font is specified using the ascii attribute on the rFonts element.

The East Asian font formats all characters that belong to Unicode sub ranges for East Asian languages. This font is specified using the eastAsia attribute on the rFonts element.

The complex script font formats all characters that belong to Unicode sub ranges for complex script languages. This font is specified using the cs attribute on the rFonts element.

The high ANSI font formats all characters that belong to Unicode sub ranges other than those explicitly included by one of the groups above. This font is specified using the hAnsi attribute on the rFonts element.
"

But its not really defined what the breakdown is between Complex vs "high ANSI"

Awesome, we have three slots, CJK, Complex and "everything else". Its probable that their "EastAsian" always maps to our "CJK", and that their "Complex" always maps to our "Complex", but how the other two map to ours is vague.

"http://www.eggheadcafe.com/software/aspnet/31531279/latin-fareast-complex-unicode-ranges.aspx"
Comment 5 Caolán McNamara 2011-02-24 05:28:10 UTC
Created attachment 43748 [details]
as good as we can do
Comment 6 Caolán McNamara 2011-02-24 05:32:16 UTC
The whole thing is really messy. So I've hacked in taking account of the idcthint script-bias msword hack on import, so importing from the original.doc and back out gives the attached output, that's as good as we can do without a idcthint implementation of our own in writer. Which would need someplace to store it in our own documents, which needs some additional file format support in ODF.

4e9fcdafe0bf3287529c253e3725748a5207e19c writer
f4f22dc6d8d50f8373f360bd06c3b9508bd4b07d writer
6a5dcedf3766e32ad798bc66ade617abbe91210b writer
2b7142fcdea35d0d4eb45d5aa2e09b37d5d8951a writer