So, this is the other issue specific to Japanese language that we talked about at FOSDEM.
Some characters get corrupted when exporting to the Word document.
Problem with ooo#110526 will almost certainly be that the chars involved are Latin script, but have a script-override property set on them to make them treated as CJK by word, something we don't have. So original import would have to do something like examine that prop and if its set and counter the "natural" script hard code the script-override font onto the text, that would likely let it round trip properly.
Problem with ooo#110532 is probably a set of mixups with LN_CRgLid0/LN_CRgLid0_80 with potentially some need to set idcthint here and there
stage 1 of 2: ooo#110532# mixups between LN_CRgLid0/LN_CRgLid0_80, LN_CRgLid1/LN_CRgLid1_80 and, probably most crucially, incorrect additional export of LN_CRgLid1 alongside with LN_CLidBi
Created attachment 43719 [details]
rough and ready hack at the other part
Man, what a mess, 4 font settings in MSWord
0x4A4F: sprmCRgFtc0 ascii text
0x4A50: sprmCRgFtc1 east asian text
0x4A51: sprmCRgFtc2 non-east asian text
0x4A5E: sprmCFtcBi complex text
which corresponds to the OfficeOpen equivalents, where...
"The ASCII font formats all characters in the ASCII range (character values 0–127). This font is specified using the ascii attribute on the rFonts element.
The East Asian font formats all characters that belong to Unicode sub ranges for East Asian languages. This font is specified using the eastAsia attribute on the rFonts element.
The complex script font formats all characters that belong to Unicode sub ranges for complex script languages. This font is specified using the cs attribute on the rFonts element.
The high ANSI font formats all characters that belong to Unicode sub ranges other than those explicitly included by one of the groups above. This font is specified using the hAnsi attribute on the rFonts element.
But its not really defined what the breakdown is between Complex vs "high ANSI"
Awesome, we have three slots, CJK, Complex and "everything else". Its probable that their "EastAsian" always maps to our "CJK", and that their "Complex" always maps to our "Complex", but how the other two map to ours is vague.
Created attachment 43748 [details]
as good as we can do
The whole thing is really messy. So I've hacked in taking account of the idcthint script-bias msword hack on import, so importing from the original.doc and back out gives the attached output, that's as good as we can do without a idcthint implementation of our own in writer. Which would need someplace to store it in our own documents, which needs some additional file format support in ODF.