Created attachment 176963 [details] test xml file GBK encoded Steps to Reproduce: 1. New Calc. 2. Data > XML Source. Click the "Source File" button, and use the attached xml file. --> Some fields are listed in the box, that's good. 3. Click on the field "Fp", then click the "Shrink" button and select a cell. Click "Import". --> Garbage content imported. It is noted that the xml file is GBK encoded, and the encoding character information is correctly included in the first line: <?xml version="1.0" encoding="GBK"?>
A workaround is to change the <?xml version="1.0" encoding="GBK"?> to <?xml version="1.0" encoding="UTF-8"?> and resave the file with UTF-8 encoding. Actually I suspect only resave as UTF-8 is enough, no need to change the <?xml version="1.0" encoding="GBK"?> line. I guess the XML Source import is handled by Orcus (?), and Orcus may not support other multi-byte encodings such as GBK, GB2312 etc. As a result, we should either: 1. Convert the file content to UTF-8 before sent to orcus for parsing, or 2. Improve orcus to support other encodings other than UTF-8.
I am adding Kohei Yoshida to cc: would you please confirm whether this is an Orcus issue?
(In reply to Kevin Suo from comment #2) > I am adding Kohei Yoshida to cc: would you please confirm whether this is an > Orcus issue? Yes, this is an orcus issue. Orcus itself can theoretically process XML documents containing element contents or attribute values encoded in something other than utf-8 since the decoding part is delegated to the handler side (libreoffice). But the XML element and attribute names themselves must be in utf-8 (or utf-16 if proper byte order mark is given). Orcus needs to properly pick up the declared encoding type to the handler during XML mapping, and that *should* in theory fix this issue. The Excel 2003 XML file format import code does support non-unicode encoding by honoring the declared encoding type, so I just need to apply the same mechanism to XML mapping code.
Mark to NEW per comment 3.
With the latest version of orcus (0.18.0) now on master and this change https://gerrit.libreoffice.org/c/core/+/146223, this issue should be fixed.
Kohei Yoshida committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/edcfb4632a514c5595540d69f7b217b4a12bac5c tdf#146260: Add more mapping rules on character encoding It will be available in 7.6.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
(In reply to Kohei Yoshida from comment #5) Thank you. I mark this as fixed for now per your comment.
Xisco Fauli committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/db54b5e778828279394bbe310358e40dac27bf13 tdf#146260: sc: Add UItest It will be available in 7.6.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.