| Summary: | XML Source: Wrong character set is used, thus the contents mapped to the cells are garbage/messy codes | ||
|---|---|---|---|
| Product: | LibreOffice | Reporter: | Kevin Suo <suokunlong> |
| Component: | Calc | Assignee: | Not Assigned <libreoffice-bugs> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | kohei, telesto |
| Priority: | medium | ||
| Version: | 7.4.0.0 alpha0+ | ||
| Hardware: | All | ||
| OS: | Linux (All) | ||
| Whiteboard: | target:7.6.0 | ||
| Crash report or crash signature: | Regression By: | ||
| Bug Depends on: | |||
| Bug Blocks: | 145509 | ||
| Attachments: | test xml file GBK encoded | ||
A workaround is to change the <?xml version="1.0" encoding="GBK"?> to <?xml version="1.0" encoding="UTF-8"?> and resave the file with UTF-8 encoding. Actually I suspect only resave as UTF-8 is enough, no need to change the <?xml version="1.0" encoding="GBK"?> line. I guess the XML Source import is handled by Orcus (?), and Orcus may not support other multi-byte encodings such as GBK, GB2312 etc. As a result, we should either: 1. Convert the file content to UTF-8 before sent to orcus for parsing, or 2. Improve orcus to support other encodings other than UTF-8. I am adding Kohei Yoshida to cc: would you please confirm whether this is an Orcus issue? (In reply to Kevin Suo from comment #2) > I am adding Kohei Yoshida to cc: would you please confirm whether this is an > Orcus issue? Yes, this is an orcus issue. Orcus itself can theoretically process XML documents containing element contents or attribute values encoded in something other than utf-8 since the decoding part is delegated to the handler side (libreoffice). But the XML element and attribute names themselves must be in utf-8 (or utf-16 if proper byte order mark is given). Orcus needs to properly pick up the declared encoding type to the handler during XML mapping, and that *should* in theory fix this issue. The Excel 2003 XML file format import code does support non-unicode encoding by honoring the declared encoding type, so I just need to apply the same mechanism to XML mapping code. With the latest version of orcus (0.18.0) now on master and this change https://gerrit.libreoffice.org/c/core/+/146223, this issue should be fixed. Kohei Yoshida committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/edcfb4632a514c5595540d69f7b217b4a12bac5c tdf#146260: Add more mapping rules on character encoding It will be available in 7.6.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback. (In reply to Kohei Yoshida from comment #5) Thank you. I mark this as fixed for now per your comment. Xisco Fauli committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/db54b5e778828279394bbe310358e40dac27bf13 tdf#146260: sc: Add UItest It will be available in 7.6.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback. |
Created attachment 176963 [details] test xml file GBK encoded Steps to Reproduce: 1. New Calc. 2. Data > XML Source. Click the "Source File" button, and use the attached xml file. --> Some fields are listed in the box, that's good. 3. Click on the field "Fp", then click the "Shrink" button and select a cell. Click "Import". --> Garbage content imported. It is noted that the xml file is GBK encoded, and the encoding character information is correctly included in the first line: <?xml version="1.0" encoding="GBK"?>