Bug 146260 - XML Source: Wrong character set is used, thus the contents mapped to the cells are garbage/messy codes
Summary: XML Source: Wrong character set is used, thus the contents mapped to the cell...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
7.4.0.0 alpha0+
Hardware: All Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.6.0
Keywords:
Depends on:
Blocks: orcus_bugs
  Show dependency treegraph
 
Reported: 2021-12-16 11:30 UTC by Kevin Suo
Modified: 2023-02-15 14:09 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
test xml file GBK encoded (3.36 KB, text/xml)
2021-12-16 11:30 UTC, Kevin Suo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Suo 2021-12-16 11:30:50 UTC
Created attachment 176963 [details]
test xml file GBK encoded

Steps to Reproduce:

1. New Calc.

2. Data > XML Source. Click the "Source File" button, and use the attached xml file.

--> Some fields are listed in the box, that's good.

3. Click on the field "Fp", then  click the "Shrink" button and select a cell. Click "Import".

--> Garbage content imported.

It is noted that the xml file is GBK encoded, and the encoding character information is correctly included in the first line:
<?xml version="1.0" encoding="GBK"?>
Comment 1 Kevin Suo 2021-12-16 11:35:17 UTC
A workaround is to change the
<?xml version="1.0" encoding="GBK"?>
to
<?xml version="1.0" encoding="UTF-8"?>
and resave the file with UTF-8 encoding.

Actually I suspect only resave as UTF-8 is enough, no need to change the <?xml version="1.0" encoding="GBK"?> line.

I guess the XML Source import is handled by Orcus (?), and Orcus may not support other multi-byte encodings such as GBK, GB2312 etc. As a result, we should either:
1. Convert the file content to UTF-8 before sent to orcus for parsing, or
2. Improve orcus to support other encodings other than UTF-8.
Comment 2 Kevin Suo 2021-12-16 11:36:41 UTC
I am adding Kohei Yoshida to cc: would you please confirm whether this is an Orcus issue?
Comment 3 Kohei Yoshida 2021-12-18 15:48:57 UTC
(In reply to Kevin Suo from comment #2)
> I am adding Kohei Yoshida to cc: would you please confirm whether this is an
> Orcus issue?

Yes, this is an orcus issue.

Orcus itself can theoretically process XML documents containing element contents or attribute values encoded in something other than utf-8 since the decoding part is delegated to the handler side (libreoffice).  But the XML element and attribute names themselves must be in utf-8 (or utf-16 if proper byte order mark is given).

Orcus needs to properly pick up the declared encoding type to the handler during XML mapping, and that *should* in theory fix this issue.  The Excel 2003 XML file format import code does support non-unicode encoding by honoring the declared encoding type, so I just need to apply the same mechanism to XML mapping code.
Comment 4 Kevin Suo 2021-12-18 15:57:06 UTC
Mark to NEW per comment 3.
Comment 5 Kohei Yoshida 2023-01-27 03:00:24 UTC
With the latest version of orcus (0.18.0) now on master and this change https://gerrit.libreoffice.org/c/core/+/146223, this issue should be fixed.
Comment 6 Commit Notification 2023-01-27 07:30:47 UTC
Kohei Yoshida committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/edcfb4632a514c5595540d69f7b217b4a12bac5c

tdf#146260: Add more mapping rules on character encoding

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 7 Kevin Suo 2023-01-29 14:08:16 UTC
(In reply to Kohei Yoshida from comment #5)
Thank you. I mark this as fixed for now per your comment.
Comment 8 Commit Notification 2023-02-15 14:09:50 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/db54b5e778828279394bbe310358e40dac27bf13

tdf#146260: sc: Add UItest

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.