Bug 141672 - XML source import doesn't support french accented characters in XML tags
Summary: XML source import doesn't support french accented characters in XML tags
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
7.1.2.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.2.0 target:7.1.4
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-13 09:13 UTC by Regis Perdreau
Modified: 2021-10-14 17:31 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
original open data file (1.40 KB, text/xml)
2021-04-13 09:14 UTC, Regis Perdreau
Details
XML file without accented characters in tag (1.38 KB, text/xml)
2021-04-13 09:15 UTC, Regis Perdreau
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Regis Perdreau 2021-04-13 09:13:39 UTC
Description:
If there are some accented characters in a XML tag, the XML source import can't open this file. 

Steps to Reproduce:
1. Open adil-12202-01-1.xml in SC with Data->XML source-> pick the file 
2. have a look to the Map to Document window
3. 

Actual Results:
Nothing in the Map to Document window

Expected Results:
We should see the XML tree


Reproducible: Always


User Profile Reset: No



Additional Info:
For comparaison, open the adil-12202-01-1-with-accent-in-data-only.xml fill
the xml tree can be viewed in Map to Document window.
Comment 1 Regis Perdreau 2021-04-13 09:14:41 UTC
Created attachment 171152 [details]
original open data file
Comment 2 Regis Perdreau 2021-04-13 09:15:12 UTC
Created attachment 171153 [details]
XML file without accented characters in tag
Comment 3 Regis Perdreau 2021-04-15 09:53:08 UTC
An error message in debug version :

sc/source/filter/orcus/xmlcontext.cxx:191: Malformed XML error: malformed_xml_error: name must begin with an alphabet, but got this instead '�' (offset=769)

seems orcus lib related.
Comment 4 Michael Warner 2021-04-16 02:00:26 UTC
Confirmed in:

Version: 7.1.2.2 / LibreOffice Community
Build ID: 8a45595d069ef5570103caea1b71cc9d82b2aae4
CPU threads: 4; OS: Linux 5.4; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded
Comment 5 Michael Warner 2021-04-19 02:41:17 UTC
> seems orcus lib related.

That error message comes from liborcus sax_parser_base.cpp:338. It looks like that library requires all element and attribute names to be in the US ASCII range, which does not correspond to what the XML specification says. I have filed an issue with that project (see https://gitlab.com/orcus/orcus/-/issues/137).
Comment 6 Commit Notification 2021-04-30 09:23:59 UTC
Luboš Luňák committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/6b7c2fa65eb68be520ed4135cc245e33fa22e8bf

allow utf-8 in xml names (liborcus) (tdf#141672)

It will be available in 7.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 7 Commit Notification 2021-05-02 19:21:24 UTC
Luboš Luňák committed a patch related to this issue.
It has been pushed to "libreoffice-7-1":

https://git.libreoffice.org/core/commit/be4e23da3fe1bcdc1e1ef6982c5f0b47b5efd257

allow utf-8 in xml names (liborcus) (tdf#141672)

It will be available in 7.1.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.