Bug 145117 - XML Source not imported if tags is non-ASCII symbols
Summary: XML Source not imported if tags is non-ASCII symbols
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 141672 (view as bug list)
Depends on:
Blocks: orcus_bugs
  Show dependency treegraph
 
Reported: 2021-10-13 15:56 UTC by Anton Shevtsov
Modified: 2021-11-10 09:51 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments
xml example file (86.77 KB, text/xml)
2021-10-13 15:57 UTC, Anton Shevtsov
Details
Reproducible XML file. Names start with (non-(ascii-alpha)) (118 bytes, text/xml)
2021-10-17 14:52 UTC, himajin100000
Details
NonReproducible XML file. Names start with ascii-alpha (123 bytes, text/xml)
2021-10-17 14:53 UTC, himajin100000
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Anton Shevtsov 2021-10-13 15:56:42 UTC
Description:
Import XML file (main menu -Data - XML Source) with non-ASCII tags not working.

XML something like this

<?xml version="1.0" encoding="UTF-8"?>
<Файл ИдФайл="VO_OTKRDAN3_9965_9965_20210729_ffe299da-7d35-4b33-a1f4-0b07f99825ab" ВерсФорм="4.01" ВерсПрог="1.0" ТипИнф="ОТКРДАННЫЕ3" КолДок="144">
  <ИдОтпр>
    <ФИООтв Фамилия="_" Имя="_"/>
  </ИдОтпр>
  <Документ ИдДок="82de4304-7b04-472b-a0f4-01f244e38858" ДатаДок="29.07.2021" ДатаСост="31.12.2020">
    <СведНП НаимОрг="ОБЩЕСТВО С ОГРАНИЧЕННОЙ ОТВЕТСТВЕННОСТЬЮ &quot;ТЕХНИЧЕСКИЙ ЦЕНТР &quot;ВЯТКАКАР&quot;" ИННЮЛ="4345205327"/>
    <СведССЧР КолРаб="4"/>
  </Документ>
  <Документ ИдДок="ea03189e-3855-4db5-a7cc-87ebb70b07a3" ДатаДок="29.07.2021" ДатаСост="31.12.2020">
    <СведНП НаимОрг="ОБЩЕСТВО С ОГРАНИЧЕННОЙ ОТВЕТСТВЕННОСТЬЮ СПЕЦИАЛИЗИРОВАННОЕ ПРОЕКТНОЕ БЮРО &quot;СФЕРА&quot;" ИННЮЛ="3443079196"/>
    <СведССЧР КолРаб="6"/>
  </Документ>
</Файл>

p.s. full file available here - https://pastebin.com/jJK3dttP 

MS Office opened this files perfect!

Steps to Reproduce:
open LO
download example XML file from https://pastebin.com/jJK3dttP
try import from menu option Data -> XML Source

Actual Results:
nothing

Expected Results:
XML structure open 


Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 6.2.8.2
Build ID: 20(Build:2)
CPU threads: 4; OS: Linux 4.19; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.UTF-8); UI-Language: en-US
Calc: threaded

Also tried 7x version
Comment 1 Anton Shevtsov 2021-10-13 15:57:18 UTC
Created attachment 175720 [details]
xml example file
Comment 2 Michael Warner 2021-10-13 21:57:45 UTC
I recall another bug about this being written about 5 months back. I can't remember enough details at the moment to find it, but it was traced to a dependent library and was fixed. 

You don't say what 7.x version you tried, but please try downloading the latest version from https://www.libreoffice.org/download/libreoffice-fresh/ and see if the problem is still there.
Comment 3 Anton Shevtsov 2021-10-14 04:07:18 UTC
Tried 7.0.6.2 from my distro repo  and 7.2.1.2 from official site
Problem still exists. XML not imported (button Import is disable)

Version: 7.2.1.2 / LibreOffice Community
Build ID: 87b77fad49947c1441b67c559c339af8f3517e22
CPU threads: 12; OS: Linux 5.10; UI render: default; VCL: gtk3
Locale: ru-RU (ru_RU.UTF-8); UI: en-US
Calc: threaded

Version: 7.0.6.2
Build ID: 00(Build:2)
CPU threads: 12; OS: Linux 5.10; UI render: default; VCL: gtk3
Locale: ru-RU (ru_RU.UTF-8); ИП: ru-RU
Calc: threaded
Comment 4 himajin100000 2021-10-14 09:25:58 UTC
https://opengrok.libreoffice.org/xref/core/sc/source/ui/xmlsource/xmlsourcedlg.cxx?r=3b8e53f6#191

possible cause:

https://gitlab.com/orcus/orcus/-/blob/master/CHANGELOG

>orcus <next version>
>* sax parser
>* utf-8 names are now allowed as element names.
Comment 5 himajin100000 2021-10-14 11:55:16 UTC
slightly different error message.

warn:sc.orcus:13376:15364:sc/source/filter/orcus/xmlcontext.cxx:191: Malformed XML error: malformed_xml_error: expected an alphabet. (offset=39)

Version: 7.3.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 5b2848413883565c48d312c96daf8fbca25405d8
CPU threads: 4; OS: Windows 10.0 Build 19042; UI render: default; VCL: win
Locale: ja-JP (ja_JP); UI: en-US
Calc: CL
Comment 6 Roman Kuznetsov 2021-10-14 22:31:31 UTC
Confirm in

Version: 7.3.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 17d3cacfb9675268e709cfc95771ad4ce8bde75a
CPU threads: 4; OS: Windows 6.1 Service Pack 1 Build 7601; UI render: Skia/Raster; VCL: win
Locale: ru-RU (ru_RU); UI: en-US
Calc: CL

So will hope an orcus library updating can solve it =)
Comment 7 himajin100000 2021-10-17 14:52:24 UTC
Created attachment 175792 [details]
Reproducible XML file. Names start with (non-(ascii-alpha))
Comment 8 himajin100000 2021-10-17 14:53:07 UTC
Created attachment 175793 [details]
NonReproducible XML file. Names start with ascii-alpha
Comment 10 Kevin Suo 2021-10-18 02:19:22 UTC
>orcus <next version>
>* sax parser
>* utf-8 names are now allowed as element names.

So this issue may be fixed after we upgrade orcus to the <next version>?
Comment 11 himajin100000 2021-10-18 02:51:44 UTC
>So this issue may be fixed after we upgrade orcus to the <next version>

When I posted comment 4, I thought so, but I was wrong.

LibreOffice master has a patch already merged on April
https://gerrit.libreoffice.org/c/core/+/114892
, which is similar to the upstream so-called fix.
https://gitlab.com/orcus/orcus/-/commit/2c2215e94bd8fce4b9a93e986339aa6ae06d2cba

so I thought the bug was supposed to be fixed. I tested on latest master, and unfortunately the bug was still reproducible.

I continued investigation on my own, and finally found the culprit as indicated on comment 9.

--
sax_parser<_Handler,_Config>::element()
https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L225
calls
 
sax_parser<_Handler,_Config>::element_open(std::ptrdiff_t begin_pos)
at https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L245 IF NO EXCEPTION IS THROWN,

https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L255

and then calls 
parser_base::element_name(parser_element& elem, std::ptrdiff_t begin_pos)
https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L255
https://gitlab.com/orcus/orcus/-/blob/master/src/parser/sax_parser_base.cpp#L394
,which calls

parser_base::name(std::string_view& str)
https://gitlab.com/orcus/orcus/-/blob/master/src/parser/sax_parser_base.cpp#L333
--

the patch mainly focuses on parser_base::name(std::string_view& str),
but the culprit is even before that. THE EXCEPTION WAS THORWN.
Comment 12 Kevin Suo 2021-10-19 12:12:26 UTC
The problem may be in:
include/orcus/sax_parser.hpp

template<typename _Handler, typename _Config>
void sax_parser<_Handler,_Config>::element()
{
    assert(cur_char() == '<');
    std::ptrdiff_t pos = offset();
    char c = next_char_checked();
    switch (c)
    {
        case '/':
            element_close(pos);
        break;
        case '!':
            special_tag();
        break;
        case '?':
            declaration(nullptr);
        break;
        default:
            if (!is_alpha(c) && c != '_')
                throw sax::malformed_xml_error("expected an alphabet.", offset());
            element_open(pos);
    }
}

The default clause checks whether the current char is alpha. However, for complex char tags i.e. CJK, this is not true as the char may be a a portion of a multi-byte char stream. In my testing the value of such c is < 0. Im such case, it should continue reading until it finds the closing tag ">".

See my patch for the other bug at
https://gerrit.libreoffice.org/c/core/+/123727
Comment 13 Kevin Suo 2021-10-19 12:29:35 UTC
And the old patch adding utf-8 support seems only addressed the names like the following:
<?xml version="1.0" encoding="UTF-8"?>
<Myšička jméno="Žužla">
   <Nožičky>4</Nožičky>
</Myšička>

in which the 1 char is still an ascii [a-zA-Z] alpha (M, N). That is why the test in that patch can pass.
Comment 14 Kevin Suo 2021-10-26 05:38:48 UTC
This issue has been fixed upstream in Orcus
https://gitlab.com/orcus/orcus/-/issues/143

However the current orcus version is still old. Need either prepare a patch within libreoffice, or upgrade orcus to the pending 0.17.
Comment 16 Kevin Suo 2021-11-03 23:55:37 UTC
Resolvedin LibreOffice master via Kohei's upgrade of lo orcus version to 0.17.0 in commit eb07a0e76.

Mark as RESOLVED FIXED.
Comment 17 Kevin Suo 2021-11-04 11:02:01 UTC
*** Bug 141672 has been marked as a duplicate of this bug. ***