Description: Import XML file (main menu -Data - XML Source) with non-ASCII tags not working. XML something like this <?xml version="1.0" encoding="UTF-8"?> <Файл ИдФайл="VO_OTKRDAN3_9965_9965_20210729_ffe299da-7d35-4b33-a1f4-0b07f99825ab" ВерсФорм="4.01" ВерсПрог="1.0" ТипИнф="ОТКРДАННЫЕ3" КолДок="144"> <ИдОтпр> <ФИООтв Фамилия="_" Имя="_"/> </ИдОтпр> <Документ ИдДок="82de4304-7b04-472b-a0f4-01f244e38858" ДатаДок="29.07.2021" ДатаСост="31.12.2020"> <СведНП НаимОрг="ОБЩЕСТВО С ОГРАНИЧЕННОЙ ОТВЕТСТВЕННОСТЬЮ "ТЕХНИЧЕСКИЙ ЦЕНТР "ВЯТКАКАР"" ИННЮЛ="4345205327"/> <СведССЧР КолРаб="4"/> </Документ> <Документ ИдДок="ea03189e-3855-4db5-a7cc-87ebb70b07a3" ДатаДок="29.07.2021" ДатаСост="31.12.2020"> <СведНП НаимОрг="ОБЩЕСТВО С ОГРАНИЧЕННОЙ ОТВЕТСТВЕННОСТЬЮ СПЕЦИАЛИЗИРОВАННОЕ ПРОЕКТНОЕ БЮРО "СФЕРА"" ИННЮЛ="3443079196"/> <СведССЧР КолРаб="6"/> </Документ> </Файл> p.s. full file available here - https://pastebin.com/jJK3dttP MS Office opened this files perfect! Steps to Reproduce: open LO download example XML file from https://pastebin.com/jJK3dttP try import from menu option Data -> XML Source Actual Results: nothing Expected Results: XML structure open Reproducible: Always User Profile Reset: No Additional Info: Version: 6.2.8.2 Build ID: 20(Build:2) CPU threads: 4; OS: Linux 4.19; UI render: default; VCL: gtk3; Locale: en-US (en_US.UTF-8); UI-Language: en-US Calc: threaded Also tried 7x version
Created attachment 175720 [details] xml example file
I recall another bug about this being written about 5 months back. I can't remember enough details at the moment to find it, but it was traced to a dependent library and was fixed. You don't say what 7.x version you tried, but please try downloading the latest version from https://www.libreoffice.org/download/libreoffice-fresh/ and see if the problem is still there.
Tried 7.0.6.2 from my distro repo and 7.2.1.2 from official site Problem still exists. XML not imported (button Import is disable) Version: 7.2.1.2 / LibreOffice Community Build ID: 87b77fad49947c1441b67c559c339af8f3517e22 CPU threads: 12; OS: Linux 5.10; UI render: default; VCL: gtk3 Locale: ru-RU (ru_RU.UTF-8); UI: en-US Calc: threaded Version: 7.0.6.2 Build ID: 00(Build:2) CPU threads: 12; OS: Linux 5.10; UI render: default; VCL: gtk3 Locale: ru-RU (ru_RU.UTF-8); ИП: ru-RU Calc: threaded
https://opengrok.libreoffice.org/xref/core/sc/source/ui/xmlsource/xmlsourcedlg.cxx?r=3b8e53f6#191 possible cause: https://gitlab.com/orcus/orcus/-/blob/master/CHANGELOG >orcus <next version> >* sax parser >* utf-8 names are now allowed as element names.
slightly different error message. warn:sc.orcus:13376:15364:sc/source/filter/orcus/xmlcontext.cxx:191: Malformed XML error: malformed_xml_error: expected an alphabet. (offset=39) Version: 7.3.0.0.alpha0+ (x64) / LibreOffice Community Build ID: 5b2848413883565c48d312c96daf8fbca25405d8 CPU threads: 4; OS: Windows 10.0 Build 19042; UI render: default; VCL: win Locale: ja-JP (ja_JP); UI: en-US Calc: CL
Confirm in Version: 7.3.0.0.alpha0+ (x64) / LibreOffice Community Build ID: 17d3cacfb9675268e709cfc95771ad4ce8bde75a CPU threads: 4; OS: Windows 6.1 Service Pack 1 Build 7601; UI render: Skia/Raster; VCL: win Locale: ru-RU (ru_RU); UI: en-US Calc: CL So will hope an orcus library updating can solve it =)
Created attachment 175792 [details] Reproducible XML file. Names start with (non-(ascii-alpha))
Created attachment 175793 [details] NonReproducible XML file. Names start with ascii-alpha
https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L244 https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L252 there may be more is_alpha() thingy in non-element-related code.
>orcus <next version> >* sax parser >* utf-8 names are now allowed as element names. So this issue may be fixed after we upgrade orcus to the <next version>?
>So this issue may be fixed after we upgrade orcus to the <next version> When I posted comment 4, I thought so, but I was wrong. LibreOffice master has a patch already merged on April https://gerrit.libreoffice.org/c/core/+/114892 , which is similar to the upstream so-called fix. https://gitlab.com/orcus/orcus/-/commit/2c2215e94bd8fce4b9a93e986339aa6ae06d2cba so I thought the bug was supposed to be fixed. I tested on latest master, and unfortunately the bug was still reproducible. I continued investigation on my own, and finally found the culprit as indicated on comment 9. -- sax_parser<_Handler,_Config>::element() https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L225 calls sax_parser<_Handler,_Config>::element_open(std::ptrdiff_t begin_pos) at https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L245 IF NO EXCEPTION IS THROWN, https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L255 and then calls parser_base::element_name(parser_element& elem, std::ptrdiff_t begin_pos) https://gitlab.com/orcus/orcus/-/blob/master/include/orcus/sax_parser.hpp#L255 https://gitlab.com/orcus/orcus/-/blob/master/src/parser/sax_parser_base.cpp#L394 ,which calls parser_base::name(std::string_view& str) https://gitlab.com/orcus/orcus/-/blob/master/src/parser/sax_parser_base.cpp#L333 -- the patch mainly focuses on parser_base::name(std::string_view& str), but the culprit is even before that. THE EXCEPTION WAS THORWN.
The problem may be in: include/orcus/sax_parser.hpp template<typename _Handler, typename _Config> void sax_parser<_Handler,_Config>::element() { assert(cur_char() == '<'); std::ptrdiff_t pos = offset(); char c = next_char_checked(); switch (c) { case '/': element_close(pos); break; case '!': special_tag(); break; case '?': declaration(nullptr); break; default: if (!is_alpha(c) && c != '_') throw sax::malformed_xml_error("expected an alphabet.", offset()); element_open(pos); } } The default clause checks whether the current char is alpha. However, for complex char tags i.e. CJK, this is not true as the char may be a a portion of a multi-byte char stream. In my testing the value of such c is < 0. Im such case, it should continue reading until it finds the closing tag ">". See my patch for the other bug at https://gerrit.libreoffice.org/c/core/+/123727
And the old patch adding utf-8 support seems only addressed the names like the following: <?xml version="1.0" encoding="UTF-8"?> <Myšička jméno="Žužla"> <Nožičky>4</Nožičky> </Myšička> in which the 1 char is still an ascii [a-zA-Z] alpha (M, N). That is why the test in that patch can pass.
This issue has been fixed upstream in Orcus https://gitlab.com/orcus/orcus/-/issues/143 However the current orcus version is still old. Need either prepare a patch within libreoffice, or upgrade orcus to the pending 0.17.
https://gerrit.libreoffice.org/c/core/+/124573
Resolvedin LibreOffice master via Kohei's upgrade of lo orcus version to 0.17.0 in commit eb07a0e76. Mark as RESOLVED FIXED.
*** Bug 141672 has been marked as a duplicate of this bug. ***