Created attachment 181293 [details] A sample Excel 2003 XML with double BOM The attachment is from https://www.blackrock.com/il/intermediaries/en/products/264107/ishares-spi-ch-fund/1554770013246.ajax?fileType=xls&fileName=iShares-Core-SPI-ETF-CH_fund&dataType=fund. A user reported on IRC that it can't be opened in Calc, while Excel opens it fine. The UTF-8 XML file has a double BOM - and that makes it actually invalid, since after stripping the first BOM, the rest XML does not begin with XML declaration, but with a zero-width character. See also bug 68742, which was fixed with an improvement in orcus.
Confirm with Version: 7.5.0.0.alpha0+ / LibreOffice Community Build ID: 28daee8a3252e03a67484dc8d3dd26fd73af4826 CPU threads: 4; OS: Linux 5.13; UI render: default; VCL: gtk3 Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US Calc: threaded
Hi Mike! Should this be fixed in LibreOffice or in the sax_parser.hpp of liborcus? template<typename _Handler, typename _Config> void sax_parser<_Handler,_Config>::header() // we don't handle multi byte encodings so we can just skip bom entry if exists. skip_bom(); If the skip_bom function will be adapted, then we could resolve this issue.
(In reply to Andreas Heinisch from comment #2) > Should this be fixed in LibreOffice or in the sax_parser.hpp of liborcus? IMO, the latter. > If the skip_bom function will be adapted, then we could resolve this issue. There's also a call to skip_space_and_control immediately following the skip_bom. IMO, we can't treat *any* character after BOM as BOM; so the "zero-width non-breaking space" (the deprecated meaning of U+FEFF) falls into the "space and control" realm nicely.
Just a side remark sax_parser::header starts with skip_bom(); skip_space_and_control(); if (!has_char() || cur_char() != '<') throw sax::malformed_xml_error("xml file must begin with '<'.", offset()); where skip_space_and_control is documented as "Skip all characters that are 0-32 in ASCII range", and one would assume that after an optional BOM, "space and control" characters are tolerated. before the opening '<'. But skip_bom is implemented as checking that if the first UTF-8 code unit is not ' ' or '<', then the BOM *must* be present, and the '<' *must* immediately follow it. So the skip_bom implementation prevents the correct skip_space_and_control execution, and basically that required the explicit check for space inside it, partially implementing the skip_space_and_control functionality. Just checking three BOM characters without any other checks would seem reasonable to me :)
Fixed in https://gitlab.com/orcus/orcus/-/issues/173 Should we add a test case for libreoffice too?
(In reply to Andreas Heinisch from comment #5) > Should we add a test case for libreoffice too? Yes. The behavior should be kept no matter which changes happen in the libraries we use :)