Bug 150023 - Double BOM (invalid XML) prevents importing some Excel 2003 XML (Excel reads them)
Summary: Double BOM (invalid XML) prevents importing some Excel 2003 XML (Excel reads ...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Kohei Yoshida
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: MSO-XML2003
  Show dependency treegraph
 
Reported: 2022-07-17 12:58 UTC by Mike Kaganski
Modified: 2023-02-07 07:28 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
A sample Excel 2003 XML with double BOM (1.49 MB, application/vnd.ms-excel)
2022-07-17 12:58 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2022-07-17 12:58:14 UTC
Created attachment 181293 [details]
A sample Excel 2003 XML with double BOM

The attachment is from https://www.blackrock.com/il/intermediaries/en/products/264107/ishares-spi-ch-fund/1554770013246.ajax?fileType=xls&fileName=iShares-Core-SPI-ETF-CH_fund&dataType=fund. A user reported on IRC that it can't be opened in Calc, while Excel opens it fine.

The UTF-8 XML file has a double BOM - and that makes it actually invalid, since after stripping the first BOM, the rest XML does not begin with XML declaration, but with a zero-width character.

See also bug 68742, which was fixed with an improvement in orcus.
Comment 1 raal 2022-07-18 20:36:48 UTC
Confirm with Version: 7.5.0.0.alpha0+ / LibreOffice Community
Build ID: 28daee8a3252e03a67484dc8d3dd26fd73af4826
CPU threads: 4; OS: Linux 5.13; UI render: default; VCL: gtk3
Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US
Calc: threaded
Comment 2 Andreas Heinisch 2022-12-12 15:14:48 UTC
Hi Mike!

Should this be fixed in LibreOffice or in the sax_parser.hpp of liborcus?

template<typename _Handler, typename _Config>
void sax_parser<_Handler,_Config>::header()

// we don't handle multi byte encodings so we can just skip bom entry if exists.    skip_bom();

If the skip_bom function will be adapted, then we could resolve this issue.
Comment 3 Mike Kaganski 2022-12-12 15:27:26 UTC
(In reply to Andreas Heinisch from comment #2)
> Should this be fixed in LibreOffice or in the sax_parser.hpp of liborcus?

IMO, the latter.

> If the skip_bom function will be adapted, then we could resolve this issue.

There's also a call to skip_space_and_control immediately following the skip_bom. IMO, we can't treat *any* character after BOM as BOM; so the "zero-width non-breaking space" (the deprecated meaning of U+FEFF) falls into the "space and control" realm nicely.
Comment 4 Mike Kaganski 2022-12-12 15:42:21 UTC
Just a side remark

sax_parser::header starts with

    skip_bom();
    skip_space_and_control();
    if (!has_char() || cur_char() != '<')
        throw sax::malformed_xml_error("xml file must begin with '<'.", offset());

where skip_space_and_control is documented as "Skip all characters that are 0-32 in ASCII range", and one would assume that after an optional BOM, "space and control" characters are tolerated. before the opening '<'.

But skip_bom is implemented as checking that if the first UTF-8 code unit is not ' ' or '<', then the BOM *must* be present, and the '<' *must* immediately follow it. So the skip_bom implementation prevents the correct skip_space_and_control execution, and basically that required the explicit check for space inside it, partially implementing the skip_space_and_control functionality.

Just checking three BOM characters without any other checks would seem reasonable to me :)
Comment 5 Andreas Heinisch 2023-02-07 07:25:23 UTC
Fixed in https://gitlab.com/orcus/orcus/-/issues/173

Should we add a test case for libreoffice too?
Comment 6 Mike Kaganski 2023-02-07 07:28:15 UTC
(In reply to Andreas Heinisch from comment #5)
> Should we add a test case for libreoffice too?

Yes. The behavior should be kept no matter which changes happen in the libraries we use :)