150023 – Double BOM (invalid XML) prevents importing some Excel 2003 XML (Excel reads them)

Bug 150023 - Double BOM (invalid XML) prevents importing some Excel 2003 XML (Excel reads them)

Summary: Double BOM (invalid XML) prevents importing some Excel 2003 XML (Excel reads ...

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All All

Importance:	medium normal
Assignee:	Kohei Yoshida

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	MSO-XML2003
	Show dependency tree / graph

Reported:	2022-07-17 12:58 UTC by Mike Kaganski
Modified:	2023-02-07 07:28 UTC (History)
CC List:	4 users (show)

See Also:	68742
Crash report or crash signature:

Attachments
A sample Excel 2003 XML with double BOM (1.49 MB, application/vnd.ms-excel) 2022-07-17 12:58 UTC, Mike Kaganski	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mike Kaganski 2022-07-17 12:58:14 UTC

Created attachment 181293 [details]
A sample Excel 2003 XML with double BOM

The attachment is from https://www.blackrock.com/il/intermediaries/en/products/264107/ishares-spi-ch-fund/1554770013246.ajax?fileType=xls&fileName=iShares-Core-SPI-ETF-CH_fund&dataType=fund. A user reported on IRC that it can't be opened in Calc, while Excel opens it fine.

The UTF-8 XML file has a double BOM - and that makes it actually invalid, since after stripping the first BOM, the rest XML does not begin with XML declaration, but with a zero-width character.

See also bug 68742, which was fixed with an improvement in orcus.

Comment 1 raal 2022-07-18 20:36:48 UTC

Confirm with Version: 7.5.0.0.alpha0+ / LibreOffice Community
Build ID: 28daee8a3252e03a67484dc8d3dd26fd73af4826
CPU threads: 4; OS: Linux 5.13; UI render: default; VCL: gtk3
Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US
Calc: threaded

Comment 2 Andreas Heinisch 2022-12-12 15:14:48 UTC

Hi Mike!

Should this be fixed in LibreOffice or in the sax_parser.hpp of liborcus?

template<typename _Handler, typename _Config>
void sax_parser<_Handler,_Config>::header()

// we don't handle multi byte encodings so we can just skip bom entry if exists.    skip_bom();

If the skip_bom function will be adapted, then we could resolve this issue.

Comment 3 Mike Kaganski 2022-12-12 15:27:26 UTC

(In reply to Andreas Heinisch from comment #2)
> Should this be fixed in LibreOffice or in the sax_parser.hpp of liborcus?

IMO, the latter.

> If the skip_bom function will be adapted, then we could resolve this issue.

There's also a call to skip_space_and_control immediately following the skip_bom. IMO, we can't treat *any* character after BOM as BOM; so the "zero-width non-breaking space" (the deprecated meaning of U+FEFF) falls into the "space and control" realm nicely.

Comment 4 Mike Kaganski 2022-12-12 15:42:21 UTC

Just a side remark

sax_parser::header starts with

    skip_bom();
    skip_space_and_control();
    if (!has_char() || cur_char() != '<')
        throw sax::malformed_xml_error("xml file must begin with '<'.", offset());

where skip_space_and_control is documented as "Skip all characters that are 0-32 in ASCII range", and one would assume that after an optional BOM, "space and control" characters are tolerated. before the opening '<'.

But skip_bom is implemented as checking that if the first UTF-8 code unit is not ' ' or '<', then the BOM *must* be present, and the '<' *must* immediately follow it. So the skip_bom implementation prevents the correct skip_space_and_control execution, and basically that required the explicit check for space inside it, partially implementing the skip_space_and_control functionality.

Just checking three BOM characters without any other checks would seem reasonable to me :)

Comment 5 Andreas Heinisch 2023-02-07 07:25:23 UTC

Fixed in https://gitlab.com/orcus/orcus/-/issues/173

Should we add a test case for libreoffice too?

Comment 6 Mike Kaganski 2023-02-07 07:28:15 UTC

(In reply to Andreas Heinisch from comment #5)
> Should we add a test case for libreoffice too?

Yes. The behavior should be kept no matter which changes happen in the libraries we use :)