Bug 147088 - Export from Calc document with Unicode characters belonging to Unicode category Cn to xlsx produces corrupt file.
Summary: Export from Calc document with Unicode characters belonging to Unicode catego...
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
6.4.5.2 release
Hardware: All All
: medium normal
Assignee: Stephan Bergmann
URL:
Whiteboard: target:7.4.0 target:7.3.1
Keywords: bibisected, bisected, filter:xlsx, regression
Depends on:
Blocks:
 
Reported: 2022-01-31 13:23 UTC by Winfried Donkers
Modified: 2022-02-04 20:12 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Calc document as described in comment #0. (38.02 KB, application/vnd.oasis.opendocument.spreadsheet-flat-xml)
2022-01-31 13:23 UTC, Winfried Donkers
Details
xlsx document as described in comment #0 (4.37 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2022-01-31 13:24 UTC, Winfried Donkers
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Winfried Donkers 2022-01-31 13:23:07 UTC
Description:
Exporting a Calc document containing one or more Unicode characters which are of category Cn to xlsx results in a file that Excel cannot open. Repairs by Excel delete the entire worksheet containing category Cn character(s). Opening with Calc likewise.

Steps to Reproduce:
1.create Calc document and enter =UNICHAR(65535) in cell A1.
2.save as xlsx document and open this document.


Actual Results:
Cell A1 is empty.

Expected Results:
Cell A1 should contain =UNICHAR(65535).


Reproducible: Always


User Profile Reset: No



Additional Info:
See attachments.
The mentioned deletion of the entire worksheet can be reproduced by also entering values and or formulas in other cells and saving as xlsx.
Comment 1 Winfried Donkers 2022-01-31 13:23:58 UTC
Created attachment 177937 [details]
Calc document as described in comment #0.
Comment 2 Winfried Donkers 2022-01-31 13:24:28 UTC
Created attachment 177938 [details]
xlsx document as described in comment #0
Comment 3 Winfried Donkers 2022-01-31 13:26:06 UTC
Note that he value of the formula in the fods document is empty (UNICHAR(65535) is not a character), whereas in the xlsx document is is 'something'.
Comment 4 Roman Kuznetsov 2022-01-31 13:54:02 UTC
Confirm in

Version: 7.4.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: e27a41a362bf25e12487b36f625985b35fb891e3
CPU threads: 4; OS: Windows 6.1 Service Pack 1 Build 7601; UI render: Skia/Raster; VCL: win
Locale: ru-RU (ru_RU); UI: ru-RU
Calc: CL
Comment 5 Roman Kuznetsov 2022-01-31 13:58:14 UTC
no repro in 6.3 but repro in 6.4.5 = > regression
Comment 6 Roman Kuznetsov 2022-01-31 14:18:47 UTC
I bisected it in win64-6.4 bisect repo and I got the sha
 cd563e7b807fe038ebefb228e70bc587c040d17d

https://gerrit.libreoffice.org/c/core/+/78598

https://git.libreoffice.org/core/commit/cd563e7b807fe038ebefb228e70bc587c040d17d

Added to CC:  Stephan Bergmann

Stephan could you please look at it? Thank you
Comment 7 Stephan Bergmann 2022-01-31 15:18:42 UTC
So the test.fods (attachment 177937 [details]) contains

> <table:table-cell table:formula="of:=UNICHAR(65535)" office:value-type="string" office:string-value="" calcext:value-type="string">

with an empty office:string-value="", while the test.xlsx (attachment 177938 [details]) xl/worksheets/sheet1.xml stream contains

> <c r="A1" s="0" t="str"><f aca="false">_xlfn.UNICHAR(65535)</f><v>�</v></c>

with an (UTF-8 encoded) U+FFFF.

Eike, do you know whether saving to .[f]ods has some code that explicitly filters out non-characters, whereas saving to .xslx presumably implicitly relied on the sal/rtl/textenc code converting to UTF-8 to filter out non-characters (and which it no longer does since <https://git.libreoffice.org/core/+/cd563e7b807fe038ebefb228e70bc587c040d17d%5E%21> "Do not exclude Unicode noncharacters from rtl_convertUnicodeToText")?
Comment 8 Stephan Bergmann 2022-01-31 15:24:08 UTC
(In reply to Stephan Bergmann from comment #7)
> Eike, do you know whether saving to .[f]ods has some code that explicitly
> filters out non-characters, whereas saving to .xslx presumably implicitly
> relied on the sal/rtl/textenc code converting to UTF-8 to filter out
> non-characters (and which it no longer does since
> <https://git.libreoffice.org/core/+/
> cd563e7b807fe038ebefb228e70bc587c040d17d%5E%21> "Do not exclude Unicode
> noncharacters from rtl_convertUnicodeToText")?

[I assume Bugzilla failed to send out emails for comment 7 due to the verbatim U+FFFF contained in that comment, which, it claimed in its web UI, it couldn't convert to UTF-8; phh]
Comment 9 Eike Rathke 2022-01-31 16:07:52 UTC
sax/source/expatwrap/saxwriter.cxx SaxWriterHelper::convertToXML() does such thing with IsInvalidChar()
Comment 10 Eike Rathke 2022-01-31 16:25:47 UTC
Fwiw, opening the attached .fods in Calc for me cell A1 is not empty but contains the expected =UNICHAR(65535) formula expression.

(and yes, the literal 0xffff glyph in comment 7 kicks Bugzilla and its mailing into the abyss for every comment added).
Comment 11 Christian Lohmaier 2022-01-31 17:11:59 UTC
Note: the U+FFFF in the original line from comment#7 has been replaced by a U+FFFD to not trip up bugzilla
Comment 12 Commit Notification 2022-02-01 22:37:32 UTC
Stephan Bergmann committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/2f3a0bfbfe110c0837b3c7e04f9ad0969d6e56e4

tdf#147088: Also handle U+FFFE, U+FFFF invalid XML 1.0 characters

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 13 Stephan Bergmann 2022-02-02 07:28:12 UTC
(In reply to Winfried Donkers from comment #0)
> Exporting a Calc document containing one or more Unicode characters which
> are of category Cn to xlsx results in a file that Excel cannot open.

Unicode category Cn covers both noncharacter and reserved code points (<https://www.unicode.org/versions/Unicode14.0.0/ch02.pdf>).  Unicode noncharacter code points are U+FDD0..U+FDEF, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, U+10FFFF (<https://www.unicode.org/versions/Unicode14.0.0/ch23.pdf>).

Unicode code points outside the range of XML 1.0 legal characters are U+0000..U+0008, U+000B, U+000C, U+000E..U+001F, U+D800..U+DFFF, U+FFFE, U+FFFF (<https://www.w3.org/TR/2008/REC-xml-20081126/#charsets>).  This has nothing really to do with Unicode category Cn or Unicode noncharacters.

What the fix from comment 12 does is to make sure that it does not emit Unicode charcters U+FFFE, U+FFFF verbatim into certain XML 1.0 documents.  (The two Unicode characters that the original code's handling of non-legal characters had missed.)

Winfried, can you please clarify whether the issue you describe in comment 0 is actually about Unicode characters of category Cn, or Unicode noncharacters, or Unicode characters that are not legal in XML 1.0?
Comment 14 Winfried Donkers 2022-02-02 08:38:49 UTC
(In reply to Stephan Bergmann from comment #13)

> Winfried, can you please clarify whether the issue you describe in comment 0
> is actually about Unicode characters of category Cn, or Unicode
> noncharacters, or Unicode characters that are not legal in XML 1.0?

The issue was about Unicode characters of category Cn (other - not assigned) and came to light whilst testing a patch for Calc function CLEAN.

I can confirm that Calc with the patch from comment #12 applied no longer has the issue. Thank you, I can continue testing CLEAN with Excel :)
Comment 15 Commit Notification 2022-02-02 20:54:37 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/38d4f94788549f7f8118b28b03d9e056f994c841

tdf#147088: sc_subsequent_export_test2: Add unittest

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 16 Commit Notification 2022-02-04 20:12:39 UTC
Stephan Bergmann committed a patch related to this issue.
It has been pushed to "libreoffice-7-3":

https://git.libreoffice.org/core/commit/33d70d68aa67d567e9b18fa5947b86df6e378f32

tdf#147088: Also handle U+FFFE, U+FFFF invalid XML 1.0 characters

It will be available in 7.3.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.