Description: Exporting a Calc document containing one or more Unicode characters which are of category Cn to xlsx results in a file that Excel cannot open. Repairs by Excel delete the entire worksheet containing category Cn character(s). Opening with Calc likewise. Steps to Reproduce: 1.create Calc document and enter =UNICHAR(65535) in cell A1. 2.save as xlsx document and open this document. Actual Results: Cell A1 is empty. Expected Results: Cell A1 should contain =UNICHAR(65535). Reproducible: Always User Profile Reset: No Additional Info: See attachments. The mentioned deletion of the entire worksheet can be reproduced by also entering values and or formulas in other cells and saving as xlsx.
Created attachment 177937 [details] Calc document as described in comment #0.
Created attachment 177938 [details] xlsx document as described in comment #0
Note that he value of the formula in the fods document is empty (UNICHAR(65535) is not a character), whereas in the xlsx document is is 'something'.
Confirm in Version: 7.4.0.0.alpha0+ (x64) / LibreOffice Community Build ID: e27a41a362bf25e12487b36f625985b35fb891e3 CPU threads: 4; OS: Windows 6.1 Service Pack 1 Build 7601; UI render: Skia/Raster; VCL: win Locale: ru-RU (ru_RU); UI: ru-RU Calc: CL
no repro in 6.3 but repro in 6.4.5 = > regression
I bisected it in win64-6.4 bisect repo and I got the sha cd563e7b807fe038ebefb228e70bc587c040d17d https://gerrit.libreoffice.org/c/core/+/78598 https://git.libreoffice.org/core/commit/cd563e7b807fe038ebefb228e70bc587c040d17d Added to CC: Stephan Bergmann Stephan could you please look at it? Thank you
So the test.fods (attachment 177937 [details]) contains > <table:table-cell table:formula="of:=UNICHAR(65535)" office:value-type="string" office:string-value="" calcext:value-type="string"> with an empty office:string-value="", while the test.xlsx (attachment 177938 [details]) xl/worksheets/sheet1.xml stream contains > <c r="A1" s="0" t="str"><f aca="false">_xlfn.UNICHAR(65535)</f><v>�</v></c> with an (UTF-8 encoded) U+FFFF. Eike, do you know whether saving to .[f]ods has some code that explicitly filters out non-characters, whereas saving to .xslx presumably implicitly relied on the sal/rtl/textenc code converting to UTF-8 to filter out non-characters (and which it no longer does since <https://git.libreoffice.org/core/+/cd563e7b807fe038ebefb228e70bc587c040d17d%5E%21> "Do not exclude Unicode noncharacters from rtl_convertUnicodeToText")?
(In reply to Stephan Bergmann from comment #7) > Eike, do you know whether saving to .[f]ods has some code that explicitly > filters out non-characters, whereas saving to .xslx presumably implicitly > relied on the sal/rtl/textenc code converting to UTF-8 to filter out > non-characters (and which it no longer does since > <https://git.libreoffice.org/core/+/ > cd563e7b807fe038ebefb228e70bc587c040d17d%5E%21> "Do not exclude Unicode > noncharacters from rtl_convertUnicodeToText")? [I assume Bugzilla failed to send out emails for comment 7 due to the verbatim U+FFFF contained in that comment, which, it claimed in its web UI, it couldn't convert to UTF-8; phh]
sax/source/expatwrap/saxwriter.cxx SaxWriterHelper::convertToXML() does such thing with IsInvalidChar()
Fwiw, opening the attached .fods in Calc for me cell A1 is not empty but contains the expected =UNICHAR(65535) formula expression. (and yes, the literal 0xffff glyph in comment 7 kicks Bugzilla and its mailing into the abyss for every comment added).
Note: the U+FFFF in the original line from comment#7 has been replaced by a U+FFFD to not trip up bugzilla
Stephan Bergmann committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/2f3a0bfbfe110c0837b3c7e04f9ad0969d6e56e4 tdf#147088: Also handle U+FFFE, U+FFFF invalid XML 1.0 characters It will be available in 7.4.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
(In reply to Winfried Donkers from comment #0) > Exporting a Calc document containing one or more Unicode characters which > are of category Cn to xlsx results in a file that Excel cannot open. Unicode category Cn covers both noncharacter and reserved code points (<https://www.unicode.org/versions/Unicode14.0.0/ch02.pdf>). Unicode noncharacter code points are U+FDD0..U+FDEF, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, U+10FFFF (<https://www.unicode.org/versions/Unicode14.0.0/ch23.pdf>). Unicode code points outside the range of XML 1.0 legal characters are U+0000..U+0008, U+000B, U+000C, U+000E..U+001F, U+D800..U+DFFF, U+FFFE, U+FFFF (<https://www.w3.org/TR/2008/REC-xml-20081126/#charsets>). This has nothing really to do with Unicode category Cn or Unicode noncharacters. What the fix from comment 12 does is to make sure that it does not emit Unicode charcters U+FFFE, U+FFFF verbatim into certain XML 1.0 documents. (The two Unicode characters that the original code's handling of non-legal characters had missed.) Winfried, can you please clarify whether the issue you describe in comment 0 is actually about Unicode characters of category Cn, or Unicode noncharacters, or Unicode characters that are not legal in XML 1.0?
(In reply to Stephan Bergmann from comment #13) > Winfried, can you please clarify whether the issue you describe in comment 0 > is actually about Unicode characters of category Cn, or Unicode > noncharacters, or Unicode characters that are not legal in XML 1.0? The issue was about Unicode characters of category Cn (other - not assigned) and came to light whilst testing a patch for Calc function CLEAN. I can confirm that Calc with the patch from comment #12 applied no longer has the issue. Thank you, I can continue testing CLEAN with Excel :)
Xisco Fauli committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/38d4f94788549f7f8118b28b03d9e056f994c841 tdf#147088: sc_subsequent_export_test2: Add unittest It will be available in 7.4.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Stephan Bergmann committed a patch related to this issue. It has been pushed to "libreoffice-7-3": https://git.libreoffice.org/core/commit/33d70d68aa67d567e9b18fa5947b86df6e378f32 tdf#147088: Also handle U+FFFE, U+FFFF invalid XML 1.0 characters It will be available in 7.3.1. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.