See gerrit patch https://gerrit.libreoffice.org/21745 which updates our documentation for Calc's CLEAN function.
Per http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part2.html#__RefHeading__1018842_715980110 this command should remove all "character belongs to [UNICODE] class Cc (Other - Control), or to Unicode class Cn (Other - Not Assigned)"
This misses many characters, see e.g.
One problem is, that the Excel or e.g. Apple's Numbers (not ODF software) aren't conform, too. We should for import / export reasons add a duplication and create a CLEAN_ADD or CLEAN_EXCEL (or whatever naming) or let all the code as it is and add a CLEAN_ODF function.
I believe you.
Please implement also an additional compatible CLEAN function for MSO Excel / OOXML (which only replaces 0x20 <= c (see lcl_ScInterpreter_IsPrintable and Page 2123 of Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference.pdf in http://www.ecma-international.org/publications/standards/Ecma-376.htm ECMA-376 4th edition Part 1)
including the mapping
Using http://www.unicode.org/Public/UNIDATA/UnicodeData.txt as reference, we can grep for ';(Cc|Cn);'.
We find that Cc and Cn chars are:
0x00 to 0x1F (inclusive)
0x7F to 0x9F (inclusive)
Using the above info, we can update the CLEAN isPrintable() function.
I also added CLEAN.OOXML using a new isPrintable_OOXML() function that returns true only if c > 0x1f.
Using the following fods to verify, I found that the CHAR function is not appropriate to use with CLEAN since CHAR returns 0xfffd for the range of 0x80 to 0xFF inclusive. 0x100 and above returns Err:502.
Changing CHAR to UNICHAR, we can then successfully test the range 0x80 to 0x9F.
I'll attempt to use gerrit to start the patch process.
A polite ping, still working on this bug?
I lost focus after not being able to figure out how to update 2 files.
(In reply to Taylor Lee from comment #5)
> I lost focus after not being able to figure out how to update 2 files.
It would be great to see you continue :) If you need some tips, you could visit the #libreoffice-dev IRC channel during EU office hours: https://wiki.documentfoundation.org/Website/IRC
Eike is in there, nickname: erAck
This bug has been in ASSIGNED status for more than 3 months without any
activity. Resetting it to NEW.
Please assigned it back to yourself if you're still working on this.
@Winfried: As you have done a lot on Calc functions, I ask, is this something you can fix and you are interested in?
Created attachment 177374 [details]
document to verify output in Excel
Interestingly, Excel 2019 has almost the same output as Open Document Format Formula (v1.3) states for CLEAN.
ODFF states that CLEAN cleans UNICODE characters 0x00...0x1F(incl.) and 0x7F...0x9F(incl.).
The only difference I found is UNICODE character 0x7F, which ODFF removes and Excel doesn't.
Another difference between ODFF and Excel is that Excel does not support/accept UNICODE 0x00 as character and Calc does.
I am very much tempted to not have a separate Excel and ODFF version of CLEAN in Calc, with all its confusion and problems wrt backward compatibility and interoperability with Excel for ods and xlsx formats.
Has anyone access to Excel 2016 or 2022 or 365 to confirm my findings?
This attachment can be used for this.
Created attachment 177375 [details]
Sheet2 of ODS exported with MSO_O365
(In reply to Winfried Donkers (retired) from comment #25)
> Has anyone access to Excel 2016 or 2022 or 365 to confirm my findings?
> This attachment can be used for this.
happy new year. ;-)
I exported Sheet 2 to PDF using
Microsoft® Excel® für Microsoft 365 MSO (Version 2111 Build 16.0.14701.20254) 64 Bit
The file itself was "rescued" by Excel (if I trust the author/source/file ^^).
I'm not sure if the outcome is excepted.
(In reply to Dennis Roczek from comment #26)
> happy new year. ;-)
> I'm not sure if the outcome is excepted.
The output is the same as I had; no result for line 2 (UNICODE 0x00) and cleaned output for UNICODE 0x7F, the rest conforming with ODFF (and not with the Microsoft support page about CLEAN).
I'll think some more about it, and probably only make the current CLEAN fully compliant with ODFF v1.3. The single difference (UNICODE 0x7f, as UNICODE 0x00 has nothing to do with CLEAN in Excel) seems much less problematic than the problem arising with two CLEAN versions (which Excel doesn't recognize either).