Bug 97706 - CLEAN is neither ODF nor OOXML conform
Summary: CLEAN is neither ODF nor OOXML conform
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Winfried Donkers
URL:
Whiteboard: target:7.4.0
Keywords: difficultyMedium, easyHack, filter:odf, filter:ooxml, skillCpp, topicCleanup
Depends on:
Blocks: ODF-import Function-iWorksNumbers
  Show dependency treegraph
 
Reported: 2016-02-10 11:17 UTC by Dennis Roczek
Modified: 2022-02-11 06:21 UTC (History)
9 users (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments
document to verify output in Excel (41.47 KB, application/vnd.oasis.opendocument.spreadsheet)
2022-01-07 13:31 UTC, Winfried Donkers
Details
Sheet2 of ODS exported with MSO_O365 (107.95 KB, application/pdf)
2022-01-07 13:50 UTC, Dennis Roczek
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dennis Roczek 2016-02-10 11:17:08 UTC
See gerrit patch https://gerrit.libreoffice.org/21745 which updates our documentation for Calc's CLEAN function.

Per http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part2.html#__RefHeading__1018842_715980110 this command should remove all "character belongs to [UNICODE] class Cc (Other - Control), or to Unicode class Cn (Other - Not Assigned)"

This misses many characters, see e.g.
http://www.fileformat.info/info/unicode/category/Cc/list.htm

One problem is, that the Excel or e.g. Apple's Numbers (not ODF software) aren't conform, too. We should for import / export reasons add a duplication and create a CLEAN_ADD or CLEAN_EXCEL (or whatever naming) or let all the code as it is and add a CLEAN_ODF function.
Comment 1 Buovjaga 2016-02-12 09:26:26 UTC
I believe you.
Comment 2 Dennis Roczek 2016-02-23 12:22:21 UTC
Codepointers:
http://opengrok.libreoffice.org/xref/core/sc/source/core/tool/interpr1.cxx#3190

Please implement also an additional compatible CLEAN function for MSO Excel / OOXML (which only replaces 0x20 <= c  (see lcl_ScInterpreter_IsPrintable and Page 2123 of Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference.pdf in http://www.ecma-international.org/publications/standards/Ecma-376.htm ECMA-376 4th edition Part 1)
including the mapping
Comment 3 Taylor Lee 2017-08-17 20:06:21 UTC
Using http://www.unicode.org/Public/UNIDATA/UnicodeData.txt as reference, we can grep for ';(Cc|Cn);'.

We find that Cc and Cn chars are:
0x00 to 0x1F (inclusive)
0x7F to 0x9F (inclusive)

Using the above info, we can update the CLEAN isPrintable() function.

I also added CLEAN.OOXML using a new isPrintable_OOXML() function that returns true only if c > 0x1f.

Using the following fods to verify, I found that the CHAR function is not appropriate to use with CLEAN since CHAR returns 0xfffd for the range of 0x80 to 0xFF inclusive.  0x100 and above returns Err:502.

core/sc/qa/unit/data/functions/text/fods/clean.fods

Changing CHAR to UNICHAR, we can then successfully test the range 0x80 to 0x9F.

I'll attempt to use gerrit to start the patch process.
Comment 4 Xisco Faulí 2017-09-17 02:04:48 UTC Comment hidden (obsolete)
Comment 5 Taylor Lee 2017-09-18 18:24:36 UTC
I lost focus after not being able to figure out how to update 2 files.

https://gerrit.libreoffice.org/#/c/41271/
Comment 6 Buovjaga 2017-09-18 18:29:54 UTC
(In reply to Taylor Lee from comment #5)
> I lost focus after not being able to figure out how to update 2 files.
> 
> https://gerrit.libreoffice.org/#/c/41271/

It would be great to see you continue :) If you need some tips, you could visit the #libreoffice-dev IRC channel during EU office hours: https://wiki.documentfoundation.org/Website/IRC
Eike is in there, nickname: erAck
Comment 7 Xisco Faulí 2017-10-19 02:35:50 UTC Comment hidden (obsolete)
Comment 8 Xisco Faulí 2017-11-19 03:25:56 UTC Comment hidden (obsolete)
Comment 9 Xisco Faulí 2017-12-20 03:33:14 UTC Comment hidden (obsolete)
Comment 10 Xisco Faulí 2018-01-20 03:31:55 UTC Comment hidden (obsolete)
Comment 11 Xisco Faulí 2018-02-20 03:35:43 UTC Comment hidden (obsolete)
Comment 12 Xisco Faulí 2018-03-23 03:33:30 UTC Comment hidden (obsolete)
Comment 13 Xisco Faulí 2018-04-23 02:30:57 UTC Comment hidden (obsolete)
Comment 14 Xisco Faulí 2018-05-24 02:43:20 UTC Comment hidden (obsolete)
Comment 15 Xisco Faulí 2018-06-24 02:39:08 UTC Comment hidden (obsolete)
Comment 16 Xisco Faulí 2018-07-25 02:36:49 UTC Comment hidden (obsolete)
Comment 17 Xisco Faulí 2018-08-25 02:41:02 UTC Comment hidden (obsolete)
Comment 18 Xisco Faulí 2018-09-25 02:49:06 UTC Comment hidden (obsolete)
Comment 19 Xisco Faulí 2018-10-26 02:56:19 UTC Comment hidden (obsolete)
Comment 20 Xisco Faulí 2018-11-26 03:34:49 UTC Comment hidden (obsolete)
Comment 21 Xisco Faulí 2018-12-27 03:40:56 UTC Comment hidden (obsolete)
Comment 22 Xisco Faulí 2019-01-27 03:44:58 UTC Comment hidden (obsolete)
Comment 23 Xisco Faulí 2019-01-28 10:19:13 UTC Comment hidden (obsolete)
Comment 24 Regina Henschel 2021-12-09 16:49:22 UTC
@Winfried: As you have done a lot on Calc functions, I ask, is this something you can fix and you are interested in?
Comment 25 Winfried Donkers 2022-01-07 13:31:29 UTC
Created attachment 177374 [details]
document to verify output in Excel

Interestingly, Excel 2019 has almost the same output as Open Document Format Formula (v1.3) states for CLEAN.
ODFF states that CLEAN cleans UNICODE characters 0x00...0x1F(incl.) and 0x7F...0x9F(incl.).
The only difference I found is UNICODE character 0x7F, which ODFF removes and Excel doesn't.

Another difference between ODFF and Excel is that Excel does not support/accept UNICODE 0x00 as character and Calc does.

I am very much tempted to not have a separate Excel and ODFF version of CLEAN in Calc, with all its confusion and problems wrt backward compatibility and interoperability with Excel for ods and xlsx formats.

Has anyone access to Excel 2016 or 2022 or 365 to confirm my findings?
This attachment can be used for this.
Comment 26 Dennis Roczek 2022-01-07 13:50:43 UTC
Created attachment 177375 [details]
Sheet2 of ODS exported with MSO_O365

(In reply to Winfried Donkers (retired) from comment #25)
> Has anyone access to Excel 2016 or 2022 or 365 to confirm my findings?
> This attachment can be used for this.

Hi Winfried,

happy new year. ;-)


I exported Sheet 2 to PDF using
Microsoft® Excel® für Microsoft 365 MSO (Version 2111 Build 16.0.14701.20254) 64 Bit 

The file itself was "rescued" by Excel (if I trust the author/source/file ^^). 

I'm not sure if the outcome is excepted.

Dennis
Comment 27 Winfried Donkers 2022-01-07 15:12:56 UTC
(In reply to Dennis Roczek from comment #26)
> happy new year. ;-)

Thanks, likewise.

> I'm not sure if the outcome is excepted.

The output is the same as I had; no result for line 2 (UNICODE 0x00) and cleaned output for UNICODE 0x7F, the rest conforming with ODFF (and not with the Microsoft support page about CLEAN).

I'll think some more about it, and probably only make the current CLEAN fully compliant with ODFF v1.3. The single difference (UNICODE 0x7f, as UNICODE 0x00 has nothing to do with CLEAN in Excel) seems much less problematic than the problem arising with two CLEAN versions (which Excel doesn't recognize either).
Comment 28 Commit Notification 2022-01-24 22:43:07 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/34412339f863b35508842a3bf20fcc7341a006e7

tdf#97706 CLEAN is neither ODFF nor OOXML conform.

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 29 Winfried Donkers 2022-02-11 06:21:09 UTC
With commit 61f4250ee9f43902107e4d2e6322cbf54f52dd8e (which accidently lacks a reference to this bug report) CLEAN is now fully compliant with ODFF v1.3 and compliant with Excel 2016/2019 as far as could be tested (actual behaviour in Excel differs from the behaviour as described in the Microsoft documentation).