Bug 97706 - CLEAN is neither ODF nor OOXML conform
Summary: CLEAN is neither ODF nor OOXML conform
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: difficultyBeginner, easyHack, filter:odf, filter:ooxml, skillCpp, topicCleanup
Depends on:
Blocks: ODF-import Function-iWorksNumbers
  Show dependency treegraph
 
Reported: 2016-02-10 11:17 UTC by Dennis Roczek
Modified: 2019-08-18 08:48 UTC (History)
9 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dennis Roczek 2016-02-10 11:17:08 UTC
See gerrit patch https://gerrit.libreoffice.org/21745 which updates our documentation for Calc's CLEAN function.

Per http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part2.html#__RefHeading__1018842_715980110 this command should remove all "character belongs to [UNICODE] class Cc (Other - Control), or to Unicode class Cn (Other - Not Assigned)"

This misses many characters, see e.g.
http://www.fileformat.info/info/unicode/category/Cc/list.htm

One problem is, that the Excel or e.g. Apple's Numbers (not ODF software) aren't conform, too. We should for import / export reasons add a duplication and create a CLEAN_ADD or CLEAN_EXCEL (or whatever naming) or let all the code as it is and add a CLEAN_ODF function.
Comment 1 Buovjaga 2016-02-12 09:26:26 UTC
I believe you.
Comment 2 Dennis Roczek 2016-02-23 12:22:21 UTC
Codepointers:
http://opengrok.libreoffice.org/xref/core/sc/source/core/tool/interpr1.cxx#3190

Please implement also an additional compatible CLEAN function for MSO Excel / OOXML (which only replaces 0x20 <= c  (see lcl_ScInterpreter_IsPrintable and Page 2123 of Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference.pdf in http://www.ecma-international.org/publications/standards/Ecma-376.htm ECMA-376 4th edition Part 1)
including the mapping
Comment 3 Taylor Lee 2017-08-17 20:06:21 UTC
Using http://www.unicode.org/Public/UNIDATA/UnicodeData.txt as reference, we can grep for ';(Cc|Cn);'.

We find that Cc and Cn chars are:
0x00 to 0x1F (inclusive)
0x7F to 0x9F (inclusive)

Using the above info, we can update the CLEAN isPrintable() function.

I also added CLEAN.OOXML using a new isPrintable_OOXML() function that returns true only if c > 0x1f.

Using the following fods to verify, I found that the CHAR function is not appropriate to use with CLEAN since CHAR returns 0xfffd for the range of 0x80 to 0xFF inclusive.  0x100 and above returns Err:502.

core/sc/qa/unit/data/functions/text/fods/clean.fods

Changing CHAR to UNICHAR, we can then successfully test the range 0x80 to 0x9F.

I'll attempt to use gerrit to start the patch process.
Comment 4 Xisco Faulí 2017-09-17 02:04:48 UTC Comment hidden (obsolete)
Comment 5 Taylor Lee 2017-09-18 18:24:36 UTC
I lost focus after not being able to figure out how to update 2 files.

https://gerrit.libreoffice.org/#/c/41271/
Comment 6 Buovjaga 2017-09-18 18:29:54 UTC
(In reply to Taylor Lee from comment #5)
> I lost focus after not being able to figure out how to update 2 files.
> 
> https://gerrit.libreoffice.org/#/c/41271/

It would be great to see you continue :) If you need some tips, you could visit the #libreoffice-dev IRC channel during EU office hours: https://wiki.documentfoundation.org/Website/IRC
Eike is in there, nickname: erAck
Comment 7 Xisco Faulí 2017-10-19 02:35:50 UTC Comment hidden (obsolete)
Comment 8 Xisco Faulí 2017-11-19 03:25:56 UTC Comment hidden (obsolete)
Comment 9 Xisco Faulí 2017-12-20 03:33:14 UTC Comment hidden (obsolete)
Comment 10 Xisco Faulí 2018-01-20 03:31:55 UTC Comment hidden (obsolete)
Comment 11 Xisco Faulí 2018-02-20 03:35:43 UTC Comment hidden (obsolete)
Comment 12 Xisco Faulí 2018-03-23 03:33:30 UTC Comment hidden (obsolete)
Comment 13 Xisco Faulí 2018-04-23 02:30:57 UTC Comment hidden (obsolete)
Comment 14 Xisco Faulí 2018-05-24 02:43:20 UTC Comment hidden (obsolete)
Comment 15 Xisco Faulí 2018-06-24 02:39:08 UTC Comment hidden (obsolete)
Comment 16 Xisco Faulí 2018-07-25 02:36:49 UTC Comment hidden (obsolete)
Comment 17 Xisco Faulí 2018-08-25 02:41:02 UTC Comment hidden (obsolete)
Comment 18 Xisco Faulí 2018-09-25 02:49:06 UTC Comment hidden (obsolete)
Comment 19 Xisco Faulí 2018-10-26 02:56:19 UTC Comment hidden (obsolete)
Comment 20 Xisco Faulí 2018-11-26 03:34:49 UTC Comment hidden (obsolete)
Comment 21 Xisco Faulí 2018-12-27 03:40:56 UTC Comment hidden (obsolete)
Comment 22 Xisco Faulí 2019-01-27 03:44:58 UTC Comment hidden (obsolete)
Comment 23 Xisco Faulí 2019-01-28 10:19:13 UTC
Dear Taylor,
This bug has been in ASSIGNED status for more than 3 months without any
activity. Resetting it to NEW.
Please assigned it back to yourself if you're still working on this.