Bug 97198 - Calc: incorrect formula text function result from Unicode's non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()
Summary: Calc: incorrect formula text function result from Unicode's non-BMP character...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Winfried Donkers
URL:
Whiteboard: target:6.0.0 target:6.1.0 target:6.0.0.1
Keywords:
Depends on:
Blocks: Calc-Function
  Show dependency treegraph
 
Reported: 2016-01-17 00:52 UTC by Mark Hung
Modified: 2017-12-01 12:50 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Non-BMP characters handling test case (10.61 KB, application/vnd.oasis.opendocument.spreadsheet)
2016-01-17 00:52 UTC, Mark Hung
Details
Non-BMP characters handling test case (with "üë" string) (9.80 KB, application/vnd.oasis.opendocument.spreadsheet)
2016-12-25 04:40 UTC, Aron Budea
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Hung 2016-01-17 00:52:53 UTC
Created attachment 122013 [details]
Non-BMP characters handling test case

In general, text functions that has position or length as the parameter or return value have incorrect results. Check attachment.
Comment 1 m_a_riosv 2016-01-17 02:36:32 UTC
Hi Mark, thanks for reporting.

What font is used for TS?, i have not it installed.
Comment 2 Mark Hung 2016-01-17 08:31:34 UTC
You might need MingLiu-EXTB to see the characters.
Comment 3 m_a_riosv 2016-01-17 13:54:44 UTC
There are special functions for fonts with double byte characters.
like:
MIDB()
https://help.libreoffice.org/index.php?title=5.0/Calc/Text_Functions&Language=en-US&System=WIN&Version=5.0#bm_id2955392
LENB()
https://help.libreoffice.org/index.php?title=5.0/Calc/Text_Functions&Language=en-US&System=WIN&Version=5.0#bm_id2954050
UNICHAR()
https://help.libreoffice.org/index.php?title=5.0/Calc/Text_Functions&Language=en-US&System=WIN&Version=5.0#bm_id0907200904011330

And I guess you nee to set up properly Menu/Tools/Options/Language settings/Default language, for CTL documents. And there is the same option in Cell format.

I'm not used to work with CTL languages, maybe you can find help in the ML http://nabble.documentfoundation.org/

And please don't set up you own reports as new, the usually is someone else doing it after reproduce the issue.
Comment 4 Mark Hung 2016-01-18 00:09:35 UTC
m.a.rsiov:
It was a mistake to change it to NEW. Thanks.

I think I can explain more for this issue:

Test string is composed of 4 unicode characters: "𧌒𧀬", U+22 U+27312 U+2702C U+22.  I expect that LEN(TS) been expressed as number of unicode code points. 

However, LO internally store characters in UTF16, which means U+27312 and U+2702C has to be represented as "surrogate pairs", hence len(TS) would be different from the result of counting codepoints.
Comment 5 azurement@orange.fr 2016-12-19 14:23:12 UTC Comment hidden (off-topic)
Comment 6 m_a_riosv 2016-12-19 15:03:06 UTC Comment hidden (off-topic)
Comment 7 Aron Budea 2016-12-25 04:40:25 UTC
Created attachment 129931 [details]
Non-BMP characters handling test case (with "üë" string)

Based on Mark's example, and Dennis' sample from bug 101962, here's one that doesn't require special fonts.

I'm afraid that's all I can add to the discussion.
Comment 8 Carlos 2017-04-17 19:27:08 UTC
I can not reproduce the bug in
Versión: 5.2.3.3
Id. de compilación: d54a8868f08a7b39642414cf2c8ef2f228f780cf
Subprocesos de CPU: 2; Versión de SO: Windows 5.1; Repr. de IU: predeterminado; 
Configuración regional: es-AR (es_AR); Calc: group


The attachment 129931 [details] (after opening it in LO and saving as MS Excel 2007-2013 (.xlsx)) produced the same results in MS Excel 2011. 

Apparently LO works the same as MS Excel with the "üë" string.
Comment 9 Winfried Donkers 2017-11-08 06:45:53 UTC
The OpenFormula standard ODFF1.2 states for functions LEFT, LEN, MID, RIGHT that the result may be application-dependent, see http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part2.html#Text-String .

I will look into the code of all functions and see what causes the behaviour as in both attachments.
Comment 10 Winfried Donkers 2017-11-08 10:37:39 UTC
As already said in comment #4, LO uses 16bits for unicode characters, which is not enough for the UniCode 5.2.0 (referenced to in ODFF1.2). Changing this could have far-reaching effects in all sorts of places in the LO code. Anyway, IMHO such a change would require an ESC decision.

As Excel 2016 has the same results as LO, interoperability would be affected when LO is changed.

I suggest to set this bug report to RESOLVED-NOTABUG and create a new bug report/enhancement request to change the UniCode typedef in the LO code.

Adding ERack to cc as he is the Calc expert.
Comment 11 Eike Rathke 2017-11-09 12:41:31 UTC
(In reply to Winfried Donkers from comment #10)
> As already said in comment #4, LO uses 16bits for unicode characters, which
> is not enough for the UniCode 5.2.0 (referenced to in ODFF1.2).
That's not true. Strings are UTF-16, which is well capable to represent all Unicode planes.

However, the second example is not even about non-BMP characters, the "üë" in B1 contains the sequence

U+0075 LATIN SMALL LETTER U
U+0308 COMBINING DIAERESIS
U+0065 LATIN SMALL LETTER E
U+0308 COMBINING DIAERESIS

To regard a combined character as one character, some Unicode normalization would have to take place. Which, if at all, should be implemented in a separate function.

As for the first attachment, these are real non-BMP characters. For this to work correctly the mentioned spreadsheet functions should operate on the actual Unicode characters code points instead of the 16-bit units. There's OUString::iterateCodePoints() to do that properly, a change in LO's string handling is not necessary.

> As Excel 2016 has the same results as LO, interoperability would be affected
> when LO is changed.
Also the same result with the first attachment and real non-BMP characters?
Anyway, I couldn't find a definition for Excel how they'd treat non-BMP characters, but as LEN() is supposed to return the length in *characters*, not in 16-bit code units, I propose to actually handle it that way. I consider treating these string functions in code units wrong. Specifically with LEFT() or RIGHT() or MID() or REPLACE() that should operate on characters, otherwise with code units could cut non-BMP characters in the middle.
Comment 12 Winfried Donkers 2017-11-09 14:30:06 UTC
(In reply to Eike Rathke from comment #11)
> As for the first attachment, these are real non-BMP characters. For this to
> work correctly the mentioned spreadsheet functions should operate on the
> actual Unicode characters code points instead of the 16-bit units. There's
> OUString::iterateCodePoints() to do that properly, a change in LO's string
> handling is not necessary.
> 
> > As Excel 2016 has the same results as LO, interoperability would be affected
> > when LO is changed.
> Also the same result with the first attachment and real non-BMP characters?
> Anyway, I couldn't find a definition for Excel how they'd treat non-BMP
> characters, but as LEN() is supposed to return the length in *characters*,
> not in 16-bit code units, I propose to actually handle it that way. I
> consider treating these string functions in code units wrong. Specifically
> with LEFT() or RIGHT() or MID() or REPLACE() that should operate on
> characters, otherwise with code units could cut non-BMP characters in the
> middle.

Thanks for your clarification, I will see if I can fix this.
(And I will document the Excel results in the unit test documents.)
Comment 13 Commit Notification 2017-11-24 20:11:38 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=e78f508997f9384518c2dd1a005c5306ccd24783

tdf#97198 Make calc function MID work with Unicode non-BMP characters.

It will be available in 6.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 14 Commit Notification 2017-11-24 20:11:47 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=09455734727daa3774158795febd910b33b1d784

tdf#97198 Make Calc function RIGHT work with Unicode non-BMP characters.

It will be available in 6.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 15 Commit Notification 2017-11-24 20:13:19 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=5e2a3eed69060d63070d72df60cd84247ca0bc9e

tdf#97198 Make Calc function FIND work with UniCode non-BMP characters.

It will be available in 6.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 16 Commit Notification 2017-11-24 20:15:00 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=47155ce8fb4302d377b374d39b501d0a0f7577c5

tdf#97198 Make Calc function LEN work with Unicode non-BMP characters.

It will be available in 6.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Commit Notification 2017-11-24 20:15:08 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=78b7e5f76e72647e0cab662ad2c9342d4538704f

tdf#97198 Make Calc function SEARCH work with UniCode non-BMP characters.

It will be available in 6.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 18 Commit Notification 2017-11-24 20:16:47 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=4e3530b4f870e0470c23ae89cdb8b9a09af54d59

tdf#97198 Make Calc function LEFT work with Unicode non-BMP characters

It will be available in 6.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 19 Commit Notification 2017-11-29 12:41:44 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=aaba725b466950f47c4e93d790986d209e80ead3

tdf#97198 Make Calc function REPLACE work with UniCode non-BMP-characters.

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 20 Commit Notification 2017-11-29 23:38:52 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "libreoffice-6-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=44487fee8646c0ded5c2b2ffaaa70e75a7af7e04&h=libreoffice-6-0

tdf#97198 Make Calc function REPLACE work with UniCode non-BMP-characters.

It will be available in 6.0.0.1.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 21 Commit Notification 2017-11-30 15:30:54 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c5d0cffbf42f41349a8575ba52456d9cb07e91c7

tdf#97198 follow up: improve efficiency for Calc function RIGHT.

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 22 Commit Notification 2017-12-01 10:26:12 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "libreoffice-6-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=664d40fb24bf500a22efd1f1aff58a96b3c3cf2c&h=libreoffice-6-0

tdf#97198 follow up: improve efficiency for Calc function RIGHT.

It will be available in 6.0.0.1.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.