123752 – EDITING: Regression in handling of the group separator in formula context

Bug 123752 - EDITING: Regression in handling of the group separator in formula context

Summary: EDITING: Regression in handling of the group separator in formula context

Status:	VERIFIED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	6.2.0.1 rc
Hardware:	All All

Importance:	low minor
Assignee:	Eike Rathke (retired, only occasionally showing up)

URL:
Whiteboard:	target:6.4.0 target:6.3.0.1 target:7.3.0
Keywords:	bibisected, bisected, regression

Depends on:
Blocks:	Number-Format User-Locale
	Show dependency tree / graph

Reported:	2019-02-27 23:33 UTC by Jouni Kosonen
Modified:	2021-07-13 22:30 UTC (History)
CC List:	4 users (show)

See Also:	138432
Crash report or crash signature:

Attachments
Minimal patch to restore functionality (727 bytes, patch) 2019-04-07 23:26 UTC, Jouni Kosonen	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jouni Kosonen 2019-02-27 23:33:14 UTC

Calc version 6.2 introduced a regression with a possibility of data corruption. This affects locales where the grouping separator is not comma.

Steps to reproduce: 

1) use a locale like fi_FI for the document
2) enter a value like 12345,67 into a worksheet cell
3) pick a number format like "# ##0,00" for the cell
4) copy the cell to clipboard
5) enter another cell, type a starting = and paste

The result is Err:509 in the target cell and the pasted value has an extra zero before the decimal separator, i.e. =123450,67 

If the worksheet with such an error is closed and then reopened, the error status gets cleared but the value in the target cell is the one with an extra zero.

This happens with both the provided Windows binary (6.2.0.3) and a Linux version (6.2.1.1) compiled from source.

If the originating cell has a format with no group separator or if the group separator of the locale is comma, everything works like before.

The reason behind the regression appears to be an incomplete fix to bug 42518, since reverting commits 673bd16887c354981120f7ebea44b25b0ab2e41e and 9336286a7ea5385541344f444e6f8702c85bdacb is enough to build a binary without the regression.

Comment 1 Laurent Balland 2019-03-01 09:54:28 UTC

I reproduce the bug with locale fr-FR:
- Version: 6.2.0.3 (x64)
Build ID: 98c6a8a1c6c7b144ce3cc729e34964b47ce25d62
Threads CPU : 2; OS : Windows 6.1; UI Render : par défaut; VCL: win; 
Locale : fr-FR (fr_FR); Langue IHM : fr-FR
Calc: CL

I do NOT reproduce with:
- Version: 6.1.5.1 (x64)
Build ID: f18954c1ba9116b85c32b6bdbc0188d3e0fd24c7
Threads CPU : 2; OS : Windows 6.1; UI Render : par défaut; 
Locale : fr-FR (fr_FR); Calc: CL

Comment 2 Laurent Balland 2019-03-01 10:31:48 UTC

Same bug with new profile
Same bug with more or less decimals
Same bug with locale en-US
No bug without '='
No bug without thousand separator

So procedure can be simplified to:
1. Enter a value larger than 1000
2. Add thousand separator
3. Copy cell
4. Type a starting '=' and paste

Behavior will depend on value, if it contains '0' or not: some '0' are removed in pasted string
1,000 => 1,0
1,001 => 1,1
1,034,000 => 1,34,0
1,034,001 => 1,34,1
1,034,010 => 1,34,10
103,000 => 103,0
103,001 => 103,1
103,011 => 103,11

I was not able to add an extra zero.

Comment 3 Eike Rathke (retired, only occasionally showing up) 2019-03-01 13:13:25 UTC

Entering a formula it was never guaranteed to accept group separators in values (as in many locales the group separator also is the parameter separator ','). It may have worked by chance in some locales that use a NO-BREAK SPACE as group separator, but already using an ordinary SPACE character instead made that attempt fail; or if parameter separator was set to ';' a group separator of ',' may have worked.

I can investigate why an extra zero is added in
12345,67 => =123450,67
of the original description (fi-FI locale).

But in an en-US locale with default ',' parameter separator things like
1,000 => 1,0
are logical because it is parsed as
number 1  separator ,  number 0
Certainly the result of SUM(1,234) shall not be 1234 but 235 instead..
This was not different in earlier versions.

Comment 4 Jouni Kosonen 2019-03-01 13:37:01 UTC

(In reply to Eike Rathke from comment #3)
> Entering a formula it was never guaranteed to accept group separators in
> values (as in many locales the group separator also is the parameter
> separator ','). It may have worked by chance in some locales that use a
> NO-BREAK SPACE as group separator, but already using an ordinary SPACE
> character instead made that attempt fail; or if parameter separator was set
> to ';' a group separator of ',' may have worked.
There's still the problem that this happens also when what's copied on the clipboard is the entire cell, not just its displayed value. Changing how a number is formatted in one cell probably shouldn't cause Err:509 in another. Is there a reason to use the display formatted string in paste to a non-string context when the numeric value is also available?

> I can investigate why an extra zero is added in
> 12345,67 => =123450,67
> of the original description (fi-FI locale).
Thanks.

Comment 5 Buovjaga 2019-03-01 15:45:03 UTC

Jouni: thanks for your report and investigation. For bisecting, we have special repos that contain binaries for commits (or commit ranges in the older ones): https://wiki.documentfoundation.org/QA/Bibisect

Comment 6 Eike Rathke (retired, only occasionally showing up) 2019-03-04 20:53:18 UTC

What happens is that 12_345,67 (here with '_' for the no-break space) is lexically tokenized to

12
_345
,67

which then leads to three numbers 12 and 345 (because there in the number parser the no-break space is just a leading space) and 0,67 (from which stems the extra 0 because a leading decimal separator is allowed), which then with the formula tokens concatenated as strings produces 123450,67

Comment 7 Jouni Kosonen 2019-03-14 21:18:55 UTC

(In reply to Buovjaga from comment #5)
> Jouni: thanks for your report and investigation. For bisecting, we have
> special repos that contain binaries for commits (or commit ranges in the
> older ones): https://wiki.documentfoundation.org/QA/Bibisect

I see that this issue has already been marked as bisected, so I'll just drop this here for completeness' sake:

$ git bisect good 9ab19b74662842c303ef72c33f6045cb701ae299 is the first bad commit
commit 9ab19b74662842c303ef72c33f6045cb701ae299
Author: Jenkins Build User <tdf@pollux.tdf>
Date:   Sat Dec 1 03:52:19 2018 +0100

    source 9336286a7ea5385541344f444e6f8702c85bdacb
    
    source 9336286a7ea5385541344f444e6f8702c85bdacb

:040000 040000 1c2382d2d7f50dab3c61ab8d29a824698e3c41cc 9070851048d99c2944c6643546ccbc1ff445e357 M      instdir

As expected, that source sha is for the commit for the intended fix in 
bug 42518 comment 24 mentioned in the original report.

Comment 8 Jouni Kosonen 2019-03-30 19:20:07 UTC

(In reply to Eike Rathke from comment #6)
> What happens is that 12_345,67 (here with '_' for the no-break space) is
> lexically tokenized to
> 
> 12
> _345
> ,67
> 
> which then leads to three numbers 12 and 345 (because there in the number
> parser the no-break space is just a leading space) and 0,67 (from which
> stems the extra 0 because a leading decimal separator is allowed), which
> then with the formula tokens concatenated as strings produces 123450,67

That would explain the extra zero and also the Err:509 afterwards, since that ends up producing three numeric tokens with no operators in between.

It doesn't explain *why* this happens or why it *doesn't* happen when there is a decimal separator but no group separator.

Both 6.1 and 6.2 have now been released with this behavior and I can't find anything related in gerrit, is this being worked on or should I just try to figure out some kind of a fix myself?

Comment 9 Buovjaga 2019-03-30 19:33:01 UTC

(In reply to Jouni Kosonen from comment #8)
> Both 6.1 and 6.2 have now been released with this behavior and I can't find
> anything related in gerrit, is this being worked on or should I just try to
> figure out some kind of a fix myself?

If you could do that, it would be great and I think Eike would be happy to review your patch.

Comment 10 Jouni Kosonen 2019-03-30 21:09:00 UTC

(In reply to Buovjaga from comment #9)
> (In reply to Jouni Kosonen from comment #8)
> > Both 6.1 and 6.2 have now been released with this behavior and I can't find
> > anything related in gerrit, is this being worked on or should I just try to
> > figure out some kind of a fix myself?
> 
> If you could do that, it would be great and I think Eike would be happy to
> review your patch.

That could take a while, not least because Gentoo Linux just pulled the 6.2.1.2 ebuild I've been using to tinker with this and the 6.2.2.2 they replaced it with doesn't actually build.

As far as I can tell, the best way to fix this without breaking bug 42518 again would be to process the group separator so that the formula parser never sees it was there in the first place. I'll look into it but Ḯ'm not overly optimistic.

Comment 11 Jouni Kosonen 2019-04-01 17:44:38 UTC

I think I see some of what's happening now. 

In i18npool/source/characterclassification/cclass_unicode_parser.cxx 
cclass_Unicode::initParserTable, KParseTokens::GROUP_SEPARATOR_IN_NUMBER isn't on in nContTypes, but I don't know why. I can't seem to find where that bit is supposed to be set.

This means that the cGroupSep code point doesn't get the ParserFlags::VALUE bit set and is actually set to zero itself.

This then means that the code in cclass_Unicode::parseText handling cGroupSep not only doesn't look at the right codepoint, it doesn't even get run because it is conditioned on the ParserFlags::VALUE bit and the group separator is handled as nMask ParserFlags::(VALUE_SEP | WORD_SEP | CHAR_DONTCARE). This ssBounces the parser out of the ssGetValue state and the separator gets reevaluated in a new ssGetChar state as nMask ParserFlags::CHAR_WORD. 

That would be why the value gets emitted as sequence multiple tokens instead of a single parsed number. The numbers following the group separator get parsed in the same ssGetWord state that started from the group separator. If the value ends with a decimal part, that comma ends the word and gets evaluated outside the parseText state machine and appended to the sequence.

TLDR: I get a working binary if I simply set the ParserFlags::VALUE bit on for the group separator code point regardless of KParseTokens::GROUP_SEPARATOR_IN_NUMBER being off in nContTypes, but that would presumably have some other effect elsewhere.

Comment 12 Jouni Kosonen 2019-04-07 23:26:26 UTC

Created attachment 150588 [details]
Minimal patch to restore functionality

This appears to be the simplest patch. If the locale provides a decimal group separator, use the new KParseTokens::GROUP_SEPARATOR_IN_NUMBER to indicate this so that it doesn't get zeroed out.

The patch was made against 6.2.3.1 sources.

Comment 13 Buovjaga 2019-04-08 05:39:55 UTC

(In reply to Jouni Kosonen from comment #12)
> Created attachment 150588 [details]
> Minimal patch to restore functionality
> 
> This appears to be the simplest patch. If the locale provides a decimal
> group separator, use the new KParseTokens::GROUP_SEPARATOR_IN_NUMBER to
> indicate this so that it doesn't get zeroed out.
> 
> The patch was made against 6.2.3.1 sources.

Could you make a patch against master and submit it to gerrit? https://wiki.documentfoundation.org/Development/gerrit

Comment 14 Eike Rathke (retired, only occasionally showing up) 2019-04-08 12:17:36 UTC

(In reply to Jouni Kosonen from comment #12)
> Created attachment 150588 [details]
> Minimal patch to restore functionality
The patch is wrong. It unconditionally adds the flag for every locale no matter what parser flags the caller set and thus eliminates the purpose of the flag. If at all then the flag should be set at caller level. But as described earlier, doing so may also lead to unexpected behaviour if the group separator equals the function parameter separator. This is a rather twisted situation, and things interfere also with ScCompiler's lexical analysis. Actually the i18npool parser is called only because the ScCompiler parser decided it would not handle the character (left for investigation if and how it could be solved at that level).

Comment 15 Jouni Kosonen 2019-04-08 16:50:14 UTC

(In reply to Eike Rathke from comment #14)
> The patch is wrong. It unconditionally adds the flag for every locale no
> matter what parser flags the caller set and thus eliminates the purpose of
> the flag. If at all then the flag should be set at caller level. But as
> described earlier, doing so may also lead to unexpected behaviour if the
> group separator equals the function parameter separator. This is a rather
> twisted situation, and things interfere also with ScCompiler's lexical
> analysis. Actually the i18npool parser is called only because the ScCompiler
> parser decided it would not handle the character (left for investigation if
> and how it could be solved at that level).

Yes, but as I understand it there isn't a single line anywhere in the codebase that would set the flag – as things currently stand the group separator is unconditionally broken for all locales, some just don't show the symptoms. The ScCompiler parser may have decided it wouldn't handle the character but then the i18npool parser isn't allowed to recognize it.

Comment 16 Eike Rathke (retired, only occasionally showing up) 2019-05-09 12:10:59 UTC

(In reply to Jouni Kosonen from comment #15)
> Yes, but as I understand it there isn't a single line anywhere in the
> codebase that would set the flag
That is the point. It should only be set under very specific circumstances in a well known surrounding. Which so far does not exist.

Comment 17 Jouni Kosonen 2019-05-09 19:51:55 UTC

(In reply to Eike Rathke from comment #16)
> (In reply to Jouni Kosonen from comment #15)
> > Yes, but as I understand it there isn't a single line anywhere in the
> > codebase that would set the flag
> That is the point. It should only be set under very specific circumstances
> in a well known surrounding. Which so far does not exist.

Well, no. By your own comment about the flag, leaving it *unset* introduces new, unspecified behaviour and should only be done under very specific circumstances. Instead, that unspecified behaviour is now the default action and there's no way around it.

The way the code is now, KParseTokens::GROUP_SEPARATOR_IN_NUMBER would need to be on by default unless someone explicitly clears it for reasons best known to themselves.

Comment 18 Eike Rathke (retired, only occasionally showing up) 2019-06-03 19:05:24 UTC

Without the change it was unspecified behaviour, in that it sometimes by chance "worked" under some circumstances to include a group separator in a number, and sometimes parsed a group separator away even if it wasn't meant as a group separator. With the change and KParseTokens::GROUP_SEPARATOR_IN_NUMBER not set it is specified behaviour that group separators in formulas are not ignored and end a number.

Comment 19 Eike Rathke (retired, only occasionally showing up) 2019-06-03 19:15:46 UTC

Taking a stab at this.

Comment 20 Commit Notification 2019-06-04 10:16:25 UTC

Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/+/4e3e0c6744b3fc33c7f4da7bd48136b861e9ed58%5E%21

Resolves: tdf#123752 allow group separator in formula values if possible

It will be available in 6.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 21 Commit Notification 2019-06-04 10:59:02 UTC

Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-6-3":

https://git.libreoffice.org/core/+/c4eb7dddf047a613fee78188a65b36f5215b4c21%5E%21

Resolves: tdf#123752 allow group separator in formula values if possible

It will be available in 6.3.0.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 22 Buovjaga 2019-06-05 17:28:44 UTC

Verified

Arch Linux 64-bit
Version: 6.4.0.0.alpha0+
Build ID: 6c31c2b01dd32cc7ba1230f2c4a98b8f7def219b
CPU threads: 8; OS: Linux 5.1; UI render: default; VCL: gtk3; 
Locale: fi-FI (fi_FI.UTF-8); UI-Language: en-US
Calc: threaded
Built on 5 June 2019

Comment 23 Jouni Kosonen 2019-06-08 08:01:56 UTC

Patching with c4eb7dddf047a613fee78188a65b36f5215b4c21 does restore the functionality in 6.2.4.2.

Comment 24 Commit Notification 2021-07-13 22:30:08 UTC

Xisco Fauli committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/213430e0bdac0786b30a76a68b43d35647e93912

tdf#123752, tdf#138432: sc_uicalc: Add unittest

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.