Bug 160278 - XTextRange.setString("๐œ‡") or any other 32-bit Unicode character breaks the range
Summary: XTextRange.setString("๐œ‡") or any other 32-bit Unicode character breaks the range
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
24.2.1.2 release
Hardware: All All
: medium normal
Assignee: Mike Kaganski
URL: https://github.com/zotero/zotero-libr...
Whiteboard: target:24.8.0 target:24.2.3 target:7.6.7
Keywords:
Depends on:
Blocks: Fields Fields-Cross-Reference
  Show dependency treegraph
 
Reported: 2024-03-19 15:14 UTC by Adomas Venฤkauskas
Modified: 2024-03-28 14:20 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Adomas Venฤkauskas 2024-03-19 15:14:00 UTC
Description:
We have encountered a bug in the Zotero plugin where setting the string on a range with 32-bit Unicode characters, such as "๐œ‡" or "๐Ÿ”’" breaks the range. If you call `range.getString()`, it will return a string that is extended beyond the one that was set, increasing the length by 1 at the start of the range for every 32-bit encoded character.

Steps to Reproduce:
1. Create a range with a Java plugin
2. Call `range.setString("๐Ÿ”’")`

Actual Results:
The range start and end, as well as string is incorrectly reported

Expected Results:
The range should be correctly reported.


Reproducible: Always


User Profile Reset: No

Additional Info:
N/A
Comment 1 Mike Kaganski 2024-03-19 16:58:33 UTC
You never gave the actual code that you feel problematic. You only shown how to set the text, how to get the string from the text; but what code gives the "wrong" length?

Now, you mentioned "Java". You should know, that in Java, strings are sequences of *UTF-16* code units; and so, a string length there is count of 16-bit code units, not of Unicode characters. It is what the programming language defines, that you chosen, not a LibreOffice-defined thing.
Comment 2 Mike Kaganski 2024-03-19 17:03:42 UTC
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

> A String represents a string in the UTF-16 format in which supplementary
> characters are represented by surrogate pairs (see the section Unicode Character
> Representations in the Character class for more information). Index values refer
> to char code units, so a supplementary character uses two positions in a String.
Comment 3 Adomas Venฤkauskas 2024-03-19 17:17:17 UTC
Sorry if this wasn't entirely clear.

Say the contents of my Writer document are

"Text[range]" where the text in bracket is a range.

Then

```
public void String test(XTextRange range) throws Exception {
  range.setString("๐œ‡");
  return range.getString();
}
```

Will produce a String with content "t๐œ‡".

We have a big plugin codebase and I was debugging this in-code with a debugger, so I have not actually tested this as an isolated piece of code, but it should work, although I'm not sure if Java accepts a Unicode string literal like that.

Our actual problem is that we replace a range with some text from the user's Zotero library and then set a HyperLinkURL on the range. Our user reported an issue where this operation sets the link on the range plus some characters before the replaced range, and I nailed it down to the "๐œ‡" character causing the issue. 

Shorter unicode characters like latin diacritics (ฤ…, ฤ, ฤ™, etc.) do not produce the issue. I assume the issue is precisely caused because Java will report this character as a String of length 2, when in LibreOffice it may be treated as length 1, and the range should be of length 1 too, but I have not investigated it further and wouldn't know where in the interface between Java and LibreOffice things go wrong.

If you want me to look into this some more or produce a better test case I can, but fundamentally I think someone able to fix this will be better equipped to know what to look for and where.
Comment 4 Mike Kaganski 2024-03-19 17:21:34 UTC
(In reply to Adomas Venฤkauskas from comment #3)
> If you want me to look into this some more or produce a better test case

Yes please - the whole code + a text document, if needed. Thanks.
Comment 5 Adomas Venฤkauskas 2024-03-20 09:00:08 UTC
Here is the modified starter project which showcases the bug

https://github.com/adomasven/libreoffice-java-unicode-bug

Install the project (https://github.com/adomasven/libreoffice-java-unicode-bug/raw/master/dist/StarterProject.oxt), then from LibreOffice menu run Starter Project -> Action One.

The relevant code is here https://github.com/adomasven/libreoffice-java-unicode-bug/commit/26e8531287f9f97ad886f7aef30c31ce909bf44f#diff-901f25d906d0499701765e181379a0b21f46448afcc45467f12afaaa361cf10eR72

You should make sure your writer document contains some text (is not empty).

The code:
1. Creates a text range on the final character of the text
2. Replaces it with string "๐œ‡".
3. Replaces it with string "text".

The expected result:
The last character of the text is replaced with "text"

The actual result:
The last two characters of the text are replaced with "text".
Comment 6 Adomas Venฤkauskas 2024-03-20 09:02:36 UTC
To add additional detail:

After running Start Project -> Action One you can Undo step back the two changes in the document: the insertion of "Text" and the insertion of "๐œ‡". You will see that inserting "๐œ‡" replaces only the last character, not the last two.
Comment 7 Mike Kaganski 2024-03-20 09:44:28 UTC
Repro - thank you!

This Basic code shows this from scratch:


sub testSurrogates
  doc = StarDesktop.LoadComponentFromUrl("private:factory/swriter", "_blank", 0, array())
  doc.text.setString("123")
  cursor = doc.text.createTextCursor()
  cursor.gotoEnd(false)
  cursor.goLeft(1, true)
  cursor.setString("๐œ‡")
  cursor.setString("test")
end sub

Before step 'cursor.setString("๐œ‡")', the document has text "123".
After step 'cursor.setString("๐œ‡")', the document has text "12๐œ‡", as expected.
After step 'cursor.setString("test")', the document has text "1test", instead of expected text "12test".
Comment 8 Mike Kaganski 2024-03-20 11:28:46 UTC
https://gerrit.libreoffice.org/c/core/+/165056
Comment 9 Commit Notification 2024-03-21 16:52:04 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/1c52e94b2d926875d1a108c54988dbcb2dc9d017

tdf#160278: restore cursor bounds properly

It will be available in 24.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 10 Commit Notification 2024-03-22 07:36:03 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-24-2":

https://git.libreoffice.org/core/commit/41586f2f417a2d55d6baa07d3885d2d117a16d1d

tdf#160278: restore cursor bounds properly

It will be available in 24.2.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 11 Adomas Venฤkauskas 2024-03-22 07:40:37 UTC
Thanks for the swift resolution!
Comment 12 Commit Notification 2024-03-22 17:10:35 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-7-6":

https://git.libreoffice.org/core/commit/7b1fc707b56463d64585b2437f8531f9c1d71f75

tdf#160278: restore cursor bounds properly

It will be available in 7.6.7.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.