Bug 138591 - Using Unicode conversion on a combined emoji results in only partial conversion
Summary: Using Unicode conversion on a combined emoji results in only partial conversion
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
5.1.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: implementationError
Depends on:
Blocks: Font-Rendering
  Show dependency treegraph
 
Reported: 2020-12-01 07:17 UTC by Mike Kaganski
Modified: 2023-07-28 20:11 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Combined emoji and combined character (8.24 KB, application/vnd.oasis.opendocument.text)
2020-12-01 07:18 UTC, Mike Kaganski
Details
EmojiTest.odt: various emoji combinations - tried to be relatively comprehensive (13.75 KB, application/vnd.oasis.opendocument.text)
2020-12-04 12:56 UTC, Justin L
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2020-12-01 07:17:05 UTC
In the attached document, the first line contains a combined emoji (👚🏌, U+1f468 "MAN" + U+1f3fc "EMOJI MODIFIER FITZPATRICK TYPE-3"). Putting the cursor immediately after the emoji, and pressing Alt+X, results not in the expected "U+1f468U+1f3fc" that would represent both elements of the emoji, but in "👚U+1f3fc", i.e. "MAN" is still not converted into the text representing its code.

For comparison, the second line has a combined character á, U+0061 "LATIN SMALL LETTER A" + U+0301 "COMBINING ACUTE ACCENT". Pressing Ctrl+End to move after the character, and pressing Alt+X, results in both parts of the combined character to get converted: "U+0061U+0301". (There's some strange *different* issue that using a mouse to put cursor after the character, the result is as if you put cursor between them, but it's *unrelated* to the issue here).

Tested with Version: 7.0.3.1 (x64)
Build ID: d7547858d014d4cf69878db179d326fc3483e082
CPU threads: 12; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: ru-RU (ru_RU); UI: en-US
Calc: CL
Comment 1 Mike Kaganski 2020-12-01 07:18:17 UTC
Created attachment 167703 [details]
Combined emoji and combined character
Comment 2 V Stuart Foote 2020-12-01 19:36:34 UTC
Interesting, it will convert both from HEX, i.e. U+1f468U+1f3fc to combined.

But toggling the opposite way against the combined glyphs, only applies to the trailing glyph.

It does not seem to have anything to do with the combining nature of the Emoji Modifiers or the SMP.

Two BMP symbols ☣♻ U+2623U+267b will toggle from HEX to glyph, but only the trailing glyph is converted back. 

 =-testing-=
2020-11-19
Version: 7.1.0.0.alpha1+ (x64)
Build ID: ccd0e5f445d4a7d0e7aca6c23c02c61bf14510b2
CPU threads: 8; OS: Windows 10.0 Build 18363; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: CL
Comment 3 V Stuart Foote 2020-12-01 19:43:07 UTC
Believe as implemented, https://gerrit.libreoffice.org/17535
present Version: 5.1.6.2 (x64)
Build ID: 07ac168c60a517dba0f0d7bc7540f5afa45f0909
CPU Threads: 8; OS Version: Windows 6.19; UI Render: GL; 
Locale: en-US (en_US); Calc: CL
Comment 4 Justin L 2020-12-02 09:54:38 UTC
This might be VERY complex to do completely correctly, since there does not seem to be a single standard way of marking combining combinations of emojis.

The latest version of the "Unicode Emoji" spec can be found at http://www.unicode.org/reports/tr51/.

Having glanced through the spec, I imagine adding some kind of logic like:
if ( maInput.getLength() == 0 )
    bIsEmojiSequence = isEmoji();
if ( isEmoji_modifier_base() )
    bHaveEmoji_modifier_base = true;
const nZWJ ==  fe0f; //Zero Width Joiner character

if ( bIsEmojiSquence )
{
    if ( next == nZWJ || (isEmoji(next) && !bHaveEmoji_modifier_base)  )
    then continue to accept new characters.
}


It looks like this will require some low-level identification of emoji, since there is no classification yet such as ::com::sun::star::i18n::UnicodeType::EMOJI
Comment 5 Mike Kaganski 2020-12-02 09:57:44 UTC
(In reply to Justin L from comment #4)

Just a random idea: can't we use the same code that WrtShell uses when does its "step left"/"step right" magic, to identify what constitutes a single "character cell"?
Comment 6 Mike Kaganski 2020-12-03 08:16:50 UTC
(In reply to Justin L from comment #4)

We might want to create a text cursor for the current view cursor [1], and use the text cursor to iterate over the positions, instead of iterating over the code points.

[1] https://wiki.openoffice.org/wiki/Writer/API/Text_cursor
Comment 7 Mike Kaganski 2020-12-04 08:31:09 UTC
https://gerrit.libreoffice.org/c/core/+/107187 is a proof of concept using XTextCursor for Writer. It doesn't work for Calc/Draw/Math in current form, because the implementation needs XTextCursor for the edit engines there (which ought to be possible, but U can't work on that longer).

Whoever wants to try to implement this, feel free to jump in and use the code as you like.
Comment 8 Justin L 2020-12-04 12:56:04 UTC
Created attachment 167829 [details]
EmojiTest.odt: various emoji combinations - tried to be relatively comprehensive
Comment 9 Mike Kaganski 2020-12-05 11:06:20 UTC
Or maybe better make a writer-local change, using SwCursor::LeftRight and CRSR_SKIP_CELLS.
Comment 10 QA Administrators 2022-12-07 03:22:27 UTC
Dear Mike Kaganski,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.
 
If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not 
appropriate in this case)


If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword


Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug