Bug 114160 - ZWJ shouldn't be treated as breaking character
Summary: ZWJ shouldn't be treated as breaking character
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
6.0.0.0.beta1
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, regression
Depends on:
Blocks: RTL-CTL
  Show dependency treegraph
 
Reported: 2017-11-30 08:23 UTC by Volga
Modified: 2019-04-28 13:26 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample ODT (10.24 KB, application/vnd.oasis.opendocument.text)
2017-11-30 08:24 UTC, Volga
Details
Screen recording by LICEcap (66.95 KB, image/gif)
2017-11-30 08:26 UTC, Volga
Details
Sample file containing malayalam characters to understand word breaking with Zero Width Joiner (287.22 KB, application/zip)
2018-04-29 05:46 UTC, Ramesh K
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Volga 2017-11-30 08:23:16 UTC
Description:
In certain cases ZWJ (U+200D) caused automatic line break at its position, even if NBSP is used near the ZWJ.

Steps to Reproduce:
1. Opening the attached ODF file
2. Resize the frame at is bottom edge a bit

Actual Results:  
While you resize the frame to certain size, the Manchu suffix I (U+1873) bump to the top of next line, which is following ZWJ.

Expected Results:
If a character is following ZWJ, it shouldn't be bump to the top of next line even if ZWJ is following whitespace character.


Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 6.0.0.0.beta1 (x64)
Build ID:97471ab4eb4db4c487195658631696bb3238656c
CPU 线程:4; 操作系统:Windows 10.0; UI 渲染:默认; 
Locale: zh-CN (zh_CN); Calc: group threaded


User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
Comment 1 Volga 2017-11-30 08:24:08 UTC
Created attachment 138103 [details]
Sample ODT
Comment 2 Volga 2017-11-30 08:26:13 UTC
Created attachment 138104 [details]
Screen recording by LICEcap
Comment 3 Mike Kaganski 2017-11-30 09:06:52 UTC
UAX#14 (http://unicode.org/reports/tr14/) for Zero-Width Joiner directly prohibits line breaks within joiner sequences (ZWJ), prohibits break between a zero width joiner and an ideograph, emoji base or emoji modifier (LB8a), and in other respects, prohibits a line break between the character and the preceding character (CM). Notice that there's no prohibition of word break *after*, but I suspect that that should be dependent on the previous and next character classes ("The line breaking behavior of the sequence is that of the base character", as per CM).
Comment 4 Volga 2017-12-01 08:12:15 UTC
Is it possible to add an exception to let ZWJ prohibits break for NBSP?
Comment 5 Mike Kaganski 2017-12-01 09:51:29 UTC
First, Zero-Width Joiner character is supposed to act on special character sequences that produce connected forms [1]. In such sequences, it is not always used between the connected characters; sometimes it's the last character in the sequence. When it is used not adjacent to the characters that might create such sequences, it is just a combining character, which shouldn't allow breaking between it and the previous character, but wrapping behaviour after the ZWJ is that of the previous character (i.e., if normally it's permitted to break line after the previous character, then it would be possible to break after sequence of that character and ZWJ).

As NBSP prohibits breaks before it, it should not be possible to break between ZWJ and NBSP. Based on this, it looks like there is a problem here. I don't confirm it because of not enough competence here.

Btw: do you possibly want to use ZWNBSP instead of ZWJ?

[1] https://en.wikipedia.org/wiki/Zero-width_joiner
Comment 6 Volga 2017-12-26 08:36:19 UTC
(In reply to Mike Kaganski from comment #5)
> Btw: do you possibly want to use ZWNBSP instead of ZWJ?
No. Because in Mongolian/Manchu fonts, ZWNBSP doesn't making suffixing letter joining as NNBSP and ZWJ.
Comment 7 Phil Krylov 2018-03-21 15:13:38 UTC
I confirm this buggy behaviour still exists with LibreOffice 6.0.2.
Comment 8 Phil Krylov 2018-03-21 17:14:28 UTC
After reading http://www.unicode.org/reports/tr14/, I understood that more detail on the context is needed. I still think that the observed line break behaviour is buggy at least for the following codepoint combinations:

0064 200D 02DA (d ZWJ ˚) - no break should be allowed on any side of ZWJ
0067 200D 02F3 (g ZWJ ˳) - same
0077 200D 0237 (w ZWJ ȷ) - same
Comment 9 Volga 2018-03-22 16:23:12 UTC
I agree with you. I found Firefox is already implemented several months before I found this bug in LibreOffice, so LO should do it anyway.
Comment 10 Ramesh K 2018-04-29 05:46:11 UTC
Created attachment 141753 [details]
Sample file containing malayalam characters to understand word breaking with Zero Width Joiner

Reproducable in Version: 6.1.0.0.alpha1+ (x64) Build ID: a6a38c6de9c18fd1269fc8cfc0e070ef429c8e2f CPU threads: 4; OS: Windows 10.0; UI render: default; TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-28_01:58:12 Locale: en-IN (en_IN); Calc: group
Comment 11 Aron Budea 2018-05-07 19:24:14 UTC
Ramesh popped by on IRC, and mentioned this was working fine in LO 5.2.7, thanks for that piece of information, and for the sample!

Based on that, the change could be bibisected to the following range of commits:
https://cgit.freedesktop.org/libreoffice/core/log/?qt=range&q=b68ed302830fd1c44212eeb6c23d5a08b7dc97ec..092261ffd497f752c342f1fbdca6e7267e312a21

Of which "upgrade to ICU 58" is the most likely culprit, especially since the document displays fine in LO 5.4.6 bundled with Ubuntu 17.10, which comes with ICU 57.1.
Comment 12 Xisco Faulí 2018-06-07 08:44:15 UTC
Moving to NEW based on comment 11
Comment 13 Eyal Rozenberg 2018-09-17 20:38:46 UTC
Even though I haven't managed to get the Sample ODT to display the intended glyphs, this bug still manifests - assuming that the gray lines between characters are ZWJ's. Tested with:

Version: 6.1.1.2
Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; 
Locale: en-GB (en_GB.UTF-8); Calc: group threaded
Comment 14 ssmithg1 2019-02-08 10:10:10 UTC
In response to Comment 11, this worked fine in LO 5.2.7.2 but is broken in 5.3.2.2, so the culprit is somewhere between those two versions, in case that helps.

I should say that this is a serious issue for Indic scripts (eg, Devanagari etc). In these scripts, ZWJ is used following a halant (virama) between two consonants, to block the formation of a conjunct form. In Devanagari this generally results in a half-form of the 1st consonant. [See the section "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking the line there leaves a half-character at the end of the line, which is invalid. So until this is fixed we have to revert to LO 5.2.7.2 or use another editor.
Comment 15 Volga 2019-02-17 17:05:43 UTC
(In reply to Eyal Rozenberg from comment #13)
> Even though I haven't managed to get the Sample ODT to display the intended
> glyphs, this bug still manifests - assuming that the gray lines between
> characters are ZWJ's. Tested with:
> 
> Version: 6.1.1.2
> Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
> CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; 
> Locale: en-GB (en_GB.UTF-8); Calc: group threaded
This file is using Abkai Xanyan fonts to render the text, you can get the fonts from here before you reproduce with sample ODT:
http://abkai.net/core/en/manchu/manchu-fonts/

(In reply to ssmithg1 from comment #14)
> In response to Comment 11, this worked fine in LO 5.2.7.2 but is broken in
> 5.3.2.2, so the culprit is somewhere between those two versions, in case
> that helps.
> 
> I should say that this is a serious issue for Indic scripts (eg, Devanagari
> etc). In these scripts, ZWJ is used following a halant (virama) between two
> consonants, to block the formation of a conjunct form. In Devanagari this
> generally results in a half-form of the 1st consonant. [See the section
> "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking
> the line there leaves a half-character at the end of the line, which is
> invalid. So until this is fixed we have to revert to LO 5.2.7.2 or use
> another editor.
So it seems to me that LibreOffice have something missing, or some specific Unicode properties doesn’t properly handled after new text layout backend is introduced in 5.3.