In certain cases ZWJ (U+200D) caused automatic line break at its position, even if NBSP is used near the ZWJ.
Steps to Reproduce:
1. Opening the attached ODF file
2. Resize the frame at is bottom edge a bit
While you resize the frame to certain size, the Manchu suffix I (U+1873) bump to the top of next line, which is following ZWJ.
If a character is following ZWJ, it shouldn't be bump to the top of next line even if ZWJ is following whitespace character.
User Profile Reset: No
Version: 184.108.40.206.beta1 (x64)
CPU 线程：4; 操作系统：Windows 10.0; UI 渲染：默认;
Locale: zh-CN (zh_CN); Calc: group threaded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
Created attachment 138103 [details]
Created attachment 138104 [details]
Screen recording by LICEcap
UAX#14 (http://unicode.org/reports/tr14/) for Zero-Width Joiner directly prohibits line breaks within joiner sequences (ZWJ), prohibits break between a zero width joiner and an ideograph, emoji base or emoji modifier (LB8a), and in other respects, prohibits a line break between the character and the preceding character (CM). Notice that there's no prohibition of word break *after*, but I suspect that that should be dependent on the previous and next character classes ("The line breaking behavior of the sequence is that of the base character", as per CM).
Is it possible to add an exception to let ZWJ prohibits break for NBSP?
First, Zero-Width Joiner character is supposed to act on special character sequences that produce connected forms . In such sequences, it is not always used between the connected characters; sometimes it's the last character in the sequence. When it is used not adjacent to the characters that might create such sequences, it is just a combining character, which shouldn't allow breaking between it and the previous character, but wrapping behaviour after the ZWJ is that of the previous character (i.e., if normally it's permitted to break line after the previous character, then it would be possible to break after sequence of that character and ZWJ).
As NBSP prohibits breaks before it, it should not be possible to break between ZWJ and NBSP. Based on this, it looks like there is a problem here. I don't confirm it because of not enough competence here.
Btw: do you possibly want to use ZWNBSP instead of ZWJ?
(In reply to Mike Kaganski from comment #5)
> Btw: do you possibly want to use ZWNBSP instead of ZWJ?
No. Because in Mongolian/Manchu fonts, ZWNBSP doesn't making suffixing letter joining as NNBSP and ZWJ.
I confirm this buggy behaviour still exists with LibreOffice 6.0.2.
After reading http://www.unicode.org/reports/tr14/, I understood that more detail on the context is needed. I still think that the observed line break behaviour is buggy at least for the following codepoint combinations:
0064 200D 02DA (d ZWJ ˚) - no break should be allowed on any side of ZWJ
0067 200D 02F3 (g ZWJ ˳) - same
0077 200D 0237 (w ZWJ ȷ) - same
I agree with you. I found Firefox is already implemented several months before I found this bug in LibreOffice, so LO should do it anyway.
Created attachment 141753 [details]
Sample file containing malayalam characters to understand word breaking with Zero Width Joiner
Reproducable in Version: 220.127.116.11.alpha1+ (x64) Build ID: a6a38c6de9c18fd1269fc8cfc0e070ef429c8e2f CPU threads: 4; OS: Windows 10.0; UI render: default; TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-28_01:58:12 Locale: en-IN (en_IN); Calc: group
Ramesh popped by on IRC, and mentioned this was working fine in LO 5.2.7, thanks for that piece of information, and for the sample!
Based on that, the change could be bibisected to the following range of commits:
Of which "upgrade to ICU 58" is the most likely culprit, especially since the document displays fine in LO 5.4.6 bundled with Ubuntu 17.10, which comes with ICU 57.1.
Moving to NEW based on comment 11
Even though I haven't managed to get the Sample ODT to display the intended glyphs, this bug still manifests - assuming that the gray lines between characters are ZWJ's. Tested with:
Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2;
Locale: en-GB (en_GB.UTF-8); Calc: group threaded
In response to Comment 11, this worked fine in LO 18.104.22.168 but is broken in 22.214.171.124, so the culprit is somewhere between those two versions, in case that helps.
I should say that this is a serious issue for Indic scripts (eg, Devanagari etc). In these scripts, ZWJ is used following a halant (virama) between two consonants, to block the formation of a conjunct form. In Devanagari this generally results in a half-form of the 1st consonant. [See the section "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking the line there leaves a half-character at the end of the line, which is invalid. So until this is fixed we have to revert to LO 126.96.36.199 or use another editor.
(In reply to Eyal Rozenberg from comment #13)
> Even though I haven't managed to get the Sample ODT to display the intended
> glyphs, this bug still manifests - assuming that the gray lines between
> characters are ZWJ's. Tested with:
> Version: 188.8.131.52
> Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
> CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2;
> Locale: en-GB (en_GB.UTF-8); Calc: group threaded
This file is using Abkai Xanyan fonts to render the text, you can get the fonts from here before you reproduce with sample ODT:
(In reply to ssmithg1 from comment #14)
> In response to Comment 11, this worked fine in LO 184.108.40.206 but is broken in
> 220.127.116.11, so the culprit is somewhere between those two versions, in case
> that helps.
> I should say that this is a serious issue for Indic scripts (eg, Devanagari
> etc). In these scripts, ZWJ is used following a halant (virama) between two
> consonants, to block the formation of a conjunct form. In Devanagari this
> generally results in a half-form of the 1st consonant. [See the section
> "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking
> the line there leaves a half-character at the end of the line, which is
> invalid. So until this is fixed we have to revert to LO 18.104.22.168 or use
> another editor.
So it seems to me that LibreOffice have something missing, or some specific Unicode properties doesn’t properly handled after new text layout backend is introduced in 5.3.