Bug 114160 - ZWJ shouldn't be treated as breaking character
Summary: ZWJ shouldn't be treated as breaking character
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
6.0.0.0.beta1
Hardware: All All
: high minor
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, regression
Depends on:
Blocks: RTL-CTL ZWNJ-ZWJ
  Show dependency treegraph
 
Reported: 2017-11-30 08:23 UTC by Volga
Modified: 2023-07-19 10:43 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample ODT (10.24 KB, application/vnd.oasis.opendocument.text)
2017-11-30 08:24 UTC, Volga
Details
Screen recording by LICEcap (66.95 KB, image/gif)
2017-11-30 08:26 UTC, Volga
Details
Sample file containing malayalam characters to understand word breaking with Zero Width Joiner (287.22 KB, application/zip)
2018-04-29 05:46 UTC, Ramesh K
Details
Latest ICU line break iterator rules (10.77 KB, text/plain)
2021-06-09 05:05 UTC, martin_hosken
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Volga 2017-11-30 08:23:16 UTC
Description:
In certain cases ZWJ (U+200D) caused automatic line break at its position, even if NBSP is used near the ZWJ.

Steps to Reproduce:
1. Opening the attached ODF file
2. Resize the frame at is bottom edge a bit

Actual Results:  
While you resize the frame to certain size, the Manchu suffix I (U+1873) bump to the top of next line, which is following ZWJ.

Expected Results:
If a character is following ZWJ, it shouldn't be bump to the top of next line even if ZWJ is following whitespace character.


Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 6.0.0.0.beta1 (x64)
Build ID:97471ab4eb4db4c487195658631696bb3238656c
CPU 线程:4; 操作系统:Windows 10.0; UI 渲染:默认; 
Locale: zh-CN (zh_CN); Calc: group threaded


User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
Comment 1 Volga 2017-11-30 08:24:08 UTC
Created attachment 138103 [details]
Sample ODT
Comment 2 Volga 2017-11-30 08:26:13 UTC
Created attachment 138104 [details]
Screen recording by LICEcap
Comment 3 Mike Kaganski 2017-11-30 09:06:52 UTC
UAX#14 (http://unicode.org/reports/tr14/) for Zero-Width Joiner directly prohibits line breaks within joiner sequences (ZWJ), prohibits break between a zero width joiner and an ideograph, emoji base or emoji modifier (LB8a), and in other respects, prohibits a line break between the character and the preceding character (CM). Notice that there's no prohibition of word break *after*, but I suspect that that should be dependent on the previous and next character classes ("The line breaking behavior of the sequence is that of the base character", as per CM).
Comment 4 Volga 2017-12-01 08:12:15 UTC
Is it possible to add an exception to let ZWJ prohibits break for NBSP?
Comment 5 Mike Kaganski 2017-12-01 09:51:29 UTC
First, Zero-Width Joiner character is supposed to act on special character sequences that produce connected forms [1]. In such sequences, it is not always used between the connected characters; sometimes it's the last character in the sequence. When it is used not adjacent to the characters that might create such sequences, it is just a combining character, which shouldn't allow breaking between it and the previous character, but wrapping behaviour after the ZWJ is that of the previous character (i.e., if normally it's permitted to break line after the previous character, then it would be possible to break after sequence of that character and ZWJ).

As NBSP prohibits breaks before it, it should not be possible to break between ZWJ and NBSP. Based on this, it looks like there is a problem here. I don't confirm it because of not enough competence here.

Btw: do you possibly want to use ZWNBSP instead of ZWJ?

[1] https://en.wikipedia.org/wiki/Zero-width_joiner
Comment 6 Volga 2017-12-26 08:36:19 UTC
(In reply to Mike Kaganski from comment #5)
> Btw: do you possibly want to use ZWNBSP instead of ZWJ?
No. Because in Mongolian/Manchu fonts, ZWNBSP doesn't making suffixing letter joining as NNBSP and ZWJ.
Comment 7 Phil Krylov 2018-03-21 15:13:38 UTC
I confirm this buggy behaviour still exists with LibreOffice 6.0.2.
Comment 8 Phil Krylov 2018-03-21 17:14:28 UTC
After reading http://www.unicode.org/reports/tr14/, I understood that more detail on the context is needed. I still think that the observed line break behaviour is buggy at least for the following codepoint combinations:

0064 200D 02DA (d ZWJ ˚) - no break should be allowed on any side of ZWJ
0067 200D 02F3 (g ZWJ ˳) - same
0077 200D 0237 (w ZWJ ȷ) - same
Comment 9 Volga 2018-03-22 16:23:12 UTC
I agree with you. I found Firefox is already implemented several months before I found this bug in LibreOffice, so LO should do it anyway.
Comment 10 Ramesh K 2018-04-29 05:46:11 UTC
Created attachment 141753 [details]
Sample file containing malayalam characters to understand word breaking with Zero Width Joiner

Reproducable in Version: 6.1.0.0.alpha1+ (x64) Build ID: a6a38c6de9c18fd1269fc8cfc0e070ef429c8e2f CPU threads: 4; OS: Windows 10.0; UI render: default; TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-28_01:58:12 Locale: en-IN (en_IN); Calc: group
Comment 11 Aron Budea 2018-05-07 19:24:14 UTC
Ramesh popped by on IRC, and mentioned this was working fine in LO 5.2.7, thanks for that piece of information, and for the sample!

Based on that, the change could be bibisected to the following range of commits:
https://cgit.freedesktop.org/libreoffice/core/log/?qt=range&q=b68ed302830fd1c44212eeb6c23d5a08b7dc97ec..092261ffd497f752c342f1fbdca6e7267e312a21

Of which "upgrade to ICU 58" is the most likely culprit, especially since the document displays fine in LO 5.4.6 bundled with Ubuntu 17.10, which comes with ICU 57.1.
Comment 12 Xisco Faulí 2018-06-07 08:44:15 UTC
Moving to NEW based on comment 11
Comment 13 Eyal Rozenberg 2018-09-17 20:38:46 UTC
Even though I haven't managed to get the Sample ODT to display the intended glyphs, this bug still manifests - assuming that the gray lines between characters are ZWJ's. Tested with:

Version: 6.1.1.2
Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; 
Locale: en-GB (en_GB.UTF-8); Calc: group threaded
Comment 14 ssmithg1 2019-02-08 10:10:10 UTC
In response to Comment 11, this worked fine in LO 5.2.7.2 but is broken in 5.3.2.2, so the culprit is somewhere between those two versions, in case that helps.

I should say that this is a serious issue for Indic scripts (eg, Devanagari etc). In these scripts, ZWJ is used following a halant (virama) between two consonants, to block the formation of a conjunct form. In Devanagari this generally results in a half-form of the 1st consonant. [See the section "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking the line there leaves a half-character at the end of the line, which is invalid. So until this is fixed we have to revert to LO 5.2.7.2 or use another editor.
Comment 15 Volga 2019-02-17 17:05:43 UTC
(In reply to Eyal Rozenberg from comment #13)
> Even though I haven't managed to get the Sample ODT to display the intended
> glyphs, this bug still manifests - assuming that the gray lines between
> characters are ZWJ's. Tested with:
> 
> Version: 6.1.1.2
> Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
> CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; 
> Locale: en-GB (en_GB.UTF-8); Calc: group threaded
This file is using Abkai Xanyan fonts to render the text, you can get the fonts from here before you reproduce with sample ODT:
http://abkai.net/core/en/manchu/manchu-fonts/

(In reply to ssmithg1 from comment #14)
> In response to Comment 11, this worked fine in LO 5.2.7.2 but is broken in
> 5.3.2.2, so the culprit is somewhere between those two versions, in case
> that helps.
> 
> I should say that this is a serious issue for Indic scripts (eg, Devanagari
> etc). In these scripts, ZWJ is used following a halant (virama) between two
> consonants, to block the formation of a conjunct form. In Devanagari this
> generally results in a half-form of the 1st consonant. [See the section
> "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking
> the line there leaves a half-character at the end of the line, which is
> invalid. So until this is fixed we have to revert to LO 5.2.7.2 or use
> another editor.
So it seems to me that LibreOffice have something missing, or some specific Unicode properties doesn’t properly handled after new text layout backend is introduced in 5.3.
Comment 16 QA Administrators 2021-04-28 03:51:03 UTC Comment hidden (obsolete)
Comment 17 Volga 2021-05-04 05:59:48 UTC
This is still reproduced with

Version: 7.1.2.2 (x64) / LibreOffice Community
Build ID: 8a45595d069ef5570103caea1b71cc9d82b2aae4
CPU threads: 4; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: ro-RO (ro_RO); UI: en-US
Comment 18 martin_hosken 2021-06-09 05:04:05 UTC
My understanding of how to fix this bug is that we just need to freshen the line.txt in i18npool/source/breakiterator/data to bring it up to date with ICU again. The one in libo is very old and makes no mention of ZWJ.

So I propose we just take the latest and greatest from icu4c/source/data/brkitr/rules/line.txt. I will add a copy from ICU as of this date.
Comment 19 martin_hosken 2021-06-09 05:05:42 UTC
Created attachment 172720 [details]
Latest ICU line break iterator rules

Propose for this to replace i18npool/source/breakiterator/data/line.txt
Comment 20 Volga 2021-06-10 07:10:55 UTC
That's nice. It would be better if LibreOffice have ability to query the property directly from ICU, or allowed to update this text file directly from Unicode.org while complicate a new version.
Comment 21 martin_hosken 2021-07-14 04:10:45 UTC
True. But then one gets into the debate of whether libo should carry its own break iterator specifications or just use ICU. This way, it's a quick bug fix which fixes the bug and allows the refactoring to be pushed down the road. But if someone wants to remove the product specific break iterators and revert to ICU, I won't complain.
Comment 22 Julien Nabet 2022-11-13 09:46:01 UTC
Eike: following last Martin's comment, any idea why we can't just use ./source/test/testdata/break_rules/line.txt from icu instead of having our proper line.txt in i18npool/source/breakiterator/data/ ?

If it's just to be compatible with older ICU versions perhaps we would need to be more restrictive about older version accepted (or include ICU statically in LO but I suppose it would increase LO binary size?)
Comment 23 Volga 2022-12-03 08:53:36 UTC
So I think it's necessary to replace legacy codes by native calls to ICU to make extensive use of current dependency.
Comment 24 Eike Rathke 2022-12-05 13:51:36 UTC
(In reply to Julien Nabet from comment #22)
> Eike: following last Martin's comment, any idea why we can't just use
> ./source/test/testdata/break_rules/line.txt from icu instead of having our
> proper line.txt in i18npool/source/breakiterator/data/ ?
Our own break rules for some locales emerged because back in that time the ICU break rules weren't sufficient. I'm all for ditching our own in favour of going with default ICU data instead, if someone could judge whether doing so would actually be a good thing and not break breaks..

(In reply to Volga from comment #23)
> So I think it's necessary to replace legacy codes by native calls to ICU to
> make extensive use of current dependency.
? We do use ICU, just that some locales have defined break rules that override the ICU ones.
Comment 25 Volga 2022-12-05 17:49:52 UTC
(In reply to Eike Rathke from comment #24)
> ? We do use ICU, just that some locales have defined break rules that
> override the ICU ones.
Yes, I means such break rules should be replaced by new codes that calling ICU directly.