In certain cases ZWJ (U+200D) caused automatic line break at its position, even if NBSP is used near the ZWJ.
Steps to Reproduce:
1. Opening the attached ODF file
2. Resize the frame at is bottom edge a bit
While you resize the frame to certain size, the Manchu suffix I (U+1873) bump to the top of next line, which is following ZWJ.
If a character is following ZWJ, it shouldn't be bump to the top of next line even if ZWJ is following whitespace character.
User Profile Reset: No
Version: 188.8.131.52.beta1 (x64)
CPU 线程：4; 操作系统：Windows 10.0; UI 渲染：默认;
Locale: zh-CN (zh_CN); Calc: group threaded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
Created attachment 138103 [details]
Created attachment 138104 [details]
Screen recording by LICEcap
UAX#14 (http://unicode.org/reports/tr14/) for Zero-Width Joiner directly prohibits line breaks within joiner sequences (ZWJ), prohibits break between a zero width joiner and an ideograph, emoji base or emoji modifier (LB8a), and in other respects, prohibits a line break between the character and the preceding character (CM). Notice that there's no prohibition of word break *after*, but I suspect that that should be dependent on the previous and next character classes ("The line breaking behavior of the sequence is that of the base character", as per CM).
Is it possible to add an exception to let ZWJ prohibits break for NBSP?
First, Zero-Width Joiner character is supposed to act on special character sequences that produce connected forms . In such sequences, it is not always used between the connected characters; sometimes it's the last character in the sequence. When it is used not adjacent to the characters that might create such sequences, it is just a combining character, which shouldn't allow breaking between it and the previous character, but wrapping behaviour after the ZWJ is that of the previous character (i.e., if normally it's permitted to break line after the previous character, then it would be possible to break after sequence of that character and ZWJ).
As NBSP prohibits breaks before it, it should not be possible to break between ZWJ and NBSP. Based on this, it looks like there is a problem here. I don't confirm it because of not enough competence here.
Btw: do you possibly want to use ZWNBSP instead of ZWJ?
(In reply to Mike Kaganski from comment #5)
> Btw: do you possibly want to use ZWNBSP instead of ZWJ?
No. Because in Mongolian/Manchu fonts, ZWNBSP doesn't making suffixing letter joining as NNBSP and ZWJ.
I confirm this buggy behaviour still exists with LibreOffice 6.0.2.
After reading http://www.unicode.org/reports/tr14/, I understood that more detail on the context is needed. I still think that the observed line break behaviour is buggy at least for the following codepoint combinations:
0064 200D 02DA (d ZWJ ˚) - no break should be allowed on any side of ZWJ
0067 200D 02F3 (g ZWJ ˳) - same
0077 200D 0237 (w ZWJ ȷ) - same
I agree with you. I found Firefox is already implemented several months before I found this bug in LibreOffice, so LO should do it anyway.
Created attachment 141753 [details]
Sample file containing malayalam characters to understand word breaking with Zero Width Joiner
Reproducable in Version: 184.108.40.206.alpha1+ (x64) Build ID: a6a38c6de9c18fd1269fc8cfc0e070ef429c8e2f CPU threads: 4; OS: Windows 10.0; UI render: default; TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-28_01:58:12 Locale: en-IN (en_IN); Calc: group
Ramesh popped by on IRC, and mentioned this was working fine in LO 5.2.7, thanks for that piece of information, and for the sample!
Based on that, the change could be bibisected to the following range of commits:
Of which "upgrade to ICU 58" is the most likely culprit, especially since the document displays fine in LO 5.4.6 bundled with Ubuntu 17.10, which comes with ICU 57.1.
Moving to NEW based on comment 11
Even though I haven't managed to get the Sample ODT to display the intended glyphs, this bug still manifests - assuming that the gray lines between characters are ZWJ's. Tested with:
Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2;
Locale: en-GB (en_GB.UTF-8); Calc: group threaded
In response to Comment 11, this worked fine in LO 220.127.116.11 but is broken in 18.104.22.168, so the culprit is somewhere between those two versions, in case that helps.
I should say that this is a serious issue for Indic scripts (eg, Devanagari etc). In these scripts, ZWJ is used following a halant (virama) between two consonants, to block the formation of a conjunct form. In Devanagari this generally results in a half-form of the 1st consonant. [See the section "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking the line there leaves a half-character at the end of the line, which is invalid. So until this is fixed we have to revert to LO 22.214.171.124 or use another editor.
(In reply to Eyal Rozenberg from comment #13)
> Even though I haven't managed to get the Sample ODT to display the intended
> glyphs, this bug still manifests - assuming that the gray lines between
> characters are ZWJ's. Tested with:
> Version: 126.96.36.199
> Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
> CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2;
> Locale: en-GB (en_GB.UTF-8); Calc: group threaded
This file is using Abkai Xanyan fonts to render the text, you can get the fonts from here before you reproduce with sample ODT:
(In reply to ssmithg1 from comment #14)
> In response to Comment 11, this worked fine in LO 188.8.131.52 but is broken in
> 184.108.40.206, so the culprit is somewhere between those two versions, in case
> that helps.
> I should say that this is a serious issue for Indic scripts (eg, Devanagari
> etc). In these scripts, ZWJ is used following a halant (virama) between two
> consonants, to block the formation of a conjunct form. In Devanagari this
> generally results in a half-form of the 1st consonant. [See the section
> "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking
> the line there leaves a half-character at the end of the line, which is
> invalid. So until this is fixed we have to revert to LO 220.127.116.11 or use
> another editor.
So it seems to me that LibreOffice have something missing, or some specific Unicode properties doesn’t properly handled after new text layout backend is introduced in 5.3.
To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.
There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.
If you have time, please do the following:
Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/
If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.
If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.
Please DO NOT
Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)
If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/
2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword
Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa
Thank you for helping us make LibreOffice even better for everyone!
This is still reproduced with
Version: 18.104.22.168 (x64) / LibreOffice Community
Build ID: 8a45595d069ef5570103caea1b71cc9d82b2aae4
CPU threads: 4; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: ro-RO (ro_RO); UI: en-US
My understanding of how to fix this bug is that we just need to freshen the line.txt in i18npool/source/breakiterator/data to bring it up to date with ICU again. The one in libo is very old and makes no mention of ZWJ.
So I propose we just take the latest and greatest from icu4c/source/data/brkitr/rules/line.txt. I will add a copy from ICU as of this date.
Created attachment 172720 [details]
Latest ICU line break iterator rules
Propose for this to replace i18npool/source/breakiterator/data/line.txt
That's nice. It would be better if LibreOffice have ability to query the property directly from ICU, or allowed to update this text file directly from Unicode.org while complicate a new version.
True. But then one gets into the debate of whether libo should carry its own break iterator specifications or just use ICU. This way, it's a quick bug fix which fixes the bug and allows the refactoring to be pushed down the road. But if someone wants to remove the product specific break iterators and revert to ICU, I won't complain.
Eike: following last Martin's comment, any idea why we can't just use ./source/test/testdata/break_rules/line.txt from icu instead of having our proper line.txt in i18npool/source/breakiterator/data/ ?
If it's just to be compatible with older ICU versions perhaps we would need to be more restrictive about older version accepted (or include ICU statically in LO but I suppose it would increase LO binary size?)
So I think it's necessary to replace legacy codes by native calls to ICU to make extensive use of current dependency.
(In reply to Julien Nabet from comment #22)
> Eike: following last Martin's comment, any idea why we can't just use
> ./source/test/testdata/break_rules/line.txt from icu instead of having our
> proper line.txt in i18npool/source/breakiterator/data/ ?
Our own break rules for some locales emerged because back in that time the ICU break rules weren't sufficient. I'm all for ditching our own in favour of going with default ICU data instead, if someone could judge whether doing so would actually be a good thing and not break breaks..
(In reply to Volga from comment #23)
> So I think it's necessary to replace legacy codes by native calls to ICU to
> make extensive use of current dependency.
? We do use ICU, just that some locales have defined break rules that override the ICU ones.
(In reply to Eike Rathke from comment #24)
> ? We do use ICU, just that some locales have defined break rules that
> override the ICU ones.
Yes, I means such break rules should be replaced by new codes that calling ICU directly.