Description: In certain cases ZWJ (U+200D) caused automatic line break at its position, even if NBSP is used near the ZWJ. Steps to Reproduce: 1. Opening the attached ODF file 2. Resize the frame at is bottom edge a bit Actual Results: While you resize the frame to certain size, the Manchu suffix I (U+1873) bump to the top of next line, which is following ZWJ. Expected Results: If a character is following ZWJ, it shouldn't be bump to the top of next line even if ZWJ is following whitespace character. Reproducible: Always User Profile Reset: No Additional Info: Version: 6.0.0.0.beta1 (x64) Build ID:97471ab4eb4db4c487195658631696bb3238656c CPU 线程:4; 操作系统:Windows 10.0; UI 渲染:默认; Locale: zh-CN (zh_CN); Calc: group threaded User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
Created attachment 138103 [details] Sample ODT
Created attachment 138104 [details] Screen recording by LICEcap
UAX#14 (http://unicode.org/reports/tr14/) for Zero-Width Joiner directly prohibits line breaks within joiner sequences (ZWJ), prohibits break between a zero width joiner and an ideograph, emoji base or emoji modifier (LB8a), and in other respects, prohibits a line break between the character and the preceding character (CM). Notice that there's no prohibition of word break *after*, but I suspect that that should be dependent on the previous and next character classes ("The line breaking behavior of the sequence is that of the base character", as per CM).
Is it possible to add an exception to let ZWJ prohibits break for NBSP?
First, Zero-Width Joiner character is supposed to act on special character sequences that produce connected forms [1]. In such sequences, it is not always used between the connected characters; sometimes it's the last character in the sequence. When it is used not adjacent to the characters that might create such sequences, it is just a combining character, which shouldn't allow breaking between it and the previous character, but wrapping behaviour after the ZWJ is that of the previous character (i.e., if normally it's permitted to break line after the previous character, then it would be possible to break after sequence of that character and ZWJ). As NBSP prohibits breaks before it, it should not be possible to break between ZWJ and NBSP. Based on this, it looks like there is a problem here. I don't confirm it because of not enough competence here. Btw: do you possibly want to use ZWNBSP instead of ZWJ? [1] https://en.wikipedia.org/wiki/Zero-width_joiner
(In reply to Mike Kaganski from comment #5) > Btw: do you possibly want to use ZWNBSP instead of ZWJ? No. Because in Mongolian/Manchu fonts, ZWNBSP doesn't making suffixing letter joining as NNBSP and ZWJ.
I confirm this buggy behaviour still exists with LibreOffice 6.0.2.
After reading http://www.unicode.org/reports/tr14/, I understood that more detail on the context is needed. I still think that the observed line break behaviour is buggy at least for the following codepoint combinations: 0064 200D 02DA (d ZWJ ˚) - no break should be allowed on any side of ZWJ 0067 200D 02F3 (g ZWJ ˳) - same 0077 200D 0237 (w ZWJ ȷ) - same
I agree with you. I found Firefox is already implemented several months before I found this bug in LibreOffice, so LO should do it anyway.
Created attachment 141753 [details] Sample file containing malayalam characters to understand word breaking with Zero Width Joiner Reproducable in Version: 6.1.0.0.alpha1+ (x64) Build ID: a6a38c6de9c18fd1269fc8cfc0e070ef429c8e2f CPU threads: 4; OS: Windows 10.0; UI render: default; TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-28_01:58:12 Locale: en-IN (en_IN); Calc: group
Ramesh popped by on IRC, and mentioned this was working fine in LO 5.2.7, thanks for that piece of information, and for the sample! Based on that, the change could be bibisected to the following range of commits: https://cgit.freedesktop.org/libreoffice/core/log/?qt=range&q=b68ed302830fd1c44212eeb6c23d5a08b7dc97ec..092261ffd497f752c342f1fbdca6e7267e312a21 Of which "upgrade to ICU 58" is the most likely culprit, especially since the document displays fine in LO 5.4.6 bundled with Ubuntu 17.10, which comes with ICU 57.1.
Moving to NEW based on comment 11
Even though I haven't managed to get the Sample ODT to display the intended glyphs, this bug still manifests - assuming that the gray lines between characters are ZWJ's. Tested with: Version: 6.1.1.2 Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; Locale: en-GB (en_GB.UTF-8); Calc: group threaded
In response to Comment 11, this worked fine in LO 5.2.7.2 but is broken in 5.3.2.2, so the culprit is somewhere between those two versions, in case that helps. I should say that this is a serious issue for Indic scripts (eg, Devanagari etc). In these scripts, ZWJ is used following a halant (virama) between two consonants, to block the formation of a conjunct form. In Devanagari this generally results in a half-form of the 1st consonant. [See the section "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking the line there leaves a half-character at the end of the line, which is invalid. So until this is fixed we have to revert to LO 5.2.7.2 or use another editor.
(In reply to Eyal Rozenberg from comment #13) > Even though I haven't managed to get the Sample ODT to display the intended > glyphs, this bug still manifests - assuming that the gray lines between > characters are ZWJ's. Tested with: > > Version: 6.1.1.2 > Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b > CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; > Locale: en-GB (en_GB.UTF-8); Calc: group threaded This file is using Abkai Xanyan fonts to render the text, you can get the fonts from here before you reproduce with sample ODT: http://abkai.net/core/en/manchu/manchu-fonts/ (In reply to ssmithg1 from comment #14) > In response to Comment 11, this worked fine in LO 5.2.7.2 but is broken in > 5.3.2.2, so the culprit is somewhere between those two versions, in case > that helps. > > I should say that this is a serious issue for Indic scripts (eg, Devanagari > etc). In these scripts, ZWJ is used following a halant (virama) between two > consonants, to block the formation of a conjunct form. In Devanagari this > generally results in a half-form of the 1st consonant. [See the section > "Explicit Half-Consonants" in chapter 12 of the Unicode Standard.] Breaking > the line there leaves a half-character at the end of the line, which is > invalid. So until this is fixed we have to revert to LO 5.2.7.2 or use > another editor. So it seems to me that LibreOffice have something missing, or some specific Unicode properties doesn’t properly handled after new text layout backend is introduced in 5.3.
Dear Volga, To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug
This is still reproduced with Version: 7.1.2.2 (x64) / LibreOffice Community Build ID: 8a45595d069ef5570103caea1b71cc9d82b2aae4 CPU threads: 4; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win Locale: ro-RO (ro_RO); UI: en-US
My understanding of how to fix this bug is that we just need to freshen the line.txt in i18npool/source/breakiterator/data to bring it up to date with ICU again. The one in libo is very old and makes no mention of ZWJ. So I propose we just take the latest and greatest from icu4c/source/data/brkitr/rules/line.txt. I will add a copy from ICU as of this date.
Created attachment 172720 [details] Latest ICU line break iterator rules Propose for this to replace i18npool/source/breakiterator/data/line.txt
That's nice. It would be better if LibreOffice have ability to query the property directly from ICU, or allowed to update this text file directly from Unicode.org while complicate a new version.
True. But then one gets into the debate of whether libo should carry its own break iterator specifications or just use ICU. This way, it's a quick bug fix which fixes the bug and allows the refactoring to be pushed down the road. But if someone wants to remove the product specific break iterators and revert to ICU, I won't complain.
Eike: following last Martin's comment, any idea why we can't just use ./source/test/testdata/break_rules/line.txt from icu instead of having our proper line.txt in i18npool/source/breakiterator/data/ ? If it's just to be compatible with older ICU versions perhaps we would need to be more restrictive about older version accepted (or include ICU statically in LO but I suppose it would increase LO binary size?)
So I think it's necessary to replace legacy codes by native calls to ICU to make extensive use of current dependency.
(In reply to Julien Nabet from comment #22) > Eike: following last Martin's comment, any idea why we can't just use > ./source/test/testdata/break_rules/line.txt from icu instead of having our > proper line.txt in i18npool/source/breakiterator/data/ ? Our own break rules for some locales emerged because back in that time the ICU break rules weren't sufficient. I'm all for ditching our own in favour of going with default ICU data instead, if someone could judge whether doing so would actually be a good thing and not break breaks.. (In reply to Volga from comment #23) > So I think it's necessary to replace legacy codes by native calls to ICU to > make extensive use of current dependency. ? We do use ICU, just that some locales have defined break rules that override the ICU ones.
(In reply to Eike Rathke from comment #24) > ? We do use ICU, just that some locales have defined break rules that > override the ICU ones. Yes, I means such break rules should be replaced by new codes that calling ICU directly.
Jonathan Clark committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/8c063d4e293e76982d4310ddc162b565a9a3c16e tdf#114160 Regression tests for non-breaking ZWJ It will be available in 25.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Is it possible to backport to 24.8 release channel?
Backport what? The commit in comment 26 is a build time test to verify the desired behaviour. The underlying changes of https://gerrit.libreoffice.org/c/core/+/166273 and many more commits related to bug 49885 were merged before the 24-8 branch-off.