Bug 57652 - Wrong treatment of Word Joiner (U+2060) in line breaking algorithm
Summary: Wrong treatment of Word Joiner (U+2060) in line breaking algorithm
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.3.0 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Font-Rendering Formatting-Mark
  Show dependency treegraph
 
Reported: 2012-11-28 14:31 UTC by Jan_J
Modified: 2023-02-26 10:36 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Improper handling of u+2060 in line breaking algorithm (10.43 KB, application/vnd.oasis.opendocument.text)
2012-11-28 14:31 UTC, Jan_J
Details
different placements of WJ ad SP; none good (12.24 KB, application/vnd.oasis.opendocument.text)
2012-11-28 19:43 UTC, Jan_J
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jan_J 2012-11-28 14:31:27 UTC
Created attachment 70734 [details]
Improper handling of u+2060 in line breaking algorithm

There is a discussion whether NBSP, u+00a0, should have elastic width in full justification mode in Writer. The most significant voices vote for keeping it rather fixed, for compatibility with some popular third-party software.

However, lack of elastic nonbreaking space in continuous justified text remains a problem in many DTP tasks.

Instead of changing the current behaviour of the u+00a0 character, it may be considerable to use u+2060 plus u+0020 (zero-width-word-joiner plus ordinary space). Unfortunately, the algorithm currently implemented for placing line breaks inside a paragraph does not cover correctly this situation.

If in the attached file one tries to add some spaces to the second line (no matter: inside or out of a word, except the the first word), it will result in moving additional material from the third line back to the second. That's surprising, since line breaks (in line-by-line mode algorithm) should allow for maximal filllling of a row.

That means, that using ZWJ-s paradoxally disturbs the normal process of calculating free place inside a row. After removing ZWJs from the text, the line-break points behave as usually.

This is probably because the usage of u+2060 character is considered only in the final stage of the algorithm loop, and does not take into account the possibility of succeeding space.

See i18npool/source/breakiterator/breakiterator_unicode.cxx file in the source tree, method BreakIterator_Unicode::getLineBreak(), lines 405--420.

The problem is weel-know for me for years, and has been discussed many times at users forums.

It would be also mentioned, that if once the line-break algorithm needs to be revised, it may be worth to improve it in other ways, see e.g. adding full paragraph-level justification -- see https://bugs.freedesktop.org/show_bug.cgi?id=38159.
Comment 1 Roman Eisele 2012-11-28 16:26:12 UTC
Thank you for your bug report!

Just a hint (not for you, but for other readers ;-):
> There is a discussion whether NBSP, u+00a0, should have elastic width in
> full justification mode in Writer.
See bug 41652 - “‘NO-BREAK SPACE’ (U+00A0) interpreted as fixed-width space”
Comment 2 Roman Eisele 2012-11-28 17:16:02 UTC
Just to prevent misunderstandings: we are talking about U+2060, right? I need to ask this first because you call the character “zero-width word joiner”, but AFAIK Unicode calls U+2060 just “WORD JOINER” (usual abbreviation: “WJ”). So we must be careful not to confuse it with U+200D (“ZERO WIDTH JOINER”, “ZWJ”) ;-)


> If in the attached file one tries to add some spaces to the second line (no
> matter: inside or out of a word, except the the first word), it will result
> in moving additional material from the third line back to the second. That's
> surprising, since line breaks (in line-by-line mode algorithm) should allow
> for maximal filllling of a row.

REPRODUCIBLE with the attached sample file and any LibreOffice version since 3.3.0 (tested on Mac OS X 10.6.8 (Intel)). → Set Version to 3.3.0 (first known version which contains the problem).

> That means, that using ZWJ-s paradoxally disturbs the normal process of
> calculating free place inside a row.

It seems so.

> After removing ZWJs from the text, the line-break points behave as usually.

I can confirm this.

*

What we need now is some authoritative reference which confirms your fundamental assumption that U+2060 plus U+0020 (word joiner plus ordinary space) *must* really work as expected in your description, i.e. that U+2060 plus U+0020 together should act like an elastic NBSP. I have to confess that
I am not yet completely sure about this assumption ;-)

I don’t have an up-to-date copy of the printed Unicode manual right here; and the short Unicode charts, available from
   http://www.unicode.org/charts/PDF/U2000.pdf
just says about U+2060:
  “2060 WORD JOINER
   * commonly abbreviated WJ
   * a zero width non-breaking space (only)
   * intended for disambiguation of functions
     for byte order mark
   → FEFF zero width no-break space”
The formulation “a zero width non-breaking space (*only*)” (emphasis of *only* by me) and the hint “intended for disambiguation of functions for byte order mark” makes me doubt if U+2060 is really intended for the usage you want to use it for.

And http://decodeunicode.org/u+2060 says only:
   “A zero width non-breaking space (words should not break at linebreak)”

So, could you please cite some authoritative reference which makes clear that U+2060 is really supposed to work as expected by you, i.e. that U+2060 plus U+0020 together should act like an elastic NBSP?

No offence -- maybe I am completely wrong, and my doubts are ridiculous --
but we really need to clarify this before we can proceed with this issue!

Thank you very much for your answer ...
Comment 3 Jan_J 2012-11-28 19:14:27 UTC
Thank you for your interest.

Certainly, I had in mind WJ == u+2660 character. Wrong abbreviation of the name might cause misunderstanings.

In the Line Break Chart, http://www.unicode.org/Public/UNIDATA/auxiliary/LineBreakTest.html both WJ + SP and SP + WJ combination are descripted as “no break”.

Although the reference contains informative, non-normative material, it refers to Unicode rules 11.01 and 11.02 that ultimately prohibit line breaks before and after WJ.

In Unicode Standard Annex #14: Unicode Line Breaking Algorithm, http://www.unicode.org/reports/tr14/ we read:
“WJ 	Word Joiner 	WJ 	Prohibit line breaks before and after ”
and 
“The word joiner character is the preferred choice for an invisible character to keep other characters together that would otherwise be split across the line at a direct break.”
Comment 4 Jan_J 2012-11-28 19:43:28 UTC
Created attachment 70743 [details]
different placements of WJ ad SP; none good

Although this case may be regarded as putting linebreak AFTER a space, take in mind that other combinations of WJ and SP neighbourhoods do not prevent from breaking at all...
Comment 5 Roman Eisele 2012-11-28 19:52:02 UTC
OK, thank you very much for the references -- they are exactly what was missing from this report before!

So we can set the Status of this bug report to NEW (= confirmed).
Adapting the Summary to the correct name for U+2660.

A final hint:
> Instead of changing the current behaviour of the u+00a0 character, it may
> be considerable to use u+2060 plus u+0020 ([...] word-joiner plus
> ordinary space).
Certainly; but one solution does not necessarily invalidate the other. I.e., when the present bug gets fixed, and WJ + SP (+ WJ) act like an elastic NBSP, we can still discuss if the behaviour of U+00A0 should be changed, too (or better: if an option to choose the behaviour of U+00A0, elastic or fixed, should be added).
So bug 41652 is still a valid enhancement request and essentially independend from the present issue.
Comment 6 stfhell 2012-11-30 00:06:42 UTC
Like Roman, I think that handling of U+00A0 and U+2060 are not really related, though U+2060 indeed provides a good way for someone who wishes to keep U+00A0 Word-compatible _and _ wishes to have a flexible-width non-breaking space. But getting a real, working WJ shouldn't prevent users from using a real Unicode nonbreaking space.

I was trying to find out how one could use WJ in a Unicode compliant way with spaces. Of course, strictly speaking, one shouldn't use it with SP: "The word joiner can be used to prevent line breaking with other characters that do not have nonbreaking variants, such as U+2009 thin space or U+2015 horizontal bar, by bracketing the character." (Unicode 6.2, p. 546) SP U+0020 actually has a nonbreaking variant (U+00A0)... (But U+2009 also has: U+202F - so the Unicode Consortium might have become a little confused about spaces, or they might have created U+202F more with Asian writing systems in mind.)

The quoted passage seems to indicate that WJ should "bracket" a normally breaking space to get a nonbreaking variant:

WJ + U+2009 + WJ

On the other hand, SP has a line-breaking behaviour that differs from other spaces: "In particular, when NO-BREAK SPACE follows SPACE, there is a break opportunity after the SPACE and the NO-BREAK SPACE will go as visible space onto the next line. [...] When SPACE follows NO-BREAK SPACE, there is no break, because there never is a break in front of SPACE." (http://www.unicode.org/reports/tr14/tr14-30.html#GL) So it seems that

SP + WJ

would be enough in the case of SP U+0020.

If compatibility with old-fashioned word processors like Word (or, of course, LibreOffice/OpenOffice) is what makes using U+00A0 in a Unicode compliant way difficult, WJ is no practical solution as well. Word displays it as a box character, and treats it exactly like that (as a "break wherever you please" box). But of course, LO should correctly support WJ anyway, wherever people use it.
Comment 7 stfhell 2012-11-30 00:18:47 UTC
When I was playing around with your attached test files, I noticed that WJ is not completely ignored by LibreOffice. There are cases, when it keeps WJ-glued words together. The behaviour reminds me of something similar that I noticed with hyphenation, see Bug 56392: LO can hyphenate words correctly, but sometimes it just doesn't. Inserting a character in a different place or using a different font can suddenly trigger the hyphenation. It looks quite like the strange behaviour in your files, where _adding_ another space on line 2 suddenly makes LO put words from line 3 into line 2 - _adding_ a character in line 2 makes LO fit _more_ instead of _less_ characters into the line...

Probably the 2 issues are related.
Comment 8 Jan_J 2012-11-30 08:48:58 UTC
As far as I understand, when the line is not maximally packed with text, word joiner behave correctly. The problems start near full density of text; then calculations go wrong.
In the breakiterator source code the u+2060 character is referred only once, as “special case” at the final stage of analysis. I suspect this piece was programmed as a kind of patch, but I am afraid it does not play its role in general case.
Comment 9 stfhell 2012-11-30 11:56:38 UTC
(In reply to comment #8)
> As far as I understand, when the line is not maximally packed with text,
> word joiner behave correctly. The problems start near full density of text;
> then calculations go wrong.

I don't think so. The keep-together "Cra⁠ s" in your test file seems stable, even if you add more text to the end, so that it will be justified. It seems that the presence of some characters in the paragraph (here WJ, in Bug 56392 the quotation mark) prevents the line breaker from realizing some line break opportunities.

> In the breakiterator source code the u+2060 character is referred only once,
> as “special case” at the final stage of analysis. I suspect this piece was
> programmed as a kind of patch, but I am afraid it does not play its role in
> general case.

Are you sure if BreakIterator_Unicode::getLineBreak() is really used here? There are so many parallel solutions for the same things in LO, maybe getLineBreak() is only used with CTL, Asian writing systems or again something else. The function looks strange to me anyway, as it sets the Boolean "GlueSpace=sal_True" before a "while (GlueSpace)" loop without any "return", "break" or "GlueSpace=sal_False". But I haven't looked too closely.

WJ could also be covered in the database- or rule-driven approach (see file i18npool/source/breakiterator/data/line.txt).
Comment 10 Roman Eisele 2012-12-01 09:33:39 UTC
(In reply to comment #9)
> It seems that the presence of some characters in the paragraph (here WJ,
> in Bug 56392 the quotation mark) prevents the line breaker from realizing
> some line break opportunities.

I had the same impression when typesetting a big proceedings volume with LibO 3.4.x. When a line contained some special characters -- e.g., IIRC, U+2009 and U+200A, which I had to use quite often to fulfill some typographic requirements --, the line-breaking algorithm seemed often (not always?) confused and missed some good line-breaking opportunities. But I have never succeeded in tracking this down to an easy test case, therefore this is just a hint ...
Comment 11 Simo Kaupinmäki 2014-06-14 15:42:22 UTC
See bug 80027 about how a word joiner that directly follows a regular space fails to prevent a line break from happening. The examples there look rather different, but these issues may well be related.

Bug 80000 is about missing a valid line break opportunity within a word that is followed by a no-break space (U+00A0) or non-breaking hyphen (U+2011).
Comment 12 Alex Thurgood 2015-01-03 17:39:10 UTC Comment hidden (no-value)
Comment 13 QA Administrators 2016-01-17 20:04:16 UTC Comment hidden (obsolete)
Comment 14 QA Administrators 2017-03-06 14:21:52 UTC Comment hidden (obsolete)
Comment 15 QA Administrators 2019-12-03 14:08:15 UTC Comment hidden (obsolete)
Comment 16 QA Administrators 2021-12-03 04:27:59 UTC Comment hidden (obsolete)
Comment 17 Martin Sourada 2023-02-26 10:36:14 UTC
I can confirm this is still present in 
Version: 7.5.0.1 (X86_64) / LibreOffice Community
Build ID: 77cd3d7ad4445740a0c6cf977992dafd8ebad8df
CPU threads: 8; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: cs-CZ (cs_CZ.UTF-8); UI: cs-CZ
Calc: threaded