148257 – Need ability to explicitly set the language/language-group of a piece of text

Bug 148257 - Need ability to explicitly set the language/language-group of a piece of text

Summary: Need ability to explicitly set the language/language-group of a piece of text

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	needsDevAdvice

Depends on:
Blocks:	RTL-UI 132000 Language-Grouping 162502 Script-Assignment
	Show dependency tree / graph

Reported:	2022-03-29 20:08 UTC by Eyal Rozenberg
Modified:	2025-04-02 20:19 UTC (History)
CC List:	8 users (show)

See Also:	146928 151215 92655 154795 106307 162331 129038 163082
Crash report or crash signature:

Attachments
Document illustrating the issue (11.33 KB, application/vnd.oasis.opendocument.text) 2022-03-29 20:14 UTC, Eyal Rozenberg	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eyal Rozenberg 2022-03-29 20:08:07 UTC

One of the problems which most authors of mixed RTL-LTR documents have likely faced is with LO's automatic choice of how to render numbers: You have a document with some RTL text in font f1, and LTR text in font f2. Now you type some number somewhere, or a 24H-format time (e.g. 12:35), or other such sequence of characters: The _font_ chosen for it will correspond to the "language group" (Latin, CTL, Asian - the three groups LO defines) it is determined to be in, which in turn will be chosen based on whether it is set in an RTL or LTR text run.

This is often, and certainly sometimes, not what you want. Unfortunately - there is no way to force which language-group / which language / which language-group font the number of character sequence is. 

... ok, that's not exactly correct. If you right-click and choose "Character...", you get a dialog where you _can_ set the language; but - you can't set the language-group, only set the language _within_ the language group.

That in itself seems inconsistent to me. I wonder if the underlying functionality is missing or whether it's just the UI. At any rate, we _should_ be able to force the language group too.

As for implications, such as how that information is recorded, whether it's part of the character style or not - let it be the same as for the intra-group language.

Comment 1 Eyal Rozenberg 2022-03-29 20:14:14 UTC

Created attachment 179195 [details]
Document illustrating the issue

This document merely illustrates the direction -> language group (-> language) -> font automatic choice by LO.

Comment 2 Heiko Tietze 2022-03-30 07:35:26 UTC

The only issue for me is when I select the number/date - as it is RTL I cannot mark from left. But changing the language to None (we have a section in the status bar to quickly reach language options) does the trick.

The language group is maybe only a virtual thing meaning just at the UI, haven't check the ODF. Although the idea to get rid of it was rejected in bug 146910 it was at least worth to discuss.

Possible solution to the number problem might be to add this to the AutoCorrect options as "[ ] Use 'None' for language in case of numbers".

Wonder how CJK people deal with the problem.

Comment 3 Eyal Rozenberg 2022-03-30 08:04:19 UTC

(In reply to Heiko Tietze from comment #2)
> The only issue for me is when I select the number/date - as it is RTL I
> cannot mark from left.

... but that would be a whole different bug page, about selection. That is annoying, actually; would you open a separate bug about it? Anyway, to be clear, this bug is only about language+font selection, and especially the font.

> But changing the language to None (we have a section
> in the status bar to quickly reach language options) does the trick.

I don't think so. When I did this, it set the language in all groups to "none", but the font didn't change. Which means it probably didn't change the language-group selection either.

> The language group is maybe only a virtual thing meaning just at the UI,
> haven't check the ODF. Although the idea to get rid of it was rejected in
> bug 146910 it was at least worth to discuss.

Ok. But - I'm not taking a position on that matter here.

Comment 4 Eyal Rozenberg 2022-03-30 08:10:24 UTC

Moreover, even if manually changing the language group to "none" helped, that wouldn't resolve the bug, because:

* People would not easily figure out that's what they need to do.
* No right-click menu UI for this.
* There are parity issues with MS Office for .doc and .docx document importation.
* Autocorrect cannot be assumed to be applied by default, and is anyway something optional, not to be relied on.

Comment 5 Heiko Tietze 2022-03-30 08:30:19 UTC

(In reply to Eyal Rozenberg from comment #3)
> ...this bug is only about language+font selection

What exactly is the use case / scenario then? Besides convenience.

Comment 6 Eyal Rozenberg 2022-03-30 09:07:05 UTC

(In reply to Heiko Tietze from comment #5)
> What exactly is the use case / scenario then? Besides convenience.

I thought the title of the bug made it clear...

It's more than about a use-case, it's a matter of principle: 

* It does not make sense that setting a direction also sets the language.
* It does not make sense, and is not tolerable, that changing the direction of a run of text changes its font.

In the attached document, the 12:35 should not appear in the CTL font. And at the very least, it should be easy to prevent that from happening, and easy to indicate it's in English rather than Hebrew (which would make it use the Western language group font).

At the moment, it just can't be done: You can't say it's in English, and you can't set its font to the Western languages group font. (You could change the CTL font to the Western language font but that's a hack, not a solution.)


(I'll also say that it's not obvious what the font selection logic for "None"-language text should be, but that also would be another bug.)

Comment 7 JO3EMC 2022-04-05 09:28:40 UTC

As you may know, in the current Japanese language, LTR is the basis for horizontal writing.
So, no matter what the automatic language group selection works, you usually don't have to worry about mixed character directions.

Of course, sentences with a mixture of Western characters (mainly English ASCII characters) and Japanese characters are common.
In such cases, it is natural that the Western characters are treated as Western instead of Japanese.
The current automatic recognition of LibreOffice language groups seems to work well in the Japanese environment.
So I don't think we'll often encounter cases where we have to manually change language groups individually.

ASCII numerical characters are also automatically recognized as Western.
In Japanese, that is OK.

As discussed in Bug 146910 etc., there is some need to apply the same font to both Western and CJK language groups, but that is not the same as wanting to treat them all as the same language group.
It is a need to recognize them as different language groups and to be able to easily apply the same font.
There are also many needs to apply different fonts to each.

So far, I've talked about the situation in Japanese that seems to be related to this issue, but I haven't fully understood this issue.
I'm unfamiliar with CTL and RTL and can't figure out what's wrong and how you want it to work.
Please pardon.

Comment 8 JO3EMC 2022-04-06 07:26:14 UTC

After that, I noticed ...
The Bug 144003 issue in Japanese may be related to this issue.
In vertical writing (RTL), Japanese characters are automatically recognized in the wrong language group when Western characters are followed by punctuation marks.

Comment 9 JO3EMC 2022-04-06 07:29:32 UTC

I'm sorry.
In Bug 144003, the automatic recognition of language groups does not seem to be wrong.
It seemed that the handling of punctuation marks was just strange.

Comment 10 Heiko Tietze 2022-04-07 12:57:14 UTC

(In reply to Eyal Rozenberg from comment #6)
> It's more than about a use-case, it's a matter of principle: 
> 
> * It does not make sense that setting a direction also sets the language.
> * It does not make sense, and is not tolerable, that changing the direction
> of a run of text changes its font.

What bothers me in general on the ticket is using the language group for layouting. 
But I cannot judge on this topic since it affects basic development aspects.

Comment 11 Eyal Rozenberg 2022-04-07 19:16:06 UTC

I'll just make a clarification w.r.t. my tone.

When I said "is not tolerable", I didn't mean to suggest I am accusing developers of having acted in an intolerable way. I meant to say that reason would not tolerate an affirmation of such a behavior as the appropriate one.


On another note: Heiko, you said:

> What bothers me in general on the ticket is using the language group for layouting.

Can you clarify what you meant by the word "layouting"?


The way things stand right now seem to be the result of a choice of convenience, which often, or usually, works: If your document has, say, content in both Arabic and English, then the LTR runs are assumed to be Spanish, with the user wanting the font family they've chosen for 'Western' languages, as it likely has full coverage of the Latin-1 character set with the glyphs the user wants. And similarly, the RTL runs are assumed to be in Arabic, and the font to use for these would be the one covering the Unicode range for Arabic, which the user has likely chosen as its Complex scripts font.

We are seeing a "corner case" of these assumptions not holding: Characters which are common to text in different language groups, which already undermines the assumptions somewhat, and are direction neutral, which makes them susceptible to be switched back and forth.

But more generally - the assumptions don't hold:

* A user may want/need, a more complex covering of the set of Unicode characters by different fonts - even if all text is in the same language (e.g. for characters like arrows, or numbers, or dingbats, or emojis). And the different fonts the user has may have complex intersections requiring a more involved logic for preferences.
* There may be multiple languages used within the same language group, with the user needing different fonts for them. Obvious example: Hebrew and Arabic. I'm guessing that maybe even CJK authors may want a different font for Japanese and for Chinese, for example, even if many glyphs are shared between them.


So, the question is (or one of the questions is): Should this specific issue be resolved by some kind of localized action, retaining the assumption-of-convenience from above, or should it upending that assumption?

Comment 12 Heiko Tietze 2022-04-08 09:31:43 UTC

(In reply to Eyal Rozenberg from comment #11)
> Can you clarify what you meant by the word "layouting"?
Text runs either from left to right or the other way (and anything else that belongs to the visible layout of the text). The language group is a mechanism to bundle languages so that you don't have to assign RTL to both Hebrew and Arab, for example.

Comment 13 Eyal Rozenberg 2022-04-08 09:49:51 UTC

(In reply to Heiko Tietze from comment #12)

So now I'm confused about your opinion. You said that "using the language group for layouting" bothers you, but then you wrote that "The language group is a mechanism to bundle languages so that you don't have to assign RTL to both Hebrew and Arab" <- but setting text to be RTL is part of "layouting" as you described it. So does the very mechanism of language groups bother you?

Anyway, language grouping doesn't actually save you from marking anything as RTL. I think. I always though it was mostly a simplifying mechanism for font combination.

Comment 14 Mike Kaganski 2022-04-15 09:50:51 UTC

There is nothing like "language group" property of character or paragraph. The "language group" is just some artificial construct, very problematic itself, which should be dropped completely at some point if possible, and not increased in presence by all means. No, there's no need to add it here; setting the language, or explicitly setting direction and font, should be enough. The problem that the "language group" tries to solve comes from not all fonts having all glyphs; and even though there already are *some* fonts with ~wide coverage (of questionable quality for different scripts), they are still not predominant, hence the styles include three fonts for respective groups, allowing each style choose respective font with (supposedly) existing glyphs whenever the script/language is matching that group (so you have a chimera three-piece metafont, with specific glyphs coming from one of the three fonts).

This one should be WF; and overall, we definitely need better *concept* for handling of this complex issue - but the current state is based on (1) state of the art - fonts have imperfect coverage; (2) legacy (we must support existing documents having those synthetic metafonts, both in our native formats, and in external formats using the same concept). IMO, this could only improve *much later*, when (1) is improved greatly (not something we can change).

Comment 15 Eyal Rozenberg 2022-04-15 12:31:46 UTC

(In reply to Mike Kaganski from comment #14)
> setting the language, ... should be enough.

I would be fine with the ability to set the language independently of the direction, but - what will map this setting to a change of font?

> or explicitly setting direction and font, 

Neither of these are relevant in themselves. The entanglement with the direction setting is part of my problem here - I need to change the language regardless of the direction.

> The problem that the "language group" tries to solve comes from not
> all fonts having all glyphs etc.

Yes, I assumed as much in comment #10.

> This one should be WF

Do you mean you're suggesting this issue be marked WFM?

> and overall, we definitely need better *concept* for
> handling of this complex issue - but the current state is based on (1) state
> of the art - fonts have imperfect coverage;

I wouldn't characterize it as "state of the art", because it's not something that is expected to change, or progress, in a different direction. It's perfectly ok for fonts to have partial coverage. (More on that below)


>  IMO, this could
> only improve *much later*, when (1) is improved greatly

You are proposing a paradigm change for distribution and design of fonts in general, that is way beyond the scope of LO. Its merits can be debated - I personally don't agree with your paradigm change - but it cannot be a factor in short-to-medium-term engineering decisions. If there was 100% consensus that this is where the world of fonts were going, then maybe you could argue against addressing issues such as this one. But - with due respect - it's just your opinion, or the opinion of some people (I've not heard this from others). So I don't believe it should have bearing on this issue.

For now, LO should be able to let us force the use of any of fonts in the three language-groups which contain our glyphs of interest (e.g. the digits of a number). I agree that the language group is an artificial construct, but it is what LO associates a font right now, so either we allow setting a language group, or allow setting a language and have that auto-mapped to one of the groups (and thus also group fonts).


---------------

Sidenote:

My idea for a long term alternative to the "language group" is one of two, both involving multiple fonts:

1. A font preference list: To decide which font is used for a glyph, one searches a list of decreasing preference, and the first font with that glyph available is chosen.
2. A glyph set map: The Unicode plane / set of all characters is subdivided into sets (e.g. represented as a list of ranges), each mapped to a font. Simple division templates would be "all from font F1", "strongly-language-L glyhps from F1, the rest from F2" and a few others.

In both of these, advanced users would be offered a dialog (pane) for arbitrarily modifying the list/map.

This does not cover the issue of skewing the preferences for characters which may be part of different Unicode text runs, and thus perceived as part of sequences in different languages - like numbers, or punctuation marks. But the specifics would take more thinking and are even farther beyond the scope of this bug.

---------------

Comment 16 Mike Kaganski 2022-04-15 12:52:26 UTC

(In reply to Eyal Rozenberg from comment #15)
> > This one should be WF
> 
> Do you mean you're suggesting this issue be marked WFM?

WFM stands for "WORKSFORME", a resolution that means "there was a reproducible problem that OP described, in some version; later, it was obviously fixed, so it is not reproducible anymore in newer versions, but we don't know which commit was that - so instead of marking FIXED (reserved for cases where we know exact commit), we use WORKSFORME".

WF stands for WONTFIX, which is for "there is an acknowledged problem around this issue, but this proposal should not be implemented" (there may be different reasons why).

Comment 17 Mike Kaganski 2022-04-16 10:09:25 UTC

(In reply to Eyal Rozenberg from comment #15)
> For now, LO should be able to let us force the use of any of fonts in the
> three language-groups which contain our glyphs of interest (e.g. the digits
> of a number). I agree that the language group is an artificial construct,
> but it is what LO associates a font right now, so either we allow setting a
> language group, or allow setting a language and have that auto-mapped to one
> of the groups (and thus also group fonts).

Assigning a *language* (not "language group"!) with any run of the text must be enough to associate that text with the "language group", and thus to force picking of glyphs from the respective font associated with that group.

What I would agree would be "assigning a language does not pick the right font" bug.

Comment 18 Eyal Rozenberg 2022-04-16 11:54:00 UTC

(In reply to Mike Kaganski from comment #17)
> What I would agree would be "assigning a language does not pick the right
> font" bug.

But we can't assign a language, so that bug can't exist yet...

I've rephrased the title to indicate that being able to assign a language is just as desirable (perhaps more so).

Comment 19 Heiko Tietze 2022-04-22 10:54:39 UTC

Let's resolve WF.

Comment 20 Eyal Rozenberg 2022-04-22 11:00:02 UTC

(In reply to Heiko Tietze from comment #19)
> Let's resolve WF.

What? No, let's not resolve as WONTFIX. We need to be able to set the language.

Comment 21 Mike Kaganski 2022-04-22 11:09:37 UTC

(In reply to Eyal Rozenberg from comment #18)
> But we can't assign a language

Why? You can assign any language to any run of text, either using character properties, on the Font tab; or using Tools->Language->For Selection (using the same dialog when More is used).

Comment 22 Eyal Rozenberg 2022-04-22 11:25:40 UTC

(In reply to Mike Kaganski from comment #21)
> (In reply to Eyal Rozenberg from comment #18)
> > But we can't assign a language
> 
> Why? You can assign any language to any run of text, either using character
> properties

No, I can't. That's the problem. I can choose a language within the language group LO has decided the text belongs to, but I can't choose a language outside of it.

> using Tools->Language->For Selection (using the same dialog when More is used).

I had actually never noticed that, believe it or not... but - again, I can't choose a language I want to that way, on from a limited set of languages.

Comment 23 Mike Kaganski 2022-04-22 11:39:32 UTC

Interesting. Indeed, with the Asian and CTL enabled, the dialog only allows to assign languages to groups ... and that doesn't allow one to define the language to the text, only to affect the results of application of the rules how the program assigns the language (by analyzing the characters and their script).

FTR: you may still assign any language to any group - just typing respective locale name (zh-CN)... but it's not something too user-friendly.

Comment 24 Eyal Rozenberg 2022-04-22 11:53:51 UTC

(In reply to Mike Kaganski from comment #23)
> FTR: you may still assign any language to any group - just typing respective
> locale name (zh-CN)...

Type it where? If you mean in the Western tab language box - you might be able to type something in there, but it doesn't help make text considered Hebrew-language into English-language.

Comment 25 Mike Kaganski 2022-04-22 12:01:58 UTC

(In reply to Eyal Rozenberg from comment #24)

Yes, I didn't check that there is inconsistency between the drop-downs. Which is another issue.

Comment 26 Eyal Rozenberg 2024-09-12 20:44:19 UTC

Marking this as a bug rather than an enhancement. It is necessary to be able to indicate this, both for its own sake and to enable the solution of 162502 for the bugs it blocks.