Bug 151290 - A language must be a feature of text content, not of character/paragraph styles
Summary: A language must be a feature of text content, not of character/paragraph styles
Status: UNCONFIRMED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
7.4.1.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 160248 (view as bug list)
Depends on: 160249
Blocks: ODF-spec Languages
  Show dependency treegraph
 
Reported: 2022-10-03 03:51 UTC by Eyal Rozenberg
Modified: 2024-03-18 15:49 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Rozenberg 2022-10-03 03:51:20 UTC
When I write a document, I often use character styles such as "Emphasis", "Internet Link". "Quotation" Naturally, I want to use these styles for text in different languages - and not define separate styles named "Arabic Emphasis", "Hebrew Emphasis", "N'Ko emphasis" etc.

However, as Regina Henschel tells me, it is currently the case that the choice of language is a feature of a character style (or at least - the choice of a single language in each language group).

That does also not make sense semantically: The languages I use are part of the content, not the style. I can take Hebrew text and change its "style" - but it will not become Arabic text. 

So, this should change. The language (and the language group) of a stretch of text must be _removed_ from the character style (explicit or default-style in a paragraph style).
Comment 1 Mike Kaganski 2022-10-03 09:13:02 UTC
I completely agree (modulo the fact that we must stay compatible, and so must support existing documents using styles exactly for the language definition). Also, the problem of marking runs as having a specific language easily must be solved, also for platforms not provising system input language / users not using that feature.

What styles could/should provide is a mapping from a language to a set of formatting, for multiple languages inside a single style - bug 151215. The language applied to the text run shouldn't be formatting itself, and thus having it as part of a style is conceptually wrong.
Comment 2 Regina Henschel 2022-10-03 16:24:32 UTC
If you assign the character style "Emphasis" to a portion of text in a paragraph, then this generates a <text:span> element (6.1.7) in file markup. In this <text:span> element you will find the text:style-name="Emphasis" attribute (19.880.33).
The style "Emphasis" is a <style:style> (16.2) element in file markup in styles.xml. This <style:style> element has a style:name="Emphasis" attribute (19.502) which identifies the style and a style:family="text" attribute (19.480), which determines the properties, which may be specified in this style.
In case of family "text", the properties are contained as attributes in a <style:text-properties> element (16.29.29). Up to 84 properties exists, but you may use a subset of them. The section 16.2 in the standard specifies how the value of a property has to be determined, in case it is not contained in the <style:text-properties> element of a style which is referenced by the to be styled object.
[The section numbers refer to ODF 1.3.]

These <text:span> elements may be nested, however as the file format is XML, the elements cannot overlap. That means, that ODF allows to apply several character styles to the same portion of text. But that is currently not correctly implemented (bug 115311).

If you want, that your style "Emphasis" does not include the language, then simply do not specify the language in the style. You must not touch the language field in the dialog, otherwise the language is set. If you are unsure whether the language is set or not set, look at the Organizer tab of the style modify dialog. To remove a language setting from the style you have to use the "Reset to Parent" button on the "Font" page and set the desired other properties on that page again.

Many of the properties depend on the script type of a character. The script type of a character can be "latin", "asian" or "complex". The unicode code point determines which of the three script types applies, not the language. Script type dependent properties have three variants of a property, e.g. fo:font-style, style:font-style-asian and style:font-style-complex. Only one of them is active for a character. So if you set e.g. "italic" for "Western text" and "bold" for "CTL text" the "Emphasis" character style should work for English, Hebrew and Farsi as well. If not, that is a bug.

A language is set by the attributes fo:language and fo:country and their "asian" and "complex" variants. These are attributes of a <style:text-properties> element. This <style:text-properties> element can be a child element of a style of family "text". That corresponds to character styles. It can also be child element of a style of family "paragraph". So removing setting a language in a character style or a paragraph style is not possible. We can only try to make the UI clearer reflect the relationships. For example move the language settings to an own tab, so that they cannot be changed by accident when working with other settings.
Comment 3 Eyal Rozenberg 2022-10-03 20:26:52 UTC
(In reply to Regina Henschel from comment #2)
Noting the use of `text:span` I am reminded of HTML span, and HTML in general. In that standard, the language is an attribute separate from the style (e.g. `<p lang="de-DE" style="bunch of CSS here">`).

> In case of family "text", the properties are contained as attributes in a
> <style:text-properties> element (16.29.29).

Yes, I see:

https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#element-style_text-properties

So, fo:country and fo:language should be removed from style:text-properties. And they should be otherwise settable on text:span's, and probably some other text:XXXX elements. And maybe even other elements.

And - styles should be able to carry properties for multiple languages, in multiple language-groups.

> If you want, that your style "Emphasis" does not include the language, then
> simply do not specify the language in the style.

But then - how would the Emphasis style use different font properties for Arabic text and to Hebrew text?

Anyway, I believe it should not be _possible_ to specify a language as part of a style; I claim this is a design mistake in the ODF spec.
Comment 4 Panos Stokas 2022-12-13 07:11:42 UTC
I like having the ability to set language to None on certain styles.

For example, the style I'm using for programming code is set to Language=none because I want it to be exempt from spelling checks.

Some styles may have a decorative purpose (eg. those based on Symbol or Wingdings characters) which again benefit from setting language=none.

However I can't think of a scenario where I would set a specific language to a style. As Eyal Rozenberg points, language is really part of the content.
Comment 5 BogdanB 2022-12-17 17:00:23 UTC
(In reply to Panos Stokas from comment #4)
> I like having the ability to set language to None on certain styles.
> 
> For example, the style I'm using for programming code is set to
> Language=none because I want it to be exempt from spelling checks.
> 
> Some styles may have a decorative purpose (eg. those based on Symbol or
> Wingdings characters) which again benefit from setting language=none.
> 
> However I can't think of a scenario where I would set a specific language to
> a style. As Eyal Rozenberg points, language is really part of the content.

You CAN already to set the Language to NONE to any style you need.
Comment 6 Panos Stokas 2022-12-20 11:01:32 UTC
(In reply to BogdanB from comment #5)
> (In reply to Panos Stokas from comment #4)

> > For example, the style I'm using for programming code is set to
> > Language=none because I want it to be exempt from spelling checks.

> You CAN already to set the Language to NONE to any style you need.

Indeed, and I like that.
Comment 7 Jan Vlug 2024-03-01 09:40:11 UTC
I also observed this issue in the context of cells in Calc.
I created this forum topic for that:
https://ask.libreoffice.org/t/improve-location-of-cell-language-setting-in-calc/102849

Shouldn't the status of this bug be NEW?
Comment 8 ajlittoz 2024-03-17 16:59:30 UTC
Let me bring my 2 cents to this debate.

As mentioned in several comments, _language_ is an inherent property of text. Presently, this can only be set through a character style. But styles in general are tools to **format** text, i.e. change its appearance and flow properties.

The language attribute in the Font tab mixes two layers: the abstract semantic layer associated to text significance and the "graphical" decoration layer.

As pointed out in another comment, language tagging should be separate from the formatting layer.

Comment #4 mentions a common usage of the Font language attribute to switch off spellchecking (e.g. for computer code). However, I think this is semantically wrong. Computer code is just another language (_None_ to avoid mistaking it for a human language) and this is too part of the data.

Presently, writing multi-lingual documents is a real pain because this means duplicating styles. I don't like either the idea to retrieve current language from keyboard layout. Keyboard, for me, is a language-neutral device to enter characters. I don't practice layout switching for language switch sake because my keyboards have single engraving. I do switch layout but only because I configured various layouts for infrequent characters access, still continuing to type in the same language.

Keyboard layout (again in my workflow) is only a description of the physical keyboard (I have one intl-US in addition to my locale) without implication about the language I type.

Not using Font tab language attribute is a way to make styles universal. But this means language sequence is set with direct formatting, which is generally bad because there is no UI for it or visual feedback.

Auto-detecting current language based on glyph seems to me infeasible: too many languages share characters (e.g. all West-European languages shares the Latin set, Japanese and Chinese share Kanji, …).

I don't grasp the present notion of "groups". What is the commonality between Arabic and Hindi in the "Complex" group? Layout rules are dramatically different.

What would make sense is language tagging. This should not be based on glyph. Many glyphs are "neutral", like punctuation and in some aspects "ordinary" digits. Consequently, only author's mark up can eliminate ambiguities.

I acknowledge that the matter is difficult and compatibility with existing documents must be preserved. Font tab language setting could be kept for that but documentation should discourage its use as obsoleted by a new feature (separate from the formatting layer).
Comment 9 Eyal Rozenberg 2024-03-17 21:25:04 UTC
(In reply to ajlittoz from comment #8)
> Comment #4 mentions a common usage of the Font language attribute to switch
> off spellchecking (e.g. for computer code). However, I think this is
> semantically wrong. Computer code is just another language (_None_ to avoid
> mistaking it for a human language) and this is too part of the data.

This is a good point, but there are actually three separate issues here:

* Text with no language

* Languages for programming and other specific domains rather than languages developed for general-purpose speech and writing.

* Text in arbitrary languages LibreOffice does not know about apriori.

> I don't like either the idea to retrieve current
> language from keyboard layout.

I don't believe that was suggested in the context of this bug. The effect of the chosen keyboard layout on the entered text's language is an interesting discussion to have, but let's not have it in this bug.

> Auto-detecting current language based on glyph seems to me infeasible:

It's indeed quite infeasible. However, in the context of "filling in" language tagging for a document we obtain with no-language-tagging - that might be a reasonable "limited-effort" heuristic. At any rate - doing so is also a matter for another, dependent, bug :-)

> I don't grasp the present notion of "groups". What is the commonality
> between Arabic and Hindi in the "Complex" group? Layout rules are
> dramatically different.

Well, there's some similarity in how typesetting is handled: A lot of glyph-joining. OTOH, there are ligatures in Latin/Western languages too... as for "Asian" languages - those are the ideogramic ones, so again, similarity in handling. But it's basically historical reasons. 

> What would make sense is language tagging. This should not be based on
> glyph. Many glyphs are "neutral", like punctuation and in some aspects
> "ordinary" digits. Consequently, only author's mark up can eliminate
> ambiguities.

Indeed. We need more of the LO community to realize the significance and necessity of this fundamental change, for it to gather enough momentum to be executed.

> I acknowledge that the matter is difficult and compatibility with existing
> documents must be preserved. Font tab language setting could be kept for
> that but documentation should discourage its use as obsoleted by a new
> feature (separate from the formatting layer).

We will have compatibility considerations for the UI, and compatibility considerations for the document markup, and both must be handled with some care.
Comment 10 Eyal Rozenberg 2024-03-18 09:22:49 UTC
*** Bug 160248 has been marked as a duplicate of this bug. ***