Bug 166011 - Implement style:script-type
Summary: Implement style:script-type
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium enhancement
Assignee: Jonathan Clark
URL: https://docs.oasis-open.org/office/Op...
Whiteboard:
Keywords:
Depends on:
Blocks: 66791 Script-Assignment 166012 ODF-1.0-Support
  Show dependency treegraph
 
Reported: 2025-04-02 21:00 UTC by Jonathan Clark
Modified: 2025-06-05 18:58 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Clark 2025-04-02 21:00:21 UTC
Writing systems in LibreOffice and ODF are divided into three disjoint script categories. During layout and rendering, LibreOffice must determine to which category each character belongs, in order to apply the correct formatting. We currently assign these characters to script types using a hard-coded algorithm.

A recurring issue is that, sometimes, our algorithm guesses wrong. This usually happens with characters like punctuation, which may be used for different purposes across languages within the same document. We currently don't have any way for users to override our algorithm in these cases.

While it wouldn't solve all problems with script assignment, a good start would be to support the style:script-type attribute:

Per 20.358 in the the ODF 1.3 specification, "[c]onsumers that can determine script types of Unicode characters may also evaluate the attribute and overwrite the script type they determine for certain character with the value of the attribute".

Although this attribute was introduced in ODF 1.0, it was never implemented. Doing so would create a workaround for a significant number of our script assignment issues, and may also unblock some OOXML interop (e.g. w:hint).
Comment 1 Eyal Rozenberg 2025-04-03 00:22:22 UTC
Isn't an unimplemented ODF feature a bug rather than an enhancement?
Comment 2 Eyal Rozenberg 2025-04-08 15:39:20 UTC
A couple of notes for the benefit of people who don't know what style:script-type is (like myself until very recently) and may be confused.

Note 1:
In the context of this bug, and related bugs (but not everywhere), here is some relevant term rewriting:

When people say    they mean the same as if they had said
---------------------------------------------------------------
language           script
                   (i.e. a distinctive writing system, based on 
                   a repertoire of specific elements or symbols, 
                   or that repertoire itself; brief Wikipedia 
                   definition).
written language   script
language group     script type
script group       script type
script category    script type

(so, we're not talking about languages in the usual sense of the word)


Note 2:

Several aspects of Character styles and DF are specific to one of the script groups, mentioned by Jonathan in comment #0. An example would be the font family: There are three ODF attributes for that: fo:font-family, style:font-family-asian and style:font-family-complex , and we can in fact see all three of their values in the "Format > Character..." dialog (assuming full RTL-CTL and CJK support has been enabled). 

But then, which of these font-families is actually to be used for a given character? The one for the script group which LO determines the character belongs to. And this is where style:script-type comes in. 

It has one possible value for each of the three script groups: latin, rtl-ctl, asian (and a fourth value we won't get into here). If it is set, LO should treat the relevant text as being in that script group, applying the script-group-specific attributes to it; if style:script-type not set - LO can fall back use the heuristic algorithm it now uses.

Except - like Jonathan says, this is not what we do. We simply ignore style:script-type and always go for the heuristic. This is not formally a bug, since the spec say that we _may_ use it if we want to; it's not a hard requirement; but it's quite problematic, as can be deduced by reading bug 148257.

Bug 148257 is about the user being able to set the script; and this attribute "clinches" it, since we can already set, albeit in a crooked fashion, the choice of script within each of the script groups; if we also set the script group - we've set the language.
Comment 3 Volga 2025-04-12 15:11:28 UTC
(In reply to Jonathan Clark from comment #0)
> Although this attribute was introduced in ODF 1.0, it was never implemented.
> Doing so would create a workaround for a significant number of our script
> assignment issues, and may also unblock some OOXML interop (e.g. w:hint).
This also need some researches for current implementations for such characters in LibreOffice, and some documents published by W3C Internationalization (I18n) Activity (https://www.w3.org/International/) and Unicode Consortium.
Comment 4 Eyal Rozenberg 2025-04-12 18:26:57 UTC
(In reply to Volga from comment #3)

I _think_ I disagree with your comment, but perhaps I'm just misunderstanding it.

> This also need some researches for current implementations for such
> characters in LibreOffice

What do you mean by "implementations of characters"?

> and some documents published by W3C
> Internationalization (I18n) Activity (https://www.w3.org/International/) and
> Unicode Consortium.

Which documents? And - why would we need to consult these documents before adding support for script-type?
Comment 5 Volga 2025-05-30 16:44:48 UTC Comment hidden (no-value)
Comment 6 Volga 2025-05-30 17:30:04 UTC
I mean to found out characters that would be shared by various scripts, investigate their usage, then establish rules to assign proper type face, joining behavior, line break, etc. For example, U+0640 ARABIC TATWEEL is encoded in Arabic block, but the Unicode Standard recommended to use it in Adlam, Hanifi Rohingya, Mandaic, Manichaean, N'ko, Old Uighur, Psalter Pahlavi, Syriac as well, so when Arabic Tatweel is injected in them, it should be rendered with respected font face and not break up contextual alternates.
Comment 7 Eyal Rozenberg 2025-06-05 17:58:29 UTC
(In reply to Volga from comment #6)
> I mean to found out characters that would be shared by various scripts,
> investigate their usage, then establish rules to assign proper type face,
> joining behavior, line break, etc. 

Ah, I see what you mean. Well, isn't this known? i.e. I would assume this data is part of what the Unicode consortium publishes. 

> For example, U+0640 ARABIC TATWEEL is
> encoded in Arabic block, but the Unicode Standard recommended to use it in
> Adlam, Hanifi Rohingya, Mandaic, Manichaean, N'ko, Old Uighur, Psalter
> Pahlavi, Syriac as well, so when Arabic Tatweel is injected in them, it
> should be rendered with respected font face and not break up contextual
> alternates.

I'm not sure that's a good example, because all of these scripts are in the "RTL-CTL" group if I'm not mistaken, so for this one, we would not have any hesitation.

In the future, however, and outside of the focus of this bug, when we have per-script font setting, we might hesitate regarding which script the TATWEEL belongs to.

I believe the typical examples now would be multi-language punctuation marks like points, commas, colons, question marks and so on; quotes; spaces; symbols; geometric shapes; western-Arabic numerals; etc.
Comment 8 Eyal Rozenberg 2025-06-05 18:01:37 UTC
@Jonathan: Can you outline how far you plan to go with the implementation, within the scope this bug? i.e. will you add UI for it to the paragraph style dialog for example? Will you add it to output filters? input filters?
Comment 9 Jonathan Clark 2025-06-05 18:35:33 UTC
(In reply to Eyal Rozenberg from comment #8)
> @Jonathan: Can you outline how far you plan to go with the implementation,
> within the scope this bug? i.e. will you add UI for it to the paragraph
> style dialog for example? Will you add it to output filters? input filters?

For organization purposes, I planned to stop at odt I/O and layout. I planned to tackle UI under bug 166012. Filters should be filed separately and only as applicable, but I haven't done it yet.
Comment 10 Eyal Rozenberg 2025-06-05 18:58:29 UTC
(In reply to Eyal Rozenberg from comment #7)
> Ah, I see what you mean. Well, isn't this known? i.e. I would assume this
> data is part of what the Unicode consortium publishes. 

in fact, lust look at the table in the ODF spec, at the section Jonathan linked to:

Latin:
U+0003..U+001F, U+0021..U+009F, U+00A1..U+04FF, U+0530..U+058F, U+10A0..U+10FF, U+13A0..U+16FF, U+1E00..U+1FFF, U+2C60..U+2C7F, U+2C80..U+2CE3, U+A720..U+A7FF
	
Complex:
U+0590..U+074F, U+0780..U+07BF, U+0900..U+109F, U+1200..U+137F, U+1780..U+18AF, U+FB50..U+FDFF, U+FE70..U+FEFF
	
Asian:
U+1100..U+11FF, U+2E80..U+31BF, U+31C0..U+31EF, U+3200..U+4DBF, U+4E00..U+A4CF, U+AC00..U+D7AF, U+F900..U+FAFF, U+FE30..U+FE4F, U+FF00..U+FFEF, U+20000..U+2A6DF, U+2F800..U+2FA1F