167301 – Extend style:script-type hinting to more character ranges

Bug 167301 - Extend style:script-type hinting to more character ranges

Summary: Extend style:script-type hinting to more character ranges

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	25.8.0.0 alpha0+
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:	129038 132000
Blocks:	Script-Assignment
	Show dependency tree / graph

Reported:	2025-06-30 13:31 UTC by Jonathan Clark
Modified:	2025-06-30 17:56 UTC (History)
CC List:	1 user (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jonathan Clark 2025-06-30 13:31:06 UTC

Currently, our implementation of style:script-type only affects the appearance of characters that are unmapped to a script type per the ODF specification (see ODF 1.3 Table 22). Although this implementation technically complies with the standard, it ignores many characters that users may reasonably want to override our default behavior, such as numerals and certain mathematical symbols.

We should investigate how to safely expand style:script-type to cover more characters. For the benefit of interoperability, we should also evaluate the algorithm described in the OOXML standard (ECMA-376-1:2016 17.3.2.26).

Comment 1 Eyal Rozenberg 2025-06-30 17:20:42 UTC

I'd say the mapping by the standard is just plain wrong for many of those characters. Since when are numerals and punctuation marks "western"?

I would suggest we have a list of choices regarding the Unicode character mapping scheme, in Tools > Options > Languages and Locales : It could have 3 items:

* LibreOffice heuristic
* ODF 1.4 §20.358 Table 23 mapping
* ECMA-376-1:2016 OOXML 17.3.2.26 mapping (or whatever MSO is using)

The LO heuristic would be whatever modifications we decide we want to make on the ODF table.

If the Unicode consortium defines which characters are strongly associated with one or a set of languagues, we could compose that mapping with our language-to-language-group mapping (which we have, right?), - so that if the languages aren't fully within a single language group, we try to apply the hint to choose between the groups. That could be a fourth  and call that a fourth option, "Unicode-consortium-based". Or just use that ourselves and drop the "LO Heuristic" option.

Comment 2 Eyal Rozenberg 2025-06-30 17:51:20 UTC Comment hidden (invalid, obsolete)

Is this

Comment 3 Jonathan Clark 2025-06-30 17:56:28 UTC

(In reply to Eyal Rozenberg from comment #1)
> I'd say the mapping by the standard is just plain wrong for many of those
> characters. Since when are numerals and punctuation marks "western"?
> 
> I would suggest we have a list of choices regarding the Unicode character
> mapping scheme, in Tools > Options > Languages and Locales : It could have 3
> items:
> 
> * LibreOffice heuristic
> * ODF 1.4 §20.358 Table 23 mapping
> * ECMA-376-1:2016 OOXML 17.3.2.26 mapping (or whatever MSO is using)

I'm worried it might be asking too much to expect users to have an opinion about this. I'm in favor of an OOXML compatibility flag once our implementation is robust enough that the differences start to matter, but only because we can make that choice for the user automatically when they open a DOCX file.

> If the Unicode consortium defines which characters are strongly associated
> with one or a set of languagues, we could compose that mapping with our
> language-to-language-group mapping (which we have, right?), - so that if the
> languages aren't fully within a single language group, we try to apply the
> hint to choose between the groups. That could be a fourth  and call that a
> fourth option, "Unicode-consortium-based". Or just use that ourselves and
> drop the "LO Heuristic" option.

Unicode doesn't have an opinion about language, but they do associate characters with scripts. The ODF/OOXML standards already obey Unicode for the most part, with the exception of mishandling the "common" script type.