Writing systems in LibreOffice and ODF are divided into three disjoint script categories. During layout and rendering, LibreOffice must determine to which category each character belongs, in order to apply the correct formatting. We currently assign these characters to script types using a hard-coded algorithm. A recurring issue is that, sometimes, our algorithm guesses wrong. This usually happens with characters like punctuation, which may be used for different purposes across languages within the same document. We currently don't have any way for users to override our algorithm in these cases. While it wouldn't solve all problems with script assignment, a good start would be to support the style:script-type attribute: Per 20.358 in the the ODF 1.3 specification, "[c]onsumers that can determine script types of Unicode characters may also evaluate the attribute and overwrite the script type they determine for certain character with the value of the attribute". Although this attribute was introduced in ODF 1.0, it was never implemented. Doing so would create a workaround for a significant number of our script assignment issues, and may also unblock some OOXML interop (e.g. w:hint).
Isn't an unimplemented ODF feature a bug rather than an enhancement?
A couple of notes for the benefit of people who don't know what style:script-type is (like myself until very recently) and may be confused. Note 1: In the context of this bug, and related bugs (but not everywhere), here is some relevant term rewriting: When people say they mean the same as if they had said --------------------------------------------------------------- language script (i.e. a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire itself; brief Wikipedia definition). written language script language group script type script group script type script category script type (so, we're not talking about languages in the usual sense of the word) Note 2: Several aspects of Character styles and DF are specific to one of the script groups, mentioned by Jonathan in comment #0. An example would be the font family: There are three ODF attributes for that: fo:font-family, style:font-family-asian and style:font-family-complex , and we can in fact see all three of their values in the "Format > Character..." dialog (assuming full RTL-CTL and CJK support has been enabled). But then, which of these font-families is actually to be used for a given character? The one for the script group which LO determines the character belongs to. And this is where style:script-type comes in. It has one possible value for each of the three script groups: latin, rtl-ctl, asian (and a fourth value we won't get into here). If it is set, LO should treat the relevant text as being in that script group, applying the script-group-specific attributes to it; if style:script-type not set - LO can fall back use the heuristic algorithm it now uses. Except - like Jonathan says, this is not what we do. We simply ignore style:script-type and always go for the heuristic. This is not formally a bug, since the spec say that we _may_ use it if we want to; it's not a hard requirement; but it's quite problematic, as can be deduced by reading bug 148257. Bug 148257 is about the user being able to set the script; and this attribute "clinches" it, since we can already set, albeit in a crooked fashion, the choice of script within each of the script groups; if we also set the script group - we've set the language.
(In reply to Jonathan Clark from comment #0) > Although this attribute was introduced in ODF 1.0, it was never implemented. > Doing so would create a workaround for a significant number of our script > assignment issues, and may also unblock some OOXML interop (e.g. w:hint). This also need some researches for current implementations for such characters in LibreOffice, and some documents published by W3C Internationalization (I18n) Activity (https://www.w3.org/International/) and Unicode Consortium.
(In reply to Volga from comment #3) I _think_ I disagree with your comment, but perhaps I'm just misunderstanding it. > This also need some researches for current implementations for such > characters in LibreOffice What do you mean by "implementations of characters"? > and some documents published by W3C > Internationalization (I18n) Activity (https://www.w3.org/International/) and > Unicode Consortium. Which documents? And - why would we need to consult these documents before adding support for script-type?
I mean to investigate characters that would be shared by various scripts, then establish rules to assign proper type face, joining behavior, etc. For example, U+0640 ARABIC TATWEEL is encoded in Arabic block, but the Unicode Standard recommended to use it in Adlam, Hanifi Rohingya, Mandaic, Manichaean, N'ko, Old Uighur, Psalter Pahlavi, Syriac as well, so when this character is injected in texts other than Arabic, it should be rendered with respected font face and not break up contextual alternates.
I mean to found out characters that would be shared by various scripts, investigate their usage, then establish rules to assign proper type face, joining behavior, line break, etc. For example, U+0640 ARABIC TATWEEL is encoded in Arabic block, but the Unicode Standard recommended to use it in Adlam, Hanifi Rohingya, Mandaic, Manichaean, N'ko, Old Uighur, Psalter Pahlavi, Syriac as well, so when Arabic Tatweel is injected in them, it should be rendered with respected font face and not break up contextual alternates.
(In reply to Volga from comment #6) > I mean to found out characters that would be shared by various scripts, > investigate their usage, then establish rules to assign proper type face, > joining behavior, line break, etc. Ah, I see what you mean. Well, isn't this known? i.e. I would assume this data is part of what the Unicode consortium publishes. > For example, U+0640 ARABIC TATWEEL is > encoded in Arabic block, but the Unicode Standard recommended to use it in > Adlam, Hanifi Rohingya, Mandaic, Manichaean, N'ko, Old Uighur, Psalter > Pahlavi, Syriac as well, so when Arabic Tatweel is injected in them, it > should be rendered with respected font face and not break up contextual > alternates. I'm not sure that's a good example, because all of these scripts are in the "RTL-CTL" group if I'm not mistaken, so for this one, we would not have any hesitation. In the future, however, and outside of the focus of this bug, when we have per-script font setting, we might hesitate regarding which script the TATWEEL belongs to. I believe the typical examples now would be multi-language punctuation marks like points, commas, colons, question marks and so on; quotes; spaces; symbols; geometric shapes; western-Arabic numerals; etc.
@Jonathan: Can you outline how far you plan to go with the implementation, within the scope this bug? i.e. will you add UI for it to the paragraph style dialog for example? Will you add it to output filters? input filters?
(In reply to Eyal Rozenberg from comment #8) > @Jonathan: Can you outline how far you plan to go with the implementation, > within the scope this bug? i.e. will you add UI for it to the paragraph > style dialog for example? Will you add it to output filters? input filters? For organization purposes, I planned to stop at odt I/O and layout. I planned to tackle UI under bug 166012. Filters should be filed separately and only as applicable, but I haven't done it yet.
(In reply to Eyal Rozenberg from comment #7) > Ah, I see what you mean. Well, isn't this known? i.e. I would assume this > data is part of what the Unicode consortium publishes. in fact, lust look at the table in the ODF spec, at the section Jonathan linked to: Latin: U+0003..U+001F, U+0021..U+009F, U+00A1..U+04FF, U+0530..U+058F, U+10A0..U+10FF, U+13A0..U+16FF, U+1E00..U+1FFF, U+2C60..U+2C7F, U+2C80..U+2CE3, U+A720..U+A7FF Complex: U+0590..U+074F, U+0780..U+07BF, U+0900..U+109F, U+1200..U+137F, U+1780..U+18AF, U+FB50..U+FDFF, U+FE70..U+FEFF Asian: U+1100..U+11FF, U+2E80..U+31BF, U+31C0..U+31EF, U+3200..U+4DBF, U+4E00..U+A4CF, U+AC00..U+D7AF, U+F900..U+FAFF, U+FE30..U+FE4F, U+FF00..U+FFEF, U+20000..U+2A6DF, U+2F800..U+2FA1F