Writing systems in LibreOffice and ODF are divided into three disjoint script categories. During layout and rendering, LibreOffice must determine to which category each character belongs, in order to apply the correct formatting. We currently assign these characters to script types using a hard-coded algorithm. A recurring issue is that, sometimes, our algorithm guesses wrong. This usually happens with characters like punctuation, which may be used for different purposes across languages within the same document. We currently don't have any way for users to override our algorithm in these cases. While it wouldn't solve all problems with script assignment, a good start would be to support the style:script-type attribute: Per 20.358 in the the ODF 1.3 specification, "[c]onsumers that can determine script types of Unicode characters may also evaluate the attribute and overwrite the script type they determine for certain character with the value of the attribute". Although this attribute was introduced in ODF 1.0, it was never implemented. Doing so would create a workaround for a significant number of our script assignment issues, and may also unblock some OOXML interop (e.g. w:hint).
Isn't an unimplemented ODF feature a bug rather than an enhancement?
A couple of notes for the benefit of people who don't know what style:script-type is (like myself until very recently) and may be confused. Note 1: In the context of this bug, and related bugs (but not everywhere), here is some relevant term rewriting: When people say they mean the same as if they had said --------------------------------------------------------------- language script (i.e. a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire itself; brief Wikipedia definition). written language script language group script type script group script type script category script type (so, we're not talking about languages in the usual sense of the word) Note 2: Several aspects of Character styles and DF are specific to one of the script groups, mentioned by Jonathan in comment #0. An example would be the font family: There are three ODF attributes for that: fo:font-family, style:font-family-asian and style:font-family-complex , and we can in fact see all three of their values in the "Format > Character..." dialog (assuming full RTL-CTL and CJK support has been enabled). But then, which of these font-families is actually to be used for a given character? The one for the script group which LO determines the character belongs to. And this is where style:script-type comes in. It has one possible value for each of the three script groups: latin, rtl-ctl, asian (and a fourth value we won't get into here). If it is set, LO should treat the relevant text as being in that script group, applying the script-group-specific attributes to it; if style:script-type not set - LO can fall back use the heuristic algorithm it now uses. Except - like Jonathan says, this is not what we do. We simply ignore style:script-type and always go for the heuristic. This is not formally a bug, since the spec say that we _may_ use it if we want to; it's not a hard requirement; but it's quite problematic, as can be deduced by reading bug 148257. Bug 148257 is about the user being able to set the script; and this attribute "clinches" it, since we can already set, albeit in a crooked fashion, the choice of script within each of the script groups; if we also set the script group - we've set the language.
(In reply to Jonathan Clark from comment #0) > Although this attribute was introduced in ODF 1.0, it was never implemented. > Doing so would create a workaround for a significant number of our script > assignment issues, and may also unblock some OOXML interop (e.g. w:hint). This also need some researches for current implementations for such characters in LibreOffice, and some documents published by W3C Internationalization (I18n) Activity (https://www.w3.org/International/) and Unicode Consortium.
(In reply to Volga from comment #3) I _think_ I disagree with your comment, but perhaps I'm just misunderstanding it. > This also need some researches for current implementations for such > characters in LibreOffice What do you mean by "implementations of characters"? > and some documents published by W3C > Internationalization (I18n) Activity (https://www.w3.org/International/) and > Unicode Consortium. Which documents? And - why would we need to consult these documents before adding support for script-type?
I mean to investigate characters that would be shared by various scripts, then establish rules to assign proper type face, joining behavior, etc. For example, U+0640 ARABIC TATWEEL is encoded in Arabic block, but the Unicode Standard recommended to use it in Adlam, Hanifi Rohingya, Mandaic, Manichaean, N'ko, Old Uighur, Psalter Pahlavi, Syriac as well, so when this character is injected in texts other than Arabic, it should be rendered with respected font face and not break up contextual alternates.
I mean to found out characters that would be shared by various scripts, investigate their usage, then establish rules to assign proper type face, joining behavior, line break, etc. For example, U+0640 ARABIC TATWEEL is encoded in Arabic block, but the Unicode Standard recommended to use it in Adlam, Hanifi Rohingya, Mandaic, Manichaean, N'ko, Old Uighur, Psalter Pahlavi, Syriac as well, so when Arabic Tatweel is injected in them, it should be rendered with respected font face and not break up contextual alternates.
(In reply to Volga from comment #6) > I mean to found out characters that would be shared by various scripts, > investigate their usage, then establish rules to assign proper type face, > joining behavior, line break, etc. Ah, I see what you mean. Well, isn't this known? i.e. I would assume this data is part of what the Unicode consortium publishes. > For example, U+0640 ARABIC TATWEEL is > encoded in Arabic block, but the Unicode Standard recommended to use it in > Adlam, Hanifi Rohingya, Mandaic, Manichaean, N'ko, Old Uighur, Psalter > Pahlavi, Syriac as well, so when Arabic Tatweel is injected in them, it > should be rendered with respected font face and not break up contextual > alternates. I'm not sure that's a good example, because all of these scripts are in the "RTL-CTL" group if I'm not mistaken, so for this one, we would not have any hesitation. In the future, however, and outside of the focus of this bug, when we have per-script font setting, we might hesitate regarding which script the TATWEEL belongs to. I believe the typical examples now would be multi-language punctuation marks like points, commas, colons, question marks and so on; quotes; spaces; symbols; geometric shapes; western-Arabic numerals; etc.
@Jonathan: Can you outline how far you plan to go with the implementation, within the scope this bug? i.e. will you add UI for it to the paragraph style dialog for example? Will you add it to output filters? input filters?
(In reply to Eyal Rozenberg from comment #8) > @Jonathan: Can you outline how far you plan to go with the implementation, > within the scope this bug? i.e. will you add UI for it to the paragraph > style dialog for example? Will you add it to output filters? input filters? For organization purposes, I planned to stop at odt I/O and layout. I planned to tackle UI under bug 166012. Filters should be filed separately and only as applicable, but I haven't done it yet.
(In reply to Eyal Rozenberg from comment #7) > Ah, I see what you mean. Well, isn't this known? i.e. I would assume this > data is part of what the Unicode consortium publishes. in fact, lust look at the table in the ODF spec, at the section Jonathan linked to: Latin: U+0003..U+001F, U+0021..U+009F, U+00A1..U+04FF, U+0530..U+058F, U+10A0..U+10FF, U+13A0..U+16FF, U+1E00..U+1FFF, U+2C60..U+2C7F, U+2C80..U+2CE3, U+A720..U+A7FF Complex: U+0590..U+074F, U+0780..U+07BF, U+0900..U+109F, U+1200..U+137F, U+1780..U+18AF, U+FB50..U+FDFF, U+FE70..U+FEFF Asian: U+1100..U+11FF, U+2E80..U+31BF, U+31C0..U+31EF, U+3200..U+4DBF, U+4E00..U+A4CF, U+AC00..U+D7AF, U+F900..U+FAFF, U+FE30..U+FE4F, U+FF00..U+FFEF, U+20000..U+2A6DF, U+2F800..U+2FA1F
Jonathan Clark committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/0b979c50b6fa72bae4344130e48f5503ac14a9c4 tdf#166011 Implemented style:script-type It will be available in 26.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
(In reply to Commit Notification from comment #11) > Jonathan Clark committed a patch related to this issue. How can we see this in action? I realize for now that would mean some manual modification of files, but can you post some instructions involving that, which we could apply to a document and see the effects?
(In reply to Eyal Rozenberg from comment #12) > (In reply to Commit Notification from comment #11) > > Jonathan Clark committed a patch related to this issue. > > How can we see this in action? I realize for now that would mean some manual > modification of files, but can you post some instructions involving that, > which we could apply to a document and see the effects? I'd rather avoid giving instructions that could lead people to corrupt their documents in the future. If you're already comfortable hand-editing ODF files, the ODF standard has enough information to experiment with this feature. If not, it's best to wait for bug 166012.
(In reply to Jonathan Clark from comment #13) > I'd rather avoid giving instructions that could lead people to corrupt their > documents in the future. The only people are me and Stuart... :-P Anyway, how about posting an FODT using style:script-type, and a before-and-after screehshot? > If you're already comfortable hand-editing ODF files, Not that comfortable actually, otherwise I wouldn't have asked. Also, I wasn't sure where in documents this will now be recognized/supported (e.g. which style categories).
Jonathan Clark committed a patch related to this issue. It has been pushed to "libreoffice-25-8": https://git.libreoffice.org/core/commit/b37b2ed3efb4e3fc1417ce1008fe306211868bd0 tdf#166011 Implemented style:script-type It will be available in 25.8.0.0.beta2. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Created attachment 201214 [details] Sample document using style:script-type
Created attachment 201215 [details] Screenshot of sample document with style:script-type support
(In reply to Eyal Rozenberg from comment #14) > (In reply to Jonathan Clark from comment #13) > The only people are me and Stuart... :-P I used the following macro to make the test document. Sub setHintToLatin() cursor = ThisComponent.getCurrentController.getViewCursor() cursor.CharScriptHint = 2 End Sub This should be safe to use. Possible values are: 0 = Automatic (the default behavior) 1 = Ignore (special; treats all text as Latin) 2 = Latin hint 3 = Asian hint 4 = Complex hint
(In reply to Jonathan Clark from comment #17) > Created attachment 201215 [details] > Screenshot of sample document with style:script-type support I don't see the differences between the two paragraphs with the quotes, the text looks the same.
Created attachment 201250 [details] Document with some paragraphs having style:script-type set I believe there's something wrong with the implementation. In the attached document, there are 4 sections, each with several paragraphs: Directionality style:script-type RTL/LTR x set / unset in each section, there are several paragraphs with English and/or Hebrew text, before and/or after some neutral characters: Spaces, semicolons, numerals. The RTL-CTL (Hebrew) font has been set to 18pt Bold, to better emphasize the differences. The effect I observe style:script-type to have is on space chracters, which seem to widen after it is set. But other than that - none of the neutrals is rendered in the font of the other RTL-CTL language group: Neither the numerals nor the semicolon. Note that the FODT has been manually edited to reduce the number of auto-styles; but the phenomenon was the same before that as well.
(In reply to Eyal Rozenberg from comment #19) > (In reply to Jonathan Clark from comment #17) > > Created attachment 201215 [details] > > Screenshot of sample document with style:script-type support > > I don't see the differences between the two paragraphs with the quotes, the > text looks the same. In the original text, the end quote character is higher and has a space after it. (In reply to Eyal Rozenberg from comment #20) > The effect I observe style:script-type to have is on space chracters, which > seem to widen after it is set. But other than that - none of the neutrals is > rendered in the font of the other RTL-CTL language group: Neither the > numerals nor the semicolon. It's working as specified. This only affects characters that aren't explicitly mapped to a script type by the ODF standard. (In the LO code, we call this the "weak" script type; it has nothing to do with directionality.) Space is the only character in your test document that doesn't have an explicit mapping.
(In reply to Jonathan Clark from comment #21) > Space is the only character in your test document that doesn't have an > explicit mapping. Oh no! You're right! The mapping is wrong! What do we do? :-( Do we need to file a bug report against ODF? Is it possible for us to countermand the spec in this case, for characters which are erroneously mapped? Punctuation, numerals, others?
(In reply to Eyal Rozenberg from comment #22) > (In reply to Jonathan Clark from comment #21) > > Space is the only character in your test document that doesn't have an > > explicit mapping. > > Oh no! You're right! The mapping is wrong! What do we do? :-( > > Do we need to file a bug report against ODF? Is it possible for us to > countermand the spec in this case, for characters which are erroneously > mapped? Punctuation, numerals, others? All of these problems (the language trichotomy, the table in the ODF standard) originally come from Microsoft Word. If we want to be compatible with Word, we need to be somewhat careful how we move here. The current version of style:script-type works basically the same as ooxml w:hint, so we could use it to implement support for that tag. That becomes less convenient if our implementation diverges too much from the ODF standard as currently written. We do have options, though. We could propose some sort of style:script-type-we-really-mean-it-this-time that has a modified table, or maybe something less silly based around breaking the language trichotomy.
(In reply to Jonathan Clark from comment #23) > All of these problems (the language trichotomy, the table in the ODF > standard) originally come from Microsoft Word. I actually doubt that, because Word _definitely_ maps numerals and punctuation different depending on whether you're writing English or Hebrew. > If we want to be compatible > with Word, we need to be somewhat careful how we move here. This compatibility is part of my motivation actually. > We do have options, though. We could propose some sort of > style:script-type-we-really-mean-it-this-time that has a modified table, or > maybe something less silly based around breaking the language trichotomy. I realize that breaking the trichotomy is the way to go eventually, I was just hopeful that we could have a temporary fix with this change :-(
(In reply to Eyal Rozenberg from comment #24) > (In reply to Jonathan Clark from comment #23) > > All of these problems (the language trichotomy, the table in the ODF > > standard) originally come from Microsoft Word. > > I actually doubt that, because Word _definitely_ maps numerals and > punctuation different depending on whether you're writing English or Hebrew. They do - but that doesn't mean I didn't overlook something in how Word implements w:hint. Is there a bug for this? I'm aware of bug 163082, but that only refers to directionality. The original intention here is to be compatible with Word, so I would feel better about adding more characters to be overridden if that's the reason. It's easy to add them. I would only suggest that we open a separate bug for it.