Bug Hunting Session
Bug 38095 - Character classification for Western or Asian text font differ since conventional version.
Summary: Character classification for Western or Asian text font differ since convent...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.4.0 release
Hardware: x86 (IA32) Windows (All)
: medium normal
Assignee: Caolán McNamara
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-08 21:41 UTC by sanada
Modified: 2011-10-29 07:38 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
comparison 3.3.2 with 3.4.0 (167.68 KB, image/jpeg)
2011-06-08 21:41 UTC, sanada
Details
test file (24.60 KB, application/vnd.oasis.opendocument.text)
2011-06-08 21:43 UTC, sanada
Details
test file for calc (27.99 KB, application/vnd.oasis.opendocument.spreadsheet)
2011-06-08 21:43 UTC, sanada
Details
comparison 3.3.2 with 3.4.0 for calc (193.92 KB, image/jpeg)
2011-06-08 21:47 UTC, sanada
Details
Unicode subrange check 3.3.2 (474.64 KB, application/pdf)
2011-06-09 01:31 UTC, sanada
Details
Unicode subrange check 3.4.0 (499.75 KB, application/pdf)
2011-06-09 01:32 UTC, sanada
Details
Hal width font representation in writer in ver. 3.4 and 3.3 (133.85 KB, image/png)
2011-07-18 17:58 UTC, Tatsuro MATSUOKA
Details
post proposed fix (16.78 KB, application/pdf)
2011-07-19 01:38 UTC, Caolán McNamara
Details
post proposed fix (26.86 KB, application/pdf)
2011-07-19 01:38 UTC, Caolán McNamara
Details
inspection result 3.4.2RC-2 (2.13 MB, application/x-zip-compressed)
2011-07-22 00:26 UTC, sanada
Details
ODT for Unicode subrange check. (90.63 KB, application/vnd.oasis.opendocument.text)
2011-07-22 05:10 UTC, sanada
Details

Note You need to log in before you can comment on or make changes to this bug.
Description sanada 2011-06-08 21:41:32 UTC
Created attachment 47747 [details]
comparison 3.3.2 with 3.4.0

-set Western and Asian each text font configuration, For simplicity, font size.
-input character, western digit, alphabet, digit with alphabet, alphabet with digit.
-view other version(3.3.x with 3.4.0) LibreOffice.

please refer to attachemnt files.
Comment 1 sanada 2011-06-08 21:43:17 UTC
Created attachment 47748 [details]
test file
Comment 2 sanada 2011-06-08 21:43:59 UTC
Created attachment 47749 [details]
test file for calc
Comment 3 sanada 2011-06-08 21:47:53 UTC
Created attachment 47750 [details]
comparison 3.3.2 with 3.4.0 for calc
Comment 4 Don't use this account, use tml@iki.fi 2011-06-08 23:59:03 UTC
If you mean Latin vs. Chinese script, say so. No need to use euphemisms (or worse) like "Western" and "Asian". There are other scripts used in the "West" than Latin. From a historic point of view, what could be more "Western" than Greek, for instance? And what's non-"Asian" in the Indic scripts, for instance?
Comment 5 sanada 2011-06-09 01:03:46 UTC
I'm sorry.. I just quoted item (Western/Asian) displayed on LibreOffice's dialog, but apologize if I caused misunderstanding.

It is a phenomenon when specified Western=Halfwidth font (Basic Latin character) and Asian=Fullwidth font (Japanese/CJK character).
Comment 6 sanada 2011-06-09 01:31:57 UTC
Created attachment 47759 [details]
Unicode subrange check 3.3.2

LibreOffice 3.3.2 Writer exported PDF.
Comment 7 sanada 2011-06-09 01:32:47 UTC
Created attachment 47760 [details]
Unicode subrange check 3.4.0

LibreOffice 3.4.0 exported PDF.
Comment 8 Don't use this account, use tml@iki.fi 2011-06-09 01:49:08 UTC
OK, I didn't remember that LibreOffice uses those terms itself;) I filed a bug about that. Thanks for the further information.
Comment 9 sanada 2011-06-09 02:12:51 UTC
Attached 2 PDF exported by LibreOffice 3.3.2 and 3.4.0.

for simple check with Character classification about each Unicode subrange.

PDF contains that 2 pattern ,only Unicode subrange, and subrange with put alphabet in top of paragraph.

On the latter, as is often the case with two type font (halfwidth-font and fullwidth-font)  applied to  halfwidth character  is issue since before 3.4.0 (LibreOffice 3.3.0, and before forked OpenOffice.org)..

*PDF is hybrid.
Comment 10 sanada 2011-06-09 02:20:42 UTC
Thanking you in advance!
Comment 11 Tatsuro MATSUOKA 2011-07-17 17:40:52 UTC
It seems that this bug has not been corrected in LibO 3.4.2-RC2.
The bug is really annoying to at least for me.
If it is not corrected, I cannot recommend LibO 3.4.2 for productive use for Japanese users.
Comment 12 Tatsuro MATSUOKA 2011-07-17 17:42:20 UTC
LibO 3.4.2-RC2 is incorrect.
LibO 3.4.2-RC1 is right. 

Sorry for my carelessness.
Comment 13 Tatsuro MATSUOKA 2011-07-18 15:35:20 UTC
I have forgotten to tell that the same phenomena happens on also on Writer.
Comment 14 Tatsuro MATSUOKA 2011-07-18 17:58:14 UTC
Created attachment 49282 [details]
Hal width font representation in writer in ver. 3.4 and 3.3

I have attached a file that represent how the fonts are represented in both version 3.4 and 3.3.
Comment 15 Caolán McNamara 2011-07-19 01:36:39 UTC
So, the problematic chars are the characters in the half and full width forms range, unicode FF00–FFEF, which as an entire block used to be classified as ASIAN historically by LibreOffice/OpenOffice.org.

This got changed to try and use UAX 24 (http://unicode.org/reports/tr24/) to determine the best script and thus the script-type for each char in that range, and the range contains "common" ascii-alike numerals in full-width form which get classified as "WEAK" and ascii-alike letters in full-width form which get classified as "LATIN" hence the change. (icu/source/data/unidata/Scripts.txt is a list of the derived script classification FWIW)

I suggest that we are constrained to stick to the old classification system for "commonly-used" text ranges for at least compatibility reasons, while free to use the new classification system for recently added, future added, and really obscure stuff, e.g. Old South Arabian can become COMPLEX and not just default to WEAK because it didn't exist in 1999, while the full half-full forms stick as ASIAN and the full number-forms range stick as WEAK.

http://cgit.freedesktop.org/libreoffice/libs-gui/commit/?id=e76c8d80009c8e29abf0447b7edc157eb42c9e56

done in master, will chase acks for 3-4/3-4-2
Comment 16 Caolán McNamara 2011-07-19 01:38:19 UTC
Created attachment 49288 [details]
post proposed fix
Comment 17 Caolán McNamara 2011-07-19 01:38:39 UTC
Created attachment 49289 [details]
post proposed fix
Comment 18 Tatsuro MATSUOKA 2011-07-19 15:34:34 UTC
Hello Caolán McNamara

Thank you for your reply and proposal for fix. 
Judging from the attachment files, your modification fixed the issue.

Tatsuro
Comment 19 sanada 2011-07-22 00:23:05 UTC
Thank you for modified work!

However, it does not seem to be yet improved about some letters...   - case 1

In addition, about the potentiality malfunction when placed a letter of the alphabet just before any letter class, it is not improved either.     - case 2

Due to these, I feel premature with "RESOLVED FIXED", but how will be it?

Because I attach an inspection result, please confirm it.
Comment 20 sanada 2011-07-22 00:26:10 UTC
Created attachment 49414 [details]
inspection result 3.4.2RC-2

case 1 and case 2 contains archived to ZIP.
Comment 21 Caolán McNamara 2011-07-22 02:47:45 UTC
You mention two issues.

For the second one:

"In addition, about the potentiality malfunction when placed a letter of the
alphabet just before any letter class, it is not improved either. - case 2"

is the same issue we're talking about as in...

"subrange with put alphabet in top of paragraph. ... as is often the case with two type font (halfwidth-font and fullwidth-font) applied to  halfwidth character  is issue since before 3.4.0 (LibreOffice 3.3.0, and before forked OpenOffice.org).."

right ? I mean there is no change from before 3.3.0 and 3.4.2RC2, right ?

Here (looking at the pdf because there is no .odt for the Unicode subrange check) we have the case that when a "Western" (by our categorization) character is entered at the start of a paragraph containing "Weak" (by our categorization) characters then the weak characters take on the attributes of the "Western" text.

If a "Asian" character is entered then the "Weak" characters should take on the attributes of the "Asian" text. This is by design[1]. When there is no preceding characters then it eventually defaults to the system locale if I recall correctly.

Which means that in the case where the paragraph consists of *only* weak characters and no preceding "Western" or "Asian" character has been entered, then they only appear in the Asian font-size for you because your locale is Japanese, they would appear in the Western font-size for me because my locale is a western one[2].

For the first one:

From the discussion above the remaining issues are what chars are "Latin", which are "Asian" and which are "Weak"

In 3.3.0 and earlier all the "Letter Like Symbols" were "Weak" and all the "Alphabetic Presentation Forms" were "Weak", while in 3.4.2RC2 the categorization is now done on a code-point-by-code-point basis using the "Script" property from http://unicode.org/reports/tr24

That means, that yes, e.g.

LetterLike Symbol: 0x212B
Alphabetic Presentation Forms: 0xFB00

went from "Weak" to "Western". So the letter like symbols and alphabetic presentation forms which are strongly associated with a Latin script now always take the "Western" font attributes.

This is a change, but is it now a bug that they are "Western", or was it a bug in the past for < 3.3.0 that they were "Weak" ?

I mean, 0xFB00 is a single glyph for a "ff" ligature, and a literal two "f"'s as "ff" in would definitely be considered "Western". So it would seem sensible enough to consider the Latin Small Ligature ff as "Western" rather than "Weak"

--

[1] not my design. I would not have create different font attributes except for font name for Asian/Western/Weak, it was a bad idea IMO
[2] yes, this means that the same document opened in different locales with the same fonts installed can render differently when there are Weak chars with not surrounding text to bias them towards "Western" or "Asian", this is a horror which should be resolved in the ODF file format to at least retain in the document a default locale to bias towards. And if that happens, there should also be something like idctHint from OOXML added to ODF as well to override the "Weak" characterization of characters so they can be forced one way or the other.
Comment 22 sanada 2011-07-22 05:09:00 UTC
Thank you for a detailed commentary!

At first I reply about "For the second one:"...

I'm sorry about unattached .odt, attaches .odt, just in case.

I knew the existence of "Weak" categorization for the first time.

I consented to be by design.
Comment 23 sanada 2011-07-22 05:10:21 UTC
Created attachment 49425 [details]
ODT for Unicode subrange check.

odt file attached.
Comment 24 sanada 2011-07-22 23:51:11 UTC
Successively, about "For the first one:"...

-----
This is a change, but is it now a bug that they are "Western", 
or was it a bug in the past for < 3.3.0 that they were "Weak" ?
-----

Is it really true by recognition that remain to some bugs?

I am often sorry, but am saved if I have you teach it.
Comment 25 Caolán McNamara 2011-07-25 04:04:54 UTC
-----
This is a change, but is it now a bug that they are "Western", 
or was it a bug in the past for < 3.3.0 that they were "Weak" ?
-----

"Is it really true by recognition that remain to some bugs?"

I'll read this as "is there still a bug or not". Its not a completely black or white issue unfortunately. Looking at the results, I think there *isn't* a bug. It makes sense to me that e,g, the "ff" ligature is classified the same as "f" and "f" side by side would be, i.e. "Western"

So, unless there is some specialized use in some CJK locales of the now "Western" classified "Letter Like Symbols" and "Alphabetic Presentation Forms" which were previously classified as "Weak" then I think we're done here, and we should leave it alone.

If there is some "real-world" use of these "now-Western-instead-of-Weak" characters where this causes a problem we can revisit it.
Comment 26 sanada 2011-07-27 18:05:01 UTC
Thank you for detailed explanation.

Could understand about behavior of ligature. 
Of course,  it feels more natural.


However, about "Letter like symbols", appears to be not uniform within classified:
 ℃(U+2103) - Weak
 Å(U+212B) - Western

It didn't seem right to difference in both characters.

How should construe about this?
Comment 27 Caolán McNamara 2011-07-28 01:27:23 UTC
These then all come from the Unicode "Script" data about them, i.e.
http://www.unicode.org/Public/6.0.0/ucd/Scripts.txt

(to be exact, according to whatever version of unicode that the version of icu were are using is written against, so libreoffice/icu/source/data/unidata/Scripts.txt)

those define...

"2103..2106    ; Common"
"212A..212B    ; Latin"

I don't know what the exact logic is that defines degrees Kelvin as "Latin" and degrees Celsius as "Common" in that categorization system, but it's that's where the values are coming from, i.e. its not *our* categorization, but the unicode Script categorization that we're basing the decision on.
Comment 28 sanada 2011-07-28 17:55:10 UTC
Thank you for a detailed description of once again.

Even within same character classification, I understand about different categorization for that depends on Unicode Script.

For questions, all were assented. Thank you again.

Thanking you and best regards.