40665 – [zh-TW] Help search finds nothing

Bug 40665 - [zh-TW] Help search finds nothing

Summary: [zh-TW] Help search finds nothing

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Localization (show other bugs)
Version: (earliest affected)	3.5.0 Beta2
Hardware:	x86 (IA32) Windows (All)

Importance:	medium major
Assignee:	Andras Timar

URL:
Whiteboard:	target:3.6.0 target:3.5.1
Keywords:

Depends on:
Blocks:

Reported:	2011-09-06 17:57 UTC by wck317-wck317
Modified:	2012-04-05 07:59 UTC (History)
CC List:	2 users (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description wck317-wck317 2011-09-06 17:57:14 UTC

Always find nothing while trying to find a search-term in Help, 
i.e. while F1 > Find > Search Term: Input any word > Find, 
on xp-sp3/Vista Home 32, LO 3.4.2/3.4.3, jre 1.6.0.26/27, and zh-TW help-pack.

With en-US or de help-pack LO can find something. It means that the find function does not function at all if it is in traditional Chinese UI and the correspondent help-pack. Perhaps this problem exists in other non-Indo-European-UIs and help-packs also. I have tested on xp and vista 32 only.

you may copy one title in the help content and paste it to the field for search item. Or you may try to input one of the following words, just for example, to confirm this bug. They are Chinese words for "style" and "format" respectively. 
style: 樣式
format: 格式

The Chinese words may not appear correct on your screen if your system does not have the correspondent Chinese fonts, but the Character-codes underlying them should be able to be copied and pasted correctly.

(I do not know subcomponents and hardware well. So I could choose wrong items there.)

Comment 1 Rainer Bielefeld Retired 2011-09-06 22:53:02 UTC

@András:
Any idea who might be able to help here?

Comment 2 Andras Timar 2011-09-07 11:33:54 UTC

I tried the find function with zh-TW and ja UI. Some characters or character sequences can be found. But not all, and it is clearly a bug. Indexing and search in help is implemented by Apache Lucene, which is a third party library. I would like to know first, if it is a regression. Did find function in zh-TW help work in LibreOffice 3.3 or in OpenOffice.org?

Comment 3 wck317-wck317 2011-09-08 19:16:57 UTC

It can find topics about 縮排 or 快鍵, but not about 樣式, 格式, 段落, 搜尋, 尋找, 公式, etc. LO 3.3.2 with zh-TW help pack, UI and jre 1.6.0.27 on Windows 2000.

Many Chinese words (nouns, verbs, adjectives, adverbs, etc.) which consist of two or more characters, should be treated as one unit by searching. If the searching-term is one-character word, the find function is ok. If the searching-term is two-characters word, it often finds nothing, sometimes finds topics. For example, it can find topics about 縮排, but not about 段落, 搜尋, 尋找, 公式, etc.

The search results of the Chinese word "縮排" or "快鍵" show that the engine splits this two-characters word into two separate characters and if one of the characters shows up in a topic, it returns that topic. This kind of treatment is incorrect because this two-characters word should be treated as one unit. As a result it returns many redundant topics. The splitting seems not to happen by other two-characters word, but the search engine which should find something finds nothing in most cases.

The test results show: If the engine splits two-characters words into two separate characters, it find many redundant topics. If the engine does not or cannot splits two-characters words, it finds nothing. Most of times it finds nothing.

After googling "Apache Lucene Chinese" I find there are some Chinese articles in Internet about this problem and its solutions. For example, http://www.reality.hk/articles/2005/03/16/382/, http://tw.myblog.yahoo.com/ys-blog/article?mid=966&sc=1, etc. (I wish I were a programmer. :))
In Lucene or not in Lucene, that is still the question.

Comment 4 Björn Michaelsen 2011-12-23 12:39:11 UTC

[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html

Comment 5 wck317-wck317 2011-12-26 19:35:32 UTC

LO: 3.5 Beta2 (in Chinese or English UI) with zh-TW help installed
OS: Windows XP-sp3
I had limited access to test all kinds of cases with beta2 as possible.

For non-Chinese developers' imagination, let "BC", "DEF" stand for two and three characters meaningful Chinese phrase respectively which may be inputed for searching.

A. One meaningful phrases or two or more meaningful phrases seperated with space-character(s): "BC", "DEF", "BC DEF"
Real samples: "手動", "編號", "段落", "按一下"
Result: often the same messages: No topics found (this message are translated in English)

If the "conditional text" is searched in English help file, "conditional" and "text" are marked in the returned topic-text. In contrast to this only "縮", but not "排" is marked in the returned topic-text. The same occurs by searching "快鍵". It seems that only the first character of the searching words is marked and only the first character of searching word was recognised as searching word. The result by searching "縮排", "表單", and "錨點" is the same with that by "縮" only.

Some words in index list on Help UI can be found. Most cannot. (Were the words in index list generated by Lucene engine?)

Searching one-character words causes no problem.
B. The same phrases as those in A, but each character is seperated from other charchters with space-characters: "B C", "D E F", "B C D E F". And the same characters with changed order: "C B", "F E D", "E F D", etc.
Real samples: "手 號", "手 號 段", "手 號 段 按", "按 一 下", "下 一 按", "動 手", "號 編"
Result: Any topics containing those characters are shown. With many redundant ones, of course.
(But this is not unimportant. For this is the only workaround for the moment.)

C. Adding one character into the phrases in A, we have "A BC", "A DEF", "BC A", etc. to test.
Real samples: "手 編號"
Result: No topics found.

So, the searching engine can handle two or more uncombined Chinese characters well. It often has difficulties in all the others cases. The searching engine often cannot handle two or more combined Chinese characters. (In contrast to this, the engine can handle a set of combined English characters, for example "style", correctly.)

List of single keywords which may be inputed for searching:
印
縮排
快鍵
錨點
摘要
大綱
手動
編號
段落
名片
標籤
顯示
還原
按一下
功能表
印表機
記憶體
定位點
控制項
資料庫
項目符號
編號類型
向左對齊
自動儲存
檔案特性
保護記錄
直接格式
同義詞詞典
合併列印精靈
頁數的有條件的文字

List of multiple keywords which are seperated with space-character and may be inputed for searching:
頁數 有條件文字
手動 編號

(What I will say in the following may be wrong.) I surmise that Lucene has tried to solve the problem of more satisfactory segmentation. I am not a programmer. Please examine the following:
CJKTokenizer: http://lucene.apache.org/java/3_0_2/api/contrib-analyzers/org/apache/lucene/analysis/cjk/package-summary.html#package_description

Three kinds of analyzer are presented in the above webpage. A (simplised) Chinese sample sentence is segmented well with SmartChineseAnalyzer. In contrast to this the ChineseAnalyzer "[i]ndex unigrams (individual Chinese characters) as a token". I am not sure if only the latter analyzer is used in LibreOffice so that it does not generate an index containing more satisfactory tokenized things, like "BC", "DEF", etc, rather it generates only "A", "B", "C", "D", etc.

There is indeed an index appearing on the Help UI in which tokens like "BC", "DEF", etc, are there. I don't know whether this one is the one generated by Lucene engine.

Would the same problem occur in Korean or Janpanese help UI?

27.12.2011

Blogs in which cjktokenizer was discussed. But I am not sure if they contributed to the current version of CJKTokenizer.
http://tw.myblog.yahoo.com/ys-blog/article?mid=966&sc=1
http://blog.csdn.net/liangjian103103103/article/details/6547611
http://tw.myblog.yahoo.com/ys-blog/article?mid=966&sc=1

Comment 6 Caolán McNamara 2012-02-14 08:24:56 UTC

*possibly* opengrok for "CJKAnalyzer" and see if running zh-* (and possibly ko) in addition to the existing "ja" through org.apache.lucene.analysis.cjk.CJKAnalyzer makes a difference.

Comment 7 Not Assigned 2012-02-17 12:51:24 UTC

Andras Timar commited a patch related to this issue to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=7636d37f8f9c53d694c4fe38581f3b495d53670e

fdo#40665 use CJKAnalyzer for ko, zh-CN, and zh-TW, too

Comment 8 Not Assigned 2012-02-17 13:16:43 UTC

Andras Timar commited a patch related to this issue to "libreoffice-3-5":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c3f76c548c6131543e4ccfafe4e72e8d12804234&g=libreoffice-3-5

fdo#40665 use CJKAnalyzer for ko, zh-CN, and zh-TW, too


It will be available in LibreOffice 3.5.1.

Comment 9 Caolán McNamara 2012-02-23 02:57:10 UTC

don't "needinfo" anymore, problem fixed in master and for 3.5.1

Comment 10 wck317-wck317 2012-02-23 05:40:13 UTC

Wow, This fix will help CK-user greatly in searching what they need in help document. User in CK-area would thank you for your effort. I would look forward to the test build for version 3.5.1.

Comment 11 Rainer Bielefeld Retired 2012-04-05 07:59:15 UTC

I added Fix submitter as assignee because this will ease queries and bug tracking.