Always find nothing while trying to find a search-term in Help,
i.e. while F1 > Find > Search Term: Input any word > Find,
on xp-sp3/Vista Home 32, LO 3.4.2/3.4.3, jre 18.104.22.168/27, and zh-TW help-pack.
With en-US or de help-pack LO can find something. It means that the find function does not function at all if it is in traditional Chinese UI and the correspondent help-pack. Perhaps this problem exists in other non-Indo-European-UIs and help-packs also. I have tested on xp and vista 32 only.
you may copy one title in the help content and paste it to the field for search item. Or you may try to input one of the following words, just for example, to confirm this bug. They are Chinese words for "style" and "format" respectively.
The Chinese words may not appear correct on your screen if your system does not have the correspondent Chinese fonts, but the Character-codes underlying them should be able to be copied and pasted correctly.
(I do not know subcomponents and hardware well. So I could choose wrong items there.)
Any idea who might be able to help here?
I tried the find function with zh-TW and ja UI. Some characters or character sequences can be found. But not all, and it is clearly a bug. Indexing and search in help is implemented by Apache Lucene, which is a third party library. I would like to know first, if it is a regression. Did find function in zh-TW help work in LibreOffice 3.3 or in OpenOffice.org?
It can find topics about 縮排 or 快鍵, but not about 樣式, 格式, 段落, 搜尋, 尋找, 公式, etc. LO 3.3.2 with zh-TW help pack, UI and jre 22.214.171.124 on Windows 2000.
Many Chinese words (nouns, verbs, adjectives, adverbs, etc.) which consist of two or more characters, should be treated as one unit by searching. If the searching-term is one-character word, the find function is ok. If the searching-term is two-characters word, it often finds nothing, sometimes finds topics. For example, it can find topics about 縮排, but not about 段落, 搜尋, 尋找, 公式, etc.
The search results of the Chinese word "縮排" or "快鍵" show that the engine splits this two-characters word into two separate characters and if one of the characters shows up in a topic, it returns that topic. This kind of treatment is incorrect because this two-characters word should be treated as one unit. As a result it returns many redundant topics. The splitting seems not to happen by other two-characters word, but the search engine which should find something finds nothing in most cases.
The test results show: If the engine splits two-characters words into two separate characters, it find many redundant topics. If the engine does not or cannot splits two-characters words, it finds nothing. Most of times it finds nothing.
After googling "Apache Lucene Chinese" I find there are some Chinese articles in Internet about this problem and its solutions. For example, http://www.reality.hk/articles/2005/03/16/382/, http://tw.myblog.yahoo.com/ys-blog/article?mid=966&sc=1, etc. (I wish I were a programmer. :))
In Lucene or not in Lucene, that is still the question.
[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
LO: 3.5 Beta2 (in Chinese or English UI) with zh-TW help installed
OS: Windows XP-sp3
I had limited access to test all kinds of cases with beta2 as possible.
For non-Chinese developers' imagination, let "BC", "DEF" stand for two and three characters meaningful Chinese phrase respectively which may be inputed for searching.
A. One meaningful phrases or two or more meaningful phrases seperated with space-character(s): "BC", "DEF", "BC DEF"
Real samples: "手動", "編號", "段落", "按一下"
Result: often the same messages: No topics found (this message are translated in English)
If the "conditional text" is searched in English help file, "conditional" and "text" are marked in the returned topic-text. In contrast to this only "縮", but not "排" is marked in the returned topic-text. The same occurs by searching "快鍵". It seems that only the first character of the searching words is marked and only the first character of searching word was recognised as searching word. The result by searching "縮排", "表單", and "錨點" is the same with that by "縮" only.
Some words in index list on Help UI can be found. Most cannot. (Were the words in index list generated by Lucene engine?)
Searching one-character words causes no problem.
B. The same phrases as those in A, but each character is seperated from other charchters with space-characters: "B C", "D E F", "B C D E F". And the same characters with changed order: "C B", "F E D", "E F D", etc.
Real samples: "手 號", "手 號 段", "手 號 段 按", "按 一 下", "下 一 按", "動 手", "號 編"
Result: Any topics containing those characters are shown. With many redundant ones, of course.
(But this is not unimportant. For this is the only workaround for the moment.)
C. Adding one character into the phrases in A, we have "A BC", "A DEF", "BC A", etc. to test.
Real samples: "手 編號"
Result: No topics found.
So, the searching engine can handle two or more uncombined Chinese characters well. It often has difficulties in all the others cases. The searching engine often cannot handle two or more combined Chinese characters. (In contrast to this, the engine can handle a set of combined English characters, for example "style", correctly.)
List of single keywords which may be inputed for searching:
List of multiple keywords which are seperated with space-character and may be inputed for searching:
(What I will say in the following may be wrong.) I surmise that Lucene has tried to solve the problem of more satisfactory segmentation. I am not a programmer. Please examine the following:
Three kinds of analyzer are presented in the above webpage. A (simplised) Chinese sample sentence is segmented well with SmartChineseAnalyzer. In contrast to this the ChineseAnalyzer "[i]ndex unigrams (individual Chinese characters) as a token". I am not sure if only the latter analyzer is used in LibreOffice so that it does not generate an index containing more satisfactory tokenized things, like "BC", "DEF", etc, rather it generates only "A", "B", "C", "D", etc.
There is indeed an index appearing on the Help UI in which tokens like "BC", "DEF", etc, are there. I don't know whether this one is the one generated by Lucene engine.
Would the same problem occur in Korean or Janpanese help UI?
Blogs in which cjktokenizer was discussed. But I am not sure if they contributed to the current version of CJKTokenizer.
*possibly* opengrok for "CJKAnalyzer" and see if running zh-* (and possibly ko) in addition to the existing "ja" through org.apache.lucene.analysis.cjk.CJKAnalyzer makes a difference.
Andras Timar commited a patch related to this issue to "master":
fdo#40665 use CJKAnalyzer for ko, zh-CN, and zh-TW, too
Andras Timar commited a patch related to this issue to "libreoffice-3-5":
fdo#40665 use CJKAnalyzer for ko, zh-CN, and zh-TW, too
It will be available in LibreOffice 3.5.1.
don't "needinfo" anymore, problem fixed in master and for 3.5.1
Wow, This fix will help CK-user greatly in searching what they need in help document. User in CK-area would thank you for your effort. I would look forward to the test build for version 3.5.1.
I added Fix submitter as assignee because this will ease queries and bug tracking.