Bug 151148 - Finding KATAKANA which has voice consonant mark returns incorrect results.
Summary: Finding KATAKANA which has voice consonant mark returns incorrect results.
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
7.4.1.2 release
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.5.0 target:7.4.2 target:7.3.7
Keywords: bibisected, regression
: 151141 151396 151477 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-09-23 11:55 UTC by Kiyotaka Nishibori
Modified: 2022-10-16 08:09 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments
example file (26.05 KB, application/vnd.oasis.opendocument.text)
2022-09-23 12:06 UTC, Kiyotaka Nishibori
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kiyotaka Nishibori 2022-09-23 11:55:30 UTC
Description:
The "voice consonant mark" means little dashes or circle which is put on some KATAKANA characters: e.g. カ (KA) with the little dashes becomes ガ (GA), ハ (HA) with a little circle becomes パ (PA).
Half-width KATAKANA treats such little dashes or circle ― U+FF9E and U+FF9F ― as a single character. for example, "ガ" is a combination of two characters (U+FF76, U+FF9E). In Full-width KATAKANA, KATAKANA character with a voice consonant mark counts to 1 character, e.g. "ガ" (U+30AC).
Japanese human usually recognizes KATAKANA with such marks as one character, even if a combination of 2 half-width characters.

If a finding string includes such KATAKANA with voice consonant mark, the searching result is incorrect. This problem occurs, at least in Calc, Writer, Draw and Impress.
The issue has reproduced since the commit d6336e0b21eeece0e678a8768938c04fa120043f, and didn't before that commit.

Steps to Reproduce:
1. open the attachment with Writer.
2. open Find and Replace dialog and Uncheck "Match Character Width"
3. enter a KATAKANA string which contains voice consonant mark:
  Examination 1: enter "ガギグゲゴ" (U+30AC + U+30AE + U+3030B0 + U+30B2 + U+30B4) or "ガギグゲゴ" (U+FF76 + U+FF9E + U+FF77 + U+FF9E + U+FF78 + U+FF9E + U+FF79 + U+FF9E + U+FF7A+ U+FF9E)
  Examination 2: enter "ギグゲ" (U+30AE, U+3030B0, U+30B2) or "ギグゲ" (U+FF77 + U+FF9E + U+FF78 + U+FF9E + U+FF79 + U+FF9E)
4. click Find Next.


Actual Results:
Examination 1: "ガギグゲゴ" (U+FF76 + U+FF9E + U+FF77 + U+FF9E + U+FF78 + U+FF9E + U+FF79 + U+FF9E + U+FF7A+ U+FF9E) or "ガギグゲゴ01234" (U+30AC + U+30AE + U+3030B0 + U+30B2 + U+30B4 + U+0030 + U+0031 + U+0032 + U+0033 + U+0034)
Examination 2: "ギグゲ" (U+FF77 + U+FF9E + U+FF78 + U+FF9E+ U+FF79 + U+FF9E), "グゲ0123" (U+30B0 + U+30B2 + U+0030 + U+0031 + U+0032 + U+0033) or "グゲゴ012" (U+30B0 + U+30B2 + U+30B4 + U+0030 + U+0031 + U+0032)


Expected Results:
Examnation 1: "ガギグゲゴ" (U+30AC + U+30AE + U+3030B0 + U+30B2 + U+30B4) or "ガギグゲゴ" (U+FF76 + U+FF9E + U+FF77 + U+FF9E + U+FF78 + U+FF9E + U+FF79 + U+FF9E + U+FF7A+ U+FF9E)
Examination 2: "ギグゲ" (U+30AE + U+30B0 + U+30B2) or "ギグゲ" (U+FF77 + U+FF9E + U+FF78 + U+FF9E+ U+FF79 + U+FF9E)



Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 7.4.1.2 / LibreOffice Community
Build ID: 40(Build:2)
CPU threads: 8; OS: Linux 5.19; UI render: default; VCL: gtk3
Locale: ja-JP (ja_JP.UTF-8); UI: en-US
7.4.1-2
Calc: threaded
Comment 1 Kiyotaka Nishibori 2022-09-23 12:06:05 UTC
Created attachment 182643 [details]
example file

This file is not embedded Japanese font. Please install a Japanese font and review the file.
Comment 2 Kiyotaka Nishibori 2022-09-23 15:55:53 UTC
"The issue has reproduced since the commit d6336e0b21eeece0e678a8768938c04fa120043f, and didn't before that commit."

I'm sorry, but the commit is wrong. That was of bibisecting:
 d6336e0b21eeece0e678a8768938c04fa120043f is the first bad commit
commit d6336e0b21eeece0e678a8768938c04fa120043f
Author: Jenkins Build User <tdf@pollux.tdf>
Date:   Thu Sep 16 12:16:43 2021 +0200

    source sha:c7551e8a46e2f9f8142aa7921a0494221ae096e8
    
    source sha:c7551e8a46e2f9f8142aa7921a0494221ae096e8

 instdir/program/libi18npoollo.so | Bin 1617192 -> 1613016 bytes
 instdir/program/libi18nutil.so   | Bin 123104 -> 123104 bytes
 instdir/program/setuprc          |   2 +-
 instdir/program/versionrc        |   2 +-
 4 files changed, 2 insertions(+), 2 deletions(-)

As you can see, the issue has reproduced since the commit c7551e8a46e2f9f8142aa7921a0494221ae096e8
, and didn't before that commit.
Comment 3 Ming Hua 2022-09-24 05:05:40 UTC
Reproduced with 7.3.6 and 7.4.1 on Windows:
Version: 7.3.6.2 (x64) / LibreOffice Community
Build ID: c28ca90fd6e1a19e189fc16c05f8f8924961e12e
CPU threads: 12; OS: Windows 10.0 Build 22000; UI render: Skia/Vulkan; VCL: win
Locale: zh-CN (zh_CN); UI: en-US
Calc: CL
and
Version: 7.4.1.2 (x64) / LibreOffice Community
Build ID: 3c58a8f3a960df8bc8fd77b461821e42c061c5f0
CPU threads: 12; OS: Windows 10.0 Build 22000; UI render: Skia/Raster; VCL: win
Locale: en-US (zh_CN); UI: zh-CN
Calc: CL

But no reproduce on 7.0.6:
Version: 7.0.6.2 (x64)
Build ID: 144abb84a525d8e30c9dbbefa69cbbf2d8d4ae3b
CPU threads: 12; OS: Windows 10.0 Build 22000; UI render: default; VCL: win
Locale: zh-CN (zh_CN); UI: en-US
Calc: CL
Comment 4 Ming Hua 2022-09-24 05:13:48 UTC
It seems the reporter was using machine translation.  I hope my description below is more concise and clear.

The issue is rather straight forward: To find a Japanese string, the result shouldn't include the digits (0123...) after the search term.  In 7.0 the behavior is normal, in 7.3 and 7.4 the search result (highlighted) includes the digits when it matches the full-width characters, for example searching "ガギグゲゴ" (U+30AC...) gets "ガギグゲゴ01234".

Although I didn't do the bibisection myself, the result in comment #2 is consistent with the regression range I found in testing.  So setting the keyword and adding Noel to CC.

Noel: Would you please have a look?
Comment 5 Commit Notification 2022-09-24 11:40:02 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/222e56157c6317435088e09e52a0705bc6a1a83a

tdf#151148 Finding KATAKANA which has voice consonant mark wrong

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 6 Julien Nabet 2022-09-24 14:09:40 UTC
*** Bug 151141 has been marked as a duplicate of this bug. ***
Comment 7 Kiyotaka Nishibori 2022-09-25 11:03:14 UTC
To Noel:
Thank you for your providing a patch. It works well for me.
I'm sorry, but I submitted your patch to libreoffice-7-4 branch  as backporting without your permission. 
The branch is developing  for its minor release now. That regression is serious for Japanese users and most of them want the immediate fix.

 If you don't mind, See https://gerrit.libreoffice.org/c/core/+/140566
Comment 8 Commit Notification 2022-09-26 10:21:20 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "libreoffice-7-4":

https://git.libreoffice.org/core/commit/a5b6ddf3f0055cebe2713af34c304a647af6c76a

tdf#151148 Finding KATAKANA which has voice consonant mark wrong

It will be available in 7.4.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 9 Commit Notification 2022-09-26 11:09:08 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "libreoffice-7-3":

https://git.libreoffice.org/core/commit/a288453c50f49852c2a83cc4716ec44d6230d37c

tdf#151148 Finding KATAKANA which has voice consonant mark wrong

It will be available in 7.3.7.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 10 Commit Notification 2022-09-26 13:09:45 UTC
Noel Grandin committed a patch related to this issue.
It has been pushed to "libreoffice-7-4-2":

https://git.libreoffice.org/core/commit/b46221a4817ca41776446d2a8d81272ce1022c29

tdf#151148 Finding KATAKANA which has voice consonant mark wrong

It will be available in 7.4.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 11 Julien Nabet 2022-10-07 09:41:38 UTC
*** Bug 151396 has been marked as a duplicate of this bug. ***
Comment 12 Julien Nabet 2022-10-12 12:07:32 UTC
*** Bug 151477 has been marked as a duplicate of this bug. ***
Comment 13 Ming Hua 2022-10-16 07:55:53 UTC
I can confirm the issue described in comment #0 here is fixed in:
Version: 7.4.2.3 (x64) / LibreOffice Community
Build ID: 382eef1f22670f7f4118c8c2dd222ec7ad009daf
CPU threads: 12; OS: Windows 10.0 Build 22000; UI render: Skia/Raster; VCL: win
Locale: en-US (zh_CN); UI: zh-CN
Calc: CL

Reporters of other bugs that are marked as DUPLICATE:
Please test your problem with 7.4.2 (already released) or 7.3.7 (RC1 should be out this coming week).  If the new version doesn't resolve your issue, speak up here or set your bug's status back to UNCONFIRMED.
Comment 14 Ming Hua 2022-10-16 07:57:56 UTC
And of course, thanks Noel for the quick fix and Julien for testing!
Comment 15 Kiyotaka Nishibori 2022-10-16 08:09:57 UTC
I confirmed the issue was fixed in the latest libreoffice-fresh package of Archlinux:
Version: 7.4.2.3 / LibreOffice Community
Build ID: 40(Build:3)
CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: gtk3
Locale: ja-JP (ja_JP.UTF-8); UI: ja-JP
7.4.2-1
Calc: threaded