Bug 91764 - RTL: Arabic, Hebrew diacritics can't be found using search dialog
Summary: RTL: Arabic, Hebrew diacritics can't be found using search dialog
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: Other All
: high major
Assignee: ⁨خالد حسني⁩
URL:
Whiteboard: target:7.5.0 target:7.4.1 inReleaseNo...
Keywords:
Depends on:
Blocks: Find-Search RTL-Arabic-and-Farsi RTL-Hebrew
  Show dependency treegraph
 
Reported: 2015-05-31 03:13 UTC by zahra
Modified: 2022-12-06 15:17 UTC (History)
11 users (show)

See Also:
Crash report or crash signature:


Attachments
Plain-text sentence with diacritics to try out (UTF-8 charset) (70 bytes, text/plain)
2015-05-31 03:13 UTC, zahra
Details
Document showing problem with arabic characters with diacritics (13.33 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-06-14 15:39 UTC, Pedro
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zahra 2015-05-31 03:13:16 UTC
Created attachment 116186 [details]
Plain-text sentence with diacritics to try out (UTF-8 charset)

it can not possible to navigate in our document and read them character by character when we read persian or arabic document and it has diacritics. 
i mean open office and libreoffice dont consider diacritics in arabic and persian one of the character to be read and edit by left and arro keys. 
for example when we want to read and edit txt, if we move in the document by arro key and we want to edit it when it has diacritics its not easy. 
for example we read txt and we want to remove ّ tashdid from document completely when we use control h, in the first part we type ّ and the second part we dont type anything for replacement to remove ّ in all documents, its like sometime we dont have any ّ in our documents. 
it problem is true about all our diacritics َ، ُ، ِ، ّ، ًْ، ٌ، ٍ، ْ. 
i appreciate you if you solve these problems in libreoffice 4.5 or at least 5.
Comment 1 Joel Madero 2015-06-01 02:41:09 UTC Comment hidden (obsolete)
Comment 2 Adolfo Jayme Barrientos 2015-06-07 03:37:54 UTC Comment hidden (obsolete)
Comment 3 Robinson Tryon (qubit) 2015-09-03 10:29:06 UTC Comment hidden (obsolete)
Comment 4 Xisco Faulí 2016-09-11 13:52:55 UTC Comment hidden (noise)
Comment 5 zahra 2016-09-14 06:52:05 UTC
(In reply to Joel Madero from comment #1)
> Hey Zahra -
> 
> Can you explain the issue a bit clearer (preferably in numbered steps).
> 
> Example:
> 1. Open attached document;
> 2. Place cursor....
> 3. Do step Z
> 
> . . . etc...
> 
> Very hard to follow a paragraph of text - explain how to reproduce the issue
> with the document that you have attached.
> 
> Also just FYI - almost guaranteed this will not be resolved by 5.0 or even
> 5.1 - we are a team of volunteers, if a volunteer finds that bug to be
> interesting, they will fix it, else, it will not be fixed. If you or someone
> you know is a developer, we'd happily help you understand the code so that
> you can get this issue resolved :)
> 
> Setting to NEEDINFO - once you describe the issue clearly please set to
> UNCONFIRMED. Thanks!

hi.
i realy appreciate your efforts.
my bug is duplicate of the bug.
100854
https://bugs.documentfoundation.org/show_bug.cgi?id=100854
its the same problem.
its a very critical problem for me.
because most of times, i use diacritics and cant use libreoffice easily.
because libreoffice and openoffice dont recognize diacritics independently.
they count diacritics as part of letter and when i want to navigate letter by letter, nvda reads only letter and not diacritics.

for example: 
steps to reproduce:
1/copy this sentences in libreoffice writer and save it in docx or any format like docx.
2/ press control+h to activate find and replace.
3/ in the first fild, type ّ when your keyboard is set to persian.
4/ in the replace with type nothing.
5/ click okay for replace all.
there are three ّ in this sentence.

actual result: openoffice libreoffice say search key not found.
expected behaviour:
openoffice libreoffice should recognize the diacritics and behave them like normal letters.
they should say three ّ was found and replace them as i mentioned.

here you are the text for testing.


ائمّه عليهم السّلام چراغهاي هدايت و كشتي نجات براي سعادت همه انسان‌ها و تقرّب آن‌ها به خداوند هستند.
Comment 6 Xisco Faulí 2016-09-14 15:44:49 UTC Comment hidden (noise)
Comment 7 Buovjaga 2016-10-07 17:41:16 UTC Comment hidden (obsolete)
Comment 8 Urmas 2016-10-08 06:16:58 UTC Comment hidden (obsolete)
Comment 9 zahra 2017-06-13 03:24:48 UTC Comment hidden (obsolete)
Comment 10 zahra 2017-06-13 03:33:15 UTC Comment hidden (obsolete)
Comment 11 vihsa 2017-06-13 09:16:34 UTC Comment hidden (obsolete)
Comment 12 Pedro 2017-06-14 15:39:17 UTC Comment hidden (obsolete)
Comment 13 Pedro 2017-06-26 21:43:50 UTC Comment hidden (obsolete)
Comment 14 Yousuf Philips (jay) (retired) 2017-10-13 19:44:28 UTC
So lets limit this bug to the find & replace issue of arabic diacritics and bug 54494 will deal the keyboard navigation between letters issue.

steps:
1. open attachment 134028 [details]
2. open find toolbar or find & replace dialog
3. type ُ  (dumma) and it will appear as a dotted-lined circle and a dumma
4. find will result in no results though the document has 3 dummas
5. type كُ (kaaf + dumma) and it will find 2 results

Version: 6.0.0.0.alpha0+
Build ID: 3672cdd35985201ea87463cf032fedd02c052f4d
CPU threads: 2; OS: Linux 4.4; UI render: default; VCL: gtk2; 
Locale: en-US (en_US.UTF-8); Calc: group
Comment 15 Lior Kaplan 2017-10-13 21:12:14 UTC
Notice that the advanced options of the search & replace dialog have "ignore diacritics" and "ignore kashida" on by default (bug #52204 and bug #77123). 

Please verify if either of these options helps with this issue.
Comment 16 Yousuf Philips (jay) (retired) 2017-10-14 18:28:55 UTC
(In reply to Lior Kaplan from comment #15)
> Notice that the advanced options of the search & replace dialog have "ignore
> diacritics" and "ignore kashida" on by default (bug #52204 and bug #77123). 

Disabling these didnt solve the issue.
Comment 17 QA Administrators 2019-03-07 03:44:04 UTC Comment hidden (noise)
Comment 18 Safeer Pasha 2020-01-25 14:32:23 UTC Comment hidden (obsolete)
Comment 19 zahra 2020-06-21 11:09:45 UTC
(In reply to Lior Kaplan from comment #15)
> Notice that the advanced options of the search & replace dialog have "ignore
> diacritics" and "ignore kashida" on by default (bug #52204 and bug #77123). 
> 
> Please verify if either of these options helps with this issue.

Hello.
i tried unchecking ignoring diacritics in other options of find-replace dialog,
and for example: i tried to remove
ّ
from an html file.
the result was not found!
however, when i reported this bug, i did not know that ignoring diacritics is intentional!
why libreoffice and even microsoft office, ignore diacritics by default?
about microsoft office, i only tested with office 2007!
Comment 20 Eyal Rozenberg 2020-06-22 08:33:32 UTC
This bug occurs with Hebrew diacritics as well - seemingly, in exactly the same fashion, so I won't attach another example. To reproduce:

1. Create a new document
2. Type in אבא הלך לעבודה
3. Enter a Dagesh character into the first ב. (On Linux: Place the cursor after the ב; Enable Num Lock; Press Right-Alt + ד)
4. Search for ב without a Dagesh
5. Search for just Dagesh, i.e. for ּ
6. Search for ב with a Dagesh, i.e. for בּ

You'll get no matches for just Dagesh, and two matches for either ב.

This is quite problematic, since ב and בּ are different consonants (V and B in English).

Now, it's true that most people don't use diacritics when writing in Hebrew, but some do, e.g. users

* with impaired hearing or sight
* who are editing text to be read out loud automatically
* who are working on poetry or religious texts
* who are working on reading material for children

I assume it's the same for Arabic.
Comment 21 Safeer Pasha 2020-11-13 07:47:47 UTC Comment hidden (obsolete)
Comment 22 Safeer Pasha 2021-03-01 07:47:06 UTC
GOOD NEWS EVERYONE
I just tested this and I can say that it is FIXED
the diacritics can now be found and replaced. 

Version: 7.1.0.3 / LibreOffice Community
Build ID: 10(Build:3)
CPU threads: 4; OS: Linux 5.10; UI render: default; VCL: kf5
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded
Comment 23 Eyal Rozenberg 2021-03-01 09:36:32 UTC
(In reply to Safeer Pasha from comment #22)
> GOOD NEWS EVERYONE
> I just tested this and I can say that it is FIXED

Not quite. The behavior has changed, but it isn't what we would expect. With the same reproduction instructions I gave in my last comment:

* Searching for ב without DAGESH finds both ב's.
* Searching for ב with DAGESH _also_ finds both ב's.
* Searching for just DAGESH doesn't find anything.

So, the current behavior is that searching ignores diacritics (at least in Hebrew).

Is this intentional? I'm not sure. Is this the best behavior to have? I'd say it isn't, but then - we should really have a "policy discussion" about this. preferably on the LO RTL channel.
Comment 24 Safeer Pasha 2021-03-02 14:54:31 UTC
(In reply to Eyal Rozenberg from comment #23)

it works for Arabic, did not test with Hebrew.
Comment 25 Xisco Faulí 2021-03-02 15:13:25 UTC
(In reply to Safeer Pasha from comment #24)
> (In reply to Eyal Rozenberg from comment #23)
> 
> it works for Arabic, did not test with Hebrew.

Could you please explain the steps that are fixed ? I can investigate when it got fixed and maybe it helps with the hebrew problem
Comment 26 Eyal Rozenberg 2021-03-02 21:03:53 UTC
(In reply to Xisco Faulí from comment #25)
> Could you please explain the steps that are fixed ? I can investigate when
> it got fixed and maybe it helps with the hebrew problem

Oh, I'm sorry, I misspoke. Current behavior is exactly the same as what I described in my reproduction instructions for Hebrew. There is no change.

However, if you make your search diacritics-sensitive, then searching for either BET or BET+DAGESH will find just one occurrence.


Now, on to Arabic.

First, if I follow Yousef Phillips's instructions - I get the exact same problematic behavior: Finding a DAMMA fails, finding a KAF + DAMMA finds two results. This is true whether your search is marked  "diacritics-sensitive" or not (which can only be done with the dialog BTW).

If on that same document, you you type a third KAF letter, without a DAMMA - it will be found when searching for KAF + DAMMA in non-diacritic-sensitive mode, and will not be found when searching for KAF + DAMMA, in diacritic-sensitive mode (which is a good thing). So, the behavior for Arabic and Hebrew seems to be exactly the same. Marking my earlier comment obsolete.
Comment 27 Safeer Pasha 2021-03-03 04:16:55 UTC
(In reply to Xisco Faulí from comment #25)

the testing is done on the document that is in the attachments,

open the S&R dialog and type "کُ" the letter KAF with DAMMA in the "Search BOX"
type KAF with FATHA "کَ" in the "Replace BOX" 
press "Find" or "Find all" or "Replace" or "Replace All" 

you can even search & replace for FATHA or DAMMA alone, without it being attached to another letter.
Comment 28 ustrendingnews 2021-05-16 07:36:23 UTC Comment hidden (spam)
Comment 29 Commit Notification 2022-08-19 17:02:27 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/2a78fbf4e4a49f2b52aa1352aac41ee024d0cf72

tdf#91764: Combining marks from “complex” scripts can’t be searched for

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 30 ⁨خالد حسني⁩ 2022-08-19 17:11:17 UTC
Searching for standalone combining marks works now, as long as “Diacritic sensitive” option is checked.
Comment 31 Commit Notification 2022-08-22 14:36:06 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "libreoffice-7-4":

https://git.libreoffice.org/core/commit/a8ba4ed2966a6f39b9aea6ce9deaed63db2023f2

tdf#91764: Combining marks from “complex” scripts can’t be searched for

It will be available in 7.4.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.