Description: PDFs exported with LibO7.4 can be found by searching for Hebrew in a PDF reader, but PDFs exported with LibO7.5 and later cannot be found by searching for Hebrew. Posted on ask https://ask.libreoffice.org/t/writer-pdf/98051 It seems that the characters are stored separately and cannot be recognized as words. Steps to Reproduce: 1.Export sample files to PDF 2.Open that PDF in a reader 3.Search for וַיְהִ֥י Actual Results: Not found. Expected Results: will hit Reproducible: Always User Profile Reset: No Additional Info: Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: ff3fb42b48c70ba5788507a6177bf0a9f3b50fdb CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win Locale: ja-JP (ja_JP); UI: ja-JP Calc: CL threaded Version: 7.4.7.2 (x64) / LibreOffice Community Build ID: 723314e595e8007d3cf785c16538505a1c878ca5 CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Vulkan; VCL: win Locale: ja-JP (ja_JP); UI: ja-JP Calc: CL
Created attachment 190979 [details] sample file
Created attachment 190980 [details] exported747
Created attachment 190981 [details] exported242 Sample file exported to PDF using LibO24.2 The same thing happens with [attachments file](https://bugs.documentfoundation.org/attachment.cgi?id=134028) in [Bug 91764](https://bugs.documentfoundation.org/show_bug.cgi?id=91764).
The most important thing to note about this bug is that the search term of interest contains Niqqud marks - marks indicating vowels, emphasis or intonation; and even one cantillation mark. See: https://en.wikipedia.org/wiki/Niqqud https://en.wikipedia.org/wiki/Hebrew_cantillation without marks: ויהי with marks: וַיְהִ֥י if we search for the no-Niqqud term, we find it on the second line, in both attached PDFs. If we search for the with-Niqqud term, we find it in the older-version export but not the newer-version. I can also confirm the newer-behavior part of this bug with: Version: 24.2.0.0.alpha1+ (X86_64) / LibreOffice Community Build ID: 516f800f84b533db0082b1f39c19d1af40ab29c8 CPU threads: 4; OS: Linux 6.5; UI render: default; VCL: gtk3 Locale: he-IL (en_IL); UI: en-US Note that, in LO itself, and when searching - LO ignores the Niqqud and cantillation and just searches for the letter sequence, so both terms will match each other and themselves in the original document.
Oh, and: The problem is there even if we drop the cantillation mark. So Niqqud is enough for it to manifest.
This seems to have begun at the below commit in bibisect repository/OS linux-64-7.5. Adding Cc: to Khaled Hosny ; Could you possibly take a look at this one? Thanks ba8787d89bb90aced203271dee7231163446d7e9 is the first bad commit commit ba8787d89bb90aced203271dee7231163446d7e9 Author: Jenkins Build User <tdf@pollux.tdf> Date: Wed Oct 5 22:14:28 2022 +0200 source 09c076c3f29c28497f162d3a5b7baab040725d56 140994: tdf#151350: Fix extraneous gaps before marks | https://gerrit.libreoffice.org/c/core/+/140994
Text extraction from PDF is a lost cause. We are now generating /ActualText spans where we didn’t previously, and PDF readers are now confused by this. I blame Adobe for creating such a backwards file format and never fixing it. This probably can be fixed, but I don’t have the capacity to work on it right now.
Also in Version: 24.8.0.3 (X86_64) / LibreOffice Community Build ID: 0bdf1299c94fe897b119f97f3c613e9dca6be583 CPU threads: 4; OS: Linux 6.8; UI render: default; VCL: gtk3 Locale: ro-RO (ro_RO.UTF-8); UI: en-US Calc: threaded