Bug 158329 - Can't find text with Niqqud in exported PDF
Summary: Can't find text with Niqqud in exported PDF
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
7.5.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, bisected, regression
Depends on:
Blocks: RTL-CTL PDF-Export
  Show dependency treegraph
 
Reported: 2023-11-23 00:06 UTC by Saburo
Modified: 2023-11-30 20:17 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
sample file (9.82 KB, application/vnd.oasis.opendocument.text)
2023-11-23 00:07 UTC, Saburo
Details
exported747 (10.30 KB, application/pdf)
2023-11-23 00:08 UTC, Saburo
Details
exported242 (10.21 KB, application/pdf)
2023-11-23 00:26 UTC, Saburo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Saburo 2023-11-23 00:06:53 UTC
Description:
PDFs exported with LibO7.4 can be found by searching for Hebrew in a PDF reader, but PDFs exported with LibO7.5 and later cannot be found by searching for Hebrew.
Posted on ask
https://ask.libreoffice.org/t/writer-pdf/98051

It seems that the characters are stored separately and cannot be recognized as words.

Steps to Reproduce:
1.Export sample files to PDF 
2.Open that PDF in a reader
3.Search for וַיְהִ֥י

Actual Results:
Not found.

Expected Results:
will hit


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: ff3fb42b48c70ba5788507a6177bf0a9f3b50fdb
CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win
Locale: ja-JP (ja_JP); UI: ja-JP
Calc: CL threaded

Version: 7.4.7.2 (x64) / LibreOffice Community
Build ID: 723314e595e8007d3cf785c16538505a1c878ca5
CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Vulkan; VCL: win
Locale: ja-JP (ja_JP); UI: ja-JP
Calc: CL
Comment 1 Saburo 2023-11-23 00:07:32 UTC
Created attachment 190979 [details]
sample file
Comment 2 Saburo 2023-11-23 00:08:01 UTC
Created attachment 190980 [details]
exported747
Comment 3 Saburo 2023-11-23 00:26:36 UTC
Created attachment 190981 [details]
exported242

Sample file exported to PDF using LibO24.2

The same thing happens with [attachments file](https://bugs.documentfoundation.org/attachment.cgi?id=134028) in [Bug 91764](https://bugs.documentfoundation.org/show_bug.cgi?id=91764).
Comment 4 Eyal Rozenberg 2023-11-27 10:33:34 UTC
The most important thing to note about this bug is that the search term of interest contains Niqqud marks - marks indicating vowels, emphasis or intonation; and even one cantillation mark. See:

https://en.wikipedia.org/wiki/Niqqud
https://en.wikipedia.org/wiki/Hebrew_cantillation

without marks: ויהי
with marks:    וַיְהִ֥י

if we search for the no-Niqqud term, we find it on the second line, in both attached PDFs. If we search for the with-Niqqud term, we find it in the older-version export but not the newer-version.

I can also confirm the newer-behavior part of this bug with:

Version: 24.2.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 516f800f84b533db0082b1f39c19d1af40ab29c8
CPU threads: 4; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

Note that, in LO itself, and when searching - LO ignores the Niqqud and cantillation and just searches for the letter sequence, so both terms will match each other and themselves in the original document.
Comment 5 Eyal Rozenberg 2023-11-27 10:39:08 UTC
Oh, and: The problem is there even if we drop the cantillation mark. So Niqqud is enough for it to manifest.
Comment 6 raal 2023-11-30 19:59:30 UTC
This seems to have begun at the below commit in bibisect repository/OS linux-64-7.5.
Adding Cc: to Khaled Hosny ; Could you possibly take a look at this one?
Thanks
 ba8787d89bb90aced203271dee7231163446d7e9 is the first bad commit
commit ba8787d89bb90aced203271dee7231163446d7e9
Author: Jenkins Build User <tdf@pollux.tdf>
Date:   Wed Oct 5 22:14:28 2022 +0200

    source 09c076c3f29c28497f162d3a5b7baab040725d56

140994: tdf#151350: Fix extraneous gaps before marks | https://gerrit.libreoffice.org/c/core/+/140994
Comment 7 ⁨خالد حسني⁩ 2023-11-30 20:17:17 UTC
Text extraction from PDF is a lost cause.

We are now generating /ActualText spans where we didn’t previously, and PDF readers are now confused by this. I blame Adobe for creating such a backwards file format and never fixing it.

This probably can be fixed, but I don’t have the capacity to work on it right now.