Bug 158329 - Can't find text with Niqqud in exported PDF
Summary: Can't find text with Niqqud in exported PDF
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
(earliest affected) release
Hardware: All All
: medium normal
Assignee: Not Assigned
Keywords: bibisected, bisected, regression
Depends on:
Blocks: RTL-CTL PDF-Export
  Show dependency treegraph
Reported: 2023-11-23 00:06 UTC by Saburo
Modified: 2023-11-30 20:17 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:

sample file (9.82 KB, application/vnd.oasis.opendocument.text)
2023-11-23 00:07 UTC, Saburo
exported747 (10.30 KB, application/pdf)
2023-11-23 00:08 UTC, Saburo
exported242 (10.21 KB, application/pdf)
2023-11-23 00:26 UTC, Saburo

Note You need to log in before you can comment on or make changes to this bug.
Description Saburo 2023-11-23 00:06:53 UTC
PDFs exported with LibO7.4 can be found by searching for Hebrew in a PDF reader, but PDFs exported with LibO7.5 and later cannot be found by searching for Hebrew.
Posted on ask

It seems that the characters are stored separately and cannot be recognized as words.

Steps to Reproduce:
1.Export sample files to PDF 
2.Open that PDF in a reader
3.Search for וַיְהִ֥י

Actual Results:
Not found.

Expected Results:
will hit

Reproducible: Always

User Profile Reset: No

Additional Info:
Version: (X86_64) / LibreOffice Community
Build ID: ff3fb42b48c70ba5788507a6177bf0a9f3b50fdb
CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win
Locale: ja-JP (ja_JP); UI: ja-JP
Calc: CL threaded

Version: (x64) / LibreOffice Community
Build ID: 723314e595e8007d3cf785c16538505a1c878ca5
CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Vulkan; VCL: win
Locale: ja-JP (ja_JP); UI: ja-JP
Calc: CL
Comment 1 Saburo 2023-11-23 00:07:32 UTC
Created attachment 190979 [details]
sample file
Comment 2 Saburo 2023-11-23 00:08:01 UTC
Created attachment 190980 [details]
Comment 3 Saburo 2023-11-23 00:26:36 UTC
Created attachment 190981 [details]

Sample file exported to PDF using LibO24.2

The same thing happens with [attachments file](https://bugs.documentfoundation.org/attachment.cgi?id=134028) in [Bug 91764](https://bugs.documentfoundation.org/show_bug.cgi?id=91764).
Comment 4 Eyal Rozenberg 2023-11-27 10:33:34 UTC
The most important thing to note about this bug is that the search term of interest contains Niqqud marks - marks indicating vowels, emphasis or intonation; and even one cantillation mark. See:


without marks: ויהי
with marks:    וַיְהִ֥י

if we search for the no-Niqqud term, we find it on the second line, in both attached PDFs. If we search for the with-Niqqud term, we find it in the older-version export but not the newer-version.

I can also confirm the newer-behavior part of this bug with:

Version: (X86_64) / LibreOffice Community
Build ID: 516f800f84b533db0082b1f39c19d1af40ab29c8
CPU threads: 4; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

Note that, in LO itself, and when searching - LO ignores the Niqqud and cantillation and just searches for the letter sequence, so both terms will match each other and themselves in the original document.
Comment 5 Eyal Rozenberg 2023-11-27 10:39:08 UTC
Oh, and: The problem is there even if we drop the cantillation mark. So Niqqud is enough for it to manifest.
Comment 6 raal 2023-11-30 19:59:30 UTC
This seems to have begun at the below commit in bibisect repository/OS linux-64-7.5.
Adding Cc: to Khaled Hosny ; Could you possibly take a look at this one?
 ba8787d89bb90aced203271dee7231163446d7e9 is the first bad commit
commit ba8787d89bb90aced203271dee7231163446d7e9
Author: Jenkins Build User <tdf@pollux.tdf>
Date:   Wed Oct 5 22:14:28 2022 +0200

    source 09c076c3f29c28497f162d3a5b7baab040725d56

140994: tdf#151350: Fix extraneous gaps before marks | https://gerrit.libreoffice.org/c/core/+/140994
Comment 7 ⁨خالد حسني⁩ 2023-11-30 20:17:17 UTC
Text extraction from PDF is a lost cause.

We are now generating /ActualText spans where we didn’t previously, and PDF readers are now confused by this. I blame Adobe for creating such a backwards file format and never fixing it.

This probably can be fixed, but I don’t have the capacity to work on it right now.