Bug 158329 - Can't find text with Niqqud in exported PDF
Summary: Can't find text with Niqqud in exported PDF
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
7.5.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, bisected, regression
: 161514 (view as bug list)
Depends on:
Blocks: PDF-Export
  Show dependency treegraph
 
Reported: 2023-11-23 00:06 UTC by Saburo
Modified: 2024-09-30 21:43 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
sample file (9.82 KB, application/vnd.oasis.opendocument.text)
2023-11-23 00:07 UTC, Saburo
Details
exported747 (10.30 KB, application/pdf)
2023-11-23 00:08 UTC, Saburo
Details
exported242 (10.21 KB, application/pdf)
2023-11-23 00:26 UTC, Saburo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Saburo 2023-11-23 00:06:53 UTC
Description:
PDFs exported with LibO7.4 can be found by searching for Hebrew in a PDF reader, but PDFs exported with LibO7.5 and later cannot be found by searching for Hebrew.
Posted on ask
https://ask.libreoffice.org/t/writer-pdf/98051

It seems that the characters are stored separately and cannot be recognized as words.

Steps to Reproduce:
1.Export sample files to PDF 
2.Open that PDF in a reader
3.Search for וַיְהִ֥י

Actual Results:
Not found.

Expected Results:
will hit


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: ff3fb42b48c70ba5788507a6177bf0a9f3b50fdb
CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win
Locale: ja-JP (ja_JP); UI: ja-JP
Calc: CL threaded

Version: 7.4.7.2 (x64) / LibreOffice Community
Build ID: 723314e595e8007d3cf785c16538505a1c878ca5
CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Vulkan; VCL: win
Locale: ja-JP (ja_JP); UI: ja-JP
Calc: CL
Comment 1 Saburo 2023-11-23 00:07:32 UTC
Created attachment 190979 [details]
sample file
Comment 2 Saburo 2023-11-23 00:08:01 UTC
Created attachment 190980 [details]
exported747
Comment 3 Saburo 2023-11-23 00:26:36 UTC
Created attachment 190981 [details]
exported242

Sample file exported to PDF using LibO24.2

The same thing happens with [attachments file](https://bugs.documentfoundation.org/attachment.cgi?id=134028) in [Bug 91764](https://bugs.documentfoundation.org/show_bug.cgi?id=91764).
Comment 4 Eyal Rozenberg 2023-11-27 10:33:34 UTC
The most important thing to note about this bug is that the search term of interest contains Niqqud marks - marks indicating vowels, emphasis or intonation; and even one cantillation mark. See:

https://en.wikipedia.org/wiki/Niqqud
https://en.wikipedia.org/wiki/Hebrew_cantillation

without marks: ויהי
with marks:    וַיְהִ֥י

if we search for the no-Niqqud term, we find it on the second line, in both attached PDFs. If we search for the with-Niqqud term, we find it in the older-version export but not the newer-version.

I can also confirm the newer-behavior part of this bug with:

Version: 24.2.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 516f800f84b533db0082b1f39c19d1af40ab29c8
CPU threads: 4; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

Note that, in LO itself, and when searching - LO ignores the Niqqud and cantillation and just searches for the letter sequence, so both terms will match each other and themselves in the original document.
Comment 5 Eyal Rozenberg 2023-11-27 10:39:08 UTC
Oh, and: The problem is there even if we drop the cantillation mark. So Niqqud is enough for it to manifest.
Comment 6 raal 2023-11-30 19:59:30 UTC
This seems to have begun at the below commit in bibisect repository/OS linux-64-7.5.
Adding Cc: to Khaled Hosny ; Could you possibly take a look at this one?
Thanks
 ba8787d89bb90aced203271dee7231163446d7e9 is the first bad commit
commit ba8787d89bb90aced203271dee7231163446d7e9
Author: Jenkins Build User <tdf@pollux.tdf>
Date:   Wed Oct 5 22:14:28 2022 +0200

    source 09c076c3f29c28497f162d3a5b7baab040725d56

140994: tdf#151350: Fix extraneous gaps before marks | https://gerrit.libreoffice.org/c/core/+/140994
Comment 7 ⁨خالد حسني⁩ 2023-11-30 20:17:17 UTC
Text extraction from PDF is a lost cause.

We are now generating /ActualText spans where we didn’t previously, and PDF readers are now confused by this. I blame Adobe for creating such a backwards file format and never fixing it.

This probably can be fixed, but I don’t have the capacity to work on it right now.
Comment 8 BogdanB 2024-08-24 14:00:43 UTC
Also in
Version: 24.8.0.3 (X86_64) / LibreOffice Community
Build ID: 0bdf1299c94fe897b119f97f3c613e9dca6be583
CPU threads: 4; OS: Linux 6.8; UI render: default; VCL: gtk3
Locale: ro-RO (ro_RO.UTF-8); UI: en-US
Calc: threaded
Comment 9 ⁨خالد حسني⁩ 2024-09-15 01:33:55 UTC
*** Bug 161514 has been marked as a duplicate of this bug. ***
Comment 10 ⁨خالد حسني⁩ 2024-09-15 01:41:08 UTC
For anyone trying to debug this, it is caused by the removal of the line:
hb_buffer_set_cluster_level(pHbBuffer, HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS);
As the default cluster level in HarfBuzz gives the base character and combining mark the same cluster number, so when we tey to map glyphs back to input characters while creating PDF data, we can no longer map the base and mark glyphs individually to their original characters and instead have 2+ glyphs mapped to 2+ characters which requires /ActualText which in turn is badly supported in PDF readers and lead to this and the duplicate bug.

One fix is to restore this line and tey to figure another way to fix bug 151350.
Comment 11 David Huggins-Daines 2024-09-16 13:52:01 UTC
(In reply to ⁨خالد حسني⁩ from comment #10)
> instead have 2+
> glyphs mapped to 2+ characters which requires /ActualText which in turn is
> badly supported in PDF readers and lead to this and the duplicate bug.

Hi!  Thank you for tracking down this problem!

In the case of the duplicate bug (#161514) I am not convinced that, as you say, "The PDF has valid character data".  The problem there is that the character <02> is not mapped to anything in the ToUnicode CMap:

(content stream)
/Span<</ActualText<FEFF0078030C>>>
BDC
1 0 0 1 128.8 668.1 Tm
/F1 72 Tf[<01>243<02>]TJ
EMC
(ToUnicode CMap)
2 beginbfchar
<01> <0078030C>
<03> <0075>
endbfchar

While it's true that the PDF 1.7 spec doesn't specifically say that all character codes in a font have to be defined in the ToUnicode CMap, instead providing this extremely helpful suggestion:

> If these methods fail to produce a Unicode value, there is no way to determine what the character code
> represents in which case a conforming reader may choose a character code of their choosing.

...one would hope that we can do better, given that we do actually know what the Unicode characters are and *exactly* which characters in the text object they are mapped to.  I understand that it's necessary for rendering purposes to group them in grapheme clusters, but this isn't really the purpose of ToUnicode CMaps.

The problem with /ActualText (aside from not being supported by any PDF readers except Acrobat...) is that there's no way to tell which characters in the /ActualText correspond to which characters in the text object, which becomes an issue for layout analysis and low-level text extraction in libraries like pdfminer/pdfplumber.  I'm looking at implementing support for it there and this is a real stumbling block.
Comment 12 ⁨خالد حسني⁩ 2024-09-30 18:52:23 UTC
(In reply to David Huggins-Daines from comment #11)
> (In reply to ⁨خالد حسني⁩ from comment #10)
> > instead have 2+
> > glyphs mapped to 2+ characters which requires /ActualText which in turn is
> > badly supported in PDF readers and lead to this and the duplicate bug.
> 
> Hi!  Thank you for tracking down this problem!
> 
> In the case of the duplicate bug (#161514) I am not convinced that, as you
> say, "The PDF has valid character data".  The problem there is that the
> character <02> is not mapped to anything in the ToUnicode CMap:

That is a still fully-complaint and valid PDF and all the character data is. The use of ActualText is by design, lack of support in PDF readers is an unfortunate limitation, but so it the sate of text extraction from PDF in general.

Using ActualText is unavoidable. It can be avoided in the particular cases here, but not in general.


> The problem with /ActualText (aside from not being supported by any PDF
> readers except Acrobat...) is that there's no way to tell which characters
> in the /ActualText correspond to which characters in the text object, which
> becomes an issue for layout analysis and low-level text extraction in
> libraries like pdfminer/pdfplumber.  I'm looking at implementing support for
> it there and this is a real stumbling block.

We use ActualText for the smallest range of glyphs that we can map to a range of characters, so if an ActualText tag is used then we don’t have any information that can tell which glyphs in this sequence belongs to which characters (this regression notwithstanding of course).

When shaping text, there are 4 possiple glyph to character relationships:
1. one glyph to one character: this is the common case and it can be handled by ToUnicode.
2. one glyph to many characters, AKA ligatures: this can also be handled by ToUnicode.
3. many glyphs to one character, AKA decomposition: this can not be handled by ToUnicode and ActualText tags must be used.
4. many glyphs to many characters, which can happen in scripts that reorders input text. Again, this can not be handled by ToUnicode and ActualText tags must be used.

On top of that, ToUnicode mapping must be unique, a glyph can appear there only once, but fonts might map different characters to the same glyph, and in this case ToUnicode to be used for one of these mappings, and all the others will need ActualText.

The case here can be fixed. Using HarfBuzz cluster level 0 is not required, but it was the quickest way to fix bug 151350 and I didn’t think about the implications this has on PDF text extraction.
Comment 13 David Huggins-Daines 2024-09-30 19:34:53 UTC
(In reply to ⁨خالد حسني⁩ from comment #12)
> On top of that, ToUnicode mapping must be unique, a glyph can appear there
> only once, but fonts might map different characters to the same glyph, and
> in this case ToUnicode to be used for one of these mappings, and all the
> others will need ActualText.

Thank you for the really detailed explanation!  In this particular regression we have a sort of ligature, so ToUnicode should work, but I understand why it isn't sufficient in the more general case.

I'll try to do a best-effort implementation of ActualText for pdfminer/pdfplumber, since as you say it gets used for the smallest span of text necessary, and since text extraction is best-effort by definition anyway.

I haven't checked to see if poppler, qpdf, pdfium, and company are working on ActualText support...
Comment 14 ⁨خالد حسني⁩ 2024-09-30 21:31:20 UTC
(In reply to David Huggins-Daines from comment #13)
> (In reply to ⁨خالد حسني⁩ from comment #12)
> > On top of that, ToUnicode mapping must be unique, a glyph can appear there
> > only once, but fonts might map different characters to the same glyph, and
> > in this case ToUnicode to be used for one of these mappings, and all the
> > others will need ActualText.
> 
> Thank you for the really detailed explanation!  In this particular
> regression we have a sort of ligature, so ToUnicode should work, but I
> understand why it isn't sufficient in the more general case.
> 
> I'll try to do a best-effort implementation of ActualText for
> pdfminer/pdfplumber, since as you say it gets used for the smallest span of
> text necessary, and since text extraction is best-effort by definition
> anyway.
> 
> I haven't checked to see if poppler, qpdf, pdfium, and company are working
> on ActualText support...

Poppler supports ActualText, pdfium does not (at least last I checked), I don’t know about qpdf.
Comment 15 David Huggins-Daines 2024-09-30 21:43:01 UTC
(In reply to ⁨خالد حسني⁩ from comment #14)

> > I haven't checked to see if poppler, qpdf, pdfium, and company are working
> > on ActualText support...
> 
> Poppler supports ActualText, pdfium does not (at least last I checked), I
> don’t know about qpdf.

Ah, thanks!  I can consult the Poppler source to see how they do it then.