Description: PDFs exported with LibO7.4 can be found by searching for Hebrew in a PDF reader, but PDFs exported with LibO7.5 and later cannot be found by searching for Hebrew. Posted on ask https://ask.libreoffice.org/t/writer-pdf/98051 It seems that the characters are stored separately and cannot be recognized as words. Steps to Reproduce: 1.Export sample files to PDF 2.Open that PDF in a reader 3.Search for וַיְהִ֥י Actual Results: Not found. Expected Results: will hit Reproducible: Always User Profile Reset: No Additional Info: Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: ff3fb42b48c70ba5788507a6177bf0a9f3b50fdb CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win Locale: ja-JP (ja_JP); UI: ja-JP Calc: CL threaded Version: 7.4.7.2 (x64) / LibreOffice Community Build ID: 723314e595e8007d3cf785c16538505a1c878ca5 CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Vulkan; VCL: win Locale: ja-JP (ja_JP); UI: ja-JP Calc: CL
Created attachment 190979 [details] sample file
Created attachment 190980 [details] exported747
Created attachment 190981 [details] exported242 Sample file exported to PDF using LibO24.2 The same thing happens with [attachments file](https://bugs.documentfoundation.org/attachment.cgi?id=134028) in [Bug 91764](https://bugs.documentfoundation.org/show_bug.cgi?id=91764).
The most important thing to note about this bug is that the search term of interest contains Niqqud marks - marks indicating vowels, emphasis or intonation; and even one cantillation mark. See: https://en.wikipedia.org/wiki/Niqqud https://en.wikipedia.org/wiki/Hebrew_cantillation without marks: ויהי with marks: וַיְהִ֥י if we search for the no-Niqqud term, we find it on the second line, in both attached PDFs. If we search for the with-Niqqud term, we find it in the older-version export but not the newer-version. I can also confirm the newer-behavior part of this bug with: Version: 24.2.0.0.alpha1+ (X86_64) / LibreOffice Community Build ID: 516f800f84b533db0082b1f39c19d1af40ab29c8 CPU threads: 4; OS: Linux 6.5; UI render: default; VCL: gtk3 Locale: he-IL (en_IL); UI: en-US Note that, in LO itself, and when searching - LO ignores the Niqqud and cantillation and just searches for the letter sequence, so both terms will match each other and themselves in the original document.
Oh, and: The problem is there even if we drop the cantillation mark. So Niqqud is enough for it to manifest.
This seems to have begun at the below commit in bibisect repository/OS linux-64-7.5. Adding Cc: to Khaled Hosny ; Could you possibly take a look at this one? Thanks ba8787d89bb90aced203271dee7231163446d7e9 is the first bad commit commit ba8787d89bb90aced203271dee7231163446d7e9 Author: Jenkins Build User <tdf@pollux.tdf> Date: Wed Oct 5 22:14:28 2022 +0200 source 09c076c3f29c28497f162d3a5b7baab040725d56 140994: tdf#151350: Fix extraneous gaps before marks | https://gerrit.libreoffice.org/c/core/+/140994
Text extraction from PDF is a lost cause. We are now generating /ActualText spans where we didn’t previously, and PDF readers are now confused by this. I blame Adobe for creating such a backwards file format and never fixing it. This probably can be fixed, but I don’t have the capacity to work on it right now.
Also in Version: 24.8.0.3 (X86_64) / LibreOffice Community Build ID: 0bdf1299c94fe897b119f97f3c613e9dca6be583 CPU threads: 4; OS: Linux 6.8; UI render: default; VCL: gtk3 Locale: ro-RO (ro_RO.UTF-8); UI: en-US Calc: threaded
*** Bug 161514 has been marked as a duplicate of this bug. ***
For anyone trying to debug this, it is caused by the removal of the line: hb_buffer_set_cluster_level(pHbBuffer, HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS); As the default cluster level in HarfBuzz gives the base character and combining mark the same cluster number, so when we tey to map glyphs back to input characters while creating PDF data, we can no longer map the base and mark glyphs individually to their original characters and instead have 2+ glyphs mapped to 2+ characters which requires /ActualText which in turn is badly supported in PDF readers and lead to this and the duplicate bug. One fix is to restore this line and tey to figure another way to fix bug 151350.
(In reply to خالد حسني from comment #10) > instead have 2+ > glyphs mapped to 2+ characters which requires /ActualText which in turn is > badly supported in PDF readers and lead to this and the duplicate bug. Hi! Thank you for tracking down this problem! In the case of the duplicate bug (#161514) I am not convinced that, as you say, "The PDF has valid character data". The problem there is that the character <02> is not mapped to anything in the ToUnicode CMap: (content stream) /Span<</ActualText<FEFF0078030C>>> BDC 1 0 0 1 128.8 668.1 Tm /F1 72 Tf[<01>243<02>]TJ EMC (ToUnicode CMap) 2 beginbfchar <01> <0078030C> <03> <0075> endbfchar While it's true that the PDF 1.7 spec doesn't specifically say that all character codes in a font have to be defined in the ToUnicode CMap, instead providing this extremely helpful suggestion: > If these methods fail to produce a Unicode value, there is no way to determine what the character code > represents in which case a conforming reader may choose a character code of their choosing. ...one would hope that we can do better, given that we do actually know what the Unicode characters are and *exactly* which characters in the text object they are mapped to. I understand that it's necessary for rendering purposes to group them in grapheme clusters, but this isn't really the purpose of ToUnicode CMaps. The problem with /ActualText (aside from not being supported by any PDF readers except Acrobat...) is that there's no way to tell which characters in the /ActualText correspond to which characters in the text object, which becomes an issue for layout analysis and low-level text extraction in libraries like pdfminer/pdfplumber. I'm looking at implementing support for it there and this is a real stumbling block.
(In reply to David Huggins-Daines from comment #11) > (In reply to خالد حسني from comment #10) > > instead have 2+ > > glyphs mapped to 2+ characters which requires /ActualText which in turn is > > badly supported in PDF readers and lead to this and the duplicate bug. > > Hi! Thank you for tracking down this problem! > > In the case of the duplicate bug (#161514) I am not convinced that, as you > say, "The PDF has valid character data". The problem there is that the > character <02> is not mapped to anything in the ToUnicode CMap: That is a still fully-complaint and valid PDF and all the character data is. The use of ActualText is by design, lack of support in PDF readers is an unfortunate limitation, but so it the sate of text extraction from PDF in general. Using ActualText is unavoidable. It can be avoided in the particular cases here, but not in general. > The problem with /ActualText (aside from not being supported by any PDF > readers except Acrobat...) is that there's no way to tell which characters > in the /ActualText correspond to which characters in the text object, which > becomes an issue for layout analysis and low-level text extraction in > libraries like pdfminer/pdfplumber. I'm looking at implementing support for > it there and this is a real stumbling block. We use ActualText for the smallest range of glyphs that we can map to a range of characters, so if an ActualText tag is used then we don’t have any information that can tell which glyphs in this sequence belongs to which characters (this regression notwithstanding of course). When shaping text, there are 4 possiple glyph to character relationships: 1. one glyph to one character: this is the common case and it can be handled by ToUnicode. 2. one glyph to many characters, AKA ligatures: this can also be handled by ToUnicode. 3. many glyphs to one character, AKA decomposition: this can not be handled by ToUnicode and ActualText tags must be used. 4. many glyphs to many characters, which can happen in scripts that reorders input text. Again, this can not be handled by ToUnicode and ActualText tags must be used. On top of that, ToUnicode mapping must be unique, a glyph can appear there only once, but fonts might map different characters to the same glyph, and in this case ToUnicode to be used for one of these mappings, and all the others will need ActualText. The case here can be fixed. Using HarfBuzz cluster level 0 is not required, but it was the quickest way to fix bug 151350 and I didn’t think about the implications this has on PDF text extraction.
(In reply to خالد حسني from comment #12) > On top of that, ToUnicode mapping must be unique, a glyph can appear there > only once, but fonts might map different characters to the same glyph, and > in this case ToUnicode to be used for one of these mappings, and all the > others will need ActualText. Thank you for the really detailed explanation! In this particular regression we have a sort of ligature, so ToUnicode should work, but I understand why it isn't sufficient in the more general case. I'll try to do a best-effort implementation of ActualText for pdfminer/pdfplumber, since as you say it gets used for the smallest span of text necessary, and since text extraction is best-effort by definition anyway. I haven't checked to see if poppler, qpdf, pdfium, and company are working on ActualText support...
(In reply to David Huggins-Daines from comment #13) > (In reply to خالد حسني from comment #12) > > On top of that, ToUnicode mapping must be unique, a glyph can appear there > > only once, but fonts might map different characters to the same glyph, and > > in this case ToUnicode to be used for one of these mappings, and all the > > others will need ActualText. > > Thank you for the really detailed explanation! In this particular > regression we have a sort of ligature, so ToUnicode should work, but I > understand why it isn't sufficient in the more general case. > > I'll try to do a best-effort implementation of ActualText for > pdfminer/pdfplumber, since as you say it gets used for the smallest span of > text necessary, and since text extraction is best-effort by definition > anyway. > > I haven't checked to see if poppler, qpdf, pdfium, and company are working > on ActualText support... Poppler supports ActualText, pdfium does not (at least last I checked), I don’t know about qpdf.
(In reply to خالد حسني from comment #14) > > I haven't checked to see if poppler, qpdf, pdfium, and company are working > > on ActualText support... > > Poppler supports ActualText, pdfium does not (at least last I checked), I > don’t know about qpdf. Ah, thanks! I can consult the Poppler source to see how they do it then.