Bug 104770 - Fileopen of OCR'ed PDF also shows otherwise hidden text so looks duplicated
Summary: Fileopen of OCR'ed PDF also shows otherwise hidden text so looks duplicated
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
3.3.0 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 116442 132493 138243 138810 146495 152448 (view as bug list)
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2016-12-19 04:09 UTC by Chris Sherlock
Modified: 2024-11-06 18:18 UTC (History)
13 users (show)

See Also:
Crash report or crash signature:


Attachments
Problematic PDF (2.50 MB, application/pdf)
2016-12-19 04:10 UTC, Chris Sherlock
Details
Problematic PDF - test document (2.92 MB, application/pdf)
2016-12-19 04:17 UTC, Chris Sherlock
Details
The PDF opened in LibreOffice Writer 5.2.2 (750.85 KB, image/png)
2016-12-19 04:18 UTC, Chris Sherlock
Details
The PDF opened in Acrobat Reader (750.85 KB, image/png)
2016-12-19 04:19 UTC, Chris Sherlock
Details
The PDF opened in LibreOffice Writer 5.2.2 (526.20 KB, image/png)
2016-12-19 04:20 UTC, Chris Sherlock
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Sherlock 2016-12-19 04:09:07 UTC
Description:
We have OCR'ed newspapers and we tried to open one of them in LibreOffice. Unfortunately, it exposes the underlying selectable text as green text, when in fact if you open this in Adobe Reader then the text is invisible until you select it. 

Steps to Reproduce:
Open the attached PDF in LibreOffice
Open the attached PDF in Acrobat Reader

Now save the document to PDF, then open it in Acrobat Reader


Actual Results:  
Note the difference - Acrobat Reader no text shows until you select it, in LibreOffice the text is immediately visible (and an odd shade of green).

If you then export the document as a PDF and open it in Acrobat Reader it now shows all the text as green over the top of the newspaper archive. 

Expected Results:
I'm expecting the text to remain invisible. 


Reproducible: Always

User Profile Reset: No - don't believe this is relevant.

Additional Info:


User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Comment 1 Chris Sherlock 2016-12-19 04:10:16 UTC Comment hidden (obsolete)
Comment 2 Chris Sherlock 2016-12-19 04:17:32 UTC
Created attachment 129756 [details]
Problematic PDF - test document
Comment 3 Chris Sherlock 2016-12-19 04:18:53 UTC Comment hidden (obsolete)
Comment 4 Chris Sherlock 2016-12-19 04:19:33 UTC
Created attachment 129758 [details]
The PDF opened in Acrobat Reader
Comment 5 Chris Sherlock 2016-12-19 04:20:52 UTC
Created attachment 129759 [details]
The PDF opened in LibreOffice Writer 5.2.2
Comment 6 Buovjaga 2016-12-19 17:32:29 UTC
Confirmed.

Note: LibreOffice is not a PDF reader. It converts the PDF to its own format.

Arch Linux 64-bit, KDE Plasma 5
Version: 5.4.0.0.alpha0+
Build ID: db9aec4520766c87a09d4cb0238ed06ebaeaaeeb
CPU Threads: 8; OS Version: Linux 4.8; UI Render: default; VCL: kde4; 
Locale: fi-FI (fi_FI.UTF-8); Calc: group
Built on December 18th 2016
Comment 7 Xisco Faulí 2017-04-13 08:44:32 UTC Comment hidden (obsolete)
Comment 8 V Stuart Foote 2018-03-17 05:23:25 UTC
*** Bug 116442 has been marked as a duplicate of this bug. ***
Comment 9 Chris Sherlock 2018-03-19 00:52:53 UTC
I've looked into this a bit more. The issue is that we don't (or perhaps can't yet!) implement text rendering mode 3 - neither fill nor stroke text (invisible).

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

See PDF 32000-1:2008 section 9.3.6 Text Rendering Mode.
Comment 10 QA Administrators 2019-03-20 03:50:23 UTC Comment hidden (obsolete)
Comment 11 V Stuart Foote 2019-03-20 13:37:52 UTC
Issue remains with current master/6.3.0alpha0+, 

However as the OCR'd PDF is a bitmap, the text spans are annotation on that image. Showing the annotation on import to Draw--where the PDF is broken out to its component Draw elements--actually seems correct.

Inserting the PDF (pdfium based, but just the first page of PDF for now) renders the PDF page as an image. 

The inserted "Image" can be selected and with "Break" split into its component text and the scanned newspaper page. After break, the scanned image can be selected and removed leaving just the OCR text as Draw text frames.  It is slow, and utility of this is questionable--but then manipulating PDF content is questionable. The text from a PDF is not intended to be manipulated.
Comment 12 Timur 2020-11-16 09:36:46 UTC
*** Bug 138243 has been marked as a duplicate of this bug. ***
Comment 13 V Stuart Foote 2020-11-16 13:53:51 UTC
@quikee, something fixable?
Comment 14 Timur 2022-01-29 15:21:24 UTC
Another exemple is attachment 165948 [details] where text box "LAYERSCAPE LX2160A BLOCK DIAGRAM" is duplicated on page 2.
Comment 15 Timur 2022-03-09 18:11:43 UTC
*** Bug 138810 has been marked as a duplicate of this bug. ***
Comment 16 Timur 2022-09-09 14:45:11 UTC
*** Bug 146495 has been marked as a duplicate of this bug. ***
Comment 17 V Stuart Foote 2022-11-08 23:15:16 UTC
*** Bug 132493 has been marked as a duplicate of this bug. ***
Comment 18 m_a_riosv 2022-12-09 23:57:46 UTC
*** Bug 152448 has been marked as a duplicate of this bug. ***
Comment 19 Justin L 2024-11-06 18:18:21 UTC
repro 25.2+

I suggest using ocrmypdf-ocr.pdf (attachment 178754 [details]) from duplicate bug 138810 because the other documents take a long time to load. Look at the last word on the page...