We have OCR'ed newspapers and we tried to open one of them in LibreOffice. Unfortunately, it exposes the underlying selectable text as green text, when in fact if you open this in Adobe Reader then the text is invisible until you select it.
Steps to Reproduce:
Open the attached PDF in LibreOffice
Open the attached PDF in Acrobat Reader
Now save the document to PDF, then open it in Acrobat Reader
Note the difference - Acrobat Reader no text shows until you select it, in LibreOffice the text is immediately visible (and an odd shade of green).
If you then export the document as a PDF and open it in Acrobat Reader it now shows all the text as green over the top of the newspaper archive.
I'm expecting the text to remain invisible.
User Profile Reset: No - don't believe this is relevant.
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Created attachment 129755 [details]
Created attachment 129756 [details]
Problematic PDF - test document
Created attachment 129757 [details]
The PDF opened in LibreOffice Writer 5.2.2
Created attachment 129758 [details]
The PDF opened in Acrobat Reader
Created attachment 129759 [details]
The PDF opened in LibreOffice Writer 5.2.2
Note: LibreOffice is not a PDF reader. It converts the PDF to its own format.
Arch Linux 64-bit, KDE Plasma 5
Build ID: db9aec4520766c87a09d4cb0238ed06ebaeaaeeb
CPU Threads: 8; OS Version: Linux 4.8; UI Render: default; VCL: kde4;
Locale: fi-FI (fi_FI.UTF-8); Calc: group
Built on December 18th 2016
Putting back to NEW as there's no assignee to this bug
*** Bug 116442 has been marked as a duplicate of this bug. ***
I've looked into this a bit more. The issue is that we don't (or perhaps can't yet!) implement text rendering mode 3 - neither fill nor stroke text (invisible).
See PDF 32000-1:2008 section 9.3.6 Text Rendering Mode.
** Please read this message in its entirety before responding **
To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.
There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.
If you have time, please do the following:
Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/
If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.
If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.
Please DO NOT
Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)
If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from http://downloadarchive.documentfoundation.org/libreoffice/old/
2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword
Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa
Thank you for helping us make LibreOffice even better for everyone!
Issue remains with current master/6.3.0alpha0+,
However as the OCR'd PDF is a bitmap, the text spans are annotation on that image. Showing the annotation on import to Draw--where the PDF is broken out to its component Draw elements--actually seems correct.
Inserting the PDF (pdfium based, but just the first page of PDF for now) renders the PDF page as an image.
The inserted "Image" can be selected and with "Break" split into its component text and the scanned newspaper page. After break, the scanned image can be selected and removed leaving just the OCR text as Draw text frames. It is slow, and utility of this is questionable--but then manipulating PDF content is questionable. The text from a PDF is not intended to be manipulated.
*** Bug 138243 has been marked as a duplicate of this bug. ***
@quikee, something fixable?
Another exemple is attachment 165948 [details] where text box "LAYERSCAPE LX2160A BLOCK DIAGRAM" is duplicated on page 2.
*** Bug 138810 has been marked as a duplicate of this bug. ***
*** Bug 146495 has been marked as a duplicate of this bug. ***
*** Bug 132493 has been marked as a duplicate of this bug. ***
*** Bug 152448 has been marked as a duplicate of this bug. ***