Bug 132493 - Offer means to handle import of PDF containing both raster image pages and OCR text
Summary: Offer means to handle import of PDF containing both raster image pages and OC...
Status: RESOLVED DUPLICATE of bug 104770
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2020-04-28 15:41 UTC by gerd.specht
Modified: 2022-11-08 23:15 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
PDF from the bug description (10.72 MB, application/pdf)
2022-07-15 05:36 UTC, Aron Budea
Details

Note You need to log in before you can comment on or make changes to this bug.
Description gerd.specht 2020-04-28 15:41:22 UTC
Description:
I found aq PDF , that cannot be opened correctly, see file

Steps to Reproduce:
1.See this File:
https://journals.ametsoc.org/doi/pdf/10.1175/1520-0477%281985%29066%3C0505%3APSASC%3E2.0.CO%3B2
2. 
3.

Actual Results:
Libreoffice cannot show the first page correctly
at this page i cannot upload a picture

Expected Results:
Libreoffice show a pdf with bad charakter; please see the link: 
https://journals.ametsoc.org/doi/pdf/10.1175/1520-0477%281985%29066%3C0505%3APSASC%3E2.0.CO%3B2 


Reproducible: Didn't try


User Profile Reset: No



Additional Info:
Version: 6.4.2.2 (x64)
Build-ID: 4e471d8c02c9c90f512f7f9ead8875b57fcb1ec3
CPU-Threads: 4; BS: Windows 6.1 Service Pack 1 Build 7601; UI-Render: Standard; VCL: win; 
Gebietsschema: de-DE (de_DE); UI-Sprache: de-DE
Calc: threaded
Comment 1 V Stuart Foote 2020-04-28 22:26:06 UTC
The PDF opens fine, the issue is that it had been prepared with OCR of the page images.

You can remove the OCR by opening in your PDF viewer of choice and then printing the result back to PDF. Just the page images will be output--none of the OCR text runs.

Alternatively if you prefer, or need the OCR results--you can do that with LibreOffice Draw. It is a manual process where by on each page of the imported PDF you select the source page's image and delete it, leaving the OCR text runs behind.

But, it would be kind of convenient if the pdf import filter offered methods to strip out either the image, or the OCR text when both are present.
Comment 2 Aron Budea 2022-07-15 05:36:11 UTC
Created attachment 181270 [details]
PDF from the bug description

Let's add the linked PDF as attachment.
Comment 3 V Stuart Foote 2022-11-08 23:15:16 UTC

*** This bug has been marked as a duplicate of bug 104770 ***