152143 – Provide a mechanism to export PDF to text

Bug 152143 - Provide a mechanism to export PDF to text

Summary: Provide a mechanism to export PDF to text

Status:	RESOLVED DUPLICATE of bug 32249

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Draw (show other bugs)
Version: (earliest affected)	7.5.0.0 alpha0+
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	needsUXEval

Depends on:
Blocks:	PDF
	Show dependency tree / graph

Reported:	2022-11-20 13:44 UTC by Hossein
Modified:	2022-11-21 09:39 UTC (History)
CC List:	3 users (show)

See Also:	104597 151598 117428
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hossein 2022-11-20 13:44:35 UTC

Description:
Currently it is not possible to export PDF files loaded in LibreOffice (Draw) to text. The importance of this feature for Arabic/Persian/Hebrew or other RTL languages is that there are very few applications that can convert PDF to text correctly for RTL text. On the other hand, many problems with loading RTL PDF files in LibreOffice are either fixed or recently being fixed by Kevin. So, it would be very helpful to have a working PDF->TXT converter that works for Arabic/Persian via LibreOffice Draw.

Steps to Reproduce:
1. Invoke this command:
liberoffice7.4 --convert-to txt:Text test.pdf

Actual Results:
The Output ins not created, and this error message is generated:

convert test.pdf -> test.txt using filter : Text
Error: Please verify input parameters... (SfxBaseModel::impl_store <file:///home/hossein/Projects/libreoffice/core/test.txt> failed: 0xc10(Error Area:Io Class:Write Code:16) at /home/buildslave/source/libo-core/sfx2/source/doc/sfxbasemodel.cxx:3207 at /home/buildslave/source/libo-core/sfx2/source/doc/sfxbasemodel.cxx:1783)

Expected Results:
The PDF should be converted to text.

Reproducible: Always


User Profile Reset: No


Additional Info:

Converting to text is not possible in LibreOffice 7.4:
Version: 7.4.0.3 / LibreOffice Community
Build ID: f85e47c08ddd19c015c0114a68350214f7066f5a
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded

Also not in the latest LO 7.5 dev master:
Version: 7.5.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: deb0bb9f2635a8dfec90b42e3727f4224548a8e9
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded

Comment 1 V Stuart Foote 2022-11-20 19:10:33 UTC

(In reply to Hossein from comment #0)
> Description:
> Currently it is not possible to export PDF files loaded in LibreOffice
> (Draw) to text.

Not true. Currently LO has the 'Consolidate text' feature see work done for bug 118370 [1]. Which is functional just inconvenient to move PDF imported text to the Writer canvas for filter export. And this is a dupe of bug 32249, or at most of bug 151598 to implement 'Consolidate text' on the Writer canvas.

In reasonable workflow, we now take an imported PDF (opened via Draw) to draw vcl canvas. The textboxes representing the text streams read out from PDF structures are discretely placed onto vcl canvas. 

So you can already select and consolidate entire pages of imported draw shape  textboxes (by glyph index lookup in a ToUinicode CMAP) into a single draw shape textbox--a sentence or paragraph of text. And then select that text, copy it and paste it as needed. Then correct as lexically necessary.

Also, because PDF provides no lexical sense to the runs in a document (it is a published presentation format)--the discrete imported draw shape text boxes *must be selected in sequence* for a manual merge. That would remain the case working with draw shape textboxes on the Writer canvas and is a limitation of the published rendering encoded into PDF.

PDF provides an /ActualText construct that could be used more effectively than index lookup on a Unicode CMAP. 

For bug 66597 LibreOffice export filter for PDF /ActualText construct already is in place [2] for PDF creation but only to the grapheme cluster run. Bug 117428 is open to refactor PDF export to provide /ActualText at the word bound.

What is unclear is how our poppler PDF import filter(s) would need to be refactored to use the lexical details to load draw shape textboxes with /ActualText--for roundtrip, or import of other sourced PDF.

Doing more efficient and high fidelity text extraction from PDF into ODF paragraphs is the end goal of bug 32249. 

Export of lexically correct word, sentence or paragraph to other document formats then becomes routine export filtering that is already present. 

=-ref-=
[1] https://gerrit.libreoffice.org/c/core/+/75043/
[2] https://gerrit.libreoffice.org/c/core/+/53315/

*** This bug has been marked as a duplicate of bug 32249 ***

Comment 2 Hossein 2022-11-20 21:00:39 UTC

I don't think this is a duplicate of tdf#32249. The title of that one is:

  Bug 32249
  "When importing PDF with text in it , it will be better to have a easy
  and fluent option to edit the imported Text".

So, the above issue is basically talking about being able to edit the text. I am here talking about being able to export the PDF as a text file. These are obviously different, even if you discuss about the commonalities in the implementation.

> So you can already select and consolidate entire pages of imported draw shape
>  textboxes (by glyph index lookup in a ToUinicode CMAP) into a single draw
> shape textbox--a sentence or paragraph of text. And then select that text,
> copy it and paste it as needed. Then correct as lexically necessary.
I disagree. This is not what was intended in this feature request. I have specifically requested means of exporting the whole PDF document as a text file, both via UI and command line. The above consolidation feature might help internally when you want to implement such a feature, but that is not what I have asked for.

> Also, because PDF provides no lexical sense to the runs in a document (it is a 
> published presentation format)--the discrete imported draw shape text boxes
> *must be selected in sequence* for a manual merge. That would remain the case
> working with draw shape textboxes on the Writer canvas and is a limitation of
> the published rendering encoded into PDF.
I disagree again. We have text boxes in LibreOffice, MS Office and elsewhere, but we can export the contents to text files. I haven't requested for a smart software that can understand the meaning of the document. The goal is to export the contents to a text file.

> Doing more efficient and high fidelity text extraction from PDF into ODF
> paragraphs is the end goal of bug 32249.
>
> Export of lexically correct word, sentence or paragraph to other document
> formats then becomes routine export filtering that is already present. 
Even by accepting this implementation path, it can be said that this feature request is depending on tdf#32249, not a duplicate of it.

Comment 3 V Stuart Foote 2022-11-20 21:36:47 UTC

Nope. That in effect is asking for LibreOffice to become a utility for parsing content of PDF.  We are not a PDF editor, nor do we provide utilities to manipulate PDF.

We deal in ODF and in external formats via appropriate filters--import and export.

There is no ODF or project requirement to convert PDF to external text formats. There is a requirement as for bug 32249 to fully filter parse PDF structure into useable text runs/sentences/paragraphs.
 
If you have a requirement--construct it within that context and it would be valid.

Anything else is OUT OF SCOPE.

Comment 4 Heiko Tietze 2022-11-21 09:39:18 UTC

(In reply to V Stuart Foote from comment #3)
> We are not a PDF editor, nor do we provide
> utilities to manipulate PDF.

+1