Bug Hunting Session
Bug 32249 - When importing PDF with text in it , it will be better to have a easy and fluent option to edit the imported Text
Summary: When importing PDF with text in it , it will be better to have a easy and flu...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.3.0 RC1
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 38084 84712 91896 93039 105274 125838 (view as bug list)
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2010-12-08 22:51 UTC by grigoreflorin1985
Modified: 2019-07-22 05:11 UTC (History)
17 users (show)

See Also:
Crash report or crash signature:


Attachments
PDF_import_testDoc.odg: exploring what combining textboxes could look like (21.61 KB, application/vnd.oasis.opendocument.graphics)
2019-06-27 18:38 UTC, Justin L
Details
Draw-add-option-to-consolidate-multiple-textboxes.patch (15.71 KB, patch)
2019-07-03 13:02 UTC, Justin L
Details

Note You need to log in before you can comment on or make changes to this bug.
Description grigoreflorin1985 2010-12-08 22:51:29 UTC
When i import a PDF with text in it I get editing function on separate paragraph one by one , what i need and want (like all users will expect to do)  to do it is to edit all paragraf like I do an office document normaly. Option to union the the paragraph to edit them flawless and easy at start ? Unificate all paragraph on page to be editable like a simple full form. It is time consuming to click and edit every paragrapf one at a time. Tryed union from the right click menu  on them and I get a plain graphic non editable txt form with the txt tool (aka big T icon).

Hope this small option will get until final release.
Comment 1 Rainer Bielefeld Retired 2010-12-09 08:58:10 UTC
I also sometimes wished such a feature, but I'm afraid that would cost too much manpower. Compared to other needs definitively not more than Importance "Medium", I doubt that that ever will be integrated.
Comment 2 Samuele Kaplun 2011-05-06 07:30:12 UTC
Hi,

I am a developer on a digital library software, and, aiming at supporting digital preservation, I was thinking of exploiting the wonderful PDF importer filter of LibreOffice to archive .odt document next to the original .pdf (as the .odt document should provide more value for future retrieval and the use of the document).

Indeed I also find this a very nice feature to have and it should be possible to implement it via some heuristic such as merging together subsequent lines that are not too far from each other (e.g. say that they are not more distant than the height of the character).

If no-one have time to work on it I'd be glad to give it a try in my spare time, if someone could be so kind to point me at the most appropriate source code files that would need to be touched.

Cheers!
Comment 3 Rainer Bielefeld Retired 2011-05-06 08:17:14 UTC
@Samuele Kaplun:
That would be great. 

I'm afraid that won't be easy. I do not know how that works for other OS, but for WIN I have to install the "Oracle PDF Import Extension" from 
<http://extensions.services.openoffice.org/en/search/node/pdf import>, what itself afaik uses XPDF <http://foolabs.com/xpdf/about.html> as text extractor.

That's all I can contribute.

BTW: Version is for the first version where the problem has been observed!
Comment 4 Samuele Kaplun 2011-05-06 08:34:08 UTC
(In reply to comment #3)
> @Samuele Kaplun:
> I'm afraid that won't be easy. I do not know how that works for other OS, but
> for WIN I have to install the "Oracle PDF Import Extension" from 
> <http://extensions.services.openoffice.org/en/search/node/pdf import>, what
> itself afaik uses XPDF <http://foolabs.com/xpdf/about.html> as text extractor.

From <http://www.libreoffice.org/features/extensions/> I understand that finally this extension is part of the core LibreOffice source tree. Is this so?

> That's all I can contribute.
> 
> BTW: Version is for the first version where the problem has been observed!

Sorry for this!! That makes perfect sense!
Comment 5 Rainer Bielefeld Retired 2011-05-06 10:50:21 UTC
> From <http://www.libreoffice.org/features/extensions/> I understand that
> finally this extension is part of the core LibreOffice source tree. Is this so?

I thought so, too, but for my 3.4 I definitively had to download the extension. Pls see 
<https://bugs.freedesktop.org/show_bug.cgi?id=35604#c6>
Comment 6 Björn Michaelsen 2011-12-23 11:33:26 UTC Comment hidden (obsolete)
Comment 7 Rainer Bielefeld Retired 2011-12-23 23:33:18 UTC
Was New by good reasons. But it#s the question whether there is a realistic chance to get this enhancement.
Comment 8 vilpan 2013-05-01 18:41:14 UTC
*** Bug 38084 has been marked as a duplicate of this bug. ***
Comment 9 sophie 2014-10-09 11:21:36 UTC
*** Bug 84712 has been marked as a duplicate of this bug. ***
Comment 10 QA Administrators 2014-10-23 17:31:40 UTC Comment hidden (obsolete)
Comment 11 Gerry 2015-04-23 16:57:59 UTC
I can confirm this bug in LO 4.4.2.2. on Windows 7
Comment 12 Jean-Baptiste Faure 2015-07-01 18:06:35 UTC
*** Bug 91896 has been marked as a duplicate of this bug. ***
Comment 13 Hendrik Maryns 2015-11-22 08:38:23 UTC
Is there no bug voting?  This is a major turndown!
Comment 14 m.a.riosv 2017-01-13 09:30:38 UTC
*** Bug 105274 has been marked as a duplicate of this bug. ***
Comment 15 m.a.riosv 2017-01-13 09:31:54 UTC
*** Bug 93039 has been marked as a duplicate of this bug. ***
Comment 16 V Stuart Foote 2017-01-13 14:05:25 UTC
LibreOffice has provided functional filter import of PDF into Draw (default Open action), and into Impress and Writer or also Draw by import filter selection.

With each filter selected, the rendering to respective document canvas follows the structure of the document as recorded within the PDF and text elements are rendered into styled Text box or Frames. 

The PDF filter(s) do not "reflow" text into Paragraph objects. That would require a very complex treatment of the PDF structure to reliably extract syntax and layout--at the expense of fidelity rendering the PDF document.

Replacing of supplementing the PDF filters to provide "reflow" back into paragraphs is seen as out-of-scope for the project as we are not a PDF editor.

The core PDF filters and function are sufficient to our needs of high rendering fidelity.

This is fertile ground for an extension.
Comment 17 Justin L 2019-06-21 18:29:39 UTC
*** Bug 125838 has been marked as a duplicate of this bug. ***
Comment 18 Justin L 2019-06-27 18:38:15 UTC
Created attachment 152450 [details]
PDF_import_testDoc.odg: exploring what combining textboxes could look like

I agree with Stuart's conclusion that monkeying with import to make larger textboxes would be disastrous. So I only see one reasonable option and that is a function that allows a user to combine selected textboxes into one textbox.

However, the results won't be pretty. Each character attribute change (size, bold, font, etc.) becomes a separate textbox, and there is no way to identify whether that ends the paragraph or not, although some content analysis guesswork could approximate the majority of cases I guess. In any case, a LOT of cleanup would be needed to reformat the text, since each character run is treated as a separate paragraph and all paragraph spacing information is missing.

The other option is to force the user to create their own textbox and copy/paste the text from the PDF itself, but in that case all the character properties are lost. So there does still seem to be an advantage of consolidating textboxes into one, even if many excess paragraph markers need to be deleted.
Comment 19 V Stuart Foote 2019-06-27 23:56:42 UTC
(In reply to Justin L from comment #18)
> ... So I only see one reasonable option and that
> is a function that allows a user to combine selected textboxes into one
> textbox.
> 

Yes, agree that would be an acceptable way to handle PDF source text runs extracted from BT/ET blocks, or where /ActualText annotation is present.

But why first extract the text runs into Draw Text boxes, and then merging them into one or more non-formattable Draw Text boxes? Seems like a different filter import of the PDF text runs is needed.

Dumping the strings out to a Writer Paragraph object, either in bulk or interactively, would be more functional.  And text runs dumped into a Paragraph object, would allow assignment of direct formatting or style, with text validation and word and line break cleanup.

Probably more efficient UI could evolve if done as a pop-out dialog to pick the Draw Text box snippets, but could spin up a full Writer session and do the same.

More often than not, folks simply want to reflow the text strings back into their lexicographically correct sequence without too great a concern as to original formatting of the source document generating the PDF.

We can't do that with much fidelity to the original source--so why bother?

Our other 'pdfium' based "insert" filter provides the text runs to document canvas with good fidelity to the original layout. Though the object "break" there has similar issues to the 'poppler' based import filter for text handling.
Comment 20 Justin L 2019-06-28 05:24:00 UTC
(In reply to V Stuart Foote from comment #19)
> But why first extract the text runs into Draw Text boxes? Seems like a
> different filter import of the PDF text runs is needed.
Yes, that sounds like it would be perfect, but 100x more complex to code.
Comment 21 Justin L 2019-07-03 13:02:54 UTC
Created attachment 152531 [details]
Draw-add-option-to-consolidate-multiple-textboxes.patch

(In reply to Justin L from comment #18)
> I only see one reasonable option and that is a function that
> allows a user to combine selected textboxes into one textbox.

Proposed patch: https://gerrit.libreoffice.org/75043
    Draw: add option to consolidate multiple textboxes into one
Comment 22 Justin L 2019-07-22 05:11:36 UTC
(In reply to Justin L from comment #21)
This patch has landed in LO 6.4. Please use bug 118370 to follow the implementation of a "Shapes - Consolidate Text" function that gives the user a tool to combine multiple textboxes into one.

For this bug report, let's keep the discussion to the bigger request to add a "text content focused" PDF import, rather than a layout focused import as discussed in comment 19.