When i import a PDF with text in it I get editing function on separate paragraph one by one , what i need and want (like all users will expect to do) to do it is to edit all paragraf like I do an office document normaly. Option to union the the paragraph to edit them flawless and easy at start ? Unificate all paragraph on page to be editable like a simple full form. It is time consuming to click and edit every paragrapf one at a time. Tryed union from the right click menu on them and I get a plain graphic non editable txt form with the txt tool (aka big T icon).
Hope this small option will get until final release.
I also sometimes wished such a feature, but I'm afraid that would cost too much manpower. Compared to other needs definitively not more than Importance "Medium", I doubt that that ever will be integrated.
I am a developer on a digital library software, and, aiming at supporting digital preservation, I was thinking of exploiting the wonderful PDF importer filter of LibreOffice to archive .odt document next to the original .pdf (as the .odt document should provide more value for future retrieval and the use of the document).
Indeed I also find this a very nice feature to have and it should be possible to implement it via some heuristic such as merging together subsequent lines that are not too far from each other (e.g. say that they are not more distant than the height of the character).
If no-one have time to work on it I'd be glad to give it a try in my spare time, if someone could be so kind to point me at the most appropriate source code files that would need to be touched.
That would be great.
I'm afraid that won't be easy. I do not know how that works for other OS, but for WIN I have to install the "Oracle PDF Import Extension" from
<http://extensions.services.openoffice.org/en/search/node/pdf import>, what itself afaik uses XPDF <http://foolabs.com/xpdf/about.html> as text extractor.
That's all I can contribute.
BTW: Version is for the first version where the problem has been observed!
(In reply to comment #3)
> @Samuele Kaplun:
> I'm afraid that won't be easy. I do not know how that works for other OS, but
> for WIN I have to install the "Oracle PDF Import Extension" from
> <http://extensions.services.openoffice.org/en/search/node/pdf import>, what
> itself afaik uses XPDF <http://foolabs.com/xpdf/about.html> as text extractor.
From <http://www.libreoffice.org/features/extensions/> I understand that finally this extension is part of the core LibreOffice source tree. Is this so?
> That's all I can contribute.
> BTW: Version is for the first version where the problem has been observed!
Sorry for this!! That makes perfect sense!
> From <http://www.libreoffice.org/features/extensions/> I understand that
> finally this extension is part of the core LibreOffice source tree. Is this so?
I thought so, too, but for my 3.4 I definitively had to download the extension. Pls see
[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Was New by good reasons. But it#s the question whether there is a realistic chance to get this enhancement.
*** Bug 38084 has been marked as a duplicate of this bug. ***
*** Bug 84712 has been marked as a duplicate of this bug. ***
Please read this message in its entirety before responding.
Your bug was confirmed at least 1 year ago and has not had any activity on it for over a year. Your bug is still set to NEW which means that it is open and confirmed. It would be nice to have the bug confirmed on a newer version than the version reported in the original report to know that the bug is still present -- sometimes a bug is inadvertently fixed over time and just never closed.
If you have time please do the following:
1) Test to see if the bug is still present on a currently supported version of LibreOffice (preferably 4.2 or newer).
2) If it is present please leave a comment telling us what version of LibreOffice and your operating system.
3) If it is NOT present please set the bug to RESOLVED-WORKSFORME and leave a short comment telling us your version and Operating System
Please DO NOT
1) Update the version field
2) Reply via email (please reply directly on the bug tracker)
3) Set the bug to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LibreOffice is powered by a team of volunteers, every bug is confirmed (triaged) by human beings who mostly give their time for free. We invite you to join our triaging by checking out this link:
There are also other ways to get involved including with marketing, UX, documentation, and of course developing - http://www.libreoffice.org/get-help/mailing-lists/.
Lastly, good bug reports help tremendously in making the process go smoother, please always provide reproducible steps (even if it seems easy) and attach any and all relevant material
I can confirm this bug in LO 22.214.171.124. on Windows 7
*** Bug 91896 has been marked as a duplicate of this bug. ***
Is there no bug voting? This is a major turndown!
*** Bug 105274 has been marked as a duplicate of this bug. ***
*** Bug 93039 has been marked as a duplicate of this bug. ***
LibreOffice has provided functional filter import of PDF into Draw (default Open action), and into Impress and Writer or also Draw by import filter selection.
With each filter selected, the rendering to respective document canvas follows the structure of the document as recorded within the PDF and text elements are rendered into styled Text box or Frames.
The PDF filter(s) do not "reflow" text into Paragraph objects. That would require a very complex treatment of the PDF structure to reliably extract syntax and layout--at the expense of fidelity rendering the PDF document.
Replacing of supplementing the PDF filters to provide "reflow" back into paragraphs is seen as out-of-scope for the project as we are not a PDF editor.
The core PDF filters and function are sufficient to our needs of high rendering fidelity.
This is fertile ground for an extension.
*** Bug 125838 has been marked as a duplicate of this bug. ***
Created attachment 152450 [details]
PDF_import_testDoc.odg: exploring what combining textboxes could look like
I agree with Stuart's conclusion that monkeying with import to make larger textboxes would be disastrous. So I only see one reasonable option and that is a function that allows a user to combine selected textboxes into one textbox.
However, the results won't be pretty. Each character attribute change (size, bold, font, etc.) becomes a separate textbox, and there is no way to identify whether that ends the paragraph or not, although some content analysis guesswork could approximate the majority of cases I guess. In any case, a LOT of cleanup would be needed to reformat the text, since each character run is treated as a separate paragraph and all paragraph spacing information is missing.
The other option is to force the user to create their own textbox and copy/paste the text from the PDF itself, but in that case all the character properties are lost. So there does still seem to be an advantage of consolidating textboxes into one, even if many excess paragraph markers need to be deleted.
(In reply to Justin L from comment #18)
> ... So I only see one reasonable option and that
> is a function that allows a user to combine selected textboxes into one
Yes, agree that would be an acceptable way to handle PDF source text runs extracted from BT/ET blocks, or where /ActualText annotation is present.
But why first extract the text runs into Draw Text boxes, and then merging them into one or more non-formattable Draw Text boxes? Seems like a different filter import of the PDF text runs is needed.
Dumping the strings out to a Writer Paragraph object, either in bulk or interactively, would be more functional. And text runs dumped into a Paragraph object, would allow assignment of direct formatting or style, with text validation and word and line break cleanup.
Probably more efficient UI could evolve if done as a pop-out dialog to pick the Draw Text box snippets, but could spin up a full Writer session and do the same.
More often than not, folks simply want to reflow the text strings back into their lexicographically correct sequence without too great a concern as to original formatting of the source document generating the PDF.
We can't do that with much fidelity to the original source--so why bother?
Our other 'pdfium' based "insert" filter provides the text runs to document canvas with good fidelity to the original layout. Though the object "break" there has similar issues to the 'poppler' based import filter for text handling.
(In reply to V Stuart Foote from comment #19)
> But why first extract the text runs into Draw Text boxes? Seems like a
> different filter import of the PDF text runs is needed.
Yes, that sounds like it would be perfect, but 100x more complex to code.
Created attachment 152531 [details]
(In reply to Justin L from comment #18)
> I only see one reasonable option and that is a function that
> allows a user to combine selected textboxes into one textbox.
Proposed patch: https://gerrit.libreoffice.org/75043
Draw: add option to consolidate multiple textboxes into one
(In reply to Justin L from comment #21)
This patch has landed in LO 6.4. Please use bug 118370 to follow the implementation of a "Shapes - Consolidate Text" function that gives the user a tool to combine multiple textboxes into one.
For this bug report, let's keep the discussion to the bigger request to add a "text content focused" PDF import, rather than a layout focused import as discussed in comment 19.
Created attachment 172095 [details]
pdf test file for too many textboxes also in LO 7132
LO 126.96.36.199 win64
Automatic detection of blocks should be improved.
It is better than before, but there is a place for better.
the selection of fonts and their size is improvable.
Perhaps KI is here the solution for the best fonts with the best sizes and detection of blocks.
Shapes - consolidate text
changes the position of the text.
mostly the text needs more place in height and width.
also with LO 188.8.131.52
Too many text boxes are active with location errors of other signs with different fonts in LO 184.108.40.206.
Version: 220.127.116.11 (x64) / LibreOffice Community
Build ID: 728fec16bd5f605073805c3c9e7c4212a0120dc5
CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: en-US (de_DE); UI: de-DE
Improvement is detection of language for right writing.
In PDF example, English is now immediately detected.
In 18.104.22.168 there is only red snakes under the text for unknown writing by primary German language.
Version: 22.214.171.124 (x64) / LibreOffice Community
Build ID: 8d71d29d553c0f7dcbfa38fbfda25ee34cce99a2
CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: de-DE
*** Bug 151577 has been marked as a duplicate of this bug. ***
Nowadays there it's possible to join the text for the boxes on the same page.
Select the boxes and righ-click Consolidate text.
I'm the author of bug 151577.
I want to bring up the question of whether there should be a single bug about this issue for importing into Writer and into Draw.
In Draw, we expect drawing objects which can be manipulated independently - although paragraph-level rather than line-level boxes would indeed be preferable whenever applicable. In Writer, however, we would like long continuous runs of text, across paragraphs and page, which aren't drawing object at all.
Also, the code for the two input filters, while similar, is different: They're two independent filters.
Finally, I would say that while in Draw this issue may be considered as an enhancement - in Writer it is a proper bug: The current Writer import filter produces what is essentially a Draw document - a bunch of disconnected drawing objects on separate pages - opened in Writer.
What say you? :-)
(In reply to m.a.riosv from comment #27)
> Nowadays there it's possible to join the text for the boxes on the same page.
> Select the boxes and righ-click Consolidate text.
... but only in Draw, it seems. Just filed bug 151598 about having it in Writer as well.
(In reply to V Stuart Foote from comment #16)
> Replacing of supplementing the PDF filters to provide "reflow" back into
> paragraphs is seen as out-of-scope for the project as we are not a PDF
A reflow is often not necessary. That is, the text in a contiguous paragraph without changes to the formatting is saved in a single object stream. So, what actually happens is that our import filters _artificially_ break up the text into lines.
Also, LibreOffice is actually a PDF editor: It satisfies the dictionary definition  of an editor for PDFs, and is used by many to edit PDFs. True, it does not directly manipulate the structure of PDFs - it imports-from and exports-to PDFs - but that is also the case for OOXML documents and many other formats - and we still consider LO an editor for those. Certainly, LO may not be the ideal software platform for editing PDFs, but there's no reason it couldn't be a half-decent editor for not-very-complex PDFs. I've recently had this discussion with Stuart on bug 151552.
For this reason improving the editability of imported PDFs, e.g. by importing text as completr paragraphs, is entirely within the scope of the project.
 : https://www.dictionary.com/browse/editor
*** Bug 151607 has been marked as a duplicate of this bug. ***
*** Bug 152143 has been marked as a duplicate of this bug. ***