When i import a PDF with text in it I get editing function on separate paragraph one by one , what i need and want (like all users will expect to do) to do it is to edit all paragraf like I do an office document normaly. Option to union the the paragraph to edit them flawless and easy at start ? Unificate all paragraph on page to be editable like a simple full form. It is time consuming to click and edit every paragrapf one at a time. Tryed union from the right click menu on them and I get a plain graphic non editable txt form with the txt tool (aka big T icon). Hope this small option will get until final release.
I also sometimes wished such a feature, but I'm afraid that would cost too much manpower. Compared to other needs definitively not more than Importance "Medium", I doubt that that ever will be integrated.
Hi, I am a developer on a digital library software, and, aiming at supporting digital preservation, I was thinking of exploiting the wonderful PDF importer filter of LibreOffice to archive .odt document next to the original .pdf (as the .odt document should provide more value for future retrieval and the use of the document). Indeed I also find this a very nice feature to have and it should be possible to implement it via some heuristic such as merging together subsequent lines that are not too far from each other (e.g. say that they are not more distant than the height of the character). If no-one have time to work on it I'd be glad to give it a try in my spare time, if someone could be so kind to point me at the most appropriate source code files that would need to be touched. Cheers!
@Samuele Kaplun: That would be great. I'm afraid that won't be easy. I do not know how that works for other OS, but for WIN I have to install the "Oracle PDF Import Extension" from <http://extensions.services.openoffice.org/en/search/node/pdf import>, what itself afaik uses XPDF <http://foolabs.com/xpdf/about.html> as text extractor. That's all I can contribute. BTW: Version is for the first version where the problem has been observed!
(In reply to comment #3) > @Samuele Kaplun: > I'm afraid that won't be easy. I do not know how that works for other OS, but > for WIN I have to install the "Oracle PDF Import Extension" from > <http://extensions.services.openoffice.org/en/search/node/pdf import>, what > itself afaik uses XPDF <http://foolabs.com/xpdf/about.html> as text extractor. From <http://www.libreoffice.org/features/extensions/> I understand that finally this extension is part of the core LibreOffice source tree. Is this so? > That's all I can contribute. > > BTW: Version is for the first version where the problem has been observed! Sorry for this!! That makes perfect sense!
> From <http://www.libreoffice.org/features/extensions/> I understand that > finally this extension is part of the core LibreOffice source tree. Is this so? I thought so, too, but for my 3.4 I definitively had to download the extension. Pls see <https://bugs.freedesktop.org/show_bug.cgi?id=35604#c6>
[This is an automated message.] This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it started right out as NEW without ever being explicitly confirmed. The bug is changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases. Details on how to test the 3.5.0 beta1 can be found at: http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1 more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Was New by good reasons. But it#s the question whether there is a realistic chance to get this enhancement.
*** Bug 38084 has been marked as a duplicate of this bug. ***
*** Bug 84712 has been marked as a duplicate of this bug. ***
Please read this message in its entirety before responding. Your bug was confirmed at least 1 year ago and has not had any activity on it for over a year. Your bug is still set to NEW which means that it is open and confirmed. It would be nice to have the bug confirmed on a newer version than the version reported in the original report to know that the bug is still present -- sometimes a bug is inadvertently fixed over time and just never closed. If you have time please do the following: 1) Test to see if the bug is still present on a currently supported version of LibreOffice (preferably 4.2 or newer). 2) If it is present please leave a comment telling us what version of LibreOffice and your operating system. 3) If it is NOT present please set the bug to RESOLVED-WORKSFORME and leave a short comment telling us your version and Operating System Please DO NOT 1) Update the version field 2) Reply via email (please reply directly on the bug tracker) 3) Set the bug to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LibreOffice is powered by a team of volunteers, every bug is confirmed (triaged) by human beings who mostly give their time for free. We invite you to join our triaging by checking out this link: https://wiki.documentfoundation.org/QA/BugTriage There are also other ways to get involved including with marketing, UX, documentation, and of course developing - http://www.libreoffice.org/get-help/mailing-lists/. Lastly, good bug reports help tremendously in making the process go smoother, please always provide reproducible steps (even if it seems easy) and attach any and all relevant material
I can confirm this bug in LO 4.4.2.2. on Windows 7
*** Bug 91896 has been marked as a duplicate of this bug. ***
Is there no bug voting? This is a major turndown!
*** Bug 105274 has been marked as a duplicate of this bug. ***
*** Bug 93039 has been marked as a duplicate of this bug. ***
LibreOffice has provided functional filter import of PDF into Draw (default Open action), and into Impress and Writer or also Draw by import filter selection. With each filter selected, the rendering to respective document canvas follows the structure of the document as recorded within the PDF and text elements are rendered into styled Text box or Frames. The PDF filter(s) do not "reflow" text into Paragraph objects. That would require a very complex treatment of the PDF structure to reliably extract syntax and layout--at the expense of fidelity rendering the PDF document. Replacing of supplementing the PDF filters to provide "reflow" back into paragraphs is seen as out-of-scope for the project as we are not a PDF editor. The core PDF filters and function are sufficient to our needs of high rendering fidelity. This is fertile ground for an extension.
*** Bug 125838 has been marked as a duplicate of this bug. ***
Created attachment 152450 [details] PDF_import_testDoc.odg: exploring what combining textboxes could look like I agree with Stuart's conclusion that monkeying with import to make larger textboxes would be disastrous. So I only see one reasonable option and that is a function that allows a user to combine selected textboxes into one textbox. However, the results won't be pretty. Each character attribute change (size, bold, font, etc.) becomes a separate textbox, and there is no way to identify whether that ends the paragraph or not, although some content analysis guesswork could approximate the majority of cases I guess. In any case, a LOT of cleanup would be needed to reformat the text, since each character run is treated as a separate paragraph and all paragraph spacing information is missing. The other option is to force the user to create their own textbox and copy/paste the text from the PDF itself, but in that case all the character properties are lost. So there does still seem to be an advantage of consolidating textboxes into one, even if many excess paragraph markers need to be deleted.
(In reply to Justin L from comment #18) > ... So I only see one reasonable option and that > is a function that allows a user to combine selected textboxes into one > textbox. > Yes, agree that would be an acceptable way to handle PDF source text runs extracted from BT/ET blocks, or where /ActualText annotation is present. But why first extract the text runs into Draw Text boxes, and then merging them into one or more non-formattable Draw Text boxes? Seems like a different filter import of the PDF text runs is needed. Dumping the strings out to a Writer Paragraph object, either in bulk or interactively, would be more functional. And text runs dumped into a Paragraph object, would allow assignment of direct formatting or style, with text validation and word and line break cleanup. Probably more efficient UI could evolve if done as a pop-out dialog to pick the Draw Text box snippets, but could spin up a full Writer session and do the same. More often than not, folks simply want to reflow the text strings back into their lexicographically correct sequence without too great a concern as to original formatting of the source document generating the PDF. We can't do that with much fidelity to the original source--so why bother? Our other 'pdfium' based "insert" filter provides the text runs to document canvas with good fidelity to the original layout. Though the object "break" there has similar issues to the 'poppler' based import filter for text handling.
(In reply to V Stuart Foote from comment #19) > But why first extract the text runs into Draw Text boxes? Seems like a > different filter import of the PDF text runs is needed. Yes, that sounds like it would be perfect, but 100x more complex to code.
Created attachment 152531 [details] Draw-add-option-to-consolidate-multiple-textboxes.patch (In reply to Justin L from comment #18) > I only see one reasonable option and that is a function that > allows a user to combine selected textboxes into one textbox. Proposed patch: https://gerrit.libreoffice.org/75043 Draw: add option to consolidate multiple textboxes into one
(In reply to Justin L from comment #21) This patch has landed in LO 6.4. Please use bug 118370 to follow the implementation of a "Shapes - Consolidate Text" function that gives the user a tool to combine multiple textboxes into one. For this bug report, let's keep the discussion to the bigger request to add a "text content focused" PDF import, rather than a layout focused import as discussed in comment 19.
Created attachment 172095 [details] pdf test file for too many textboxes also in LO 7132 LO 7.1.3.2 win64 Automatic detection of blocks should be improved. It is better than before, but there is a place for better. the selection of fonts and their size is improvable. Perhaps KI is here the solution for the best fonts with the best sizes and detection of blocks.
Shapes - consolidate text changes the position of the text. mostly the text needs more place in height and width. also with LO 7.2.0.2
Too many text boxes are active with location errors of other signs with different fonts in LO 7.3.4.2. Version: 7.3.4.2 (x64) / LibreOffice Community Build ID: 728fec16bd5f605073805c3c9e7c4212a0120dc5 CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win Locale: en-US (de_DE); UI: de-DE Calc: CL Improvement is detection of language for right writing. In PDF example, English is now immediately detected. In 7.2.7.2 there is only red snakes under the text for unknown writing by primary German language. Version: 7.2.7.2 (x64) / LibreOffice Community Build ID: 8d71d29d553c0f7dcbfa38fbfda25ee34cce99a2 CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win Locale: de-DE (de_DE); UI: de-DE Calc: CL
*** Bug 151577 has been marked as a duplicate of this bug. ***
Nowadays there it's possible to join the text for the boxes on the same page. Select the boxes and righ-click Consolidate text.
I'm the author of bug 151577. I want to bring up the question of whether there should be a single bug about this issue for importing into Writer and into Draw. In Draw, we expect drawing objects which can be manipulated independently - although paragraph-level rather than line-level boxes would indeed be preferable whenever applicable. In Writer, however, we would like long continuous runs of text, across paragraphs and page, which aren't drawing object at all. Also, the code for the two input filters, while similar, is different: They're two independent filters. Finally, I would say that while in Draw this issue may be considered as an enhancement - in Writer it is a proper bug: The current Writer import filter produces what is essentially a Draw document - a bunch of disconnected drawing objects on separate pages - opened in Writer. What say you? :-)
(In reply to m.a.riosv from comment #27) > Nowadays there it's possible to join the text for the boxes on the same page. > Select the boxes and righ-click Consolidate text. ... but only in Draw, it seems. Just filed bug 151598 about having it in Writer as well.
(In reply to V Stuart Foote from comment #16) > Replacing of supplementing the PDF filters to provide "reflow" back into > paragraphs is seen as out-of-scope for the project as we are not a PDF > editor. A reflow is often not necessary. That is, the text in a contiguous paragraph without changes to the formatting is saved in a single object stream. So, what actually happens is that our import filters _artificially_ break up the text into lines. Also, LibreOffice is actually a PDF editor: It satisfies the dictionary definition [1] of an editor for PDFs, and is used by many to edit PDFs. True, it does not directly manipulate the structure of PDFs - it imports-from and exports-to PDFs - but that is also the case for OOXML documents and many other formats - and we still consider LO an editor for those. Certainly, LO may not be the ideal software platform for editing PDFs, but there's no reason it couldn't be a half-decent editor for not-very-complex PDFs. I've recently had this discussion with Stuart on bug 151552. For this reason improving the editability of imported PDFs, e.g. by importing text as completr paragraphs, is entirely within the scope of the project. [1] : https://www.dictionary.com/browse/editor
*** Bug 151607 has been marked as a duplicate of this bug. ***
*** Bug 152143 has been marked as a duplicate of this bug. ***
Note this recent (and relatively popular) YouTube video by "The Linux Experiment", decrying several usability issues with modern Linux distributions. In the section on PDFs, it explains how, to edit a PDF, you use LibreOffice Draw. Then it goes on to complain about how bad it is as an editor, and particularly: How the text is broken up into separate lines. The (very common) use-case described in the video: Signing a PDF by adding a signature image to it. https://www.youtube.com/watch?v=0re63X2nY0s This illustrates that, at the bottom line, LO is a PDF editor, and in fact, is _the_ PDF editor for users who aren't experts in locating software. It also illustrates how addressing this issue will make both LO and FOSS desktop environments more attractive to users.
(In reply to Eyal Rozenberg from comment #33) >... > This illustrates that, at the bottom line, LO is a PDF editor, and in fact, > is _the_ PDF editor for users who aren't experts in locating software. > > It also illustrates how addressing this issue will make both LO and FOSS > desktop environments more attractive to users. Nope! It again illustrates the bottom line that PDF (ISO 32000-1:2008, or 2:2020) is NOT an editable format, it is a presentation/publication format. Also, it demonstrates reality that LibreOffice is not a PDF editor as we will only ever read content of a PDF to filter import to an ODF XML compliant document canvas. Fidelity of the filter import varies (poppler to sd Drawing objects, or pdfium as image/meta streams as image to a vcl canvas), but in no sense do we do more than read from PDF. Export/print from ODF module to PDF is then a completely different process with a different set of export filters. And it highlights the project's need to scrupulously manage user expectations reinforcing that PDF is not an editable format, and that LibreOffice is NOT a PDF "editor". Improvements can be made to LO filter handling as a PDF reader to import content--witness the adoption of pdfium libs for the insert as image filter paths. But simply put, the internals of the presentation optimized text runs within PDF do not support extraction with the lexical syntax of the original source document from which a PDF was generated. We can provide tools to better organize results of the import filters, reformating them into either paragraph objects or sd draw objects--but there are very real limits to what the project can or should do.
(In reply to V Stuart Foote from comment #34) > Nope! It again illustrates the bottom line that PDF (ISO 32000-1:2008, or > 2:2020) is NOT an editable format, it is a presentation/publication format. People need to edit PDFs all the time - hence its featuring prominently in a video describing common tasks which need catering to by desktop apps. You get a PDF of a form - typically scanned or printed from a word processor - and you need to put text and/or a signature on it. That's PDF editing, and millions of people do it every day. Ok, maybe not millions every day, let's say millions every week. > Also, it demonstrates reality that LibreOffice is not a PDF editor as we > will only ever read content of a PDF to filter import to an ODF XML > compliant document canvas. Nobody said LO needs to represent the PDF structure as-is and perform surgical edits. In that sense, LO isn't a .doc and .docx editor either: It only ever reads their contents via an import filter; and it is certainly a .doc and .docx editor. But - we've had this argument already. Why are you repeating a rebutted point? > And it highlights the project's need to scrupulously manage user > expectations reinforcing that PDF is not an editable format, and that > LibreOffice is NOT a PDF "editor". You keep saying that, despite it having been demonstrated to you both in principle and empirically that it is. What LO needs to manage perhaps people's insistence of sticking their heads in the sand and ignoring an important use of our suite. I'll bet you there are more people using LO as a PDF editor than users of LO Base, for example. (No offense to the LO Base folks!) But anyway, let's focus on the practicality and the scope of this bug. > Improvements can be made to LO filter handling as a PDF reader to import > content--witness the adoption of pdfium libs for the insert as image filter > paths. That's a step in the right direction - as was the resolution of bug 104597. But there's a very long way to go. > But simply put, the internals of the presentation optimized text runs within > PDF do not support extraction with the lexical syntax of the original source > document from which a PDF was generated. That's true, and we can never hope to restore what's not saved in a PDF. But: 1. We can avoid losing the information and styling that _is_ represented in the PDF, so that importing-then-saving would result in a PDF with no noticeable distortions, or almost none. At least - for PDFs of typical documents which don't use the more esoteric features of PDFs. Of course the PDF's internal structure will likely show a lot more differences, but the observed result will be pleasing. 2. We can use reasonable assumptions to constitute paragraphs, define styles, have structural elements/features like columns, tables, annotations, comments, etc. Yes, each of these is may be a lot of work and nobody expects this to happen overnight, but if we set this as an explicit goal and have some development resources assigned to working towards that goal then things will gradually improve. By the way, this is mostly, even if not entirely, orthogonal to making sure we don't mess up the PDF on import-then-export. 3. For the specific case of LO being the originator of the PDF, we could consider - and that is out-of-scope here I suppose - embedding auxiliary information into the PDF which allows for perfect or near-perfect reconstitution of the original LO document. > but there are very real limits to what the project can or should do. Certainly, but these limits depend in part on what the project defines as a goal or an important feature. Recognition of the use of LO as a PDF editor rather than its denial will allow for setting these limits farther.
(In reply to Eyal Rozenberg from comment #35) > 3. For the specific case of LO being the originator of the PDF, we could > consider - and that is out-of-scope here I suppose - embedding auxiliary > information into the PDF which allows for perfect or near-perfect > reconstitution of the original LO document. > Actually we already provide that with the source ODF inserted as a data stream to our PDF export--we call it a "Hybrid PDF". Although some potential to improve the visibility of the ODF data stream beyond LibreOffice's handling, as for bug 95328 and making the ODF a proper PDF attachment.
cf https://bugs.documentfoundation.org/show_bug.cgi?id=153888 On our side, we are not interested by a pdf editor so the question to know if LO is/should better be a pdf editor is not relevant for me. That being said Eyal is right to say many users like that feature (importing/converting pdf into odf) and I can testify that many surely install LO for that powerful feature (pdf import). Anyway, our need is just importing/converting a pdf into the odf formats with decent conservation of the original formatting/font/visual. We are not interested by UI editing but just the capability to retouch some texts in the odf file, by simply retouching the contents xml files in the odf file.
(In reply to wtambellini from comment #37) > Anyway, our need is just importing/converting a pdf into the odf formats > with decent conservation of the original formatting/font/visual. We are not > interested by UI editing but just the capability to retouch some texts in > the odf file, by simply retouching the contents xml files in the odf file. Ok, but - this particular bug is specifically about editing text. Naturally, better, more accurate import allows in turn for easier/better editing, but still. I've commented on the other bug. Actually, this last sentence makes me think that maybe PDF-Import-Draw and PDF-Import-Writer should be blocking this bug rather than the other way around. What do you all think?
(In reply to Eyal Rozenberg from comment #38) And another question to the CC list members: I believe this bug may need to be split up, because there are several possible "asks": 1. Constitute sub-line-level text runs into full-line text boxes 2. Constitute line-level/sub-line-level text runs into single-paragraph text boxes 3. Constitute multiple paragraphs into single text boxes - e.g. all paragraphs in a frame or contiguous in the body of a page 4. Put text into the document body, unless there is cause to place it in a separate box. (And this can still maintain the separation into pages, by having page breaks at the end of each page). Right now the bug title is very vague, not even specific to text run consolidation, while the discussion is a bit all over the place.
Splitting ("divide and conquer") is a good idea Eyal and I see you did it so congrats. Tks
(In reply to Eyal Rozenberg from comment #39) > (In reply to Eyal Rozenberg from comment #38) > And another question to the CC list members: I believe this bug may need to > be split up, because there are several possible "asks": > > 1. Constitute sub-line-level text runs into full-line text boxes > 2. Constitute line-level/sub-line-level text runs into single-paragraph text > boxes > 3. Constitute multiple paragraphs into single text boxes - e.g. all > paragraphs in a frame or contiguous in the body of a page > 4. Put text into the document body, unless there is cause to place it in a > separate box. (And this can still maintain the separation into pages, by > having page breaks at the end of each page). > > Right now the bug title is very vague, not even specific to text run > consolidation, while the discussion is a bit all over the place. I do not fully understand what you write above, but I think you technically reword what I would want: import a pdf and be able to copy and edit the text. Keeping layout intact is of second importance.
Linked to an overview by "The Linux Experiment" (a Youtube channel/vlog with 254K subscribers) on working with PDFs on Linux: https://www.youtube.com/watch?v=ie7Jb1KiIBM Notes about the video: * LO is the first app presented for working with PDFs other than viewers * ... followed by Inkscape. * LO (and Inkscape) are said to "kind of suck" in editing PDFs. And still, they are what's suggested as FOSS PDF editors. * Only LO Draw is recognized in the video, i.e. even a person who has spent some time researching the subject has not noticed that PDFs can be opened in Writer. * Different utilities are suggested for different specific PDF-related tasks, like annotation, stamping an image, rearranging/removing pages etc. * The bottom line is that if you _really_ want to edit your PDF, you'll need to go with a commercial app.
*** Bug 119070 has been marked as a duplicate of this bug. ***