When opening a (Writer-created) PDF document with justified text, the justification is lost, and the resulting draw objects are aligned on the left only. Reproduction instruction: 1. Create a new Writer document 2. Change the Default Paragraph Style to be justified 3. Enter a few lines of text (which are not themselves perfectly justified and require some spacing increases for justification). 4. Save the document as a PDF. 5. Open the PDF in Writer (not in Draw! Use the Writer PDF import filter) Expected result: The opened document is justified, at least visually. Actual result: Text is aligned to the left only, _unlike_ what we see in the PDF itself.
Created attachment 183066 [details] Writer-generated PDF document for observing the bug Instead of creating the PDF yourself, you may just open the attached file. It contains two paragraphs of "lorem ipsum" text.
Created attachment 183068 [details] Screenshot of attachment 183066 [details] imported into Writer Screenshot showing how the alignment after import is left-only.
Why is this a bug? And I will restate the obvious LibreOffice is not a PDF editor! The import filter results in draw Shape textboxes not LO paragraphs. The poppler based PDF import filter does not provide the spacings recorded into PDF. We use it to convert the text runs held in PDF into draw Shape objects--specifically textbox. The lack of "justified" filter import is expected and by design. We ignore any spacing in parsing out the text. We have pretty much the same result to canvas with the poppler based Impress filter. The filters (there are two separate but similar) only provides a workflow to extract text runs (varying fidelity by script) and to merge them into one draw Shape textbox (the 'Consolidate text' action). And from there select the text stream to use if needed in a new LibreOffice paragraph object. If you need layout fidelity to the original PDF page, use the Insert PDF filter! Otherwise accept the tool for what it provides--an ability to extract text runs from PDF.
(In reply to V Stuart Foote from comment #3) > Why is this a bug? I believe you're being facetious here, but: A PDF imported into Writer should (ignoring complex and esoteric PDF features) be rendered near-identically to its rendering in a PDF viewer. Or phrased otherwise: Printing the PDF and printing the Writer-imported PDF onto paper should result in almost-identical printed documents. If that seems too presumptuous, then at the very least that should hold for PDFs created by exporting Writer documents (ignoring complex and esoteric features). In particular, if words are spaced out within a line in the PDF, they should be spaced-out exactly, or almost-exactly the same way in the PDF-imported Writer document. > And I will restate the obvious LibreOffice is not a PDF editor! That is is obvious...ly wrong: LibreOffice is a PDF editor. Dictionary.com defines [1] editor as: "A program used for writing and revising code, data, or text" LO can open PDFs, make edits to the opened PDF, and save the result to a PDF. It's a poor PDF editor, but considering the lack of FOSS alternatives, and the fact that LO is installed so widely - it's the PDF editor of choice for many. If PDF import filters - in particular for Writer, but for Draw as well - would improve, LO could become a mediocre PDF editor. > The import filter results in draw Shape textboxes not LO paragraphs. That's an implementation detail. One could argue whether it's a good idea in general for a Writer import filter, but regardless - implementation details are not an excuse to mess up the import. > The > poppler based PDF import filter does not provide the spacings recorded into > PDF. So here's your bug. > We use it to convert the text runs held in PDF into draw Shape > objects--specifically textbox. The lack of "justified" filter import is > expected and by design. It's not what users expect, and if it was by design - the bug is in the design of the import filter. > We have pretty much the same result to canvas with the poppler based Impress > filter. Actually, that's not true, and I'll open a bug about the Draw filter separately; but the difference is not a good one... > If you need layout fidelity to the original PDF page, use the Insert PDF > filter! I need both layout fidelity and editability, and that's what the import filter should provide. [1] : https://www.dictionary.com/browse/editor
Spacing of the text runs is something that can not be efficiently extracted from the PDF, IMHO NAB and => WF LibreOffice is not a PDF editor. When a user choses to filter import a source PDF to LO, they *must* understand the content of the PDF is being extracted and constituent elements rendered as drawing Shapes to document canvas. Draw by default or optionally Impress or Writer. It is time for UX and ESC to flatly state what project will do regards PDF source materials--up to an including *removal* of the PDF import filters to eliminate the misguided perception that LibreOffice is a PDF editor.
(In reply to V Stuart Foote from comment #5) > LibreOffice is not a PDF editor. I explained why it is, and you have not presented a counter-argument. Repeating your statement without a counter-argument is effectively conceding the point. > Spacing of the text runs is something that can not be efficiently extracted > from the PDF If that were true, the PDF format would be useless and PDF viewers would not work. Also, it would be good enough if LO simply realized that it's seeing a justified line, and formatted it accordingly (as after all, we can justify single lines.) That might not result in exactly the same spacing as in the PDF file, but it would be pretty close typically, and could be identical if the file had originated in LO (and if the text box were sized appropriately). > When a user choses to filter import a > source PDF to LO, they *must* understand the content of the PDF is being > extracted and constituent elements rendered as drawing Shapes to document > canvas. Why does it matter whether users "understand" that there's a bug? I'm not following. > It is time for UX and ESC to flatly state what project will do regards PDF > source materials--up to an including *removal* of the PDF import filters to > eliminate the misguided perception that LibreOffice is a PDF editor. It's obvious you're trying to promote this agenda by pushing back against bug reports on PDF import filters. That's not appropriate. PDF import into writer is an officially supported feature. If you want to remove it - open an issue about it (or actually, don't, since it's an important and useful feature); certainly don't try to suppress requests to fix the import filter.
(In reply to Eyal Rozenberg from comment #6) No, please understand how our poppler based PDF import filtering functions. PDF is not an editable format. We do not Edit PDFs. A PDF viewer processor will open and parse PDF stream content onto fully described (in postscript) pages. And then manage display of those complete pages. Even for a document being "round-tripped" LibreOffice's import filter(s), using external poppler and poppler-utils libraries, extracts the content streams from the published presentation, and converts each stream into a discreet draw Shape object. The text runs in the PDF are just one of the content streams. Those discreet text run content streams have no lexical details and are strictly glyph based snippets of text with font and character metrics that are then used to create the draw Shape textboxes. The content stream includes a starting position on the published page, and that is used to coarsely position the draw textbox to LO canvas. That is why the text runs are not rendered to LO canvas as "justified" and can exceed the LO canvas margins. The mishandling of the RTL text was also manifestation of the fact that the content stream records text in the order they are recorded to the postscript page. There are similar issues for complex text recorded to PDF with /ActualText flag support. PDF Viewers don't need to do more with the content streams--they simply parse them and lay them out as described in the postscript pages. And LibreOffice actually includes a PDF viewer processor--that is the pdfium based ipdf filter used to insert PDF page as image. Improving fidelity of filter imported draw Shapes to content on the source PDF published page is out of scope for project. Put another way it is not justified to expend dev, QA and design resources working on the PDF import filters when we offer exceptional fidelity for PDF content using the pdfium based insert filters. Where any "manipulation" of the source PDF (e.g. page extraction, clipping, etc.) to prepare it for insertion is best done external to LibreOffice. And that is why I make the suggestion that perhaps it would be best just to drop the functional poppler based PDF import filter from core LO deliverables. And it could then be packaged more effectively as an extension (where it started in the Oracle OOo era). And again, LibreOffice is *not* a PDF editor.
(In reply to V Stuart Foote from comment #7) > (In reply to Eyal Rozenberg from comment #6) > No, please understand how our poppler based PDF import filtering functions. I actually assumed everything you wrote in your post. I'm not that dense... :-) But - it is irrelevant how the current filter works. Or rather, it's relevant when evaluating whether or not a fix can be based on the current implementation - it is not relevant for evaluating what the desired behavior is. > PDF is not an editable format. First of all, of course it's an editable format. It's not _convenient_ to edit; it expresses many things implicitly, sure; and still, it's editable. ... but I won't fall for the moving-of-the-goalposts you seem to be setting up here. PDFs do not need to be editable to have editors. We've already described what an editor does - and that does not require directly working on the format it's an editor for. It is perfectly legitimate for an editor to import-edit-export. gimp and Photoshop do that for most image formats, because those are also not editing-friendly. > We do not Edit PDFs. I told you I wouldn't fall for that. You might as well say "We do not edit OOXMLs"... ok, sure, but LO is still a DOCX editor, and one of the better ones. > Even for a document being "round-tripped" LibreOffice's import filter(s), > using external poppler and poppler-utils libraries, extracts the content > streams from the published presentation, and converts each stream into a > discreet draw Shape object. This, at most, may means that fixing this bug may require a lot of effort due to the need for an alternative to the use of poppler (although - maybe not; I'm not familiar with poppler's capabilities). Fine! I do not claim that this this issue should be the LO project's top priority. > PDF Viewers don't need to do more with the content streams--they simply > parse them and lay them out as described in the postscript pages. Indeed, PDF viewers have it easier, and don't have to reach structural conclusions. A PDF import filter for a textual document editor needs to work much harder, reconstituting structure, deducing features and styles etc. I don't expect this to work perfectly for arbitrary PDFs. But I definitely expect it to work well for the most straightforward of PDFs for us to import: Paragraphs of text exported from LO Writer. > Put another way it is not justified to expend dev, QA and design resources > working on the PDF import filters when we offer exceptional fidelity for PDF > content using the pdfium based insert filters. But you know that's not what a PDF import filter is for. The PDF import filter for Writer is for editing PDFs in Writer, and that's not at all provided by pdfium. So, the existence of pdfium does not constitute an argument against investing effort in improving the Writer PDF import filter. In fact, I must say that you're taking a rather myopic view of the matter. Think about the promotion of LO as a product! Especially vis-a-vis MS Office. If you could tell the user "Someone send you a document as a PDF? With LO, you can edit it! Either make it your own by modifying the text or use Track Changes to treat it as a draft for discussion." - that very attractive functionality that Microsoft doesn't offer. > And again, LibreOffice is *not* a PDF editor. I commend your valiant (?) attempt to try to argue this point. Unfortunately, your argument was based on the false premise that an editor for a file format must be able to manipulate that format's internal structure directly.
To have the "justified" text be justified in Writer pdfimport as well, we need to know how PDF specifies "justified alignment" as per the PDF specification. If we can find the pdf token defining the justified alignment (there should be one, but need to read the pdf specifications carefully to identify it), then we can add a line output to the (poppler based) xpdfimport binary, then handle that in the so called "emiting" process during import. This is similar as to how do we handle bold, underline, etc. We read the PDF tokens, if we encounter the pdf token specifying that the text should be aligned justified, then we do that in our import process. I think we really need to have this ticket short and provide as much useful information as possible, otherwise devs would not finish reading this ticket and this one will never be fixed. May the irrelevant comments be tagged as "obsolete"? I mark this to NEW as I see there is a bug here.
(In reply to Kevin Suo from comment #9) > ... > I think we really need to have this ticket short and provide as much useful > information as possible, otherwise devs would not finish reading this ticket > and this one will never be fixed. a little history from the poppler side... https://bugs.freedesktop.org/show_bug.cgi?id=55977
*** This bug has been marked as a duplicate of bug 49705 ***