Bug 49705 - Heuristically determine the alignment for paragraphs/text blocks in imported PDFs
Summary: Heuristically determine the alignment for paragraphs/text blocks in imported ...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Master old -3.6
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 47730 149216 151552 151554 158173 (view as bug list)
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2012-05-09 13:31 UTC by Sergey
Modified: 2024-03-20 13:32 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
Source "justify" file (15.33 KB, application/vnd.oasis.opendocument.text)
2012-05-09 13:32 UTC, Sergey
Details
Exported "justify" file (22.96 KB, application/pdf)
2012-05-09 13:33 UTC, Sergey
Details
Source "right" file (15.18 KB, application/vnd.oasis.opendocument.text)
2012-05-09 13:33 UTC, Sergey
Details
Exported "right" file (22.86 KB, application/pdf)
2012-05-09 13:33 UTC, Sergey
Details
Justify.pdf in LO (105.85 KB, image/png)
2012-05-09 13:35 UTC, Sergey
Details
Justify.pdf in evince (52.05 KB, image/png)
2012-05-09 13:36 UTC, Sergey
Details
Source justify file (odt) (9.35 KB, application/vnd.oasis.opendocument.text)
2019-04-23 09:14 UTC, Eugene Saenko
Details
Exported justify file (pdf) (17.82 KB, application/pdf)
2019-04-23 09:16 UTC, Eugene Saenko
Details
Exported file in Okular (49.54 KB, image/png)
2019-04-23 09:17 UTC, Eugene Saenko
Details
Screenshot of exported file, opened in LO (29.84 KB, image/png)
2019-04-23 09:19 UTC, Eugene Saenko
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sergey 2012-05-09 13:31:54 UTC
I'm using build of code latest from git.

When libreoffice opens any PDF file alignment mode "justify" is ignored. This happens even with PDFs created by libreoffice itself.

At the same time alignment "right" works as expected.

Attached are source ODT and resulting PDF files, screenshots of libreoffice and evince.
Comment 1 Sergey 2012-05-09 13:32:40 UTC
Created attachment 61306 [details]
Source "justify" file
Comment 2 Sergey 2012-05-09 13:33:00 UTC
Created attachment 61307 [details]
Exported "justify" file
Comment 3 Sergey 2012-05-09 13:33:22 UTC
Created attachment 61308 [details]
Source "right" file
Comment 4 Sergey 2012-05-09 13:33:40 UTC
Created attachment 61309 [details]
Exported "right" file
Comment 5 Sergey 2012-05-09 13:35:19 UTC
Created attachment 61310 [details]
Justify.pdf in LO
Comment 6 Sergey 2012-05-09 13:36:11 UTC
Created attachment 61311 [details]
Justify.pdf in evince
Comment 7 bfoman (inactive) 2012-11-02 14:16:12 UTC
Confirmed with:
LO 3.6.3.1 
Build ID: f8fce0b
Windows 7 Professional SP1 64 bit

Also Draw is importing some words in a sentence as a separate objects - see 8th sentence in attachment 61307 [details].
Comment 8 Joel Madero 2012-11-03 17:19:53 UTC
Marking as NEW as bfoman has confirmed.

Priotizing:

Minor: Makes creating professional quality work harder under specific situation and/or makes user not use certain features.

Low: Importing pdf's are uncommon in general, not many people affected, only happens when justify is set.
Comment 9 QA Administrators 2015-01-05 17:52:09 UTC Comment hidden (obsolete)
Comment 10 Buovjaga 2015-01-28 13:03:38 UTC
Reproduced.

Win 7 Pro 64-bit Version: 4.5.0.0.alpha0+
Build ID: 784d069cc1d9f1d6e6a4e543a278376ab483d1eb
TinderBox: Win-x86@62-TDF, Branch:MASTER, Time: 2015-01-25_23:07:36
Comment 11 QA Administrators 2016-02-21 08:37:12 UTC Comment hidden (obsolete)
Comment 12 QA Administrators 2017-03-06 15:45:26 UTC Comment hidden (obsolete)
Comment 13 Thomas Lendo 2018-09-20 10:00:02 UTC
Still reproducible.

Version: 6.2.0.0.alpha0+ (x64)
Build ID: 18c5089df091bddeb8c2dc339776671964389040
CPU threads: 8; OS: Windows 10.0; UI render: GL; 
TinderBox: Win-x86_64@42, Branch:master, Time: 2018-09-12_23:24:12
Locale: de-AT (de_AT); Calc: CL

'Justify pdf' attached to this bug and newly created pdf are imported without justified paragraphs.

A newly created 'Right pdf' from source file attached to this bug is also wrong imported with spacing before text (which means on the right of the text).
Comment 14 Eugene Saenko 2019-04-23 09:06:06 UTC
Confirmed.
Version: 6.1.5.2
Build ID: 90f8dcf33c87b3705e78202e3df5142b201bd805
CPU threads: 6; OS: Linux 5.0; UI render: default; VCL: gtk2; 
Locale: ru-RU (ru_RU.utf8); Calc: group threaded

Mode "justify" isn't simply ignored. Sometimes lines of text became longer and exceed right margin.
Comment 15 Eugene Saenko 2019-04-23 09:14:16 UTC
Created attachment 150933 [details]
Source justify file (odt)
Comment 16 Eugene Saenko 2019-04-23 09:16:06 UTC
Created attachment 150934 [details]
Exported justify file (pdf)
Comment 17 Eugene Saenko 2019-04-23 09:17:47 UTC
Created attachment 150935 [details]
Exported file in Okular
Comment 18 Eugene Saenko 2019-04-23 09:19:00 UTC
Created attachment 150936 [details]
Screenshot of exported file, opened in LO
Comment 19 Eugene Saenko 2019-04-23 09:36:56 UTC
(In reply to Joel Madero from comment #8)
> Marking as NEW as bfoman has confirmed.
> 
> Priotizing:
> 
> Minor: Makes creating professional quality work harder under specific
> situation and/or makes user not use certain features.
> 
> Low: Importing pdf's are uncommon in general, not many people affected, only
> happens when justify is set.

I don't agree. Sometimes it's need to change some words in pdf file with many pages. And after that I have to check all pages and manually correct all lines that exceed right margin. I don't say about "professional quality work". I simply say about readable document.
Comment 20 QA Administrators 2021-04-23 04:00:18 UTC Comment hidden (obsolete)
Comment 21 Sergey 2021-05-28 22:49:53 UTC Comment hidden (obsolete)
Comment 22 Heiko Tietze 2022-11-02 14:27:17 UTC
*** Bug 47730 has been marked as a duplicate of this bug. ***
Comment 23 Heiko Tietze 2022-11-02 14:28:06 UTC
*** Bug 151554 has been marked as a duplicate of this bug. ***
Comment 24 Heiko Tietze 2022-11-04 10:30:11 UTC
*** Bug 151552 has been marked as a duplicate of this bug. ***
Comment 25 ⁨خالد حسني⁩ 2022-12-25 02:55:11 UTC
PDF does not have paragraphs or lines let alone justification options. Text in PDF is just a stream of absolutely positioned glyphs, it can come in any order and formation as long as it gives the desired visual output. Short of using the exact font with the exact glyphs and positioning every glyph individually to match the PDF positioning (which essentially means putting each glyph in its own text box, which I don’t think people will be thrilled about), I don’t see LibreOffice ever being able to exactly replicate the PDF text spacing.
Comment 26 Eugene Saenko 2022-12-25 07:55:38 UTC
> I don’t see LibreOffice ever being able to exactly replicate the PDF text spacing.

This does not mean that LibreOffice should distort the PDF document. There is an example of successful work with a formatted PDF document. Master PDF Editor does a great job with this task.
Comment 27 Eyal Rozenberg 2022-12-25 07:57:58 UTC
(In reply to خالد حسني from comment #25)
> PDF does not have paragraphs or lines let alone justification options. Text
> in PDF is just a stream of absolutely positioned glyphs, it can come in any
> order and formation as long as it gives the desired visual output.

While that is true in one sense, in another sense, it's false: When we look at a PDF, we see paragraphs and justification. So they are there, they're just not expressed explicitly.

> I don’t see LibreOffice ever being able to exactly replicate the PDF
> text spacing.

This bug is not about replicating the positioning exactly. It is about deciding which alignment the text in a line (or a paragraph, if/when we reconstitute paragraphs) has. We currently just choose "Left". Not "Right", not "Justified", not "Centered". It is quite doable to make a much better, and usually-correct, choice. To do so we need to:

1. Guesstimate where the text boundaries are for the page or part-of-the-page.
2. Determine how much extra space the paragraph has to its left and to its right.
3. Estimate whether the line is a list line on its paragraph
4. Try to decide whether the spacing of the words on the line is the result of adding extra inter-word space (justification) or not. Also determine the inter-glyph spacing and perhaps the glyph stretch factor (which can immediately be set for the text in the paragraph/line, if it's uniform at least.)
5. Try to decide whether the paragraph in its entirety is indented and/or has a first-line indent
6. Based on our determinations and guesstimates, set the indentation, alignment, and inter-word spacing for the paragraph.

... and that would typically come after deciding what the paragraph boundaries are; although some of it may need to be combined (e.g. distinguishing paragraphs by indents when there is no inter-paragraph spacing.)

And I'll emphasize again that this may not always result in the correct reconstitution of paragraph alignment - but it will certainly be correct for typical cases (think: the formal letter you exported as a PDF), and correct for the large majority of cases.


---------------------------------

> Short of
> using the exact font with the exact glyphs and positioning every glyph
> individually to match the PDF positioning (which essentially means putting
> each glyph in its own text box, which I don’t think people will be thrilled
> about), 

That's outside the scope of this bug. However, that's not what it means. You can very well set the spacing and stretch factors of individual glyphs within the same text box or on the same paragraph. And - that should definitely be an option when editing a PDF: Sometimes, what the user would want is a document that's easy to read, unladen with a zillion formatting and positioning specifications; but sometimes, what the user wants is a perfect reproduction of what the PDF looks like, to the extent LO is capable of doing so. The second case is for when one wants to make an edit to a PDF in LO (e.g. adding some text or a signature), and save the result - passing through LO should cause minimal distortion to the rendering of existing PDF content.

(Also, in the Writer filter, we typically don't want textboxes anyway, and the text should just go in the body.)
Comment 28 ⁨خالد حسني⁩ 2022-12-25 14:15:35 UTC
(In reply to Eugene Saenko from comment #26)
> > I don’t see LibreOffice ever being able to exactly replicate the PDF text spacing.
> 
> This does not mean that LibreOffice should distort the PDF document. There
> is an example of successful work with a formatted PDF document. Master PDF
> Editor does a great job with this task.

Master PDF Editor is a PDF editor, it does not import PDF into another reprentation to edit it, it works on the PDF structure directly. This gives limited editability options but it keeps the file structure intact. If this is what you need, please use it, LibreOffice PDF import can’t achieve this.
Comment 29 Eyal Rozenberg 2022-12-25 18:55:15 UTC
(In reply to خالد حسني from comment #28)

I agree with Khaled in the sense that LO is not supposed to maintain a PDF's internal structure (and hence is unlikely to preserve PDFs perfectly). But this is not a dichotomy. There is a spectrum between maintaining nothing, complete preservation like a proper PDF editor.

We can, and should, try to preserve - indirectly, via ODF-expressible features and structures - much of what the PDF contains. We can reach a situation in which most people would not notice significant (or any) distortions when opening a PDF originating in an exported word-processor document, or of a similar source.
Comment 30 Mike Kaganski 2022-12-27 12:41:34 UTC
This request would likely be better changed to "improve text block and its alignment recognition" task (if it's not filed already somewhere). In its current wording (and in its original description), it is simply wrong, implying that there is some metadata in the PDF that is ignored by LibreOffice; while the real task is inventing some heuristics to recreate such metadata from fixed positioning of glyphs on the media. It is related to e.g. "Consolidate Text" convenience tool introduced in 6.4 for tdf#118370.
Comment 31 Eyal Rozenberg 2023-03-01 20:39:20 UTC
Note that this comes up in bug 153888:  There's a simple(ish) PDF with some centered text in/over a gray frame. And - LO Draw doesn't lay out that text as centered at the horizontal center of the frame, nor is it designated as centered.
Comment 32 Eyal Rozenberg 2023-09-19 08:24:33 UTC
(In reply to Mike Kaganski from comment #30)
> In its current wording (and in its original description), it is simply wrong,
> implying that there is some metadata in the PDF that is ignored by
> LibreOffice;

Rephrased title to clarify this point.
Comment 33 Stéphane Guillou (stragu) 2024-03-20 13:28:10 UTC
*** Bug 149216 has been marked as a duplicate of this bug. ***
Comment 34 Stéphane Guillou (stragu) 2024-03-20 13:32:24 UTC
*** Bug 158173 has been marked as a duplicate of this bug. ***