165396 – PDF Import: When font face is missing, respect box dimensions rather than font x-height

Bug 165396 - PDF Import: When font face is missing, respect box dimensions rather than font x-height

Summary: PDF Import: When font face is missing, respect box dimensions rather than fon...

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	filters and storage (show other bugs)
Version: (earliest affected)	25.8.0.0 alpha0+
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	filter:pdf

Depends on:
Blocks:	PDF-Import-Draw PDF-Import-Writer
	Show dependency tree / graph

Reported:	2025-02-22 21:52 UTC by Eyal Rozenberg
Modified:	2025-02-25 23:05 UTC (History)
CC List:	3 users (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eyal Rozenberg 2025-02-22 21:52:39 UTC

Consider the PDF document in attachment 199370 [details] - page 1.

That document has a black bar across the page, text above it aligned to its left, and text below it aligned to its right. The text below ("Release 9.4") uses the Computer Modern font face (serif, roman, size=x-height 10.9 pt). When Computer Modern is unavailable, and the fallback for it is a wider font (say, DejaVu Sans) - the text starts at the same horizontal line, but extends much furter, past the end of the black bar, ruining the alignment effect.

No, I am guessing there is no explicit indication of the PDF of their being an alignment. But I am also guessing, that the dimensions of the object with the text in it are given explicitly. 

Assuming that is the case, I believe it is better for LO to respect the box dimensions, instead of respecting the nominal 10.9 pt size of the font.

The question of what to respect is of course a tradeoff - we can't do both. But since the font used is not the right one anyways, the merit in maintaining the exact size is very limited; while the keeping the same dimensions has clear benefit, not just in this example, but in principle as well: Maintaining the location, size and proportions of items on the page as intended.

How would the dimensions be respected? In one of several ways:

* Reduce or enlarge the font
* Scale the font horizontally (i.e. the character width setting we have in LO)
* Use narrower or wider inter-character and/or inter-word spacing
* Use a condensed variant of the fallback font.

... and perhaps there are other ways to do it. The first and second adjustments combine, i.e. one could set the "best" size to fit within the box, then stretch the dimension which is not an exact fit, to make it an exact fit. But this again represents a trade-off: Perfect fit to the box vs. natural proportions of the font. Other adjustments may also combine.

Comment 1 V Stuart Foote 2025-02-22 22:04:00 UTC

Just FYI, note insert as image (with the pdfium filter) onto a Draw or a Writer page is pixel perfect.

This is an issue handling poppler parsing with cairo to create sd text shapes.

Comment 2 Eyal Rozenberg 2025-02-22 23:33:18 UTC

(In reply to V Stuart Foote from comment #1)

Indeed, I was talking about our default filter, producing an editable document.

> This is an issue handling poppler parsing with cairo to create sd text
> shapes.

... and can you confirm it?

Comment 3 V Stuart Foote 2025-02-23 17:31:12 UTC

Actually, is this even a valid request? The dimensions of text object streams do not exist within the PDF for extraction. Only a starting position and spacing.

The BT/ET for text object streams inside the PDF do not receive a statement of  "dimensions of the object with the text in it", rather within the text object Starting BT and Ending ET there is simply a Tf line - with font a glyph size, a Td line - with text start position in x and y offset from bottom left and any transformation, and the Tj line - the character string or lookups from font dictionary.

And on LibreOffice import via poppler lib, alignment of the draw text object is observed, as is general glyph size of the font. But beyond the needed anchoring, the draw shape is not sized to match what had been held for the stream within PDF.

And if the font used in the PDF is not available to os/DE, the poppler -> cairo rendering to a text draw shape uses fall back font to render the assembled text span.

In other words, the resulting LibreOffice draw text shapes can differ considerably from how they were laid down in PDF because they are rendered with different font glyphs.

Comment 4 Eyal Rozenberg 2025-02-23 22:57:17 UTC

(In reply to V Stuart Foote from comment #3)
> The dimensions of text object
> streams do not exist within the PDF for extraction. Only a starting position
> and spacing.
>
> The BT/ET for text object streams inside the PDF do not receive a statement
> of  "dimensions of the object with the text in it", rather within the text
> object Starting BT and Ending ET there is simply a Tf line - with font a
> glyph size, a Td line - with text start position in x and y offset from
> bottom left and any transformation, and the Tj line - the character string
> or lookups from font dictionary.

I am not familiar with most of the specifics of PDFs' internal structure. I know that some objects have "boxes" specified and some don't. But - even if "text object streams" don't have them - their dimensions can be readily determined - just like PDF viewers determine them: by using the font metrics to place glyphs until reaching the end of the stream, or stretch of text, or what-not; that gets you the width, or right edge.

> And on LibreOffice import via poppler lib, alignment of the draw text object
> is observed, as is general glyph size of the font. But beyond the needed
> anchoring, the draw shape is not sized to match what had been held for the
> stream within PDF.
> 
> And if the font used in the PDF is not available to os/DE, the poppler ->
> cairo rendering to a text draw shape uses fall back font to render the
> assembled text span.
> 
> In other words, the resulting LibreOffice draw text shapes can differ
> considerably from how they were laid down in PDF because they are rendered
> with different font glyphs.

Yes, you're describing the problematic behavior, that I believe should be changes. I won't argue with you about marking this as an enhancement, since I did say what I'm suggesting is a different tradeoff of what to preserve.

Comment 5 Dave Gilbert 2025-02-24 01:00:29 UTC

Looking at the raw PDF (uncompressed with podofouncompress) the actual PDF code we have for that main line is:

BT                                                                                                                   
/F75 20.6585 Tf 90 561.788 Td [(Org)-375(Mo)-31(de)-375(Compact)-375(Guide)]TJ                                       
ET

I don't see anything in there showing any form of bounding box that it's part of - there's no hierarchy here except the outer'MediaBox' which I think is most of the page?

The numbers in there (the 375 and 31) are spacing and kerning; so we're not getting the size of any character directly there.

The line is an entirely separate rectange/fill:

0 0 432 3.985 re f

So there's nothing we can do associating the text and the line or the box outside it.

So lets see what the /F75 font tells us:
/F75 7 0 R
...
7 0 obj
<<    
/Type /Font
/BaseFont /CXVGSH+CMBX12
/FirstChar 11
/FontDescriptor 592 0 R
/LastChar 121
/Subtype /Type1
/ToUnicode 1 0 R
/Widths 588 0 R
>>

huh OK so 588 is:
588 0 obj
[ 656.300000 625 625 937.500000 937.500000 312.500000 343.700000 562.500000 562.500000 562.500000
562.500000 562.500000 849.500000 500 574.100000 812.500000 875 562.500000 1018.500000 1143.500000
875 312.500000 342.600000 581 937.500000 562.500000 937.500000 875 312.500000 437.500000
437.500000 562.500000 875 312.500000 375 312.500000 562.500000 562.500000 562.500000 562.500000
....

so that *is* the width of each character that's expected so in theory you should be able to get some idea of the total length.

I don't know how much is already done; for me the substitute that's being picked looks about right, no big overrun.

Comment 6 Eyal Rozenberg 2025-02-24 09:19:46 UTC

(In reply to Dave Gilbert from comment #5)
> [(Org)-375(Mo)-31(de)-375(Compact)-375(Guide)]TJ                            

Actually, the line I am talking about says "Release 9.4". But your observations are probably valid for that one as well.

It's certainly true, that the relation of the two elements - the black bar and the text run - is not made explicit in the PDF. This is not because of the PDF format - it's just like in Draw: If we align shapes, there's nothing written to the ODG file saying that they are associated with each other. But if we were to maintain object dimensions as best we can - we would maintain the implicit alignments and authors' intended presentational effect, better.

And indeed, like you've shown us - there is information inside the PDF to let us do so. In fact, it seems from your description that it is easier than I had thought, and one might not even to do full text rendering to a scratch buffer - just accounting for that sequence of numbers you quoted might be enough.

Finally, about the fallback font: The overrun depends on which font is used on your system. On mine, it's wider, and so is the discrepancy. But once it's "off", you lose the effect the author aimed for. 

Another point to consider is what happens within paragraphs of text, where the text isn't just a single stretch for the whole line. There, the choice of the tradeoff (size, horizontal scale of the font, inter-character spacing, unintended space at the end) is even more delicate.

Comment 7 Dave Gilbert 2025-02-24 14:43:55 UTC

Yeh I can imagine something like getting that data and looking at the width of 'i' and 'm' say and then some heuristic to choose the size of the fallback font rather than what we were told; or maybe even to pick the right fallback?

Anyway, this sounds kind of doable, but is very much a heuristicy type thing we'd have to play with.

Comment 8 V Stuart Foote 2025-02-24 16:15:33 UTC

(In reply to Dave Gilbert from comment #7)
> Yeh I can imagine something like getting that data and looking at the width
> of 'i' and 'm' say and then some heuristic to choose the size of the
> fallback font rather than what we were told; or maybe even to pick the right
> fallback?
> 
> Anyway, this sounds kind of doable, but is very much a heuristicy type thing
> we'd have to play with.

But ideally it doesn't even come from a fallback, instead we should try harder to use the embedded subset font glyphs recorded to Font directories in the PDF. Unfortunately they won't follow Unicode, and absent a /ToUnicode handling inside the PDF, normally get arbitrary encoded points to assemble the text spans.

But then that is how the pdfium Insert filter parses the PDF (to BMP image, with high fidelity to original PDF, but we've asked option to also deliver vector objects).

If we were able to better use the subset fonts when filter imported (poppler -> cairo object) then only additional edits/changes to the draw text shapes might need fallback font.

But suspect that would be a lot of dev effort for minimal returns--LibreOffice is not a PDF editor, and the current filter import of PDF text objects to runs within Draw text shapes is really not too bad and suitable to majority of uses.

Comment 9 Dave Gilbert 2025-02-24 17:58:53 UTC

The tricky part to do that with popler is that we would have to get our poppler built with font support, which it currently isn't and adds a bunch more dependencies (which I'm unsure how to satisfy).
But, I don't think it's actually what you want - if you want rendered pdf, then use the pdfium import path, I think the poppler path is only useful where it is to some degree editable.

Comment 10 Eyal Rozenberg 2025-02-24 19:17:36 UTC

(In reply to V Stuart Foote from comment #8)
> But ideally it doesn't even come from a fallback, instead we should try
> harder to use the embedded subset font glyphs recorded to Font directories
> in the PDF.

That is a separate issue. Naturally, if one uses glyphs with the same metrics, one gets the same dimensions. This bug is about when that does not happen, for any reason. The reason may be us misidentifying the font, but it might also be that it's a Type-1 font, which we don't support on principle (as I have recently been told).

> If we were able to better use the subset fonts when filter imported (poppler
> -> cairo object) then only additional edits/changes to the draw text shapes
> might need fallback font.
> 
> But suspect that would be a lot of dev effort for minimal returns
> ...
> the current filter import of
> PDF text objects to runs within Draw text shapes is really not too bad 

The returns would be _massive_, not minimal. It is very often the difference between imported PDFs looking like disarrayed junk and properly-laid-out text. Just think of the case of justified paragraphs of text; you can look at the same document as an example. Right now, we mess them up. Which is exactly "too bad".

> LibreOffice is not a PDF editor

I do believe that discussion is settled:

* https://www.youtube.com/watch?v=98yX0JRHFbQ
* https://events.documentfoundation.org/libreoffice-conference-2023/talk/JDSEVU/

plus, a "PDF editor" as you would define it could not even have this bug, but the only thing it would do is stick to the exact structure of the PDF.

Comment 11 V Stuart Foote 2025-02-24 22:39:10 UTC

(In reply to Eyal Rozenberg from comment #10)

> 
> > LibreOffice is not a PDF editor
> 
> I do believe that discussion is settled:
> 
> * https://www.youtube.com/watch?v=98yX0JRHFbQ
> *
> https://events.documentfoundation.org/libreoffice-conference-2023/talk/
> JDSEVU/
> 
> plus, a "PDF editor" as you would define it could not even have this bug,
> but the only thing it would do is stick to the exact structure of the PDF.

Only in your mind, but not for any active developer nor in the majority of users. Most know better. 

Calling it such, or LibreOffice treating PDF as an editable format just raises unreasonable user expectations--and disappointment them with the marginal results. 

More desirable for the majority of users and supporting reasonable workflows with PDF would be to implement improvements in page handling (bug 114234), and PDF text object stream aggregation and transformation into Paragraph objects (bug 32249).

There really is no advantage to import filter transformation from PDF with pixel perfect placement of draw object text runs without lexical context. The pdfium insert filters already excel at that. 

PDF is a publication/presentation format not an editable document--there is no such thing as a PDF editor, as PDF are not editable.

Comment 12 Eyal Rozenberg 2025-02-24 22:51:49 UTC

(In reply to V Stuart Foote from comment #11)
> Only in your mind, but not for any active developer nor in the majority of
> users. Most know better. 

You should really follow the links. It is the exact opposite. PDFs are editable, users need PDF editors, and LibreOffice, despite its flaws, is considered one of the top FOSS PDF editors. If we edited PDFs better, we could aim for the popularity of PDF editors like... yes, you guessed it, Microsoft Office Word.

Also, Stuart, the fact that one bug may (or may not) be of more significance than another, does not make the other invalid.

Oh, one final point is that the heuristic reconstitution of text into paragraphs  will likely rely on determing the correct dimensions of text runs (text object streams) - at least for setting the alignment of the reconstituted paragraph, if not for the reconstitution itself.

Comment 13 Dave Gilbert 2025-02-24 23:01:05 UTC

Woah, stop arguing folk!

I'm somewhere between the two of you;  I don't think pixel-perfect is doable
without the right font;  but a heuristic to improve stuff might be worth a try but
it's far from the top problem.

I do agree that people do want to edit PDFs, and I've used it for that myself
(which is partly why I started fixing bugs in it).

Now while it's technically true PDF isn't editable and is not an editable document; it's true in the same way that some programs we can't proof that some will terminate;  we do a reasonable job with most simple PDFs (if you have the fonts), but things like the text aggregation is probably more urgent in making it easier to edit.

Comment 14 Dave Gilbert 2025-02-24 23:02:20 UTC

Since I've been fixing poppler->PDF bugs recently, I kind of agree with the idea;
so will move it to 'new'.
I've no idea how well a heuristic would work for it.

Comment 15 V Stuart Foote 2025-02-24 23:52:33 UTC

(In reply to Eyal Rozenberg from comment #12)
> (In reply to V Stuart Foote from comment #11)
> > Only in your mind, but not for any active developer nor in the majority of
> > users. Most know better. 
> 
> You should really follow the links. It is the exact opposite. PDFs are
> editable, users need PDF editors, and LibreOffice, despite its flaws, is
> considered one of the top FOSS PDF editors. If we edited PDFs better, we
> could aim for the popularity of PDF editors like... yes, you guessed it,
> Microsoft Office Word.
> 
> Also, Stuart, the fact that one bug may (or may not) be of more significance
> than another, does not make the other invalid.
> 
> Oh, one final point is that the heuristic reconstitution of text into
> paragraphs  will likely rely on determing the correct dimensions of text
> runs (text object streams) - at least for setting the alignment of the
> reconstituted paragraph, if not for the reconstitution itself.

Did I close this bug, did I say it couldn't be done? I said there are other priorities for handling PDF.

Dave G. says it is feasible but requires adjustment to our poppler bundling (and that would have to be cross platform). That alone should put a chill to this lunacy.

What I maintain and will not budge on is that PDF is not intended to be an editable format--it is a published finished page oriented document. We can consume pages with high fidelity using pdfium.  The poppler lib parsing and cairo transformation into Draw text shape objects is functional, but can never be pixel perfect--no sense in attempting to make it so when offering PDF editing is way out of scope! 

ps @Dave thanks for poking around in the LibreOffice guts. Should you need folks to bounce patches off of (cc in gerrit) recommend quikee (Tomaž V.), Justin L., Khaled H. and of course Miklos V. all of whom are familiar with workings of our PDF filters import and export.  Eyal and I mostly do QA and UX-advise--don't take anything we say as gospel. And we do poke at each other :-)

Comment 16 V Stuart Foote 2025-02-25 15:21:32 UTC

(In reply to Dave Gilbert from comment #14)
> Since I've been fixing poppler->PDF bugs recently, I kind of agree with the
> idea;
> so will move it to 'new'.
> I've no idea how well a heuristic would work for it.

Thanks Dave! Excited to see how you get on with it. 
Justin L. got us the ability to 'Consolidate Text' (.uno:TextCombine) for a selected run of draw text shapes, but that only goes so far, is awkward to use (e.g. as multi-selection of the shape texts) and is limited to the Draw module.

Maybe have a look at suggestion in
https://bugs.documentfoundation.org/show_bug.cgi?id=32249#c19

Comment 17 Eyal Rozenberg 2025-02-25 23:05:50 UTC

(In reply to Dave Gilbert from comment #14)

Much obliged!

> I've no idea how well a heuristic would work for it.

Please feel very free to ask here, or on Telegram for help in testing, or brainstorming, or feeback on partial work etc.