Bug 151577 - Writer PDF import filter should default to producing paragraphs of text, not drawing objects
Summary: Writer PDF import filter should default to producing paragraphs of text, not ...
Status: UNCONFIRMED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
6.4.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Import-Writer
  Show dependency treegraph
 
Reported: 2022-10-16 18:59 UTC by Eyal Rozenberg
Modified: 2025-07-01 20:19 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments
A two-paragraph Writer document exported to PDF (13.09 KB, application/pdf)
2022-10-16 18:59 UTC, Eyal Rozenberg
Details
The original Writer document (28.97 KB, application/vnd.oasis.opendocument.text)
2022-10-16 19:00 UTC, Eyal Rozenberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Rozenberg 2022-10-16 18:59:32 UTC
Created attachment 183089 [details]
A two-paragraph Writer document exported to PDF

When opening a (Writer-created) PDF document, with text in several paragraphs, the resulting document should be paragraphs of text, very similar or identical to those in the original document which produced the PDF. This, provided that the PDF has not been manipulated in some complex and esoteric way which breaks up its paragraphs internally (i.e. when it is objectively difficult to decide whether such paragraphs exist and what their boundaries are).

We should not be getting a bunch of independently-positioned drawing objects - single word or single line - except for the content in the PDF document which necessitates it. The drawing-object-by-default approach may be fitting for use in LO Draw or Impress (although there as well one could consider paragraph-level boxes).

Reproduction instruction:

1. Create a new Writer document
2. Enter a couple of paragraphs of text; make each of them multi-line.
3. Save the document as a PDF.
4. Open the PDF in Writer (not in Draw! Use the Writer PDF import filter)

Expected result: You get paragraphs of text.

Actual result: You get many single-line textboxes.

The attachment will let you skip steps (1.)-(3.) .
Comment 1 Eyal Rozenberg 2022-10-16 19:00:56 UTC
Created attachment 183090 [details]
The original Writer document

Opening the PDF should result in a document that is very similar to this one (the original document exported to PDF).
Comment 2 m_a_riosv 2022-10-17 02:27:50 UTC

*** This bug has been marked as a duplicate of bug 32249 ***
Comment 3 Eyal Rozenberg 2025-06-30 20:59:09 UTC
Considering this bug is about Writer, and 32249, I think this should not be a dupe. Also, (re)constructing paragraphs is not only for the purpose of ease-of-editing.
Comment 4 Heiko Tietze 2025-07-01 06:18:32 UTC
Assuming PDF is a document format that allows editing... but it isn't.
No UX aspect in file format questions.
Comment 5 Eyal Rozenberg 2025-07-01 06:31:27 UTC
(In reply to Heiko Tietze from comment #4)
> Assuming PDF is a document format that allows editing... but it isn't.

NO, there's no assuming that. We don't edit any format directly except for ODF. The rest - we import, edit, and export.

Moreover, and perhaps more importantly - whoever said we were writing back to the PDF? Assume we're going to be saving an ODF file.

> No UX aspect in file format questions.

Ok.
Comment 6 V Stuart Foote 2025-07-01 10:38:35 UTC
Sorry, it is a dupe of bug 33249 clear an simple. Filter functions needed to render PDF text spans back as Paragraph objects would be the same across all LO modules. 

Comment 0 was opened against a Writer originated ODF document, but there is no distinction made in the export filter(s) (PDF has no "paragraph" object keeping text spans together as sentences, even words might be broken apart). And this *enhancement* is not about the LO Hybrid PDF that attaches the ODF source document into the PDF and selectively LO will open that attachment on import--bypassing the PDF facsimile. But that already functions as an export option.

For bug 32249 and bug 118370 Justin L. completed *one* reasonable approach working with the poppler -> cairo extracted sd text box objects from the PDF BT/ET spans, of "consolidating" a selection of the generated text boxes into a single text box object.

An alternative was proposed at https://bugs.documentfoundation.org/show_bug.cgi?id=32249#c19 of an process taking the extracted strings (still poppler -> cairo based) and reflowing that into lexically correct full sentences or full paragraph objects. And assembling those into as an ODF ready object available to style, spell check, etc. Focus would be less on the layout of the PDF and more on extracting a lexicographic correct representation of a page.

So, this bz issue could be that additional work. More fully scoped here. Or, we  could set back to the dupe it is as bug 33249 was left open after the work on bug 118370 but scope was not expanded to all PDF import filters. 

Added the devs with insight, for their opinions, but coin flip set it again as the dupe it is.
Comment 7 V Stuart Foote 2025-07-01 10:41:52 UTC
(In reply to V Stuart Foote from comment #6)
> Sorry, it is a dupe of bug 33249 clear an simple. Filter functions needed to
> ...

better make that bug 32249
Comment 8 Dave Gilbert 2025-07-01 11:11:23 UTC
(In reply to V Stuart Foote from comment #6)
> Sorry, it is a dupe of bug 33249 clear an simple. Filter functions needed to
> render PDF text spans back as Paragraph objects would be the same across all
> LO modules. 

The poppler import code does have an abstraction of which module it's targeting,
so it _could_ do something different for writer than draw; however...

> 
> Comment 0 was opened against a Writer originated ODF document, but there is
> no distinction made in the export filter(s) (PDF has no "paragraph" object
> keeping text spans together as sentences, even words might be broken apart).
> And this *enhancement* is not about the LO Hybrid PDF that attaches the ODF
> source document into the PDF and selectively LO will open that attachment on
> import--bypassing the PDF facsimile. But that already functions as an export
> option.
> 
> For bug 32249 and bug 118370 Justin L. completed *one* reasonable approach
> working with the poppler -> cairo extracted sd text box objects from the PDF
> BT/ET spans, of "consolidating" a selection of the generated text boxes into
> a single text box object.
> 
> An alternative was proposed at
> https://bugs.documentfoundation.org/show_bug.cgi?id=32249#c19 of an process
> taking the extracted strings (still poppler -> cairo based) and reflowing
> that into lexically correct full sentences or full paragraph objects. And
> assembling those into as an ODF ready object available to style, spell
> check, etc. Focus would be less on the layout of the PDF and more on
> extracting a lexicographic correct representation of a page.
> 
> So, this bz issue could be that additional work. More fully scoped here. Or,
> we  could set back to the dupe it is as bug 33249 was left open after the
> work on bug 118370 but scope was not expanded to all PDF import filters. 
> 
> Added the devs with insight, for their opinions, but coin flip set it again
> as the dupe it is.

Yeh, the hard part is deciding how to assemble the chunks of text; once you have those
spitting them out as a paragraph object for writer feels relatively easy.
There's some recent separate non-LO tools that try various heuristics for it which look pretty neat, so while it's never going to be perfect, something better should be doable.

Duping as suggested.

If you want to repeatedly edit through a PDF you create from LO, tick the hybrid box - that's what it's for!

*** This bug has been marked as a duplicate of bug 32249 ***
Comment 9 Eyal Rozenberg 2025-07-01 18:43:01 UTC
Stuart, stop messing with my bugs. 

> Sorry, it is a dupe of bug 33249 clear an simple.

It clearly and simply isn't.
Comment 10 Eyal Rozenberg 2025-07-01 18:46:12 UTC
Half-an-apology Stuart, I realize now that you didn't actually mark this as a dupe, you merely made the baseless claim of this being a dupe, and Dave Gilbert obliged you.

Oh, and this bug has almost nothing to do with 118370.

Dave: Please don't do these kinds of things.
Comment 11 Eyal Rozenberg 2025-07-01 18:47:36 UTC
Oh, and of course this isn't an enhancement, it's just a bug. If we open a PDF in Writer, the filter should reconstitute a Writer document - as best it can - from the PDF. Failing to constitue paragraphs and filling each page with a bunch of drawing objects is simply a failure.
Comment 12 V Stuart Foote 2025-07-01 19:11:51 UTC
@Eyal,

Its been mentioned on multiple occasions it servs no purpose to open multiple BZ issues for what are essentially identical issues. 

And splitting a hair here and calling it a bug rather than an enhancement to the Writer PDF import filter is just petty--calling them "your bugs" just shows the extent of your ego.

When you do this petty sniping, I can't help but compare you to Paolo. Is that the reputation you're looking for. 

But to issue at hand, the poppler -> cairo based PDF filters are monolithic, what affect one module affects all.

The scope of effort remaining for bug 32249 is exactly what would still be required to work with text runs as Paragraph objects in swriter, or Text Boxes in sdraw. If anything more so, given the distinction between text held inside sd text box objects and likely need to place extracted paragraphs into object frames to be able to repliccate the document layout from a PDF source on a swriter page.
Comment 13 Eyal Rozenberg 2025-07-01 19:25:16 UTC
(In reply to V Stuart Foote from comment #12)
> @Eyal,
> 
> Its been mentioned on multiple occasions it servs no purpose to open
> multiple BZ issues for what are essentially identical issues. 

And it has also been mentioned it serves no purpose, or negative purpose, to fold different issues into a single kitchen-sink issue.

> And splitting a hair here and calling it a bug rather than an enhancement to
> the Writer PDF import filter is just petty--calling them "your bugs" just
> shows the extent of your ego.
> 
> When you do this petty sniping, I can't help but compare you to Paolo. Is
> that the reputation you're looking for. 

Bugzilla page about importing PDFs into writer is definitely not where I would write anything regarding Paolo's virtues or faults. So absolutely no comment on that.

> But to issue at hand, the poppler -> cairo based PDF filters are monolithic,
> what affect one module affects all.

These are two filters, for two applications, which should have significantly different behavior. Perhaps a separate bug (or meta-bug) should be filed about drawing them apart from each other, as part of a wider re-organization of PDF-support-related bugs.

> The scope of effort remaining for bug 32249 is exactly what would still be
> required to work with text runs as Paragraph objects in swriter, or Text
> Boxes in sdraw.

Whatever we may think of bug 32249 (I am hoping to get enough support to split it up and either keep it as a meta-bug or replace it with one), it regards importing PDFs into Draw; this bug regards importing them into Writer.
Comment 14 Dave Gilbert 2025-07-01 20:19:45 UTC
(In reply to Eyal Rozenberg from comment #10)
> Half-an-apology Stuart, I realize now that you didn't actually mark this as
> a dupe, you merely made the baseless claim of this being a dupe, and Dave
> Gilbert obliged you.

Hey! I often disagree with Stuart - I don't 'oblige' people - I read it, consider
and do what I think is technically correct.

> Oh, and this bug has almost nothing to do with 118370.
> 
> Dave: Please don't do these kinds of things.

Please don't get into ranting matches.

My reason for agreeing here is that recombining text in PDFs is technically hard; it certainly needs doing - but it's not like we have to fix some existing code; we need to go and try a whole bunch of systems to see what would work and write a whole new thing.
It's not like we're missing some small corner/feature that we need to fix.

Having said that, I really don't care if this is a dupe; I just fix code.