Bug 154727 - pdf import: odd text layout (tabs?) (UK IHT 407)
Summary: pdf import: odd text layout (tabs?) (UK IHT 407)
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
7.6.0.0 alpha0+
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2023-04-09 12:59 UTC by Dave Gilbert
Modified: 2023-04-09 16:30 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
Original IHT407 form unfilled (96.95 KB, application/pdf)
2023-04-09 12:59 UTC, Dave Gilbert
Details
Screenshot of LO's rendering of this document (168.02 KB, image/png)
2023-04-09 13:00 UTC, Dave Gilbert
Details
Okular's nice rendering of the same file (119.11 KB, image/png)
2023-04-09 13:02 UTC, Dave Gilbert
Details
attachment 186549 inserted to document (168.43 KB, image/png)
2023-04-09 15:42 UTC, V Stuart Foote
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Gilbert 2023-04-09 12:59:43 UTC
Created attachment 186549 [details]
Original IHT407 form unfilled

In the attached UK government form, the text is laid out on a grid in libreoffice -draw but not in any other PDF viewer.
It looks to me as if it's tab related - navigator is showing all that text as having tabs rather than spaces.
Comment 1 Dave Gilbert 2023-04-09 13:00:45 UTC
Created attachment 186550 [details]
Screenshot of LO's rendering of this document
Comment 2 Dave Gilbert 2023-04-09 13:02:17 UTC
Created attachment 186551 [details]
Okular's nice rendering of the same file
Comment 3 V Stuart Foote 2023-04-09 15:42:52 UTC
Created attachment 186553 [details]
attachment 186549 [details] inserted to document

Works for me when I "insert" PDF a page at a time as image. See attached. When inserted you can "break" and then "consolidate" the text spans back into lexically meaningful runs to reassemble sentences and paragraphs.

Otherwise LibreOffice is not a PDF "viewer". 

YMMV but personally I would never attempt to fill a form using LibreOffice as doing so is "out of scope". IIUC these UK forms are meant to be filled online, with newer forms also using the obligatory 'GDS Transport' font. This form subsets just the IRModena-Regular and IRModena-Bold and when not local to system LibreOffice will substitute.

When "Opened" as a document (to Draw, Impress or Writer depending on filter selected) LibreOffice filter imports the text runs of the PDF creating a draw text box shape for each run--there can be multiple draw shapes per line of text and the position/size of the shape frame is dependent on combination of the PDF sequence and the font details from the PDF. When done in the Draw module you can "consolidate" the text boxes back to sentences and paragraphs, but the line heights can shift--breaking the inserted PDF image offers a little more fidelity, but both require font substitution.  PDF generated from source documents with fonts with odd metrics are going to have issues--just as here.

IMHO => NOB
Comment 4 Dave Gilbert 2023-04-09 15:55:33 UTC
Hi,
  Thanks - however, I'm not sure this is a simple font substitution screwup.
Each of the words seems to have been tab aligned - how did that happen?

(IHT411 is showing simple font substitution issues with simple places just a bit longer than needed and overlapping stuff, but I'd agree that's not a bug).

I agree about not using LO for form filling; but what I was actually trying to do
was use it when the other PDF viewers were breaking and screwing up form field values so use it for manual editing.

(We've just fixed two Okular bugs on this).
Comment 5 V Stuart Foote 2023-04-09 16:01:25 UTC
@Miklos, the Cairo based ipdf import filter does fail rather notably parsing the draw text object placements and sizing compared to the pdfium based filter. Should we do better?
Comment 6 V Stuart Foote 2023-04-09 16:20:51 UTC
ipdf filter import seemingly grid/tabed layout issues noted these builds:

Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 1e9f4de320f67d1218c710bcee1969a2324c6888
CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

Version: 7.5.2.2 (X86_64) / LibreOffice Community
Build ID: 53bb9681a964705cf672590721dbc85eb4d0c3a2
CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

Version: 6.4.7.2 (x64)
Build ID: 639b8ac485750d5696d7590a72ef1b496725cfb5
CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: GL; VCL: win; 
Locale: en-US (en_US); UI-Language: en-US
Calc: threaded
Comment 7 V Stuart Foote 2023-04-09 16:30:04 UTC
(In reply to V Stuart Foote from comment #6)
> ipdf filter import seemingly grid/tabed layout issues noted these builds:
> 
likewise with 
Version: 5.4.7.2 (x64)
Build ID: c838ef25c16710f8838b1faec480ebba495259d0
CPU threads: 8; OS: Windows 6.19; UI render: GL; 
Locale: en-US (en_US); Calc: group

so work on bug 50879 is not involved (export only but just checking).