155640 – Draw fails to render character sequence correctly in pdf

Bug 155640 - Draw fails to render character sequence correctly in pdf

Summary: Draw fails to render character sequence correctly in pdf

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Draw (show other bugs)
Version: (earliest affected)	7.3.7.2 release
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	PDF-Import-Draw
	Show dependency tree / graph

Reported:	2023-06-01 19:37 UTC by Jon Ten
Modified:	2023-06-06 10:36 UTC (History)
CC List:	2 users (show)

See Also:
Crash report or crash signature:

Attachments
example document (177.90 KB, application/pdf) 2023-06-05 19:09 UTC, Jon Ten	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jon Ten 2023-06-01 19:37:38 UTC

Description:
Some character pairs eg 'fl' 'fi' are rendered in the Draw page as the UTF-8 character for ef bf bd (a black diamond with a white question mark).
Viewing the same document in a pdf viewer looks OK.


Steps to Reproduce:
1. View a pdf (typically one produced by printing to pdf from Firefox)
2. Note black diamond where 'fi' 'fl' etc should be present in the word
3.

Actual Results:
a black diamond with a white question mark

Expected Results:
fl or fi etc


Reproducible: Always


User Profile Reset: No

Additional Info:
na

Comment 1 Khaled Hosny 2023-06-04 19:48:21 UTC

Please attach PDF file that can be used to reproduce this issue. The font, the text, the tool used to generate PDF all can lead to different results.

Comment 2 Jon Ten 2023-06-05 19:09:33 UTC

Created attachment 187741 [details]
example document

Comment 3 Jon Ten 2023-06-05 19:13:19 UTC

The pdf document, when opened in Draw, has the diamond characters on 
page 2: 
significant
benefits
financial
beneficiaries

page 3:
offline
benefits
financial

page 4:
finances

Comment 4 raal 2023-06-05 20:04:27 UTC

Confirm with Version: 7.6.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 845054aa25b7cba1daa1ff30b142d549027299bd
CPU threads: 4; OS: Linux 5.19; UI render: default; VCL: gtk3
Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US
Calc: threaded

Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a)

Comment 5 Khaled Hosny 2023-06-06 09:01:26 UTC

This is how the PDF is structured, fi and fl are ligatures and the PDF maps their glyphs to U+FFFD REPLACEMENT CHARACTER (�), so when we try to import the PDF as editable text this what we get.

PDF viewers just render the glyphs from the PDF and are not concerned about the textual representation, but if you try to search for any word contaning fi or fl it will not be found, and if you copy such a word e.g. siginificant, and paste it you will get signi�cant.

This is a bug in the PDF creation side.

Comment 6 Jon Ten 2023-06-06 09:51:06 UTC

thanks
So are you are saying that the pdf writer eg Firefox is creating mappings to ligature characters (glyphs) and that a pdf reader will simply render them but Draw does not, as it wants to show single characters, so maps them to U+FFFD?

If this is so why not convert back from the glyph code to the 2 characters represented as presumably they are identifiable (see https://www.unicode.org/charts/PDF/UFB00.pdf)?

OR are you saying that the pdf just has U+FFFD for ligatures. If so how does the pdf reader access the glyphs? And if it can why can Draw not do this? 

best wishes

Comment 7 Jon Ten 2023-06-06 10:13:55 UTC

BTW testing with Brave and Chrome to generate the pdf shows in Draw that 'space' 0x20 is present where the ligature should be. But it would be helpful to know how pdf viewers resolve this

Comment 8 Khaled Hosny 2023-06-06 10:15:24 UTC

(In reply to Jon Ten from comment #6)
> thanks
> So are you are saying that the pdf writer eg Firefox is creating mappings to
> ligature characters (glyphs) and that a pdf reader will simply render them
> but Draw does not, as it wants to show single characters, so maps them to
> U+FFFD?

PDF has mapping from glyphs to characters so that text extraction (searching, copying) work. When importing PDF as editable text we use this mapping, we can’t use glyphs. The mapping is faulty in this PDF which is the responsibility of PDF producer.


> If this is so why not convert back from the glyph code to the 2 characters
> represented as presumably they are identifiable (see
> https://www.unicode.org/charts/PDF/UFB00.pdf)?

There is no such thing as glyph code, fonts contain glyphs in arbitrary order and have mapping from Unicode code points to glyph indices.

> OR are you saying that the pdf just has U+FFFD for ligatures. If so how does
> the pdf reader access the glyphs? And if it can why can Draw not do this? 

PDF works with glyph indices, so to render the PDF a PDF viewer simply renders the specified glyph from the font embedded in the PDF.

PDF also provides a reverse map from glyph indices to Unicode code points, to be used for text extraction. If the mapping is faulty, there is no way to retrieve the textual content. You can try coping these words from any PDF reader and you will get the same replacement character because this what the PDF indicates as the textual representation of these glyphs.

If you want a faithful rendering of the PDF, insert it as image. If you want faithful editing of PDF (not importing it as text), you should try a dedicated PDF editor.

Please do not re-open, if there still a LibreOffice issue after discussion, we will re-open the issue.

Comment 9 Jon Ten 2023-06-06 10:22:50 UTC

Thanks but it's still confusing why browsers etc would want to use glyphs when proportional fonts are fine for rendering 'fl' etc

Comment 10 Khaled Hosny 2023-06-06 10:36:09 UTC

(In reply to Jon Ten from comment #9)
> Thanks but it's still confusing why browsers etc would want to use glyphs
> when proportional fonts are fine for rendering 'fl' etc

Because that is how PDF works. PDF is not HTML, it is an format concerned first and foremost with presentation, and textual data handling is a second thought.