Bug 155640 - Draw fails to render character sequence correctly in pdf
Summary: Draw fails to render character sequence correctly in pdf
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
7.3.7.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2023-06-01 19:37 UTC by Jon Ten
Modified: 2023-06-06 10:36 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
example document (177.90 KB, application/pdf)
2023-06-05 19:09 UTC, Jon Ten
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jon Ten 2023-06-01 19:37:38 UTC
Description:
Some character pairs eg 'fl' 'fi' are rendered in the Draw page as the UTF-8 character for ef bf bd (a black diamond with a white question mark).
Viewing the same document in a pdf viewer looks OK.


Steps to Reproduce:
1. View a pdf (typically one produced by printing to pdf from Firefox)
2. Note black diamond where 'fi' 'fl' etc should be present in the word
3.

Actual Results:
a black diamond with a white question mark

Expected Results:
fl or fi etc


Reproducible: Always


User Profile Reset: No

Additional Info:
na
Comment 1 ⁨خالد حسني⁩ 2023-06-04 19:48:21 UTC
Please attach PDF file that can be used to reproduce this issue. The font, the text, the tool used to generate PDF all can lead to different results.
Comment 2 Jon Ten 2023-06-05 19:09:33 UTC
Created attachment 187741 [details]
example document
Comment 3 Jon Ten 2023-06-05 19:13:19 UTC
The pdf document, when opened in Draw, has the diamond characters on 
page 2: 
significant
benefits
financial
beneficiaries

page 3:
offline
benefits
financial

page 4:
finances
Comment 4 raal 2023-06-05 20:04:27 UTC
Confirm with Version: 7.6.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 845054aa25b7cba1daa1ff30b142d549027299bd
CPU threads: 4; OS: Linux 5.19; UI render: default; VCL: gtk3
Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US
Calc: threaded

Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a)
Comment 5 ⁨خالد حسني⁩ 2023-06-06 09:01:26 UTC
This is how the PDF is structured, fi and fl are ligatures and the PDF maps their glyphs to U+FFFD REPLACEMENT CHARACTER (�), so when we try to import the PDF as editable text this what we get.

PDF viewers just render the glyphs from the PDF and are not concerned about the textual representation, but if you try to search for any word contaning fi or fl it will not be found, and if you copy such a word e.g. siginificant, and paste it you will get signi�cant.

This is a bug in the PDF creation side.
Comment 6 Jon Ten 2023-06-06 09:51:06 UTC
thanks
So are you are saying that the pdf writer eg Firefox is creating mappings to ligature characters (glyphs) and that a pdf reader will simply render them but Draw does not, as it wants to show single characters, so maps them to U+FFFD?

If this is so why not convert back from the glyph code to the 2 characters represented as presumably they are identifiable (see https://www.unicode.org/charts/PDF/UFB00.pdf)?

OR are you saying that the pdf just has U+FFFD for ligatures. If so how does the pdf reader access the glyphs? And if it can why can Draw not do this? 

best wishes
Comment 7 Jon Ten 2023-06-06 10:13:55 UTC
BTW testing with Brave and Chrome to generate the pdf shows in Draw that 'space' 0x20 is present where the ligature should be. But it would be helpful to know how pdf viewers resolve this
Comment 8 ⁨خالد حسني⁩ 2023-06-06 10:15:24 UTC
(In reply to Jon Ten from comment #6)
> thanks
> So are you are saying that the pdf writer eg Firefox is creating mappings to
> ligature characters (glyphs) and that a pdf reader will simply render them
> but Draw does not, as it wants to show single characters, so maps them to
> U+FFFD?

PDF has mapping from glyphs to characters so that text extraction (searching, copying) work. When importing PDF as editable text we use this mapping, we can’t use glyphs. The mapping is faulty in this PDF which is the responsibility of PDF producer.


> If this is so why not convert back from the glyph code to the 2 characters
> represented as presumably they are identifiable (see
> https://www.unicode.org/charts/PDF/UFB00.pdf)?

There is no such thing as glyph code, fonts contain glyphs in arbitrary order and have mapping from Unicode code points to glyph indices.

> OR are you saying that the pdf just has U+FFFD for ligatures. If so how does
> the pdf reader access the glyphs? And if it can why can Draw not do this? 

PDF works with glyph indices, so to render the PDF a PDF viewer simply renders the specified glyph from the font embedded in the PDF.

PDF also provides a reverse map from glyph indices to Unicode code points, to be used for text extraction. If the mapping is faulty, there is no way to retrieve the textual content. You can try coping these words from any PDF reader and you will get the same replacement character because this what the PDF indicates as the textual representation of these glyphs.

If you want a faithful rendering of the PDF, insert it as image. If you want faithful editing of PDF (not importing it as text), you should try a dedicated PDF editor.

Please do not re-open, if there still a LibreOffice issue after discussion, we will re-open the issue.
Comment 9 Jon Ten 2023-06-06 10:22:50 UTC
Thanks but it's still confusing why browsers etc would want to use glyphs when proportional fonts are fine for rendering 'fl' etc
Comment 10 ⁨خالد حسني⁩ 2023-06-06 10:36:09 UTC
(In reply to Jon Ten from comment #9)
> Thanks but it's still confusing why browsers etc would want to use glyphs
> when proportional fonts are fine for rendering 'fl' etc

Because that is how PDF works. PDF is not HTML, it is an format concerned first and foremost with presentation, and textual data handling is a second thought.