Description: Some character pairs eg 'fl' 'fi' are rendered in the Draw page as the UTF-8 character for ef bf bd (a black diamond with a white question mark). Viewing the same document in a pdf viewer looks OK. Steps to Reproduce: 1. View a pdf (typically one produced by printing to pdf from Firefox) 2. Note black diamond where 'fi' 'fl' etc should be present in the word 3. Actual Results: a black diamond with a white question mark Expected Results: fl or fi etc Reproducible: Always User Profile Reset: No Additional Info: na
Please attach PDF file that can be used to reproduce this issue. The font, the text, the tool used to generate PDF all can lead to different results.
Created attachment 187741 [details] example document
The pdf document, when opened in Draw, has the diamond characters on page 2: significant benefits financial beneficiaries page 3: offline benefits financial page 4: finances
Confirm with Version: 7.6.0.0.alpha1+ (X86_64) / LibreOffice Community Build ID: 845054aa25b7cba1daa1ff30b142d549027299bd CPU threads: 4; OS: Linux 5.19; UI render: default; VCL: gtk3 Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US Calc: threaded Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a)
This is how the PDF is structured, fi and fl are ligatures and the PDF maps their glyphs to U+FFFD REPLACEMENT CHARACTER (�), so when we try to import the PDF as editable text this what we get. PDF viewers just render the glyphs from the PDF and are not concerned about the textual representation, but if you try to search for any word contaning fi or fl it will not be found, and if you copy such a word e.g. siginificant, and paste it you will get signi�cant. This is a bug in the PDF creation side.
thanks So are you are saying that the pdf writer eg Firefox is creating mappings to ligature characters (glyphs) and that a pdf reader will simply render them but Draw does not, as it wants to show single characters, so maps them to U+FFFD? If this is so why not convert back from the glyph code to the 2 characters represented as presumably they are identifiable (see https://www.unicode.org/charts/PDF/UFB00.pdf)? OR are you saying that the pdf just has U+FFFD for ligatures. If so how does the pdf reader access the glyphs? And if it can why can Draw not do this? best wishes
BTW testing with Brave and Chrome to generate the pdf shows in Draw that 'space' 0x20 is present where the ligature should be. But it would be helpful to know how pdf viewers resolve this
(In reply to Jon Ten from comment #6) > thanks > So are you are saying that the pdf writer eg Firefox is creating mappings to > ligature characters (glyphs) and that a pdf reader will simply render them > but Draw does not, as it wants to show single characters, so maps them to > U+FFFD? PDF has mapping from glyphs to characters so that text extraction (searching, copying) work. When importing PDF as editable text we use this mapping, we can’t use glyphs. The mapping is faulty in this PDF which is the responsibility of PDF producer. > If this is so why not convert back from the glyph code to the 2 characters > represented as presumably they are identifiable (see > https://www.unicode.org/charts/PDF/UFB00.pdf)? There is no such thing as glyph code, fonts contain glyphs in arbitrary order and have mapping from Unicode code points to glyph indices. > OR are you saying that the pdf just has U+FFFD for ligatures. If so how does > the pdf reader access the glyphs? And if it can why can Draw not do this? PDF works with glyph indices, so to render the PDF a PDF viewer simply renders the specified glyph from the font embedded in the PDF. PDF also provides a reverse map from glyph indices to Unicode code points, to be used for text extraction. If the mapping is faulty, there is no way to retrieve the textual content. You can try coping these words from any PDF reader and you will get the same replacement character because this what the PDF indicates as the textual representation of these glyphs. If you want a faithful rendering of the PDF, insert it as image. If you want faithful editing of PDF (not importing it as text), you should try a dedicated PDF editor. Please do not re-open, if there still a LibreOffice issue after discussion, we will re-open the issue.
Thanks but it's still confusing why browsers etc would want to use glyphs when proportional fonts are fine for rendering 'fl' etc
(In reply to Jon Ten from comment #9) > Thanks but it's still confusing why browsers etc would want to use glyphs > when proportional fonts are fine for rendering 'fl' etc Because that is how PDF works. PDF is not HTML, it is an format concerned first and foremost with presentation, and textual data handling is a second thought.