Bug 103217 - PDF export of Unicode characters does not work with hexadecimal code more than four digits
Summary: PDF export of Unicode characters does not work with hexadecimal code more tha...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
4.3.0.4 release
Hardware: x86-64 (AMD64) Windows (All)
: medium normal
Assignee: Not Assigned
QA Contact:
URL:
Whiteboard: target:5.3.0
Keywords:
: 103760 (view as bug list)
Depends on: HarfBuzz
Blocks:
  Show dependency treegraph
 
Reported: 2016-10-14 13:12 UTC by Dirk W.
Modified: 2016-11-08 13:28 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
PDF export with Writer 5.2.2.2 (74.02 KB, application/x-pdf)
2016-10-14 13:12 UTC, Dirk W.
Details
PDF export with Word 2016 (376.25 KB, application/x-pdf)
2016-10-14 13:13 UTC, Dirk W.
Details
“Characters in Unicode” in Writer (23.83 KB, application/vnd.oasis.opendocument.text)
2016-10-14 13:15 UTC, Dirk W.
Details
“Charakters in Unicode” in Word (21.57 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2016-10-14 13:15 UTC, Dirk W.
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk W. 2016-10-14 13:12:24 UTC
Created attachment 128004 [details]
PDF export with Writer 5.2.2.2

It’s a positive note, that hexadecimal code input with more than four digits works in LibreOffice still since version 5.1. On a less positive note, the PDF export does not work with hexadecimal code more than four digits.

For example: If I want to type in a LO-Writer document (.odt) the Unicode Character ‘Mathematical italic small e’ (U+1D452) because I need the correct ‘e’ for Euler’s number, then in my odt the correct character arises (see “Characters in Unicode.odt”). If I make then a PDF export of this document, all characters which have more than four digits in their hexadecimal code – e.g. the ‘Mathematical italic small e’ which has five – are shown with an square or something curios else (see “Characters in Unicode - PDF export Writer 5.2.2.2.pdf”).

If I save my odt as docx and open it then with Word 2016, the PDF export there works fine (see “Characters in Unicode.docx” and “Characters in Unicode - PDF export Word 2016.pdf”).

I attached the mentioned four files, which illustrate well what is the fail, because I marked the characters.
Comment 1 Dirk W. 2016-10-14 13:13:56 UTC
Created attachment 128006 [details]
PDF export with Word 2016
Comment 2 Dirk W. 2016-10-14 13:15:18 UTC
Created attachment 128007 [details]
“Characters in Unicode” in Writer
Comment 3 Dirk W. 2016-10-14 13:15:55 UTC
Created attachment 128008 [details]
“Charakters in Unicode” in Word
Comment 4 Buovjaga 2016-10-31 12:55:15 UTC
I get the same result, except that Mathematischer kursiver Kleinbuchstabe E is shown as a square already in LibreOffice!

Note for testers: you have to have Segoe UI font.

Win 7 Pro 64-bit Version: 5.3.0.0.alpha1+
Build ID: 4b4abb73fcd7f2802e73102b3e7c30face8d309c
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; Layout Engine: old; 
TinderBox: Win-x86@39, Branch:master, Time: 2016-10-31_02:54:50
Locale: fi-FI (fi_FI); Calc: group

4.3.0.1
Comment 5 Dirk W. 2016-11-01 19:42:33 UTC
Litte correction to Buovjaga:

for the most text the font „Segoe UI“ is used. But for the characters with more than four digits in this document the font „Segoe UI Symbol“ is needed, because this font includes the Unicodeblock „Mathematical Alphanumeric Symbols“.
Comment 6 V Stuart Foote 2016-11-01 23:13:34 UTC
Printing to a GS based PDF generator retains the codepoints and glyphs in the PDF.
Comment 7 Aron Budea 2016-11-06 20:30:52 UTC
With the daily build below, the exported PDF looks fine with the new layout engine, but shows erroneous glyphs instead of those three characters with the old one.

Version: 5.3.0.0.alpha1+
Build ID: a6ce5d391476e4b6a2cb2d92ff45548c1d75684b
CPU Threads: 4; OS Version: Windows 6.1; UI Render: GL; Layout Engine: new; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2016-11-04_00:03:22
Locale: hu-HU (hu_HU); Calc: CL
Comment 8 V Stuart Foote 2016-11-06 20:54:57 UTC
Confirming that the HarfBuzz common layout does some good with the export to PDF for both OpenGL and GDI+ rendering with the new layout engine. And that using the old DirectWrite layout engine with the PDF export filter does not pass the SMP glyphs through to the PDF.

So, fixed with the new layout engine.

On Windows 10 Pro 64-bit (1607) en-US with
Version: 5.3.0.0.alpha1+
Build ID: 32bdc5097013e7efd9c85e1b8df697880e66e925
CPU Threads: 8; OS Version: Windows 6.2; UI Render: GL; Layout Engine: new; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2016-11-04_23:30:30
Locale: en-US (en_US); Calc: CL

Closing as resolved fixed by commits for bug 89870 and fact that any work would be done on the "old" DirectWrite WinLayout code.

Please reopen if that work is perceived as necessary, or must be resolved for 5.2
Comment 9 V Stuart Foote 2016-11-07 15:50:02 UTC
*** Bug 103760 has been marked as a duplicate of this bug. ***
Comment 10 Dirk W. 2016-11-07 18:56:24 UTC
I have installed LO 5.3.0.0.alpha1+ and tried the PDF export of my “Characters in Unicode.odt”.

Result: The bug persists. All characters which have more than four digits in their hexadecimal code are shown with an square or something curios else.
Comment 11 Buovjaga 2016-11-07 19:15:41 UTC
(In reply to Dirk W. from comment #10)
> I have installed LO 5.3.0.0.alpha1+ and tried the PDF export of my
> “Characters in Unicode.odt”.
> 
> Result: The bug persists. All characters which have more than four digits in
> their hexadecimal code are shown with an square or something curios else.

Please copy and paste here the contents of the Help - About box in your 5.3.
Comment 12 Dirk W. 2016-11-07 19:53:37 UTC
On Windows 10 Enterprise 64-bit (1607) en-US (VirtualBox) with
Version: 5.3.0.0.alpha1
Build ID: f4ca1573fcf445164c068c1046ab5d084e1b005f
CPU Threads: 2; OS Version: Windows 6.2; UI Render: default; 
Locale: en-US (en_US); Calc: group
Comment 13 V Stuart Foote 2016-11-07 19:58:11 UTC
(In reply to Dirk W. from comment #12)
> On Windows 10 Enterprise 64-bit (1607) en-US (VirtualBox) with
> Version: 5.3.0.0.alpha1
> Build ID: f4ca1573fcf445164c068c1046ab5d084e1b005f
> CPU Threads: 2; OS Version: Windows 6.2; UI Render: default; 
> Locale: en-US (en_US); Calc: group

That build does not have the new HarfBuzz based layout enabled by default. You would need to set the variable "SAL_USE_COMMON_LAYOUT" to activate it.

But rather than the Alpha1 build, suggest you install current daily build of master from here: http://dev-builds.libreoffice.org/daily/master/

There have been a number of patches of the new common layout since Alpha1 was built including default enabling of the new layout.

Please test with the new layout enabled either with the Alpha1 or current master
Comment 14 Felix 2016-11-08 08:21:22 UTC
Many thanks for this additional information! I Can confirm that this bug is fixed in the current master (2016-11-07).
Comment 15 Dirk W. 2016-11-08 12:40:12 UTC
Hello „V Stuart Foote“,

I did not understand, what you mean with „You would need to set the variable "SAL_USE_COMMON_LAYOUT" to activate it.“, but I downloaded both current daily build of master – „master~2016-11-08_06.11.45_LibreOfficeDev_5.3.0.0.alpha1_Win_x86.msi“ and „master~2016-11-07_13.03.37_LibreOfficeDev_5.3.0.0.alpha1_Win_x64_en-US_de_ar_ja_ru_qtz.msi“ – and installed/deinstalled them.

Result: In both versions, the PDF export works.

If the similar bug 103468 – „Hexadecimal code input with more than four digits sometimes works, sometimes not“ – is also repaired, I cannot say. But what I can say, is, that all characters which have more than four digits in their hexadecimal code are shown correctly – at the moment.
Comment 16 V Stuart Foote 2016-11-08 13:28:12 UTC
OK then this is resolved fixed with the new HarfBuzz based text layout for bug 89870 set active by default.