Bug 131213 - Pasting non-printable Unicode character into LibreOffice makes it unusable
Summary: Pasting non-printable Unicode character into LibreOffice makes it unusable
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
6.4.1.2 release
Hardware: x86-64 (AMD64) macOS (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-07 23:15 UTC by github.ds
Modified: 2020-03-08 16:21 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description github.ds 2020-03-07 23:15:58 UTC
Description:
I got a PDF document which looks fine visually, but copying text out of it results in having unprintable Unicode characters in the clipboard, for example U+10FC1F, U+10FC10 or U+10FC01. Pasting such a character into LibrOffice leads to the application locking up for several seconds while the CPU goes up to 100%. This again happens on every interaction with the text view. LibreOffice gets unusable.

The larger the amount of text copied over, the longer LibreOffice locks up and doesn't react.

Here's a string, you should be able to test this with: 􏰟􏰐􏰁􏰄􏰌􏰂􏰅􏰐􏰂

I could reproduce this with LibreOffice Writer, Calc and Impress. I assume this is due to a bug in some shared component.

Steps to Reproduce:
1. Copy this string into any text field of Writer, Calc or Impress: 􏰟􏰐􏰁􏰄􏰌􏰂􏰅􏰐􏰂

Actual Results:
The GUI locks up for several seconds and the CPU goes up to 100%. This happend on every interaction with the pasted string.

Expected Results:
The GUI does not lock up.


Reproducible: Always


User Profile Reset: Yes



Additional Info:
Version: 6.4.1.2
Build ID: 4d224e95b98b138af42a64d84056446d09082932
CPU threads: 4; OS: Mac OS X 10.14.6; UI render: default; VCL: osx; 
Locale: de-DE (de_DE.UTF-8); UI-Language: en-US
Calc: threaded
Comment 1 himajin100000 2020-03-08 01:21:19 UTC
What font should I use for those characters?
It's often the case for me, when specified font does not have a glyph for the characters
Comment 2 V Stuart Foote 2020-03-08 04:44:24 UTC
Unicode in that range is all Private Use Area (PUA)

Two usage factors:

1. for working with PUA defined glyphs, you must make a font assignment on the document canvas. Allowing default paragraph font, or a fall back font will likely display junk.

2. If the font is not installed to system LO will search for a fallback--checking every font on system for coverage of that PUA block--and still may have the wrong glyphs rendered. Avoid that by always making a font assignment--installing the font if necessary.

The PDF generator for some software will take valid fonts and cast them into PUA areas to obscure the text strings--looks fine in the PDF but no means to work with the PDF except with the source software.

Seems that is your situation, your PDF source program is working with PUA -- youu'll have to obtain the font/PUA mappings and either create a new font (e.g. with Fontforge) or extract the font from the PDF, numerous utilities to do that. But be mindful of licensing of the resulting font.
Comment 3 github.ds 2020-03-08 09:19:36 UTC
@himajin100000@gmail.com: I don't have to set a font to produce this bug. It is enough to open a new document on Writer and copying the string straight from the browser into the document. The font is set to Liberation Serif then.

@V Stuart Foote: I did not create the PDF. It was sent to me as is. Maybe the institution that created it is not familiar with the intricacies of creating a PDF, but honestly as an end user I expect LibreOffice to be robust enough to not make me force quit it when I copy text over into my document.
Comment 4 V Stuart Foote 2020-03-08 15:10:30 UTC
(In reply to github.ds from comment #3)
>... as an end user I expect LibreOffice to be
> robust enough to not make me force quit it when I copy text over into my
> document.

By Unicode standard  [1], use of the PUA (Code  points  in  the  ranges  U+E000..U+F8FF,  U+F0000..U+FFFFD, and U+100000..U+10FFFD) is left to the conformant process. By standard for handling PUA "the abstract characters associated with them have no interpretation specified by this standard. They can be given any interpretation by conformant processes."

LibreOffice is Unicode "conformant", meaning it will do correct things with code point assignments for the the 1.1 million defined Unicode values. If a known (to system) font with glyph coverage of the PUA glyphs is specified--a paste of the PUA will succeed resulting in rendering the glyphs, and likewise they will be subset when filter exported or printed. 

If no font is specified for the receiving paragraph, or a paste 'unformatted' is performed, paragraph default font will be assumed. If default font has no coverage of the PUA code points being copied/pasted the result will be either a "place holder" glyph from the font assigned or in some cases a font fallback search for a font defining a glyph for the PUA (controlled by OS).

LibreOffice is plenty "robust", but the user can not do unreasonable things. Expecting PUA mapping for pasting from unknown fonts is unreasonable.
 
=-ref-=
[1] https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf#G43463
Comment 5 github.ds 2020-03-08 16:21:42 UTC
I appreciate the time you took to explain this to me and I understand your reasoning. I must object, however, that it is based on a false premise. I did not expect „PUA mapping for pasting from unknown fonts“. I clearly stated that I’d expect the GUI to „not lock up“, even to the point where I have to force quit LibreOffice. This is not the same.

I encourage you to look on it from an end user perspective. Some institution sends me a PDF. I copy the text. I paste it in LibreOffice. LibreOffice hangs and must be force quit.

If I follow you correctly, you argue that I – as an end user – should not only be debugging this problem to the point where I can identify the glyphs I actually pasted (though they look fine on the PDF), but also to be knowledgeable enough about the Unicode standard and its relation to Font and rendering implementations to come to the conclusion that I do something „unreasonable“ and do not expect LibreOffice to not hang up before my eyes. I find that bold.

Please don’t misunderstand me. I do not feel entitled to demand that this is fixed. Fortunately I have enough technical understanding to work around such issues. I report them with the understanding that LibreOffice is not only aimed at a technically versed audience, but also at laymen, who have never heard of Unicode or PUA. For them, I think, this is just plain bad UX.