Bug 81129 - Unicode Chars with 5-hex-digit Codes are Filtered Away in regular Paste or Paste Special (HTML)
Summary: Unicode Chars with 5-hex-digit Codes are Filtered Away in regular Paste or Pa...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: Other All
: medium normal
Assignee: Mark Hung
URL:
Whiteboard: target:5.2.0
Keywords:
: 85315 85316 (view as bug list)
Depends on:
Blocks: Font-Rendering
  Show dependency treegraph
 
Reported: 2014-07-09 22:58 UTC by jburrill
Modified: 2016-10-25 19:02 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description jburrill 2014-07-09 22:58:58 UTC
I’ll be submitting this problem report to both OpenOffice and LibreOffice.

  To perform these steps, you will need to have Tahoma and Lucida Sans Unicode fonts (common) and EITHER Symbola font OR both FreeSerif font and Segoe UI Symbol font.  (While the latter are not as common, they are available for free).

1. The six test lines below all begin with a Unicode character followed by text which describes the character.  Copy the six lines into a WordPad document and change the font for the lines to Tahoma.  Note that only the Unicode character in the second line (sharp sign) is displayed.

2. Next change the font of all six lines to Lucida Sans Unicode.  Note that now the first two Unicode characters (flat and sharp) are displayed.

3. Now -either- change the last four lines to Symbola -or- change the middle two lines to FreeSerif and the last two lines to Segoe UI Symbol.  Note that now all the Unicode characters are displayed correctly.

  The first three steps have demonstrated the correct behavior.  I would expect the same behavior in OpenOffice Writer and LibreOffice Writer.

4. But now repeat the same three steps in either version of “Writer.”  Note that the Unicode characters in the last four lines are never displayed correctly.

5. Now change the last four lines to OpenSymbol font.  Note that this still doesn’t help...

It seems that this happens with any 5-hex-digit Unicode character which is supported by a font like Symbola, FreeSerif or Segoe Symbol UI in WordPad.  (Related LibreOffice Bug 71603 appears to be just one instance of this general problem with 5-hex-digit Unicode characters.)

I haven’t tried it in MS Word, but I’d assume that if it works in WordPad, it will work in Word.
_   _   _   _   _

the six test lines:

♭  266D music flat sign
♯  266F music sharp sign
𝄋  1D10B segno
𝄌  1D10C coda
🎶  1F3B6 multiple musical notes
🎷  1F3B7 saxophone

I’ll switch to using and recommending whichever “Office” is either [a] quickest in showing me how to display 5-hex-digit Unicode characters in the Writer application the way it is -or- [b] quickest in fixing the problem.

jburrill@gmail.com
Comment 1 Adolfo Jayme Barrientos 2014-07-10 00:53:52 UTC
I tested this issue under Linux (the operating system I use).

So I copied the test lines from this bug report, and the pasting mechanism threw away the special characters (except ♭ and ♯). But then I tried pasting the test lines by using the “Text Without Formatting” option (from the Paste Special dialog, Ctrl+Shift+V) and all of the special characters were pasted correctly.

Please let me know if using Paste Special > Text Without Formatting works for you under Windows. I’m adjusting this bug’s title a bit.
Comment 2 Yousuf Philips (jay) (retired) 2014-07-11 12:41:17 UTC
(In reply to comment #1)
> I tested this issue under Linux (the operating system I use).
> 
> So I copied the test lines from this bug report, and the pasting mechanism
> threw away the special characters (except ♭ and ♯). But then I tried pasting
> the test lines by using the “Text Without Formatting” option (from the Paste
> Special dialog, Ctrl+Shift+V) and all of the special characters were pasted
> correctly.
> 
> Please let me know if using Paste Special > Text Without Formatting works
> for you under Windows. I’m adjusting this bug’s title a bit.

Adolfo,

Regular paste gave the same 2 characters you mentioned appeared on linux, but paste special only gives 'HTML format' and 'HTML format without comments'. Selecting without comments had the first entry as a blank box sometimes and sometimes as the b, the second entry showed correctly always, while the remaining 4 showed as questions marks. This was on 4.2.4 and 4.3.0 on Windows 7.
Comment 3 jburrill 2014-07-11 22:38:37 UTC
Changing title again since Paste Special does not work.  Depending on the font, it can look as though the characters were dropped in the paste when they weren't.  It's just that they aren't displayed.  This might even be correct behavior, depending on the font, so it's important to mention the font you're attempting to have the characters rendered in.

WordPad correctly renders all four of the 5-hex-digit Unicode characters (lines 3 - 6) in Symbola font.  It also correctly renders lines 3 and 4 in FreeSerif -- and lines 5 and 6 in Segoe UI Symbol.  Both LibreOffice Writer and OpenOffice Writer should be able to do as well.  And they should be exportable to PDF.

Since submitting this, I've discovered that KingSoft Writer does render these characters correctly, but they are lost when KingSoft tries to export them to PDF.  (But KingSoft also has some other bugs related to Unicode characters that LibreOffice/OpenOffice doesn't have).
Comment 4 Urmas 2014-07-12 02:39:53 UTC
The symbols do appear after application restart and reloading the document.
Comment 5 Matthew Francis 2014-09-27 15:41:11 UTC
Reproduced on OSX / LO 4.3.2.2 and 4.4 master:

"Paste Special" as "Unformatted text" does the right thing, and all characters are displayed.

Regular paste and paste as HTML appear to replace all the characters outside the Unicode basic multilingual plane (i.e. >0xFFFF) with "?" (a literal question mark, not a placeholder for a non-rendered character). Given that, whatever the font is then changed to makes no difference.

-> Platform: All
-> NEW
Comment 6 Matthew Francis 2015-01-17 00:45:38 UTC
*** Bug 85315 has been marked as a duplicate of this bug. ***
Comment 7 V Stuart Foote 2015-12-13 18:07:00 UTC
On Windows 10 Pro 64-bit en-US with
Version: 5.2.0.0.alpha0+
Build ID: 917d59a84124d1022bd1912874e7a53c674784f1
CPU Threads: 8; OS Version: Windows 6.2; UI Render: GL; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2015-12-12_12:17:04
Locale: en-US (en_US)

Confirming observations of comment 5, i.e. that Edit -> Paste special: Unformatted text handles the 5-hex-digit characters correctly. Also that regular Paste, or Paste Special: HTML is corrupting the pasted text and losing character.

Adjusting font with a combination of Bravua Text and Segoe Symbol UI correctly show all glyphs on Paste Special: unformatted.
Comment 8 Mark Hung 2016-01-06 23:59:00 UTC
*** Bug 85316 has been marked as a duplicate of this bug. ***
Comment 9 Commit Notification 2016-02-13 08:06:49 UTC
Mark Hung committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=4647e778993250b8c9431e2890750916fb986ecc

tdf#81129 Support reading non-BMP characters in HTML documents.

It will be available in 5.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.