Bug Hunting Session
Bug 113694 - UTF-16 surrogate pairs are mishandled when text is set to Thai
Summary: UTF-16 surrogate pairs are mishandled when text is set to Thai
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: high critical
Assignee: Khaled Hosny
URL:
Whiteboard: target:6.1.0
Keywords: haveBacktrace
Depends on:
Blocks: Font-Rendering
  Show dependency treegraph
 
Reported: 2017-11-07 14:53 UTC by Hiunn-hué
Modified: 2018-05-29 01:51 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
demo_file (14.19 KB, application/vnd.oasis.opendocument.text)
2017-11-07 14:54 UTC, Hiunn-hué
Details
demo_screenshot (102.78 KB, image/png)
2017-11-07 14:54 UTC, Hiunn-hué
Details
how I see it (60.41 KB, image/png)
2017-11-13 22:01 UTC, Xisco Faulí
Details
gdbtrace.log (23.85 KB, text/x-log)
2017-11-15 04:07 UTC, Hiunn-hué
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hiunn-hué 2017-11-07 14:53:56 UTC
Description:
It seems that some special characters, which are from Unicode Plane 1 and classified by LibreOffice as CTL Text, are not well handeled, and sometimes can even cause crash.

Steps to Reproduce:
 1. Open the attached ODT file, OR:
    (a) Open a new writer file.
    (b) Copy and paste the character "𐍕", "𐊙", or "𐏁" to writer. It doesn't matten whether your system can display these characters or not.
    (c) Set the font to "Linux Biolinum O" or "Linux Libertine O" to help observation.

 2. Use arrow keys to move cursor.

 3. When the cursor moves from the characters' left side to right side, there seems to be a space, which is not inputted by us. (see the attached PNG file)

 4. Now the cursor is at the red line, press "Enter" key or input anything, and then the character will split into two question marks. (see the attached PNG file)

 5. Move the border, make the column smaller, then the character will split, too.

 6. The splitting will cause crash sometimes, especially Version 6.0.0.0.alpha1+.

Actual Results:  
 1. The characters split.
 2. Writer crashs, especially Version 6.0.0.0.alpha1+.

Expected Results:
 The character should not split or even cause crash.


Reproducible: Always


User Profile Reset: No



Additional Info:
 Reproducible in the following version:
    * 3.3.0 (linux)
    * 4.0.0.1 (linux)
    * 5.4.2.2 (win/linux)
    * 6.0.0.0.alpha1+ (linux)


 Please notice that these characters **must be classified by LibO as CTL text** (see the status bar) to trigger the bug. 

 In version 3.3.0, characters "𐍕", "𐊙", or "𐏁" are depends on your local setting. Which means, if your local is Western (like en_US), then they would be classified as Western Text. While in version 5.4.2.2, they are classified as CTL Text, regardless of your local setting.


 Calc and Impress are also affected. Just copy and paste "𐍕", "𐊙", or "𐏁" to the text filed, and then press Backspace.


User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/62.0.3202.75 Chrome/62.0.3202.75 Safari/537.36
Comment 1 Hiunn-hué 2017-11-07 14:54:31 UTC
Created attachment 137596 [details]
demo_file
Comment 2 Hiunn-hué 2017-11-07 14:54:54 UTC
Created attachment 137597 [details]
demo_screenshot
Comment 3 Xisco Faulí 2017-11-13 22:01:03 UTC
Created attachment 137729 [details]
how I see it

I've installed Linux Biolinum O but the symbols look different than in your screenshot
Comment 4 Mohamed 2017-11-14 01:55:32 UTC
The characters issue is reproduced, I got the same result as Hiunn-hué  when I performed the test on Windows and I got the same result as Xisco Faulí  when I tested on Ubuntu, however no crash occurred. 
 
Tested on:
    • Operating system: Windows 8.1 Pro 64-bits.
    • LibreOffice: Version: 5.4.3.2 (x64)
      Build ID: 92a7159f7e4af62137622921e809f8546db437e5
      CPU threads: 4; OS: Windows 6.29; UI render: default; 
      Locale: en-US (en_US); Calc: group

And also on:
    • Operating system: Ubuntu 16.04.3 64-bits.
    • LibreOffice :Version: 5.4.3.2
      Build ID: 92a7159f7e4af62137622921e809f8546db437e5
      CPU threads: 8; OS: Linux 4.4; UI render: default; VCL: gtk2; 
      Locale: en-US (ja_JP.UTF-8); Calc: group
Comment 5 Hiunn-hué 2017-11-14 04:09:09 UTC
Thank you for helping testing and confirming.


*** I just found that, we must set the Text Langue to Thai to trigger this bug.  Other CTL languages are safe.  ( Format > Character > CTL Font )


---


@Xisco Faulí

It's OK that your system cannot show those symbols. 

The important part is Step 2 ~ 5.

You can still try Step 2 ~ 5 with those Tofu (square with X inside).



As described in Step 3, there's an extra space after the characters.

Using the "Linux Biolinum O" font is just a way to help us see it clearly.

It's not necessary.  Sorry for causing the misunderstanding.


---


@Mohamed

I don't make it always crash, either. Maybe you can try the 6.0.0.0+ daily build?
Comment 6 Buovjaga 2017-11-14 09:37:45 UTC
Hiunn-hué: you could try getting a backtrace of the crash: https://wiki.documentfoundation.org/QA/BugReport/Debug_Information#GNU.2FLinux:_How_to_get_a_backtrace

For this, use a daily build from https://dev-builds.libreoffice.org/daily/master/ that has -dbg at the end of its name.
Comment 7 Hiunn-hué 2017-11-15 04:07:17 UTC
Created attachment 137766 [details]
gdbtrace.log


 == Message in Terminal ==

After ./soffice --backtrace:

> ** (soffice:8179): WARNING **: Couldn't connect to accessibility bus: Failed to connect to socket /tmp/dbus-GpItp358Lf: Connection refused
> warn:vcl:8179:8179:vcl/unx/generic/fontmanager/fontmanager.cxx:702: Could not OpenTTFont "/usr/share/fonts/woff/charis/CharisSIL-B.woff"
> warn:vcl:8179:8179:vcl/unx/generic/fontmanager/fontmanager.cxx:702: Could not OpenTTFont "/usr/share/fonts/woff/charis/CharisSIL-BI.woff"
> warn:vcl:8179:8179:vcl/unx/generic/fontmanager/fontmanager.cxx:702: Could not OpenTTFont "/usr/share/fonts/woff/charis/CharisSIL-I.woff"
> warn:vcl:8179:8179:vcl/unx/generic/fontmanager/fontmanager.cxx:702: Could not OpenTTFont "/usr/share/fonts/woff/charis/CharisSIL-R.woff"
> warn:i18nlangtag:8179:8179:i18nlangtag/source/languagetag/languagetag.cxx:1618: LanguageTag::getRegionFromLangtag: pRegionT==NULL for 'en-MED'
> warn:i18nlangtag:8179:8179:i18nlangtag/source/languagetag/languagetag.cxx:1618: LanguageTag::getRegionFromLangtag: pRegionT==NULL for 'en-MED'
> warn:i18nlangtag:8179:8179:i18nlangtag/source/languagetag/languagetag.cxx:1618: LanguageTag::getRegionFromLangtag: pRegionT==NULL for 'de-med'
> warn:i18nlangtag:8179:8179:i18nlangtag/source/languagetag/languagetag.cxx:1618: LanguageTag::getRegionFromLangtag: pRegionT==NULL for 'de-med'
> warn:i18nlangtag:8179:8179:i18nlangtag/source/languagetag/languagetag.cxx:1386: LanguageTagImpl::convertLocaleToLang: with bAllowOnTheFlyID invalid 'de-med'
> warn:vcl:8179:8179:vcl/unx/generic/fontmanager/fontconfig.cxx:852: In glyph fallback throwing away the language property of en because the detected script for '0x9f3' is Bengali and that language doesn't make sense. Autodetecting instead.




Do the steps 2 ~ 5:

> warn:ucb.ucp.gio:8179:8179:ucb/source/ucp/gio/gio_content.cxx:393: ignoring GError "Operation not supported" for <>
> warn:xmloff:8179:8179:xmloff/source/core/xmlerror.cxx:169: An error or a warning has occurred during XML import/export!
> Error-Id: 0x4002000d
>     Flags: 4 SEVERE
>     Class: 2 FORMAT
>     Number: d
> Parameters:
>     0: office:blue
> Exception-Message: Root element unknown
> Position:
>     Public Identifier: 
>     System Identifier: DocumentList.xml
>     Row, Column: 2,1
> 
> soffice.bin: /tinderbox/buildslave/source/libo-master/include/rtl/ustring.hxx:669: sal_Unicode rtl::OUString::operator[](sal_Int32) const: Assertion `index >= 0 && static_cast<sal_uInt32>(index) < static_cast<sal_uInt32>(getLength())' failed.
Comment 8 Khaled Hosny 2018-05-21 11:04:11 UTC
Looks like broken handling of UTF-16 surrogate pairs when the language is set to Thai. I suspect something is broken in the Thai break iterator.
Comment 9 Commit Notification 2018-05-22 13:40:12 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=5dc52ee00102cbf4262805d6e8f338bf0a88f470

tdf#113694 Fix BreakIterator_CTL surrogate pairs

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 10 Xisco Faulí 2018-05-28 12:16:12 UTC
Hi Khaled Hosny,
Thanks for fixing this.
Do you think we should backport it to 6.0 ?
Comment 11 Khaled Hosny 2018-05-29 01:51:27 UTC
(In reply to Xisco Faulí from comment #10)
> Hi Khaled Hosny,
> Thanks for fixing this.
> Do you think we should backport it to 6.0 ?

I tried to backport through Gerrit but there is a merge conflict, I can’t check 6.0 branch locally to try to fix the merge conflict, unfortunately.