Bug 146317 - Multi-byte text characters on form control in xlsx are not shown due to failure detecting UTF-8
Summary: Multi-byte text characters on form control in xlsx are not shown due to failu...
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
6.0.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-19 15:24 UTC by himajin100000
Modified: 2022-10-24 17:19 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
the document to be used for STR (11.58 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2021-12-19 15:25 UTC, himajin100000
Details
Excel screenshot (18.21 KB, image/png)
2021-12-19 15:26 UTC, himajin100000
Details
Calc screenshot (8.72 KB, image/png)
2021-12-19 15:26 UTC, himajin100000
Details

Note You need to log in before you can comment on or make changes to this bug.
Description himajin100000 2021-12-19 15:24:51 UTC
Description:
see steps to reproduce

Steps to Reproduce:
1. open the attached file in excel
2. open the attached file in libreoffice
3. compare the result

Actual Results:
see the attached screenshots.
There is a letter 'あ' (U+3042) in Excel, but not in Calc

Expected Results:
Calc shows 'あ'


Reproducible: Always


User Profile Reset: No



Additional Info:
I personally applied the following patch on my local build to avoid this issue.

diff --git a/oox/source/vml/vmlinputstream.cxx b/oox/source/vml/vmlinputstream.cxx
index 93204ac50710..b41e697ab5c0 100644
--- a/oox/source/vml/vmlinputstream.cxx
+++ b/oox/source/vml/vmlinputstream.cxx
@@ -42,7 +42,7 @@ const char* lclFindCharacter( const char* pcBeg, const char* pcEnd, char cChar )

 bool lclIsWhiteSpace( char cChar )
 {
-    return cChar <= 32;
+    return 0 <= cChar && cChar <= 32;
 }

 const char* lclFindWhiteSpace( const char* pcBeg, const char* pcEnd )
@@ -268,7 +268,7 @@ constexpr OStringLiteral gaClosingCData( "]]>" );

 InputStream::InputStream( const Reference< XComponentContext >& rxContext, const Reference< XInputStream >& rxInStrm ) :
     // use single-byte ISO-8859-1 encoding which maps all byte characters to the first 256 Unicode characters
-    mxTextStrm( TextInputStream::createXTextInputStream( rxContext, rxInStrm, RTL_TEXTENCODING_ISO_8859_1 ) ),
+    mxTextStrm( TextInputStream::createXTextInputStream( rxContext, rxInStrm, RTL_TEXTENCODING_UTF8 ) ),
     maOpeningBracket{ '<' },
     maClosingBracket{ '>' },
     mnBufferPos( 0 )
@@ -378,12 +378,12 @@ void InputStream::updateBuffer()

 OString InputStream::readToElementBegin()
 {
-    return OUStringToOString( mxTextStrm->readString( maOpeningBracket, false ), RTL_TEXTENCODING_ISO_8859_1 );
+    return OUStringToOString( mxTextStrm->readString( maOpeningBracket, false ), RTL_TEXTENCODING_UTF8 );
 }

 OString InputStream::readToElementEnd()
 {
-    OString aText = OUStringToOString( mxTextStrm->readString( maClosingBracket, false ), RTL_TEXTENCODING_ISO_8859_1 );
+    OString aText = OUStringToOString( mxTextStrm->readString( maClosingBracket, false ), RTL_TEXTENCODING_UTF8 );
     OSL_ENSURE( aText.endsWith(">"), "InputStream::readToElementEnd - missing closing bracket of XML element" );
     return aText;
 }
Comment 1 himajin100000 2021-12-19 15:25:44 UTC
Created attachment 177020 [details]
the document to be used for STR
Comment 2 himajin100000 2021-12-19 15:26:16 UTC
Created attachment 177021 [details]
Excel screenshot
Comment 3 himajin100000 2021-12-19 15:26:49 UTC
Created attachment 177022 [details]
Calc screenshot
Comment 4 Rainer Bielefeld Retired 2021-12-20 05:55:45 UTC
REPRODUCIBLE with reporter's sample document Installation of Version 7.2.4.1 (x64) / LibreOffice Build  27d75539669ac387bb498e35313b970b7fe9c4f9
CPU threads: 12; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win; Locale: de-DE (de_DE); UI: de-DE; Calc: threaded;  Elementary Theme; My normal User Profile:
Opened document does not show a character in the button.

Additional information:
a) SoftMaker PlanMaker does show the character
b)  I can't tell whether there might be a DUPlicate.
Comment 5 Kevin Suo 2021-12-20 06:10:37 UTC
Already broken in:
Version: 6.0.0.0.alpha1+
Build ID: 6eeac3539ea4cac32d126c5e24141f262eb5a4d9
CPU threads: 8; OS: Linux 5.14; UI render: default; VCL: gtk3; 
Locale: zh-CN (zh_CN.UTF-8); Calc: group threaded
Comment 6 Andreas Heinisch 2022-02-25 13:44:58 UTC
Imho, we could add the fixes for the inputstreams since the xml will be saved using utf8 encoding (<?xml version="1.0" encoding="UTF-8" standalone="yes"?>). 

So we should read using the fixed utf8 encoding like proposed in your patch.

However, if I change the text of the button from あ to ああ and save it, the xml will be changed (<a:t>あああああ</a:t>), but the button is missing after reopening the file.
Comment 7 Caolán McNamara 2022-10-24 16:38:15 UTC
The last time I encountered something like this I also assumed it was this encoding thing, but it wasn't and the problem was as fixed with https://cgit.freedesktop.org/libreoffice/core/commit/?id=b320ef30977144c52de9b39bc4db0db540727c79

So, does this problem persist after that fix?
Comment 8 Andreas Heinisch 2022-10-24 17:19:21 UTC
It opens correct now.