Bug 96499 - FILEOPEN: HTML format .xls file shows NUMERIC cell value while TEXT type is expected (because orcus does not accept un-quoted non-ASCII characters as css property values) (see comment 17)
Summary: FILEOPEN: HTML format .xls file shows NUMERIC cell value while TEXT type is e...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
5.0.4.2 release
Hardware: All All
: medium normal
Assignee: Kevin Suo
URL:
Whiteboard: target:7.3.0 target:7.4.0
Keywords:
Depends on:
Blocks: HTML-Import orcus_bugs
  Show dependency treegraph
 
Reported: 2015-12-15 05:38 UTC by Kevin Suo
Modified: 2022-02-07 19:10 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
test file (21.19 KB, application/vnd.ms-excel)
2015-12-15 05:38 UTC, Kevin Suo
Details
test file 2 with Chinese-char values quoted (20.85 KB, application/vnd.ms-excel)
2021-10-15 11:51 UTC, Kevin Suo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Suo 2015-12-15 05:38:48 UTC
Created attachment 121310 [details]
test file

Attached is an .xls file in html format. The file contains two columns:Bank Account Number and ID Card Number. We expect these two fields to be "text" type. 
However, when open with LibreOffice Calc, the cells are showing float point numeric values. In Microsoft Office and WPS Office the cells are showing as "text" types as expected.

Steps to reproduce:
1. Open the attached xls file in Calc;
2. Observe the cell values. 
--> They are showing as numeric. We expect the cells to be "text" values.

Version: 5.0.4.2
Build ID: 2b9802c1994aa0b7dc6079e128979269cf95bc78
Locale: zh-CN (zh_CN)
Win10 x64

PS. This issue was initially reported by libreoffice_xf in the LibreOffice Chinese Forum:
http://www.libreofficechina.org/thread-1390-1-1.html
Comment 1 Buovjaga 2015-12-21 18:02:13 UTC
Confirmed.

Win 7 Pro 64-bit Version: 5.2.0.0.alpha0+
Build ID: 014633f83e44ae8ba33087b6f38e8e253e281969
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2015-12-15_06:21:44
Locale: fi-FI (fi_FI)
Comment 2 QA Administrators 2017-03-06 14:21:29 UTC Comment hidden (obsolete)
Comment 3 Kevin Suo 2017-03-08 08:45:33 UTC
Bug still exists in the latest master.


---------------

MORE INFO:
With a debug run, I get the follow warning when open the attached test document in Calc:

warn:legacy.tools:19891:1:editeng/source/editeng/eehtml.cxx:54: EditHTMLParser::EditHTMLParser: Where does the encoding come from?
warn:svtools:19891:1:svtools/source/svhtml/parhtml.cxx:1427: GetOption: unknown HTML option
warn:svtools:19891:1:svtools/source/svhtml/parhtml.cxx:1427: GetOption: unknown HTML option
<same line repeated 20 times>
Comment 4 QA Administrators 2018-08-22 02:35:38 UTC Comment hidden (obsolete)
Comment 5 Kevin Suo 2019-05-15 10:46:43 UTC
(In reply to QA Administrators from comment #4)

Still reproducible in

Version: 6.3.0.0.alpha1+
Build ID:d2fa9c0d657877c967e41fdd0091f81d1b7ca048
CPU 线程:4; 操作系统:Linux 4.18; UI 渲染:默认; VCL: gtk3; 
Locale: zh-CN (zh_CN.UTF-8); UI-Language: zh-CN
Calc: threaded
Ubuntu 18.04 LTS X64
Comment 6 QA Administrators 2021-05-15 04:18:15 UTC Comment hidden (obsolete)
Comment 7 Kevin Suo 2021-05-17 01:48:19 UTC
(In reply to QA Administrators from comment #6)

The bug is still present in:
Version: 7.1.4.0.0+ / LibreOffice Community
Build ID: b2fc048cb2d5f5bd1095a8110fa4a16a305a8acc
CPU threads: 4; OS: Linux 5.11; UI render: default; VCL: gtk3
Locale: zh-CN (zh_CN.UTF-8); UI: zh-CN
Calc: threaded
Comment 8 Kevin Suo 2021-10-15 06:20:50 UTC
The (html) file contains the attribute "x:str" in its <td> tag:
   <tr height="19" style='height:14.25pt;'>
    <td class="xl67" height="19" style='height:14.25pt;' x:str>4100025601074122197</td>
    <td class="xl68" x:str>350627197809253585</td>
   </tr>
which indicates the cell should be of string value rather than a number.
Comment 9 Kevin Suo 2021-10-15 08:00:22 UTC
I have submitted a patch on gerrit for review:
https://gerrit.libreoffice.org/c/core/+/123620

Please help to test with this patch.
Comment 10 Kevin Suo 2021-10-15 08:52:03 UTC
I abandoned the patch because it was a misunderstanding. I did not found any specifications indicating the x:str meaning the cell should be in text format.

Rather, the test document has style class ".xl67" or ".x168" applied to each of the cells appearing as digits. Each of these two styles has the attribute 'mso-number-format:"\@"'. Thus LibreOffice should get this attribute to set the cell format as TEXT. However, it failed to do so because, it seems the stylesheet was never parsed! In "ScHTMLTable::DataOn", "case HtmlOptionId::CLASS:", "rStyles = mpParser->GetStyles()" returned an empty style all the time.
Comment 11 Kevin Suo 2021-10-15 09:10:32 UTC
The failing to detect the css stylesheet can be observed if you set a debug line like this:

--- a/sc/source/filter/html/htmlpars.cxx
+++ b/sc/source/filter/html/htmlpars.cxx
@@ -3105,6 +3105,7 @@ void ScHTMLQueryParser::ParseStyle(std::u16string_view rStrm)
     }
     catch (const orcus::css::parse_error&)
     {
+        SAL_WARN("sc.htmlimport", "FIXME: Failed to parse CSS stylesheet!");
         // TODO: Parsing of CSS failed.  Do nothing for now.
     }
 }
Comment 12 Kevin Suo 2021-10-15 11:49:11 UTC
I am adding Kohei Yoshida to cc: could you please take a look? 

This seems to be related to orcus. There are several issues:

1. If the stylesheet value contains Chinese characters which is not quoted, then the orcus::css_parser.parse() will raise an error. For instance, the html document in attachment 121310 [details] contains the following:

.style18
	{mso-pattern:auto none;
	background:#FFCC99;
	color:#3F3F76;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:宋体;
	mso-generic-font-family:auto;
	mso-font-charset:0;
	border:.5pt solid #7F7F7F;
	mso-style-name:"输入";}

in which the value for entry "font-family:宋体;" contains non-ascii chars and is not quoted, thus it raises an error in function css_parser<_Handler>::value() in css_parser.hpp:
css::parse_error::throw_with("value:: illegal first character of a value '", c, "'");

2. If I have all the above un-quoted Chinese char values quoted (see attachment attached in the next reply), there seems to be no errors raised during parsing, but the returned CSSHandler is empty in ScHTMLQueryParser::ParseStyle:
Comment 13 Kevin Suo 2021-10-15 11:51:36 UTC
Created attachment 175756 [details]
test file 2 with Chinese-char values quoted
Comment 14 Kevin Suo 2021-10-17 12:29:03 UTC
I submitted another patch:
https://gerrit.libreoffice.org/c/core/+/123715

With this patch, the "test file 2" (attachment 175756 [details]) is not opened correctly with the MSO-Number-Formats applied, the ID numbers are shown correctly as TEXT.

However, attachment 121310 [details] (i.e. the one which has Chinese char property-values unquoted. Anyone who has any idea how to add quotes to those chars? The css OSring looks like this:

.style0
	{mso-number-format:"General";
	text-align:general;
	...
	font-family:宋体;
	mso-font-charset:134;
        ...
	mso-style-id:0;}
.style16
	{mso-number-format:"_ \0022\00A5\0022* \#\,\#\#0_ \;_ \0022\00A5\0022* \\-\#\,\#\#0_ \;_ \0022\00A5\0022* \0022-\0022_ \;_ \@_ ";
	mso-style-name:"货币[0]";
	mso-style-id:7;}

The unquoted "font-family:宋体;" should be changed to "font-family:'宋体';" before passed to orcus::css_parser in ScHTMLQueryParser::ParseStyle
Comment 15 Kevin Suo 2021-10-19 03:43:38 UTC
A followup patch address the un_quoted property value is submitted as:
https://gerrit.libreoffice.org/c/core/+/123727/4
Comment 16 Commit Notification 2021-10-19 14:15:05 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/a0c23b40905d7b59caf46fc8887864ab35142522

tdf#96499 sc htmlimport: fix broken CSSHandler so that...

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Kevin Suo 2021-10-19 14:34:09 UTC
As I stated in comment 12, there are two issues here.

1) RESOLVED: The first issue is that normal css styles are not parsed at all (due to regressions in our code base during orcus version updates). This is addressed in https://gerrit.libreoffice.org/c/core/+/123715/5 and is now merged as commit a0c23b40905d7b59caf46fc8887864ab35142522.

2) TODO: The 2nd issue is that, even with the fix in 1 above, if the css style contains un-quoted non [a-zA-Z] chars (e.g. CJK or other chars) as property names, orcus has failed to parse. I had submitted a patch in https://gerrit.libreoffice.org/c/core/+/123727/5 but I abandoned it myself because I am not sure whether it is the right approach. Anyone who is interested in this please do continue to fix this.

I mark this back to NEW then.
Comment 18 Commit Notification 2021-10-20 06:36:17 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "libreoffice-7-2":

https://git.libreoffice.org/core/commit/a25f615a103f6ed3c2d4c35d2eacdd828b75854e

tdf#96499 sc htmlimport: fix broken CSSHandler so that...

It will be available in 7.2.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 19 Kevin Suo 2021-10-23 13:45:21 UTC
The todo issue as indicated in comment 17 is submitted to orcus issue tracker:
https://gitlab.com/orcus/orcus/-/issues/140
Comment 20 Kevin Suo 2021-10-28 01:20:00 UTC
(In reply to Kevin Suo from comment #19)
The todo issue is now fixed on Orcus master branch. This bug may be resolved when we upgrade the orcus used by lo to a (pending) new version.
Comment 22 Kevin Suo 2021-11-03 23:53:28 UTC
The TODO issue is now resolved in LibreOffice master via Kohei's upgrade of lo orcus version to 0.17.0 in commit eb07a0e76.

Mark as RESOLVED FIXED.
Comment 23 Commit Notification 2022-02-07 19:10:28 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9d02b1edafd44b75a8996a97c329fdd4967e8f54

tdf#96499: sc: Add UItest

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.