Bug 40218 - FILEOPEN: Calc confused by unclosed HTML tags
Summary: FILEOPEN: Calc confused by unclosed HTML tags
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
3.4.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: HTML-Import
  Show dependency treegraph
 
Reported: 2011-08-19 03:11 UTC by Tristan Miller
Modified: 2019-09-09 09:55 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
Sample HTML File as descripebed in first Comment (347 bytes, text/html)
2012-01-22 06:43 UTC, famo
Details
Proposed unit test for this bug (2.49 KB, patch)
2013-10-10 16:46 UTC, Thomas Arnhold
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tristan Miller 2011-08-19 03:11:22 UTC
The Calc HTML importer is completely confused by unclosed HTML tags.  If you try
to open an HTML file in Calc which contains unclosed HTML tags, it will import
only up until the unclosed tag.  The Writer HTML importer is much more
resilient, and will gracefully ignore unclosed tags.

Reproducibility: Always

Steps to reproduce:
1. Create an HTML file containing a table with unclosed tags.  Example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>spreadsheet</title>
  </head>
  <body>
    <table>
      <tr><td>a1</td><td>b1</td><td>c1</td></tr>
      <tr><td>a2</td><td><a href="foo">b2</td><td>c2</td></tr>
      <tr><td>a3</td><td>b3</td><td>c3</td></tr>
    </table>
  </body>
</html>

2. Open the file in Calc.  (If the file has an .htm or .html extension, you will
need to set the filter to "HTML Document (OpenOffice.org Calc) (*.html;*.htm)"
in the file selection dialog, or else OpenOffice.org will try to open it with
Writer.)

Observed behaviour:
3. Calc renders the spreadsheet as follows:
spreadsheet
a1 b1 c1
a2 

Expected behaviour:
3. Calc should have rendered the spreadsheet as follows:
spreadsheet
a1 b1 c1
a2 b2 c2
a3 b3 c3

This bug also affects OpenOffice.org; see Bug 115301 there: https://openoffice.org/bugzilla/show_bug.cgi?id=115301
Comment 1 Björn Michaelsen 2011-12-23 12:35:42 UTC Comment hidden (obsolete)
Comment 2 famo 2012-01-22 06:42:43 UTC
I checked this in LO 3.5 beta, Calc still does not import the whole table, only till a2. The import Writer/Web is fine.

Setting status to NEW
Comment 3 famo 2012-01-22 06:43:54 UTC
Created attachment 55960 [details]
Sample HTML File as descripebed in first Comment
Comment 4 Tristan Miller 2012-08-15 19:17:58 UTC
Confirming problem still exists with LibreOffice 3.6.0.4.
Comment 5 Thomas Arnhold 2013-10-10 16:43:21 UTC
The HTML importer is only confused by this unclosed anchor tag. I've tried other tags like <div>, <span> or <font>, but the import works fine. Also <a name="foo"> works. The only problem exists with <a href="eu">.

A solution would be to manually end the started anchor if the next </td> is found, but that's some kind of spaghetti:

--- a/editeng/source/editeng/eehtml.cxx
+++ b/editeng/source/editeng/eehtml.cxx
@@ -319,6 +319,7 @@ void EditHTMLParser::NextToken( int nToken )
     case HTML_TABLEHEADER_OFF:
     case HTML_TABLEDATA_OFF:
     {
+        AnchorEnd();
         if ( nInCell )
             nInCell--;
     }


A far better solution for all non-well-formatted HTML documents would be to clean them up in a first step. This could be done like http://www.mostthingsweb.com/2013/02/parsing-html-with-c/

Do we want to include tidy in our project? In my opinion this could be a huge benefit.
Comment 6 Thomas Arnhold 2013-10-10 16:46:05 UTC
Created attachment 87404 [details]
Proposed unit test for this bug
Comment 7 QA Administrators 2015-04-01 14:41:09 UTC Comment hidden (obsolete)
Comment 8 Tristan Miller 2015-04-01 15:21:43 UTC
Confirming bug still exists as originally described in LibreOffice 4.4.1.2.
Comment 9 tommy27 2016-04-16 07:23:51 UTC Comment hidden (obsolete)
Comment 10 Tristan Miller 2016-04-22 10:37:06 UTC
Confirming bug still exists as originally described in LibreOffice 5.1.2.2.0+ (on openSUSE 13.2).
Comment 11 QA Administrators 2017-05-22 13:24:11 UTC Comment hidden (obsolete)
Comment 12 Tristan Miller 2017-05-28 20:16:57 UTC
Confirming bug still exists as originally described in LibreOffice 5.3.2.2 (on openSUSE Tumbleweed).
Comment 13 QA Administrators 2018-08-22 02:38:21 UTC Comment hidden (obsolete)
Comment 14 Tristan Miller 2018-08-22 08:15:18 UTC
Confirming bug still exists as originally described in LibreOffice 6.1.0.3 (on openSUSE Tumbleweed).
Comment 15 QA Administrators 2019-09-02 09:20:27 UTC Comment hidden (obsolete)
Comment 16 Tristan Miller 2019-09-09 09:55:39 UTC
Confirming bug still exists as originally described in LibreOffice 6.3.1.1 (on openSUSE Tumbleweed).