Bug 35756 - Import large HTML table, data gets truncated
Summary: Import large HTML table, data gets truncated
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
3.3.2 release
Hardware: All All
: medium normal
Assignee: Eike Rathke
QA Contact:
URL:
Whiteboard: target:4.1.0 target:4.0.5
Keywords:
: 60354 64168 64572 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-03-28 18:14 UTC by Marco
Modified: 2013-06-28 12:05 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
20000 rows as HTML table (239.16 KB, application/x-compressed)
2011-03-28 18:18 UTC, Marco
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marco 2011-03-28 18:14:21 UTC
By importing a document (a HTML file named as .XLS or .HTML) with lots (more
than 15000) of rows, OpenOffice Calc truncates data without showing any error
or warning.

The issue can be reproduced by importing attached file into calc. On my machine
it imports just 13635 of the 20000 rows.
Comment 1 Marco 2011-03-28 18:18:17 UTC
Created attachment 44977 [details]
20000 rows as HTML table
Comment 2 Rainer Bielefeld Retired 2011-03-28 22:41:03 UTC
The effect is reproducible with reporter's sample document and "LibreOffice 3.3.2  – WIN7  Home Premium  (64bit) German UI [OOO330m19 (Build:202 / tag 3.3.2.2)]"

I saw a lot of documents with name extension .xls having nothing to do with an EXCEL spreadsheet, the user or his application only used that name because of "somehow table contents". 

To be honest, I do not know much about EXCEL HTML document, except that it is a mess to work with them. Imho that's an EXCEL problem, EXCEL should create documents with correct syntax.

Reporter's sample is no correct html, although source text is pretending to be html. At least html type information is missing.
I'ts also not an EXCEL type spreadsheet.

MS EXCEL viewer will not open that document.

Some other observations:
OOo3.1.1. (from open WRITER document) will by default open the document as WRITER-HTML document in writer with correct table view until "A12800", then table view stops and strings from table will be shown as endless plain text line.
I can force OOo to open the document as html-calc, then it will open the document as spreadsheet, "E13105" is the latest content shown correctly, then table formatting breaks.

Exactly the same with OOo-dev 3.4

My result:
My aversion against such documents has nothing to do with the reported problem, LibO should reject the document or open it correctly (may be with a warning message). Low priority, imprtant data should be exported to a document with correct syntax, that's a problem of the application creating such documents.

@Marco:
You get such documents from what application?
Comment 3 Rainer Bielefeld Retired 2011-03-28 22:54:31 UTC
Although the "html" code is completely different, I see something similar to the reported problem with the attachment of OOo bug
 Bug 111579 -  Opening large html excel document from SAS  
<http://openoffice.org/bugzilla/show_bug.cgi?id=111579>
Opening that document with LibO CALC (from WIN Explorer) the last correctly shown cell 'F6712' will have contents "PXXX09.001.AAAA.BBBB 1728". Next cell will be broken, no further contents will be shown, Table ends with date 15/09/2009

Renaming document to .html and opening with Seamonky shows: there is much ocntents behind "15/09/2009"
Comment 4 Marco 2011-03-28 23:27:28 UTC
(In reply to comment #3)
> Although the "html" code is completely different, I see something similar to
> the reported problem with the attachment of OOo bug
>  Bug 111579 -  Opening large html excel document from SAS  
> <http://openoffice.org/bugzilla/show_bug.cgi?id=111579>
> Opening that document with LibO CALC (from WIN Explorer) the last correctly
> shown cell 'F6712' will have contents "PXXX09.001.AAAA.BBBB 1728". Next cell
> will be broken, no further contents will be shown, Table ends with date
> 15/09/2009
> 
> Renaming document to .html and opening with Seamonky shows: there is much
> ocntents behind "15/09/2009"

Yes I agree, it seems to be same issue.
Comment 5 Marco 2011-03-28 23:39:44 UTC
(In reply to comment #3)
> Although the "html" code is completely different, I see something similar to
> the reported problem with the attachment of OOo bug
>  Bug 111579 -  Opening large html excel document from SAS  
> <http://openoffice.org/bugzilla/show_bug.cgi?id=111579>
> Opening that document with LibO CALC (from WIN Explorer) the last correctly
> shown cell 'F6712' will have contents "PXXX09.001.AAAA.BBBB 1728". Next cell
> will be broken, no further contents will be shown, Table ends with date
> 15/09/2009
> 
> Renaming document to .html and opening with Seamonky shows: there is much
> ocntents behind "15/09/2009"

The .XLS extension is used for users convenience - as those extensions are
associated with LibreOffice or MS Excel by default.

Trying with MS Excel 2010, it imports that example file without a problem. It
just showed a warning that it's not an Excel file.

Such files are generated by applications which cannot create native .XLS (or
.XLSX). The example file is one I was creating manually to demonstrate the
issue.


However, the main issue I see here is that LibreOffice cannot import huge HTML
tables. It should either import the whole data or show warning message.
Comment 6 Ctibor 2011-08-22 00:14:10 UTC
I can confirm this bug too in libreoffice 3.4.2. Happens for me on slightly less huge tables with around 3000 rows. The interesting thing is, that borders of the table are rendered to the last row, but data are truncated randomly in each file somewhere in the middle.
Comment 7 Björn Michaelsen 2011-12-23 11:45:48 UTC
[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Comment 8 Marco 2011-12-23 23:04:19 UTC
The issue is still open and reproducible with "3.5.0 beta2".
Comment 9 Owen Genat (retired) 2013-04-02 10:47:13 UTC
Issue is still reproducible under v3.5.7.2 (Ubuntu v10.04 x86_64) and v4.0.1.2 (Win7).
Comment 10 Eike Rathke 2013-05-08 19:49:54 UTC
Working on this. The limit is around ~64k data cells, imposed by some underlying structures used during import.
Comment 11 Commit Notification 2013-05-10 14:06:31 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=2af1f5691e8d64afd5246d245d7876b5a2cd5cd8

resolved fdo#35756 import more than 64k HTML table cells



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 12 kuko 2013-05-28 18:19:24 UTC
*** Bug 64168 has been marked as a duplicate of this bug. ***
Comment 13 ign_christian 2013-05-31 09:10:56 UTC
*** Bug 64572 has been marked as a duplicate of this bug. ***
Comment 14 ign_christian 2013-05-31 09:12:51 UTC
*** Bug 60354 has been marked as a duplicate of this bug. ***
Comment 15 Eike Rathke 2013-06-19 17:31:15 UTC
Backport pending review for 4-0 as https://gerrit.libreoffice.org/4368
Comment 16 Commit Notification 2013-06-28 12:05:26 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-4-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=da11528150df545a31df3c9863bd4c3925ccdf21&h=libreoffice-4-0

resolved fdo#35756 import more than 64k HTML table cells


It will be available in LibreOffice 4.0.5.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.