Bug 48692 - TABLES: Writer corrupts large tables
Summary: TABLES: Writer corrupts large tables
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.2 release
Hardware: x86 (IA32) All
: high critical
Assignee: Michael Stahl
QA Contact: Joel Madero
URL:
Whiteboard: target:3.7.0
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-14 10:13 UTC by mike.sykes
Modified: 2012-09-28 07:17 UTC (History)
9 users (show)

See Also:


Attachments
Tab-delimited text file used for experiments (490.87 KB, text/plain)
2012-04-19 05:01 UTC, mike.sykes
Details
The 6000 line txt file after converting to table and back to text (20.98 KB, text/plain)
2012-04-19 05:05 UTC, mike.sykes
Details
ODT file with corrupted table created by my test in comment #11 (16.29 KB, application/vnd.oasis.opendocument.text)
2012-04-19 08:08 UTC, Roman Eisele
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mike.sykes 2012-04-14 10:13:53 UTC
Double-clicked a Word .doc file 4,550,144 bytes in size. LibreOffice Writer started and after some seconds (creating a temporary copy?) the progress bar started to show "Importing document ..." and very slow progress, viz one mark every 10 seconds or so. 

CPU usage was 90+% (1 Gh AMD Athlon). I continued with another task, whose window covered the Writer window. When I returned to see if Writer had finished, the window was blank, with no status bar, and showing only the hourglass pointer. Windows said it was "not responding".

The only solution was to end the task, after which it was necessary to delete the .~lock. file.

The bug is reproducible.

Microsoft Word for Windows 97, which created the file, opens it almost instantaneously.
Comment 1 David Tardon 2012-04-14 23:53:06 UTC
Could you attach the .doc file here?
Comment 2 mike.sykes 2012-04-15 10:11:43 UTC
I'd be happy to attach the file, but its content would need to be 
anonymized/redacted.

Perhaps later. The problem isn't urgent.

James Sykes

(I might add that I'm new to bugzilla, and getting familiar is turning 
out to be a struggle)
Comment 3 Roman Eisele 2012-04-17 13:13:12 UTC
It is hard to help until we know more about the DOC file in question. For now, to make sure that not every big DOC file causes Writer to fail, I downloaded the 'RTF 1.9.1 Specification', a 12.5 MB DOC file, from

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=10725

and tried to open it with LibreOffice 3.5.2.2 (Build-ID: 281b639-6baa1d3-ef66a77-d866f25-f36d45f) German running on MacOS X 10.6.8 German. This took some seconds and near to 100% CPU usage, but finally the file was open and I can read and edit it without problems ... Switching to another application while the file is opened does not change the situation.

@Mike:
could you try to download and to open the 'RTF 1.9.1 Specification' as mentioned above? If it opens without problems for you, too, we at least know for sure that the problem depends on the contents of the DOC file in question, not only on its size. (And if there are problems on your system to open the 'RTF 1.9.1 Specification', even better: then you don't need to anonymize your file anymore ;-).)
Comment 4 Roman Eisele 2012-04-17 13:16:54 UTC
@James Sykes:
Sorry! I infered the name 'Mike' from the e-mail address, but now I see that comment #2 gives your correct name (James). So, please, instead of '@Mike', pleas read 'Dear James Sykes' in my comment #3. (Obviously, I'm tired and should quit for today; it's 22:16 (10:16 p.m.) here in Germany.)
Comment 5 Roman Eisele 2012-04-19 00:52:00 UTC
Similar to bug 39883 - 'Crash on opening a "big" word file'. It's hard to say if this is the same bug because we don't have the .doc files and therefore can't see if there are any similarities.
Comment 6 Petr Mladek 2012-04-19 03:39:38 UTC
We can't move forward without a testing document.

The problem happens only with some special documents, so it can't block the release => lowering the severity a bit.
Comment 7 mike.sykes 2012-04-19 05:01:48 UTC
Created attachment 60308 [details]
Tab-delimited text file used for experiments

This file can easily by truncated to provide the 3000, 4000 and 5000 line size samples.
Comment 8 mike.sykes 2012-04-19 05:05:46 UTC
Created attachment 60309 [details]
The 6000 line txt file after converting to table and back to text

This is the 6000 line txt file after opening in writer, converting to table, converting the table to text and saving as txt.
Comment 9 mike.sykes 2012-04-19 05:09:39 UTC
Using LibreOffice 3.5.2.2 
Build ID: 281b639-6baa1d3-ef66a77-d866f25-f36d45f
under WinXP SP3

As suggested, I have downloaded the RTF spec and can confirm that it opens in ~30 secs without a problem, and can be edited quite satisfactorily. Interestingly, I see it contains quite a large table viz Appendix B: Index of RTF Control Words, which presents no problem.

Further investigation (on a now redacted file) has revealed that the problem is almost certainly nothing to do with converting from Word format, per se, but to soffice's handling of very large tables.

The original Word file consists almost entirely of a table exported from MySQL, having 11 columns x ~12,000 rows, with a page count of 300 (not that I would ever consider printing it).

I converted to table to (tab delimited) text and saved it as .txt. 
I then:
    created smaller versions with 3000, 4000, 5000 and 6000 lines. 
    For each
        opened with soffice
        converted to table
        saved as odt 
        Converted back to text
        saved as txt
As far as 5000, the result was correct, but increasingly slow. However at 6000, not only was everything almost very, very slow, but the odt saved file was seriously corrupted. There were 6000 lines, but only the last 40 were non-blank!
Comment 10 mike.sykes 2012-04-19 05:11:49 UTC
Forgot to mention:

My first name is James, but I'm Mike to family, friends & colleagues.
Comment 11 Roman Eisele 2012-04-19 08:05:50 UTC
@Mike: thank you very much for your investigations!
Now this bug is easily

[REPRODUCIBLE] with LibreOffice 3.5.2.2 (Build-ID: 281b639-6baa1d3-ef66a77-d866f25-f36d45f), German langpack installed, running on MacOS X 10.6.8.

I repeated Mike's steps given in comment #9, using his 'Tab-delimited text file used for experiments', and can confirm the strange behaviour he describes.

Additionally, I tried the following:
-- Open Mike's 6000 lines sample file with a text editor.
-- Copy the complete tab-delimited text.
-- Open LibreOffice.
-- Create a new Writer document.
-- Paste the complete table.
-- Save the file as ODT.
-- Close the ODT file; file size is 52,896 bytes (reasonable).
-- Open the ODT file again: looks still fine.
-- Select all the text.
-- From the menu, select Table > Convert > Text to Table ...
-- Wait for some seconds, CPU usage goes <= 100%.
-- Finally, a table appears and looks (on a first glance) OK.
-- Save the document again. NB that the progress bar begins slowly to show the progress, but then stops prematurely.
-- Close it; file size is only 16,679 (!) bytes now.
-- Open the ODT file again: the table is corrupted, data is only present in the last 38 rows. Also the previous rows look strange: most rows have only 2 columns, but there are some rows in between that have 6 or 7 columns (!).

I will attach this corrupt ODT file.

Setting Status to 'New', as this bug is confirmed now.
Changing Platform from 'Windows' to 'All', as the bug is also reproducible on MacOS X.
Comment 12 Roman Eisele 2012-04-19 08:08:21 UTC
Created attachment 60316 [details]
ODT file with corrupted table created by my test in comment #11
Comment 13 mike.sykes 2012-04-19 09:51:59 UTC
Glad it's confirmed to be reproducible.

My apologies for the misleading title!
Comment 14 Roman Eisele 2012-04-19 10:11:52 UTC
(In reply to comment #13)
> My apologies for the misleading title!

We can change it (just click '(edit)' at the end of the title), and probably we should, because the title should be a short summary of the bug report.

What about "Writer corrupts large tables"? This would cover both your original report (reading a large table from a .doc file) and your report in comment #9 and my one in comment #11 (the table gets corrupted when saving to file).
Comment 15 mike.sykes 2012-04-20 01:03:32 UTC
Title changed to be less misleading.

I can confirm that Roman's experience matched mine very closely.

However, it should be borne in mind that at some size, a background task becomes uninterruptible besides taking all the CPU it can get. A lot of pagination has to be done on such a large table; could it be anything to do with that?
Comment 16 Roman Eisele 2012-04-22 01:23:26 UTC
I forgot that we should add the TABLES keyword to the Summary, in order to make this bug easier to find etc.
Comment 17 Michael Meeks 2012-09-20 14:50:30 UTC
Cedric - any thoughts :-)
Comment 18 Michael Stahl 2012-09-20 21:03:42 UTC
problem is that the table is simply too big:
it contains more than 2^16 cells, and in the Writer core
(designed for Windows 3.x) class SwTable there's at least
one array of all cells (GetTabSortBoxes()) that is a SvArray
with 2^16 max capacity so it overflows when pasting.

Writer often crashes while saving the document or closing it.

that is the case in LO 3.6 and earlier; on master it's now a
STL container with at least 2^32 capacity, which is progress.
unfortunately there are lots of iterations over that array
that still use 16 bit integers, so on master i get an
infinite recursion in this function when closing the document:

void DelBoxNode( SwTableSortBoxes& rSortCntBoxes )
{
    for( sal_uInt16 n = 0; n < rSortCntBoxes.size(); ++n )
        rSortCntBoxes[ n ]->pSttNd = 0;
}

i'll try to fix that up...

Noel, if you aren't bored by STL containers yet, you can
look for some other cases where sal_uInt16 or shorts are used
in iterations or as index, and fix those :)
Comment 19 Not Assigned 2012-09-21 16:12:14 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=5f91f8a368343d8921a01edb7359cd300892f09d

fdo#48692: fix problems with large number of table cells:



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 20 Not Assigned 2012-09-21 16:12:34 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=b844f06b36cf9a6c1411861a08701c8f9be2af0d

fdo#48692: fix problems with large number of table cells:



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 21 Michael Stahl 2012-09-21 16:24:17 UTC
after upgrading a lot of 16 bit integers to size_t, it's now possible
to convert the 6000 lines to table and store it as ODF.

the document is valid and contains all table cells.

it can even be loaded again.

there are however still significant performance problems here:
after converting the table, the layout uses about 10 minutes
of CPU time (in a debug build, which includes extra checks),
storing about 4 min., loading about 3 min.; and after loading,
the layout somehow doesn't finish, it is constantly running
in the idle handler... so it's not really possible to work with
such a big table.

but since no corruption occurs any more, i'm declaring this fixed.
if you care about the performance, please file a follow up bug,
but fixing that is likely to be hard because nobody understands
the Writer layout.
Comment 22 Michael Stahl 2012-09-21 16:26:14 UTC
oh, forgot: backporting to release branch sounds like a bad idea here,
because the various STL conversions depend on each other and would be
a mess to dis-entangle so the fix will be in 3.7 only.
Comment 23 Roman Eisele 2012-09-21 16:34:10 UTC
@ Michael Stahl:
Thank you very much for your fixing this issue! This is a big step forward.
Comment 24 Roman Eisele 2012-09-28 07:17:27 UTC
VERIFIED as FIXED with LOdev 3.7.0.0.alpha0+ (Build ID: 30d33b1; pull time: 2012-09-27 04:27:30) on Mac OS X 10.6.8 (Intel).

I have repeated my steps given in comment #11 with the current Master build.
This time, the table in the final .odt file contains all 6000 row, and the table looks fine (no corruption anymore). I have verified that all 6000 rows are present by saving the .odt file as .fodt file (works fine!) and then greping all <table:table-row> tags.

So, while I must second Michael Stahl’s hints about remaining performance problems (comment #21), I can confirm that the main issue of this report -- the corruption of tables with more than 2^16 cells -- is successfully fixed.

Thank you again!