Bug 95984 - FILEOPEN partial csv import corrupt, aborts when many lines are imported to a single cell until it fills
Summary: FILEOPEN partial csv import corrupt, aborts when many lines are imported to a...
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
3.6.7.2 release
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Not Assigned
URL: https://wiki.documentfoundation.org/R...
Whiteboard:
Keywords: bibisected, regression
Depends on:
Blocks:
 
Reported: 2015-11-22 12:58 UTC by Milos Sramek
Modified: 2016-04-26 06:34 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments
The attachment (3.03 MB, application/gzip)
2015-11-22 15:27 UTC, Milos Sramek
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Milos Sramek 2015-11-22 12:58:15 UTC
When loading the attached csv file, all content after a certain line (line 30734 in this case)  is imported into a single cell, when using TAB as the delimiter (TAB is the correct one)

Bibisected, using bibisect-releases, the first bad release is 3.6.7.1

Loading takes much longer time in comparison to versions prior to 3.6.7.1 and the message
The data could not be loaded completely because the maximum number of characters per cell was exceeded

is displayed. However, these are perhaps just side effects. 

Still present in Version: 5.1.0.0.alpha1+
Build ID: 966c1e94e8e2669bd623999661b95cdfefa8c6b7

--
milos
Comment 1 V Stuart Foote 2015-11-22 14:17:04 UTC
The attachment?
Comment 2 Milos Sramek 2015-11-22 15:27:49 UTC
Created attachment 120721 [details]
The attachment

Sorry, here it is
Comment 3 V Stuart Foote 2015-11-22 17:02:12 UTC
Confirming, including the bibisect.

On Windows 10 Pro 64-bit with

BAD in Version 3.6.7.2 (Build ID: e183d5b) -- incomplete load w/error  7 min
BAD in Version 3.6.7.1 (Build ID: 9418c72)  -- incomplete load w/error 7 min

OK in Version 3.6.6.2 (Build ID: f969faf) -- fully loads in 11 sec
OK on Version 3.6.5.2 (Build ID: 5b93205) -- fully loads in 11 sec

and with current master
Version: 5.1.0.0.alpha1+ (x64)
Build ID: 966c1e94e8e2669bd623999661b95cdfefa8c6b7-GL
TinderBox: Win-x86_64@62-TDF, Branch:MASTER, Time: 2015-11-22_00:59:29
Locale: en-US (en_US)

For all Windows builds since 3.6.6.2 tested, after considerable delay, import of the <Tab> delimited CSV with date column type set stops at row 15111 -- the 2634985 Ventnor record. From there it fills the "D" column of row 15112, inserting newlines for the rest of the records.

I do get the same "maximum number of characters per cell was exceeded" error message. Of course that is not the issue--issue is why at some point it stops parsing lines from the CSV into rows in the Calc sheet.
Comment 4 V Stuart Foote 2015-11-22 17:38:14 UTC
Looking at the release notes for 3.6.7.1, work on bug 60468 seems germane here.

@Eike, Fridrich -- soemthing old and moldy that may need another look...
Comment 5 m.a.riosv 2015-11-22 22:43:04 UTC
Not using double quotes " as Text Delimiter in the import window, opens the file in a few seconds.

I think this is necessary because there are lines like 15112 already commented in c#4, on which the column D has an unpaired double quotes, so having it as text delimiter, reaches the maximum cell size.

Not sure it can be considered a bug.
Comment 6 Milos Sramek 2015-11-23 09:05:20 UTC
Hi,

I see now, the file is not OK. I have not noticed that before.Thanks for noticing that.

I think, however, that LO should not react by hanging for a long time, but instead it should react in a meaningful way. Otherwise it looks like a bug in LO

The best way how to deal with that would be to warn the user by a message, e.g.  "Unbalanced delimiter on line XXX" and exit - so that he or she knows that the data is corrupt. 

A not so good solution would be to ignore the problem, but still to start a new record once line end is reached. This is the approach used in LO prior to 3.6 and in MSO. This look pretty, but the user doers not know that the data is corrupt.

It would also be good to have a third option in text delimiter setting: None
Comment 7 Robinson Tryon (qubit) 2015-12-10 07:32:33 UTC
Migrating Whiteboard tags to Keywords: (regression, bibisected (fix typo))
Comment 8 Markus Mohrhard 2016-04-21 16:42:44 UTC
(In reply to Milos Sramek from comment #6)
> Hi,
> 
> I see now, the file is not OK. I have not noticed that before.Thanks for
> noticing that.
> 
> I think, however, that LO should not react by hanging for a long time, but
> instead it should react in a meaningful way. Otherwise it looks like a bug
> in LO

Well it is a broken document. This is clearly not a bug in LibreOffice. We somehow try to make sense of the content of the document.

> 
> The best way how to deal with that would be to warn the user by a message,
> e.g.  "Unbalanced delimiter on line XXX" and exit - so that he or she knows
> that the data is corrupt. 

This is not known until the whole document has been imported.

> 
> A not so good solution would be to ignore the problem, but still to start a
> new record once line end is reached. This is the approach used in LO prior
> to 3.6 and in MSO. This look pretty, but the user doers not know that the
> data is corrupt.

Obviously this is wrong. The newline inside of a quoted part is a normal newline that is transferred to the cell content.

> 
> It would also be good to have a third option in text delimiter setting: None
Comment 9 B. Nikola 2016-04-23 08:54:20 UTC
During csv file import calc / LO crashes. And its loading very slow, too.