Bug 90653 - Opening .csv : Problem with rows
Summary: Opening .csv : Problem with rows
Status: RESOLVED INVALID
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.2.7.2 release
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: haveBacktrace
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2015-04-16 13:48 UTC by Mamoth
Modified: 2020-04-20 16:57 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
The .csv file to open (3.15 MB, text/csv)
2015-04-16 13:48 UTC, Mamoth
Details
bt with debug symbols (10.34 KB, text/plain)
2019-08-10 11:38 UTC, Julien Nabet
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mamoth 2015-04-16 13:48:50 UTC
Created attachment 114824 [details]
The .csv file to open

Hello.

I am trying to import a .csv file (see attachment). The file has 16069 rows and is not especially large (3.3 Mb).
But when I open it with LO, it says "The data could not be loaded completely because the maximum number of characters per cell was exceeded."

First problem : Calc should not have any problem loading a file with this number of rows.

Second problem : 
After looking twice, I noticed that the last line of the Calc sheet does match the last line of the .csv file ( i.e. " La Taille de la Vente;37350;LA CELLE-GUENAND;INDRE-ET-LOIRE;CENTRE;491439.0;2217742.0;3811Z;Collecte des déchets non dangereux;;") but the row number does not match the one of the .csv file :
 - in the Calc sheet, the last row is number 15405 
 - in the .csv file, the last row is number 16069

So I suppose that some rows have been skipped during import.
Comment 1 Julien Nabet 2015-04-16 19:22:02 UTC
On pc Debian x86-64 with master sources updated today, I could reproduce this.

I noticed this on console:
warn:legacy.osl:23655:1:sc/source/ui/docshell/impex.cxx:602: lcl_appendLineData: data overflow
Comment 2 tommy27 2016-04-16 07:27:24 UTC Comment hidden (obsolete)
Comment 3 Mamoth 2016-04-18 07:36:32 UTC Comment hidden (obsolete)
Comment 4 Julien Nabet 2016-04-18 16:55:19 UTC Comment hidden (obsolete)
Comment 5 Mamoth 2016-04-18 18:59:04 UTC Comment hidden (obsolete)
Comment 6 libreoffice-bugs 2016-05-28 11:29:28 UTC
Problem still exists in 5.1.3.2
Comment 7 firfin 2017-07-17 13:20:29 UTC
Still present in LO Version: 5.1.6.2
Build ID: 1:5.1.6~rc2-0ubuntu1~xenial2 on Ubuntu 16.04

There are also similar bugs on the OOo bug tracker about this (see https://bz.apache.org/ooo/show_bug.cgi?id=123340 for instance) , so this might be inherited ( I am not runnig virtualbox anymore, so I can't easily check with the 3.3. version without wrecking my current install.)
Comment 8 Tristan Miller 2017-08-29 13:36:26 UTC
Using LibreOffice Calc 5.4.0.3, I can open Attachment 114824 [details].  However, that version of Calc still chokes on delimited text files that are a bit larger.
Comment 9 Andreas Säger 2017-10-30 13:27:25 UTC
The csv is invalid because it contains text values with line breaks that are indistinguishable from line breaks between 2 records. Any csv parser will fail with this.
Comment 10 Tristan Miller 2017-11-03 18:05:08 UTC
(In reply to Andreas Säger from comment #9)
> The csv is invalid because it contains text values with line breaks that are
> indistinguishable from line breaks between 2 records. Any csv parser will
> fail with this.

Could you elaborate on this?  True, the line breaks may be semantically wrong, but I don't see how they make the file syntactically incorrect or harder/impossible to parse.
Comment 11 QA Administrators 2018-11-04 03:58:24 UTC Comment hidden (obsolete)
Comment 12 Jonny Grant 2019-08-10 10:52:17 UTC
Hi
This bug is still present in Version: 6.0.7.3

"The data could not be loaded completely because the maximum number of characters per cell was exceeded."

The strange popup doesn't even say what line it is on. It should just continue after what ever issue it finds.


I can put a $100 bug-bounty on a fix roll out. can't believe this PR was filed 4 years ago, and still broken.
Comment 13 Julien Nabet 2019-08-10 11:38:45 UTC
Created attachment 153283 [details]
bt with debug symbols

On pc Debian x86-64 with master sources updated today, I could reproduce this.

I attached a bt just after:
warn:legacy.osl:21334:21334:sc/source/ui/docshell/impex.cxx:628: lcl_appendLineData: data overflow
Comment 14 Timur 2019-09-12 13:57:39 UTC
(In reply to Mamoth from comment #0)
>  - in the Calc sheet, the last row is number 15405 
>  - in the .csv file, the last row is number 16069
Last row number in LO depends on CSV open dialog options.
And in MSO Excel, which opens CSV without warning, the last row number is 16008.

On the other side, if open and saved with LO or MSO, csv remains at 16069. 

This file is not correct. I guess that's what Andreas referred to. 
Reporter should have given minimal example so that we know what a problem in LO is, not that we look for a problem in his file.
Comment 15 Tristan Miller 2019-09-13 07:50:18 UTC
Timur, you have closed this bug on the basis that the second problem reported by Mamoth is invalid.  But you are ignoring the fact that the first problem he reported (the unhelpful error message) is valid.  Even if there is a problem with the file, LibreOffice needs to inform the user in a way that allows them to understand and locate the problem.  I suggest this bug be reopened to deal with this first problem.  Incidentally, it is still reproducible for me with the following setup:

Version: 6.3.1.1
Build ID: 30(Build:1)
CPU threads: 12; OS: Linux 5.2; UI render: default; VCL: gtk3; 
Locale: en-GB (en_CA.UTF-8); UI-Language: en-US
Calc: threaded
Comment 16 Julien Nabet 2019-09-17 20:06:41 UTC
(In reply to Tristan Miller from comment #15)
> Timur, you have closed this bug on the basis that the second problem
> reported by Mamoth is invalid.  But you are ignoring the fact that the first
> problem he reported (the unhelpful error message) is valid.  Even if there
> is a problem with the file, LibreOffice needs to inform the user in a way
> that allows them to understand and locate the problem.  I suggest this bug
> be reopened to deal with this first problem.  Incidentally, it is still
> reproducible for me with the following setup:
> ...

LO gives some hint, a cell contains too much characters. So it means something prevented from the good split on the data.
LO does just csv file parsing to split at the right location depending on the entered parameters, it can't do a semantic analyze which would allow it to guess where a value should stop.
Comment 17 Tristan Miller 2019-09-18 08:25:11 UTC
(In reply to Julien Nabet from comment #16)
> LO gives some hint, a cell contains too much characters. So it means
> something prevented from the good split on the data.
> LO does just csv file parsing to split at the right location depending on
> the entered parameters, it can't do a semantic analyze which would allow it
> to guess where a value should stop.

The least it could do is indicate at which cell or which row the parsing failed.
Comment 18 Julien Nabet 2019-09-18 08:49:55 UTC
Tristan:the simplest thing would be to stop when there's a pb

Eike: do you think we may add some throw exception or at least an assert in https://opengrok.libreoffice.org/xref/core/sc/source/ui/docshell/impex.cxx?r=f805e0b9#636 in else part?

626  static bool lcl_appendLineData( OUString& rField, const sal_Unicode* p1, const sal_Unicode* p2 )
627  {
628      OSL_ENSURE( rField.getLength() + (p2 - p1) <= SAL_MAX_UINT16, "lcl_appendLineData: data overflow");
629      if (rField.getLength() + (p2 - p1) <= SAL_MAX_UINT16)
630      {
631          rField += OUString( p1, sal::static_int_cast<sal_Int32>( p2 - p1 ) );
632          return true;
633      }
634      else
635      {
636          rField += OUString( p1, SAL_MAX_UINT16 - rField.getLength() );
637          return false;
638      }
639  }

Indeed, putting more than SAL_MAX_UINT16 characters in a cell is more than suspect no ?
Comment 19 Richard Bruce Baxter 2020-04-20 08:05:35 UTC
Bug still present @Version: 6.4.2.2
Comment 20 Eike Rathke 2020-04-20 16:57:21 UTC
It's not a bug in Calc. It's a bug in the data generating application.

Meanwhile the cell's string limit is not 64k anymore so theoretically upping that limit could be possible. BUT.. big BUT:

The data is broken at several places, that's why it is tried to stuff so many characters into one cell resulting in less rows than the original. First broken field is in row 3385, after import go to D3385 and you'll see quite a few records crammed into that cell. Reason is that the record

052.01076;DECONS - Le Pian (Louens);40271311900012;"Louens" 1701 -  Route de Soulac;33290;LE PIAN-MEDOC;GIRONDE;AQUITAINE;360329.91;1999074.6;3831Z;Démantèlement d'épaves;;

is broken data in that the field content
;"Louens" 1701 -  Route de Soulac;
starts with a field opening " quote character but is not closed, so the CSV import correctly keeps running that field until a field closing " quote character is encountered in row 3415 of

052.01340;SITA SUD OUEST;70198020300148;Centre de Valorisation des Déchets "Graulin";33470;LE TEICH;GIRONDE;AQUITAINE;334690.0;1958026.0;3821Z;Traitement et élimination des déchets non dangereux;;

(the Graulin";).

The correct data in the wrong field instead should had been
;"""Louens"" 1701 -  Route de Soulac";

There are a few of such cases, visually detectable by a tall row and usually the bad data in column D. Rows of raw data:

3385:
052.01076;DECONS - Le Pian (Louens);40271311900012;"Louens" 1701 -  Route de Soulac;33290;LE PIAN-MEDOC;GIRONDE;AQUITAINE;360329.91;1999074.6;3831Z;Démantèlement d'épaves;;

3573:
052.02306;RIGHINI -Tonneins;72615004800023;"Rapetout" Route de Verteuil;47400;TONNEINS;LOT-ET-GARONNE;AQUITAINE;438289;1934226;1623Z;Fabrication de charpentes et d'autres menuiseries;;

3800:
052.06068;USSGETOM;25330430700015;"Lichon" - ZA de Coussères;33210;FARGUES;GIRONDE;AQUITAINE;392400.0;1951200.0;8412Z;Administration publique (tutelle) de la santé -  de la formation -  de la culture et des services sociaux -  autre que sécurité sociale;;

3843:
052.06793;LA GRANDE JAUGUE;50818282100019;"Touban" Pôle environnemental;33160;SAINT-MEDARD-EN-JALLES;GIRONDE;AQUITAINE;359012;1992919;3821Z;Traitement et élimination des déchets non dangereux;;

9783:
064.00202;Usine de traitement d'Hugueneuve;05781313100026;"500 Route du Gros Cerveau;83190;OLLIOULES;VAR;PROVENCE-ALPES-COTE-D'AZUR;886494;1799096;3600Z;Captage -  traitement et distribution d'eau;;

14296:
053.02584;SMIRTOM DE LA REGION DE L'AIGLE;25610217900015;"les champs rouges" déchètterie route de l'aigle à crulai BP 174;61300;L'AIGLE;ORNE;BASSE-NORMANDIE;457856.0;2416389.0;3811Z;Collecte des déchets non dangereux;;

16033:
100.02151;SODEC;42396801500010;"La Grande pièce" RD 90;18100;SAINT-HILAIRE-DE-COURT;CHER;CENTRE;574286.0;2245553.0;3821Z;Traitement et élimination des déchets non dangereux;5.(d);Décharges -  à l'exception des décharges de déchets inertes et des décharges qui ont été définitivement fermées avant le 16 juillet 2001 ou dont la phase de gestion après désaffection requise par les autorités compétentes conformément à l'article 13 de la directive 1999/31/CE du Conseil du 26 avril 1999 concernant la mise en décharge des déchets s'est achevée recevant 10 tonnes par jour ou d'une capacité totale de 25 000 tonnes