Bug 155384 - Fixed width paste incorrectly assumes an extra split point
Summary: Fixed width paste incorrectly assumes an extra split point
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
7.6.0.0 alpha1+
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2023-05-18 03:16 UTC by Pierre Fortin
Modified: 2023-06-03 18:55 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
sample file (deleted)
2023-05-18 15:48 UTC, Pierre Fortin
Details
Screenshot opening csv sample file (deleted)
2023-05-18 22:40 UTC, m_a_riosv
Details
sample file 2 (462.64 KB, text/csv)
2023-05-19 00:48 UTC, Pierre Fortin
Details
Screenshot opening sample2 (60.62 KB, image/png)
2023-05-19 21:59 UTC, m_a_riosv
Details
Screenshot how to do it. (60.29 KB, image/png)
2023-05-19 22:05 UTC, m_a_riosv
Details
Number of supported rows (OT) (23.96 KB, image/png)
2023-05-20 03:07 UTC, Pierre Fortin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pierre Fortin 2023-05-18 03:16:36 UTC
I have many sheets which have a column of ID numbers consisting of two letters and 4-7 digits e.g., XX2035-XX1876543
Wanting to split this column into two columns: XX and numeric parts:  

Paste Special > Unformatted Test
Fixed width

The dialog samples some number of rows and sees XXnnnn; then assumes all entries are 2+4. Pasting the data results in all XXnnnn rows correctly pasted; but ALL other rows are split into 3 columns:  XX nnnn nnn

I can "trick" the dialog into doing the right thing by inserting a sacrificial row at the top containing XX1234567; but that adds extra steps to correct the data.
Comment 1 m_a_riosv 2023-05-18 14:54:08 UTC
Please attach a sample file, so someone can test it.
Comment 2 Pierre Fortin 2023-05-18 15:48:27 UTC
Created attachment 187376 [details]
sample file

Column A is the data I copied, then pasted into B and C; yet 4 digits went into column C and the rest into column D.   I can provide sample files with up to 1.5M rows if necessary.
Comment 3 Pierre Fortin 2023-05-18 15:55:13 UTC
Forgot to mention:  the original file was 14,361,164 rows with 100 unique 2-character prefixes; so this issue caused a bit of grief... :)
Comment 4 m_a_riosv 2023-05-18 22:40:32 UTC
Created attachment 187386 [details]
Screenshot opening csv sample file

I think sample file, it's not the same you have reported.
Comment 5 Pierre Fortin 2023-05-19 00:48:47 UTC
Created attachment 187388 [details]
sample file 2

I have no idea how that file got selected...  Here's another try at the file I thought was attached...
Comment 6 QA Administrators 2023-05-19 03:17:28 UTC Comment hidden (obsolete)
Comment 7 m_a_riosv 2023-05-19 21:59:24 UTC
Created attachment 187407 [details]
Screenshot opening sample2

I think it is a different file from your screenshot.
Comment 8 m_a_riosv 2023-05-19 22:05:39 UTC
Created attachment 187408 [details]
Screenshot how to do it.

Seems your screenshot, it's for the first one.

But what you copy to do the paste special.

Maybe the fixed with gets what you need to do.
Comment 9 Pierre Fortin 2023-05-19 23:03:31 UTC
1. my first attachment was the wrong file -- no idea how that happened since that was not a file I had open.

2. sample file 2 is correct. 

3. I know how to use fixed width; the issue is that fixed width does not work correctly if there are >= N (unknown to me) rows that *appear* to imply that ALL data is like these initial N samples.

4. Try it yourself: 

  a) to see the problem as reported:
   - select cell D1
   - Ctrl+down-arrow
   - do you see digits in column D?  That is the issue!

  b) to reproduce the problem:
   - select column A
   - Copy it
   - right-click on E1 or any column to the right; NOT on column A-D as they already contain data at the bottom.
   - Paste Special > Unformatted Text
     - Fixed width
   - set the divider line between characters 2 and 3
   - Note that the sample display is only 6 characters wide
   - OK
   - check the bottom rows...  is your pasted data in 2 or 3 columns?  Unless this bug has been fixed, you should have "AH" in column E, 4 digits in column F, and 1 digit in column G.  With a huge sample of ~1.5M rows, you would see this in the bottom rows:  AH, 1234, 567 vs the expected AH, 1234567

HTH
Comment 10 ady 2023-05-20 02:15:24 UTC
While I can reproduce it when copying from the already-imported first column (by selecting the first column _only_), it can be easily avoided when initially importing the data (using either "Separated by" or "fixed width"), precisely because there is more data on the right-side (duplicated in this case).

OTOH, I think users shouldn't expect to look-up 7900 rows (or whatever) just in order to find the data format before suggesting the columns' width. That would probably be inefficient.

[OT] BTW, 1.5M rows are not supported anyway.
Comment 11 Pierre Fortin 2023-05-20 03:07:53 UTC
Created attachment 187409 [details]
Number of supported rows (OT)

I don't think I said, suggested or implied that many rows need to be inspected.

Questions: 
 - What is the logic behind adding another [hidden] split point based on the first row(s)?
 - The request is to split once between 2nd and 3rd characters in this case; why is an unspecified, unwanted additional split added?
 - Why is the fixed width UI limited to what is visible in the first few rows, vs allowing the user to specify how to split wider cells by adding a split bar farther the right, without having to scan the entire data?
 - Why even have an extra [hidden] split point?

>to find the data format before suggesting the columns' width.
What does the columns' width have to do with explicit split points?  
A suggested column's width is no reason for adding an unwanted/unexpected split point.

Just accept the split(s) as specified, without adding another hidden split at the end of what is presented as examples...
Comment 12 Pierre Fortin 2023-05-20 03:32:03 UTC
(In reply to ady from comment #10)

More specifically to your points...

> While I can reproduce it when copying from the already-imported first column
> (by selecting the first column _only_), it can be easily avoided when
> initially importing the data (using either "Separated by" or "fixed width"),
> precisely because there is more data on the right-side (duplicated in this
> case).

So...  rather than fixing what I consider a rookie coding error, the user is expected to abandon work on a HUGE spreadsheet (often 14-15M rows) that takes 4 minutes just to load on a very capable CPU:
  Processors: 20 × 12th Gen Intel® Core™ i7-12700K (4@5GHz, 12@4.9GHz, 4@3.8GHz)
  Memory: 125.5 GiB of RAM and 2TB + 4TB SSDs + 44TB on 6 platter drives
in order to save a few milliseconds finding the max width of the incoming data?

> OTOH, I think users shouldn't expect to look-up 7900 rows (or whatever) just
> in order to find the data format before suggesting the columns' width. That
> would probably be inefficient.

Compared to 4 minutes to load a large file; that's not even noticeable.
Comment 13 ady 2023-05-20 04:50:51 UTC
(In reply to Pierre Fortin from comment #12)
> (In reply to ady from comment #10)
> 
> So...  rather than fixing what I consider a rookie coding error, the user is
> expected to abandon work on a HUGE spreadsheet (often 14-15M rows)

Let me clarify my comment then. There is indeed a problem. The fixed-width selection should not trim the second/last column. Additionally, the width of the last column should not strictly depend on some kind of scanning or look-up of the first few rows. The current result acts as if the second column's width was somehow selected/delimited, but the user didn't do that. The problem is that the last column's width is limited in the dialog and it seems it cannot be marked in any way in order for it to be wider. (At least the data is not completely lost.)

Having said that, care should be taken if solving this issue, I could imagine some case in which someone complains in the opposite direction (e.g. the last column is too wide and contains unwanted characters, not seen in the dialog).

It seems as if somehow some lookup is performed in order to hint about some columns' widths; when solving this problem, care should be taken not to delay the dialog too much (for instance if there are many columns and many rows).

I am able to overcome the case by using the "Separated by" alternative when importing the first time the csv (no need to copy after the initial import), but, again, this is because the relevant column is not the _last_ one.

So, clearly, there is some problem when importing, with the "last" column's width.
Comment 14 Pierre Fortin 2023-05-20 13:48:08 UTC
(In reply to ady from comment #13)

> The problem is that the last column's width is limited in the dialog and it
> seems it cannot be marked in any way in order for it to be wider. (At least
> the data is not completely lost.)

Not lost; but certainly not useful... and destructive if I've only allotted 2 columns and the data is split into 3, thereby overwriting some of the data in the right-side adjacent column...  at least there's Ctrl+z 

While "natural sort" can be used on this data, arithmetic operations cannot unless the original column is split into prefix & gap'ed sequences.

> Having said that, care should be taken if solving this issue, I could
> imagine some case in which someone complains in the opposite direction (e.g.
> the last column is too wide and contains unwanted characters, not seen in
> the dialog).

Which "unwanted characters" would you foresee being added?  I've always held the position: "I can't insert data that doesn't exist"...  The last column should only contain what is present, without "padding".  Or did you mean something else?

> It seems as if somehow some lookup is performed in order to hint about some
> columns' widths; when solving this problem, care should be taken not to
> delay the dialog too much (for instance if there are many columns and many
> rows).

This is a "fixed width" issue; not "separated by". In fact, I often separate columns with "separated by [comma]"... Since the user has selected how to split the first N characters, Calc does not need to concern itself with anything but the last X characters to determine the max width.  
A Python example:
  maxXwidth = 0  # last expected column
  for row in rows:
     maxXwidth = max( maxXwidth, len( row[last_split_point:] ) )

As far as "not to delay", this is a column, not sheet issue; so doing an in-memory scan should be barely noticeable if at all. To wit, anyone who uses AutoFilter is subliminally aware of any likely delay dealing with a column.

> I am able to overcome the case by using the "Separated by" alternative when
> importing the first time the csv (no need to copy after the initial import),
> but, again, this is because the relevant column is not the _last_ one.

Sorry; but what I provided was a sample of the issue. The real "initial import" contains 90+ varying width columns, delimited by tab, comma, or other. There's no way to use fixed width on that initial data without providing a dialog which allows "fixed width" splitting on certain columns within a "separated by" scenario. e.g., initial data is imported with "separated by" along with certain columns further split via a second level "fixed width" or "separated by" (see below).

> So, clearly, there is some problem when importing, with the "last" column's
> width.

Thinking about the problem more logically...  aside from the incorrect final split; this may be more of a feature request:  the need is to split a single column. In this case "fixed width"; but I also need to split a single columns with "separated by", such as an address column into street, city, zip...   Or a phone number column into area_code, NNX, number (separated by '-').  This is further complicated for me because I have to deal with international data for which there is no standard when multiple countries' data is provided within a single column.  ;p
Comment 15 Stéphane Guillou (stragu) 2023-06-01 08:56:20 UTC
The content of attachment 187376 [details] has been deleted for the following reason:

looks like sensitive data
Comment 16 Stéphane Guillou (stragu) 2023-06-01 08:58:24 UTC
The content of attachment 187386 [details] has been deleted for the following reason:

looks like sensitive data
Comment 17 Stéphane Guillou (stragu) 2023-06-01 09:36:52 UTC
Trying to start from scratch here because I am confused by the steps.
Please tell me if the following steps are correct:

1. Open attachment 187388 [details] with Calc
2. In text import dialog, import as tab-delimited

Result 1: import dialog preview shows 3 columns, but resulting column D contains data starting from row 7895.
However: original file _does_ contain an extra tab at the end of the lines (and before the last digit in the bottom rows, starting from lin 7895)!You can see that with a text editor.

3. Copy column A, paste unformatted in column E (Edit > Paste > Paste unformatted)
4. Using the "fixed width" setting, place a delimiter after the first two letters "AH". No other delimiter.
5. Press OK

Result 2: works as expected for me, only two columns as a result, column G has 4 digits at the beginning, 5 digits at the end.

So in conclusion, according to my testing, there is no issue with how the data is imported. However, the importa dialo preview _should_ show that a fourth column will be created, because there are a total of three tab delimiters by line. That last column should show even if it's empty for the however many first rows.

Can you please follow the steps outlined above and confirm if you see the same thing?
Comment 18 Pierre Fortin 2023-06-01 12:51:07 UTC
>Trying to start from scratch here because I am confused by the steps.

Probably my bad: I included the source data and the resulting issue in the same file to reduce the data to a minimal set...

The real source files are HUGE. One column in these HUGE files contains text like this:    XXnnnn[nnn]  (2 letters and 4-7 digits)

When copying the column, there is no TAB in that data -- it's already been absorbed by the import (more below).

After copying/splitting this column, I use the steps in comment #1... The dialog shows only XX|1234 because the number of sampled rows are all simply 2 letters and 4 digits.  Apparently, there are enough 2+4 rows to convince the dialog that all rows must be 2+4. There are always corner cases; but sampling the last row in this case would result in one of: XX1234, XX12345, XX123456, XX1234567... Not a good/reliable solution to sample the last row...

The issue is that the dialog should not assume the data can be split after 4 digits.  There was no explicit fixed split-point specified after 2+4, so none should be assumed.  The flaw is that an unspecified split is assumed; only explicit split points should be used.

The solution seems this simple:  split ONLY where split-points are specified and leave the string remainder intact in the last resulting column so that the resultant data becomes:
xx 1234
...
xx 12345
...
xx 123456

etc... and not:
xx 1234
...
xx 1234 5
...
xx 1234 56
...
etc...
Comment 19 ady 2023-06-01 14:59:29 UTC
FWIW

(In reply to ady from comment #13)
> Let me clarify my comment then. There is indeed a problem. The fixed-width
> selection should not trim the second/last column. Additionally, the width of
> the last column should not strictly depend on some kind of scanning or
> look-up of the first few rows. The current result acts as if the second
> column's width was somehow selected/delimited, but the user didn't do that.
> The problem is that the last column's width is limited in the dialog and it
> seems it cannot be marked in any way in order for it to be wider. (At least
> the data is not completely lost.)