Bug 155946 - Guess separator for the text import dialog
Summary: Guess separator for the text import dialog
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CSV-Dialog
  Show dependency treegraph
 
Reported: 2023-06-20 10:18 UTC by Eyal Rozenberg
Modified: 2023-06-21 12:10 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Rozenberg 2023-06-20 10:18:58 UTC
When we paste multi-line text, the Text Import dialog springs up. At the moment, it offers us a default choice of text field separator - separate by Tab.

But if we are already parsing the text to look for newlines - why not also look for common separators as well?

* If the line has no tabs, definitely don't  offer tabs as the default
* Ditto for spaces and comments
* Out of the remaining possible separators - apply some simple heuristic for the choice, e.g. most commonly appearing except for at start and end of line.

The specific heuristic is a matter for bikeshedding, but even "first separator encountered" is better than what we have now.
Comment 1 Eike Rathke 2023-06-20 11:14:22 UTC
Not necessarily. If standard text is pasted from the system clipboard then offering Tab is actually a good choice because any text can contain all other separators without them being actually separators, specifically comma. Furthermore, if cells are copied to clipboard and pasted as text-only they will be separated by Tab, so at least in that case it's the only sensible choice. Also note that the last choice is remembered, so whether you actually get Tab offered depends on your previous action. For the first time of dialog usage we even already try to determine a separator in the context of ending a quoted field. "apply some simple heuristic" is wishful thinking, but what exactly should that "simple" be? The "if it has no [...separator...] then don't offer it" doesn't help either, because a checked separator that isn't used in data has no effect on the import, so not offering it is just cosmetic.
Comment 2 Eyal Rozenberg 2023-06-20 20:46:10 UTC
(In reply to Eike Rathke from comment #1)
> Not necessarily. 

Not necessarily what?

>If standard text is pasted from the system clipboard then
> offering Tab is actually a good choice because any text can contain all
> other separators without them being actually separators, specifically comma.

If there are no tabs, then offering a tab is obviously not a good choice. But other than that, and like I said - any reasonable heuristic is fine by me.

> Furthermore, if cells are copied to clipboard and pasted as text-only they
> will be separated by Tab, so at least in that case it's the only sensible
> choice. 

But that's a case where the pasted text already has tabs. The point of this issue is to not to assume this is the case always - which it isn't. 

> Also note that the last choice is remembered, so whether you
> actually get Tab offered depends on your previous action.

Well, yes, but the memory becomes irrelevant if the pasted text doesn't use that separator.

> For the first time
> of dialog usage we even already try to determine a separator in the context
> of ending a quoted field. "apply some simple heuristic" is wishful thinking,
> but what exactly should that "simple" be? The "if it has no
> [...separator...] then don't offer it" doesn't help either, because a
> checked separator that isn't used in data has no effect on the import, so
> not offering it is just cosmetic.

I'm not sure I follow. Of course it helps if the default choice in the dialog is of a separator that actually appears in the text rather than one which doesn't. 

Give the memory we have of the user's last choice, the simple heuristic might be: "User's last choice, unless the text doesn't have that separator (or even - unless the first line doesn't have), in which case the first separator which appears on the first line."

That's pretty simple. Feel free to suggest something else.
Comment 3 Eike Rathke 2023-06-21 12:10:42 UTC
(In reply to Eyal Rozenberg from comment #2)
> (In reply to Eike Rathke from comment #1)
> > Not necessarily. 
> 
> Not necessarily what?
Not necessarily this:
>> but even "first separator encountered" is better than what we have now.


> Give the memory we have of the user's last choice, the simple heuristic
> might be: "User's last choice, unless the text doesn't have that separator
> (or even - unless the first line doesn't have), in which case the first
> separator which appears on the first line."
With that we're back to "what is considered to be a separator".
a) the arbitrary comma encountered in a sentence?
b) or only if there's not a blank following?
c) if the first comma is at the end of line, does it constitute a separator?

I'd say no to a) and yes to b) and c).
Can that be generalized also for Tab and semicolon? Probably yes.
Can it for Space? No because it would split a sentence into fields.