138472 – Calc enhancement: ignore rows when importing CSV

Bug 138472 - Calc enhancement: ignore rows when importing CSV

Summary: Calc enhancement: ignore rows when importing CSV

Status:	RESOLVED WONTFIX

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	3.3.0 release
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	needsDevEval

Depends on:
Blocks:	CSV-Dialog Data-Provider
	Show dependency tree / graph

Reported:	2020-11-24 22:46 UTC by J22Gim
Modified:	2022-05-05 06:41 UTC (History)
CC List:	7 users (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description J22Gim 2020-11-24 22:46:46 UTC

Description:
Many times we get data files as CVS from 3rd parties. These files often have comments and explanations which are not relevant to eg making plots.
It would be great to have an option in the CVS import dialog where the user can ignore some rows. I am aware that you can "ignore" the first x rows, however if you want to automate this task you would need to know X for each and all files to be imported.

As comments rows start with different character in different uses or data sources, it would be convenient to be able to chose that character.

So I propose to include an option in this dialog which let the user 
"ignore rows starting with [] character" where [] is a text box.


Steps to Reproduce:
Open CSV in Calc. 
Lets imagine the file has a number of rows (which you do not know in advance) that are comments and you don't need them to work with Calc. Each line of the comments starts with "#"


Actual Results:
You get a dialog where several options are presented, eg the character which separates the values (eg    ","   ";"    etc).
But you can not discard the comments rows. So you have to load the whole file, then manually delete the rows.


Expected Results:
It would be great if there was the option in the import dialog, something like 
a click box to activate the option "Ignore the rows starting with []" where [] is a text box where the user would place the "comment character" ("#" in this case)


Reproducible: Always


User Profile Reset: No



Additional Info:
Additionally, this could be useful to ignore empty lines, for example writing "" (empty string) into the "ignore box".

Comment 1 J22Gim 2020-11-24 22:47:38 UTC

Changed to "enhancement"

Comment 2 V Stuart Foote 2020-11-25 15:35:01 UTC

Reasonable enhancement. 

Suppress import of a row matching some string.

Imagine it could best be accomplished within the Import Text dialog using ICU regex. 

Perhaps in Other Options  area, a Field to enter a regex string and Check-box to enable--then exclude the "row" from text column parsing.

Comment 3 m_a_riosv 2020-11-26 09:24:39 UTC

With a few steps can be done after import, with more flexibility.

On the sheet, e.g. select the column with the text to find, [Ctrl+H]"Find & Replace", search for the string(s), Find All, click on the sheet window title (to retain selected founded cells) and [Ctrl-] to delete all rows with founded cells.

Comment 4 V Stuart Foote 2020-11-26 16:04:23 UTC

(In reply to m.a.riosv from comment #3)
> With a few steps can be done after import, with more flexibility.
> 
> On the sheet, e.g. select the column with the text to find, [Ctrl+H]"Find &
> Replace", search for the string(s), Find All, click on the sheet window
> title (to retain selected founded cells) and [Ctrl-] to delete all rows with
> founded cells.

Yes true except, pre-filtering the Text Import dialog could eliminate unwanted stanzas/rows from the source CSV before they are parsed against the field separators--both improving the accuracy of the data import, and avoiding need to clean up data after the import.

Comment 5 m_a_riosv 2020-11-26 16:28:19 UTC

(In reply to V Stuart Foote from comment #4)
> ...
> Yes true except, pre-filtering the Text Import dialog could eliminate
> unwanted stanzas/rows from the source CSV before they are parsed against the
> field separators--both improving the accuracy of the data import, and
> avoiding need to clean up data after the import.

Maybe, but I don't think pre-filtering is so easy to implement, with that kind of pseudo-csv, e.g. difficult to know on where column the searched string is, or not all the data to eliminate with searched string it's in the same row.

I'm not against, but IMO there are a lot of better things to get developers attention.

Comment 6 Heiko Tietze 2020-12-03 12:41:20 UTC

I fully agree with Miguel. But since we have a dropdown "Column Type" where "Hide" is an option, we could easily add "Delete" to the list. This wouldn't affect the usability of the import dialog. But still, it's hard to imagine that users click through all columns for fine-tuning the import. Maybe the better solution is a special extension or some kind of raw data preparation before the import.

Comment 7 V Stuart Foote 2020-12-03 14:42:33 UTC

(In reply to Heiko Tietze from comment #6)
> ... since we have a dropdown "Column Type" where
> "Hide" is an option, we could easily add "Delete" to the list. This wouldn't
> affect the usability of the import dialog. But still, it's hard to imagine
> that users click through all columns for fine-tuning the import.
> ...

No this would not be a columnar filtering. Here we are seeking to avoid parsing *entire* rows from the sourced CSV--think of comments/meta data surrounding the CSV. 

If the row or "record" matches the search the entire row would be excluded from the import, and not be parsed for columnar fielding.

Comment 8 Heiko Tietze 2020-12-03 15:11:25 UTC

(In reply to V Stuart Foote from comment #7)
> No this would not be a columnar filtering. Here we are seeking to avoid
> parsing entire _rows_ ...

Ack, my mistake. But still, filtering 1..10, 25, 735 or the like is very special and better done internally. Or per extension.

Comment 9 J22Gim 2020-12-03 19:57:40 UTC

May I mention again that the original point of this is to avoid loading rows with no data. This happens when the beginning (a given amount of rows) have information about how data was stored, collected, calculated, etc. but it is mostly metadata.
Here its an example of such file:
https://www.ncei.noaa.gov/pub/data/paleo/paleolimnology/northamerica/canada/ellesmere/sawtooth2020.txt

So the original idea was to define a character or string which would be encountered at the *beginning* of a row, and thus avoid loading that row. Additionally, if that string is "" (ie empty string), it would be a way to avoid loading empty files.

I am not sure (but also not against) about searching that string in anywhere in the row. I can't imagine the usefulness of that, but if it is the same work then why not.

Just wanted to reinforce the idea of the original post.
Thanks for this wonderful software!!

PD. CVS-related support will be increasingly important, as it is the preferred format to work with Python, R, and many others. And Python (and many others) already have libraries to load and save as XLSX, so LibreOffice should not stay behind!

Comment 10 J22Gim 2020-12-03 19:59:32 UTC

In the example file above, I would define the character "#" and thus all the rows starting with "#" would be ignored, and the file would contain only the data (eg for plotting, analysis, etc.)

Comment 11 Mike Kaganski 2022-05-05 04:04:19 UTC

IMO a clear WONTFIX. It would be something absolutely non-standard, implementing a very specific use case, but not helping at infinite number of alternative "filtering" scenarios. In case of "automation", you need a custom macro or the like.

Comment 12 Heiko Tietze 2022-05-05 06:10:25 UTC

Let's resolve it WF then. Meanwhile we have the Data Provider [1] that, while still in experimental state, should be already capable of filtering as requested. So ultimately WFM.

Meta ticket is bug 120219.

[1] https://libreoffice-dataproviders.blogspot.com/2021/08/data-providers.html
[2] https://help.libreoffice.org/6.1/en-US/text/scalc/01/data_provider.html