150714 – DATALOSS: saving a recovered CSV converts all non-Western characters to question marks

Bug 150714 - DATALOSS: saving a recovered CSV converts all non-Western characters to question marks

Summary: DATALOSS: saving a recovered CSV converts all non-Western characters to quest...

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All Windows (All)

Importance:	medium normal
Assignee:	Mike Kaganski

URL:	https://forumooo.ru/index.php?topic=9330
Whiteboard:	target:7.5.0
Keywords:	dataLoss

Depends on:
Blocks:	AutoSave-AutoRecovery-Backup CSV
	Show dependency tree / graph

Reported:	2022-08-31 07:47 UTC by Mike Kaganski
Modified:	2022-09-01 05:54 UTC (History)
CC List:	3 users (show)

See Also:
Crash report or crash signature:

Attachments
An UTF-8-encoded CSV (37 bytes, text/csv) 2022-08-31 07:47 UTC, Mike Kaganski	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mike Kaganski 2022-08-31 07:47:34 UTC

Created attachment 182109 [details]
An UTF-8-encoded CSV

If autorecovery information save is enabled, and a crash happens while editing a CSV, then opening LibreOffice would offer recovery for the CSV. Performing the recovery would give the correct data, and saving the recovered data using File->Save would overwrite the original CSV, but the result would loose all the non-Western characters, which would only be apparent after reload, at which point, the non-Western data would be unrecoverable (the autorecovery information is deleted, and original CSV is overwritten).

Steps to reproduce:
0. Make sure that "Save Autorecovery information" is enabled under Options->Load/Save->General, and set to some small value (1 minute) for ease of reproduction.
1. Open the attached CSV, making sure to use UTF-8 encoding (it contains a string "テストabcабвÀ", which includes Japanese, English, Cyrillic, and extended Western characters).
2. Make some change to B2, e.g. replace "1" with "2".
3. Wait for automatic save.
4. Kill soffice.bin process.
5. Start LibreOffice, see it offers recovery for the CSV. Confirm the recovery. Note that no CSV import dialog is shown during the recovery.
6. See that the recovered document looks OK, having the correct string in A2, and "2" in B2.
7. Press Save toolbar button (or File->Save), confirm saving as CSV.
8. Close and reopen the file.

The text in A2 will be destroyed - both Japanese and Cyrillic characters would turn to question marks. The Western character "À" will be restored if ISO 8859-1 encoding is selected on import.

Note that if in step #7, you use Save As instead, and select "Edit filter settings" in the Save As dialog, the settings dialog would offer the last used encoding, not ISO 8859-1. When saving a new document to CSV, the filter settings dialog would appear even when "Edit filter settings" is unselected, so it looks like a problem specific to recovered files.

The origin of the problem seems to be in the fact that autorecovery stores files in the native ODS format (which is a good thing), but that doesn't keep the original filter settings in the autorecovery data (see also bug 123877 comment 9). So opening the autorecovery ODS can't provide Calc with the original encoding, and some internal default is used, which happens to be ISO 8859-1, not even the last-used encoding that is stored in the profile (using the last-used setting silently would also be incorrect anyway).

The proposal is to make UTF-8 the default encoding for recovered CSV files.
Another option could be to make autorecovered documents have some media flag set that would force filter settings dialog on save, the same way as when you save a new document to CSV.
Yet another option is to implement storing original CSV filter settings inside autorecovery ODS (most complex).

The issue appeared in https://forumooo.ru/index.php?topic=9330.

Comment 1 Eike Rathke 2022-08-31 10:06:51 UTC

Without looking at code I'd assume the system/thread encoding is used for recovery, which probably still happens to be some nasty ugly legacy code page on Windows and not even ISO 8859-1 but Windows-1252 or whatever the regional system settings offer. Or there's some default forcing ISO 8859-1 that might originate from legacy reasons.

Sounds sensible to default to UTF-8 instead for CSV recovery if both save/restore use it.

Comment 2 Mike Kaganski 2022-08-31 10:37:28 UTC

(In reply to Eike Rathke from comment #1)

I suspect hard-coded ISO 8859-1, because for me, the local system codepage is Windows-1251, that indeed is able to cover Cyrillic characters (but not extended Western ones).

Comment 3 Mike Kaganski 2022-08-31 10:43:07 UTC

ScDocShell::ConvertTo seems to use hard-coded RTL_TEXTENCODING_MS_1252

Comment 4 Eike Rathke 2022-08-31 11:05:01 UTC

That may be the culprit if it's actually hit for the recovery case. If we change that also ConvertFrom() needs to be changed and an API change announced, as these are still (at least per the comments) the defaults for API use without any parameters.

Comment 5 Commit Notification 2022-08-31 21:14:38 UTC

Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/0fa5b4e03ed981cd79ac9af57e616714fc41b685

tdf#150714: [API-CHANGE] change CSV default encoding to UTF-8

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.