Bug 166208 - Upgrading to v.25.2.2+, the previous selection of encoding in text import dialog gets shifted to the previous entry
Summary: Upgrading to v.25.2.2+, the previous selection of encoding in text import dia...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
25.2.2.2 release
Hardware: All All
: medium normal
Assignee: Mike Kaganski
URL:
Whiteboard: target:25.8.0 target:25.2.5
Keywords:
: 164659 166095 166211 166238 166373 167181 (view as bug list)
Depends on:
Blocks: CSV-Import
  Show dependency treegraph
 
Reported: 2025-04-16 15:16 UTC by MartinP
Modified: 2025-06-27 11:11 UTC (History)
10 users (show)

See Also:
Crash report or crash signature:


Attachments
Text File (20 bytes, text/csv)
2025-04-17 09:08 UTC, MartinP
Details
TEXT INPUT screen w/ UTF-8 (65.17 KB, image/png)
2025-04-18 18:53 UTC, dpkesling
Details
TEXT OUTPUT screen (29.94 KB, image/png)
2025-04-18 18:55 UTC, dpkesling
Details

Note You need to log in before you can comment on or make changes to this bug.
Description MartinP 2025-04-16 15:16:28 UTC
In Calc I open a csv file which has these values:

a,b,c,test,-10,11,23,-24

I open this file in calc, make no changes and save still as csv.
The file now has these values:

a,b,c,test,+AC0-10,11,23,+AC0-24
Comment 1 Mike Kaganski 2025-04-16 15:43:28 UTC
This means, that for export, you used UTF-7 (not UTF-8!). Check your export filter settings.
Comment 2 MartinP 2025-04-16 21:45:09 UTC
The file is downloaded from a mainstream bank website. 
The file displays as expected in notepad++.
The csv displays as expected in libre calc. 
When saved, the file has these added characters. 
Is this expected behaviour?
Not a great UX.
Comment 3 Mike Kaganski 2025-04-16 22:11:44 UTC
Well, I finally found the bug that discusses the real problem here. It is bug 150836, that is *correctly* called "CSV save-mode is different from the one used for opening" - even though it only shows one aspect of it, specifically the "save formula" mode. Generally, the encoding used on opening the file should be also pre-selected when saving - and *if* it's not the case, it's a bug.

This looks very similar to bug 120574. But first, we need to be able to reproduce the problem. Can you provide a sample csv (yes, I mean a CSV file, not its text in comment 0 - because in the file, there will also be *encoding*, so that we know, that exactly this file opens on your system, but when you save it, the encoding changes). Then please provide full information from Help->About. Maybe we could reproduce this after that.
Comment 4 MartinP 2025-04-17 09:08:21 UTC
Created attachment 200370 [details]
Text File

Instructions to create the csv text file:
1. Open a notepad, Notepad++ or any text editor really.
2. type some text and numbers separated by a comma. Remember to include some negative numbers!
3. Save as csv.

Instructions to recreate the "bug"
1. Now open the csv file created above in libre calc.
2. Save the file in libre calc.
3. Open the text file with a text editor.
4. What do you see?
Comment 5 Mike Kaganski 2025-04-17 09:51:30 UTC
(In reply to MartinP from comment #4)
> Instructions to recreate the "bug"
> 1. Now open the csv file created above in libre calc.
> 2. Save the file in libre calc.
> 3. Open the text file with a text editor.
> 4. What do you see?

I see a newline appeared after the line. But neither encoding is reported changed, nor there were any new characters.

So there is a question, how come that you have UTF-7 set in your CSV export settings.

Now please:

1. Open your csv file in libre calc.
2. File->Save As.
3. In the File Picker dialog, check the "Edit filter settings" checkbox, and press OK.
4. Confirm file format (Use Text CSV Format), if asked.
5. What do you see in the "Character set" field of the Export Text File dialog?

I assume, you will see UTF-7 there; and I also assume, that you saw this dialog in the past at least once (it could be e.g. when you created a spreadsheet anew, and saved as CSV); and there, you chosen UTF-7 mistakenly, instead of UTF-8.

If I'm right, than the incorrect encoding ("character set") there is the user error. But the real problem here is that it doesn't pick the encoding from the value that you used in the Import dialog, but uses the "last used" value from another time (which is, again, bug 150836).
Comment 6 Mike Kaganski 2025-04-18 05:09:40 UTC
*** Bug 166211 has been marked as a duplicate of this bug. ***
Comment 7 dpkesling 2025-04-18 17:37:51 UTC
I still wonder how this error occurred. You attribute this to user error by my somehow specifying UTF-7, but I have some difficulty with that explanation (not the UTF-7 part).
My problem was observed on TWO different machines.The second machine was prepped by removing the entire LibreOffice 24 package via Windows control panel. A fresh download of Libre 25 was performed and installed using the TYPICAL option. TWO fresh copies of the suspect data file were downloaded and testing was performed without any other intervening steps... ESPECIALLY NOT performing any changes in OPTIONS.
Again, the "virgin" file was read correctly by my R program whereas the other,"Calc saved" file had the +AC0 encoding error. 
Again, this happened on BOTH my machines, independent of one another.
I do not dispute your UTF-7 assessment... I just wonder how the same error could appear on BOTH machines?
Possible answers are:
1) I really am an incompetent dolt who managed to make the same mistake on two different machines, despite the fact that changing a code page seems like a pretty involved, intentional process not prone to accidental revision.
2) Somehow Microsoft Control Panel uninstall process for v24 of Libre (which was working just fine for me) corrupted some residual file that went untouched during the uninstall process. When v25 installed, it picked-up on that changed value.
3) Assuming that v25 went through the entire cycle from Unit test -> System test -> Release testing (which I'm sure it did), there is the possibility that someone JUST PRIOR to the final build of the .exe image for distribution had changed the code page for whatever reason and THAT is what went out the door.

Perhaps there are other scenarios which haven't immediately come to mind, but these 3 are certainly candidates. The fact that the error occurred on two different machines diminishes (though doesn't eliminate) the likelihood of Theory 1.
The additional fact that my processing simply uses Calc as an intermediary step rather than as a self-contained solution might be a situation not considered in your Release Testing Script/Plan.
I shall attempt to manually change the code page to see if that effects a change and success. I will report back if it doesn't.
Comment 8 dpkesling 2025-04-18 17:43:46 UTC
I will check the encoded value of the "virgin" file as it came straight via download from another DB program. This file was not created in Calc originally, simply downloaded.
Comment 9 Mike Kaganski 2025-04-18 18:21:03 UTC
(In reply to dpkesling from comment #7)

First of all: note that I don't make strong claims. I reset it to UNCONFIRMED, because I have no 100% evidence that it was a user error. If we find steps to reproduce on a clean machine, it would be a definite bug.

> My problem was observed on TWO different machines.The second machine was
> prepped by removing the entire LibreOffice 24 package via Windows control
> panel. A fresh download of Libre 25 was performed and installed using the
> TYPICAL option.

Note that removal of the program using control panel, and installing of a new version, do nothing with the *user settings* - that aren't part of installation set, but are created by the (first-)running program, and are not removed by uninstallation. Thus, what you described says *nothing* about "cleanness" of the settings on that system. Of course, that also doesn't prove that there necessarily was the pre-existing problem there; and I even consider that strange, that you would make the same mistake on two different systems (in the imagined scenario that you also set the settings to UTF-7 before uninstalling the program)... Unless you cloned the settings somehow?
Comment 10 dpkesling 2025-04-18 18:51:17 UTC
Latest info...
I went to the "virgin" file and started to load it into Calc...
The text IMPORT screen (grab attached) had UTF-7 auto selected. The image shows the screen AFTER I changed it to UTF-8. In made adjustments to the file, went to save it and when I looked at the File Options per your suggestion, it showed some Western-ISO thing (also attached). I left that as is.... and the file processed just fine.

SO

The problem you identified as UTF-7 was there, but as an artifact of the TEXT IMPORT process. Bottom line: I am back in business w/ v25.
Final questions: Has Calc always spec'd UTF-8 or was this a recent change from UTF-7> And, does Calc autodetect that code page of the incoming file and set itself accordingly... or just run with the last format specified by the user?

For my part, I am gonna go upstream and ask my DB Admin just what format e's spitting these files at me.

Weird.
Comment 11 dpkesling 2025-04-18 18:53:46 UTC
Created attachment 200396 [details]
TEXT INPUT screen w/ UTF-8
Comment 12 dpkesling 2025-04-18 18:55:04 UTC
Created attachment 200397 [details]
TEXT OUTPUT screen
Comment 13 Bruce H 2025-04-28 17:10:25 UTC
This bug is similar to one that I logged around the same time (https://bugs.documentfoundation.org/show_bug.cgi?id=166238).

That one is also apparently a case of incorrect choice of character set.  In 166238, Calc auto-selected "UTF-7" by default (without my awareness) on import, which resulted in Calc mangling the content of my CSV file.

It is my contention that (1) Calc 25.2 is making an inappropriate choice in deciding what character set to use in the import (or export) dialog;  (2) the character set choice was NOT based on my prior import/export history, because I have never used UTF-7 before;  (3) most user don't know the difference between various character set encodings, and they are not knowledgeable enough to know that they have to explicitly choose UTF-8 among dozens of other choices; and (4) Calc just started defaulting to UTF-7 with version 25.x.  After I installed 25.2, the default on the import dialog changed.  IMO it should not default to UTF-7.
Comment 14 petrelharp 2025-05-12 19:54:58 UTC
I've just hit this also. Here's my experience: writing out files I was getting +AC0- and the like. I use libreoffice to do this regularly, and have never encountered this before. 

Finding this thread I verified that changing the "Character set" field of the Export Text File dialog to UTF-8 (from UTF-7) fixed the problem.

Then I created a new CSV file, opened it with libreoffice, and noticed that in the import dialog UTF-7 was selected. Ah-ha! Changing this to UTF-8 seems to be persistent; then other opening and saving of CSV files seems to work as expected.

I absolutely did not change this value, at least on purpose. I cannot completely guarantee that, for instance, I somehow toggled it by hitting 'tab' when focus was not on the window I thought it was. But, this seems very unlikely, and others are experiencing the same thing.

I've had other issues recently with defaults on reading CSVs. For a few weeks (after an update), when opening a CSV the preview in the "Text Import" dialog would look good; but clicking "OK" would produce a spreadsheet with all content jammed into one cell; trying again and selecting "Separated by" "comma" got me the right result. This selection was not persistent. This bug recently went away (with another update; now at 25.2.3.2).

It seems to me that something was corrupting or improperly reading previous defaults in the read CSV dialog. I don't know how I'd go about reproducing that, but will report if it happens again.
Comment 15 Mike Kaganski 2025-05-14 11:55:59 UTC
Oh wow, our ctor of ScImportAsciiDlg does some really strange thing (or, rather, we store something very odd in /org.openoffice.Office.Calc/Dialogs/CSVImport/CharSet). How come that we store there not an encoding code (rtl_TextEncoding, like RTL_TEXTENCODING_UTF8 = 76), but a *position in the drop-down list*???

This means, that when we add an element to the list, like "-Automatic-", the user who updated their version, will have the index off. A user who had UTF-8 there, will now have UTF-7.

BTW, a user who changed UI language will have the index off, too (because the alphabetical sorting will change).
Comment 16 Mike Kaganski 2025-05-14 12:50:52 UTC
Eike: what do you think about comment 15? Since introduction of "Automatic" element in the top, all users' existing selection index is not off-by-one. Should we just accept that, and tell them "just re-select a correct item now, and wait for another shift in the future", or should we do some change in what we store there - and have another breakage of the selection because of that; and maybe even a change of "API" (if we consider what we store in config to be an API)?
Comment 17 Eike Rathke 2025-05-21 20:19:01 UTC
Storing a position of a drop down list is obviously ill-implemented, for the exact two reasons you stated, also if for any reason we decided to sort the list differently.

Storing the rtl_TextEncoding instead seems like a good solution to me. Or rather additionally because we need to differentiate the two. I.e. newer versions would ignore (and not write) /CharSet and store /Encoding which older versions don't know, in case people share the user profile between different versions (not much advisable anyway, but..).
Comment 18 Ophir LOJKINE 2025-05-23 12:59:33 UTC Comment hidden (me-too)
Comment 19 Mike Kaganski 2025-06-04 05:39:22 UTC
*** Bug 166373 has been marked as a duplicate of this bug. ***
Comment 20 Mike Kaganski 2025-06-04 15:06:59 UTC
*** Bug 166095 has been marked as a duplicate of this bug. ***
Comment 21 Mike Kaganski 2025-06-04 19:31:39 UTC
*** Bug 166238 has been marked as a duplicate of this bug. ***
Comment 22 Mike Kaganski 2025-06-04 20:11:37 UTC
*** Bug 164659 has been marked as a duplicate of this bug. ***
Comment 23 Eyal Rozenberg 2025-06-04 21:14:31 UTC
Coming from the dupe (?) bug 164659 - can someone please explain with "AC0" means?
Comment 24 Mike Kaganski 2025-06-04 21:19:57 UTC
(In reply to Eyal Rozenberg from comment #23)

It's a code used in UTF-7 encoding, that became selected for users who had UTF-8 (next in list) selected previously (the same way as for you, Western Europe (DOS/OS2-865/Nordic) made selected, when previously, Western Europe (ISO-8859-1) was selected; but while the two encodings in your bug 164659 make no difference for ASCII-only text, UTF-7 encodes many ASCII characters into such =ACx- sequences). The change of selection happened on upgrade, exactly for the same reason (an element is added to the top of the list).
Comment 25 Mike Kaganski 2025-06-04 21:22:41 UTC
(In reply to Mike Kaganski from comment #24)
> into such =ACx- sequences

I meant, "into such +ACx- sequences".

For non-ASCII, it may start with other characters, e.g. +AKM-...

Ref: https://en.wikipedia.org/wiki/UTF-7
Comment 26 Mike Kaganski 2025-06-07 07:55:00 UTC
So in https://gerrit.libreoffice.org/c/core/+/186240, I decided to implement Gabriel's initial idea (which unfortunately didn't work initially, but which, if succeeded back then, would prevent this from happening): the first time you open the text import dialog after upgrading to a version that has the upcoming patch, you will have "-Automatic-" selected as the active encoding.

1. This helps everyone who is upgrading from versions earlier than 25.2.2, because they will not have a shifted, wrong element selected. Yes, the newly selected value is also not what they had; but it's more meaningful; likely to get the correct encoding by itself; and also improves the discoverability of the "automatic encoding detection" feature.
2. For people who have already upgraded to 25.2.2+, but who have not noticed yet the shift of the selection, this may also be helpful - this would fix exactly this bug.
3. For people who have already upgraded to 25.2.2+, and then fixed their encoding selection - well, these people would have to check if autodetection works for them / re-select the encoding. Sorry; I realize that this is an annoyance, but realistically, there are much more people who are yet to come across this bug than who already updated to the (still early adopter) 25.2 - and big thanks to all early adopters, exactly for helping find problems early, and make others' lives easier.

This also implements the proposal of Eike (comment 17), introducing a new "Encoding" setting, which would store the actual encoding enum value, instead of the problematic selection index. The new versions will simply ignore the old CharSet there, allowing new and old versions to use the same config, and not break each other's choice.
Comment 27 Eyal Rozenberg 2025-06-07 08:43:06 UTC
(In reply to Mike Kaganski from comment #26)
> 3. For people who have already upgraded to 25.2.2+, and then fixed their
> encoding selection - well, these people would have to check if autodetection
> works for them / re-select the encoding. Sorry; I realize that this is an
> annoyance,

This is an acceptable annoyance IMHO - considering that the "annoyed" people are already aware they need to be careful about the encoding. They may be slightly perplexed by the switch to "Automatic"; but they _have_ installed a new version, so it's not entirely surprising. Plus, they will only be annoyed once.

> This also implements the proposal of Eike (comment 17), introducing a new
> "Encoding" setting, which would store the actual encoding enum value,
> instead of the problematic selection index.

Always a good idea. Perhaps also an inspiration for thinking about other dialogs in which we may be persisting selection indices rather than the values the selection indicates.
Comment 28 Commit Notification 2025-06-07 10:05:09 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e8027d47d516e74b0631d99c584bc7bb301a3efb

Related: tdf#166208 Avoid C-style array and fixed index madness

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 29 Commit Notification 2025-06-07 10:05:11 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/58dc72c6f29a3ad4cbea09803d6c568b96f80c08

Related: tdf#166208 Drop bBeforeDetection

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 30 Commit Notification 2025-06-07 10:06:14 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/d16535788721238a423408ba59805f4bcacc4e12

[API CHANGE] tdf#166208: New "Encoding" property for text import / clipboard paste

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 31 Commit Notification 2025-06-09 09:15:43 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-25-2":

https://git.libreoffice.org/core/commit/70ad1d0c86cffc567b05b903ee83844a189c0d26

Related: tdf#166208 Avoid C-style array and fixed index madness

It will be available in 25.2.5.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 32 Commit Notification 2025-06-09 09:15:46 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-25-2":

https://git.libreoffice.org/core/commit/72f2474e085924f852b9cf64ae805dfaa28e7f08

Related: tdf#166208 Drop bBeforeDetection

It will be available in 25.2.5.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 33 Mike Kaganski 2025-06-24 06:23:58 UTC
*** Bug 167181 has been marked as a duplicate of this bug. ***
Comment 34 Commit Notification 2025-06-27 11:11:05 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-25-2":

https://git.libreoffice.org/core/commit/9556677d1aa726fafa79c4b0f34a68213dfd7fe0

[API CHANGE] tdf#166208: New "Encoding" property for text import / clipboard paste

It will be available in 25.2.5.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.