Bug 120574 - TXT file encoding is lost when saving
Summary: TXT file encoding is lost when saving
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: tobias
URL:
Whiteboard: target:7.2.0
Keywords: difficultyMedium, easyHack, skillCpp
: 132426 142956 (view as bug list)
Depends on:
Blocks: Save-Text
  Show dependency treegraph
 
Reported: 2018-10-13 20:41 UTC by Mike Kaganski
Modified: 2022-06-07 14:55 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2018-10-13 20:41:39 UTC
When a plain text file is opened (no matter if "Text" or "Text - choose encoding" was used), the filter settings (charset, Line End, is BOM present: see SwAsciiOptions) are forgot, and last used for export are shown if "Edit filter settings" is requested. In case, say, I open a UTF-8 with BOM, I'll get a Windows-1251 Cyrillic file upon save.

The settings should persist with the open file, to not require user to re-enter them on save (a user might have no idea what they were if auto-detection was used).
Comment 1 m_a_riosv 2018-10-13 22:53:30 UTC
You are right the same type should be used for save.
Comment 2 librebug 2018-10-14 08:16:49 UTC
And by default for "save as". But here the user should have the option to change settings.

As has been remarked elsewhere -- for example 82254, "FILESAVE: UTF-8 BOM removed from CSV when saving file" -- this is how LO should act for all documents it can read and save in plain text form.  For predictability, consistency, accuracy, and usability.
Comment 3 Ming Hua 2020-05-02 07:40:30 UTC
*** Bug 132426 has been marked as a duplicate of this bug. ***
Comment 4 Mike Kaganski 2021-01-22 08:17:59 UTC
The filter should store the settings in medium attached to the document. The proper place to store this is likely AsciiReader::Read (sw/source/filter/ascii/parasc.cxx), which has m_pMedium set in SwReader::Read, and allows to call its GetItemSet()->Put() to modify the data based on parser data.

Using this data likely should happen in SwASCWriter::SetupFilterOptions (which should be implemented, and which is called from Writer::Write, where the medium is available).

The easy hack implies that a unit test is also implemented, that tests that the detected non-default settings (e.g., UTF-16BE with BOM with CR line endings, etc.) are retained on save-and-reload. The unit test should be in sw/qa/extras/txtexport/txtexport.cxx, and should include reading the exported file (similar to what TxtExportTest::readExportedFile does), testing BOM and data bytes.
Comment 5 Timur 2021-05-13 07:43:20 UTC
Could bug 64603 be marked duplicate?
Comment 6 Mike Kaganski 2021-05-13 07:50:26 UTC
(In reply to Timur from comment #5)

No, that one is (in the absence of addition information) about a *new* document, where there's no information from *import* stage (which is what this one is about).
Comment 7 tobias 2021-05-29 10:39:05 UTC Comment hidden (obsolete)
Comment 8 Buovjaga 2021-05-29 10:42:17 UTC
(In reply to tobias from comment #7)
> I'm new to LibreOffice development. I already had a look at this issue. May
> I assign this bug to me?

Yes.
Comment 9 tobias 2021-06-04 14:33:16 UTC
(In reply to Mike Kaganski from comment #4)
> The filter should store the settings in medium attached to the document. The
> proper place to store this is likely AsciiReader::Read
> (sw/source/filter/ascii/parasc.cxx), which has m_pMedium set in
> SwReader::Read, and allows to call its GetItemSet()->Put() to modify the
> data based on parser data.
> 
> Using this data likely should happen in SwASCWriter::SetupFilterOptions
> (which should be implemented, and which is called from Writer::Write, where
> the medium is available).
> 
> The easy hack implies that a unit test is also implemented, that tests that
> the detected non-default settings (e.g., UTF-16BE with BOM with CR line
> endings, etc.) are retained on save-and-reload. The unit test should be in
> sw/qa/extras/txtexport/txtexport.cxx, and should include reading the
> exported file (similar to what TxtExportTest::readExportedFile does),
> testing BOM and data bytes.

When parsing UTF8 files it seems the auto detection always sets the BOM flag, even there is no BOM in the file. As a result opening a UTF8 file will produce  a UTF8BOM file  after saving. Is this maybe another bug?
Comment 10 Mike Kaganski 2021-06-04 14:57:37 UTC
(In reply to tobias from comment #9)
> When parsing UTF8 files it seems the auto detection always sets the BOM
> flag, even there is no BOM in the file. As a result opening a UTF8 file will
> produce  a UTF8BOM file  after saving. Is this maybe another bug?

See https://gerrit.libreoffice.org/c/core/+/50388, which had implemented BOM in the options. It might be a bug indeed, that BOM in UTF-8 files is not reflected - let's keep it to a separate issue, if you think it's appropriate (please create one then).
Comment 11 Commit Notification 2021-06-04 21:47:29 UTC
tobias committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/cb490979ac238011efa27e0fb18fe62c13329d1f

tdf#120574 Store Ascii Options for Later Saving

It will be available in 7.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 Mike Kaganski 2021-06-20 16:02:43 UTC
*** Bug 142956 has been marked as a duplicate of this bug. ***