Bug 149462 - Treat pure ASCII files (codes 0-127) as UTF-8 without BOM on import by default
Summary: Treat pure ASCII files (codes 0-127) as UTF-8 without BOM on import by default
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.3.0.3 release
Hardware: All Windows (All)
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: difficultyBeginner, easyHack, skillCpp
Depends on:
Blocks: Dev-related
  Show dependency treegraph
 
Reported: 2022-06-05 16:15 UTC by Truss
Modified: 2023-07-23 18:21 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Screen Capture 01 (Correct) (4.80 MB, video/mp4)
2022-06-05 16:16 UTC, Truss
Details
Screen Capture 02 (Incorrect) (5.13 MB, video/mp4)
2022-06-05 16:17 UTC, Truss
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Truss 2022-06-05 16:15:25 UTC
Description:

This is a follow up to the below issue (which was reported fixed in 7.3.0):
https://bugs.documentfoundation.org/show_bug.cgi?id=142956

All plain text files (*.txt) I use are encoded in UTF-8 (without BOM) format – therefore using the same file across multiple different applications is ordinarily seamless as utf8NoBOM is pretty standard now.  However, if I then use LibreOffice to check spelling of a plain text document for example, it creates an issue because when saving the file (CTRL + S), LibreOffice can change it to ANSI encoding, instead of keeping it as UTF-8 (without BOM).

If the plain text file only contains ASCII printable characters (character code 32-127), LibreOffice 7.3 correctly saves the file as UTF-8 (without BOM).  However, if any extended ASCII codes (character code 128-255) are added (I.E. € … “ ” – —) and the file is saved, it changes the document to ANSI – which I don't want as every other application is expecting a document encoded in UTF-8 (without BOM) – meaning they now show invalid characters in some applications and is inconsistent.

I want to always save plain text files in UTF-8 (without BOM).  Saving the file by going to [File > Save as > Text - Choose Encoding] is slow, impractical and relies on remembering to do it.  I think LibreOffice needs a setting in [Tools > Options > Load/Save > General] that allows users to set the default encoding they want to use when saving plain text files – with the default setting being UTF-8 (without BOM).


Steps to Reproduce:

Steps to Reproduce:

Working Correctly:

1) Create a new plain text document in a text editor, such as Windows Notepad or Visual Studio Code.

2) Add the below text.

Spell check:  Lores sump dolor sit meat, emus no duo, obit verger fed an. Fabulous porticoes core rum pit nu tied, in enc more commode mandamus. Eli tar principle complemented ea is.

3) Save the file with UTF-8 (without BOM) encoding – which is the default in Windows Notepad and Visual Studio Code.

4) Open the file in LibreOffice, modify the document (using only ASCII printable characters (I.E. a-z A-Z 0-9)), then save the file [CTRL + S] or [File > Save].

5) Open the file in Windows Notepad.  The encoding remains as UTF-8 (without BOM), which is correct.

6) Open the file in Visual Studio Code.  The text displays correctly.

7) Open the file in Notepad++.  The text displays correctly.


Not Working Correctly:

1) Create a new plain text document in a text editor, such as Windows Notepad or Visual Studio Code.

2) Add the below text.

Spell check:  Lores sump dolor sit meat, emus no duo, obit verger fed an. Fabulous porticoes core rum pit nu tied, in enc more commode mandamus. Eli tar principle complemented ea is.

3) Save the file with UTF-8 (without BOM) encoding, which is the default in Windows Notepad and Visual Studio Code.

4) Open the file in LibreOffice, modify the document (using extended ASCII codes (I.E. € … “ ” – —)), then save the file [CTRL + S] or [File > Save].

5) Open the file in Windows Notepad.  The encoding has been changed from UTF-8 (without BOM), to ANSI.

6) Open the file in Visual Studio Code.  The text shows invalid characters if VSCode is set to use UTF-8 (without BOM), as the file is now ANSI.

7) Open the file in Notepad++.  The encoding has been changed from UTF-8 (without BOM), to ANSI.


Actual Results:
LibreOffice 7.3 changes the encoding from UTF-8 (without BOM), to ANSI when saving (after adding extended ASCII characters to document).

Expected Results:
LibreOffice should leave the encoding as UTF-8 (without BOM).  Any new LibreOffice documents should also be saved as UTF-8 (without BOM) when saving as *.txt.


Reproducible: Always


User Profile Reset: Yes



Additional Info:
ASCII printable characters:
https://www.ascii-code.com/

Video:
See two attached MP4 video demonstrating the issue.

Info:
Additional Info:
Version: 7.3.3.2 (x64) / LibreOffice Community
Build ID: d1d0ea68f081ee2800a922cac8f79445e4603348
CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Calc: threaded
Comment 1 Truss 2022-06-05 16:16:45 UTC
Created attachment 180581 [details]
Screen Capture 01 (Correct)
Comment 2 Truss 2022-06-05 16:17:37 UTC
Created attachment 180582 [details]
Screen Capture 02 (Incorrect)
Comment 3 Timur 2022-06-07 14:54:27 UTC
Repro 7.4+. New.
Comment 4 Mike Kaganski 2022-06-07 15:32:09 UTC
I do not quite see how this is a bug.

Any file without BOM and with only bytes 32-127 in them are *at the same time* valid UTF-8 *and* valid ASCII files. There is nothing in such files that could allow to detect that it's UTF-8. Hence, the "current Windows codepage" detection would indeed trigger, and the file would be open as file using 8-bit system encoding. This detection will be correctly remembered since version 7.2 (bug 120574), and when saving, would be correctly used. If the original detection was not what OP expected, is a different story.

OTOH, if you opened it using "Text - choose encoding" filter, and defined UTF-8 on opening, it must save the extended characters on save.

So the possible enhancement would be to treat pure ASCII (first 127 Unicode codepoints) files as UTF-8. Which is reasonable, and in line with e.g. resolution of tdf#148413.
Comment 5 Mike Kaganski 2022-06-08 05:53:27 UTC
Code pointer:
The fix could be implemented in the same place where bug 60145 was fixed: SwIoSystem::IsDetectableText in sw/source/filter/basflt/iodetect.cxx. The code checking return from ucsdet_getName should also treat pure ASCII (whatever is the specific string corresponding to that case) as UTF-8.

Unit tests should be created - again, see the fix for tdf#60145.
Comment 6 RezzyA 2022-10-14 02:15:48 UTC
I am starting work on this bug.
Comment 7 RezzyA 2022-10-22 13:45:56 UTC
I am currently reading through iodetect.cxx (in sw/source/filter/basflt/).
Comment 8 RezzyA 2022-11-01 12:33:18 UTC
I am still reading through iodetect.cxx (in sw/source/filter/basflt/) and the header files associated with it.