149462 – Treat pure ASCII files (codes 0-127) as UTF-8 without BOM on import by default

Bug 149462 - Treat pure ASCII files (codes 0-127) as UTF-8 without BOM on import by default

Summary: Treat pure ASCII files (codes 0-127) as UTF-8 without BOM on import by default

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	7.3.0.3 release
Hardware:	All Windows (All)

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	difficultyBeginner, easyHack, skillCpp

Depends on:
Blocks:	Dev-related
	Show dependency tree / graph

Reported:	2022-06-05 16:15 UTC by Truss
Modified:	2025-02-26 05:27 UTC (History)
CC List:	4 users (show)

See Also:	120574 60145
Crash report or crash signature:

Attachments
Screen Capture 01 (Correct) (4.80 MB, video/mp4) 2022-06-05 16:16 UTC, Truss	Details
Screen Capture 02 (Incorrect) (5.13 MB, video/mp4) 2022-06-05 16:17 UTC, Truss	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Truss 2022-06-05 16:15:25 UTC

Description:

This is a follow up to the below issue (which was reported fixed in 7.3.0):
https://bugs.documentfoundation.org/show_bug.cgi?id=142956

All plain text files (*.txt) I use are encoded in UTF-8 (without BOM) format – therefore using the same file across multiple different applications is ordinarily seamless as utf8NoBOM is pretty standard now.  However, if I then use LibreOffice to check spelling of a plain text document for example, it creates an issue because when saving the file (CTRL + S), LibreOffice can change it to ANSI encoding, instead of keeping it as UTF-8 (without BOM).

If the plain text file only contains ASCII printable characters (character code 32-127), LibreOffice 7.3 correctly saves the file as UTF-8 (without BOM).  However, if any extended ASCII codes (character code 128-255) are added (I.E. € … “ ” – —) and the file is saved, it changes the document to ANSI – which I don't want as every other application is expecting a document encoded in UTF-8 (without BOM) – meaning they now show invalid characters in some applications and is inconsistent.

I want to always save plain text files in UTF-8 (without BOM).  Saving the file by going to [File > Save as > Text - Choose Encoding] is slow, impractical and relies on remembering to do it.  I think LibreOffice needs a setting in [Tools > Options > Load/Save > General] that allows users to set the default encoding they want to use when saving plain text files – with the default setting being UTF-8 (without BOM).


Steps to Reproduce:

Steps to Reproduce:

Working Correctly:

1) Create a new plain text document in a text editor, such as Windows Notepad or Visual Studio Code.

2) Add the below text.

Spell check:  Lores sump dolor sit meat, emus no duo, obit verger fed an. Fabulous porticoes core rum pit nu tied, in enc more commode mandamus. Eli tar principle complemented ea is.

3) Save the file with UTF-8 (without BOM) encoding – which is the default in Windows Notepad and Visual Studio Code.

4) Open the file in LibreOffice, modify the document (using only ASCII printable characters (I.E. a-z A-Z 0-9)), then save the file [CTRL + S] or [File > Save].

5) Open the file in Windows Notepad.  The encoding remains as UTF-8 (without BOM), which is correct.

6) Open the file in Visual Studio Code.  The text displays correctly.

7) Open the file in Notepad++.  The text displays correctly.


Not Working Correctly:

1) Create a new plain text document in a text editor, such as Windows Notepad or Visual Studio Code.

2) Add the below text.

Spell check:  Lores sump dolor sit meat, emus no duo, obit verger fed an. Fabulous porticoes core rum pit nu tied, in enc more commode mandamus. Eli tar principle complemented ea is.

3) Save the file with UTF-8 (without BOM) encoding, which is the default in Windows Notepad and Visual Studio Code.

4) Open the file in LibreOffice, modify the document (using extended ASCII codes (I.E. € … “ ” – —)), then save the file [CTRL + S] or [File > Save].

5) Open the file in Windows Notepad.  The encoding has been changed from UTF-8 (without BOM), to ANSI.

6) Open the file in Visual Studio Code.  The text shows invalid characters if VSCode is set to use UTF-8 (without BOM), as the file is now ANSI.

7) Open the file in Notepad++.  The encoding has been changed from UTF-8 (without BOM), to ANSI.


Actual Results:
LibreOffice 7.3 changes the encoding from UTF-8 (without BOM), to ANSI when saving (after adding extended ASCII characters to document).

Expected Results:
LibreOffice should leave the encoding as UTF-8 (without BOM).  Any new LibreOffice documents should also be saved as UTF-8 (without BOM) when saving as *.txt.


Reproducible: Always


User Profile Reset: Yes



Additional Info:
ASCII printable characters:
https://www.ascii-code.com/

Video:
See two attached MP4 video demonstrating the issue.

Info:
Additional Info:
Version: 7.3.3.2 (x64) / LibreOffice Community
Build ID: d1d0ea68f081ee2800a922cac8f79445e4603348
CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Calc: threaded

Comment 1 Truss 2022-06-05 16:16:45 UTC

Created attachment 180581 [details]
Screen Capture 01 (Correct)

Comment 2 Truss 2022-06-05 16:17:37 UTC

Created attachment 180582 [details]
Screen Capture 02 (Incorrect)

Comment 3 Timur 2022-06-07 14:54:27 UTC

Repro 7.4+. New.

Comment 4 Mike Kaganski 2022-06-07 15:32:09 UTC

I do not quite see how this is a bug.

Any file without BOM and with only bytes 32-127 in them are *at the same time* valid UTF-8 *and* valid ASCII files. There is nothing in such files that could allow to detect that it's UTF-8. Hence, the "current Windows codepage" detection would indeed trigger, and the file would be open as file using 8-bit system encoding. This detection will be correctly remembered since version 7.2 (bug 120574), and when saving, would be correctly used. If the original detection was not what OP expected, is a different story.

OTOH, if you opened it using "Text - choose encoding" filter, and defined UTF-8 on opening, it must save the extended characters on save.

So the possible enhancement would be to treat pure ASCII (first 127 Unicode codepoints) files as UTF-8. Which is reasonable, and in line with e.g. resolution of tdf#148413.

Comment 5 Mike Kaganski 2022-06-08 05:53:27 UTC

Code pointer:
The fix could be implemented in the same place where bug 60145 was fixed: SwIoSystem::IsDetectableText in sw/source/filter/basflt/iodetect.cxx. The code checking return from ucsdet_getName should also treat pure ASCII (whatever is the specific string corresponding to that case) as UTF-8.

Unit tests should be created - again, see the fix for tdf#60145.

Comment 6 RezzyA 2022-10-14 02:15:48 UTC

I am starting work on this bug.

Comment 7 RezzyA 2022-10-22 13:45:56 UTC

I am currently reading through iodetect.cxx (in sw/source/filter/basflt/).

Comment 8 RezzyA 2022-11-01 12:33:18 UTC

I am still reading through iodetect.cxx (in sw/source/filter/basflt/) and the header files associated with it.

Comment 9 Gabriel Masei 2024-11-20 11:46:49 UTC

Is there anyone still working on this ? If not then can I change the ownership ?

Comment 10 Mike Kaganski 2024-11-20 12:06:04 UTC

(In reply to Gabriel Masei from comment #9)

Please do!

Comment 11 Gabriel Masei 2024-11-20 12:23:03 UTC

(In reply to Mike Kaganski from comment #10)
> (In reply to Gabriel Masei from comment #9)
> 
> Please do!

Thanks! Done!

Comment 12 radutaalexandru 2024-11-29 16:57:34 UTC

(In reply to Mike Kaganski from comment #5)
> Code pointer:
> The fix could be implemented in the same place where bug 60145 was fixed:
> SwIoSystem::IsDetectableText in sw/source/filter/basflt/iodetect.cxx. The
> code checking return from ucsdet_getName should also treat pure ASCII
> (whatever is the specific string corresponding to that case) as UTF-8.
> 
> Unit tests should be created - again, see the fix for tdf#60145.

It seems that the ucsdet_getName doesn't have a return value for *pure ASCII*. If we take a look here: https://unicode-org.github.io/icu/userguide/conversion/detection.html , at the bottom of the page we'll see the list of possible return values. The closest values to *pure ASCII* could be anything from ISO-8859-[1-9] or windows-125[0-6]. However, this doesn't guarantee us that the characters from the input buffer are in 0-127 range. There could be characters from 128-255 range, and in this case we should not return UTF-8.

I tested locally with a simple file that contains the following text: a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z. And it returned ISO-8859-2. Not even ISO-8859-1. My guess is that it takes into account information about my location/region. The conclusion is that we cannot rely on the detection library to detect *pure ASCII*.

I propose the following solution: parse the buffer and check that all characters are in 0-127 range. But we should do this only if ucsdet_getName returns a value which is not from UTF family. Or only if it returns a value from ISO-8859-[1-9] or windows-125[0-6] families.

What do you think ?

Comment 13 radutaalexandru 2024-12-20 20:44:49 UTC

(In reply to radutaalexandru from comment #12)
> (In reply to Mike Kaganski from comment #5)
> > Code pointer:
> > The fix could be implemented in the same place where bug 60145 was fixed:
> > SwIoSystem::IsDetectableText in sw/source/filter/basflt/iodetect.cxx. The
> > code checking return from ucsdet_getName should also treat pure ASCII
> > (whatever is the specific string corresponding to that case) as UTF-8.
> > 
> > Unit tests should be created - again, see the fix for tdf#60145.
> 
> It seems that the ucsdet_getName doesn't have a return value for *pure
> ASCII*. If we take a look here:
> https://unicode-org.github.io/icu/userguide/conversion/detection.html , at
> the bottom of the page we'll see the list of possible return values. The
> closest values to *pure ASCII* could be anything from ISO-8859-[1-9] or
> windows-125[0-6]. However, this doesn't guarantee us that the characters
> from the input buffer are in 0-127 range. There could be characters from
> 128-255 range, and in this case we should not return UTF-8.
> 
> I tested locally with a simple file that contains the following text:
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z. And it returned
> ISO-8859-2. Not even ISO-8859-1. My guess is that it takes into account
> information about my location/region. The conclusion is that we cannot rely
> on the detection library to detect *pure ASCII*.
> 
> I propose the following solution: parse the buffer and check that all
> characters are in 0-127 range. But we should do this only if ucsdet_getName
> returns a value which is not from UTF family. Or only if it returns a value
> from ISO-8859-[1-9] or windows-125[0-6] families.
> 
> What do you think ?

I provided a patch using the solution above, including a unit test.
https://gerrit.libreoffice.org/c/core/+/178449