Bug 146429 - Fallback to other character encodings detected by ICU above a certain confidence threshold
Summary: Fallback to other character encodings detected by ICU above a certain confide...
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Daniel Thomas
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Font-Fallback
  Show dependency treegraph
 
Reported: 2021-12-26 22:53 UTC by Daniel Thomas
Modified: 2023-11-29 19:43 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
File encoded in EUC-KR (1.36 KB, text/plain)
2021-12-26 23:09 UTC, Daniel Thomas
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Thomas 2021-12-26 22:53:17 UTC
Description:
As discussed in Bug 92161, we should modify SwIoSystem::IsDetectableText so that if none of the encodings we explicitly check for match, we can consider falling back to whatever ucsdet_getName (from icu library) returns (provided LO supports it). We can use ucsdet_getConfidence to filter out anything below a certain confidence threshold.

Steps to Reproduce:
1. Save a text file as one of the encodings we don't detect, e.g. EUC-KR with some Korean text pasted in
2. open the file in LO Writer. 

Actual Results:
It should display incorrectly as it is assumed to be Unicode or some other encoding

Expected Results:
the filetype is correctly deterined and it displays happily


Reproducible: Always


User Profile Reset: No



Additional Info:
see desc
Comment 1 Daniel Thomas 2021-12-26 23:05:12 UTC
Draft PR:https://gerrit.libreoffice.org/c/core/+/127539
tests to follow
Comment 2 Daniel Thomas 2021-12-26 23:09:00 UTC
Created attachment 177149 [details]
File encoded in EUC-KR
Comment 3 Daniel Thomas 2022-01-20 00:03:41 UTC
Sorry forgot to update this, current status is that I have a PR which worked fine under Linux with make check, but the Windows build failed in CI. I'm currently setting up a Windows build environment to investigate this.
Comment 4 Daniel Thomas 2022-01-20 00:04:04 UTC Comment hidden (obsolete)
Comment 5 Xisco Faulí 2022-05-02 14:55:04 UTC
Dear Daniel Thomas,
This bug has been in ASSIGNED status for more than 3 months without any
activity. Resetting it to NEW.
Please assign it back to yourself if you're still working on this.
Comment 6 Daniel Thomas 2022-09-29 18:02:08 UTC
Have rebased again:https://gerrit.libreoffice.org/c/core/+/127539 and this is ready for test, the Windows test bug is fixed too - see details in the Gerrit comments.

Note that the CONFIDENCE_THRESHOLD is arbitrarily set to 90%, which could be changed, I didn't know what to set it to. Note that even setting it to 100% would be helpful and an improvement over the detection currently done.