Bug 146429

Summary: Fallback to other character encodings detected by ICU above a certain confidence threshold
Product: LibreOffice Reporter: Daniel Thomas <daniel>
Component: LibreOfficeAssignee: Daniel Thomas <daniel>
Status: ASSIGNED ---    
Severity: enhancement CC: buzea.bogdan, daniel, himajin100000, mikekaganski, vsfoote, xiscofauli
Priority: medium    
Version: unspecified   
Hardware: All   
OS: All   
See Also: https://bugs.documentfoundation.org/show_bug.cgi?id=92161
https://bugs.documentfoundation.org/show_bug.cgi?id=141971
https://bugs.documentfoundation.org/show_bug.cgi?id=61703
Whiteboard:
Crash report or crash signature: Regression By:
Bug Depends on:    
Bug Blocks: 157526    
Attachments: File encoded in EUC-KR

Description Daniel Thomas 2021-12-26 22:53:17 UTC
Description:
As discussed in Bug 92161, we should modify SwIoSystem::IsDetectableText so that if none of the encodings we explicitly check for match, we can consider falling back to whatever ucsdet_getName (from icu library) returns (provided LO supports it). We can use ucsdet_getConfidence to filter out anything below a certain confidence threshold.

Steps to Reproduce:
1. Save a text file as one of the encodings we don't detect, e.g. EUC-KR with some Korean text pasted in
2. open the file in LO Writer. 

Actual Results:
It should display incorrectly as it is assumed to be Unicode or some other encoding

Expected Results:
the filetype is correctly deterined and it displays happily


Reproducible: Always


User Profile Reset: No



Additional Info:
see desc
Comment 1 Daniel Thomas 2021-12-26 23:05:12 UTC
Draft PR:https://gerrit.libreoffice.org/c/core/+/127539
tests to follow
Comment 2 Daniel Thomas 2021-12-26 23:09:00 UTC
Created attachment 177149 [details]
File encoded in EUC-KR
Comment 3 Daniel Thomas 2022-01-20 00:03:41 UTC
Sorry forgot to update this, current status is that I have a PR which worked fine under Linux with make check, but the Windows build failed in CI. I'm currently setting up a Windows build environment to investigate this.
Comment 4 Daniel Thomas 2022-01-20 00:04:04 UTC Comment hidden (obsolete)
Comment 5 Xisco FaulĂ­ 2022-05-02 14:55:04 UTC
Dear Daniel Thomas,
This bug has been in ASSIGNED status for more than 3 months without any
activity. Resetting it to NEW.
Please assign it back to yourself if you're still working on this.
Comment 6 Daniel Thomas 2022-09-29 18:02:08 UTC
Have rebased again:https://gerrit.libreoffice.org/c/core/+/127539 and this is ready for test, the Windows test bug is fixed too - see details in the Gerrit comments.

Note that the CONFIDENCE_THRESHOLD is arbitrarily set to 90%, which could be changed, I didn't know what to set it to. Note that even setting it to 100% would be helpful and an improvement over the detection currently done.