Description: As discussed in Bug 92161, we should modify SwIoSystem::IsDetectableText so that if none of the encodings we explicitly check for match, we can consider falling back to whatever ucsdet_getName (from icu library) returns (provided LO supports it). We can use ucsdet_getConfidence to filter out anything below a certain confidence threshold. Steps to Reproduce: 1. Save a text file as one of the encodings we don't detect, e.g. EUC-KR with some Korean text pasted in 2. open the file in LO Writer. Actual Results: It should display incorrectly as it is assumed to be Unicode or some other encoding Expected Results: the filetype is correctly deterined and it displays happily Reproducible: Always User Profile Reset: No Additional Info: see desc
Draft PR:https://gerrit.libreoffice.org/c/core/+/127539 tests to follow
Created attachment 177149 [details] File encoded in EUC-KR
Sorry forgot to update this, current status is that I have a PR which worked fine under Linux with make check, but the Windows build failed in CI. I'm currently setting up a Windows build environment to investigate this.
Dear Daniel Thomas, This bug has been in ASSIGNED status for more than 3 months without any activity. Resetting it to NEW. Please assign it back to yourself if you're still working on this.
Have rebased again:https://gerrit.libreoffice.org/c/core/+/127539 and this is ready for test, the Windows test bug is fixed too - see details in the Gerrit comments. Note that the CONFIDENCE_THRESHOLD is arbitrarily set to 90%, which could be changed, I didn't know what to set it to. Note that even setting it to 100% would be helpful and an improvement over the detection currently done.