Bug 92161 - GBK encoded Chinese text not auto-detected
Summary: GBK encoded Chinese text not auto-detected
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
4.2.8.2 release
Hardware: Other All
: low enhancement
Assignee: Not Assigned
URL:
Whiteboard: target:7.4.0
Keywords: difficultyBeginner, easyHack, skillCpp
Depends on:
Blocks: CJK-Chinese-Simplified
  Show dependency treegraph
 
Reported: 2015-06-18 16:00 UTC by ni shengyue
Modified: 2023-11-22 09:47 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
GBK encoded file (90.45 KB, text/plain)
2015-06-18 16:00 UTC, ni shengyue
Details
screen shot of Libreoffice and kate (328.36 KB, image/png)
2015-06-18 16:01 UTC, ni shengyue
Details
1040044624.DOC - document de test (19.58 KB, application/msword)
2016-02-24 16:28 UTC, Stéphane Aulery
Details
Testing document rendered winth LO 5.0.1.2 under Windows 7 x86 (134.03 KB, image/png)
2016-02-24 16:29 UTC, Stéphane Aulery
Details
Testing document rendered winth MS Word 2010 under Win7 x86 (198.17 KB, image/png)
2016-02-24 16:30 UTC, Stéphane Aulery
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ni shengyue 2015-06-18 16:00:37 UTC
Created attachment 116630 [details]
GBK encoded file

GBK encoded Chinese text  can't be read,while another application such as kate in KDE can decode it ok.
Comment 1 ni shengyue 2015-06-18 16:01:26 UTC
Created attachment 116631 [details]
screen shot of Libreoffice  and kate
Comment 2 Julien Nabet 2015-06-20 07:08:50 UTC
On pc Debian x86-64 with master sources updated yesterday, I could reproduce this.

I noticed this on console:
warn:legacy.osl:3197:1:sw/source/filter/ascii/parasc.cxx:265: Autodetect of text import without nag dialog must have failed
warn:vcl:3197:1:vcl/generic/fontmanager/fontconfig.cxx:863: In glyph fallback throwing away the language property of hi because the detected script for '0xc7e' is Telugu and that language doesn't make sense. Autodetecting instead.

Caolan: one for you? (vcl + language/font detection)
Comment 3 Caolán McNamara 2015-07-08 09:05:48 UTC
It can be read, it just can't auto-detect the format. You need to use file->open and select the "text - choose encoding" filter, the press ok, and then select "Chinese simplified (GB-18030)" as the encoding here.
Comment 4 ni shengyue 2015-07-09 15:55:04 UTC
Yes,I can read if using  Caolán McNamara 's method, but common user can't find this 'text - choose encoding' menu,we suggest Libre office should support auto-detect encoding mechanism,just as MS Office,so I suggest to REOPEN this case to track this requirement.
Comment 5 Caolán McNamara 2015-07-10 09:12:14 UTC
IMO non utf-8 text is just archaic at this point
Comment 6 Buovjaga 2015-10-10 12:12:46 UTC
Set to NEW, lowered priority and adjusted summary.
Comment 7 Stéphane Aulery 2016-02-24 16:28:35 UTC Comment hidden (obsolete)
Comment 8 Stéphane Aulery 2016-02-24 16:29:28 UTC Comment hidden (obsolete)
Comment 9 Stéphane Aulery 2016-02-24 16:30:00 UTC Comment hidden (obsolete)
Comment 10 Maxim Monastirsky 2016-05-16 13:18:48 UTC
Removing unrelated debian bug from 'See Also', and changing to 'enhancement', as charset auto-detection isn't implemented.
Comment 11 Cosimo Cecchi 2016-10-07 00:43:22 UTC
(In reply to Caolán McNamara from comment #5)
> IMO non utf-8 text is just archaic at this point

Caolán, unfortunately while this may be true for Europe and the US, it's definitely not true for China. GB18030 is the standard in China and a requirement for software that is distributed there.
Comment 12 Ingrid Cain 2020-10-19 11:02:54 UTC Comment hidden (spam)
Comment 13 Karyexander 2020-10-27 05:58:22 UTC Comment hidden (spam)
Comment 14 Mark Douglass 2021-10-03 19:46:17 UTC Comment hidden (spam)
Comment 15 Mike Kaganski 2021-12-13 09:07:25 UTC
Note that after bug 60145 is fixed, it's actually easy to add auto-detection of any encoding recognized by ICU's charset detector, amending SwIoSystem::IsDetectableText.

Setting this as easyhack.
Comment 16 Daniel Thomas 2021-12-23 02:47:36 UTC
Have created https://gerrit.libreoffice.org/c/core/+/127347 to fix this. Though now I'm wondering whether we could modify that code to support all of the encodings in LO?
Comment 17 Mike Kaganski 2021-12-23 06:23:26 UTC
(In reply to Daniel Thomas from comment #16)
> Have created https://gerrit.libreoffice.org/c/core/+/127347 to fix this.

Thanks - merged! :)

> Though now I'm wondering whether we could modify that code to support all of
> the encodings in LO?

It should be relatively easy. We already have rtl_getTextEncodingFromMimeCharset, which seems to be what ucsdet_getName returns. The only concern here would be false detections, and we could use ucsdet_getConfidence [1] to filter out unreliable detections.

Feel free to submit a new enhancement, and then fix it - that would be a nice hack!

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucsdet_8h.html#a30dd8812653be28766f1ee1bbc412c18
Comment 18 Commit Notification 2021-12-23 06:23:53 UTC
dtm committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/763c2a436baa1814d2bf95477b9d79fa3934d5e5

tdf#92161 add GB18030 encoding to iodetect

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 19 Daniel Thomas 2021-12-26 23:05:33 UTC
https://bugs.documentfoundation.org/show_bug.cgi?id=146429 created for the aforementioned addition
Comment 20 Kevin Suo 2022-09-14 16:02:33 UTC
I confirm the original bug behaviour is now fixed on 7.4 and trunk. The commit 763c2a436baa1814d2bf95477b9d79fa3934d5e5 added GB18030 which can still decode most of the characters encoded as GBK.

I think we should leave this open for now in case someone is interested he/she can still work on this for improvements (i.e. add the detection of other encodings).
Comment 21 Kevin Suo 2023-11-22 09:47:08 UTC
Closing this. As Mike Kaganski has mentioned, "feel free to submit a new enhancement" for other encodings.