92161 – GBK encoded Chinese text not auto-detected

Bug 92161 - GBK encoded Chinese text not auto-detected

Summary: GBK encoded Chinese text not auto-detected

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	LibreOffice (show other bugs)
Version: (earliest affected)	4.2.8.2 release
Hardware:	Other All

Importance:	low enhancement
Assignee:	Not Assigned

URL:
Whiteboard:	target:7.4.0
Keywords:	difficultyBeginner, easyHack, skillCpp

Depends on:
Blocks:	CJK-Chinese-Simplified
	Show dependency tree / graph

Reported:	2015-06-18 16:00 UTC by ni shengyue
Modified:	2023-11-22 09:47 UTC (History)
CC List:	3 users (show)

See Also:	http://bugs.debian.org/790903 60145 146429 61703
Crash report or crash signature:

Attachments
GBK encoded file (90.45 KB, text/plain) 2015-06-18 16:00 UTC, ni shengyue	Details
screen shot of Libreoffice and kate (328.36 KB, image/png) 2015-06-18 16:01 UTC, ni shengyue	Details
1040044624.DOC - document de test (19.58 KB, application/msword) 2016-02-24 16:28 UTC, Stéphane Aulery	Details
Testing document rendered winth LO 5.0.1.2 under Windows 7 x86 (134.03 KB, image/png) 2016-02-24 16:29 UTC, Stéphane Aulery	Details
Testing document rendered winth MS Word 2010 under Win7 x86 (198.17 KB, image/png) 2016-02-24 16:30 UTC, Stéphane Aulery	Details
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description ni shengyue 2015-06-18 16:00:37 UTC

Created attachment 116630 [details]
GBK encoded file

GBK encoded Chinese text  can't be read,while another application such as kate in KDE can decode it ok.

Comment 1 ni shengyue 2015-06-18 16:01:26 UTC

Created attachment 116631 [details]
screen shot of Libreoffice  and kate

Comment 2 Julien Nabet 2015-06-20 07:08:50 UTC

On pc Debian x86-64 with master sources updated yesterday, I could reproduce this.

I noticed this on console:
warn:legacy.osl:3197:1:sw/source/filter/ascii/parasc.cxx:265: Autodetect of text import without nag dialog must have failed
warn:vcl:3197:1:vcl/generic/fontmanager/fontconfig.cxx:863: In glyph fallback throwing away the language property of hi because the detected script for '0xc7e' is Telugu and that language doesn't make sense. Autodetecting instead.

Caolan: one for you? (vcl + language/font detection)

Comment 3 Caolán McNamara 2015-07-08 09:05:48 UTC

It can be read, it just can't auto-detect the format. You need to use file->open and select the "text - choose encoding" filter, the press ok, and then select "Chinese simplified (GB-18030)" as the encoding here.

Comment 4 ni shengyue 2015-07-09 15:55:04 UTC

Yes,I can read if using  Caolán McNamara 's method, but common user can't find this 'text - choose encoding' menu,we suggest Libre office should support auto-detect encoding mechanism,just as MS Office,so I suggest to REOPEN this case to track this requirement.

Comment 5 Caolán McNamara 2015-07-10 09:12:14 UTC

IMO non utf-8 text is just archaic at this point

Comment 6 Buovjaga 2015-10-10 12:12:46 UTC

Set to NEW, lowered priority and adjusted summary.

Comment 7 Stéphane Aulery 2016-02-24 16:28:35 UTC Comment hidden (obsolete)

Created attachment 122949 [details]
1040044624.DOC - document de test

Comment 8 Stéphane Aulery 2016-02-24 16:29:28 UTC Comment hidden (obsolete)

Created attachment 122950 [details]
Testing document rendered winth LO 5.0.1.2 under Windows 7 x86

Comment 9 Stéphane Aulery 2016-02-24 16:30:00 UTC Comment hidden (obsolete)

Created attachment 122951 [details]
Testing document rendered winth MS Word 2010 under Win7 x86

Comment 10 Maxim Monastirsky 2016-05-16 13:18:48 UTC

Removing unrelated debian bug from 'See Also', and changing to 'enhancement', as charset auto-detection isn't implemented.

Comment 11 Cosimo Cecchi 2016-10-07 00:43:22 UTC

(In reply to Caolán McNamara from comment #5)
> IMO non utf-8 text is just archaic at this point

Caolán, unfortunately while this may be true for Europe and the US, it's definitely not true for China. GB18030 is the standard in China and a requirement for software that is distributed there.

Comment 12 Ingrid Cain 2020-10-19 11:02:54 UTC Comment hidden (spam)

The information shown on this page is very important and useful for clients to manage their schedules for transactions. Banks and other financial insitutions are playing an essential role in business http://www.essaysoriginreview.com/review-on-college-paper-org/ is a site where you can get all the necessary details about this.

Comment 13 Karyexander 2020-10-27 05:58:22 UTC Comment hidden (spam)

The information shown on this page is very important and useful for clients to manage their schedules for transactions. Banks and other financial insitutions are playing an essential role in business https://www.essaysoriginreview.com/review-on-college-paper-org/ is a site where you can get all the necessary details about this.

Comment 14 Mark Douglass 2021-10-03 19:46:17 UTC Comment hidden (spam)

Learn to clearly formulate, justify, defend your own point of view. What is gained at the cost of one's own mental efforts and work and relying on https://order-essay.org/dbq-essay-help. The most effective way to consolidate and improve one's own achievements is best remembered and assimilated. And learn from those who have already succeeded in what interests you.

Comment 15 Mike Kaganski 2021-12-13 09:07:25 UTC

Note that after bug 60145 is fixed, it's actually easy to add auto-detection of any encoding recognized by ICU's charset detector, amending SwIoSystem::IsDetectableText.

Setting this as easyhack.

Comment 16 Daniel Thomas 2021-12-23 02:47:36 UTC

Have created https://gerrit.libreoffice.org/c/core/+/127347 to fix this. Though now I'm wondering whether we could modify that code to support all of the encodings in LO?

Comment 17 Mike Kaganski 2021-12-23 06:23:26 UTC

(In reply to Daniel Thomas from comment #16)
> Have created https://gerrit.libreoffice.org/c/core/+/127347 to fix this.

Thanks - merged! :)

> Though now I'm wondering whether we could modify that code to support all of
> the encodings in LO?

It should be relatively easy. We already have rtl_getTextEncodingFromMimeCharset, which seems to be what ucsdet_getName returns. The only concern here would be false detections, and we could use ucsdet_getConfidence [1] to filter out unreliable detections.

Feel free to submit a new enhancement, and then fix it - that would be a nice hack!

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucsdet_8h.html#a30dd8812653be28766f1ee1bbc412c18

Comment 18 Commit Notification 2021-12-23 06:23:53 UTC

dtm committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/763c2a436baa1814d2bf95477b9d79fa3934d5e5

tdf#92161 add GB18030 encoding to iodetect

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 19 Daniel Thomas 2021-12-26 23:05:33 UTC

https://bugs.documentfoundation.org/show_bug.cgi?id=146429 created for the aforementioned addition

Comment 20 Kevin Suo 2022-09-14 16:02:33 UTC

I confirm the original bug behaviour is now fixed on 7.4 and trunk. The commit 763c2a436baa1814d2bf95477b9d79fa3934d5e5 added GB18030 which can still decode most of the characters encoded as GBK.

I think we should leave this open for now in case someone is interested he/she can still work on this for improvements (i.e. add the detection of other encodings).

Comment 21 Kevin Suo 2023-11-22 09:47:08 UTC

Closing this. As Mike Kaganski has mentioned, "feel free to submit a new enhancement" for other encodings.