Description: LibreOffice does not recognize DBF encoding 0x69. In the list of character sets, Mazovia (CP620) is not offered. Steps to Reproduce: Open Attached File Actual Results: Asks for encoding, none of the options are Mazovia Expected Results: File should open with the correct encoding. The correct rendering for the string in question is Ś╫êëτ⌡ś Reproducible: Always User Profile Reset: Yes Additional Info: Visual FoxPro has a few special encodings: - 0x69 -> Mazovia (Polish) MS-DOS [CP620] Unicode table: https://github.com/SheetJS/js-codepage/blob/master/codepages/620.TBL - 0x68 -> Kamenický (Czech) MS-DOS [CP895] Unicode table: https://github.com/SheetJS/js-codepage/blob/master/codepages/895.TBL If VFP access is limited, Gnumeric recognizes the codepage mapping. On a machine with limited iconv support, the terminal will show a message like ``` Unable to open an iconv handle from codepage 620 -> UTF-8 File has unknown or missing code page information (69) ``` which indicates that Gnumeric detects the DBF encoding is 0x69 and that it corresponds to CP620.
Created attachment 182332 [details] specimen
On pc Debian x86-64 with master sources updated today, I could reproduce this. Code pointer: 2019 //case 0x68: eEncoding = ; break; // Kamenicky (Czech) MS-DOS 2020 //case 0x69: eEncoding = ; break; // Mazovia (Polish) MS-DOS see https://opengrok.libreoffice.org/xref/core/connectivity/source/commontools/dbtools.cxx?r=91ba9654&mo=81393&fi=2020#2020 Eike: thought you might be interested in this one since it concerns encoding and Calc/Base
These two cases are rightfully commented out. My short investigation shows we don't have conversions for them, at least there's no RTL_TEXTENCODING_IBM_620 define, the only 620 is RTL_TEXTENCODING_TIS_620 which is Thai that doesn't fit Polish ;-) Same for CP859 there's no RTL_TEXTENCODING_IBM_859. Code pointers: include/rtl/textenc.h sal/textenc/tencinfo.cxx and look for all places over the code base that use for example RTL_TEXTENCODING_IBM_865 to see what would need to be added to support a new encoding.
I've started a patch here: https://gerrit.libreoffice.org/c/core/+/139819 As you may have seen, I put some questions in it.
Just to be sure, do you confirm you're the one who did https://github.com/SheetJS/js-codepage/blob/master/codepages/620.TBL ? If yes, perhaps you'd have some insight about https://opengrok.libreoffice.org/xref/core/sal/textenc/tcvtest1.tab?r=98492e9d (see https://gerrit.libreoffice.org/c/core/+/139819) and also perhaps you may be interested to contribute by following this link https://wiki.documentfoundation.org/Development/GetInvolved ?
Both files use the same format as the unicode.org tables (e.g. http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP037.TXT). The first column is a byte value and the second column is the equivalent Unicode value. If https://opengrok.libreoffice.org/xref/core/sal/textenc/tencinfo.cxx?r=b480819d#823 is the master mapping from codepage to encodings, there are a number of issues. For example, RTF spec [1] has two tables of supported codepages: A) Pages 14-15 describe the ANSI codepages specified with \ansicpg# . B) Pages 20-21 describe the \fcharset control word and associated codepages. Example issues: - CP720 is missing - CP708 is described as "ASMO 708" but it should use the Windows version [2] [3] . Windows 708 fills a number of gaps that ISO-8859-6 leaves undefined, so a separate mapping should be created. - CP10021 (Mac Thai) is missing (referenced as \fcharset87) Sources: [1] https://interoperability.blob.core.windows.net/files/Archive_References/[MSFT-RTF].pdf [2] https://docs.microsoft.com/en-us/previous-versions/images/cc195061.h0018(en-us,msdn.10).gif [3] https://github.com/SheetJS/js-codepage/blob/master/codepages/720.TBL according to our notes it was enumerated using .NET System.Text.Encoding from a Windows 7 machine
I abandoned the patch, too complicate for me, unassign myself.
Julien Nabet committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/b943d28f04014a36ce51da386f0b9411b8dbfa01 tdf#150877: Add support for Kamenický and Mazovia encodings It will be available in 7.5.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Just to tell it clearly, even if Stephan started from my abandoned patch, he did the most important and difficult part of the job (I think about the mapping). Thank you again Stephan! BTW, with master sources updated today, I don't reproduce the pb anymore. I got the string Ś╫êëτ⌡ś (in B3), so let's put this one to VERIFIED.
I added it to the release notes: https://wiki.documentfoundation.org/ReleaseNotes/7.5#Calc