Bug 150877 - [FILEOPEN] DBF Mazovia Encoding (0x69)
Summary: [FILEOPEN] DBF Mazovia Encoding (0x69)
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
7.3.6.2 release
Hardware: All All
: medium normal
Assignee: Stephan Bergmann
URL:
Whiteboard: target:7.5.0 inReleaseNotes:7.5
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-09 07:32 UTC by SheetJS
Modified: 2022-12-05 16:17 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
specimen (397 bytes, application/x-dbase)
2022-09-09 07:32 UTC, SheetJS
Details

Note You need to log in before you can comment on or make changes to this bug.
Description SheetJS 2022-09-09 07:32:08 UTC
Description:
LibreOffice does not recognize DBF encoding 0x69.  In the list of character sets, Mazovia (CP620) is not offered.

Steps to Reproduce:
Open Attached File

Actual Results:
Asks for encoding, none of the options are Mazovia

Expected Results:
File should open with the correct encoding. The correct rendering for the string in question is Ś╫êëτ⌡ś


Reproducible: Always


User Profile Reset: Yes



Additional Info:
Visual FoxPro has a few special encodings:


- 0x69 -> Mazovia (Polish) MS-DOS [CP620]

Unicode table: https://github.com/SheetJS/js-codepage/blob/master/codepages/620.TBL


- 0x68 -> Kamenický (Czech) MS-DOS [CP895]

Unicode table: https://github.com/SheetJS/js-codepage/blob/master/codepages/895.TBL


If VFP access is limited, Gnumeric recognizes the codepage mapping.  On a machine with limited iconv support, the terminal will show a message like

```
Unable to open an iconv handle from codepage 620 -> UTF-8
File has unknown or missing code page information (69)
```

which indicates that Gnumeric detects the DBF encoding is 0x69 and that it corresponds to CP620.
Comment 1 SheetJS 2022-09-09 07:32:34 UTC
Created attachment 182332 [details]
specimen
Comment 2 Julien Nabet 2022-09-10 08:04:56 UTC
On pc Debian x86-64 with master sources updated today, I could reproduce this.

Code pointer:
2019     //case 0x68: eEncoding = ; break;     // Kamenicky (Czech) MS-DOS
2020     //case 0x69: eEncoding = ; break;     // Mazovia (Polish) MS-DOS

see https://opengrok.libreoffice.org/xref/core/connectivity/source/commontools/dbtools.cxx?r=91ba9654&mo=81393&fi=2020#2020

Eike: thought you might be interested in this one since it concerns encoding and Calc/Base
Comment 3 Eike Rathke 2022-09-12 11:00:30 UTC
These two cases are rightfully commented out. My short investigation shows we don't have conversions for them, at least there's no RTL_TEXTENCODING_IBM_620 define, the only 620 is RTL_TEXTENCODING_TIS_620 which is Thai that doesn't fit Polish ;-)  Same for CP859 there's no RTL_TEXTENCODING_IBM_859.

Code pointers:
include/rtl/textenc.h
sal/textenc/tencinfo.cxx
and look for all places over the code base that use for example RTL_TEXTENCODING_IBM_865 to see what would need to be added to support a new encoding.
Comment 4 Julien Nabet 2022-09-12 19:32:35 UTC
I've started a patch here: https://gerrit.libreoffice.org/c/core/+/139819

As you may have seen, I put some questions in it.
Comment 5 Julien Nabet 2022-09-13 18:51:56 UTC
Just to be sure, do you confirm you're the one who did https://github.com/SheetJS/js-codepage/blob/master/codepages/620.TBL ?
If yes, perhaps you'd have some insight about https://opengrok.libreoffice.org/xref/core/sal/textenc/tcvtest1.tab?r=98492e9d
(see https://gerrit.libreoffice.org/c/core/+/139819)
and also perhaps you may be interested to contribute by following this link https://wiki.documentfoundation.org/Development/GetInvolved ?
Comment 6 SheetJS 2022-09-13 21:30:53 UTC
Both files use the same format as the unicode.org tables (e.g. http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP037.TXT). The first column is a byte value and the second column is the equivalent Unicode value.

If https://opengrok.libreoffice.org/xref/core/sal/textenc/tencinfo.cxx?r=b480819d#823 is the master mapping from codepage to encodings, there are a number of issues.

For example, RTF spec [1] has two tables of supported codepages:

A) Pages 14-15 describe the ANSI codepages specified with \ansicpg# .

B) Pages 20-21 describe the \fcharset control word and associated codepages.

Example issues:

- CP720 is missing

- CP708 is described as "ASMO 708" but it should use the Windows version [2] [3] .  Windows 708 fills a number of gaps that ISO-8859-6 leaves undefined, so a separate mapping should be created.

- CP10021 (Mac Thai) is missing (referenced as \fcharset87)

Sources:

[1] https://interoperability.blob.core.windows.net/files/Archive_References/[MSFT-RTF].pdf 

[2] https://docs.microsoft.com/en-us/previous-versions/images/cc195061.h0018(en-us,msdn.10).gif

[3] https://github.com/SheetJS/js-codepage/blob/master/codepages/720.TBL according to our notes it was enumerated using .NET System.Text.Encoding from a Windows 7 machine
Comment 7 Julien Nabet 2022-09-14 15:59:43 UTC
I abandoned the patch, too complicate for me, unassign myself.
Comment 8 Commit Notification 2022-09-15 13:57:22 UTC
Julien Nabet committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/b943d28f04014a36ce51da386f0b9411b8dbfa01

tdf#150877: Add support for Kamenický and Mazovia encodings

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 9 Julien Nabet 2022-09-16 05:58:38 UTC
Just to tell it clearly, even if Stephan started from my abandoned patch, he did the most important and difficult part of the job (I think about the mapping).
Thank you again Stephan!

BTW, with master sources updated today, I don't reproduce the pb anymore.
I got the string Ś╫êëτ⌡ś (in B3), so let's put this one to VERIFIED.
Comment 10 Xisco Faulí 2022-09-19 10:37:34 UTC
I added it to the release notes: https://wiki.documentfoundation.org/ReleaseNotes/7.5#Calc