Bug 69744 - Data in Visual FoxPro DBF is garbled
Summary: Data in Visual FoxPro DBF is garbled
Status: RESOLVED INVALID
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Base (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-24 05:35 UTC by Urmas
Modified: 2017-06-17 16:18 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Russian DBF (410 bytes, application/x-dbase)
2013-09-24 05:35 UTC, Urmas
Details
Screenshot (4.55 KB, image/png)
2013-09-24 22:51 UTC, Urmas
Details
Two charsets in the same *.dbf (17.29 KB, image/png)
2013-09-26 14:39 UTC, Robert Großkopf
Details
Opening the file with MS Excel 2013 (46.42 KB, application/x-zip-compressed)
2013-09-28 10:48 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Urmas 2013-09-24 05:35:53 UTC
Created attachment 86430 [details]
Russian DBF

When using the attached text file in LO Base, data in table are garbled.
VFP9 shows the table properly.
Comment 1 Robert Großkopf 2013-09-24 17:50:47 UTC
Would be better to show with screenshots, how it should look. I have opened the file with Calc and tried different filters - having no idea what I should search for.
With which program the file is created? 
Which character-set had been chosen?
Seems Calc doesn't know the right charcater-set for the file. But Calc (and also Base) could change the character-set in many ways.
Comment 2 Urmas 2013-09-24 22:51:55 UTC
Created attachment 86493 [details]
Screenshot
Comment 3 Robert Großkopf 2013-09-25 19:20:19 UTC
Seems to be a problem, because the field isn't created by the same character-set as the content.
When I try to get the right field-description, I have to chose "Kyrillisch DOS/OS2-866" in the German version of LO.
When I try to get the right content-description, I have to use "Kyrillisch PT154" or "Kyrillisch Windows-1251".
It's the same behavior in all LO-versions and in Base and Calc. I see the same behavior in AOO 4.0. So I don't know, if this is a bug of LO or a bug of the program the file is created with ...
Comment 4 Urmas 2013-09-26 13:38:49 UTC
I think VFP screenshot gives an unambiguous answer whether this is an LO bug or not.
Comment 5 Robert Großkopf 2013-09-26 14:39:07 UTC
Created attachment 86654 [details]
Two charsets in the same *.dbf

When I open the *.dbf with different charsets I could see the field-description in the right way and the field-content in the right way. If a program works with different charsets for the content and the header in the same *.dbf, the program itself should present the right content. But how should another program recognize it.
I haven't any possibility here to open the file in the right way you show with the screenshot of Visual FoxPro. So I can't confirm, that it is a wrong behavior of LO/AOO/OOo.
Let us hope anybody else will read this and test it, for example, with MS Excel ...
Comment 6 Mike Kaganski 2013-09-28 10:48:02 UTC
Created attachment 86767 [details]
Opening the file with MS Excel 2013

I think MS Excel screenshots give unambiguous answer whether this is an LO bug or not.
Comment 7 Owen Genat (retired) 2014-07-22 04:08:12 UTC
(In reply to comment #3)
> Seems to be a problem, because the field isn't created by the same
> character-set as the content.
> When I try to get the right field-description, I have to chose "Kyrillisch
> DOS/OS2-866" in the German version of LO.
> When I try to get the right content-description, I have to use "Kyrillisch
> PT154" or "Kyrillisch Windows-1251".

I can confirm that opening (in Calc) the provided DBF under GNU/Linux using v4.3.0.3 Build ID: 08ebe52789a201dd7d38ef653ef7a48925e7f9f7 this is displayed for these character sets 

Cyrillic (DOS/OS2-866/Russian):
A1: НАЗВАНИЕ,C,80
A2: ╨єёёъшщ ЄхъёЄ

Cyrillic (PT154):
A1: ҚҖҮӮҖҚҲ…,C,80
A2: Русский текст

i.e., Using DOC/OS2-866/Russian A1 displays the field-description as "NAME" and using PT154 A2 displays the content as "Russian text". This would seem consistent with what Robert has indicated under Base.

(In reply to comment #5)
> If a program works with different charsets for the content and the header in
> the same *.dbf, the program itself should present the right content. But how
> should another program recognize it.

Agreed. At the very least this would be an enhancement request to expand the existing functionality of DBF import to offer field-by-field character set specification or to cater for a quirk with how VFP writes these files out.

(In reply to comment #6)
> I think MS Excel screenshots give unambiguous answer whether this is an LO
> bug or not.

Given that MS Excel experiences the same import issue I am tossing this report in the NEEDINFO bucket. It requires developer input as to what is feasible with handling DBF files with multiple character sets.
Comment 8 Alex Thurgood 2015-01-03 17:39:59 UTC
Adding self to CC if not already on
Comment 9 QA Administrators 2015-07-18 17:35:16 UTC Comment hidden (obsolete)
Comment 10 QA Administrators 2015-09-04 03:00:33 UTC Comment hidden (obsolete)
Comment 11 Julien Nabet 2017-06-17 16:18:37 UTC
Following recent dBase commits (see https://cgit.freedesktop.org/libreoffice/core/log/?qt=grep&q=dbase), the dbf files open with RTL_TEXTENCODING_IBM_866 (Russian MS-DOS code page 866)
hexdump of the file shows this:
0000000 0d30 1809 0001 0000 0148 0051 0000 0000
0000010 0000 0000 0000 0000 0000 0000 6500 0000
0000020 808d 8287 8d80 8588 0000 4300 0001 0000
0000030 0050 0004 0000 0000 0000 0000 0000 0000
0000040 000d 0000 0000 0000 0000 0000 0000 0000
0000050 0000 0000 0000 0000 0000 0000 0000 0000
*
0000140 0000 0000 0000 0000 d020 f1f3 eaf1 e9e8
0000150 f220 eae5 f2f1 2020 2020 2020 2020 2020
0000160 2020 2020 2020 2020 2020 2020 2020 2020
*
0000190 2020 2020 2020 2020 1a20               
000019a

Let's read it in little-endian way, so first byte is 30 not 0d.
30 is version and corresponds here to VisualFoxPro file (see http://opengrok.libreoffice.org/xref/core/connectivity/source/inc/dbase/DTable.hxx#40)
65 (in second line) indicates RTL_TEXTENCODING_IBM_866
Third line gives field name, its fieldtype and 50 from beginning "50" from line gives indicates length field (80 in decimal).
But then lines 7 and 8 give content of the record but nothing about encoding.

So I don't know how LO could "guess" the encoding of the context except by testing range value of charsets, eg:
d0 in https://www.ascii-codes.com/cp866.html gives "Box drawings up double and horizontal single"
d0 in http://www.iana.org/assignments/charset-reg/PTCP154 gives "CYRILLIC CAPITAL LETTER ER"
But even with this, a user could want some non cyrillic characters (bow drawings) in content and the guessing would be wrong.

BTW, would be interested in dbf original with different versions (DB2, DB3, DB4... with memo, with sql, ...FoxPro, etc.) and encodings.