Bug 117115 - Firebird: migration from HSQLDB does not respect language-specific characters
Summary: Firebird: migration from HSQLDB does not respect language-specific characters
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Base (show other bugs)
Version:
(earliest affected)
6.1.0.0.alpha0+
Hardware: All All
: high major
Assignee: Not Assigned
URL:
Whiteboard: target:6.2.0
Keywords: dataLoss
Depends on:
Blocks: Database-Firebird-Migration
  Show dependency treegraph
 
Reported: 2018-04-19 23:07 UTC by Gerhard Weydt
Modified: 2019-02-01 12:34 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
test doc with HSQLDB (10.51 KB, application/vnd.sun.xml.base)
2018-04-19 23:10 UTC, Gerhard Weydt
Details
test doc after migration to Firebird (11.23 KB, application/vnd.sun.xml.base)
2018-04-19 23:11 UTC, Gerhard Weydt
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gerhard Weydt 2018-04-19 23:07:32 UTC
When migrating an existing database document with HSQLDB to Firebird language-specific characters in field names are not transferred correctly. For example, german "ß" is changed to "/u00df", its unicode equivalent, in the _name_ of the field.
his is not a problem of Firebird, and not even of Firebird within LibreOffice, you can create fields with names using language-specific characters, they are saved correctly and are reproduced when opening the document; I've tested it. So that's a deficiency in the migration program.
I will add test documents: a database document with one table, having (except for the first, which is simply an id) field names consisting of one generally used latin character and one language-specific: "?" and the three german Umlaute and three french (and partly Italian ...) characters, "a" with the three accents. And the result after the migration to Firebird.
These are examples of language-specific characters easily available on my keyboard. For all those the unicode representation is used for the name in Firebird, which certainly is a problem - if not a catastroph - for all programs using that table, because all references to these fields do not work.
This is a grave problem, because for non-european languages there are lots of those non-latin characters, which must be migrated correctly.
Comment 1 Gerhard Weydt 2018-04-19 23:10:28 UTC
Created attachment 141495 [details]
test doc with HSQLDB
Comment 2 Gerhard Weydt 2018-04-19 23:11:23 UTC
Created attachment 141496 [details]
test doc after migration to Firebird
Comment 3 Robert Großkopf 2018-04-20 05:49:29 UTC
Could confirm this buggy behavior:
Fieldnames of a table aren't migrated correctly if special charcaters have been used.
Tested with
Version: 6.1.0.0.alpha0+
Build-ID: cc10b063235dcb25ad16f697ea0b1ff91a10bacb
CPU-Threads: 4; BS: Linux 4.4; UI-Render: Standard; VCL: kde4; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2018-04-18_13:21:28
Comment 4 Drew Jensen 2018-04-24 03:12:12 UTC
Once more just to note - this functions as expected with the existing code used for a simple drag drop of the table from hsql->fb odb.

Is there some reason this is all be reimplemented apparently from scratch?
Comment 5 Commit Notification 2018-05-25 13:28:11 UTC
Tamas Bunth committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=ded4dcbbce875efeffba7e894a6dea1f584e8e9b

tdf#117115 dbahsql: respect unicode in columns

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 6 Stephan Bergmann 2018-05-25 15:02:03 UTC
(In reply to Commit Notification from comment #5)
> Tamas Bunth committed a patch related to this issue.
> It has been pushed to "master":
> 
> http://cgit.freedesktop.org/libreoffice/core/commit/
> ?id=ded4dcbbce875efeffba7e894a6dea1f584e8e9b
> 
> tdf#117115 dbahsql: respect unicode in columns

For one, that only fixes names of columns, not names of tables.

For another, a name "f\u2345bar" is erroneously converted to "f⍅bar".
Comment 7 Stephan Bergmann 2018-05-25 15:26:24 UTC
...and a name using non-BMP chars like "💩" (U+1F4A9 PILE OF POO) is converted to something like "?" (the "in-transit" representation appears to be using an encoding of individual UTF-16 code units, "\ud83d\udca9", and the newly added lcl_ConvertToUTF8 tries to convert them back to UTF-8 individually with

  const OString sNewChar = OString(&cDec, 1, RTL_TEXTENCODING_UTF8);

which doesn't work).
Comment 8 Tamas Bunth 2018-05-28 09:22:55 UTC
(In reply to Stephan Bergmann from comment #6)
> For another, a name "f\u2345bar" is erroneously converted to "f⍅bar".

I've found this:
http://graphemica.com/%E2%8D%85

According to this web page, \u2345 is "leftwards vane", so it seems to me the conversion is right. What would be the expected result?
Comment 9 Stephan Bergmann 2018-05-28 09:45:20 UTC
(In reply to Tamas Bunth from comment #8)
> (In reply to Stephan Bergmann from comment #6)
> > For another, a name "f\u2345bar" is erroneously converted to "f⍅bar".
> 
> I've found this:
> http://graphemica.com/%E2%8D%85
> 
> According to this web page, \u2345 is "leftwards vane", so it seems to me
> the conversion is right. What would be the expected result?

An entity that was named "f\u2345bar" in the original database should remain named like that in the converted database too, I'd assume.  (It is apparently encoded as something like "f\u005Cu2345bar" when it reaches lcl_ConvertToUTF8, and then erroneously converted to "f⍅bar" there.)
Comment 10 Xisco Faulí 2018-06-06 17:21:39 UTC
 ded4dcbbce875efeffba7e894a6dea1f584e8e9b is in master ( 6-2 )
Comment 11 Commit Notification 2018-06-11 08:43:30 UTC
Tamas Bunth committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=647a9fec404ebce898a44de63fcf1b1d6f5036e6

tdf#117115 dbahsql: respect escaped '\'

It will be available in 6.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 Drew Jensen 2018-06-27 17:42:55 UTC
checked on Ubuntu 18.04 with build:
Version: 6.2.0.0.alpha0+
Build ID: aae64e0f9cd1582c0dc31992aa22b849d2527c80
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2018-06-23_02:31:34
Locale: en-US (en_US.UTF-8); Calc: group threaded

Works as expected.
Comment 13 Gerhard Weydt 2018-07-01 20:34:19 UTC
Verified on Windows 10:
Version: 6.2.0.0.alpha0+ (x64)
Build-ID: d8733e2c59f120acf9feddff04964becc3358621
CPU-Threads: 4; BS: Windows 10.0; UI-Render: GL; 
TinderBox: Win-x86_64@62-TDF, Branch:master, Time: 2018-06-26_11:09:03
Gebietsschema: de-DE (de_DE); Calc: CL
Comment 14 Stephan Bergmann 2019-01-30 14:31:52 UTC
(In reply to Stephan Bergmann from comment #6)
> (In reply to Commit Notification from comment #5)
> > Tamas Bunth committed a patch related to this issue.
> > It has been pushed to "master":
> > 
> > http://cgit.freedesktop.org/libreoffice/core/commit/
> > ?id=ded4dcbbce875efeffba7e894a6dea1f584e8e9b
> > 
> > tdf#117115 dbahsql: respect unicode in columns
> 
> For one, that only fixes names of columns, not names of tables.

I assume the issue with table names has not yet been addressed (cf. bug 121469)?
Comment 15 Stephan Bergmann 2019-02-01 12:34:55 UTC
(In reply to Stephan Bergmann from comment #7)
> ...and a name using non-BMP chars like "💩" (U+1F4A9 PILE OF POO) is
> converted to something like "?" (the "in-transit" representation appears to
> be using an encoding of individual UTF-16 code units, "\ud83d\udca9", and
> the newly added lcl_ConvertToUTF8 tries to convert them back to UTF-8
> individually with
> 
>   const OString sNewChar = OString(&cDec, 1, RTL_TEXTENCODING_UTF8);
> 
> which doesn't work).

addressed now with <https://gerrit.libreoffice.org/67245> "Fix conversion of non-BMP chars"


(In reply to Stephan Bergmann from comment #14)
> I assume the issue with table names has not yet been addressed (cf. bug
> 121469)?

apparently addressed with issue 121469 comment 11