Bug 108910 - Concordance file for indexes breaks UTF-8 and turns the characters into ASCII
Summary: Concordance file for indexes breaks UTF-8 and turns the characters into ASCII
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.4.7.2 release
Hardware: x86-64 (AMD64) Windows (All)
: medium normal
Assignee: Andreas Heinisch
URL:
Whiteboard: target:7.4.0 target:7.3.1
Keywords: needUITest
: 125496 (view as bug list)
Depends on:
Blocks: Concordance-File
  Show dependency treegraph
 
Reported: 2017-07-02 20:31 UTC by George Acu
Modified: 2022-01-31 18:33 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Concordance UTF-8 file containing characters with diacritics (43 bytes, application/octet-stream)
2017-07-06 11:57 UTC, George Acu
Details
Concordance UTF-8 file containing characters with diacritics (115 bytes, application/octet-stream)
2017-07-06 12:14 UTC, George Acu
Details
Sample .fodt file that uses that .sdi attachment (619.09 KB, application/vnd.oasis.opendocument.text)
2017-07-06 12:14 UTC, George Acu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description George Acu 2017-07-02 20:31:16 UTC
In LibreOffice Writer, when building an index in the Romanian language, I used an UTF-8 encoded concordance file, encoding which retains correctly all the letters with diacritics (ș, ț, â, î, ă etc.).
After loading it into Writer, the diacritic characters are transformed from UTF-8 into ASCII, which causes replacing characters as ș, ț, â, î, ă etc. for ?. For this reason, the index cannot be build in the Romanian language (and perhaps also in other languages that have letters with diacritics), but only manually (which is very time-consuming).
Comment 1 George Acu 2017-07-02 20:34:12 UTC
I noticed this feature works correctly on Linux, but not on Windows (I use Windows 7 SP1)
Comment 2 Buovjaga 2017-07-06 11:45:38 UTC
Please attach an example concordance file and any other files needed to reproduce this.

Set to NEEDINFO.
Change back to UNCONFIRMED after you have provided the document(s).
Comment 3 George Acu 2017-07-06 11:57:13 UTC
Created attachment 134513 [details]
Concordance UTF-8 file containing characters with diacritics

Here it is
Comment 4 Buovjaga 2017-07-06 12:02:04 UTC
(In reply to George Acu from comment #3)
> Created attachment 134513 [details]
> Concordance UTF-8 file containing characters with diacritics
> 
> Here it is

Ok, now what to do with this?
Comment 5 George Acu 2017-07-06 12:14:11 UTC
Created attachment 134514 [details]
Concordance UTF-8 file containing characters with diacritics
Comment 6 George Acu 2017-07-06 12:14:54 UTC
Created attachment 134515 [details]
Sample .fodt file that uses that .sdi attachment
Comment 7 George Acu 2017-07-06 12:16:07 UTC
Please ignore comment #3.
I attached again a sample .fodt file, which uses the .sdi second attached file (containing terms with diacritics). The language used in the .fodt file is Romanian.
Comment 8 George Acu 2017-07-06 12:33:42 UTC
Simply open the .fodt document, go to page 12, to Alphabetical index, right-click - Edit Index, see on the first tab Concordance file. Open it, and you'll see some altered characters (LOW converts the concordance file from UTF-8 to ASCII on Windows 7).
Comment 9 Buovjaga 2017-07-06 12:57:59 UTC
Thanks. I found a duplicate.

I tested on Linux and this seems to be Win-only.

Curiously, on Windows it produces absolutely nothing in the index.. no entries.

*** This bug has been marked as a duplicate of bug 81409 ***
Comment 10 Buovjaga 2017-07-06 13:01:15 UTC
Ah, sorry I made a mistake, it is not a duplicate.

Well, I guess this can be set to NEW as I could not produce anything.

Version: 6.0.0.0.alpha0+ (x64)
Build ID: e0f67add2ec56706ce06a03572535266f21c0303
CPU threads: 4; OS: Windows 6.19; UI render: default; 
TinderBox: Win-x86_64@42, Branch:master, Time: 2017-06-27_23:04:56
Locale: fi-FI (fi_FI); Calc: group

Version: 4.4.7.2
Build ID: f3153a8b245191196a4b6b9abd1d0da16eead600
Locale: fi_FI

Arch Linux 64-bit, KDE Plasma 5
Version: 6.0.0.0.alpha0+
Build ID: 9eed346b0b745f0598eefc572c789d58353b5e31
CPU threads: 8; OS: Linux 4.11; UI render: default; VCL: kde4; 
Locale: fi-FI (fi_FI.UTF-8); Calc: group
Built on July 5th 2017
Comment 11 QA Administrators 2018-07-07 02:39:09 UTC Comment hidden (obsolete)
Comment 12 Pietro 2020-03-06 10:43:21 UTC
Versione: 6.4.0.3 (x64)
Build ID: b0a288ab3d2d4774cb44b62f04d5d28733ac6df8
Thread CPU: 16; SO: Windows 10.0 Build 18363; Resa interfaccia: GL; VCL: win; 
Versione locale: it-IT (it_IT); Lingua interfaccia: it-IT
Calc: threaded
I confirm the existance of this bug under windows 10: All diacritics are gone because of writer importing index.sdi file ANSI encoded, instead of using utf8.
Using attached files result in the same experience.

Version: 6.1.5.2
Build ID: 1:6.1.5-3+deb10u5
CPU threads: 16; OS: Linux 4.4; UI render: default; VCL: x11; 
Locale: en-US (en_US.UTF-8); Calc: group threaded
In debian linux subsystem at the very same time (I've opened the same .odm document twice, both in windows and linux) the index works flawlessly.
Using attached files result in the same experience.

Notepad++ confirms UTF8 encoding
Using attached files result in the same experience.

How to replicate:
1)Create a new Writer document
2)type some words with diacritics (eg. "Paweł" "Lukáš")
3)Insert an Analytical Index, select "use concordance file" - new file
4)Paste your words into the first column of the table
5)Confirm
5.1)Click on edit file, your diacritics are now mere "?"
6)Save and enjoy your empty index.
7)Open the very same file in any linux distro
8)Copy your words with diacritics
8.1)Edit index - edit concordance file - paste your words again - save - ok
9)Enjoy your index.

Could it be a problem related with Calc? It is involved into the creation of the table
Comment 13 Pietro 2020-09-14 08:47:33 UTC
I'm still trying to update my index file and the very same problem happens with latest Writer 7.0.1.2 on windows 10 2004 v19041.508
Comment 14 Pietro 2021-02-10 10:47:34 UTC
Version: 7.1.0.3 (x64) / LibreOffice Community
Build ID: f6099ecf3d29644b5008cc8f48f42f4a40986e4c
CPU threads: 16; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: it-IT (it_IT); UI: it-IT
Calc: CL

I tried again wih no luck
Comment 15 Andreas Heinisch 2022-01-24 19:53:01 UTC
*** Bug 125496 has been marked as a duplicate of this bug. ***
Comment 16 Commit Notification 2022-01-30 20:36:20 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/7e6e0fd63eac57de0f76ab1efdb1283c22ad6e6c

tdf#108910, tdf#125496 - read/write index entries using utf8

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Andreas Heinisch 2022-01-30 20:45:50 UTC
Unfortunately, the widget does not yet support ui testing, but the test files are in the previous patch sets.
Comment 18 Commit Notification 2022-01-31 18:33:31 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "libreoffice-7-3":

https://git.libreoffice.org/core/commit/4dc4dfe0f249f454291a2d57e28f11342421bb00

tdf#108910, tdf#125496 - read/write index entries using utf8

It will be available in 7.3.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.