Bug 125496 - SDI file with greek letters does not work
Summary: SDI file with greek letters does not work
Status: RESOLVED DUPLICATE of bug 108910
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.3.0 release
Hardware: All Windows (All)
: medium normal
Assignee: Andreas Heinisch
URL:
Whiteboard: target:7.4.0 target:7.3.1
Keywords:
Depends on:
Blocks: TableofContents-Indexes
  Show dependency treegraph
 
Reported: 2019-05-25 22:02 UTC by Michael Herbst
Modified: 2022-01-31 18:33 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
SDI File requested by Xisco Faulí (14.16 KB, text/plain)
2019-06-11 01:15 UTC, Michael Herbst
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Herbst 2019-05-25 22:02:01 UTC
Description:
I made an index of sentences in various languages for a book, using a different SDI file for each language. All worked as expected except the greek one, and you can easily see why, if you edit the attached file with the SDIeditor of Swriter, witch also did not recognize an additional spended UTF8-BOM.  -  I regard this a bug, because *any* not completely outdated simple text editor (including good old Notepad) will recognize the file without any problem.


Steps to Reproduce:
1.Edit the attached SDI file and you will see, that it cannot work.
2.
3.

Actual Results:
greek letters are scrambled

Expected Results:
Show the letters unscrambled and proceed the file


Reproducible: Always


User Profile Reset: No



Additional Info:
[Information automatically included from LibreOffice]
Locale: en-GB
Module: TextDocument
[Information guessed from browser]
OS: Windows (All)
OS is 64bit: no
Comment 1 Xisco Faulí 2019-06-10 15:50:44 UTC
Thank you for reporting the bug. Please attach a sample document, as this makes it easier for us to verify the bug. 
(Please note that the attachment will be public, remove any sensitive information before attaching it. 
See https://wiki.documentfoundation.org/QA/FAQ#How_can_I_eliminate_confidential_data_from_a_sample_document.3F for help on how to do so.)

I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' once the requested document is provided.
Comment 2 Michael Herbst 2019-06-11 01:15:57 UTC
Created attachment 152086 [details]
SDI File requested by Xisco Faulí
Comment 3 Xisco Faulí 2019-06-11 07:38:27 UTC
it works fine for me in

Versió: 6.1.4.2
ID de la construcció: 1:6.1.4-0ubuntu0.16.04.1~lo2
Fils de CPU: 4; SO: Linux 4.15; Renderitzador de la IU: per defecte; VCL: gtk3; 
Configuració local: en-AU (ca_ES.UTF-8); Calc: group threaded

Could you please paste the info from Help - about LibreOffice ?

I have set the bug's status to 'NEEDINFO'. Please change it back to
'UNCONFIRMED' once the information has been provided
Comment 4 Michael Herbst 2019-06-11 12:16:09 UTC
About has:

Version: 6.1.6.3
Build-ID: 5896ab1714085361c45cf540f76f60673dd96a72
CPU-Threads: 4; BS: Windows 6.1; UI-Render: Standard; 
Gebietsschema: de-DE (de_DE); Calc: group threaded

It's obviously the Encoding witch scrambles the greek Letters.
Comment 5 Julien Nabet 2019-06-21 14:36:33 UTC
On Windows 10 with 6.2.4 or with master sources updated today, I could reproduce this.

I just opened Writer, then File/open and select sdi file.

Here are console logs which may be relevant:
Throwing InvalidHeaderException
Throwing InvalidHeaderException
warn:oox.storage:36748:26504:oox/source/helper/zipstorage.cxx:67: ZipStorage::ZipStorage exception opening input storage com.sun.star.io.IOException
Throwing InvalidHeaderException
Throwing InvalidHeaderException
AbiDocument::isFileFormatSupported
Found xml parser severity error Document is empty

Throwing InvalidHeaderException
warn:oox.storage:36748:26504:oox/source/helper/zipstorage.cxx:67: ZipStorage::ZipStorage exception opening input storage com.sun.star.io.IOException
...
VisioDocument: version 0
Found xml parser severity error Document is empty
Comment 6 Julien Nabet 2019-06-21 14:49:58 UTC
Sorry, don't take into account previous comment.
Getting some info about sdi, I followed this link (in French) to open sdi correctly:
https://dutailly.net/un-fichier-de-concordance-pour-indexer-un-document

On Win10 with master sources updated today I have scrambled letters but no specific console logs.
Comment 7 Julien Nabet 2019-06-21 15:02:23 UTC
UI comes from "createautomarkdialog.ui"

This file is used by sw/source/ui/index/cnttab.cxx
Search "encod" here gives 3 locations:
3815  void SwEntryBrowseBox::ReadEntries(SvStream& rInStr)
3816  {
3817      AutoMarkEntry* pToInsert = nullptr;
3818      rtl_TextEncoding  eTEnc = osl_getThreadTextEncoding();
3819      while (rInStr.good())

3866  void SwEntryBrowseBox::WriteEntries(SvStream& rOutStr)
3867  {
3868      //check if the current controller is modified
...
3878      rtl_TextEncoding  eTEnc = osl_getThreadTextEncoding();
3879      for(std::unique_ptr<AutoMarkEntry> & rpEntry : m_Entries)

3956  IMPL_LINK_NOARG(SwAutoMarkDlg_Impl, OkHdl, Button*, void)
3957  {
3958      bool bError = false;
3959      if(m_pEntriesBB->IsModified() || bCreateMode)
3960      {
3961          SfxMedium aMed( sAutoMarkURL,
3962                          bCreateMode ? StreamMode::WRITE
3963                                      : StreamMode::WRITE| StreamMode::TRUNC );
3964          SvStream* pStrm = aMed.GetOutStream();
3965          pStrm->SetStreamCharSet( RTL_TEXTENCODING_MS_1253 );

So it seems it doesn't try to detect the encoding of the file.
Also, line 3965 seems weird to me, why fixed encoding RTL_TEXTENCODING_MS_1253 ?
Comment 8 Julien Nabet 2019-06-21 15:09:27 UTC
I checked the attached file with online hexa editor and it doesn't contain BOM
(should be the sequence 0xEF,0xBB,0xBF since it's UTF-8, see 
https://en.wikipedia.org/wiki/Byte_order_mark)
Anyway I also gave a try with BOM file, it doesn't change anything, still scrambled letters but no surprise considering LO code (see my previous comment).
Comment 9 Julien Nabet 2019-06-21 15:29:39 UTC
Keeping on debugging, I put some traces on 3 methods quoted in comment 7.
I confirm that when opening the file, it goes into SwEntryBrowseBox::ReadEntries

"osl_getThreadTextEncoding()" returns 1 (so "RTL_TEXTENCODING_MS_1252", see https://opengrok.libreoffice.org/xref/core/include/rtl/textenc.h?r=189abcf0#38)
to "eTEnc" variable (type "rtl_TextEncoding")

Forcing "eTEnc" to "RTL_TEXTENCODING_UTF8" allows to see Greek characters.
Comment 10 Andreas Heinisch 2022-01-24 19:53:01 UTC

*** This bug has been marked as a duplicate of bug 108910 ***
Comment 11 Commit Notification 2022-01-30 20:36:28 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/7e6e0fd63eac57de0f76ab1efdb1283c22ad6e6c

tdf#108910, tdf#125496 - read/write index entries using utf8

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 Commit Notification 2022-01-31 18:33:39 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "libreoffice-7-3":

https://git.libreoffice.org/core/commit/4dc4dfe0f249f454291a2d57e28f11342421bb00

tdf#108910, tdf#125496 - read/write index entries using utf8

It will be available in 7.3.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.