Bug 106110 - UTF-8 text import incorrectly detected as UTF-16 if linebreaks are CRLF
Summary: UTF-8 text import incorrectly detected as UTF-16 if linebreaks are CRLF
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
5.1.4.2 release
Hardware: x86-64 (AMD64) All
: medium minor
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-20 17:49 UTC by JMM
Modified: 2020-05-17 09:44 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample file containing UTF-8 text with CRLF newlines (716 bytes, application/octet-stream)
2017-02-20 18:43 UTC, JMM
Details

Note You need to log in before you can comment on or make changes to this bug.
Description JMM 2017-02-20 17:49:12 UTC
Description:
When pasting UTF-8 external simple text into Calc, if the text has linebreaks as Windows-style newlines CR+LF (0x0D0A), the text is incorrectly detected as UTF-16.

If the character encoding is manually set to UTF-8, then Calc inserts an empty line after every line.

Steps to Reproduce:
1. Generate some text in any text application where you can be sure that newlines are Windows-style CRLF, and that the encoding is UTF-8, e.g. Notepad++
2. Copy that text, ensuring it has several lines.
3. Paste it over any Calc cell, to open the text import window.
4. Notice the character encoding detected by Calc is UTF-16, not UTF-8.
5. Change the encoding to the correct UTF-8.
6. See how in the preview every even row is empty.

Actual Results:  
If you don't touch anything and the text doesn't include any UTF-8 character combinations which translate to an UTF-16 character, nothing happens. However until I noticed this problem, sometimes I had strange characters in my text imports, which might have been caused by this problem. I can't provide any specific text string to trigger that problem but I am pretty sure it exists.

Expected Results:
The text should correctly be detected as UTF-8 to avoid potential problems, and when doing so, the CRLF combination shouldn't be interpreted as two consecutive newlines.


Reproducible: Always

User Profile Reset: No

Additional Info:
This bug is not triggered when opening a file with the same text, UTF-8 with CRLF. 


User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
Comment 1 Xisco Faulí 2017-02-20 18:14:29 UTC
Thank you for reporting the bug. Please attach a sample document, as this makes it easier for us to verify the bug.
On the other hand, it seems you're using an old version of LibreOffice.
Could you please try to reproduce it with the latest version of LibreOffice from https://www.libreoffice.org/download/libreoffice-fresh/ ?
I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' if the bug is still present in the latest version.
Comment 2 JMM 2017-02-20 18:43:16 UTC
Created attachment 131367 [details]
Sample file containing UTF-8 text with CRLF newlines

To trigger the bug correctly, open this file with a text editor capable of maintaining both the UTF-8 encoding and the CRLF newline style. Copy the contents and paste them into any Calc cell to open the text import window. You can also see the difference in encoding detection if you try to open the file from Calc, and you see the text import window but with the data coming from a file instead of from the clipboard.
Comment 3 JMM 2017-02-20 18:46:29 UTC
Yes, I noticed I was a bit behind in updating while filing the bug report. I updated to last stable 5.2.5.1 (I've had a few probs with fresh releases so I stick to stable ones), reproduced the bug again, and added a sample text.

I changed the version number and the status to unconfirmed until someone else can reproduce the bug.

Note that the bug only triggers when pasting the text from the clipboard, not when importing from a file, which I find weird.
Comment 4 Buovjaga 2017-03-04 17:24:23 UTC
Reproduced.

Tried with v. 3.6, but it does not separate the utfs, only says "Unicode".

Arch Linux 64-bit, KDE Plasma 5
Version: 5.4.0.0.alpha0+
Build ID: ed0e8f970ff552e75222dc92ed2879aa3b3e5851
CPU threads: 8; OS: Linux 4.9; UI render: default; VCL: kde4; 
Locale: fi-FI (fi_FI.UTF-8); Calc: group
Built on March 4th 2016
Comment 5 QA Administrators 2018-07-27 02:42:02 UTC Comment hidden (obsolete)
Comment 6 JMM 2018-07-27 09:03:33 UTC
I tried with 5.4.7_Win_x64 with the attachment I made a year ago, ANSI encoding with Windows CRLF endlines.

The results are the same, the bug is still there. Incorrect detection of the encoding, even if there are no characters that can be interpreted as UTF-16.

Note that the extra added empty lines are shown only in the import panel; if applying the import, the text is imported correctly. It's only in the import panel where the problem lies. However as I said in my initial bug report, working with characters beyond the 7-bit ASCII gave me problems with importing, and rendered strange characters, so it's not only a cosmetic bug. There's something wrong in the detection of the encoding, and in how LibreOffice reencodes such text for internal usage.
Comment 7 himajin100000 2018-08-19 06:22:36 UTC
Are you really sure that the encoding is different from UTF-16 without CRLF and with CR or LF oly?

https://opengrok.libreoffice.org/xref/core/sc/source/ui/dbgui/scuiasciiopt.cxx?r=a5c04cbf#380

https://opengrok.libreoffice.org/xref/core/sc/source/ui/view/viewfun5.cxx?r=18a8cac5#346

It looks to me that these code suggests, regardless of Whatever encoding and line-separator we use, pasting will open dialog with UTF-16 set as encoding.

It also looks to me whatever encoding you use on your text editor doesn't affect in what encoding Windows stores string data to its clipboard.
Comment 8 himajin100000 2018-08-19 06:23:06 UTC
typo: oly => only
Comment 9 Regina Henschel 2018-08-19 11:08:17 UTC
I have tried the attached text with NotePad++, NotePad, PSPad and with Wordpad. All of them put more than one type into the clipboard. The type "Unicode Text Format" is provided by all of these apps. That is UTF-16. Then I have changed the line ends to LF. Again copying results in a type "Unicode Text Format" in the clipboard. If LibreOffice takes this clipboard flavor, the selection UTF-16 in the dialog is correct.

I have used "Free Clipboard Viewer 3.0" to examine the clipboard.

Do you have got an application, which does not put "Unicode Text Format" into clipboard?
Comment 10 Buovjaga 2018-08-19 11:24:12 UTC
Hmm, it's true that there is something weird about the attachment. Linux command "file" says:
sample utf-8 crlf text.txt: ASCII text, with CRLF line terminators

However, Kate editor opens it as UTF-8.
Comment 11 Buovjaga 2018-08-19 11:33:56 UTC
If I use the program enca like this:
enca -L none sample.txt

It says:
7bit ASCII characters
  CRLF line terminators
Comment 12 himajin100000 2018-08-19 13:06:57 UTC
>sample utf-8 crlf text.txt: ASCII text, with CRLF line terminators
However, Kate editor opens it as UTF-8.

If all the characters are in U+0000 to U+00FF, US-ASCII and UTF-8 are completely identical. so there is nothing strange in this behavior.
Comment 13 QA Administrators 2019-09-02 09:29:34 UTC Comment hidden (obsolete)
Comment 14 Mike Kaganski 2020-05-17 09:41:52 UTC
I suggest to close it NOTABUG. This is something about wrong expectations. OP expects that when pasting, the text has the same encoding as when it's copied in original program. But that's incorrect, as himajin100000 and Regina rightfully note in comment 7 and comment 9. The expectation that changing UTF-16 to UTF-8 on paste (step 5 in comment 0) would result in "correct" behaviour is also wrong, and step 6 shows that. The bottom line in the description was:

> If you don't touch anything and the text doesn't include any UTF-8 character
> combinations which translate to an UTF-16 character, nothing happens.
> However until I noticed this problem, sometimes I had strange characters in
> my text imports, which might have been caused by this problem. I can't provide
> any specific text string to trigger that problem but I am pretty sure it exists.

... which is mixing two completely unrelated things: OP has some unspecified problem, and suspects that it has something with the observed inconsistencies between OP's expectations and the correct behaviour. The real problem is completely unrelated.
Comment 15 Buovjaga 2020-05-17 09:44:11 UTC
Thanks, let's close