106110 – UTF-8 text import incorrectly detected as UTF-16 if linebreaks are CRLF

Bug 106110 - UTF-8 text import incorrectly detected as UTF-16 if linebreaks are CRLF

Summary: UTF-8 text import incorrectly detected as UTF-16 if linebreaks are CRLF

Status:	RESOLVED NOTABUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	5.1.4.2 release
Hardware:	x86-64 (AMD64) All

Importance:	medium minor
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-02-20 17:49 UTC by JMM
Modified:	2020-05-17 09:44 UTC (History)
CC List:	5 users (show)

See Also:
Crash report or crash signature:

Attachments
Sample file containing UTF-8 text with CRLF newlines (716 bytes, application/octet-stream) 2017-02-20 18:43 UTC, JMM	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description JMM 2017-02-20 17:49:12 UTC

Description:
When pasting UTF-8 external simple text into Calc, if the text has linebreaks as Windows-style newlines CR+LF (0x0D0A), the text is incorrectly detected as UTF-16.

If the character encoding is manually set to UTF-8, then Calc inserts an empty line after every line.

Steps to Reproduce:
1. Generate some text in any text application where you can be sure that newlines are Windows-style CRLF, and that the encoding is UTF-8, e.g. Notepad++
2. Copy that text, ensuring it has several lines.
3. Paste it over any Calc cell, to open the text import window.
4. Notice the character encoding detected by Calc is UTF-16, not UTF-8.
5. Change the encoding to the correct UTF-8.
6. See how in the preview every even row is empty.

Actual Results:
If you don't touch anything and the text doesn't include any UTF-8 character combinations which translate to an UTF-16 character, nothing happens. However until I noticed this problem, sometimes I had strange characters in my text imports, which might have been caused by this problem. I can't provide any specific text string to trigger that problem but I am pretty sure it exists.

Expected Results:
The text should correctly be detected as UTF-8 to avoid potential problems, and when doing so, the CRLF combination shouldn't be interpreted as two consecutive newlines.

Reproducible: Always

User Profile Reset: No

Additional Info:
This bug is not triggered when opening a file with the same text, UTF-8 with CRLF.

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0

Comment 1 Xisco Faulí 2017-02-20 18:14:29 UTC

Thank you for reporting the bug. Please attach a sample document, as this makes it easier for us to verify the bug.
On the other hand, it seems you're using an old version of LibreOffice.
Could you please try to reproduce it with the latest version of LibreOffice from https://www.libreoffice.org/download/libreoffice-fresh/ ?
I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' if the bug is still present in the latest version.

Comment 2 JMM 2017-02-20 18:43:16 UTC

Created attachment 131367 [details]
Sample file containing UTF-8 text with CRLF newlines

To trigger the bug correctly, open this file with a text editor capable of maintaining both the UTF-8 encoding and the CRLF newline style. Copy the contents and paste them into any Calc cell to open the text import window. You can also see the difference in encoding detection if you try to open the file from Calc, and you see the text import window but with the data coming from a file instead of from the clipboard.

Comment 3 JMM 2017-02-20 18:46:29 UTC

Yes, I noticed I was a bit behind in updating while filing the bug report. I updated to last stable 5.2.5.1 (I've had a few probs with fresh releases so I stick to stable ones), reproduced the bug again, and added a sample text.

I changed the version number and the status to unconfirmed until someone else can reproduce the bug.

Note that the bug only triggers when pasting the text from the clipboard, not when importing from a file, which I find weird.

Comment 4 Buovjaga 2017-03-04 17:24:23 UTC

Reproduced.

Tried with v. 3.6, but it does not separate the utfs, only says "Unicode".

Arch Linux 64-bit, KDE Plasma 5
Version: 5.4.0.0.alpha0+
Build ID: ed0e8f970ff552e75222dc92ed2879aa3b3e5851
CPU threads: 8; OS: Linux 4.9; UI render: default; VCL: kde4; 
Locale: fi-FI (fi_FI.UTF-8); Calc: group
Built on March 4th 2016

Comment 5 QA Administrators 2018-07-27 02:42:02 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 6 JMM 2018-07-27 09:03:33 UTC

I tried with 5.4.7_Win_x64 with the attachment I made a year ago, ANSI encoding with Windows CRLF endlines.

The results are the same, the bug is still there. Incorrect detection of the encoding, even if there are no characters that can be interpreted as UTF-16.

Note that the extra added empty lines are shown only in the import panel; if applying the import, the text is imported correctly. It's only in the import panel where the problem lies. However as I said in my initial bug report, working with characters beyond the 7-bit ASCII gave me problems with importing, and rendered strange characters, so it's not only a cosmetic bug. There's something wrong in the detection of the encoding, and in how LibreOffice reencodes such text for internal usage.

Comment 7 himajin100000 2018-08-19 06:22:36 UTC

Are you really sure that the encoding is different from UTF-16 without CRLF and with CR or LF oly?

https://opengrok.libreoffice.org/xref/core/sc/source/ui/dbgui/scuiasciiopt.cxx?r=a5c04cbf#380

https://opengrok.libreoffice.org/xref/core/sc/source/ui/view/viewfun5.cxx?r=18a8cac5#346

It looks to me that these code suggests, regardless of Whatever encoding and line-separator we use, pasting will open dialog with UTF-16 set as encoding.

It also looks to me whatever encoding you use on your text editor doesn't affect in what encoding Windows stores string data to its clipboard.

Comment 8 himajin100000 2018-08-19 06:23:06 UTC

typo: oly => only

Comment 9 Regina Henschel 2018-08-19 11:08:17 UTC

I have tried the attached text with NotePad++, NotePad, PSPad and with Wordpad. All of them put more than one type into the clipboard. The type "Unicode Text Format" is provided by all of these apps. That is UTF-16. Then I have changed the line ends to LF. Again copying results in a type "Unicode Text Format" in the clipboard. If LibreOffice takes this clipboard flavor, the selection UTF-16 in the dialog is correct.

I have used "Free Clipboard Viewer 3.0" to examine the clipboard.

Do you have got an application, which does not put "Unicode Text Format" into clipboard?

Comment 10 Buovjaga 2018-08-19 11:24:12 UTC

Hmm, it's true that there is something weird about the attachment. Linux command "file" says:
sample utf-8 crlf text.txt: ASCII text, with CRLF line terminators

However, Kate editor opens it as UTF-8.

Comment 11 Buovjaga 2018-08-19 11:33:56 UTC

If I use the program enca like this:
enca -L none sample.txt

It says:
7bit ASCII characters
  CRLF line terminators

Comment 12 himajin100000 2018-08-19 13:06:57 UTC

>sample utf-8 crlf text.txt: ASCII text, with CRLF line terminators
However, Kate editor opens it as UTF-8.

If all the characters are in U+0000 to U+00FF, US-ASCII and UTF-8 are completely identical. so there is nothing strange in this behavior.

Comment 13 QA Administrators 2019-09-02 09:29:34 UTC Comment hidden (obsolete)

Dear JMM,