Bug 93896 - Non-ASCII characters in comments corrupt when RTF is reopened
Summary: Non-ASCII characters in comments corrupt when RTF is reopened
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.3.3.2 release
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:rtf
: 90128 (view as bug list)
Depends on:
Blocks: RTF
  Show dependency treegraph
 
Reported: 2015-09-03 16:18 UTC by Simo Kaupinmäki
Modified: 2016-03-29 19:20 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
ODT document with a comment (10.31 KB, application/vnd.oasis.opendocument.text)
2015-09-03 16:18 UTC, Simo Kaupinmäki
Details
RTF with corrupt characters in the comment (4.24 KB, application/rtf)
2015-10-19 15:14 UTC, Simo Kaupinmäki
Details
modified rtf (4.32 KB, application/rtf)
2015-10-19 20:27 UTC, Yousuf Philips (jay) (retired)
Details
RTF created with MS Word (46.06 KB, application/rtf)
2015-10-20 16:07 UTC, Simo Kaupinmäki
Details
Screenshots from Word (30.65 KB, image/png)
2015-10-20 17:38 UTC, Simo Kaupinmäki
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simo Kaupinmäki 2015-09-03 16:18:30 UTC
Created attachment 118397 [details]
ODT document with a comment

In a specific situation, non-ASCII characters in comments are corrupt when an RTF document is reopened. This only seems to occur on Linux after an RTF document has been modified and then saved. If there are non-ASCII characters in the actual content of the document, these are not affected.

Found with LibO 4.3.3.2 (from Debian Jessie repository), 4.4.5.2 (from Debian backports) and 5.0.1.2 (downloaded directly from the LibO site). The issue does not emerge with any version of LibO on Windows 7. Perhaps there is something that goes wrong with UTF-8 (UTF-16 is used internally in Windows).

(I actually found the bug a few months ago, but I didn't report it back then, because it already seemed fixed in LibO 5.0.0.0.beta1. However, it now affects 5.0.0.5 as well on Linux.)

Steps to reproduce:

1. Open the attached ODT document. There is some text (in Finnish) that is repeated in a comment.
2. Save the document as RTF. (You can at this point close the document and reopen it, and everything seems fine.)
3. Make a change in the document, save it again as RTF and close it.
4. When the RTF is now reopened, the non-ASCII characters in the comment are corrupt. They have mostly been turned into question marks, but typographical quotation marks have been turned into a combination of a square and a letter.
Comment 1 Yousuf Philips (jay) (retired) 2015-09-20 12:27:22 UTC
Hi Simo,

Tried it in 5.0 daily and master and it worked fine. Try giving 5.0.1 or 5.0.2 a try and see if it still shows up for you.

Version: 5.0.3.0.0+
Build ID: 4ae70fd6c93087ce66c76d3102ad678bcf01dbf5
TinderBox: Linux-rpm_deb-x86_64@46-TDF, Branch:libreoffice-5-0, Time: 2015-09-18_11:42:55
Locale: en-US (en_US.UTF-8)

Version: 5.1.0.0.alpha1+
Build ID: cbf3fac0a5a1be34b2e1a58da959debd24ebc017
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2015-09-17_07:03:22
Locale: en-US (en_US.UTF-8)
Comment 2 Simo Kaupinmäki 2015-10-01 13:01:15 UTC
Hi Jay,

LibO 5.0.2.2 is also affected, but the bug does seem fixed in the development version. However, as I said above, it already seemed fixed in 5.0.0.0.beta1, but was back in 5.0.0.5. Therefore, I think I'll keep a close eye on this when 5.0.3.1 is made available.

AFFECTED:
Version: 5.0.2.2
Build ID: 37b43f919e4de5eeaca9b9755ed688758a8251fe
Locale: fi-FI (fi_FI.utf8)

NOT AFFECTED:
Version: 5.0.3.0.0+
Build ID: a9670e0735b77ecc40aa8af4106af7d32ec548a0
TinderBox: Linux-rpm_deb-x86@45-TDF, Branch:libreoffice-5-0, Time: 2015-09-24_23:24:38
Locale: fi-FI (fi_FI.utf8)
Comment 3 Simo Kaupinmäki 2015-10-16 15:31:23 UTC
Reopening. 

So, the issue seemed fixed in 5.0.3.0, but now it has re-emerged in 5.0.3.1. As I already saw the same regression occur between 5.0.0.0.beta1 and 5.0.0.5, it seems something goes wrong when the beta version is turned into a release candidate.

Version: 5.0.3.1
Build ID: fd8cfc22f7f58033351fcb8a83b92acbadb0749e
Locale: fi-FI (fi_FI.utf8)
Comment 4 Simo Kaupinmäki 2015-10-16 15:41:06 UTC
*** Bug 90128 has been marked as a duplicate of this bug. ***
Comment 5 tommy27 2015-10-19 00:53:47 UTC
has anyone tried to bibisect this latest 5.0.x regressions?
Comment 6 Yousuf Philips (jay) (retired) 2015-10-19 14:39:53 UTC
Works fine for me.

Version: 5.1.0.0.alpha1+
Build ID: b684090d4f573eb339e93872d0cef07e69adc913
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2015-10-16_01:50:06
Locale: en-US (en_US.UTF-8)

Version: 5.0.4.0.0+
Build ID: 9a75c72495ed6014d6c84fdead14bef68ea32858
TinderBox: Linux-rpm_deb-x86_64@46-TDF, Branch:libreoffice-5-0, Time: 2015-10-16_08:36:43
Locale: en-US (en_US.UTF-8)

(In reply to tommy27 from comment #5)
> has anyone tried to bibisect this latest 5.0.x regressions?

I doubt there is any regression there.

(In reply to Simo Kaupinmäki from comment #3)
> So, the issue seemed fixed in 5.0.3.0, but now it has re-emerged in 5.0.3.1.
> As I already saw the same regression occur between 5.0.0.0.beta1 and
> 5.0.0.5, it seems something goes wrong when the beta version is turned into
> a release candidate.

Works fine for me. Could it be specific to Finnish, as i've been typing english words into the document before saving it to RTF? Can you please provide a sample corrupt RTF, so we can test if it is corrupt when we open it?

Version: 5.0.3.1
Build ID: fd8cfc22f7f58033351fcb8a83b92acbadb0749e
Locale: en-US (en_US.UTF-8)
Comment 7 Simo Kaupinmäki 2015-10-19 15:14:20 UTC
Created attachment 119752 [details]
RTF with corrupt characters in the comment

There is another corrupt RTF in the duplicate bug 90128. As that file was probably not created in a Finnish environment, I don't think the issue is specific to Finnish.

I've just noticed that if I try to open either of the corrupt files with XFCE's Mousepad, it complains about incorrect encoding:
"The document was not UTF-8 valid"
"Invalid byte sequence in conversion input."

LibO opens the file without complaints, but the non-ASCII characters in the comment are corrupt.
Comment 8 Simo Kaupinmäki 2015-10-19 16:03:25 UTC
(In reply to Yousuf (Jay) Philips from comment #6)

> Works fine for me. Could it be specific to Finnish, as i've been typing
> english words into the document before saving it to RTF?

Please take in to account that the issue does not emerge if you just save a document as RTF. It only emerges after you have re-saved an existing RTF and then close and re-open it.
Comment 9 Yousuf Philips (jay) (retired) 2015-10-19 20:27:25 UTC
Created attachment 119769 [details]
modified rtf
Comment 10 Yousuf Philips (jay) (retired) 2015-10-19 20:37:02 UTC
(In reply to Simo Kaupinmäki from comment #8)
> Please take in to account that the issue does not emerge if you just save a
> document as RTF. It only emerges after you have re-saved an existing RTF and
> then close and re-open it.

In my last comment, i've attached an rtf that i created after following the steps in the bug description, so yes i am aware of the issue that you have addressed in the bug report, but still cant reproduce it.

Version: 5.1.0.0.alpha1+
Build ID: b684090d4f573eb339e93872d0cef07e69adc913
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2015-10-16_01:50:06
Locale: en-US (en_US.UTF-8)

@Miklos: Any ideas on what maybe causing this issue?
Comment 11 Simo Kaupinmäki 2015-10-19 20:52:51 UTC
(In reply to Yousuf (Jay) Philips from comment #10)

> Version: 5.1.0.0.alpha1+
> Build ID: b684090d4f573eb339e93872d0cef07e69adc913
> TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time:
> 2015-10-16_01:50:06
> Locale: en-US (en_US.UTF-8)

I don't know if it matters, but it seems you have tested this on x86_64, whereas I have only tested it on IA32. Bug 90128 has also been filed against IA32 architecture.
Comment 12 Simo Kaupinmäki 2015-10-20 15:44:03 UTC
(In reply to Simo Kaupinmäki from comment #7)
> I've just noticed that if I try to open either of the corrupt files with
> XFCE's Mousepad, it complains about incorrect encoding:
> "The document was not UTF-8 valid"
> "Invalid byte sequence in conversion input."

Actually, this seems to happen with any RTF file, so it doesn't prove anything.
Comment 13 Simo Kaupinmäki 2015-10-20 16:07:54 UTC
Created attachment 119794 [details]
RTF created with MS Word

For comparison, I have opened the original ODT file in MS Word 2013 and saved it as RTF. Notice that the file size is considerably larger than with an RTF created by Writer. Furthermore, if this file is opened with Mousepad, it does not complain about encoding. So there seems to be something not-quite-right even with the file posted by Jay, although it's not directly visible.
Comment 14 Yousuf Philips (jay) (retired) 2015-10-20 16:31:38 UTC
Tried 32-bit version and still couldnt confirm.

Version: 5.1.0.0.alpha1+
Build ID: 2b5a48da5969b1ed37f4480d843714d434feb5d9
TinderBox: Linux-rpm_deb-x86@71-TDF, Branch:master, Time: 2015-10-19_05:39:28
Locale: en-US (en_US.UTF-8)
Comment 15 Simo Kaupinmäki 2015-10-20 16:56:22 UTC
(In reply to Yousuf (Jay) Philips from comment #14)
> Tried 32-bit version and still couldnt confirm.
> 
> Version: 5.1.0.0.alpha1+

Well, that's an alpha version. As I have said above, twice already I have been unable to reproduce this bug in a beta version, but then it has re-emerged in a release candidate.
Comment 16 Yousuf Philips (jay) (retired) 2015-10-20 17:13:48 UTC
(In reply to Simo Kaupinmäki from comment #15)
> Well, that's an alpha version. As I have said above, twice already I have
> been unable to reproduce this bug in a beta version, but then it has
> re-emerged in a release candidate.

Just tested with this 32-bit release candidate and still no luck.

Version: 5.0.3.1
Build ID: fd8cfc22f7f58033351fcb8a83b92acbadb0749e
Locale: en-US (en_US.UTF-8)

http://downloadarchive.documentfoundation.org/libreoffice/old/5.0.3.1/deb/x86/
Comment 17 Simo Kaupinmäki 2015-10-20 17:38:52 UTC
Created attachment 119796 [details]
Screenshots from Word

Well, isn't this getting frustrating or what?

I've done some further comparison with MS Word and found that although the RTF posted by Jay looks fine in Writer, it does not look quite right in Word. It's not as bad as first RTF by me, but the umlauted letter "ä" in my name has been replaced with what looks like a Chinese character. The actual comment looks OK, except that the font has been changed from the original DejaVu Serif into Liberation Serif (I assume this isn't deliberate). Furthermore, the font of the non-ASCII characters in the actual document text have also been changed into Liberation Serif, whereas the font of the ASCII characters is still DejaVu Serif.

This attachment is a combination of three screenshots taken of various RTFs as they appear in Word. The first one is attachment 119794 [details] (created by me in Word) and looks fine, the second one is attachment 119752 [details] (created by me in Writer) and looks all wrong, and the third one is attachment 119769 [details] (created by Jay in Writer) and looks, well, not quite right.
Comment 18 Simo Kaupinmäki 2015-10-21 10:15:41 UTC
I've only now realized that in order to reproduce this bug, you may actually need to close and re-open the RTF before re-saving it. Alternatively, you can select "File > Save As... RTF" twice. The issue does not emerge if you first select "File > Save As... RTF" and then just save changes by pressing Ctrl+S or clicking the save icon (at least if you don't have the "Ask when not saving in ODF or default format" option selected).
Comment 19 Robinson Tryon (qubit) 2015-12-14 11:27:36 UTC
Migrating Whiteboard tags to Keywords: (filter:rtf )
[NinjaEdit]
Comment 20 Simo Kaupinmäki 2016-03-29 18:25:21 UTC
In 5.1.1.3, the main issue seems to have been fixed both in the official LibO version and the Debian backports version. As I don't know what it is exactly that has made the problem go away, I'm closing this bug as WORKSFORME. Thank you for your help in trying to track it down.

All is not fine, though. The "ä" in my name is still being replaced by a question mark. It seems that there is a similar bug that specifically affects the non-ASCII characters included in the metadata of a comment. However, this appears to be a separate issue, and it's something I can live with.

Version: 5.1.1.3
Build ID: 89f508ef3ecebd2cfb8e1def0f0ba9a803b88a6d
CPU Threads: 1; OS Version: Linux 3.16; UI Render: default; 
Locale: fi-FI (fi_FI.utf8)

Version: 5.1.1.3
Build ID: 1:5.1.1-1~bpo8+1
CPU Threads: 1; OS Version: Linux 3.16; UI Render: default; 
Locale: fi-FI (fi_FI.utf8)
Comment 21 Simo Kaupinmäki 2016-03-29 19:20:42 UTC
For the record, 5.0.5.2 is still affected by the main issue (and also by the secondary issue regarding metadata):

Version: 5.0.5.2
Build ID: 55b006a02d247b5f7215fc6ea0fde844b30035b3
Locale: fi-FI (fi_FI.utf8)

I haven't tested with 5.1.0, so it's possible that the main issue has already been fixed there.