Bug Hunting Session
Bug 31555 - Pasting test copied from Adobe Acrobat Reader messes up non-ascii characters
Summary: Pasting test copied from Adobe Acrobat Reader messes up non-ascii characters
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium major
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-11 10:46 UTC by matteo sisti sette
Modified: 2015-04-18 18:07 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description matteo sisti sette 2010-11-11 10:46:30 UTC
1) Open a pdf file in Adobe Acrobat Reader, containing text with characters such as àáèéìíòùóúïüî etc

2) Select some text containing such characters

3) Copy it (Ctrl+C or Edit/Copy)

4) Go to LibreOffice Writer and paste the text into the document

Expected result: the pasted text should be identical to the copied text

Observed result: SOMETIMES, non-ascii character such as those mentioned above are replaced with other characters, OR an extra space gets inserted after each such character.

The SAME TEXT, pasted into gedit or into this web form, or into Mozilla Thunderbird, is pasted correctly: only in LibreOffice it is altered. That makes me thing it's a bug in LibreOffice rather than Adobe Reader, though I cannot be sure.

---

EXAMPLE (1):

Copied from this document: http://creaciodigital.upf.edu/~etisselli/redes/09-10/clase02.pdf

"Conjunto de funciones y métodos informáticos que permiten
interactuar con una aplicación"

Pasted into Libre Office Writer becomes:

"Conjunto de funciones y métodos informáticos que permiten
interactuar con una aplicación"

EXAMPLE (2):

Copied from this document:http://www.ine.es/daco/daco42/sociales/ciencia_tecno.pdf

"En los años 50 la ciencia económica comprobó
que la acumulación de capital no
podría generar"

Pasted into LibreOffice Writer becomes:

" En los añ os 50 la ciencia econó mica comprobó
 que la acumulació n de capital no
podrí a generar"

(note the extra spaces)


---

NEGATIVE EXAMPLE:

Just in case it helps to narrow down the issue, here's a document with which this does NOT happen: http://www.deugarte.com/gomi/el_poder_de_las_redes.pdf
Comment 1 Petr Mladek 2010-11-18 09:01:09 UTC
I have reproduced the problem.

The text is pasted correctly into evolution or pidgin => OOo really could do a better job here.

Cedric, any chance to look at it?
Comment 2 stfhell 2012-10-10 11:36:04 UTC
I copy some information from Bug 52229 here, where this issue was also discussed. In short: The bug shows up on Linux and MacOS X. Non-7-bit-ASCII characters are badly converted due to the way the data are passed through the clipboard. LibreOffice reads the text/rtf format from the clipboard, many other applications UTF8_STRING format. I believe that Adobe Reader encodes bad RTF but would like this to be checked by someone familiar with the RTF specification.

If the bug is in Adobe Reader (which I believe), LO cannot correct it, of course. It is very annoying, however, although you can paste text from Adobe Reader as UTF-8 (via Ctrl+V). How much sense would it make to correct Adobe Reader's bugs from inside LO? If there is an easy way to detect badly encoded RTF, LO could possibly prefer UTF-8 as default paste format in case of Adobe Reader.

Here the quotes from Bug 52229:

Comment_6 (Linux): (1) Special characters (éÉÄÖÜß) are misread and converted into 2-character sequences

Comment_7 (MacOS): I can reproduce something similar to issue (1), but with a little difference: for me, all non-ASCII characters are replaced by simple dots (.). I can confirm that pasting the text as as unformatted text avoids this: all special characters are treated correctly.

Comment_8 (Linux):
Concerning issue (1), the bad conversion of non-ASCII-7-characters:

I copied the string "Köpfen" from a PDF and this is what Adobe Reader put in the clipboard in format "text/rtf":

{\rtf1\ansi\uc1 {\fonttbl\f0\froman TVTPJB+CaslonBookBE-Regular;}\pard\plain\ql\f0\fs20 {\fs22 K\'C3\'B6pfen}}

Microsoft's RTF standard 1.9.1 says: "Text characters can be handled using the 16-bit Unicode character-encoding scheme defined in this section. Expressing this text in RTF required a new mechanism, because until Word 97, RTF handled only 7-bit characters directly and 8-bit characters encoded as hexadecimal using \'xx."

So, Adobe Reader writes the "ö" UTF-8-encoded in the normal RTF notation for ANSI characters. UTF-8 is not a defined encoding scheme in RTF. As far as I can see, the "ö" should be encoded in a suitable ANSI character set (code page) or as a Unicode character using the "\u" command ("\uc1" configures the RTF reader to expect decimal representation of U+00F6 = UTF-8 C3-B6 followed by the ANSI representation of "ö": in RTF "\u246\'f6" ).
Comment 3 Roman Eisele 2012-10-10 14:44:34 UTC
I can confirm this behaviour (see bug 52229, all important information from there collected here in comment #2 by stfhell -- thank you for that!).

Improving some fields:
* I can reproduce this behaviour since LibO 3.3.0 → adapting Version field.
* Lowering Importance a little bit -- this is a major problem, yes, but
  a) it is caused by bug in Adobe Reader (see comment #2),
  b) there is no crash, loss of (native) LibO files, etc.,
  therefore not “critical”.
* Platform changed to “All”, because reproducible in Linux, Mac and Linux.
Comment 4 QA Administrators 2014-10-23 17:32:01 UTC
Please read this message in its entirety before responding.

Your bug was confirmed at least 1 year ago and has not had any activity on it for over a year. Your bug is still set to NEW which means that it is open and confirmed. It would be nice to have the bug confirmed on a newer version than the version reported in the original report to know that the bug is still present -- sometimes a bug is inadvertently fixed over time and just never closed.

If you have time please do the following:
1) Test to see if the bug is still present on a currently supported version of LibreOffice (preferably 4.2 or newer).
2) If it is present please leave a comment telling us what version of LibreOffice and your operating system.
3) If it is NOT present please set the bug to RESOLVED-WORKSFORME and leave a short comment telling us your version and Operating System

Please DO NOT
1) Update the version field
2) Reply via email (please reply directly on the bug tracker)
3) Set the bug to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
LibreOffice is powered by a team of volunteers, every bug is confirmed (triaged) by human beings who mostly give their time for free. We invite you to join our triaging by checking out this link:
https://wiki.documentfoundation.org/QA/BugTriage

There are also other ways to get involved including with marketing, UX, documentation, and of course developing -  http://www.libreoffice.org/get-help/mailing-lists/. 

Lastly, good bug reports help tremendously in making the process go smoother, please always provide reproducible steps (even if it seems easy) and attach any and all relevant material
Comment 5 Matthew Francis 2015-04-07 03:20:52 UTC
Following on from comment 2:

The RTF standard 1.9.1 also states

"RTF files are usually 7-bit ASCII plain text, consisting of control words, control symbols, and groups. RTF files are easily transmitted between most PC based operating systems because of their 7-bit ASCII characters. However, converters that communicate with Microsoft Word for Windows or Microsoft Word for the Macintosh should expect data transfer as 8-bit characters and binary data (see \binN) can contain any 8-bit values."

Which hints that this may be expected in interoperability with MS Office.

It should be checked whether Word accepts and/or produces RTF with the characteristics produced by Reader.
Comment 6 Buovjaga 2015-04-18 18:07:55 UTC
Example 2 pastes without extra spaces to LibO.
Example 1 will not open in Acrobat Reader XI, it is broken.

Closing this as WFM.

Win 8.1 32-bit
LibO Version: 4.5.0.0.alpha0+
Build ID: 211c12b9c64facd1c12f637a5229bd6a6feb032a
TinderBox: Win-x86@39, Branch:master, Time: 2015-04-18_00:35:20
Locale: fi_FI