Bug 52229 - EDITING: FORMATTING destroyed after Copy/Paste Text from Adobe Reader
Summary: EDITING: FORMATTING destroyed after Copy/Paste Text from Adobe Reader
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.5.3 release
Hardware: Other Windows (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-18 11:07 UTC by ayl_ronnie
Modified: 2012-10-10 14:46 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample couments for inserting text from PDF via Adobe Reader (66.76 KB, application/zip)
2012-10-04 15:56 UTC, stfhell
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ayl_ronnie 2012-07-18 11:07:16 UTC
Cannot copy and paste text from adobe documents into libre writer documents properly, especially where there are bullets. For example if there are 5 bullets, and attempt to paste text into the 2nd or 3rd bullet, the copied text will ALWAYS appear after the last bullet. Occasionally the copied text will appear momentarily and then disappear.
Comment 1 Rainer Bielefeld Retired 2012-07-18 11:22:02 UTC
@ayl_ronnie@yahoo.com.sg
It's generally difficult to copy/paste from AR, often formatting can't be read. That's not specific to LibO, but to PDF.

Please:
- Write a meaningful Summary describing exactly what the problem is
- Attach a PDF sample document (not only screenshot) or refer to an existing 
  sample document in an other Bug with a link; to attach a file to this 
  bug report, just click on "Add an attachment" right on this page.
- Attach screenshots with comments if you believe that that might explain the 
  problem better than a text comment. Best way is to insert your screenshots
  into a DRAW document and to add comments that explain what you want to show
- Contribute a document related step by step instruction containing every 
  key press and every mouse click how to reproduce your problem 
  (similar to example in Bug 43431)
– if possible contribute an instruction how to create a sample document 
  (PDF - Writer export?) from the scratch
- add information 
  -- what EXACTLY is unexpected
  -- and WHY do you believe it's unexpected (cite Help or Documentation!)
  -- concerning your OS (Version, Distribution, Language)
  -- concerning your LibO localization (UI language, Locale setting)
  –- Libo settings that might be related to your problems 
  -- how you launch LibO and how you opened the sample document
  –- If you can contribute an AOOo Issue that might be useful
  -- everything else crossing your mind after you read linked texts

Even if you can not provide all demanded information, every little new information might bring the breakthrough.
Comment 2 Roman Eisele 2012-07-18 17:06:44 UTC
(Try to improve the Summary a bit -- I suppose the application in question is Adobe _Reader_, Adobe is just the company.)
Comment 3 Rainer Bielefeld Retired 2012-07-18 17:36:58 UTC
> Adobe _Reader_, Adobe is just the company.)

yes, I wanted to Write Adobe Reader (what IMHO is the correct name), but it seems I became hungry or anything else hindered me to terminate my work ;-)
Comment 4 stfhell 2012-10-04 15:56:53 UTC
Created attachment 68088 [details]
Sample couments for inserting text from PDF via Adobe Reader

ZIP file:
1_Sample_document_for_insertion.odt : insert text into this document
1_Sample_document_for_insertion.pdf : insert text from the PDF version of the document or another PDF using Adobe Reader
2_Sample_document_after_insertion.odt : This is the document after inserting 3 paragraphs from the PDF at the end of the 1st paragraph
Comment 5 Roman Eisele 2012-10-04 16:21:19 UTC
Comment on attachment 68088 [details]
Sample couments for inserting text from PDF via Adobe Reader


Thank you very much for your sample files!
(For now, I just corrected the MIME type.)
Comment 6 stfhell 2012-10-04 16:31:40 UTC
I attached some sample document in a ZIP file.

I used Adobe Reader 9.4.2 on 64-bit-Linux (Ubuntu 12.04).

There are 2 issues here with inserting text from Adobe Reader. It may not convern other PDF-viewers like "evince", but this doesn't necessarily have to do with a bug in Adobe Reader. I think a statement like "It's generally difficult to copy/paste from AR, often formatting can't be read. That's not specific to LibO, but to PDF" is no longer true. (There are some problems with hard line-breaks or hyphens, but generally copy-and-paste from PDF has become fairly easy.)

Adobe Reader copies text _with_ formatting information into the clipboard (which can be quite useful). If you paste it into LO via Ctrl-V there are 3 issues, really:

(1) Special characters (éÉÄÖÜß) are misread and converted into 2-character sequences (probably a conversion from the wrong code set?). This is a very old bug.

(2) The text from the clipboard is inserted at the wrong place (in the sample documents: at the end of the document), and the document's page settings are changed. This is a new bug, introduced with LO 3.5. LO 3.4 didn't have this.

(3) The formatting information from the PDF file is used; however LO uses a font name for which it will not find a font file. In the sample: "Times New Roman" text is formatted with "BAAAAA+TimesNewRomanPSMT" as font name. I suppose this is the font name mangled by the Adobe Reader, not a bug in LO.

You can avoid all these issues if you copy the text as unformatted text into LO via Ctrl-Shift-V (which is not always available, however): Special characters are treated correctly and the page attributes are not changed.

I assume that LO's interpretation of what Reader puts in the clipboard differs from Reader's, and there is of course the possibility that Reader's clipboard format is bad. But issue (2) cannot be a bug in Reader.
Comment 7 Roman Eisele 2012-10-04 17:02:58 UTC
Thank you very much for your detailed report!

I tried to reproduce this LibreOffice 3.5.7.1 and 3.6.2.2 (no difference!) and Adobe Reader 10.1.4 on Mac OS X 10.6.8 (Intel). Results:


at 1) I can reproduce something similar to issue (1), but with a little difference: for me, all non-ASCII characters are replaced by simple dots (.). I can confirm that pasting the text as as unformatted text avoids this: all special characters are treated correctly.

It is an interesting question if this is a bug in Adobe Reader or in LibO. When I paste the text copied from Adobe Reader into TextEdit (as formatted text), I see the same dots instead of non-ASCII characters; if I paste the text into a document of an application which does not support any formatting (BBEdit), all special characters are treated correctly.

So I think this is a bug in Adobe Reader, which copies the text two times to the clipboard, once formatted and once as raw text, and damages all special characters in the formatted copy.


ad 2) I agree, this cannot be a bug in Reader. But: I cannot reproduce this issue at all; for me, the text is inserted exactly at the right position, i.e., where the caret is.

So this issue is either specific to LibO for Windows, or it has been fixed in LibreOffice 3.5.7.1 and 3.6.2.2.


ad 3) I can reproduce this; I wonder why LibO uses Liberation Serif when I can’t find the font “BAAAAA+TimesNewRomanPSMT”.


@ stfhell:
Can you please check if you can still reproduce issue (2) in LibreOffice 3.5.7.1 and 3.6.2.2?
And, do you have checked the special option “Activate experimental (instable) functions” in Options > LibreOffice > General? This may very well cause such behaviour ...
Comment 8 stfhell 2012-10-05 14:22:49 UTC
I have tested LibreOffice 3.6.2.2 with Adobe Reader 9.5.1 (the most current version available for Linux) on Ubuntu Linux 12.04/x86. Issue (2) has indeed disappeared with the current version. There is no reformatting anymore.

Concerning issue (1), the bad conversion of non-ASCII-7-characters:

I copied the string "Köpfen" from a PDF and this is what Adobe Reader put in the clipboard in format "text/rtf":

{\rtf1\ansi\uc1 {\fonttbl\f0\froman TVTPJB+CaslonBookBE-Regular;}\pard\plain\ql\f0\fs20 {\fs22 K\'C3\'B6pfen}}

Microsoft's RTF standard 1.9.1 says: "Text characters can be handled using the 16-bit Unicode character-encoding scheme defined in this section. Expressing this text in RTF required a new mechanism, because until Word 97, RTF handled only 7-bit characters directly and 8-bit characters encoded as hexadecimal using \'xx."

So, Adobe Reader writes the "ö" UTF-8-encoded in the normal RTF notation for ANSI characters. UTF-8 is not a defined encoding scheme in RTF. As far as I can see, the "ö" should be encoded in a suitable ANSI character set (code page) or as a Unicode character using the "\u" command ("\uc1" configures the RTF reader to expect decimal representation of U+00F6 = UTF-8 C3-B6 followed by the ANSI representation of "ö": in RTF "\u246\'f6" ).

(By the way, LO 3.5 made the same mistake of using UTF-8 in RTF files; seems to be corrected in LO 3.6.)

So unless I misread the specification, Adobe Reader encodes bad RTF.
Comment 9 Roman Eisele 2012-10-05 15:28:18 UTC
@ stfhell:
Wow, thank you very much for your investigation!


So, the question is: what to do now about this bug?

Issue (1) is, according to stfhell’s research (and my tests), a bug in Adobe Reader, not in LibO. Maybe we could file a new LibO bug report, namely an enhancement request, to make the RTF reader more tolerant about this kind of wrong text encoding, in order to “fix” Adobe Reader’s wrong Unicode encoding; but I don’t know if it is worth the work which would be necessary.

Issue (2) was a real bug in LibO, but seems fixed in the current LibO versions, so RESOLVED/WORKSFORME.

Issue (3) is not really a bug in LibO -- there is just no font “BAAAAA+TimesNewRomanPSMT”. Maybe we could file a new LibO bug report, again an enhancement request, to add some font name guessing (this would be not too difficult in cases like these); I say a new bug report, because this would be really a new feature, not a bug fix.


Therefore I would suggest to close the present bug as RESOLVED/WORKSFORME (according to issue (2), the most important one), and, if some of you want, to file the enhancement requests for (3) and maybe for (1). I can do that, if nobody else wants to do so.

Any objections? thoughts? questions? ;-)
Comment 10 stfhell 2012-10-10 11:43:36 UTC
I think it's best to close this bug. The copy-and-paste issue with Adobe Reader has already been reported as Bug 31555. I copied the important information concerning issue (1) there.

I think LO has enough of its own bugs to take care of and shouldn't bother to adapt its RTF decoder to non-standard encoders (if this is the problem with Adobe Reader copy-and-paste). But it would be smoother if it didn't choose RTF as the default paste format in case of Adobe Reader.
Comment 11 Roman Eisele 2012-10-10 14:46:49 UTC
@ stfhell:

Thank you very much for your decision, and especially for the nice summary of all important stuff from this present bug you have added to bug 31555!

(I just change RESOLVED/FIXED to /WORKSFORME, because we and the developers prefer the latter (“WORKSFORME”) if there was no acutal and specific bug fix.)