Bug 148025 - Leading spaces lost copying from pdf
Summary: Leading spaces lost copying from pdf
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-16 08:57 UTC by flywire
Modified: 2023-07-24 23:07 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
XpdfReader screenshot showing selection (3.26 KB, image/png)
2022-03-23 22:06 UTC, flywire
Details

Note You need to log in before you can comment on or make changes to this bug.
Description flywire 2022-03-16 08:57:50 UTC
Copy and paste source code text from libreoffice pdf guides loses indent space characters.

Reproducing:

For example, the bottom of https://documentation.libreoffice.org/assets/Uploads/Documentation/en/GS7.3/GS73-GettingStarted.pdf#page=431 has sample python code. A copy and paste as text loses the blank line and any line with leading spaces is pasted with a single space before the code, losing the other spaces. Tabs are ignored. Code is formatted in Liberation Mono, a fixed-pitch font, with 4 spaces per indent.

Sample output:

import uno
def HelloWorld():
 doc = XSCRIPTCONTEXT.getDocument()
 cell = doc.Sheets[0]['A1']
 cell.setString('Hello World from Python')
 return

This is important for a language like python because indents are part of the syntax.

It's expected the space characters would be retained, at least with an option.
Comment 1 flywire 2022-03-16 12:44:02 UTC
Here's a better example:

*Text.txt*
def hello_world():
    print("Hello World!")

hello_world()
*File ends line above with no CRLF*

Method of capturing the text is important. Exporting as pdf with default settings retains the leading spaces but drops blank lines.
Using: java -jar pdfbox-app-2.0.25.jar ExtractText test.pdf test1.txt

Version: 7.3.1.3 (x64) / LibreOffice Community
Build ID: a69ca51ded25f3eefd52d7bf9a5fad8c90b87951
CPU threads: 8; OS: Windows 10.0 Build 19043; UI render: Skia/Raster; VCL: win
Locale: en-AU (en_AU); UI: en-GB
Calc: threaded

============================================================

java -jar pdfbox-app-2.0.25.jar TextToPDF -standardFont Courier test.pdf test.txt
java -jar pdfbox-app-2.0.25.jar ExtractText test.pdf test1.txt

Retains blank lines (but appends space to each line and a final CRLF.
Comment 2 flywire 2022-03-18 12:38:42 UTC
Blank lines might be a problem because there is no such thing as a new line character in a pdf, the next paragraph is just started further down.

A viable workaround might be to add a single space to empty lines. If any code was developed for the python REPL a space on a line intended to be blank would cause it to fail.

https://pdf-xchange.eu/ will select/copy text including all white space for pasting but it can't pick up blank lines (because they don't exist).
Comment 3 flywire 2022-03-19 22:12:21 UTC
Pasting the selection from https://pdf-xchange.eu/ will contain the required blank lines *if* it is written in the pdf. LibreOffice does not write blank lines in the pdf.

I suggest an enhancement to tweak the format to support it.

eg:
stream
/F1 10 Tf
BT
40 763.07751 Td
0 -11.0775 Td
(Lorem ipsum dolor sit amet,) Tj
0 -11.0775 Td
(    consectetur adipiscing ) Tj
0 -11.0775 Td
() Tj
0 -11.0775 Td
(elit. sed do eiusmod) Tj
ET

endstream


Decoded LibreOffice pdf:
stream
0.1 w
q 0 0.028 595.275 841.861 re
W* n
q 0 0 0 rg
BT
56.8 776.789 Td /F1 10 Tf<0102030405020606070809070A06010B0C0D>Tj
ET
Q
q 0 0 0 rg
BT
56.8 765.489 Td /F1 10 Tf<040404040E0A0F10110B1213020606070414070A060115120C04>Tj
ET
Q
q 0 0 0 rg
BT
56.8 742.789 Td /F1 10 Tf<05020606070809070A06010B0C>Tj
ET
Q
Q 
endstream

There is no "() Tj" for blank line (unpacking the characters within the lines is not important to demonstrate this).

------------------------------------------------------------

Indented code is a potential complication that could be overcome by formatting it left aligned.
Comment 4 flywire 2022-03-20 09:17:18 UTC

*** This bug has been marked as a duplicate of bug 66181 ***
Comment 5 flywire 2022-03-23 22:06:29 UTC
Created attachment 179058 [details]
XpdfReader screenshot showing selection
Comment 6 flywire 2022-06-28 11:16:49 UTC
Two issues with pdf:
1. Leading spaces lost UNLESS caption in selection
2. Extra blank line above MsgBox

Demonstration
=============

The functionality depends on the pdf viewer used.

1a. Using XpdfReader Version 4.03 www.xpdfreader.com

* Open https://www.pitonyak.org/OOME_3_0.pdf#page=88
* Copy/Paste Listing to a text editor:

Listing 59. Modified bubble sort.
Sub ExampleForNextSort
   Dim iEntry(10) As Integer
   Dim iOuter As Integer, iInner As Integer, iTemp As Integer
   Dim bSomethingChanged As Boolean

   ' Fill the array with integers between -10 and 10
   For iOuter = LBound(iEntry()) To Ubound(iEntry())

       iEntry(iOuter) = Int((20 * Rnd) -10)
   Next iOuter

   ' iOuter runs from the highest item to the lowest
   For iOuter = UBound(iEntry()) To LBound(iEntry()) Step -1

       'Assume that the array is already sorted and see if this is incorrect
       bSomethingChanged = False
       For iInner = LBound(iEntry()) To iOuter-1

          If iEntry(iInner) > iEntry(iInner+1) Then
              iTemp = iEntry(iInner)
              iEntry(iInner) = iEntry(iInner+1)
              iEntry(iInner+1) = iTemp
              bSomethingChanged = True

          End If
       Next iInner
       'If the array is already sorted then stop looping!
       If Not bSomethingChanged Then Exit For
   Next iOuter
   Dim s$
   For iOuter = LBound(iEntry()) To Ubound(iEntry())
       s = s & iOuter & " : " & iEntry(iOuter) & CHR$(10)
   Next iOuter
   MsgBox s, 0, "Sorted Array"
End Sub

1b. Repeat in Acrobat, Brave, SumatraPDF etc:

Listing 59. Modified bubble sort.
Sub ExampleForNextSort
Dim iEntry(10) As Integer
Dim iOuter As Integer, iInner As Integer, iTemp As Integer
Dim bSomethingChanged As Boolean
' Fill the array with integers between -10 and 10
For iOuter = LBound(iEntry()) To Ubound(iEntry())
iEntry(iOuter) = Int((20 * Rnd) -10)
Next iOuter
' iOuter runs from the highest item to the lowest
For iOuter = UBound(iEntry()) To LBound(iEntry()) Step -1
'Assume that the array is already sorted and see if this is incorrect
bSomethingChanged = False
For iInner = LBound(iEntry()) To iOuter-1
If iEntry(iInner) > iEntry(iInner+1) Then
iTemp = iEntry(iInner)
iEntry(iInner) = iEntry(iInner+1)
iEntry(iInner+1) = iTemp
bSomethingChanged = True
End If
Next iInner
'If the array is already sorted then stop looping!
If Not bSomethingChanged Then Exit For
Next iOuter
Dim s$
For iOuter = LBound(iEntry()) To Ubound(iEntry())
s = s & iOuter & " : " & iEntry(iOuter) & CHR$(10)
Next iOuter
MsgBox s, 0, "Sorted Array"
End Sub

2a. Another example using XpdfReader to open https://documentation.libreoffice.org/assets/Uploads/Documentation/en/GS7.3/GS73-GettingStarted.pdf#page=425

Sub AppendHello
Dim oDoc
Dim sTextService$
Dim oCurs

REM ThisComponent refers to the currently active document.
oDoc = ThisComponent

REM Verify that this is a text document.
sTextService = "com.sun.star.text.TextDocument"
If NOT oDoc.supportsService(sTextService) Then

   MsgBox "This macro only works with a text document"
   Exit Sub
End If
REM Get the view cursor from the current controller.
oCurs = oDoc.currentController.getViewCursor()

REM Move the cursor to the end of the document.
oCurs.gotoEnd(False)

REM Insert text "Hello" at the end of the document.
oCurs.Text.insertString(oCurs, "Hello", False)
End Sub

2b. Let's try again including the Listing Caption:

Listing 5: Append the text “Hello” at the end of to the current document
Sub AppendHello
   Dim oDoc
   Dim sTextService$
   Dim oCurs

   REM ThisComponent refers to the currently active document.
   oDoc = ThisComponent

   REM Verify that this is a text document.
   sTextService = "com.sun.star.text.TextDocument"
   If NOT oDoc.supportsService(sTextService) Then

       MsgBox "This macro only works with a text document"
       Exit Sub
   End If
   REM Get the view cursor from the current controller.
   oCurs = oDoc.currentController.getViewCursor()

   REM Move the cursor to the end of the document.
   oCurs.gotoEnd(False)

   REM Insert text "Hello" at the end of the document.
   oCurs.Text.insertString(oCurs, "Hello", False)
End Sub
Comment 8 ⁨خالد حسني⁩ 2023-07-03 09:47:40 UTC
The leading spaces are in the PDF, viewers may chose not to copy them, but there is not much we can do about this.

PDF is not a good format for exchanging plain text data because it is a purely visual format and and preservation of the underlying textual data is rather limited.
Comment 9 flywire 2023-07-24 22:20:30 UTC
(In reply to ⁨خالد حسني⁩ from comment #8)
> PDF is not a good format for exchanging plain text data because it is a
> purely visual format and preservation of the underlying textual data is
> rather limited.

Agree. Regardless, it IS being used to exchange plain text so we should do what we can to support it, especially with such a minimal request.

> The leading spaces are in the PDF, viewers may choose not to copy them, but
> there is not much we can do about this.

Did you look at the comment detail?

"There is no "() Tj" for blank line"

Explicitly writing blank lines really improves the functionality of code as demonstrated. Leading spaces are more problematic but a suitable browser could be recommended in the guides.

This is an enhancement request and it would be simple to implement. I'd like it to remain open.
Comment 10 ⁨خالد حسني⁩ 2023-07-24 23:07:33 UTC
(In reply to flywire from comment #9)
> (In reply to ⁨خالد حسني⁩ from comment #8)
> > PDF is not a good format for exchanging plain text data because it is a
> > purely visual format and preservation of the underlying textual data is
> > rather limited.
> 
> Agree. Regardless, it IS being used to exchange plain text so we should do
> what we can to support it, especially with such a minimal request.
> 
> > The leading spaces are in the PDF, viewers may choose not to copy them, but
> > there is not much we can do about this.
> 
> Did you look at the comment detail?
> 
> "There is no "() Tj" for blank line"
> 
> Explicitly writing blank lines really improves the functionality of code as
> demonstrated. Leading spaces are more problematic but a suitable browser
> could be recommended in the guides.

I don’t think we have any knowledge of blank lines by the time we are writing PDF output.

> This is an enhancement request and it would be simple to implement. I'd like
> it to remain open.

Fell free to re-open if you are planing to work on it, otherwise I don’t think it is as simple to implement as it might seem.