Bug 131266 - Extra space added into RTF
Summary: Extra space added into RTF
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
6.4.1.2 release
Hardware: x86-64 (AMD64) Windows (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-10 21:34 UTC by Alistair Saywell
Modified: 2020-05-15 22:05 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample of text with the extra spaces added (11.16 KB, application/vnd.oasis.opendocument.text)
2020-03-10 21:39 UTC, Alistair Saywell
Details
Original .doc which opens with two space characters in 6.4.1.2 but just one in earlier versions (3.13 KB, application/octet-stream)
2020-03-18 21:45 UTC, Alistair Saywell
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alistair Saywell 2020-03-10 21:34:05 UTC
I have some documents that were scanned and OCRed using older software and saved as .doc. There are some double spaces due to interpretation of typewriter font. In previous versions I could remove these by CTRL+H and searching for "  " (two spaces) and replacing with " " (single space). 

In version 6.4 the same document now seems to have two characters for each space. Inspection by ALT+x shows where there is one space (U+0020) in the document opened in Writer 6.3.4.2. The same document opened in Writer 6.4.1.2 has 2 spaces: U+2006U+0020.

It seems that the new version converts the space to Unicode but doesn't delete the older character
Comment 1 Alistair Saywell 2020-03-10 21:39:52 UTC
Created attachment 158576 [details]
Sample of text with the extra spaces added
Comment 2 Dieter 2020-03-11 12:34:07 UTC
I can't see a difference between

Version: 6.3.5.2 (x64)
Build-ID: dd0751754f11728f69b42ee2af66670068624673
CPU-Threads: 4; BS: Windows 10.0; UI-Render: Standard; VCL: win; 
Gebietsschema: de-DE (de_DE); UI-Sprache: de-DE
Calc: threaded

and 

Version: 7.0.0.0.alpha0+ (x64)
Build ID: eeb2d19e77d6dc47c68e8ba0920a02cf64a1247b
CPU threads: 4; OS: Windows 10.0 Build 18363; UI render: default; VCL: win; 
Locale: de-DE (de_DE); UI-Language: en-GB
Calc: threaded

Last two sentences begin with more than one space character.
Comment 3 Xisco Faulí 2020-03-18 17:34:18 UTC
Hello Alistair,
The summary talks about .doc files. The file attached is a .odt file. Please clarify...
Comment 4 Alistair Saywell 2020-03-18 21:45:37 UTC
Created attachment 158790 [details]
Original .doc which opens with two space characters in 6.4.1.2 but just one in earlier versions
Comment 5 Alistair Saywell 2020-03-18 21:52:28 UTC
Hi Xisco

Thanks for looking at this.

I attached the .odt because I wanted to crystallise what I was seeing in version 6.4.1.2. I realise now that this isn't quite helpful without the original. I have used the original OCR software (ABBYY Finreader 9.0) to extract just a single page which has the U+0020 space characters and have uploaded that "Original .doc which opens with two space characters in 6.4.1.2 but just one in earlier versions". LO 6.4.1.2 adds the additional space characters (U+2006) on opening.

Cheers 
Alistair
Comment 6 Buovjaga 2020-05-15 18:44:41 UTC
If you open the .doc file in a text editor, you see it is in fact an RTF.

I bibisected the behaviour change with Linux 6.4 repo and the commit is this: https://git.libreoffice.org/core/+/24b04db5a63b57a74e58a7616091437ad68548ac%5E!/
tdf#123703 RTF import: fix length of space character sequence

So this is an improvement to the earlier situation.

For your new cleanup workflow, I propose:
1. Ctrl-H Find & replace
2. Other options - tick Regular Expressions
3. In the Find field, input \s\s
4. In the Replace field, input a single normal space
5. Replace all multiple times until satisfied

The \s pattern in a regular expression means any whitespace and will eat them all regardless of Unicode value.
Comment 7 Alistair Saywell 2020-05-15 22:05:34 UTC
(In reply to Buovjaga from comment #6)
> If you open the .doc file in a text editor, you see it is in fact an RTF.
> 
> I bibisected the behaviour change with Linux 6.4 repo and the commit is
> this:
> https://git.libreoffice.org/core/+/
> 24b04db5a63b57a74e58a7616091437ad68548ac%5E!/
> tdf#123703 RTF import: fix length of space character sequence
> 
> So this is an improvement to the earlier situation.
> 
> For your new cleanup workflow, I propose:
> 1. Ctrl-H Find & replace
> 2. Other options - tick Regular Expressions
> 3. In the Find field, input \s\s
> 4. In the Replace field, input a single normal space
> 5. Replace all multiple times until satisfied
> 
> The \s pattern in a regular expression means any whitespace and will eat
> them all regardless of Unicode value.

Thank you Buojaga for your explanation and the link to the reason as to why it is not a bug. I will modify my workflow to include the /s/s; I already search for empty paragraphs so Regular Expressions is normally on. Time for me to update to latest LO version and to update my working list of common expressions. 
Thanks again
Alistair