I have some documents that were scanned and OCRed using older software and saved as .doc. There are some double spaces due to interpretation of typewriter font. In previous versions I could remove these by CTRL+H and searching for " " (two spaces) and replacing with " " (single space). In version 6.4 the same document now seems to have two characters for each space. Inspection by ALT+x shows where there is one space (U+0020) in the document opened in Writer 6.3.4.2. The same document opened in Writer 6.4.1.2 has 2 spaces: U+2006U+0020. It seems that the new version converts the space to Unicode but doesn't delete the older character
Created attachment 158576 [details] Sample of text with the extra spaces added
I can't see a difference between Version: 6.3.5.2 (x64) Build-ID: dd0751754f11728f69b42ee2af66670068624673 CPU-Threads: 4; BS: Windows 10.0; UI-Render: Standard; VCL: win; Gebietsschema: de-DE (de_DE); UI-Sprache: de-DE Calc: threaded and Version: 7.0.0.0.alpha0+ (x64) Build ID: eeb2d19e77d6dc47c68e8ba0920a02cf64a1247b CPU threads: 4; OS: Windows 10.0 Build 18363; UI render: default; VCL: win; Locale: de-DE (de_DE); UI-Language: en-GB Calc: threaded Last two sentences begin with more than one space character.
Hello Alistair, The summary talks about .doc files. The file attached is a .odt file. Please clarify...
Created attachment 158790 [details] Original .doc which opens with two space characters in 6.4.1.2 but just one in earlier versions
Hi Xisco Thanks for looking at this. I attached the .odt because I wanted to crystallise what I was seeing in version 6.4.1.2. I realise now that this isn't quite helpful without the original. I have used the original OCR software (ABBYY Finreader 9.0) to extract just a single page which has the U+0020 space characters and have uploaded that "Original .doc which opens with two space characters in 6.4.1.2 but just one in earlier versions". LO 6.4.1.2 adds the additional space characters (U+2006) on opening. Cheers Alistair
If you open the .doc file in a text editor, you see it is in fact an RTF. I bibisected the behaviour change with Linux 6.4 repo and the commit is this: https://git.libreoffice.org/core/+/24b04db5a63b57a74e58a7616091437ad68548ac%5E!/ tdf#123703 RTF import: fix length of space character sequence So this is an improvement to the earlier situation. For your new cleanup workflow, I propose: 1. Ctrl-H Find & replace 2. Other options - tick Regular Expressions 3. In the Find field, input \s\s 4. In the Replace field, input a single normal space 5. Replace all multiple times until satisfied The \s pattern in a regular expression means any whitespace and will eat them all regardless of Unicode value.
(In reply to Buovjaga from comment #6) > If you open the .doc file in a text editor, you see it is in fact an RTF. > > I bibisected the behaviour change with Linux 6.4 repo and the commit is > this: > https://git.libreoffice.org/core/+/ > 24b04db5a63b57a74e58a7616091437ad68548ac%5E!/ > tdf#123703 RTF import: fix length of space character sequence > > So this is an improvement to the earlier situation. > > For your new cleanup workflow, I propose: > 1. Ctrl-H Find & replace > 2. Other options - tick Regular Expressions > 3. In the Find field, input \s\s > 4. In the Replace field, input a single normal space > 5. Replace all multiple times until satisfied > > The \s pattern in a regular expression means any whitespace and will eat > them all regardless of Unicode value. Thank you Buojaga for your explanation and the link to the reason as to why it is not a bug. I will modify my workflow to include the /s/s; I already search for empty paragraphs so Regular Expressions is normally on. Time for me to update to latest LO version and to update my working list of common expressions. Thanks again Alistair