Bug 53399 - Word count inconsistent and wrong with non-breaking space
Summary: Word count inconsistent and wrong with non-breaking space
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.6.0.4 release
Hardware: All All
: medium normal
Assignee: Muhammad Haggag
URL:
Whiteboard: target:3.7.0 target:3.6.2
Keywords: regression
Depends on:
Blocks: Word-Count mab3.6
  Show dependency treegraph
 
Reported: 2012-08-12 10:00 UTC by Stephan Hennig
Modified: 2016-10-24 21:20 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
word count test document (8.78 KB, application/vnd.oasis.opendocument.text)
2012-08-12 10:00 UTC, Stephan Hennig
Details
first example - correct results (37.86 KB, image/png)
2012-08-12 10:02 UTC, Stephan Hennig
Details
second example - inconsistent and incorrect results (37.82 KB, image/png)
2012-08-12 10:02 UTC, Stephan Hennig
Details
third example - incorrect results (37.77 KB, image/png)
2012-08-12 10:03 UTC, Stephan Hennig
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stephan Hennig 2012-08-12 10:00:41 UTC
Created attachment 65456 [details]
word count test document

Non-breaking spaces are handles incorrectly by word count, giving wrong and inconsistent results.

Recipe:

1. Open the attached document wordcount-sample.odt containing a text consisting of 6 words.  Word count is correct in this case, both, in the status line as well as in the word count window (file wordcount-sample-1.png).

2. Replace the space between the fourth and fifth word with a non-breaking space, but don't move the cursor (file wordcount-sample-2.png).

  Inconsistent behaviour:  Word count in status line switches to 4 whereas word count in explicit windows switches to 5.

  Wrong results:  Both word counts are wrong as the number of words didn't change.

3. Move the cursor (file wordcount-sample-3.png).

  Wrong results:  Word count in explicit windows now also switches to 4 and 'characters excluding spaces' decrease from 24 down to 16.

It looks like non-breaking spaces are treated like end-of-file markers by the word count algorithm.  Still, that doesn't explain why multiple inconsistent word count statistics are displayed in status line and the explicit word count window.
Comment 1 Stephan Hennig 2012-08-12 10:02:01 UTC
Created attachment 65457 [details]
first example - correct results
Comment 2 Stephan Hennig 2012-08-12 10:02:44 UTC
Created attachment 65458 [details]
second example - inconsistent and incorrect results
Comment 3 Stephan Hennig 2012-08-12 10:03:26 UTC
Created attachment 65459 [details]
third example - incorrect results
Comment 4 Stephan Hennig 2012-08-12 12:36:36 UTC
I've checked with LibreOffice 3.5.2 on Windows XP.  Here are the results:

1. After step 2 in the given recipe word count in the word count window decreases to 5 as well.

2. After step 3 (moving the cursor) word count is corrected to 6 again.  That is, part of the bug is not present in LibO 3.5.2.


LibreOffice 3.5.2.2 
Build-ID: 281b639-6baa1d3-ef66a77-d866f25-f36d45f
Comment 5 Stephan Hennig 2012-08-12 13:16:33 UTC
(In reply to comment #4)
> I've checked with LibreOffice 3.5.2 on Windows XP.  Here are the results:
> 
> 1. After step 2 in the given recipe word count in the word count window
> decreases to 5 as well.
> 
> 2. After step 3 (moving the cursor) word count is corrected to 6 again.  That
> is, part of the bug is not present in LibO 3.5.2.
> 
> 
> LibreOffice 3.5.2.2 
> Build-ID: 281b639-6baa1d3-ef66a77-d866f25-f36d45f

The same applies to LibO 3.5.4 as shipped by Linux Mint 13.

LibreOffice 3.5.4.2 
Build-ID: 350m1(Build:2)
Comment 6 Urmas 2012-08-19 04:36:38 UTC
Also words are not counted (not in status bar, nor in dialog) after the first ZWSP character.
Comment 7 Jean-Baptiste Faure 2012-08-21 12:01:50 UTC
Good catch ! Same problem in LO 3.6.1 rc1 under Linux (Ubuntu 11.10 x86) :-(

Best regards. JBF
Comment 8 Jean-Baptiste Faure 2012-08-21 12:04:54 UTC
Hi Muhammad,

Can you help here ? :-)

Best regards. JBF
Comment 9 Roman Eisele 2012-08-21 13:43:26 UTC
Added "regression" keyword -- LibreOffice 3.5.6.2 (Build-ID: e0fbe70-dcba98b-297ab39-994e618-0f858f0) shows the right word count, treating non-breaking spaces correctly just like ordinary spaces.
Comment 10 Muhammad Haggag 2012-08-21 15:55:21 UTC
(In reply to comment #8)
> Hi Muhammad,
> 
> Can you help here ? :-)
> 
> Best regards. JBF

Hello. I'm investigating :)
Comment 11 Muhammad Haggag 2012-08-21 20:43:01 UTC
The issue is that non-breaking space (as well as a bunch of other Unicode characters in the separator category) isn't handled as a separator/space character in lcl_IsSkippableWhitespace, defined in sw/source/core/txtnode/txtedit.cxx.

I'm working on a fix.
Comment 12 Muhammad Haggag 2012-08-22 15:00:04 UTC
Patch is up for review at: https://gerrit.libreoffice.org/453

The patch doesn't fix the inconsistency between the dialog and the status bar. The two update differently.

The dialog has some hooks in the editing code that update it when text is entered or selection changes, but it appears it doesn't have a hook for when you replace the currently selected character if replaced with "Insert special character" (and maybe other missing hooks--it's a tedious way to implement the functionality).

The status bar field is updated constantly whenever anything changes in the document/selection, so it's more up to date. I'll file a separate bug to track this issue and perhaps have the status bar updates drive the dialog as well.
Comment 13 Not Assigned 2012-08-22 15:27:00 UTC
Muhammad Haggag committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=3ba107606682b5e675127483a514f0e6580ecfd1

fdo#53399 Word count is inconsistent and wrong with non-breaking space
Comment 14 Roman Eisele 2012-08-22 15:36:22 UTC
(In reply to comment #13)
> Muhammad Haggag committed a patch related to this issue.
> It has been pushed to "master":

Thank you very much for fixing this issue so fast!
Comment 15 Michael Meeks 2012-08-22 19:30:04 UTC
review on-going on-list to back-port to 3.6.2 thanks for the report.
Comment 16 Jean-Baptiste Faure 2012-08-24 04:14:55 UTC
Hi Muhammad,

it seems that there is a fatal side effect with the last fix: in French double punctuation marks (; ? ! :) must be preceeded by a non-breaking space. With the last fix this punctuation mark is counted as a word.
Steps to reproduce :
- open a new empty text doc
- type two dummy words like aaa bbb -> count = 2 words
- add a non-breaking space followed by a punctuation mark -> 3 words !!!
- add another word -> 4 words. So your fix works in the sense that non-breaking space does not interrupt the word count for the current paragraph.

Best regards. JBF
Comment 17 Muhammad Haggag 2012-08-24 15:41:36 UTC
(In reply to comment #16)
> Hi Muhammad,
> 
> it seems that there is a fatal side effect with the last fix: in French double
> punctuation marks (; ? ! :) must be preceeded by a non-breaking space. With the
> last fix this punctuation mark is counted as a word.
> Steps to reproduce :
> - open a new empty text doc
> - type two dummy words like aaa bbb -> count = 2 words
> - add a non-breaking space followed by a punctuation mark -> 3 words !!!
> - add another word -> 4 words. So your fix works in the sense that non-breaking
> space does not interrupt the word count for the current paragraph.
> 
> Best regards. JBF

Hi Jean,

I haven't modified word-counting behavior regarding punctuation. Stand-alone punctuation marks are counted as separate words, even without my change. I can reproduce that on the official Ubuntu LO package (version 3.5.4.2).
Comment 18 Roman Eisele 2012-08-24 15:56:00 UTC
(In reply to comment #17)
> I haven't modified word-counting behavior regarding punctuation. Stand-alone
> punctuation marks are counted as separate words, even without my change. I can
> reproduce that on the official Ubuntu LO package (version 3.5.4.2).

For this, see bug 38983. French interpunctation is a special case, of course, but IMHO a proper fix for bug 38983 would fix the issue about French interpunctation, too ... therefore:

@Jean-Baptiste Faure:
Could you please add a short comment about the problem with French interpunctation to bug 38983? Thank you!
Comment 19 Roman Eisele 2012-08-24 16:05:06 UTC
(In reply to comment #18)
> For this, see bug 38983. French interpunctation is a special case, of course,
> but IMHO a proper fix for bug 38983 would fix the issue about French
> interpunctation, too ... therefore:

Well, I was too fast. Bug 38983 is contaminated by a (IMHO a bit too sophisticated) discussion about the impossibility of an exact account for word-counting. Therefore, while I still think that bug 38983 can be fixed/at least: improved, and that French spacing should be mentioned there, it may be reasonable to file a special (new) bug report about the special case of word counting and French interpunctation, which is special in that IMHO no discussion is necessary about it, so that fixing it is much easier than fixing the general bug 38983 ... Sorry!

@Jean-Baptiste Faure:
So please file an additional special bug report about the problem with word counting and French interpunctation, and mention in it that this is (unlike bug 38983) a matter which does not need much discussion, but just a fix ;-)
Thank you again!
Comment 20 Jean-Baptiste Faure 2012-08-24 21:01:07 UTC
Hi,

I am not sure that French punctuation is a special case here. If space is a separator of words and non-breaking space is not, then 
aaa bbb ccc ; ddd (assuming that the space before ; is a non-breaking one)
should be counted for 4 words instead of 3 in LO 3.6, no matter if the third word is defined as "ccc" or "ccc ;".
In English you have aaa bbb ccc; ddd
which should be counted for 4 words too. Instead of that, in both cases, LO 3.6 stops the count at the ; even when there is a new separator after it. For me the question is why the space following the ; does not play its role of separator when the ; is preceded by a non-breaking space?

If non-breaking space is used as a separator, then isolated punctuation marks (that is really used for punctuation) should be separators too and the counting algorithm should aggregate consecutive separators in the same way it is done in csv import in Calc.

Best regards. JBF
Comment 21 Caolán McNamara 2012-08-28 09:02:48 UTC
btw, re comment #12 I added some stuff to update the dialog if its open and the statusbar is updated with more recent up-to-date word/char count data

http://cgit.freedesktop.org/libreoffice/core/commit/?id=5192468dd49f5e1d821239cd51cea42f8bac7a4b
Comment 22 Not Assigned 2012-08-28 09:15:31 UTC
Muhammad Haggag committed a patch related to this issue.
It has been pushed to "libreoffice-3-6":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=48d1979dc3fb4618e04f37e5090c66ddf2fdad3a&g=libreoffice-3-6

fdo#53399 Word count is inconsistent and wrong with non-breaking space


It will be available in LibreOffice 3.6.2.
Comment 23 Stephan Hennig 2012-09-14 12:45:10 UTC
I can confirm that word count doesn't stay permanently incorrect in presence of non-breaking spaces with LibO 3.6.2.1 (Build ID: ba822cc) anymore.

Still, word count is temporarily inconsistent in dialogue and status line when replacing a selected space between two words with a non-breaking space by pressing Shift+Ctrl+Space.  I have opened bug 54918 for tracking this remaining glitch.

Thanks!