38983 – Word Count counts incorrectly with dashes and other separators

Bug 38983 - Word Count counts incorrectly with dashes and other separators

Summary: Word Count counts incorrectly with dashes and other separators

Status:	VERIFIED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	3.3.3 release
Hardware:	Other All

Importance:	medium normal
Assignee:	Caolán McNamara

URL:
Whiteboard:	target:3.7.0
Keywords:

Depends on:
Blocks:

Reported:	2011-07-05 11:17 UTC by Jamee Mikell
Modified:	2023-06-26 16:57 UTC (History)
CC List:	3 users (show)

See Also:	62799 126629
Crash report or crash signature:

Attachments
Sample document showing test text and results. (13.43 KB, application/vnd.oasis.opendocument.text) 2011-07-05 11:17 UTC, Jamee Mikell	Details
Example of outline spreadsheet described in comment 9. (13.39 KB, application/vnd.oasis.opendocument.spreadsheet) 2012-04-20 09:20 UTC, Jamee Mikell	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jamee Mikell 2011-07-05 11:17:56 UTC

Created attachment 48782 [details]
Sample document showing test text and results.

For text separated by an em-dash with no spaces, word count counts the em-dashed words as one word. For text separated by dashes, em-dash, and other separators with spaces around the separator, word count counts the separator as a word. See the attached file for more examples.

This is a test.   -> wc = 4

This -- is a test. -> wc = 5 (the double-dash should not be counted as a word from a writer's perspective.)

This--is a test. (where -- is an em-dash) -> wc = 3. (There are four words here from a writer's perspective.)


Windows 7. Core i5. Love LibreOffice. :)

Comment 1 noname 2011-07-05 14:24:05 UTC

I don't think this is a bug. Words like 'up-to-date' or 're-enabled' have emdash or (minus signs) in them, and they count as 1 word each. This is the correct english spelling.
The -- are used quite a few times in the programming world, when using commands like --help or --preset and should also count as 1 word each.

Comment 2 Jamee Mikell 2011-07-06 05:19:23 UTC

(In reply to comment #1)
> I don't think this is a bug. Words like 'up-to-date' or 're-enabled' have
> emdash or (minus signs) in them, and they count as 1 word each. This is the
> correct english spelling.
> The -- are used quite a few times in the programming world, when using commands
> like --help or --preset and should also count as 1 word each.

I didn't thought about hyphenation when trying to formulate the rule for how this should work, however, hyphen is not double hyphen is not emdash is not endash (http://en.wikipedia.org/wiki/Hyphen). Put another way, 'up-to-date' is a hyphenated word, but 'up--to--date' whether that's a double hyphen, emdash or endash between the words is not a hyphenated word. FWIW, MS Word counts 'up-to-date' with hyphens as one word and 'up--to--date' as three words whether that's double dash, emdash or endash between the words. Also FWIW, the Wikipedia article demonstrates how tricky hyphen can be.

If I'm writing a technical book that includes code fragments, consider:
    man --help    (more legible, executes)
versus
    man--help     (fails with "command not found")
If the word count rule says that emdash, --, etc. are whitespace, the first still counts as two words. The second would count as two words, but should not be used in the first place because it produces an error. Also, the explanatory text in such a book should greatly exceed the code, favoring a balance toward the text rules vs. the code rules. (If you want to give me 200 pages of code, I'd rather have it electronically so I don't have to type it. Saves trees too.) FWIW, MS Word counts 'man --help' as three words but the same with an emdash or endash as two words, which suggests a bug in their word count.

Also, regarding hyphenation, the rule would have to consider whitespaces around the hyphen. For example 'up-to-date' is one word, but is 'up - to - date' five words? I'll suggest not. If so, hyphens with whitespace on one or both sides, such as 're- ', ' -ing' as one might write prefixes and suffixes when talking about them, should count as whitespace too, which would mean only single hyphens surrounded by alphanumerics should be ignored. FWIW, MS Word counts 'up - to - date' as 5 words but the same with endashes as three words, suggesting they have bugs in their word count when it comes to hyphens vs other separators.

Comment 3 noname 2011-07-06 10:59:56 UTC

Well, maybe the case should be that a - or -- (or ++ or `) *surrounded* with spaces shouldn't be counted as a word.

Comment 4 Don't use this account, use tml@iki.fi 2011-07-06 11:16:53 UTC

Come on, since when is "word count" an exactly defined term anyway? There are as many definitions as there are software implementations plus the number people trying to come up with an exact definition. We will never be able to satisfy the whims of any word-counting writer/editor/teacher/student. 

If the word count displayed by LO is "too small", write more text. If it is "too large", delete some unneeded verbiage. Don't blame the word-counting algorithm;)

In my humble opinion, LO should not even attempt to offer an exact "word count". It should round it in some suitably mysterious fashion to a nice round number. For instance with at least +/- 5 "tolerance" and at most 10% "precision", or something like that. The bikeshedding possibilities here are endless!

Comment 5 Jamee Mikell 2011-07-06 13:22:48 UTC

(In reply to comment #4)
> Come on, since when is "word count" an exactly defined term anyway? [etc]

I'll agree that there are a number of different word counting methods, but for writing (and this is "Writer" we're talking about) they usually come down to one of three options.

1) Count of all characters / n (where n is 6 or sometimes 5)

2) pages * 250 words/page (where "page" is defined by specific formatting requirements)

3) Count of words in the given language, whitespace, separators characters, parentheses, quotation marks, etc. don't count as words.

Occasionally, someone might use lines * 10 words/line subject to specific formatting requirements. (After all, the first page is half blank and the last page ends who knows where.) But this is really just a hybrid of 1 and 2. "Formatting requirements" in both cases usually give about 10 words/line and 25 lines per page.

LO's algorithm isn't 1 or 2 (or any combination thereof) so LO's algorithm must be method 3 or an attempt to approximate method 3. In the interests of continuous improvement, when there's a relatively easy way to deal with an error in the approximation method it should be fixed. 

Also, it's elegant and simple. Define the rule for word counting as:
  hyphen, emdash, and endash are treated as whitespace
  except a hyphen surrounded by alphanumerics

(Where alphanumerics refers to "letters and numbers" characters and varies by language but is definable for any given language. For example: English uses a-z, 0-9; Spanish adds ch, enye, rr, ll, and accented vowels; Greek is... well, Greek, and so forth.)

This gives a more reasonable word count and resolves the issue xdmx raised about hyphens. 

Expand the rule to be:
  hyphen, emdash, endash, double quotes, and curly double quotes are treated as whitespace
  except a hyphen surrounded by alphanumerics

Now the rule addresses the problem described in bug 33774 too (resolved now, but the point is, the problems are similar and could be solved by the same code). 

Expand the rule again to be:
  hyphen, emdash, endash, double quotes, curly double quotes, single quotes, single curly quotes and apostrophe are treated as whitespace
  except a hyphen or single quote or single curly quote or apostrophe surrounded by alphanumerics

Now the rule addresses the problem of "eat at joe's" (LO wc = 3 words) vs. "eat at joe ' s" (LO wc = 5 words). 

And we should probably throw in # and * since those are used for section breaks within a chapter and " ran. (EOL) (centered) # (EOL) Joe " counts as three words in Writer. So:
  hyphen, emdash, endash, double quotes, curly double quotes, single quotes, single curly quotes, apostrophe, # and * are treated as whitespace
  except a hyphen or single quote or single curly quote or apostrophe or # or * surrounded by alphanumerics

I probably missed a few, but there are clearly a number of cases like these. To be really general...
1) Any non-alphanumeric character that is not surrounded by alphanumeric characters indicates a word break and is not counted as a word itself. 
2) Emdash, endash, double quotes and curly double quotes ALWAYS indicate a word break and are not counted as words themselves. 

(So, my bike shed has become quite a large one. :) If you really need a nuclear reactor, though, we could start looking at word counts for East Asian languages. :O )

The fact that all these cases are resolved by a single rule and its exception clause makes the coding involved seem relatively small (maybe even fits into a regular expression or two). The pseudo-code is written. Just need someone who understands the LO word count code to take that and implement it, resolve a bunch of known issues in one swell foop and position the word count code to handle similar issues in the future with minor effort.

Comment 6 Jamee Mikell 2011-07-20 08:56:59 UTC

I did some research on this in the 3.4.2.1 code. It has been a while since I did C/C++, but here's what I think it boils down to. In txtedt.cxx line 1858
        while ( aScanner.NextWord() )
        {
            //  1 is len(CH_TXTATR_BREAKWORD) : match returns length of match
            if( 1 != aExpandText.match(aBreakWord, aScanner.GetBegin() ))
            {
                ++nTmpWords;
                nTmpCharsExcludingSpaces += aScanner.GetLen();
            }
        }

This says LO recognizes text as either a word or a space because, when it finds a word break, it counts only the text of the "word" as "excluding spaces" and counts all text inside the word break as a single word.

In reality, there is a class of characters that are not part of a word, or that are conditionally not part of a word (e.g., hyphen, etc. discussed previously), but are not whitespace either. IOW, they should treated as breaks when counting words, but not treated as space when counting characters excluding spaces.

The code above needs to recognize this and separately identify whitespace characters vs. words.

I'm fairly certain it's possible to write a regular expression to find word breaks based on the regular expression language described for the Java split() function at:
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html#sum
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#sum
(Link to ancient Java and Java 6. Point: this ain't new and is still around.)

I've been trying to modify a copy of the beanshell wordcount macro to demonstrate. Unfortunately, while split() in beanshell works for a simple regex (e.g., " |,") it doesn't handle the full language defined at the link. For example, word breaks as discussed in my early comment could be defined by something like "[\s|\W{2,}]&&[^[\w[-|'|#|\*|<curlyquote>|<apostrophe>]]\w" where <curlyquote> and <apostrophe> are the character codes for the single curly quote and the apostrophe characters. May need another set of [] around that.

This could be an interestingly powerful method. (Maybe using Pattern and Matcher instead of split(). But I'm guessing if beanshell won't handle the regex in split() it won't handle it in Pattern either.) Count all words using above word break regex. a = Count all characters, b = count all characters returned by the regex "\s" (b), characters excluding spaces = a - b. Accurately counted.

Comment 7 Björn Michaelsen 2011-12-23 12:26:44 UTC Comment hidden (obsolete)

[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html

Comment 8 sasha.libreoffice 2012-03-01 03:05:52 UTC

Thanks for so ground work in word counting.
IMHO right version of word counting may be written on Basic macro
Please, verify: may be among OpenOffice extensions already exist such macro

Comment 9 Jamee Mikell 2012-04-20 09:20:10 UTC

Created attachment 60399 [details]
Example of outline spreadsheet described in comment 9.

In case someone is interested in writing a macro to do more sophisticated word counting, this attachment illustrates the semi-manual solution I describe in comment 9.

Comment 10 Jamee Mikell 2012-04-20 09:21:49 UTC

Comment on attachment 60399 [details]
Example of outline spreadsheet described in comment 9.

re: comment 8
It's possible I could figure out LO's macro language and object structures. I have developed (and currently maintain and enhance) macros in VBA for Excel, and made a living coding C and learned a little Python for a couple of small side projects I did, and used various other languages off and on, so it isn't like I have no programming background. But I'd rather be writing fiction, and I haven't found much to help me understand LO's macro language or object structure.

I didn't find any extensions or such that offered more sophisticated word counting. The one or two examples I found rely on the underlying (flawed in my opinion) word count method or implement the same method in their own code.

To be honest, though, if I did write a macro, I'd write it to look for a specific section break identifier (e.g., <eol>#<eol>) that would be entered as part of a dialog and then report word count data for each section. Or use bookmarks and count between bookmarks. Probably other options for defining breaks too.

In the end, I decided this issue wasn't going to get solved in a way that meets my needs any time soon so I modified my outline spreadsheet for this writing project. See attachment OutlineExample.ods. Each time I edit a section (part of a chapter), I run the default word count and record the LO "word" count and the count of characters including spaces on the section's row in the spreadsheet. I sum up the two sets of numbers--LO words and characters--at the end of each set of sections and report LO words and char/6 and char/6 and char/X, where X is a value I determined was a reasonable estimate of characters per page based on sampling several comparable books. I can also estimate char/6 given char for any single section if I need to. This helps me balance the size of books and, to a lesser extent, sections and chapters.

If someone would like to write a macro or extension or enhance LO to facilitate more sophisticated word counting methods that fiction writers and others who care about word count might find useful, I would be grateful. I occasionally make changes with find/replace across the whole document, which requires going recounting each section to update the count data. But as I noted, I'd rather write fiction than spend a lot of time figuring out how LO's macro language works and how the Writer objects work and such. If I already knew LO's macro language and objects, it would probably be a different story.

Comment 11 Stefan Knorr 2012-06-28 01:37:33 UTC

Setting this to NEW.

Can reproduce & completely agree with James's proposal.

Comment 12 Caolán McNamara 2012-08-28 09:29:55 UTC

re comment #6 the CH_TXTATR_BREAKWORD there is a special case to filter out "words" which consist of a special internal placeholder character. So forget about that, the meat is in "SwScanner::NextWord" which determines the bounds of the words, backed mostly by icu's word break iterator

Comment 13 Caolán McNamara 2012-08-28 16:14:08 UTC

I have a plausible solution for much of this while exploring msword-alike word counts wrt endash and emdash

Comment 14 Not Assigned 2012-08-29 08:12:51 UTC

Caolan McNamara committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=42a15f45ff4e02f98229de02efd0d8c19f10bcd5

Resolves: fdo#38983 allow extra word boundary characters

Comment 15 Caolán McNamara 2012-08-29 08:25:49 UTC

So, here's what we've got now.

By default out of the box master towards 3-7 should now give the same word and character counts as MSWord 2010 does for the 7 examples in the original .odt. i.e. For the purposes of word counting, a hyphen, equals, underscore etc do not split a word. But endash and emdash do split a word. So out of the box example 1, 2 and 3 are now 4 words, example 4, 5 and 6 remain as 5 words and example 7 remains as 1 word.

Additional word separators can be configured with tools->options->writer->general to add additional characters to that list (or remove the emdash/endash) to customize word breaking behaviour for word counting. It may not exactly address every potential issue, but it does allow a certain amount of tweaking.

Comment 16 Roman Eisele 2012-08-29 08:29:23 UTC

(In reply to comment #14)
> Caolan McNamara committed a patch related to this issue.
> It has been pushed to "master":

@Caolán:
Wow, thank you very much for this fix! This will help academic users of LibreOffice a lot (word count is very important in the academic world, and probably also in journalism, technical writing, etc.).

And it makes it much easier for me to recommend LibreOffice to students, if I can tell them that LibreOffice has got a reliable, “Word compatible”, and even configurable word count function. Especially the configurability is really cool!

So thank you again for this fastidious fix! I am eager to test it ...

Comment 17 Roman Eisele 2012-10-01 14:28:16 UTC

VERIFIED as FIXED with LOdev 3.7.0.0.alpha0+ (Build ID: 30d33b1, pull time: 2012-09-27 04:27:30) on Mac OS X 10.6.8 (Intel).

Of course, it is not possible to test all possible applications of the fix/new feature, but I have verified that:

1) The original samples (comment #0) are now handled by default as expected by academic style manuals and most users, i.e. that ‘—’ and ‘–’ (em-dash and en-dash) are not counted as words in samples like
   This – is a test. // en-dash with spaces, modern British (and German …) style
   This—is a test.   // em-dash w/o spaces, American and old British style

2) I can add additional charactes like ‘§’ or ‘…’ (horizontal ellipsis, U+2026) to the new word separators list in “Tools → Options → Writer → General”, and they will be handled as expected; i.e., while by default
   This … is a test.
was counted as 5 words, it will then be counted as 4 words (like e.g. German academic users would expect), and the same is true for
   This § 17 is stupid.
which is, when I add ‘§’ to the exceptions list, correctly counted as 4 words.

While doing these tests I have not noticed any negative side-effects (regressions).

So thank you again for this little, but very useful new feature!