Bug 34779 - Counts are wrong when paragraph at/over 65535 characters
Summary: Counts are wrong when paragraph at/over 65535 characters
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.3.0 release
Hardware: x86 (IA32) All
: medium normal
Assignee: Caolán McNamara
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-26 13:16 UTC by chickenwingspan
Modified: 2011-07-08 16:20 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
All the necessary information is already in the above description. (122.79 KB, application/rtf)
2011-02-26 13:16 UTC, chickenwingspan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description chickenwingspan 2011-02-26 13:16:28 UTC
Created attachment 43860 [details]
All the necessary information is already in the above description.

I found a bug in LibreOffice Writer when opening a document that was originally created in Microsoft Word 2007. The bug is that if you go to Tools/Word Count in LibreOffice Writer, the character count for the whole document is sometimes greater than the word count. This doesn't always happen, so I've attached a sample document that does have the problem. The steps to reproduce this bug are quite simple:

1. Open Microsoft Word 2007
2. Type something. I found out that the bug only happens in "long" documents. I'm not sure exactly how long the document has to be for the bug to happen, but about 20 pages or more should be sufficient. If in doubt, just see my attachment.
3. Clean all metadata using the document inspector built into Microsoft Word 2007.
4. Save the document as an .rtf file.
5. Open the file in LibreOffice Writer and view the character count.

To use the attached file, just open it in LibreOffice 3.3.0 and view the character count. You will see that the character count is actually less than the word count, which is impossible because in any document, the character count must be greater than or equal to the word count; each word must have at least one character.
Comment 1 tester8 2011-02-26 14:21:17 UTC
LO 3.3.1 RC2 Ubuntu 10.04 x86
Whole document
Words                       17496
Characters                  6392
Characters excluding spaces 0

With any selection
Current Selection
Words                       0
Characters                  Actual_number
Characters excluding spaces 0

When selected all text

Current Selection
Words                       0
Characters                  65535
Characters excluding spaces 0

Whole document
Words                       17496
Characters                  6392
Characters excluding spaces 0

And finally when save as txt and use wc:
wc -w Sample_document.txt | awk '{ print $1 }'
12822

wc -m Sample_document.txt | awk '{ print $1 }'
65537
Comment 2 LeMoyne Castle 2011-05-31 10:55:46 UTC
Take the attached test document and delete a few of the repeated sentences.  Run Tools->Word Count and char count is just below 65535, word count close to values from tester8. Insert repeated sentence until it won't take anymore text.  Run word count and char count is maxxed + all other counts are zero. 

Test document is a 15+ page paragraph.  Problem clearly related to size of paragraph and happens on Linux so edited platform and bug title.
Comment 3 Caolán McNamara 2011-07-08 15:48:44 UTC
the "nice" 32bit len OUStrings are mangled into "nasty" 16bit len UniStrings in SwScanner to get the word and excluding space counts
Comment 4 Caolán McNamara 2011-07-08 16:20:18 UTC
Converted afflicted SwScanner from UniString to rtl::OUString as http://cgit.freedesktop.org/libreoffice/writer/commit/?id=65d3573c5afd7fd132b30c41a240d1d2d04c8527

so that gives 12822 words now and 52714 non-space chars, which is a good bit better.