Bug 126629 - Writer reads some n-dashes as words - Editing
Summary: Writer reads some n-dashes as words - Editing
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
(earliest affected) release
Hardware: x86-64 (AMD64) Windows (All)
: medium trivial
Assignee: Not Assigned
Depends on:
Blocks: Formatting-Mark
  Show dependency treegraph
Reported: 2019-07-30 17:42 UTC by stephen.sottong
Modified: 2019-08-07 16:25 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:

Shows example of a dash that is not counted as a word and one that is. (8.11 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2019-07-30 17:43 UTC, stephen.sottong

Note You need to log in before you can comment on or make changes to this bug.
Description stephen.sottong 2019-07-30 17:42:01 UTC
I found when checking word count in a long document that Writer always was 10 words longer. I finally traced it to Writer counting some dashes as words. Neither MS Word nor Softmaker Textmaker reads these as words in their count. I can provide a document that demonstrates the difference, but it doesn't reproduce in an online form.

Steps to Reproduce:
1.Not sure how the dashes that are counted were made.

Actual Results:
Some dashes are counted as words

Expected Results:
The count should have ignored the dashes.

Reproducible: Always

User Profile Reset: No

Additional Info:
Comment 1 stephen.sottong 2019-07-30 17:43:41 UTC
Created attachment 153059 [details]
Shows example of a dash that is not counted as a word and one that is.
Comment 2 V Stuart Foote 2019-07-30 20:50:53 UTC
In OOXML the run is "<w:t xml:space="preserve">Earth </w:t><w:softHyphen/><w:t>– not</w:t></w:r>" 

Which on filter import to Writer gives a text run of U+0020 U+00AD U+2013 U+0020

So, seems the filter assigned U+00AD (SOFT HYPHEN) in combination with the (EN DASH) and bounded by spaces is treated as an edit engine word, increasing the word count.