Bug 126629

Summary: Writer counts dashes (soft hyphen, hyphen, and others) as words when en-dash and em-dash are ignored
Product: LibreOffice Reporter: steve.sottong
Component: WriterAssignee: Not Assigned <libreoffice-bugs>
Status: NEW ---    
Severity: trivial CC: Dianavides09, stephane.guillou, vsfoote, xiscofauli
Priority: medium    
Version: Inherited From OOo   
Hardware: x86-64 (AMD64)   
OS: All   
See Also: https://bugs.documentfoundation.org/show_bug.cgi?id=62799
https://bugs.documentfoundation.org/show_bug.cgi?id=38983
Whiteboard:
Crash report or crash signature: Regression By:
Bug Depends on:    
Bug Blocks: 102345, 103479    
Attachments: Shows example of a dash that is not counted as a word and one that is.

Description steve.sottong 2019-07-30 17:42:01 UTC
Description:
I found when checking word count in a long document that Writer always was 10 words longer. I finally traced it to Writer counting some dashes as words. Neither MS Word nor Softmaker Textmaker reads these as words in their count. I can provide a document that demonstrates the difference, but it doesn't reproduce in an online form.

Steps to Reproduce:
1.Not sure how the dashes that are counted were made.
2.
3.

Actual Results:
Some dashes are counted as words

Expected Results:
The count should have ignored the dashes.


Reproducible: Always


User Profile Reset: No



Additional Info:
Comment 1 steve.sottong 2019-07-30 17:43:41 UTC
Created attachment 153059 [details]
Shows example of a dash that is not counted as a word and one that is.
Comment 2 V Stuart Foote 2019-07-30 20:50:53 UTC
In OOXML the run is "<w:t xml:space="preserve">Earth </w:t><w:softHyphen/><w:t>– not</w:t></w:r>" 

Which on filter import to Writer gives a text run of U+0020 U+00AD U+2013 U+0020

So, seems the filter assigned U+00AD (SOFT HYPHEN) in combination with the (EN DASH) and bounded by spaces is treated as an edit engine word, increasing the word count.
Comment 3 QA Administrators 2021-08-07 03:40:06 UTC Comment hidden (obsolete)
Comment 4 Diana Vides 2023-05-25 01:55:11 UTC
I was able to reproduce this bug first in version 6.4.7.2. When using a short dash is counted as a word but when using a long dash (autocorrected) is not counted as a word.
Steps to Reproduce:
1. Type a dash and add space and type a word and press enter
2. Type a word add space and type a dash and type a word and add space


Actual Results:
The short dash in Step 1 is counted as a word and the long(autocorrected)dash in Step 2 is not counted as a word.

Expected Results:
Both short dash and long dash should be counted or ignored depending on the specifications. The user guide is ambiguous. 
https://help.libreoffice.org/7.2/en-US/text/swriter/guide/words_count.html?&DbPAR=WRITER&System=WIN


Version: 6.4.7.2 (x64)
Build ID: 639b8ac485750d5696d7590a72ef1b496725cfb5
CPU threads: 6; OS: Windows 10.0 Build 19045; UI render: default; VCL: win; 
Locale: en-US (en_US); UI-Language: en-US
Calc: CL

I reproduced it in version 7.5.2.2 and it is still present 

Version: 7.5.2.2 (X86_64) / LibreOffice Community
Build ID: 53bb9681a964705cf672590721dbc85eb4d0c3a2
CPU threads: 6; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

I reproduced it in the master version  7.6.0.0 and it is still present 

Version: 7.6.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: f4c24da1e7f11664e0d2f688d2531f068e4a3bc0
CPU threads: 6; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: CL threaded
Comment 5 Stéphane Guillou (stragu) 2023-06-26 16:57:54 UTC
I checked in OOo 3.3, it was already the case for a simple hyphen and a soft hyphen surrounded by spaces (although the en-dash was also counted back then).

Related issue looking at the documentation is bug 62799.

Testing in 24.2 alpha0+:

Not counted

En – dash: not counted (U+2013)
Em — dash: not counted (U+2014)

Counted

Horizontal ― bar: counted (U+2015)
Figure ‒ dash: counted (U+2012)
Hyphen - minus: counted (U+002D)
Minus − sign: counted (U+2212)
Hyphen ‐ hyphen: counted (U+2010)
Soft ­ hyphen: counted (U+00AD)

Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 9fc0b2b9b96d87eb642a3b29e9dcb5d6273265eb
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded