Bug 71877 - Word Count Wrong for ZWSP delimited text in SEA langauges (Thai, Lao, Khmer, and Burmese)
Summary: Word Count Wrong for ZWSP delimited text in SEA langauges (Thai, Lao, Khmer, ...
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Word-Count
  Show dependency treegraph
 
Reported: 2013-11-21 14:11 UTC by Robert M Campbell
Modified: 2020-10-26 16:27 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Test document including ZWSP and non-ZWSP Thai, Lao, Khmer, and Burmese text (42.63 KB, application/vnd.oasis.opendocument.text)
2013-11-21 14:11 UTC, Robert M Campbell
Details
Test document including ZWSP and non-ZWSP Thai, Lao, Khmer, and Burmese text (38.13 KB, application/vnd.oasis.opendocument.text)
2013-11-25 04:21 UTC, Robert M Campbell
Details
Mittaphap (24.57 KB, application/x-font-ttf)
2013-11-25 04:42 UTC, Robert M Campbell
Details
Mittaphap Book (24.63 KB, application/x-font-ttf)
2013-11-25 04:43 UTC, Robert M Campbell
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert M Campbell 2013-11-21 14:11:17 UTC
Created attachment 89590 [details]
Test document including ZWSP and non-ZWSP Thai, Lao, Khmer, and Burmese text

When working with text that uses ZWSPs (zero width spaces) to delimit text, LibreOffice does not count each word. When the ZWSPs are removed, the word count acts fine.

But, word selection (double click) and line breaking work fine with or without ZWSPs.

Testing document attached.
Comment 1 Robinson Tryon (qubit) 2013-11-24 22:54:58 UTC
CONFIRMED in LO Version: 4.2.0.0.beta1 + Ubuntu 12.04.3

(In reply to comment #0)
> When working with text that uses ZWSPs (zero width spaces) to delimit text,
> LibreOffice does not count each word. When the ZWSPs are removed, the word
> count acts fine.

Per instructions in Test document:

REPRO STEPS:
- Open test document in LibreOffice
- Highlight first 4 paragraphs

As noted in the document, the bottom bar shows "202 words"

- Highlight the next set of 4 paragraphs

As noted in the document, the bottom bar shows "2 words"

> But, word selection (double click) and line breaking work fine with or
> without ZWSPs.

Well, at least there's that!

> 
> Testing document attached.

Thanks for the test document. Some of the fonts are not present on my system -- would it be possible to change the test document to use fonts included in LO that exercise the same bug?  (if not, perhaps point to where the fonts might be downloaded)

Status -> NEW
Comment 2 Robinson Tryon (qubit) 2013-11-24 22:57:05 UTC
Andras - Is this behavior a bug?
Comment 3 Robert M Campbell 2013-11-25 04:17:08 UTC
Paragraphs 1 & 5 (Thai) - No LibreOffice fonts that I can tell
Droid Sans
https://www.google.com/fonts/specimen/Droid+Sans

Paragraphs 2 & 6 (Khmer) - No LibreOffice fonts that I can tell
Khmer OS
http://sourceforge.net/projects/khmer/files/Fonts%20-%20KhmerOS/KhmerOS%20Fonts%204.0-%20LGPL%20License/

Paragraphs 3 & 7 (Lao) - No LibreOffice fonts that I can tell
Mittaphap
http://hg.palaso.org/font-lao2/file/d0764b11848f

Padauk (included in LibreOffice) is the Burmese Font

I'll adjust the document to the fonts listed. Mittaphap in particular is fairly new and only available as source, not ttf yet, but I have generated some fonts and can attach them here if that would be helpful?
Comment 4 Robert M Campbell 2013-11-25 04:21:26 UTC
Created attachment 89726 [details]
Test document including ZWSP and non-ZWSP Thai, Lao, Khmer, and Burmese text
Comment 5 Robinson Tryon (qubit) 2013-11-25 04:38:03 UTC
(In reply to comment #3)
> [...various font things ..] 
> I'll adjust the document to the fonts listed.

thanks

> Mittaphap in particular is
> fairly new and only available as source, not ttf yet, but I have generated
> some fonts and can attach them here if that would be helpful?

As long as the links are stable and fonts under some FOSS license so we may test against them, then it's generally fine to link to external font files.
Comment 6 Robert M Campbell 2013-11-25 04:42:58 UTC
Created attachment 89727 [details]
Mittaphap
Comment 7 Robert M Campbell 2013-11-25 04:43:29 UTC
Created attachment 89728 [details]
Mittaphap Book
Comment 8 Robert M Campbell 2013-11-25 04:52:49 UTC
Mittaphap is licensed OFL
Comment 9 Robert M Campbell 2014-01-22 03:16:36 UTC
Any news on this bug? Anything I can do to help?
Comment 10 Robinson Tryon (qubit) 2015-01-14 08:14:19 UTC
(In reply to Robert M Campbell from comment #9)
> Any news on this bug? Anything I can do to help?

Hi Robert,
Good question -- sorry for the late reply here! As you can see, we have a large number of open bug reports filed against LibreOffice, so it's often a matter of finding the right resource to help address a particular bug or set of bugs.

This bug appears to affect a number of different languages including Thai, so I'd suggest that you check with the Thai mailing list and see if others are experiencing the same problem:
https://wiki.documentfoundation.org/Local_Mailing_Lists#Thai

If the problem is affecting many people, then we can try to identify someone who'd be interested in working on a fix. This could be a great opportunity for a university CS student or someone else familiar with programming to learn more about LibreOffice.
Comment 11 QA Administrators 2017-10-25 08:58:19 UTC Comment hidden (obsolete)
Comment 12 Robert M Campbell 2017-10-25 10:48:57 UTC
Sorry, life, travels, and ever expanding projects seem to eat up time. I've just now reviewed this bug (tested with 5.4.2.2 (x64)) and...

Still works in the same manner as previous (so still not providing correct word counts). 

Basically, without any zero-width-spaces, the word counts seem spot on. It's just when working with text that has zero-width-spaces (ZWSP). 

I'm not exactly sure where this happens in the code. My programming skills in the no web sphere is not super high, but I am willing to look into it, if someone can kind of guide me where I should start looking. 

What I don't know, and this my play a major factor in things, is if all users use zero width spaces to delimit words (in the case of Thai, Lao, Khmer - this seems to be the case, but I'm not a linguist/language expert, though I can read at varying levels in each listed language). It may be that sometimes users may insert ZWSPs specifically for cases where in English we'd use a hyphen to do the same (line breaking). 

Anyways, point me where I can help, and I'm glad to do what I can.

Thanks!
Comment 13 QA Administrators 2018-10-26 02:58:40 UTC Comment hidden (obsolete)
Comment 14 QA Administrators 2020-10-26 04:13:04 UTC Comment hidden (obsolete)
Comment 15 Robert M Campbell 2020-10-26 16:27:02 UTC
I believe this issue may be resolved. I'm working with Lao content on a regular basis in LibreOffice, and the OSs I use all seem to word count Lao correctly. I've not done extensive testing, but it seems to work well now.