Bug 49629 - Macros: GotoEndOfWord fails when footnote at word end
Summary: Macros: GotoEndOfWord fails when footnote at word end
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: BASIC (show other bugs)
Version:
(earliest affected)
3.5.2 release
Hardware: All All
: medium normal
Assignee: Caolán McNamara
URL:
Whiteboard: BSA target:3.7.0
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-08 03:33 UTC by Nemo
Modified: 2013-12-05 14:59 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
tentative fix (1.97 KB, patch)
2012-05-11 01:36 UTC, David Tardon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nemo 2012-05-08 03:33:03 UTC
Problem description: 
a GotoEndOfWord fails, when there is a footnote at the end of a word

Steps to reproduce:
- Type a word and a footnote directly at the end of the word

- write a macro kind of like this:
Dim sText As String
xDoc=ThisComponent
oText=xDoc.Text
xSelection = xDoc.CurrentController.getSelection
xRange=xSelection(0)
xCursor=xRange.getText.createTextCursorByRange(xRange)
xCursor.gotoStartOfWord(False)
xCursor.GotoEndOfWord(True)
sText=xCursor.GetString()
msgbox "sText:" + sText

- execute Macro with cursor inside the word with the attached footnote

Current behavior:
sText is empty

Expected behavior:
stext should contain text from beginning of word to the end of the word

Platform (if different from the browser): 
Win7 64bit
              
Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
Comment 1 David Tardon 2012-05-11 01:28:14 UTC
(In reply to comment #0)
> msgbox "sText:" + sText

This should of course be

msgbox("sText:" + sText)

otherwise nothing is ever shown.
Comment 2 David Tardon 2012-05-11 01:36:30 UTC
Created attachment 61429 [details]
tentative fix

The problem is that footnote is represented as character \002 internally by Writer. getWordBoundary() includes it into the word, but isEndWord() does not, because of:

       sal_Int32 tmp = skipSpace(Text, nPos, len, rWordType, sal_False);

and returns false at the next line.

The attached patch fixes the problem, but I am not sure it is the best way to do that.
Comment 3 Nemo 2012-05-11 02:04:00 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > msgbox "sText:" + sText
> 
> This should of course be
> 
> msgbox("sText:" + sText)
> 
> otherwise nothing is ever shown.

Not very important, but gets shown perfectly with the LO version 3.5.2 specified for this bug.

I'd guess getWordBoundary() is the function used with doubleclick marking (since this includes the footnote). Not sure if doubleclick *should* do that, though.

I'm not familiar with LibreOffice code, but I'd think those two functions (doubleclick marking and gotoStartOfWord/GotoEndOfWord) should arrive at the same boundaries.
Comment 4 Michael Stahl (allotropia) 2012-07-06 11:51:00 UTC
Caolan, does this break iterator patch make sense to you?

it doesn't look obviously wrong to me, but that doesn't mean much here...
Comment 5 Eike Rathke 2012-07-06 16:52:06 UTC
Huh, I was on Cc of this? Missed it..

Including footnote with getWordBoundary() may be on purpose because you don't want to leave the footnote on its own when breaking, I'm not sure. Check also
getCompatibilityScriptClassByBlock() that says
        //  0x02 - this can be inside a word
umm.. yes.

However, seeing with annotate that this few lines piece of code already did undergo its 3rd iteration with "fix word breakiterator issues" I'm a bit uncomfortable with such an invasive (in this case IMHO, maybe I'm overcautious) change without some tests that take also those other issues into account.

As there were (check with git show -w because of extensive whitespace reformatting)
 43c90c8ad4c33388bb0cfc9c966cd3e7d44c7b2a
#i11993# #i14904# fix word breakiterator issues
 f0939f43315a21f5134cd631773ddae7cfef4493
#112021# fix word boundary problem on begining and end of the string
#i21907# fix isBeginWord and isEndWord problem
 2887ecb5554eee699e1dce4ffbc2dfcf71a54a41
#i11993# fix getWordBoundary problem when position is on the end of the word.


Sorry to be a pest here, but breakiterator is known to be a very fragile area with corner cases in different locales or script types. AFAIK Writer has also some behavior specifically adapted to the breakiterator so it may be possible that when something is changed here something else may have to be adjusted in Writer as well where the Boundary returned is evaluated.

Maybe treating the 0x02 specially in isEndWord() might be an option.
Comment 6 Not Assigned 2012-07-13 11:58:02 UTC
Caolan McNamara committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=52280c29880c2d6bc533692d3f0cdff37f35790d

Related: fdo#49629 add test case for #i11993#
Comment 7 Not Assigned 2012-07-13 12:04:26 UTC
Caolan McNamara committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=2cf6778842d259d4e6e1a1f0e4c6c2ceab874eee

Related: fdo#49629 add test case for #i21907#
Comment 8 Not Assigned 2012-07-13 12:35:12 UTC
Caolan McNamara committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=413554a3bd3e7236bd60e4e36f063351f3d31293

Related: fdo#49629 add test case for #i14904#
Comment 9 Caolán McNamara 2012-07-13 14:45:32 UTC
Well, I munged all the available bug reports into test cases anyway. 

Much of the trouble here is that 0x0002 is stuffed in as a code point which can appear in a "word" in the source/breakiterator/data/*word*.txt rules. But 0x0002 doesn't appear as a code point that is non-whitespace in the skipSpace rules so its inconsistent.

Having 0x0002 as a code point than can appear in a word gives all sorts of odd results in writer for word counting, character counting word skipping etc.

I'm inclined to remove the special handing of 0x0002 down in the breakiterator rules and go back up to writer (which I presume is the only thing with the rather mad 0x0002 usage) and handle it specifically there when necessary, e.g. expanding it out to its display text for the purposes of character and word counting
Comment 10 Eike Rathke 2012-07-16 11:16:59 UTC
(In reply to comment #9)
> I'm inclined to remove the special handing of 0x0002 down in the breakiterator
> rules and go back up to writer (which I presume is the only thing with the
> rather mad 0x0002 usage)

Yes please, having that implemented down in i18n is the historical result of one module having personal influence and directive over another ;-)
Comment 11 Caolán McNamara 2012-07-17 13:42:49 UTC
http://cgit.freedesktop.org/libreoffice/core/commit/?id=f8f05d43de8728db58c8224c8aebf31ff570b6fc

a) remove special handling of 0x0002 in our custom icu rules. Which brings us a step closer to getting rid of at least some of them in favour of the defaults

b) expand the 0x02 in SwTxtNode::BuildConversionMap like we do for fields

Good side effect is our word count and character count now take into account the actual footnote indicator text, as does our cursor travelling. Both of which are more word-alike.