Problem description: a GotoEndOfWord fails, when there is a footnote at the end of a word Steps to reproduce: - Type a word and a footnote directly at the end of the word - write a macro kind of like this: Dim sText As String xDoc=ThisComponent oText=xDoc.Text xSelection = xDoc.CurrentController.getSelection xRange=xSelection(0) xCursor=xRange.getText.createTextCursorByRange(xRange) xCursor.gotoStartOfWord(False) xCursor.GotoEndOfWord(True) sText=xCursor.GetString() msgbox "sText:" + sText - execute Macro with cursor inside the word with the attached footnote Current behavior: sText is empty Expected behavior: stext should contain text from beginning of word to the end of the word Platform (if different from the browser): Win7 64bit Browser: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
(In reply to comment #0) > msgbox "sText:" + sText This should of course be msgbox("sText:" + sText) otherwise nothing is ever shown.
Created attachment 61429 [details] tentative fix The problem is that footnote is represented as character \002 internally by Writer. getWordBoundary() includes it into the word, but isEndWord() does not, because of: sal_Int32 tmp = skipSpace(Text, nPos, len, rWordType, sal_False); and returns false at the next line. The attached patch fixes the problem, but I am not sure it is the best way to do that.
(In reply to comment #1) > (In reply to comment #0) > > msgbox "sText:" + sText > > This should of course be > > msgbox("sText:" + sText) > > otherwise nothing is ever shown. Not very important, but gets shown perfectly with the LO version 3.5.2 specified for this bug. I'd guess getWordBoundary() is the function used with doubleclick marking (since this includes the footnote). Not sure if doubleclick *should* do that, though. I'm not familiar with LibreOffice code, but I'd think those two functions (doubleclick marking and gotoStartOfWord/GotoEndOfWord) should arrive at the same boundaries.
Caolan, does this break iterator patch make sense to you? it doesn't look obviously wrong to me, but that doesn't mean much here...
Huh, I was on Cc of this? Missed it.. Including footnote with getWordBoundary() may be on purpose because you don't want to leave the footnote on its own when breaking, I'm not sure. Check also getCompatibilityScriptClassByBlock() that says // 0x02 - this can be inside a word umm.. yes. However, seeing with annotate that this few lines piece of code already did undergo its 3rd iteration with "fix word breakiterator issues" I'm a bit uncomfortable with such an invasive (in this case IMHO, maybe I'm overcautious) change without some tests that take also those other issues into account. As there were (check with git show -w because of extensive whitespace reformatting) 43c90c8ad4c33388bb0cfc9c966cd3e7d44c7b2a #i11993# #i14904# fix word breakiterator issues f0939f43315a21f5134cd631773ddae7cfef4493 #112021# fix word boundary problem on begining and end of the string #i21907# fix isBeginWord and isEndWord problem 2887ecb5554eee699e1dce4ffbc2dfcf71a54a41 #i11993# fix getWordBoundary problem when position is on the end of the word. Sorry to be a pest here, but breakiterator is known to be a very fragile area with corner cases in different locales or script types. AFAIK Writer has also some behavior specifically adapted to the breakiterator so it may be possible that when something is changed here something else may have to be adjusted in Writer as well where the Boundary returned is evaluated. Maybe treating the 0x02 specially in isEndWord() might be an option.
Caolan McNamara committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=52280c29880c2d6bc533692d3f0cdff37f35790d Related: fdo#49629 add test case for #i11993#
Caolan McNamara committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=2cf6778842d259d4e6e1a1f0e4c6c2ceab874eee Related: fdo#49629 add test case for #i21907#
Caolan McNamara committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=413554a3bd3e7236bd60e4e36f063351f3d31293 Related: fdo#49629 add test case for #i14904#
Well, I munged all the available bug reports into test cases anyway. Much of the trouble here is that 0x0002 is stuffed in as a code point which can appear in a "word" in the source/breakiterator/data/*word*.txt rules. But 0x0002 doesn't appear as a code point that is non-whitespace in the skipSpace rules so its inconsistent. Having 0x0002 as a code point than can appear in a word gives all sorts of odd results in writer for word counting, character counting word skipping etc. I'm inclined to remove the special handing of 0x0002 down in the breakiterator rules and go back up to writer (which I presume is the only thing with the rather mad 0x0002 usage) and handle it specifically there when necessary, e.g. expanding it out to its display text for the purposes of character and word counting
(In reply to comment #9) > I'm inclined to remove the special handing of 0x0002 down in the breakiterator > rules and go back up to writer (which I presume is the only thing with the > rather mad 0x0002 usage) Yes please, having that implemented down in i18n is the historical result of one module having personal influence and directive over another ;-)
http://cgit.freedesktop.org/libreoffice/core/commit/?id=f8f05d43de8728db58c8224c8aebf31ff570b6fc a) remove special handling of 0x0002 in our custom icu rules. Which brings us a step closer to getting rid of at least some of them in favour of the defaults b) expand the 0x02 in SwTxtNode::BuildConversionMap like we do for fields Good side effect is our word count and character count now take into account the actual footnote indicator text, as does our cursor travelling. Both of which are more word-alike.