We implement the XProofreader interface in our LanguageTool LibreOffice add-on to get the text to be checked. It used to return a special character (\u0002) for footnote references, now it returns a simple character like "1" for the first reference. The problem is that this messes up e.g. our sentence boundary detection, because we "1" is now somewhere in the text, indistinguishable from real text. Note that fixing issue #36540 would be broader fix that would also cover this proble. version 3.5.7.2 -> we got the special character version 3.6.5.2 -> we get the "1" Our original bug report is at http://sourceforge.net/p/languagetool/bugs/191/
I wanted to say that the problem occurs starting from about 3.6.5.2, not only in that version.
(In reply to comment #1) > I wanted to say that the problem occurs starting from about 3.6.5.2, not > only in that version. The issue remains at 4.1.1.2. Essentially, it used to be that LanguageTool can tell the end of a sentence correctly all the time. After the change though, when there's a footnote after a sentence, LT can only see the last word, the full-stop and the footnote number as ONE word, e.g. important.12 This messes up the ability of LT to count the number of words in a sentence correctly, because the counting would run to the next sentence. I've changed the version to the first version this issue is noted (3.6.5.2). The change in LibO should have happened sometime between 3.5.7.2 and 3.6.5.2.
Move 'regression' tag from Whiteboard -> Keyword (as this appears to be a confirmed regression).
i think generally we don't want Writer's weird abuse of \u000x code points to leak out into client code... this commit looks related: commit b0f170d7df9cff12535d2ecfd146b32b745a8ef8 Author: Caolán McNamara <caolanm@redhat.com> AuthorDate: Wed Jul 18 17:00:52 2012 +0100 0xFFF9 is a better choice for CH_TXTATR_INWORD than 0x0002. a) the default properties for the code point make it not split a word it appears in into two different words in any break mode we have. Which is what we want from a CH_TXTATR_INWORD b) unicode TR#20 gives for the interlinear annotation anchor: "What to do if detected: In a proxy context or browser context, remove U+FFF9", so when we need to strip it from text to run that text through e.g. the spellchecker or word counting then there's a solid precedent for stripping it In addition I *do* want the footnote placeholder to break the word it appears in, that gives the desired wordcount and cursor travelling behaviour The BREAKWORD and other *random* selection of CH_TXTATR are still odd choices, and there's way too many of them. .... but apparently that is not in the libreoffice-3-6 branch so there must be another change here... commit 1c6f5a16d83a87d7f2233e5c7e60ab5f39546225 Author: Michael Stahl <mstahl@redhat.com> AuthorDate: Tue Feb 12 19:43:16 2013 +0100 fdo#60668: filter out fieldmark chars from Index entries: SwTxtNode::GetExpandTxt must filter out all dummy characters used to represent fields, footnotes, field marks, etc. ... hmm but that doesn't do anything with CH_TXTATR_INWORD.
ok now with a breakpoint in ModelToViewHelper::ModelToViewHelper() / SwDoc::Spell() it's obvious that footnotes get a CH_TXTATR_BREAKWORD and no longer CH_TXTATR_INWORD ... is that intentional? ok ModelToViewHelper::ModelToViewHelper() is what expands fields & footnotes too in this case (why do we have N different expand-node-text functions?). ah so this commit is probably what changed this: commit f8f05d43de8728db58c8224c8aebf31ff570b6fc Author: Caolán McNamara <caolanm@redhat.com> AuthorDate: Tue Jul 17 14:03:28 2012 +0100 Resolves: fdo49629 GotoEndOfWord fails with footnote at word end a) remove special handling of 0x0002 in our custom icu rules. Which brings us a step closer to getting rid of at least some of them in favour of the defaults b) expand the 0x02 in SwTxtNode::BuildConversionMap like we do for fields so Good side effect is our word count and character count now take into account the actual footnote indicator text, as does our cursor travelling. Both of which are more word-alike.
Let me explain why the proofreading tool needs to know that it gets a footnote/endnote number: there are strict rules regarding the position of the footnote number in different academic publishing style guides and national standards for typography. People make mistakes in putting these numbers and it could be useful to flag these mistakes as they are difficult to spot. So actually, a special character was really useful for us as additional information. Now it's gone, and no text attributes are available for the proofreader interface, which basically means that certain features cannot be implemented. Now com.sun.star.linguistic2.ProofreadingResult contains property values that could be used to pass formatting information to us, including footnote/endnote information.
forgot about this for a while, and now comment #6 says that XProofreader clients actually want to know where the footnotes are, which looks like a new requirement? it appears to me that there is no good use-case for providing the expanded footnote numbers or field contents to grammar checkers, that is all auto-generated anyway so should be filtered out. so what can we do here: a) replace footnotes with ZWSP + word/sentence breaking works + Writer internals stay internal - grammar checker cannot tell that there is a footnote there b) filter out \u0001 without replacement + better than silly numbers in the text + Writer internals stay internal - grammar checker cannot tell that there is a footnote there - grammar checker cannot break sentence/words properly c) let \u0001 remain in text + word/sentence breaking works (but requires special handling of 0x01) - leaks Writer internals - \u0001 is not guaranteed to be a footnote, could be frame or field... d) basically a) + add some out-of-band thing like "FootnotePositions" property + Writer internals stay internal + word/sentence breaking works + grammar checker can tell that there is a footnote there + grammar checkers that don't care about footnotes can just ignore it - requires more code in grammar checkers e) re-map footnotes to some funky custom code point (Unicode has some "private use" area?) - not sure what this would mean for ICU break iterators, would it require custom ones that treat the new char specially like they did \u0002 ? - requires special handling in grammar checkers guess d) meets requirements the best; i think it requires passing the new property as part of XProofreader::doProofreading parameter "aProperties". something like a sequence<long> to indicate the UTF-16 code unit positions. would require getting that property to the GrammarCheckingIterator, which only has a XFlatParagraph; that would require some property for it too, hope it's possible to add that. the grammar checkers don't accesss XFlatParagraph directly, right? does everybody agree this is the way to go?
oops, Marcin not on CC:, please read comment #7 :)
Well, I don't understand why you're saying that there is no use case for internal codes, as there are numerous things that a grammar checker could check. Note that grammar checkers are not really *grammar* checkers but also typography checkers, style checkers, in general, they are smart proofreading tools. So one could, for example, use a grammar checker to see whether References in a scientific paper are formatted according to a given style guide, but for that, access to general formatting attributes would be required. Let me give a real-life example. In Polish, official spelling rules require that names of journals, weeklies, newspapers etc. are written in quotation marks, and many people make a mistake of writing them in italics. Now, we can check whether there are quotation marks but we cannot see whether the title is still in italics. And it's a very mechanical thing we could easily check with LanguageTool if we knew what part is in italics. Also, we really miss a way to access the info whether a given piece of information is part of the bullet point (we try to work around this but the rule that capitalizes the start of the sentence may give false alarms). This is why I think d) is the way to go, as we'll need more info to get really nice proofreading.
Michael Stahl committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=d477ff4a81ecba8a77ead5ff1a33d3e3ceed622e fdo#69416: make footnote positions available to XProofreader The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
since there were no objections, implemented that now, you get "FootnotePositions" and "FieldPositions" (since it was trivial to add that too) properties in the doProofreading() call. please test this on master builds. http://dev-builds.libreoffice.org/daily/master/
Sorry for the late reply. I can confirm this works, it's now part of the latest LanguageTool snapshot (https://languagetool.org/download/snapshots/?C=M;O=D).
László Németh committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=aab440c768b64b9d2ffa72223b6d843a84c2d061 tdf#69416 don't remove field content in proofreading, It will be available in 6.1.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.