69416 – XProofreader: footnote no special character anymore

Bug 69416 - XProofreader: footnote no special character anymore

Summary: XProofreader: footnote no special character anymore

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	3.6.5.2 release
Hardware:	Other All

Importance:	medium normal
Assignee:	Michael Stahl

URL:
Whiteboard:	target:4.3.0 target:6.1.0
Keywords:	regression

Depends on:
Blocks:

Reported:	2013-09-16 13:06 UTC by Daniel Naber
Modified:	2018-03-14 14:27 UTC (History)
CC List:	7 users (show)

See Also:	49629 89794
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Daniel Naber 2013-09-16 13:06:15 UTC

We implement the XProofreader interface in our LanguageTool LibreOffice add-on to get the text to be checked. It used to return a special character (\u0002) for footnote references, now it returns a simple character like "1" for the first reference. The problem is that this messes up e.g. our sentence boundary detection, because we "1" is now somewhere in the text, indistinguishable from real text.

Note that fixing issue #36540 would be broader fix that would also cover this proble.

version 3.5.7.2 -> we got the special character
version 3.6.5.2 -> we get the "1"

Our original bug report is at http://sourceforge.net/p/languagetool/bugs/191/

Comment 1 Daniel Naber 2013-09-16 13:07:51 UTC

I wanted to say that the problem occurs starting from about 3.6.5.2, not only in that version.

Comment 2 Kumāra 2013-09-17 04:43:50 UTC

(In reply to comment #1)
> I wanted to say that the problem occurs starting from about 3.6.5.2, not
> only in that version.

The issue remains at 4.1.1.2.

Essentially, it used to be that LanguageTool can tell the end of a sentence correctly all the time. After the change though, when there's a footnote after a sentence, LT can only see the last word, the full-stop and the footnote number as ONE word, e.g. important.12

This messes up the ability of LT to count the number of words in a sentence correctly, because the counting would run to the next sentence.

I've changed the version to the first version this issue is noted (3.6.5.2). The change in LibO should have happened sometime between 3.5.7.2 and 3.6.5.2.

Comment 3 Robinson Tryon (qubit) 2013-10-29 04:34:48 UTC

Move 'regression' tag from Whiteboard -> Keyword (as this appears to be a confirmed regression).

Comment 4 Michael Stahl 2013-12-05 13:26:44 UTC

i think generally we don't want Writer's weird abuse of
\u000x code points to leak out into client code...

this commit looks related:

commit b0f170d7df9cff12535d2ecfd146b32b745a8ef8
Author:     Caolán McNamara <caolanm@redhat.com>
AuthorDate: Wed Jul 18 17:00:52 2012 +0100

    0xFFF9 is a better choice for CH_TXTATR_INWORD than 0x0002.
    
    a) the default properties for the code point make it not split a word it
    appears in into two different words in any break mode we have. Which is what we
    want from a CH_TXTATR_INWORD
    
    b) unicode TR#20 gives for the interlinear annotation anchor: "What to do if
    detected: In a proxy context or browser context, remove U+FFF9", so when we
    need to strip it from text to run that text through e.g. the spellchecker or
    word counting then there's a solid precedent for stripping it
    
    In addition I *do* want the footnote placeholder to break the word it appears
    in, that gives the desired wordcount and cursor travelling behaviour
    
    The BREAKWORD and other *random* selection of CH_TXTATR are still odd choices,
    and there's way too many of them.


.... but apparently that is not in the libreoffice-3-6 branch
so there must be another change here...

commit 1c6f5a16d83a87d7f2233e5c7e60ab5f39546225
Author:     Michael Stahl <mstahl@redhat.com>
AuthorDate: Tue Feb 12 19:43:16 2013 +0100

    fdo#60668: filter out fieldmark chars from Index entries:
    
    SwTxtNode::GetExpandTxt must filter out all dummy characters used to
    represent fields, footnotes, field marks, etc.

... hmm but that doesn't do anything with CH_TXTATR_INWORD.

Comment 5 Michael Stahl 2013-12-05 14:57:25 UTC

ok now with a breakpoint in ModelToViewHelper::ModelToViewHelper()
/ SwDoc::Spell() it's obvious that footnotes get a
CH_TXTATR_BREAKWORD and no longer CH_TXTATR_INWORD ...
is that intentional?

ok ModelToViewHelper::ModelToViewHelper() is what expands
fields & footnotes too in this case (why do we have N different
expand-node-text functions?).

ah so this commit is probably what changed this:

commit f8f05d43de8728db58c8224c8aebf31ff570b6fc
Author:     Caolán McNamara <caolanm@redhat.com>
AuthorDate: Tue Jul 17 14:03:28 2012 +0100

    Resolves: fdo49629 GotoEndOfWord fails with footnote at word end
    
    a) remove special handling of 0x0002 in our custom icu rules.
       Which brings us a step closer to getting rid of at least
       some of them in favour of the defaults
    b) expand the 0x02 in SwTxtNode::BuildConversionMap like we
       do for fields so
    
    Good side effect is our word count and character count now take into account
    the actual footnote indicator text, as does our cursor travelling. Both of
    which are more word-alike.

Comment 6 Marcin Miłkowski 2014-01-09 11:18:49 UTC

Let me explain why the proofreading tool needs to know that it gets a footnote/endnote number: there are strict rules regarding the position of the footnote number in different academic publishing style guides and national standards for typography. People make mistakes in putting these numbers and it could be useful to flag these mistakes as they are difficult to spot. So actually, a special character was really useful for us as additional information. Now it's gone, and no text attributes are available for the proofreader interface, which basically means that certain features cannot be implemented.

Now com.sun.star.linguistic2.ProofreadingResult contains property values that could be used to pass formatting information to us, including footnote/endnote information.

Comment 7 Michael Stahl 2014-01-31 14:17:45 UTC

forgot about this for a while, and now comment #6 says that
XProofreader clients actually want to know where the footnotes are,
which looks like a new requirement?

it appears to me that there is no good use-case for providing
the expanded footnote numbers or field contents to grammar checkers,
that is all auto-generated anyway so should be filtered out.

so what can we do here:

a) replace footnotes with ZWSP
   + word/sentence breaking works
   + Writer internals stay internal
   - grammar checker cannot tell that there is a footnote there
b) filter out \u0001 without replacement
   + better than silly numbers in the text
   + Writer internals stay internal
   - grammar checker cannot tell that there is a footnote there
   - grammar checker cannot break sentence/words properly
c) let \u0001 remain in text
   + word/sentence breaking works (but requires special handling of 0x01)
   - leaks Writer internals
   - \u0001 is not guaranteed to be a footnote, could be frame or field...
d) basically a) + add some out-of-band thing like "FootnotePositions" property
   + Writer internals stay internal
   + word/sentence breaking works
   + grammar checker can tell that there is a footnote there
   + grammar checkers that don't care about footnotes can just ignore it
   - requires more code in grammar checkers
e) re-map footnotes to some funky custom code point (Unicode has
   some "private use" area?)
   - not sure what this would mean for ICU break iterators,
     would it require custom ones that treat the new char specially like
     they did \u0002 ?
   - requires special handling in grammar checkers

guess d) meets requirements the best; i think it requires passing the
new property as part of XProofreader::doProofreading parameter "aProperties".

something like a sequence<long> to indicate the UTF-16 code unit positions.

would require getting that property to the GrammarCheckingIterator,
which only has a XFlatParagraph; that would require some property for it too,
hope it's possible to add that. 

the grammar checkers don't accesss XFlatParagraph directly, right?

does everybody agree this is the way to go?

Comment 8 Michael Stahl 2014-01-31 14:20:19 UTC

oops, Marcin not on CC:, please read comment #7 :)

Comment 9 Marcin Miłkowski 2014-02-05 09:57:38 UTC

Well, I don't understand why you're saying that there is no use case for internal codes, as there are numerous things that a grammar checker could check. Note that grammar checkers are not really *grammar* checkers but also typography checkers, style checkers, in general, they are smart proofreading tools. So one could, for example, use a grammar checker to see whether References in a scientific paper are formatted according to a given style guide, but for that, access to general formatting attributes would be required. 

Let me give a real-life example. In Polish, official spelling rules require that names of journals, weeklies, newspapers etc. are written in quotation marks, and many people make a mistake of writing them in italics. Now, we can check whether there are quotation marks but we cannot see whether the title is still in italics. And it's a very mechanical thing we could easily check with LanguageTool if we knew what part is in italics.

Also, we really miss a way to access the info whether a given piece of information is part of the bullet point (we try to work around this but the rule that capitalizes the start of the sentence may give false alarms).

This is why I think d) is the way to go, as we'll need more info to get really nice proofreading.

Comment 10 Commit Notification 2014-04-08 10:16:36 UTC

Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=d477ff4a81ecba8a77ead5ff1a33d3e3ceed622e

fdo#69416: make footnote positions available to XProofreader



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Comment 11 Michael Stahl 2014-04-08 10:38:17 UTC

since there were no objections, implemented that now,
you get "FootnotePositions" and "FieldPositions"
(since it was trivial to add that too) properties in
the doProofreading() call.

please test this on master builds.
http://dev-builds.libreoffice.org/daily/master/

Comment 12 Daniel Naber 2014-09-19 10:48:46 UTC

Sorry for the late reply. I can confirm this works, it's now part of the latest LanguageTool snapshot (https://languagetool.org/download/snapshots/?C=M;O=D).

Comment 13 Commit Notification 2018-03-14 14:27:10 UTC

László Németh committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=aab440c768b64b9d2ffa72223b6d843a84c2d061

tdf#69416 don't remove field content in proofreading,

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.