Bug 42893 - EDITING: Improve 'Capitalize first letter of sentence'
Summary: EDITING: Improve 'Capitalize first letter of sentence'
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.4.4 release
Hardware: All All
: medium enhancement
Assignee: Anurag Kanungo
URL:
Whiteboard: target:4.1.0
Keywords: difficultyBeginner, easyHack
Depends on:
Blocks:
 
Reported: 2011-11-13 18:56 UTC by ryan.jendoubi@gmail.com
Modified: 2015-12-16 05:11 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
helpful patch (1.22 KB, patch)
2013-02-20 15:43 UTC, Caolán McNamara
Details
It Fixes the F.o.o. bar auto capitalization (911 bytes, patch)
2013-04-20 06:16 UTC, Anurag Kanungo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ryan.jendoubi@gmail.com 2011-11-13 18:56:24 UTC
In addition to the issue identified in https://bugs.freedesktop.org/show_bug.cgi?id=35515, there are other instances where the Capitalize first letter of every sentence option is more trouble than it's worth.

The 'start of sentence detection' should be improved to recognise the following for what they are, and therefore not perform any capitalization:

1. Common contractions, e.g. "esp." for "especially", "incl." for "including", "temp." for "temporary", "e.g.", "i.e.", etc.

2. Things which are clearly acronyms, e.g. "U.S.", "Y.M.C.A.", etc. In regex terms I'd imagine the pattern to be /([a-zA-Z]\.){2,}/, i.e., any two or more occurrences of a letter followed by a period.

You could make a judgment call about whether you wanted to limit it to capital letters. On the plus side you're more likely to be looking at something really intended as an acronym, but on the negative site I often use acronyms like "w.r.t." for "with regard to", and suchlike. This might matter more if you thought "e.g." and "i.e." are more accurately classed as acronyms than contractions; I'm not sure the conceptual distinction would make a difference here in practice.

3. Did you spot the 'intentional mistake' in number 1. above? :-) The case where a contraction or acronym falls at the end of a sentence is tricky. Some cursory research ([1],[2]) confirms that in these situations the correct thing to do is to have only one period, which 'does double duty', both indicating the shortening and ending the sentence.

Therefore, in these situations LO would probably miss the new sentence and not be able to capitalize. However, both ending sentences with acronyms and (hopefully) the occurrence of people forgetting to capitalize are pretty rare, so I'd vote to suffer this possible intermittent inconvenience in order to have the benefit which 1. and 2. above would bring.

As a pie-in-the-sky concept, I guess it'd be possible to do some heuristics using the grammar engine to determine if the writer probably intended to finish the sentence at a certain point, but that seems like a disproportionate amount of effort.

Localization issues
-------------------
/[a-zA-Z]/ is Unicode-unfriendly for a start. I can't remember if LO's regex engine supports Unicode-aware character entities like [[:alpha:]]: if it does, we can use them; if it doesn't, that's another bug report :p

In addition, it's likely that all the rules above would have to be language-contingent. The possible scope of this might be taking us outside the realms of an EasyHack, but it should be possible to lay the groundwork easily enough.


[1] http://ethnicity.rutgers.edu/~jlynch/Writing/p.html#periods
[2] http://english.stackexchange.com/search?q=[punctuation]+etc
Comment 1 sasha.libreoffice 2012-02-29 02:32:30 UTC
Thanks for idea.
IMHO it is much simpler to type two spaces after abbreviation. And add option "Not capitalize after 2 spaces" that do not capitalizes words after dot and 2 spaces, just deletes extra space.
Comment 2 Rainer Bielefeld Retired 2012-04-10 03:03:32 UTC
EasyHack tags unification:  only allowed in Whiteboard to make queries more easy and reliable
Comment 3 Roman Eisele 2012-05-10 00:18:03 UTC
This is a feature/enhancement request, therefore changed 'Importance' field to 'enhancement'.
Comment 4 ryan.jendoubi@gmail.com 2012-05-10 02:00:23 UTC
Sorry, can't help disagreeing with you there Roman!

"Capitalize first letter after a period '.'" is not a feature.

The actual feature of the software we're talking about is "Capitalize first letter of *a sentence*".

Therefore, to the extent that LO fails to identify properly what is a new sentence and what isn't, this is a *bug* in the existing feature, not a new feature/enhancement request.

I feel it might be a little rude to undo your Importance change, but I'd like to hear your reasons why you feel that this issue does not highlight that capitalization-of-new-sentences is partly broken *existing* functionality :-)
Comment 5 Caolán McNamara 2012-10-25 15:05:36 UTC
We have tools->autocorrect options, and that comes prefilled with a bunch of exceptions. So for en_US i.e. and e.g. are already there. If you have some more words you think should be in there then http://cgit.freedesktop.org/libreoffice/core/tree/extras/source/autotext/lang/en-US/acor/SentenceExceptList.xml is where the list lives and it should be straightforward to add some more in there and submit that to us. That should cover the vast vast majority of cases and is straightforward to do right now.
Comment 6 Janit Anjaria 2013-02-04 09:48:44 UTC
Can someone provide me with the code pointer for this Enhancement!?i would like to take it up!
Comment 7 Caolán McNamara 2013-02-05 13:19:47 UTC
Code entry point is SvxAutoCorrect::AutoCorrect in editeng/source/misc/svxacorr.cxx

Autocorrect exceptions are stored in extras/source/autotext/lang/*/acor/SentenceExceptList.xml which is where known contractions for a given language go.

so... 
a) for part 1. "common contraction", add any common US English missing contractions to the extras/source/autotext/lang/en-US/acor/SentenceExceptList.xml
b) for part 2. I reckon it's likely sufficient to not auto-capitalize the start of a new sentence if there is a previous block of non-white-space-separated text and that previous block has a . as its second last character

e.g. 

  Foo. bar
       ^ capitalize this to Bar
   ^___ because this is not a period

F.o.o. bar
       ^ do not capitalize this
   ^___ because this is a period

i.e. the best goal is not to make sure to autocapitalize the right things, but to make sure not to autocapitalize the wrong things
Comment 8 Janit Anjaria 2013-02-06 18:38:24 UTC
What does the "second to last" here mean..??coz according to what i can observe before the "bar" there is a " " and before that there is "." in both the cases!
Can someone please elaborate!
Comment 9 Caolán McNamara 2013-02-06 23:05:34 UTC
F.o.o. bar
       ^ do not capitalize this, because...
    ^____ initially consider o as the last character of the sentence, now examine the second last character of that block of non-whitespace characters
   ^___ this second last character is a period, reject "o" as a sentence and do not capitalize the following word "bar" to "Bar"
Comment 10 Caolán McNamara 2013-02-20 15:43:45 UTC
Created attachment 75179 [details]
helpful patch
Comment 11 Anurag Kanungo 2013-04-19 10:51:09 UTC
The bug still exists, should i start working on it ???
Comment 12 Anurag Kanungo 2013-04-20 06:16:08 UTC
Created attachment 78265 [details]
It Fixes the F.o.o. bar auto capitalization

In this case , F.o.o. bar , here b is not capitalized and also if we use any other abbreviation like U.S. , the next word of it won't be capitalized even its the end of statement as we can set the user free to decide whether its the end of sentence or not . Because U.S.. is not valid in English . So, basic aim is that it should not unnecessarily capitalize the next word is being fulfilled .

Please Review .
Comment 13 Commit Notification 2013-04-26 16:03:27 UTC
anuragkanungo committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=3a390f36e8931e50009438f992ed0e4cdb02cca4

Resolves: fdo#42893 improve Capitalize first letter of Sentence



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 14 Caolán McNamara 2013-04-26 16:16:57 UTC
I'm going to consider this closed now as the specific scenario is fixed. There's always the possibility to improve autotext in other ways but then you get a bug issue that bloats out of control and turns into a kitchen sink issue. So if there are further suggestions outside of the specific addressed "f.o.o. remain lowercase" scenario then don't reopen this bug, file a new one.
Comment 15 Robinson Tryon (qubit) 2015-12-16 05:11:47 UTC
Migrating Whiteboard tags to Keywords: (EasyHack, DifficultyBeginner,  )
[NinjaEdit]