Proposing an improvement to spellchecking: Legal and scholarly writing often includes phrases in which one word is usually wrong unless paired with a certain other word. The spell check should allow such phrases but not allow the usually-wrong word if it's not in the phrase. Spaces could be hard or soft or represented by line breaks. This is consistent with linguistics, which recognizes that a word can have a space within it, because it is grammatically treated like a word that has no internal space. My hardware description is a guess. Examples, some in U.S. law: per se (but "se" alone would still be wrong) voir dire (but "voir" alone would still be wrong) stare decisis (but "decisis" alone would still be wrong) inter alia (but "alia" alone would still be wrong) Personal and two-word product names could be treated this way, too, especially personal names (such as some in German and Dutch cultures) that include parts with lower-case initials. I think one approach to programming this would be to let spellchecking identify a wrong word, then retest that word by pairing it with the preceding word to see if it passes the two-word spell check, then, if that fails, retest that wrong word by pairing it with the following word to see if it passes the two-word spell check. Execution would be faster if, when a two-word phrase is added to a dictionary, it's stored in a separate file; and that can be done even though the two-word phrase would be added through the same user interface as are single words. Another would be to include the likely-wrong word in the standard dictionary but conditionally, by adding a flag or code to indicate that it is correct only if a preceding word is also present and another flag to indicate the same thing for a subsequent word.
Sounds reasonable -> NEW.
@Nick Levinson: thanks for the bug report! This problem was fixed under Bug 154499 at the UX level, so this bug report is now about the extension of the English dictionary with the proposed words (or a mechanism to report the standalone phrase parts, as a problem. E.g. by the English proofreader – similar to the Hungarian Lightproof module, where there is a checking for the standalone phrase parts already.) It seems, the en_US dictionary doesn't support this very common legal language, so it's worth to add the phrases to it: $ echo 'per se (but "se" alone would still be wrong) voir dire (but "voir" alone would still be wrong) stare decisis (but "decisis" alone would still be wrong) inter alia (but "alia" alone would still be wrong)' | hunspell -d en_US Hunspell 1.7.2 + p & se 2 4: SE, tie * & se 2 13: SE, tie * * * * * & voir 5 0: coir, avoir, vair, void, devoir * * & voir 5 16: coir, avoir, vair, void, devoir * * * * * * & decisis 4 6: decisive, decisions, decision, decides * & decisis 4 20: decisive, decisions, decision, decides * * * * * * & alia 10 6: ala, alias, ilia, aria, alga, glia, alba, Elia, Ilia, pallia * & alia 10 17: ala, alias, ilia, aria, alga, glia, alba, Elia, Ilia, pallia * * * * *
It appears that multi-word support is by treating each constituent one-word string as correctly spelled even though it's wrong or too rare for inclusion unless adjacent to another word that is itemized for left-adjacency or right-adjacency (including for phrases that are 3 or more words long). This applies to legal and medical terminology, place names, business names, personal names, foreign phrases that have been accepted into English usage albeit if italicized (recommending italicization to a user might be a separate feature request), and probably unlimited other categories. So, now, "York", "Los", "Hampshire", "est", and "Francisco" are accepted, even though as standalone words in English they're probably very rare, so they should be marked as wrong by default unless the user wants to allow exceptions. Rarities are usually omitted from spell-check dictionaries because in a typical user's context the string is more likely to be a misspelling the user will want to correct. Merriam-Webster's Third (approximate title) dictionary, unabridged, says in its frontmatter that if a word is formed in English as set solid, hyphenated, and spaced, it is entered into the dictionary with only one form. Usually, the senses, pronunciations, etymologies, etc. would be the same anyway, and that saves space, but that means that even that unabridged dictionary is not an authority for determining whether unlisted forms are uncommon in English. An introductory book on computers, I think on Linux, said that "file system" and "filesystem" do not have the same meaning. The only way that occurs to me to solve that problem in a spell-check would be with a tooltip or similar display asking the user which meaning is intended. Back to accepting "York", "Los", etc.: I disagree with that being the solution to recognizing "New York", "Los Angeles", "New Hampshire", "id est" (the expansion of "i.e."), and "San Francisco", respectively. But I also know that designing spell-check to recognize multi-word strings is harder. My guess is to do multiple passes, with a separate dictionary for each number of spaces in a string and a pass through the whole document or through recent edits for strings with the most spaces per string and then repeating until ending with a pass for spaceless strings. This also needs a way to assign a string being accepted into a supplemental dictionary into the supplemental dictionary for the right number of spaces within the string. It is possible to use one dictionary sorted first by number of spaces and then by today's sortation method, but for user-editable dictionaries when a user is trying to find, edit, or add an entry that would be confusing. How https://bugs.documentfoundation.org/show_bug.cgi?id=154499 indirectly relates to this I'm not clear, but I think it does.