Bug 128860 - Incorrect autocorrection of apostrophes in German text
Summary: Incorrect autocorrection of apostrophes in German text
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
6.3.3.2 release
Hardware: All All
: medium enhancement
Assignee: László Németh
URL:
Whiteboard: target:7.1.0 target:7.0.0.1
Keywords: needsDevAdvice
: 132985 (view as bug list)
Depends on:
Blocks: AutoCorrect-Complete
  Show dependency treegraph
 
Reported: 2019-11-17 16:31 UTC by Rob Schroeder
Modified: 2020-06-09 09:41 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
Image of correct and incorrect typographic apostrophe (5.10 KB, image/png)
2019-11-17 16:32 UTC, Rob Schroeder
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rob Schroeder 2019-11-17 16:31:37 UTC
Description:
In German text, words or phrases containing apostrophes with no adjacent spaces are incorrectly autocorrected.

Common examples are "ist's" (short for "ist es"), "Andrea's" (similar to English use), "D'dorf" (short for "Düsseldorf"). 

The correct typographic symbol would be the English *right* single quotation mark (U+2019).

When single quotes autocorrection is disabled, the typewriter-style 'apostrophe' (U+0027) is inserted (the ASCII character to which the apostrophe key is traditionally mapped).

When single quotes autocorrection is enabled, the English *left* single quotation mark (U+2018) is inserted, or whatever the user has configured for the single 'end quote'.

Steps to Reproduce:
1. Open LibreOffice Writer, create a German-language document
2. Make sure 'autocorrect while typing' is enabled
3. Type words or phrases containing apostrophes like ist's, Andrea's, D'dorf


Actual Results:
If single quotes autocorrection is disabled, the character inserted into the text is the typewriter-style, non-typographic apostrophe symbol (U+0027).

If single quotes autocorrection is enabled, the character inserted into the text is the unicode character 'LEFT SINGLE QUOTATION MARK' (U+2018), or the character defined by the user as the 'end quote' for single quotes autocorrection, depending on whether there is a user-defined replacement or not.

Only three specific phrases, and even these only if single quotes autocorrection is disabled, finally will have the apostrophe (U+0027) replaced with the correct symbol (U+2019) – "geht's", "gibt's" and "wird's" – because these three replacements are explicitly included in DocumentList.xml file within the acor_de.dat archive.



Expected Results:
The apostrophe being inserted is the unicode character 'RIGHT SINGLE QUOTATION MARK' (U+2019).


Reproducible: Always


User Profile Reset: Yes


OpenGL enabled: Yes

Additional Info:
Three remarks:

1. Even when used as an apostrophe, it's surely sensible to not replace the U+0027 apostrophe character with a typographic character if such an autocorrection would not be wanted, but the enabled/disabled state of 'single quotes' autocorrection is not the proper place, semantically, for deciding what is to be done with a genuine apostrophe. For a user, changing something within 'single quotes' autocorrection settings should not have an effect on how genuine apostrophes are handled. The only real solution to this problem might be to include a third, separate settings option for 'apostrophes' beside 'single quotes' and 'double quotes'. 

2. Sometimes an apostrophe can be used at a word's end (for marking the genitive of nouns ending in s, ß, z, x: "Delacroix' Gemälde", the painting by Delacroix). If we get a correct typographic character there, it's only because by pure chance the currently used 'end quote' for 'single quotes' is the same symbol. Depending on the overall typography someone wants to use, this does not need to be the case. 

3. To the best of my knowledge, even the implementation for English text (I tried English-UK) solves the problem only partly. Autocorrection of the U+0027 character used as an apostrophe within a word is dependent on whether "single quotes" autocorrection is enabled or not, too. Only if this is enabled, and only if the user has not defined a custom "end quote" for "single quotes", autocorrection correctly inserts U+2019 as the typographic apostrophe symbol.

To get the same behaviour for German text would already improve things, even if it wouldn't solve the underlying problem, which is the behaviour of apostrophes being dependent on the settings for single quotes.

----------

Version: 6.3.3.2
Build-ID: 1:6.3.3-0ubuntu0.18.04.1~lo1
CPU-Threads: 4; BS: Linux 5.3; UI-Render: Standard; VCL: gtk3; 
Gebietsschema: de-DE (de_DE.UTF-8); UI-Sprache: de-DE
Calc: threaded
Comment 1 Rob Schroeder 2019-11-17 16:32:56 UTC
Created attachment 155897 [details]
Image of correct and incorrect typographic apostrophe
Comment 2 Regina Henschel 2019-11-17 17:47:51 UTC
Do you have a suggestion how to distinguish the situation, where an apostrophe (’ U+2019) has to be written, from the situation, where a single ending quotation mark (‘ U+2018) is needed? Especially, if you have started a single opening quotation and now want e.g "Hans’ Auto". I know no way to do it.

If you need the signs often, you should learn the code points and use the "Toggle Unicode" feature of LibreOffice, or define a macro and shortcut for it, or adjust the keyboard layout to get them directly.
Comment 3 Rob Schroeder 2019-11-17 18:12:33 UTC
I'm aware of the difficulty in distinguishing apostrophes at the end of a word from (single-quote) end quotes, which is why I included the case only as a remark, while the subject of this bug report is apostrophes *inside* a word or phrase, without an adjacent space or other delimiter character. An apostrophe character inside a word can and should, by default, always be interpreted as an apostrophe, not an end quote. Which is what the English implementation is already doing correctly - if only as long as default single quote autocorrection is used. 

(And once we had the notion of an 'apostrophe' vs. a 'quote', it would offer some options for at least recognizing some apostrophes at the end of words. A criterion, for example, could be the absence of any (single-quote) start-quote character before the apostrophe character in the document or paragraph.)
Comment 4 V Stuart Foote 2019-11-17 18:31:14 UTC
Seems to me current handling of typographic 'single quotes' is correct to function--use of opening and closing quotation. An immediate <Ctrl>+z or <Esc> will revert any unwanted autocorrect while typing.

And as noted entries in the autocorrect Replacement strings data file by locale (of the Paragraph being edited) will take precedence over the option corrections of single and double quotes while typing or modifying content.

But, seems it could be a reasonable enhancement to the edit engine where it should be possible to set additional logic testing for entry of second "'" (0x0027). Maybe as simple as: "more than two words and it gets handled as a quotation, one or two words and it is an apostrophe"--and receiving a locale preferred typographic glyph for apostrophe. Though probably just the QuotationEnd of the localedata/data/[locale]

=-notes-=

Related issue for the fr-CH users in bug 116062 where we in the i18npool/source/localedata/data/fr_CH.xml we reverted correct Swiss national "‹" (0x2039) & "›" (0x203a) to  "‘" (0x2018) & "’" (0x2019) for QuotationStart & QuotationEnd to restore use of the apostrophe.

While for bug 1115382 László tweaked apostrophe usage for the -HU locales via the Autocorrect logic.

Likewise we have some issues with RTL scripts, eg. bug 114575, where we reverse the order of Starting and Ending. Or as in bug 114184 needing support in Hebrew for its Geresh and Mercha.
Comment 5 Rob Schroeder 2019-11-17 18:56:54 UTC
> Seems to me current handling of typographic 'single quotes' is correct to
> function--use of opening and closing quotation. An immediate <Ctrl>+z or 
> <Esc> will revert any unwanted autocorrect while typing" 
This report is not about 'single quotes' handling being incorrect, it is about 'apostrophes' not being properly handled. Apostrophes are routinely recognised as a single quote (end quote), which is incorrect, and reverting the autocorrection still does the wrong thing - it reverts back to the a typewriter-style x0027 character, where a typographic apostrophe symbol would be correct.
Comment 6 V Stuart Foote 2019-11-17 19:23:19 UTC
(In reply to Rob Schroeder from comment #5)
> > Seems to me current handling of typographic 'single quotes' is correct to
> > function--use of opening and closing quotation. An immediate <Ctrl>+z or 
> > <Esc> will revert any unwanted autocorrect while typing" 
> This report is not about 'single quotes' handling being incorrect, it is
> about 'apostrophes' not being properly handled. Apostrophes are routinely
> recognised as a single quote (end quote), which is incorrect, and reverting
> the autocorrection still does the wrong thing - it reverts back to the a
> typewriter-style x0027 character, where a typographic apostrophe symbol
> would be correct.

And exactly which Unicode glyph would be the "typographic apostrophe symbol" that would be "correct"--and for which locale?

Only the 0x0027 is an APOSTROPHE and has its own Unicode glyph, everything else requires a Unicode glyph be substituted--a single quote (opening or closing), or other locale appropriate glyph--depending on locale.

We handle as quotes (single or double), or with Autocorrect disabled as apostrophe (and use of 0x0027 as drawn fro the font in use for the paragraph).

To be precise what is required is additional editengine, or autocorrect, logic to handle keyboard input of the Apostrophe keysym (0x0027) with more options than as a single quotation (opening or closing). Likewise for the Quotation Mark keysym (0x0021).

Can't do it now, but it is Not a Bug, so => Enhancement requiring dev effort.
Comment 7 Rob Schroeder 2019-11-17 20:02:08 UTC
> And exactly which Unicode glyph would be the "typographic apostrophe symbol" 
> that would be "correct"--and for which locale?

This has already been answered by the Unicode consortium, independent of locale. Originally, U+02BC 'MODIFIER LETTER APOSTROPHE' was considered as the preferred character for a punctuation apostrophe, but since Unicode v.3.0.0 the U+2019 'RIGHT SINGLE QUOTATION MARK' is considered as the preferred character - see https://en.wikipedia.org/wiki/Modifier_letter_apostrophe.

> Can't do it now

You already do, in en_?? locales. 

I repeat, the above is what LibreOffice already does in en_?? locales when an apostrophe is typed inside a word, i.e. with no delimiter character following it (if the user didn't specify and enable custom single quote characters, and this limitation is actually a bug, too, while I guess developers wanted to keep it like that as long as there is no option to specify and enable a custom 'apostrophe' character, too, which then would be used in place of U+2019).

Now my report is about de_DE, where basically the same rules apply except for some differences in the default characters for 'single quotes', and here LibreOffice doesn't do it - it uses U+2018 'LEFT SINGLE QUOTATION MARK' when an apostrophe is typed inside a word, which is always wrong. Which is why this is a bug.
Comment 8 Regina Henschel 2019-11-17 20:45:24 UTC
German typographical apostrophe is U+2019, looks like an upper comma.

There are some ideas now:
A Detect, if the to be replaced straight apostrophe U+0027 is not inside a single quotation and not after a starting single quotation. I think, that might work for entering new text at the end of a document, but it will be difficult when edit an existing text.

B If the to be replaced U+0027 sign is inside a word, then replace it with a typographical apostrophe. That would not catch all cases for a typographical apostrophe, but would be better than the current situation.

C Provide a short cut for entering a typographical apostrophe, independent from keyboard layout and locale. Same for other characters, needed for other languages, perhaps allow user to customize it. I know, that Linux has means to customize keyboard layout, but those do not exist on Windows.

I hope, Eike can tell, which is indeed a practical idea.
Comment 9 Rob Schroeder 2019-11-17 21:02:05 UTC
> There are some ideas now

Thanks, sounds like a good start.

Perhaps it might help to find out what the existing code already does for en_??, but doesn't do for de_DE? (I'm doing a git clone right now, but I fear it will take substantially longer for me even to find the relevant code than it will take for the pros here to fix it - the more so since I'm not fluent in C++...)
Comment 10 Xisco Faulí 2020-02-17 11:23:32 UTC
@Eike, do you have any opinion on this issue ?
Comment 11 László Németh 2020-05-11 08:39:48 UTC
Likely the best option to replace the default single quote replacement with default curly apostrophe usage.
Comment 12 Rob Schroeder 2020-05-12 15:39:02 UTC
[László Németh:]
> Likely the best option to replace the default single quote replacement with 
> default curly apostrophe usage.

Just changing glyphs within the existing logic won't solve this, it will just lead to different errors in typography.

As I understand the logic behind what goes on in the code, there needs to be code to decide whether the apostrophe someone types on their keyboard is to be interpreted as 'single end quote' or as 'apostrophe'.

The code already exists. The implementation of autocorrection for English locales has such code. It is not perfect, but works for the vast majority of cases. 

It's just that the German locale is missing that piece of code (or possibly even just the call to that piece of code), which is why when someone types an apostrophe following a letter character, it's always the glyph for 'closing single quote' that will be inserted into the text, never the (curly) apostrophe. 

It would not really help if it was the other way round, though.
Comment 13 Ming Hua 2020-05-12 19:53:57 UTC
*** Bug 132985 has been marked as a duplicate of this bug. ***
Comment 14 Commit Notification 2020-05-31 21:15:31 UTC
László Németh committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/a0c90f1bccd9b5a349d3199746facab549f27dba

tdf#128860 AutoCorrect: fix apostrophe in Czech, German,

It will be available in 7.1.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 15 Commit Notification 2020-06-02 13:44:03 UTC
László Németh committed a patch related to this issue.
It has been pushed to "libreoffice-7-0":

https://git.libreoffice.org/core/commit/c3ef223ba5f893f8096d205ef09b5f5262ab6baa

tdf#128860 AutoCorrect: fix apostrophe in Czech, German,

It will be available in 7.0.0.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 16 László Németh 2020-06-02 17:09:29 UTC
(In reply to Rob Schroeder from comment #12)

The recent solution – checking preceding quote marks – fixes most of the problems. A possible/remaining improvement could be to handle apostrophe usage within second level quotations, based on the direct continuation of the word after the apostrophe, i.e. fixing

„… ‚word‘s

as

„… ‚word’s...

Thanks for your bug report and help!
Comment 17 Rob Schroeder 2020-06-09 08:09:08 UTC
Works in libreoffice-7-0_2020-06-08_06.00.18_LibreOfficeDev_7.0.0.0.beta1 (.deb, Linux Mint). Excellent, thanks!
Comment 18 László Németh 2020-06-09 09:41:46 UTC
(In reply to Rob Schroeder from comment #17)
> Works in libreoffice-7-0_2020-06-08_06.00.18_LibreOfficeDev_7.0.0.0.beta1
> (.deb, Linux Mint). Excellent, thanks!

Thanks for verification!