Bug 168506 - Automatically detect language when pasting text
Summary: Automatically detect language when pasting text
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Language-Detection
  Show dependency treegraph
 
Reported: 2025-09-22 13:34 UTC by Hossein
Modified: 2025-10-01 01:43 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
List of 353 languages on Wikipedia written in their own language (5.39 KB, text/plain)
2025-09-22 13:34 UTC, Hossein
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hossein 2025-09-22 13:34:59 UTC
Created attachment 202926 [details]
List of 353 languages on Wikipedia written in their own language

Description:
If you copy plain text into LiberOffice (Writer, Calc, Impress), text from many languages are not displayed correctly until you set the language manually. Otherwise, even with a proper font it is not displayed correctly. For example, try this text (Burmese):

မြန်မာဘာသာ

Steps to Reproduce:
1. Copy and paste the above text to LibreOffice Writer (or other applications)

Actual Results:
Bad rendering, even with a font that contains the glyphs. If you set the CTL language to Burmese, the text is displayed correctly.

Expected Results:
Text should be displayed correctly. Even if the language is not set automatically.


Reproducible: Always


User Profile Reset: No


Additional Info:
Wikipedia has a list of supported languages, which are 353 languages at the moment. You can see the list by going to the main page, and clicking on "353 languages":

Wikipedia
https://en.wikipedia.org/wiki/Main_Page

These are basic representation of a sample text in these languages. I have attached a text version of the list, which is displayed fine in a Unicode aware text editor like Gedit (Linux) and Notepad (Windows). But for many language it is rendered incorrectly in LibreOffice

Tested with:
Version: 25.8.1.1 (X86_64)
Build ID: 54047653041915e595ad4e45cccea684809c77b5
CPU threads: 12; OS: Linux 6.2; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded

Please not that on my machine the locale is en-US, and the default "Complex text language" is set to Persian in "Tools > Options > Languages and Locales > Default Languages for Documents".
Comment 1 Hossein 2025-09-22 20:10:47 UTC
Please note that the problem seems to be platform-specific. On Windows, there seems to be almost no issues. On the other hand, loading the attachment on Linux and macOS leads to many of the languages in the list not displayed correctly.

Issues visible with: (described above, happens even by setting "Noto Sans" font for all the text)
Version: 25.8.1.1 (X86_64)
Build ID: 54047653041915e595ad4e45cccea684809c77b5
CPU threads: 12; OS: Linux 6.2; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded

Issues visible with: (happens even by setting "Noto Sans" font for all the text)
Version: 25.8.1.1 (AARCH64)
Build ID: 54047653041915e595ad4e45cccea684809c77b5
CPU threads: 10; OS: macOS 26.0; UI render: Skia/Metal; VCL: osx
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded

No visible issues: (does not happen, even by setting "Noto Sans" font for all the text‌)
Version: 25.8.1.1 (X86_64)
Build ID: 54047653041915e595ad4e45cccea684809c77b5
CPU threads: 20; OS: Windows 11 X86_64 (build 26100); UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: CL threaded
Comment 2 Jonathan Clark 2025-09-22 21:27:17 UTC
This example shows a glyph positioning issue with ligatures involving font fallback, so I am marking this bug new and retargeting.

---

For the broader issue from the original report:

Noto Sans doesn't have glyphs for all languages. When you use the Noto fonts, almost all CTL and CJK languages are handled by fallback based on system configuration.

Font substitution is sensitive to document language. This is actually a feature, not a bug, but it can produce unexpected results. For a good example of this mechanism working correctly:

> $ fc-match "Noto Sans":lang=ja-JP:charset=6c34
> NotoSansCJK-Regular.ttc: "Noto Sans CJK JP" "Regular"
> $ fc-match "Noto Sans":lang=zh-CN:charset=6c34
> NotoSansCJK-Regular.ttc: "Noto Sans CJK SC" "Regular"

Simply by setting "Noto Sans" as the font and changing the language setting, my distro-provided font config can intelligently switch between a Japanese font and a Chinese font for codepoints that are shared across multiple languages.

However, if I try the same thing with Thai, I get:

> $ fc-match "Noto Sans":lang=en:charset=e22
> FreeSerif.ttf: "FreeSerif" "Regular"
> $ fc-match "Noto Sans":lang=th:charset=e22
> NotoSansThai-Regular.ttf: "Noto Sans Thai" "Regular"

...which is definitely a bug, but unfortunately it's in my distribution. It's not a LibreOffice bug that we could easily fix.


On Linux, you can reproduce the clean config copy-and-paste Burmese fallback scenario from the terminal with the following command:

> fc-match -s "Noto Sans Devanagari":lang=hi-IN:charset=102c | head -n 5 -

On my machine, I get the following:

> NotoSansMyanmar-Regular.ttf: "Noto Sans Myanmar" "Regular"
> NotoSansDevanagari-Regular.ttf: "Noto Sans Devanagari" "Regular"
> NotoSans-Regular.ttf: "Noto Sans" "Regular"
> NotoColorEmoji.ttf: "Noto Color Emoji" "Regular"
> NotoSansCJK-Regular.ttc: "Noto Sans CJK SC" "Regular"

My computer is substituting the correct font, at least, and the glyphs show up instead of being rendered as tofu. However, the glyphs overlap due to a real bug.
Comment 3 Hossein 2025-09-23 09:29:28 UTC Comment hidden (obsolete)
Comment 4 Hossein 2025-09-23 09:42:31 UTC
In the past I have filed a bug report for glyph positioning when fallback font is used:

tdf#152196 - Visible gaps in Arabic/Persian text with fallback font
https://bugs.documentfoundation.org/show_bug.cgi?id=152196

But I think with the title changed to "Overlapping ligature glyphs...", now the bug report is drifted away from the original goal I intended: correctly detecting the language, and using an appropriate font.

> ...which is definitely a bug, but unfortunately it's in my distribution.
> It's not a LibreOffice bug that we could easily fix.
I am fine with RESOLVED/NOTOURBUG, although I think LibreOffice can (and in fact should) provide workarounds, like detecting the correct language with other means.

While testing, I found that even setting the language to "None" instead of "Default" can fix the issue I see here. This means no spell checking will be used, but at least the text is displayed correctly. Also, with a wrong language detected, meaningful spell checking will not be doable anyway.

In the end, I think if Gedit can display the plain text correctly, LibreOffice should be able to that on the same machine with the same configuration.

I suggest reverting the title as it was, and then I can file another bug report for "Overlapping ligature glyphs involving font fallback in certain languages".
Comment 5 Jonathan Clark 2025-09-23 13:30:30 UTC
(In reply to Hossein from comment #4)
> I suggest reverting the title as it was, and then I can file another bug
> report for "Overlapping ligature glyphs involving font fallback in certain
> languages".
I support keeping this bug on the topic of language detection, but I think there are a few separate issues included in this one. Some of them are NOTOURBUG, some of them warrant their own bug(s).

The first issue is an ER to assign languages automatically on copy and paste. This is quite difficult to do in general, but we could at least try. We already have other open bugs related to this, so this may be a duplicate.

The second issue is that our font fallback mechanism is not rendering text correctly in certain cases. (Overlapping ligatures.)

The third issue is that font fallback may be producing surprising results, such as incorrect font selection or major differences in document appearance when switching between different languages. This is an OS problem.

The fourth issue is an ER to automatically assign a font with language coverage on paste, instead of relying on fallback working correctly. We don't ship fonts for many of these languages, so we would need to somehow scan installed fonts and make a good choice. Seems tricky, but might be worth having on file.

> While testing, I found that even setting the language to "None" instead of
> "Default" can fix the issue I see here. This means no spell checking will be
> used, but at least the text is displayed correctly. Also, with a wrong
> language detected, meaningful spell checking will not be doable anyway.

Setting language to "none" may also make the situation worse, for some users. I encourage you to play around with the fc-match command a bit, to see how sensitive this process is.

> In the end, I think if Gedit can display the plain text correctly,
> LibreOffice should be able to that on the same machine with the same
> configuration.

Gedit is likely doing the same thing we do with language "None", except without all of the same bugs we have. It's also probably not displaying all of the languages correctly on your machine. The codepoint 语 for instance is common in the endonyms of East Asian languages, but the appearance changes depending on whether it is written in Simplified Chinese, Traditional Chinese, or Japanese.
Comment 6 Jonathan Clark 2025-09-29 15:36:41 UTC
I have filed bug 168613 to track the fallback rendering issue. We should always render text correctly under fallback, even if the text language is set incorrectly.

This bug now tracks automatically detecting and setting the text language on paste.