117324 – Hungarian dictionary contains invalid UTF-8 sequences

Bug 117324 - Hungarian dictionary contains invalid UTF-8 sequences

Summary: Hungarian dictionary contains invalid UTF-8 sequences

Status:	RESOLVED INVALID

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Linguistic (show other bugs)
Version: (earliest affected)	6.1.0.0.alpha1+
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-28 21:12 UTC by Pander
Modified:	2018-04-28 21:51 UTC (History)
CC List:	1 user (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pander 2018-04-28 21:12:07 UTC

Description:
The Hungarian dictionary contains invalid UTF-8 sequences and cannot be used or converted. For exact details, see https://github.com/hunspell/hunspell/issues/559

Steps to Reproduce:
Open hu_HU_u8.aff in gedit

sudo apt install hunspell-hu
gedit /usr/share/hunspell/hu_HU.aff --encoding=UTF-8


Actual Results:  
Bugged behavior (output)

Gedit shows error. If by any chance it tries to interpret the file as ISO-8859-15 open the file with --encoding option in gedit.

Expected Results:
Expected behavior (output)

No error should be shown by the text editor. Valid UTF-8 is expected.


Reproducible: Always


User Profile Reset: Yes



Additional Info:
Solution

Invalid UTF appears only in comments and in flag vectors.

Upstream is here https://sourceforge.net/projects/magyarispell/ , open the source tarball.

The fix is in the file bin/u8myspell. The following script should fix it completely.

#!/bin/bash
set -x
export LANG=en_US
export LC_ALL=C

case $# in
0|1|2) echo "u8myspell - converts MySpell dictionaries to UTF-8
usage: u8myspell source_name output_name source_charset"; exit 1;;
esac

i=$1
o=$2
charset=$3
localdir="$(dirname $0)"

iconv -f "$charset" -t UTF-8 "$i.dic" | sed -f "$localdir"/l1_u8.sed > "$o.dic"
iconv -f "$charset" -t UTF-8 "$i.aff" |
sed 's/^SET .*$/SET UTF-8\
FLAG UTF-8/' | sed -f "$localdir"/l1_u8.sed > "$o.aff"

Basically the latin2 is converted to utf8 and the command FLAG UTF-8 is additionally issued in .aff.


User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0

Comment 1 László Németh 2018-04-28 21:51:29 UTC

hu_HU.dic and hu_HU.aff file are not UTF-8 encoded files.

They contain UTF-8 encoded dictionary items (words and morphemes), and the default 8-bit flags, see hunspell (5) manual page for dictionary format.

The suggested conversion duplicates the memory footprint of the flag vectors, and  decoding of the UTF-8 encoded flags slows down the dictionary loading by 70% (plain dic.) or 50% (alias compressed dic.), resulting noticeable differences in the user interface of LibreOffice.