Bug 148413 - Non-ASCII image names are encoded wrong when saved as HTML
Summary: Non-ASCII image names are encoded wrong when saved as HTML
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Mike Kaganski
URL: https://forumooo.ru/index.php/topic,9106
Whiteboard: target:7.4.0
Keywords:
: 37615 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-04-06 06:35 UTC by Mike Kaganski
Modified: 2022-05-16 08:48 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments
Screenshot of the related options page. (33.52 KB, image/png)
2022-04-06 07:55 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2022-04-06 06:35:25 UTC
1. Insert an image into a new text document.
2. In the image properties, set image name to anything non-ASCII, e.g. to a single Cyrillic character ш.
3. Save the document as "HTML Document (Writer)".
4. Re-open the document, save, and close.

At each new save, the length of the image's 'name' attribute will double, consisting of garbage that is not a proper UTF-8 encoding of the original Cyrillic character, and the name itself becomes broken. Very soon, the length of the image name starts dominating, and the size of the HTML will become tens of megabytes, then gigabytes.

The problem is likely specific to systems where system encoding is *not* UTF-8 (mainly Windows).

Tested with Version: 7.3.2.2 (x64) / LibreOffice Community
Build ID: 49f2b1bff42cfccbd8f788c8dc32c1c309559be0
CPU threads: 12; OS: Windows 10.0 Build 19044; UI render: default; VCL: win
Locale: ru-RU (ru_RU); UI: en-US
Calc:
Comment 1 Mike Kaganski 2022-04-06 07:41:51 UTC
Relevant:

1. SwHTMLWriter::WriteStream [1], which initializes m_eDestEnc using SvxHtmlOptions::GetTextEncoding [2];
2. officecfg.Office.Common/Filter/HTML/Export/Encoding configuration value, which is void by default, and is not editable until commit 7c5ca44c48b05ba73defd48057a82db7dc833e0c [3];
3. Thread encoding, used in SvtSysLocale::GetBestMimeEncoding [4], which is ~always non-UTF-8 on Windows;
4. HtmlWriter::attribute [5], using fixed UTF-8 encoding, and so unsynchronized with #1;
5. Other places that write strings (like HtmlWriterHelper::applyColor [6], which is used in Writer's HTML filter), which also use UTF-8 unconditionally.

It seems that the best option would be to use UTF-8 unconditionally in #1, and just drop (ignore) #2 and #3.

[1] https://opengrok.libreoffice.org/xref/core/sw/source/filter/html/wrthtml.cxx?r=5de24375#342
[2] https://opengrok.libreoffice.org/xref/core/svtools/source/config/htmlcfg.cxx?r=96e3a641#93
[3] https://git.libreoffice.org/core/+/7c5ca44c48b05ba73defd48057a82db7dc833e0c
[4] https://opengrok.libreoffice.org/xref/core/unotools/source/misc/syslocale.cxx?r=55adeb1c#177
[5] https://opengrok.libreoffice.org/xref/core/svtools/source/svhtml/HtmlWriter.cxx?r=ad1557f5#153
[6] https://opengrok.libreoffice.org/xref/core/svtools/source/svhtml/htmlout.cxx?r=96e3a641#981
Comment 2 Mike Kaganski 2022-04-06 07:55:21 UTC
Created attachment 179340 [details]
Screenshot of the related options page.

In comment 1, #2 (officecfg.Office.Common/Filter/HTML/Export/Encoding configuration) is defined in Options->Load/Save-?HTML Compatibility. The workaround is to set UTF-8 there manually.
Comment 3 Michael Stahl (allotropia) 2022-04-06 08:26:32 UTC
yeah let's just use UTF-8 to export, i can't imagine why anybody would want anything else these days
Comment 4 Commit Notification 2022-04-06 11:51:59 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e4f53484d255f844169957c411dc3e872af7d3bb

tdf#148413: Drop HTML export encoding configuration; use UTF-8

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 5 Andreas Heinisch 2022-05-12 15:24:22 UTC
*** Bug 37615 has been marked as a duplicate of this bug. ***
Comment 6 Commit Notification 2022-05-16 08:48:27 UTC
Adolfo Jayme Barrientos committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/help/commit/b62a15c05d585dddfa0206a478867feff1294df8

tdf#148413 Help: Remove mention of dropped Character Set setting