1. Insert an image into a new text document. 2. In the image properties, set image name to anything non-ASCII, e.g. to a single Cyrillic character ш. 3. Save the document as "HTML Document (Writer)". 4. Re-open the document, save, and close. At each new save, the length of the image's 'name' attribute will double, consisting of garbage that is not a proper UTF-8 encoding of the original Cyrillic character, and the name itself becomes broken. Very soon, the length of the image name starts dominating, and the size of the HTML will become tens of megabytes, then gigabytes. The problem is likely specific to systems where system encoding is *not* UTF-8 (mainly Windows). Tested with Version: 7.3.2.2 (x64) / LibreOffice Community Build ID: 49f2b1bff42cfccbd8f788c8dc32c1c309559be0 CPU threads: 12; OS: Windows 10.0 Build 19044; UI render: default; VCL: win Locale: ru-RU (ru_RU); UI: en-US Calc:
Relevant: 1. SwHTMLWriter::WriteStream [1], which initializes m_eDestEnc using SvxHtmlOptions::GetTextEncoding [2]; 2. officecfg.Office.Common/Filter/HTML/Export/Encoding configuration value, which is void by default, and is not editable until commit 7c5ca44c48b05ba73defd48057a82db7dc833e0c [3]; 3. Thread encoding, used in SvtSysLocale::GetBestMimeEncoding [4], which is ~always non-UTF-8 on Windows; 4. HtmlWriter::attribute [5], using fixed UTF-8 encoding, and so unsynchronized with #1; 5. Other places that write strings (like HtmlWriterHelper::applyColor [6], which is used in Writer's HTML filter), which also use UTF-8 unconditionally. It seems that the best option would be to use UTF-8 unconditionally in #1, and just drop (ignore) #2 and #3. [1] https://opengrok.libreoffice.org/xref/core/sw/source/filter/html/wrthtml.cxx?r=5de24375#342 [2] https://opengrok.libreoffice.org/xref/core/svtools/source/config/htmlcfg.cxx?r=96e3a641#93 [3] https://git.libreoffice.org/core/+/7c5ca44c48b05ba73defd48057a82db7dc833e0c [4] https://opengrok.libreoffice.org/xref/core/unotools/source/misc/syslocale.cxx?r=55adeb1c#177 [5] https://opengrok.libreoffice.org/xref/core/svtools/source/svhtml/HtmlWriter.cxx?r=ad1557f5#153 [6] https://opengrok.libreoffice.org/xref/core/svtools/source/svhtml/htmlout.cxx?r=96e3a641#981
Created attachment 179340 [details] Screenshot of the related options page. In comment 1, #2 (officecfg.Office.Common/Filter/HTML/Export/Encoding configuration) is defined in Options->Load/Save-?HTML Compatibility. The workaround is to set UTF-8 there manually.
yeah let's just use UTF-8 to export, i can't imagine why anybody would want anything else these days
Mike Kaganski committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/e4f53484d255f844169957c411dc3e872af7d3bb tdf#148413: Drop HTML export encoding configuration; use UTF-8 It will be available in 7.4.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
*** Bug 37615 has been marked as a duplicate of this bug. ***
Adolfo Jayme Barrientos committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/help/commit/b62a15c05d585dddfa0206a478867feff1294df8 tdf#148413 Help: Remove mention of dropped Character Set setting