Description: I create a ".doc" file with html content, UTF-8 charset defined. If the file contains 1 or 2 consecutive "é" character, LibreOffice open it correctly. But if there is 3 or 4 consecutive "é", the html parsing seem to fail and LibreOffice display the raw html. With 5 consecutive "é", it's back to normal. Steps to Reproduce: 1. With a text editor, create an html with 3 consecutive "é" in the body 2. Save the document with ".doc" extension 3. Open with LibreOffice Actual Results: Display raw html Expected Results: Parse the html and convert it Reproducible: Always User Profile Reset: No Additional Info: Version: 7.3.7.2 / LibreOffice Community Build ID: 30(Build:2) CPU threads: 4; OS: Linux 5.19; UI render: default; VCL: gtk3 Locale: en-US (en_US.UTF-8); UI: en-US Ubuntu package version: 1:7.3.7-0ubuntu0.22.04.3 Calc: threaded
Created attachment 195492 [details] A doc file with html content that LibreOffice FAIL to open correctly
Created attachment 195493 [details] A doc file with html content that LibreOffice can correctly open
This is a regression. In 4.0, it was opened correctly. In 4.1, it failed to open. In 4.2, it started to import as now. It is an import filter detection problem. Selecting HTML document (Writer) manually in the File Open dialog opens the file normally.
FTR: the file is detected as writer_WordPerfect_Document
This seems to have begun at the below commit in bibisect repository/OS bibisect-42max. 542c4ff1c90cd5e3920f5ab68f53e644c4018d44 is the first bad commit commit 542c4ff1c90cd5e3920f5ab68f53e644c4018d44 Author: Matthew Francis <mjay.francis@gmail.com> Date: Sat Sep 5 20:10:12 2015 +0800 source-hash-e69aa9572bb2206313cd2aa7edd13da91460f2c4 commit e69aa9572bb2206313cd2aa7edd13da91460f2c4 Author: Kohei Yoshida <kohei.yoshida@gmail.com> AuthorDate: Mon Aug 19 15:28:57 2013 -0400 Commit: Kohei Yoshida <kohei.yoshida@gmail.com> CommitDate: Mon Aug 19 15:41:02 2013 -0400 fdo#67699: Remove a whole bunch of old hacks. The new format detection service is much simpler than the old one. In the new framework, each detection service receives the name of format that it is expected to check against, and it should either reject it by returning an empty string in case the file is not of that format, or if the file is indeed that format, set the appropriate filter name and return that type to the caller. We no longer need to deal with preselected filters (which is dealt with in the detection framework itself) or return an entirely different format that's different from the one being asked to verify. Change-Id: I3f36951b0ad821d836fb8a56b852e40d43095f09
(In reply to raal from comment #5) Aha, so the change of behavior was because the WordPerfect filter started to check the file before HTML filter. Interesting, if there is a way to prioritize the filters relatively?
https://sourceforge.net/p/libwpd/code/ci/bc3050ad804b4f9f60ccfaf425f469fc095033dc/
I will be on vacations for some time starting this weekend, so please, don't ask for release. In case dtardon is still active, he can do it without me. Or eventually vmiklos.
(In reply to Fridrich Strba from comment #8) Thank you Fridrich! I'm sure that the problem that waited since v.4.1 can wait a month or two more. Have a nice vacation!