162186 – FILEOPEN DOC html file with 3 or 4 consecutive "é" character

Bug 162186 - FILEOPEN DOC html file with 3 or 4 consecutive "é" character

Summary: FILEOPEN DOC html file with 3 or 4 consecutive "é" character

Status:	NEW

Alias:	None

Product:	Document Liberation Project
Classification:	Unclassified
Component:	General (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:	https://sourceforge.net/p/libwpd/tick...
Whiteboard:	libwpd
Keywords:	bibisected, bisected, regression

Depends on:
Blocks:	HTML-Import
	Show dependency tree / graph

Reported:	2024-07-25 08:27 UTC by remi.thevenoux
Modified:	2025-10-06 05:17 UTC (History)
CC List:	2 users (show)

See Also:	158793
Crash report or crash signature:

Attachments
A doc file with html content that LibreOffice FAIL to open correctly (92 bytes, application/msword) 2024-07-25 08:29 UTC, remi.thevenoux	Details
A doc file with html content that LibreOffice can correctly open (88 bytes, application/msword) 2024-07-25 08:30 UTC, remi.thevenoux	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description remi.thevenoux 2024-07-25 08:27:58 UTC

Description:
I create a ".doc" file with html content, UTF-8 charset defined.
If the file contains 1 or 2 consecutive "é" character, LibreOffice open it correctly.
But if there is 3 or 4 consecutive "é", the html parsing seem to fail and LibreOffice display the raw html.
With 5 consecutive "é", it's back to normal.

Steps to Reproduce:
1. With a text editor, create an html with 3 consecutive "é" in the body
2. Save the document with ".doc" extension
3. Open with LibreOffice

Actual Results:
Display raw html

Expected Results:
Parse the html and convert it


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 7.3.7.2 / LibreOffice Community
Build ID: 30(Build:2)
CPU threads: 4; OS: Linux 5.19; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Ubuntu package version: 1:7.3.7-0ubuntu0.22.04.3
Calc: threaded

Comment 1 remi.thevenoux 2024-07-25 08:29:42 UTC

Created attachment 195492 [details]
A doc file with html content that LibreOffice FAIL to open correctly

Comment 2 remi.thevenoux 2024-07-25 08:30:21 UTC

Created attachment 195493 [details]
A doc file with html content that LibreOffice can correctly open

Comment 3 Mike Kaganski 2024-07-25 09:05:51 UTC

This is a regression. In 4.0, it was opened correctly. In 4.1, it failed to open. In 4.2, it started to import as now.

It is an import filter detection problem. Selecting HTML document (Writer) manually in the File Open dialog opens the file normally.

Comment 4 Mike Kaganski 2024-07-26 06:02:33 UTC

FTR: the file is detected as writer_WordPerfect_Document

Comment 5 raal 2024-07-26 16:54:39 UTC

This seems to have begun at the below commit in bibisect repository/OS bibisect-42max.
 542c4ff1c90cd5e3920f5ab68f53e644c4018d44 is the first bad commit
commit 542c4ff1c90cd5e3920f5ab68f53e644c4018d44
Author: Matthew Francis <mjay.francis@gmail.com>
Date:   Sat Sep 5 20:10:12 2015 +0800

    source-hash-e69aa9572bb2206313cd2aa7edd13da91460f2c4
    
    commit e69aa9572bb2206313cd2aa7edd13da91460f2c4
    Author:     Kohei Yoshida <kohei.yoshida@gmail.com>
    AuthorDate: Mon Aug 19 15:28:57 2013 -0400
    Commit:     Kohei Yoshida <kohei.yoshida@gmail.com>
    CommitDate: Mon Aug 19 15:41:02 2013 -0400
    
        fdo#67699: Remove a whole bunch of old hacks.
    
        The new format detection service is much simpler than the old one.
    
        In the new framework, each detection service receives the name of format
        that it is expected to check against, and it should either reject it by
        returning an empty string in case the file is not of that format, or
        if the file is indeed that format, set the appropriate filter name and
        return that type to the caller.
    
        We no longer need to deal with preselected filters (which is dealt with
        in the detection framework itself) or return an entirely different format
        that's different from the one being asked to verify.
    
        Change-Id: I3f36951b0ad821d836fb8a56b852e40d43095f09

Comment 6 Mike Kaganski 2024-07-26 18:44:34 UTC

(In reply to raal from comment #5)

Aha, so the change of behavior was because the WordPerfect filter started to check the file before HTML filter.

Interesting, if there is a way to prioritize the filters relatively?

Comment 7 Fridrich Strba 2024-07-26 19:19:18 UTC

https://sourceforge.net/p/libwpd/code/ci/bc3050ad804b4f9f60ccfaf425f469fc095033dc/

Comment 8 Fridrich Strba 2024-07-26 19:26:42 UTC

I will be on vacations for some time starting this weekend, so please, don't ask for release. In case dtardon is still active, he can do it without me. Or eventually vmiklos.

Comment 9 Mike Kaganski 2024-07-27 09:04:39 UTC

(In reply to Fridrich Strba from comment #8)

Thank you Fridrich!
I'm sure that the problem that waited since v.4.1 can wait a month or two more. Have a nice vacation!

Comment 10 Thomas201Tang 2025-10-06 05:17:30 UTC Comment hidden (spam)

This bizarre issue is caused by a very specific bug in LibreOffice's HTML filter (version 7.3.7.2) when a file saved with a .doc extension contains the UTF-8 character 'é' a certain number of times. https://www.nelnet.it.com