Bug 162186 - FILEOPEN DOC html file with 3 or 4 consecutive "é" character
Summary: FILEOPEN DOC html file with 3 or 4 consecutive "é" character
Status: NEW
Alias: None
Product: Document Liberation Project
Classification: Unclassified
Component: General (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Not Assigned
URL: https://sourceforge.net/p/libwpd/tick...
Whiteboard: libwpd
Keywords: bibisected, bisected, regression
Depends on:
Blocks: HTML-Import
  Show dependency treegraph
 
Reported: 2024-07-25 08:27 UTC by remi.thevenoux
Modified: 2024-07-27 09:04 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
A doc file with html content that LibreOffice FAIL to open correctly (92 bytes, application/msword)
2024-07-25 08:29 UTC, remi.thevenoux
Details
A doc file with html content that LibreOffice can correctly open (88 bytes, application/msword)
2024-07-25 08:30 UTC, remi.thevenoux
Details

Note You need to log in before you can comment on or make changes to this bug.
Description remi.thevenoux 2024-07-25 08:27:58 UTC
Description:
I create a ".doc" file with html content, UTF-8 charset defined.
If the file contains 1 or 2 consecutive "é" character, LibreOffice open it correctly.
But if there is 3 or 4 consecutive "é", the html parsing seem to fail and LibreOffice display the raw html.
With 5 consecutive "é", it's back to normal.

Steps to Reproduce:
1. With a text editor, create an html with 3 consecutive "é" in the body
2. Save the document with ".doc" extension
3. Open with LibreOffice

Actual Results:
Display raw html

Expected Results:
Parse the html and convert it


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 7.3.7.2 / LibreOffice Community
Build ID: 30(Build:2)
CPU threads: 4; OS: Linux 5.19; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Ubuntu package version: 1:7.3.7-0ubuntu0.22.04.3
Calc: threaded
Comment 1 remi.thevenoux 2024-07-25 08:29:42 UTC
Created attachment 195492 [details]
A doc file with html content that LibreOffice FAIL to open correctly
Comment 2 remi.thevenoux 2024-07-25 08:30:21 UTC
Created attachment 195493 [details]
A doc file with html content that LibreOffice can correctly open
Comment 3 Mike Kaganski 2024-07-25 09:05:51 UTC
This is a regression. In 4.0, it was opened correctly. In 4.1, it failed to open. In 4.2, it started to import as now.

It is an import filter detection problem. Selecting HTML document (Writer) manually in the File Open dialog opens the file normally.
Comment 4 Mike Kaganski 2024-07-26 06:02:33 UTC
FTR: the file is detected as writer_WordPerfect_Document
Comment 5 raal 2024-07-26 16:54:39 UTC
This seems to have begun at the below commit in bibisect repository/OS bibisect-42max.
 542c4ff1c90cd5e3920f5ab68f53e644c4018d44 is the first bad commit
commit 542c4ff1c90cd5e3920f5ab68f53e644c4018d44
Author: Matthew Francis <mjay.francis@gmail.com>
Date:   Sat Sep 5 20:10:12 2015 +0800

    source-hash-e69aa9572bb2206313cd2aa7edd13da91460f2c4
    
    commit e69aa9572bb2206313cd2aa7edd13da91460f2c4
    Author:     Kohei Yoshida <kohei.yoshida@gmail.com>
    AuthorDate: Mon Aug 19 15:28:57 2013 -0400
    Commit:     Kohei Yoshida <kohei.yoshida@gmail.com>
    CommitDate: Mon Aug 19 15:41:02 2013 -0400
    
        fdo#67699: Remove a whole bunch of old hacks.
    
        The new format detection service is much simpler than the old one.
    
        In the new framework, each detection service receives the name of format
        that it is expected to check against, and it should either reject it by
        returning an empty string in case the file is not of that format, or
        if the file is indeed that format, set the appropriate filter name and
        return that type to the caller.
    
        We no longer need to deal with preselected filters (which is dealt with
        in the detection framework itself) or return an entirely different format
        that's different from the one being asked to verify.
    
        Change-Id: I3f36951b0ad821d836fb8a56b852e40d43095f09
Comment 6 Mike Kaganski 2024-07-26 18:44:34 UTC
(In reply to raal from comment #5)

Aha, so the change of behavior was because the WordPerfect filter started to check the file before HTML filter.

Interesting, if there is a way to prioritize the filters relatively?
Comment 8 Fridrich Strba 2024-07-26 19:26:42 UTC
I will be on vacations for some time starting this weekend, so please, don't ask for release. In case dtardon is still active, he can do it without me. Or eventually vmiklos.
Comment 9 Mike Kaganski 2024-07-27 09:04:39 UTC
(In reply to Fridrich Strba from comment #8)

Thank you Fridrich!
I'm sure that the problem that waited since v.4.1 can wait a month or two more. Have a nice vacation!