Bug 77656 - FILEOPEN: HTML with <!DOCTYPE html> and UTF-8 BOM isn't detected
Summary: FILEOPEN: HTML with <!DOCTYPE html> and UTF-8 BOM isn't detected
Status: RESOLVED WONTFIX
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.2.4.1 rc
Hardware: x86 (IA32) All
: medium major
Assignee: Not Assigned
URL:
Whiteboard: target:4.4.0
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-18 20:12 UTC by Ivan
Modified: 2014-11-24 12:57 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments
incorrectly being displayed file (6.42 KB, text/html)
2014-04-18 20:12 UTC, Ivan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ivan 2014-04-18 20:12:53 UTC
Created attachment 97574 [details]
incorrectly being displayed file

Hi

.DOC - source file in libreoffice can not display correctly.
In ms office 2003 - normal visualization.
Comment 1 Julien Nabet 2014-04-18 21:45:47 UTC
What do you mean incorrect displaying? What have you got and what did you expect?

BTW? It's a plain html file with a doc extension, was it on purpose?
Comment 2 Ivan 2014-04-19 08:17:21 UTC
I was expecting to read a text file, as in v.4.2.3(v.4.2.2)
If I open a file in MS 2003 I see normal text.

Sorry for my english.
Comment 3 Ivan 2014-04-19 08:25:21 UTC
Normal visualization .doc in Microsoft Word Viewer
http://floomby.ru/s1/8Wr6S7
Comment 4 Maxim Monastirsky 2014-04-20 11:22:25 UTC
@Julien: Ivan probably expects that it will open as HTML document, not as a Writer document showing the HTML code. In that case I can confirm this bug with 4.2 branch. Fortunately it's fixed in master by the changes I've made to the HTML detection there. But unfortunately it's a big change and unlikely to be backported to 4.2.

If someone wants to work on a fix for 4.2: The problem here is that this file begins with a UTF-8 BOM (http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8), but HTMLParser::IsHTMLFormat (in svtools/source/svhtml/parhtml.cxx) doesn't respect that kind of BOM, only the UTF-16 one. We need to simply skip it, the same way we do for UTF-16.
Comment 5 Maxim Monastirsky 2014-04-20 11:30:55 UTC
Comment on attachment 97574 [details]
incorrectly being displayed file

To avoid confusion, I'll change the file extension and MIME type to HTML. This bug has nothing to do with the .doc extension (but once fixed, it should work even with that extension).
Comment 6 Julien Nabet 2014-04-23 09:25:34 UTC
Thank you Maxim for your detailed feedback
Comment 7 Julien Nabet 2014-08-04 20:10:40 UTC
In general recent tracker might be a dup of an older one but since the later one has been fixed.

I tested with master sources updated today, it was ok.

About the code, here's a chain:
svtools/source/svhtml/parhtml.cxx uses this
include/svtools/parhtml.hxx which uses this
/include/svtools/svparser.hxx which is defined here:
svtools/source/svrtf/svparser.cxx
This last file includes SvParser::GetNextChar() which has been fixed by:
http://cgit.freedesktop.org/libreoffice/core/commit/?id=5eb408a3bb8df204452f0b931a254dad5f0cf35b

David: would it be ok to cherry-pick http://cgit.freedesktop.org/libreoffice/core/commit/?id=5eb408a3bb8df204452f0b931a254dad5f0cf35b in 4.3 branch (and perhaps in 4.2)? (I can cherry-pick for both and put them to review)

*** This bug has been marked as a duplicate of bug 81044 ***
Comment 8 David Tardon 2014-08-04 21:00:38 UTC
(In reply to comment #7)
> David: would it be ok to cherry-pick
> http://cgit.freedesktop.org/libreoffice/core/commit/
> ?id=5eb408a3bb8df204452f0b931a254dad5f0cf35b in 4.3 branch (and perhaps in
> 4.2)? (I can cherry-pick for both and put them to review)

Yes for 4.3, no for 4.2.
Comment 9 Julien Nabet 2014-08-04 21:06:52 UTC
David: thank you for your feedback, I put https://gerrit.libreoffice.org/#/c/10742/ (as you must have already seen :-) )
Comment 10 Maxim Monastirsky 2014-08-05 18:11:45 UTC
@Julien: This is *not* a duplicate of bug 81044. Bug 81044 is about the filter, this one is about type detection.
Comment 11 Julien Nabet 2014-08-05 19:12:46 UTC
Maxim: I'm not sure to understand. When I opened the file, it was ok. Is it KO when you try to open it?
Comment 12 Maxim Monastirsky 2014-08-05 19:22:29 UTC
(In reply to comment #11)
Julien: This report was regarding 4.2, I know it's fine in 4.3/master.
Comment 13 Julien Nabet 2014-08-05 19:25:00 UTC
Ok sorry then, I reopen this tracker.
Comment 14 Maxim Monastirsky 2014-11-24 12:57:01 UTC
4.2 is EOL.