Bug 74595 - FILEOPEN: When opening an HTML file in LibreOffice who's DOCTYPE indicator is not on the first line, LibreOffice shows the HTML source code
Summary: FILEOPEN: When opening an HTML file in LibreOffice who's DOCTYPE indicator is...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.2.0.4 release
Hardware: All All
: high normal
Assignee: Maxim Monastirsky
URL:
Whiteboard: target:4.3.0
Keywords: regression
: 79863 81865 82134 (view as bug list)
Depends on:
Blocks: HTML-Import
  Show dependency treegraph
 
Reported: 2014-02-06 00:27 UTC by xmlhttprequest.open
Modified: 2017-10-13 20:21 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments
Proof of the bug (testcase) (2.98 KB, application/zip)
2014-02-06 00:27 UTC, xmlhttprequest.open
Details

Note You need to log in before you can comment on or make changes to this bug.
Description xmlhttprequest.open 2014-02-06 00:27:08 UTC
Created attachment 93500 [details]
Proof of the bug (testcase)

LibreOffice Writer/Web has trouble opening certain HTML files.  The problem occurs when LibreOffice opens an HTML file that has a DOCTYPE indicator that is not on the first line of the file.  When LibreOffice attempts to open any file with <!DOCTYPE ...> not on the first line, it shows the HTML source code.  See the attached "Proof of the bug" in the attachments section.

Steps to reproduce:
1. Make sure that you are running LibreOffice 4.2.
2A.
If you would like a quick way to reproduce the bug, download the attachment labeled "Proof of the bug."  It is a zip file with four HTML files in it:
 - 1) File that works.html - An HTML file that renders normally
 - 2) File that doesn't work.html - A file that causes the bug
 - 3) Moneydance-File that works.html - A revised version of a generated MoneyDance file by the original user who noticed this LibreOffice behavior that is fixed and renders normally
 - 4) Moneydance-File that doesn't work.html - The original file generated by Moneydance that has the bug in it
2B.
If you want to make your own HTML files to further experiment with this bug, make one HTML file that has a DOCTYPE on the first line of the file, and one that has a DOCTYPE NOT on the first line.
3. Open the HTML files in LibreOffice Writer/Web (the ones that say "doesn't work" will make LibreOffice show the source while the ones that say "works" renders normally).
Comment 1 Joel Madero 2014-02-17 00:10:30 UTC
Please don't nominate you're own bugs - we have a procedure that QA/Devs do. Thanks
Comment 2 m_a_riosv 2014-02-17 00:49:25 UTC
Hi xmlhttprequest, thanks for reporting.

Reproducible with:
Win7x64Ult.
Version: 4.2.0.4 Build ID: 05dceb5d363845f2cf968344d7adab8dcfb2ba71
Version: 4.2.1.1 Build ID: d7dbbd7842e6a58b0f521599204e827654e1fb8b
Version: 4.3.0.0.alpha0+ Build ID: ecf22894f522374cbdb8196d3bdef88e2fba7af9
  TinderBox: Win-x86@39, Branch:master, Time: 2014-02-15_01:01:17

Last working:
Version: 4.1.6.0.0+ Build ID: 2e2040401d99fe116b65b9661c3d4755091a660

Selecting the file type to open, explicitly as HTML Document (Writer) (*html;*.htm) open fine the file for me.

Importance is perhaps a little high, having an easy workaround.
Comment 3 Joel Madero 2014-02-17 00:54:06 UTC
Most definitely over prioritized - lowering to normal - this is a normal bug. Leaving as high as it's a regression.


Critical is meant for crashers, memory leaks, and similar bugs
Comment 4 xmlhttprequest.open 2014-02-17 01:18:12 UTC
(In reply to comment #2)
> Selecting the file type to open, explicitly as HTML Document (Writer)
> (*html;*.htm) open fine the file for me.
So, is it safe to say that LibreOffice's auto-detection of file types thinks that the HTML file is not an HTML file when its <!DOCTYPE ...> is not on the first line?  If so, I will update the bug.
Thank you for helping and for finding the workaround.
I never would have thought about trying that.

Regards,
xmlhttprequest.open@gmail.com
Comment 5 Maxim Monastirsky 2014-02-17 09:08:20 UTC
I'll take care of it.
Comment 6 Maxim Monastirsky 2014-02-17 09:51:47 UTC
I submitted a fix for master to gerrit: https://gerrit.libreoffice.org/8079/. Unfortunately 4.2 requires a different fix (which hopefully I'll do later).
Comment 7 Commit Notification 2014-02-18 12:59:36 UTC
Maxim Monastirsky committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=7a044d08572244931b16f24f3f8cc83111b039f9

fdo#74595 Make HTML detection to follow specs



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 8 m_a_riosv 2014-02-20 22:53:42 UTC
Thanks Maxim.
Verified in:
Version: 4.3.0.0.alpha0+ Build ID: 22b709e84a7b6d38cab2dd37f2f2b28e0fc9d062
  TinderBox: Win-x86@39, Branch:master, Time: 2014-02-20_00:01:31
Comment 9 Björn Michaelsen 2014-03-18 00:59:07 UTC
Closing this one as FIXED as per comment 7.
Comment 10 Björn Michaelsen 2014-03-18 00:59:44 UTC
And just for kicks: VERIFIED as per comment 8.
Comment 11 Maxim Monastirsky 2014-06-10 08:18:34 UTC
*** Bug 79863 has been marked as a duplicate of this bug. ***
Comment 12 Yousuf Philips (jay) (retired) 2014-06-10 09:40:12 UTC
My bug 79863 was classified as a duplicate, but my provided html has doctype on the first line, but with a few blank spaces before it. Also though this bug has been labelled verified-fixed, it still hasnt been fixed in 4.2.6.

Version: 4.2.6.0.0+
Build ID: 2b959fb871a68f08a06850909abd16f71033aa3a
TinderBox: Linux-rpm_deb-x86@45-TDF, Branch:libreoffice-4-2, Time: 2014-06-06_06:33:25
Comment 13 Maxim Monastirsky 2014-06-10 09:51:58 UTC
(In reply to comment #12)
> My bug 79863 was classified as a duplicate, but my provided html has doctype
> on the first line, but with a few blank spaces before it.
Right, it was the same problem. LO required the DOCTYPE to be at the very beginning of the file. So it doesn't matter whether it has a space or a line break before.

> Also though this
> bug has been labelled verified-fixed, it still hasnt been fixed in 4.2.6.
Right, it was fixed for 4.3 (see the whiteboard). That fixed can't be applied to 4.2, because 4.2 uses a different code for HTML detection.
Comment 14 Yousuf Philips (jay) (retired) 2014-06-10 15:25:59 UTC
(In reply to comment #13)
> Right, it was fixed for 4.3 (see the whiteboard). That fixed can't be
> applied to 4.2, because 4.2 uses a different code for HTML detection.

Yes i do understand that the fix is different for 4.3 and 4.2 as you stated in comment 6, but you also stated "Unfortunately 4.2 requires a different fix (which hopefully I'll do later).". So are you confirming here that you arent going to be doing a 4.2 fix?
Comment 15 Maxim Monastirsky 2014-06-11 14:13:18 UTC
(In reply to comment #14)
> So are you confirming here that you arent going to be doing a 4.2 fix?
Probably not, but maybe I'll find some time for it at some point.
Comment 16 Maxim Monastirsky 2014-07-03 08:21:02 UTC
Good news for 4.2 users. Caolán pushed a fix for this to the 4.2 branch:

http://cgit.freedesktop.org/libreoffice/core/commit/?h=libreoffice-4-2&id=32eddb3f48fcea0a052401a8a5dc075c7847f1c5

So this is fixed also for 4.2.6.
Comment 17 Maxim Monastirsky 2014-07-29 09:01:54 UTC
*** Bug 81865 has been marked as a duplicate of this bug. ***
Comment 18 Maxim Monastirsky 2014-08-05 18:35:06 UTC
*** Bug 82134 has been marked as a duplicate of this bug. ***