Bug 50931 - EasyHack: FILEOPEN: Cannot import HTML <br/> and <hr/> (while <br> and <br />, <hr> and <hr /> are imported correctly)
Summary: EasyHack: FILEOPEN: Cannot import HTML <br/> and <hr/> (while <br> and <br />...
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.3.0 release
Hardware: Other All
: medium minor
Assignee: Not Assigned
URL:
Whiteboard: BSA target:3.7.0 target:3.6.4
Keywords: easyHack, skillCpp
Depends on:
Blocks:
 
Reported: 2012-06-10 04:25 UTC by matta2006
Modified: 2015-12-15 23:18 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
Test file (4.50 KB, text/html)
2012-09-26 12:56 UTC, Roman Eisele
Details
Screenshot of the problem (602.15 KB, image/jpeg)
2012-09-26 14:02 UTC, matta2006
Details
Test file with additional '/' (4.51 KB, text/html)
2012-09-26 14:04 UTC, matta2006
Details

Note You need to log in before you can comment on or make changes to this bug.
Description matta2006 2012-06-10 04:25:45 UTC
Problem description: 
- Libreoffice cannot import <br/> tag when loading HTML files.

Steps to reproduce:
1. Create a basic HTML file with a <br/> tag.

Current behavior:
-I suppose tag is ignored and removed. Instead of a line break, there is nothing (last word attached to new one).

Expected behavior:
-A line break as in Firefox and Windows Word.

Platform (if different from the browser):
-I tested on Windows version.
              
Browser: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20120602 Iceweasel/3.5.16 (like Firefox/3.5.16)
Comment 1 matta2006 2012-06-10 04:28:13 UTC
Example:
is actually quite simple at its core.<br/>We used a few of the options

Current (wrong) result:
is actually quite simple at its core.We used a few of the options
Comment 2 matta2006 2012-07-07 06:22:25 UTC
Any news about this?
Comment 3 Roman Eisele 2012-09-26 12:53:27 UTC
Thank you for your bug report!

> Any news about this?

I checked, and I can NOT reproduce this (anymore). Using LibreOffice 3.5.7.1 and 3.6.2.1 on Mac OS X 10.6.8, I tried opening a simple, but valid (X)HTML file into which I had inserted all three variants:

   <br>     -- old pre-XHTML style
   <br/>    -- XHMTL-style, but bad style according to some recommendations
   <br />   -- XHMTL-style, good style according to some recommendations

All three BRs were imported correctly as ‘new line’ (not new paragraph), which is IMHO the very best thing Writer can do.


So can you please check yourself if a current version of LibreOffice fixes the problem?

And if not, please attach a sample HTML file which shows the wrong import of <br/> for you.

Thank you again!
Comment 4 Roman Eisele 2012-09-26 12:56:56 UTC
Created attachment 67724 [details]
Test file



Additional idea:

This is the (X)HTML test file which I used for my test, so please check if it opens correctly (i.e., inteprets the three <br/> variants correctly) for you, or if you still see the error with it. Thank you!
Comment 5 matta2006 2012-09-26 14:02:42 UTC
Created attachment 67731 [details]
Screenshot of the problem

Here you are a screenshot of the problem.
Comment 6 matta2006 2012-09-26 14:04:48 UTC
Created attachment 67732 [details]
Test file with additional '/'

Here you are my test file
Comment 7 Roman Eisele 2012-09-26 14:50:30 UTC
Thank you for correcting my test file! Wow, how silly by me -- I really made the mistake to test <br />, <br>, and <br> again -- a simple typo, but no wonder that I did not see the error with <br/>!

Thank you also for the helpful screenshot! I see the same things now.

So, while <br> and <br /> are imported correctly as ‘new line’, <br/> is just ignored, as you write. To prevent any further misunderstandings, I add this information to the Summary.


Reproducibility/Versions
------------------------

And this failure to import <br/> correctly is REPRODUCIBLE with
* LibreOffice 3.3.0
* LibreOffice 3.4.6
* LibreOffice 3.5.7.1
* LibreOffice 3.6.2.1
So this is an old bug, probably inherited from OOo. → Adapting Version field (it should always contain the 1st version in which a bug is known to exist).


A possible objection, and the aswer to it
-----------------------------------------

One could object that <br/> (without any whitespace before the closing />) is often considered as “bad style”, and that <br /> with some whitespace is the recommended version. That is true. But nevertheless many many web sites use the “bad” style, i.e. <br/>, and “bad style” does not mean “invalid”.

So, while LibreOffice, when *generating* (X)HTML, should use either <br> (for HTML 4.x) or <br /> (for XHTML), it should, when *importing* (X)HTML, definitely understand <br/>, just like <br> and <br />.
Comment 8 matta2006 2012-09-26 14:57:35 UTC
Yes, exactly. Unfortunately I have found plenty of files with this format.
For example, Adobe Acrobat exports HTML with this tag. :-(
And even if a bad style, it passes all validation tests I did (W3C for example).

I can (and I do) replace all <br/> manually with <br>... but it is really annoying.

Moreover, I discovered this issue by chance: I did not understand why some part of the texts importent into Writer had no spaces... so often.

Thanks for your help.
Comment 9 Roman Eisele 2012-09-26 15:22:08 UTC
@ Writer experts:

Hello Cédric, Michael, Miklós,

this is a minor Writer problem, but nevertheless it can be very annoying for users, because the <br/> tag is very common on web pages today, and often (ab)used for necessary formatting tasks, without which the HTML text arrives more or less unreadable in Writer. Maybe some of you wants to take a look at it ... ;-)

Thank you very much!
Comment 10 Urmas 2012-09-26 17:00:33 UTC
Is /core/svtools/source/svhtml/parhtml.cxx:1118 here to blame?
Comment 11 Roman Eisele 2012-09-27 06:35:28 UTC
This problem is not limited to <br/>, there is some general parser problem with 
empty = standalone (X)HTML tags which match the pattern <[A-Za-z]+/>, because:

  <hr>   is imported correctly
  <hr /> is imported correctly
  <hr/>  fails: is just ignored!

I can not remember any other empty (standalone) (X)HTML tag besides <br/> and <hr/> which can occur without additional attributes. In empty = standalone (X)HTML tags which *require* some attributes, like <img/>, there is no problem with missing whitespace before the final />:

  <img src="x.png">   is imported correctly
  <img src="x.png" /> is imported correctly
  <img src="x.png"/>  is imported correctly

And as soon as I add some attribute to the <br/> and <hr/> tags, the problem vanishes:

  <hr class="myhr">   is imported correctly
  <hr class="myhr" /> is imported correctly
  <hr class="myhr"/>  is imported correctly

So the problem is really limited to tags which match the pattern <[A-Za-z]+/>.

Therefore:

(In reply to comment #10)
> Is /core/svtools/source/svhtml/parhtml.cxx:1118 here to blame?

Seems reasonsable, but my code reading facilities are limited. Would it help if one changes line 1118-1119 from

    } while( '>' != nNextCh && !HTML_ISSPACE( nNextCh ) &&
             IsParserWorking() && !rInput.IsEof() );

to something like

    } while( '>' != nNextCh && '/' != nNextCh && !HTML_ISSPACE( nNextCh )
             && IsParserWorking() && !rInput.IsEof() );

?
Comment 12 Michael Stahl (allotropia) 2012-09-27 10:02:20 UTC
Urmas, that place looks interesting, there is already code to
handle "/>" around line 1217 in parhtml.cxx; would you like to
debug this a little, find out why it doesn't work and submit a patch :)
Comment 13 Louis Possoz 2012-10-26 14:46:05 UTC
In the original parhtml.cxx, line 1118 get the first token of an html tag (the word following '<' or '</'), stopping at the first white space or at a '>'.

In our case (<br/> or <hr/>), this resulted in the inclusion of the terminating slash in the token. It seems that the most simple and efficient way to solve this semi-bug is to also stop the first token on the slash by modifying line 1118. The slash will then be correctly interpreted later like in <br />.

For my very first contribution to libreoffice development (I am a newby to 'git' and all that stuff), I thus submitted a new parhtml.cxx with this correction : https://gerrit.libreoffice.org/#/c/915/

If I made any error or if I have to do anything more, please, tell me.
Comment 14 Roman Eisele 2012-11-03 10:29:04 UTC
It seems that the bot is sleeping or on vacation, therefore a manual hint:

This bug was fixed on Master by the commit
http://cgit.freedesktop.org/libreoffice/core/commit/?id=9fdf86df4eb65a0cd2a178998daf751afc34805e
by Louis Possoz, committed at 2012-11-02 20:20:07 (GMT) by Michael Stahl.

The fix will appear in LibreOffice 3.7.0.


Can we get a backport of this fix to the 3.6.x branch, please? ;-)
Comment 15 Not Assigned 2012-11-05 11:04:46 UTC
Louis Possoz committed a patch related to this issue.
It has been pushed to "libreoffice-3-6":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=f8d569846ad04ff80e2810be865cacffd6cc1527&g=libreoffice-3-6

fdo#50931 : Cannot import HTML <br/> and <hr/>


It will be available in LibreOffice 3.6.4.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 16 Roman Eisele 2012-11-06 07:32:59 UTC
VERIFIED as FIXED:
with LOdev 3.7.0.0.alpha0+ (build ID: c1b407; pull time: 2012-11-02 23:47:02)
on Mac OS X 10.6.8, our test file is now processed without any problems --
all three variants <br>, <br/> and <br /> are now imported correctly;
the same is true for <hr>, <hr/> and <hr />.

Thank you all for collaborating on this fix!
Comment 17 Robinson Tryon (qubit) 2015-12-15 23:18:16 UTC Comment hidden (obsolete)