Problem description: - Libreoffice cannot import <br/> tag when loading HTML files. Steps to reproduce: 1. Create a basic HTML file with a <br/> tag. Current behavior: -I suppose tag is ignored and removed. Instead of a line break, there is nothing (last word attached to new one). Expected behavior: -A line break as in Firefox and Windows Word. Platform (if different from the browser): -I tested on Windows version. Browser: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20120602 Iceweasel/3.5.16 (like Firefox/3.5.16)
Example: is actually quite simple at its core.<br/>We used a few of the options Current (wrong) result: is actually quite simple at its core.We used a few of the options
Any news about this?
Thank you for your bug report! > Any news about this? I checked, and I can NOT reproduce this (anymore). Using LibreOffice 3.5.7.1 and 3.6.2.1 on Mac OS X 10.6.8, I tried opening a simple, but valid (X)HTML file into which I had inserted all three variants: <br> -- old pre-XHTML style <br/> -- XHMTL-style, but bad style according to some recommendations <br /> -- XHMTL-style, good style according to some recommendations All three BRs were imported correctly as ‘new line’ (not new paragraph), which is IMHO the very best thing Writer can do. So can you please check yourself if a current version of LibreOffice fixes the problem? And if not, please attach a sample HTML file which shows the wrong import of <br/> for you. Thank you again!
Created attachment 67724 [details] Test file Additional idea: This is the (X)HTML test file which I used for my test, so please check if it opens correctly (i.e., inteprets the three <br/> variants correctly) for you, or if you still see the error with it. Thank you!
Created attachment 67731 [details] Screenshot of the problem Here you are a screenshot of the problem.
Created attachment 67732 [details] Test file with additional '/' Here you are my test file
Thank you for correcting my test file! Wow, how silly by me -- I really made the mistake to test <br />, <br>, and <br> again -- a simple typo, but no wonder that I did not see the error with <br/>! Thank you also for the helpful screenshot! I see the same things now. So, while <br> and <br /> are imported correctly as ‘new line’, <br/> is just ignored, as you write. To prevent any further misunderstandings, I add this information to the Summary. Reproducibility/Versions ------------------------ And this failure to import <br/> correctly is REPRODUCIBLE with * LibreOffice 3.3.0 * LibreOffice 3.4.6 * LibreOffice 3.5.7.1 * LibreOffice 3.6.2.1 So this is an old bug, probably inherited from OOo. → Adapting Version field (it should always contain the 1st version in which a bug is known to exist). A possible objection, and the aswer to it ----------------------------------------- One could object that <br/> (without any whitespace before the closing />) is often considered as “bad style”, and that <br /> with some whitespace is the recommended version. That is true. But nevertheless many many web sites use the “bad” style, i.e. <br/>, and “bad style” does not mean “invalid”. So, while LibreOffice, when *generating* (X)HTML, should use either <br> (for HTML 4.x) or <br /> (for XHTML), it should, when *importing* (X)HTML, definitely understand <br/>, just like <br> and <br />.
Yes, exactly. Unfortunately I have found plenty of files with this format. For example, Adobe Acrobat exports HTML with this tag. :-( And even if a bad style, it passes all validation tests I did (W3C for example). I can (and I do) replace all <br/> manually with <br>... but it is really annoying. Moreover, I discovered this issue by chance: I did not understand why some part of the texts importent into Writer had no spaces... so often. Thanks for your help.
@ Writer experts: Hello Cédric, Michael, Miklós, this is a minor Writer problem, but nevertheless it can be very annoying for users, because the <br/> tag is very common on web pages today, and often (ab)used for necessary formatting tasks, without which the HTML text arrives more or less unreadable in Writer. Maybe some of you wants to take a look at it ... ;-) Thank you very much!
Is /core/svtools/source/svhtml/parhtml.cxx:1118 here to blame?
This problem is not limited to <br/>, there is some general parser problem with empty = standalone (X)HTML tags which match the pattern <[A-Za-z]+/>, because: <hr> is imported correctly <hr /> is imported correctly <hr/> fails: is just ignored! I can not remember any other empty (standalone) (X)HTML tag besides <br/> and <hr/> which can occur without additional attributes. In empty = standalone (X)HTML tags which *require* some attributes, like <img/>, there is no problem with missing whitespace before the final />: <img src="x.png"> is imported correctly <img src="x.png" /> is imported correctly <img src="x.png"/> is imported correctly And as soon as I add some attribute to the <br/> and <hr/> tags, the problem vanishes: <hr class="myhr"> is imported correctly <hr class="myhr" /> is imported correctly <hr class="myhr"/> is imported correctly So the problem is really limited to tags which match the pattern <[A-Za-z]+/>. Therefore: (In reply to comment #10) > Is /core/svtools/source/svhtml/parhtml.cxx:1118 here to blame? Seems reasonsable, but my code reading facilities are limited. Would it help if one changes line 1118-1119 from } while( '>' != nNextCh && !HTML_ISSPACE( nNextCh ) && IsParserWorking() && !rInput.IsEof() ); to something like } while( '>' != nNextCh && '/' != nNextCh && !HTML_ISSPACE( nNextCh ) && IsParserWorking() && !rInput.IsEof() ); ?
Urmas, that place looks interesting, there is already code to handle "/>" around line 1217 in parhtml.cxx; would you like to debug this a little, find out why it doesn't work and submit a patch :)
In the original parhtml.cxx, line 1118 get the first token of an html tag (the word following '<' or '</'), stopping at the first white space or at a '>'. In our case (<br/> or <hr/>), this resulted in the inclusion of the terminating slash in the token. It seems that the most simple and efficient way to solve this semi-bug is to also stop the first token on the slash by modifying line 1118. The slash will then be correctly interpreted later like in <br />. For my very first contribution to libreoffice development (I am a newby to 'git' and all that stuff), I thus submitted a new parhtml.cxx with this correction : https://gerrit.libreoffice.org/#/c/915/ If I made any error or if I have to do anything more, please, tell me.
It seems that the bot is sleeping or on vacation, therefore a manual hint: This bug was fixed on Master by the commit http://cgit.freedesktop.org/libreoffice/core/commit/?id=9fdf86df4eb65a0cd2a178998daf751afc34805e by Louis Possoz, committed at 2012-11-02 20:20:07 (GMT) by Michael Stahl. The fix will appear in LibreOffice 3.7.0. Can we get a backport of this fix to the 3.6.x branch, please? ;-)
Louis Possoz committed a patch related to this issue. It has been pushed to "libreoffice-3-6": http://cgit.freedesktop.org/libreoffice/core/commit/?id=f8d569846ad04ff80e2810be865cacffd6cc1527&g=libreoffice-3-6 fdo#50931 : Cannot import HTML <br/> and <hr/> It will be available in LibreOffice 3.6.4. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
VERIFIED as FIXED: with LOdev 3.7.0.0.alpha0+ (build ID: c1b407; pull time: 2012-11-02 23:47:02) on Mac OS X 10.6.8, our test file is now processed without any problems -- all three variants <br>, <br/> and <br /> are now imported correctly; the same is true for <hr>, <hr/> and <hr />. Thank you all for collaborating on this fix!
Migrating Whiteboard tags to Keywords: (EasyHack SkillCpp ) [NinjaEdit]