Bug 38989 - PATCH/FIX FOR - Incorrect character written out when saved in HTML format
Summary: PATCH/FIX FOR - Incorrect character written out when saved in HTML format
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
(earliest affected)
3.5.0 RC1
Hardware: Other All
: medium normal
Assignee: Caolán McNamara
Whiteboard: BSA target:3.6.0
Keywords: patch
Depends on:
Reported: 2011-07-05 17:01 UTC by gordon.lack
Modified: 2012-06-01 14:16 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:

Patch against 3.4.3 (709 bytes, patch)
2011-10-06 16:27 UTC, gordon.lack

Note You need to log in before you can comment on or make changes to this bug.
Description gordon.lack 2011-07-05 17:01:50 UTC
If you put a non-break hyphen into a document (Ctl+Shift+-, U+2011) it's fine as long as you keep it as an ODF document.
However, if you write this out as HTML the character is converted into a non-break *space*.     is written instead of ‑

Note that this is not a new bug - it happens in OpenOffice 3.2.1 as well (which is where I first noticed it).
Comment 1 gordon.lack 2011-07-07 17:47:42 UTC
OK.  This is the cause:

    case 0xA0:          // is a hard blank
//!! the TextConverter has a problem with this character - so change it to
// a hard space - that's the same as our 5.2
    case 0x2011:        // is a hard hyphen
        pStr = OOO_STRING_SVTOOLS_HTML_S_nbsp;

No idea what the TextConverter is, but if it has a problem then surely that is the place that needs fixing - not breaking HTML exports instead?
Comment 2 gordon.lack 2011-07-10 15:59:03 UTC
I've built with the following patch and that results in the *correct *html (‑) being output when a document is saved in html format.
I've also tested that a cut&paste of the resulting html document (when viewed in LO) into a new odt document (in LO) results in a non-break hyphen.

===== htmlout.cxx.diff =====

--- htmlout.cxx-orig    2011-05-19 11:58:05.000000000 +0100
+++ htmlout.cxx 2011-07-10 23:07:15.612747262 +0100
@@ -418,10 +418,15 @@
     switch( c )
     case 0xA0:         // is a hard blank
+        pStr = OOO_STRING_SVTOOLS_HTML_S_nbsp;
+        break;
+// This was labelled as:
 //!! the TextConverter has a problem with this character - so change it to
 // a hard space - that's the same as our 5.2
+//   but that just breaks html output.  Setting the numberic html entity
+//   seems fine.
     case 0x2011:       // is a hard hyphen
-        pStr = OOO_STRING_SVTOOLS_HTML_S_nbsp;
+        pStr = "#8209";
     case 0xAD:         // is a soft hyphen
         pStr = OOO_STRING_SVTOOLS_HTML_S_shy;
Comment 3 gordon.lack 2011-10-06 16:27:45 UTC
Created attachment 52063 [details]
Patch against 3.4.3
Comment 4 Björn Michaelsen 2011-12-23 12:27:11 UTC
[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Comment 5 Julien Nabet 2012-01-21 14:59:34 UTC
According to http://www.robinlionheart.com/stds/html4/spchars or http://en.wikipedia.org/wiki/Hyphen, shouldn't it be "#8208" instead of "8209" ?

Quote of first source :
In addition to the soft hyphen, there is also a hard hyphen (‐ or ‐) which always renders, and a nonbreaking hyphen character (‑ or ‑), for hyphens that do not break words across lines.
Comment 6 gordon.lack 2012-01-22 04:13:56 UTC
>> shouldn't it be "#8208" instead of "8209"

No!  As your quote notes, &#8209 is the nonbreaking hyphen character, which *is* what is needed here!  That is a hyphen which is *always* displayed but *never* breaks.

A soft-hypehn is one at an optional break which is only displayed if the break occurs, while a hard-hyphen is one which is always displayed even if there is no break - but a break is allowed.  So both of these allow breaks.  The whole point of a non-break hyphen is to prevent breaking, while still displaying a hyphen
Comment 7 gordon.lack 2012-01-22 04:38:38 UTC
I can confirm that this bug still exists in:

 LibreOffice 3.5.0rc1 
 Build ID: b6c8ba5-8c0b455-0b5e650-d7f0dd3-b100c87
Comment 8 Caolán McNamara 2012-05-09 05:56:00 UTC
Sorry for the insane delay, fell through the cracks :-(. pushed now, bug was in since very initial commit in 2000

Mailing to the libreoffice@lists.freedesktop.org list is the best route to get a patch looked at FWIW
Comment 9 gordon.lack 2012-05-09 13:47:49 UTC
>> caolanm->gordon: can you confirm your patch is under our preferred LGPLv3+/MPL+ license combination ?

Yes - that's fine.  I'm happy to transfer all ownership to you (or anyone) to do with as you wish, so that will do.
Comment 10 Caolán McNamara 2012-05-09 14:55:59 UTC
great, thanks, added you to...
if you want to review those details in case I got them wrong.

It's insanely wrong the patch lingered so long, apologies.