Bug 108806 - DOCX IMPORT: line break appears in a specific document that is absent in Word
Summary: DOCX IMPORT: line break appears in a specific document that is absent in Word
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Mike Kaganski
URL:
Whiteboard: target:6.0.0
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-27 05:40 UTC by Mike Kaganski
Modified: 2017-08-28 07:28 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
A sanitized DOCX that has no line breaks in Word (1.28 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-06-27 05:40 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Kaganski 2017-06-27 05:40:22 UTC
Created attachment 134305 [details]
A sanitized DOCX that has no line breaks in Word

The attachment does not have a line break in Word. Its single paragraph reads
> First part of a line (before CRLF). Second part of the same line (after CRLF).

When open in LO, it is split into two lines separated by a line break:
> First part of a line (before CRLF).
> Second part of the same line (after CRLF).

This happens because there is a line break (CRLF) in the markup (document.xml) that is converted into space by Word, but treated as line break by LibreOffice.

Word behavior should be considered correct, as described at https://msdn.microsoft.com/en-us/library/ms256097 and ECMA-376-1:2016 17.3.3.31 (although I didn't find documentation specifically discussing CRLF in WordprocessingML). This is consistent with xml:space="preserve" attribute used in the file, as documented at https://www.w3.org/TR/xml/#sec-white-space.
Comment 1 Mike Kaganski 2017-06-27 05:46:34 UTC
A patch is sent: https://gerrit.libreoffice.org/39286
Comment 2 Xisco Faulí 2017-06-27 09:02:34 UTC
Moving to ASSIGNED
Comment 3 Commit Notification 2017-06-27 09:30:39 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=6124490c1e486d648d75cd1c3f7f4e793fb1d1c0

tdf#108806: convert CRLF into space in OOXML text

It will be available in 6.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 4 vihsa 2017-08-28 07:23:12 UTC Comment hidden (no-value)
Comment 5 Mike Kaganski 2017-08-28 07:28:37 UTC Comment hidden (no-value)