Bug 136246 - [RTF] Import: Txt-Table is messed up
Summary: [RTF] Import: Txt-Table is messed up
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:rtf
Depends on:
Blocks: RTF
  Show dependency treegraph
 
Reported: 2020-08-28 21:36 UTC by Dennis Roczek
Modified: 2022-09-05 16:03 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Screenshot of the table in LO64 (86.91 KB, image/png)
2020-08-28 21:37 UTC, Dennis Roczek
Details
Screesnhot in LO5 (121.66 KB, image/png)
2020-08-28 21:37 UTC, Dennis Roczek
Details
Abiword (186.69 KB, image/png)
2020-08-28 21:38 UTC, Dennis Roczek
Details
Wordpad in Windows 10 (1909) (78.97 KB, image/png)
2020-08-28 21:38 UTC, Dennis Roczek
Details
Correct rendering in MSO Word (83.73 KB, image/png)
2020-08-28 21:39 UTC, Dennis Roczek
Details
Problematic File (21.64 KB, application/rtf)
2020-08-28 21:39 UTC, Dennis Roczek
Details
somewhat fixed file using an old LibreOffice version (8.15 KB, application/rtf)
2020-08-28 21:40 UTC, Dennis Roczek
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dennis Roczek 2020-08-28 21:36:08 UTC
Description:
This Bug Report was reported to de-discuss and can be confirmed.

Version: 6.4.4.2 (x64)
Build-ID: 3d775be2011f3886db32dfd395a6a6d1ca2630ff
CPU-Threads: 4; BS: Windows 10.0 Build 18363; UI-Render: GL; VCL: win; 
Gebietsschema: de-DE (de_DE); UI-Sprache: de-DE
Calc: threaded

MMS Bila 5.0 is a small ERP system in Germany. You can grab a test version at http://www.mmsgmbh.de/finanzbuchhaltung.html .

Per mailing list this export is correct in LibreOffice before 6.4.0, although I cannot confirm that at the moment using 5.4.7 (portableapps version).

For what it is worth: Wordpad in Windows 10 (see another screenshot) and Abiword 2.8.6 (see screenshot) do also have their problems reading that file, only MSO (365 version) is able to read it correctly.

So basically: the content is completely messed up as the table columns are not in the correct order. 

Steps to Reproduce:
1. open attached RTF file


Actual Results:
messed up table

Expected Results:
correct table


Reproducible: Always


User Profile Reset: Yes



Additional Info:
.
Comment 1 Dennis Roczek 2020-08-28 21:37:24 UTC
Created attachment 164829 [details]
Screenshot of the table in LO64
Comment 2 Dennis Roczek 2020-08-28 21:37:44 UTC
Created attachment 164830 [details]
Screesnhot in LO5
Comment 3 Dennis Roczek 2020-08-28 21:38:00 UTC
Created attachment 164831 [details]
Abiword
Comment 4 Dennis Roczek 2020-08-28 21:38:33 UTC
Created attachment 164832 [details]
Wordpad in Windows 10 (1909)
Comment 5 Dennis Roczek 2020-08-28 21:39:03 UTC
Created attachment 164833 [details]
Correct rendering in MSO Word
Comment 6 Dennis Roczek 2020-08-28 21:39:29 UTC
Created attachment 164834 [details]
Problematic File
Comment 7 Dennis Roczek 2020-08-28 21:40:35 UTC
Created attachment 164835 [details]
somewhat fixed file using an old LibreOffice version
Comment 8 Telesto 2020-08-29 12:55:03 UTC
Confirm with
7.1

Also in
LibreOffice 3.3.0 
OOO330m19 (Build:6)
tag libreoffice-3.3.0.4
Comment 9 Telesto 2020-08-29 12:58:30 UTC
@Miklos
Is there some way to validate RTF files; the bug rtf doc here is dubious quality wise.
Comment 10 Miklos Vajna 2020-08-31 08:14:40 UTC
I'm not aware of anything like that. If Word opens the file, we're expected to do the same.
Comment 11 Telesto 2020-08-31 08:19:30 UTC
(In reply to Miklos Vajna from comment #10)
> I'm not aware of anything like that. If Word opens the file, we're expected
> to do the same.

For the record: The file can be opened.. point is more how everything is presented on screen
Comment 12 Dennis Roczek 2020-09-06 21:20:41 UTC
Oooh, I just realize: the problem is not the content itself, it is tab character!

If it is replaced by a whitespace using search and replace it is /mostly/ correctly displayed. 

So basically it is in the file itself: \u8198\'20 which reads in the latest RTF spec 1.9.1 as following:

------------------------------------
\uN This keyword represents a single Unicode character that has no equivalent ANSI representation
based on the current ANSI code page. N represents the Unicode character value expressed as a
decimal number.
This keyword is followed immediately by equivalent character(s) in ANSI representation. In this
way, old readers will ignore the \uN keyword and pick up the ANSI representation properly.
When this keyword is encountered, the reader should ignore the next N' characters, where N'
corresponds to the last \ucN' value encountered.
As with all RTF keywords, a keyword-terminating space may be present (before the ANSI
characters) that is not counted in the characters to skip. While this is not likely to occur (or
recommended), a \binN keyword, its argument, and the binary data that follows are considered
one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or
closing brace) is encountered while scanning skippable data, the skippable data is considered to
end before the delimiter. This makes it possible for a reader to perform some rudimentary error
recovery. To include an RTF delimiter in skippable data, it must be represented using the
appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control
word or symbol is considered a single character for the purposes of counting skippable characters.

An RTF writer, when it encounters a Unicode character with no corresponding ANSI character,
should output \uN followed by the best ANSI representation it can manage. Often a question
mark is used if no reasonable ANSI character exists. In addition, if the Unicode character
translates into an ANSI character stream with a count of bytes differing from the current Unicode
Character Byte Count, it should emit the appropriate \ucN keyword prior to the \uN keyword to
notify the reader of the change.
Most RTF control words accept signed 16-bit numbers as arguments. For these control words,
Unicode values greater than 32767 are expressed as negative numbers. For example, the
character code U+F020 is given by \u-4064. To get -4064, convert F02016 to decimal (61472)
and subtract 65536.
Occasionally Word writes SYMBOL_CHARSET (nonUnicode) characters in the range
U+F020..U+F0FF instead of U+0020..U+00FF. Internally Word uses the values U+F020..U+F0FF
for these characters so that plain-text searches don’t mistakenly match SYMBOL_CHARSET
characters when searching for Unicode characters in the range U+0020..U+00FF. To find out the
correct symbol font to use, e.g., Wingdings, Symbol, etc., find the last SYMBOL_CHARSET font
control word \fN used, look up font N in the font table and find the face name. The charset is
specified by the \fch

------------------------------------

So as LibreOffice /seems/ not to identify \u8198 it should only display a whitespace.

I guess 8198 is that character https://www.codetable.net/decimal/8198 ("Six-Per-Em Space", but isn't this \u2006?!?).

So why do we not recognize that character? *g*
Comment 13 Dennis Roczek 2020-09-06 21:35:59 UTC
I forgot to mention: replacing the Unicode character + whitespace using search and replace (to whitespace) it looks nearly as good as in MSO Word!

Next question which come to my min: why is that six-per-em space not recognized as character separator? (yeah, I do know: another ticket!)