Created attachment 61591 [details]
Illustration of problem with converting richtext to plaintext. Possible Solution
Is there a way to convert a document with rich text (.doc, .odx, .rtf) to plain text (.txt) AND keep the original paragraph alignment?
[document attached explaining this next paragraph:
It doesn't have to be perfect. Keeping only the alignment of the paragraphs first line by means of spaces or tabs would be ok. Preferably, however, there would be a way to tell the script the anticipated dimensions of where the softline breaks fall in the text (ie 1" margins for a 8.5x11 paper, the program would calculate where to throw the hard line breaks, which would be at 7.5 inches.) The bad thing, though, is if the user changes the font size or font, this will throw off the formatting. So, it might be best to stick with only inserting spaces for the first line relative to the indentation in the rich text document.]
If using spaces/tabs to preserve some part of the rich text layout, would there be a method to distinguish between the tabs/spaces which are part of the original document and those which are merely used to preserve the original alignment?
For example, assume a rich text document has indentations created NOT using spaces or tabs, but only the indentation settings on the ruler. It would convert to this:
Line one with no indent. And some text that wraps: "sed ut perspiciatis unde omnis iste natus error sit..."
Line 2 with .5" indent. And some text that wraps: "sed ut perspiciatis unde
omnis iste natus error sit..."
Line 3 with 1" indent. And some text that wraps: "sed ut perspiciatis
unde omnis iste natus error sit..."
As alluded to above, one possible solution would be to write a script that identifies the indentation of each line (ie .5", 1", 1.5"...) and places the correct number of spaces to fill that length.
Hello Matthew, *,
I am not sure, if I understand you right, but opening your attachment in Writer, using "Save as..." to convert it to txt and reopening the txt file in kate 3.8.4 (from KDE 4.8.4) under Debian Testing AMD64, the look of the txt seems nearly identical in both editors.
Would you be so kind to test it with a newer version of LO than 3.5.3, please? Which OS/architecture and which viewer/editor are you using for the text file? I have used LO Version: 18.104.22.168 Build ID: 1b3956717a60d6ac35b133d7b0a0f5eb55e9155 under Debian Testing AMD64 and kate 3.5.3, as written before ... ;) If you have done the conversion in a different way, it would be nice, if you can give us a clearer step-by-step description ... ;)
But as cited on http://en.wikipedia.org/wiki/Plain_text#Plain_text.2C_the_Unicode_definition:
«Plain text represents character content only, not its appearance. »
«If the same plain text sequence is given to disparate rendering processes, there is no expectation that rendered text in each instance should have the same appearance. »
, I am not really sure, if our developers could do anything about it ... :(
Sorry for the inconvenience
Dear Bug Submitter,
Please read the entire message before proceeding.
This bug has been in NEEDINFO status with no change for at least 6 months. Please provide the requested information as soon as possible and mark the bug as UNCONFIRMED. Due to regular bug tracker maintenance, if the bug is still in NEEDINFO status with no change in 30 days the QA team will close the bug as INVALID due to lack of needed information.
For more information about our NEEDINFO policy please read the wiki located here:
If you have already provided the requested information, please mark the bug as UNCONFIRMED so that the QA team knows that the bug is ready to be confirmed.
Thank you for helping us make LibreOffice even better for everyone!
Dear Bug Submitter,
Please read this message in its entirety before proceeding.
Your bug report is being closed as INVALID due to inactivity and a lack of information which is needed in order to accurately reproduce and confirm the problem. We encourage you to retest your bug against the latest release. If the issue is still present in the latest stable release, we need the following information (please ignore any that you've already provided):
a) Provide details of your system including your operating system and the latest version of LibreOffice that you have confirmed the bug to be present
b) Provide easy to reproduce steps – the simpler the better
c) Provide any test case(s) which will help us confirm the problem
d) Provide screenshots of the problem if you think it might help
e) Read all comments and provide any requested information
Once all of this is done, please set the bug back to UNCONFIRMED and we will attempt to reproduce the issue.
Please do not:
a) respond via email
b) update the version field in the bug or any of the other details on the top section of FDO
An enhancement request.
Never confirmed so moving to UNCONFIRMED for QA team to evaluate. Thanks!
(In reply to Matthew B from comment #0)
> Is there a way to convert a document with rich text (.doc, .odx, .rtf) to
> plain text (.txt) AND keep the original paragraph alignment?
> For example, assume a rich text document has indentations created NOT using
> spaces or tabs, but only the indentation settings on the ruler. It would
> convert to this:
> Line one with no indent. And some text that wraps: "sed ut perspiciatis unde
> omnis iste natus error sit..."
> Line 2 with .5" indent. And some text that wraps: "sed ut perspiciatis
> omnis iste natus error sit..."
> Line 3 with 1" indent. And some text that wraps: "sed ut
> unde omnis iste natus error sit..."
> As alluded to above, one possible solution would be to write a script that
> identifies the indentation of each line (ie .5", 1", 1.5"...) and places the
> correct number of spaces to fill that length.
Seems like one plausible approach to providing plaintext output that retains higher-fidelity to the original document than what the filter currently provides.
Status -> NEW