Bug 144892 - new-page blank space put in many pages when converting from Word 4 on macOS
Summary: new-page blank space put in many pages when converting from Word 4 on macOS
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
7.2.1.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:doc
Depends on:
Blocks: DOC
  Show dependency treegraph
 
Reported: 2021-10-03 10:59 UTC by David Sherman
Modified: 2023-10-12 04:43 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample file to convert from Word 4 format (318.00 KB, application/msword)
2021-10-03 11:03 UTC, David Sherman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Sherman 2021-10-03 10:59:39 UTC
Description:
I write my files using Word 4.00E (Word 4) for the Mac, released around 1987. I use LibreOffce only to convert them to .docx format, since the latest version of MS Word no longer supports converting from Word 4 and 5. I then save the file and open it with Word. Consistently somewhere near or after the end of each page, LibreOffice has inserted a character that breaks for a new page. But from Word I can't copy and search/replace this character. All I can do it go through my file page by page and delete each special char that is causing a page break. This happens every time, with every file I convert.

Steps to Reproduce:
1. Create a file in Word 4.0 (I can send you a sample).
2. Open it with LibreOffice and save it as a .docx Word file
3. Open it with any recent version of Word

Actual Results:
I (ds@davidsherman.ca) can email you a sample file to convert.

Expected Results:
I get the converted file but the page break characters are in there, almost every page. It's before or after the end of the page.


Reproducible: Always


User Profile Reset: Yes



Additional Info:
Not included these weird page-break characters that I can't even search for.
Comment 1 David Sherman 2021-10-03 11:03:05 UTC
Created attachment 175473 [details]
Sample file to convert from Word 4 format

If you open this file with LibreOffice (which recognizes it as Word 4 format), then save it as .docx, you'll see all the page-break chars throughout.
Comment 2 Michael Warner 2021-10-04 13:17:31 UTC
Comment on attachment 175473 [details]
Sample file to convert from Word 4 format

Changed MIME type to application/msword
Comment 3 Michael Warner 2021-10-04 13:59:03 UTC
I see this issue in both the Word 4 file and the converted .docx.
Comment 4 Michael Warner 2021-10-04 14:08:17 UTC
As a workaround, https://ask.libreoffice.org/t/remove-all-manual-page-breaks/59715/8 lists some extensions and macros that can remove all manual page breaks.
Comment 5 Michael Warner 2021-10-04 14:10:24 UTC
(In reply to Michael Warner from comment #3)
> I see this issue in both the Word 4 file and the converted .docx.

Version: 7.2.0.4 (x64) / LibreOffice Community
Build ID: 9a9c6381e3f7a62afc1329bd359cc48accb6435b
CPU threads: 6; OS: Windows 6.1 Service Pack 1 Build 7601; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded
Comment 6 David Sherman 2021-10-04 17:00:25 UTC
Thanks. What do you mean you see the problem in the original Word file? What special character are you seeing?

I see no problem in the original Word file. I've been writing these Word files (using Word 4) for over 30 years and they're all fine, and I have tens of thousands of such files. I never had a problem when converting them to modern versions of Word (using a more modern version of Word itself for the conversion) until Word stopped supporting this and I starting using LibreOffice to do the conversion. Every single file I convert with LibreOffice, from a 2-page file to a longer one like the sample I uploaded, has the same problem.
Comment 7 Michael Warner 2021-10-05 13:52:59 UTC
When I first read Comment 1, what I thought you were saying is that the hard page breaks only appeared in the docx that you opened in a recent version of Word; causing me to suspect the docx export filter. But when I tried it, I found that I saw them in LibreOffice as soon as opened the .doc, which would instead implicate the import filter. This is what I meant in Comment 3. I have not done any analysis on the actual files.
Comment 8 David Sherman 2021-10-05 14:19:45 UTC
Thanks. Evidently the problem must be in the import filter then.
Comment 9 Alex Thurgood 2021-10-11 09:50:10 UTC
When I open your test file in

Version: 7.2.1.2 / LibreOffice Community
Build ID: 87b77fad49947c1441b67c559c339af8f3517e22
CPU threads: 8; OS: Mac OS X 10.16; UI render: default; VCL: osx
Locale: fr-FR (fr_FR.UTF-8); UI: fr-FR
Calc: threaded

I immediately see a number of manually inserted page breaks (dashed line) separating various pages, in addition to the natural page breaks determined by the page dimensions.

I don't know whether this is as a result of the original file using 2 particular page styles with a first page style separated by a page break followed by the second page style, or whether you have inserted the page breaks yourself in the original Word 4.0 document (or perhaps Word 4 has done it for you, unasked?).

If you want to delete those manually inserted page breaks before converting your document to DOCX, the only way I know to do this is to place the cursor on the paragraph after the page break and press the backspace key, which is hugely annoying and extremely fastidious.

However, there is apparently a solution : use the Alternative Find/Replace dialog extension :

https://ask.libreoffice.org/t/search-and-replace-manual-page-break/20248
Comment 10 Alex Thurgood 2021-10-11 10:10:51 UTC
Inspecting the styles recognized and imported into LO doesn't seem to show anything of note.

Guess the only solution is to install the AltFindReplace extension.
Comment 11 David Sherman 2021-10-11 10:29:14 UTC
Thank you, Alex. I don't understand how this has happened. I have been writing in Word 4.0 since 1988 and using the same version I'm using now for 30 years. I have written tens of thousands of files (I'm a lawyer and professional author). Every single file I convert with LibreOffice has these page breaks all over the place. This never happened with other conversion tools (later version of MS Word) over the year until MS Word stopped supporting this format. So I can't believe that Word 4 has inserted random page breaks. It's mystifying.
Comment 12 David Sherman 2021-10-11 10:36:12 UTC
Re the advice to use the ALT find/replace extension, I'm sorry, but I'm really unfamiliar with LibreOffice (all I've used it for is opening and reading in the Word 4 files and saving them in .docx format) and I don't understand what I'm supposed to do. I'm using LibreOffice on a Mac. The link you provided doesn't make clear what I'm supposed to be doing. Is this something I'm supposed to write and install into my LibreOffice, or a simple search/replace command I can run on the file after I convert it? If the latter, what specifically should I be doing?

I tried Edit > Find and Replace ...    and searching for   \m   and replacing with   \r   but nothing matches.

Thanks.
Comment 13 Alex Thurgood 2021-10-11 10:37:53 UTC
Hi David,

I'm a lawyer too, and I'm not saying it isn't the LO import filter, especially if previous iterations of LibreOffice didn't do this. Unfortunately, I don't have a Word 4 for Mac version with which to test, only the latest version of MSOffice for Mac (which tells me, as you have already indicated, that the opening of this format is blocked in current versions of MSWord).

Do you perchance have any results of tests with that file against earlier versions of LibreOffice, where those page breaks do not appear ? That at least would point to a possible regression in the import filter ?
Comment 14 Alex Thurgood 2021-10-11 10:45:18 UTC
(In reply to David Sherman from comment #12)

> search/replace command I can run on the file after I convert it? If the
> latter, what specifically should I be doing?
> 

You would need to install the AltFindReplace extension first, which is available here :

https://extensions.libreoffice.org/en/extensions/show/alternative-dialog-find-replace-for-writer

The usual way to install these extensions is to download the extension package via the download link (the file has an oxt file extension).

Drag and drop the oxt file onto a running instance of LibreOffice, for example, onto the LibreOffice app icon in the Dock.

The Extension Manager GUI should be displayed and should ask you to confirm that you wish to install the extension. If the extension installs successfully, you will be asked whether you want to restart LibreOffice in order for the extension to be recognised for future use. On restart, when you open a Writer document, you should see a pair og bright green binoculars in the main toolbar on the left hand side. Clicking on those binoculars will bring up the Alternative Search & Replace tool. It is here that you should try entering the commands given in Ask link page I referred to earlier.
Comment 15 Alex Thurgood 2021-10-11 10:59:10 UTC
(In reply to Alex Thurgood from comment #14)
> (In reply to David Sherman from comment #12)
>
It is here that you should
> try entering the commands given in Ask link page I referred to earlier.

In the AltSearchReplace dialog, instead of typing in \m as the search string (which didn't find any occurrences), use the dropdown menu Regular Expressions and select the appropriate command from the dropdown menu. This works for me with:

Version: 7.2.1.2 / LibreOffice Community
Build ID: 87b77fad49947c1441b67c559c339af8f3517e22
CPU threads: 8; OS: Mac OS X 10.16; UI render: default; VCL: osx
Locale: fr-FR (fr_FR.UTF-8); UI: fr-FR
Calc: threaded

Unfortunately, although it finds and replaces 138 occurrences, the changes are not displayed, even after saving as ODT or DOCX...
Comment 16 David Sherman 2021-10-11 11:17:16 UTC
Thanks very much, Alex. The Mac made the extension install even easier than you had indicated; once I downloaded it, it opened LibreOffice and invited me to install it, then invited me to restart LibreOffice.

And I was able to get the changes to run using the Regular Expressions menu entry as you indicated.

Much appreciated. I can live with running this workaround each time I convert a file. (Though I still don't know where the page breaks came from suddenly.)
Comment 17 QA Administrators 2023-10-12 03:17:12 UTC Comment hidden (obsolete)
Comment 18 David Sherman 2023-10-12 04:43:29 UTC
I just received a message from Bugzilla asking that I check whether this bug us still present. It is. I have downloaded the latest version of LibreOffice (7.6.2.1 running on a MacBook Pro, MacOS version 12.7, and the bug is the same.

I use LibreOffice several times a week to convert Word 4 files to .docx (to send to others), and the same bug still appears. A hard page break gets inserted at the end of a line, more or less near the end of each page but not in any consistent place, and definitely not at the page-break point.

I have a workaround in Word (after saving he .docx from LibreOffice), which is simply to search for every ^m and delete it, but a global delete won't work because, if the hard break was inserted in the middle of a paragraph, I have to reconnect the split paragraphs into one, whereas if it wasn't, I just delete the hard break. So I have to search for each instance of ^m manually and fix it. It's a bit annoying but manageable, and I have no alternative. I have tens of thousands of files I've written over 35 years in Word 4, and I still access many of them regularly and still write new documents in Word 4 for efficiency.

So the bug is still present. Contrary to others on this trail who claimed that the hard breaks were in the original file, they aren't visible in Word 4, and they never showed up before (I used Word's own built-in conversion tools for decades until Microsoft stopped supporting conversion from Word 4 a few years ago).

Thanks to anyone who can help solve this.