Bug 158936 - Writer extremely sluggish on larger docx-files converted from .pdf. Appears to max out single cpu core
Summary: Writer extremely sluggish on larger docx-files converted from .pdf. Appears t...
Status: RESOLVED INVALID
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.6.4.1 release
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: DOCX-Opening
  Show dependency treegraph
 
Reported: 2023-12-30 19:34 UTC by Paul
Modified: 2024-01-26 14:03 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Original .pdf for reference purposes (deleted)
2024-01-15 02:05 UTC, Paul
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Paul 2023-12-30 19:34:22 UTC
Description:
I have had this problem for many years. I thought my new Ryzen system would solve it, but it hasn't. 

I often work on files converted from .pdf, often fairly large, Opening them in LO takes up to 10 minutes at times, or fails altogether. If I can get it open the solution then usually is to remove all ad hoc formatting and convert various paragraph and page styles to a simple few. 

Today I tried working with a 16MB .pdf. I tried converting it to .docx via ilovepdf.com and then adobe.com, but LO, FreeOffice, and even Google Docs would not open the .docx.

Then I opened the .pdf in Okular (no problem doing so) and did a text export. This opens fine in geany, but not in LO. So I truncated it to the manageable first 500kb to work on it. I stripped all formatting and converted everything to Default Paragraph Style and saved as .odt, at 520KB. Usually that's enough to solve performance problems, but this one is still a bear to work on.

What happens is the cpu will go to 9-11% and hang there. That's doesn't sound bad, but what I think is happening is Writer is only working with one core, and that core is maxed out. The behavior is the same as if the cpu is maxed. LO freezes, usually alone, but sometimes Dolphin file manager will freeze too.

Calc has a setting to enable multicore processing, but it does not help this Writer problem

I'm on 7.6.4.1 on Linux. I also used to have this problem in my Windows days years ago.


Steps to Reproduce:
1.Open large file with a lot of ad hoc formatting
2.
3.

Actual Results:
File may or may not open. If it does open, Writer will freeze up doing simple things like scrolling down.

Expected Results:
File should open and be workable.


Reproducible: Always


User Profile Reset: No

Additional Info:
...
Comment 1 m_a_riosv 2023-12-31 00:26:19 UTC
Could it be that there are line breaks instead of paragraph break.

I don't remember where, because I think I have seen an issue with that before.
Comment 2 Paul 2023-12-31 04:21:13 UTC
No, they are paragraph breaks.
Comment 3 Dieter 2024-01-14 11:44:21 UTC
Paul. could you please have a look at similar bugs: bug 155170, bug 152680 and bug 113050. Do you think, your report is a duplicate of one of those?
=> NEEDINFO

BTW: I guess you get the same result when you try to open file in Draw directly.
Comment 4 Dave Gilbert 2024-01-14 12:39:57 UTC
Paul: Any chance you can provide one of the files?
Comment 5 Paul 2024-01-14 21:09:50 UTC
(In reply to Dieter from comment #3)
> Paul. could you please have a look at similar bugs: bug 155170, bug 152680
> and bug 113050. Do you think, your report is a duplicate of one of those?
> => NEEDINFO
> 
> BTW: I guess you get the same result when you try to open file in Draw
> directly.

I don't think there's an overt direct connection to those bugs, because I'm not trying to open a .pdf with LO, but rather a .docx converted from a .pdf. I'm not sure if what's driving those bugs is the limitation to one cpu core, which I think is the problem here.

Dave:
I'm working on uploading a sample file. I have the abbreviated 520kb .odt file, and it works fine, no delays. I have the full .docx, at 7.4MB, and it's still not opened after 20 minutes. What I'd like to do is get it open and then save it as .odt to see how it works and for uploading here.
Comment 6 Dieter 2024-01-14 21:19:04 UTC
(In reply to Paul from comment #5)
> I don't think there's an overt direct connection to those bugs, because I'm
> not trying to open a .pdf with LO, but rather a .docx converted from a .pdf.
> I'm not sure if what's driving those bugs is the limitation to one cpu core,
> which I think is the problem here.

Thank you for additional information. So perhaps problem is not related to pdf at all.

> 
> Dave:
> I'm working on uploading a sample file. I have the abbreviated 520kb .odt
> file, and it works fine, no delays. I have the full .docx, at 7.4MB, and
> it's still not opened after 20 minutes. What I'd like to do is get it open
> and then save it as .odt to see how it works and for uploading here.
Perhaps it's sufficient to attach docx-file

Docx file-opening issues are collected in meta bug 104450.
Comment 7 Dave Gilbert 2024-01-14 21:28:00 UTC
(In reply to Paul from comment #5)
> (In reply to Dieter from comment #3)
> > Paul. could you please have a look at similar bugs: bug 155170, bug 152680
> > and bug 113050. Do you think, your report is a duplicate of one of those?
> > => NEEDINFO
> > 
> > BTW: I guess you get the same result when you try to open file in Draw
> > directly.
> 
> I don't think there's an overt direct connection to those bugs, because I'm
> not trying to open a .pdf with LO, but rather a .docx converted from a .pdf.
> I'm not sure if what's driving those bugs is the limitation to one cpu core,
> which I think is the problem here.
> 
> Dave:
> I'm working on uploading a sample file. I have the abbreviated 520kb .odt
> file, and it works fine, no delays. I have the full .docx, at 7.4MB, and
> it's still not opened after 20 minutes. What I'd like to do is get it open
> and then save it as .odt to see how it works and for uploading here.

You might find it's only one particular page in the file or similar.

There's quite a few heuristics in the PDF loader; on a bad day you can end up with zillions of objects rather than a simple curve or line fo text in the resulting odt; so it could still be the PDF loading.
Comment 8 Paul 2024-01-14 21:32:08 UTC
(In reply to Dieter from comment #6).
> 
> Thank you for additional information. So perhaps problem is not related to
> pdf at all.
> 

I THINK the connection is that conversion from .pdf often involves a lot of ad hoc formatting, and this causes LO to hang. For instance, one recent file converted from .pdf had page styles "ConvertedXXX", with XXX incrementing for each page, from 1 to 240. There was a lot of in-place character formatting also. When I cleaned that up manually, the file became responsive. But I would guess that technically, the problem isn't the .pdf origin, but all the ad hoc formatting that often accompanies it. Just a guess.
Comment 9 Paul 2024-01-14 21:34:08 UTC
(In reply to Dave Gilbert from comment #7)
> 
> There's quite a few heuristics in the PDF loader; on a bad day you can end
> up with zillions of objects rather than a simple curve or line fo text in
> the resulting odt; so it could still be the PDF loading.

Just to be clear, I'm not trying to load the .pdf file into LO, but rather a .docx  file converted from a .pdf file.
Comment 10 Dave Gilbert 2024-01-14 21:37:12 UTC
What did the conversion from pdf->docx?
Comment 11 Paul 2024-01-14 21:43:01 UTC
(In reply to Dave Gilbert from comment #10)
> What did the conversion from pdf->docx?

"Today I tried working with a 16MB .pdf. I tried converting it to .docx via ilovepdf.com and then adobe.com, but LO, FreeOffice, and even Google Docs would not open the .docx."
Comment 12 Dave Gilbert 2024-01-14 21:45:07 UTC
(In reply to Paul from comment #11)
> (In reply to Dave Gilbert from comment #10)
> > What did the conversion from pdf->docx?
> 
> "Today I tried working with a 16MB .pdf. I tried converting it to .docx via
> ilovepdf.com and then adobe.com, but LO, FreeOffice, and even Google Docs
> would not open the .docx."

Oops, missed that.  OK, so probably not our PDF wrangling; still if the docx is hanging it's worth looking at somewhere.
Comment 13 Paul 2024-01-14 22:22:21 UTC
After half an hour, the 7.4MB .docx manifested in LO. But after another 45 minutes, it is still completely unresponsive, and the cpu remains at 10%, with RAM at 6G / 16G.
Comment 14 Paul 2024-01-15 01:02:46 UTC
Four hours later, LO won't even manifest with the file, and the cpu is still at 9%. I'm giving up. I can upload the file if you think it might help, but it is 7.4MB.
Comment 15 Dave Gilbert 2024-01-15 01:12:54 UTC
I don't think anyone can debug without it.
It would be good to add the pdf as well - a separate task would be seeing how our PDF converter handles it.
(I wouldn't be surprised if some services use some of the same code)
Comment 16 Paul 2024-01-15 02:05:37 UTC
Created attachment 191939 [details]
Original .pdf for reference purposes

I did not try to open this with LO. Rather, I used ILovePDF.com and Adobe.com to convert it to a .docx file, which I did try to open in LO.
Comment 17 Paul 2024-01-15 02:08:38 UTC
Actually, I'm concerned about IP matters with the file in question, and am not sure I should upload it publicly.
Comment 18 Buovjaga 2024-01-25 16:52:17 UTC
(In reply to Paul from comment #16)
> Created attachment 191939 [details]
> Original .pdf for reference purposes
> 
> I did not try to open this with LO. Rather, I used ILovePDF.com and
> Adobe.com to convert it to a .docx file, which I did try to open in LO.

I converted this to docx via ILovePDF.com. I confirm that LibreOffice struggles with it (I killed the process after a couple of minutes), but even Microsoft's office.com is unable to open it: "Sorry, this document can't be opened for editing" pops up after a couple of minutes of struggling. So I'm not sure, if this report is useful for any other party besides those PDF conversion services. They are obviously doing something wrong.

Arch Linux 64-bit, X11
Version: 7.6.4.1 (X86_64) / LibreOffice Community
Build ID: 60(Build:1)
CPU threads: 8; OS: Linux 6.6; UI render: default; VCL: kf5 (cairo+xcb)
Locale: fi-FI (fi_FI.UTF-8); UI: en-US
7.6.4-2
Calc: threaded
Comment 19 Paul 2024-01-25 17:00:32 UTC
I probably chose a bad file to base this bug report on, because you're absolutely right about the pdf conversion problem. 

I've been having this problem for ages, even on large plain text files I download, which never saw .pdf status. Usually they are older theological works from sites such as ccel.org. They had been OCR'd and often have a lot of spelling errors, which, thinking about it now, might help bollix up LO. But aside from that their only sin seems to be large file size.

Maybe when this happens again I should check whether Google Docs will open it, and upload the new file to enhance this report.
Comment 20 Buovjaga 2024-01-26 11:14:24 UTC
The content of attachment 191939 [details] has been deleted for the following reason:

Copyrighted content
Comment 21 Buovjaga 2024-01-26 11:16:09 UTC
(In reply to Paul from comment #19)
> Maybe when this happens again I should check whether Google Docs will open
> it, and upload the new file to enhance this report.

As this report has quite many comments already, I think it would be best to open a new report, if you can produce such a document some day.
Comment 22 Paul 2024-01-26 14:03:10 UTC
Thanks for deleting that. I realized the copyright problem after I had uploaded.

I will start a new report when I come across the problem again, and I will try to narrow the cause down to a traceable path.