Bug 148253 - LibreOffice 7.3 conversion can no longer convert .pptx to .html unless with filter or .htm
Summary: LibreOffice 7.3 conversion can no longer convert .pptx to .html unless with f...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
7.3.0.3 release
Hardware: All All
: medium minor
Assignee: Mike Kaganski
URL: https://ask.libreoffice.org/t/libreof...
Whiteboard: target:7.4.0 target:7.3.3
Keywords: bibisected, bisected, regression
Depends on:
Blocks: Commandline
  Show dependency treegraph
 
Reported: 2022-03-29 14:43 UTC by Louis Coste
Modified: 2022-04-01 08:52 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Louis Coste 2022-03-29 14:43:44 UTC
Description:
Hello,

I am working on a service that converts documents to pdf or html in order to display them in a browser. I am trying to update the libreoffice version from 7.2 to 7.3 and I am noticing now that it can no longer convert powerpoint documents to html.

The command I have been running is

soffice --headless --convert-to html test.pptx
I tried updating my command to specify a filter to use but that did not seem to work:

soffice --headless --convert-to html:"impress_html_Export" test.pptx
When I run the first command there is an error message saying that there is no export filter. When I run the second command I get an error that I should verify my input parameters

Steps to Reproduce:
1. Try to convert .pptx document to .html

Actual Results:
Fails to convert

Expected Results:
Converts document to .html


Reproducible: Always


User Profile Reset: No



Additional Info:
No other information
Comment 1 Mike Kaganski 2022-03-29 15:12:58 UTC
Repro using 7.3.0.3, but not 7.2.0.4.

But 7.3.0.3 works using 'htm' instead of 'html'.
Comment 2 Timur 2022-03-30 11:49:09 UTC
--convert-to html   > no export filter   : repro, as reporter
--convert-to html:"impress_html_Export"  > OK for me, different from reporter
--convert-to htm   > OK, as Mike found, using filter : impress_html_Export

I'm not sure about "regression", because convert of attachment 178918 [details] to html would loose footer with 7.2 and give different errors for XSL Vendor: 'libxslt'.
In all, I'd say minor.

Also, "no export filter, aborting" should be an error, but convert error status is 0.
Comment 3 Mike Kaganski 2022-03-30 12:11:31 UTC
(In reply to Timur from comment #2)
> I'm not sure about "regression"

Marking it that way, it is possible to *bibisect* it to the changing commit, and get an idea what was the change about - which helps clarify it it's the intended change or not. Hence, I used the keyword, and consider it proper for the said reason :-D - not claiming more than that.
Comment 4 Timur 2022-03-30 12:56:57 UTC
7.3 commit 36ce32072658c6ffca75b200f116ddfc11cab138
Date:   Tue Jun 22 08:50:00 2021 +0200
    source 990b2cb056788f7f412656a303456d90c003cf83
    pre 949658028e722e5d2657b503eb20e16e41dbd8cf

author	Noel Grandin <noel@peralex.com>	
committer	Noel Grandin <noel.grandin@collabora.co.uk>
commit 990b2cb056788f7f412656a303456d90c003cf83
simplify and improve Wildcard
it is faster to just process OUString data, rather than perform
expensive conversion to OString and back again.

Hi Noel, please see this. Report is that PPTX --convert-to html now gives "no export filter" which is true.
But, my example shows that even before it wasn't reliable, instead html:"impress_html_Export" would give better result before and now. 
Also, --convert-to htm is different and proper, how comes that. 

There are other bugs with HTML convert and my general conlusion is that app or filter should be defined, short form is not reliable.
Comment 5 Commit Notification 2022-03-31 15:02:17 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/50add7c97e75d604287218f49c9283aab052fdf0

tdf#148253: fix matching algorithm

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 6 Commit Notification 2022-03-31 17:43:13 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-7-3":

https://git.libreoffice.org/core/commit/2143fa31b9035c7c2cf302ccd3907d0853132e8f

tdf#148253: fix matching algorithm

It will be available in 7.3.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 7 Timur 2022-04-01 08:38:59 UTC
Mike, until I'm able to test this, I have 3 things to ask you ( I think you are following via Tags, as myself ):

1. How come that html and htm are different?
2. Can you see bug 148275 and confirm or comment?
3. Can you see the mail I sent directly to your Hotmail address? 

Thanks.
Comment 8 Mike Kaganski 2022-04-01 08:52:51 UTC
(In reply to Timur from comment #7)
> 1. How come that html and htm are different?

We have two HTML-related export filters: "impress_html_Export" and "XHTML Impress File". The latter refers to "XHTML_File" type, which has extension list defined as "html,xhtml". The former refers to "graphic_HTML" type, and its extensions are "html,htm". Note that both can handle "html", and both have "html" extension first. It seems that *for some reason* (I don't know which, but that is irrelevant here - *some* of them must be picked anyway, one or the other), in *normal* case, the latter one got picked when you didn't explicitly defined the filter.

But the found commit regressed so that it couldn't find *any* non-last extension in the list - so it didn't see *both* filters handling NTML. But it could find the *last* extension in the list - so it found graphic_HTML type, and hence impress_html_Export filter.

> 2. Can you see bug 148275 and confirm or comment?

Yes.

> 3. Can you see the mail I sent directly to your Hotmail address?

Unfortunately, no (I also checked the spam folder)...