Bug 169077 - Data Provider dialog does not work with format HTML
Summary: Data Provider dialog does not work with format HTML
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
26.2.0.0 alpha0+ master
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:26.2.0 target:25.8.4
Keywords:
Depends on:
Blocks: Data-Provider
  Show dependency treegraph
 
Reported: 2025-10-26 13:12 UTC by Regina Henschel
Modified: 2025-12-16 09:15 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Regina Henschel 2025-10-26 13:12:29 UTC
There exist unit tests for the feature "Data Provider". But they do not work, if you do it manually by the dialog.

The test is
https://opengrok.libreoffice.org/xref/core/sc/qa/unit/dataproviders_test.cxx

The test files are
https://opengrok.libreoffice.org/xref/core/sc/qa/unit/data/dataprovider/html/test1.html
https://opengrok.libreoffice.org/xref/core/sc/qa/unit/data/dataprovider/xml/test1.xml
Download them, the opengrok page has a blue "Download" item.

For the case testHTMLImport:
1. Open a new spreadsheet.
2. Define a database range "testDB" for the range A1:K11. That is menu Data > Define Range
3. Start Data Provider dialog. That is menu Data > Data Provider.
4. Select "TestDB" from down-load list `Database Range`
5. Select "HTML" from down-load list `Data Format`
6. Click on `Browse` button and find the downloaded file "test1.html".
7. Click on `Apply` button. Error: No import in Preview.
8. Click on `OK` button. Error: No data imported.
BTW, the import via menu Sheet > External Links works. Use locale English(USA) and detect special numbers.


For the case testXMLImport:
1.-4. see above
5. Select "XML" form down-load list `Data Format`
6. Click on `Browse` button and find the downloaded file "test1.xml".
7. 8. see above.
The test sets "maFieldPaths". There is nothing corresponding in the dialog.
BTW, the import via menu Data > XML Source works. Use the recurring element //book and link it to cell A1, for example.


As the tests themselves do not fail, I guess that there is something wrong with the dialog.
Comment 1 raal 2025-10-28 20:15:57 UTC
I can confirm with Version: 26.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 2595f031fa93c1eb89fb4dce6f337de9be813e15
CPU threads: 4; OS: Linux 6.8; UI render: default; VCL: gtk3
Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US
Calc: threaded
Comment 2 Regina Henschel 2025-11-20 14:50:53 UTC
Let us take this bug report only for HTML. I have split the problems, because the underlying code is different. HTML import is handled by an own part but XML import is forwarded to Orcus library. For the XML import, I have written bug 169574.
Comment 3 Regina Henschel 2025-11-29 00:45:32 UTC
Suggested fix is in https://gerrit.libreoffice.org/c/core/+/194789.
Comment 4 Commit Notification 2025-12-01 11:11:10 UTC
Regina Henschel committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9187f38b48956acc892fbf3e7fe0d1942fcfb6f2

tdf#169077 dataproviderdlg setID expects mxEditID

It will be available in 26.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 5 Regina Henschel 2025-12-01 11:17:43 UTC
The bug is now fixed. But a unit test is still missing. It has to be a Python UI test, but that's not possible for me because I work with Windows. So someone else will have to step in here.
Comment 6 Commit Notification 2025-12-01 14:44:46 UTC
Regina Henschel committed a patch related to this issue.
It has been pushed to "libreoffice-25-8":

https://git.libreoffice.org/core/commit/56e1b57bd921c2dcaa4bb5932b38ea6c93eb49bd

tdf#169077 dataproviderdlg setID expects mxEditID

It will be available in 25.8.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 7 Commit Notification 2025-12-03 18:23:13 UTC
Neil Roberts committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/655ba73e506d75ca9988a428620f543b213774b7

tdf#169077: Add a UITest

It will be available in 26.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 8 Michael Otto 2025-12-15 15:09:18 UTC
(In reply to Commit Notification from comment #4)
(... and comment #7)
> Regina Henschel / Neil Roberts committed a patch related to this issue.
> It has been pushed to "master":
...
> It will be available in 26.2.0.
...
> Affected users are encouraged to test the fix and report feedback.

tried to import HTML with 
LibreOfficeDev_26.2.0.0.alpha1_Linux_x86-64_deb (2025-12-05_04.47.37)
but could not succeed


Calc spreadsheet with Data > Define Range 
(see #169514 attachment DataRangeForDataProvider.ods)
https://bugs.documentfoundation.org/attachment.cgi?id=204066
Data > Data Provider
Database Range: DBrange
Data Format: HTML
URL: test local copy of file https://opengrok.libreoffice.org/xref/core/sc/qa/unit/data/dataprovider/html/test1.html
Identifier: tried "content" and "src"
click Preview: Apply --> no entries 
click OK: no changes to the spreadsheet (File > Reload is done without query)


Version: 26.2.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 0686b1972806fe8b711de5ba64039fb38cd14889
CPU threads: 5; OS: Linux 6.14; UI render: default; VCL: gtk3
Locale: de-DE (de_DE.UTF-8); UI: en-US
Calc: threaded
Comment 9 Regina Henschel 2025-12-15 16:07:10 UTC
The HTML import needs in the `Identifier` field the XPath to the desired <table> element. That could be e.g.
//table

or in case you want the second table of the source, it would be
//table[2]

For example try with a target database range of 10 columns and 120 rows.
URL
https://de.wikipedia.org/wiki/Liste_der_erfolgreichsten_Filme_nach_Einspielergebnis

Identifier
//table[5]

(The proposed database range is larger than actual needed.)
Comment 10 Michael Otto 2025-12-15 20:07:28 UTC
(In reply to Regina Henschel from comment #9)

yes, this works with //table[1] through [5] for your Wikpedia example.
However with //table only the first table is included. Shouldn't all
tables be included then?

But I can't find a way for the HTML test file. There can <table>, 
<tr> and <td> only be found in the text, not as HTML elements. 
How should the test file be used?

Isn't the implementation of XPath for the Identifier contrary to the documentation?
   "Identifier: The target ID for HTML provided data..."
I expected that the HTML attribute "id" should be used to address the items
(see also the id="content" and id="src" attributes in the test file).
Otherwise we should change the documentation and mention XPath there.


In the Wikipedia example the entries where a <link ...> is contained 
in addition, are skipped (e.g. 1st table line 4 Titanic col. 2 Deutscher 
Titel) and many other entries are missing where additional HTML elements 
are contained in the <td> element.
Comment 11 Regina Henschel 2025-12-15 22:03:42 UTC
(In reply to Michael Otto from comment #10)
 
> But I can't find a way for the HTML test file. There can <table>, 
> <tr> and <td> only be found in the text, not as HTML elements. 
> How should the test file be used?

The entry in the Identifier field has to be
//table

<table> is an HTML element.
The implementation of this import is in
https://opengrok.libreoffice.org/xref/core/sc/source/ui/dataprovider/htmldataprovider.cxx
I had not touched that. I have only repaired, that the wrong field was used.

> 
> Isn't the implementation of XPath for the Identifier contrary to the
> documentation?
>    "Identifier: The target ID for HTML provided data..."
> I expected that the HTML attribute "id" should be used to address the items
> (see also the id="content" and id="src" attributes in the test file).
> Otherwise we should change the documentation and mention XPath there.

Yes, the documentation needs to be improved. I have already added a comment to the "WorkInProgress" version of the Calc Guide for version 26.2. Might be a bugreport for the help is needed as well.

> 
> 
> In the Wikipedia example the entries where a <link ...> is contained 
> in addition, are skipped (e.g. 1st table line 4 Titanic col. 2 Deutscher 
> Titel) and many other entries are missing where additional HTML elements 
> are contained in the <td> element.

Yes, the current HTML import is very simple. I hesitated about whether anything should be fixed at all. There was also bug 139409, where it was discussed whether the entire feature should be removed. But the feature exists since LibreOffice version 6, that is more than 7 years now. It should therefore work at least to some extent.
Comment 12 Michael Otto 2025-12-16 09:15:37 UTC
(In reply to Regina Henschel from comment #11)
> The entry in the Identifier field has to be
> //table
... 
> Yes, the documentation needs to be improved. I have already added a comment
> to the "WorkInProgress" version of the Calc Guide for version 26.2. Might be
> a bugreport for the help is needed as well.

I raised LOCALHELP bug#169996 
(Proposal for Identifier with HTML: "//table, //table[2], ... following Xpath")

> Yes, the current HTML import is very simple. ...
> It should therefore work at least to some extent.

Data Provider with format HTML now works in a basic way, 
so according to this, for me it's ok.