Created attachment 54084 [details]
Example spreadsheet and database to demonstrate bug
Any time LibreOffice Base is connected to a spreadsheet or .CSV file as its data source, if you perform any SQL query that has "DISTINCT" in it, it will crash the program.
The attached example database (SelectDistinctBug.odb) connects to the spreadsheet Sample.ods. In the database is a query called "Sample" that will consistently crash LibreOffice if it is run.
I can confirm it, by half.
The crash happens in LibO 3.4.4 only with the spreadsheet.
When connected to the csv file no crash for me.
I checked in OOo 3.4 beta, same crash, in OOo 3.3 the query runs without crash.
I agree, I cannot reproduce the issue with .CSV files. I must have made a mistake when doing my initial checks. I will edit the report to remove mention of .CSV files.
Created attachment 54900 [details]
backtrace with symbols
I reproduce this pb on master branch (pc Debian x86-64) so I attached a backtrace with symbols.
I tested again on 3.5 branch.
I reproduced the pb with csv too (same backtrace)
Here are some other results I had :
SELECT DISTINCT "ColumnA" FROM "SampleTable" works well
SELECT DISTINCT "ColumnB" FROM "SampleTable" crashes.
SELECT "ColumnB" FROM "SampleTable" works well.
also applies to dbase - same symptoms.
Tried it in dBase with 3.5.0 Beta2 OpenSUSE 11.4 32bit. DISTINCT on the first column works, on the second column there was no crash - but distinct is ignored. All rows were shown.
Created attachment 55561 [details]
Tests and some analysis
I made different tests to try to understand the cause of this bug.
It seems that the main point is the values contained in m_aOrderbyColumnNumber
When we use the column without omitting 1, it's ok, lines like this one : in connectivity/source/drivers/file/FResultSet.cxx is ok :
1284 switch ((*(m_aSelectRow->get().begin()+*aOrderByIter))->getValue().getTypeKind())
But if you only want the second column and you use "Distinct" for example, it's ko.
there's 1 value in m_aOrderbyColumnNumber but this one is equal "2"
So it seems we should replace *aOrderByIter by (i+1) in these kind of lines. (see proposed patch just after this add).
Created attachment 55562 [details]
Proposition of patch
Following the previous attachment, I propose this patch.
It seems to solve the distinct pb. Yet there's still "order by" pbs, but that's another story.
i think the problem here is that m_aOrderbyColumnNumber contains an invalid column number: the column seems to correspond to the source row, not the row after SELECT.
Julien, it seems your patch would change the sort order so it always sorts columns by their order in the SELECT clause, not as given by the m_aOrderbyColumnNumber, so it won't crash but should give wrong result?
better investigate why the m_aOrderbyColumnNumber contains invalid columns :)
setting regression keyword as it worked in OOo 3.3
Just to note the result of your discussion with Michael on IRC yesterday :
m_aColMapping is well initialized but because of of the setBoundedColumns function, it's messed up a bit.
Now I made the same type of request with Thunderbird/Icedove bookaddress (select distinct "E-mail" from "Collected Addresses", it works well.
So why setBoundedColumns for mozab has 1 parameter less ?
What about the other drivers which don't have setBoundedColumns ? Perhaps they've got the equivalent but with another name.
I must have a look to the git history to have more info about this function on both drivers.
probably have tracked it down now.
the crash was introduced with f89f2b8bf506de0cc547ad596c75cbe1a0cf1ef1,
which changed OResultSet::sortRows to work not on the whole row
but on the selected columns of the row (m_aSelectRow);
thus it looks like using m_aColMapping which maps from the column
in the select row to the columns in the entire row is wrong and
was not adapted to match the change.
wondering about whether SQL spec says anything about the order of rows that are equal in all columns specified in ORDER BY but not the columns not specified in ORDER BY...
OResultSet::OpenImpl() looks like it could be much simpler...
(In reply to comment #12)
> wondering about whether SQL spec says anything about the order of rows that are
> equal in all columns specified in ORDER BY but not the columns not specified in
> ORDER BY...
X/Open spec: Data Management: Structured Query Language (SQL), Version 2, section 4.4.1 page 89:
Rows that have duplicate values in the columns specified by all the sort-specifications appear in the result set together, but in an undefined order.
ISO spec (draft) IWD 9075-2:2011(E): Information technology -- Database languages -- SQL -- Part 2: Foundation (SQL/Foundation)
i) Two rows that are not distinct with respect to the <sort specification>s are said to be peers of each other. The relative ordering of peers is implementation-dependent.
(section 10.10, General Rules, subsection "1" clause "i"; page 625 of the PDF I have)
so i've now changed it to not rely on m_aColMapping, and thanks to
Lionel's info about SQL also got rid of the second sortRows call.
fixed on master: 35e8190b0317972abb376dd2ed55567c43bc8b4d
Lionel, please review the fix and push it to -3-5 and -3-4 branches;
since this is a recent regression that seems in order.
Some more info on SQL and ORDER by: In basic/core SQL, the column names in the ORDER BY clause are column names from the *result* of the SELECT, not column names from the *source* table(s).
So, if table "t" has columns "colA" and "colB", the following is an error:
SELECT t.colA FROM t ORDER BY colB
However, as an extension, many SQL engines accept that. Another common extension is sorting by an expression:
SELECT t.colA FROM t ORDER BY colA*colA
Also, this sorts according to t.colA (renamed to colB in the result), not according to t.colB:
SELECT t.colA AS colB FROM t ORDER BY colB
So it seems f89f2b8bf506de0cc547ad596c75cbe1a0cf1ef1 does the right thing (I had a doubt and had to go read the standard again).
(In reply to comment #14)
> so i've now changed it to not rely on m_aColMapping, and thanks to
> Lionel's info about SQL also got rid of the second sortRows call.
> fixed on master: 35e8190b0317972abb376dd2ed55567c43bc8b4d
> Lionel, please review the fix and push it to -3-5 and -3-4 branches;
OK, I've reviewed the patch and I have some questions:
for (sal_Int32 i = 1;
static_cast<size_t>(i) < m_aColMapping.size(); ++i)
? SQL_ASC // default for no ORDER BY
0) This does not add the first column (the bookmark column) to the sort specification, right? Things are sometimes 0-indexed and sometimes 1-indexed in LO database stuff, so I prefer to check. This warrants a comment, too. E.g.: // Add all columns that are not already in the sort specification at th end, except the bookmark column (column 0)
1) We could just add all columns unconditionally, but OK, not repeating a column we already sort on earlier is an optimisation worth having -> OK for that
2) What is the idea behind reusing the ASC/DESC choice of the last entry in the sort specification? It does not *hurt*, because it makes absolutely no difference whether we put ASC or DESC, neither in result, nor in performance. But I think it confuses the reader of the code, who tries to understand why this is done. Unless there is a benefit that I don't see, I suggest we hardcode SQL_ASC there.
> 0) This does not add the first column (the bookmark column) to the sort
> specification, right? Things are sometimes 0-indexed and sometimes 1-indexed in
> LO database stuff, so I prefer to check.
it does not; i don't know what exactly this bookmark column is,
but it doesn't sound like it contains actual data.
> This warrants a comment, too. E.g.: //
> Add all columns that are not already in the sort specification at th end,
except the bookmark column (column 0)
i've added a comment.
> 2) What is the idea behind reusing the ASC/DESC choice of the last entry in the
> sort specification? It does not *hurt*, because it makes absolutely no
> difference whether we put ASC or DESC, neither in result, nor in performance.
indeed; i think that at first i thought there can be only one mention of ASC or DESC in the ORDER BY clause, but then why would we store it once per column?
this seems to be valid:
ORDER BY "ColumnB" ASC, "ColumnA" DESC
so i've changed that now:
(In reply to comment #17)
>> 0) This does not add the first column (the bookmark column) to the sort
>> specification, right? Things are sometimes 0-indexed and sometimes 1-indexed in
>> LO database stuff, so I prefer to check.
> it does not; i don't know what exactly this bookmark column is,
> but it doesn't sound like it contains actual data.
A bookmark is an opaque value that a client application can give to a database driver, and that will position the resultset or cursor on that row. If the underlying driver does not provide bookmarks, LO uses sequential integers (= row number). As a consequence of its specification, it is different for each row, including for duplicates :-)
So yes, it does not contain "actual data", it is metadata.
> so i've changed that now:
OK, backported that to 3.5 in http://cgit.freedesktop.org/libreoffice/core/commit/?id=48c33e89fe7398c05f2af6c49c75fa2bedf36859
great; can we get this into libreoffice-3-4 as well?
(In reply to comment #19)
> great; can we get this into libreoffice-3-4 as well?
Yes, sorry, I forgot. Done in http://http://cgit.freedesktop.org/libreoffice/libs-core/commit/?h=libreoffice-3-4&id=9cd5a1b11f2f5f69d674f0b0e44352d96518013e