Third thoughts about the DISTINCT MAX() problem

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Third thoughts about the DISTINCT MAX() problem
Date: 2008-03-28 21:51:43
Message-ID: 15346.1206741103@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I just realized that the patch I applied here
http://archives.postgresql.org/pgsql-committers/2008-03/msg00531.php
for Taiki Yamaguchi's bug report here
http://archives.postgresql.org/pgsql-bugs/2008-03/msg00275.php
really doesn't work. It assumes that an ungrouped aggregate
query can't return more than one row, which is true in straight
SQL ... but it's not true if you consider SRFs in the target list.
CVS HEAD gives the wrong answer for this example in the regression
database:

regression=# select max(unique1), generate_series(1,3) as g from tenk1 order by g desc;
max | g
------+---
9999 | 1
9999 | 2
9999 | 3
(3 rows)

because it wrongly supposes it can discard the ORDER BY.

So, back to the drawing board. I had thought of two approaches to
fixing the problem instead of just dodging it. Plan A was to
apply planagg.c's Aggref->Param substitution inside EquivalenceClasses,
as in the draft patch here:
http://archives.postgresql.org/pgsql-patches/2008-03/msg00388.php
which I didn't entirely like for reasons mentioned in that post.
Plan B was to try to revert to the way sort clause matching was
done pre-8.3, that is have make_sort_from_pathkeys check first
for a matching ressortgroupref tag before it goes looking for equal()
expressions. I had actually tried to do that first but got hung
up on the problem of identifying the correct sort operator ---
just taking the exposed type of the targetlist entry doesn't always
work, because of binary-compatible cases (eg, tlist entry may say
it yields varchar but we need to find the text opclass). Perhaps
thinking a bit harder would yield a solution though.

Does anyone have comments for or against either of these approaches,
or perhaps a Plan C to consider?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Third thoughts about the DISTINCT MAX() problem
Date: 2008-03-29 01:28:53
Message-ID: 17874.1206754133@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Plan B was to try to revert to the way sort clause matching was
> done pre-8.3, that is have make_sort_from_pathkeys check first
> for a matching ressortgroupref tag before it goes looking for equal()
> expressions. I had actually tried to do that first but got hung
> up on the problem of identifying the correct sort operator ---
> just taking the exposed type of the targetlist entry doesn't always
> work, because of binary-compatible cases (eg, tlist entry may say
> it yields varchar but we need to find the text opclass). Perhaps
> thinking a bit harder would yield a solution though.

Ah, got it. I had previously attached a sortref to EquivalenceClasses
(ec_sortref) to deal properly with multiple textually-identical but
volatile expressions. But that's really the wrong thing: sortrefs
should be attached to individual EquivalenceMembers, instead. If
we do that then the existing logic in make_sort_from_pathkeys can be
made to use the sortref as a preferred (and faster) way of identifying
which tlist member matches a pathkey.

This implies a change in the EquivalenceClass data structures,
but those are never dumped to disk so it's not a problem to back-patch.

Note: we still need to be able to match on equal() since we may have
to deal with sort keys that are not any of the explicit ORDER BY
expressions and hence don't have a sortref. But that's not a problem
for the MIN/MAX optimization since the only things left to do after
planagg.c are apply explicit ORDER BY or DISTINCT operations.

regards, tom lane


From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Third thoughts about the DISTINCT MAX() problem
Date: 2008-03-29 01:37:10
Message-ID: 65937bea0803281837w3d964b5bw801adea55b9c4589@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 29, 2008 at 3:21 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I just realized that the patch I applied here
> http://archives.postgresql.org/pgsql-committers/2008-03/msg00531.php
> for Taiki Yamaguchi's bug report here
> http://archives.postgresql.org/pgsql-bugs/2008-03/msg00275.php
> really doesn't work. It assumes that an ungrouped aggregate
> query can't return more than one row, which is true in straight
> SQL ... but it's not true if you consider SRFs in the target list.
> CVS HEAD gives the wrong answer for this example in the regression
> database:
>
> regression=# select max(unique1), generate_series(1,3) as g from tenk1
> order by g desc;
> max | g
> ------+---
> 9999 | 1
> 9999 | 2
> 9999 | 3
> (3 rows)
>
> because it wrongly supposes it can discard the ORDER BY.
>
> So, back to the drawing board. I had thought of two approaches to
> fixing the problem instead of just dodging it. Plan A was to
> apply planagg.c's Aggref->Param substitution inside EquivalenceClasses,
> as in the draft patch here:
> http://archives.postgresql.org/pgsql-patches/2008-03/msg00388.php
> which I didn't entirely like for reasons mentioned in that post.
> Plan B was to try to revert to the way sort clause matching was
> done pre-8.3, that is have make_sort_from_pathkeys check first
> for a matching ressortgroupref tag before it goes looking for equal()
> expressions. I had actually tried to do that first but got hung
> up on the problem of identifying the correct sort operator ---
> just taking the exposed type of the targetlist entry doesn't always
> work, because of binary-compatible cases (eg, tlist entry may say
> it yields varchar but we need to find the text opclass). Perhaps
> thinking a bit harder would yield a solution though.
>
> Does anyone have comments for or against either of these approaches,
> or perhaps a Plan C to consider?
>
>
>

In the past I had seen suggestions (perhaps from you) that we should
disallow SRFs in target list... Although not for 8.3, but would this be a
good time for 8.4 to deprecate the usage of SRFs in targetlist?

Best regards,
--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

Mail sent from my BlackLaptop device


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Third thoughts about the DISTINCT MAX() problem
Date: 2008-03-29 08:54:13
Message-ID: 1206780853.4285.1793.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2008-03-29 at 07:07 +0530, Gurjeet Singh wrote:
> On Sat, Mar 29, 2008 at 3:21 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> regression=# select max(unique1), generate_series(1,3) as g
> from tenk1 order by g desc;
> max | g
> ------+---
> 9999 | 1
> 9999 | 2
> 9999 | 3
> (3 rows)
>
>
> In the past I had seen suggestions (perhaps from you) that we should
> disallow SRFs in target list... Although not for 8.3, but would this
> be a good time for 8.4 to deprecate the usage of SRFs in targetlist?

I think it makes sense to throw an error if we have both ungrouped
aggregates and SRFs in the targetlist. The two concepts seem at odds
with each other anyhow.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Third thoughts about the DISTINCT MAX() problem
Date: 2008-03-29 17:50:10
Message-ID: 10400.1206813010@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Plan B was to try to revert to the way sort clause matching was
> done pre-8.3, that is have make_sort_from_pathkeys check first
> for a matching ressortgroupref tag before it goes looking for equal()
> expressions.

So I tried that, and after a whole bunch of regression test failures
I realize it won't work. ressortgroupref tags are only guaranteed
unique within a given targetlist --- the values applicable in the
final result tlist might mean something else within a plan node further
down in the tree. Since EquivalenceClasses are global to the whole
plan, trying to match them to tlist entries by ressortgroupref isn't
safe.

This actually calls into question the existing fix for making sure
we pick the right tlist entry when dealing with volatile ORDER BY
expressions. I think it's all right because sorting by such expressions
could only occur at top tree level --- we'd never consider such an
expression as a mergejoin key, for instance. But it seems fragile
and something we'll need to keep an eye on.

Anyway, it seems that the only remaining acceptable fix is the one
involving modifying the EquivalenceClasses when planagg.c makes
its substitutions. Unless someone has another idea.

regards, tom lane