Warts with SELECT DISTINCT

Lists: pgsql-hackers
From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Warts with SELECT DISTINCT
Date: 2006-05-03 21:58:07
Message-ID: 87irom22kw.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Normally Postgres extends SQL to allow ORDER BY to include arbitrary
expressions not in the select list. However this doesn't seem to work with
SELECT DISTINCT.

stark=> \d test
Table "public.test"
Column | Type | Modifiers
--------+------+-----------
col1 | text |

stark=> select distinct col1 from test order by upper(col1);
ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list

It seems like as long as the expressions involve only columns or expressions
present in the SELECT DISTINCT list and as long as those functions are stable
or immutable then this shouldn't be a problem. Just prepend those expressions
to the select list to use as the sort key.

In fact the equivalent GROUP BY query does work as expected:

stark=> select col1 from test group by col1 order by upper(col1);
col1
------
a
c
x
(3 rows)

Though it's optimized poorly and does a superfluous sort step:

stark=> explain select col1 from test group by col1 order by upper(col1);
QUERY PLAN
---------------------------------------------------------------------------
Sort (cost=99.72..100.22 rows=200 width=32)
Sort Key: upper(col1)
-> Group (cost=85.43..92.08 rows=200 width=32)
-> Sort (cost=85.43..88.50 rows=1230 width=32)
Sort Key: col1
-> Seq Scan on test (cost=0.00..22.30 rows=1230 width=32)
(6 rows)

Whereas it shouldn't be hard to prove that this is equivalent:

stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1);
QUERY PLAN
---------------------------------------------------------------------
Group (cost=88.50..98.23 rows=200 width=32)
-> Sort (cost=88.50..91.58 rows=1230 width=32)
Sort Key: upper(col1), col1
-> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32)
(4 rows)

My understanding is that the DISTINCT and DISTINCT ON code path is old and
grotty. Perhaps it's time to remove those code paths, and replace them with a
transformation that creates the equivalent GROUP BY query and then optimize
that path until it can produce plans as good as DISTINCT and DISTINCT ON ever
did.

--
greg


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 03:10:57
Message-ID: 20060504031057.GA30219@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 03, 2006 at 17:58:07 -0400,
Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
> Though it's optimized poorly and does a superfluous sort step:
>
> stark=> explain select col1 from test group by col1 order by upper(col1);
> QUERY PLAN
> ---------------------------------------------------------------------------
> Sort (cost=99.72..100.22 rows=200 width=32)
> Sort Key: upper(col1)
> -> Group (cost=85.43..92.08 rows=200 width=32)
> -> Sort (cost=85.43..88.50 rows=1230 width=32)
> Sort Key: col1
> -> Seq Scan on test (cost=0.00..22.30 rows=1230 width=32)
> (6 rows)
>
>
> Whereas it shouldn't be hard to prove that this is equivalent:
>
> stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1);
> QUERY PLAN
> ---------------------------------------------------------------------
> Group (cost=88.50..98.23 rows=200 width=32)
> -> Sort (cost=88.50..91.58 rows=1230 width=32)
> Sort Key: upper(col1), col1
> -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32)
> (4 rows)

I don't think you can assume that that will be true for any locale. If there
are two different characters that both have the same uppercase version, this
will break things.

And while you would expect that x = y => upper(x) = upper(y) I am not sure
that is guarenteed for locales. I can imagine having two different characters
that are treated the same for ordering purposes, but have uppercase versions
that are considered different for ordering purposes.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 04:05:16
Message-ID: 87d5eu1lkz.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruno Wolff III <bruno(at)wolff(dot)to> writes:

> > Whereas it shouldn't be hard to prove that this is equivalent:
> >
> > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1);
> > QUERY PLAN
> > ---------------------------------------------------------------------
> > Group (cost=88.50..98.23 rows=200 width=32)
> > -> Sort (cost=88.50..91.58 rows=1230 width=32)
> > Sort Key: upper(col1), col1
> > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32)
> > (4 rows)
>
> I don't think you can assume that that will be true for any locale. If there
> are two different characters that both have the same uppercase version, this
> will break things.

No it won't.

--
greg


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 04:50:42
Message-ID: 20060504045042.GA14214@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 04, 2006 at 00:05:16 -0400,
Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Bruno Wolff III <bruno(at)wolff(dot)to> writes:
>
> > > Whereas it shouldn't be hard to prove that this is equivalent:
> > >
> > > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1);
> > > QUERY PLAN
> > > ---------------------------------------------------------------------
> > > Group (cost=88.50..98.23 rows=200 width=32)
> > > -> Sort (cost=88.50..91.58 rows=1230 width=32)
> > > Sort Key: upper(col1), col1
> > > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32)
> > > (4 rows)
> >
> > I don't think you can assume that that will be true for any locale. If there
> > are two different characters that both have the same uppercase version, this
> > will break things.
>
> No it won't.

Sure it will, because when you do the group by you will get a different
number of groups. When grouping by the original characters you will get
separate groups for characters that have the same uppercase character, where
as when grouing by the uppercased characters you won't.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 05:13:20
Message-ID: 18946.1146719600@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> Greg Stark <gsstark(at)mit(dot)edu> wrote [ baldly summarized ]
>> [ x > y implies upper(x) > upper(y) ]

> I don't think you can assume that that will be true for any locale.

Whether or not that may actually be true for upper() (I share Bruno's
skepticism, but maybe it's so), it really does not matter because the
planner doesn't have enough knowledge about the behavior of upper() to
make such an inference.

I think it's a fair point that we could allow "SELECT DISTINCT x ORDER BY
foo(x)" if foo() is stable, but that does not imply that sorting by x is
interchangeable with sorting by foo(x). foo = abs is a trivial
counterexample.

As far as the original point goes: feel free to reimplement DISTINCT,
but don't break the documented behavior of DISTINCT ON + ORDER BY, or
you'll have a lot of unhappy villagers appear on your doorstep bearing
torches and pitchforks. It might be mostly an implementation artifact,
but it's an awfully useful one ...

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 05:32:45
Message-ID: 877j521hj6.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruno Wolff III <bruno(at)wolff(dot)to> writes:

> On Thu, May 04, 2006 at 00:05:16 -0400,
> Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> >
> > > > Whereas it shouldn't be hard to prove that this is equivalent:
> > > >
> > > > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1);
> > > > QUERY PLAN
> > > > ---------------------------------------------------------------------
> > > > Group (cost=88.50..98.23 rows=200 width=32)
> > > > -> Sort (cost=88.50..91.58 rows=1230 width=32)
> > > > Sort Key: upper(col1), col1
> > > > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32)
> > > > (4 rows)
> > >
> > > I don't think you can assume that that will be true for any locale. If there
> > > are two different characters that both have the same uppercase version, this
> > > will break things.
> >
> > No it won't.
>
> Sure it will, because when you do the group by you will get a different
> number of groups. When grouping by the original characters you will get
> separate groups for characters that have the same uppercase character, where
> as when grouing by the uppercased characters you won't.

But grouping on *both* will produce the same groups as grouping on the
original characters alone.

--
greg


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 06:05:23
Message-ID: 20060504060523.GA26841@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 04, 2006 at 01:32:45 -0400,
Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Bruno Wolff III <bruno(at)wolff(dot)to> writes:
>
> > On Thu, May 04, 2006 at 00:05:16 -0400,
> > Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > > Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> > >
> > > > > Whereas it shouldn't be hard to prove that this is equivalent:
> > > > >
> > > > > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1);
> > > > > QUERY PLAN
> > > > > ---------------------------------------------------------------------
> > > > > Group (cost=88.50..98.23 rows=200 width=32)
> > > > > -> Sort (cost=88.50..91.58 rows=1230 width=32)
> > > > > Sort Key: upper(col1), col1
> > > > > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32)
> > > > > (4 rows)
> > > >
> > > > I don't think you can assume that that will be true for any locale. If there
> > > > are two different characters that both have the same uppercase version, this
> > > > will break things.
> > >
> > > No it won't.
> >
> > Sure it will, because when you do the group by you will get a different
> > number of groups. When grouping by the original characters you will get
> > separate groups for characters that have the same uppercase character, where
> > as when grouing by the uppercased characters you won't.
>
> But grouping on *both* will produce the same groups as grouping on the
> original characters alone.

OK, I misssed that. My brain only saw upper(col) and not the immediately
following ,col1.
I aggree that grouping by col1 and upper(col1), col1 will give you the same
groups. And hence the queries should be equivalent.


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 06:21:53
Message-ID: 20060504062153.GB26841@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 04, 2006 at 01:13:20 -0400,
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I think it's a fair point that we could allow "SELECT DISTINCT x ORDER BY
> foo(x)" if foo() is stable, but that does not imply that sorting by x is
> interchangeable with sorting by foo(x). foo = abs is a trivial
> counterexample.

I misunderstood Greg's example. Sorting by (foo(x), x) is a suitable
replacement for sorting by foo(x). So that it would be OK to rewrite
SELECT DISTINCT x ORDER BY foo(x)
as
SELECT DISTINCT ON (foo(x), x) x ORDER BY foo(x)

Whether or not this is worthwhile to automate, I am not in a good position
to judge.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 06:39:33
Message-ID: 19612.1146724773@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> ... it would be OK to rewrite
> SELECT DISTINCT x ORDER BY foo(x)
> as
> SELECT DISTINCT ON (foo(x), x) x ORDER BY foo(x)

This assumes that x = y implies foo(x) = foo(y), which is something
that's not necessarily the case, mainly because a datatype's "="
function need not have a lot to do with the behavior of arbitrary
functions foo(), especially if foo() yields a different datatype.
The citext datatype is an easy counterexample: it thinks "foo" = "Foo",
but md5() of those values will not yield the same answers.

The bottom line here is that this sort of deduction requires more
understanding of the properties of datatypes and functions than
our existing catalogs allow the planner to obtain.

regards, tom lane


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 14:06:11
Message-ID: 20060504140611.GA19321@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 04, 2006 at 02:39:33 -0400,
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> > ... it would be OK to rewrite
> > SELECT DISTINCT x ORDER BY foo(x)
> > as
> > SELECT DISTINCT ON (foo(x), x) x ORDER BY foo(x)
>
> This assumes that x = y implies foo(x) = foo(y), which is something
> that's not necessarily the case, mainly because a datatype's "="
> function need not have a lot to do with the behavior of arbitrary
> functions foo(), especially if foo() yields a different datatype.
> The citext datatype is an easy counterexample: it thinks "foo" = "Foo",
> but md5() of those values will not yield the same answers.
>
> The bottom line here is that this sort of deduction requires more
> understanding of the properties of datatypes and functions than
> our existing catalogs allow the planner to obtain.

Thanks for pointing that out. I should have realized that this was the same
(or at least close to) issue I was thinking would be a problem initially, but
then I started thinking that '=' promised more than it did and assumed that
x = y implies foo(x) = foo(y), which as you point out isn't always true.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 16:47:09
Message-ID: 871wv920vm.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruno Wolff III <bruno(at)wolff(dot)to> writes:

> Thanks for pointing that out. I should have realized that this was the same
> (or at least close to) issue I was thinking would be a problem initially, but
> then I started thinking that '=' promised more than it did and assumed that
> x = y implies foo(x) = foo(y), which as you point out isn't always true.

Hm. This goes back to the earlier conversation about whether = should ever be
true for two objects that aren't, well, equal. I thought there was some
consensus at the time that sorting should impose a superficial ordering on
items that compare equal but aren't in fact the same.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Bruno Wolff III <bruno(at)wolff(dot)to>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Warts with SELECT DISTINCT
Date: 2006-05-04 17:18:30
Message-ID: 11287.1146763110@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Hm. This goes back to the earlier conversation about whether = should ever be
> true for two objects that aren't, well, equal. I thought there was some
> consensus at the time that sorting should impose a superficial ordering on
> items that compare equal but aren't in fact the same.

We forced that recently for text strings (overriding locale-specific
cases in strcoll()), but there certainly was not any intent to decree
that every datatype must do likewise.

regards, tom lane