Re: DISTINCT vs. GROUP BY

Lists: pgsql-hackers
From: Neil Conway <neilc(at)samurai(dot)com>
To: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-19 14:52:45
Message-ID: 1127141565.3770.5.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2005-19-09 at 16:27 +0200, Hans-Jürgen Schönig wrote:
> I was wondering whether it is possible to teach the planner to handle
> DISTINCT in a more efficient way:
[...]
> Isn't it possible to perform the same operation using a
> HashAggregate?

One problem is that DISTINCT ON is defined to return the first unique
row (according to the query's ORDER BY) for the set of DISTINCT ON
columns, which can't easily be done via hashing.

-Neil


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-19 15:45:10
Message-ID: 8764sxoyd5.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Neil Conway <neilc(at)samurai(dot)com> writes:

> On Mon, 2005-19-09 at 16:27 +0200, Hans-Jürgen Schönig wrote:
> > I was wondering whether it is possible to teach the planner to handle
> > DISTINCT in a more efficient way:
> [...]
> > Isn't it possible to perform the same operation using a
> > HashAggregate?
>
> One problem is that DISTINCT ON is defined to return the first unique
> row (according to the query's ORDER BY) for the set of DISTINCT ON
> columns, which can't easily be done via hashing.

Uhm. Sure it can.

DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
GROUP BY with a kind of "first()" aggregate function. What would be really
neat would be to teach GROUP BY about first() and last() and how it can skip
over some index entries and still satisfy the query. Then make DISTINCT and
DISTINCT ON be handled through the exact same code path.

For bonus points teach it that min() and max() can sometimes be treated the
same way if the path is presenting records sorted on that column.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Neil Conway <neilc(at)samurai(dot)com>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-19 19:50:47
Message-ID: 29125.1127159447@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
> GROUP BY with a kind of "first()" aggregate function. What would be really
> neat would be to teach GROUP BY about first() and last() and how it can skip
> over some index entries and still satisfy the query. Then make DISTINCT and
> DISTINCT ON be handled through the exact same code path.

You've missed the point entirely.

first() is not a substitute for sorting the input; it is only useful
if the input comes pre-sorted. And if you are going to sort the input,
you might as well use the current implementation of DISTINCT ON and
skip the effort and memory-overflow-risk associated with a hashtable.

I do think hash aggregation is a plausible alternative implementation of
plain DISTINCT, but I don't see the case for using it for DISTINCT ON.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Neil Conway <neilc(at)samurai(dot)com>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-19 21:00:35
Message-ID: 87fys0ojrg.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I do think hash aggregation is a plausible alternative implementation of
> plain DISTINCT, but I don't see the case for using it for DISTINCT ON.

It could be done without presorting the input though not with a simple
first()-like function. It would have be a sort of two-argument min() function
that kept a state variable for the smallest value found so far of the sort
key.

My main motivation here is that it's odd to have two code paths for
implementing the two language constructs when one is really just a special
case of the other. It's a source of cases like this where the code to
implement a query path exists but isn't accessible due to the way the query is
written.

--
greg


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Neil Conway <neilc(at)samurai(dot)com>, Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-20 02:16:36
Message-ID: 200509200216.j8K2Gar15108@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO:

* Allow DISTINCT to use hashing like GROUP BY

---------------------------------------------------------------------------

Greg Stark wrote:
>
> Neil Conway <neilc(at)samurai(dot)com> writes:
>
> > On Mon, 2005-19-09 at 16:27 +0200, Hans-J?rgen Sch?nig wrote:
> > > I was wondering whether it is possible to teach the planner to handle
> > > DISTINCT in a more efficient way:
> > [...]
> > > Isn't it possible to perform the same operation using a
> > > HashAggregate?
> >
> > One problem is that DISTINCT ON is defined to return the first unique
> > row (according to the query's ORDER BY) for the set of DISTINCT ON
> > columns, which can't easily be done via hashing.
>
> Uhm. Sure it can.
>
>
> DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
> GROUP BY with a kind of "first()" aggregate function. What would be really
> neat would be to teach GROUP BY about first() and last() and how it can skip
> over some index entries and still satisfy the query. Then make DISTINCT and
> DISTINCT ON be handled through the exact same code path.
>
> For bonus points teach it that min() and max() can sometimes be treated the
> same way if the path is presenting records sorted on that column.
>
>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Neil Conway <neilc(at)samurai(dot)com>, Hans-J?rgen Sch?nig <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-20 19:40:41
Message-ID: 20050920194041.GR7630@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:
>
> Added to TODO:
>
> * Allow DISTINCT to use hashing like GROUP BY

3 lines above we have...
Consider using hash buckets to do DISTINCT, rather than sorting
This would be beneficial when there are few distinct values.

Can you add
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
could find on the other TODO was
http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
which doesn't help much...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Neil Conway <neilc(at)samurai(dot)com>, "Hans-J?rgen Sch?nig" <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-20 21:05:05
Message-ID: 200509202105.j8KL55f24171@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:
> >
> > Added to TODO:
> >
> > * Allow DISTINCT to use hashing like GROUP BY
>
> 3 lines above we have...
> Consider using hash buckets to do DISTINCT, rather than sorting
> This would be beneficial when there are few distinct values.

OK, I have merged these items into one.

>
> Can you add
> http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
> could find on the other TODO was
> http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
> which doesn't help much...

What do these URL's have that the current TODO does not?

* Consider using hash buckets to do DISTINCT, rather than sorting

This would be beneficial when there are few distinct values. This is
already used by GROUP BY.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Neil Conway <neilc(at)samurai(dot)com>, Hans-J?rgen Sch?nig <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-20 22:07:50
Message-ID: 20050920220750.GT7630@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 20, 2005 at 05:05:05PM -0400, Bruce Momjian wrote:
> Jim C. Nasby wrote:
> > On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:
> > >
> > > Added to TODO:
> > >
> > > * Allow DISTINCT to use hashing like GROUP BY
> >
> > 3 lines above we have...
> > Consider using hash buckets to do DISTINCT, rather than sorting
> > This would be beneficial when there are few distinct values.
>
> OK, I have merged these items into one.
>
> >
> > Can you add
> > http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
> > could find on the other TODO was
> > http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
> > which doesn't help much...
>
> What do these URL's have that the current TODO does not?
>
> * Consider using hash buckets to do DISTINCT, rather than sorting
>
> This would be beneficial when there are few distinct values. This is
> already used by GROUP BY.

Maybe it's just me, but the recent run-through of the TODO list
indicated that there's a fair number of items that people look at and
don't really knowh what they mean. Providing the context (ie: email
thread) that spawned an idea seems extremely valuable in being able to
explain the idea behind a TODO item. They also usually contain valuable
tips about how a TODO could be implemented. In this example, having
quick reference to the discussion about hashagg and first()/last() would
probably prove useful.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Neil Conway <neilc(at)samurai(dot)com>, "Hans-J?rgen Sch?nig" <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-20 22:45:07
Message-ID: 200509202245.j8KMj7Z19667@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> > What do these URL's have that the current TODO does not?
> >
> > * Consider using hash buckets to do DISTINCT, rather than sorting
> >
> > This would be beneficial when there are few distinct values. This is
> > already used by GROUP BY.
>
> Maybe it's just me, but the recent run-through of the TODO list
> indicated that there's a fair number of items that people look at and
> don't really knowh what they mean. Providing the context (ie: email
> thread) that spawned an idea seems extremely valuable in being able to
> explain the idea behind a TODO item. They also usually contain valuable
> tips about how a TODO could be implemented. In this example, having
> quick reference to the discussion about hashagg and first()/last() would
> probably prove useful.

True, but sometimes the thread winds all around and there isn't a
definative explaination of how to go at something. I woul rather digest
the information to improve it, rather than require people to wade around
in an email thread. Is there some detail the TODO is missing?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Neil Conway <neilc(at)samurai(dot)com>, Hans-J?rgen Sch?nig <postgres(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DISTINCT vs. GROUP BY
Date: 2005-09-20 23:13:54
Message-ID: 20050920231353.GC7630@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 20, 2005 at 06:45:07PM -0400, Bruce Momjian wrote:
> Jim C. Nasby wrote:
> > > What do these URL's have that the current TODO does not?
> > >
> > > * Consider using hash buckets to do DISTINCT, rather than sorting
> > >
> > > This would be beneficial when there are few distinct values. This is
> > > already used by GROUP BY.
> >
> > Maybe it's just me, but the recent run-through of the TODO list
> > indicated that there's a fair number of items that people look at and
> > don't really knowh what they mean. Providing the context (ie: email
> > thread) that spawned an idea seems extremely valuable in being able to
> > explain the idea behind a TODO item. They also usually contain valuable
> > tips about how a TODO could be implemented. In this example, having
> > quick reference to the discussion about hashagg and first()/last() would
> > probably prove useful.
>
> True, but sometimes the thread winds all around and there isn't a
> definative explaination of how to go at something. I woul rather digest
> the information to improve it, rather than require people to wade around
> in an email thread. Is there some detail the TODO is missing?

Sure, and I think the TODO items usually are digested fine. But if you
want to know any of the details behind the items (like what the
motivation is, or how it could be done), you have to manually go and
search the archives trying to find something.

In this example, it would have been useful to see if there had been any
discussion about if you could use hashagg to do this or not. But
searching the archive turned up nothing, so I guess we'll never know. If
there was a link to a thread in the TODO, we could see exactly what was
discussed.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461