Re: Optimizing DISTINCT with LIMIT

Lists: pgsql-hackers
From: tmp <skrald(at)amossen(dot)dk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Optimizing DISTINCT with LIMIT
Date: 2008-12-04 13:32:51
Message-ID: gh8m5v$7oj$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

As far as I have understood the following query
SELECT DISTINCT foo
FROM bar
LIMIT baz
is done by first sorting the input and then traversing the sorted data,
ensuring uniqueness of output and stopping when the LIMIT threshold is
reached. Furthermore, a part of the sort procedure is to traverse input
at least one time.

Now, if the input is large but the LIMIT threshold is small, this
sorting step may increase the query time unnecessarily so here is a
suggestion for optimization:
If the input is "sufficiently" large and the LIMIT threshold
"sufficiently" small, maintain the DISTINCT output by hashning while
traversing the input and stop when the LIMIT threshold is reached. No
sorting required and *at* *most* one read of input.

Use case: Websites that needs to present small samples of huge queries fast.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: tmp <skrald(at)amossen(dot)dk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-04 14:09:34
Message-ID: 878wqw117l.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tmp <skrald(at)amossen(dot)dk> writes:

> If the input is "sufficiently" large and the LIMIT threshold "sufficiently"
> small, maintain the DISTINCT output by hashning while traversing the input and
> stop when the LIMIT threshold is reached. No sorting required and *at* *most*
> one read of input.

You mean like this?

postgres=# explain select distinct x from i limit 5;
QUERY PLAN
-------------------------------------------------------------------
Limit (cost=54.50..54.51 rows=1 width=304)
-> HashAggregate (cost=54.50..54.51 rows=1 width=304)
-> Seq Scan on i (cost=0.00..52.00 rows=1000 width=304)
(3 rows)

This will be in the upcoming 8.4 release.

Versions since about 7.4 or so have been capable of producing this plan but
not for DISTINCT, only for the equivalent GROUP BY query:

postgres=# explain select x from i group by x limit 5;

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: tmp <skrald(at)amossen(dot)dk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-04 14:20:17
Message-ID: 4937E721.2050407@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> tmp <skrald(at)amossen(dot)dk> writes:
>
>> If the input is "sufficiently" large and the LIMIT threshold "sufficiently"
>> small, maintain the DISTINCT output by hashning while traversing the input and
>> stop when the LIMIT threshold is reached. No sorting required and *at* *most*
>> one read of input.
>
> You mean like this?
>
> postgres=# explain select distinct x from i limit 5;
> QUERY PLAN
> -------------------------------------------------------------------
> Limit (cost=54.50..54.51 rows=1 width=304)
> -> HashAggregate (cost=54.50..54.51 rows=1 width=304)
> -> Seq Scan on i (cost=0.00..52.00 rows=1000 width=304)
> (3 rows)

Does that know to stop scanning as soon as it has seen 5 distinct values?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: tmp <skrald(at)amossen(dot)dk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-04 14:32:07
Message-ID: 871vwo1060.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:

> Gregory Stark wrote:
> Does that know to stop scanning as soon as it has seen 5 distinct values?

Uhm, hm. Apparently not :(

postgres=# create or replace function v(integer) returns integer as $$begin raise notice 'called %', $1; return $1; end$$ language plpgsql volatile;
CREATE FUNCTION
postgres=# select distinct v(i) from generate_series(1,10) as a(i) limit 3;
NOTICE: 00000: called 1
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 2
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 3
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 4
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 5
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 6
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 7
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 8
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 9
LOCATION: exec_stmt_raise, pl_exec.c:2542
NOTICE: 00000: called 10
LOCATION: exec_stmt_raise, pl_exec.c:2542
v
---
5
4
6
(3 rows)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, tmp <skrald(at)amossen(dot)dk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-04 14:35:26
Message-ID: 24128.1228401326@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Gregory Stark wrote:
>> You mean like this?
>>
>> postgres=# explain select distinct x from i limit 5;
>> QUERY PLAN
>> -------------------------------------------------------------------
>> Limit (cost=54.50..54.51 rows=1 width=304)
>> -> HashAggregate (cost=54.50..54.51 rows=1 width=304)
>> -> Seq Scan on i (cost=0.00..52.00 rows=1000 width=304)
>> (3 rows)

> Does that know to stop scanning as soon as it has seen 5 distinct values?

In principle, if there are no aggregate functions, then nodeAgg could
return a row immediately upon making any new entry into the hash table.
Whether it's worth the code uglification is debatable ... I think it
would require a third major pathway through nodeAgg.

regards, tom lane


From: tmp <skrald(at)amossen(dot)dk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-04 19:34:44
Message-ID: gh9bcj$2o2q$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> In principle, if there are no aggregate functions, then nodeAgg could
> return a row immediately upon making any new entry into the hash table.
> Whether it's worth the code uglification is debatable ... I think it
> would require a third major pathway through nodeAgg.

Regarding whether it's worth the effort: In each of my three past jobs
(all using postgresql) I have met several queries that would fetch a
small subset of a large - even huge - input. I think that types of
queries are relatively common out there, but if they are executed for
e.g. a web-client it is simply a no-go with the current late LIMIT
evaluation.

Also, it is my impression that many people use LIMIT to minimize the
evaluation time of sub queries from which the outer query only needs a
small subset of the sub query output.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: tmp <skrald(at)amossen(dot)dk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-04 20:09:57
Message-ID: 87bpvrzoq2.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


tmp <skrald(at)amossen(dot)dk> writes:

> Regarding whether it's worth the effort: In each of my three past jobs (all
> using postgresql) I have met several queries that would fetch a small subset of
> a large - even huge - input. I think that types of queries are relatively
> common out there, but if they are executed for e.g. a web-client it is simply a
> no-go with the current late LIMIT evaluation.
>
> Also, it is my impression that many people use LIMIT to minimize the evaluation
> time of sub queries from which the outer query only needs a small subset of the
> sub query output.

I've seen lots of queries which only pull a subset of the results too -- but
it's always a specific subset. So that means using ORDER BY or a WHERE clause
to control it.

In this example the subset returned is completely arbitrary. That's a much
finer slice of queries.

I would tend to think it's worth it myself. I can see cases where the subset
selected doesn't really matter -- for instance if you're only testing whether
there are at least a certain number of distinct values. Or if you're using up
some inventory and it's not important what order you use them up only that you
fetch some candidate inventory and process them.

But I can also see Tom's reluctance. It's a fair increase in the amount of
code to maintain in that file for a pretty narrow use case. On the other hand
it looks like it would be all in that file. The planner wouldn't have to do
anything special to set it up which is nice.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: tmp <skrald(at)amossen(dot)dk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-05 13:08:59
Message-ID: ghb957$28ts$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I would tend to think it's worth it myself.

I am unfortunately not familiar enough with the postgresql code base to
be comfortable to provide a patch. Can I submit this optimization
request to some sort of issue tracker or what should I do?


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: tmp <skrald(at)amossen(dot)dk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-05 14:09:45
Message-ID: 8763lyww5y.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tmp <skrald(at)amossen(dot)dk> writes:

>> I would tend to think it's worth it myself.
>
> I am unfortunately not familiar enough with the postgresql code base to be
> comfortable to provide a patch. Can I submit this optimization request to some
> sort of issue tracker or what should I do?

You could add it to here -- note that if we decide it isn't worth it it'll
just get removed.

http://wiki.postgresql.org/wiki/Todo

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: David Lee Lambert <davidl(at)lmert(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-06 11:29:05
Message-ID: 200812060629.08223.davidl@lmert.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thursday 04 December 2008 15:09, Gregory Stark wrote:
> tmp <skrald(at)amossen(dot)dk> writes:

> > Also, it is my impression that many people use LIMIT to minimize the
> > evaluation time of sub queries from which the outer query only needs a
> > small subset of the sub query output.
>
> I've seen lots of queries which only pull a subset of the results too --
> but it's always a specific subset. So that means using ORDER BY or a WHERE
> clause to control it.

I use "ORDER BY random() LIMIT :some_small_number" frequently to get a "feel"
for data. That always builds the unrandomized relation and then sorts it. I
guess an alternate path for single-table queries would be to randomly choose
a block number and then a tuple number; but that would be biased toward long
rows (of which fewer can appear in a block).

--
David Lee Lambert ... Software Developer
Cell phone: +1 586-873-8813 ; alt. email <as4109(at)wayne(dot)edu> or
<lamber45(at)msu(dot)edu>
GPG key at http://www.lmert.com/keyring.txt


From: Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
To: David Lee Lambert <davidl(at)lmert(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-06 18:08:56
Message-ID: 1E3782D0-1A9D-4C28-A307-F06E767312AA@pointblue.com.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2008-12-06, at 11:29, David Lee Lambert wrote:
>>
>
> I use "ORDER BY random() LIMIT :some_small_number" frequently to get
> a "feel"
> for data. That always builds the unrandomized relation and then
> sorts it. I
> guess an alternate path for single-table queries would be to
> randomly choose
> a block number and then a tuple number; but that would be biased
> toward long
> rows (of which fewer can appear in a block).

but that's going to be extremely slow, due to speed of random()
function.


From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
Cc: David Lee Lambert <davidl(at)lmert(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-06 18:48:31
Message-ID: 10F17BC4-64B4-468A-B28B-F7361D62C479@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

It's slow because there's no way around running through the entire
input. The optimization tmp is talking about wouldn't be relevant
becase there is an order by clause which was precisely why I I said it
was a fairly narrow use case. Most people who use limit want a
specific subset even if that specific subset is random. Without the
order by the subset is entirely arbitrary but not useully random.

Incidentally "order by ... limit" is amenable to an optimization which
avoids having to *sort* the whole input even though it still has to
read the whole input. We implemented that in 8.3.

greg

On 6 Dec 2008, at 06:08 PM, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
wrote:

>
> On 2008-12-06, at 11:29, David Lee Lambert wrote:
>>>
>>
>> I use "ORDER BY random() LIMIT :some_small_number" frequently to
>> get a "feel"
>> for data. That always builds the unrandomized relation and then
>> sorts it. I
>> guess an alternate path for single-table queries would be to
>> randomly choose
>> a block number and then a tuple number; but that would be biased
>> toward long
>> rows (of which fewer can appear in a block).
>
> but that's going to be extremely slow, due to speed of random()
> function.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: tmp <skrald(at)amossen(dot)dk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-08 16:26:15
Message-ID: 27036.1228753575@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> But I can also see Tom's reluctance. It's a fair increase in the amount of
> code to maintain in that file for a pretty narrow use case. On the other hand
> it looks like it would be all in that file. The planner wouldn't have to do
> anything special to set it up which is nice.

No, the planner would have to be changed to be aware of the behavioral
difference. Otherwise it might pick some other plan besides the one
that has the performance advantage.

regards, tom lane


From: tmp <skrald(at)amossen(dot)dk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimizing DISTINCT with LIMIT
Date: 2008-12-16 20:53:29
Message-ID: gi94g7$be$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> You could add it to here -- note that if we decide it isn't worth it it'll
> just get removed.

Which category would you recommend? "Optimizer / Executor"?