Re: a JOIN on same table, but 'slided over'

From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Rafal Pietrak" <rafal(at)zorro(dot)isa-geek(dot)com>
Cc: "hubert depesz lubaczewski" <depesz(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: a JOIN on same table, but 'slided over'
Date: 2007-06-26 14:26:00
Message-ID: 65937bea0706260726i4b904361y37a701418a52c992@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

It _is_ the optimised version.... as you can see from the explain plans
posted in the other mail, the planner shows that the cost is drastically
less than the 'distinct on' version.

For smaller data-sets 'distinct-on' version might seem faster, but for
reasonably larger datasets, it's performance deteriorates exponentially...
This is because of the Nested-loops involved in the plan...

I increased your data-set to 10240 rows by executing the following query
10 times:

insert into test select id+(select max(id) from test), thread, info from
test;

On such data-set (which is not very large by any means), the standard
SQL version executes in almost a second, and on the other hand, I had to
cancel the EXPLAIN ANALYZE of the 'distinct on' query after letting it run
for over three minutes!!!

postgres=# explain analyze
postgres-# select t1.id as id, t2.id as "id+1",
postgres-# t1.thread as thread, t2.thread as "thread+1",
postgres-# t1.info as info, t2.info as "info+1"
postgres-# from test as t1, test as t2
postgres-# where t2.id = ( select min(id) from test as t3 where t3.id >
t1.id )
postgres-# order by t1.id asc;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=2971.36..2996.96 rows=10240 width=24) (actual time=
1004.031..1030.116 rows=10239 loops=1)
Sort Key: t1.id
Sort Method: external sort Disk: 416kB
-> Merge Join (cost=840.48..2289.28 rows=10240 width=24) (actual time=
834.218..956.595 rows=10239 loops=1)
Merge Cond: (t2.id = ((subplan)))
-> Index Scan using test_id_key on test t2
(cost=0.00..332.85rows=10240 width=12) (actual time=
0.060..24.503 rows=10240 loops=1)
-> Sort (cost=840.48..866.08 rows=10240 width=12) (actual time=
834.129..854.776 rows=10240 loops=1)
Sort Key: ((subplan))
Sort Method: quicksort Memory: 928kB
-> Seq Scan on test t1 (cost=0.00..158.40 rows=10240
width=12)(actual time=0.196..797.752 rows=10240 loops=1)
SubPlan
-> Result (cost=0.04..0.05 rows=1 width=0) (actual
time=0.062..0.064 rows=1 loops=10240)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=4)
(actual time=0.047..0.050 rows=1 loops=10240)
-> Index Scan using test_id_key on
test t3 (cost=0.00..121.98 rows=3413 width=4) (actual
time=0.038..0.038rows=1 loops=10240)
Index Cond: (id > $0)
Filter: (id IS NOT NULL)
Total runtime: 1052.802 ms
(18 rows)
Time: 1056.740 ms

postgres=# explain analyze
postgres-# select
postgres-# distinct on (t1.id)
postgres-# t1.*, t2.*
postgres-# from
postgres-# test t1
postgres-# join test t2 on t2.id > t1.id
postgres-# order by t1.id asc, t2.id asc;
Cancel request sent
ERROR: canceling statement due to user request
postgres=#

On 6/26/07, Rafal Pietrak <rafal(at)zorro(dot)isa-geek(dot)com> wrote:
>
> OK. Have tried this one.... looks like close to 6 times slower then the
> 'non-standard' phrase with 'distinct on'.
>
> On the small dataset that I've included in my original post (ten rows of
> data within TEST), I've run both queries through EXPLAIN ANALYSE, with
> the following result summary (for clearity, I've cut away the details
> from EXPLAIN output):
>
> -----------STANDARD
> Total runtime: 10.660 ms
> -----------DISTINCT-ON
> Total runtime: 1.479 ms
> -----------
>
> Would there be ways to optimise the standard query to get the
> performance closer to the none-standard one?
>
>
> -R
>
>
> On Tue, 2007-06-26 at 18:05 +0530, Gurjeet Singh wrote:
> > Hi Rafal,
> >
> > Just a note that this is not standard SQL... 'distinct on' is an
> > extension to SQL provided by postgres.
> >
> > Following query utilizes the standard SQL to get the same results:
> >
> > select t1.id as id, t2.id as "id+1",
> > t1.thread as thread, t2.thread as "thread+1",
> > t1.info as info, t2.info as "info+1"
> > from test as t1, test as t2
> > where t2.id = ( select min(id) from test as t3 where t3.id > t1.id);
> >
> > HTH
> > --
> > gurjeet[(dot)singh](at)EnterpriseDB(dot)com
> > singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com
> >
> > 17°29'34.37"N 78°30'59.76"E - Hyderabad *
> > 18°32'57.25"N 73°56'25.42 "E - Pune
> >
> > Sent from my BlackLaptop device
> >
> > On 6/26/07, Rafal Pietrak <rafal(at)zorro(dot)isa-geek(dot)com> wrote:
> > Marvelous! Thenx!
> >
> > -R
> >
> > On Tue, 2007-06-26 at 10:06 +0200, hubert depesz lubaczewski
> > wrote:
> > > On 6/26/07, Rafal Pietrak <rafal(at)zorro(dot)isa-geek(dot)com> wrote:
> > > Is there an SQL construct to get it?
> > >
> > > select
> > > distinct on (t1.id)
> > > t1.*, t2.*
> > > from
> > > test t1
> > > join test t2 on t2.id > t1.id
> > > order by t1.id asc, t2.id asc
> > >
> > > should do the trick.
> > >
> > > depesz
> > >
> > > --
> > > http://www.depesz.com/ - nowy, lepszy depesz
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> > http://archives.postgresql.org/
> >
>

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

17°29'34.37"N 78°30'59.76"E - Hyderabad *
18°32'57.25"N 73°56'25.42"E - Pune

Sent from my BlackLaptop device

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bart Degryse 2007-06-26 14:36:59 Re: NO DATA FOUND Exception
Previous Message Fernando Hevia 2007-06-26 14:25:49 Re: NO DATA FOUND Exception