From: | Sameer Kumar <sameer(dot)kumar(at)ashnik(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Using indexes for ORDER BY and PARTITION BY clause in windowing functions |
Date: | 2013-10-23 16:37:42 |
Message-ID: | CADp-Sm5npBTnnt20w84=00P2d8XsQQ0pogoPHAgq2okttm+vdw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Oct 23, 2013 at 10:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Sameer Kumar <sameer(dot)kumar(at)ashnik(dot)com> writes:
> > I am not sure why but my PostgreSQL does not seem to be using indexes for
> > ORDER BY clause or PARTITION BY CLAUSE which I use with windowing
> function.
>
> When the entire contents of the table have to be read, a seqscan-and-sort
> will frequently be estimated as cheaper than an indexscan. If you think
> this is not true on your hardware, you might need to adjust
> random_page_cost.
>
> regards, tom lane
>
My mistake. I had understood the issue wrongly.
Actually when I use functions like max to find the maximum value grouped by
another column I get a better performance when I try to do the same
operation using max() over().
Take a look at below plan:
edb=# \x
Expanded display is on.
edb=# \dS= student_score;
Table "enterprisedb.student_score"
Column | Type | Modifiers
--------------+-------------------------+-----------
id | integer | not null
student_name | character varying(1000) |
score | integer |
course | character varying(100) |
Indexes:
"student_score_pkey" PRIMARY KEY, btree (id)
"idx_course" btree (course)
"idx_score" btree (score)
edb=# select count(*) from student_score ;
-[ RECORD 1 ]-
count | 122880
edb=# explain analyze select max(score) from student_score group by course;
-[ RECORD 1
]-------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | HashAggregate (cost=3198.20..3198.26 rows=6 width=9) (actual
time=110.792..110.793 rows=6 loops=1)
-[ RECORD 2
]-------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Seq Scan on student_score (cost=0.00..2583.80
rows=122880 width=9) (actual time=0.011..23.055 rows=122880 loops=1)
-[ RECORD 3
]-------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Total runtime: 110.862 ms
edb=# explain analyze select max(score) over(partition by course) from
student_score ;
-[ RECORD 1
]--------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | WindowAgg (cost=0.00..10324.65 rows=122880 width=9) (actual
time=36.145..224.504 rows=122880 loops=1)
-[ RECORD 2
]--------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using idx_course on student_score
(cost=0.00..8481.45 rows=122880 width=9) (actual time=0.037..85.283
rows=122880 loops=1)
-[ RECORD 3
]--------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Total runtime: 242.949 ms
AS you can see there is a difference of twice. On similar lines, when I
have to find students who "topped" (had highest score) per course, I will
fire something like below:
edb=# explain analyze select student_name from student_score where
(course,score)in (select course,max(score) from student_score group by
course);
-[ RECORD 1
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Hash Semi Join (cost=3198.41..6516.76 rows=7300 width=43)
(actual time=113.727..181.045 rows=555 loops=1)
-[ RECORD 2
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Hash Cond: (((enterprisedb.student_score.course)::text =
(enterprisedb.student_score.course)::text) AND
(enterprisedb.student_score.score =
(max(enterprisedb.student_score.score))))
-[ RECORD 3
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Seq Scan on student_score (cost=0.00..2583.80
rows=122880 width=52) (actual time=0.009..22.702 rows=122880 loops=1)
-[ RECORD 4
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Hash (cost=3198.32..3198.32 rows=6 width=9) (actual
time=111.521..111.521 rows=6 loops=1)
-[ RECORD 5
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Buckets: 1024 Batches: 1 Memory Usage: 1kB
-[ RECORD 6
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> HashAggregate (cost=3198.20..3198.26 rows=6
width=9) (actual time=111.506..111.507 rows=6 loops=1)
-[ RECORD 7
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Seq Scan on student_score
(cost=0.00..2583.80 rows=122880 width=9) (actual time=0.002..23.303
rows=122880 loops=1)
-[ RECORD 8
]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Total runtime: 181.284 ms
An alternative way of doing this could be:
edb=# explain analyze select student_name from (select student_name,
dense_rank() over(partition by course order by score)rn from student_score)
where rn=1 ;
-[ RECORD 1
]--------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Subquery Scan on __unnamed_subquery_0
(cost=12971.39..16964.99 rows=614 width=43) (actual
time=2606.075..2953.937 rows=558 loops=1)
-[ RECORD 2
]--------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Filter: (__unnamed_subquery_0.rn = 1)
-[ RECORD 3
]--------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> WindowAgg (cost=12971.39..15428.99 rows=122880
width=52) (actual time=2606.063..2928.061 rows=122880 loops=1)
-[ RECORD 4
]--------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Sort (cost=12971.39..13278.59 rows=122880
width=52) (actual time=2606.020..2733.677 rows=122880 loops=1)
-[ RECORD 5
]--------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Sort Key: student_score.course,
student_score.score
-[ RECORD 6
]--------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Sort Method: external merge Disk: 7576kB
-[ RECORD 7
]--------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Seq Scan on student_score
(cost=0.00..2583.80 rows=122880 width=52) (actual time=0.009..49.026
rows=122880 loops=1)
-[ RECORD 8
]--------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Total runtime: 2958.653 ms
The second format of query could be more useful in DW and Data mining
operations where I might not be always looking highest scorer. I may have
to look for 2nd highest scorer. I can make this 2nd query parameterized and
filter on rn could be 2 or 3 based on my interest.
Another thing, (I may be stupid and naive here) does PostgreSQL re-uses the
hash which has been already created for sort. In this case the inner query
must have created a hash for windoing aggregate. Can't we use that same one
while applying the the filter "rn=1" ?
From | Date | Subject | |
---|---|---|---|
Next Message | Mike Blackwell | 2013-10-23 16:40:54 | RULE regression test fragility? |
Previous Message | Josh Berkus | 2013-10-23 16:24:27 | Re: Location for external scripts for Extensions? |