Quick Links

Re: SELECT DISTINCT Performance Issue

Lists:	pgsql-performance

From:	K C Lau <KCLau(at)attglobal(dot)net>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	SELECT DISTINCT Performance Issue
Date:	2005-06-03 01:56:49
Message-ID:	6.2.1.2.0.20050602190756.05ab2c70@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Hi All,

We are testing PostgreSQL 8.0.3 on MS Windows for porting an OLTP system
from MS SqlServer.

We got a major performance issue which seems to boil down to the following
type of query:

select DISTINCT ON (PlayerID) PlayerID,AtDate from Player where
PlayerID='22220' order by PlayerID desc, AtDate desc;
The Player table has primary key (PlayerID, AtDate) representing data over
time and the query gets the latest data for a player.

With enable_seqscan forced off (which I'm not sure if that should be done
for a production system), the average query still takes a very long time to
return a record:

esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from
Player
where PlayerID='22220' order by PlayerID desc, AtDate desc;
Unique (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000
rows=1 loops=1)
-> Index Scan Backward using pk_player on player (cost=0.00..2505.55
rows=8
43 width=23) (actual time=0.000..187.000 rows=1227 loops=1)
Index Cond: ((playerid)::text = '22220'::text)
Total runtime: 187.000 ms

It appears that all the 1227 data records for that player were searched,
even when doing a backward index scan. I would presume that, after locating
the index for the highest AtDate, only the first data record needs to be
retrieved.

The following summary of tests seems to confirm my observation. They were
done on a quiet system (MS Windows 2000 Server, P4 3.0GHz with
Hyperthreading, 1GB Memory, PostgreSQL shared_buffers = 50000), starting
with a test database before doing a vacuum:

set enable_seqscan = off;
select Total runtime: 187.000 ms
again: Total runtime: 78.000 ms
vacuum analyze verbose player;
select Total runtime: 47.000 ms
again: Total runtime: 47.000 ms
reindex table player;
select Total runtime: 78.000 ms
again: Total runtime: 63.000 ms
cluster pk_player on player;
select Total runtime: 16.000 ms
again: Total runtime: 0.000 ms
set enable_seqscan = on;
analyze verbose player;
select Total runtime: 62.000 ms
again: Total runtime: 78.000 ms

Previously, we have also tried to use LIMIT 1 instead of DISTINCT, but the
performance was no better:
select PlayerID,AtDate from Player where PlayerID='22220' order by PlayerID
desc, AtDate desc LIMIT 1

Any clue or suggestions would be most appreciated. If you need further info
or the full explain logs, please let me know.

Regards,
KC.

From:	PFC <lists(at)boutiquenumerique(dot)com>
To:	"K C Lau" <KCLau(at)attglobal(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: SELECT DISTINCT Performance Issue
Date:	2005-06-06 11:45:39
Message-ID:	op.srx5mdsmth1vuj@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

> Previously, we have also tried to use LIMIT 1 instead of DISTINCT, but
> the performance was no better:
> select PlayerID,AtDate from Player where PlayerID='22220' order by
> PlayerID desc, AtDate desc LIMIT 1

The DISTINCT query will pull out all the rows and keep only one, so the
one with LIMIT should be faster. Can you post explain analyze of the LIMIT
query ?

From:	K C Lau <kclau60(at)netvigator(dot)com>
To:	PFC <lists(at)boutiquenumerique(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: SELECT DISTINCT Performance Issue
Date:	2005-06-06 14:54:44
Message-ID:	6.2.1.2.0.20050606224010.05df3578@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

At 19:45 05/06/06, PFC wrote:

>>Previously, we have also tried to use LIMIT 1 instead of DISTINCT, but
>>the performance was no better:
>>select PlayerID,AtDate from Player where PlayerID='22220' order by
>>PlayerID desc, AtDate desc LIMIT 1
>
> The DISTINCT query will pull out all the rows and keep only one,
> so the
>one with LIMIT should be faster. Can you post explain analyze of the LIMIT
>query ?

Actually the problem with LIMIT 1 query is when we use views with the LIMIT
1 construct. The direct SQL is ok:

esdt=> explain analyze select PlayerID,AtDate from Player where
PlayerID='22220'
order by PlayerID desc, AtDate desc LIMIT 1;

Limit (cost=0.00..1.37 rows=1 width=23) (actual time=0.000..0.000 rows=1
loops
=1)
-> Index Scan Backward using pk_player on player (cost=0.00..16074.23
rows=
11770 width=23) (actual time=0.000..0.000 rows=1 loops=1)
Index Cond: ((playerid)::text = '22220'::text)
Total runtime: 0.000 ms

esdt=> create or replace view VCurPlayer3 as select * from Player a
where AtDate = (select b.AtDate from Player b where a.PlayerID = b.PlayerID
order by b.PlayerID desc, b.AtDate desc LIMIT 1);

esdt=> explain analyze select PlayerID,AtDate,version from VCurPlayer3
where Pla
yerID='22220';
Index Scan using pk_player on player a (cost=0.00..33072.78 rows=59
width=27)
(actual time=235.000..235.000 rows=1 loops=1)
Index Cond: ((playerid)::text = '22220'::text)
Filter: ((atdate)::text = ((subplan))::text)
SubPlan
-> Limit (cost=0.00..1.44 rows=1 width=23) (actual
time=0.117..0.117 rows
=1 loops=1743)
-> Index Scan Backward using pk_player on player
b (cost=0.00..1402
3.67 rows=9727 width=23) (actual time=0.108..0.108 rows=1 loops=1743)
Index Cond: (($0)::text = (playerid)::text)
Total runtime: 235.000 ms

The problem appears to be in the loops=1743 scanning all 1743 data records
for that player.

Regards, KC.

From:	George Essig <george(dot)essig(at)gmail(dot)com>
To:	K C Lau <KCLau(at)attglobal(dot)net>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: SELECT DISTINCT Performance Issue
Date:	2005-06-08 13:34:29
Message-ID:	6744b38505060806342e6a07d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 6/2/05, K C Lau <KCLau(at)attglobal(dot)net> wrote:
...
>
> select DISTINCT ON (PlayerID) PlayerID,AtDate from Player where
> PlayerID='22220' order by PlayerID desc, AtDate desc;
> The Player table has primary key (PlayerID, AtDate) representing data over
> time and the query gets the latest data for a player.
>
>
...
> esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from
> Player
> where PlayerID='22220' order by PlayerID desc, AtDate desc;
> Unique (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000
> rows=1 loops=1)
> -> Index Scan Backward using pk_player on player (cost=0.00..2505.55
> rows=8
> 43 width=23) (actual time=0.000..187.000 rows=1227 loops=1)
> Index Cond: ((playerid)::text = '22220'::text)
> Total runtime: 187.000 ms
>

Is PlayerID an integer datatype or a text datatype. It seems like
PlayerID should be an integer data type, but postgres treats PlayerID
as a text data type. This is because the value '22220' is quoted in
your query. Also, the explain analyze output shows "Index Cond:
((playerid)::text = '22220'::text".

George Essig

From:	K C Lau <kclau60(at)netvigator(dot)com>
To:	George Essig <george(dot)essig(at)gmail(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: SELECT DISTINCT Performance Issue
Date:	2005-06-08 14:25:16
Message-ID:	6.2.1.2.0.20050608221510.02c4e368@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Both keys are text fields. Does it make any difference if PlayerID were
integer?

BTW, I think the real performance problem is when we use SELECT ... ORDER
BY PlayerID DESC, AtDate DESC LIMIT 1 in a VIEW. Please see my subsequent
email http://archives.postgresql.org/pgsql-performance/2005-06/msg00110.php
on this show-stopper problem for which I still have no clue how to get
around. Suggestions are much appreciated.

Thanks and regards, KC.

At 21:34 05/06/08, George Essig wrote:
>On 6/2/05, K C Lau <KCLau(at)attglobal(dot)net> wrote:
>...
> >
> > select DISTINCT ON (PlayerID) PlayerID,AtDate from Player where
> > PlayerID='22220' order by PlayerID desc, AtDate desc;
> > The Player table has primary key (PlayerID, AtDate) representing data over
> > time and the query gets the latest data for a player.
> >
> >
>...
> > esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from
> > Player
> > where PlayerID='22220' order by PlayerID desc, AtDate desc;
> > Unique (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000
> > rows=1 loops=1)
> > -> Index Scan Backward using pk_player on player (cost=0.00..2505.55
> > rows=8
> > 43 width=23) (actual time=0.000..187.000 rows=1227 loops=1)
> > Index Cond: ((playerid)::text = '22220'::text)
> > Total runtime: 187.000 ms
> >
>
>Is PlayerID an integer datatype or a text datatype. It seems like
>PlayerID should be an integer data type, but postgres treats PlayerID
>as a text data type. This is because the value '22220' is quoted in
>your query. Also, the explain analyze output shows "Index Cond:
>((playerid)::text = '22220'::text".
>
>George Essig
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match

From:	George Essig <george(dot)essig(at)gmail(dot)com>
To:	K C Lau <kclau60(at)netvigator(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: SELECT DISTINCT Performance Issue
Date:	2005-06-08 14:51:11
Message-ID:	6744b385050608075132eeb8a6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 6/8/05, K C Lau <kclau60(at)netvigator(dot)com> wrote:
> Both keys are text fields. Does it make any difference if PlayerID were
> integer?
>

It can make a difference in speed and integrity. If the column is an
integer, the storage on disk could be smaller for the column and the
related indexes. If the the column is an integer, it would not be
possible to have a value like 'arbitrary value that looks nothing like
an integer'.

George Essig