How to optimize select count(*)..group by?

Lists: pgsql-general
From: Bryan Field-Elliot <bryan_lists(at)netmeme(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: How to optimize select count(*)..group by?
Date: 2005-07-28 16:19:49
Message-ID: 1122567589.3880.19.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

We have this simple query:

select status, count(*) from customer group by status;

There is already a btree index on status, but, the customer table is
huge, and this query must be executed very frequently... an "explain" on
this query shows that it is quite costly (and we notice it runs
slowly)...

Can someone recommend the best technique to optimize this? We can create
new indices, we can re-write this query.. But we'd rather not add new
tables or columns if possible (not just to solve this problem).

Thank you,

Bryan


From: David Fetter <david(at)fetter(dot)org>
To: Bryan Field-Elliot <bryan_lists(at)netmeme(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How to optimize select count(*)..group by?
Date: 2005-07-28 16:38:45
Message-ID: 20050728163845.GB22658@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Jul 28, 2005 at 09:19:49AM -0700, Bryan Field-Elliot wrote:
> We have this simple query:
>
> select status, count(*) from customer group by status;
>
> There is already a btree index on status, but, the customer table is
> huge, and this query must be executed very frequently... an
> "explain" on this query shows that it is quite costly (and we notice
> it runs slowly)...
>
> Can someone recommend the best technique to optimize this? We can
> create new indices, we can re-write this query.. But we'd rather not
> add new tables or columns if possible (not just to solve this
> problem).

You're pretty much stuck with either writing triggers that modify a
cache table or having your performance the way it is now.

Cheers,
D
--
David Fetter david(at)fetter(dot)org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778

Remember to vote!


From: Richard Huxton <dev(at)archonet(dot)com>
To: Bryan Field-Elliot <bryan_lists(at)netmeme(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How to optimize select count(*)..group by?
Date: 2005-07-28 16:42:42
Message-ID: 42E90B02.5030708@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Bryan Field-Elliot wrote:
> We have this simple query:
>
> select status, count(*) from customer group by status;
>
> There is already a btree index on status, but, the customer table is
> huge, and this query must be executed very frequently... an "explain" on
> this query shows that it is quite costly (and we notice it runs slowly)...
>
> Can someone recommend the best technique to optimize this? We can create
> new indices, we can re-write this query.. But we'd rather not add new
> tables or columns if possible (not just to solve this problem).

There's no other accurate solution. While PG's MVCC system means you
need less locking, it makes it tricky to determine whether a row is
visible without actually checking on disk.

Simplest solution - have a "customer_status_summary" table and add a
trigger to the "customer" table to keep it up to date. That way, you'll
have extremely cheap counts. Make sure you understand what locking
you'll need with your particular solution.

--
Richard Huxton
Archonet Ltd


From: Bricklen Anderson <BAnderson(at)PresiNET(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: Bryan Field-Elliot <bryan_lists(at)netmeme(dot)org>, pgsql-general(at)postgresql(dot)org
Subject: Re: How to optimize select count(*)..group by?
Date: 2005-07-28 17:02:43
Message-ID: 42E90FB3.8090201@PresiNET.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

David Fetter wrote:
> On Thu, Jul 28, 2005 at 09:19:49AM -0700, Bryan Field-Elliot wrote:
>
>>We have this simple query:
>>
>>select status, count(*) from customer group by status;
>>
>>There is already a btree index on status, but, the customer table is
>>huge, and this query must be executed very frequently... an
>>"explain" on this query shows that it is quite costly (and we notice
>>it runs slowly)...
>>
>>Can someone recommend the best technique to optimize this? We can
>>create new indices, we can re-write this query.. But we'd rather not
>>add new tables or columns if possible (not just to solve this
>>problem).
>
>
> You're pretty much stuck with either writing triggers that modify a
> cache table or having your performance the way it is now.
>
> Cheers,
> D
How about the new bitmap index? I wonder if that'll result in better performance
for that type of query?

--
_______________________________

This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Bryan Field-Elliot <bryan_lists(at)netmeme(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How to optimize select count(*)..group by?
Date: 2005-07-28 17:22:04
Message-ID: 87fyty977n.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Bryan Field-Elliot <bryan_lists(at)netmeme(dot)org> writes:

> We have this simple query:
>
> select status, count(*) from customer group by status;
>
> There is already a btree index on status, but, the customer table is
> huge, and this query must be executed very frequently... an "explain" on
> this query shows that it is quite costly (and we notice it runs
> slowly)...
>
> Can someone recommend the best technique to optimize this? We can create
> new indices, we can re-write this query.. But we'd rather not add new
> tables or columns if possible (not just to solve this problem).

Without creating a new table you have three options, none of which will be
very fast:

1) If the entire table fits in ram then you could try setting random_page_cost
close enough to 1 to cause the index to be used.

2) If it doesn't but there are a reasonably small number of distinct values of
status you would hope to see a sequential scan and a hash aggregate being
used. I would expect this to be what you would see with the default
configuration. If not you might have to increase work_mem (or sort_mem
depending on your version I think).

3) If you have a large table with a large set of status values then Postgres
may be compelled to resort the entire table. In which case you should
experiment with work_mem/sort_mem to get the largest value you can without
inducing swapping. You could also experiment with putting pgsql_tmp on
separate spindles.

None of these will be fast enough for OLTP access like a web page. You would
be best off performing this query periodically and stuffing the results in a
table for quick access.

--
greg


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: Bryan Field-Elliot <bryan_lists(at)netmeme(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How to optimize select count(*)..group by?
Date: 2005-07-28 17:23:39
Message-ID: 1122571418.32465.17.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, 2005-07-28 at 11:19, Bryan Field-Elliot wrote:
> We have this simple query:
>
> select status, count(*) from customer group by status;
>
> There is already a btree index on status, but, the customer table is
> huge, and this query must be executed very frequently... an "explain"
> on this query shows that it is quite costly (and we notice it runs
> slowly)...
>
> Can someone recommend the best technique to optimize this? We can
> create new indices, we can re-write this query.. But we'd rather not
> add new tables or columns if possible (not just to solve this
> problem).

With a query of the form:

select field,count([field|*]) from table WHERE somefield =
somecondition;

the query planner is going to have to scan every single row returned by
that where clause. There's no shortcut, because the visibility rules of
MVCC means you have to look at every tuple IN THE TABLE, not in the
index (it's the way postgresql is built, and it isn't likely to change
soon, because putting the visibility information in indexes is
expensive, and would result in VERY slow updates and very large
indexes).

So, the best optimization is to use a selective where clause.

If you run the query with a where clause of something like:

where processdate between '01 july 2005' and '07 july 2005'

then you should get better performance.