Re: Slow-ish Query Needs Some Love

Lists: pgsql-performance
From: Matt White <mattw922(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Slow-ish Query Needs Some Love
Date: 2010-02-02 00:53:56
Message-ID: 1d4feebf-ff4d-4463-b1c8-d1a12d27400c@z10g2000prh.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I have a relatively straightforward query that by itself isn't that
slow, but we have to run it up to 40 times on one webpage load, so it
needs to run much faster than it does. Here it is:

SELECT COUNT(*) FROM users, user_groups
WHERE users.user_group_id = user_groups.id AND NOT users.deleted AND
user_groups.partner_id IN
(partner_id_1, partner_id_2);

The structure is partners have user groups which have users. In the
test data there are over 200,000 user groups and users but only ~3000
partners. Anyone have any bright ideas on how to speed this query up?

Here's the query plan:

Aggregate (cost=12574.53..12574.54 rows=1 width=0) (actual
time=2909.298..2909.299 rows=1 loops=1)
-> Hash Join (cost=217.79..12566.08 rows=3378 width=0) (actual
time=2909.284..2909.284 rows=0 loops=1)
Hash Cond: (users.user_group_id = user_groups.id)
-> Seq Scan on users (cost=0.00..11026.11 rows=206144
width=4) (actual time=0.054..517.811 rows=205350 loops=1)
Filter: (NOT deleted)
-> Hash (cost=175.97..175.97 rows=3346 width=4) (actual
time=655.054..655.054 rows=200002 loops=1)
-> Nested Loop (cost=0.27..175.97 rows=3346 width=4)
(actual time=1.327..428.406 rows=200002 loops=1)
-> HashAggregate (cost=0.27..0.28 rows=1
width=4) (actual time=1.259..1.264 rows=2 loops=1)
-> Result (cost=0.00..0.26 rows=1
width=0) (actual time=1.181..1.240 rows=2 loops=1)
-> Index Scan using user_groups_partner_id_idx
on user_groups (cost=0.00..133.86 rows=3346 width=8) (actual
time=0.049..96.992 rows=100001 loops=2)
Index Cond: (user_groups.partner_id =
(partner_all_subpartners(3494)))

The one obvious thing that everyone will point out is the sequential
scan on users, but there actually is an index on users.deleted. When I
forced sequential scanning off, it ran slower, so the planner wins
again.

Thanks for any help you can offer.


From: Edgardo Portal <egportal2002(at)yahoo(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow-ish Query Needs Some Love
Date: 2010-02-02 13:06:46
Message-ID: hk9816$qh5$1@news.eternal-september.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 2010-02-02, Matt White <mattw922(at)gmail(dot)com> wrote:
> I have a relatively straightforward query that by itself isn't that
> slow, but we have to run it up to 40 times on one webpage load, so it
> needs to run much faster than it does. Here it is:
>
> SELECT COUNT(*) FROM users, user_groups
> WHERE users.user_group_id = user_groups.id AND NOT users.deleted AND
> user_groups.partner_id IN
> (partner_id_1, partner_id_2);
>
> The structure is partners have user groups which have users. In the
> test data there are over 200,000 user groups and users but only ~3000
> partners. Anyone have any bright ideas on how to speed this query up?

Can you avoid running it 40 times, maybe by restructuring the
query (or making a view) along the lines of the following and
adding some logic to your page?

SELECT p.partner_id, ug.user_group_id, u.id, count(*)
FROM partners p
LEFT JOIN user_groups ug
ON ug.partner_id=p.partner_id
LEFT JOIN users u
ON u.user_group_id=ug.id
WHERE NOT u.deleted
GROUP BY 1,2,3
;


From: Matt White <mattw922(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow-ish Query Needs Some Love
Date: 2010-02-02 19:03:42
Message-ID: 250b74b4-f38f-4dab-a8bc-91d6d152d46c@h9g2000prn.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Feb 2, 6:06 am, Edgardo Portal <egportal2(dot)(dot)(dot)(at)yahoo(dot)com> wrote:
> On 2010-02-02, Matt White <mattw(dot)(dot)(dot)(at)gmail(dot)com> wrote:
>
> > I have a relatively straightforward query that by itself isn't that
> > slow, but we have to run it up to 40 times on one webpage load, so it
> > needs to run much faster than it does. Here it is:
>
> > SELECT COUNT(*) FROM users, user_groups
> >  WHERE users.user_group_id = user_groups.id AND NOT users.deleted AND
> > user_groups.partner_id IN
> >  (partner_id_1, partner_id_2);
>
> > The structure is partners have user groups which have users. In the
> > test data there are over 200,000 user groups and users but only ~3000
> > partners. Anyone have any bright ideas on how to speed this query up?
>
> Can you avoid running it 40 times, maybe by restructuring the
> query (or making a view) along the lines of the following and
> adding some logic to your page?
>
> SELECT p.partner_id, ug.user_group_id, u.id, count(*)
>   FROM partners p
>        LEFT JOIN user_groups ug
>               ON ug.partner_id=p.partner_id
>        LEFT JOIN users u
>               ON u.user_group_id=ug.id
>  WHERE NOT u.deleted
>  GROUP BY 1,2,3
> ;

Thanks for the suggestion. The view didn't seem to speed things up.
Perhaps we can reduce the number of times it's called, we'll see. Any
additional ideas would be helpful. Thanks.


From: Andy Colson <andy(at)squeakycode(dot)net>
To: Matt White <mattw922(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow-ish Query Needs Some Love
Date: 2010-02-02 20:11:34
Message-ID: 4B6886F6.4080402@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 2/2/2010 1:03 PM, Matt White wrote:
> On Feb 2, 6:06 am, Edgardo Portal<egportal2(dot)(dot)(dot)(at)yahoo(dot)com> wrote:
>> On 2010-02-02, Matt White<mattw(dot)(dot)(dot)(at)gmail(dot)com> wrote:
>>
>>> I have a relatively straightforward query that by itself isn't that
>>> slow, but we have to run it up to 40 times on one webpage load, so it
>>> needs to run much faster than it does. Here it is:
>>
>>> SELECT COUNT(*) FROM users, user_groups
>>> WHERE users.user_group_id = user_groups.id AND NOT users.deleted AND
>>> user_groups.partner_id IN
>>> (partner_id_1, partner_id_2);
>>
>>> The structure is partners have user groups which have users. In the
>>> test data there are over 200,000 user groups and users but only ~3000
>>> partners. Anyone have any bright ideas on how to speed this query up?
>>
>> Can you avoid running it 40 times, maybe by restructuring the
>> query (or making a view) along the lines of the following and
>> adding some logic to your page?
>>
>> SELECT p.partner_id, ug.user_group_id, u.id, count(*)
>> FROM partners p
>> LEFT JOIN user_groups ug
>> ON ug.partner_id=p.partner_id
>> LEFT JOIN users u
>> ON u.user_group_id=ug.id
>> WHERE NOT u.deleted
>> GROUP BY 1,2,3
>> ;
>
> Thanks for the suggestion. The view didn't seem to speed things up.
> Perhaps we can reduce the number of times it's called, we'll see. Any
> additional ideas would be helpful. Thanks.

I agree with Edgardo, I think the biggest time saver will be reducing
trips to the database.

But... do you have an index on users.user_group_id?

Does rewriting it change the plan any?

SELECT COUNT(*) FROM users
inner join user_groups on (users.user_group_id = user_groups.id)
where NOT users.deleted
AND user_groups.partner_id IN (partner_id_1, partner_id_2);

And... it looks like the row guestimate is off a litte:

Index Scan using user_groups_partner_id_idx
on user_groups
(cost=0.00..133.86 rows=3346 width=8)
(actual time=0.049..96.992 rows=100001 loops=2)

It guessed 3,346 rows, but actually got 100,001. Have you run an
analyze on it? If so, maybe bumping up the stats might help?

-Andy


From: Andy Colson <andy(at)squeakycode(dot)net>
To: Matt White <mattw922(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow-ish Query Needs Some Love
Date: 2010-02-03 18:50:56
Message-ID: 4B69C590.2070709@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 2/3/2010 11:17 AM, Matt White wrote:
> On Feb 2, 1:11 pm, a(dot)(dot)(dot)(at)squeakycode(dot)net (Andy Colson) wrote:
>> On 2/2/2010 1:03 PM, Matt White wrote:
>>
>>
>>
>>
>>
>>> On Feb 2, 6:06 am, Edgardo Portal<egportal2(dot)(dot)(dot)(at)yahoo(dot)com> wrote:
>>>> On 2010-02-02, Matt White<mattw(dot)(dot)(dot)(at)gmail(dot)com> wrote:
>>
>>>>> I have a relatively straightforward query that by itself isn't that
>>>>> slow, but we have to run it up to 40 times on one webpage load, so it
>>>>> needs to run much faster than it does. Here it is:
>>
>>>>> SELECT COUNT(*) FROM users, user_groups
>>>>> WHERE users.user_group_id = user_groups.id AND NOT users.deleted AND
>>>>> user_groups.partner_id IN
>>>>> (partner_id_1, partner_id_2);
>>
>>>>> The structure is partners have user groups which have users. In the
>>>>> test data there are over 200,000 user groups and users but only ~3000
>>>>> partners. Anyone have any bright ideas on how to speed this query up?
>>
>>>> Can you avoid running it 40 times, maybe by restructuring the
>>>> query (or making a view) along the lines of the following and
>>>> adding some logic to your page?
>>
>>>> SELECT p.partner_id, ug.user_group_id, u.id, count(*)
>>>> FROM partners p
>>>> LEFT JOIN user_groups ug
>>>> ON ug.partner_id=p.partner_id
>>>> LEFT JOIN users u
>>>> ON u.user_group_id=ug.id
>>>> WHERE NOT u.deleted
>>>> GROUP BY 1,2,3
>>>> ;
>>
>>> Thanks for the suggestion. The view didn't seem to speed things up.
>>> Perhaps we can reduce the number of times it's called, we'll see. Any
>>> additional ideas would be helpful. Thanks.
>>
>> I agree with Edgardo, I think the biggest time saver will be reducing
>> trips to the database.
>>
>> But... do you have an index on users.user_group_id?
>>
>> Does rewriting it change the plan any?
>>
>> SELECT COUNT(*) FROM users
>> inner join user_groups on (users.user_group_id = user_groups.id)
>> where NOT users.deleted
>> AND user_groups.partner_id IN (partner_id_1, partner_id_2);
>>
>> And... it looks like the row guestimate is off a litte:
>>
>> Index Scan using user_groups_partner_id_idx
>> on user_groups
>> (cost=0.00..133.86 rows=3346 width=8)
>> (actual time=0.049..96.992 rows=100001 loops=2)
>>
>> It guessed 3,346 rows, but actually got 100,001. Have you run an
>> analyze on it? If so, maybe bumping up the stats might help?
>>
>> -Andy
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performa(dot)(dot)(dot)(at)postgresql(dot)org)
>> To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-performance
>
> Andy,
>
> I have run analyze, see my query plan in my original post. You'll have
> to forgive me for being a bit of a Postgres noob but what do you mean
> by "bumping up the stats"?

Thats not what I mean. "explain analyze select..." is what you did, and
correct. What I meant was "analyze user_groups".

see:
http://www.postgresql.org/docs/8.4/interactive/sql-analyze.html

an analyze will make PG look at a table, and calc stats on it, so it can
make better guesses. By default analyze only looks at a few rows (well
a small percent of rows) and makes guesses about the entire table based
on those rows. If it guesses wrong, sometimes you need to tell it to
analyze more rows (ie. a bigger percentage of the table).

By "bumping the stats" I was referring to this:

http://wiki.postgresql.org/wiki/Planner_Statistics

I have never had to do it, so dont know much about it. It may or may
not help. Just thought it was something you could try.

-Andy


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Matt White <mattw922(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow-ish Query Needs Some Love
Date: 2010-02-05 01:49:26
Message-ID: 603c8f071002041749x14ff4414g1ed2bb26be3a0bce@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, Feb 1, 2010 at 7:53 PM, Matt White <mattw922(at)gmail(dot)com> wrote:
> I have a relatively straightforward query that by itself isn't that
> slow, but we have to run it up to 40 times on one webpage load, so it
> needs to run much faster than it does. Here it is:
>
> SELECT COUNT(*) FROM users, user_groups
>  WHERE users.user_group_id = user_groups.id AND NOT users.deleted AND
> user_groups.partner_id IN
>  (partner_id_1, partner_id_2);
>
> The structure is partners have user groups which have users. In the
> test data there are over 200,000 user groups and users but only ~3000
> partners. Anyone have any bright ideas on how to speed this query up?
>
> Here's the query plan:
>
>  Aggregate  (cost=12574.53..12574.54 rows=1 width=0) (actual
> time=2909.298..2909.299 rows=1 loops=1)
>   ->  Hash Join  (cost=217.79..12566.08 rows=3378 width=0) (actual
> time=2909.284..2909.284 rows=0 loops=1)
>         Hash Cond: (users.user_group_id = user_groups.id)
>         ->  Seq Scan on users  (cost=0.00..11026.11 rows=206144
> width=4) (actual time=0.054..517.811 rows=205350 loops=1)
>               Filter: (NOT deleted)
>         ->  Hash  (cost=175.97..175.97 rows=3346 width=4) (actual
> time=655.054..655.054 rows=200002 loops=1)
>               ->  Nested Loop  (cost=0.27..175.97 rows=3346 width=4)
> (actual time=1.327..428.406 rows=200002 loops=1)
>                     ->  HashAggregate  (cost=0.27..0.28 rows=1
> width=4) (actual time=1.259..1.264 rows=2 loops=1)
>                           ->  Result  (cost=0.00..0.26 rows=1
> width=0) (actual time=1.181..1.240 rows=2 loops=1)
>                     ->  Index Scan using user_groups_partner_id_idx
> on user_groups  (cost=0.00..133.86 rows=3346 width=8) (actual
> time=0.049..96.992 rows=100001 loops=2)
>                           Index Cond: (user_groups.partner_id =
> (partner_all_subpartners(3494)))
>
>
> The one obvious thing that everyone will point out is the sequential
> scan on users, but there actually is an index on users.deleted. When I
> forced sequential scanning off, it ran slower, so the planner wins
> again.

Yeah, I don't think the sequential scan is hurting you. What is
bugging me is that it doesn't look like the plan you've posted is for
the query you've posted. The plan shows an index condition that
references partner_all_subpartners(3494), which doesn't appear in your
original query, and also has two aggregates in it, where your posted
query only has one.

...Robert