Re: regression between 8.4.8 and 8.4.2?

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Ben Chobot <bench(at)silentmedia(dot)com>
Cc: pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: regression between 8.4.8 and 8.4.2?
Date: 2011-08-31 07:53:24
Message-ID: 1314777204.27073.2.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On ons, 2011-08-31 at 10:42 +0300, Peter Eisentraut wrote:
> I don't have an answer for you, but this report looks suspiciously
> similar to the one I posted the other day at
> <http://archives.postgresql.org/pgsql-hackers/2011-08/msg01224.php>,
> which, now that I think about it, also manifested itself after the
> upgrade to 8.4.8.

See this thread:
http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php

It looks like there are a number of users affected by this.

>
> On tis, 2011-08-30 at 15:24 -0700, Ben Chobot wrote:
> > We recently took a copy of our production data (running on 8.4.2),
> > scrubbed many data fields, and then loaded it onto a qa server
> > (running 8.4.8). We're seeing some odd planner performance that I
> > think might be a bug, though I'm hoping it's just idiocy on my part.
> > I've analyzed things and looked into pg_stats and it seems as if the
> > relevant columns have about the same statistics.
> >
> >
> > I've managed to simplify the query, but if I make it any simpler, then
> > the two servers end up with the same good plan. The query is down to:
> >
> >
> > SELECT machines.quota_purchased
> > FROM machines
> > WHERE NOT deleted AND machines.user_id IN (
> > SELECT id FROM users WHERE user_group_id IN (
> > select 607547 offset 0
> > ) OFFSET 0
> > );
> >
> >
> >
> >
> > (Those "offset 0" are in there to protect us from planner regressions
> > we saw when moving to 8.4. When we move to 9, they can hopefully go
> > away.)
> >
> >
> > On the production server, this returns a fairly accurate plan:
> >
> >
> >
> > QUERY PLAN
> >
> > ----------------------------------------------------------------------------------------------------------------------------------------------------------
> > Nested Loop (cost=843.59..1447.90 rows=243 width=8) (actual
> > time=0.044..0.045 rows=1 loops=1)
> > -> HashAggregate (cost=843.59..845.59 rows=200 width=4) (actual
> > time=0.027..0.027 rows=1 loops=1)
> > -> Limit (cost=0.02..823.90 rows=1575 width=4) (actual
> > time=0.024..0.025 rows=1 loops=1)
> > -> Nested Loop (cost=0.02..823.90 rows=1575 width=4)
> > (actual time=0.023..0.024 rows=1 loops=1)
> > -> HashAggregate (cost=0.02..0.03 rows=1
> > width=4) (actual time=0.005..0.005 rows=1 loops=1)
> > -> Limit (cost=0.00..0.01 rows=1 width=0)
> > (actual time=0.001..0.002 rows=1 loops=1)
> > -> Result (cost=0.00..0.01 rows=1
> > width=0) (actual time=0.000..0.000 rows=1 loops=1)
> > -> Index Scan using users_user_groups_idx on
> > users (cost=0.00..804.18 rows=1575 width=8) (actual time=0.017..0.018
> > rows=1 loops=1)
> > Index Cond: (users.user_group_id =
> > (607547))
> > -> Index Scan using machines_sid_un on machines (cost=0.00..3.00
> > rows=1 width=12) (actual time=0.015..0.015 rows=1 loops=1)
> > Index Cond: (machines.user_id = users.id)
> > Total runtime: 0.121 ms
> > (12 rows)
> >
> >
> >
> >
> > On the QA server, things are not so accurate. It doesn't hurt the
> > timing of this simplified query much, but when put into the actual
> > query, the row estimation being off by 6 orders of magnitude really
> > throws the planning in the wrong direction. The plan on the QA server
> > is:
> >
> >
> >
> > QUERY PLAN
> >
> > -----------------------------------------------------------------------------------------------------------------------------------------------------------
> > Nested Loop (cost=1887.16..3671.20 rows=1192462 width=8) (actual
> > time=0.049..0.051 rows=1 loops=1)
> > -> HashAggregate (cost=1887.16..1889.16 rows=200 width=4) (actual
> > time=0.032..0.033 rows=1 loops=1)
> > -> Limit (cost=0.02..1868.20 rows=1517 width=4) (actual
> > time=0.027..0.029 rows=1 loops=1)
> > -> Nested Loop (cost=0.02..1868.20 rows=1517 width=4)
> > (actual time=0.027..0.028 rows=1 loops=1)
> > -> HashAggregate (cost=0.02..0.03 rows=1
> > width=4) (actual time=0.008..0.008 rows=1 loops=1)
> > -> Limit (cost=0.00..0.01 rows=1 width=0)
> > (actual time=0.001..0.001 rows=1 loops=1)
> > -> Result (cost=0.00..0.01 rows=1
> > width=0) (actual time=0.001..0.001 rows=1 loops=1)
> > -> Index Scan using users_user_groups_idx on
> > users (cost=0.00..1849.20 rows=1517 width=8) (actual
> > time=0.015..0.016 rows=1 loops=1)
> > Index Cond: (users.user_group_id =
> > (607547))
> > -> Index Scan using machines_sid_un on machines (cost=0.00..8.90
> > rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
> > Index Cond: (machines.user_id = users.id)
> > Total runtime: 0.148 ms
> > (12 rows)
> >
> >
> >
> >
> >
> >
> > The problem here (I think) seems to be that the QA server believes
> > that running a nested loop over 200 users.id values and joining that
> > against machines.user_id will result in >1M rows. The production
> > servers sees this more accurately as the nearly 1:1 relationship that
> > it is.
> >
> >
> > The reason I wonder if this might be a bug is because if I change the
> > obtuse clause "WHERE user_group_id IN (select 607547 offset 0)" to
> > simply "where user_group_id in (607547)" then the plan collapses to
> > the same plan on both servers:
> >
> >
> > explain analyze SELECT machines.quota_purchased
> > FROM machines
> > WHERE NOT deleted AND machines.user_id IN (
> > SELECT id FROM users WHERE user_group_id IN (
> > 607547
> > ) OFFSET 0
> > );
> >
> >
> >
> >
> >
> > QUERY PLAN
> >
> > -------------------------------------------------------------------------------------------------------------------------------------------------
> > Nested Loop (cost=82.27..636.70 rows=62 width=8) (actual
> > time=0.042..0.043 rows=1 loops=1)
> > -> HashAggregate (cost=82.27..82.88 rows=61 width=4) (actual
> > time=0.024..0.025 rows=1 loops=1)
> > -> Limit (cost=0.00..81.51 rows=61 width=4) (actual
> > time=0.017..0.018 rows=1 loops=1)
> > -> Index Scan using users_user_groups_idx on users
> > (cost=0.00..81.51 rows=61 width=4) (actual time=0.015..0.016 rows=1
> > loops=1)
> > Index Cond: (user_group_id = 607547)
> > -> Index Scan using machines_sid_un on machines (cost=0.00..9.07
> > rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
> > Index Cond: (machines.user_id = users.id)
> > Total runtime: 0.106 ms
> > (8 rows)
> >
> >
> >
> >
> > But, as I understand it, that remaining OFFSET 0 should keep the
> > planner from caring that it's a single value or the result of a
> > subselect.
> >
> >
> >
> >
> > Anyway, anything I can check? The hardware is different so the
> > configuration files are different, but they're not that different in
> > anything other than effective_cache_size. Certainly not different in
> > any way I can think that would affect this kind of planning
> > mis-estimation.
>
>
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adarsh Sharma 2011-08-31 11:54:05 ERD Tool
Previous Message Peter Eisentraut 2011-08-31 07:42:54 Re: regression between 8.4.8 and 8.4.2?