Re: BUG #5059: Planner ignores estimates when planning an IN () subquery

Lists: pgsql-bugs
From: "Kenaniah Cerny" <kenaniah(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date: 2009-09-16 03:35:07
Message-ID: 200909160335.n8G3Z7Oo061129@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 5059
Logged by: Kenaniah Cerny
Email address: kenaniah(at)gmail(dot)com
PostgreSQL version: 8.4.1
Operating system: Centos5.2
Description: Planner ignores estimates when planning an IN ()
subquery
Details:

Consider the following query:

http://pgsql.privatepaste.com/aa5DAtiwws

When planning the subquery of the IN () statement, the planner chose to scan
the indexes of the outer and inner columns in parallel using a nested loop
semi join.

http://pgsql.privatepaste.com/4eXj3zRcy7

By not enabling the planner to sort via the index of the outer column in the
WHERE clause (query above), the a nested loop version of the plan executes
in a fraction of the time.

http://pgsql.privatepaste.com/5c0bOcL3t6

As you can see from the above query, forcing the materialization of the
subquery produces a much superior plan.

http://pgsql.privatepaste.com/371nl6KFrI

For comparison, this query replaces the subquery with hard-coded values.

The planner appears to not be weighing the benefits of materializing the
subquery of the IN () statement properly when ordering is involved, and
still produces an inferior plan when ordering is not a factor.

Please feel free to contact me for additional test cases if needed.

Thanks,
Kenaniah


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kenaniah Cerny <kenaniah(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date: 2009-09-16 21:32:39
Message-ID: 603c8f070909161432k3fe8c46i7b36ec3655767c5c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Tue, Sep 15, 2009 at 11:35 PM, Kenaniah Cerny <kenaniah(at)gmail(dot)com> wrote:
>
> The following bug has been logged online:
>
> Bug reference:      5059
> Logged by:          Kenaniah Cerny
> Email address:      kenaniah(at)gmail(dot)com
> PostgreSQL version: 8.4.1
> Operating system:   Centos5.2
> Description:        Planner ignores estimates when planning an IN ()
> subquery
> Details:
>
> Consider the following query:
>
> http://pgsql.privatepaste.com/aa5DAtiwws
>
> When planning the subquery of the IN () statement, the planner chose to scan
> the indexes of the outer and inner columns in parallel using a nested loop
> semi join.
>
> http://pgsql.privatepaste.com/4eXj3zRcy7
>
> By not enabling the planner to sort via the index of the outer column in the
> WHERE clause (query above), the a nested loop version of the plan executes
> in a fraction of the time.
>
> http://pgsql.privatepaste.com/5c0bOcL3t6
>
> As you can see from the above query, forcing the materialization of the
> subquery produces a much superior plan.
>
> http://pgsql.privatepaste.com/371nl6KFrI
>
> For comparison, this query replaces the subquery with hard-coded values.
>
> The planner appears to not be weighing the benefits of materializing the
> subquery of the IN () statement properly when ordering is involved, and
> still produces an inferior plan when ordering is not a factor.
>
> Please feel free to contact me for additional test cases if needed.

I've seen this kind of plan before. The planner knows that a
backwards index scan over user_activity_log is slow, but it thinks
that it won't have to scan very much of it because it will accumulate
enough rows to satisfy the LIMIT before it goes very far. When it
turns out that there are not as many matching rows in user_friends as
anticipated, things get very slow.

Unfortunately, there's not an easy solution to this problem. If it
were really true that most rows in user_activity_log had matches in
user_friends, then planner would be correct to choose the plan it did
- and wrong to choose the plan you want it to use. In this case, your
table is small enough that a bitmap heap scan on user_activity_log is
reasonable, but for a larger table in a situation where the join
selectivity is higher, it might really be necessary to use the index.
The problem is just that the planner is doing a poor job figuring out
where the breakpoint is between those two strategies.

I'm not 100% sure exactly which selectivity estimate is wrong. It's
not clear to me whether things are already wrong here:

INDEX Scan USING user_friends_user_account_id_friend_id_key ON
user_friends (cost=0.00..0.27 rows=1 width=4) (actual
time=0.004..0.004 rows=0 loops=350713)

But they're definitely wrong here:

Nested Loop Semi JOIN (cost=0.00..144564.33 rows=62298 width=52)
(actual time=5138.961..5405.075 rows=10 loops=1)

I *think* the root cause of the problem is that we don't have
multi-column statistics. When you look at user_friends, we can tell
you the selectivity of user_friends.user_account_id = 74650 pretty
accurately, and we should also be able to spit out MCVs to compare
against the MCVs of user_activity_log, but if the selectivity of those
two expressions together doesn't resemble the product of their
individual selectivities, which is probably the case here, the planner
has no way to detect that.

...Robert


From: Kenaniah Cerny <kenaniah(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date: 2009-09-16 22:39:37
Message-ID: f647f4600909161539n1eb071d6k2f56f484432ff632@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

I can provide the output of statistics queries if you would like. Just let
me know which statistics you want and I'll pastebin them.

As far as selectivity goes, the selectivity estimate for the
user_anime_log.user_account_id was definitely miscalculated. The
user_anime_log contains up to 15 entries per user (70,000 users, but only
475,811 rows). The default statistics target on that relation is set to
1000. But even with poor statistics, guessing 62,000 rows when there's a
maximum of 15 per user account is still quite a miss.

Analysis of SELECT * FROM user_activity_log WHERE user_account_id =
17238; estimates
13 rows and returns 15, which is quite reasonable considering the statistics
targets.

Please forgive my ignorance, but in the case of my subquery, is the
estimated number of rows and cost being taken into account (or only the
selectivity)? Granted I don't understand much about the planner internals,
but it seems strange that a nested loop semi join would be chosen when the
inner table is estimated to return an extremely low number of rows with low
cost and low selectivity. Shouldn't the planner also estimate the cost of an
inner (er, left?) join in that scenario?

Kenaniah

On Wed, Sep 16, 2009 at 2:32 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Sep 15, 2009 at 11:35 PM, Kenaniah Cerny <kenaniah(at)gmail(dot)com>
> wrote:
> >
> > The following bug has been logged online:
> >
> > Bug reference: 5059
> > Logged by: Kenaniah Cerny
> > Email address: kenaniah(at)gmail(dot)com
> > PostgreSQL version: 8.4.1
> > Operating system: Centos5.2
> > Description: Planner ignores estimates when planning an IN ()
> > subquery
> > Details:
> >
> > Consider the following query:
> >
> > http://pgsql.privatepaste.com/aa5DAtiwws
> >
> > When planning the subquery of the IN () statement, the planner chose to
> scan
> > the indexes of the outer and inner columns in parallel using a nested
> loop
> > semi join.
> >
> > http://pgsql.privatepaste.com/4eXj3zRcy7
> >
> > By not enabling the planner to sort via the index of the outer column in
> the
> > WHERE clause (query above), the a nested loop version of the plan
> executes
> > in a fraction of the time.
> >
> > http://pgsql.privatepaste.com/5c0bOcL3t6
> >
> > As you can see from the above query, forcing the materialization of the
> > subquery produces a much superior plan.
> >
> > http://pgsql.privatepaste.com/371nl6KFrI
> >
> > For comparison, this query replaces the subquery with hard-coded values.
> >
> > The planner appears to not be weighing the benefits of materializing the
> > subquery of the IN () statement properly when ordering is involved, and
> > still produces an inferior plan when ordering is not a factor.
> >
> > Please feel free to contact me for additional test cases if needed.
>
> I've seen this kind of plan before. The planner knows that a
> backwards index scan over user_activity_log is slow, but it thinks
> that it won't have to scan very much of it because it will accumulate
> enough rows to satisfy the LIMIT before it goes very far. When it
> turns out that there are not as many matching rows in user_friends as
> anticipated, things get very slow.
>
> Unfortunately, there's not an easy solution to this problem. If it
> were really true that most rows in user_activity_log had matches in
> user_friends, then planner would be correct to choose the plan it did
> - and wrong to choose the plan you want it to use. In this case, your
> table is small enough that a bitmap heap scan on user_activity_log is
> reasonable, but for a larger table in a situation where the join
> selectivity is higher, it might really be necessary to use the index.
> The problem is just that the planner is doing a poor job figuring out
> where the breakpoint is between those two strategies.
>
> I'm not 100% sure exactly which selectivity estimate is wrong. It's
> not clear to me whether things are already wrong here:
>
> INDEX Scan USING user_friends_user_account_id_friend_id_key ON
> user_friends (cost=0.00..0.27 rows=1 width=4) (actual
> time=0.004..0.004 rows=0 loops=350713)
>
> But they're definitely wrong here:
>
> Nested Loop Semi JOIN (cost=0.00..144564.33 rows=62298 width=52)
> (actual time=5138.961..5405.075 rows=10 loops=1)
>
> I *think* the root cause of the problem is that we don't have
> multi-column statistics. When you look at user_friends, we can tell
> you the selectivity of user_friends.user_account_id = 74650 pretty
> accurately, and we should also be able to spit out MCVs to compare
> against the MCVs of user_activity_log, but if the selectivity of those
> two expressions together doesn't resemble the product of their
> individual selectivities, which is probably the case here, the planner
> has no way to detect that.
>
> ...Robert
>


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kenaniah Cerny <kenaniah(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date: 2009-09-17 02:35:28
Message-ID: 603c8f070909161935t68266cdt2cf1c10803b8ccd7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Wed, Sep 16, 2009 at 6:39 PM, Kenaniah Cerny <kenaniah(at)gmail(dot)com> wrote:
> I can provide the output of statistics queries if you would like. Just let
> me know which statistics you want and I'll pastebin them.
>
> As far as selectivity goes, the selectivity estimate for the
> user_anime_log.user_account_id was definitely miscalculated. The
> user_anime_log contains up to 15 entries per user (70,000 users, but only
> 475,811 rows). The default statistics target on that relation is set to
> 1000. But even with poor statistics, guessing 62,000 rows when there's a
> maximum of 15 per user account is still quite a miss.

You've changed the table name on me, vs. what you pasted in the query,
which had a user_activity_log but no user_anime_log...

> Analysis of SELECT * FROM user_activity_log WHERE user_account_id = 17238;
> estimates 13 rows and returns 15, which is quite reasonable considering the
> statistics targets.
>
> Please forgive my ignorance, but in the case of my subquery, is the
> estimated number of rows and cost being taken into account (or only the
> selectivity)?

Well, selectivity is just a term that refers to the fraction of rows
that match some condition (rows themselves do not have selectivity).
Usually the initial estimating is done in terms of selectivity, which
is then multiplied by the total number of rows to find the number of
rows that will remain after the condition is applied.

So, yes, rows and cost are taken into account. The problem here is
that the planner is mis-estimating the selectivity, therefore it
computes the wrong number of rows (way too high), therefore it makes
the wrong decision.

> Granted I don't understand much about the planner internals,
> but it seems strange that a nested loop semi join would be chosen when the
> inner table is estimated to return an extremely low number of rows with low
> cost and low selectivity. Shouldn't the planner also estimate the cost of an
> inner (er, left?) join in that scenario?

Well... you can't replace a semi join with an inner or left join,
because it doesn't do the same thing. You could use a hash semi join
or merge join semi join, but that doesn't make sense if, as you say,
the inner table is estimated to return an extremely low number of
rows.

It might be a bit easier to analyze this if you stripped out all the
joins that aren't necessary to reproduce the problem. Also, I would
prefer EXPLAIN ANALYZE output posted in-line to the mailing list
rather than pasted to a separate web site - it screws up the
formatting.

But honestly I'm not sure how much time it's worth spending on this.
You have a way to rewrite the query that works... and fixing the
estimator is going to be hard... so I suggest doing it the way that
works, and moving on!

...Robert


From: Kenaniah Cerny <kenaniah(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date: 2009-09-17 05:32:50
Message-ID: f647f4600909162232h27d115f4of2e299177a1d3a67@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

My apologies, I meant user_activity_log, and not user_anime_log in the
previous email.

This isn't a huge issue as the query could easily be rewritten using a more
pragmatic join structure, but I thought it would be worth mentioning as
occasionally bugs involve more than just what meets the eye. Thank you for
the help.

On Wed, Sep 16, 2009 at 7:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Sep 16, 2009 at 6:39 PM, Kenaniah Cerny <kenaniah(at)gmail(dot)com>
> wrote:
> > I can provide the output of statistics queries if you would like. Just
> let
> > me know which statistics you want and I'll pastebin them.
> >
> > As far as selectivity goes, the selectivity estimate for the
> > user_anime_log.user_account_id was definitely miscalculated. The
> > user_anime_log contains up to 15 entries per user (70,000 users, but only
> > 475,811 rows). The default statistics target on that relation is set to
> > 1000. But even with poor statistics, guessing 62,000 rows when there's a
> > maximum of 15 per user account is still quite a miss.
>
> You've changed the table name on me, vs. what you pasted in the query,
> which had a user_activity_log but no user_anime_log...
>
> > Analysis of SELECT * FROM user_activity_log WHERE user_account_id =
> 17238;
> > estimates 13 rows and returns 15, which is quite reasonable considering
> the
> > statistics targets.
> >
> > Please forgive my ignorance, but in the case of my subquery, is the
> > estimated number of rows and cost being taken into account (or only the
> > selectivity)?
>
> Well, selectivity is just a term that refers to the fraction of rows
> that match some condition (rows themselves do not have selectivity).
> Usually the initial estimating is done in terms of selectivity, which
> is then multiplied by the total number of rows to find the number of
> rows that will remain after the condition is applied.
>
> So, yes, rows and cost are taken into account. The problem here is
> that the planner is mis-estimating the selectivity, therefore it
> computes the wrong number of rows (way too high), therefore it makes
> the wrong decision.
>
> > Granted I don't understand much about the planner internals,
> > but it seems strange that a nested loop semi join would be chosen when
> the
> > inner table is estimated to return an extremely low number of rows with
> low
> > cost and low selectivity. Shouldn't the planner also estimate the cost of
> an
> > inner (er, left?) join in that scenario?
>
> Well... you can't replace a semi join with an inner or left join,
> because it doesn't do the same thing. You could use a hash semi join
> or merge join semi join, but that doesn't make sense if, as you say,
> the inner table is estimated to return an extremely low number of
> rows.
>
> It might be a bit easier to analyze this if you stripped out all the
> joins that aren't necessary to reproduce the problem. Also, I would
> prefer EXPLAIN ANALYZE output posted in-line to the mailing list
> rather than pasted to a separate web site - it screws up the
> formatting.
>
> But honestly I'm not sure how much time it's worth spending on this.
> You have a way to rewrite the query that works... and fixing the
> estimator is going to be hard... so I suggest doing it the way that
> works, and moving on!
>
> ...Robert
>