Re: Bogus nestloop join estimate, ignores WHERE clause

Lists: pgsql-hackers
From: Marti Raudsepp <marti(at)juffo(dot)org>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Bogus nestloop join estimate, ignores WHERE clause
Date: 2011-08-31 12:38:50
Message-ID: CABRT9RAWwn0KQ66K=7cvTuxyTzEHd6P_PxYDkT3gfwg2xqybHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi list!

I'm getting really surprising planner estimates for a query that's
joining another table via a varchar field. All of this was tested on
PostgreSQL 8.4.8, 9.0.4 and 9.1rc1.

The original query is pretty huge, but I managed to shrink it down to this:
SELECT * FROM email_message where email_message.id_code IN (SELECT
id_code FROM client WHERE client_id='83509');

* id_code is an indexed varchar(20) NOT NULL column in both tables
* client_id is the primary key of client.
* There are 149152 rows in email_message and 140975 rows in client
* The most common value in both sides of the join is an empty string.
121970 in email_message and 10753 in client
(Turning the empty values into NULLs helps a little, but still gives
bad estimates)

This is the plan I get:
EXPLAIN SELECT * FROM email_message where email_message.id_code IN
(SELECT id_code FROM client WHERE client_id='83509');
Nested Loop (cost=8.28..36.86 rows=139542 width=101)
-> HashAggregate (cost=8.28..8.29 rows=1 width=11)
-> Index Scan using client_pkey on client (cost=0.00..8.28
rows=1 width=11)
Index Cond: (client_id = 83509)
-> Index Scan using email_message_id_code_idx on email_message
(cost=0.00..28.05 rows=41 width=101)
Index Cond: ((email_message.id_code)::text = (client.id_code)::text)
(6 rows)

This nestloop couldn't possibly generate 139542 rows since the inner
plan is expected to return 1 row and the outer plan 41

After a bit of digging, I figured out that it uses the same estimate
as a semi-join WITHOUT the client_id restriction.
EXPLAIN SELECT * FROM email_message m WHERE EXISTS(SELECT * FROM
client c WHERE m.id_code=c.id_code);
Nested Loop Semi Join (cost=0.00..7725.31 rows=139542 width=101)
-> Seq Scan on email_message m (cost=0.00..3966.52 rows=149152 width=101)
-> Index Scan using client_id_code_idx1 on client c
(cost=0.00..0.39 rows=1 width=11)
Index Cond: ((c.id_code)::text = (m.id_code)::text)

For whatever reason, the 1st query completely ignores the fact that
the client_id clause reduces the result count by a large factor.

So I turned this into a simple JOIN and I'm still seeing bad estimates:

EXPLAIN SELECT * FROM email_message JOIN client USING (id_code) WHERE
client_id='83509';
Nested Loop (cost=0.00..36.85 rows=9396 width=252)
-> Index Scan using client_pkey on client (cost=0.00..8.28 rows=1
width=162)
Index Cond: (client_id = 83509)
-> Index Scan using email_message_id_code_idx on email_message
(cost=0.00..28.05 rows=41 width=101)
Index Cond: ((email_message.id_code)::text = (client.id_code)::text)

This is better, but still overestimates massively.

When I change empty values to NULLs, then this JOIN query starts
estimating correctly. So this one is probably confused because the
empty values would result in a cartesian join.

Are there any reasons why nestloop can't use the known (1 * 41) as its estimate?

Regards,
Marti Raudsepp
voicecom.ee


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bogus nestloop join estimate, ignores WHERE clause
Date: 2011-08-31 13:34:43
Message-ID: 1314797683.27073.12.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On ons, 2011-08-31 at 15:38 +0300, Marti Raudsepp wrote:
> I'm getting really surprising planner estimates for a query that's
> joining another table via a varchar field. All of this was tested on
> PostgreSQL 8.4.8, 9.0.4 and 9.1rc1.

By any chance, did it work better in 8.4.7?


From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bogus nestloop join estimate, ignores WHERE clause
Date: 2011-08-31 13:46:44
Message-ID: CABRT9RDfSZn++xWfvrpKJ4Du6L1rRs=GGY-O+o=2dwz4uovoeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 31, 2011 at 16:34, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On ons, 2011-08-31 at 15:38 +0300, Marti Raudsepp wrote:
>> I'm getting really surprising planner estimates for a query that's
>> joining another table via a varchar field. All of this was tested on
>> PostgreSQL 8.4.8, 9.0.4 and 9.1rc1.
>
> By any chance, did it work better in 8.4.7?

No. Estimates on 8.4.7 are pretty much the same (139820, 139820 and 9455)

(I built and installed 8.4.7 with a clean database)

Regards,
Marti Raudsepp
voicecom.ee


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bogus nestloop join estimate, ignores WHERE clause
Date: 2011-08-31 20:59:19
Message-ID: 3645.1314824359@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marti Raudsepp <marti(at)juffo(dot)org> writes:
> After a bit of digging, I figured out that it uses the same estimate
> as a semi-join WITHOUT the client_id restriction.
> ...
> For whatever reason, the 1st query completely ignores the fact that
> the client_id clause reduces the result count by a large factor.

Could we see the pg_stats rows for the two join columns?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Bogus nestloop join estimate, ignores WHERE clause
Date: 2011-08-31 22:54:34
Message-ID: 5790.1314831274@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marti Raudsepp <marti(at)juffo(dot)org> writes:
> On Wed, Aug 31, 2011 at 23:59, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Could we see the pg_stats rows for the two join columns?

> Sure, but I don't want to send this out to the public list since
> [ it's private data ]

Thanks for the data. I set up a comparable test case and duplicated
your problem. It looks like it is a variant of the same brain fade
being discussed over in pgsql-performance,
http://archives.postgresql.org/pgsql-performance/2011-08/msg00327.php

In your case, we are running through the branch of eqjoinsel_semi
that does have MCVs to play with, and that code path is effectively
not taking any account at all of restrictions applied to the inner
relation. We need to have it clamp nd2 (and not nd1) along the same
lines as should be happening in the no-MCV-list code path. This is
exactly the case I was thinking needed to be covered when I was
responding to Mark, and now I've got an example to prove it.

In this particular case, the estimate is probably still not going to be
that good, because you have so many empty-string keys that that one
value dominates the result. The only way for the planner to get a real
quality estimate would be for it to know whether or not the specific
value of client_id mapped to an empty-string id_code, which would
require cross-column stats that we haven't got. Things would get better
if you were willing to replace the empty strings with nulls, which the
planner would know couldn't match. But I'm not sure if that is the
semantics you need. In any case, the eqjoinsel_semi logic is broken;
will fix.

regards, tom lane