Re: limit in subquery causes poor selectivity estimation

Lists: pgsql-hackers
From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: limit in subquery causes poor selectivity estimation
Date: 2011-08-27 11:33:27
Message-ID: 1314444807.2349.10.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This is an artificial test case shrunk down from a much larger real
query.

CREATE TABLE test1 (
sha1 bytea PRIMARY KEY,
something text
);

CREATE TABLE test2 (
sha1 bytea PRIMARY KEY,
blah text
);

Fill those with 1000 random rows each, e.g.,

for i in $(seq 1 1000); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test1 VALUES (decode('$sha1','hex'), 'blah$i$i')"; done
for i in $(seq 500 1500); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test2 VALUES (decode('$sha1','hex'), 'foo$i')"; done

(Doesn't really matter whether the key values are the same or
overlapping or not. Just to make it interesting.)

ANALYZE;

EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2);
QUERY PLAN
----------------------------------------------------------------------
Hash Semi Join (cost=30.52..61.27 rows=1000 width=27)
Hash Cond: (test1.sha1 = test2.sha1)
-> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27)
-> Hash (cost=18.01..18.01 rows=1001 width=21)
-> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)

That's OK. Apparently it can tell that joining two tables on their
primary keys cannot result in more rows than the smaller table. (Or
can it?)

EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);
QUERY PLAN
----------------------------------------------------------------------------------
Hash Join (cost=10.60..33.35 rows=500 width=27)
Hash Cond: (test1.sha1 = test2.sha1)
-> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27)
-> Hash (cost=8.10..8.10 rows=200 width=32)
-> HashAggregate (cost=6.10..8.10 rows=200 width=32)
-> Limit (cost=0.00..3.60 rows=200 width=21)
-> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)

Here, however, it has apparently not passed this knowledge through the
LIMIT.

The problem is that in the real query, instead of the 500 up there it
estimates about 30 million (which might be a reasonable estimate for a
join between two unkeyed columns), when it should in fact be 200.
(And again, this is part of a larger query, which is then completely
messed up because of this misestimation.)

So what's up with that? Just a case of, we haven't thought about
covering this case yet, or are there larger problems?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-08-27 17:32:54
Message-ID: 2651.1314466374@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2);
> QUERY PLAN
> ----------------------------------------------------------------------
> Hash Semi Join (cost=30.52..61.27 rows=1000 width=27)
> Hash Cond: (test1.sha1 = test2.sha1)
> -> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27)
> -> Hash (cost=18.01..18.01 rows=1001 width=21)
> -> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)

> That's OK. Apparently it can tell that joining two tables on their
> primary keys cannot result in more rows than the smaller table. (Or
> can it?)

More like it knows that a semijoin can't produce more rows than the
lefthand input has. But I think it is actually applying stats for
both columns here.

> EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);

> Here, however, it has apparently not passed this knowledge through the
> LIMIT.

The LIMIT prevents the subquery from being flattened entirely, ie we
don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
FROM test2 LIMIT 200)". If you look at examine_variable in selfuncs.c
you'll note that it punts for Vars coming from unflattened subqueries.

> So what's up with that? Just a case of, we haven't thought about
> covering this case yet, or are there larger problems?

The larger problem is that if a subquery didn't get flattened, it's
often because it's got LIMIT, or GROUP BY, or some similar clause that
makes it highly suspect whether the statistics available for the table
column are reasonable to use for the subquery outputs. It wouldn't be
that hard to grab the stats for test2.sha1, but then how do you want
to adjust them to reflect the LIMIT?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-08-29 16:58:19
Message-ID: CA+TgmoYerLuxcmkHtrp-2f7aAuxFp_3vWUR73RTNy4HH=YbUtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 27, 2011 at 1:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
>> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
>>                               QUERY PLAN
>> ----------------------------------------------------------------------
>>  Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
>>    Hash Cond: (test1.sha1 = test2.sha1)
>>    ->  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
>>    ->  Hash  (cost=18.01..18.01 rows=1001 width=21)
>>          ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)
>
>> That's OK.  Apparently it can tell that joining two tables on their
>> primary keys cannot result in more rows than the smaller table.  (Or
>> can it?)
>
> More like it knows that a semijoin can't produce more rows than the
> lefthand input has.  But I think it is actually applying stats for
> both columns here.
>
>> EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);
>
>> Here, however, it has apparently not passed this knowledge through the
>> LIMIT.
>
> The LIMIT prevents the subquery from being flattened entirely, ie we
> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
> FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
> you'll note that it punts for Vars coming from unflattened subqueries.
>
>> So what's up with that?  Just a case of, we haven't thought about
>> covering this case yet, or are there larger problems?
>
> The larger problem is that if a subquery didn't get flattened, it's
> often because it's got LIMIT, or GROUP BY, or some similar clause that
> makes it highly suspect whether the statistics available for the table
> column are reasonable to use for the subquery outputs.  It wouldn't be
> that hard to grab the stats for test2.sha1, but then how do you want
> to adjust them to reflect the LIMIT?

Well, you can't. I think the question is, in the absence of perfect
information, is it better to use the stats you have, or just punt and
assume you know nothing? Like Peter, I've certainly seen cases where
pulling up the stats would be a huge win, but it's hard to say whether
there are other cases where it would be worse than what we do now,
because nobody spends any time staring at the queries where the
existing system works great. My gut feeling is that pulling up the
stats unchanged is likely to be better than punting, but my gut
feeling may not be worth much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-08-31 10:22:04
Message-ID: 1314786124.27073.11.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
> > EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2
> LIMIT 200);
>
> > Here, however, it has apparently not passed this knowledge through
> the
> > LIMIT.
>
> The LIMIT prevents the subquery from being flattened entirely, ie we
> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
> FROM test2 LIMIT 200)". If you look at examine_variable in selfuncs.c
> you'll note that it punts for Vars coming from unflattened subqueries.
>
> > So what's up with that? Just a case of, we haven't thought about
> > covering this case yet, or are there larger problems?
>
> The larger problem is that if a subquery didn't get flattened, it's
> often because it's got LIMIT, or GROUP BY, or some similar clause that
> makes it highly suspect whether the statistics available for the table
> column are reasonable to use for the subquery outputs. It wouldn't be
> that hard to grab the stats for test2.sha1, but then how do you want
> to adjust them to reflect the LIMIT?

It turns out that this is a regression introduced in 8.4.8; the same
topic is also being discussed in

http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php

and

http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php

This is the (previously posted) plan with 8.4.8:

QUERY PLAN
----------------------------------------------------------------------------------
Hash Join (cost=10.60..34.35 rows=500 width=31)
Hash Cond: (test1.sha1 = test2.sha1)
-> Seq Scan on test1 (cost=0.00..18.00 rows=1000 width=31)
-> Hash (cost=8.10..8.10 rows=200 width=32)
-> HashAggregate (cost=6.10..8.10 rows=200 width=32)
-> Limit (cost=0.00..3.60 rows=200 width=21)
-> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)

And this is the plan with 8.4.7:

QUERY PLAN
----------------------------------------------------------------------------------
Hash Join (cost=10.80..34.55 rows=200 width=31)
Hash Cond: (test1.sha1 = test2.sha1)
-> Seq Scan on test1 (cost=0.00..18.00 rows=1000 width=31)
-> Hash (cost=8.30..8.30 rows=200 width=32)
-> HashAggregate (cost=6.30..8.30 rows=200 width=32)
-> Limit (cost=0.00..3.80 rows=200 width=21)
-> Seq Scan on test2 (cost=0.00..19.01 rows=1001 width=21)

I liked the old one better. ;-)


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-08-31 18:09:07
Message-ID: CA+TgmoYTSQP0F8SOP7FAL_6myT0GSYxH1WBxPO9iQ=n1KsTAqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 31, 2011 at 6:22 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
>> > EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2
>> LIMIT 200);
>>
>> > Here, however, it has apparently not passed this knowledge through
>> the
>> > LIMIT.
>>
>> The LIMIT prevents the subquery from being flattened entirely, ie we
>> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
>> FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
>> you'll note that it punts for Vars coming from unflattened subqueries.
>>
>> > So what's up with that?  Just a case of, we haven't thought about
>> > covering this case yet, or are there larger problems?
>>
>> The larger problem is that if a subquery didn't get flattened, it's
>> often because it's got LIMIT, or GROUP BY, or some similar clause that
>> makes it highly suspect whether the statistics available for the table
>> column are reasonable to use for the subquery outputs.  It wouldn't be
>> that hard to grab the stats for test2.sha1, but then how do you want
>> to adjust them to reflect the LIMIT?
>
> It turns out that this is a regression introduced in 8.4.8; the same
> topic is also being discussed in
>
> http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php
>
> and
>
> http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php
>
> This is the (previously posted) plan with 8.4.8:
>
>                                    QUERY PLAN
> ----------------------------------------------------------------------------------
>  Hash Join  (cost=10.60..34.35 rows=500 width=31)
>   Hash Cond: (test1.sha1 = test2.sha1)
>   ->  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
>   ->  Hash  (cost=8.10..8.10 rows=200 width=32)
>         ->  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
>               ->  Limit  (cost=0.00..3.60 rows=200 width=21)
>                     ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)
>
> And this is the plan with 8.4.7:
>
>                                    QUERY PLAN
> ----------------------------------------------------------------------------------
>  Hash Join  (cost=10.80..34.55 rows=200 width=31)
>   Hash Cond: (test1.sha1 = test2.sha1)
>   ->  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
>   ->  Hash  (cost=8.30..8.30 rows=200 width=32)
>         ->  HashAggregate  (cost=6.30..8.30 rows=200 width=32)
>               ->  Limit  (cost=0.00..3.80 rows=200 width=21)
>                     ->  Seq Scan on test2  (cost=0.00..19.01 rows=1001 width=21)
>
> I liked the old one better. ;-)

AFAICS, those plans are identical, except for a minor difference in
the cost of scanning test2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-08-31 18:12:58
Message-ID: 21748.1314814378@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Aug 31, 2011 at 6:22 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>> I liked the old one better. ;-)

> AFAICS, those plans are identical, except for a minor difference in
> the cost of scanning test2.

The point is that the estimate of the result size is worse in 8.4.8.

I am not, however, convinced that 8.4.7 was actually smarter ... it
may have been getting the right answer for the wrong reason.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-09-02 16:45:19
Message-ID: 19020.1314981919@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
>> The larger problem is that if a subquery didn't get flattened, it's
>> often because it's got LIMIT, or GROUP BY, or some similar clause that
>> makes it highly suspect whether the statistics available for the table
>> column are reasonable to use for the subquery outputs. It wouldn't be
>> that hard to grab the stats for test2.sha1, but then how do you want
>> to adjust them to reflect the LIMIT?

> It turns out that this is a regression introduced in 8.4.8;

Well, the fact that examine_variable punts on subqueries is certainly
not a "regression introduced in 8.4.8"; it's always been like that.

I think your observation that 8.4.8 is worse has to be blamed on
commit 0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d, which introduced a
fallback rule of assuming 0.5 selectivity for a semijoin if we didn't
have non-default ndistinct estimates on both sides. Before that, 8.4.x
would go ahead and apply its heuristic rule, essentially Min(nd2/nd1, 1),
even when one or both ndistinct values were completely made-up.

I'm not sure what we could do instead. Perhaps you could argue that
we should just revert that commit on the grounds that it's doing more
harm than good, but I don't really believe that --- I think reverting
would just move the pain points around. It's pure luck that 8.4.8
is worse rather than better on the particular example you cite.

On a longer-term basis, I'm looking into what we could do with
extracting stats from subqueries, but that doesn't seem like material
for a backpatch. I have a draft patch that I've been playing with
(attached). The main thing I feel unsure about is whether it's
reasonable to extract stats in this way from a subquery that has GROUP
BY or DISTINCT. ISTM it's probably okay to ignore joining, sorting, or
limiting in the subquery: those might affect the stats of the subquery
output, but this is no worse than using the unmodified column statistics
for any other join-level selectivity estimate (where we already ignore
the effects of scan-level restriction clauses that will filter the
column values). But GROUP BY or DISTINCT would entirely invalidate the
column frequency statistics, which makes me think that ignoring the
pg_statistic entry might be the thing to do. Comments?

regards, tom lane

Attachment Content-Type Size
wip-subquery-stats.patch text/x-patch 6.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-09-04 22:31:38
Message-ID: 9086.1315175498@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> On a longer-term basis, I'm looking into what we could do with
> extracting stats from subqueries, but that doesn't seem like material
> for a backpatch. I have a draft patch that I've been playing with
> (attached).

I've committed a heavily rewritten version of that patch. Git HEAD
seems to do reasonably well on the test case you gave at the start of
this thread. I'm not sure yet how well the logic will hold up in
real-world queries as opposed to simplified test cases, but give it
a try.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-09-06 02:25:03
Message-ID: CA+Tgmoagrs=FQmdr04tiYt2-Kwu6MyFccx5cmcdR_kHL7fky5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 2, 2011 at 12:45 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> column values).  But GROUP BY or DISTINCT would entirely invalidate the
> column frequency statistics, which makes me think that ignoring the
> pg_statistic entry might be the thing to do.  Comments?

There's a possible problem there in that you may have trouble getting
a good join selectivity estimate in cases like:

SELECT ... FROM foo LEFT JOIN (SELECT x, SUM(1) FROM bar GROUP BY 1)
ON foo.x = bar.x

My guess is that in practice, the number of rows in foo that find a
join partner here is going to be much higher than what a stats-less
join selectivity estimation is likely to come up with. You typically
don't write a query like this in the first place if you don't expect
to find matches, although I'm sure it's been done. In some cases you
might even have a foreign key relationship to work with.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company