Re: New hashed IN code ignores distinctiveness of subquery

Lists: pgsql-bugspgsql-hackers
From: Bradley Baetz <bbaetz(at)acm(dot)org>
To: pgsql-bugs(at)postgresql(dot)org
Subject: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-26 12:26:12
Message-ID: 20030126122612.GA3820@mango.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

I've been trying out the new hased subselect code from CVS. It appears
that the planner isn't taking the distinctiveness of the values from the
subselect into account:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3485675.30..3485675.30 rows=1 width=8) (actual
time=1430065.54..1430065.55 rows=1 loops=1)
-> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual
time=0.15..1429696.69 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
rows=50000 loops=1)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
rows=277884160 loops=1)
Total runtime: 1430102.08 msec
(6 rows)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT distinct product_id FROM bugs);

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3033.30..3033.30 rows=1 width=8) (actual
time=505.17..505.17 rows=1 loops=1)
-> Hash Join (cost=1959.54..3032.67 rows=251 width=8) (actual
time=282.19..456.66 rows=50000 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.01..68.94 rows=50000 loops=1)
-> Hash (cost=1959.52..1959.52 rows=9 width=4) (actual
time=282.14..282.14 rows=0 loops=1)
-> Subquery Scan "IN_subquery" (cost=0.00..1959.52
rows=9 width=4) (actual time=0.13..282.08 rows=9 loops=1)
-> Unique (cost=0.00..1959.52 rows=9 width=4)
(actual time=0.13..282.03 rows=9 loops=1)
-> Index Scan using bugs_product_id_idx on
bugs (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..245.46
rows=50000 loops=1)
Total runtime: 505.30 msec
(9 rows)

bbaetz=# set enable_mergejoin=false;
SET

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4281600.45..4281600.45 rows=1 width=8) (actual
time=486.80..486.80 rows=1 loops=1)
-> Hash Join (cost=1091.00..4281586.50 rows=5580 width=8) (actual
time=146.58..425.92 rows=50000 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..75.73 rows=50000 loops=1)
-> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
time=146.34..146.34 rows=0 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.06..64.98 rows=50000 loops=1)
Total runtime: 486.91 msec
(7 rows)

bugs is a table with 50000 rows, and products has 10 rows. (Only 9 of
the products are actually used in bugs.product_id, due to an off-by-one
error in the script I used to generate the table)

I still haven't tuned the various optimiser settings, which may explain
part of the enable_mergejoin=false result, although the DISTINCT
probably takes some time too (Side note - is it possible to notice that
DISTINCT on a column with a unique index doesn't need a Unique pass?).
However, 23 minutes vs 0.5 seconds isn't due to that. This is a fairly
useless and silly query though - I was just playing arround.

The tables have been analyzed, and there are separate unique indexes on
products.id, bugs.bug_id and bugs.product_id.

FWIW:

bbaetz=# select n_distinct, correlation FROM pg_stats WHERE
tablename='products' AND attname='product_id';
n_distinct | correlation
------------+-------------
9 | 0.0919474
(1 row)

Thanks,

Bradley


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)acm(dot)org>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-26 19:09:49
Message-ID: 29875.1043608189@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> I've been trying out the new hased subselect code from CVS. It appears
> that the planner isn't taking the distinctiveness of the values from the
> subselect into account:

This isn't really anything to do with the new IN code, but is a
long-standing problem: cost_mergejoin doesn't apply any penalty factor
for the case where there are lots of duplicates in both inner and outer
relation (causing rescans of big chunks of the inner relation). You can
see the rescanning happening in the EXPLAIN ANALYZE output:

> -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual
> time=0.15..1429696.69 rows=50000 loops=1)
> Merge Cond: ("outer".product_id = "inner".product_id)
> -> Index Scan using bugs_product_id_idx on bugs
> (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
> rows=50000 loops=1)
> -> Index Scan using bugs_product_id_idx on bugs
> (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
> rows=277884160 loops=1)
^^^^^^^^^^^^^^

277884160 rows pulled from a 50000-row relation means a heck of a lot of
rescanning went on :-(

The good news is that the system *is* aware of the small number of
distinct values in the table (note the dead-on estimate of the number of
distinct rows in your other query; which I think is from new-for-7.4
code, though the required stats have been available since 7.2).

I think it'd probably be possible to make some reasonable estimate of
the amount of rescanning required, and then inflate the mergejoin cost
estimate proportionally. I have not gotten around to looking at the
problem though. Care to take a shot at it?

regards, tom lane


From: Bradley Baetz <bbaetz(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-26 22:37:39
Message-ID: 20030126223739.GA1193@mango.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote:
>
> This isn't really anything to do with the new IN code, but is a
> long-standing problem: cost_mergejoin doesn't apply any penalty factor
> for the case where there are lots of duplicates in both inner and outer
> relation (causing rescans of big chunks of the inner relation). You can
> see the rescanning happening in the EXPLAIN ANALYZE output:
>
> > -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual
> > time=0.15..1429696.69 rows=50000 loops=1)
> > Merge Cond: ("outer".product_id = "inner".product_id)
> > -> Index Scan using bugs_product_id_idx on bugs
> > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
> > rows=50000 loops=1)
> > -> Index Scan using bugs_product_id_idx on bugs
> > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
> > rows=277884160 loops=1)
> ^^^^^^^^^^^^^^
>
> 277884160 rows pulled from a 50000-row relation means a heck of a lot of
> rescanning went on :-(

Hmm. I'm not sure that that is the entire story. For the non-DISTINCT
version, the top of the tree has an estimated cost of 3485675.30, whilst
for the DISTINCT case, the estimated cost is 3033.30. If the planner
tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo
...) then presumably the second plan would have been chosen. IOW, the
cost estimate is roughly right (although its actually probably low,
because the number of rows is wrong - see below), but the
DISTINCTiveness of IN isn't being considered.

>
> The good news is that the system *is* aware of the small number of
> distinct values in the table (note the dead-on estimate of the number of
> distinct rows in your other query; which I think is from new-for-7.4
> code, though the required stats have been available since 7.2).
>
> I think it'd probably be possible to make some reasonable estimate of
> the amount of rescanning required, and then inflate the mergejoin cost
> estimate proportionally. I have not gotten around to looking at the
> problem though. Care to take a shot at it?

See above - the estimated merge_join cost appears to be being inflated
already (to 3485661.38). That may be off by an order of magnitude, I
guess, but its higher than the hash_join version. What appears not to be
happening is that the optimiser doesn't realise that the real cost is
less because of:

/*
* If we're doing an IN join, we want to return at most one row per
* outer tuple; so we can stop scanning the inner scan if we matched on
* the previous try.
*/
if (node->js.jointype == JOIN_IN &&
node->hj_MatchedOuter)
node->hj_NeedNewOuter = true;

Actually, from the last plan in my previous mail:

-> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
time=146.34..146.34 rows=0 loops=1)

Why is the actual number of rows 0? Would it be easier to have the
hashing stage discard duplicates for an IN hash? I guess thats the same
as doing a hash-based DISTINCT, though, although the DISTINCT plan has:

-> Hash
-> Subquery Scan "IN_subquery"
-> Unique
-> Index scan using bugs_product_id_idx

so I guess the unique and the hash aren't combined (Not that the unique
is needed - the indexscan should only be returning unique values anyway)

While I'm going through the stats, the number of rows for a JOIN_IN
merge is wrong - its currently |outer_rel->rows * selec;|, but should
probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| -
notice the estimated-vs-actual for the final join, of 5580 vs 50000.
5580 would be correct if the IN only returned one value, but thats not
the case. IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same
cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in
set_joinrel_size_estimates ?

>
> regards, tom lane

Thanks,

Bradley


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)acm(dot)org>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-26 23:51:18
Message-ID: 14095.1043625078@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote:
>> This isn't really anything to do with the new IN code, but is a
>> long-standing problem: cost_mergejoin doesn't apply any penalty factor

> Hmm. I'm not sure that that is the entire story. For the non-DISTINCT
> version, the top of the tree has an estimated cost of 3485675.30, whilst
> for the DISTINCT case, the estimated cost is 3033.30.

But the DISTINCT case is a different query. The question that's really
at hand is why does the planner mistakenly prefer merge to hash join in
the non-DISTINCT case.

> If the planner
> tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo
> ...) then presumably the second plan would have been chosen.

You are confusing cost estimation with number-of-resulting-tuples
estimation. They are different things.

> What appears not to be
> happening is that the optimiser doesn't realise that the real cost is
> less because of:

Because the real cost is NOT less because of that. See above.

> -> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
> time=146.34..146.34 rows=0 loops=1)

> Why is the actual number of rows 0?

Because a Hash node doesn't return any tuples in the normal fashion;
it's only called to build a hash table, which is then probed by the
HashJoin node. (No, I don't know why the Berkeley boys designed it that
way, and not as one combined plan node. Maybe they had visions of using
Hash for other things.)

> While I'm going through the stats, the number of rows for a JOIN_IN
> merge is wrong - its currently |outer_rel->rows * selec;|, but should
> probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| -
> notice the estimated-vs-actual for the final join, of 5580 vs 50000.

Not sure about that. The selectivity would need to be calculated
differently if we based the calculation on that. It's probably bogus
as-is (note the comments) but I haven't had time to think about it.
The major reason why this equation is stated the way it is is that given
selectivity ranging from 0 to 1, multiplying by outer rows only yields
the correct range of possible results (0 to number of outer rows).
Multiplying by num_distinct_rows could yield an impossible result.

> IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same
> cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in
> set_joinrel_size_estimates ?

Maybe; from a mechanical point of view, the existing code does the right
thing given 0-to-1 selectivity estimates in each case.

regards, tom lane


From: Bradley Baetz <bbaetz(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 01:07:58
Message-ID: 20030127010758.GA2806@mango.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Jan 26, 2003 at 06:51:18PM -0500, Tom Lane wrote:
> Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> > On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote:
> >> This isn't really anything to do with the new IN code, but is a
> >> long-standing problem: cost_mergejoin doesn't apply any penalty factor
>
> > Hmm. I'm not sure that that is the entire story. For the non-DISTINCT
> > version, the top of the tree has an estimated cost of 3485675.30, whilst
> > for the DISTINCT case, the estimated cost is 3033.30.
>
> But the DISTINCT case is a different query. The question that's really
> at hand is why does the planner mistakenly prefer merge to hash join in
> the non-DISTINCT case.

Well, its a different input, but the output is the same, so it may as
well be the same query. IN (1,1,1) is the same as IN (1).

As I mentioned, I think the code prefers a merge join because the
calculations assume that all of the 50000 rows returned from the
subquery are unique. If that were the case, then a mergejoin probably
would be correct. Given that most of the values are the same, however,
its quicker to just do a hash probe for each entry.

This probably ties into the 'long-standing problem' you mentioned
earlier, but while fixing that would possibly increase the cost to the
stage where the hashjoin is picked instead, it wouldn't fix the fact
that the hashjoin cost is way too high to start with.

>
> > If the planner
> > tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo
> > ...) then presumably the second plan would have been chosen.
>
> You are confusing cost estimation with number-of-resulting-tuples
> estimation. They are different things.

I mentioned them both, because they are related, although not
necessarily for the bug(s) here. IN (SELECT foo) will return more
tuples, but its semantically identical to the IN (SELECT DISTINCT foo)
form. For a subselect with a high selectivity, its better to filter
later on, but where the result will have lots of repeated tuples, it may
be better to try to filter first, which reduces teh number of tuples to
join, and possibly use an index_scan to get the list instead of a
seqscan.

> Because a Hash node doesn't return any tuples in the normal fashion;
> it's only called to build a hash table, which is then probed by the
> HashJoin node. (No, I don't know why the Berkeley boys designed it that
> way, and not as one combined plan node. Maybe they had visions of using
> Hash for other things.)

Well, in this case the hash node could prune duplicates as they went
into the table :). Not sure if that would be a win, though.

> > While I'm going through the stats, the number of rows for a JOIN_IN
> > merge is wrong - its currently |outer_rel->rows * selec;|, but should
> > probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| -
> > notice the estimated-vs-actual for the final join, of 5580 vs 50000.
>
> Not sure about that. The selectivity would need to be calculated
> differently if we based the calculation on that. It's probably bogus
> as-is (note the comments) but I haven't had time to think about it.
> The major reason why this equation is stated the way it is is that given
> selectivity ranging from 0 to 1, multiplying by outer rows only yields
> the correct range of possible results (0 to number of outer rows).
> Multiplying by num_distinct_rows could yield an impossible result.

I don't follow that. We have 50000 outer rows, with a selectivity of say
0.1 (to make the maths nicer). That gives the roughly 5000 rows printed
above. However, thats not the whole story. The result of the join will
be 5000 rows _for each (unique) tuple in the subselect result_. For the
10 distinct values I have, that gives 50000 total rows, which is
correct. (I'm not familar with the postgres source to know if that
approach will work with histograms; I presume theres a helper func
somewhere to handle merging that stuff)

Can you give an example where an impossible result would be given?

>
> > IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same
> > cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in
> > set_joinrel_size_estimates ?
>
> Maybe; from a mechanical point of view, the existing code does the right
> thing given 0-to-1 selectivity estimates in each case.
>

See above. What is the purpose of the _UNIQUE forms, anyway?

> regards, tom lane

Thanks,

Bradley


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)acm(dot)org>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 02:43:18
Message-ID: 19010.1043635398@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> On Sun, Jan 26, 2003 at 06:51:18PM -0500, Tom Lane wrote:
>> But the DISTINCT case is a different query. The question that's really
>> at hand is why does the planner mistakenly prefer merge to hash join in
>> the non-DISTINCT case.

> Well, its a different input, but the output is the same, so it may as
> well be the same query. IN (1,1,1) is the same as IN (1).

That's a good observation with respect to the problem of estimating the
number of output tuples; but it's irrelevant to the problem of
estimating execution cost. The plan that we're trying to cost here is
specifically a plan that runs a mergejoin *without* first uniq-ifying
the righthand relation. As such, the cost of the mergejoin is nearly
the same as the cost of an ordinary mergejoin of the two relations.
We save the costs of outputting join tuples that aren't wanted by the
IN, but we still have to run the mergejoin over those tuples.

> Well, in this case the hash node could prune duplicates as they went
> into the table :). Not sure if that would be a win, though.

We're already checking that as a separate plan alternative. The
implementation could be improved a little, though --- we could combine
the uniq-ification into the Hash node.

> I don't follow that. We have 50000 outer rows, with a selectivity of say
> 0.1 (to make the maths nicer). That gives the roughly 5000 rows printed
> above. However, thats not the whole story. The result of the join will
> be 5000 rows _for each (unique) tuple in the subselect result_. For the
> 10 distinct values I have, that gives 50000 total rows, which is
> correct.

AFAICS there are two or three different concepts of selectivity being
tossed about here. If we want to continue defining "selectivity"
as the ratio of the number of output tuples to the cross product size,
then you need different selectivity numbers depending on whether you
take the cross product size to be num_outer * num_inner or num_outer *
num_distinct_inner or just num_outer. Only in the last case will the
valid range of selectivity range from 0 to 1; if the selectivity
estimator returns something approaching 1 then the first two equations
will give very wrong answers. As an example, consider the case where
all the entries in both tables are the same value (say 42). The
present selectivity estimator *will* say 1.0, as it should for an
ordinary join. If you multiply by anything but num_outer you get the
wrong answer for the number of output tuples.

It might be that we have to bite the bullet and set up a different
selectivity estimation procedure for IN. I'd prefer to avoid that
but haven't really figured out if it's necessary or not.

In the meantime, I think you are right that it's bogus that
set_joinrel_size_estimates uses different equations for JOIN_IN and
JOIN_UNIQUE_INNER. Whatever the decision is about how to do the
estimation, these should be done the same way.

regards, tom lane


From: Bradley Baetz <bbaetz(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 03:22:57
Message-ID: 20030127032257.GA3391@mango.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Jan 26, 2003 at 09:43:18PM -0500, Tom Lane wrote:

> We're already checking that as a separate plan alternative. The
> implementation could be improved a little, though --- we could combine
> the uniq-ification into the Hash node.

Right, or skip it entirely when selecting stuff with unique constraints.

> AFAICS there are two or three different concepts of selectivity being
> tossed about here.

<snip>

Ah, OK. Yeah, I think I was confusing myself there.

>
> It might be that we have to bite the bullet and set up a different
> selectivity estimation procedure for IN. I'd prefer to avoid that
> but haven't really figured out if it's necessary or not.

I don't think it is. The number of rows is correct if you do product_id
IN (1) vs product_id IN (1,2) vs product_id IN (1,2,3) and so on. The
only difference for a subselect is that the number of values generated
is only known approximately, rather than exactly, but I don't think that
thats relevent.

> In the meantime, I think you are right that it's bogus that
> set_joinrel_size_estimates uses different equations for JOIN_IN and
> JOIN_UNIQUE_INNER. Whatever the decision is about how to do the
> estimation, these should be done the same way.

Hmm. So, I made that change, and now get an estimate of 50218 rows for
the join result, which is about right (although capping the result to
have a maximum of the number of input rows is valid for an inner join),
but the cost for the hashjoin is still 4281586.50. cost_hashjoin
probably needs to be taught about the short circuiting done for _IN, but
I'm not sure where to do that from a quick glance through the code.

What is the point of JOIN_UNIQUE_{INNER,OUTER}, though? What does it do
that JOIN_IN doesn't? executor/* doesn't appear to use it.

> regards, tom lane

Bradley


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)acm(dot)org>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 04:18:31
Message-ID: 19407.1043641111@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> On Sun, Jan 26, 2003 at 09:43:18PM -0500, Tom Lane wrote:
>> We're already checking that as a separate plan alternative. The
>> implementation could be improved a little, though --- we could combine
>> the uniq-ification into the Hash node.

> Right, or skip it entirely when selecting stuff with unique constraints.

I'm hesitant to do that until we have some scheme in place for
invalidating cached plans. Right now, if you drop an index that is used
by some cached plan, you'll hear about it next time the plan is used.
But if the plan doesn't directly use the index, yet depends on its
existence to be correct, you'll get no error and subtly(?) wrong
answers. I don't mind depending on such assumptions in estimating
costs, but I do mind depending on them for correct answers.

> I don't think it is. The number of rows is correct if you do product_id
> IN (1) vs product_id IN (1,2) vs product_id IN (1,2,3) and so on.

But that's a completely different code path; it doesn't even enter the
routines we're concerned about here.

> cost_hashjoin
> probably needs to be taught about the short circuiting done for _IN,

Yeah, I think so. Stepping through cost_hashjoin shows that the major
component of the inflated cost is coming from the per-tuple CPU costs,
which are inflated because we're not allowing for the IN shortcircuit.

> What is the point of JOIN_UNIQUE_{INNER,OUTER}, though? What does it do
> that JOIN_IN doesn't?

Uniqify the inner/outer path and then do a normal inner join. See
joinpath.c.

> executor/* doesn't appear to use it.

No; the executor never sees JOIN_REVERSE_IN, either.

regards, tom lane


From: Bradley Baetz <bbaetz(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 04:50:41
Message-ID: 20030127045041.GA8600@mango.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Jan 26, 2003 at 11:18:31PM -0500, Tom Lane wrote:
> Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> > Right, or skip it entirely when selecting stuff with unique constraints.
>
> I'm hesitant to do that until we have some scheme in place for
> invalidating cached plans.

By cached, do you mean PREPARE stuff, or something else?

>
> > I don't think it is. The number of rows is correct if you do product_id
> > IN (1) vs product_id IN (1,2) vs product_id IN (1,2,3) and so on.
>
> But that's a completely different code path; it doesn't even enter the
> routines we're concerned about here.

Yes, but its the same concept. Although we seem to be agreeing about
that now :)

> > What is the point of JOIN_UNIQUE_{INNER,OUTER}, though? What does it do
> > that JOIN_IN doesn't?
>
> Uniqify the inner/outer path and then do a normal inner join. See
> joinpath.c.

Ah, OK. If I comment out line 547 of joinrels.c (which adds JOIN_IN to
the set of join paths) so that the UNIQUE joins are all that are left to
try, then I get:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3494816.98..3494816.98 rows=1 width=8) (actual
time=579.71..579.71 rows=1 loops=1)
-> Merge Join (cost=5169.41..3494691.43 rows=50218 width=8) (actual
time=111.41..530.16 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57
rows=50000 loops=1)
-> Sort (cost=920.14..920.17 rows=9 width=4) (actual
time=111.25..143.42 rows=44476 loops=1)
Sort Key: public.bugs.product_id
-> HashAggregate (cost=920.00..920.00 rows=9 width=4)
(actual time=111.17..111.18 rows=9 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.41 rows=50000 loops=1)
Total runtime: 579.84 msec
(9 rows)

(This isn't picked without my hack, because the cost is slightly higher
than the JOIN_IN version)

However, its much faster (although not as fast as sticking the DISTINCT
in there myself), but the actual rows coming from the sort is really odd
- where is that number coming from? How can sorting 9 rows take 44476
anythings? The final mergejoin cost is still way off, too.

> regards, tom lane

Thanks,

Bradley


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)acm(dot)org>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 05:00:08
Message-ID: 19654.1043643608@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> By cached, do you mean PREPARE stuff, or something else?

PREPARE, and cached plans in plpgsql, and cached plans in SPI, and
cached plans for foreign keys (though probably those never use IN)
and perhaps other places I'm not thinking of at the moment.

> However, its much faster (although not as fast as sticking the DISTINCT
> in there myself), but the actual rows coming from the sort is really odd
> - where is that number coming from? How can sorting 9 rows take 44476
> anythings?

We're back full circle to my original comment about rescans in
mergejoin. The EXPLAIN ANALYZE instrumentation counts a re-fetch
as another returned row.

regards, tom lane


From: Bradley Baetz <bbaetz(at)acm(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 07:29:19
Message-ID: 20030127072918.GA24215@mango.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

[Moving -> -hackers]

On Mon, Jan 27, 2003 at 12:00:08AM -0500, Tom Lane wrote:
> Bradley Baetz <bbaetz(at)acm(dot)org> writes:

> > However, its much faster (although not as fast as sticking the DISTINCT
> > in there myself), but the actual rows coming from the sort is really odd
> > - where is that number coming from? How can sorting 9 rows take 44476
> > anythings?
>
> We're back full circle to my original comment about rescans in
> mergejoin. The EXPLAIN ANALYZE instrumentation counts a re-fetch
> as another returned row.

Hmm. OK, I poked through the code a bit more, and I think I now realise
why we were talking across each other. I've attached a 'patch' which
gets the mergejoin counts down to something reasonable. (The patch also
includes a bit to fix the estimated row count for JOIN_IN, as discussed
on -bugs)

When calculating the cost for a join, if a path is a T_UniquePath, then
the reduction in the number of rows to be examined isn't taken into
account. This is why the mergejoin code was calulating the cost of
merging two 50000 tuple paths - the overestimation that you menioned
earlier isn't as important here.

For reference, bugs is a 50000 row table, and there are 9 distinct
values for product_id.

Before(with the num_rows part of the patch, though)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3494816.98..3494816.98 rows=1 width=8) (actual
time=579.71..579.71 rows=1 loops=1)
-> Merge Join (cost=5169.41..3494691.43 rows=50218 width=8) (actual
time=111.41..530.16 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57
rows=50000 loops=1)
-> Sort (cost=920.14..920.17 rows=9 width=4) (actual
time=111.25..143.42 rows=44476 loops=1)
Sort Key: public.bugs.product_id
-> HashAggregate (cost=920.00..920.00 rows=9 width=4)
(actual time=111.17..111.18 rows=9 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.41 rows=50000 loops=1)
Total runtime: 579.84 msec
(9 rows)

After:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=8007.21..8007.21 rows=1 width=8) (actual
time=578.16..578.16 rows=1 loops=1)
-> Merge Join (cost=5169.41..7881.67 rows=50218 width=8) (actual
time=110.94..527.79 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..250.74
rows=50000 loops=1)
-> Sort (cost=920.14..920.17 rows=9 width=4) (actual
time=110.78..142.80 rows=44476 loops=1)
Sort Key: public.bugs.product_id
-> HashAggregate (cost=920.00..920.00 rows=9 width=4)
(actual time=110.70..110.71 rows=9 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.14 rows=50000 loops=1)
Total runtime: 578.30 msec
(9 rows)

The patch isn't correct as-is, because it only covers merge joins:

bbaetz=# set enable_mergejoin=false;
SET
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4281712.05..4281712.05 rows=1 width=8) (actual
time=410.14..410.14 rows=1 loops=1)
-> Hash Join (cost=1091.00..4281586.50 rows=50218 width=8) (actual
time=126.32..362.30 rows=50000 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..66.81 rows=50000 loops=1)
-> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
time=126.08..126.08 rows=0 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.02..68.23 rows=50000 loops=1)
Total runtime: 410.25 msec
(7 rows)

I don't think that propogating my hack to everywhere which wants to know
how many rows are returned is a good idea though - is there a more
generic way to get the number of rows really returned by a path?

>
> regards, tom lane

Bradley

Attachment Content-Type Size
unq text/plain 1.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] New hashed IN code ignores distinctiveness of subquery
Date: 2003-01-27 21:29:52
Message-ID: 8193.1043702992@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> Hmm. OK, I poked through the code a bit more, and I think I now realise
> why we were talking across each other. I've attached a 'patch' which
> gets the mergejoin counts down to something reasonable.

I've just committed a significant set of changes in the join cost
estimation routines. On looking closer, they hadn't been upgraded for
any of the recent changes --- they were still assuming that merge and
hash join clauses could only be simple var = var, for instance. I did
something about the mergejoin rescan issue, as well as modeling JOIN
short-circuiting. All of the estimates are a bit crude, but certainly
better than no model at all.

I think this covers your concerns, though I'm still worried about
whether it's okay to use the existing selectivity routines to compute
selectivities in the JOIN_IN/JOIN_UNIQUE case.

regards, tom lane