Re: Resjunk sort columns, Heikki's index-only quals patch, and bug #5000

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Resjunk sort columns, Heikki's index-only quals patch, and bug #5000
Date: 2009-09-21 00:43:45
Message-ID: 603c8f070909201743r1fcc3405w1980962d2bd30455@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 15, 2009 at 7:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>  Consider A IJ B, with
>>> the scan over A implemented as an index scan.  It seems to me that if
>>> the join selectivity is < 1, then assuming there's a choice, we
>>> probably want to join A to B and then do the heap fetches against A
>>> afterwards.  But if the join selectivity is > 1 (consider, for
>>> example, a cross join), we probably want to do the heap fetches first.
>>
>> Hmm, good point. I didn't consider that join selectivity can be > 1.
>>
>> A more common scenario is that there's an additional filter condition on
>> the HeapFetch, with a selectivity < 1. It can then cheaper to perform
>> the heap fetches first and only join the remaining rows that satisfy the
>> filter condition.
>
> Well, again, it seems to me that it entirely depends on whether the IJ
> increases or decreases the number of rows.  You want  to do the heap
> fetches at the point where there are the fewest of them to do, and you
> can't know that a priori.  When you start talking about "more common"
> scenarios, what you really mean is "more common in the queries I
> normally do", and that's not the same as what other people's queries
> do.  (See, for example, previous discussions on -performance, where it
> turns out that my suggested "fix" for a particular kind of planner
> problem is the exact opposite of Kevin Grittner's fix for a problem
> with the same code; the existing code bounds a certain value from
> below at 1 - I suggested raising it to 2, he suggested lowering it to
> 0.)

I've been mulling over this problem all week. I haven't gotten all
that far, but here are my thoughts.

Suppose we're planning some joinrel within a large join nest. We have
two partial paths A and B with identical pathkeys. Path A does not
involve an index scan, so we know its exact cost. Path B involves an
index scan, so a heap fetch will have to be done at some point, but
since we can't yet know where the best place to do that heap fetch is,
so we can't know the exact cost of B. Basically, the decision we face
here is whether to keep both plans A and B or to discard one of them
as clearly inferior to the other. As a preliminary observation, this
is the logic that is performed by add_path(), which gets called A LOT
in planning large join nests, so the performance impact of what
happens there needs to be measured carefully.

We can, however, put some bounds on the cost of B. We know what the
cost of B is disregarding the heap fetch (or heap fetches) that will
eventually need to be performed. This cost is also the minimum total
cost of B, in the case where some yet-to-be-performed join is
estimated to return no rows, and thus the heap fetch is estimated to
not actually need to fetch anything. We can also bound the maximum
cost of B: it certainly can't be any higher than the cost of doing all
the heap fetches immediately above the index fetches. We could
probably compute a tighter upper bound by considering various possible
positions for the heap fetches at or below the level of the joinrel,
but I doubt that it's worth the effort.

Instead, what we can do is compare the low-estimate for B to the
estimate for A. If it's higher, discard B. Otherwise, compare the
high-estimate for B to the estimate for A. If it's lower, discard A.
Otherwise, keep both paths. More generally, we can conceive of each
path as having low and high estimates, and we can say that for two
paths with identical pathkeys, P dominates P' if the high-estimate for
P is less than the low-estimate for P'.

In this view of the world, we don't actually need to represent the
heap-fetches in the path trees during joinrel planning. Instead, we
build up a set of possible paths for the whole joinrel, and then at
the end, we go back and look at any remaining paths that require heap
fetches to be inserted and figure out the best place to put them. The
major problem I see with this approach is that it may slow down
add_path() too much, but if that turns out to be the case I don't have
another idea short of attacking the problem using some kind of
heuristics that will probably be less accurate than an analysis of the
type described above.

Thoughts?

...Robert

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-09-21 01:25:17 Re: Resjunk sort columns, Heikki's index-only quals patch, and bug #5000
Previous Message Jeff Davis 2009-09-21 00:21:51 Re: operator exclusion constraints [was: generalized index constraints]