Re: proposal: be smarter about i/o patterns in index scan

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: be smarter about i/o patterns in index scan
Date: 2004-05-19 18:56:25
Message-ID: 19335.1084992985@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> Are you saying that index scan results are sorted by something other
> than the index key? Because in my scheme they would still be sorted by
> index key.

Not unless you add yet another sort step after the fetch step, that is
the idea becomes
1. read index to get candidate TIDs
2. sort TIDs into physical order
3. read heap in physical order, check row visibility
4. sort selected rows into index order

That would start to look like an unpleasant amount of overhead, since
the sort would have to be over the entire output; you couldn't divide
the scan into reasonable-size batches, whereas steps 1-3 can easily
be run in batched style.

Moreover, this'd not be any win compared to just dropping the assumption
of sorted output; the planner could stick in the same sort for itself if
it decided the cost was worth it.

What you probably want to do is support both our traditional indexscan
style and an "unsorted indexscan" plan type that delivers unsorted
output (implementing steps 1-3 above). With appropriate cost estimation
the planner could make an intelligent choice between the two.

I'm too lazy to look in the archives right now, but my private notes
about this say:

We could improve indexscan performance if we were to read a bunch of TIDs from
the index, sort in page number order, and access the heap tuples in page
number order. But probably need a batch size in the thousands of TIDs to get
any really useful effect. This would not work very well given the present
assumption that we hold a pin on the index page until we have fetched the
tuple (cf lazy VACUUM). Pinning lots of index pages is no good for
concurrency or buffer usage. QUESTION: is that assumption really necessary?
Can't we just make the code ignore TIDs that lead to deleted tuples? Worst
case is that we fetch a newly-inserted tuple that does not actually have the
expected index value, but won't we ignore it because of MVCC tqual rules?
(This does NOT work for dirty read or SnapshotNow cases, but we could probably
set things up to only try this optimization for standard snapshot reads.)
An even nicer property is that we could loop inside the index AM to scoop up
tuples, saving locking overhead. This might be worth doing even when we want
to keep returning the tuples in index order.

and there is also something about a longer-range plan involving the same
ideas:

Idea: invent two new plan node types:

* IndexOnlyScan
reads an index, does NOT visit heap. Returns consed-up tuples
that contain the heap ctid and (some of) the index columns.
* TidExpand
Given an input tuplestream containing CTIDs, visit the heap.
Discard input tuple if heap tuple is missing or not visible to
current xact. Otherwise, generate output tuple consisting of
required fields from input tuple and heap tuple.

We'd put IndexOnlyScan below, and TidExpand above, a join node. The
transformation is legal if the join condition does not require a heap column
that's not available from the index. (If both join inputs are indexscanable
then we'd consider plans involving two IndexOnlyScans, a join, and two
TidExpand nodes above the join.) This is a win if the join eliminates many
source tuples from consideration. (Probably it never makes sense for the
outer side of an outer join...) Also, it's not legal for a rel that is on
the nullable side of an outer join, since we might have bogus matches that
would incorrectly prevent a null-extended row from being emitted.

Note that any restriction quals on the original index node or the join node
would have to be moved up to the TidExpand node, if they refer to columns not
available from the index. (This would of course affect the row count and cost
estimates, possibly causing us to decide the transformation is a bad idea.
Note also that the Expand node would fetch the same heap tuple multiple times
if the join is not one-to-one; this could again cause the cost to come out
worse than the regular style.)

The only part of this that we can't cost-estimate very accurately is the
fraction of dead index entries that will fail the time qual; but really we
aren't estimating the costs for regular indexscans very well either, when the
fraction of dead entries is high. We could probably just make this a GUC
parameter with a default setting of 0.1 or so. (NOTE: ANALYZE without VACUUM
could compute info about the current dead-tuple fraction, I think; worth
doing?)

The work required would mostly be to figure out how to construct such plans
efficiently in the planner. Should we try to take already-built join
pathtrees and modify them, or should we stick to the existing bottom-up
approach? In multiple-join cases, there might be many places where the
required TidExpand could be inserted; should we try to cost them all, or
just use a heuristic about where to place TidExpand?

Issue: compatibility with concurrent VACUUM. We can't expect any protection
from index locking when the heap visit is delayed like this. I think this is
okay as long as TidExpand is careful not to assume the CTID is still valid
(it could be pointing at a removed tuple). The CTID could also be pointing at
a tuple slot that has been emptied by VACUUM and then refilled by another
process --- but in that case the new tuple will fail the time qual test and
will be ignored anyway. (Note that this *only* works for a snapshot time
qual, not for SnapshotNow; so we can get away with it in user queries even
though it'd be unsafe in the general case.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Glen Parker 2004-05-19 18:58:50 Re: proposal: be smarter about i/o patterns in index scan
Previous Message Bruce Momjian 2004-05-19 18:49:55 Re: Call for 7.5 feature completion