Quick Links

Re: proposal: be smarter about i/o patterns in index scan

Lists:	pgsql-hackers

From:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	proposal: be smarter about i/o patterns in index scan
Date:	2004-05-18 03:32:17
Message-ID:	1084851137.24232.3.camel@heat
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings all,

We have noticed a way to make a major improvement in Pg's performance on
our workload, and I would like to get your thoughts before I go off to
work up a patch.

The basic problem is that Pg seeks far too much when performing an index
scan. I have saved an strace of a backend which is selecting 170,000
rows from a 240,000,000 row table via index scan. The strace shows
134,000 seeks and reads, or almost one seek for every tuple in the
result set. This would be fine normally except the seeks are in a
*very* suboptimal pattern, swinging back and forth across the device.
The query requires 16 minutes, 30 seconds on our test hardware.

The proposal is to sort the block requests before they are issued.
Because Pg only issues single seek/read pairs synchronously, the kernel
has no chance to apply its elevator and hence every seek/read Pg issues
results in actual movement of the disk head. Pg's random pattern of
seeking also makes the kernel's readahead fairly useless.

To prove the proposal I did a simulation, using the recorded seek
positions in the strace. We noted that Pg issued 403 seek/read pairs
for every 8192 bytes read from the index. So we performed four
simulations: a straight playback of Pg's seek pattern, a playback where
each list of 403 seeks is sorted by byte offset, a playback of all the
seeks sorted by offset, and a sequential scan of the 13GB data file.

PostgreSQL 4.2.1: 16m30s
Sorted in chunks: 10m6s
Sorted altogether: 4m19s
Sequential scan: 6m7s

As you can see, there's a lot to be gained here. If Pg was able to
issue its seeks in the optimal order, the query would return in 1/4th
the time. Even the chunk-sorted scheme is a big win.

So the proposal is this: the offsets stored in the index ought to be
sorted. Either A) they can be stored sorted in the first place (sorted
in VACUUM ANALYZE, or always sorted), or B) the backend can sort each
list of tuples it reads from the index, or C) the backend can read the
entire index result and sort that (this would be the best).

>From just paging through the code, it doesn't seem terribly hard. B
seems the easiest to implement but is also the least improvement. Even
seqscan is better than B. One improvement to B would be to read much
more than 8K from the index. Reading e.g. 256KB would improve things
dramatically. A might be easy but would either degrade over time or
cause slower inserts. C is the best and hardest to implement (I am
guessing, because the size of sorted index subset is unbounded).

Your thoughts are appreciated.

-jwb

From:	Sailesh Krishnamurthy <sailesh(at)cs(dot)berkeley(dot)edu>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-18 04:13:50
Message-ID:	mjqy8nqgyo1.fsf@cs.berkeley.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Yes, fetching a RID list from an index scan, sorting 'em and then
fetching from the table would be a very appreciable speedup for many
queries. I would imagine that all the commercial engines do this (db2
calls it a sorted RID-list-fetch) .. and this has in fact been
discussed on -hackers before.

One issue for this is that there could be a slip between the cup and
the lip .. ie., between the fetch from the index, the sort, and the
fetch from the table a vaccuum could have taken place rendering TIDs
invalid. This should however not be a show-stopper .. surely all
that's needed is a lock against vaccuuming in the presence of a
tid-list-fetch.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

From:	"Glen Parker" <glenebob(at)nwlink(dot)com>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-18 19:34:06
Message-ID:	AJEKKAIECKNMBCEKADJPGENNCEAA.glenebob@nwlink.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> The basic problem is that Pg seeks far too much when performing an index
> scan. I have saved an strace of a backend which is selecting 170,000
> rows from a 240,000,000 row table via index scan. The strace shows
> 134,000 seeks and reads, or almost one seek for every tuple in the
> result set. This would be fine normally except the seeks are in a
> *very* suboptimal pattern, swinging back and forth across the device.
> The query requires 16 minutes, 30 seconds on our test hardware.

Yes yes yes *please* fix this :-) This performance bottle neck in PG
effects us all the time.

> The proposal is to sort the block requests before they are issued.
> Because Pg only issues single seek/read pairs synchronously, the kernel
> has no chance to apply its elevator and hence every seek/read Pg issues
> results in actual movement of the disk head. Pg's random pattern of
> seeking also makes the kernel's readahead fairly useless.
>
> To prove the proposal I did a simulation, using the recorded seek
> positions in the strace. We noted that Pg issued 403 seek/read pairs
> for every 8192 bytes read from the index. So we performed four
> simulations: a straight playback of Pg's seek pattern, a playback where
> each list of 403 seeks is sorted by byte offset, a playback of all the
> seeks sorted by offset, and a sequential scan of the 13GB data file.
>
> PostgreSQL 4.2.1: 16m30s
> Sorted in chunks: 10m6s
> Sorted altogether: 4m19s
> Sequential scan: 6m7s
>
> As you can see, there's a lot to be gained here. If Pg was able to
> issue its seeks in the optimal order, the query would return in 1/4th
> the time. Even the chunk-sorted scheme is a big win.

I think your worst case may not be as bad as it gets. Nothing scientific,
but my experience tells me that it's common to get even worse performance
than that. I've had queries that would take seemingly forever, but then
after a fresh cluster and cache flush, the same query would take almost no
time at all.

I also think that with a multi-mirrored RAID setup and your proposed IO
sorting, the mirroring would be more likely to overcome seek overhead. With
the current implementation, it seems almost useless to throw more hardware
at the problem.

I think the improvement would be even 'huger' than your numbers show :-)

> So the proposal is this: the offsets stored in the index ought to be
> sorted. Either A) they can be stored sorted in the first place (sorted
> in VACUUM ANALYZE, or always sorted), or B) the backend can sort each
> list of tuples it reads from the index, or C) the backend can read the
> entire index result and sort that (this would be the best).
>
> >From just paging through the code, it doesn't seem terribly hard. B
> seems the easiest to implement but is also the least improvement. Even
> seqscan is better than B. One improvement to B would be to read much
> more than 8K from the index. Reading e.g. 256KB would improve things
> dramatically. A might be easy but would either degrade over time or
> cause slower inserts. C is the best and hardest to implement (I am
> guessing, because the size of sorted index subset is unbounded).

IMO if you're going to do it, do it right. That means A is pretty much out,
again, IMO. It would seem that B (improved) would be the way to go because,
as you say, C would produce unbounded subsets. I would propose to read very
large sections of the index subset though, more in the order of several MB
(configurable, of course). With that much space to play with, most queries
would require only one pass at the index and therefore show performance
equal to C. Larger queries would suffer a bit, but then we don't usually
have such speedy expectations of large expected result sets, and again, with
enough sort space to play with, the performance could approach that of C.

I would think that you could also sort the index on index order (during
vacuum) to improve index scan performance. This could be a completely
seperate implementation task. With clean sequential index access *and*
clean sequential heap access, it might just prove to be the single largest
performance boost PG has seen, well, ever.

My .02...

Glen Parker

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 14:58:47
Message-ID:	16894.1084978727@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> We have noticed a way to make a major improvement in Pg's performance on
> our workload, and I would like to get your thoughts before I go off to
> work up a patch.

For starters, read the previous discussions of this in the archives.

Two questions you should have answers to before starting to implement,
rather than after, are how to deal with locking considerations and
what will be the implications of giving up the property that indexscans
deliver sorted output.

regards, tom lane

From:	Sailesh Krishnamurthy <sailesh(at)cs(dot)berkeley(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 16:23:33
Message-ID:	mjq7jv8jshm.fsf@cs.berkeley.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> For starters, read the previous discussions of this in the
Tom> archives.

Tom> Two questions you should have answers to before starting to
Tom> implement, rather than after, are how to deal with locking
Tom> considerations and what will be the implications of giving up
Tom> the property that indexscans deliver sorted output.

I don't know about the former, but as to the latter, we should
certainly have the ability for both output sorted by key and
TID-List-Fetch ...

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

From:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 18:02:29
Message-ID:	1084989749.19249.1.camel@heat
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2004-05-19 at 07:58, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> > We have noticed a way to make a major improvement in Pg's performance on
> > our workload, and I would like to get your thoughts before I go off to
> > work up a patch.
>
> For starters, read the previous discussions of this in the archives.

If you are referring to archives.postgresql.org, all I have to say is:

"An error occured! Can not connect to search daemon"

And as far as I have seen, it's been like that for years.

> Two questions you should have answers to before starting to implement,
> rather than after, are how to deal with locking considerations and
> what will be the implications of giving up the property that indexscans
> deliver sorted output.

Are you saying that index scan results are sorted by something other
than the index key? Because in my scheme they would still be sorted by
index key.

-jwb

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org, Marc Fournier <scrappy(at)hub(dot)org>
Subject:	search engine down (again)
Date:	2004-05-19 18:34:17
Message-ID:	19171.1084991657@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> If you are referring to archives.postgresql.org, all I have to say is:
> "An error occured! Can not connect to search daemon"
> And as far as I have seen, it's been like that for years.

I do seem to be getting that today (Marc?) but it's hardly been "like
that for years".

See also http://www.pgsql.ru/db/pgsearch/ which provides an alternative
search engine.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 18:56:25
Message-ID:	19335.1084992985@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> Are you saying that index scan results are sorted by something other
> than the index key? Because in my scheme they would still be sorted by
> index key.

Not unless you add yet another sort step after the fetch step, that is
the idea becomes
1. read index to get candidate TIDs
2. sort TIDs into physical order
3. read heap in physical order, check row visibility
4. sort selected rows into index order

That would start to look like an unpleasant amount of overhead, since
the sort would have to be over the entire output; you couldn't divide
the scan into reasonable-size batches, whereas steps 1-3 can easily
be run in batched style.

Moreover, this'd not be any win compared to just dropping the assumption
of sorted output; the planner could stick in the same sort for itself if
it decided the cost was worth it.

What you probably want to do is support both our traditional indexscan
style and an "unsorted indexscan" plan type that delivers unsorted
output (implementing steps 1-3 above). With appropriate cost estimation
the planner could make an intelligent choice between the two.

I'm too lazy to look in the archives right now, but my private notes
about this say:

We could improve indexscan performance if we were to read a bunch of TIDs from
the index, sort in page number order, and access the heap tuples in page
number order. But probably need a batch size in the thousands of TIDs to get
any really useful effect. This would not work very well given the present
assumption that we hold a pin on the index page until we have fetched the
tuple (cf lazy VACUUM). Pinning lots of index pages is no good for
concurrency or buffer usage. QUESTION: is that assumption really necessary?
Can't we just make the code ignore TIDs that lead to deleted tuples? Worst
case is that we fetch a newly-inserted tuple that does not actually have the
expected index value, but won't we ignore it because of MVCC tqual rules?
(This does NOT work for dirty read or SnapshotNow cases, but we could probably
set things up to only try this optimization for standard snapshot reads.)
An even nicer property is that we could loop inside the index AM to scoop up
tuples, saving locking overhead. This might be worth doing even when we want
to keep returning the tuples in index order.

and there is also something about a longer-range plan involving the same
ideas:

Idea: invent two new plan node types:

* IndexOnlyScan
reads an index, does NOT visit heap. Returns consed-up tuples
that contain the heap ctid and (some of) the index columns.
* TidExpand
Given an input tuplestream containing CTIDs, visit the heap.
Discard input tuple if heap tuple is missing or not visible to
current xact. Otherwise, generate output tuple consisting of
required fields from input tuple and heap tuple.

We'd put IndexOnlyScan below, and TidExpand above, a join node. The
transformation is legal if the join condition does not require a heap column
that's not available from the index. (If both join inputs are indexscanable
then we'd consider plans involving two IndexOnlyScans, a join, and two
TidExpand nodes above the join.) This is a win if the join eliminates many
source tuples from consideration. (Probably it never makes sense for the
outer side of an outer join...) Also, it's not legal for a rel that is on
the nullable side of an outer join, since we might have bogus matches that
would incorrectly prevent a null-extended row from being emitted.

Note that any restriction quals on the original index node or the join node
would have to be moved up to the TidExpand node, if they refer to columns not
available from the index. (This would of course affect the row count and cost
estimates, possibly causing us to decide the transformation is a bad idea.
Note also that the Expand node would fetch the same heap tuple multiple times
if the join is not one-to-one; this could again cause the cost to come out
worse than the regular style.)

The only part of this that we can't cost-estimate very accurately is the
fraction of dead index entries that will fail the time qual; but really we
aren't estimating the costs for regular indexscans very well either, when the
fraction of dead entries is high. We could probably just make this a GUC
parameter with a default setting of 0.1 or so. (NOTE: ANALYZE without VACUUM
could compute info about the current dead-tuple fraction, I think; worth
doing?)

The work required would mostly be to figure out how to construct such plans
efficiently in the planner. Should we try to take already-built join
pathtrees and modify them, or should we stick to the existing bottom-up
approach? In multiple-join cases, there might be many places where the
required TidExpand could be inserted; should we try to cost them all, or
just use a heuristic about where to place TidExpand?

Issue: compatibility with concurrent VACUUM. We can't expect any protection
from index locking when the heap visit is delayed like this. I think this is
okay as long as TidExpand is careful not to assume the CTID is still valid
(it could be pointing at a removed tuple). The CTID could also be pointing at
a tuple slot that has been emptied by VACUUM and then refilled by another
process --- but in that case the new tuple will fail the time qual test and
will be ignored anyway. (Note that this *only* works for a snapshot time
qual, not for SnapshotNow; so we can get away with it in user queries even
though it'd be unsafe in the general case.)

regards, tom lane

From:	"Glen Parker" <glenebob(at)nwlink(dot)com>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 18:58:50
Message-ID:	AJEKKAIECKNMBCEKADJPOEPKCEAA.glenebob@nwlink.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org]On Behalf Of Tom Lane

> For starters, read the previous discussions of this in the archives.

Search doesn't work. Even if it did, I'm not sure what I would search on.
Do you remember the time frame of the discussion? If I could narrow it down
to a few months, I could possibly find it the slow way. I'd be very
interested to read it.

Anyway, this subject seems to get precious little attention, and I don't
understand why. PG's index scan implementation is so bad at some types of
queries, I find it surprising that there aren't weekly discussions about it,
and developers lined up around the block to work on it. I would be one of
those developers, but frankly the learning curve is pretty steep and I don't
have any smaller complaints to use as learning tools.

What am I missing? Why is a performance bottle neck of this magnitude not
on the same list of priorities as PITR, replication, and Win32?

Here's one answer: If you had to sort every result set, even when an index
could have been used, overall performance would still improve by a very
large margin. I'd bet money on it.

Actually, the index would still have to be scanned in sort order. Only the
fetch order for heap tuples would be sorted differently.

Glen Parker

From:	"Glen Parker" <glenebob(at)nwlink(dot)com>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 19:36:07
Message-ID:	AJEKKAIECKNMBCEKADJPKEPNCEAA.glenebob@nwlink.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org]On Behalf Of Tom Lane
> Sent: Wednesday, May 19, 2004 11:56 AM

> Not unless you add yet another sort step after the fetch step, that is
> the idea becomes
> 1. read index to get candidate TIDs
> 2. sort TIDs into physical order
> 3. read heap in physical order, check row visibility
> 4. sort selected rows into index order
>
> That would start to look like an unpleasant amount of overhead, since
> the sort would have to be over the entire output; you couldn't divide
> the scan into reasonable-size batches, whereas steps 1-3 can easily
> be run in batched style.

Or:

2. Sort AND COPY TID's into physical order.
3. Read heap in phy. order, match results to un-sorted TID list.

That sounds quite cheap.

Then you get to drop step 4 (which would *usually* be quite a bit more
expensive than a TID sort and copy?).

The cost of the proposed implementation would be higher *when everything is
in the cache*, granted. As a user, I will take that cost 10 times over in
return for such a large improvement in the no-cache situation.

-Glen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Glen Parker" <glenebob(at)nwlink(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 19:55:17
Message-ID:	19980.1084996517@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Glen Parker" <glenebob(at)nwlink(dot)com> writes:
> What am I missing? Why is a performance bottle neck of this magnitude not
> on the same list of priorities as PITR, replication, and Win32?

It's higher on my personal to-do list than most of those ;-). But those
things are getting done because there are other developers with other
priorities. I suspect also that the set of people competent to make
this change is much smaller than the set of people able to work on the
other points. In my mind, most of the issue is in the planner (figuring
out what to do with an unsorted-indexscan option) and relatively few
people have wanted to touch the planner.

> Here's one answer: If you had to sort every result set, even when an index
> could have been used, overall performance would still improve by a very
> large margin. I'd bet money on it.

For a counterexample I refer you to our standard solution for
MAX-using-an-index:

SELECT ... FROM table ORDER BY foo DESC LIMIT 1;

which would become truly spectacularly bad without an ordered index
scan. A more general point is that for any indexscan that returns only
a small number of index entries (eg, any unique-key search) worrying
about physical-order access will be wasted effort. The best you could
hope for is not to be significantly worse than the existing code in such
cases, and I'm unconvinced you could achieve even that.

I can assure you that any patch that completely removes the existing
behavior will be rejected, because there are plenty of cases where it's
the right thing.

The main thing that unordered indexscan could do for us is extend the
usefulness of indexscan plans into relatively-poor-selectivity cases
where we presently tend to drop back to seqscans. There would still be
a selectivity threshold above which you might as well use seqscan, but
it ought to be higher than the fraction-of-a-percent that we currently
see for indexscans. What is unknown, and will be unknown until someone
tries it, is just what range of selectivity this technique might win
for. I think such a range exists; I am not real certain that it is wide
enough to justify a lot of effort in making the idea a reality.

regards, tom lane

From:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 19:56:14
Message-ID:	1084996574.19249.8.camel@heat
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2004-05-19 at 11:56, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> > Are you saying that index scan results are sorted by something other
> > than the index key? Because in my scheme they would still be sorted by
> > index key.
>
> Not unless you add yet another sort step after the fetch step, that is
> the idea becomes
> 1. read index to get candidate TIDs
> 2. sort TIDs into physical order
> 3. read heap in physical order, check row visibility
> 4. sort selected rows into index order
>
> That would start to look like an unpleasant amount of overhead, since
> the sort would have to be over the entire output; you couldn't divide
> the scan into reasonable-size batches, whereas steps 1-3 can easily
> be run in batched style.
>
> Moreover, this'd not be any win compared to just dropping the assumption
> of sorted output; the planner could stick in the same sort for itself if
> it decided the cost was worth it.
>
> What you probably want to do is support both our traditional indexscan
> style and an "unsorted indexscan" plan type that delivers unsorted
> output (implementing steps 1-3 above). With appropriate cost estimation
> the planner could make an intelligent choice between the two.

> I'm too lazy to look in the archives right now, but my private notes
> about this say:

Thanks for the notes. Introducing a new query execution step sounds
like a better/easier idea than was I was going to do, which was to
rework some of the access methods to operate on vectors instead of
scalars. That idea looks increasingly difficult to implement.

> We could improve indexscan performance if we were to read a bunch of TIDs from
> the index, sort in page number order, and access the heap tuples in page
> number order. But probably need a batch size in the thousands of TIDs to get
> any really useful effect.

Depends on the query, but I got an improvement by half from batches of
400.

> This would not work very well given the present
> assumption that we hold a pin on the index page until we have fetched the
> tuple (cf lazy VACUUM). Pinning lots of index pages is no good for
> concurrency or buffer usage. QUESTION: is that assumption really necessary?
> Can't we just make the code ignore TIDs that lead to deleted tuples? Worst
> case is that we fetch a newly-inserted tuple that does not actually have the
> expected index value, but won't we ignore it because of MVCC tqual rules?
> (This does NOT work for dirty read or SnapshotNow cases, but we could probably
> set things up to only try this optimization for standard snapshot reads.)
> An even nicer property is that we could loop inside the index AM to scoop up
> tuples, saving locking overhead. This might be worth doing even when we want
> to keep returning the tuples in index order.

I think this is *definitely* worth doing. The current implementation in
the vicinity of index_getnext wastes valuable opportunities to optimize
the io and/or to let the kernel do a better job. Descending into
index_getnext -> heap_getnext for every tuple looks expensive in my
application.

Thanks again for your notes I'll revisit the source code and see if I
can come up with a plan.

-jwb

PS Regarding the archive, I've never seen it work. I use MARC instead.

From:	"Jeffrey W(dot) Baker" <jwb(at)gghcwest(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Glen Parker <glenebob(at)nwlink(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 20:07:40
Message-ID:	1084997260.19249.16.camel@heat
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2004-05-19 at 12:55, Tom Lane wrote:
> "Glen Parker" <glenebob(at)nwlink(dot)com> writes:
> > What am I missing? Why is a performance bottle neck of this magnitude not
> > on the same list of priorities as PITR, replication, and Win32?
>
> It's higher on my personal to-do list than most of those ;-). But those
> things are getting done because there are other developers with other
> priorities. I suspect also that the set of people competent to make
> this change is much smaller than the set of people able to work on the
> other points. In my mind, most of the issue is in the planner (figuring
> out what to do with an unsorted-indexscan option) and relatively few
> people have wanted to touch the planner.
>
> > Here's one answer: If you had to sort every result set, even when an index
> > could have been used, overall performance would still improve by a very
> > large margin. I'd bet money on it.
>
> For a counterexample I refer you to our standard solution for
> MAX-using-an-index:
>
> SELECT ... FROM table ORDER BY foo DESC LIMIT 1;

Yes, that would stink with unordered index result.

> which would become truly spectacularly bad without an ordered index
> scan. A more general point is that for any indexscan that returns only
> a small number of index entries (eg, any unique-key search) worrying
> about physical-order access will be wasted effort. The best you could
> hope for is not to be significantly worse than the existing code in such
> cases, and I'm unconvinced you could achieve even that.

I don't think a TID-sorted index scan would be worse than the current
unsorted scan. You will produce a list of 1, sort that, and fetch the
tuple. In the current implementation, you only omit the sort, which is
basically free.

> I can assure you that any patch that completely removes the existing
> behavior will be rejected, because there are plenty of cases where it's
> the right thing.

Okay.

> The main thing that unordered indexscan could do for us is extend the
> usefulness of indexscan plans into relatively-poor-selectivity cases
> where we presently tend to drop back to seqscans. There would still be
> a selectivity threshold above which you might as well use seqscan, but
> it ought to be higher than the fraction-of-a-percent that we currently
> see for indexscans. What is unknown, and will be unknown until someone
> tries it, is just what range of selectivity this technique might win
> for. I think such a range exists; I am not real certain that it is wide
> enough to justify a lot of effort in making the idea a reality.

I don't think selectivity is the main issue. I think the absolute size
of the result is the issue. The current implementation fairly well
garauntees one seek for each tuple of the result. Which means that the
cost to access the result is something on the order of 10ms * rows. So
for any significant result that needs to be fetched from the heap, the
seek cost is going to overwhelm all other costs quickly.

Even though my index is selective, retrieving only 200k rows of 240m row
table, the cost of the sort is justified.

-jwb

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Glen Parker" <glenebob(at)nwlink(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 20:10:05
Message-ID:	20179.1084997405@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Glen Parker" <glenebob(at)nwlink(dot)com> writes:
>> Not unless you add yet another sort step after the fetch step, that is
>> the idea becomes
>> 1. read index to get candidate TIDs
>> 2. sort TIDs into physical order
>> 3. read heap in physical order, check row visibility
>> 4. sort selected rows into index order

> Or:
> 2. Sort AND COPY TID's into physical order.
> 3. Read heap in phy. order, match results to un-sorted TID list.
> That sounds quite cheap.

No, it sounds like handwaving. What's your "match results" step, and
does it take less than O(N^2) time? How do you get to *return* the
tuples in index order, without having read them all before you return
the first one? (Which is really what the problem is with the sort
alternative, at least for fast-start cases such as the LIMIT 1 example.)
What happens when you run out of memory for the list of TIDs ... er,
make that two lists of TIDs?

regards, tom lane

From:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Glen Parker <glenebob(at)nwlink(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 20:18:15
Message-ID:	1084997895.19249.22.camel@heat
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2004-05-19 at 13:10, Tom Lane wrote:
> "Glen Parker" <glenebob(at)nwlink(dot)com> writes:
> >> Not unless you add yet another sort step after the fetch step, that is
> >> the idea becomes
> >> 1. read index to get candidate TIDs
> >> 2. sort TIDs into physical order
> >> 3. read heap in physical order, check row visibility
> >> 4. sort selected rows into index order
>
> > Or:
> > 2. Sort AND COPY TID's into physical order.
> > 3. Read heap in phy. order, match results to un-sorted TID list.
> > That sounds quite cheap.
>
> No, it sounds like handwaving. What's your "match results" step, and
> does it take less than O(N^2) time? How do you get to *return* the
> tuples in index order, without having read them all before you return
> the first one? (Which is really what the problem is with the sort
> alternative, at least for fast-start cases such as the LIMIT 1 example.)
> What happens when you run out of memory for the list of TIDs ... er,
> make that two lists of TIDs?

No doubt, you can't just naively create giant vectors of TIDs and expect
to sort them. Is there any concept in Pg of an unrealized result? If
you scanned an index building up a result set that was totally
unrealized, except for the TID and the index columns, you could cheaply
join two such results without ever touching the heap. You could also
use the existing Sort execution step to sort such a result. Then don't
touch the heap something accesses a non-index column, or because you are
returning the result somewhere and need to satisfy MVCC visibility
limits.

This would lay down the foundation needed by your TIDExpand, and could
also be a useful concept in other areas.

I acknowledge my own significant handwaving on this subject :) From
your point of view everyone is going to be handwaving, because we don't
have your depth of experience with this code.

-jwb

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	Glen Parker <glenebob(at)nwlink(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 20:31:20
Message-ID:	20357.1084998680@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> No doubt, you can't just naively create giant vectors of TIDs and expect
> to sort them. Is there any concept in Pg of an unrealized result?

Only for the case of a partially-read plan result. That is, we do this
for rowsets, but not for scalars or arrays. A lot of the point of the
LIMIT 1 example is that it is exploiting the fact that we won't ever
materialize the full output of the indexscan.

> If you scanned an index building up a result set that was totally
> unrealized, except for the TID and the index columns, you could
> cheaply join two such results without ever touching the heap. You
> could also use the existing Sort execution step to sort such a result.
> Then don't touch the heap something accesses a non-index column, or
> because you are returning the result somewhere and need to satisfy
> MVCC visibility limits.

This is basically what I was talking about with IndexOnlyScan/TidExpand.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 20:54:33
Message-ID:	87brkkrvcm.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> The main thing that unordered indexscan could do for us is extend the
> usefulness of indexscan plans into relatively-poor-selectivity cases
> where we presently tend to drop back to seqscans.

Well I'm sure Tom remembers this since he's mentioned it in the past. But for
the sake of completeness I'll remind others of the other big benefit a
heap-ordered indexscan could provide: Merging multiple index scans.

If you have something like

select * from a where x>? and (y=? or z=?)

and you have separate indexes on x, y, and z, then having a heap-ordered index
scan would allow the optimizer to do three independent scans of the three
indexes and only have to fetch the tuples that fit the entire boolean clause
from the heap.

In the case of expressions where any individual column is very non-selective
but the combination is very selective the result can be a big improvement.

(Actually in the special case of x=? and y=? and z=? you could do the same
without any special sorting step if the index tuples were kept in heap order
within each index key. I think that would be an interesting optimization to
try too)

--
greg

From:	"Glen Parker" <glenebob(at)nwlink(dot)com>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 21:06:44
Message-ID:	AJEKKAIECKNMBCEKADJPKEACCFAA.glenebob@nwlink.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org]On Behalf Of Tom Lane

> > Or:
> > 2. Sort AND COPY TID's into physical order.
> > 3. Read heap in phy. order, match results to un-sorted TID list.
> > That sounds quite cheap.
>
> No, it sounds like handwaving. What's your "match results" step, and

You got me :-)

> does it take less than O(N^2) time? How do you get to *return* the

Less than O(N^2)??? Geez I think so!

> tuples in index order, without having read them all before you return
> the first one? (Which is really what the problem is with the sort
> alternative, at least for fast-start cases such as the LIMIT 1 example.)

(you'll have to pardon me for thinking as I type... and being to long winded
:-)

It takes O(1) time.

If you have a hard maximum of TID's you'll grab from the index (which sounds
right), you could store the TID list in a vector. The IO-order-sorted list
could be a singly-linked-list OR a vector, but either way, each entry would
have to point back to an offset in the index-ordered list.

The index-ordered list would be (sizeof(TID) * max_TID_count) bytes, and the
IO-ordered list would be (sizeof(TID) + sizeof(int) * max_TID_count) bytes.
If a TID is 8 bytes and we're going to fetch 2048 index entries per
iteration, that gets us (*counting on fingers*) 16K (for index-ordered list)
+ 24K for the IO-ordered list.

The index-ordered list can also be a singly-linked list, in which case the
mapping would be by pointer. If both lists are single-linked lists, then
the size overhead (for a fully realized 2048 entry scan) would be:
((sizeof(TID) + sizeof(void*)) * max_TID_count) + ((sizeof(TID) +
sizeof(void*) + sizeof(void*)) * max_TID_count) = (*more counting on
fingers*) ... 24K + 32K = 56K.

Hmm, the vector style would be crazy expensive for small queries. The
linked-list approach sounds pretty good though. I assume no one thinks
doing this would be completely free :-) And, if you did the IO-ordered list
as a vector, you'd save the linked-list overhead, and you would know exactly
how big to make it, so it wouldn't have to hurt you on small queries.

2048 entries seems pretty darn generous actually.

> How do you get to *return* the
> tuples in index order, without having read them all before you return

Hmm, worsed case scenario would be to scan the heap in exactly-reverse order
of index order. With a 2048 entry batch size, you'd have a fair amount of
overhead on a LIMIT 1 :-)

Unless the index scanner could be given a maximum tuple count, rather than
just 'knowing' it from a GUC value. Would this dirty up the plan tree
beyond all recognition?

> What happens when you run out of memory for the list of TIDs ... er,
> make that two lists of TIDs?

Now you really got me. Hand waving: the same thing you do when you run out
of memory anywhere else :-) By configuring the server to fetch index
entries in 1-entry batches, you'd be right back to the overhead (or very
nearly so) of the current implementation, so it would only hurt people who
chose to be hurt by it :-)

-Glen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-05-19 23:09:16
Message-ID:	22096.1085008156@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> ... Introducing a new query execution step sounds
> like a better/easier idea than was I was going to do, which was to
> rework some of the access methods to operate on vectors instead of
> scalars. That idea looks increasingly difficult to implement.

One thing that I think is worth doing in any case is to alter the API
for the index AMs' getnext functions so that they can return multiple
TIDs per call (probably into an array supplied by the caller). We could
for example return all matching tuples from a single index page in one
call, and then let index_getnext iterate through them without re-calling
the index AM. (Making it work per-index-page would allow us to keep the
current VACUUM interlocking conventions in place exactly, so that there
would be no risk of breaking anything. All tuples returned in a given
call would still be protected by an index-page lock.)

We'd have to make such an API change anyway to support unordered
indexscan, but it should be a benefit even for ordered scans, because
it should considerably reduce the locking overhead for indexscans that
fetch multiple tuples. In particular it might alleviate the BufMgrLock
contention problems that were under discussion last month. (The test
case we were using to illustrate that problem fetched several hundred
tuples per indexscan, so it clearly could benefit. Extent of benefit
unknown at this time, though.)

The tricky part of this is figuring out how to handle mark/restore and
scan direction switching in a way that doesn't complicate the code to
the point of unmaintainability. I think it may be possible to keep all
the extra complication within indexam.c, but haven't thought through the
details.

This seems like a relatively localized change, and might be a good place
for you to start.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: proposal: be smarter about i/o patterns in index scan
Date:	2004-06-09 21:04:16
Message-ID:	200406092104.i59L4GI15806@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Is there a TODO here?

---------------------------------------------------------------------------

Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> > ... Introducing a new query execution step sounds
> > like a better/easier idea than was I was going to do, which was to
> > rework some of the access methods to operate on vectors instead of
> > scalars. That idea looks increasingly difficult to implement.
>
> One thing that I think is worth doing in any case is to alter the API
> for the index AMs' getnext functions so that they can return multiple
> TIDs per call (probably into an array supplied by the caller). We could
> for example return all matching tuples from a single index page in one
> call, and then let index_getnext iterate through them without re-calling
> the index AM. (Making it work per-index-page would allow us to keep the
> current VACUUM interlocking conventions in place exactly, so that there
> would be no risk of breaking anything. All tuples returned in a given
> call would still be protected by an index-page lock.)
>
> We'd have to make such an API change anyway to support unordered
> indexscan, but it should be a benefit even for ordered scans, because
> it should considerably reduce the locking overhead for indexscans that
> fetch multiple tuples. In particular it might alleviate the BufMgrLock
> contention problems that were under discussion last month. (The test
> case we were using to illustrate that problem fetched several hundred
> tuples per indexscan, so it clearly could benefit. Extent of benefit
> unknown at this time, though.)
>
> The tricky part of this is figuring out how to handle mark/restore and
> scan direction switching in a way that doesn't complicate the code to
> the point of unmaintainability. I think it may be possible to keep all
> the extra complication within indexam.c, but haven't thought through the
> details.
>
> This seems like a relatively localized change, and might be a good place
> for you to start.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073