Page at a time index scan

Lists: pgsql-patches
From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-patches(at)postgresql(dot)org
Subject: Page at a time index scan
Date: 2006-05-01 19:00:11
Message-ID: Pine.OSF.4.61.0605012127550.183753@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Here's a patch that implements page at a time index scans discussed at
pgsql-hackers earlier. See proposal 1 at:
http://archives.postgresql.org/pgsql-hackers/2006-03/msg01237.php

It passes regression tests, and there's no known bugs. There's
some minor issues I'd like to point out, though:

1. An index scan now needs to allocate enough memory to hold potentially a
whole page worth of items. And if you use markpos/restrpos, twice that
much. I don't know if that's an issue, but I thought I'd bring that up.

2. Vacuum is now done in one phase, scanning the index in physical order.
That significantly speeds up index vacuums of large indexes that don't fit
into memory. However, btbulkdelete doesn't know if the vacuum is a full or
lazy one. The patch just assumes it's a lazy vacuum, but the API really
needs to be changed to pass that information earlier than at vacuum_cleanup.

3. Before the patch, a scan would keep the current page pinned to keep
vacuum from deleting the current item. The patch doesn't change that
behaviour, but it now seems to me that even a pin is no longer needed.

The patch needs testing and review, to ensure it doesn't brake anything,
and to see the effect on performance. It doesn't change disk layout or
catalogs, so you can run it using the same data directory as with the
unpatched version.

- Heikki


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-01 19:02:54
Message-ID: Pine.OSF.4.61.0605012202050.183753@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

As usual, I forgot the attachment. Here you go.

On Mon, 1 May 2006, Heikki Linnakangas wrote:

> Here's a patch that implements page at a time index scans discussed at
> pgsql-hackers earlier. See proposal 1 at:
> http://archives.postgresql.org/pgsql-hackers/2006-03/msg01237.php
>
> It passes regression tests, and there's no known bugs. There's some minor
> issues I'd like to point out, though:
>
> 1. An index scan now needs to allocate enough memory to hold potentially a
> whole page worth of items. And if you use markpos/restrpos, twice that much.
> I don't know if that's an issue, but I thought I'd bring that up.
>
> 2. Vacuum is now done in one phase, scanning the index in physical order.
> That significantly speeds up index vacuums of large indexes that don't fit
> into memory. However, btbulkdelete doesn't know if the vacuum is a full or
> lazy one. The patch just assumes it's a lazy vacuum, but the API really needs
> to be changed to pass that information earlier than at vacuum_cleanup.
>
> 3. Before the patch, a scan would keep the current page pinned to keep vacuum
> from deleting the current item. The patch doesn't change that behaviour, but
> it now seems to me that even a pin is no longer needed.
>
> The patch needs testing and review, to ensure it doesn't brake anything, and
> to see the effect on performance. It doesn't change disk layout or catalogs,
> so you can run it using the same data directory as with the unpatched
> version.
>
> - Heikki
>

- Heikki

Attachment Content-Type Size
indvac-20060501.diff text/plain 62.6 KB

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 13:44:04
Message-ID: 1146577444.9599.450.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Mon, 2006-05-01 at 22:00 +0300, Heikki Linnakangas wrote:
> Here's a patch that implements page at a time index scans discussed at
> pgsql-hackers earlier. See proposal 1 at:
> http://archives.postgresql.org/pgsql-hackers/2006-03/msg01237.php
>
> It passes regression tests, and there's no known bugs. There's
> some minor issues I'd like to point out, though:
>
> 1. An index scan now needs to allocate enough memory to hold potentially a
> whole page worth of items. And if you use markpos/restrpos, twice that
> much. I don't know if that's an issue, but I thought I'd bring that up.

AFAICS the code:

- allocates memory for the markpos whether or not its ever needed? Most
index scans never call markpos and not all merge joins either, so that
seems wasteful. We could allocate when btmarkpos() is called for the
first time, if ever.

- allocates 1024 offsets in every case. If this were just a unique index
retrieval, that would be too much. When scan->is_multiscan == true, go
straight for 1024, otherwise start with just 32 offsets and double that
when/if required.

> 2. Vacuum is now done in one phase, scanning the index in physical order.
> That significantly speeds up index vacuums of large indexes that don't fit
> into memory.

Also for those that *aren't in* memory. Should speed up medium-sized
VACUUMs also because of sequential disk access.

> However, btbulkdelete doesn't know if the vacuum is a full or
> lazy one. The patch just assumes it's a lazy vacuum, but the API really
> needs to be changed to pass that information earlier than at vacuum_cleanup.

Looks like it needs work. Do you have suggestions while you're there?

> 3. Before the patch, a scan would keep the current page pinned to keep
> vacuum from deleting the current item. The patch doesn't change that
> behaviour, but it now seems to me that even a pin is no longer needed.

Agreed. The pin has two functions:
- keep the page from being moved out of the bufmgr - no need anymore
- stop a vacuum from removing the page - no need anymore. We'll not stop
on a removable row anymore, so no need.

Some of the code doesn't use standard spacing e.g. "if(" should be "if
(", but other than that it looks very neat and well implemented.

Overall, I'm optimistic that this patch will help in a number of ways.
Speeding up a VACUUM index scan is a primary objective and it looks like
that will work well. Also, this looks like it will reduce LWLocking
overhead and encourage sequential memory scans of blocks, both of which
will improve index scan performance. It should also reduce buffer
residency time making shared_buffers more fluid. So, subject to
performance tests of this I'm very interested in this.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 15:40:45
Message-ID: Pine.OSF.4.61.0605021750180.463955@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Tue, 2 May 2006, Simon Riggs wrote:

> On Mon, 2006-05-01 at 22:00 +0300, Heikki Linnakangas wrote:
>>
>> 1. An index scan now needs to allocate enough memory to hold potentially a
>> whole page worth of items. And if you use markpos/restrpos, twice that
>> much. I don't know if that's an issue, but I thought I'd bring that up.
>
> AFAICS the code:
>
> - allocates memory for the markpos whether or not its ever needed? Most
> index scans never call markpos and not all merge joins either, so that
> seems wasteful. We could allocate when btmarkpos() is called for the
> first time, if ever.

Right. I'll do that.

> - allocates 1024 offsets in every case. If this were just a unique index
> retrieval, that would be too much. When scan->is_multiscan == true, go
> straight for 1024, otherwise start with just 32 offsets and double that
> when/if required.

I wonder if that gets a bit too complicated for saving a small amount of
memory? I'll do it if people think it's worth it.

Also, could we calculate a better estimate of the maximum number of
offsets an index page can hold? There's no way a page can really hold
1024 items. Page headers and special space, line pointers, and the actual
keys need some space.

>> However, btbulkdelete doesn't know if the vacuum is a full or
>> lazy one. The patch just assumes it's a lazy vacuum, but the API really
>> needs to be changed to pass that information earlier than at vacuum_cleanup.
>
> Looks like it needs work. Do you have suggestions while you're there?

Now that I look at it: Why do we have a separate vacuum_cleanup function
at all? Calls to index_vacuum_cleanup go hand in hand with calls to
index_bulk_delete. I thought that index_vacuum_cleanup would only be
called after the last cycle of a multi-cycle vacuum, but that doesn't seem
to be the case.

>> 3. Before the patch, a scan would keep the current page pinned to keep
>> vacuum from deleting the current item. The patch doesn't change that
>> behaviour, but it now seems to me that even a pin is no longer needed.
>
> Agreed. The pin has two functions:
> - keep the page from being moved out of the bufmgr - no need anymore
> - stop a vacuum from removing the page - no need anymore. We'll not stop
> on a removable row anymore, so no need.

At the moment, backward scan returns to the page to walk left from there.
It could be changed to take a copy of the left sibling pointer
like forward scans do, eliminating the need to return to the original
page in most cases.

Also, if there's dead tuples on the page, the scan will need to return
to the page to kill them.

But you can always re-pin the page when needed.

> Some of the code doesn't use standard spacing e.g. "if(" should be "if
> (", but other than that it looks very neat and well implemented.

Thanks, I'll fix the spacings.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 16:04:21
Message-ID: 8949.1146585861@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> Also, could we calculate a better estimate of the maximum number of
> offsets an index page can hold?

We could make something analogous to MaxHeapTuplesPerPage --- the
correct number ought to be approximately BLCKSZ/16 I should think.
(It's not possible for an entry to be *just* the header, there has
to be either a datum or a null bitmap. Hence, with maxalign padding,
at least 12 bytes for item, plus 4 for item pointer.)

> Now that I look at it: Why do we have a separate vacuum_cleanup function
> at all? Calls to index_vacuum_cleanup go hand in hand with calls to
> index_bulk_delete.

Yeah, I was just thinking we ought to revisit that. The original idea
was that vacuumcleanup would be called just once at the end of vacuuming,
not once per bulkdelete. I don't recall why I changed it but it was
probably a bad idea to do so.

>> Agreed. The pin has two functions:
>> - keep the page from being moved out of the bufmgr - no need anymore
>> - stop a vacuum from removing the page - no need anymore. We'll not stop
>> on a removable row anymore, so no need.

> At the moment, backward scan returns to the page to walk left from there.

Backwards scan may break this whole concept; are you sure you've thought
it through?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 16:48:09
Message-ID: 15644.1146588489@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> Here's a patch that implements page at a time index scans discussed at
>> pgsql-hackers earlier. See proposal 1 at:
>> http://archives.postgresql.org/pgsql-hackers/2006-03/msg01237.php

One potential performance lossage from this is that it partially defeats
the keys_are_unique optimization: bt_checkkeys will be run across all
the matching tuples on the index page even if the waiting caller is
going to stop after the first live one. (I don't see any way to avoid
that without breaking the entire concept, since we can't know which of
the index entries the caller will think is live.)

I suspect this is not a deal-breaker, but we have to test to make sure
that case isn't getting markedly worse. The thing to look at would be
unique indexes with expensive comparison functions (eg, text in a
non-C locale).

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 17:05:04
Message-ID: 18705.1146589504@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

I wrote:
> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> Now that I look at it: Why do we have a separate vacuum_cleanup function
>> at all? Calls to index_vacuum_cleanup go hand in hand with calls to
>> index_bulk_delete.

> Yeah, I was just thinking we ought to revisit that. The original idea
> was that vacuumcleanup would be called just once at the end of vacuuming,
> not once per bulkdelete. I don't recall why I changed it but it was
> probably a bad idea to do so.

I remember why: the design involves passing a palloc'd struct from
bulkdelete to vacuumcleanup and there needed to be matching calls to
make that behave sanely.

We could fix this if the API is something like "the first bulkdelete
call palloc's the struct, and *it's passed in to* each subsequent
bulkdelete, as well as being passed to vacuumcleanup". So we're short
one argument to bulkdelete. This would provide a saner way of dealing
with the case of nothing to delete, too: if vacuumcleanup gets a NULL
stats pointer, that means bulkdelete wasn't ever called (or was, but
chose never to return a non-null pointer).

Also, as noted in other contexts, it'd be a good idea if vacuumcleanup
was told the total number of heap tuples (GIN needs this), and both
steps really ought to be able to find out if it's a full or lazy vacuum.

I'll work on making these API changes; the recent GIN patch has left
some other detritus that needs to be cleaned up in the same area.

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 19:02:59
Message-ID: Pine.OSF.4.61.0605022150090.463955@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Tue, 2 May 2006, Tom Lane wrote:

> Also, as noted in other contexts, it'd be a good idea if vacuumcleanup
> was told the total number of heap tuples (GIN needs this), and both
> steps really ought to be able to find out if it's a full or lazy vacuum.

It's already in IndexVacuumCleanupInfo, isn't it?

/* Struct for additional arguments passed to vacuum-cleanup operation */
typedef struct IndexVacuumCleanupInfo
{
bool vacuum_full; /* VACUUM FULL (we have exclusive lock) */
int message_level; /* ereport level for progress messages */
--> double num_heap_tuples; /* tuples remaining in heap */
} IndexVacuumCleanupInfo;

gistvacuumcleanup uses num_heap_tuples to set num_index_tuples
when it doesn't need to scan the index otherwise.

BTW: Is it possible to have a partial gist index? If it is,
num_index_tuples = num_heap_tuples isn't right.

- Heikki


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 19:21:09
Message-ID: Pine.OSF.4.61.0605021946560.463955@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Tue, 2 May 2006, Tom Lane wrote:

>>> Agreed. The pin has two functions:
>>> - keep the page from being moved out of the bufmgr - no need anymore
>>> - stop a vacuum from removing the page - no need anymore. We'll not stop
>>> on a removable row anymore, so no need.
>
>> At the moment, backward scan returns to the page to walk left from there.
>
> Backwards scan may break this whole concept; are you sure you've thought
> it through?

I think so. The patch doesn't change the walk-left code. Do you have
something specific in mind?

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 19:31:05
Message-ID: 26433.1146598265@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On Tue, 2 May 2006, Tom Lane wrote:
>> Also, as noted in other contexts, it'd be a good idea if vacuumcleanup
>> was told the total number of heap tuples (GIN needs this), and both
>> steps really ought to be able to find out if it's a full or lazy vacuum.

> It's already in IndexVacuumCleanupInfo, isn't it?

Yeah, I had forgotten that, but just noticed it again now. The patch
I'm working on at the moment defines

/*
* Struct for input arguments passed to ambulkdelete and amvacuumcleanup
*
* Note that num_heap_tuples will not be valid during ambulkdelete,
* only amvacuumcleanup.
*/
typedef struct IndexVacuumInfo
{
Relation index; /* the index being vacuumed */
bool vacuum_full; /* VACUUM FULL (we have exclusive lock) */
int message_level; /* ereport level for progress messages */
double num_heap_tuples; /* tuples remaining in heap */
} IndexVacuumInfo;

with

IndexBulkDeleteResult *
ambulkdelete (IndexVacuumInfo *info,
IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback,
void *callback_state);

Because of limited <varname>maintenance_work_mem</>,
<function>ambulkdelete</> may need to be called more than once when many
tuples are to be deleted. The <literal>stats</> argument is the result
of the previous call for this index (it is NULL for the first call within a
<command>VACUUM</> operation). This allows the AM to accumulate statistics
across the whole operation. Typically, <function>ambulkdelete</> will
modify and return the same struct if the passed <literal>stats</> is not
null.

IndexBulkDeleteResult *
amvacuumcleanup (IndexVacuumInfo *info,
IndexBulkDeleteResult *stats);

Clean up after a <command>VACUUM</command> operation (zero or more
<function>ambulkdelete</> calls). This does not have to do anything
beyond returning index statistics, but it may perform bulk cleanup
such as reclaiming empty index pages. <literal>stats</> is whatever the
last <function>ambulkdelete</> call returned, or NULL if
<function>ambulkdelete</> was not called because no tuples needed to be
deleted. If the result is not NULL it must be a palloc'd struct.
The statistics it contains will be reported by <command>VACUUM</> if
<literal>VERBOSE</> is given.

> BTW: Is it possible to have a partial gist index? If it is,
> num_index_tuples = num_heap_tuples isn't right.

It is, and it isn't ;-). We'll need to see about fixing that.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-02 19:35:08
Message-ID: 26483.1146598508@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On Tue, 2 May 2006, Tom Lane wrote:
>> Backwards scan may break this whole concept; are you sure you've thought
>> it through?

> I think so. The patch doesn't change the walk-left code. Do you have
> something specific in mind?

I'm worried about synchronization, particularly what happens if the page
gets deleted from under you while you don't have it pinned.

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 07:28:24
Message-ID: Pine.OSF.4.61.0605031026060.244444@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Tue, 2 May 2006, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> On Tue, 2 May 2006, Tom Lane wrote:
>>> Backwards scan may break this whole concept; are you sure you've thought
>>> it through?
>
>> I think so. The patch doesn't change the walk-left code. Do you have
>> something specific in mind?
>
> I'm worried about synchronization, particularly what happens if the page
> gets deleted from under you while you don't have it pinned.

AFAICS, shouldn't happen. The half-dead system makes sure that a page
won't get deleted while a scan might still be interested in it. It
doesn't depend on pins.

- Heikki


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 09:14:38
Message-ID: 1146647678.449.42.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> > On Tue, 2 May 2006, Tom Lane wrote:
> >> Backwards scan may break this whole concept; are you sure you've thought
> >> it through?
>
> > I think so. The patch doesn't change the walk-left code. Do you have
> > something specific in mind?
>
> I'm worried about synchronization, particularly what happens if the page
> gets deleted from under you while you don't have it pinned.

Perhaps I should update my comments on "we don't need a pin at all"...

On a Forward scan we need to pin while we are reading a whole page,
though can release the pin afterwards. We don't need to keep the pin
while servicing btgetnext() requests from our private page buffer
though. (Which is what I meant to say.)

AFAICS we will need to return to the page for a backward scan, so we
could just keep the pin the whole way. It's not possible to cache the
left page pointer because block splits to our immediate left can update
them even after we read the page contents. (A forward scan need never
fear page splits in the same way because existing items can't move past
the existing page boundary).

We need never return to a page that *could* be deleted. While scanning
in either direction, if the complete page contains nothing but dead
items we can simply move straight onto the next page, having updated the
page status to half-dead. (The great thing about this patch is we should
be able to report that somehow, so an asynchronous task handler can come
and clean that page (only) now that we don't have a restriction on
individual page vacuuming. We can think about somehow later)

If only some of the index tuples are deleted, we should only return to
the page to update the deleted index tuples *if*:
- the page is still in the buffer pool. If its been evicted its because
space is tight so we shouldn't call it back just to dirty the page.
- we have a minimum threshold of deleted tuples. Otherwise we might
re-dirty the page for just a single hint bit, so we end up writing the
page out hundreds of times. (Guess: that should be 2 or 3)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 14:17:43
Message-ID: 16108.1146665863@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> AFAICS we will need to return to the page for a backward scan, so we
> could just keep the pin the whole way. It's not possible to cache the
> left page pointer because block splits to our immediate left can update
> them even after we read the page contents.

Sure we can cache the left pointer: it's not any different from the
normal walk-left problem, because as soon as you drop pin on the page
you are leaving, all the same hazards apply. This algorithm just has a
slightly wider window between dropping one pin and acquiring the next.
Walk-left is painful and (rarely) expensive, but it's a solved problem.

> We need never return to a page that *could* be deleted. While scanning
> in either direction, if the complete page contains nothing but dead
> items we can simply move straight onto the next page, having updated the
> page status to half-dead.

This is unnecessary and probably wrong.

> - we have a minimum threshold of deleted tuples. Otherwise we might
> re-dirty the page for just a single hint bit, so we end up writing the
> page out hundreds of times. (Guess: that should be 2 or 3)

You are optimizing the wrong thing here. If we choose not to mark an
entry dead then we will pay for that omission on every future scan of
the same entry. I don't think that outweighs the (doubtless rare)
situation where we expend an extra page fetch to reload the page.

It's worth noting that all of this stuff is predicated on the assumption
that index items never move across pre-existing page boundaries, in
either direction. We are therefore going to be permanently giving up
any prospect of index space reclamation by merging partly-filled pages
(unless maybe in VACUUM FULL). We didn't know how to do that anyway,
so I don't feel too bad about it, but if indexscans don't make any
attempt to explicitly re-locate their positions then that certainly
goes out the window.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 14:45:15
Message-ID: 1146667515.449.106.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote:
> > > I'm worried about synchronization, particularly what happens if the page
> > > gets deleted from under you while you don't have it pinned.

On Wed, 2006-05-03 at 10:17 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:

> > We need never return to a page that *could* be deleted. While scanning
> > in either direction, if the complete page contains nothing but dead
> > items we can simply move straight onto the next page, having updated the
> > page status to half-dead.
>
> This is unnecessary and probably wrong.

You'll need to be more specific about what you mean. Heikki's concurrent
post says roughly the same thing as what I just said, AFAICS.

Do you see a problem with page deletion? If so, where?

> It's worth noting that all of this stuff is predicated on the assumption
> that index items never move across pre-existing page boundaries, in
> either direction. We are therefore going to be permanently giving up
> any prospect of index space reclamation by merging partly-filled pages
> (unless maybe in VACUUM FULL). We didn't know how to do that anyway,
> so I don't feel too bad about it, but if indexscans don't make any
> attempt to explicitly re-locate their positions then that certainly
> goes out the window.

Seems like a step forwards to me, even if there is still wish to go
further; we've all been trying to improve this behaviour for some time,
so hats off to Heikki...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 14:54:50
Message-ID: 1146668090.449.113.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Wed, 2006-05-03 at 10:17 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:

> You are optimizing the wrong thing here. If we choose not to mark an
> entry dead then we will pay for that omission on every future scan of
> the same entry. I don't think that outweighs the (doubtless rare)
> situation where we expend an extra page fetch to reload the page.

Sounds a familiar conversation, which I shouldn't have raised here.

This depends upon whether the pages being accessed are in cache or not,
and whether we have sufficient I/O to pay the cost of a write. Reads
don't always go to disk, writes always do. I see that its difficult to
tell which is which, but that doesn't mean there aren't different cases.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 14:56:56
Message-ID: 16436.1146668216@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote:
>> This is unnecessary and probably wrong.

> You'll need to be more specific about what you mean.

There is no point in marking a page half-dead, as that doesn't save
anyone else from visiting it, and it's probably wrong to mark a leaf
page as half-dead at all. That state is associated with upper pages.

Even if it were a legal tree configuration, marking the page half-dead
would make it impossible to insert any more keys in the page, which
doesn't strike me as an appropriate behavior; it's likely to force
excess I/O later due to unnecessary page splits during future inserts.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 15:23:35
Message-ID: 1146669815.449.131.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Wed, 2006-05-03 at 10:56 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote:
> >> This is unnecessary and probably wrong.
>
> > You'll need to be more specific about what you mean.
>
> There is no point in marking a page half-dead, as that doesn't save
> anyone else from visiting it, and it's probably wrong to mark a leaf
> page as half-dead at all. That state is associated with upper pages.
>
> Even if it were a legal tree configuration, marking the page half-dead
> would make it impossible to insert any more keys in the page, which
> doesn't strike me as an appropriate behavior; it's likely to force
> excess I/O later due to unnecessary page splits during future inserts.

OK.

So do you see a problem scenario like this?

A, B and C separate backends:
A1 Reads page, some row versions are *not* marked LP_DELETE but will be
later when A2 happens
B1 VACUUM removes dead rows, just happens to be all of them
B2 Recycles page into FSM
C1 Inserts new data into old page
A2 Attempts to update old page to notify about dead rows (UGH!)

B2 can only stopped by pinning...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 17:39:29
Message-ID: 17559.1146677969@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> So do you see a problem scenario like this?

> A, B and C separate backends:
> A1 Reads page, some row versions are *not* marked LP_DELETE but will be
> later when A2 happens
> B1 VACUUM removes dead rows, just happens to be all of them
> B2 Recycles page into FSM
> C1 Inserts new data into old page
> A2 Attempts to update old page to notify about dead rows (UGH!)

Can't happen; a page cannot be recycled until all concurrent
transactions are gone. In any case, the LP_DELETE marking code will
certainly take care to check that the entries it's trying to mark
are still the same ones it meant to mark.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-03 17:52:02
Message-ID: 1146678722.449.196.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Wed, 2006-05-03 at 13:39 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > So do you see a problem scenario like this?
>
> > A, B and C separate backends:
> > A1 Reads page, some row versions are *not* marked LP_DELETE but will be
> > later when A2 happens
> > B1 VACUUM removes dead rows, just happens to be all of them
> > B2 Recycles page into FSM
> > C1 Inserts new data into old page
> > A2 Attempts to update old page to notify about dead rows (UGH!)
>
> Can't happen; a page cannot be recycled until all concurrent
> transactions are gone. In any case, the LP_DELETE marking code will
> certainly take care to check that the entries it's trying to mark
> are still the same ones it meant to mark.

!

So do you see a problem with page deletion, or not? If so, what is it?

This patch looks good to me, based upon everything said.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-patches(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-03 18:08:22
Message-ID: Pine.OSF.4.61.0605032102530.509589@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Wed, 3 May 2006, Heikki Linnakangas wrote:

> On Tue, 2 May 2006, Tom Lane wrote:
>
>> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>>> On Tue, 2 May 2006, Tom Lane wrote:
>>>> Backwards scan may break this whole concept; are you sure you've thought
>>>> it through?
>>
>>> I think so. The patch doesn't change the walk-left code. Do you have
>>> something specific in mind?
>>
>> I'm worried about synchronization, particularly what happens if the page
>> gets deleted from under you while you don't have it pinned.
>
> AFAICS, shouldn't happen. The half-dead system makes sure that a page won't
> get deleted while a scan might still be interested in it. It doesn't depend
> on pins.

As Tom pointed out elsewhere in this thread, the above explanation is
wrong because half-dead state only applies to upper-level pages. I thought
that half-dead means that the page is dead and removed from the tree, but
not yet recycled because some transaction might still be interested in it.
Now I see that that state is actually called "deleted".

The point remains, however. A page won't get deleted while a scan
might still be interested in it, because deleted pages are not
immediately recycled (except on vacuum full), and the left and right
sibling pointers stay intact until no transaction can be interested in it.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-03 18:49:44
Message-ID: 18365.1146682184@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> The point remains, however. A page won't get deleted while a scan
> might still be interested in it, because deleted pages are not
> immediately recycled (except on vacuum full), and the left and right
> sibling pointers stay intact until no transaction can be interested in it.

Right. AFAICS there isn't any assumption there that isn't already made
by indexscans, since we drop pin before moving to the adjacent page
anyway. You still have to be prepared to deal with the same situations.
(The new assumption is that index items won't be moved onto a
pre-existing right sibling page; the old scan logic didn't assume that.)

Heikki, were you planning to make a round of revisions in the patch,
or is this as far as you wanted to take it?

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-04 06:24:15
Message-ID: Pine.OSF.4.61.0605032314040.509589@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Wed, 3 May 2006, Tom Lane wrote:

> Heikki, were you planning to make a round of revisions in the patch,
> or is this as far as you wanted to take it?

Here's an updated patch. It's the same as the original, but merged with
the changes to the vacuum_cleanup API you committed, and some comment and
space fixes here and there.

To-do:
* release pin earlier after reading the page, as discussed
* allocate memory for markpos/restrpos only when needed
* calculate a more precise estimate for the maximum number of items on an
index page

Not big things, but I don't have the time to work on them at the moment.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-04 16:35:24
Message-ID: 7915.1146760524@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> Here's an updated patch.

Uh ... no patch actually attached?

> To-do:
> Not big things, but I don't have the time to work on them at the moment.

I can take it from here if you'll send me what you have.

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-05 06:28:26
Message-ID: Pine.OSF.4.61.0605050927210.164080@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Thu, 4 May 2006, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> Here's an updated patch.
>
> Uh ... no patch actually attached?

Doh. Here you go.

>> To-do:
>> Not big things, but I don't have the time to work on them at the moment.
>
> I can take it from here if you'll send me what you have.

Thanks!

- Heikki

Attachment Content-Type Size
indvac-20060503.diff text/plain 61.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-05 16:29:13
Message-ID: 25829.1146846553@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

I've just realized that the locking considerations in this patch are
considerably more complicated than we thought. The discussion so far
considered only half of what the deletion interlock needs to accomplish.
We have to ensure that indexscans don't lose their place, which is
solved in the patch by stopping between pages rather than on a specific
item. But we also have to ensure that indexscans don't return pointers
to heap tuples that have already been deleted by VACUUM --- at least in
the case where the caller is using a non-MVCC snapshot. Else, if the
tuple is replaced by a newer tuple before the caller is able to visit
it, the caller might mistakenly accept that tuple as the one it wanted.

The problematic scenario looks like this:

1. indexscan slurps a bunch of heap TIDs into local memory. It keeps
pin, but not lock, on the index page it got the TIDs from.

2. someone else splits the index page, which is allowed since the page
is only pinned not locked. Now, some of the index items are on a new
page that the indexscan does not have pinned.

3. vacuum comes along and removes some of the moved items. Since they
are on a page that no one has pinned, even the super-exclusive lock
that vacuum takes won't prevent this.

4. vacuum removes the corresponding heap tuples.

5. someone else installs new, unrelated, tuples into the freed heap
slots.

6. indexscan finally visits the heap. It finds tuples that are valid
per its snapshot, but aren't what it thought it was finding (they don't
necessarily match the index key).

While we could prevent this by rechecking whether fetched heap tuples
match the index search condition, that costs an awful lot of cycles
for such an obviously rare problem. We need a locking-based solution
instead.

I believe an appropriate fix looks like this:

1. Indexscan is required to keep pin on the page it read the items
from (even though we realize they may not still be on that page).
The pin can only be dropped when we are no longer going to return any
more items read from the page.

2. btbulkdelete is required to get super-exclusive lock on *every leaf
page in the index*. It must not try to optimize this down to just the
pages containing deletable items.

With these rules, even though vacuum might have physically deleted the
index item that the indexscan is reporting to its caller, the
btbulkdelete call will not have been able to complete. (If bulkdelete
arrived at the original page before the indexscan did, then of course
it'd already have deleted the item and there's no problem. If it
arrives after, then it'll be blocked by not being able to get
super-exclusive lock.) Hence vacuum will not have removed the heap
tuple yet, even though the index item might be gone.

We could improve concurrency if we extend the API so that btgettuple
knows whether it is fetching for an MVCC-safe snapshot or not. When the
scan is using an MVCC-safe snapshot it's OK to release pin immediately.
However I'm not sure if this is worth the trouble --- it probably makes
sense to hold onto the pin anyway, in case we're told to mark some of
the tuples killed.

The patch as given gets point 2 right, but I think it doesn't
consistently do point 1, and in any case it's certainly not documenting
the issue properly.

BTW, I just realized another bug in the patch: btbulkdelete fails to
guarantee that it visits every page in the index. It was OK for
btvacuumcleanup to ignore pages added to the index after it starts,
but btbulkdelete has to deal with such pages.

One thing I'm kind of wondering is whether amgetmulti should still exist
as a separate API. Seems like if amgettuple is doing the equivalent
thing under-the-hood, then amgetmulti isn't saving us anything except a
few trips through FunctionCallN. It's not at all clear that it's worth
the API complexity of having two different fetch functions.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-05 18:01:37
Message-ID: 26626.1146852097@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

I wrote:
> BTW, I just realized another bug in the patch: btbulkdelete fails to
> guarantee that it visits every page in the index. It was OK for
> btvacuumcleanup to ignore pages added to the index after it starts,
> but btbulkdelete has to deal with such pages.

Actually, as written this patch does not work. At all. btbulkdelete
has to guarantee that it removes every index entry it's told to, and
it cannot guarantee that in the presence of concurrent page splits.
A split could move index items from a page that btbulkdelete hasn't
visited to one it's already passed over. This is not possible with an
index-order traversal (because splits only move items to the right)
but it's definitely possible with a physical-order traversal.

I was toying with the idea of remembering deletable pages (which
btvacuumcleanup does anyway), which are the only ones that page splits
could move items to, and then rescanning those after the completion
of the primary pass. This has a couple of pretty unpleasant
consequences though:
* We have to remember *every* deletable page for correctness, compared
to the current situation where btvacuumcleanup can bound the number of
pages it tracks. This creates a situation where VACUUM may fail
outright if it doesn't have gobs of memory. Since one of the main
reasons for developing lazy VACUUM was to ensure we could vacuum
arbitrarily large tables in bounded memory, I'm not happy with this.
* The rescan could be far from cheap if there are many such pages.

Also, I'm not sure when you can stop: it certainly seems possible that
items could be moved down during the primary scan and then moved down
again while btbulkdelete is reconsidering the deletable pages. Without
locking out splits entirely, it's far from obvious that we can make it
work.

Thoughts?

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-05 18:55:32
Message-ID: Pine.OSF.4.61.0605052149280.271304@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Fri, 5 May 2006, Tom Lane wrote:

> I wrote:
>> BTW, I just realized another bug in the patch: btbulkdelete fails to
>> guarantee that it visits every page in the index. It was OK for
>> btvacuumcleanup to ignore pages added to the index after it starts,
>> but btbulkdelete has to deal with such pages.
>
> Actually, as written this patch does not work. At all. btbulkdelete
> has to guarantee that it removes every index entry it's told to, and
> it cannot guarantee that in the presence of concurrent page splits.
> A split could move index items from a page that btbulkdelete hasn't
> visited to one it's already passed over. This is not possible with an
> index-order traversal (because splits only move items to the right)
> but it's definitely possible with a physical-order traversal.

True. :(

The first solution that occurs to me is to force page splits to choose the
target page so that it's blkno > the original page's blkno during vacuum.
It would cause the index to become more fragmented more quickly, which is
bad but perhaps tolerable.

> I was toying with the idea of remembering deletable pages (which
> btvacuumcleanup does anyway), which are the only ones that page splits
> could move items to, and then rescanning those after the completion
> of the primary pass. This has a couple of pretty unpleasant
> consequences though:
> * We have to remember *every* deletable page for correctness, compared
> to the current situation where btvacuumcleanup can bound the number of
> pages it tracks. This creates a situation where VACUUM may fail
> outright if it doesn't have gobs of memory. Since one of the main
> reasons for developing lazy VACUUM was to ensure we could vacuum
> arbitrarily large tables in bounded memory, I'm not happy with this.
> * The rescan could be far from cheap if there are many such pages.

Yep, that's not good.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-05 19:13:04
Message-ID: 27275.1146856384@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> The first solution that occurs to me is to force page splits to choose the
> target page so that it's blkno > the original page's blkno during vacuum.

I thought about that too, but don't like it for three reasons:

* it encourages index bloat, the more the longer the vacuum runs. Not
good, especially if you've got aggressive vacuum cost delay settings.

* there's a locking problem with respect to how you turn that behavior on.

* there's a failure mode where the behavior doesn't get turned off if
vacuum fails partway through.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-05 22:04:04
Message-ID: 28521.1146866644@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On Fri, 5 May 2006, Tom Lane wrote:
>>> BTW, I just realized another bug in the patch: btbulkdelete fails to
>>> guarantee that it visits every page in the index.

> The first solution that occurs to me is to force page splits to choose the
> target page so that it's blkno > the original page's blkno during vacuum.
> It would cause the index to become more fragmented more quickly, which is
> bad but perhaps tolerable.

I have a sketch of a solution that doesn't require any change in page
allocation behavior. Can anyone see any holes in this:

Assume that we have some way to recognize whether a page has been split
since the current btbulkdelete scan started. (A split would mark both the
original page and the new right-hand page as newly split.) When
btbulkdelete arrives at a page, it need take no special action unless the
page is newly split *and* its right-link points to a lower physical
address. If that's true, then after vacuuming the page, follow its
right-link and vacuum that page; repeat until arriving at a page that is
either not newly split or is above the current location of the outer loop.
Then return to the outer, sequential-scan loop.

We should also have btbulkdelete clear the newly-split marker whenever
it processes a page; this ensures that no page is processed more than
once, which is probably worth the possible extra I/O needed to clear the
marker. (The cycles to re-scan a page are maybe not that important, but
if we do reprocess pages we'll end up with a bogusly high tuple count.
OTOH I'm not sure we can guarantee an accurate tuple count anyway,
so maybe it doesn't matter.)

AFAICS, this works correctly even if the test for a newly-split page
sometimes yields false positives; thinking a page is newly split when
it is not might cost a little extra I/O but won't lead to wrong results.
This reduces the requirements for the newly-split marking mechanism.

So, how do we do that marking? Noting that we have 16 bits we could use
in BTPageOpaqueData without making that struct any bigger (because of
alignment considerations), I'm thinking about a 16-bit counter for each
index that is incremented at the start of each btbulkdelete operation and
copied into the opaque data whenever a page is split. If the value's
different from the current counter, the page definitely hasn't been split
during btbulkdelete. There's a 1-in-65536 chance of a false positive,
if the last split occurred some exact multiple of 65536 vacuums ago, but
that's probably low enough to be acceptable. (We could reduce the odds of
false positives in various ways, eg by saying that zero isn't a valid
counter value and having btbulkdelete reset a page's counter to zero
anytime it has to write the page anyway.)

This still has the same sort of locking issues I complained of in
regards to Heikki's idea, namely how do you make sure that everyone is
using the new counter value before you start scanning? That seems
soluble however. We'd just need to be willing to take one more lock
during every page split operation, which does not seem an intolerable
amount of overhead. Then we do something like take a sharelock while
fetching the counter value during split (and hold the lock until the
split's complete), and take a momentary exclusive lock while advancing
the counter during btbulkdelete startup.

Thoughts, better ideas?

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-06 20:29:29
Message-ID: Pine.OSF.4.61.0605062303550.189201@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Fri, 5 May 2006, Tom Lane wrote:

> I have a sketch of a solution that doesn't require any change in page
> allocation behavior. Can anyone see any holes in this:

Looks good to me.

> Assume that we have some way to recognize whether a page has been split
> since the current btbulkdelete scan started. (A split would mark both the
> original page and the new right-hand page as newly split.) When
> btbulkdelete arrives at a page, it need take no special action unless the
> page is newly split *and* its right-link points to a lower physical
> address. If that's true, then after vacuuming the page, follow its
> right-link and vacuum that page; repeat until arriving at a page that is
> either not newly split or is above the current location of the outer loop.
> Then return to the outer, sequential-scan loop.

It'd be a bit more efficient to finish the sequential-scan first, and
memorize the newly-split pages' right-links as they're encountered. Then
scan those pages as a separate second pass, or earlier if we run out of
memory reserved for memorizing them.

> We should also have btbulkdelete clear the newly-split marker whenever
> it processes a page; this ensures that no page is processed more than
> once, which is probably worth the possible extra I/O needed to clear the
> marker. (The cycles to re-scan a page are maybe not that important, but
> if we do reprocess pages we'll end up with a bogusly high tuple count.
> OTOH I'm not sure we can guarantee an accurate tuple count anyway,
> so maybe it doesn't matter.)

Yeah, seems worth it to always clear the marker. Definitely when the page
is dirtied anyway.

> AFAICS, this works correctly even if the test for a newly-split page
> sometimes yields false positives; thinking a page is newly split when
> it is not might cost a little extra I/O but won't lead to wrong results.
> This reduces the requirements for the newly-split marking mechanism.
>
> So, how do we do that marking? Noting that we have 16 bits we could use
> in BTPageOpaqueData without making that struct any bigger (because of
> alignment considerations), I'm thinking about a 16-bit counter for each
> index that is incremented at the start of each btbulkdelete operation and
> copied into the opaque data whenever a page is split. If the value's
> different from the current counter, the page definitely hasn't been split
> during btbulkdelete. There's a 1-in-65536 chance of a false positive,
> if the last split occurred some exact multiple of 65536 vacuums ago, but
> that's probably low enough to be acceptable. (We could reduce the odds of
> false positives in various ways, eg by saying that zero isn't a valid
> counter value and having btbulkdelete reset a page's counter to zero
> anytime it has to write the page anyway.)

If btbulkdelete always clears the marker (assuming zero isn't a valid
value), 16 bits is plenty. Unless a vacuum is aborted, there should never
be a value older than current value - 1 in the index. We could live with a
2-bit counter.

> This still has the same sort of locking issues I complained of in
> regards to Heikki's idea, namely how do you make sure that everyone is
> using the new counter value before you start scanning? That seems
> soluble however. We'd just need to be willing to take one more lock
> during every page split operation, which does not seem an intolerable
> amount of overhead. Then we do something like take a sharelock while
> fetching the counter value during split (and hold the lock until the
> split's complete), and take a momentary exclusive lock while advancing
> the counter during btbulkdelete startup.

That's not too bad. Where exactly were you thinking of putting the
counter and the lock?

> Thoughts, better ideas?

Well, we could enhance my proposal a bit to make the fragmentation effect
less severe. Instead of a simple flag indicating that a vacuum is in
progress, vacuum could announce the current page it's processing in a
per-index shmem variable. A page split must read that counter, holding a
shared lock, and choose the target page so that that boundary is not
crossed so that the new page is below the boundary and old page above.
Otherwise, it's free to choose what it wants. To make good target page
choices, it needs some help from FSM.

I think I like your proposal more, though. It seems better
concurrency-wise.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-06 21:00:24
Message-ID: 7565.1146949224@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> That's not too bad. Where exactly were you thinking of putting the
> counter and the lock?

My original thought was to keep it in btree metapages, but that's kind
of annoying since we just went to some effort to cache metapage
contents; which means the metapage will likely not be hanging around in
buffer cache. Right now I'm thinking we could have a single system-wide
counter instead of one per index, and keep it in shared memory. There's
no strong reason why it needs to survive system crash/restart AFAICS.

Aside from the counter proper, shared memory would need to contain
enough info so we could tell whether btbulkdelete was active for any
particular index, and the marker value to use for that index if so.
But this requires only enough space for 1 entry per active vacuum,
which is bounded by MaxBackends and will usually be a lot smaller.
A benefit of doing it that way is that when btbulkdelete isn't active,
splits can reliably tell so and write zero into the page markers.
That reduces the odds of false positives quite a lot, assuming that
bulkdelete is only active a small fraction of the time.

BTW, I'm currently working on cleaning up the page-at-a-time-scan
part of your patch, since we could apply that without changing
VACUUM at all. We need to do performance testing on that and make
sure it won't kill us for unique indexes, else this whole project
may have to be shelved :-(. I've found a few more problems, eg the
code doesn't handle changes in indexscan direction properly, but
no showstoppers.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-08 00:11:17
Message-ID: 15871.1147047077@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

I've committed a rewritten version of this patch.

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On Fri, 5 May 2006, Tom Lane wrote:
>> btbulkdelete arrives at a page, it need take no special action unless the
>> page is newly split *and* its right-link points to a lower physical
>> address. If that's true, then after vacuuming the page, follow its
>> right-link and vacuum that page; repeat until arriving at a page that is
>> either not newly split or is above the current location of the outer loop.
>> Then return to the outer, sequential-scan loop.

> It'd be a bit more efficient to finish the sequential-scan first, and
> memorize the newly-split pages' right-links as they're encountered. Then
> scan those pages as a separate second pass, or earlier if we run out of
> memory reserved for memorizing them.

I didn't do this. Aside from the extra memory requirement, it's not
apparent to me that it'd make things faster. The disadvantage is that
it would require more page reads than the other way: if you visit the
split page immediately, and note that its right-link is above the
current outer loop location, then you can skip following the right-link
because you know you'll visit the page later. If you postpone then you
have to chase every chain until actually reading a page with an old
cycle ID. I think this extra I/O would likely outweigh any savings from
not interrupting the main scan.

> If btbulkdelete always clears the marker (assuming zero isn't a valid
> value), 16 bits is plenty. Unless a vacuum is aborted, there should never
> be a value older than current value - 1 in the index. We could live with a
> 2-bit counter.

For the moment, the code is only clearing the marker if it's equal to
the current cycle ID. This is sufficient to recognize
definitely-already-processed pages, but it doesn't prevent false
positives in general. If we ever need the space we could narrow the
counter, at the cost of having to expend more I/O to keep the values
cleared.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-08 09:45:44
Message-ID: 1147081544.3468.180.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Fri, 2006-05-05 at 18:04 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> > On Fri, 5 May 2006, Tom Lane wrote:
> >>> BTW, I just realized another bug in the patch: btbulkdelete fails to
> >>> guarantee that it visits every page in the index.
>
> > The first solution that occurs to me is to force page splits to choose the
> > target page so that it's blkno > the original page's blkno during vacuum.
> > It would cause the index to become more fragmented more quickly, which is
> > bad but perhaps tolerable.
>
> I have a sketch of a solution that doesn't require any change in page
> allocation behavior. Can anyone see any holes in this:
>
> Assume that we have some way to recognize whether a page has been split
> since the current btbulkdelete scan started. (A split would mark both the
> original page and the new right-hand page as newly split.) When
> btbulkdelete arrives at a page, it need take no special action unless the
> page is newly split *and* its right-link points to a lower physical
> address. If that's true, then after vacuuming the page, follow its
> right-link and vacuum that page; repeat until arriving at a page that is
> either not newly split or is above the current location of the outer loop.
> Then return to the outer, sequential-scan loop.
>
> We should also have btbulkdelete clear the newly-split marker whenever
> it processes a page; this ensures that no page is processed more than
> once, which is probably worth the possible extra I/O needed to clear the
> marker. (The cycles to re-scan a page are maybe not that important, but
> if we do reprocess pages we'll end up with a bogusly high tuple count.
> OTOH I'm not sure we can guarantee an accurate tuple count anyway,
> so maybe it doesn't matter.)
>
> AFAICS, this works correctly even if the test for a newly-split page
> sometimes yields false positives; thinking a page is newly split when
> it is not might cost a little extra I/O but won't lead to wrong results.
> This reduces the requirements for the newly-split marking mechanism.
>
> So, how do we do that marking? Noting that we have 16 bits we could use
> in BTPageOpaqueData without making that struct any bigger (because of
> alignment considerations), I'm thinking about a 16-bit counter for each
> index that is incremented at the start of each btbulkdelete operation and
> copied into the opaque data whenever a page is split. If the value's
> different from the current counter, the page definitely hasn't been split
> during btbulkdelete. There's a 1-in-65536 chance of a false positive,
> if the last split occurred some exact multiple of 65536 vacuums ago, but
> that's probably low enough to be acceptable. (We could reduce the odds of
> false positives in various ways, eg by saying that zero isn't a valid
> counter value and having btbulkdelete reset a page's counter to zero
> anytime it has to write the page anyway.)

I read your earlier post about needing to lock everything and spent some
time thinking about this. The issue of needing to lock everything means
that we would never be able to do a partial vacuum of an index i.e.
remove one page without a scan. I'm more concerned about making partial
vacuum work than I am about speeding up an all-block vacuum.

I'd been mulling over the locking problems and came up with something
that looks identical to your proposal *except* for the value that gets
written into the BTPageOpaqueData on the right-hand page.

My thinking was to write the blockid of the original left hand page, so
as to record the original ancestor since split. Thus if multiple splits
occurred, then the original ancestor blockid would remain on record
until VACUUM. In more detail: When we split a page if the ancestor
blockid is not set, then we set it to be the blockid of the old page
(new left hand page). If the ancestor blockid is already set then we
copy that to the new right hand page. Every split will write a value to
BTPageOpaqueData, though the values to use are already available without
extra work.

AFAICS this doesn't change the ability to scan the index in physical
order, so we still get the benefits for that action.

A *partial* vacuum of an index block can be achieved by visiting the
block to be vacuumed, then following the link directly to the pre-split
ancestor block (if there is one), then moving right again back to the
target block. btbulkdelete always clears the marker when it cleans. This
opens the door for the approach of notifying when a page is deletable
and then having a background agent remove just that page, as
patch-proposed recently by Junji TERAMOTO.

I'm *not* presuming we have an agreed solution for partial vacuuming,
but I do want to keep that door open by implementing a locking protocol
that could support it.

> This still has the same sort of locking issues I complained of in
> regards to Heikki's idea, namely how do you make sure that everyone is
> using the new counter value before you start scanning? That seems
> soluble however. We'd just need to be willing to take one more lock
> during every page split operation, which does not seem an intolerable
> amount of overhead. Then we do something like take a sharelock while
> fetching the counter value during split (and hold the lock until the
> split's complete), and take a momentary exclusive lock while advancing
> the counter during btbulkdelete startup.

I'm not very happy about an extra lock during page splitting, which adds
a performance hit even for tables that never will need regular vacuuming
(apart from occaisional wrap-around avoidance).

AFAICS if we just store the ancestor blockid we don't need to do the
business with shared memory and extra locks etc.. The size of the field
we hold for BTPageOpaqueData seems fairly unimportant, since we leave
lots of space in index pages anyway - a few extra bytes would only make
a noise level change in performance.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-08 14:18:19
Message-ID: 26585.1147097899@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> I read your earlier post about needing to lock everything and spent some
> time thinking about this. The issue of needing to lock everything means
> that we would never be able to do a partial vacuum of an index i.e.
> remove one page without a scan. I'm more concerned about making partial
> vacuum work than I am about speeding up an all-block vacuum.

[ shrug... ] That's an illusory goal anyway. Retail tuple removal is
just too inefficient. (No, I don't believe in that proposed patch.)

> My thinking was to write the blockid of the original left hand page, so
> as to record the original ancestor since split. Thus if multiple splits
> occurred, then the original ancestor blockid would remain on record
> until VACUUM. In more detail: When we split a page if the ancestor
> blockid is not set, then we set it to be the blockid of the old page
> (new left hand page). If the ancestor blockid is already set then we
> copy that to the new right hand page. Every split will write a value to
> BTPageOpaqueData, though the values to use are already available without
> extra work.

Doesn't work, at least not for making it possible to vacuum part of the
index. The conflicting indexscan could have stopped on a page, and then
that page could have split, before your "partial vacuum" ever started.
So tracing back only as far as the data has split since vacuum started
is not enough to prevent conflict.

(The other little problem is that we'd have to enlarge the BTOpaque
overhead, because a block id doesn't fit in the available 16 bits.)

> I'm not very happy about an extra lock during page splitting, which adds
> a performance hit even for tables that never will need regular vacuuming
> (apart from occaisional wrap-around avoidance).

While I'd rather not have done that, I don't believe that it makes for
any material performance degradation. Normal splits all take the lock
in shared mode and hence suffer no contention. Your proposal wouldn't
make for less locking anyway, since it still assumes that there's a way
to tell whether vacuum is active for a given index, which is just about
the same amount of overhead as the code-as-committed.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-08 15:06:35
Message-ID: 1147100795.3468.316.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Mon, 2006-05-08 at 10:18 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I read your earlier post about needing to lock everything and spent some
> > time thinking about this. The issue of needing to lock everything means
> > that we would never be able to do a partial vacuum of an index i.e.
> > remove one page without a scan. I'm more concerned about making partial
> > vacuum work than I am about speeding up an all-block vacuum.
>
> [ shrug... ] That's an illusory goal anyway. Retail tuple removal is
> just too inefficient. (No, I don't believe in that proposed patch.)

Current VACUUM optimizes for the case where random updates/deletes
occur. If there are hotspots then scanning the whole relation is O(N) or
even O(N^2) if we have to scan the indexes multiple times.

We think we have a way to improve heap VACUUMs (bitmaps...) but we are
still looking for an equivalent for indexes.

> > My thinking was to write the blockid of the original left hand page, so
> > as to record the original ancestor since split. Thus if multiple splits
> > occurred, then the original ancestor blockid would remain on record
> > until VACUUM. In more detail: When we split a page if the ancestor
> > blockid is not set, then we set it to be the blockid of the old page
> > (new left hand page). If the ancestor blockid is already set then we
> > copy that to the new right hand page. Every split will write a value to
> > BTPageOpaqueData, though the values to use are already available without
> > extra work.
>
> Doesn't work, at least not for making it possible to vacuum part of the
> index. The conflicting indexscan could have stopped on a page, and then
> that page could have split, before your "partial vacuum" ever started.
> So tracing back only as far as the data has split since vacuum started
> is not enough to prevent conflict.

That wasn't the proposal. Every split would be marked and stay marked
until those blocks were VACUUMed. The data used to mark is readily
available and doesn't rely on whether or not VACUUM is running.
IMHO this does work.

> (The other little problem is that we'd have to enlarge the BTOpaque
> overhead, because a block id doesn't fit in the available 16 bits.)

ISTM it is indeed a little problem. After CREATE INDEX, most index pages
don't stay completely full, so worrying about alignment wastage might
very occasionally save an I/O, but not enough for us to care.

> > I'm not very happy about an extra lock during page splitting, which adds
> > a performance hit even for tables that never will need regular vacuuming
> > (apart from occaisional wrap-around avoidance).
>
> While I'd rather not have done that, I don't believe that it makes for
> any material performance degradation. Normal splits all take the lock
> in shared mode and hence suffer no contention. Your proposal wouldn't
> make for less locking anyway, since it still assumes that there's a way
> to tell whether vacuum is active for a given index, which is just about
> the same amount of overhead as the code-as-committed.

The proposed scheme doesn't rely on knowing whether a VACUUM is active
or not.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-08 15:26:33
Message-ID: 27173.1147101993@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> That wasn't the proposal. Every split would be marked and stay marked
> until those blocks were VACUUMed. The data used to mark is readily
> available and doesn't rely on whether or not VACUUM is running.
> IMHO this does work.

OK, I misunderstood what you had in mind, but now that I do understand
it doesn't seem terribly efficient. What you're suggesting is that we
take as a "vacuum group" all the pages that have been split off from a
single original page since that page was last vacuumed, and that this
group must be vacuumed as a whole. That works, but it seems that the
groups would get awfully large. In particular, this substantially
penalizes btbulkdelete in hopes of someday improving matters for what
remains an entirely fictional partial vacuum. As it stands today,
btbulkdelete only has to worry about page groups formed since it began
to run, not since the last vacuum. Changing the data representation
like this would force it to retrace much more often and over much larger
page groups.

>> (The other little problem is that we'd have to enlarge the BTOpaque
>> overhead, because a block id doesn't fit in the available 16 bits.)

> ISTM it is indeed a little problem. After CREATE INDEX, most index pages
> don't stay completely full, so worrying about alignment wastage might
> very occasionally save an I/O, but not enough for us to care.

I don't think an extra 4 or 8 bytes need be a show-stopper, but it does
force an initdb and thus make it harder to compare the two techniques.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-08 18:02:54
Message-ID: 1147111374.3468.372.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Mon, 2006-05-08 at 11:26 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > That wasn't the proposal. Every split would be marked and stay marked
> > until those blocks were VACUUMed. The data used to mark is readily
> > available and doesn't rely on whether or not VACUUM is running.
> > IMHO this does work.
>
> OK, I misunderstood what you had in mind, but now that I do understand
> it doesn't seem terribly efficient. What you're suggesting is that we
> take as a "vacuum group" all the pages that have been split off from a
> single original page since that page was last vacuumed, and that this
> group must be vacuumed as a whole. That works, but it seems that the
> groups would get awfully large. In particular, this substantially
> penalizes btbulkdelete in hopes of someday improving matters for what
> remains an entirely fictional partial vacuum.

OK, so we have the germ of a new mechanism - and I very much agree that
the idea of a partial vacuum is at present entirely fictional...but we
at least have a starting place.

> As it stands today,
> btbulkdelete only has to worry about page groups formed since it began
> to run, not since the last vacuum. Changing the data representation
> like this would force it to retrace much more often and over much larger
> page groups.

Yes, I saw the potential issue you mention - but for many cases the
index grows forwards and so we wouldn't care in either case. Page splits
that go to lower blockids are limited by available space, so would be
less of a problem. I'm balancing the additional cost of page splits
against the additional cost on the vacuum. I would prefer to keep
in-line ops faster and pay a little extra on the out-of-line operations,
if thats what it takes. I note your point that there is little
contention, but there is still a cost and in many cases this cost is
being paid on tables that never will be VACUUMed.

For insert-intensive apps, this adds cost, for little benefit.

For update-intensive apps, we're VACUUMing continually anyway so there's
no benefit from doing this only-during-VACUUM.

So we just optimised for slowly-but-continually churning tables (i.e.
DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM
performance for those that don't need it that often. That might be the
common case, but it isn't the one thats hurting most.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-08 18:46:02
Message-ID: 28697.1147113962@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> So we just optimised for slowly-but-continually churning tables (i.e.
> DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM
> performance for those that don't need it that often. That might be the
> common case, but it isn't the one thats hurting most.

No, we just improved VACUUM performance for those that need it most.
I was just doing some desultory experiments with today's code versus
yesterday's (it's easy to make a direct comparison because they're
initdb-compatible ... just stop one postmaster executable and start
another on the same database). I made a table of 16M rows with an
index over a random-data integer column. With a thoroughly disordered
index (built on-the-fly as the random data was inserted), the time to
VACUUM after deleting a small number of rows was 615 seconds with
yesterday's code, 31 seconds today. With a perfectly-ordered index
(identical table, but CREATE INDEX after all the data is in place), the
times were about 28 and 26 seconds respectively. (I wouldn't put a
*whole* lot of faith in these numbers, since they're from a Dell machine
with a toy ATA drive, but anyway they do show that sequential access to
the index makes a huge difference.) But perfectly-ordered indexes are
impossible to maintain unless your data is either static or insert-at-
the-end-only.

Anyway, while the extra LWLock and shared-memory access during a split
is annoying, I think it's really negligible (and so does pgbench).
A page split is going to involve many times that much work --- you've
got to acquire lock on at least four different buffers, visit the FSM,
possibly ask the kernel for another disk page, do XLogInsert twice, etc.
Any one of those things involves significantly more CPU effort and
contention than _bt_vacuum_cycleid(). So while I'd be happy to get rid
of it, not at the price of slowing down VACUUM again.

regards, tom lane


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org, "Eng" <eng(at)intranet(dot)greenplum(dot)com>, "System Engineering" <system_engineering(at)intranet(dot)greenplum(dot)com>
Subject: Re: Page at a time index scan
Date: 2006-05-08 19:09:11
Message-ID: C084E567.22FA6%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Tom,

On 5/8/06 11:46 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I made a table of 16M rows with an
> index over a random-data integer column. With a thoroughly disordered
> index (built on-the-fly as the random data was inserted), the time to
> VACUUM after deleting a small number of rows was 615 seconds with
> yesterday's code, 31 seconds today. With a perfectly-ordered index
> (identical table, but CREATE INDEX after all the data is in place), the
> times were about 28 and 26 seconds respectively.

Very impressive! This corroborates findings we've had with index
maintenance in the field - thanks for finding/fixing this.

- Luke


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Page at a time index scan
Date: 2006-05-08 21:21:33
Message-ID: 1147123293.4572.19.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Mon, 2006-05-08 at 14:46 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > So we just optimised for slowly-but-continually churning tables (i.e.
> > DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM
> > performance for those that don't need it that often. That might be the
> > common case, but it isn't the one thats hurting most.
>
> No, we just improved VACUUM performance for those that need it most.
> I was just doing some desultory experiments with today's code versus
> yesterday's (it's easy to make a direct comparison because they're
> initdb-compatible ... just stop one postmaster executable and start
> another on the same database). I made a table of 16M rows with an
> index over a random-data integer column. With a thoroughly disordered
> index (built on-the-fly as the random data was inserted), the time to
> VACUUM after deleting a small number of rows was 615 seconds with
> yesterday's code, 31 seconds today. With a perfectly-ordered index
> (identical table, but CREATE INDEX after all the data is in place), the
> times were about 28 and 26 seconds respectively. (I wouldn't put a
> *whole* lot of faith in these numbers, since they're from a Dell machine
> with a toy ATA drive, but anyway they do show that sequential access to
> the index makes a huge difference.) But perfectly-ordered indexes are
> impossible to maintain unless your data is either static or insert-at-
> the-end-only.

You and Heikki have achieved a marvelous thing, well done.

> Anyway, while the extra LWLock and shared-memory access during a split
> is annoying, I think it's really negligible (and so does pgbench).
> A page split is going to involve many times that much work --- you've
> got to acquire lock on at least four different buffers, visit the FSM,
> possibly ask the kernel for another disk page, do XLogInsert twice, etc.
> Any one of those things involves significantly more CPU effort and
> contention than _bt_vacuum_cycleid(). So while I'd be happy to get rid
> of it, not at the price of slowing down VACUUM again.

I'll raise the partial vacuum topic on -hackers.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com