Extended Prefetching using Asynchronous IO - proposal and patch

Lists: pgsql-hackers
From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 06:17:59
Message-ID: CAM3SWZR77w4Fq26H0t=3n4d=142MWgHrH0EL4+1GpTFZb3HXAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():

/*
* If this page was incompletely split, finish the split now. We
* do this while holding a lock on the left sibling, which is not
* good because finishing the split could be a fairly lengthy
* operation. But this should happen very seldom.
*/
if (P_INCOMPLETE_SPLIT(lpageop))
{
_bt_finish_split(rel, rbuf, stack);
rbuf = InvalidBuffer;
continue;
}

The "left sibling" referred to here is "the first page this key could
be on", an important concept for unique index enforcement. It's the
first sibling iff we're on our first iteration of the nested for(;;)
loop in _bt_findinsertloc(). So the buffer lock held on this left
sibling may constitute what in the past I've called a "value lock";
we've established the right to insert our value into the unique index
at this point, and the lock will only be released when we're done
(regardless of whether or not that buffer/page value lock is on the
buffer/page we'll ultimately insert into, or an earlier one).

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.

* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.

* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this). It exclusive locks that later page/buffer before the second guy
gets a chance to lock it once again. It establishes the right to
insert with _bt_check_unique(), undeterred by the second inserter's
buffer lock/"value lock". The value lock is effectively skipped over.

* Both the second and third inserters have "established the right" to
insert the same value, 7, and both do so. The unique index has an
MVCC-snapshot-wise spurious duplicate, and so is corrupt.

Regardless of whether or not I have these details exactly right (that
is, regardless of whether or not this scenario is strictly possible) I
suggest as a code-hardening measure that _bt_findinsertloc() release
its "value lock", upon realizing it must complete splits, and then
complete the split or splits known to be required. It would finally
report that it "couldn't find" an insertion location to
_bt_doinsert(), which would then retry from the start, just like when
_bt_check_unique() finds an inconclusive conflict. The only difference
is that we don't have an xact to wait on. We haven't actually done
anything extra that makes this later "goto top;" any different to the
existing one.

This should occur so infrequently that it isn't worth trying harder,
or worth differentiating between the UNIQUE_CHECK_NO and
!UNIQUE_CHECK_NO cases when retrying. This also removes the more
general risk of holding an extra buffer lock during page split
completion.

It kind of looks like _bt_findinsertloc() doesn't have this bug,
because in my scenario _bt_finish_split() is called with both the
value lock and its right page locked (so the right page is the left
page for _bt_finish_split()'s purposes). But when you take another
look, and realize that _bt_finish_split() releases its locks, and so
once again only the original value lock will be held when
_bt_finish_split() returns, and so the downlink is there to skip the
value locked page, you realize that the bug does exist (assuming that
I haven't failed to consider some third factor and am not otherwise
mistaken).

Thoughts?
--
Peter Geoghegan


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 11:54:00
Message-ID: 53847CD8.6080401@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/27/2014 09:17 AM, Peter Geoghegan wrote:
> While speccing out a new B-Tree verification tool, I had the
> opportunity to revisit a thought I had during review concerning
> _bt_findinsertloc(): that the new coding is unsafe in the event of
> deferred split completion of a leaf page of a unique index. To recap,
> we now do the following in _bt_findinsertloc():
>
> /*
> * If this page was incompletely split, finish the split now. We
> * do this while holding a lock on the left sibling, which is not
> * good because finishing the split could be a fairly lengthy
> * operation. But this should happen very seldom.
> */
> if (P_INCOMPLETE_SPLIT(lpageop))
> {
> _bt_finish_split(rel, rbuf, stack);
> rbuf = InvalidBuffer;
> continue;
> }
>
> The "left sibling" referred to here is "the first page this key could
> be on", an important concept for unique index enforcement.

No, it's not "the first page this key could be on".

_bt_findinsertloc() does *not* hold a lock on the first valid page the
key could go to. It merely ensures that when it steps to the next page,
it releases the lock on the previous page only after acquiring the lock
on the next page. Throughout the operation, it will hold a lock on
*some* page that could legally hold the inserted value, and it acquires
the locks in left-to-right order. This is sufficient for the uniqueness
checks, because _bt_unique_check() scans all the pages, and
_bt_unique_check() *does* hold the first page locked while it scans the
rest of the pages. But _bt_findinsertlock() does not.

Also note that _bt_moveright() also finishes any incomplete splits it
encounters (when called for an insertion). So _bt_findinsertloc() never
gets called on a page with the incomplete-split flag set. It might
encounter one when it moves right, but never the first page.

> Anyway, the concern is that there may be problems when we call
> _bt_finish_split() with that left sibling locked thoughout (i.e.
> finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
> itself has a right sibling from the incomplete split (which is usually
> the value lock page's right-right sibling)). I'm not concerned about
> performance, since as the comment says it ought to be an infrequent
> occurrence. I also believe that there are no deadlock hazards. But
> consider this scenario:
>
> * We insert the value 7 into an int4 unique index. We need to split
> the leaf page. We run out of memory or something, and ours is an
> incomplete page split. Our transaction is aborted. For the sake of
> argument, suppose that there are also already a bunch of dead tuples
> within the index with values of 7, 8 and 9.

If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-------------+
| |
| 9 -> A |
+-------------+

Leaf A
+-------------+
| HI-key: 9 |
| |
| 7 8 9 |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
| |
| 9 -> A |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| (INCOMPLETE_ | | |
| SPLIT) | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above
split)

> * Another inserter of the value 7 comes along. It follows exactly the
> same downlink as the first, now aborted inserter (suppose the
> downlink's value is 9). It also locks the same leaf page to establish
> a "value lock" in precisely the same manner. Finding no room on the
> first page, it looks further right, maintaining its original "value
> lock" throughout. It finishes the first inserter's split of the
> non-value-lock page - a new downlink is inserted into the parent page,
> with the value 8. It then releases all buffer locks except the first
> one/original "value lock". A physical insertion has yet to occur.

Hmm, I think you got confused at this step. When inserting a 7, you
cannot "look further right" to find a page with more space, because the
HI-key, 8, on the first page stipulates that 7 must go on that page, not
some later page.

> * A third inserter of the value comes along. It gets to a later page
> than the one locked by the second inserter, preferring the newer
> downlink with value 8 (the internal-page _bt_binsrch() logic ensures
> this).

That's a contradiction: the downlink with value 8 points to the first
page, not some later page. After the split is finished, the tree looks
like this:

Parent page
+-------------+
| 8 -> A |
| 9 -> B |
+-------------+

Leaf A Leaf B
+-------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| | <-> | |
| 7 7* 8 | | 9 |
+-------------+ +-------------+

> Regardless of whether or not I have these details exactly right (that
> is, regardless of whether or not this scenario is strictly possible) I
> suggest as a code-hardening measure that _bt_findinsertloc() release
> its "value lock", upon realizing it must complete splits, and then
> complete the split or splits known to be required. It would finally
> report that it "couldn't find" an insertion location to
> _bt_doinsert(), which would then retry from the start, just like when
> _bt_check_unique() finds an inconclusive conflict. The only difference
> is that we don't have an xact to wait on. We haven't actually done
> anything extra that makes this later "goto top;" any different to the
> existing one.
>
> This should occur so infrequently that it isn't worth trying harder,
> or worth differentiating between the UNIQUE_CHECK_NO and
> !UNIQUE_CHECK_NO cases when retrying. This also removes the more
> general risk of holding an extra buffer lock during page split
> completion.

Yeah, that would work too. It seems safe enough as it is, though, so I
don't see the point.

> It kind of looks like _bt_findinsertloc() doesn't have this bug,
> because in my scenario _bt_finish_split() is called with both the
> value lock and its right page locked (so the right page is the left
> page for _bt_finish_split()'s purposes). But when you take another
> look, and realize that _bt_finish_split() releases its locks, and so
> once again only the original value lock will be held when
> _bt_finish_split() returns, and so the downlink is there to skip the
> value locked page, you realize that the bug does exist (assuming that
> I haven't failed to consider some third factor and am not otherwise
> mistaken).

When inserting, the scan for the right insert location always begins
from the first page where the key can legally go to. Inserting a missing
downlink doesn't change what that page is - it just makes it faster to
find, by reducing the number of right-links you need to follow.

PS. Thanks for looking into this again! These B-tree changes really need
thorough review.

- Heikki


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 18:47:39
Message-ID: CAM3SWZT4bgJTVqb8XLvhxgap9zeEr2yjUTnSOb1MiZtL6uG8fg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>> The "left sibling" referred to here is "the first page this key could
>> be on", an important concept for unique index enforcement.
>
> No, it's not "the first page this key could be on".

Well, it may be initially. I could have been more cautious about the
terminology here.

> Also note that _bt_moveright() also finishes any incomplete splits it
> encounters (when called for an insertion). So _bt_findinsertloc() never gets
> called on a page with the incomplete-split flag set. It might encounter one
> when it moves right, but never the first page.

Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?). It's a nice property, since it
makes the extra locking while completing a split within
_bt_findinsertloc() particularly infrequent. But, that might also be a
bad thing, when considered from a different perspective.

> If I understood correctly, the tree looks like this before the insertion:
>
> Parent page:
> +-------------+
> | |
> | 9 -> A |
> +-------------+
>
> Leaf A
> +-------------+
> | HI-key: 9 |
> | |
> | 7 8 9 |
> +-------------+
>
> And after insertion and incomplete split:
>
> Parent page
> +-------------+
> | |
> | 9 -> A |
> +-------------+
>
> Leaf A Leaf B
> +--------------+ +-------------+
> | HI-key: 8 | | HI-key: 9 |
> | (INCOMPLETE_ | | |
> | SPLIT) | <-> | |
> | | | |
> | 7 7* 8 | | 9 |
> +--------------+ +-------------+

> After the split is finished, the tree looks like this:
>
> Parent page
> +-------------+
> | 8 -> A |
> | 9 -> B |
> +-------------+
>
> Leaf A Leaf B
> +-------------+ +-------------+
> | HI-key: 8 | | HI-key: 9 |
> | | <-> | |
> | 7 7* 8 | | 9 |
> +-------------+ +-------------+

How did the parent page change between before and after the final
atomic operation (page split completion)? What happened to "9 -> A"?

>> Regardless of whether or not I have these details exactly right (that
>> is, regardless of whether or not this scenario is strictly possible) I
>> suggest as a code-hardening measure that _bt_findinsertloc() release
>> its "value lock", upon realizing it must complete splits, and then
>> complete the split or splits known to be required. It would finally
>> report that it "couldn't find" an insertion location to
>> _bt_doinsert(), which would then retry from the start, just like when
>> _bt_check_unique() finds an inconclusive conflict.
>
> Yeah, that would work too. It seems safe enough as it is, though, so I don't
> see the point.

Well, it would be nice to not have to finish the page split in what is
a particularly critical path, with that extra buffer lock. It's not
strictly necessary, but then it is theoretically safer, and certainly
much clearer. The fact that this code is so seldom executed is one
issue that made me revisit this. On the other hand, I can see why
you'd want to avoid cluttering up the relatively comprehensible
_bt_doinsert() function if it could be avoided. I defer to you.

> PS. Thanks for looking into this again! These B-tree changes really need
> thorough review.

You're welcome. Hopefully my questions will lead you in a useful
direction, even if my concerns turn out to be, in the main, unfounded.
:-)

It previously wasn't in evidence that you'd considered these
interactions, and I feel better knowing that you have.
--
Peter Geoghega


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 19:19:26
Message-ID: 5384E53E.5030205@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/27/2014 09:47 PM, Peter Geoghegan wrote:
> On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> Also note that _bt_moveright() also finishes any incomplete splits it
>> encounters (when called for an insertion). So _bt_findinsertloc() never gets
>> called on a page with the incomplete-split flag set. It might encounter one
>> when it moves right, but never the first page.
>
> Fair enough, but I don't think that affects correctness either way (I
> don't think that you meant to imply that this was a necessary
> precaution that you'd taken - right?).

Right.

>> If I understood correctly, the tree looks like this before the insertion:
>>
>> Parent page:
>> +-------------+
>> | |
>> | 9 -> A |
>> +-------------+
>>
>> Leaf A
>> +-------------+
>> | HI-key: 9 |
>> | |
>> | 7 8 9 |
>> +-------------+
>>
>> And after insertion and incomplete split:
>>
>> Parent page
>> +-------------+
>> | |
>> | 9 -> A |
>> +-------------+
>>
>> Leaf A Leaf B
>> +--------------+ +-------------+
>> | HI-key: 8 | | HI-key: 9 |
>> | (INCOMPLETE_ | | |
>> | SPLIT) | <-> | |
>> | | | |
>> | 7 7* 8 | | 9 |
>> +--------------+ +-------------+
>
>> After the split is finished, the tree looks like this:
>>
>> Parent page
>> +-------------+
>> | 8 -> A |
>> | 9 -> B |
>> +-------------+
>>
>> Leaf A Leaf B
>> +-------------+ +-------------+
>> | HI-key: 8 | | HI-key: 9 |
>> | | <-> | |
>> | 7 7* 8 | | 9 |
>> +-------------+ +-------------+
>
> How did the parent page change between before and after the final
> atomic operation (page split completion)? What happened to "9 -> A"?

Ah, sorry, I got that wrong. The downlinks store the *low* key of the
child page, not the high key as I depicted. Let me try again:

On 05/27/2014 09:17 AM, Peter Geoghegan wrote:
> Anyway, the concern is that there may be problems when we call
> _bt_finish_split() with that left sibling locked thoughout (i.e.
> finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
> itself has a right sibling from the incomplete split (which is usually
> the value lock page's right-right sibling)). I'm not concerned about
> performance, since as the comment says it ought to be an infrequent
> occurrence. I also believe that there are no deadlock hazards. But
> consider this scenario:
>
> * We insert the value 7 into an int4 unique index. We need to split
> the leaf page. We run out of memory or something, and ours is an
> incomplete page split. Our transaction is aborted. For the sake of
> argument, suppose that there are also already a bunch of dead tuples
> within the index with values of 7, 8 and 9.

So, initially the tree looks like this:

Parent page:
+-------------+
| |
| 7 -> A |
+-------------+

Leaf A
+-------------+
| HI-key: 9 |
| |
| 7 8 9 |
+-------------+

And after insertion and incomplete split:

Parent page
+-------------+
| |
| 7 -> A |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| (INCOMPLETE_ | | |
| SPLIT) | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above
split)

> * Another inserter of the value 7 comes along. It follows exactly the
> same downlink as the first, now aborted inserter (suppose the
> downlink's value is 9). It also locks the same leaf page to establish
> a "value lock" in precisely the same manner. Finding no room on the
> first page, it looks further right, maintaining its original "value
> lock" throughout. It finishes the first inserter's split of the
> non-value-lock page - a new downlink is inserted into the parent page,
> with the value 8. It then releases all buffer locks except the first
> one/original "value lock". A physical insertion has yet to occur.

The downlink of the original page cannot contain 9. Because, as I now
remember ;-), the downlinks contain low-keys, not high keys.

- Heikki


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 20:30:44
Message-ID: CAM3SWZSKistAiyTSh8Wentd8X5hDhhLQgaspBAALLzZK5XznfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
> page, not the high key as I depicted. Let me try again:

Would you mind humoring me, and including a corrected final
post-downlink-insert diagram, when the split is fully complete? You
omitted that.

--
Peter Geoghegan


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 20:45:50
Message-ID: 5384F97E.7070101@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/27/2014 11:30 PM, Peter Geoghegan wrote:
> On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
>> page, not the high key as I depicted. Let me try again:
>
> Would you mind humoring me, and including a corrected final
> post-downlink-insert diagram, when the split is fully complete? You
> omitted that.

Sure:

Parent page
+-------------+
| 7 -> A |
| 8 -> B |
+-------------+

Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| | | |
| | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+

- Heikki


From: John Lumby <johnlumby(at)hotmail(dot)com>
To: pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Extended Prefetching using Asynchronous IO - proposal and patch
Date: 2014-05-27 22:17:01
Message-ID: BAY175-W45086073075CA064EFE9A0A33A0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Claudio Freire and I are proposing new functionality for Postgresql
to extend the scope of prefetching and also exploit posix asynchronous IO
when doing prefetching, and have a patch based on 9.4dev
ready for consideration.

This topic has cropped up at irregular intervals over the years,
e.g. this thread back in 2012
www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
and this thread more recently
http://www.postgresql.org/message-id/CAGTBQpaFC_z=zdWVAXD8wWss3v6jxZ5pNmrrYPsD23LbrqGvgQ@mail.gmail.com

We now have an implementation which gives useful performance improvement
as well as other advantages compared to what is currently available,
at least for certain environments.

Below I am pasting the README we have written for this new functionality
which mentions some of the measurements, advantages (and disadvantages)
and we welcome all and any comments on this.

I will send the patch to commitfest later, once this email is posted to hackers,
so that anyone who wishes can try it, or apply directly to me if you wish.
The patch is currently based on 9.4dev but a version based on 9.3.4
will be available soon if anyone wants that. The patch is large (43 files)
so non-trivial to review, but any comments on it (when posted) will be
appreciated and acted on. Note that at present the only environment
in which it has been applied and tested is linux.

John Lumby
__________________________________________________________________________________________________

Postgresql -- Extended Prefetching using Asynchronous IO
============================================================

Postgresql currently (9.3.4) provides a limited prefetching capability
using posix_fadvise to give hints to the Operating System kernel
about which pages it expects to read in the near future.
This capability is used only during the heap-scan phase of bitmap-index scans.
It is controlled via the effective_io_concurrency configuration parameter.

This capability is now extended in two ways :
. use asynchronous IO into Postgresql shared buffers as an
alternative to posix_fadvise
. Implement prefetching in other types of scan :
. non-bitmap (i.e. simple) index scans - index pages
currently only for B-tree indexes.
(developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)
. non-bitmap (i.e. simple) index scans - heap pages
currently only for B-tree indexes.
. simple heap scans

Posix asynchronous IO is chosen as the function library for asynchronous IO,
since this is well supported and also fits very well with the model of
the prefetching process, particularly as regards checking for completion
of an asynchronous read. On linux, Posix asynchronous IO is provided
in the librt library. librt uses independently-schedulable threads to
achieve the asynchronicity, rather than kernel functionality.

In this implementation, use of asynchronous IO is limited to prefetching
while performing one of the three types of scan
. B-tree bitmap index scan - heap pages (as already exists)
. B-tree non-bitmap (i.e. simple) index scans - index and heap pages
. simple heap scans
on permanent relations. It is not used on temporary tables nor for writes.

The advantages of Posix asynchronous IO into shared buffers
compared to posix_fadvise are :
. Beneficial for non-sequential access patterns as well as sequential
. No restriction on the kinds of IO which can be used
(other kinds of asynchronous IO impose restrictions such as
buffer alignment, use of non-buffered IO).
. Does not interfere with standard linux kernel read-ahead functionality.
(It has been stated in
www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
that :
"the kernel stops doing read-ahead when a call to posix_fadvise comes.
I noticed the performance hit, and checked the kernel's code.
It effectively changes the prediction mode from sequential to fadvise,
negating the (assumed) kernel's prefetch logic")
. When the read request is issued after a prefetch has completed,
no delay associated with a kernel call to copy the page from
kernel page buffers into the Postgresql shared buffer,
since it is already there.
Also, in a memory-constrained environment, there is a greater
probability that the prefetched page will "stick" in memory
since the linux kernel victimizes the filesystem page cache in preference
to swapping out user process pages.
. Statistics on prefetch success can be gathered (see "Statistics" below)
which helps the administrator to tune the prefetching settings.

These benefits are most likely to be obtained in a system whose usage profile
(e.g. from iostat) shows:
. high IO wait from mostly-read activity
. disk access pattern is not entirely sequential
(so kernel readahead can't predict it but postgresql can)
. sufficient spare idle CPU to run the librt pthreads
or, stated another way, the CPU subsystem is relatively powerful
compared to the disk subsystem.
In such ideal conditions, and with a workload with plenty of index scans,
around 10% - 20% improvement in throughput has been achieved.
In an admittedly extreme environment measured by this author, with a workload
consisting of 8 client applications each running similar complex queries
(same query structure but different predicates and constants),
including 2 Bitmap Index Scans and 17 non-bitmap index scans,
on a dual-core Intel laptop (4 hyperthreads) with the database on a single
USB3-attached 500GB disk drive, and no part of the database in filesystem buffers
initially, (filesystem freshly mounted), comparing unpatched build
using posix_fadvise with effective_io_concurrency 4 against same build patched
with async IO and effective_io_concurrency 4 and max_async_io_prefetchers 32,
elapse time repeatably improved from around 640-670 seconds to around 530-550 seconds,
a 17% - 18% improvement.

The disadvantages of Posix asynchronous IO compared to posix_fadvise are:
. probably higher CPU utilization:
Firstly, the extra work performed by the librt threads adds CPU
overhead, and secondly, if the asynchronous prefetching is effective,
then it will deliver better (greater) overlap of CPU with IO, which
will reduce elapsed times and hence increase CPU utilization percentage
still more (during that shorter elapsed time).
. more context switching, because of the additional threads.

Statistics:
___________

A number of additional statistics relating to effectiveness of asynchronous IO
are provided as an extension of the existing pg_stat_statements loadable module.
Refer to the appendix "Additional Supplied Modules" in the current
PostgreSQL Documentation for details of this module.

The following additional statistics are provided for asynchronous IO prefetching:

. aio_read_noneed : number of prefetches for which no need for prefetch as block already in buffer pool
. aio_read_discrd : number of prefetches for which buffer not subsequently read and therefore discarded
. aio_read_forgot : number of prefetches for which buffer not subsequently read and then forgotten about
. aio_read_noblok : number of prefetches for which no available BufferAiocb control block
. aio_read_failed : number of aio reads for which aio itself failed or the read failed with an errno
. aio_read_wasted : number of aio reads for which in-progress aio cancelled and disk block not used
. aio_read_waited : number of aio reads for which disk block used but had to wait for it
. aio_read_ontime : number of aio reads for which disk block used and ready on time when requested

Some of these are (hopefully) self-explanatory. Some additional notes:

. aio_read_discrd and aio_read_forgot :
prefetch was wasted work since the buffer was not subsequently read
The discrd case indicates that the scanner realized this and discarded the buffer,
whereas the forgot case indicates that the scanner did not realize it,
which should not normally occur.
A high number in either suggests lowering effective_io_concurrency.

. aio_read_noblok :
Any significant number in relation to all the other numbers indicates that
max_async_io_prefetchers should be increased.

. aio_read_waited :
The page was prefetched but the asynchronous read had not completed by the time the
scanner requested to read it. causes extra overhead in waiting and indicates
prefetching is not providing much if any benefit.
The disk subsystem may be underpowered/overloaded in relation to the available CPU power.

. aio_read_ontime :
The page was prefetched and the asynchronous read had completed by the time the
scanner requested to read it. Optimal behaviour. If this number if large
in relation to all the other numbers except (possibly) aio_read_noneed,
then prefetching is working well.

To create the extension with support for these additional statistics, use the following syntax:
CREATE EXTENSION pg_stat_statements VERSION '1.3'
or, if you run the new code against an existing database which already has the extension
( see installation and migration below ), you can
ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'

A suggested set of commands for displaying these statistics might be :

/* OPTIONALLY */ DROP extension pg_stat_statements;
CREATE extension pg_stat_statements VERSION '1.3';
/* run your workload */
select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read , blk_read_time , blk_write_time \
, aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted , aio_read_waited , aio_read_ontime , aio_read_forgot \
from pg_stat_statements where shared_blks_read > 0;

Installation and Build Configuration:
_____________________________________

1. First - a prerequsite:
# as well as requiring all the usual package build tools such as gcc , make etc,
# as described in the instructions for building postgresql,
# the following is required :
gnu autoconf at version 2.69 :
# run the following command
autoconf -V
# it *must* return
autoconf (GNU Autoconf) 2.69

2. If you don't have it or it is a different version,
then you must obtain version 2.69 (which is the current version)
from your distribution provider or from the gnu software download site.

3. Also you must have the source tree for postgresql version 9.4 (development version).
# all the following commands assume your current working directory is the top of the source tree.

4. cd to top of source tree :
# check it appears to be a postgresql source tree
ls -ld configure.in src
# should show both the file and the directory
grep PostgreSQL COPYRIGHT
# should show PostgreSQL Database Management System

5. Apply the patch :
patch -b -p0 -i <patch_file_path>
# should report no errors, 42 files patched (see list at bottom of this README)
# and all hunks applied
# check the patch was appplied to configure.in
ls -ld configure.in.orig configure.in
# should show both files

6. Rebuild the configure script with the patched configure.in :
mv configure configure.orig;
autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;
ls -lrt configure.orig configure;

7. run the new configure script :
# if you have run configure before,
# then you may first want to save existing config.status and config.log if they exist,
# and then specify same configure flags and options as you specified before.
# the patch does not alter or extend the set of configure options
# if unsure, run ./configure --help
# if still unsure, run ./configure
./configure <other configure options as desired>

8. now check that configure decided that this environment supports asynchronous IO :
grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h
# it should show
#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1
# if not, apparently your environment does not support asynch IO -
# the config.log will show how it came to that conclusion,
# also check for :
# . a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)
# . your gcc must support the atomic compare_and_swap __sync_bool_compare_and_swap built-in function
# do not proceed without this define being set.

9. do you want to use the new code on an existing cluster
that was created using the same code base but without the patch?
If so then run this nasty-looking command :
(cut-and-paste it into a terminal window or a shell-script file)
Otherwise continue to step 10.
see Migration note below for explanation.
###############################################################################################
fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {
bkfl="${fl}.bak${bkx}"; if [[ -a ${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;
}; done;
if [[ -a ${bkfl} ]]; then echo "sorry cannot find a backup name for $fl";
elif [[ -a $fl ]]; then {
mv $fl $bkfl && {
sed -e "/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;
str="diff -w $bkfl $fl";echo "$str"; eval "$str";
};
};
else echo "ooopppss $fl is missing";
fi;
###############################################################################################
# it should report something like
diff -w Makefile.global.bak0 Makefile.global
222c222
< CFLAGS = XXXX
---
> CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO
# where XXXX is some set of flags

10. now run the rest of the build process as usual -
follow instructions in file INSTALL if that file exists,
else e.g. run
make && make install

If the build fails with the following error:
undefined reference to `aio_init'
Then edit the following file
src/include/pg_config_manual.h
and add the following line at the bottom:

#define DONT_HAVE_AIO_INIT

and then run
make clean && make && make install
See notes to section Runtime Configuration below for more information on this.

Migration , Runtime Configuration, and Use:
___________________________________________

Database Migration:
___________________

The new prefetching code for non-bitmap index scans introduces a new btree-index
function named btpeeknexttuple. The correct way to add such a function involves
also adding it to the catalog as an internal function in pg_proc.
However, this results in the new built code considering an existing database to be
incompatible, i.e requiring backup on the old code and restore on the new.
This is normal behaviour for migration to a new version of postgresql, and is
also a valid way of migrating a database for use with this asynchronous IO feature,
but in this case it may be inconvenient.

As an alternative, the new code may be compiled with the macro define
AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
which does what it says by not altering the catalog. The patched build can then
be run against an existing database cluster initdb'd using the unpatched build.

There are no known ill-effects of so doing, but :
. in any case, it is strongly suggested to make a backup of any precious database
before accessing it with a patched build
. be aware that if this asynchronous IO feature is eventually released as part of postgresql,
migration will probably be required anyway.

This option to avoid catalog migration is intended as a convenience for a quick test,
and also makes it easier to obtain performance comparisons on the same database.

Runtime Configuration:
______________________

One new configuration parameter settable in postgresql.conf and
in any other way as described in the postgresql documentation :

max_async_io_prefetchers
Maximum number of background processes concurrently using asynchronous
librt threads to prefetch pages into shared memory buffers

This number can be thought of as the maximum number
of librt threads concurrently active, each working on a list of
from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).

In practice, this number simply controls how many prefetch requests in total
may be active concurrently :
max_async_io_prefetchers * target_prefetch_pages ( see note 1)

default is max_connections/6
and recall that the default for max_connections is 100

note 1 a number based on effective_io_concurrency and approximately n * ln(n)
where n is effective_io_concurrency

note 2 Provided that the gnu extension to Posix AIO which provides the
aio_init() function is present, then aio_init() is called
to set the librt maximum number of threads to max_async_io_prefetchers,
and to set the maximum number of concurrent aio read requests to the product of
max_async_io_prefetchers * target_prefetch_pages

As well as this regular configuration parameter,
there are several other parameters that can be set via environment variable.
The reason why they are environment vars rather than regular configuration parameters
is that it is not expected that they should need to be set, but they may be useful :
variable name values default meaning
PG_TRY_PREFETCHING_FOR_BITMAP [Y|N] Y whether to prefetch bitmap heap scans
PG_TRY_PREFETCHING_FOR_ISCAN [Y|N|integer[,[N|Y]]] 256,N whether to prefetch non-bitmap index scans
also numeric size of list of prefetched blocks
also whether to prefetch forward-sequential-pattern index pages
PG_TRY_PREFETCHING_FOR_BTREE [Y|N] Y whether to prefetch heap pages in non-bitmap index scans
PG_TRY_PREFETCHING_FOR_HEAP [Y|N] N whether to prefetch relation (un-indexed) heap scans

The setting for PG_TRY_PREFETCHING_FOR_ISCAN is a litle complicated.
It can be set to Y or N to control prefetching of non-bitmap index scans;
But in addition it can be set to an integer, which both implies Y
and also sets the size of a list used to remember prefetched but unread heap pages.
This list is an optimization used to avoid re-prefetching and maximise the potential
set of prefetchable blocks indexed by one index page.
And if set to an integer, this integer may be followed by either ,Y or ,N
to specify to prefetch index pages which are being accessed forward-sequentially.
It has been found that prefetching is not of great benefit for this access pattern,
and so it is not the default, but also does no harm (provided sufficient CPU capacity).

Usage :
______

There are no changes in usage other than as noted under Configuration and Statistics.
However, in order to assess benefit from this feature, it will be useful to
understand the query access plans of your workload using EXPLAIN. Before doing that,
make sure that statistics are up to date using ANALYZE.

Internals:
__________

Internal changes span two areas and the interface between them :

. buffer manager layer
. programming interface for scanner to call buffer manager
. scanner layer

. buffer manager layer
____________________

changes comprise :
. allocating, pinning , unpinning buffers
this is complex and discussed briefly below in "Buffer Management"
. acquiring and releasing a BufferAiocb, the control block
associated with a single aio_read, and checking for its completion
a new file, backend/storage/buffer/buf_async.c, provides three new functions,
BufStartAsync BufReleaseAsync BufCheckAsync
which handle this.
. calling librt asynch io functions
this follows the example of all other filesystem interfaces
and is straightforward.
two new functions are provided in fd.c:
FileStartaio FileCompleteaio
and corresponding interfaces in smgr.c

. programming interface for scanner to call buffer manager
________________________________________________________
. calling interface for existing function PrefetchBuffer is modified :
. one new argument, BufferAccessStrategy strategy
. now returns an int return code which indicates :
whether pin count on buffer has been increased by 1
whether block was already present in a buffer
. new function DiscardBuffer
. discard buffer used for a previously prefetched page
which scanner decides it does not want to read.
. same arguments as for PrefetchBuffer except for omission of BufferAccessStrategy
. note - this is different from the existing function ReleaseBuffer
in that ReleaseBuffer takes a buffer_descriptor as argument
for a buffer which has been read, but has similar purpose.

. scanner layer
_____________
common to all scanners is that the scanner which wishes to prefetch must do two things:
. decide which pages to prefetch and call PrefetchBuffer to prefetch them
nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)
. remember which pages it has prefetched in some list (actual or conceptual, e.g. a page range),
removing each page from this list if and when it subsequently reads the page.
. at end of scan, call DiscardBuffer for every remembered (i.e. prefetched not unread) page
how this list of prefetched pages is implemented varies for each of the three scanners and four scan types:
. bitmap index scan - heap pages
. non-bitmap (i.e. simple) index scans - index pages
. non-bitmap (i.e. simple) index scans - heap pages
. simple heap scans
The consequences of forgetting to call DiscardBuffer on a prefetched but unread page are:
. counted in aio_read_forgot (see "Statistics" above)
. may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "
(the buffer is released at commit)
This does sometimes happen ...

Buffer Management
_________________

With async io, PrefetchBuffer must allocate and pin a buffer, which is relatively straightforward,
but also every other part of buffer manager must know about the possibility that a buffer may be in
a state of async_io_in_progress state and be prepared to determine the possible completion.
That is, one backend BK1 may start the io but another BK2 may try to read it before BK1 does.
Posix Asynchronous IO provides a means for waiting on this or another task's read if in progress,
namely aio_suspend(), which this extension uses. Therefore, although StartBufferIO and TerminateBufferIO
are called as part of asynchronous prefetching, their role is limited to maintaining the buffer descriptor flags,
and they do not track the asynchronous IO itself. Instead, asynchronous IOs are tracked in
a separate set of shared control blocks, the BufferAiocb list -
refer to include/storage/buf_internals.h
Checking asynchronous io status is handled in backend/storage/buffer/buf_async.c BufCheckAsync function.
Read the commentary for this function for more details.

Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,
and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.
When a backend BK2 requests ReadBuffer of a page for which asynch read is in progress,
buffer manager has to determine which backend BK1 pinned this buffer during previous PrefetchBuffer,
and for example must not be re-pinned a second time if BK2 is BK1.
Information concerning which backend initiated the prefetch is held in the BufferAiocb.

The trickiest case concerns the scenario in which :
. BK1 initiates prefetch and acquires a pin
. BK2 possibly waits for completion and then reads the buffer, and perhaps later on
releases it by ReleaseBuffer.
. Since the asynchronous IO is no longer in progress, there is no longer any
BufferAiocb associated with it. Yet buffer manager must remember that BK1 holds a
"prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.
. The solution to this problem is to invent the concept of a "banked" pin,
which is a pin obtained when prefetch was issued, identied as in "banked" status only if and when
the associated asynchronous IO terminates, and redeemable by the next use by same task,
either by ReadBuffer or DiscardBuffer.
The pid of the backend which holds a banked pin on a buffer (there can be at most one such backend)
is stored in the buffer descriptor.
This is done without increasing size of the buffer descriptor, which is important since
there may be a very large number of these. This does overload the relevant field in the descriptor.
Refer to include/storage/buf_internals.h for more details
and search for BM_AIO_PREFETCH_PIN_BANKED in storage/buffer/bufmgr.c and backend/storage/buffer/buf_async.c

______________________________________________________________________________
The following 43 files are changed in this feature (output of the patch command) :

patching file configure.in
patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql
patching file contrib/pg_stat_statements/Makefile
patching file contrib/pg_stat_statements/pg_stat_statements.c
patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql
patching file config/c-library.m4
patching file src/backend/postmaster/postmaster.c
patching file src/backend/executor/nodeBitmapHeapscan.c
patching file src/backend/executor/nodeIndexscan.c
patching file src/backend/executor/instrument.c
patching file src/backend/storage/buffer/Makefile
patching file src/backend/storage/buffer/bufmgr.c
patching file src/backend/storage/buffer/buf_async.c
patching file src/backend/storage/buffer/buf_init.c
patching file src/backend/storage/smgr/md.c
patching file src/backend/storage/smgr/smgr.c
patching file src/backend/storage/file/fd.c
patching file src/backend/storage/lmgr/proc.c
patching file src/backend/access/heap/heapam.c
patching file src/backend/access/heap/syncscan.c
patching file src/backend/access/index/indexam.c
patching file src/backend/access/index/genam.c
patching file src/backend/access/nbtree/nbtsearch.c
patching file src/backend/access/nbtree/nbtinsert.c
patching file src/backend/access/nbtree/nbtpage.c
patching file src/backend/access/nbtree/nbtree.c
patching file src/backend/nodes/tidbitmap.c
patching file src/backend/utils/misc/guc.c
patching file src/backend/utils/mmgr/aset.c
patching file src/include/executor/instrument.h
patching file src/include/storage/bufmgr.h
patching file src/include/storage/smgr.h
patching file src/include/storage/fd.h
patching file src/include/storage/buf_internals.h
patching file src/include/catalog/pg_am.h
patching file src/include/catalog/pg_proc.h
patching file src/include/pg_config_manual.h
patching file src/include/access/nbtree.h
patching file src/include/access/heapam.h
patching file src/include/access/relscan.h
patching file src/include/nodes/tidbitmap.h
patching file src/include/utils/rel.h
patching file src/include/pg_config.h.in

Future Possibilities:
____________________

There are several possible extensions of this feature :
. Extend prefetching of index scans to types of index
other than B-tree.
This should be fairly straightforward, but requires some
good base of benchmarkable workloads to prove the value.
. Investigate why asynchronous IO prefetching does not greatly
improve sequential relation heap scans and possibly find how to
achieve a benefit.
. Build knowledge of asycnhronous IO prefetching into the
Query Planner costing.
This is far from straightforward. The Postgresql Query Planner's
costing model is based on resource consumption rather than elapsed time.
Use of asynchronous IO prefetching is intended to improve elapsed time
as the expense of (probably) higher resource consumption.
Although Costing understands about the reduced cost of reading buffered
blocks, it does not take asynchronicity or overlap of CPU with disk
into account. A naive approach might be to try to tweak the Query
Planner's Cost Constant configuration parameters
such as seq_page_cost , random_page_cost
but this is hazardous as explained in the Documentation.

John Lumby, johnlumby(at)hotmail(dot)com


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-27 23:15:03
Message-ID: CAM3SWZSCv7ogUpW125cpJXw37T7bjKpc6NgkK7SFoUNVeywbjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>> Fair enough, but I don't think that affects correctness either way (I
>> don't think that you meant to imply that this was a necessary
>> precaution that you'd taken - right?).
>
>
> Right.

So, the comments within _bt_search() suggest that the _bt_moveright()
call will perform a _bt_finish_split() call opportunistically iff it's
called from _bt_doinsert() (i.e. access == BT_WRITE). There is no
reason to not do so in all circumstances though, assuming that it's
desirable to do so as soon as possible (which I *don't* actually
assume). If I'm not mistaken, it's also true that it would be strictly
speaking correct to never do it there. Do you think it would be a fair
stress-test if I was to hack Postgres so that this call never happens
(within _bt_moveright())? I'd also have an incomplete page split occur
at random with a probability of, say, 1% per split. The mechanism
would be much the same as your original test-case for the split patch
- I'd throw an error at the wrong place, although only 1% of the time,
and over many hours.

The concern with the _bt_moveright() call of _bt_finish_split() is
that it might conceal a problem without reliably fixing it,
potentially making isolating that problem much harder.

--
Peter Geoghegan


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition within _bt_findinsertloc()? (new page split code)
Date: 2014-05-28 09:52:22
Message-ID: 5385B1D6.9070801@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/28/2014 02:15 AM, Peter Geoghegan wrote:
> On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>>> Fair enough, but I don't think that affects correctness either way (I
>>> don't think that you meant to imply that this was a necessary
>>> precaution that you'd taken - right?).
>>
>> Right.
>
> So, the comments within _bt_search() suggest that the _bt_moveright()
> call will perform a _bt_finish_split() call opportunistically iff it's
> called from _bt_doinsert() (i.e. access == BT_WRITE). There is no
> reason to not do so in all circumstances though, assuming that it's
> desirable to do so as soon as possible (which I *don't* actually
> assume).

You can't write in a hot standby, so that's one reason to only do it
when inserting, not when querying. Even when you're not in a hot
standby, it seems like a nice property that a read-only query doesn't do
writes. I know we do hint bit updates and other kinds of write-action
when reading anyway, but still.

> If I'm not mistaken, it's also true that it would be strictly
> speaking correct to never do it there.

Hmm. No, it wouldn't. It is not allowed to have two incompletely-split
pages adjacent to each other. If you move right to the right-half of an
incomplete split, i.e. a page that does not have a downlink in its
parent, and then try to split the page again, _bt_insert_parent() would
fail to find the location to insert the new downlink to. It assumes that
there is a downlink to the page being split in the parent, and uses that
to find the location for the new downlink.

- Heikki


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: John Lumby <johnlumby(at)hotmail(dot)com>
Cc: pgsql hackers <pgsql-hackers(at)postgresql(dot)org>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date: 2014-05-28 21:51:50
Message-ID: CAM3SWZRby63v=rvvacjppghWhtgS8UROunw4ozSjg-DHZz9xCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 27, 2014 at 3:17 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
> Below I am pasting the README we have written for this new functionality
> which mentions some of the measurements, advantages (and disadvantages)
> and we welcome all and any comments on this.

I think that this is likely to be a useful area to work on, but I
wonder: can you suggest a useful test-case or benchmark to show the
advantages of the patch you posted? You mention a testcase already,
but it's a little short on details. I think it's always a good idea to
start with that when pursuing a performance feature.

Have you thought about things like specialized prefetching for nested
loop joins?

--
Peter Geoghegan


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date: 2014-05-29 00:59:44
Message-ID: CAGTBQpZnSgxr2_RufpaSeQGWjVgjmTuaxTdooqxSicUNntS0dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 28, 2014 at 6:51 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Have you thought about things like specialized prefetching for nested
> loop joins?

Currently, such a thing would need some non-trivial changes to the
execution nodes, I believe.

For nestloop, correct me if I'm wrong, but index scan nodes don't have
visibility of the next tuple to be searched for.


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date: 2014-05-29 01:12:23
Message-ID: CAM3SWZQ5LwW-1kP52Faza6JSsXbw9FSMLoXqV+7nD9zVVistzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> For nestloop, correct me if I'm wrong, but index scan nodes don't have
> visibility of the next tuple to be searched for.

Nested loop joins are considered a particularly compelling case for
prefetching, actually.

--
Peter Geoghegan


From: John Lumby <johnlumby(at)hotmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date: 2014-05-29 13:18:25
Message-ID: BAY175-W4026DD5ABB2BAF2A3E1940A3240@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I have pasted below the EXPLAIN of one of my benchmark queries
(the one I reference in the README).
Plenty of nested loop joins.
However I think I understand your question as to how effective it would be
if the outer is not sorted, and I will see if I can dig into that
if I get time (and it sounds as though Claudio is on it too).

The area of exactly what the best prefetch strategy should be for
each particular type of scan and context is a good one to work on.

John

> Date: Wed, 28 May 2014 18:12:23 -0700
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: pg(at)heroku(dot)com
> To: klaussfreire(at)gmail(dot)com
> CC: johnlumby(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org
>
> On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> > For nestloop, correct me if I'm wrong, but index scan nodes don't have
> > visibility of the next tuple to be searched for.
>
> Nested loop joins are considered a particularly compelling case for
> prefetching, actually.
>
> --
> Peter Geoghegan

____________________________________________________________________________________-
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=801294.81..801294.81 rows=2 width=532)
CTE deploy_zone_down
-> Recursive Union (cost=1061.25..2687.40 rows=11 width=573)
-> Nested Loop (cost=1061.25..1423.74 rows=1 width=41)
-> Nested Loop (cost=1061.11..1421.22 rows=14 width=49)
-> Bitmap Heap Scan on entity zone_tree (cost=1060.67..1175.80 rows=29 width=40)
Recheck Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text) AND ((discriminator)::text = 'ZONE'::text))
-> BitmapAnd (cost=1060.67..1060.67 rows=29 width=0)
-> Bitmap Index Scan on entity_name (cost=0.00..139.71 rows=5927 width=0)
Index Cond: ((name >= 'testZone-4375'::text) AND (name <= 'testZone-5499'::text))
-> Bitmap Index Scan on entity_discriminatorx (cost=0.00..920.70 rows=49636 width=0)
Index Cond: ((discriminator)::text = 'ZONE'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv (cost=0.43..8.45 rows=1 width=17)
Index Cond: (owner_id = zone_tree.id)
-> Index Scan using metadata_field_pkey on metadata_field mf (cost=0.15..0.17 rows=1 width=8)
Index Cond: (id = mv.field_id)
Filter: ((name)::text = 'deployable'::text)
-> Nested Loop (cost=0.87..126.34 rows=1 width=573)
-> Nested Loop (cost=0.72..125.44 rows=5 width=581)
-> Nested Loop (cost=0.29..83.42 rows=10 width=572)
-> WorkTable Scan on deploy_zone_down dzd (cost=0.00..0.20 rows=10 width=540)
-> Index Scan using entity_discriminator_parent_zone on entity ch (cost=0.29..8.31 rows=1 width=40)
Index Cond: ((parent_id = dzd.zone_tree_id) AND ((discriminator)::text = 'ZONE'::text))
-> Index Scan using metadata_value_owner_id on metadata_value mv_1 (cost=0.43..4.19 rows=1 width=17)
Index Cond: (owner_id = ch.id)
-> Index Scan using metadata_field_pkey on metadata_field mf_1 (cost=0.15..0.17 rows=1 width=8)
Index Cond: (id = mv_1.field_id)
Filter: ((name)::text = 'deployable'::text)
CTE deploy_zone_tree
-> Recursive Union (cost=0.00..9336.89 rows=21 width=1105)
-> CTE Scan on deploy_zone_down dzd_1 (cost=0.00..0.22 rows=11 width=1105)
-> Nested Loop (cost=0.43..933.63 rows=1 width=594)
-> WorkTable Scan on deploy_zone_tree dzt (cost=0.00..2.20 rows=110 width=581)
-> Index Scan using entity_pkey on entity pt (cost=0.43..8.46 rows=1 width=21)
Index Cond: (id = dzt.zone_tree_ancestor_parent_id)
Filter: ((discriminator)::text = ANY ('{VIEW,ZONE}'::text[]))
CTE forward_host_ip
-> Nested Loop (cost=1.30..149.65 rows=24 width=88)
-> Nested Loop (cost=0.87..121.69 rows=48 width=56)
-> Nested Loop (cost=0.43..71.61 rows=99 width=48)
-> CTE Scan on deploy_zone_tree dzt_1 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_parent_id on entity host (cost=0.43..70.14 rows=99 width=40)
Index Cond: (parent_id = dzt_1.zone_tree_id)
Filter: ((discriminator)::text = 'HOST'::text)
-> Index Scan using entity_link_owner_id on entity_link link (cost=0.43..0.50 rows=1 width=16)
Index Cond: (owner_id = host.id)
Filter: ((link_type)::text = ANY ('{IP,IP6}'::text[]))
-> Index Scan using entity_pkey on entity address (cost=0.43..0.57 rows=1 width=40)
Index Cond: (id = link.entity_id)
Filter: ((discriminator)::text = ANY ('{IP4A,IP6A}'::text[]))
CTE association_view
-> Nested Loop (cost=0.87..26.29 rows=1 width=75)
-> Nested Loop (cost=0.43..17.82 rows=1 width=56)
-> CTE Scan on deploy_zone_tree dzt_2 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_discriminator_parent_rr on entity record (cost=0.43..17.34 rows=1 width=48)
Index Cond: ((parent_id = dzt_2.zone_tree_id) AND ((discriminator)::text = ANY ('{C,MX,SRV}'::text[])))
-> Index Scan using entity_pkey on entity assoc (cost=0.43..8.46 rows=1 width=27)
Index Cond: (id = record.association_id)
CTE simple_view
-> Nested Loop (cost=0.43..22.27 rows=1 width=48)
-> CTE Scan on deploy_zone_tree dzt_3 (cost=0.00..0.47 rows=1 width=16)
Filter: (zone_tree_deployable AND ((zone_tree_ancestor_discriminator)::text = 'VIEW'::text))
-> Index Scan using entity_discriminator_parent_rr on entity record_1 (cost=0.43..21.79 rows=1 width=40)
Index Cond: ((parent_id = dzt_3.zone_tree_id) AND ((discriminator)::text = ANY ('{TXT,HINFO,GENRR,NAPTR}'::text[])))
CTE max_hist_id
-> Result (cost=0.48..0.49 rows=1 width=0)
InitPlan 6 (returns $19)
-> Limit (cost=0.43..0.48 rows=1 width=8)
-> Index Only Scan Backward using entity_history_history_id on entity_history xh (cost=0.43..444052.51 rows=10386347 width=8)
Index Cond: (history_id IS NOT NULL)
CTE relevant_history
-> Nested Loop (cost=0.43..199661.39 rows=3438689 width=28)
-> CTE Scan on max_hist_id xh_1 (cost=0.00..0.02 rows=1 width=8)
-> Index Scan using entity_history_history_id on entity_history eh (cost=0.43..156677.76 rows=3438689 width=20)
Index Cond: (history_id > xh_1.history_id)
Filter: (transaction_type = 'I'::bpchar)
CTE resource_records
-> Unique (cost=580178.30..584992.46 rows=160472 width=1063)
-> Sort (cost=580178.30..580579.48 rows=160472 width=1063)
Sort Key: fip.host_id, fip.host_discriminator, fip.host_parent_id, fip.view_id, ((fip.address_id)::text), fip.host_name, ((fip.address_long1)::text), (((fip.host_long1 & 1::bigint))::text), ((fip.address_parent_id)::text), ((((''::text || (COALESCE(mv_2.longnumber, (-1)::bigint))::text) || ','::text) || (fip.address_discriminator)::text)), rh.long1
-> Append (cost=203.82..417112.92 rows=160472 width=1063)
-> Hash Join (cost=203.82..91844.90 rows=137548 width=1136)
Hash Cond: ((rh.hist_discrim)::text = (fip.address_discriminator)::text)
Join Filter: (rh.history_delta > fip.host_id)
-> CTE Scan on relevant_history rh (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=203.52..203.52 rows=24 width=1128)
-> Nested Loop Left Join (cost=0.43..203.52 rows=24 width=1128)
-> CTE Scan on forward_host_ip fip (cost=0.00..0.48 rows=24 width=1120)
-> Index Scan using metadata_value_owner_id on metadata_value mv_2 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = fip.host_id)
-> Nested Loop (cost=0.43..77722.85 rows=5731 width=644)
Join Filter: (rh_1.history_delta > av.record_id)
-> Nested Loop Left Join (cost=0.43..8.48 rows=1 width=636)
-> CTE Scan on association_view av (cost=0.00..0.02 rows=1 width=628)
Filter: ((record_discriminator)::text = 'C'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv_3 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = av.record_id)
-> CTE Scan on relevant_history rh_1 (cost=0.00..77370.50 rows=17193 width=532)
Filter: ((hist_discrim)::text = 'C'::text)
-> Hash Join (cost=0.04..83402.53 rows=5731 width=636)
Hash Cond: ((rh_2.hist_discrim)::text = (av_1.record_discriminator)::text)
Join Filter: (rh_2.history_delta > av_1.record_id)
-> CTE Scan on relevant_history rh_2 (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=0.02..0.02 rows=1 width=628)
-> CTE Scan on association_view av_1 (cost=0.00..0.02 rows=1 width=628)
Filter: ((record_discriminator)::text = ANY ('{MX,SRV}'::text[]))
-> Nested Loop (cost=0.86..79164.06 rows=5731 width=618)
Join Filter: (rh_3.history_delta > sv.record_id)
-> Nested Loop Left Join (cost=0.86..16.94 rows=1 width=610)
-> Nested Loop Left Join (cost=0.43..8.48 rows=1 width=588)
-> CTE Scan on simple_view sv (cost=0.00..0.02 rows=1 width=580)
Filter: ((record_discriminator)::text = 'TXT'::text)
-> Index Scan using metadata_value_owner_id on metadata_value mv_4 (cost=0.43..8.45 rows=1 width=24)
Index Cond: (owner_id = sv.record_id)
-> Index Scan using metadata_value_owner_id on metadata_value txtvalue (cost=0.43..8.45 rows=1 width=38)
Index Cond: (owner_id = sv.record_id)
-> CTE Scan on relevant_history rh_3 (cost=0.00..77370.50 rows=17193 width=532)
Filter: ((hist_discrim)::text = 'TXT'::text)
-> Hash Join (cost=0.04..83373.87 rows=5731 width=588)
Hash Cond: ((rh_4.hist_discrim)::text = (sv_1.record_discriminator)::text)
Join Filter: (rh_4.history_delta > sv_1.record_id)
-> CTE Scan on relevant_history rh_4 (cost=0.00..68773.78 rows=3438689 width=532)
-> Hash (cost=0.02..0.02 rows=1 width=580)
-> CTE Scan on simple_view sv_1 (cost=0.00..0.02 rows=1 width=580)
Filter: ((record_discriminator)::text = ANY ('{HINFO,GENRR,NAPTR}'::text[]))
-> Sort (cost=4417.98..4418.48 rows=200 width=532)
Sort Key: resource_records.discrim
-> HashAggregate (cost=4412.98..4415.98 rows=200 width=532)
Group Key: resource_records.discrim
-> CTE Scan on resource_records (cost=0.00..3209.44 rows=160472 width=532)
Planning time: 6.620 ms
(133 rows)


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, John Lumby <johnlumby(at)hotmail(dot)com>
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date: 2014-05-29 13:28:27
Message-ID: CAGTBQpbAKN6cnxi3VeqoW=zya3iXma74mnD1dG019b8ywsS9sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

El 28/05/2014 22:12, "Peter Geoghegan" <pg(at)heroku(dot)com> escribió:
>
> On Wed, May 28, 2014 at 5:59 PM, Claudio Freire <klaussfreire(at)gmail(dot)com>
wrote:
> > For nestloop, correct me if I'm wrong, but index scan nodes don't have
> > visibility of the next tuple to be searched for.
>
> Nested loop joins are considered a particularly compelling case for
> prefetching, actually.

Of course. I only doubt it can be implemented without not so small changes
to all execution nodes.

I'll look into it

>
> --
> Peter Geoghegan