Quick Links

Re: Extended Prefetching using Asynchronous IO - proposal and patch

Lists:	pgsql-generalpgsql-hackers

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-28 20:52:47
Message-ID:	BAY175-W11F4BFD6485716B60C29BEA3250@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

The patch is attached.
It is based on clone of today's 9.4dev source.
I have noticed that this source is
(not suprisingly) quite a moving target at present,
meaning that this patch becomes stale quite quickly.
So although this copy is fine for reviewing,
it may quite probably soon not be correct
for the current source tree.

As mentioned before, if anyone wishes to try this feature out
on 9.3.4, I will be making a patch for that soon
which I can supply on request.

John Lumby

Attachment	Content-Type	Size
postgresql-9.4.140528.async_io_prefetching.patch	application/octet-stream	305.9 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-28 21:19:33
Message-ID:	538652E5.70303@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 05/28/2014 11:52 PM, John Lumby wrote:
>
> The patch is attached.
> It is based on clone of today's 9.4dev source.
> I have noticed that this source is
> (not suprisingly) quite a moving target at present,
> meaning that this patch becomes stale quite quickly.
> So although this copy is fine for reviewing,
> it may quite probably soon not be correct
> for the current source tree.
>
> As mentioned before, if anyone wishes to try this feature out
> on 9.3.4, I will be making a patch for that soon
> which I can supply on request.

Wow, that's a huge patch. I took a very brief look, focusing on the
basic design. ignoring the style & other minor things for now:

The patch seems to assume that you can put the aiocb struct in shared
memory, initiate an asynchronous I/O request from one process, and wait
for its completion from another process. I'm pretty surprised if that
works on any platform.

How portable is POSIX aio nowadays? Googling around, it still seems that
on Linux, it's implemented using threads. Does the thread-emulation
implementation cause problems with the rest of the backend, which
assumes that there is only a single thread? In any case, I think we'll
want to encapsulate the AIO implementation behind some kind of an API,
to allow other implementations to co-exist.

Benchmarks?
- Heikki

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 13:12:07
Message-ID:	BAY175-W33421FA400E54FDE90B4F8A3240@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Thanks for looking at it!

> Date: Thu, 29 May 2014 00:19:33 +0300
> From: hlinnakangas(at)vmware(dot)com
> To: johnlumby(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org
> CC: klaussfreire(at)gmail(dot)com
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 05/28/2014 11:52 PM, John Lumby wrote:
> >
>
> The patch seems to assume that you can put the aiocb struct in shared
> memory, initiate an asynchronous I/O request from one process, and wait
> for its completion from another process. I'm pretty surprised if that
> works on any platform.

It works on linux. Actually this ability allows the asyncio implementation to
reduce complexity in one respect (yes I know it looks complex enough) :
it makes waiting for completion of an in-progress IO simpler than for
the existing synchronous IO case,. since librt takes care of the waiting.
specifically, no need for extra wait-for-io control blocks
such as in bufmgr's WaitIO()

This also plays very nicely with the syncscan where the cohorts run close together.

If anyone would like to advise whether this also works on MacOS/BSD , windows,
that would be good, as I can't verify it myself.

>
> How portable is POSIX aio nowadays? Googling around, it still seems that
> on Linux, it's implemented using threads. Does the thread-emulation
> implementation cause problems with the rest of the backend, which
> assumes that there is only a single thread? In any case, I think we'll

Good question, I am not aware of any dependency on a backend having only
a single thread. Can you please point me to such dependencies?

> want to encapsulate the AIO implementation behind some kind of an API,
> to allow other implementations to co-exist.

It is already - it follows the smgr(stg mgr) -> md (mag disk) -> fd (filesystem)
layering used for the existing filesystem ops including posix-fadvise.

>
> Benchmarks?

I will see if I can package mine up somehow.
Would be great if anyone else would like to benchmark it on theirs ...

> - Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 20:23:19
Message-ID:	53879737.2060604@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 05/29/2014 04:12 PM, John Lumby wrote:
> Thanks for looking at it!
>
>> Date: Thu, 29 May 2014 00:19:33 +0300
>> From: hlinnakangas(at)vmware(dot)com
>> To: johnlumby(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org
>> CC: klaussfreire(at)gmail(dot)com
>> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>>
>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>
>>
>> The patch seems to assume that you can put the aiocb struct in shared
>> memory, initiate an asynchronous I/O request from one process, and wait
>> for its completion from another process. I'm pretty surprised if that
>> works on any platform.
>
> It works on linux. Actually this ability allows the asyncio implementation to
> reduce complexity in one respect (yes I know it looks complex enough) :
> it makes waiting for completion of an in-progress IO simpler than for
> the existing synchronous IO case,. since librt takes care of the waiting.
> specifically, no need for extra wait-for-io control blocks
> such as in bufmgr's WaitIO()

[checks]. No, it doesn't work. See attached test program.

It kinda seems to work sometimes, because of the way it's implemented in
glibc. The aiocb struct has a field for the result value and errno, and
when the I/O is finished, the worker thread fills them in. aio_error()
and aio_return() just return the values of those fields, so calling
aio_error() or aio_return() do in fact happen to work from a different
process. aio_suspend(), however, is implemented by sleeping on a
process-local mutex, which does not work from a different process.

Even if it worked on Linux today, it would be a bad idea to rely on it
from a portability point of view. No, the only sane way to make this
work is that the process that initiates an I/O request is responsible
for completing it. If another process needs to wait for an async I/O to
complete, we must use some other means to do the waiting. Like the
io_in_progress_lock that we already have, for the same purpose.

- Heikki

Attachment	Content-Type	Size
aio-shmem-test.c	text/x-csrc	2.1 KB

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 20:34:50
Message-ID:	CAGTBQpaU-=Oaj8+_QjzYEwNtzhq-swbs94eDz1dXcsVaMcqZ7g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 05/29/2014 04:12 PM, John Lumby wrote:
>>
>> Thanks for looking at it!
>>
>>> Date: Thu, 29 May 2014 00:19:33 +0300
>>> From: hlinnakangas(at)vmware(dot)com
>>> To: johnlumby(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org
>>> CC: klaussfreire(at)gmail(dot)com
>>> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO -
>>> proposal and patch
>>>
>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>
>>>>
>>>
>>> The patch seems to assume that you can put the aiocb struct in shared
>>> memory, initiate an asynchronous I/O request from one process, and wait
>>> for its completion from another process. I'm pretty surprised if that
>>> works on any platform.
>>
>>
>> It works on linux. Actually this ability allows the asyncio
>> implementation to
>> reduce complexity in one respect (yes I know it looks complex enough) :
>> it makes waiting for completion of an in-progress IO simpler than for
>> the existing synchronous IO case,. since librt takes care of the
>> waiting.
>> specifically, no need for extra wait-for-io control blocks
>> such as in bufmgr's WaitIO()
>
>
> [checks]. No, it doesn't work. See attached test program.
>
> It kinda seems to work sometimes, because of the way it's implemented in
> glibc. The aiocb struct has a field for the result value and errno, and when
> the I/O is finished, the worker thread fills them in. aio_error() and
> aio_return() just return the values of those fields, so calling aio_error()
> or aio_return() do in fact happen to work from a different process.
> aio_suspend(), however, is implemented by sleeping on a process-local mutex,
> which does not work from a different process.
>
> Even if it worked on Linux today, it would be a bad idea to rely on it from
> a portability point of view. No, the only sane way to make this work is that
> the process that initiates an I/O request is responsible for completing it.
> If another process needs to wait for an async I/O to complete, we must use
> some other means to do the waiting. Like the io_in_progress_lock that we
> already have, for the same purpose.

But calls to it are timeouted by 10us, effectively turning the thing
into polling mode.

Which is odd now that I read carefully, is how come 256 waits of 10us
amounts to anything? That's just 2.5ms, not enough IIUC for any normal
I/O to complete.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 20:39:56
Message-ID:	53879B1C.4040508@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 05/29/2014 11:34 PM, Claudio Freire wrote:
> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> On 05/29/2014 04:12 PM, John Lumby wrote:
>>>
>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>
>>>> The patch seems to assume that you can put the aiocb struct in shared
>>>> memory, initiate an asynchronous I/O request from one process, and wait
>>>> for its completion from another process. I'm pretty surprised if that
>>>> works on any platform.
>>>
>>> It works on linux. Actually this ability allows the asyncio
>>> implementation to
>>> reduce complexity in one respect (yes I know it looks complex enough) :
>>> it makes waiting for completion of an in-progress IO simpler than for
>>> the existing synchronous IO case,. since librt takes care of the
>>> waiting.
>>> specifically, no need for extra wait-for-io control blocks
>>> such as in bufmgr's WaitIO()
>>
>> [checks]. No, it doesn't work. See attached test program.
>>
>> It kinda seems to work sometimes, because of the way it's implemented in
>> glibc. The aiocb struct has a field for the result value and errno, and when
>> the I/O is finished, the worker thread fills them in. aio_error() and
>> aio_return() just return the values of those fields, so calling aio_error()
>> or aio_return() do in fact happen to work from a different process.
>> aio_suspend(), however, is implemented by sleeping on a process-local mutex,
>> which does not work from a different process.
>>
>> Even if it worked on Linux today, it would be a bad idea to rely on it from
>> a portability point of view. No, the only sane way to make this work is that
>> the process that initiates an I/O request is responsible for completing it.
>> If another process needs to wait for an async I/O to complete, we must use
>> some other means to do the waiting. Like the io_in_progress_lock that we
>> already have, for the same purpose.
>
> But calls to it are timeouted by 10us, effectively turning the thing
> into polling mode.

We don't want polling... And even if we did, calling aio_suspend() in a
way that's known to be broken, in a loop, is a pretty crappy way of polling.

> Which is odd now that I read carefully, is how come 256 waits of 10us
> amounts to anything? That's just 2.5ms, not enough IIUC for any normal
> I/O to complete

Wild guess: the buffer you're reading is already in OS cache.

- Heikki

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 20:44:06
Message-ID:	CAGTBQpastJcQQA_patSq4Gah1kV69dUJVQeNOi+wXPmoz0mWuA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 05/29/2014 11:34 PM, Claudio Freire wrote:
>>
>> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:
>>>
>>> On 05/29/2014 04:12 PM, John Lumby wrote:
>>>>
>>>>
>>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>>
>>>>> The patch seems to assume that you can put the aiocb struct in shared
>>>>> memory, initiate an asynchronous I/O request from one process, and wait
>>>>> for its completion from another process. I'm pretty surprised if that
>>>>> works on any platform.
>>>>
>>>>
>>>> It works on linux. Actually this ability allows the asyncio
>>>> implementation to
>>>> reduce complexity in one respect (yes I know it looks complex enough) :
>>>> it makes waiting for completion of an in-progress IO simpler than for
>>>> the existing synchronous IO case,. since librt takes care of the
>>>> waiting.
>>>> specifically, no need for extra wait-for-io control blocks
>>>> such as in bufmgr's WaitIO()
>>>
>>>
>>> [checks]. No, it doesn't work. See attached test program.
>>>
>>> It kinda seems to work sometimes, because of the way it's implemented in
>>> glibc. The aiocb struct has a field for the result value and errno, and
>>> when
>>> the I/O is finished, the worker thread fills them in. aio_error() and
>>> aio_return() just return the values of those fields, so calling
>>> aio_error()
>>> or aio_return() do in fact happen to work from a different process.
>>> aio_suspend(), however, is implemented by sleeping on a process-local
>>> mutex,
>>> which does not work from a different process.
>>>
>>> Even if it worked on Linux today, it would be a bad idea to rely on it
>>> from
>>> a portability point of view. No, the only sane way to make this work is
>>> that
>>> the process that initiates an I/O request is responsible for completing
>>> it.
>>> If another process needs to wait for an async I/O to complete, we must
>>> use
>>> some other means to do the waiting. Like the io_in_progress_lock that we
>>> already have, for the same purpose.
>>
>>
>> But calls to it are timeouted by 10us, effectively turning the thing
>> into polling mode.
>
>
> We don't want polling... And even if we did, calling aio_suspend() in a way
> that's known to be broken, in a loop, is a pretty crappy way of polling.

Agreed. Just saying that that's why it works.

The PID of the process responsible for the aio should be there
somewhere, and other processes should explicitly synchronize (or
initiate their own I/O and let the OS merge them - which also works).

>
>
>> Which is odd now that I read carefully, is how come 256 waits of 10us
>> amounts to anything? That's just 2.5ms, not enough IIUC for any normal
>> I/O to complete
>
>
> Wild guess: the buffer you're reading is already in OS cache.

My wild guess as well.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 21:00:28
Message-ID:	CAGTBQpaX_3_+wxVB4FhswrOCTbnbdXTeCGCpnmiA0dizvsiwYQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Didn't fix that, but the attached patch does fix regression tests when
scanning over index types other than btree (was invoking elog when the
index am didn't have ampeeknexttuple)

Attachment	Content-Type	Size
postgresql-9.4.140529-async_io_prefetching.patch	text/x-patch	224.3 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 21:19:35
Message-ID:	17548.1401398375@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
> Didn't fix that, but the attached patch does fix regression tests when
> scanning over index types other than btree (was invoking elog when the
> index am didn't have ampeeknexttuple)

"ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
nbtree/README).

regards, tom lane

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 21:43:02
Message-ID:	CAGTBQpZrufzDMmo95ak72sb=CnEwYGCNcJ1V-U3KGQUQXqQH=A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
>> Didn't fix that, but the attached patch does fix regression tests when
>> scanning over index types other than btree (was invoking elog when the
>> index am didn't have ampeeknexttuple)
>
> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
> nbtree/README).

It's not really the tuple, just the tid

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 21:49:07
Message-ID:	CAGTBQpYEUVj28u4mkNJms59SBtiiisjjSL6akHdVeChvnt1T+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
>>> Didn't fix that, but the attached patch does fix regression tests when
>>> scanning over index types other than btree (was invoking elog when the
>>> index am didn't have ampeeknexttuple)
>>
>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
>> nbtree/README).
>
>
> It's not really the tuple, just the tid

And, furthermore, it's used only to do prefetching, so even if the tid
was invalid when the tuple needs to be accessed, it wouldn't matter,
because the indexam wouldn't use the result of ampeeknexttuple to do
anything at that time.

Though, your comment does illustrate the need to document that on
ampeeknexttuple, for future users.

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 21:53:51
Message-ID:	BAY175-W3241B07A89EC187C3D826A3240@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> Date: Thu, 29 May 2014 18:00:28 -0300
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire(at)gmail(dot)com
> To: hlinnakangas(at)vmware(dot)com
> CC: johnlumby(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org
>
> On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> > On 05/29/2014 11:34 PM, Claudio Freire wrote:
> >>
> >> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
> >> <hlinnakangas(at)vmware(dot)com> wrote:
> >>>
> >>> On 05/29/2014 04:12 PM, John Lumby wrote:
> >>>>
> >>>>
> >>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
> >>>>>
> >>>>> The patch seems to assume that you can put the aiocb struct in shared
> >>>>> memory, initiate an asynchronous I/O request from one process, and wait
> >>>>> for its completion from another process. I'm pretty surprised if that
> >>>>> works on any platform.
> >>>>
> >>>>
> >>>> It works on linux. Actually this ability allows the asyncio
> >>>> implementation to
> >>>> reduce complexity in one respect (yes I know it looks complex enough) :
> >>>> it makes waiting for completion of an in-progress IO simpler than for
> >>>> the existing synchronous IO case,. since librt takes care of the
> >>>> waiting.
> >>>> specifically, no need for extra wait-for-io control blocks
> >>>> such as in bufmgr's WaitIO()
> >>>
> >>>
> >>> [checks]. No, it doesn't work. See attached test program.

Thanks for checking and thanks for coming up with that test program.
However, yes, it really does work -- always (on linux).
Your test program is doing things in the wrong order -
it calls aio_suspend *before* aio_error.
However, the rule is, call aio_suspend *after* aio_error
and *only* if aio_error returns EINPROGRESS.

See the code changes to fd.c function FileCompleteaio()
to see how we have done it. And I am attaching corrected version
of your test program which runs just fine.

> >>>
> >>> It kinda seems to work sometimes, because of the way it's implemented in
> >>> glibc. The aiocb struct has a field for the result value and errno, and
> >>> when
> >>> the I/O is finished, the worker thread fills them in. aio_error() and
> >>> aio_return() just return the values of those fields, so calling
> >>> aio_error()
> >>> or aio_return() do in fact happen to work from a different process.
> >>> aio_suspend(), however, is implemented by sleeping on a process-local
> >>> mutex,
> >>> which does not work from a different process.
> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.
> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.
> >>
> >>
> >> But calls to it are timeouted by 10us, effectively turning the thing
> >> into polling mode.
> >
> >
> > We don't want polling... And even if we did, calling aio_suspend() in a way
> > that's known to be broken, in a loop, is a pretty crappy way of polling.

Well, as mentioned earlier, it is not broken. Whether it is efficient I am not sure.
I have looked at the mutex in aio_suspend that you mentioned and I am not
quite convinced that, if caller is not the original aio_read process,
it renders the suspend() into an instant timeout. I will see if I can verify that.
Where are you (Claudio) seeing 10us?

>
>
> Didn't fix that, but the attached patch does fix regression tests when
> scanning over index types other than btree (was invoking elog when the
> index am didn't have ampeeknexttuple)

Attachment	Content-Type	Size
aio-shmem-test.c	text/plain	2.6 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 21:56:57
Message-ID:	18398.1401400617@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
> On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
>>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
>>> nbtree/README).

>> It's not really the tuple, just the tid

> And, furthermore, it's used only to do prefetching, so even if the tid
> was invalid when the tuple needs to be accessed, it wouldn't matter,
> because the indexam wouldn't use the result of ampeeknexttuple to do
> anything at that time.

Nonetheless, getting the next tid out of the index may involve stepping
to the next index page, at which point you've lost your interlock
guaranteeing that the *previous* tid will still mean something by the time
you arrive at its heap page. I presume that the ampeeknexttuple call is
issued before trying to visit the heap (otherwise you're not actually
getting much I/O overlap), so I think there's a real risk here.

Having said that, it's probably OK as long as this mode is only invoked
for user queries (with MVCC snapshots) and not for system indexscans.

regards, tom lane

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 22:11:26
Message-ID:	BAY175-W432649F7090F849C2871BAA3240@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> From: tgl(at)sss(dot)pgh(dot)pa(dot)us
> To: klaussfreire(at)gmail(dot)com
> CC: hlinnakangas(at)vmware(dot)com; johnlumby(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> Date: Thu, 29 May 2014 17:56:57 -0400
>
> Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
> > On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> >> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
> >>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
> >>> nbtree/README).
>
> >> It's not really the tuple, just the tid
>
> > And, furthermore, it's used only to do prefetching, so even if the tid
> > was invalid when the tuple needs to be accessed, it wouldn't matter,
> > because the indexam wouldn't use the result of ampeeknexttuple to do
> > anything at that time.
>
> Nonetheless, getting the next tid out of the index may involve stepping
> to the next index page, at which point you've lost your interlock

I think we are ok as peeknexttuple (yes bad name, sorry, can change it ...

never advances to another page :

* btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_list

* without reading a new index page

* and without causing any side-effects such as altering values in control blocks

* if found, store blocknum in next element of pfch_list

> guaranteeing that the *previous* tid will still mean something by the time
> you arrive at its heap page. I presume that the ampeeknexttuple call is
> issued before trying to visit the heap (otherwise you're not actually
> getting much I/O overlap), so I think there's a real risk here.
>
> Having said that, it's probably OK as long as this mode is only invoked
> for user queries (with MVCC snapshots) and not for system indexscans.
>
> regards, tom lane

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 22:14:36
Message-ID:	CAGTBQpbtmZGo+vGNQcj+mqbDYSR==t6nNuQVkZO6cMC2BhM-=Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, May 29, 2014 at 6:56 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
>> On Thu, May 29, 2014 at 6:43 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>>> On Thu, May 29, 2014 at 6:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> "ampeeknexttuple"? That's a bit scary. It would certainly be unsafe
>>>> for non-MVCC snapshots (read about vacuum vs indexscan interlocks in
>>>> nbtree/README).
>
>>> It's not really the tuple, just the tid
>
>> And, furthermore, it's used only to do prefetching, so even if the tid
>> was invalid when the tuple needs to be accessed, it wouldn't matter,
>> because the indexam wouldn't use the result of ampeeknexttuple to do
>> anything at that time.
>
> Nonetheless, getting the next tid out of the index may involve stepping
> to the next index page, at which point you've lost your interlock
> guaranteeing that the *previous* tid will still mean something by the time

No, no... that's exactly why a new regproc is needed, because for
prefetching, we need to get the next tid that satisfies some
conditions *without* walking the index.

This, in nbtree, only looks through the tid array to find the suitable
tid, or just return false if the array is exhausted.

> Having said that, it's probably OK as long as this mode is only invoked
> for user queries (with MVCC snapshots) and not for system indexscans.

I think system index scans will also invoke this. There's no rule
excluding that possibility.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 22:16:06
Message-ID:	CAGTBQpYTp_QuWLQpxxNjoWC5Au9MKnxNHCcoOcNtVToPJtjzLg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, May 29, 2014 at 7:11 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
>> Nonetheless, getting the next tid out of the index may involve stepping
>> to the next index page, at which point you've lost your interlock
>
> I think we are ok as peeknexttuple (yes bad name, sorry, can change it
> ...
> never advances to another page :
>
> * btpeeknexttuple() -- peek at the next tuple different from any
> blocknum in pfch_list
> * without reading a new index page
> * and without causing any side-effects such as
> altering values in control blocks
> * if found, store blocknum in next element of pfch_list

Yeah, I was getting to that conclusion myself too.

We could call it amprefetchnextheap, since it does just prefetch, and
is good for nothing *but* prefetch.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 22:18:10
Message-ID:	20140529221810.GR27914@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Hi,

On 2014-05-29 17:53:51 -0400, John Lumby wrote:
> to see how we have done it. And I am attaching corrected version
> of your test program which runs just fine.

It's perfectly fine to not be up to the coding style at this point, but
trying to adhere to it to some degree will make code review later less
painfull...
* comments with **
* line length
* tabs vs spaces
* ...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 22:18:11
Message-ID:	BAY175-W43B18F9E6D68BEEA6E7D19A3240@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> >>>
> >>> Even if it worked on Linux today, it would be a bad idea to rely on it
> >>> from
> >>> a portability point of view. No, the only sane way to make this work is
> >>> that
> >>> the process that initiates an I/O request is responsible for completing
> >>> it.

I meant to add - it is really a significant benefit that a bkend
can wait on the aio of a different bkend's original prefeetching aio_read.
Remember that we check completion only when the bkend decides it really
wants the block in a buffer, i.e ReadBuffer and friends,
which might be a very long time after it had issued the prefetch request,
or even never (see below). We don't want other bkends which want that
block to have to wait for the originator to get around to reading it.
*Especially* since the originator may *never* read it if it quits its scan early
leaving prefetched but unread blocks behind. (Which is also taken
care of in the patch).

> >>> If another process needs to wait for an async I/O to complete, we must
> >>> use
> >>> some other means to do the waiting. Like the io_in_progress_lock that we
> >>> already have, for the same purpose.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 22:26:44
Message-ID:	CAGTBQpbGkepET4EnA7vLrS8DJFvh1yJKo=WiQKyHFC8CktzCkA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, May 29, 2014 at 6:53 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
> Well, as mentioned earlier, it is not broken. Whether it is efficient
> I am not sure.
> I have looked at the mutex in aio_suspend that you mentioned and I am not
> quite convinced that, if caller is not the original aio_read process,
> it renders the suspend() into an instant timeout. I will see if I can
> verify that.
> Where are you (Claudio) seeing 10us?

fd.c, in FileCompleteaio, sets timeout to:

my_timeout.tv_sec = 0; my_timeout.tv_nsec = 10000;

Which is 10k ns, which is 10 us.

It loops 256 times at most, so it's polling 256 times with a 10 us
timeout. Sounds wasteful.

I'd:

1) If it's the same process, wait for the full timeout (no looping).
If you have to loop (EAGAIN or EINTR - which I just noticed you don't
check for), that's ok.

2) If it's not, just fall through, don't wait, issue the I/O. The
kernel will merge the requests.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-29 22:29:16
Message-ID:	CAGTBQpbPmoxO4OmBD04NRbr0cj4Wp1PwMY_nrB6ZGF=KyCzZyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, May 29, 2014 at 7:26 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> 1) If it's the same process, wait for the full timeout (no looping).
> If you have to loop (EAGAIN or EINTR - which I just noticed you don't
> check for), that's ok.

Sorry, meant to say just looping on EINTR.

About the style guidelines, no, I just copy the style of surrounding
code usually.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-30 07:15:44
Message-ID:	53883020.2020606@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 05/30/2014 12:53 AM, John Lumby wrote:
>> Date: Thu, 29 May 2014 18:00:28 -0300
>> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>> From: klaussfreire(at)gmail(dot)com
>> To: hlinnakangas(at)vmware(dot)com
>> CC: johnlumby(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org
>>
>> On Thu, May 29, 2014 at 5:39 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:
>>> On 05/29/2014 11:34 PM, Claudio Freire wrote:
>>>>
>>>> On Thu, May 29, 2014 at 5:23 PM, Heikki Linnakangas
>>>> <hlinnakangas(at)vmware(dot)com> wrote:
>>>>>
>>>>> On 05/29/2014 04:12 PM, John Lumby wrote:
>>>>>>
>>>>>>
>>>>>>> On 05/28/2014 11:52 PM, John Lumby wrote:
>>>>>>>
>>>>>>> The patch seems to assume that you can put the aiocb struct in shared
>>>>>>> memory, initiate an asynchronous I/O request from one process, and wait
>>>>>>> for its completion from another process. I'm pretty surprised if that
>>>>>>> works on any platform.
>>>>>>
>>>>>>
>>>>>> It works on linux. Actually this ability allows the asyncio
>>>>>> implementation to
>>>>>> reduce complexity in one respect (yes I know it looks complex enough) :
>>>>>> it makes waiting for completion of an in-progress IO simpler than for
>>>>>> the existing synchronous IO case,. since librt takes care of the
>>>>>> waiting.
>>>>>> specifically, no need for extra wait-for-io control blocks
>>>>>> such as in bufmgr's WaitIO()
>>>>>
>>>>>
>>>>> [checks]. No, it doesn't work. See attached test program.
>
> Thanks for checking and thanks for coming up with that test program.
> However, yes, it really does work -- always (on linux).
> Your test program is doing things in the wrong order -
> it calls aio_suspend *before* aio_error.
> However, the rule is, call aio_suspend *after* aio_error
> and *only* if aio_error returns EINPROGRESS.

I see no such a rule in the man pages of any of the functions involved.
And it wouldn't matter anyway; the behavior is exactly the same if you
aio_error() first.

> See the code changes to fd.c function FileCompleteaio()
> to see how we have done it. And I am attaching corrected version
> of your test program which runs just fine.

As Claudio mentioned earlier, the way FileCompleteaio() uses aio_suspend
is just a complicated way of polling. You might as well replace the
aio_suspend() calls with pg_usleep().

>>>>> It kinda seems to work sometimes, because of the way it's implemented in
>>>>> glibc. The aiocb struct has a field for the result value and errno, and
>>>>> when
>>>>> the I/O is finished, the worker thread fills them in. aio_error() and
>>>>> aio_return() just return the values of those fields, so calling
>>>>> aio_error()
>>>>> or aio_return() do in fact happen to work from a different process.
>>>>> aio_suspend(), however, is implemented by sleeping on a process-local
>>>>> mutex,
>>>>> which does not work from a different process.
>>>>>
>>>>> Even if it worked on Linux today, it would be a bad idea to rely on it
>>>>> from
>>>>> a portability point of view. No, the only sane way to make this work is
>>>>> that
>>>>> the process that initiates an I/O request is responsible for completing
>>>>> it.
>>>>> If another process needs to wait for an async I/O to complete, we must
>>>>> use
>>>>> some other means to do the waiting. Like the io_in_progress_lock that we
>>>>> already have, for the same purpose.
>>>>
>>>> But calls to it are timeouted by 10us, effectively turning the thing
>>>> into polling mode.
>>>
>>> We don't want polling... And even if we did, calling aio_suspend() in a way
>>> that's known to be broken, in a loop, is a pretty crappy way of polling.
>
> Well, as mentioned earlier, it is not broken. Whether it is efficient I am not sure.
> I have looked at the mutex in aio_suspend that you mentioned and I am not
> quite convinced that, if caller is not the original aio_read process,
> it renders the suspend() into an instant timeout. I will see if I can verify that.

I don't see the point of pursuing this design further. Surely we don't
want to use polling here, and you're relying on undefined behavior
anyway. I'm pretty sure aio_return/aio_error won't work from a different
process on all platforms, even if it happens to work on Linux. Even on
Linux, it might stop working if the underlying implementation changes
from the glibc pthread emulation to something kernel-based.

- Heikki

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-05-30 13:36:37
Message-ID:	CAGTBQpb9nissmi8xkcHkDFg+pG6jRy9zqGfrRfdDVP5ga8Q6qw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>>>> We don't want polling... And even if we did, calling aio_suspend() in a
>>>> way
>>>> that's known to be broken, in a loop, is a pretty crappy way of polling.
>>
>>
>> Well, as mentioned earlier, it is not broken. Whether it is
>> efficient I am not sure.
>> I have looked at the mutex in aio_suspend that you mentioned and I am not
>> quite convinced that, if caller is not the original aio_read process,
>> it renders the suspend() into an instant timeout. I will see if I can
>> verify that.
>
>
> I don't see the point of pursuing this design further. Surely we don't want
> to use polling here, and you're relying on undefined behavior anyway. I'm
> pretty sure aio_return/aio_error won't work from a different process on all
> platforms, even if it happens to work on Linux. Even on Linux, it might stop
> working if the underlying implementation changes from the glibc pthread
> emulation to something kernel-based.

I'll try to do some measuring of performance with:
a) git head
b) git head + patch as-is
c) git head + patch without aio_suspend in foreign processes (just re-read)
d) git head + patch with a lwlock (or whatever works) instead of aio_suspend

a-c will be the fastest, d might take some while.

I'll let you know of the results as I get them.

From:	johnlumby <johnlumby(at)hotmail(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-01 00:44:28
Message-ID:	BLU436-SMTP6AB1FF61C05A13A944357A3210@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 05/30/14 09:36, Claudio Freire wrote:
> On Fri, May 30, 2014 at 4:15 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> I see no such a rule in the man pages of any of the functions
involved. And it wouldn't matter anyway; the behavior is exactly the
same if you aio_error() first.

You are completely correct and I was wrong.
For the case of different-process,
It is not the order of the aio calls that makes it work/not work,
it is whether it polls with a timeout.
I think I was confusing this with a rule concerning order for
calling aio_return.

>>>>> We don't want polling... And even if we did, calling aio_suspend() in a
>>>>> way
>>>>> that's known to be broken, in a loop, is a pretty crappy way of polling.

I have made a change to the complete_aio functions
to use polling *only* if the caller is not the originator of the aio_read.
If it is the originator, it now calls aio_suspend with no timeout
(i.e. wait till complete).
The loop also now checks for the EINTR case which someone
pointed out.

In my test runs, with a debug message in FileCompleteaio to tell me
whether caller is or is not the originator of the aio_read,
I see > 99.8% of calls are from originator and only < 0.2% are not.
e.g. (samples from two backends)

different : 10
same : 11726

different : 38
same : 12105

new patch based on today 140531 is attached,

This improves one of my benchmarks by about 10% throughput,
and now shows an overall 23% improvement relative to existing code with
posix_fadvise.

So hopefully this addresses your performance concern.

If you look at the new patch, you'll see that for the different-pid case,
I still call aio_suspend with a timeout.
As you or Claudio pointed out earlier, it could just as well sleep
for the same timeout,
but the small advantage of calling aio_suspend is if the io completed
just between
the aio_error returning EINPROGRESS and the aio_suspend call.
Also it makes the code simpler. In fact this change is quite small,
just a few lines
in backend/storage/buffer/buf_async.c and backend/storage/file/fd.c

Based on this, I think it is not necessary to get rid of the polling
altogether
(and in any case, as far as I can see, very difficult).

>>>
>>> Well, as mentioned earlier, it is not broken. Whether it is
>>> efficient I am not sure.
>>> I have looked at the mutex in aio_suspend that you mentioned and I am not
>>> quite convinced that, if caller is not the original aio_read process,
>>> it renders the suspend() into an instant timeout. I will see if I can
>>> verify that.
>>
>> I don't see the point of pursuing this design further. Surely we don't want
>> to use polling here, and you're relying on undefined behavior anyway. I'm
>> pretty sure aio_return/aio_error won't work from a different process on all
>> platforms, even if it happens to work on Linux. Even on Linux, it might stop
>> working if the underlying implementation changes from the glibc pthread
>> emulation to something kernel-based.

Good point. I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective, it will be caught in configure.

> I'll try to do some measuring of performance with:
> a) git head
> b) git head + patch as-is
> c) git head + patch without aio_suspend in foreign processes (just re-read)
> d) git head + patch with a lwlock (or whatever works) instead of aio_suspend
>
> a-c will be the fastest, d might take some while.
>
> I'll let you know of the results as I get them.
>
>
Claudio, I am not quite sure if what I am submitting now is
quite the same as any of yours. As I promised before, but have
not yet done, I will package one or two of my benchmarks and
send them in.

Attachment	Content-Type	Size
postgresql-9.4.140531.async_io_prefetching.patch	text/x-patch	310.1 KB

From:	johnlumby <johnlumby(at)hotmail(dot)com>
To:	pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-01 01:03:47
Message-ID:	BLU436-SMTP96D20E49382D359070213EA3210@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 05/31/14 20:44, johnlumby wrote:
> On 05/30/14 09:36, Claudio Freire wrote:
>
> Good point. I have included the guts of your little test program
> (modified to do polling) into the existing autoconf test program
> that decides on the
> #define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
> See config/c-library.m4.
> I hope this goes some way to answer your concern about robustness,
> as at least now if the implementation changes in some way that
> renders the polling ineffective, it will be caught in configure.
>
>
I meant to add that by including this test, which involves a fork(),
in the autoconf tester, on Windows
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP would always by un-defined.
(But could then be defined manually if someone wanted to give it a try)

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	johnlumby <johnlumby(at)hotmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-01 05:27:11
Message-ID:	CAGTBQpY0jCuMgXudn5_e=9o3OHezk3vzx_VhmSEnpUYiwOvsZg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Sat, May 31, 2014 at 9:44 PM, johnlumby <johnlumby(at)hotmail(dot)com> wrote:
> I'll try to do some measuring of performance with:
> a) git head
> b) git head + patch as-is
> c) git head + patch without aio_suspend in foreign processes (just re-read)
> d) git head + patch with a lwlock (or whatever works) instead of aio_suspend
>
> a-c will be the fastest, d might take some while.
>
> I'll let you know of the results as I get them.
>
>
> Claudio, I am not quite sure if what I am submitting now is
> quite the same as any of yours. As I promised before, but have
> not yet done, I will package one or two of my benchmarks and
> send them in.

It's a tad different. c will not do polling on the foreign process, I
will just let PG do the read again. d will be like polling, but
without the associated CPU overhead.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	johnlumby <johnlumby(at)hotmail(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-01 22:33:37
Message-ID:	538BAA41.7090106@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 06/01/2014 03:44 AM, johnlumby wrote:
> If you look at the new patch, you'll see that for the different-pid case,
> I still call aio_suspend with a timeout.
> As you or Claudio pointed out earlier, it could just as well sleep
> for the same timeout,
> but the small advantage of calling aio_suspend is if the io completed
> just between
> the aio_error returning EINPROGRESS and the aio_suspend call.
> Also it makes the code simpler. In fact this change is quite small,
> just a few lines
> in backend/storage/buffer/buf_async.c and backend/storage/file/fd.c
>
> Based on this, I think it is not necessary to get rid of the polling
> altogether
> (and in any case, as far as I can see, very difficult).

That's still just as wrong as it always has been. Just get rid of it.
Don't put aiocb structs in shared memory at all. They don't belong there.

>>>> Well, as mentioned earlier, it is not broken. Whether it is
>>>> efficient I am not sure.
>>>> I have looked at the mutex in aio_suspend that you mentioned and I am not
>>>> quite convinced that, if caller is not the original aio_read process,
>>>> it renders the suspend() into an instant timeout. I will see if I can
>>>> verify that.
>>>
>>> I don't see the point of pursuing this design further. Surely we don't want
>>> to use polling here, and you're relying on undefined behavior anyway. I'm
>>> pretty sure aio_return/aio_error won't work from a different process on all
>>> platforms, even if it happens to work on Linux. Even on Linux, it might stop
>>> working if the underlying implementation changes from the glibc pthread
>>> emulation to something kernel-based.
>
> Good point. I have included the guts of your little test program
> (modified to do polling) into the existing autoconf test program
> that decides on the
> #define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
> See config/c-library.m4.
> I hope this goes some way to answer your concern about robustness,
> as at least now if the implementation changes in some way that
> renders the polling ineffective, it will be caught in configure.

No, that does not make it robust enough.

- Heikki

From:	johnlumby <johnlumby(at)hotmail(dot)com>
To:	pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-09 02:12:08
Message-ID:	BLU436-SMTP185C39CB361AAF0598C3165A3290@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

updated version of patch compatible with git head of 140608,
(adjusted proc oid and a couple of minor fixes)

Attachment	Content-Type	Size
postgresql-9.4.140608.async_io_prefetching.patch	text/x-patch	310.9 KB

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	johnlumby <johnlumby(at)hotmail(dot)com>
Cc:	pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-09 16:58:53
Message-ID:	CAGTBQpatXmMEng14r0JMGeOKbzQoDVEae9YN_0d5ppKv0=BEWw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

I'm having trouble setting max_async_io_prefetchers in postgresql.conf

It says it cannot be changed.

Is that fixed at initdb time? (sounds like a bad idea if it is)

On Sun, Jun 8, 2014 at 11:12 PM, johnlumby <johnlumby(at)hotmail(dot)com> wrote:
> updated version of patch compatible with git head of 140608,
> (adjusted proc oid and a couple of minor fixes)

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-19 17:49:44
Message-ID:	CAM-w4HNr0FOfYXa74tQ8=pdZLt7dP=S4tdGPhXu+=2vWWd23Ng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, May 28, 2014 at 2:19 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> How portable is POSIX aio nowadays? Googling around, it still seems that on
> Linux, it's implemented using threads. Does the thread-emulation
> implementation cause problems with the rest of the backend, which assumes
> that there is only a single thread? In any case, I think we'll want to
> encapsulate the AIO implementation behind some kind of an API, to allow
> other implementations to co-exist.

I think POSIX aio is pretty damn standard and it's a pretty fiddly
interface. If we abstract it behind an i/o interface we're going to
lose a lot of the power. Abstracting it behind a set of buffer manager
operations (initiate i/o on buffer, complete i/o on buffer, abort i/o
on buffer) should be fine but that's basically what we have, no?

I don't think the threaded implementation on Linux is the one to use
though. I find this *super* confusing but the kernel definitely
supports aio syscalls, glibc also has a threaded implementation it
uses if run on a kernel that doesn't implement the syscalls, and I
think there are existing libaio and librt libraries from outside glibc
that do one or the other. Which you build against seems to make a big
difference. My instinct is that anything but the kernel native
implementation will be worthless. The overhead of thread communication
will completely outweigh any advantage over posix_fadvise's partial
win.

The main advantage of posix aio is that we can actually receive the
data out of order. With posix_fadvise we can get the i/o and cpu
overlap but we will never process the later blocks until the earlier
requests are satisfied and processed in order. With aio you could do a
sequential scan, initiating i/o on 1,000 blocks and then processing
them as they arrive, initiating new requests as those blocks are
handled.

When I investigated this I found the buffer manager's I/O bits seemed
to already be able to represent the state we needed (i/o initiated on
this buffer but not completed). The problem was in ensuring that a
backend would process the i/o completion promptly when it might be in
the midst of handling other tasks and might even get an elog() stack
unwinding. The interface that actually fits Postgres best might be the
threaded interface (orthogonal to the threaded implementation
question) which is you give aio a callback which gets called on a
separate thread when the i/o completes. The alternative is you give
aio a list of operation control blocks and it tells you the state of
all the i/o operations. But it's not clear to me how you arrange to do
that regularly, promptly, and reliably.

The other gotcha here is that the kernel implementation only does
anything useful on DIRECT_IO files. That means you have to do *all*
the prefetching and i/o scheduling yourself. You would be doing that
anyways for sequential scans and bitmap scans -- and we already do it
with things like synchronised scans and posix_fadvise -- but index
scans would need to get some intelligence for when it makes sense to
read more than one page at a time. It might be possible to do
something fairly coarse like having our i/o operators keep track of
how often i/o on a relation falls within a certain number of blocks of
an earlier i/o and autotune number of blocks to read based on that. It
might not be hard to do better than the kernel with even basic info
like what level of the index we're reading or what type of pointer
we're following.

Finally, when I did the posix_fadvise work I wrote a synthetic
benchmark for testing the equivalent i/o pattern of a bitmap scan. It
let me simulate bitmap scans of varying densities with varying
parameters, notably how many i/o to keep in flight at once. It
supported posix_fadvise or aio. You should look it up in the archives,
it made for some nice looking graphs. IIRC I could not find any build
environment where aio offered any performance boost at all. I think
this means I just didn't know how to build it against the right
libraries or wasn't using the right kernel or there was some skew
between them at the time.

--
greg

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-19 18:43:44
Message-ID:	CAGTBQpbw56ygQbuZ0tWGbECWwxsmsEq-kZxNvfmvYsC_6u3wPg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> I don't think the threaded implementation on Linux is the one to use
> though. I find this *super* confusing but the kernel definitely
> supports aio syscalls, glibc also has a threaded implementation it
> uses if run on a kernel that doesn't implement the syscalls, and I
> think there are existing libaio and librt libraries from outside glibc
> that do one or the other. Which you build against seems to make a big
> difference. My instinct is that anything but the kernel native
> implementation will be worthless. The overhead of thread communication
> will completely outweigh any advantage over posix_fadvise's partial
> win.

What overhead?

The only communication is through a "done" bit and associated
synchronization structure when *and only when* you want to wait on it.

Furthermore, posix_fadvise is braindead on this use case, been there,
done that. What you win with threads is a better postgres-kernel
interaction, even if you loose some CPU performance it's gonna beat
posix_fadvise by a large margin.

> The main advantage of posix aio is that we can actually receive the
> data out of order. With posix_fadvise we can get the i/o and cpu
> overlap but we will never process the later blocks until the earlier
> requests are satisfied and processed in order. With aio you could do a
> sequential scan, initiating i/o on 1,000 blocks and then processing
> them as they arrive, initiating new requests as those blocks are
> handled.

And each and every I/O you issue with it counts as such on the kernel side.

It's not the case with posix_fadvise, mind you, and that's terribly
damaging for performance.

> When I investigated this I found the buffer manager's I/O bits seemed
> to already be able to represent the state we needed (i/o initiated on
> this buffer but not completed). The problem was in ensuring that a
> backend would process the i/o completion promptly when it might be in
> the midst of handling other tasks and might even get an elog() stack
> unwinding. The interface that actually fits Postgres best might be the
> threaded interface (orthogonal to the threaded implementation
> question) which is you give aio a callback which gets called on a
> separate thread when the i/o completes. The alternative is you give
> aio a list of operation control blocks and it tells you the state of
> all the i/o operations. But it's not clear to me how you arrange to do
> that regularly, promptly, and reliably.

Indeed we've been musing about using librt's support of completion
callbacks for that.

> The other gotcha here is that the kernel implementation only does
> anything useful on DIRECT_IO files. That means you have to do *all*
> the prefetching and i/o scheduling yourself.

If that's the case, we should discard kernel-based implementations and
stick to thread-based ones. Postgres cannot do scheduling as
efficiently as the kernel, and it shouldn't try.

> You would be doing that
> anyways for sequential scans and bitmap scans -- and we already do it
> with things like synchronised scans and posix_fadvise

That only works because the patterns are semi-sequential. If you try
to schedule random access, it becomes effectively impossible without
hardware info.

The kernel is the one with hardware info.

> Finally, when I did the posix_fadvise work I wrote a synthetic
> benchmark for testing the equivalent i/o pattern of a bitmap scan. It
> let me simulate bitmap scans of varying densities with varying
> parameters, notably how many i/o to keep in flight at once. It
> supported posix_fadvise or aio. You should look it up in the archives,
> it made for some nice looking graphs. IIRC I could not find any build
> environment where aio offered any performance boost at all. I think
> this means I just didn't know how to build it against the right
> libraries or wasn't using the right kernel or there was some skew
> between them at the time.

If it's old, it probable you didn't hit the kernel's braindeadness
since it was introduced somewhat recently (somewhate, ie, before 3.x I
believe).

Even if you did hit it, bitmap heap scans are blessed with sequential
ordering. The technique doesn't work nearly as well with random I/O
that might be sorted from time to time.

When traversing an index, you do a mostly sequential pattern due to
physical correlation, but not completely sequential. Not only that,
with the assumption of random I/O, and the uncertainty of when will
the scan be aborted too, you don't read ahead as much as you could if
you knew it was sequential or a full scan. That kills performance. You
don't fetch enough ahead of time to avoid stalls, and the kernel
doesn't do read-ahead either because posix_fadvise effectively
disables it, resulting in the equivalent of direct I/O with bad
scheduling.

Solving this for index scans isn't just a little more complex. It's
insanely more complex, because you need hardware information to do it
right. How many spindles, how many sectors per cylinder if it's
rotational, how big the segments if it's flash, etc, etc... all stuff
hidden away inside the kernel.

It's not a good idea to try to do the kernel's job. Aio, even
threaded, lets you avoid tha.

If you still have the benchmark around, I suggest you shuffle the
sectors a little bit (but not fully) and try them with semi-random
I/O.

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	johnlumby <johnlumby(at)hotmail(dot)com>
Cc:	pgsql hackers <pgsql-hackers(at)postgresql(dot)org>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-19 19:21:19
Message-ID:	CAHGQGwHbf6vBRLFDCsfnVDd1M94LdvXOZddL_=WPSpjfskGCyQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Mon, Jun 9, 2014 at 11:12 AM, johnlumby <johnlumby(at)hotmail(dot)com> wrote:
> updated version of patch compatible with git head of 140608,
> (adjusted proc oid and a couple of minor fixes)

Compilation of patched version on MacOS failed. The error messages were

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c
heapam.c: In function 'heap_unread_add':
heapam.c:362: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:363: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:369: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:375: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:381: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:387: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:405: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heap_unread_subtract':
heapam.c:419: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:425: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c:434: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:442: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:452: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:453: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:454: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_next'
heapam.c:456: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_count'
heapam.c: In function 'heapgettup':
heapam.c:944: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:716: warning: unused variable 'ix'
heapam.c: In function 'heapgettup_pagemode':
heapam.c:1243: error: 'struct HeapScanDescData' has no member named
'rs_pfchblock'
heapam.c:1029: warning: unused variable 'ix'
heapam.c: In function 'heap_endscan':
heapam.c:1808: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
heapam.c:1809: error: 'struct HeapScanDescData' has no member named
'rs_Unread_Pfetched_base'
make[4]: *** [heapam.o] Error 1
make[3]: *** [heap-recursive] Error 2
make[2]: *** [access-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2

Huge patch is basically not easy to review. What about simplifying the patch
by excluding non-core parts like the change of pg_stat_statements, so that
reviewers can easily read the patch? We can add such non-core parts later.

Regards,

--
Fujii Masao

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-20 22:12:38
Message-ID:	BAY175-W13A3E7519085AC9B61D5AAA3120@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Thanks Fujii , that is a bug -- an #ifdef USE_PREFETCH is missing in heapam.c
(maybe several)
I will fix it in the next patch version.

I also appreciate it is not easy to review the patch.
There are really 4 (or maybe 5) parts :

     .   async io (librt functions)
     .   buffer management   (allocating, locking and pinning etc)
     .   scanners making prefetch calls
     .   statistics

and the autoconf input program

I will see what I can do. Maybe putting an indicator against each modified file?

I am currently working on two things :
.   alternative way for non-originator of an aio_read to wait on completion
             (LWlock instead of polling the aiocb)
      This was talked about in several earlier posts and Claudio is also working on something there
.   package up my benchmark

Cheers John

----------------------------------------
> Date: Fri, 20 Jun 2014 04:21:19 +0900
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
> From: masao(dot)fujii(at)gmail(dot)com
> To: johnlumby(at)hotmail(dot)com
> CC: pgsql-hackers(at)postgresql(dot)org; klaussfreire(at)gmail(dot)com
>
> On Mon, Jun 9, 2014 at 11:12 AM, johnlumby <johnlumby(at)hotmail(dot)com> wrote:
>> updated version of patch compatible with git head of 140608,
>> (adjusted proc oid and a couple of minor fixes)
>
> Compilation of patched version on MacOS failed. The error messages were
>
> gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -g -I../../../../src/include -c -o heapam.o heapam.c
> heapam.c: In function 'heap_unread_add':
> heapam.c:362: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:363: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:369: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:375: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:381: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:387: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:405: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c: In function 'heap_unread_subtract':
> heapam.c:419: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:425: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c:434: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:442: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:452: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:453: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:454: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_next'
> heapam.c:456: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_count'
> heapam.c: In function 'heapgettup':
> heapam.c:944: error: 'struct HeapScanDescData' has no member named
> 'rs_pfchblock'
> heapam.c:716: warning: unused variable 'ix'
> heapam.c: In function 'heapgettup_pagemode':
> heapam.c:1243: error: 'struct HeapScanDescData' has no member named
> 'rs_pfchblock'
> heapam.c:1029: warning: unused variable 'ix'
> heapam.c: In function 'heap_endscan':
> heapam.c:1808: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> heapam.c:1809: error: 'struct HeapScanDescData' has no member named
> 'rs_Unread_Pfetched_base'
> make[4]: *** [heapam.o] Error 1
> make[3]: *** [heap-recursive] Error 2
> make[2]: *** [access-recursive] Error 2
> make[1]: *** [install-backend-recurse] Error 2
> make: *** [install-src-recurse] Error 2
>
>
> Huge patch is basically not easy to review. What about simplifying the patch
> by excluding non-core parts like the change of pg_stat_statements, so that
> reviewers can easily read the patch? We can add such non-core parts later.
>
> Regards,
>
> --
> Fujii Masao

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	John Lumby <johnlumby(at)hotmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-23 21:05:23
Message-ID:	53A89693.2030003@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 6/20/14, 5:12 PM, John Lumby wrote:
> I also appreciate it is not easy to review the patch.
> There are really 4 (or maybe 5) parts :
>
> . async io (librt functions)
> . buffer management (allocating, locking and pinning etc)
> . scanners making prefetch calls
> . statistics
>
> and the autoconf input program
>
> I will see what I can do. Maybe putting an indicator against each modified file?

Generally the best way to handle cases like this is to create multiple patches, each of which builds upon previous ones.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-23 21:43:06
Message-ID:	BAY175-W364D73BCD54C4E4351AA63A31F0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

----------------------------------------
> Date: Thu, 19 Jun 2014 15:43:44 -0300
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
> From: klaussfreire(at)gmail(dot)com
> To: stark(at)mit(dot)edu
> CC: hlinnakangas(at)vmware(dot)com; johnlumby(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org
>
> On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> I don't think the threaded implementation on Linux is the one to use
>> though. [... ] The overhead of thread communication
>> will completely outweigh any advantage over posix_fadvise's partial
>> win.
>
> What overhead?
>
> The only communication is through a "done" bit and associated
> synchronization structure when *and only when* you want to wait on it.
>

Threads do cost some extra CPU, but provided the system had some
spare CPU capacity, then performance improves because of reduced IO wait.
I quoted a measured improvement of 17% - 18% improvement in the README
along with some more explanation of when the asyc IO gives and improvement.

> Furthermore, posix_fadvise is braindead on this use case, been there,
> done that. What you win with threads is a better postgres-kernel
> interaction, even if you loose some CPU performance it's gonna beat
> posix_fadvise by a large margin.
>
> [...]
>
>> When I investigated this I found the buffer manager's I/O bits seemed
>> to already be able to represent the state we needed (i/o initiated on
>> this buffer but not completed). The problem was in ensuring that a
>> backend would process the i/o completion promptly when it might be in
>> the midst of handling other tasks and might even get an elog() stack
>> unwinding. The interface that actually fits Postgres best might be the
>> threaded interface (orthogonal to the threaded implementation
>> question) which is you give aio a callback which gets called on a
>> separate thread when the i/o completes. The alternative is you give
>> aio a list of operation control blocks and it tells you the state of
>> all the i/o operations. But it's not clear to me how you arrange to do
>> that regularly, promptly, and reliably.
>
> Indeed we've been musing about using librt's support of completion
> callbacks for that.

For the most common case of a backend issues a PrefetchBuffer
and then that *same* backend issues ReadBuffer, the posix aio works
ideally, since there is no need for any callback or completion signal,
we simply check "is it complete" during the ReadBuffer.

It is when some *other* backend gets there first with the ReadBuffer that
things are a bit trickier.    The current version of the patch did polling for that case
but that drew criticism,    and so an imminent new version of the patch
uses the sigevent mechanism.    And there are other ways still.

In an earlier posting I reported that , in my benchmark,
99.8% of [FileCompleteaio] calls are from originator and only < 0.2% are not.so, from a performance perspective, only the common case really matters.

From:	Greg Stark <stark(at)mit(dot)edu>
To:	John Lumby <johnlumby(at)hotmail(dot)com>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-23 23:04:50
Message-ID:	CAM-w4HMGMkNL+WpDDEp+9sd+idZ1rxEJLaALLCFs0URNJGcyjA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
> It is when some *other* backend gets there first with the ReadBuffer that
> things are a bit trickier. The current version of the patch did polling for that case
> but that drew criticism, and so an imminent new version of the patch
> uses the sigevent mechanism. And there are other ways still.

I'm a bit puzzled by this though. Postgres *already* has code for this
case. When you call ReadBuffer you set the bits on the buffer
indicating I/O is in progress. If another backend does ReadBuffer for
the same block they'll get the same buffer and then wait until the
first backend's I/O completes. ReadBuffer goes through some hoops to
handle this (and all the corner cases such as the other backend's I/O
completing and the buffer being reused for another block before the
first backend reawakens). It would be a shame to reinvent the wheel.

The problem with using the Buffers I/O in progress bit is that the I/O
might complete while the other backend is busy doing stuff. As long as
you can handle the I/O completion promptly -- either in callback or
thread or signal handler then that wouldn't matter. But I'm not clear
that any of those will work reliably.

--
greg

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-24 13:29:53
Message-ID:	BAY175-W27A559A507831A31D34CB6A31E0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

----------------------------------------
> From: stark(at)mit(dot)edu
> Date: Mon, 23 Jun 2014 16:04:50 -0700
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
> To: johnlumby(at)hotmail(dot)com
> CC: klaussfreire(at)gmail(dot)com; hlinnakangas(at)vmware(dot)com; pgsql-hackers(at)postgresql(dot)org
>
> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
>> It is when some *other* backend gets there first with the ReadBuffer that
>> things are a bit trickier. The current version of the patch did polling for that case
>> but that drew criticism, and so an imminent new version of the patch
>> uses the sigevent mechanism. And there are other ways still.
>
> I'm a bit puzzled by this though. Postgres *already* has code for this
> case. When you call ReadBuffer you set the bits on the buffer

Good question.     Let me explain.
Yes, postgresql has code for the case of a backend is inside a synchronous
read() or write(), performed from a ReadBuffer(), and some other backend
wants that buffer.    asynchronous aio is initiated not from ReadBuffer
but from PrefetchBuffer,    and performs its aio_read into an allocated, pinned,
postgresql buffer.    This is entirely different from the synchronous io case.
Why?      Because the issuer of the aio_read (the "originator") is unaware
of this buffer pinned on its behalf, and is then free to do any other
reading or writing it wishes,   such as more prefetching or any other operation.
And furthermore, it may *never* issue a ReadBuffer for the block which it
prefetched.

Therefore, asynchronous IO is different from synchronous IO, and
a new bit, BM_AIO_IN_PROGRESS, in the buf_header is required to
track this aio operation until completion.

I would encourage you to read the new
postgresql-prefetching-asyncio.README
in the patch file where this is explained in greater detail.

> indicating I/O is in progress. If another backend does ReadBuffer for
> the same block they'll get the same buffer and then wait until the
> first backend's I/O completes. ReadBuffer goes through some hoops to
> handle this (and all the corner cases such as the other backend's I/O
> completing and the buffer being reused for another block before the
> first backend reawakens). It would be a shame to reinvent the wheel.

No re-invention! Actually some effort has been made to use the
existing functions in bufmgr.c as much as possible rather than
rewriting them.

>
> The problem with using the Buffers I/O in progress bit is that the I/O
> might complete while the other backend is busy doing stuff. As long as
> you can handle the I/O completion promptly -- either in callback or
> thread or signal handler then that wouldn't matter. But I'm not clear
> that any of those will work reliably.

They both work reliably, but the criticism was that backend B polling
an aiocb of an aio issued by backend A is not documented as
being supported (although it happens to work), hence the proposed
change to use sigevent.

By the way, on the "will it actually work though?" question which several folks
have raised,    I should mention that this patch has been in semi-production
use for almost 2 years now in different stages of completion on all postgresql
releases from 9.1.4 to 9.5 devel.       I would guess it has had around
500 hours of operation by now.     I'm sure there are bugs still to be
found but I am confident it is fundamentally sound.

>
> --
> greg

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-24 14:02:38
Message-ID:	53A984FE.1070309@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 06/24/2014 04:29 PM, John Lumby wrote:
>> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
>>> It is when some *other* backend gets there first with the ReadBuffer that
>>> things are a bit trickier. The current version of the patch did polling for that case
>>> but that drew criticism, and so an imminent new version of the patch
>>> uses the sigevent mechanism. And there are other ways still.
>>
>> I'm a bit puzzled by this though. Postgres *already* has code for this
>> case. When you call ReadBuffer you set the bits on the buffer
>
> Good question. Let me explain.
> Yes, postgresql has code for the case of a backend is inside a synchronous
> read() or write(), performed from a ReadBuffer(), and some other backend
> wants that buffer. asynchronous aio is initiated not from ReadBuffer
> but from PrefetchBuffer, and performs its aio_read into an allocated, pinned,
> postgresql buffer. This is entirely different from the synchronous io case.
> Why? Because the issuer of the aio_read (the "originator") is unaware
> of this buffer pinned on its behalf, and is then free to do any other
> reading or writing it wishes, such as more prefetching or any other operation.
> And furthermore, it may *never* issue a ReadBuffer for the block which it
> prefetched.

I still don't see the difference. Once an asynchronous read is initiated
on the buffer, it can't be used for anything else until the read has
finished. This is exactly the same situation as with a synchronous read:
after read() is called, the buffer can't be used for anything else until
the call finishes.

In particular, consider the situation from another backend's point of
view. Looking from another backend (i.e. one that didn't initiate the
read), there's no difference between a synchronous and asynchronous
read. So why do we need a different IPC mechanism for the synchronous
and asynchronous cases? We don't.

I understand that *within the backend*, you need to somehow track the
I/O, and you'll need to treat synchronous and asynchronous I/Os
differently. But that's all within the same backend, and doesn't need to
involve the flags or locks in shared memory at all. The inter-process
communication doesn't need any changes.

>> The problem with using the Buffers I/O in progress bit is that the I/O
>> might complete while the other backend is busy doing stuff. As long as
>> you can handle the I/O completion promptly -- either in callback or
>> thread or signal handler then that wouldn't matter. But I'm not clear
>> that any of those will work reliably.
>
> They both work reliably, but the criticism was that backend B polling
> an aiocb of an aio issued by backend A is not documented as
> being supported (although it happens to work), hence the proposed
> change to use sigevent.

You didn't understand what Greg meant. You need to handle the completion
of the I/O in the same process that initiated it, by clearing the
in-progress bit of the buffer and releasing the I/O in-progress lwlock
on it. And you need to do that very quickly after the I/O has finished,
because there might be another backend waiting for the buffer and you
don't want him to wait longer than necessary.

The question is, if you receive the notification of the I/O completion
using a signal or a thread, is it safe to release the lwlock from the
signal handler or a separate thread?

> By the way, on the "will it actually work though?" question which several folks
> have raised, I should mention that this patch has been in semi-production
> use for almost 2 years now in different stages of completion on all postgresql
> releases from 9.1.4 to 9.5 devel. I would guess it has had around
> 500 hours of operation by now. I'm sure there are bugs still to be
> found but I am confident it is fundamentally sound.

Well, a committable version of this patch is going to look quite
different from the first version that you posted, so I don't put much
weight on how long you've tested the first version.

- Heikki

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-24 15:08:55
Message-ID:	BAY175-W526880A96F1B36CD529D4BA31E0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Thanks Heikki,

----------------------------------------
> Date: Tue, 24 Jun 2014 17:02:38 +0300
> From: hlinnakangas(at)vmware(dot)com
> To: johnlumby(at)hotmail(dot)com; stark(at)mit(dot)edu
> CC: klaussfreire(at)gmail(dot)com; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 06/24/2014 04:29 PM, John Lumby wrote:
>>> On Mon, Jun 23, 2014 at 2:43 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
>>>> It is when some *other* backend gets there first with the ReadBuffer that
>>>> things are a bit trickier. The current version of the patch did polling for that case
>>>> but that drew criticism, and so an imminent new version of the patch
>>>> uses the sigevent mechanism. And there are other ways still.
>>>
>>> I'm a bit puzzled by this though. Postgres *already* has code for this
>>> case. When you call ReadBuffer you set the bits on the buffer
>>
>> Good question. Let me explain.
>> Yes, postgresql has code for the case of a backend is inside a synchronous
>> read() or write(), performed from a ReadBuffer(), and some other backend
>> wants that buffer. asynchronous aio is initiated not from ReadBuffer
>> but from PrefetchBuffer, and performs its aio_read into an allocated, pinned,
>> postgresql buffer. This is entirely different from the synchronous io case.
>> Why? Because the issuer of the aio_read (the "originator") is unaware
>> of this buffer pinned on its behalf, and is then free to do any other
>> reading or writing it wishes, such as more prefetching or any other operation.
>> And furthermore, it may *never* issue a ReadBuffer for the block which it
>> prefetched.
>
> I still don't see the difference. Once an asynchronous read is initiated
> on the buffer, it can't be used for anything else until the read has
> finished. This is exactly the same situation as with a synchronous read:
> after read() is called, the buffer can't be used for anything else until
> the call finishes.

Ah, now I see what you and Greg are asking. See my next imbed below.

>
> In particular, consider the situation from another backend's point of
> view. Looking from another backend (i.e. one that didn't initiate the
> read), there's no difference between a synchronous and asynchronous
> read. So why do we need a different IPC mechanism for the synchronous
> and asynchronous cases? We don't.
>
> I understand that *within the backend*, you need to somehow track the
> I/O, and you'll need to treat synchronous and asynchronous I/Os
> differently. But that's all within the same backend, and doesn't need to
> involve the flags or locks in shared memory at all. The inter-process
> communication doesn't need any changes.
>
>>> The problem with using the Buffers I/O in progress bit is that the I/O
>>> might complete while the other backend is busy doing stuff. As long as
>>> you can handle the I/O completion promptly -- either in callback or
>>> thread or signal handler then that wouldn't matter. But I'm not clear
>>> that any of those will work reliably.
>>
>> They both work reliably, but the criticism was that backend B polling
>> an aiocb of an aio issued by backend A is not documented as
>> being supported (although it happens to work), hence the proposed
>> change to use sigevent.
>
> You didn't understand what Greg meant. You need to handle the completion
> of the I/O in the same process that initiated it, by clearing the
> in-progress bit of the buffer and releasing the I/O in-progress lwlock
> on it. And you need to do that very quickly after the I/O has finished,
> because there might be another backend waiting for the buffer and you
> don't want him to wait longer than necessary.

I think I understand the question now.    I didn't spell out the details earlier.
Let me explain a little more.
With this patch, when read is issued, it is either a synchronous IO
(as before), or an asynchronous aio_read (new,   represented by
both BM_IO_IN_PROGRESS    *and* BM_AIO_IN_PROGRESS).
The way other backends wait on a synchronous IO in progress is unchanged.
But if BM_AIO_IN_PROGRESS,   then *any* backend which requests
ReadBuffer on this block (including originator) follows a new path
through BufCheckAsync() which, depending on various flags and context,
send the backend down to FileCompleteaio to check and maybe wait.
So *all* backends who are waiting for a BM_AIO_IN_PROGRESS buffer
will wait in that way.

>
> The question is, if you receive the notification of the I/O completion
> using a signal or a thread, is it safe to release the lwlock from the
> signal handler or a separate thread?

In the forthcoming new version of the patch that uses sigevent,
the originator locks a LWlock associated with that BAaiocb eXclusive,
and ,   when signalled, in the signal handler it places that LWlock
on a process-local queue of LWlocks awaiting release.
(No, It cannot be safely released inside the signal handler or in a
separate thread).     Whenever the mainline passes a CHECK_INTERRUPTS macro
and at a few additional points in bufmgr, the backend walks this process-local
queue and releases those LWlocks.    This is also done if the originator
itself issues a ReadBuffer, which is the most frequent case in which it
is released.

Meanwhile, any other backend will simply acquire Shared and release.

I think you are right that the existing io_in_progress_lock LWlock in the
buf_header could be used for this, because if there is a aio in progress,
then that lock cannot be in use for synchronous IO. I chose not to use it
because I preferred to keep the wait/post for asynch io separate,
but they could both use the same LWlock. However, the way the LWlock
is acquired and released would still be a bit different because of the need
to have the originator release it in its mainline.

>
>> By the way, on the "will it actually work though?" question which several folks
>> have raised, I should mention that this patch has been in semi-production
>> use for almost 2 years now in different stages of completion on all postgresql
>> releases from 9.1.4 to 9.5 devel. I would guess it has had around
>> 500 hours of operation by now. I'm sure there are bugs still to be
>> found but I am confident it is fundamentally sound.
>
> Well, a committable version of this patch is going to look quite
> different from the first version that you posted, so I don't put much
> weight on how long you've tested the first version.

Yes, I am quite willing to change it, time permitting.
I take the works "committable version" as a positive sign ...

>
> - Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-24 17:49:05
Message-ID:	53A9BA11.6070904@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 06/24/2014 06:08 PM, John Lumby wrote:
>> The question is, if you receive the notification of the I/O completion
>> using a signal or a thread, is it safe to release the lwlock from the
>> signal handler or a separate thread?
>
> In the forthcoming new version of the patch that uses sigevent,
> the originator locks a LWlock associated with that BAaiocb eXclusive,
> and , when signalled, in the signal handler it places that LWlock
> on a process-local queue of LWlocks awaiting release.
> (No, It cannot be safely released inside the signal handler or in a
> separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
> and at a few additional points in bufmgr, the backend walks this process-local
> queue and releases those LWlocks. This is also done if the originator
> itself issues a ReadBuffer, which is the most frequent case in which it
> is released.
>
> Meanwhile, any other backend will simply acquire Shared and release.

Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast
enough? We have never made any hard guarantees on how often
CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd
party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls
for many seconds, or even more. That's a long time to hold onto a buffer
I/O lock. I don't think that's acceptable :-(.

> I think you are right that the existing io_in_progress_lock LWlock in the
> buf_header could be used for this, because if there is a aio in progress,
> then that lock cannot be in use for synchronous IO. I chose not to use it
> because I preferred to keep the wait/post for asynch io separate,
> but they could both use the same LWlock. However, the way the LWlock
> is acquired and released would still be a bit different because of the need
> to have the originator release it in its mainline.

It would be nice to use the same LWLock.

However, if releasing a regular LWLock in a signal handler is not safe,
and cannot be made safe, perhaps we should, after all, invent a whole
new mechanism. One that would make it safe to release the lock in a
signal handler.

>>> By the way, on the "will it actually work though?" question which several folks
>>> have raised, I should mention that this patch has been in semi-production
>>> use for almost 2 years now in different stages of completion on all postgresql
>>> releases from 9.1.4 to 9.5 devel. I would guess it has had around
>>> 500 hours of operation by now. I'm sure there are bugs still to be
>>> found but I am confident it is fundamentally sound.
>>
>> Well, a committable version of this patch is going to look quite
>> different from the first version that you posted, so I don't put much
>> weight on how long you've tested the first version.
>
> Yes, I am quite willing to change it, time permitting.
> I take the works "committable version" as a positive sign ...

BTW, sorry if I sound negative, I'm actually quite excited about this
feature. A patch like this take a lot of work, and usually several
rewrites, until it's ready ;-). But I'm looking forward for it.

- Heikki

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Greg Stark <stark(at)mit(dot)edu>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-25 18:20:32
Message-ID:	CAGTBQpbAA_Re4rJHM=ePJPdRb1zmnw_uLmgqKziNV1W+LHHToQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Tue, Jun 24, 2014 at 12:08 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
>> The question is, if you receive the notification of the I/O completion
>> using a signal or a thread, is it safe to release the lwlock from the
>> signal handler or a separate thread?
>
> In the forthcoming new version of the patch that uses sigevent,
> the originator locks a LWlock associated with that BAaiocb eXclusive,
> and , when signalled, in the signal handler it places that LWlock
> on a process-local queue of LWlocks awaiting release.
> (No, It cannot be safely released inside the signal handler or in a
> separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
> and at a few additional points in bufmgr, the backend walks this process-local
> queue and releases those LWlocks. This is also done if the originator
> itself issues a ReadBuffer, which is the most frequent case in which it
> is released.

I suggest using a semaphore instead.

Semaphores are supposed to be incremented/decremented from multiple
threads or processes already. So, in theory, the callback itself
should be able to do it.

The problem with the process-local queue is that it may take time to
be processed (the time it takes to get to a CHECK_INTERRUPTS macro,
which as it happened with regexes, it can be quite high).

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>, John Lumby <johnlumby(at)hotmail(dot)com>
Cc:	Greg Stark <stark(at)mit(dot)edu>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-25 21:08:48
Message-ID:	53AB3A60.20708@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 06/25/2014 09:20 PM, Claudio Freire wrote:
> On Tue, Jun 24, 2014 at 12:08 PM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
>>> The question is, if you receive the notification of the I/O completion
>>> using a signal or a thread, is it safe to release the lwlock from the
>>> signal handler or a separate thread?
>>
>> In the forthcoming new version of the patch that uses sigevent,
>> the originator locks a LWlock associated with that BAaiocb eXclusive,
>> and , when signalled, in the signal handler it places that LWlock
>> on a process-local queue of LWlocks awaiting release.
>> (No, It cannot be safely released inside the signal handler or in a
>> separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
>> and at a few additional points in bufmgr, the backend walks this process-local
>> queue and releases those LWlocks. This is also done if the originator
>> itself issues a ReadBuffer, which is the most frequent case in which it
>> is released.
>
> I suggest using a semaphore instead.
>
> Semaphores are supposed to be incremented/decremented from multiple
> threads or processes already. So, in theory, the callback itself
> should be able to do it.

LWLocks are implemented with semaphores, so if you can increment a
semaphore in the signal handler / callback thread, then in theory you
should be able to release a LWLock. You'll need some additional
synchronization within the same process, to make sure you don't release
a lock in signal handler while you're just doing the same thing in the
main thread. I'm not sure I want to buy into the notion that
LWLockRelease must be generally safe to call from a signal handler, but
maybe it's possible to have a variant of it that is.

On Linux at least we use System V semaphores, which are (unsurpisingly)
not listed as safe for using in a signal handler. See list
Async-signal-safe functions in signal(7) man page. The function used to
increment a POSIX semaphore, sem_post(), is in the list, however.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, John Lumby <johnlumby(at)hotmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-25 21:17:53
Message-ID:	20140625211753.GK24114@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote:
> LWLocks are implemented with semaphores, so if you can increment a semaphore
> in the signal handler / callback thread, then in theory you should be able
> to release a LWLock.

I don't think that's a convincing argument even if semop et al were
signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do
anything with the same spinlock both inside and outside a signal
handler.
I don't think we're going to get lwlock.c/LWLockRelease to work
reasonably from a signal handler without a lot of work.

> On Linux at least we use System V semaphores, which are (unsurpisingly) not
> listed as safe for using in a signal handler. See list Async-signal-safe
> functions in signal(7) man page. The function used to increment a POSIX
> semaphore, sem_post(), is in the list, however.

Heh, just wrote the same after reading the beginning of your email ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Greg Stark <stark(at)mit(dot)edu>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-25 23:01:26
Message-ID:	BAY175-W412FF89303686022A9F16AA3190@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

I am attaching the new version of the patch with support for use of sigevent
to report completion of asynch io and the new LWlock for waiters to
wait for originator to release it.

This patch is based on up-to-date git head but the asyncio design is
a bit behind the curve of recent discussions here. Specifically,
sigevent is still optional (see README for how to configure it)
and non-originators still refer to the originator's aiocb to check the
completion code (but not for waiting for it to complete).

All sigevent-related code is marked by
#ifdef USE_AIO_SIGEVENT
to make it easier to find.

Also although it "works" it is not functionally complete -
search for FIXME in
src/backend/storage/buffer/buf_async.c
and has not had nearly as much testing.

________________________________________________________________________

I will try to paste earlier points and imbed comments :

Heikki wrote :

> Ok, doing the work in CHECK_FOR_INTERRUPTS sounds safe. But is that fast
> enough? We have never made any hard guarantees on how often
> CHECK_FOR_INTERRUPTS() is called. In particular, if you're running 3rd
> party C code or PL code, there might be no CHECK_FOR_INTERRUPTS() calls
> for many seconds, or even more. That's a long time to hold onto a buffer
> I/O lock. I don't think that's acceptable :-(.

true but remember this case is the very infrequent one, less than 0.2 %
(in my benchmark).

>
>> I think you are right that the existing io_in_progress_lock LWlock in the
>> buf_header could be used for this, because if there is a aio in progress,
>> then that lock cannot be in use for synchronous IO. I chose not to use it
>> because I preferred to keep the wait/post for asynch io separate,
>> but they could both use the same LWlock. However, the way the LWlock
>> is acquired and released would still be a bit different because of the need
>> to have the originator release it in its mainline.
>
> It would be nice to use the same LWLock.

I think that will work.

>
> However, if releasing a regular LWLock in a signal handler is not safe,
> and cannot be made safe, perhaps we should, after all, invent a whole
> new mechanism. One that would make it safe to release the lock in a
> signal handler.

It would take rewriting lwlock.c and still be devillish hard to test -
I would prefer not.

>
>>
>>> Well, a committable version of this patch is going to look quite
>>> different from the first version that you posted, so I don't put much
>>> weight on how long you've tested the first version.
>>
>> Yes, I am quite willing to change it, time permitting.
>> I take the works "committable version" as a positive sign ...
>
> BTW, sorry if I sound negative, I'm actually quite excited about this
> feature. A patch like this take a lot of work, and usually several
> rewrites, until it's ready ;-). But I'm looking forward for it.

I am grateful you spent the time to study it.
knowing the code, I am not surprised it made you a bit grumpy ...

________________________________________________________

discussion about what is the difference between a synchronous read
versus an asynchronous read as far as non-originator waiting on it is concerned.

I thought a bit more about this. There are currently two differences,
one of which can easily be changed and one not so easy.

1)     The current code, even with sigevent, still makes the non-originator waiter
         call aio_error on the originator's aiocb to get the completion code.
         For sigevent variation, easily changed to have the originator always call aio_error
   (from its CHECK_INTERRUPTS or from FIleCompleteaio)
         and store that in the BAiocb.
         My idea of why not to do that was that, by having the non-originator check the aiocb,
this would allow the waiter to proceed sooner.   But for a different reason it actually
         doesn't.   (The non-originator must still wait for the LWlock release)

2)   Buffer pinning and returning the BufferAiocb to the free list
        With synchronous IO,    each backend that calls a ReadBuffer must pin the buffer
        early in the process.
        With asynchronous IO,    initially only the originator gets the pin
(and that is during PrefetchBuffer, not Readbuffer)
         When the aio completes and some backend checks that completion,
        then the backend has various responsibilities:

               . pin the buffer if it did not already have one (from prefetch)
   . if it was the last such backend to make that check
(amongst the cohort waiting on it)
   then pin the buffer if it did not already have one (from prefetch)

Those functions are different from the synchronous IO case.
I think code clarity alone may dictate keeping these separate.

___________________________________________________________________
various discussion of semaphores and LWlocks:

----------------------------------------
> Date: Wed, 25 Jun 2014 23:17:53 +0200
> From: andres(at)2ndquadrant(dot)com
> To: hlinnakangas(at)vmware(dot)com
> CC: klaussfreire(at)gmail(dot)com; johnlumby(at)hotmail(dot)com; stark(at)mit(dot)edu; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
>
> On 2014-06-26 00:08:48 +0300, Heikki Linnakangas wrote:
>> LWLocks are implemented with semaphores, so if you can increment a semaphore
>> in the signal handler / callback thread, then in theory you should be able
>> to release a LWLock.
>
> I don't think that's a convincing argument even if semop et al were
> signal safe. LWLocks also use spinlocks and it's a fairly bad idea to do
> anything with the same spinlock both inside and outside a signal
> handler.
> I don't think we're going to get lwlock.c/LWLockRelease to work
> reasonably from a signal handler without a lot of work.

I agree - see earlier.

Attachment	Content-Type	Size
postgresql-9.5.140625.140625-181050.noDEBUG.patch	application/octet-stream	340.0 KB

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Greg Stark <stark(at)mit(dot)edu>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-06-25 23:05:53
Message-ID:	BAY175-W35C03DA8661D4FE3E1E57AA3190@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

My cut'n'pasting failed me at one point corrected below.

> discussion about what is the difference between a synchronous read
> versus an asynchronous read as far as non-originator waiting on it is concerned.
>
> I thought a bit more about this. There are currently two differences,
> one of which can easily be changed and one not so easy.
>
> 1) The current code, even with sigevent, still makes the non-originator waiter
> call aio_error on the originator's aiocb to get the completion code.
> For sigevent variation, easily changed to have the originator always call aio_error
> (from its CHECK_INTERRUPTS or from FIleCompleteaio)
> and store that in the BAiocb.
> My idea of why not to do that was that, by having the non-originator check the aiocb,
> this would allow the waiter to proceed sooner. But for a different reason it actually
> doesn't. (The non-originator must still wait for the LWlock release)
>
> 2) Buffer pinning and returning the BufferAiocb to the free list
> With synchronous IO, each backend that calls a ReadBuffer must pin the buffer
> early in the process.
> With asynchronous IO, initially only the originator gets the pin
> (and that is during PrefetchBuffer, not Readbuffer)
> When the aio completes and some backend checks that completion,
> then the backend has various responsibilities:
>
> . pin the buffer if it did not already have one (from prefetch)
> . if it was the last such backend to make that check
> (amongst the cohort waiting on it)
> then XXXXXXpin the buffer if it did not already have one (from prefetch)XXXX

then return the BufferAiocb to the free list

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Greg Stark <stark(at)mit(dot)edu>, "andres(at)2ndquadrant(dot)com" <andres(at)2ndquadrant(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-08-19 21:17:09
Message-ID:	BAY175-W124C3149E15060B95EFA39A3D50@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

I am attaching a new version of the patch for consideration in the current commit fest.

Relative to the one I submitted on 25 June in BAY175-W412FF89303686022A9F16AA3190(at)phx(dot)gbl
the method for handling aio completion using sigevent has been re-written to use
signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock method.
More details are in postgresql-prefetching-asyncio.README
(search for
"Checking AIO Completion"
)

I also have worked my benchmark database and one application into a publishable state.
However,   the database is in the form of a compressed pg_dump and is around
218 MB in size.    I was hoping to come up with a generation program to load it
but have not had the time for that.     Is there some place on postgresql.org for
such a large file?   If not I will try to think of some place for it.

John Lumby

Attachment	Content-Type	Size
postgresql-9.5.140818.prefetching-asyncio.patch	application/octet-stream	348.5 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Greg Stark <stark(at)mit(dot)edu>, "andres(at)2ndquadrant(dot)com" <andres(at)2ndquadrant(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-08-19 22:27:39
Message-ID:	53F3CF5B.4000607@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 08/20/2014 12:17 AM, John Lumby wrote:
> I am attaching a new version of the patch for consideration in the current commit fest.

Thanks for working on this!

> Relative to the one I submitted on 25 June in BAY175-W412FF89303686022A9F16AA3190(at)phx(dot)gbl
> the method for handling aio completion using sigevent has been re-written to use
> signals exclusively rather than a composite of signals and LWlocks,
> and this has fixed the problem I mentioned before with the LWlock method.

ISTM the patch is still allocating stuff in shared memory that really
doesn't belong there. Namely, the BufferAiocb structs. Or at least parts
of it; there's also a waiter queue there which probably needs to live in
shared memory, but the rest of it does not.

At least BufAWaitAioCompletion is still calling aio_error() on an AIO
request that might've been initiated by another backend. That's not OK.

Please write the patch without atomic CAS operation. Just use a
spinlock. There's a patch in the commitfest to add support for that, but
it's not committed yet, and all those USE_AIO_ATOMIC_BUILTIN_COMP_SWAP
ifdefs make the patch more difficult to read. Same with all the other
#ifdefs; please just pick a method that works.

Also, please split prefetching of regular index scans into a separate
patch. It's orthogonal to doing async I/O; we could prefetch regular
index scans with posix_fadvise too, and AIO should be useful for
prefetching other stuff.

- Heikki

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	John Lumby <johnlumby(at)hotmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>, "andres(at)2ndquadrant(dot)com" <andres(at)2ndquadrant(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-08-19 23:10:20
Message-ID:	CAGTBQpbrHCJvkCmXmA8zHmfKNLLtqbgagLaYA+Z9E-F0oY2EtQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Also, please split prefetching of regular index scans into a separate patch.
> It's orthogonal to doing async I/O; we could prefetch regular index scans
> with posix_fadvise too, and AIO should be useful for prefetching other
> stuff.

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.

From:	johnlumby <johnlumby(at)hotmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Greg Stark <stark(at)mit(dot)edu>, "andres(at)2ndquadrant(dot)com" <andres(at)2ndquadrant(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-08-24 21:49:31
Message-ID:	BLU436-SMTP167F40CBAC7CD274E669274A3DE0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Thanks for the replies and thoughts.

On 08/19/14 18:27, Heikki Linnakangas wrote:
> On 08/20/2014 12:17 AM, John Lumby wrote:
>> I am attaching a new version of the patch for consideration in the
>> current commit fest.
>
> Thanks for working on this!
>
>> Relative to the one I submitted on 25 June in
>> BAY175-W412FF89303686022A9F16AA3190(at)phx(dot)gbl
>> the method for handling aio completion using sigevent has been
>> re-written to use
>> signals exclusively rather than a composite of signals and LWlocks,
>> and this has fixed the problem I mentioned before with the LWlock
>> method.
>
> ISTM the patch is still allocating stuff in shared memory that really
> doesn't belong there. Namely, the BufferAiocb structs. Or at least
> parts of it; there's also a waiter queue there which probably needs to
> live in shared memory, but the rest of it does not.

Actually the reason the BufferAiocb ( the postgresql block corresponding
to the aio's aiocb)
must be located in shared memory is that, as you said, it acts as
anchor for the waiter list.
See further comment below.

>
> At least BufAWaitAioCompletion is still calling aio_error() on an AIO
> request that might've been initiated by another backend. That's not OK.

Yes, you are right, and I agree with this one -
I will add a aio_error_return_code field in the BufferAiocb
and only the originator will set this from the real aiocb.

>
> Please write the patch without atomic CAS operation. Just use a spinlock.

Umm, this is a new criticism I think. I use CAS for things other
than locking,
such as add/remove from shared queue. I suppose maybe a spinlock on
the entire queue
can be used equivalently, but with more code (extra confusion) and
worse performance
(coarser serialization). What is your objection to using gcc's
atomic ops? Portability?

> There's a patch in the commitfest to add support for that,

sorry, support for what? There are already spinlocks in postgresql,
you mean some new kind? please point me at it with hacker msgid or
something.

> but it's not committed yet, and all those
> USE_AIO_ATOMIC_BUILTIN_COMP_SWAP ifdefs make the patch more difficult
> to read. Same with all the other #ifdefs; please just pick a method
> that works.

Ok, yes, the ifdefs are unpleasant. I will do something about that.
Ideally they would be entirely confined to header files and only macro
functions
in C files - maybe I can do that. And eventually when the dust has
settled
eliminate obsolete ifdef blocks altogether.

> Also, please split prefetching of regular index scans into a separate
> patch. It's orthogonal to doing async I/O;

actually not completely orthogonal, see next

> we could prefetch regular index scans with posix_fadvise too, and AIO
> should be useful for prefetching other stuff.

On 08/19/14 19:10, Claudio Freire wrote:
> On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> Also, please split prefetching of regular index scans into a separate patch. ...
> That patch already happened on the list, and it wasn't a win in many
> cases. I'm not sure it should be proposed independently of this one.
> Maybe a separate patch, but it should be considered dependent on this.
>
> I don't have an archive link at hand atm, but I could produce one later.
Several people have asked to split this patch into several smaller ones
and I
have thought about it. It would introduce some awkward dependencies.
E.g. splitting the scanner code (index, relation-heap) into separate
patch
from aio code would mean that the scanner patch becomes dependent
on the aio patch. They are not quite orthogonal.

The reason is that the scanners call a new function, DiscardBuffer(blockid)
when aio is in use. We can get around it by providing a stub for
that function
in the scanner patch, but then there is some risk of someone getting the
wrong version of that function in their build. It just adds yet more
complexity
and breakage opportunities.

>
> - Heikki
>
One further comment concerning these BufferAiocb and aiocb control blocks
being in shared memory :

I explained above that the BufferAiocb must be in shared memory for
wait/post.
The aiocb does not necessarily have to be in shared memory,
but since there is a one-to-one relationship between BufferAiocb and aiocb,
it makes the code much simpler , and the operation much more efficient,
if the aiocb is embedded in the BufferAiocb as I have it now.
E.g. if the aiocb is in private-process memory, then an additional
allocation
scheme is needed (fixed number? palloc()in'g extra ones as needed? ...)
which adds complexity, and probably some inefficiency since a shared
pool is usually
more efficient (allows higher maximum per process etc), and there
would have to be
some pointer de-referencing from BufferAiocb to aiocb adding some
(small) overhead.

I understood your objection to use of shared memory as being that you
don't want
a non-originator to access the originator's aiocb using aio_xxx calls,
and, with the
one change I've said I will do above (put the aio_error retcode in the
BufferAiocb)
I will have achieved that requirement. I am hoping this answers your
objections
concerning shared memory.

>
>

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	johnlumby <johnlumby(at)hotmail(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Greg Stark <stark(at)mit(dot)edu>, "andres(at)2ndquadrant(dot)com" <andres(at)2ndquadrant(dot)com>
Subject:	Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date:	2014-08-25 09:37:14
Message-ID:	53FB03CA.2070804@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 08/25/2014 12:49 AM, johnlumby wrote:
> On 08/19/14 18:27, Heikki Linnakangas wrote:
>> Please write the patch without atomic CAS operation. Just use a spinlock.
>
> Umm, this is a new criticism I think.

Yeah. Be prepared that new issues will crop up as the patch gets slimmer
and easier to review :-). Right now there's still so much chaff that
it's difficult to see the wheat.

> I use CAS for things other
> than locking,
> such as add/remove from shared queue. I suppose maybe a spinlock on
> the entire queue
> can be used equivalently, but with more code (extra confusion) and
> worse performance
> (coarser serialization). What is your objection to using gcc's
> atomic ops? Portability?

Yeah, portability.

Atomic ops might make sense, but at this stage it's important to slim
down the patch to the bare minimum, so that it's easier to review. Once
the initial patch is in, you can improve it with additional patches.

>> There's a patch in the commitfest to add support for that,
>
> sorry, support for what? There are already spinlocks in postgresql,
> you mean some new kind? please point me at it with hacker msgid or
> something.

Atomic ops: https://commitfest.postgresql.org/action/patch_view?id=1314

Once that's committed, you can use the new atomic ops in your patch. But
until then, stick to spinlocks.

> On 08/19/14 19:10, Claudio Freire wrote:
>> On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:
>>> Also, please split prefetching of regular index scans into a separate patch. ...
>> That patch already happened on the list, and it wasn't a win in many
>> cases. I'm not sure it should be proposed independently of this one.
>> Maybe a separate patch, but it should be considered dependent on this.
>>
>> I don't have an archive link at hand atm, but I could produce one later.
> Several people have asked to split this patch into several smaller ones
> and I
> have thought about it. It would introduce some awkward dependencies.
> E.g. splitting the scanner code (index, relation-heap) into separate
> patch
> from aio code would mean that the scanner patch becomes dependent
> on the aio patch. They are not quite orthogonal.

Right now, please focus on the main AIO patch. That should show a
benefit for bitmap heap scans too, so to demonstrate the benefits of
AIO, you don't need to prefetch regular index scans. The main AIO patch
can be written, performance tested, and reviewed without caring about
regular index scans at all.

> The reason is that the scanners call a new function, DiscardBuffer(blockid)
> when aio is in use. We can get around it by providing a stub for
> that function
> in the scanner patch, but then there is some risk of someone getting the
> wrong version of that function in their build. It just adds yet more
> complexity
> and breakage opportunities.

Regardless of the regular index scans, we'll need to discuss the new API
of PrefetchBuffer and DiscardBuffer.

It would be simpler for the callers if you didn't require the
DiscardBuffer calls. I think it would be totally feasible to write the
patch that way. Just drop the buffer pin after the asynchronous read
finishes. When the caller actually needs the page, it will call
ReadBuffer which will pin it again. You'll get a little bit more bufmgr
traffic that way, but I think it'll be fine in practice.

> One further comment concerning these BufferAiocb and aiocb control blocks
> being in shared memory :
>
> I explained above that the BufferAiocb must be in shared memory for
> wait/post.
> The aiocb does not necessarily have to be in shared memory,
> but since there is a one-to-one relationship between BufferAiocb and aiocb,
> it makes the code much simpler , and the operation much more efficient,
> if the aiocb is embedded in the BufferAiocb as I have it now.
> E.g. if the aiocb is in private-process memory, then an additional
> allocation
> scheme is needed (fixed number? palloc()in'g extra ones as needed? ...)
> which adds complexity,

Yep, use palloc or a fixed pool. There's nothing wrong with that.

> and probably some inefficiency since a shared
> pool is usually
> more efficient (allows higher maximum per process etc), and there
> would have to be
> some pointer de-referencing from BufferAiocb to aiocb adding some
> (small) overhead.

I think you're falling into the domain of premature optimization. A few
pointer dereferences are totally insignificant, and the amount of memory
you're saving pales in comparison to other similar non-shared pools and
caches we have (catalog caches, for example). And on the other side of
the coin, with a shared pool you'll waste memory when async I/O is not
used (e.g because everything fits in RAM), and when it is used, you'll
have more contention on locks and cache lines when multiple processes
use the same objects.

The general guideline in PostgreSQL is that everything is
backend-private, except structures used to communicate between backends.

- Heikki

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	RegisterBackgroundWorker does not actually start a bg worker process in 9.4.4
Date:	2015-06-15 18:58:51
Message-ID:	COL131-W797566963E38D7BFE52BFFA3B80@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

I am new to bg_workers so this may be my user error,
but when I build and run the contrib/worker_spi
extension, I find that :

. starting postgres with the extension named in shared_preload_libraries :
its _PG_init is invoked as expected but no process is started -
it is as though RegisterBackgroundWorker did nothing

. creating the extension and then
psql ... "select worker_spi_launch(2);" :
I see
28409 28288 ? 463508 00:05 00:00:00 0.0 postgres: bgworker: worker 2
as expected.

Is there maybe some bug in postmaster's processing of
workers marked as start_at = BgWorkerStart_RecoveryFinished
in 9.4.4?

Cheers, John

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	John Lumby <johnlumby(at)hotmail(dot)com>
Cc:	pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: RegisterBackgroundWorker does not actually start a bg worker process in 9.4.4
Date:	2015-06-15 23:33:09
Message-ID:	CAB7nPqQJt0UnSWkZEKFMHEc4-S0CjP-Aqvg7k-VYRvPRPQ+9dQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Tue, Jun 16, 2015 at 3:58 AM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
> I am new to bg_workers so this may be my user error,
> but when I build and run the contrib/worker_spi
> extension, I find that :
>
> . starting postgres with the extension named in shared_preload_libraries :
> its _PG_init is invoked as expected but no process is started -
> it is as though RegisterBackgroundWorker did nothing
>
> . creating the extension and then
> psql ... "select worker_spi_launch(2);" :
> I see
> 28409 28288 ? 463508 00:05 00:00:00 0.0 postgres: bgworker: worker 2
> as expected.
>
> Is there maybe some bug in postmaster's processing of
> workers marked as start_at = BgWorkerStart_RecoveryFinished
> in 9.4.4?

Not that I know of. I am using static background workers even with 9.4
clusters and they work as expected. Giving a try with worker_spi on
9.4, I see no problems as well:
$ ps ux | grep bgworker
michael 3906 0.0 0.1 2594780 7456 ?? Ss 8:21AM 0:00.02
postgres: bgworker: worker 1
michael 3905 0.0 0.1 2594780 6996 ?? Ss 8:21AM 0:00.02
postgres: bgworker: worker 2
$ psql -c 'show shared_preload_libraries'
shared_preload_libraries
--------------------------
worker_spi
(1 row)

Regards,
--
Michael

From:	John Lumby <johnlumby(at)hotmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: RegisterBackgroundWorker does not actually start a bg worker process in 9.4.4
Date:	2015-06-16 13:21:50
Message-ID:	COL131-W29B608F04B5F67787E148CA3A70@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Michael, thanks for checking,
I tried it again today and it is now working so I must have forgotten something.

John
----------------------------------------
> Date: Tue, 16 Jun 2015 08:33:09 +0900
> Subject: Re: [GENERAL] RegisterBackgroundWorker does not actually start a bg worker process in 9.4.4
> From: michael(dot)paquier(at)gmail(dot)com
> To: johnlumby(at)hotmail(dot)com
> CC: pgsql-general(at)postgresql(dot)org
>
> On Tue, Jun 16, 2015 at 3:58 AM, John Lumby <johnlumby(at)hotmail(dot)com> wrote:
>> I am new to bg_workers so this may be my user error,
>> but when I build and run the contrib/worker_spi
>> extension, I find that :
>>
>> . starting postgres with the extension named in shared_preload_libraries :
>> its _PG_init is invoked as expected but no process is started -
>> it is as though RegisterBackgroundWorker did nothing
>>
>> . creating the extension and then
>> psql ... "select worker_spi_launch(2);" :
>> I see
>> 28409 28288 ? 463508 00:05 00:00:00 0.0 postgres: bgworker: worker 2
>> as expected.
>>
>> Is there maybe some bug in postmaster's processing of
>> workers marked as start_at = BgWorkerStart_RecoveryFinished
>> in 9.4.4?
>
> Not that I know of. I am using static background workers even with 9.4
> clusters and they work as expected. Giving a try with worker_spi on
> 9.4, I see no problems as well:
> $ ps ux | grep bgworker
> michael 3906 0.0 0.1 2594780 7456 ?? Ss 8:21AM 0:00.02
> postgres: bgworker: worker 1
> michael 3905 0.0 0.1 2594780 6996 ?? Ss 8:21AM 0:00.02
> postgres: bgworker: worker 2
> $ psql -c 'show shared_preload_libraries'
> shared_preload_libraries
> --------------------------
> worker_spi
> (1 row)
>
> Regards,
> --
> Michael