Quick Links

Reworking WAL locking

Lists:	pgsql-hackers

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Paul van den Bogaard <Paul(dot)Vandenbogaard(at)Sun(dot)COM>
Subject:	Reworking WAL locking
Date:	2008-02-14 13:44:18
Message-ID:	1202996658.16770.664.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Paul van den Bogaard (Sun) suggested to me that we could use more than
two WAL locks to improve concurrency. I think its possible to introduce
such a scheme with some ease. All mods within xlog.c

The scheme below requires an extra LWlock per WAL buffer.

Locking within XLogInsert() would look like this:

Calculate length of data to be inserted.
Calculate initial CRC

LWLockAcquire(WALInsertLock, LW_EXCLUSIVE)

Reserve space to write into.
LSN = current Insert pointer
Move pointer forward by length of data to be inserted, acquiring
WALWriteLock if required to ensure space is available.

LWLockAcquire(LSNGetWALPageLockId(LSN), LW_SHARED);

Note that we don't lock every page, just the first one of the set we
want, but we hold it until all page writes are complete.

LWLockRelease(WALInsertLock);

finish calculating CRC
write xlog into reserved space

LWLockRelease(LSNGetWALPageLockId(LSN));

XLogWrite() will then try to get a conditional LW_EXCLUSIVE lock
sequentially on each page it plans to write. It keeps going until it
fails to get the lock, then writes. Callers of XLogWrite will never be
able to pass a backend currently performing the wal buffer fill.

We write whole page at a time.

Next time, we do a regular lock wait on the same page, so that we always
get a page eventually.

This requires us to get 2 locks for an XLogInsert rather than just one.
However the second lock is always acquired with zero-wait time when the
wal_buffers are sensibly sized. Overall this should reduce wait time for
the WALInsertLock since it seems likely that each actual filling of WAL
buffers will effect different cache lines and are very likely to be able
to be performed in parallel.

Sounds good to me.

Any objections/comments before this can be tried out?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Paul van den Bogaard <Paul(dot)Vandenbogaard(at)Sun(dot)COM>
Subject:	Re: Reworking WAL locking
Date:	2008-02-14 16:49:06
Message-ID:	9649.1203007746@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> LWLockAcquire(WALInsertLock, LW_EXCLUSIVE)

> Reserve space to write into.
> LSN = current Insert pointer
> Move pointer forward by length of data to be inserted, acquiring
> WALWriteLock if required to ensure space is available.

I think you've handwaved over how you'll avoid self-deadlock
in the case where the WAL record exceeds the size of wal_buffers.

Assuming that that can be fixed without too much ugliness,
it seems worth trying.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Paul van den Bogaard <Paul(dot)Vandenbogaard(at)Sun(dot)COM>
Subject:	Re: Reworking WAL locking
Date:	2008-02-14 18:52:10
Message-ID:	16849.1203015130@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Paul van den Bogaard (Sun) suggested to me that we could use more than
> two WAL locks to improve concurrency.

After looking over the code I'm unconvinced that there's much win to be
bought this way. I had been thinking you could push a material amount
of CRC-calculation work out of the locked code section, but that's not
so; only the WAL record header remains to be CRC'd when we enter the
locked section. The locked section is not doing much more than copying
the data into the WAL buffers, which for the normal case of short WAL
records (no page images) isn't going to take long. It's not clear that
the extra lock acquisition will be repaid by better concurrency (and
it's certain to be a dead loss if there's not concurrent writes
happening).

In a situation where you're forced to write dirty buffers to make room,
there could be some win from fixing things to not hold WALInsertLock
while you do that, but AFAICT the proposal as it stands doesn't help
on that front. Also, now that we have the wal writer process, this
case really shouldn't occur often enough to be a material performance
problem (especially not if the number of wal buffers has been kicked
up a bit).

> This requires us to get 2 locks for an XLogInsert rather than just one.
> However the second lock is always acquired with zero-wait time when the
> wal_buffers are sensibly sized.

In my experience, the fact that you won't block is not a guarantee that
lock acquisition is cheap. Trading ownership of locked cache lines back
and forth between multiple CPUs is *expensive*.

So this may be worth trying but I'm unconvinced that it'll really
provide a win that's worth the added complexity.

[ still staring at the code ... ] Something that might be interesting
though is to try to move some of the buffer control logic overhead out
of WALInsertLock's domain and into WALWriteLock's domain. Right now we
prep the page header and so forth for a new page when we are ready to
start inserting into it, which means that that work is done inside the
concurrency bottleneck. What if, as soon as we have written a filled
WAL buffer for the last time, the *writer* process is responsible for
re-initializing that buffer as empty and ready to write into for
whatever page number it's ultimately going to have (which we can
certainly determine since the number of wal buffers is known)?
Getting that page-zeroing MemSet out of the WALInsertLock domain looks
worth doing...

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org, Paul van den Bogaard <Paul(dot)Vandenbogaard(at)Sun(dot)COM>
Subject:	Re: Reworking WAL locking
Date:	2008-02-14 19:21:40
Message-ID:	1203016900.16770.690.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2008-02-14 at 13:52 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Paul van den Bogaard (Sun) suggested to me that we could use more than
> > two WAL locks to improve concurrency.
>
> After looking over the code I'm unconvinced that there's much win to be
> bought this way. I had been thinking you could push a material amount
> of CRC-calculation work out of the locked code section, but that's not
> so; only the WAL record header remains to be CRC'd when we enter the
> locked section. The locked section is not doing much more than copying
> the data into the WAL buffers, which for the normal case of short WAL
> records (no page images) isn't going to take long. It's not clear that
> the extra lock acquisition will be repaid by better concurrency (and
> it's certain to be a dead loss if there's not concurrent writes
> happening).
>
> In a situation where you're forced to write dirty buffers to make room,
> there could be some win from fixing things to not hold WALInsertLock
> while you do that, but AFAICT the proposal as it stands doesn't help
> on that front. Also, now that we have the wal writer process, this
> case really shouldn't occur often enough to be a material performance
> problem (especially not if the number of wal buffers has been kicked
> up a bit).

Will think some more.

> > This requires us to get 2 locks for an XLogInsert rather than just one.
> > However the second lock is always acquired with zero-wait time when the
> > wal_buffers are sensibly sized.
>
> In my experience, the fact that you won't block is not a guarantee that
> lock acquisition is cheap. Trading ownership of locked cache lines back
> and forth between multiple CPUs is *expensive*.
>
> So this may be worth trying but I'm unconvinced that it'll really
> provide a win that's worth the added complexity.
>
> [ still staring at the code ... ] Something that might be interesting
> though is to try to move some of the buffer control logic overhead out
> of WALInsertLock's domain and into WALWriteLock's domain. Right now we
> prep the page header and so forth for a new page when we are ready to
> start inserting into it, which means that that work is done inside the
> concurrency bottleneck. What if, as soon as we have written a filled
> WAL buffer for the last time, the *writer* process is responsible for
> re-initializing that buffer as empty and ready to write into for
> whatever page number it's ultimately going to have (which we can
> certainly determine since the number of wal buffers is known)?
> Getting that page-zeroing MemSet out of the WALInsertLock domain looks
> worth doing...

That sounds like a cheaper win.

I like this one because its another example that contention is not
static, seems like its more often cyclical. The WALInsertLock is held
longer than normal when we cross page boundaries, creating a pulsing
effect thru the lock queue.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Paul van den Bogaard <Paul(dot)Vandenbogaard(at)Sun(dot)COM>
Subject:	Re: Reworking WAL locking
Date:	2008-02-14 21:50:42
Message-ID:	19399.1203025842@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Thu, 2008-02-14 at 13:52 -0500, Tom Lane wrote:
>> [ still staring at the code ... ] Something that might be interesting
>> though is to try to move some of the buffer control logic overhead out
>> of WALInsertLock's domain and into WALWriteLock's domain.

> I like this one because its another example that contention is not
> static, seems like its more often cyclical. The WALInsertLock is held
> longer than normal when we cross page boundaries, creating a pulsing
> effect thru the lock queue.

Yeah, significantly longer than normal if you assume the normal case is
to write a WAL record that's much less than a full page. But has anyone
seen direct evidence that that's an important effect? I was just
proposing this on speculation --- if there's already evidence that that
behavior is a problem, it'd be interesting to see it.

BTW, we'd probably need to do something like this even if we then go
forward with your original idea. If we're going to allow multiple
backends to be inserting WAL records into the-same-or-different WAL
buffers concurrently, we can't have that same code responsible for
initializing empty buffers.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org, Paul van den Bogaard <Paul(dot)Vandenbogaard(at)Sun(dot)COM>
Subject:	Re: Reworking WAL locking
Date:	2008-02-15 08:16:05
Message-ID:	1203063365.16770.699.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2008-02-14 at 16:50 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Thu, 2008-02-14 at 13:52 -0500, Tom Lane wrote:
> >> [ still staring at the code ... ] Something that might be interesting
> >> though is to try to move some of the buffer control logic overhead out
> >> of WALInsertLock's domain and into WALWriteLock's domain.
>
> > I like this one because its another example that contention is not
> > static, seems like its more often cyclical. The WALInsertLock is held
> > longer than normal when we cross page boundaries, creating a pulsing
> > effect thru the lock queue.
>
> Yeah, significantly longer than normal if you assume the normal case is
> to write a WAL record that's much less than a full page. But has anyone
> seen direct evidence that that's an important effect? I was just
> proposing this on speculation --- if there's already evidence that that
> behavior is a problem, it'd be interesting to see it.

A tracepoint in AdvanceXLInsertBuffer() would help there.

I would expect lock time to vary with number of cache lines touched
(approximated by record size) as well as whether the record crossed a
page boundary. Very large WAL records are more likely to cross
boundaries, so the effect will show itself most when we write small WAL
records.

> BTW, we'd probably need to do something like this even if we then go
> forward with your original idea. If we're going to allow multiple
> backends to be inserting WAL records into the-same-or-different WAL
> buffers concurrently, we can't have that same code responsible for
> initializing empty buffers.

You've sold me already!

I will return to the other part of the idea, but just too busy now to
think and reply in full. I agree with the issues you raised earlier.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Paul van den Bogaard <Paul(dot)Vandenbogaard(at)Sun(dot)COM>
Subject:	Re: Reworking WAL locking
Date:	2008-03-23 00:05:16
Message-ID:	200803230005.m2N05Gq26940@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Added to TODO:

* Improve WAL concurrency by increasing lock granularity

http://archives.postgresql.org/pgsql-hackers/2008-02/msg00556.php

---------------------------------------------------------------------------

Simon Riggs wrote:
>
> Paul van den Bogaard (Sun) suggested to me that we could use more than
> two WAL locks to improve concurrency. I think its possible to introduce
> such a scheme with some ease. All mods within xlog.c
>
> The scheme below requires an extra LWlock per WAL buffer.
>
> Locking within XLogInsert() would look like this:
>
> Calculate length of data to be inserted.
> Calculate initial CRC
>
> LWLockAcquire(WALInsertLock, LW_EXCLUSIVE)
>
> Reserve space to write into.
> LSN = current Insert pointer
> Move pointer forward by length of data to be inserted, acquiring
> WALWriteLock if required to ensure space is available.
>
> LWLockAcquire(LSNGetWALPageLockId(LSN), LW_SHARED);
>
> Note that we don't lock every page, just the first one of the set we
> want, but we hold it until all page writes are complete.
>
> LWLockRelease(WALInsertLock);
>
> finish calculating CRC
> write xlog into reserved space
>
> LWLockRelease(LSNGetWALPageLockId(LSN));
>
> XLogWrite() will then try to get a conditional LW_EXCLUSIVE lock
> sequentially on each page it plans to write. It keeps going until it
> fails to get the lock, then writes. Callers of XLogWrite will never be
> able to pass a backend currently performing the wal buffer fill.
>
> We write whole page at a time.
>
> Next time, we do a regular lock wait on the same page, so that we always
> get a page eventually.
>
> This requires us to get 2 locks for an XLogInsert rather than just one.
> However the second lock is always acquired with zero-wait time when the
> wal_buffers are sensibly sized. Overall this should reduce wait time for
> the WALInsertLock since it seems likely that each actual filling of WAL
> buffers will effect different cache lines and are very likely to be able
> to be performed in parallel.
>
> Sounds good to me.
>
> Any objections/comments before this can be tried out?
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +