Btree runtime recovery. Stuck spins.

Lists: pgsql-hackers
From: "Mikheev, Vadim" <vmikheev(at)SECTORBASE(dot)COM>
To: "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Btree runtime recovery. Stuck spins.
Date: 2001-02-08 20:03:42
Message-ID: 8F4C99C66D04D4118F580090272A7A234D32E3@sectorbase1.sectorbase.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi

1. Subj is implemented and is ON by default in current.
There is flag - fixbtree - to turn this feature OFF.
I've run some tests: 100 clients inserted big tuples (1700-1800
bytes) into single table with index. After splitting root page
elog(ERROR) was forced, as well as after each ~ 5th non-root page
split, so there was what to fix. After this I've run selects for
each key to check that index structure is correct.

2. During tests I've got stuck spin aborts couple of times.
So I've increased S_MAX_BUSY, placed elog(NOTICE, "WOULD BE STUCK")
for spins == 20001 in s_lock_sleep() and rerun tests.
I've got *many* "WOULD BE STUCK" notices but no one abort.

Does it explain why I tried to avoid spin stuck "detection" code
in WAL? -:)

Shouldn't we increase S_MAX_BUSY and use ERROR instead of FATAL?

Vadim


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mikheev, Vadim" <vmikheev(at)SECTORBASE(dot)COM>
Cc: "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-08 20:48:49
Message-ID: 12891.981665329@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Mikheev, Vadim" <vmikheev(at)SECTORBASE(dot)COM> writes:
> 2. During tests I've got stuck spin aborts couple of times.
> So I've increased S_MAX_BUSY, placed elog(NOTICE, "WOULD BE STUCK")
> for spins == 20001 in s_lock_sleep() and rerun tests.
> I've got *many* "WOULD BE STUCK" notices but no one abort.
> Does it explain why I tried to avoid spin stuck "detection" code
> in WAL? -:)
> Shouldn't we increase S_MAX_BUSY and use ERROR instead of FATAL?

No. If you have delays exceeding a minute, or that are even a visible
fraction of a minute, then a spinlock is NOT the correct mechanism to be
using to wait ... because guess what, it's spinning, and consuming
processor time to no purpose. You should be using a lock instead for
anything that involves more than a trivial amount of delay.

regards, tom lane


From: "Vadim Mikheev" <vmikheev(at)sectorbase(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-09 02:58:38
Message-ID: 005f01c09244$341e3440$4a79583f@sectorbase.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > Shouldn't we increase S_MAX_BUSY and use ERROR instead of FATAL?
>
> No. If you have delays exceeding a minute, or that are even a visible
> fraction of a minute, then a spinlock is NOT the correct mechanism to be
> using to wait ... because guess what, it's spinning, and consuming
> processor time to no purpose. You should be using a lock instead for
> anything that involves more than a trivial amount of delay.

"Amount of delay" depends on system load - something we can't control...

Btree uses spins to lock buffers (as all other access methods) and so
I could use only spins in new code. And though tree recovery locks buffers
for longer time than normal insert operations it's possible to get
"stuck" spins when using concurrent buffers locks *everywhere* under
heavy load (especially with WAL which requires holding buffer locks
for duration of logging).

So, probably we have to use some kind of light locks (without deadlock
detection) for buffers, in future.

Vadim


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Vadim Mikheev" <vmikheev(at)sectorbase(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-09 06:30:31
Message-ID: 17709.981700231@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Vadim Mikheev" <vmikheev(at)sectorbase(dot)com> writes:
> Shouldn't we increase S_MAX_BUSY and use ERROR instead of FATAL?
>>
>> No. If you have delays exceeding a minute, or that are even a visible
>> fraction of a minute, then a spinlock is NOT the correct mechanism to be
>> using to wait ... because guess what, it's spinning, and consuming
>> processor time to no purpose. You should be using a lock instead for
>> anything that involves more than a trivial amount of delay.

> "Amount of delay" depends on system load - something we can't control...

> Btree uses spins to lock buffers (as all other access methods) and so
> I could use only spins in new code. And though tree recovery locks buffers
> for longer time than normal insert operations it's possible to get
> "stuck" spins when using concurrent buffers locks *everywhere* under
> heavy load (especially with WAL which requires holding buffer locks
> for duration of logging).

Hm. It was OK to use spinlocks to control buffer access when the max
delay was just the time to read or write one disk page. But it sounds
like we've pushed the code way past what it was designed to do. I think
this needs some careful thought, not just a quick hack like increasing
the timeout interval.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Vadim Mikheev" <vmikheev(at)sectorbase(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-09 18:05:18
Message-ID: 20428.981741918@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> "Vadim Mikheev" <vmikheev(at)sectorbase(dot)com> writes:
>> Btree uses spins to lock buffers (as all other access methods) and so
>> I could use only spins in new code. And though tree recovery locks buffers
>> for longer time than normal insert operations it's possible to get
>> "stuck" spins when using concurrent buffers locks *everywhere* under
>> heavy load (especially with WAL which requires holding buffer locks
>> for duration of logging).

> Hm. It was OK to use spinlocks to control buffer access when the max
> delay was just the time to read or write one disk page. But it sounds
> like we've pushed the code way past what it was designed to do. I think
> this needs some careful thought, not just a quick hack like increasing
> the timeout interval.

After thinking more about this, simply increasing S_MAX_BUSY is clearly
NOT a good answer. If you are under heavy load then processes that are
spinning are making things worse, not better, because they are sucking
CPU cycles that would be better spent on the processes that are holding
the locks.

It would not be very difficult to replace the per-disk-buffer spinlocks
with regular lockmanager locks. Advantages:
* Processes waiting for a buffer lock aren't sucking CPU cycles.
* Deadlocks will be detected and handled reasonably. (The more stuff
that WAL does while holding a buffer lock, the bigger the chances
of deadlock. I think this is a significant concern now.)
Of course the major disadvantage is:
* the lock setup/teardown overhead is much greater than for a
spinlock, and the overhead is just wasted when there's no contention.

A reasonable alternative would be to stick with the spinlock mechanism,
but use a different locking routine (maybe call it S_SLOW_LOCK) that is
designed to deal with locks that may be held for a long time. It would
use much longer delay intervals than the regular S_LOCK code, and would
have either a longer time till ultimate timeout, or no timeout at all.
The main problem with this idea is choosing an appropriate timeout
behavior. As I said, I am concerned about the possibility of deadlocks
in WAL-recovery scenarios, so I am not very happy with the thought of
no timeout at all. But it's hard to see what a reasonable timeout would
be if a minute or more isn't enough in your test cases; seems to me that
that suggests that for very large indexes, you might need a *long* time.

Comments, preferences, better ideas?

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Vadim Mikheev <vmikheev(at)sectorbase(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-09 18:12:56
Message-ID: 200102091812.NAA13636@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > Hm. It was OK to use spinlocks to control buffer access when the max
> > delay was just the time to read or write one disk page. But it sounds
> > like we've pushed the code way past what it was designed to do. I think
> > this needs some careful thought, not just a quick hack like increasing
> > the timeout interval.
>
> After thinking more about this, simply increasing S_MAX_BUSY is clearly
> NOT a good answer. If you are under heavy load then processes that are
> spinning are making things worse, not better, because they are sucking
> CPU cycles that would be better spent on the processes that are holding
> the locks.

Our spinlocks don't go into an infinite test loop, right? They back off
and retest at random intervals.

I can't imagine we don't have similar btree lock needs other places in
the code were a solution already exists.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Vadim Mikheev <vmikheev(at)sectorbase(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-09 18:23:35
Message-ID: 20532.981743015@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Our spinlocks don't go into an infinite test loop, right? They back off
> and retest at random intervals.

Not very random --- either 0 or 10 milliseconds. (I think there was
some discussion of changing that, but it died off without agreeing on
anything.) The point though is that the code is tuned for use with
spinlocks protecting shared-memory data structures, where no one is
supposed to be holding the lock for long; so a short retry delay is
appropriate, and we don't need a very long time before declaring "stuck
spinlock" either. This is not optimal for cases where someone may be
holding the lock for a good while, but detuning the code so that it
works less well for the shmem-structure case isn't the answer.

We need two different mechanisms.

regards, tom lane


From: ncm(at)zembu(dot)com (Nathan Myers)
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-09 19:44:05
Message-ID: 20010209114405.D624@store.zembu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 09, 2001 at 01:23:35PM -0500, Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Our spinlocks don't go into an infinite test loop, right? They back off
> > and retest at random intervals.
>
> Not very random --- either 0 or 10 milliseconds. (I think there was
> some discussion of changing that, but it died off without agreeing on
> anything.) ...

I think we agreed that 0 was just wrong, but nobody changed it.
Changing it to 1 microsecond would be the smallest reasonable
change. As it is, it just does a bunch of no-op syscalls each time it
wakes up after a 10ms sleep, without yielding the CPU.

Nathan Myers
ncm(at)zembu(dot)com


From: "Vadim Mikheev" <vmikheev(at)sectorbase(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Btree runtime recovery. Stuck spins.
Date: 2001-02-09 20:48:46
Message-ID: 00cc01c092d9$b2f17120$4c79583f@sectorbase.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > Hm. It was OK to use spinlocks to control buffer access when the max
> > delay was just the time to read or write one disk page. But it sounds

Actually, btree split requires 3 simult. buffers locks and after that
_bt_getstackbuf may read *many* parent buffers while holding locks on
2 buffers. AFAIR, the things are even worse in hash.

And anyway, there is always probability that someone else will get just
freed lock while you're waiting next 0.01 sec. The problem is that
there is no priority/ordering while waiting for spin lock.

> > like we've pushed the code way past what it was designed to do. I think
> > this needs some careful thought, not just a quick hack like increasing
> > the timeout interval.

I fear there is not enough time -:(

> After thinking more about this, simply increasing S_MAX_BUSY is clearly
> NOT a good answer. If you are under heavy load then processes that are
> spinning are making things worse, not better, because they are sucking
> CPU cycles that would be better spent on the processes that are holding
> the locks.
>
> It would not be very difficult to replace the per-disk-buffer spinlocks
> with regular lockmanager locks. Advantages:
> * Processes waiting for a buffer lock aren't sucking CPU cycles.
> * Deadlocks will be detected and handled reasonably. (The more stuff
> that WAL does while holding a buffer lock, the bigger the chances
> of deadlock. I think this is a significant concern now.)

I disagree. Lmgr needs in deadlock detection code because of deadlock
may be caused by *user application* design and we must not count on
*user application* correctness. But we must not use deadlock detection
code when we protected from deadlock by *our* design. Well, anyone can
make mistake and break order of lock acquiring - we should just fix
those bugs -:)
So, it doesn't matter *how much stuff that WAL does while holding buffer
locks* as long as WAL itself doesn't acquire buffer locks.

> Of course the major disadvantage is:
> * the lock setup/teardown overhead is much greater than for a
> spinlock, and the overhead is just wasted when there's no contention.

Exactly.

> A reasonable alternative would be to stick with the spinlock mechanism,
> but use a different locking routine (maybe call it S_SLOW_LOCK) that is
> designed to deal with locks that may be held for a long time. It would
> use much longer delay intervals than the regular S_LOCK code, and would
> have either a longer time till ultimate timeout, or no timeout at all.
> The main problem with this idea is choosing an appropriate timeout
> behavior. As I said, I am concerned about the possibility of deadlocks
> in WAL-recovery scenarios, so I am not very happy with the thought of
> no timeout at all. But it's hard to see what a reasonable timeout would

And I'm unhappy with timeouts -:) It's not solution at all. We should
do right design instead.

> be if a minute or more isn't enough in your test cases; seems to me that
> that suggests that for very large indexes, you might need a *long* time.
>
> Comments, preferences, better ideas?

For any spins which held while doing IO ops we should have queue of
waiting backend' PROCs. As I said - some kind of lightweight lock manager.
Just two kind of locks - shared & exclusive. No structures to find locked
objects. No deadlock detection code. Backends should wait on their
semaphores, without timeouts.

For "true" spins (held for really short time when accessing control
structures in shmem) we should not sleep 0.01 sec! tv_usec == 1 would be
reasonable - just to yield CPU. Actually, mutexes would be much better...

Vadim