Re: Strange Windows problem, lock_timeout test request

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Hari Babu <haribabu(dot)kommi(at)huawei(dot)com>, "'Craig Ringer'" <craig(at)2ndQuadrant(dot)com>, 'Hans-Jürgen Schönig' <hs(at)cybertec(dot)at>, "'Ants Aasma'" <ants(at)cybertec(dot)at>, "'PostgreSQL Hackers'" <pgsql-hackers(at)postgresql(dot)org>, "'Amit kapila'" <amit(dot)kapila(at)huawei(dot)com>
Subject: Re: Strange Windows problem, lock_timeout test request
Date: 2013-03-15 17:53:40
Message-ID: 23716.1363370020@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> [ 2-lock_timeout-v33.patch ]

I looked at this patch a bit. I don't understand why you've chosen to
alter the API of the enable_timeout variants to have a bool result that
says "I didn't bother to process the timeout because it would have fired
immediately". That is not an advantage for any call site that I can
see: it just means that the caller needs an extra, very difficult to
test code path to handle what would normally be handled by a timeout
interrupt. Even if it were a good API choice, you've broken a number of
existing call sites that the patch fails to touch (eg in postmaster.c
and postgres.c). It's not acceptable to define a failure return from
enable_timeout_after and then let callers assume that the failure can't
happen.

Please undo that.

Also, I'm not really enamored of the choice to use List* infrastructure
for enable_timeouts(). That doesn't appear to be especially convenient
for any caller, and what it does do is create memory-leak risks for most
of them (if they forget to free the list cells, perhaps as a result of
an error exit). I think a local-variable array of TimeoutParams[]
structs would serve better for most use-cases.

Another thought is that the PGSemaphoreTimedLock implementations all
share the same bug, which is that if the "condition" callback returns
true immediately after we acquire the semaphore, they will all wrongly
return false indicating that the semaphore wasn't acquired. (BTW,
I do not like either the variable name "condition" or the typedef name
"TimeoutCondition" for something that's actually a callback function
pointer.)

In the same vein, this comment change:

* NOTE: Think not to put any shared-state cleanup after the call to
* ProcSleep, in either the normal or failure path. The lock state must
! * be fully set by the lock grantor, or by CheckDeadLock if we give up
! * waiting for the lock. This is necessary because of the possibility
! * that a cancel/die interrupt will interrupt ProcSleep after someone else
! * grants us the lock, but before we've noticed it. Hence, after granting,
! * the locktable state must fully reflect the fact that we own the lock;
! * we can't do additional work on return.
*
* We can and do use a PG_TRY block to try to clean up after failure, but
* this still has a major limitation: elog(FATAL) can occur while waiting
--- 1594,1606 ----
/*
* NOTE: Think not to put any shared-state cleanup after the call to
* ProcSleep, in either the normal or failure path. The lock state must
! * be fully set by the lock grantor, or by CheckDeadLock or by ProcSleep
! * itself in case a timeout is detected or if we give up waiting for the lock.
! * This is necessary because of the possibility that a cancel/die interrupt
! * will interrupt ProcSleep after someone else grants us the lock, but
! * before we've noticed it. Hence, after granting, the locktable state must
! * fully reflect the fact that we own the lock; we can't do additional work
! * on return.
*

suggests that you've failed to grasp the fundamental race-condition
problem here. ProcSleep can't do cleanup after the fact any more than
its callers can, because otherwise it has a race condition with some
other process deciding to grant it the lock at about the same time its
timeout goes off. I think the changes in ProcSleep that alter the
state-cleanup behavior are just plain wrong, and what you need to do
is make that look more like the existing mechanisms that clean up when
statement timeout occurs.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-03-15 19:05:54 pg_test_fsync crashes on systems with POSIX signal handling
Previous Message Alvaro Herrera 2013-03-15 17:01:37 Re: sql_drop Event Triggerg