Re: <IDLE> and waiting

Lists: pgsql-hackers
From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: <IDLE> and waiting
Date: 2008-01-31 22:30:49
Message-ID: 65937bea0801311430r365f1e6w6b9ad8f5322c3b34@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi guys,

I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
it (understandably, since its IDLE), there are no entries for this pid in
pg_locks!

Following are the snapshots of the two system views.

procpid | current_query | waiting | duration |
backend_start
---------+-----------------------+---------+------------------+-------------------------------
20762 | <IDLE> | f | | 2008-01-31
13:38:30.848898-08
19776 | <IDLE> | t | 00:38:34.76833 | 2008-01-31
12:51:29.005744-08
20356 | <IDLE> | f | 00:38:29.971425 | 2008-01-31
13:17:37.617497-08
19775 | <IDLE> | f | 00:38:27.187201 | 2008-01-31
12:51:28.999242-08
19774 | <IDLE> | f | 00:38:27.187068 | 2008-01-31
12:51:28.90554-08
20728 | <IDLE> | f | 00:14:03.913027 | 2008-01-31
13:36:11.345822-08
9727 | <IDLE> | f | 00:03:07.444273 | 2008-01-24
22:25:00.289931-08
9684 | <IDLE> | f | 00:00:07.704656 | 2008-01-24
22:22:00.007377-08
19390 | <IDLE> in transaction | f | 00:00:00.027585 | 2008-01-31
12:30:07.999246-08
19389 | <IDLE> in transaction | t | -00:00:00.000255 | 2008-01-31
12:30:07.973868-08

select * from pg_locks where pid in ( 19776, 19389 );

locktype | database | relation | page | tuple | transactionid |
classid | objid | objsubid | transaction | pid | mode |
granted
---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+-------+------------------+---------
relation | 16584 | 17070 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
relation | 16584 | 17106 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
relation | 16584 | 17068 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
transactionid | | | | | 3700350056
| | | | 3700350056 | 19389 | ExclusiveLock | t
relation | 16584 | 17108 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
(5 rows)

The 'duration' column above is just now()-query_start. These are not
just two instant snapshots, but we could see this output consistently for
quite long.

I tracked the 'waiting' column a little bit in the source code, and saw
that it is actually generated from PgBackendStatus.st_waiting . Is it
possible that, for some reason, postgres forgot to update this for a
backend?

select version();
version
--------------------------------------------------------------------------------------------
PostgreSQL 8.2.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.1.0 (SUSE Linux)

This issue has been seen twice now.

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-01 16:16:20
Message-ID: 24863.1201882580@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
> I saw a strange behaviour on one of the production boxes. The
> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
> it (understandably, since its IDLE), there are no entries for this pid in
> pg_locks!

Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.

regards, tom lane


From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-01 22:26:08
Message-ID: 65937bea0802011426p4812b7f0k8e07be9530f5b12a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The situation seems pretty bad!!

Here are the steps to reproduce in 'PostgreSQL 8.3beta2 on
x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.3.3 (SuSE Linux)':

session 1: begin;
session 1: update test set a = 112 where a = 112;
session 2: update test set a = 113 where a = 112; --waits
session 1: select * from pg_stat_activity; -- executed this a few times
before executing 'select version()' and then following:
session 1: <stat query1> -- see end of mail for this query

procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

session 1: select * from pg_locks

locktype | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction | pid
| mode | granted
---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+---------------
-----+-------+------------------+---------
transactionid | | | | | |
390 | | | | 2/14 | 12577 | ShareLock | f
transactionid | | | | | |
390 | | | | 1/9 | 11975 | ExclusiveLock | t
<snip irrelevant>

Then,
session 2: ^C
Cancel request sent
ERROR: canceling statement due to user request

session1: <stat query1>

procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

session 1: select * from pg_locks ;

<no traces of pid 12577>

session 1: select pg_backend_pid();

pg_backend_pid
----------------
11975

The last mentioned output of <stat query> shows session 1 executing 'select
* from p_s_a', whereas the <stat query> _is_ being executed in session 1!!!
This result is consistently returned for a while, and later...

session 2: select pg_backend_pid();

pg_backend_pid
----------------
12577

session 1: <stat query1>

procpid | current_query | waiting | duration |
backend_start
---------+-----------------------+---------+-----------------+-------------------------------
11975 | <IDLE> in transaction | f | 00:06:08.671029 | 2008-02-01
13:30:40.396392-08
(1 row)

After a while again:

session 1: <stat query2> -- notice 2 not 1; 'select *' comes back to
haunt!!!

procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

session 1: <stat query1> -- 1 back in action

procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

The <stat query1> is:
select
procpid, current_query::varchar(50), waiting, now() - query_start as
duration, backend_start
from pg_stat_activity
where current_query <> '<IDLE>'
and current_query not like '%DONT COUNT ME1 %'
order by duration desc
limit 10;

The <stat query2> is:
select
procpid, current_query::varchar(50), waiting, now() - query_start as
duration, backend_start
from pg_stat_activity
where current_query not like '%DONT COUNT ME1 %'
order by duration desc
limit 10;

Found more bugs than I was looking for, to reproduce!!!

The reporter also made an observation (on 8.2.4) that there were
deadlocks detected at around the same time. Looked at WaitOnLock(), and
clearly there's a problem, but is it at the same/only place we are
suspecting it to be?

Best regards,

PS: Ran the <stat query>ies 1 and 2 again, just before hitting 'send', and
the result is the same:
procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)

Clearly, there's something wrong

On Feb 1, 2008 8:16 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
> > I saw a strange behaviour on one of the production boxes. The
> > pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top
> of
> > it (understandably, since its IDLE), there are no entries for this pid
> in
> > pg_locks!
>
> Hmm, I can reproduce something like this by aborting a wait for lock.
> It seems the problem is that WaitOnLock() is ignoring its own good
> advice, assuming that it can do cleanup work after waiting.
>
> regards, tom lane
>

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-01 23:56:56
Message-ID: 18059.1201910216@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
> The situation seems pretty bad!!

I think at least part of your problem is not understanding that a single
transaction sees a frozen snapshot of pg_stat_activity.

regards, tom lane


From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-02 00:49:17
Message-ID: 65937bea0802011649v2a793b0w6367638d9edfff17@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 1, 2008 3:56 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
> > The situation seems pretty bad!!
>
> I think at least part of your problem is not understanding that a single
> transaction sees a frozen snapshot of pg_stat_activity.
>
>
It does! I assumed that pg_stat_activity produced the
transaction-independent snapshot of internal memory structures! Is that the
case with pg_locks too!? I hope not.

BTW, we cannot say that the pg_stat_activity behaves in a consistent manner
(transactions-wise). From what I could infer, this view's results are frozen
when you first query the view, not when the transaction started (which is
how other (normal) relations behave). It's a bit confusing, and should be
documented if this is the way it is intended to work; Something along the
lines of : "In a transaction, this view will repeatedly show the same
results that were returned by it's first invocation in the transaction." in
a less confusing way :)

So we are back to the original problem... Canceling a 'waiting' transaction
does not revert the session's 'waiting' state back to 'false' (consistently
reproducible).

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-02 22:28:50
Message-ID: 24901.1201991330@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
>> I saw a strange behaviour on one of the production boxes. The
>> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
>> it (understandably, since its IDLE), there are no entries for this pid in
>> pg_locks!

> Hmm, I can reproduce something like this by aborting a wait for lock.
> It seems the problem is that WaitOnLock() is ignoring its own good
> advice, assuming that it can do cleanup work after waiting.

I've committed a fix for this. (Too late for 8.3.0, unfortunately.)

regards, tom lane


From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-02 23:13:57
Message-ID: 65937bea0802021513m9f45e5bufb4a8fce0730421a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 2, 2008 2:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I wrote:
> > "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
> >> I saw a strange behaviour on one of the production boxes. The
> >> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top
> of
> >> it (understandably, since its IDLE), there are no entries for this pid
> in
> >> pg_locks!
>
> > Hmm, I can reproduce something like this by aborting a wait for lock.
> > It seems the problem is that WaitOnLock() is ignoring its own good
> > advice, assuming that it can do cleanup work after waiting.
>
> I've committed a fix for this. (Too late for 8.3.0, unfortunately.)
>

Thanks. Like 8.2, can it not be back-patched on 8.3 too?

I just looked at the patch... Isn't PG_TRY() an expensive call to make in
the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callback if
the Xact event is XACT_EVENT_ABORT.

That would have been compliant with the previous comments ('if we fail, any
cleanup must happen in xact abort processing, not here').

Comments.

Best regards,
--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-02 23:27:57
Message-ID: 20080202232757.GB5238@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gurjeet Singh escribió:

> I just looked at the patch... Isn't PG_TRY() an expensive call to make in
> the lock.c code? I was thinking of registering a Xact callback using
> RegisterXactCallback() and performing 'waiting' reset in that callback if
> the Xact event is XACT_EVENT_ABORT.

PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-02 23:33:58
Message-ID: 65937bea0802021533o4bcc3483pfdf4882e6b52a58e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 2, 2008 3:27 PM, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:

> Gurjeet Singh escribió:
>
> > I just looked at the patch... Isn't PG_TRY() an expensive call to make
> in
> > the lock.c code? I was thinking of registering a Xact callback using
> > RegisterXactCallback() and performing 'waiting' reset in that callback
> if
> > the Xact event is XACT_EVENT_ABORT.
>
> PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
> another stack frame.
>
>
Thats why I asked. I assumed that creating stacks was expensive. Isn't this
the reason the compilers came up with the function inline capability; to
avoid stacks, because they can be expensive. Or am I confusing two different
types of stacks!

Moreover, calling a callback, once in a while (only upon XACT abort), may
prove to be much cheaper than setting up an additional stack on every
lock-acquire call.

Really, my 2 cents.

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: <IDLE> and waiting
Date: 2008-02-02 23:39:39
Message-ID: 25828.1201995579@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Gurjeet Singh escribi:
>> I just looked at the patch... Isn't PG_TRY() an expensive call to make in
>> the lock.c code? I was thinking of registering a Xact callback using
>> RegisterXactCallback() and performing 'waiting' reset in that callback if
>> the Xact event is XACT_EVENT_ABORT.

> PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
> another stack frame.

Also, since we're about to block here, shaving microseconds is not all
that important.

The reason I did it that way was to avoid having to export the saved
ps-display string out to someplace LockWaitCancel could find it.

regards, tom lane