Re: Catching resource leaks during WAL replay

Lists: pgsql-hackers
From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Catching resource leaks during WAL replay
Date: 2013-03-27 20:40:14
Message-ID: 5153592E.20102@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While looking at bug #7969, it occurred to me that it would be nice if
we could catch resource leaks in WAL redo routines better. It would be
useful during development, to catch bugs earlier, and it could've turned
that replay-stopping error into a warning.

For regular transactions, we use ResourceOwners to track buffer pins
(like in #7969) and other resources. There's no fundamental reason we
couldn't use one during replay. After running a redo routine, there
should be no buffer pins held or other resources held.

Lwlocks are not tracked by resource owners, but we could still easily
warn if any are held after the redo routine exits.

- Heikki

Attachment Content-Type Size
xlog-redo-resource-leak-warning.patch text/x-diff 6.4 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Catching resource leaks during WAL replay
Date: 2013-03-27 21:42:08
Message-ID: CA+U5nMKuCxbxLHMbkgN8vyhZ18Y1RiwkByd7wz-2cvyfpdHZBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 March 2013 20:40, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> While looking at bug #7969, it occurred to me that it would be nice if we
> could catch resource leaks in WAL redo routines better. It would be useful
> during development, to catch bugs earlier, and it could've turned that
> replay-stopping error into a warning.
>
> For regular transactions, we use ResourceOwners to track buffer pins (like
> in #7969) and other resources. There's no fundamental reason we couldn't use
> one during replay. After running a redo routine, there should be no buffer
> pins held or other resources held.
>
> Lwlocks are not tracked by resource owners, but we could still easily warn
> if any are held after the redo routine exits.

I'm inclined to think that the overhead isn't worth the trouble. This
is the only bug of its type we had in recent years.

Perhaps we need another level of compile for checks that happen only in beta?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Catching resource leaks during WAL replay
Date: 2013-03-27 23:01:25
Message-ID: 8316.1364425285@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On 27 March 2013 20:40, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
>> While looking at bug #7969, it occurred to me that it would be nice if we
>> could catch resource leaks in WAL redo routines better. It would be useful
>> during development, to catch bugs earlier, and it could've turned that
>> replay-stopping error into a warning.

> I'm inclined to think that the overhead isn't worth the trouble. This
> is the only bug of its type we had in recent years.

I agree that checking for resource leaks after each WAL record seems
too expensive compared to what we'd get for it. But perhaps it's worth
making a check every so often, like at restartpoints?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Catching resource leaks during WAL replay
Date: 2013-03-27 23:09:22
Message-ID: CA+U5nMKNEyrAVv0h9J=xeGo90ESHHk28=vHh=pWN8bvqXbRbEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 March 2013 23:01, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On 27 March 2013 20:40, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
>>> While looking at bug #7969, it occurred to me that it would be nice if we
>>> could catch resource leaks in WAL redo routines better. It would be useful
>>> during development, to catch bugs earlier, and it could've turned that
>>> replay-stopping error into a warning.
>
>> I'm inclined to think that the overhead isn't worth the trouble. This
>> is the only bug of its type we had in recent years.
>
> I agree that checking for resource leaks after each WAL record seems
> too expensive compared to what we'd get for it. But perhaps it's worth
> making a check every so often, like at restartpoints?

+1

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Catching resource leaks during WAL replay
Date: 2013-03-28 15:00:28
Message-ID: 51545B0C.5000600@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 28.03.2013 01:01, Tom Lane wrote:
> Simon Riggs<simon(at)2ndQuadrant(dot)com> writes:
>> On 27 March 2013 20:40, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com> wrote:
>>> While looking at bug #7969, it occurred to me that it would be nice if we
>>> could catch resource leaks in WAL redo routines better. It would be useful
>>> during development, to catch bugs earlier, and it could've turned that
>>> replay-stopping error into a warning.
>
>> I'm inclined to think that the overhead isn't worth the trouble. This
>> is the only bug of its type we had in recent years.
>
> I agree that checking for resource leaks after each WAL record seems
> too expensive compared to what we'd get for it. But perhaps it's worth
> making a check every so often, like at restartpoints?

That sounds very seldom. How about making it an assertion to check after
every record? I guess I'll have to do some testing to see how expensive
it really is.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Catching resource leaks during WAL replay
Date: 2013-03-28 15:09:36
Message-ID: 28353.1364483376@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> On 28.03.2013 01:01, Tom Lane wrote:
>> Simon Riggs<simon(at)2ndQuadrant(dot)com> writes:
>>> I'm inclined to think that the overhead isn't worth the trouble. This
>>> is the only bug of its type we had in recent years.

>> I agree that checking for resource leaks after each WAL record seems
>> too expensive compared to what we'd get for it. But perhaps it's worth
>> making a check every so often, like at restartpoints?

> That sounds very seldom. How about making it an assertion to check after
> every record? I guess I'll have to do some testing to see how expensive
> it really is.

Well, the actually productive part of this patch is to reduce such a
failure from ERROR to WARNING, which seems like it probably only
requires *one* resource cleanup after we exit the apply loop. Doing it
per restartpoint is probably reasonable to limit the resource owner's
memory consumption (if there were a leak) over a long replay sequence.
I am really not seeing much advantage to doing it per record.

I suppose you are thinking of being helpful during development, but if
anything I would argue that the current behavior of a hard failure is
best for development. It guarantees that the developer will notice the
failure, if it occurs at all in his test scenario; whereas a WARNING
that goes only to the postmaster log will be very very easily missed.

regards, tom lane