Re: We need to log aborted autovacuums

Lists: pgsql-hackers
From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: We need to log aborted autovacuums
Date: 2010-11-17 21:46:21
Message-ID: 4CE44D2D.2000309@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

I've been trying to diagnose in a production database why certain tables
never get autovacuumed despite having a substantial % of updates. The
obvious reason is locks blocking autovacuum from vacuuming the table ...
but the trick is we don't log such blocking behavior, at all. This
means that there is no possible way for a user to figure out *why* his
tables aren't getting autovacuumed.

And yes, this is a common problem. A quick survey on IRC found 3 active
users on channel (out of 20 or so) who'd encountered it. The current
case I'm looking at is a table with over 70% bloat which hasn't been
autovacuumed since the database was upgraded a month ago.

What I'd like to do is add some logging code to autovacuum.c so that if
log_autovacuum is any value other than -1, failure to vacuum due to
locks gets logged. Does this make sense?

And does anyone have suggestions on how users can diagnose this on older
versions?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2010-11-17 21:55:40
Message-ID: 1290030940.10258.7188.camel@jd-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-11-17 at 13:46 -0800, Josh Berkus wrote:
> Hackers,
>
> I've been trying to diagnose in a production database why certain tables
> never get autovacuumed despite having a substantial % of updates. The
> obvious reason is locks blocking autovacuum from vacuuming the table ...
> but the trick is we don't log such blocking behavior, at all. This
> means that there is no possible way for a user to figure out *why* his
> tables aren't getting autovacuumed.

Yes we do log this.

JD

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2010-11-17 22:50:49
Message-ID: 25945.1290034249@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> I've been trying to diagnose in a production database why certain tables
> never get autovacuumed despite having a substantial % of updates. The
> obvious reason is locks blocking autovacuum from vacuuming the table ...
> but the trick is we don't log such blocking behavior, at all. This
> means that there is no possible way for a user to figure out *why* his
> tables aren't getting autovacuumed.

> And yes, this is a common problem. A quick survey on IRC found 3 active
> users on channel (out of 20 or so) who'd encountered it. The current
> case I'm looking at is a table with over 70% bloat which hasn't been
> autovacuumed since the database was upgraded a month ago.

> What I'd like to do is add some logging code to autovacuum.c so that if
> log_autovacuum is any value other than -1, failure to vacuum due to
> locks gets logged. Does this make sense?

It's hard to tell, because you're just handwaving about what it is you
think isn't being logged; nor is it clear whether you have any evidence
that locks are the problem. Offhand I'd think it at least as likely
that autovacuum thinks it doesn't need to do anything, perhaps because
of a statistics issue. There *is* an elog(DEBUG3) in autovacuum.c
that reports whether autovac thinks a table needs vacuumed/analyzed ...
maybe that needs to be a tad more user-accessible.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2010-11-17 23:31:44
Message-ID: 4CE465E0.8090102@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> It's hard to tell, because you're just handwaving about what it is you
> think isn't being logged; nor is it clear whether you have any evidence
> that locks are the problem. Offhand I'd think it at least as likely
> that autovacuum thinks it doesn't need to do anything, perhaps because
> of a statistics issue. There *is* an elog(DEBUG3) in autovacuum.c
> that reports whether autovac thinks a table needs vacuumed/analyzed ...
> maybe that needs to be a tad more user-accessible.

Yeah, it would be really good to be able to log that without bumping the
log levels of the server in general to DEBUG3. On a busy production
server, using any of the DEBUG levels is pretty much out of the question
... they can produce up to 1GB/minute in output.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2010-11-17 23:35:45
Message-ID: 26954.1290036945@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> There *is* an elog(DEBUG3) in autovacuum.c
>> that reports whether autovac thinks a table needs vacuumed/analyzed ...
>> maybe that needs to be a tad more user-accessible.

> Yeah, it would be really good to be able to log that without bumping the
> log levels of the server in general to DEBUG3. On a busy production
> server, using any of the DEBUG levels is pretty much out of the question
> ... they can produce up to 1GB/minute in output.

Well, the way to deal with that would be to add a GUC that enables
reporting of those messages at LOG level. But it's a bit hard to argue
that we need such a thing without more evidence. Maybe you could just
locally modify the DEBUG3 to LOG and see whether it teaches you
anything?

regards, tom lane


From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2010-11-17 23:49:24
Message-ID: AANLkTinVC0e3OQGHc+BzfkSG8kg+WCebmkS_p=e7FwK_@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 18, 2010 at 08:35, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yeah, it would be really good to be able to log that without bumping the
>> log levels of the server in general to DEBUG3.
>
> Well, the way to deal with that would be to add a GUC that enables
> reporting of those messages at LOG level.  But it's a bit hard to argue
> that we need such a thing without more evidence.  Maybe you could just
> locally modify the DEBUG3 to LOG and see whether it teaches you
> anything?

How about adding a special role for autovacuum, and running autovacuum
by the role instead of the database owner? If we have "autovacuum" role
for autovacuum processes, we could set log_min_messages to DEBUG3 for
only the role with per-user configuration.

--
Itagaki Takahiro


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2010-11-17 23:52:47
Message-ID: 27373.1290037967@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com> writes:
> On Thu, Nov 18, 2010 at 08:35, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, the way to deal with that would be to add a GUC that enables
>> reporting of those messages at LOG level. But it's a bit hard to argue
>> that we need such a thing without more evidence. Maybe you could just
>> locally modify the DEBUG3 to LOG and see whether it teaches you
>> anything?

> How about adding a special role for autovacuum, and running autovacuum
> by the role instead of the database owner? If we have "autovacuum" role
> for autovacuum processes, we could set log_min_messages to DEBUG3 for
> only the role with per-user configuration.

That seems like a major kluge ... and anyway it doesn't fix the problem,
because DEBUG3 is still going to result in a lot of unwanted log output,
even if it's confined to autovacuum.

regards, tom lane


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-05 06:55:02
Message-ID: 4D2415C6.6090407@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> I've been trying to diagnose in a production database why certain tables
> never get autovacuumed despite having a substantial % of updates. The
> obvious reason is locks blocking autovacuum from vacuuming the table ...
>

Missed this dicussion when it popped up but have plenty to say about it
now. What I do here is look for such anomolies using
pg_stat_user_tables, that the dead rows number has exceeded the vacuum
threshold. That comparison is logged in the code at DEBUG3:

elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f
(threshold %.0f)",
NameStr(classForm->relname),
vactuples, vacthresh, anltuples, anlthresh);

But a rough computation isn't too hard to derive in a report, so long as
you haven't customized per-table parameters. As you suggested in your
message, the really bad cases here stick out a whole lot. If you pay
the slightest amount of attention to the dead row percentages they jump
right out at you. This all works easily on any version back to 8.3.
Not having as much relevant data stored in pg_stat_user_tables makes the
problem cases less obvious to spot in older versions.

If I start seeing these badly maintained tables and suspect locking is
getting in the way, I then dump traces from pg_locks+pg_stat_activity
often enough that I can estimate how often someone has an interfering
lock and what they're doing.

Should the log level on this message go up from DEBUG3? I could see
rewriting it so that it logs at DEBUG1 instead when
Log_autovacuum_min_duration is set *and* when the trigger threshold is
crossed, and at DEBUG3 the rest of the time. Given you can derive this
with a bit of work in userland, I don't see this even being justified as
an INFO or LOG level message. Anytime I can script a SQL-level monitor
for something that's easy to tie into Nagios or something, I greatly
prefer that to log file scraping for it anyway.

> What I'd like to do is add some logging code to autovacuum.c so that if
> log_autovacuum is any value other than -1, failure to vacuum due to
> locks gets logged. Does this make sense?
>

The general idea is interesting and probably more productive for the
situation you theorize is happening then messing with the logging
discussed above. But that's not where the code needs to go--the lock
isn't opened until much further down the function call stack. Attached
quickie and only tested for compilation patch probably does what you
want here. Since this would eliminate the messy follow-up step I
sometimes have gone through, dumping pg_locks data to confirm or rule
out locking issues messing with AV processing, I can see some potential
that it may have simplified situations I've ran into in the past. And
it's not out of line with the logging verbosity of similar failure mode
tests that follow it. Right now failure to acquire a lock is just not
considered a log-worthy issue, and I agree that it's worth considering
whether it should be.

If you could gather more info on whether this logging catches the
problem cases you're seeing, that would really be the right test for the
patch's usefulness. I'd give you solid 50/50 odds that you've correctly
diagnosed the issue, and knowing for sure would make advocating for this
logging a pretty easy sell to me at least.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachment Content-Type Size
av-lock-failure-v1.diff text/x-patch 565 bytes

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-05 12:29:39
Message-ID: AANLkTinY0ZjNjJrSn6gCekVrDLgdXwyB5PV=FGPv8+ET@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 5, 2011 at 07:55, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

<snip>

> a bit of work in userland, I don't see this even being justified as an INFO
> or LOG level message.  Anytime I can script a SQL-level monitor for
> something that's easy to tie into Nagios or something, I greatly prefer that
> to log file scraping for it anyway.

+<INF-1>

Log files can be very useful for details, but they suck for noticing
the pattern in the first place :-)

<snip>

> verbosity of similar failure mode tests that follow it.  Right now failure
> to acquire a lock is just not considered a log-worthy issue, and I agree
> that it's worth considering whether it should be.

Or should it perhaps be a per-table counter in pg_stat_user_tables,
given your statement above?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-05 19:27:10
Message-ID: 4D24C60E.4030408@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> If you could gather more info on whether this logging catches the
> problem cases you're seeing, that would really be the right test for the
> patch's usefulness. I'd give you solid 50/50 odds that you've correctly
> diagnosed the issue, and knowing for sure would make advocating for this
> logging a pretty easy sell to me at least.

Well, I already resolved the issue through polling pg_locks. The issue
was IIT sessions which lasted up to an hour, which we fixed in the
application, so I don't have the test case available anymore. I'd have
to generate a synthetic test case, and since I agree that a SQL-callable
diagnostic is superior to logging, I don't think we should pursue the
log levels further.

> Or should it perhaps be a per-table counter in pg_stat_user_tables,
> given your statement above?

Or even a timestamp: last_autovacuum_attempt, which would record the
last time autovacuum was tried. If that's fairly recent and you have a
large number of dead rows, you know what kind of problem you have and
can turn on debug.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Robert Treat <rob(at)xzilla(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-05 22:56:51
Message-ID: AANLkTi=Zz=3vh5x=+t+PG5b9XcVeZkVQYHHHxUZ8tVJ4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 5, 2011 at 2:27 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> If you could gather more info on whether this logging catches the
>> problem cases you're seeing, that would really be the right test for the
>> patch's usefulness.  I'd give you solid 50/50 odds that you've correctly
>> diagnosed the issue, and knowing for sure would make advocating for this
>> logging a pretty easy sell to me at least.
>
> Well, I already resolved the issue through polling pg_locks.  The issue
> was IIT sessions which lasted up to an hour, which we fixed in the
> application, so I don't have the test case available anymore.  I'd have
> to generate a synthetic test case, and since I agree that a SQL-callable
> diagnostic is superior to logging, I don't think we should pursue the
> log levels further.
>

This is a great use case for user level tracing support. Add a probe
around these bits, and you can capture the information when you need
it.

Robert Treat
http://www.xzilla.net


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Robert Treat <rob(at)xzilla(dot)net>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-05 23:06:11
Message-ID: 4D24F963.3030100@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> This is a great use case for user level tracing support. Add a probe
> around these bits, and you can capture the information when you need
> it.

Yeah, would be lovely if user-level tracing existed on all platforms.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Treat <rob(at)xzilla(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-07 04:11:36
Message-ID: 4D269278.3020901@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Treat wrote:
> This is a great use case for user level tracing support. Add a probe
> around these bits, and you can capture the information when you need
> it.
>

Sure. I would also like a pony.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-07 04:33:23
Message-ID: 4D269793.6030406@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
>> Or should it perhaps be a per-table counter in pg_stat_user_tables,
>> given your statement above?
>>
>
> Or even a timestamp: last_autovacuum_attempt, which would record the
> last time autovacuum was tried. If that's fairly recent and you have a
> large number of dead rows, you know what kind of problem you have and
> can turn on debug.
>

These are both reasonable ideas. But there was just some kickback on
Tomas's "keeping timestamp of the lasts stats reset" patch recently,
from the perspective of trying to limit per-table stats bloat. I think
it's relatively easy to make a case that this situation is difficult
enough to diagnose that a little bit of extra low-level logging is
worthwhile. That Josh and I have both been bit by it enough to be
thinking about patches to make it easier to diagnost suggests it's
obviously too hard to nail down. But is this so common and difficult to
recognize that it's worth making all the table stats bigger? That's a
harder call.

It's already possible to detect the main symptom--dead row percentage is
much higher than the autovacuum threshold, but there's been no recent
autovacuum. That makes me less enthusiastic that there's such a genuine
need to justify the overhead of storing more table stats just to detect
the same thing a little more easily. I've been playing with the Munin
PG plug-in more recently, and I was just thinking of adding a dead row
trend graph/threshold to it to address this general area instead.

We could argue both sides of the trade-off of tracking this directly in
stats for some time, and I'd never expect there to be a clear victory
for either perspective. I've run into this vacuum problem a few times,
but certainly less than I've run into "why is the stats table so huge?"

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-07 19:00:20
Message-ID: 4D2762C4.4050405@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg,

> It's already possible to detect the main symptom--dead row percentage is
> much higher than the autovacuum threshold, but there's been no recent
> autovacuum. That makes me less enthusiastic that there's such a genuine
> need to justify the overhead of storing more table stats just to detect
> the same thing a little more easily.

The problem is that while this gives you the symptoms, it doesn't give
you the cause. The lack of vacuum could be occurring for any of 4 reasons:

1) Locking
2) You have a lot of tables and not enough autovac_workers / too much
sleep time
3) You need to autovac this particular table more frequently, since it
gets dirtied really fast
4) The table has been set with special autovac settings which keep it
from being autovac'd

We can currently distinguish between cased 2, 3, 4 based on existing
available facts. However, distinguishing case 1 from 2 or 3, in
particular, isn't currently possible except by methods which require
collecting a lot of ad-hoc monitoring data over a period of time. This
makes the effort required for the diagnosis completely out of proportion
with the magnitude of the problem.

It occurs to me that another way of diagnosis would simply be a way to
cause the autovac daemon to spit out output we could camp on, *without*
requiring the huge volumes of output also required for DEBUG3. This
brings us back to the logging idea again.

> We could argue both sides of the trade-off of tracking this directly in
> stats for some time, and I'd never expect there to be a clear victory
> for either perspective. I've run into this vacuum problem a few times,
> but certainly less than I've run into "why is the stats table so huge?"

I really don't think that argument applies to either patch;
last_autovac_attempt *or* the last_stats_reset time, since neither event
is expected to occur frequently. If you have last_autovac_attempt (for
example) being updated frequently, you clearly had a db problem bigger
than the size of the stats table.

Given that our road ahead necessarily includes adding more and more
monitoring and admin data to PostgreSQL (because our DBA users demand
it), I think that we should give some thought to the issue of storing
DBA stats in general. By DBA stats I mean statistics which aren't used
in query planning, but are used in monitoring, trending, and
troubleshooting. I'm thinking these ought to have their own relation or
relations.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-08 01:15:12
Message-ID: 4D27BAA0.5090804@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> It occurs to me that another way of diagnosis would simply be a way to
> cause the autovac daemon to spit out output we could camp on, *without*
> requiring the huge volumes of output also required for DEBUG3. This
> brings us back to the logging idea again.
>

Right. I don't know how much you looked at the little patch I threw out
there yet, but I think it's a reasonable 90% solution to the problem of
"how do I distinguish between a locking block and other reasons for AV
not running?" without generating a huge amount of log data. Since
there's no real overhead or code complexity added by it, it's a
relatively easy thing to throw in the codebase without annoying anyone.

> I really don't think that argument applies to either patch;
> last_autovac_attempt *or* the last_stats_reset time, since neither event
> is expected to occur frequently. If you have last_autovac_attempt (for
> example) being updated frequently, you clearly had a db problem bigger
> than the size of the stats table.
>

It just returns to the usual "why make everyone pay overhead for
something that only happens to a small percentage of users?"
argument.[1] I personally have all sorts of additional data I'd like to
instrument in myriad obtrusive spots of code. All I'm saying is that
this one in particular I don't quite feel strongly enough about to make
it the spot I try to establish that foothold at. If a stats tracking
patch for this appeared that seemed reasonably well written (i.e. it
tries to keep the added value NULL whenever it's not relevant), I'd be
in favor of it going in. I'm just not so excited about the idea that
I'm going to write it. I'm lukewarm to per-table last_stats_reset time
just because there's so few places I'd actually use it in practice,
relative to the guaranteed overhead it adds.

[1] Silly aside: I was thinking today that I should draw a chart of all
the common objections to code that show up here, looking like those maps
you see when walking into a mall. Then we can give a copy to new
submitters with a big "you are here" X marking where they have
inadvertently wandered onto.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: David Fetter <david(at)fetter(dot)org>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-08 06:05:20
Message-ID: 20110108060520.GA4277@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 07, 2011 at 08:15:12PM -0500, Greg Smith wrote:
> [1] Silly aside: I was thinking today that I should draw a chart of
> all the common objections to code that show up here, looking like
> those maps you see when walking into a mall. Then we can give a
> copy to new submitters with a big "you are here" X marking where
> they have inadvertently wandered onto.

Actually, that'd make a great flow chart on a T-shirt :)

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: David Fetter <david(at)fetter(dot)org>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-08 12:17:51
Message-ID: m2sjx3r3m8.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Fetter <david(at)fetter(dot)org> writes:
> On Fri, Jan 07, 2011 at 08:15:12PM -0500, Greg Smith wrote:
>> [1] Silly aside: I was thinking today that I should draw a chart of
>> all the common objections to code that show up here, looking like
>> those maps you see when walking into a mall. Then we can give a
>> copy to new submitters with a big "you are here" X marking where
>> they have inadvertently wandered onto.
>
> Actually, that'd make a great flow chart on a T-shirt :)

Yeah, here's some more visual inspiration to get that:

http://xkcd.com/844/

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-15 09:47:56
Message-ID: 4D316D4C.4000406@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> The lack of vacuum could be occurring for any of 4 reasons:
>
> 1) Locking
> 2) You have a lot of tables and not enough autovac_workers / too much
> sleep time
> 3) You need to autovac this particular table more frequently, since it
> gets dirtied really fast
> 4) The table has been set with special autovac settings which keep it
> from being autovac'd
>
> We can currently distinguish between cased 2, 3, 4 based on existing
> available facts. However, distinguishing case 1 from 2 or 3, in
> particular, isn't currently possible except by methods which require
> collecting a lot of ad-hoc monitoring data over a period of time. This
> makes the effort required for the diagnosis completely out of proportion
> with the magnitude of the problem.
>

Since no one coded up something more exciting, I just did a round of
self-review of the little logging patch I sent upthread to see if it was
a worthy CF candidate. I agree that *something* should be done, and I'd
rather have a logging-based solution than nothing at all here.

So that patch is, to use a polite UK term I've adopted recently,
rubbish. But unless I'm confused (it is late), I think there's a
problem here that's much worse than what you described--in the case
where someone has grabbed a really large lock on a table that needs
cleanup at least. I started with the following test case:

SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,100000);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables
WHERE relname='t';
DELETE FROM t WHERE s<50000;
\q

Which, so long as the min duration is 0, does mostly what I expected. I
followed immediately by starting a new session, doing:

BEGIN;
LOCK t;

Before autovacuum got to the table; I left this session open, stuck
there idle in a transaction holding a lock. I threw some extra logging
in the code path leading up to where we'd both like the lack of lock
acquisition to be logged, and it shows the following (at DEBUG3):

DEBUG: t: vac: 49999 (threshold 50), anl: 149999 (threshold 50)
DEBUG: autovacuum: t: vac needed
INFO: vacuum_rel: processing 16528
INFO: vacuum_rel: trying to open relation 16528

All but the first one of those lines are not in the current code, and as
noted already that one existing line is at DEBUG3. This was trying to
simulate the case you were complaining about: autovacuum has been
triggered, it knows there is work to be done, and it can't do said work.

It hasn't triggered the error message I tried to add yet through,
because it's not hitting that spot. Look where it's actually blocked at:

$ ps -eaf | grep postgres
gsmith 5490 4880 0 04:09 ? 00:00:00 postgres: autovacuum
worker process gsmith waiting

$ psql -c "select procpid,now() - query_start as runtime,current_query
from pg_stat_activity where current_query like 'autovacuum%'"
procpid | runtime | current_query
---------+-----------------+-------------------------------------
5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t

I haven't just failed to vacuum the table that needs it without any
entry in the logs saying as much. I've actually tied down an autovacuum
worker and kept it from doing anything else until that lock is
released! When vacuum_rel inside of vacuum.c does this on a relation it
wants to acquire a lock on:

onerel = try_relation_open(relid, lmode);

It just hangs out there forever, if the only issue is not being able to
get the lock it wants. The abort code path I tried to insert logging
into only occurs if the relation was deleted. Watch what's happened
while I've been typing:

procpid | runtime | current_query
---------+-----------------+-------------------------------------
5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t

I've now hit the AV worker with a denial of service attack, essentially,
for over 35 minutes. I'd bet that if I was a malicious user, I could
make AV come to a halt altogether for the whole server with this
approach, even with relatively limited rights on the database.

Does try_relation_open need to have a lock acquisition timeout when AV
is calling it? I don't see any other way to get a log mesage noting
when this problem passed by in there, you can only see it live in these
process view. And the current behavior doesn't feel right to me.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-15 16:14:12
Message-ID: 29022.1295108052@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> Does try_relation_open need to have a lock acquisition timeout when AV
> is calling it?

Hmm. I think when looking at the AV code, I've always subconsciously
assumed that try_relation_open would fail immediately if it couldn't get
the lock. That certainly seems like it would be a more appropriate way
to behave than delaying indefinitely.

In practice, the scenario you're worried about seems unlikely to persist
indefinitely: as soon as someone else comes along and blocks behind
autovacuum's request, the deadlock checker will kick AV off the lock
queue. But it would probably be better not to depend on that.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-15 19:29:54
Message-ID: AANLkTi=t_ay3Rq3LQW=N8VH9Tuj_npSufe88zLLChaaY@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 15, 2011 at 11:14 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Smith <greg(at)2ndquadrant(dot)com> writes:
>> Does try_relation_open need to have a lock acquisition timeout when AV
>> is calling it?
>
> Hmm.  I think when looking at the AV code, I've always subconsciously
> assumed that try_relation_open would fail immediately if it couldn't get
> the lock.  That certainly seems like it would be a more appropriate way
> to behave than delaying indefinitely.

I'm confused how that's not happening already. What does "try" mean, otherwise?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 16:23:16
Message-ID: 4D331B74.6000004@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Sat, Jan 15, 2011 at 11:14 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Greg Smith <greg(at)2ndquadrant(dot)com> writes:
>>
>>> Does try_relation_open need to have a lock acquisition timeout when AV
>>> is calling it?
>>>
>> Hmm. I think when looking at the AV code, I've always subconsciously
>> assumed that try_relation_open would fail immediately if it couldn't get
>> the lock. That certainly seems like it would be a more appropriate way
>> to behave than delaying indefinitely.
>>
>
> I'm confused how that's not happening already. What does "try" mean, otherwise?
>

Apparently "try" means acquire the requested lock on the oid of the
relation, waiting for any amount of time for that part, and then check
if the relation really exists or not once it's got it. In this context,
it means it will try to open the relation, but might fail if it doesn't
actually exist anymore. The relation not existing once it tries the
check done after the lock is acquired is the only way it will return a
failure.

try_relation_open calls LockRelationOid, which blocks. There is also a
ConditionalLockRelationOid, which does the same basic thing except it
exits immediately, with a false return code, if it can't acquire the
lock. I think we just need to nail down in what existing cases it's
acceptable to have try_relation_oid use ConditionalLockRelationOid
instead, which would make it do what all us reading the code thought it
did all along.

There are four callers of try_relation_open to be concerned about here:

src/backend/commands/vacuum.c vacuum_rel
onerel = try_relation_open(relid, lmode);

src/backend/commands/analyze.c analyze_rel
onerel = try_relation_open(relid, ShareUpdateExclusiveLock);

src/backend/commands/cluster.c cluster_rel
OldHeap = try_relation_open(tableOid, AccessExclusiveLock);

src/backend/commands/lockcmds.c LockTableRecurse
* Apply LOCK TABLE recursively over an inheritance tree
rel = try_relation_open(reloid, NoLock);

I think that both the vacuum_rel and analyze_rel cases are capable of
figuring out if they are the autovacuum process, and if so calling the
fast non-blocking version of this. I wouldn't want to mess with the
other two, which rely upon the current behavior as far as I can see.

Probably took me longer to write this e-mail than the patch will take.
Since I've got trivial patch fever this weekend and already have the
test case, I'll do this one next.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 16:47:35
Message-ID: 8629.1295196455@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> try_relation_open calls LockRelationOid, which blocks. There is also a
> ConditionalLockRelationOid, which does the same basic thing except it
> exits immediately, with a false return code, if it can't acquire the
> lock. I think we just need to nail down in what existing cases it's
> acceptable to have try_relation_oid use ConditionalLockRelationOid
> instead, which would make it do what all us reading the code thought it
> did all along.

No, I don't believe we should be messing with the semantics of
try_relation_open. It is what it is.

The right way to fix this is similar to what LockTableRecurse does,
ie call ConditionalLockRelationOid itself. I tried changing vacuum_rel
that way yesterday, but the idea crashed when I realized that vacuum_rel
doesn't have the name of the target relation, only its OID, so it can't
log any very useful message ... and according to the original point of
this thread, we're surely going to want an elog(LOG) when we can't get
the lock.

I think the best thing to do is probably to have autovacuum.c do the
ConditionalLockRelationOid call before entering vacuum.c (making the
later acquisition of the lock by try_relation_open redundant, but it
will be cheap enough to not matter). But I haven't chased the details.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 17:18:22
Message-ID: 1295198302.3282.1308.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2011-01-16 at 11:47 -0500, Tom Lane wrote:
> Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> > try_relation_open calls LockRelationOid, which blocks. There is also a
> > ConditionalLockRelationOid, which does the same basic thing except it
> > exits immediately, with a false return code, if it can't acquire the
> > lock. I think we just need to nail down in what existing cases it's
> > acceptable to have try_relation_oid use ConditionalLockRelationOid
> > instead, which would make it do what all us reading the code thought it
> > did all along.
>
> No, I don't believe we should be messing with the semantics of
> try_relation_open. It is what it is.
>
> The right way to fix this is similar to what LockTableRecurse does,
> ie call ConditionalLockRelationOid itself. I tried changing vacuum_rel
> that way yesterday, but the idea crashed when I realized that vacuum_rel
> doesn't have the name of the target relation, only its OID, so it can't
> log any very useful message ... and according to the original point of
> this thread, we're surely going to want an elog(LOG) when we can't get
> the lock.
>
> I think the best thing to do is probably to have autovacuum.c do the
> ConditionalLockRelationOid call before entering vacuum.c (making the
> later acquisition of the lock by try_relation_open redundant, but it
> will be cheap enough to not matter). But I haven't chased the details.

I'm fairly confused by this thread.

We *do* emit a message when we cancel an autovacuum task. We went to a
lot of trouble to do that. The message is DEBUG2, and says
"sending cancel to blocking autovacuum pid =".

We just need to make that LOG, not DEBUG2.

The autovacuum itself then says "canceling autovacuum task" when
canceled. It doesn't say what table the autovacuum was running on when
cancelled, but that seems like an easy thing to add since the AV does
know that.

I can't see any reason to differentiate between manually canceled AVs
and automatically canceled AVs. It's all the same thing.

Not really sure what it is you're talking about above or how that
relates to log messages for AV.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Greg Smith <greg(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 18:08:03
Message-ID: 9874.1295201283@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> I'm fairly confused by this thread.

> We *do* emit a message when we cancel an autovacuum task. We went to a
> lot of trouble to do that. The message is DEBUG2, and says
> "sending cancel to blocking autovacuum pid =".

That doesn't necessarily match one-to-one with actual cancellations,
nor does it cover the case Greg is on about at the moment of an AV
worker being blocked indefinitely because it can't get the table
lock in the first place.

It might be an adequate substitute, but on the whole I agree with
the idea that it'd be better to have autovacuum log when it actually
cancels an operation, not when someone tries to cancel one.

regards, tom lane


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 18:08:36
Message-ID: 4D333424.1020104@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> I'm fairly confused by this thread.
>

That's becuase you think it has something to do with cancellation, which
it doesn't. The original report here noted a real problem but got the
theorized cause wrong. It turns out the code that acquires a lock when
autovacuum decides it is going to process something will wait forever
for that lock to be obtained. It cannot be the case that other locks on
the table are causing it to cancel, or as you say it would be visible in
the logs. Instead the AV worker will just queue up and wait for its
turn as long as it takes.

That does mean there's all sorts of ways that your AV workers can all
get stuck where they are waiting for a table, and there's no way to know
when that's happening either from the logs; you'll only see it in ps or
pg_stat_activity. Given that I think it's actually a mild denial of
service attack vector, this really needs an improvement.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 18:20:54
Message-ID: 10073.1295202054@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> Simon Riggs wrote:
>> I'm fairly confused by this thread.

> That's becuase you think it has something to do with cancellation, which
> it doesn't. The original report here noted a real problem but got the
> theorized cause wrong.

I think that cancellations are also a potentially important issue that
could do with being logged. The issue that I see is that if an
application has a use-pattern that involves obtaining exclusive lock on
a large table fairly frequently, then AV will never be able to complete
on that table, leading to bloat and eventual XID wraparound issues.
Once we get scared about XID wraparound, AV will refuse to let itself
be canceled, so it will prevent the wraparound ... at the cost of
denying service to the application code path that wants the lock.
So this is the sort of thing that it'd be good to have some bleating
about in the log, giving the DBA a chance to rethink the app's locking
behavior before the consequences get nasty.

But, having said that, false alarms in the log are not nice. So I'd
rather have the LOG messages coming out from actual transaction aborts
in AV, not from the remote end of the signal.

regards, tom lane


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 18:22:38
Message-ID: 4D33376E.2060909@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> No, I don't believe we should be messing with the semantics of
> try_relation_open. It is what it is.
>

With only four pretty simple callers to the thing, and two of them
needing the alternate behavior, it seemed a reasonable place to modify
to me. I thought the "nowait" boolean idea was in enough places that it
was reasonable to attach to try_relation_open.

Attached patch solves the "wait for lock forever" problem, and
introduces a new log message when AV or auto-analyze fail to obtain a
lock on something that needs to be cleaned up:

DEBUG: autovacuum: processing database "gsmith"
INFO: autovacuum skipping relation 65563 --- cannot open or obtain lock
INFO: autoanalyze skipping relation 65563 --- cannot open or obtain lock

My main concern is that this may cause AV to constantly fail to get
access to a busy table, where in the current code it would queue up and
eventually get the lock needed. A secondary issue is that while the
autovacuum messages only show up if you have log_autovacuum_min_duration
set to not -1, the autoanalyze ones can't be stopped.

If you don't like the way I structured the code, you can certainly do it
some other way instead. I thought this approach was really simple and
not unlike similar code elsewhere.

Here's the test case that worked for me here again:

psql
SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,100000);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables
WHERE relname='t';
DELETE FROM t WHERE s<50000;
\q
psql
BEGIN;
LOCK t;

Leave that open, then go to anther session with old "tail -f" on the
logs to wait for the errors to show up.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachment Content-Type Size
av-lock-failure-v2.diff text/x-patch 7.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 18:26:38
Message-ID: 10205.1295202398@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> Tom Lane wrote:
>> No, I don't believe we should be messing with the semantics of
>> try_relation_open. It is what it is.

> With only four pretty simple callers to the thing, and two of them
> needing the alternate behavior, it seemed a reasonable place to modify
> to me. I thought the "nowait" boolean idea was in enough places that it
> was reasonable to attach to try_relation_open.

I would be willing to do that if it actually fixed the problem, but it
doesn't. Logging only the table OID and not the table name is entirely
inadequate IMO, so the fix has to be at a different place anyway.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 19:19:47
Message-ID: 1295205587.3282.1441.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2011-01-16 at 13:08 -0500, Greg Smith wrote:
> Simon Riggs wrote:
> > I'm fairly confused by this thread.
> >
>
> That's becuase you think it has something to do with cancellation, which
> it doesn't. The original report here noted a real problem but got the
> theorized cause wrong. It turns out the code that acquires a lock when
> autovacuum decides it is going to process something will wait forever
> for that lock to be obtained. It cannot be the case that other locks on
> the table are causing it to cancel, or as you say it would be visible in
> the logs. Instead the AV worker will just queue up and wait for its
> turn as long as it takes.

OK, thanks for explaining.

So my proposed solution is to set a statement_timeout on autovacuum
tasks, so that the AV does eventually get canceled, if the above
mentioned case occurs. We can scale that timeout to the size of the
table.

Now every issue is a cancelation and we can handle it the way I
suggested in my last post.

I would prefer it if we had a settable lock timeout, as suggested many
moons ago. When that was discussed before it was said there was no
difference between a statement timeout and a lock timeout, but I think
there clearly is, this case being just one example.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-16 20:50:08
Message-ID: 4D335A00.8080302@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/16/11 11:19 AM, Simon Riggs wrote:
> I would prefer it if we had a settable lock timeout, as suggested many
> moons ago. When that was discussed before it was said there was no
> difference between a statement timeout and a lock timeout, but I think
> there clearly is, this case being just one example.

Whatever happend to lock timeouts, anyway? We even had some patches
floating around for 9.0 and they disappeared.

However, we'd want a separate lock timeout for autovac, of course. I'm
not at all keen on a *statement* timeout on autovacuum; as long as
autovacuum is doing work, I don't want to cancel it. Also, WTF would we
set it to?

Going the statement timeout route seems like a way to create a LOT of
extra work, troubleshooting, getting it wrong, and releasing patch
updates. Please let's just create a lock timeout.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-17 01:36:32
Message-ID: 1295228192.3282.1891.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2011-01-16 at 12:50 -0800, Josh Berkus wrote:
> On 1/16/11 11:19 AM, Simon Riggs wrote:
> > I would prefer it if we had a settable lock timeout, as suggested many
> > moons ago. When that was discussed before it was said there was no
> > difference between a statement timeout and a lock timeout, but I think
> > there clearly is, this case being just one example.
>
> Whatever happend to lock timeouts, anyway? We even had some patches
> floating around for 9.0 and they disappeared.
>
> However, we'd want a separate lock timeout for autovac, of course. I'm
> not at all keen on a *statement* timeout on autovacuum; as long as
> autovacuum is doing work, I don't want to cancel it.

> Also, WTF would we
> set it to?

> Going the statement timeout route seems like a way to create a LOT of
> extra work, troubleshooting, getting it wrong, and releasing patch
> updates. Please let's just create a lock timeout.

I agree with you, but if we want it *this* release, on top of all the
other features we have queued, then I suggest we compromise. If you hold
out for more feature, you may get less.

Statement timeout = 2 * (100ms + autovacuum_vacuum_cost_delay) *
tablesize/(autovacuum_vacuum_cost_limit / vacuum_cost_page_dirty)

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-17 01:50:49
Message-ID: AANLkTi=BQ_u+uMhAeGPgJ1Q-xXpLbgvjaOOv0SOAXStV@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 16, 2011 at 8:36 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I agree with you, but if we want it *this* release, on top of all the
> other features we have queued, then I suggest we compromise. If you hold
> out for more feature, you may get less.
>
> Statement timeout = 2 * (100ms + autovacuum_vacuum_cost_delay) *
> tablesize/(autovacuum_vacuum_cost_limit / vacuum_cost_page_dirty)

I'm inclined to think that would be a very dangerous compromise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-17 19:46:33
Message-ID: 19414.1295293593@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> However, we'd want a separate lock timeout for autovac, of course. I'm
> not at all keen on a *statement* timeout on autovacuum; as long as
> autovacuum is doing work, I don't want to cancel it. Also, WTF would we
> set it to?

Yeah --- in the presence of vacuum cost delay, in particular, a
statement timeout seems about useless for AV.

> Going the statement timeout route seems like a way to create a LOT of
> extra work, troubleshooting, getting it wrong, and releasing patch
> updates. Please let's just create a lock timeout.

Do we actually need a lock timeout either? The patch that was being
discussed just involved failing if you couldn't get it immediately.
I suspect that's sufficient for AV. At least, nobody's made a
compelling argument why we need to expend a very substantially larger
amount of work to do something different.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-17 20:24:30
Message-ID: 1295295870.3282.2520.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2011-01-17 at 14:46 -0500, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > However, we'd want a separate lock timeout for autovac, of course. I'm
> > not at all keen on a *statement* timeout on autovacuum; as long as
> > autovacuum is doing work, I don't want to cancel it. Also, WTF would we
> > set it to?
>
> Yeah --- in the presence of vacuum cost delay, in particular, a
> statement timeout seems about useless for AV.
>
> > Going the statement timeout route seems like a way to create a LOT of
> > extra work, troubleshooting, getting it wrong, and releasing patch
> > updates. Please let's just create a lock timeout.
>
> Do we actually need a lock timeout either? The patch that was being
> discussed just involved failing if you couldn't get it immediately.
> I suspect that's sufficient for AV. At least, nobody's made a
> compelling argument why we need to expend a very substantially larger
> amount of work to do something different.

I have a patch to do this BUT the log message is harder. Until we grab
the lock we can't confirm the table still exists, so we can't get the
name for it. The whole point of this is logging the names of tables for
which we have failed to AV. Suggestions?

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-17 21:08:12
Message-ID: 21002.1295298492@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Mon, 2011-01-17 at 14:46 -0500, Tom Lane wrote:
>> Do we actually need a lock timeout either? The patch that was being
>> discussed just involved failing if you couldn't get it immediately.
>> I suspect that's sufficient for AV. At least, nobody's made a
>> compelling argument why we need to expend a very substantially larger
>> amount of work to do something different.

> I have a patch to do this BUT the log message is harder. Until we grab
> the lock we can't confirm the table still exists, so we can't get the
> name for it. The whole point of this is logging the names of tables for
> which we have failed to AV. Suggestions?

As I said before, the correct place to fix this is in autovacuum.c,
which has the table name near at hand. If you insist on fixing it
somewhere else it's going to be very invasive.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-18 01:26:09
Message-ID: 4D34EC31.50707@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/17/11 11:46 AM, Tom Lane wrote:
> Do we actually need a lock timeout either? The patch that was being
> discussed just involved failing if you couldn't get it immediately.
> I suspect that's sufficient for AV. At least, nobody's made a
> compelling argument why we need to expend a very substantially larger
> amount of work to do something different.

The argument is that a sufficiently busy table might never get
autovacuumed *at all*, whereas a small lock wait would allow autovacuum
to block incoming transactions and start work.

However, it's hard for me to imagine a real-world situation where a
table would be under repeated full-table-locks from multiple
connections. Can anyone else?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-18 02:12:13
Message-ID: AANLkTi=7uuNgVWamzJpC_Ma9GYDwpAvcp5uA0AibL1=Y@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 8:26 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 1/17/11 11:46 AM, Tom Lane wrote:
>> Do we actually need a lock timeout either?  The patch that was being
>> discussed just involved failing if you couldn't get it immediately.
>> I suspect that's sufficient for AV.  At least, nobody's made a
>> compelling argument why we need to expend a very substantially larger
>> amount of work to do something different.
>
> The argument is that a sufficiently busy table might never get
> autovacuumed *at all*, whereas a small lock wait would allow autovacuum
> to block incoming transactions and start work.
>
> However, it's hard for me to imagine a real-world situation where a
> table would be under repeated full-table-locks from multiple
> connections.  Can anyone else?

I'm not convinced we need a lock timeout for autovacuum. I think it'd
be useful to expose on a user-level, but that's a different can of
worms.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-18 03:04:08
Message-ID: 27160.1295319848@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> On 1/17/11 11:46 AM, Tom Lane wrote:
>> Do we actually need a lock timeout either? The patch that was being
>> discussed just involved failing if you couldn't get it immediately.
>> I suspect that's sufficient for AV. At least, nobody's made a
>> compelling argument why we need to expend a very substantially larger
>> amount of work to do something different.

> The argument is that a sufficiently busy table might never get
> autovacuumed *at all*, whereas a small lock wait would allow autovacuum
> to block incoming transactions and start work.

> However, it's hard for me to imagine a real-world situation where a
> table would be under repeated full-table-locks from multiple
> connections. Can anyone else?

If that is happening, autovacuum is screwed anyway. Even if it manages
to acquire the lock, it will get kicked off by the next lock request
before it can finish vacuuming the table. So I don't see an argument
here for adding all the mechanism required to support that.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-18 06:41:29
Message-ID: 1295332889.14057.1.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On mån, 2011-01-17 at 17:26 -0800, Josh Berkus wrote:
> However, it's hard for me to imagine a real-world situation where a
> table would be under repeated full-table-locks from multiple
> connections. Can anyone else?

If you want to do assertion-type checks at the end of transactions in
the current fake-serializable mode, you need to lock the table.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-31 02:37:51
Message-ID: AANLkTik-e_ttzsXThWg5Nb0J-m=PXbjFK_peGuay-nHi@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 4:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Mon, 2011-01-17 at 14:46 -0500, Tom Lane wrote:
>>> Do we actually need a lock timeout either?  The patch that was being
>>> discussed just involved failing if you couldn't get it immediately.
>>> I suspect that's sufficient for AV.  At least, nobody's made a
>>> compelling argument why we need to expend a very substantially larger
>>> amount of work to do something different.
>
>> I have a patch to do this BUT the log message is harder. Until we grab
>> the lock we can't confirm the table still exists, so we can't get the
>> name for it. The whole point of this is logging the names of tables for
>> which we have failed to AV. Suggestions?
>
> As I said before, the correct place to fix this is in autovacuum.c,
> which has the table name near at hand.  If you insist on fixing it
> somewhere else it's going to be very invasive.

Unless I'm missing something, making autovacuum.c call
ConditionalLockRelationOid() is not going to work, because the vacuum
transaction isn't started until we get all the way down to
vacuum_rel(). But it seems not too hard to make vacuum_rel() and
analyze_rel() have the proper relation name - we just need to
manufacture a suitable RangeVar to pass via the VacuumStmt's
"relation" element.

See what you think of the attached... I think logging this message at
LOG unconditionally is probably log spam of the first order, but if
the basic concept is sound we can argue over what to do about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
autovac-nowait.patch text/x-diff 5.7 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-31 03:03:19
Message-ID: 1296442905-sup-442@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Robert Haas's message of dom ene 30 23:37:51 -0300 2011:

> Unless I'm missing something, making autovacuum.c call
> ConditionalLockRelationOid() is not going to work, because the vacuum
> transaction isn't started until we get all the way down to
> vacuum_rel().

Maybe we need ConditionalLockRelationOidForSession or something like
that?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-31 03:26:01
Message-ID: AANLkTinOHDdf-DS6p_NGOuUsPawNqGKQOhUwfth3qYTD@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 30, 2011 at 10:03 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Robert Haas's message of dom ene 30 23:37:51 -0300 2011:
>
>> Unless I'm missing something, making autovacuum.c call
>> ConditionalLockRelationOid() is not going to work, because the vacuum
>> transaction isn't started until we get all the way down to
>> vacuum_rel().
>
> Maybe we need ConditionalLockRelationOidForSession or something like
> that?

That'd be another way to go, if there are objections to what I've
implemented here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-02-04 03:49:20
Message-ID: AANLkTi=qkZK3ueJTz=s_6p6zkpoQYn=Xd-bE=TvdS2zV@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 30, 2011 at 10:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Jan 30, 2011 at 10:03 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> Excerpts from Robert Haas's message of dom ene 30 23:37:51 -0300 2011:
>>
>>> Unless I'm missing something, making autovacuum.c call
>>> ConditionalLockRelationOid() is not going to work, because the vacuum
>>> transaction isn't started until we get all the way down to
>>> vacuum_rel().
>>
>> Maybe we need ConditionalLockRelationOidForSession or something like
>> that?
>
> That'd be another way to go, if there are objections to what I've
> implemented here.

Seeing as how there seem to be neither objections nor endorsements,
I'm inclined to commit what I proposed more or less as-is. There
remains the issue of what do about the log spam. Josh Berkus
suggested logging it when log_autovacuum_min_duration != -1, which
seems like a bit of an abuse of that setting, but it's certainly not
worth adding another setting for, and the alternative of logging it
at, say, DEBUG2 seems unappealing because you'll then have to turn on
logging for a lot of unrelated crap to get this information. So on
balance I think that proposal is perhaps the least of evils.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-02-04 12:08:58
Message-ID: AANLkTin5eUjs+Mv-EV3ZBhHkQhK0CMxvhsxaKQPLj3Nv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 4, 2011 at 12:59 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Robert,
>
>> Seeing as how there seem to be neither objections nor endorsements,
>> I'm inclined to commit what I proposed more or less as-is.  There
>> remains the issue of what do about the log spam.  Josh Berkus
>> suggested logging it when log_autovacuum_min_duration != -1, which
>> seems like a bit of an abuse of that setting, but it's certainly not
>> worth adding another setting for, and the alternative of logging it
>> at, say, DEBUG2 seems unappealing because you'll then have to turn on
>> logging for a lot of unrelated crap to get this information.  So on
>> balance I think that proposal is perhaps the least of evils.
>
> Wait, which proposal is?

I meant to say that I think the "least evil" proposal is to log at LOG
when log_autovacuum_min_duration != -1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-02-05 17:54:28
Message-ID: AANLkTimo7HHYA2yy+ATBJFH8ZzSLL24T_PiZ6sZ_jBA2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/2/4 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Sun, Jan 30, 2011 at 10:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sun, Jan 30, 2011 at 10:03 PM, Alvaro Herrera
>> <alvherre(at)commandprompt(dot)com> wrote:
>>> Excerpts from Robert Haas's message of dom ene 30 23:37:51 -0300 2011:
>>>
>>>> Unless I'm missing something, making autovacuum.c call
>>>> ConditionalLockRelationOid() is not going to work, because the vacuum
>>>> transaction isn't started until we get all the way down to
>>>> vacuum_rel().
>>>
>>> Maybe we need ConditionalLockRelationOidForSession or something like
>>> that?
>>
>> That'd be another way to go, if there are objections to what I've
>> implemented here.
>
> Seeing as how there seem to be neither objections nor endorsements,
> I'm inclined to commit what I proposed

what do you implement exactly ?
* The original request from Josh to get LOG when autovac can not run
because of locks
* VACOPT_NOWAIT, what is it ?

>more or less as-is.  There
> remains the issue of what do about the log spam.  Josh Berkus
> suggested logging it when log_autovacuum_min_duration != -1, which
> seems like a bit of an abuse of that setting, but it's certainly not
> worth adding another setting for, and the alternative of logging it
> at, say, DEBUG2 seems unappealing because you'll then have to turn on
> logging for a lot of unrelated crap to get this information.  So on
> balance I think that proposal is perhaps the least of evils.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-02-05 18:40:08
Message-ID: AANLkTikt-diWY8+mSvxw-_2_PtYv=VY2MN7-mu_wxA8V@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 5, 2011 at 12:54 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> what do you implement exactly ?
> * The original request from Josh to get LOG when autovac can not run
> because of locks
> * VACOPT_NOWAIT, what is it ?

What the patch implements is:

If autovacuum can't get the table lock immediately, it skips the
table. This leaves the table still appearing to need autovacuuming,
so autovacuum will keep retrying (every 1 minute, or whatever you have
autovacuum_naptime set to) until it gets the lock. This seems clearly
better than the historical behavior of blocking on the lock, which can
potentially keep other tables in the database from getting vacuumed.

In the case where a table is skipped for this reason, we log a message
at log level LOG. The version of the patch I posted does that
unconditionally, but my intention was to change it before commit so
that it only logs the message if log_autovacuum_min_duration is
something other than -1.

Regular vacuum behaves as before - it waits for each table lock
individually. We could expose a NOWAIT option at the SQL level as
well, so that someone could do VACOPT (NOWAIT), if that's something
people want.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-02-05 20:20:57
Message-ID: AANLkTikbvVr+2V7zVN=cUD0Zz_TX0govbTzGrUoFUPTu@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/2/5 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Sat, Feb 5, 2011 at 12:54 PM, Cédric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> what do you implement exactly ?
>> * The original request from Josh to get LOG when autovac can not run
>> because of locks
>> * VACOPT_NOWAIT, what is it ?
>
> What the patch implements is:
>
> If autovacuum can't get the table lock immediately, it skips the
> table.  This leaves the table still appearing to need autovacuuming,
> so autovacuum will keep retrying (every 1 minute, or whatever you have
> autovacuum_naptime set to) until it gets the lock.  This seems clearly
> better than the historical behavior of blocking on the lock, which can
> potentially keep other tables in the database from getting vacuumed.

great :)

>
> In the case where a table is skipped for this reason, we log a message
> at log level LOG.  The version of the patch I posted does that
> unconditionally, but my intention was to change it before commit so
> that it only logs the message if log_autovacuum_min_duration is
> something other than -1.

yes looks better, I also note that someone suggest to not add a GUC for that.
I am unsure about that, and adding a parameter for that does not hurt me at all:
reducing number of parameters for memory/performance/... is fine
(well, it is very good when we can), but to improve the log output I
think it is ok to add more without much trouble.

>
> Regular vacuum behaves as before - it waits for each table lock
> individually.  We could expose a NOWAIT option at the SQL level as
> well, so that someone could do VACOPT (NOWAIT), if that's something
> people want.

Might be useful, yes.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-02-05 20:25:19
Message-ID: AANLkTinqqHd-iOvxyPFBZkHDG7yb_4aZ846hDxgcrEeP@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 5, 2011 at 3:20 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> In the case where a table is skipped for this reason, we log a message
>> at log level LOG.  The version of the patch I posted does that
>> unconditionally, but my intention was to change it before commit so
>> that it only logs the message if log_autovacuum_min_duration is
>> something other than -1.
>
> yes looks better, I also note that someone suggest to not add a GUC for that.

I think a GUC just to control this one message is overkill. Chances
are that if you're logging some or all of your autovacuum runs you'll
want to have this too, or at least it won't be a major problem to get
it anyway. If for some reason you want ONLY this message, you can
effectively get that behavior too, but setting
log_autovacuum_min_duration to an enormous value. So I can't really
imagine why you'd need a separate knob just for this one thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-02-05 20:34:15
Message-ID: AANLkTimiHq4QUZnVPS0AayeJ=RaDThoyFQA=w1R8pZaD@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/2/5 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Sat, Feb 5, 2011 at 3:20 PM, Cédric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>>> In the case where a table is skipped for this reason, we log a message
>>> at log level LOG.  The version of the patch I posted does that
>>> unconditionally, but my intention was to change it before commit so
>>> that it only logs the message if log_autovacuum_min_duration is
>>> something other than -1.
>>
>> yes looks better, I also note that someone suggest to not add a GUC for that.
>
> I think a GUC just to control this one message is overkill.  Chances
> are that if you're logging some or all of your autovacuum runs you'll
> want to have this too, or at least it won't be a major problem to get
> it anyway.  If for some reason you want ONLY this message, you can
> effectively get that behavior too, but setting
> log_autovacuum_min_duration to an enormous value.  So I can't really
> imagine why you'd need a separate knob just for this one thing.

to not output it when I want log_autovac_min_duration = 0 but I know
that some tables do have locks contention and may loose maintenance
for one hour or so.

Anyway, without GUC is fine too as it won't fill the /var/log itself !
I am just not opposed to have new GUC in those areas (log && debug).

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-02-08 03:05:21
Message-ID: AANLkTint13fVe5Qmj7jDpVysX=VOtfzimqRpxXGc3wK9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 5, 2011 at 3:34 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> Anyway, without GUC is fine too as it won't fill the /var/log itself !
> I am just not opposed to have new GUC in those areas (log && debug).

OK. Committed without a new GUC, at least for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company