Re: We need to log aborted autovacuums

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We need to log aborted autovacuums
Date: 2011-01-15 09:47:56
Message-ID: 4D316D4C.4000406@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Josh Berkus wrote:
> The lack of vacuum could be occurring for any of 4 reasons:
>
> 1) Locking
> 2) You have a lot of tables and not enough autovac_workers / too much
> sleep time
> 3) You need to autovac this particular table more frequently, since it
> gets dirtied really fast
> 4) The table has been set with special autovac settings which keep it
> from being autovac'd
>
> We can currently distinguish between cased 2, 3, 4 based on existing
> available facts. However, distinguishing case 1 from 2 or 3, in
> particular, isn't currently possible except by methods which require
> collecting a lot of ad-hoc monitoring data over a period of time. This
> makes the effort required for the diagnosis completely out of proportion
> with the magnitude of the problem.
>

Since no one coded up something more exciting, I just did a round of
self-review of the little logging patch I sent upthread to see if it was
a worthy CF candidate. I agree that *something* should be done, and I'd
rather have a logging-based solution than nothing at all here.

So that patch is, to use a polite UK term I've adopted recently,
rubbish. But unless I'm confused (it is late), I think there's a
problem here that's much worse than what you described--in the case
where someone has grabbed a really large lock on a table that needs
cleanup at least. I started with the following test case:

SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,100000);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables
WHERE relname='t';
DELETE FROM t WHERE s<50000;
\q

Which, so long as the min duration is 0, does mostly what I expected. I
followed immediately by starting a new session, doing:

BEGIN;
LOCK t;

Before autovacuum got to the table; I left this session open, stuck
there idle in a transaction holding a lock. I threw some extra logging
in the code path leading up to where we'd both like the lack of lock
acquisition to be logged, and it shows the following (at DEBUG3):

DEBUG: t: vac: 49999 (threshold 50), anl: 149999 (threshold 50)
DEBUG: autovacuum: t: vac needed
INFO: vacuum_rel: processing 16528
INFO: vacuum_rel: trying to open relation 16528

All but the first one of those lines are not in the current code, and as
noted already that one existing line is at DEBUG3. This was trying to
simulate the case you were complaining about: autovacuum has been
triggered, it knows there is work to be done, and it can't do said work.

It hasn't triggered the error message I tried to add yet through,
because it's not hitting that spot. Look where it's actually blocked at:

$ ps -eaf | grep postgres
gsmith 5490 4880 0 04:09 ? 00:00:00 postgres: autovacuum
worker process gsmith waiting

$ psql -c "select procpid,now() - query_start as runtime,current_query
from pg_stat_activity where current_query like 'autovacuum%'"
procpid | runtime | current_query
---------+-----------------+-------------------------------------
5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t

I haven't just failed to vacuum the table that needs it without any
entry in the logs saying as much. I've actually tied down an autovacuum
worker and kept it from doing anything else until that lock is
released! When vacuum_rel inside of vacuum.c does this on a relation it
wants to acquire a lock on:

onerel = try_relation_open(relid, lmode);

It just hangs out there forever, if the only issue is not being able to
get the lock it wants. The abort code path I tried to insert logging
into only occurs if the relation was deleted. Watch what's happened
while I've been typing:

procpid | runtime | current_query
---------+-----------------+-------------------------------------
5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t

I've now hit the AV worker with a denial of service attack, essentially,
for over 35 minutes. I'd bet that if I was a malicious user, I could
make AV come to a halt altogether for the whole server with this
approach, even with relatively limited rights on the database.

Does try_relation_open need to have a lock acquisition timeout when AV
is calling it? I don't see any other way to get a log mesage noting
when this problem passed by in there, you can only see it live in these
process view. And the current behavior doesn't feel right to me.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2011-01-15 10:47:24 Re: Spread checkpoint sync
Previous Message Fujii Masao 2011-01-15 08:02:30 Re: auto-sizing wal_buffers