Re: autovacuum truncate exclusive lock round two

Lists: pgsql-hackers
From: "Kevin Grittner" <kgrittn(at)mail(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>,"Jan Wieck" <JanWieck(at)yahoo(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>,"Amit Kapila" <amit(dot)kapila(at)huawei(dot)com>, "Stephen Frost" <sfrost(at)snowman(dot)net>, "PostgreSQL Development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovacuum truncate exclusive lock round two
Date: 2012-12-05 16:24:29
Message-ID: 20121205162430.142850@gmx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:

> Since people *already* raise deadlock_timeout to obscenely high
> values (a minute? an hour???) and then complain that things blow
> up in their face, I think there's a decent argument to be made
> that piggybacking anything else on that setting is unwise.

If people are really doing that, then I tend to agree. I wasn't
aware of that practice.

> Against that, FWICT, this problem only affects a small number of
> users: Jan is the only person I can ever remember reporting this
> issue. I'm not dumb enough to think he's the only person who it
> affects; but my current belief is that it's not an enormously
> common problem. So the main argument I can see against adding a
> GUC is that the problem is too marginal to justify a setting of
> its own. What I really see as the key issue is: suppose we
> hardcode this to say 2 seconds. Is that going to fix the problem
> effectively for 99% of the people who have this problem, or for
> 25% of the people who have this problem? In the former case, we
> probably don't need a GUC; in the latter case, we probably do.

Given the fact that autovacuum will keep throwing workers at it to
essentially loop indefinitely at an outer level, I don't think the
exact setting of this interval is all that critical either. My gut
feel is that anything in the 2 second to 5 second range would be
sane, so I won't argue over any explicit setting within that range.
Below that, I think the overhead of autovacuum coming back to the
table repeatedly would probably start to get too high; below that
we could be causing some small, heavily-updated table to be
neglected by autovacuum -- especially if you get multiple
autovacuum workers tied up in this delay on different tables at the
same time.

Regarding how many people are affected, I have seen several reports
of situations where users claim massive impact on performance when
autovacuum kicks in. The reports have not included enough detail to
quantify the impact or in most cases to establish a cause, but this
seems like it could have a noticable impact, especially if the
deadlock timeout was set to more than a second.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)mail(dot)com>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovacuum truncate exclusive lock round two
Date: 2012-12-05 19:00:45
Message-ID: CA+TgmobD6XOw1nZSdbFYVGWa=NzAQMuXO-4qQ2yxodwFohE+bA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 5, 2012 at 11:24 AM, Kevin Grittner <kgrittn(at)mail(dot)com> wrote:
> Robert Haas wrote:
>> Since people *already* raise deadlock_timeout to obscenely high
>> values (a minute? an hour???) and then complain that things blow
>> up in their face, I think there's a decent argument to be made
>> that piggybacking anything else on that setting is unwise.
>
> If people are really doing that, then I tend to agree. I wasn't
> aware of that practice.

It's probably not quite common enough to be called a "practice", but I
have encountered it a number of times in support situations. Alas, I
no longer remember the details of exactly what misery it caused, but I
do remember it wasn't good. :-)

>> Against that, FWICT, this problem only affects a small number of
>> users: Jan is the only person I can ever remember reporting this
>> issue. I'm not dumb enough to think he's the only person who it
>> affects; but my current belief is that it's not an enormously
>> common problem. So the main argument I can see against adding a
>> GUC is that the problem is too marginal to justify a setting of
>> its own. What I really see as the key issue is: suppose we
>> hardcode this to say 2 seconds. Is that going to fix the problem
>> effectively for 99% of the people who have this problem, or for
>> 25% of the people who have this problem? In the former case, we
>> probably don't need a GUC; in the latter case, we probably do.
>
> Given the fact that autovacuum will keep throwing workers at it to
> essentially loop indefinitely at an outer level, I don't think the
> exact setting of this interval is all that critical either. My gut
> feel is that anything in the 2 second to 5 second range would be
> sane, so I won't argue over any explicit setting within that range.
> Below that, I think the overhead of autovacuum coming back to the
> table repeatedly would probably start to get too high; below that
> we could be causing some small, heavily-updated table to be
> neglected by autovacuum -- especially if you get multiple
> autovacuum workers tied up in this delay on different tables at the
> same time.

I think that part of what's tricky here is that the dynamics of this
problem depend heavily on table size. I handled one support case
where lowering autovacuum_naptime to 15s was an indispenable part of
the solution, so in that case having an autovacuum worker retry for
more than a few seconds sounds kind of insane. OTOH, that case
involved a small, rapidly changing table. If you've got an enormous
table where vacuum takes an hour to chug through all of it, abandoning
the effort to truncate the table after a handful of seconds might
sound equally insane.

Many it'd be sensible to relate the retry time to the time spend
vacuuming the table. Say, if the amount of time spent retrying
exceeds 10% of the time spend vacuuming the table, with a minimum of
1s and a maximum of 1min, give up. That way, big tables will get a
little more leeway than small tables, which is probably appropriate.

> Regarding how many people are affected, I have seen several reports
> of situations where users claim massive impact on performance when
> autovacuum kicks in. The reports have not included enough detail to
> quantify the impact or in most cases to establish a cause, but this
> seems like it could have a noticable impact, especially if the
> deadlock timeout was set to more than a second.

Yeah, I agree this could be a cause of those types of reports, but I
don't have any concrete evidence that any of the cases I've worked
were actually due to this specific issue. The most recent case of
this type I worked on was due to I/O saturation - which, since it
happened to be EC2, really meant network saturation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <kgrittn(at)mail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovacuum truncate exclusive lock round two
Date: 2012-12-06 03:16:23
Message-ID: 50C00E07.3070601@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/5/2012 2:00 PM, Robert Haas wrote:
> Many it'd be sensible to relate the retry time to the time spend
> vacuuming the table. Say, if the amount of time spent retrying
> exceeds 10% of the time spend vacuuming the table, with a minimum of
> 1s and a maximum of 1min, give up. That way, big tables will get a
> little more leeway than small tables, which is probably appropriate.

That sort of "dynamic" approach would indeed be interesting. But I fear
that it is going to be complex at best. The amount of time spent in
scanning heavily depends on the visibility map. The initial vacuum scan
of a table can take hours or more, but it does update the visibility map
even if the vacuum itself is aborted later. The next vacuum may scan
that table in almost no time at all, because it skips all blocks that
are marked "all visible".

So the total time the "scan" takes is no yardstick I'd use.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: Kevin Grittner <kgrittn(at)mail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovacuum truncate exclusive lock round two
Date: 2012-12-06 17:45:26
Message-ID: CA+TgmoZHXQGW6O-HZyaLj17rJazmRhv68WZ7QdEc7MnYwQCfpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> On 12/5/2012 2:00 PM, Robert Haas wrote:
>>
>> Many it'd be sensible to relate the retry time to the time spend
>> vacuuming the table. Say, if the amount of time spent retrying
>> exceeds 10% of the time spend vacuuming the table, with a minimum of
>> 1s and a maximum of 1min, give up. That way, big tables will get a
>> little more leeway than small tables, which is probably appropriate.
>
> That sort of "dynamic" approach would indeed be interesting. But I fear that
> it is going to be complex at best. The amount of time spent in scanning
> heavily depends on the visibility map. The initial vacuum scan of a table
> can take hours or more, but it does update the visibility map even if the
> vacuum itself is aborted later. The next vacuum may scan that table in
> almost no time at all, because it skips all blocks that are marked "all
> visible".

Well, if that's true, then there's little reason to worry about giving
up quickly, because the next autovacuum a minute later won't consume
many resources.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <kgrittn(at)mail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovacuum truncate exclusive lock round two
Date: 2012-12-06 18:34:08
Message-ID: 50C0E520.6020802@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin and Robert are well aware of most of the below. I just want to put
this out here so other people, who haven't followed the discussion too
closely, may chime in.

Some details on the problem:

First of all, there is a minimum number of 1000 pages that the vacuum
scan must detect as possibly being all empty at the end of a relation.
Without at least 8MB of possible free space at the end, the code never
calls lazy_truncate_heap(). This means we don't have to worry about tiny
relations at all. Any relation that stays under 8MB turnover between
autovacuum VACUUM runs can never get into this ever.

Relations that have higher turnover than that, but at random places or
with a high percentage of rather static rows, don't fall into the
problem category either. They may never accumulate that much "contiguous
free space at the end". The turnover will be reusing free space all over
the place. So again, lazy_truncate_heap() won't be called ever.

Relations that eventually build up more than 8MB of free space at the
end aren't automatically a problem. The autovacuum VACUUM scan just
scanned those pages at the end, which means that the safety scan for
truncate, done under exclusive lock, is checking exactly those pages at
the end and most likely they are still in memory. The truncate safety
scan will be fast due to a 99+% buffer cache hit rate.

The only actual problem case (I have found so far) are rolling window
tables of significant size, that can bloat multiple times their normal
size every now and then. This is indeed a rare corner case and I have no
idea how many users may (unknowingly) be suffering from it.

This rare corner case triggers lazy_truncate_heap() with a significant
amount of free space to truncate. The table bloats, then all the bloat
is deleted and the periodic 100% turnover will guarantee that all "live"
tuples will shortly after circulate in lower block numbers again, with
gigabytes of empty space at the end.

This by itself isn't a problem still. The existing code may do the job
just fine "unless" there is "frequent" access to that very table. Only
at this special combination of circumstances we actually have a problem.

Only now, with a significant amount of free space at the end and
frequent access to the table, the truncate safety scan takes long enough
and has to actually read pages from disk to interfere with client
transactions.

At this point, the truncate safety scan may have to be interrupted to
let the frequent other traffic go through. This is what we accomplish
with the autovacuum_truncate_lock_check interval, where we voluntarily
release the lock whenever someone else needs it. I agree with Kevin that
a 20ms check interval is reasonable because the code to check this is
even less expensive than releasing the exclusive lock we're holding.

At the same time, completely giving up and relying on the autovacuum
launcher to restart another worker isn't as free as it looks like
either. The next autovacuum worker will have to do the VACUUM scan
first, before getting to the truncate phase. We cannot just skip blindly
to the truncate code. With repeated abortion of the truncate, the table
would deteriorate and accumulate dead tuples again. The removal of dead
tuples and their index tuples has priority.

As said earlier in the discussion, the VACUUM scan will skip pages, that
are marked as completely visible. So the scan won't physically read the
majority of the empty pages at the end of the table over and over. But
it will at least scan all pages, that had been modified since the last
VACUUM run.

To me this means that we want to be more generous to the truncate code
about acquiring the exclusive lock. In my tests, I've seen that a
rolling window table with a "live" set of just 10 MB or so, but empty
space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that
work away because we can't acquire the exclusive lock withing 2 seconds
is a waste of effort.

Something in between 2-60 seconds sounds more reasonable to me.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <kgrittn(at)mail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovacuum truncate exclusive lock round two
Date: 2012-12-08 22:30:39
Message-ID: 50C3BF8F.8000405@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/6/2012 12:45 PM, Robert Haas wrote:
> On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>> That sort of "dynamic" approach would indeed be interesting. But I fear that
>> it is going to be complex at best. The amount of time spent in scanning
>> heavily depends on the visibility map. The initial vacuum scan of a table
>> can take hours or more, but it does update the visibility map even if the
>> vacuum itself is aborted later. The next vacuum may scan that table in
>> almost no time at all, because it skips all blocks that are marked "all
>> visible".
>
> Well, if that's true, then there's little reason to worry about giving
> up quickly, because the next autovacuum a minute later won't consume
> many resources.

"Almost no time" is of course "relative" to what an actual scan and dead
tuple removal cost. Looking at a table with 3 GB of dead tuples at the
end, the initial vacuum scan takes hours. When vacuum comes back to this
table, cleaning up a couple megabytes of newly deceased tuples and then
skipping over the all visible pages may take a minute.

Based on the discussion and what I feel is a consensus I have created an
updated patch that has no GUC at all. The hard coded parameters in
include/postmaster/autovacuum.h are

AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */

I gave that the worst workload I can think of. A pgbench (style)
application that throws about 10 transactions per second at it, so that
there is constantly the need to give up the lock due to conflicting lock
requests and then reacquiring it again. A "cleanup" process is
periodically moving old tuples from the history table to an archive
table, making history a rolling window table. And a third job that 2-3
times per minute produces a 10 second lasting transaction, forcing
autovacuum to give up on the lock reacquisition.

Even with that workload autovacuum slow but steady is chopping away at
the table.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Attachment Content-Type Size
autovacuum-truncate-lock-3.diff text/x-patch 16.7 KB