Re: We probably need autovacuum_max_wraparound_workers

Lists: pgsql-hackers
From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 02:00:11
Message-ID: 4FEBBAAB.4000706@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Folks,

Yeah, I can't believe I'm calling for *yet another* configuration
variable either. Suggested workaround fixes very welcome.

The basic issue is that autovacuum_max_workers is set by most users
based on autovac's fairly lightweight action most of the time: analyze,
vacuuming pages not on the visibility list, etc. However, when XID
wraparound kicks in, then autovac starts reading entire tables from disk
... and those tables may be very large.

This becomes a downtime issue if you've set autovacuum_max_workers to,
say, 5 and several large tables hit the wraparound threshold at the same
time (as they tend to do if you're using the default settings). Then
you have 5 autovacuum processes concurrently doing heavy IO and getting
in each others' way.

I've seen this at two sites now, and my conclusion is that a single
autovacuum_max_workers isn't sufficient if to cover the case of
wraparound vacuum. Nor can we just single-thread the wraparound vacuum
(i.e. just one worker) since that would hurt users who have thousands of
small tables.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 02:18:22
Message-ID: 26822.1340849902@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Yeah, I can't believe I'm calling for *yet another* configuration
> variable either. Suggested workaround fixes very welcome.

> The basic issue is that autovacuum_max_workers is set by most users
> based on autovac's fairly lightweight action most of the time: analyze,
> vacuuming pages not on the visibility list, etc. However, when XID
> wraparound kicks in, then autovac starts reading entire tables from disk
> ... and those tables may be very large.

It doesn't seem to me that this has much of anything to do with
wraparound; that just happens to be one possible trigger condition
for a lot of vacuuming activity to be happening. (Others are bulk
data loads or bulk updates, for instance.) Nor am I convinced that
changing the max_workers setting is an appropriate fix anyway.

I think what you've really got here is inappropriate autovacuum cost
delay settings, and/or the logic in autovacuum.c to try to divvy up the
available I/O capacity by tweaking workers' delay settings isn't working
very well. It's hard to propose improvements without a lot more detail
than you've provided, though.

regards, tom lane


From: David Johnston <polobo(at)yahoo(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 02:21:30
Message-ID: 04CF547E-820D-402B-A8F5-BAF45D6192BA@yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 27, 2012, at 22:00, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> Folks,
>
> Yeah, I can't believe I'm calling for *yet another* configuration
> variable either. Suggested workaround fixes very welcome.
>
> The basic issue is that autovacuum_max_workers is set by most users
> based on autovac's fairly lightweight action most of the time: analyze,
> vacuuming pages not on the visibility list, etc. However, when XID
> wraparound kicks in, then autovac starts reading entire tables from disk
> ... and those tables may be very large.
>
> This becomes a downtime issue if you've set autovacuum_max_workers to,
> say, 5 and several large tables hit the wraparound threshold at the same
> time (as they tend to do if you're using the default settings). Then
> you have 5 autovacuum processes concurrently doing heavy IO and getting
> in each others' way.
>
> I've seen this at two sites now, and my conclusion is that a single
> autovacuum_max_workers isn't sufficient if to cover the case of
> wraparound vacuum. Nor can we just single-thread the wraparound vacuum
> (i.e. just one worker) since that would hurt users who have thousands of
> small tables.
>
>

Would there be enough benefit to setting up separate small/medium?/large thresholds with user-changeable default table size boundaries so that you can configure 6 workers where 3 handle the small tables, 2 handle the medium tables, and 1 handles the large tables. Or alternatively a small worker consumes 1, medium 2, and large 3 'units' from whatever size pool has been defined. So you could have 6 small tables or two large tables in-progress simultaneously.

David J.


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 02:22:48
Message-ID: 4FEBBFF8.7020301@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I think what you've really got here is inappropriate autovacuum cost
> delay settings, and/or the logic in autovacuum.c to try to divvy up the
> available I/O capacity by tweaking workers' delay settings isn't working
> very well. It's hard to propose improvements without a lot more detail
> than you've provided, though.

Wait, we *have* that logic? If so, that's the problem ... it's not
working very well.

What detail do you want?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 02:29:00
Message-ID: 20120628022900.GQ1267@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh, all,

* Josh Berkus (josh(at)agliodbs(dot)com) wrote:
> Yeah, I can't believe I'm calling for *yet another* configuration
> variable either. Suggested workaround fixes very welcome.

As I suggested on IRC, my thought would be to have a goal-based system
for autovacuum which is similar to our goal-based commit system. We
don't need autovacuum sucking up all the I/O in the box, nor should we
ask the users to manage that. Instead, let's decide when the autovacuum
on a given table needs to finish and then plan to keep on working at a
rate that'll allow us to get done well in advance of that deadline.

Just my 2c.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 03:22:41
Message-ID: 29891.1340853761@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> I think what you've really got here is inappropriate autovacuum cost
>> delay settings, and/or the logic in autovacuum.c to try to divvy up the
>> available I/O capacity by tweaking workers' delay settings isn't working
>> very well. It's hard to propose improvements without a lot more detail
>> than you've provided, though.

> Wait, we *have* that logic? If so, that's the problem ... it's not
> working very well.

> What detail do you want?

What's it doing? What do you think it should do instead?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 03:38:39
Message-ID: 308.1340854719@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Josh Berkus (josh(at)agliodbs(dot)com) wrote:
>> Yeah, I can't believe I'm calling for *yet another* configuration
>> variable either. Suggested workaround fixes very welcome.

> As I suggested on IRC, my thought would be to have a goal-based system
> for autovacuum which is similar to our goal-based commit system. We
> don't need autovacuum sucking up all the I/O in the box, nor should we
> ask the users to manage that. Instead, let's decide when the autovacuum
> on a given table needs to finish and then plan to keep on working at a
> rate that'll allow us to get done well in advance of that deadline.

If we allow individual vacuum operations to stretch out just because
they don't need to be completed right away, we will need more concurrent
vacuum workers (so that we can respond to vacuum requirements for other
tables). So I submit that this would only move the problem around:
the number of active workers would increase to the point where things
are just as painful, plus or minus a bit.

The intent of the autovacuum cost delay features is to ensure that
autovacuum doesn't suck an untenable fraction of the machine's I/O
capacity, even when it's running flat out. So I think Josh's complaint
indicates that we have a problem with cost-delay tuning; hard to tell
what exactly without more info. It might only be that the defaults
are bad for these particular users, or it could be more involved.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 04:41:35
Message-ID: CA+Tgmoa-W71Gz4wqp3DDOv_qEJd59BrtFs1ASN30bJc1ynPYPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 27, 2012 at 11:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
>> * Josh Berkus (josh(at)agliodbs(dot)com) wrote:
>>> Yeah, I can't believe I'm calling for *yet another* configuration
>>> variable either.  Suggested workaround fixes very welcome.
>
>> As I suggested on IRC, my thought would be to have a goal-based system
>> for autovacuum which is similar to our goal-based commit system.  We
>> don't need autovacuum sucking up all the I/O in the box, nor should we
>> ask the users to manage that.  Instead, let's decide when the autovacuum
>> on a given table needs to finish and then plan to keep on working at a
>> rate that'll allow us to get done well in advance of that deadline.
>
> If we allow individual vacuum operations to stretch out just because
> they don't need to be completed right away, we will need more concurrent
> vacuum workers (so that we can respond to vacuum requirements for other
> tables).  So I submit that this would only move the problem around:
> the number of active workers would increase to the point where things
> are just as painful, plus or minus a bit.
>
> The intent of the autovacuum cost delay features is to ensure that
> autovacuum doesn't suck an untenable fraction of the machine's I/O
> capacity, even when it's running flat out.  So I think Josh's complaint
> indicates that we have a problem with cost-delay tuning; hard to tell
> what exactly without more info.  It might only be that the defaults
> are bad for these particular users, or it could be more involved.

I've certainly come across many reports of the cost delay settings
being difficult to tune, both on pgsql-hackers/performance and in
various private EnterpriseDB correspondence. I think Stephen's got it
exactly right: the system needs to figure out the rate at which vacuum
needs to happen, not rely on the user to provide that information.

For checkpoints, we estimated the percentage of the checkpoint that
ought to be completed and the percentage that actually is completed;
if the latter is less than the former, we speed things up until we're
back on track. For autovacuum, the trick is to speed things up when
the rate at which tables are coming due for autovacuum exceeds the
rate at which we are vacuuming them; or, when we anticipate that a
whole bunch of wraparound vacuums are going to come due
simultaneously, to start doing them sooner so that they are more
spread out.

For example, suppose that 26 tables each of which is 4GB in size are
going to simultaneously come due for an anti-wraparound vacuum in 26
hours. For the sake of simplicity suppose that each will take 1 hour
to vacuum. What we currently do is wait for 26 hours and then start
vacuuming them all at top speed, thrashing the I/O system. What we
ought to do is start vacuuming them much sooner and do them
consecutively. Of course, the trick is to design a mechanism that
does something intelligent if we think we're on track and then all of
a sudden the rate of XID consumption changes dramatically, and now
we've got vacuum faster or with more workers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 04:51:53
Message-ID: 1787.1340859113@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> For example, suppose that 26 tables each of which is 4GB in size are
> going to simultaneously come due for an anti-wraparound vacuum in 26
> hours. For the sake of simplicity suppose that each will take 1 hour
> to vacuum. What we currently do is wait for 26 hours and then start
> vacuuming them all at top speed, thrashing the I/O system.

This is a nice description of a problem that has nothing to do with
reality. In the first place, we don't vacuum them all at once; we can
only vacuum max_workers of them at a time. In the second place, the
cost-delay features ought to be keeping autovacuum from thrashing the
I/O, entirely independently of what the reason was for starting the
vacuums. Clearly, since people are complaining, there's something that
needs work there. But not the work you're proposing.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 05:45:02
Message-ID: CA+TgmoYsEGfT9=41Aa8K74z=7s31V0QZUVcq1QYmJD=6LwNK_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 28, 2012 at 12:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> For example, suppose that 26 tables each of which is 4GB in size are
>> going to simultaneously come due for an anti-wraparound vacuum in 26
>> hours.  For the sake of simplicity suppose that each will take 1 hour
>> to vacuum.  What we currently do is wait for 26 hours and then start
>> vacuuming them all at top speed, thrashing the I/O system.
>
> This is a nice description of a problem that has nothing to do with
> reality.  In the first place, we don't vacuum them all at once; we can
> only vacuum max_workers of them at a time.  In the second place, the
> cost-delay features ought to be keeping autovacuum from thrashing the
> I/O, entirely independently of what the reason was for starting the
> vacuums.

I don't think it works that way. The point is that the workload
imposed by autovac is intermittent and spikey. If you configure the
cost limit too low, or the delay too high, or the number of autovac
workers is too low, then autovac can't keep up, which causes all of
your tables to bloat and is a total disaster. You have to make sure
that isn't going to happen, so you naturally configure the settings
aggressively enough that you're sure autovac will be able to stay
ahead of your bloat problem. But then autovac is more
resource-intensive ALL the time, not just when there's a real need for
it. This is like giving a kid a $20 bill to buy lunch and having them
walk around until they find a restaurant sufficiently expensive that
lunch there costs $20. The point of handing over $20 was that you
were willing to spend that much *if needed*, not that the money was
burning a hole in your pocket.

To make that more concrete, suppose that a table has an update rate
such that it hits the autovac threshold every 10 minutes. If you set
the autovac settings such that an autovacuum of that table takes 9
minutes to complete, you are hosed: there will eventually be some
10-minute period where the update rate is ten times the typical
amount, and the table will gradually become horribly bloated. But if
you set the autovac settings such that an autovacuum of the table can
finish in 1 minute, so that you can cope with a spike, then whenever
there isn't a spike you are processing the table ten times faster than
necessary, and now one minute out of every ten carries a heavier I/O
load than the other 9, leading to uneven response times.

It's just ridiculous to assert that it doesn't matter if all the
anti-wraparound vacuums start simultaneously. It does matter. For
one thing, once every single autovacuum worker is pinned down doing an
anti-wraparound vacuum of some table, then a table that needs an
ordinary vacuum may have to wait quite some time before a worker is
available. Depending on the order in which workers iterate through
the tables, you could end up finishing all of the anti-wraparound
vacuums before doing any of the regular vacuums. If the wraparound
vacuums had been properly spread out, then there would at all times
have been workers available for regular vacuums as needed. For
another thing, you can't possibly think that three or five workers
running simultaneously, each reading a different table, is just as
efficient as having one worker grind through them consecutively.
Parallelism is not free, ever, and particularly not here, where it has
the potential to yank the disk head around between five different
files, seeking like crazy, instead of a nice sequential I/O pattern on
each file in turn. Josh wouldn't keep complaining about this if it
didn't suck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 06:02:23
Message-ID: 3185.1340863343@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> It's just ridiculous to assert that it doesn't matter if all the
> anti-wraparound vacuums start simultaneously. It does matter. For
> one thing, once every single autovacuum worker is pinned down doing an
> anti-wraparound vacuum of some table, then a table that needs an
> ordinary vacuum may have to wait quite some time before a worker is
> available.

Well, that's a fair point, but I don't think it has anything to do with
Josh's complaint --- which AFAICT is about imposed load, not about
failure to vacuum things that need vacuumed. Any scheme you care to
design will sometimes be running max_workers workers at once, and if
that's too much load there will be trouble. I grant that there can be
value in a more complex strategy for when to schedule vacuuming
activities, but I don't think that it has a lot to do with solving the
present complaint.

> Parallelism is not free, ever, and particularly not here, where it has
> the potential to yank the disk head around between five different
> files, seeking like crazy, instead of a nice sequential I/O pattern on
> each file in turn.

Interesting point. Maybe what's going on here is that
autovac_balance_cost() is wrong to suppose that N workers can each have
1/N of the I/O bandwidth that we'd consider okay for a single worker to
eat. Maybe extra seek costs mean we have to derate that curve by some
large factor. 1/(N^2), perhaps? I bet the nature of the disk subsystem
affects this a lot, though.

regards, tom lane


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 08:25:15
Message-ID: CAAZKuFZT_GhazRyw7=K28mOEe79n8CitwRG0z0QYciHcQ1S7gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 27, 2012 at 7:00 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> I've seen this at two sites now, and my conclusion is that a single
> autovacuum_max_workers isn't sufficient if to cover the case of
> wraparound vacuum. Nor can we just single-thread the wraparound vacuum
> (i.e. just one worker) since that would hurt users who have thousands of
> small tables.

I have also witnessed very unfortunate un-smooth performance behavior
around wraparound time. It seems like a bit of adaptive response in
terms of allowed autovacuum throughput to number of pages requiring
wraparound vacuuming would be one load off my mind. Getting slower
and slower gradually with some way to know that autovacuum has decided
it should work harder and harder is better than the brick wall that
can sneak up currently.

Count me as appreciative for improvements in this area.

--
fdr


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 12:22:12
Message-ID: CA+TgmoYzC2ET_S4e43AbCQRKg=10RCW=pWOtOyN4e2E1ZO_wVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 28, 2012 at 2:02 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> It's just ridiculous to assert that it doesn't matter if all the
>> anti-wraparound vacuums start simultaneously.  It does matter.  For
>> one thing, once every single autovacuum worker is pinned down doing an
>> anti-wraparound vacuum of some table, then a table that needs an
>> ordinary vacuum may have to wait quite some time before a worker is
>> available.
>
> Well, that's a fair point, but I don't think it has anything to do with
> Josh's complaint --- which AFAICT is about imposed load, not about
> failure to vacuum things that need vacuumed.  Any scheme you care to
> design will sometimes be running max_workers workers at once, and if
> that's too much load there will be trouble.  I grant that there can be
> value in a more complex strategy for when to schedule vacuuming
> activities, but I don't think that it has a lot to do with solving the
> present complaint.

I think it's got everything to do with it. Josh could fix his problem
by increasing the cost limit and/or reducing the cost delay, but if he
did that then his database would get bloated...

>> Parallelism is not free, ever, and particularly not here, where it has
>> the potential to yank the disk head around between five different
>> files, seeking like crazy, instead of a nice sequential I/O pattern on
>> each file in turn.
>
> Interesting point.  Maybe what's going on here is that
> autovac_balance_cost() is wrong to suppose that N workers can each have
> 1/N of the I/O bandwidth that we'd consider okay for a single worker to
> eat.  Maybe extra seek costs mean we have to derate that curve by some
> large factor.  1/(N^2), perhaps?  I bet the nature of the disk subsystem
> affects this a lot, though.

...and this would have the same effect. Let's not assume that the
problem is that Josh doesn't know how to make autovacuum less
aggressive, because I'm pretty sure that ain't the issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 13:25:42
Message-ID: 201206281525.46186.cedric@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> >> Parallelism is not free, ever, and particularly not here, where it has
> >> the potential to yank the disk head around between five different
> >> files, seeking like crazy, instead of a nice sequential I/O pattern on
> >> each file in turn.
> >
> > Interesting point. Maybe what's going on here is that
> > autovac_balance_cost() is wrong to suppose that N workers can each have
> > 1/N of the I/O bandwidth that we'd consider okay for a single worker to
> > eat. Maybe extra seek costs mean we have to derate that curve by some
> > large factor. 1/(N^2), perhaps? I bet the nature of the disk subsystem
> > affects this a lot, though.
>
> ...and this would have the same effect. Let's not assume that the
> problem is that Josh doesn't know how to make autovacuum less
> aggressive, because I'm pretty sure that ain't the issue.

we may need reserved workers to work on system tables, at least.
Just as a protection in case all workers all locked hours walking 'log'
tables. In the meantime, the pg_type table can bloat a lot for ex.

It might be that limiting the number of workers in 'antiwraparound-mode' to
(max_workers - round(max_workers/3)) is enough.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 13:51:56
Message-ID: 11195.1340891516@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Jun 28, 2012 at 2:02 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, that's a fair point, but I don't think it has anything to do with
>> Josh's complaint --- which AFAICT is about imposed load, not about
>> failure to vacuum things that need vacuumed.

> I think it's got everything to do with it. Josh could fix his problem
> by increasing the cost limit and/or reducing the cost delay, but if he
> did that then his database would get bloated...

Josh hasn't actually explained what his problem is, nor what if any
adjustments he made to try to ameliorate it. In the absence of data
I refuse to rule out misconfiguration. But, again, to the extent that
he's given us any info at all, it seemed to be a complaint about
oversaturated I/O at max load, *not* about inability to complete
vacuuming tasks as needed. You are inventing problem details to fit
your solution.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 13:53:46
Message-ID: CA+TgmoY3kyGQNz6wOO0Sd2Kh3uTLAQ8dkCBQzjgSW2970Qqbdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 28, 2012 at 9:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>  You are inventing problem details to fit
> your solution.

Well, what I'm actually doing is assuming that Josh's customers have
the same problem that our customers do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 19:03:15
Message-ID: 4FECAA73.6010305@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert, Tom, Stephen,

So, first, a description of the specific problem I've encountered at two
sites. I'm working on another email suggesting workarounds and
solutions, but that's going to take a bit longer.

Observation
-----------

This problem occured on two database systems which shared the following
characteristics:

1) They were running with default autovacuum & vacuum settings, except
that one database had 5 workers instead of 3.

2) They have large partitioned tables, in which the partitions are
time-based and do not receive UPDATES after a certain date. Each
partition was larger than RAM.

3) The databases are old enough, and busy enough, to have been through
XID wraparound at least a couple of times.

Users reported that the database system became unresponsive, which was
surprising since both of these DBs had been placed on hardware which was
engineered for at least 100% growth over the current database size. On
investigation, we discovered the following things:

a) Each database had autovacuum_max_workers (one DB 5, one DB 3) doing
anti-wraparound vacuum on several partitions simultaneously.

b) The I/O created by the anti-wraparound vacuum was tying up the system.

c) terminating any individual autovacuum process didn't help, as it
simply caused autovac to start on a different partition.

So, first question was: why was autovacuum wanting to anti-wrapround
vacuum dozens of tables at the same time? A quick check showed that all
of these partitions had nearly identical XID ages (as in less than
100,000 transactions apart), which all had exceeded
autovacuum_max_freeze_age. How did this happen? I'm still not sure.

One thought is: this is an artifact of the *previous* wraparound vacuums
on each database. On cold partitions with old dead rows which have
been through wraparound vacuum several times, this tends to result in
the cold partitions converging towards having the same relfrozenxid over
time; I'm still working on the math to prove this. Alternately, it's
possible that a schema change to the partitioned tables gave them all
the same effective relfrozenxid at some point in the past; both
databases are still in development.

So there are two parts to this problem, each of which needs a different
solution:

1. Databases can inadvertently get to the state where many tables need
wraparound vacuuming at exactly the same time, especially if they have
many "cold" data partition tables.

2. When we do hit wraparound thresholds for multiple tables, autovacuum
has no hesitation about doing autovacuum_max_workers worth of wraparound
vacuum simultaneously, even when that exceeds the I/O capactity of the
system.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 19:54:42
Message-ID: 1340913200-sup-8470@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Josh Berkus's message of jue jun 28 15:03:15 -0400 2012:

> 2) They have large partitioned tables, in which the partitions are
> time-based and do not receive UPDATES after a certain date. Each
> partition was larger than RAM.

I think the solution to this problem has nothing to do with vacuum or
autovacuum settings, and lots to do with cataloguing enough info about
each of these tables to note that, past a certain point, they don't need
any vacuuming at all.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 20:02:33
Message-ID: CAFNqd5WYzo9_E2O+fth6gzT9XGNp9U7NMYXZmF4fsuUDY0hkKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 28, 2012 at 3:03 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> 1. Databases can inadvertently get to the state where many tables need
> wraparound vacuuming at exactly the same time, especially if they have
> many "cold" data partition tables.

This suggests that this should be handled rather earlier, and with
some attempt to not do them all simultaneously.

In effect, if there are 25 tables that will need wraparound vacuums in
the next million transactions, it is presumably beneficial to start
hitting on them right away, ideally one at a time, so as to draw their
future needs further apart.

The loose thought is that any time autovac isn't very busy, it should
consider (perhaps based on probability?) picking a table that is in a
cluster of tables that currently have wraparound needs at about the
same time, and, in effect, spread that cluster out.

I suppose there are two considerations, that conflict somewhat:
a) If there are tables that Absolutely Require wraparound vacuuming,
Right Soon Now, there's nothing to help this. They MUST be vacuumed,
otherwise the system will get very unhappy.
b) It's undesirable to *worsen* things by 'clustering' future
wraparound vacuums together, which gets induced any time autovac is
continually vacuuming a series of tables. If 25 tables get vacuumed
right around now, then that may cluster their next wraparound vacuum
to 2^31 transactions from 'right around now.'

But there's no helping a).

I suppose this suggests having an autovac thread that is 'devoted' to
spreading out future wraparound vacuums.

- If a *lot* of tables were just vacuumed recently, then it shouldn't
do anything, as Right Now is a cluster of 'badness.'
- It should group tables by slicing their next wraparounds (grouping
by rounding wraparound txid to the nearest, say, 10M or 20M), and
consider vacuuming a table Right Now that would take that table out of
the worst such "slice"

Thus, supposing the grouping is like:

| TxId - nearest 10 million | Tables Wrapping In Range |
|---------------------------+--------------------------|
| 0 | 250 |
| 1 | 80 |
| 2 | 72 |
| 3 | 30 |
| 4 | 21 |
| 5 | 35 |
| 6 | 9 |
| 7 | 15 |
| 8 | 8 |
| 9 | 7 |
| 10 | 22 |
| 11 | 35 |
| 12 | 14 |
| 13 | 135 |
| 14 | 120 |
| 15 | 89 |
| 16 | 35 |
| 17 | 45 |
| 18 | 60 |
| 19 | 25 |
| 20 | 15 |
| 21 | 150 |

Suppose current txid is 7500000, and the reason for there to be 250
tables in the current range is that there are a bunch of tables that
get *continually* vacuumed. No need to worry about that range, and
I'll presume that these are all in the past.

In this example, it's crucial to, pretty soon, vacuum the 150 tables
in partition #21, as they're getting near wraparound. Nothing to be
improved on there. Though it would be kind of nice to start on the
150 as early as possible, so that we *might* avoid having them
dominate autovac, as in Josh Berkus' example.

But once those are done, the next "crucial" set, in partition #20, are
a much smaller set of tables. It would be nice, at that point, to add
in a few tables from partitions #13 and #14, to smooth out the burden.
The ideal "steady state" would look like the following:

| TxId - nearest 10 million | Tables Wrapping In Range |
|---------------------------+--------------------------|
| 0 | 250 |
| 1 | 51 |
| 2 | 51 |
| 3 | 51 |
| 4 | 51 |
| 5 | 51 |
| 6 | 51 |
| 7 | 51 |
| 8 | 51 |
| 9 | 51 |
| 10 | 51 |
| 11 | 51 |
| 12 | 51 |
| 13 | 51 |
| 14 | 51 |
| 15 | 51 |
| 16 | 51 |
| 17 | 51 |
| 18 | 51 |
| 19 | 51 |
| 20 | 51 |
| 21 | 51 |

We might not get something totally smooth, but getting rid of the
*really* chunky ranges would be good.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-28 23:48:55
Message-ID: 6551.1340927335@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> So there are two parts to this problem, each of which needs a different
> solution:

> 1. Databases can inadvertently get to the state where many tables need
> wraparound vacuuming at exactly the same time, especially if they have
> many "cold" data partition tables.

I'm not especially sold on your theory that there's some behavior that
forces such convergence, but it's certainly plausible that there was,
say, a schema alteration applied to all of those partitions at about the
same time. In any case, as Robert has been saying, it seems like it
would be smart to try to get autovacuum to spread out the
anti-wraparound work a bit better when it's faced with a lot of tables
with similar relfrozenxid values.

> 2. When we do hit wraparound thresholds for multiple tables, autovacuum
> has no hesitation about doing autovacuum_max_workers worth of wraparound
> vacuum simultaneously, even when that exceeds the I/O capactity of the
> system.

I continue to maintain that this problem is unrelated to wraparound as
such, and that thinking it is is a great way to design a bad solution.
There are any number of reasons why autovacuum might need to run
max_workers at once. What we need to look at is making sure that they
don't run the system into the ground when that happens.

Since your users weren't complaining about performance with one or two
autovac workers running (were they?), we can assume that the cost-delay
settings were such as to not create a problem in that scenario. So it
seems to me that it's down to autovac_balance_cost(). Either there's
a plain-vanilla bug in there, or seek costs are breaking the assumption
that it's okay to give N workers each 1/Nth of the single-worker I/O
capacity.

As far as bugs are concerned, I wonder if the premise of the calculation

* The idea here is that we ration out I/O equally. The amount of I/O
* that a worker can consume is determined by cost_limit/cost_delay, so we
* try to equalize those ratios rather than the raw limit settings.

might be wrong in itself? The ratio idea seems plausible but ...

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-29 01:57:24
Message-ID: 4FED0B84.9010905@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I'm not especially sold on your theory that there's some behavior that
> forces such convergence, but it's certainly plausible that there was,
> say, a schema alteration applied to all of those partitions at about the
> same time. In any case, as Robert has been saying, it seems like it
> would be smart to try to get autovacuum to spread out the
> anti-wraparound work a bit better when it's faced with a lot of tables
> with similar relfrozenxid values.

Well, I think we can go even further than that. I think one of the
fundamental problems is that our "opportunistic" vacuum XID approach is
essentially broken for any table which doesn't receive continuous
update/deletes (I think Chris Browne makes largely the same point).

They way opportunism currently works is via vacuum_freeze_table_age,
which says "if you were going to vacuum this table *anyway*, and it's
relfrozenxid is # old, then full-scan it". That works fine for tables
getting constant UPDATEs to avoid hitting the wraparound deadline, but
tables which have stopped getting activity, or are insert-only, never
get it.

What we should have instead is some opportunism in autovacuum which says:

"If I have otherwise idle workers, and the system isn't too busy*, find
the table with the oldest relfrozenxid which is over
autovacuum_max_freeze_age/2 and vacuum-full-scan it."

The key difficulty there is "if the system isn't too busy". That's a
hard thing to determine, and subject to frequent change. An
opportunistic solution would still be useful without that requirement,
but not as helpful.

I don't find Stephen's proposal of goal-based solutions to be practical.
A goal-based approach makes the assumption that database activity is
predictable, and IME most databases are anything but.

A second obstacle to "opportunistic wraparound vacuum" is that
wraparound vacuum is not interruptable. If you have to kill it off and
do something else for a couple hours, it can't pick up where it left
off; it needs to scan the whole table from the beginning again.

> I continue to maintain that this problem is unrelated to wraparound as
> such, and that thinking it is is a great way to design a bad solution.
> There are any number of reasons why autovacuum might need to run
> max_workers at once. What we need to look at is making sure that they
> don't run the system into the ground when that happens.

100% agree.

> Since your users weren't complaining about performance with one or two
> autovac workers running (were they?),

No, it's when we hit 3 that it fell over. Thresholds vary with memory
and table size, of course.

BTW, the primary reason I think (based on a glance at system stats) this
drove the system to its knees was that the simultaneous wraparound
vacuum of 3 old-cold tables evicted all of the "current" data out of the
FS cache, forcing user queries which would normally hit the FS cache
onto disk. I/O throughput was NOT at 100% capacity.

During busy periods, a single wraparound vacuum wasn't enough to clear
the FS cache because it's competing on equal terms with user access to
data. But three avworkers "ganged up" on the user queries and kicked
the tar out of them.

Unfortunately, for the 5-worker system, I didn't find out about the
issue until after it was over, and I know it was related to wraparound
only because we were logging autovacuum. So I don't know if it had the
same case.

There are also problems with our defaults and measurements for the
various vacuum_freeze settings, but changing those won't really fix the
underlying problem, so it's not worth fiddling with them.

The other solution, as mentioned last year, is to come up with a way in
which old-cold data doesn't need to be vacuumed *at all*. This would
be the ideal solution, but it's not clear how to implement it, since any
wraparound-counting solution would bloat the CLOG intolerably.

> we can assume that the cost-delay
> settings were such as to not create a problem in that scenario. So it
> seems to me that it's down to autovac_balance_cost(). Either there's
> a plain-vanilla bug in there, or seek costs are breaking the assumption
> that it's okay to give N workers each 1/Nth of the single-worker I/O
> capacity.

Yeah, I think our I/O balancing approach was too simplistic to deal with
situations like this one. Factors I think break it are:

* modifying cost-limit/cost-delay doesn't translate exactly into 1:1
modifying I/O (in fact, it seems higly unlikely that it does)
* seek costs, as you mention
* FS cache issues and competition with user queries (per above)

> As far as bugs are concerned, I wonder if the premise of the calculation
>
> * The idea here is that we ration out I/O equally. The amount of I/O
> * that a worker can consume is determined by cost_limit/cost_delay, so we
> * try to equalize those ratios rather than the raw limit settings.
>
> might be wrong in itself? The ratio idea seems plausible but ...

Well, I think it's "plausible but wrong under at least some common
circumstances". In addition to seeking, it ignores FS cache effects
(not that I have any idea how to account for these mathematically). It
also makes the assumption that 3 autovacuum workers running at 1/3 speed
each is better than having one worker running at full speed, which is
debatable. And it makes the assumption that the main thing autovac
needs to share I/O with is itself ... instead of with user queries.

I'm not saying I have a formula which is better, or that we should junk
that logic and go back to not allocating at all. But we should see if
we can figure out something better. Lemme think about it.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-29 02:15:19
Message-ID: 20120629021519.GR1267@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Josh Berkus (josh(at)agliodbs(dot)com) wrote:
> I don't find Stephen's proposal of goal-based solutions to be practical.
> A goal-based approach makes the assumption that database activity is
> predictable, and IME most databases are anything but.

We're talking about over the entire transaction space, and we can be
pretty liberal, in my view, with our estimates. If we get it right, we
might risk doing more autovac's for wraparound than strictly necessary,
but they should happen over a sufficient time that it doesn't cause
performance issues.

One definite problem with this, of course, is that the wraparound
autovac can't be stopped and restarted, and anything that increases the
amount of wall-clock time required to complete the autovac will
necessairly increase the risk that we'll lose a bunch of work due to a
database restart.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-29 02:26:42
Message-ID: 9169.1340936802@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Well, I think it's "plausible but wrong under at least some common
> circumstances". In addition to seeking, it ignores FS cache effects
> (not that I have any idea how to account for these mathematically). It
> also makes the assumption that 3 autovacuum workers running at 1/3 speed
> each is better than having one worker running at full speed, which is
> debatable.

Well, no, not really, because the original implementation with only one
worker was pretty untenable. But maybe we need some concept like only
one worker working on *big* tables? Or at least, less than max_workers
of them.

regards, tom lane


From: Cédric Villemain <cedric(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-06-29 07:11:56
Message-ID: 201206290912.04591.cedric@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Le vendredi 29 juin 2012 04:26:42, Tom Lane a écrit :
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > Well, I think it's "plausible but wrong under at least some common
> > circumstances". In addition to seeking, it ignores FS cache effects
> > (not that I have any idea how to account for these mathematically). It
> > also makes the assumption that 3 autovacuum workers running at 1/3 speed
> > each is better than having one worker running at full speed, which is
> > debatable.
>
> Well, no, not really, because the original implementation with only one
> worker was pretty untenable. But maybe we need some concept like only
> one worker working on *big* tables? Or at least, less than max_workers
> of them.

I think it is easier to manage to keep some workers available to work on other
task instead of having all of them doing the same longest job.

pgfincore allows since years to snapshot and restore the OS cache to work
around such issues.
Autovacuum should snapshot the xMB ahead and restore the previous state cache
when done.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: We probably need autovacuum_max_wraparound_workers
Date: 2012-07-01 22:06:25
Message-ID: CAMkU=1z6QZfOvJfRLwf49cqdwub2Cb+hev7kRYN7vxzsv+R3JA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 28, 2012 at 6:57 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
> A second obstacle to "opportunistic wraparound vacuum" is that
> wraparound vacuum is not interruptable. If you have to kill it off and
> do something else for a couple hours, it can't pick up where it left
> off; it needs to scan the whole table from the beginning again.

Would recording a different relfrozenxid for each 1GB chunk of the
relation solve that?

>> Since your users weren't complaining about performance with one or two
>> autovac workers running (were they?),
>
> No, it's when we hit 3 that it fell over.  Thresholds vary with memory
> and table size, of course.

Does that mean it worked fine with 2 workers simultaneously in large
tables, or did that situation not occur and so it is not known whether
it would have worked fine or not?

> BTW, the primary reason I think (based on a glance at system stats) this
> drove the system to its knees was that the simultaneous wraparound
> vacuum of 3 old-cold tables evicted all of the "current" data out of the
> FS cache, forcing user queries which would normally hit the FS cache
> onto disk.  I/O throughput was NOT at 100% capacity.

Do you know if it was the input or the output that caused that to
happen? I would think the kernel has logic similar to BAS to prevent
reading a huge amount of data sequentially from evicting all the other
data. But that logic might be defeated if all that data is dirtied
right after being read.

If the partitions had not been touched since the last freeze, then it
should generate no dirty blocks (right?), but if they were touched
since then you could basically be writing out the entire table.

Cheers,

Jeff