Displaying accumulated autovacuum cost

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Displaying accumulated autovacuum cost
Date: 2011-08-17 21:04:33
Message-ID: 4E4C2CE1.5000801@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is a patch that tracks and displays the accumulated cost when
autovacuum is running. Code by Noah Misch and myself. I hope this idea
will bring a formal process to vacuum tuning, which is currently too
hard to do. I was about to add "without..." to that, but I then
realized it needs no disclaimer; it's just too hard, period. Vacuum
issues are enemy #1 at all the terabyte scale customer sites I've been
fighting with lately.

The patch updates the command string just before the workers sleep to
show how much work they've done so far. And at the end, it adds a few
new lines to the information written to the logs, when the autovacuum is
notable enough to be logged at all. The overhead it adds is at most a
few integer operations per buffer processed and a slower title string
update once per sleep. It's trivial compared to both the vacuum itself,
and to the instrumentation's value to sites with vacuum issues.

To demonstrate the patch in action, here's a test case using a 6.4GB
pgbench_accounts table:

$ createdb pgbench
$ pgbench -i -s 500 pgbench
$ psql -d pgbench -c "select pg_relation_size('pgbench_accounts');"
pg_relation_size
------------------
6714761216
$ psql -d pgbench -c "select relname,relpages from pg_class where
relname='pgbench_accounts';"
relname | relpages
------------------+----------
pgbench_accounts | 819673
$psql -d pgbench -c "delete from pgbench_accounts where aid<20000000"

You can see the new information in the command string with ps and grep:

$ while [ 1 ] ; do (ps -eaf | grep "[a]utovacuum worker" && sleep 60) ; done
gsmith 2687 17718 0 15:44 ? 00:00:00 postgres: autovacuum
worker process h=19 m=14196 d=14185
...
gsmith 2687 17718 0 15:44 ? 00:00:09 postgres: autovacuum
worker process h=182701 m=301515 d=321345
...
gsmith 2687 17718 1 15:44 ? 00:00:23 postgres: autovacuum
worker process h=740359 m=679987 d=617559
...

That's accumulated hit/miss/dirty counts, the raw numbers. When the
autovacuum is finished, those totals appear as a new line in the log entry:

LOG: automatic vacuum of table "pgbench.public.pgbench_accounts": index
scans: 1
pages: 0 removed, 819673 remain
tuples: 19999999 removed, 30000022 remain
buffer usage: 809537 hits, 749340 misses, 686660 dirtied
system usage: CPU 5.70s/19.73u sec elapsed 2211.60 sec

To check if this makes sense, we need the relevant parameters at the
time, which were the defaults (I only tweaked some basic config bits
here, including shared_buffers=400MB so a bit more was cached):

vacuum_cost_page_hit = 1 # 0-10000 credits
vacuum_cost_page_miss = 10 # 0-10000 credits
vacuum_cost_page_dirty = 20 # 0-10000 credits
vacuum_cost_limit = 200 # 1-10000 credits
autovacuum_vacuum_cost_delay = 20ms

Every 20ms equals 50 times/second. That means the cost accumulation
should be 200 * 50 = 10000 / second, or 600K/minute. That's how fast
the cost should be increasing here. Given a runtime of 2211.60 seconds,
that's a total estimated cost of 2209.15 * 10000 = 22,091,500. Now we
check that against the totals printed at the end of the vacuum:

1 * 809537 hits=809,537
10 * 749340 misses=7,493,400
20 * 686607 dirtied=13,732,140

And that gives a directly computed total of 22,035,077. Close enough to
show this is working as expected. And how I computed all that should
give you an idea how you might use these numbers to extract other useful
statistics, if you'd like to tune the balance of various cost_page_*
parameters as one example. I have no idea how anyone could ever set
those relative to one another without this data, it would take epic
guessing skills.

What else can do you do with this data?

-Figure out if the VACUUM is still making progress when it appears stuck
-Estimate how long it will take to finish, based on current progress and
whatever total cost was logged last time VACUUM ran against this relation.
-Compute approximate hit rate on the read side. OS caching issues and
the ring buffer are obviously a problem with that, this isn't too valuable.
-Can see the cost split when multiple vacuums are running. This problem
is why sites can't just use "total time to vacuum" as a useful proxy to
estimate how long one will take to run.
-Easy to track the read/write ratio
-Directly measure the write rate

That last one is I think the part people are most perplexed by right
now, and this makes it trivial. How do you turn all these cost figures
into real-world read/write rates? It's been hard to do.

Now, you can take a bunch of samples of the data at 1 minute intervals,
like my little "ps | grep" example above does. The delta in the
"dirty=" column is how much was written per minute, in units of 8K
(usually) buffers. Multiply that by 8192/(60*1024*1024), and you get
MB/s out of there. I collected that data for a cleanup run of the
pgbench_accounts damage done above, CSV file with all the statistics is
attached.

I also collected OS level stats from Linux about the actual read/write
rate of the process, converted into "Write Mbps" (those are actually in
MB/s, sloppy capitalization is via OpenOffice "autocorrect"). Those
numbers are close enough to make me confident the dirty buffer totals
tracked here do turn into useful MB/s values. Sample of the most
interesting part:

Cost Delta Dirty Mbps Write Mbps
589,890 2.56 2.73
591,151 2.57 2.73
589,035 2.56 2.72
593,775 3.14 0.20
599,420 2.05 0.00
598,503 2.05 0.00
599,421 2.05 0.00
574,046 0.60 0.01
574,779 0.64 0.67
609,140 2.56 2.68
612,397 2.57 2.69
611,744 2.57 2.69
610,008 2.56 2.68

This shows the expected 600K/minute cost accumulation. And using the
dirty= numbers to compute MB/s of write speed closely matches the total
write speed of this process. Some of the difference might be I/O to
other things besides the main table here, some of it is just because OS
write caching will influence the write rate. In the spots where the OS
value and what's derived from the dirty rate diverge most, it appears to
be because vacuum is filling Linux's write cache. Actual writes
accumulated against the process them block for a while. It's a small
difference most of the time.

I'd be willing to accept a "Dirty MB/s" figure computed this way as
accurate enough for most purposes. And this patch lets you get that
data, currently unavailable without poking into the OS statistics (if at
all), just by doing a little log file and/or command string scraping.
Total at the end or real-time monitoring, based on how much work you
want to put into it. For a busy site where one or more autovacuum
processes are running most of the time, being able to monitor the vacuum
portion of the I/O this way will be a huge improvement over the current
state of things. I already have a stack of tools built on top of this
data I'm working on, and they're making it much easier to come up with
an iterative tuning process for autovacuum.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

Attachment Content-Type Size
pgbench-vacuum-stats.csv text/csv 5.1 KB
vacuum_stats_v2.patch text/x-patch 5.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-08-17 21:10:31 Re: plpython crash
Previous Message Marcin Mańk 2011-08-17 20:32:42 FATAL: ReleaseSavepoint: unexpected state STARTED