Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Date: 2013-02-05 18:23:27
Message-ID: CAMkU=1w2zdfSzmm796ZhsaDH56_yrN2+9v_A367tAB7--XZX6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 3, 2013 at 4:51 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> LOG: last_statwrite 11133-08-28 19:22:31.711744+02 is later than
> collector's time 2013-02-04 00:54:21.113439+01 for db 19093
> WARNING: pgstat wait timeout
> LOG: last_statwrite 39681-12-23 18:48:48.9093+01 is later than
> collector's time 2013-02-04 00:54:31.424681+01 for db 46494
> FATAL: could not find block containing chunk 0x2af4a60
> LOG: statistics collector process (PID 10063) exited with exit code 1
> WARNING: pgstat wait timeout
> WARNING: pgstat wait timeout
>
> I'm not entirely sure where the FATAL came from, but it seems it was
> somehow related to the issue - it was quite reproducible, although I
> don't see how exactly could this happen. There relevant block of code
> looks like this:
>
> char *writetime;
> char *mytime;
>
> /* Copy because timestamptz_to_str returns a static buffer */
> writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
> mytime = pstrdup(timestamptz_to_str(cur_ts));
> elog(LOG, "last_statwrite %s is later than collector's time %s for "
> "db %d", writetime, mytime, dbentry->databaseid);
> pfree(writetime);
> pfree(mytime);
>
> which seems quite fine to mee. I'm not sure how one of the pfree calls
> could fail?

I don't recall seeing the FATAL errors myself, but didn't keep the
logfile around. (I do recall seeing the pgstat wait timeout).

Are you using windows? pstrdup seems to be different there.

I'm afraid I don't have much to say on the code. Indeed I never even
look at it (other than grepping for pstrdup just now). I am taking a
purely experimental approach, Since Alvaro and others have looked at
the code.

>
> Anyway, attached is a patch that fixes all three issues, i.e.
>
> 1) the un-initialized timestamp
> 2) the "void" ommited from the signature
> 3) rename to "pg_stat" instead of just "stat"

Thanks.

If I shutdown the server and blow away the stats with "rm
data/pg_stat/*", it recovers gracefully when I start it back up. If a
do "rm -r data/pg_stat" then it has problems the next time I shut it
down, but I have no right to do that in the first place. If I initdb
a database without this patch, then shut it down and restart with
binaries that include this patch, and need to manually make the
pg_stat directory. Does that mean it needs a catalog bump in order to
force an initdb?

A review:

It applies cleanly (some offsets, no fuzz), builds without warnings,
and passes make check including with cassert.

The final test done in "make check" inherently tests this code, and it
passes. If I intentionally break the patch by making
pgstat_read_db_statsfile add one to the oid it opens, then the test
fails. So the existing test is at least plausible as a test.

doc/src/sgml/monitoring.sgml needs to be changed: "a permanent copy of
the statistics data is stored in the global subdirectory". I'm not
aware of any other needed changes to the docs.

The big question is whether we want this. I think we do. While
having hundreds of databases in a cluster is not recommended, that is
no reason not to handle it better than we do. I don't see any
down-sides, other than possibly some code uglification. Some file
systems might not deal well with having lots of small stats files
being rapidly written and rewritten, but it is hard to see how the
current behavior would be more favorable for those systems.

We do not already have this. There is no relevant spec. I can't see
how this could need pg_dump support (but what about pg_upgrade?)

I am not aware of any dangers.

I have a question about its completeness. When I first start up the
cluster and have not yet touched it, there is very little stats
collector activity, either with or without this patch. When I kick
the cluster sufficiently (I've been using vacuumdb -a to do that) then
there is a lot of stats collector activity. Even once the vacuumdb
has long finished, this high level of activity continues even though
the database is otherwise completely idle, and this seems to happen
for every. This patch makes that high level of activity much more
efficient, but it does not reduce the activity. I don't understand
why an idle database cannot get back into the state right after
start-up.

I do not think that the patch needs to solve this problem in order to
be accepted, but if it can be addressed while the author and reviewers
are paying attention to this part of the system, that would be ideal.
And if not, then we should at least remember that there is future work
that could be done here.

I created 1000 databases each with 200 single column tables (x serial
primary key).

After vacuumdb -a, I let it idle for a long time to see what steady
state was reached.

without the patch:
vacuumdb -a real 11m2.624s
idle steady state: 48.17% user, 39.24% system, 11.78% iowait, 0.81% idle.

with the patch:
vacuumdb -a real 6m41.306s
idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle,

I also ran pgbench on a scale that fits in memory with fsync=off, on a
singe CPU machine. With the same above-mentioned 1000 databases as
unused decoys to bloat the stats file.

pgbench_tellers and branches undergo enough turn over that they should
get vacuumed every minuted (nap time).

Without the patch, they only get vacuumed every 40 minutes or so as
the autovac workers are so distracted by reading the bloated stats
file. and the TPS is ~680.

With the patch, they get vacuumed every 1 to 2 minutes and TPS is ~940

So this seems to be effective at it intended goal.

I have not done a review of the code itself, only the performance.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-02-05 18:29:07 Re: split rm_name and rm_desc out of rmgr.c
Previous Message Noah Misch 2013-02-05 18:22:06 Re: Visual Studio 2012 RC