Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Date: 2013-02-05 22:31:21
Message-ID: 51118839.3020908@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5.2.2013 19:23, Jeff Janes wrote:
> On Sun, Feb 3, 2013 at 4:51 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>
>> LOG: last_statwrite 11133-08-28 19:22:31.711744+02 is later than
>> collector's time 2013-02-04 00:54:21.113439+01 for db 19093
>> WARNING: pgstat wait timeout
>> LOG: last_statwrite 39681-12-23 18:48:48.9093+01 is later than
>> collector's time 2013-02-04 00:54:31.424681+01 for db 46494
>> FATAL: could not find block containing chunk 0x2af4a60
>> LOG: statistics collector process (PID 10063) exited with exit code 1
>> WARNING: pgstat wait timeout
>> WARNING: pgstat wait timeout
>>
>> I'm not entirely sure where the FATAL came from, but it seems it was
>> somehow related to the issue - it was quite reproducible, although I
>> don't see how exactly could this happen. There relevant block of code
>> looks like this:
>>
>> char *writetime;
>> char *mytime;
>>
>> /* Copy because timestamptz_to_str returns a static buffer */
>> writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
>> mytime = pstrdup(timestamptz_to_str(cur_ts));
>> elog(LOG, "last_statwrite %s is later than collector's time %s for "
>> "db %d", writetime, mytime, dbentry->databaseid);
>> pfree(writetime);
>> pfree(mytime);
>>
>> which seems quite fine to mee. I'm not sure how one of the pfree calls
>> could fail?
>
> I don't recall seeing the FATAL errors myself, but didn't keep the
> logfile around. (I do recall seeing the pgstat wait timeout).
>
> Are you using windows? pstrdup seems to be different there.

Nope. I'll repeat the test with the original patch to find out what went
wrong, just to be sure it was fixed.

> I'm afraid I don't have much to say on the code. Indeed I never even
> look at it (other than grepping for pstrdup just now). I am taking a
> purely experimental approach, Since Alvaro and others have looked at
> the code.

Thanks for finding the issue with unitialized timestamp!

> If I shutdown the server and blow away the stats with "rm
> data/pg_stat/*", it recovers gracefully when I start it back up. If a
> do "rm -r data/pg_stat" then it has problems the next time I shut it
> down, but I have no right to do that in the first place. If I initdb
> a database without this patch, then shut it down and restart with
> binaries that include this patch, and need to manually make the
> pg_stat directory. Does that mean it needs a catalog bump in order to
> force an initdb?

Ummmm, what you mean by "catalog bump"?

Anyway, messing with files in the "base" directory is a bad idea in
general, and I don't think that's a reason to treat the pg_stat
directory differently. If you remove it by hand, you'll be rightfully
punished by various errors.

> A review:
>
> It applies cleanly (some offsets, no fuzz), builds without warnings,
> and passes make check including with cassert.
>
> The final test done in "make check" inherently tests this code, and it
> passes. If I intentionally break the patch by making
> pgstat_read_db_statsfile add one to the oid it opens, then the test
> fails. So the existing test is at least plausible as a test.
>
> doc/src/sgml/monitoring.sgml needs to be changed: "a permanent copy of
> the statistics data is stored in the global subdirectory". I'm not
> aware of any other needed changes to the docs.

Yeah, that should be "in the global/pg_stat subdirectory".

> The big question is whether we want this. I think we do. While
> having hundreds of databases in a cluster is not recommended, that is
> no reason not to handle it better than we do. I don't see any
> down-sides, other than possibly some code uglification. Some file
> systems might not deal well with having lots of small stats files
> being rapidly written and rewritten, but it is hard to see how the
> current behavior would be more favorable for those systems.

If the filesystem has issues with that many entries, it's already hosed
by contents of the "base" directory (one per directory) or in the
database directories (multiple files per table).

Moreover, it's still possible to use tmpfs to handle this at runtime
(which is often the recommended solution with the current code), and use
the actual filesystem only for keeping the data across restarts.

> We do not already have this. There is no relevant spec. I can't see
> how this could need pg_dump support (but what about pg_upgrade?)

pg_dump - no

pg_upgrage - IMHO it should create the pg_stat directory. I don't think
it could "convert" statfile into the new format (by splitting it into
the pieces). I haven't checked but I believe the default behavior is to
delete it as there might be new fields / slight changes of meaning etc.

> I am not aware of any dangers.
>
> I have a question about its completeness. When I first start up the
> cluster and have not yet touched it, there is very little stats
> collector activity, either with or without this patch. When I kick
> the cluster sufficiently (I've been using vacuumdb -a to do that) then
> there is a lot of stats collector activity. Even once the vacuumdb
> has long finished, this high level of activity continues even though
> the database is otherwise completely idle, and this seems to happen
> for every. This patch makes that high level of activity much more
> efficient, but it does not reduce the activity. I don't understand
> why an idle database cannot get back into the state right after
> start-up.

What do you mean by "stats collector activity"? Is it reading/writing a
lot of data, or is it just using a lot of CPU?

Isn't that just a natural and expected behavior because the database
needs to actually perform ANALYZE to actually collect the data. Although
the tables are empty, it costs some CPU / IO and there's a lot of them
(1000 dbs, each with 200 tables).

I don't think there's a way around this. You may increase the autovacuum
naptime, but that's about all.

> I do not think that the patch needs to solve this problem in order to
> be accepted, but if it can be addressed while the author and reviewers
> are paying attention to this part of the system, that would be ideal.
> And if not, then we should at least remember that there is future work
> that could be done here.

If I understand that correctly, you see the same behaviour even without
the patch, right? In that case I'd vote not to make the patch more
complex, and try to improve that separately (if it's even possible).

> I created 1000 databases each with 200 single column tables (x serial
> primary key).
>
> After vacuumdb -a, I let it idle for a long time to see what steady
> state was reached.
>
> without the patch:
> vacuumdb -a real 11m2.624s
> idle steady state: 48.17% user, 39.24% system, 11.78% iowait, 0.81% idle.
>
> with the patch:
> vacuumdb -a real 6m41.306s
> idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle,

Nice. Another interesting numbers would be device utilization, average
I/O speed and required space (which should be ~2x the pgstat.stat size
without the patch).

> I also ran pgbench on a scale that fits in memory with fsync=off, on a
> singe CPU machine. With the same above-mentioned 1000 databases as
> unused decoys to bloat the stats file.
>
> pgbench_tellers and branches undergo enough turn over that they should
> get vacuumed every minuted (nap time).
>
> Without the patch, they only get vacuumed every 40 minutes or so as
> the autovac workers are so distracted by reading the bloated stats
> file. and the TPS is ~680.
>
> With the patch, they get vacuumed every 1 to 2 minutes and TPS is ~940

Great, I haven't really aimed to improve pgbench results, but it seems
natural that the decreased CPU utilization can go somewhere else. Not bad.

Have you moved the stats somewhere to tmpfs, or have you used the
default location (on disk)?

Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-02-05 23:09:36 Re: issues with range types, btree_gist and constraints
Previous Message Dimitri Fontaine 2013-02-05 21:57:00 Re: sql_drop Event Trigger