Performance features the 4th

From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Performance features the 4th
Date: 2003-11-05 19:06:58
Message-ID: 3FA94A52.8070603@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've just uploaded

http://developer.postgresql.org/~wieck/all_performance.v4.74.diff.gz

This patch contains the "still not yet ready" performance improvements
discussed over the couple last days.

_Shared buffer replacement_:

The buffer replacement strategy is a slightly modified version of ARC.
The modifications are some specializations about CDB promotions. Since
PostgreSQL allways looks for buffers multiple times when updating (first
during the scan, then during the heap_update() etc.), every updated
block would jump right into the T2 (frequent accessed) queue. To prevent
that the Xid when a buffer got added to the T1 queue is remembered and
if a block is found in T1, the same transaction will not promote it into
T2. This also affects blocks accessed like SELECT ... FOR UPDATE; UPDATE
as this is a usual strategy and does not mean that this particular datum
is accessed frequently.

Blocks faulted in by vacuum are handled special in that they end up at
the LRU of the T1 queue and when evicted from there their CDB get's
destroyed instead of added to the B1 queue to prevent vacuum from
polluting the caches autotuning.

A guc variable

buffer_strategy_status_interval = 0 # 0-600 seconds

controls DEBUG1 messages every n seconds showing the current queue sizes
and the cache hitrates during the last interval.

_Vacuum page delay_:

Tom Lane's napping during vacuums with another tuning option. I replaced
the usleep() call with a PG_DELAY(msec) macro in miscadmin.h, which does
use select(2) instead. That should address the possible portability
problems.

The config options

vacuum_page_group_delay = 0 # 0-100 milliseconds
vacuum_page_group_size = 10 # 1-1000 pages

control how many pages get vacuumed as a group and how long vacuum will
nap between groups.

I think this can be improved more if vacuum get's feedback from the
buffer manager if a page actually was found clean or already dirty in
the cache or faulted in. This together with the fact if vacuum actually
dirties the page or not would result in a sort of "vacuum page cost"
that is accumulated and controls how often to nap. So that vacuuming a
page found in the cache and that has no dead tuples is cheap, but
vacuuming a page that caused another dirty block to get evicted, then
read in and finally ends up dirty because of dead tuples is expensive.

_Lazy checkpoint_:

This is the checkpoint process with the ability to schedule the buffer
flushing over some time. Also the buffers are written in an order told
by the buffer replacement strategy. Currently that is a merged list of
dirty buffers in the order of the T1 and T2 queues of ARC. Since buffers
are replaced in that order, it causes backends to find clean buffers for
eviction more often.

The config options

lazy_checkpoint_time = 0 # 0-3600 seconds
lazy_checkpoint_group_size = 50 # 10-1000 pages
lazy_checkpoint_maxdelay = 500 # 100-1000 milliseconds

control how long the buffer flushing "should" take, how many dirty pages
to write as a group before syncing and napping. The maxdelay is a
parameter that causes really small amounts of changes not to spread out
over that long.

The syncing is currently done in a new function in md.c, mdfsyncrecent()
called through the smgr. The intention is to maintain some LRU of
written to file descriptors and pg_fdatasync() them. I haven't found the
right place for that yet, so it simply does a system global sync().

My idea here is that it really does not matter how accurate the single
files are forced to disk during this, all we care for is to cause some
physical writes performed by the kernel while we're writing them out,
and not to buffer those writes in the OS until we finish the checkpoint.

The lazy checkpoint configuration should only affect automatic
checkpoints started by postmaster because a checkpoint_timeout occured.
Acutally it seems to apply this to manually started checkpoints as well.
BufferSync() monitors the time to finish, held in shared memory, so it
would be relatively easy to hurry up a running lazy checkpoint by
setting that to zero. It's just that the postmaster can't do that
because he does not have a PGPROC structure and therefore can't lock
that shmem structure. This is a must fix item because to hurry up the
checkpointer is very critical at shutdown time.

_TODO_:

* Replace the global sync() in mdfsyncrecent(int max) with calls to
pg_fdatasync()

* Add functionality to postmaster to hurry up a running checkpoint
at shutdown.

* Make sure that manual checkpoints are not affected by the lazy
checkpoint config options and that they too hurry up a running one.

* Further improve vacuums napping strategy depending on actual caused
IO per page.

_NOTE_:

The core team is well aware of the high demand for these features. As
things stand however, it is impossible to get this functionality
released in version 7.4.

That does not mean, that we have no chance to include some or all of the
functionality in a subsequent 7.4.x release. But for that to happen, the
above already mentioned TODO's must get done first. Further, we need a
good amount of evidence that these changes actually gain the desired
effect to a degree that justifies breaking our "no features in dot
releases" rule. Also we need a good amount of evidence that the features
don't break anything or sacrifice stability and that a backward
compatible behaviour (where possible ... not possible with ARC vs. LRU)
is the default.

I personally would like to see this work included in a 7.4.x release.
But it requires people to actually run tests, stress some hardware,
check platform portability and *give us feedback*, bacause this is what
we get for the release candidates and these improvements can under no
circumstance have any lower quality than that. If this goes into a 7.4.x
release and there is any platform dependant issue in it, it endangers
the timely fix of other bugs for those platforms, and that's a no-go.

Happy testing

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rod Taylor 2003-11-05 19:17:21 Very poor estimates from planner
Previous Message Tom Lane 2003-11-05 18:28:43 Re: Erroneous PPC spinlock code