Re: Hot standby, recovery infra

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot standby, recovery infra
Date: 2009-02-05 12:18:14
Message-ID: 498AD906.1030507@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
>>>> Simon Riggs wrote:
>>>>> So we might end up flushing more often *and* we will be doing it
>>>>> potentially in the code path of other users.
>>>> For example, imagine a database that fits completely in shared buffers.
>>>> If we update at every XLogFileRead, we have to fsync every 16MB of WAL.
>>>> If we update in XLogFlush the way I described, you only need to update
>>>> when we flush a page from the buffer cache, which will only happen at
>>>> restartpoints. That's far less updates.
>>> Oh, did you change the bgwriter so it doesn't do normal page cleaning?
>> No. Ok, that wasn't completely accurate. The page cleaning by bgwriter
>> will perform XLogFlushes, but that should be pretty insignificant. When
>> there's little page replacement going on, bgwriter will do a small
>> trickle of page cleaning, which won't matter much.
>
> Yes, that case is good, but it wasn't the use case we're trying to speed
> up by having the bgwriter active during recovery. We're worried about
> I/O bound recoveries.

Ok, let's do the math:

By updating minRecoveryPoint in XLogFileRead, you're fsyncing the
control file once every 16MB of WAL.

By updating in XLogFlush, the frequency depends on the amount of
shared_buffers available to buffer the modified pages, the average WAL
record size, and the cache hit ratio. Let's determine the worst case:

The smallest WAL record that dirties a page is a heap deletion record.
That contains just enough information to locate the tuple. If I'm
reading the headers right, that record is 48 bytes long (28 bytes of
xlog header + 18 bytes of payload + padding). Assuming that the WAL is
full of just those records, and there's no full page images, and that
the cache hit ratio is 0%, we will need (16 MB / 48 B) * 8 kB = 2730 MB
of shared_buffers to achieve the once per 16 MB of WAL per one fsync mark.

So if you have a lower shared_buffers setting than 2.7 GB, you can have
more frequent fsyncs this way in the worst case. If you think of the
typical case, you're probably not doing all deletes, and you're having a
non-zero cache hit ratio, so you achieve the same frequency with a much
lower shared_buffers setting. And if you're really that I/O bound, I
doubt the few extra fsyncs matter much.

Also note that when the control file is updated in XLogFlush, it's
typically the bgwriter doing it as it cleans buffers ahead of the clock
hand, not the startup process.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message K, Niranjan (NSN - IN/Bangalore) 2009-02-05 13:50:15 Re: Synch Replication
Previous Message Simon Riggs 2009-02-05 11:52:54 Re: Hot standby, recovery infra