Dirty pages in freelist cause WAL stuck

Lists: pgsql-hackers
From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Dirty pages in freelist cause WAL stuck
Date: 2006-12-18 02:55:15
Message-ID: 20061218112321.6990.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

I'm testing the recently changes of WAL entries for freezing-tuples.
VACUUM FREEZE took more time. The cause seems to be flushing WAL buffers.

Vacuuming processes free buffers into freelist. The buffers in freelist is
preferentially used on next allocation of buffers. Then, if the buffer is
dirty, the allocator must write it before reuse. However, there are few buffers
in freelist typically, buffers made dirty recently are reused too soon
-- The WAL entries for the dirty buffer has not been flushed yet, so the
allocator flushes WAL, writes the buffer, and finally reuses it.

One solution is always keeping some buffers in freelist. If there were
N buffers in freelist, the necessity of WAL-flusing was reduced to 1/N,
because all WAL entries are flushed when we do one of them.

The attached patch is an experimental implementation of the above. Keeping
32 buffers seems to be enough when executed separately. With some background
jobs, other numbers may be better.

N | time | XLogWrite/XLogFlush
---+-------+---------------------
1 | 68.2s | 25.6%
8 | 57.4s | 10.8%
32 | 54.0s | 3.4%

[initial data]
$ pgbench -s 40 -i;
# VACUUM FREEZE
[test]
# UPDATE accounts SET aid=aid WHERE random() < 0.005;
# checkpoint;
# VACUUM FREEZE accounts;

I cannot see the above problem in non-freeze vacuum. The number buffers
in freelist increases on index-vacuuming phase. When the vacuum found
seldom used buffers (refcount==0 and usage_count==0), they are added into
freelist. So the WAL entries generated in index-vacuuming or heap-vacuuming
phase are not so serious. However, entries for FREEZE are generated in
heap-scanning phase, it is before index-vacuuming.

Are there any better fixes? Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment Content-Type Size
freelist_buffers.patch application/octet-stream 4.6 KB

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Dirty pages in freelist cause WAL stuck
Date: 2006-12-18 15:49:29
Message-ID: 1166456970.3654.117.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2006-12-18 at 11:55 +0900, ITAGAKI Takahiro wrote:

> I'm testing the recently changes of WAL entries for freezing-tuples.
> VACUUM FREEZE took more time. The cause seems to be flushing WAL buffers.

Great thinking.

> Vacuuming processes free buffers into freelist. The buffers in freelist is
> preferentially used on next allocation of buffers. Then, if the buffer is
> dirty, the allocator must write it before reuse. However, there are few buffers
> in freelist typically, buffers made dirty recently are reused too soon
> -- The WAL entries for the dirty buffer has not been flushed yet, so the
> allocator flushes WAL, writes the buffer, and finally reuses it.

I think what you are saying is: VACUUM places blocks so that they are
immediately reused. This stops shared_buffers from being polluted by
vacuumed-blocks, but it also means that almost every write becomes a
backend dirty write when VACUUM is working, bgwriter or not. That also
means that we flush WAL more often than we otherwise would.

> One solution is always keeping some buffers in freelist. If there were
> N buffers in freelist, the necessity of WAL-flusing was reduced to 1/N,
> because all WAL entries are flushed when we do one of them.

That sounds very similar to an idea I'd been working on which I'd called
cache looping. There is a related (but opposite) problem with sequential
scans - they don't move through the cache fast enough. A solution to
both issues is to have the Vacuum/SeqScans continually reuse a small
pool of buffers, rather than request the next one from the buffer
manager in the normal way.

> The attached patch is an experimental implementation of the above. Keeping
> 32 buffers seems to be enough when executed separately. With some background
> jobs, other numbers may be better.
>
> N | time | XLogWrite/XLogFlush
> ---+-------+---------------------
> 1 | 68.2s | 25.6%
> 8 | 57.4s | 10.8%
> 32 | 54.0s | 3.4%

I think this is good proof; well done.

>From above my thinking would be to have a more general implementation:
Each backend keeps a list of cache buffers to reuse in its local loop,
rather than using the freelist as a global list. That way the technique
would work even when we have multiple Vacuums working concurrently. It
would also then be possible to use this for the SeqScan case as well.

Cache looping would be implemented by a modified BufferAlloc routine,
say BufferScanAlloc() that is called only when a StrategyUseCacheLoop()
has been called during SeqScan or VacuumScan. strategy_cache_loop would
replace strategy_hint_vacuum.

Each backend would have a list of previous N buffers touched. When N
=Nmax, we would link to oldest buffer to form a linked ring. Each time
we need next buffer we read from the ring rather than from the main
clock sweep. If the buffer identified is pinned, then we drop that from
the ring and apply normally for a new buffer and keep that instead. At
the end of the scan, we simply forget the buffer ring.

Another connected thought is the idea of a having a FullBufferList - the
opposite of a free buffer list. When VACUUM/INSERT/COPY fills a block we
notify the buffer manager that this block needs writing ahead of other
buffers, so that the bgwriter can work more effectively. That seems like
it would help with both this current patch and the additional thoughts
above.

> [initial data]
> $ pgbench -s 40 -i;
> # VACUUM FREEZE
> [test]
> # UPDATE accounts SET aid=aid WHERE random() < 0.005;
> # checkpoint;
> # VACUUM FREEZE accounts;
>
>
> I cannot see the above problem in non-freeze vacuum. The number buffers
> in freelist increases on index-vacuuming phase. When the vacuum found
> seldom used buffers (refcount==0 and usage_count==0), they are added into
> freelist. So the WAL entries generated in index-vacuuming or heap-vacuuming
> phase are not so serious. However, entries for FREEZE are generated in
> heap-scanning phase, it is before index-vacuuming.

This happens for setting hint-bits also in normal operation, which might
only occur once in most test situations. In practice, this can occur
each time we touch a row and then VACUUM, so we end up re-writing the
block many times in the way you describe.

IIRC Heikki was thinking of altering the way VACUUM works to avoid it
writing out blocks that it was going to come back to in the second phase
anyway. That would go some way to alleviating the problem you describe,
but wouldn't go as far as the technique you suggest.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Dirty pages in freelist cause WAL stuck
Date: 2006-12-18 16:13:06
Message-ID: 9336.1166458386@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> I think what you are saying is: VACUUM places blocks so that they are
> immediately reused. This stops shared_buffers from being polluted by
> vacuumed-blocks, but it also means that almost every write becomes a
> backend dirty write when VACUUM is working, bgwriter or not. That also
> means that we flush WAL more often than we otherwise would.

Do we care? As long as the writes are done by the vacuum process, ISTM
this is taking load off the foreground query processes, by saving them
from having to do writes.

In any case, I'm unclear on why we should add a boatload of complexity
to improve performance of something that's done as rarely as VACUUM
FREEZE is. Quite aside from maintainability concerns, even a few extra
cycles added to the more common code paths would make it a net
performance loss overall.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Dirty pages in freelist cause WAL stuck
Date: 2006-12-18 21:34:45
Message-ID: 1166477685.3654.135.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2006-12-18 at 11:13 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > I think what you are saying is: VACUUM places blocks so that they are
> > immediately reused. This stops shared_buffers from being polluted by
> > vacuumed-blocks, but it also means that almost every write becomes a
> > backend dirty write when VACUUM is working, bgwriter or not. That also
> > means that we flush WAL more often than we otherwise would.
>
> Do we care? As long as the writes are done by the vacuum process, ISTM
> this is taking load off the foreground query processes, by saving them
> from having to do writes.

I'm not bothered about speeding up VACUUM FREEZE at all, but the effect
noted by Itagaki-san is clearly real and so can easily effect other
processes. I believe it does effect other backends and had already noted
what I thought was that effect myself. If we had better server
instrumentation it would be easy to demonstrate either way.

> In any case, I'm unclear on why we should add a boatload of complexity
> to improve performance of something that's done as rarely as VACUUM
> FREEZE is. Quite aside from maintainability concerns, even a few extra
> cycles added to the more common code paths would make it a net
> performance loss overall.

As I noted, this isn't just VACUUM FREEZE (why would it be?), but all
VACUUMs - that *is* a common code path on a busy system. VACUUM FREEZE
simply dirties more blocks and has a more clearly noticeable effect.

>From your comments we clearly need more testing to demonstrate the
effect on normal backends before we move to a solution.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Dirty pages in freelist cause WAL stuck
Date: 2006-12-19 08:53:06
Message-ID: 20061219105630.6475.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:

> I think what you are saying is: VACUUM places blocks so that they are
> immediately reused. This stops shared_buffers from being polluted by
> vacuumed-blocks, but it also means that almost every write becomes a
> backend dirty write when VACUUM is working, bgwriter or not. That also
> means that we flush WAL more often than we otherwise would.

That's right. I think it's acceptable that vacuuming process writes dirty
buffers made by itself, because only the process slows down; other backends
can run undisturbedly. However, frequent WAL flushing should be avoided.

I found the problem when I ran VACUUM FREEZE separately. But if there were
some backends, dirty buffers made by VACUUM would be reused by those backends,
not by the vacuuming process.

> From above my thinking would be to have a more general implementation:
> Each backend keeps a list of cache buffers to reuse in its local loop,
> rather than using the freelist as a global list. That way the technique
> would work even when we have multiple Vacuums working concurrently. It
> would also then be possible to use this for the SeqScan case as well.

Great idea! The troubles are in the usage of buffers by SeqScan and VACUUM.
The former uses too many buffers and the latter uses too few buffers.
Your cache-looping will work around both cases.

> Another connected thought is the idea of a having a FullBufferList - the
> opposite of a free buffer list. When VACUUM/INSERT/COPY fills a block we
> notify the buffer manager that this block needs writing ahead of other
> buffers, so that the bgwriter can work more effectively. That seems like
> it would help with both this current patch and the additional thoughts
> above.

Do you mean that bgwriter should take care of buffers in freelist, not only
ones in the tail of LRU? We might need activity control of bgwriter. Buffers
are reused rapidly in VACUUM or bulk insert, so bgwriter is not sufficient
if its settings are same as usual.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Dirty pages in freelist cause WAL stuck
Date: 2006-12-28 12:13:37
Message-ID: 20061228121337.GQ71246@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 19, 2006 at 05:53:06PM +0900, ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:
> > Another connected thought is the idea of a having a FullBufferList - the
> > opposite of a free buffer list. When VACUUM/INSERT/COPY fills a block we
> > notify the buffer manager that this block needs writing ahead of other
> > buffers, so that the bgwriter can work more effectively. That seems like
> > it would help with both this current patch and the additional thoughts
> > above.
>
> Do you mean that bgwriter should take care of buffers in freelist, not only
> ones in the tail of LRU? We might need activity control of bgwriter. Buffers
> are reused rapidly in VACUUM or bulk insert, so bgwriter is not sufficient
> if its settings are same as usual.

Actually, if I understand the code, the "LRU" stuff actually only hits
the free list. Also, the only thing that runs the clock sweep (which is
what maintains the LRU-type info) is a backend requesting a page and not
finding one on the free list.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Dirty pages in freelist cause WAL stuck
Date: 2007-02-03 02:08:56
Message-ID: 200702030208.l1328ul26953@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Is this a TODO item?

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:
>
> > I think what you are saying is: VACUUM places blocks so that they are
> > immediately reused. This stops shared_buffers from being polluted by
> > vacuumed-blocks, but it also means that almost every write becomes a
> > backend dirty write when VACUUM is working, bgwriter or not. That also
> > means that we flush WAL more often than we otherwise would.
>
> That's right. I think it's acceptable that vacuuming process writes dirty
> buffers made by itself, because only the process slows down; other backends
> can run undisturbedly. However, frequent WAL flushing should be avoided.
>
> I found the problem when I ran VACUUM FREEZE separately. But if there were
> some backends, dirty buffers made by VACUUM would be reused by those backends,
> not by the vacuuming process.
>
> > From above my thinking would be to have a more general implementation:
> > Each backend keeps a list of cache buffers to reuse in its local loop,
> > rather than using the freelist as a global list. That way the technique
> > would work even when we have multiple Vacuums working concurrently. It
> > would also then be possible to use this for the SeqScan case as well.
>
> Great idea! The troubles are in the usage of buffers by SeqScan and VACUUM.
> The former uses too many buffers and the latter uses too few buffers.
> Your cache-looping will work around both cases.
>
> > Another connected thought is the idea of a having a FullBufferList - the
> > opposite of a free buffer list. When VACUUM/INSERT/COPY fills a block we
> > notify the buffer manager that this block needs writing ahead of other
> > buffers, so that the bgwriter can work more effectively. That seems like
> > it would help with both this current patch and the additional thoughts
> > above.
>
> Do you mean that bgwriter should take care of buffers in freelist, not only
> ones in the tail of LRU? We might need activity control of bgwriter. Buffers
> are reused rapidly in VACUUM or bulk insert, so bgwriter is not sufficient
> if its settings are same as usual.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Dirty pages in freelist cause WAL stuck
Date: 2007-02-06 03:29:05
Message-ID: 92250ABC-FF41-4F72-8527-68DB11657FCF@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think there's improvement to be made in how we track buffer usage
in general. Seqscans still hold the same weight as any other
operation, the freelist is of questionable value, and there's a lot
of work done to find a free buffer out of the pool, for example.

On Feb 2, 2007, at 8:08 PM, Bruce Momjian wrote:

>
> Is this a TODO item?
>
> ----------------------------------------------------------------------
> -----
>
> ITAGAKI Takahiro wrote:
>> "Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:
>>
>>> I think what you are saying is: VACUUM places blocks so that they
>>> are
>>> immediately reused. This stops shared_buffers from being polluted by
>>> vacuumed-blocks, but it also means that almost every write becomes a
>>> backend dirty write when VACUUM is working, bgwriter or not. That
>>> also
>>> means that we flush WAL more often than we otherwise would.
>>
>> That's right. I think it's acceptable that vacuuming process
>> writes dirty
>> buffers made by itself, because only the process slows down; other
>> backends
>> can run undisturbedly. However, frequent WAL flushing should be
>> avoided.
>>
>> I found the problem when I ran VACUUM FREEZE separately. But if
>> there were
>> some backends, dirty buffers made by VACUUM would be reused by
>> those backends,
>> not by the vacuuming process.
>>
>>> From above my thinking would be to have a more general
>>> implementation:
>>> Each backend keeps a list of cache buffers to reuse in its local
>>> loop,
>>> rather than using the freelist as a global list. That way the
>>> technique
>>> would work even when we have multiple Vacuums working
>>> concurrently. It
>>> would also then be possible to use this for the SeqScan case as
>>> well.
>>
>> Great idea! The troubles are in the usage of buffers by SeqScan
>> and VACUUM.
>> The former uses too many buffers and the latter uses too few buffers.
>> Your cache-looping will work around both cases.
>>
>>> Another connected thought is the idea of a having a
>>> FullBufferList - the
>>> opposite of a free buffer list. When VACUUM/INSERT/COPY fills a
>>> block we
>>> notify the buffer manager that this block needs writing ahead of
>>> other
>>> buffers, so that the bgwriter can work more effectively. That
>>> seems like
>>> it would help with both this current patch and the additional
>>> thoughts
>>> above.
>>
>> Do you mean that bgwriter should take care of buffers in freelist,
>> not only
>> ones in the tail of LRU? We might need activity control of
>> bgwriter. Buffers
>> are reused rapidly in VACUUM or bulk insert, so bgwriter is not
>> sufficient
>> if its settings are same as usual.
>>
>> Regards,
>> ---
>> ITAGAKI Takahiro
>> NTT Open Source Software Center
>>
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 9: In versions below 8.0, the planner will ignore your desire to
>> choose an index scan if your joining column's datatypes do not
>> match
>
> --
> Bruce Momjian bruce(at)momjian(dot)us
> EnterpriseDB http://www.enterprisedb.com
>
> + If your life is a hard drive, Christ can be your backup. +
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)