Re: Fix overflow of bgwriter's request queue

Lists: pgsql-patches
From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-patches(at)postgresql(dot)org
Subject: Fix overflow of bgwriter's request queue
Date: 2006-01-13 04:48:12
Message-ID: 20060113132743.4E26.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Attached is a patch that fixes overflow of bgwriter's file-fsync request queue.

It happened on heavy update workloads and the performance decreased.
I have sent HACKERS the detail.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachment Content-Type Size
bgwriter-requests-queue-overflow-2.patch application/octet-stream 3.3 KB

From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-13 06:48:24
Message-ID: dq7idc$30g7$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> wrote
>
> Attached is a patch that fixes overflow of bgwriter's file-fsync request
> queue.
>

while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) !=
NULL)
{
+ if (i >= count)
+ elog(ERROR, "pendingOpsTable corrupted");
+
+ memcpy(&entries[i++], entry, sizeof(PendingOperationEntry));
+
+ if (hash_search(pendingOpsTable, entry,
+ HASH_REMOVE, NULL) == NULL)
+ elog(ERROR, "pendingOpsTable corrupted");
+ }

What's the rationale of this change?

Regards,
Qingqing


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-13 07:04:00
Message-ID: 20060113155017.4E2E.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> wrote:

> while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
> {
> + if (i >= count)
> + elog(ERROR, "pendingOpsTable corrupted");
> +
> + memcpy(&entries[i++], entry, sizeof(PendingOperationEntry));
> +
> + if (hash_search(pendingOpsTable, entry,
> + HASH_REMOVE, NULL) == NULL)
> + elog(ERROR, "pendingOpsTable corrupted");
> + }
>
> What's the rationale of this change?

AbsorbFsyncRequests will be called during the fsync loop in my patch,
so new files might be added to pendingOpsTable and they will be removed
from the table *before* writing the pages belonging to them.
So I changed it to copy the contents of pendingOpsTable to a local
variables and iterate on the vars later.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-13 07:52:07
Message-ID: dq7m4q$5nk$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> wrote
>
> AbsorbFsyncRequests will be called during the fsync loop in my patch,
> so new files might be added to pendingOpsTable and they will be removed
> from the table *before* writing the pages belonging to them.
> So I changed it to copy the contents of pendingOpsTable to a local
> variables and iterate on the vars later.
>

I see - it is the AbsorbFsyncRequests() added in mdsync() loop and you want
to avoid unecessary fsyncs. But the remove-recover method you use has a
caveat: if any hash_search(HASH_ENTER) failed when you try to reinsert them
into the pendingOpsTable, you have to raise the error to PANIC since we
can't get back the missing fds any more.

Regards,
Qingqing


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-13 21:05:52
Message-ID: 13052.1137186352@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> wrote
>> AbsorbFsyncRequests will be called during the fsync loop in my patch,
>> so new files might be added to pendingOpsTable and they will be removed
>> from the table *before* writing the pages belonging to them.
>> So I changed it to copy the contents of pendingOpsTable to a local
>> variables and iterate on the vars later.

I think this fear is incorrect. At the time ForwardFsyncRequest is
called, the backend must *already* have done whatever write it is
concerned about fsync'ing (note that ForwardFsyncRequest may choose
to do the fsync itself). Therefore it is OK if the bgwriter does that
fsync immediately upon receipt of the request. There is no constraint
saying that we ever need to delay execution of an fsync request.

> I see - it is the AbsorbFsyncRequests() added in mdsync() loop and you want
> to avoid unecessary fsyncs. But the remove-recover method you use has a
> caveat: if any hash_search(HASH_ENTER) failed when you try to reinsert them
> into the pendingOpsTable, you have to raise the error to PANIC since we
> can't get back the missing fds any more.

Yes, the patch is wrong as-is because it may lose uncompleted fsyncs.
But I think that we could just add the AbsorbFsyncRequests call in the
fsync loop and not worry about trying to avoid doing extra fsyncs.

Another possibility is to make the copied list as in the patch, but
HASH_REMOVE an entry only after doing the fsync successfully --- as long
as you don't AbsorbFsyncRequests between doing the fsync and removing
the entry, you aren't risking missing a necessary fsync. I'm
unconvinced that this is worth the trouble, however.

regards, tom lane


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-13 22:09:50
Message-ID: dq98d1$1mp$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> Yes, the patch is wrong as-is because it may lose uncompleted fsyncs.
> But I think that we could just add the AbsorbFsyncRequests call in the
> fsync loop and not worry about trying to avoid doing extra fsyncs.
>
> Another possibility is to make the copied list as in the patch, but
> HASH_REMOVE an entry only after doing the fsync successfully --- as long
> as you don't AbsorbFsyncRequests between doing the fsync and removing
> the entry, you aren't risking missing a necessary fsync. I'm
> unconvinced that this is worth the trouble, however.
>

Maybe the take a copied list is safer. I got a little afraid of doing
seqscan hash while doing HASH_ENTER at the same time. Do we have this kind
of hash usage somewhere?

Regards,
Qingqing


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-13 22:20:14
Message-ID: 14418.1137190814@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> Maybe the take a copied list is safer. I got a little afraid of doing
> seqscan hash while doing HASH_ENTER at the same time. Do we have this kind
> of hash usage somewhere?

Sure, it's perfectly safe. It's unspecified whether the scan will visit
such entries or not (because it might or might not already have passed
their hash bucket), but per above discussion we don't really care.

regards, tom lane


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Cc: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-16 02:00:33
Message-ID: 20060116095906.4BD5.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> > "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> wrote
> >> AbsorbFsyncRequests will be called during the fsync loop in my patch,
> >> so new files might be added to pendingOpsTable and they will be removed
> >> from the table *before* writing the pages belonging to them.
>
> I think this fear is incorrect. At the time ForwardFsyncRequest is
> called, the backend must *already* have done whatever write it is
> concerned about fsync'ing.

Oops, I was wrong. Also, I see that there is no necessity for fearing
endless loops because hash-seqscan and HASH_ENTER don't conflict.

Attached is a revised patch. It became very simple, but I worry that
one magic number (BUFFERS_PER_ABSORB) is still left.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachment Content-Type Size
bgwriter-requests-queue-overflow-3.patch application/octet-stream 2.3 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-19 23:45:38
Message-ID: 10731.1137714338@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> Attached is a revised patch. It became very simple, but I worry that
> one magic number (BUFFERS_PER_ABSORB) is still left.

Have you checked that this version of the patch fixes the problem you
saw originally? Does the problem come back if you change
BUFFERS_PER_ABSORB to too large a value? If you can identify a
threshold where the problem reappears in your test case, that would help
us choose the right value to use.

I suspect it'd probably be sufficient to absorb requests every few times
through the fsync loop, too, if you want to experiment with that.

regards, tom lane


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-01-26 04:08:55
Message-ID: 20060126124821.48A8.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> > Attached is a revised patch. It became very simple, but I worry that
> > one magic number (BUFFERS_PER_ABSORB) is still left.
>
> Have you checked that this version of the patch fixes the problem you
> saw originally? Does the problem come back if you change
> BUFFERS_PER_ABSORB to too large a value?

The problem on my machine was resolved by this patch. I tested it and
logged the maximum of BgWriterShmem->num_requests for each checkpoint.
Test condition was:
- shared_buffers = 65536
- connections = 30
The average of maximums was 25857 and the while max was 31807.
They didn't exceed the max_requests(= 65536).

> I suspect it'd probably be sufficient to absorb requests every few times
> through the fsync loop, too, if you want to experiment with that.

In the above test, smgrsync took 50 sec for syncing 32 files. This means
absorb are requested every 1.5 sec, which is less frequent than absorbs by
normal activity of bgwriter (bgwriter_delay=200ms). So I assume absorb
requests the fsync loop would not be a problem.

BUFFERS_PER_ABSORB = 10 (absorb per 1/10 of shared_buffers) is enough at least
on my machine, but it doesn't necessarily work well in all environments.
If we need to set BUFFERS_PER_ABSORB to a reasonably value, I think the number
of active backends might be useful; for example, half of num of backends.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-03-03 00:07:23
Message-ID: 16035.1141344443@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I suspect it'd probably be sufficient to absorb requests every few times
>> through the fsync loop, too, if you want to experiment with that.

> In the above test, smgrsync took 50 sec for syncing 32 files. This means
> absorb are requested every 1.5 sec, which is less frequent than absorbs by
> normal activity of bgwriter (bgwriter_delay=200ms).

That seems awfully high to me --- 1.5 sec to fsync a segment file that
is never larger than 1Gb, and probably usually has much less than 1Gb
of dirty data? I think you must have been testing an atypical case.

I've applied the attached modified version of your patch. In this
coding, absorbs are done after every 1000 buffer writes in BufferSync
and after every 10 fsyncs in mdsync. We may need to twiddle these
numbers but it seems at least in the right ballpark. If you have time
to repeat your original test and see how this does, it'd be much
appreciated.

regards, tom lane

Attachment Content-Type Size
absorb.patch application/octet-stream 3.6 KB

From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Fix overflow of bgwriter's request queue
Date: 2006-03-10 11:12:18
Message-ID: 20060310194320.4CA3.ITAGAKI.TAKAHIRO@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I've applied the attached modified version of your patch. In this
> coding, absorbs are done after every 1000 buffer writes in BufferSync
> and after every 10 fsyncs in mdsync. We may need to twiddle these
> numbers but it seems at least in the right ballpark. If you have time
> to repeat your original test and see how this does, it'd be much
> appreciated.

Thank you. It worked well on my machine(*).
Undesirable behavior was not seen.

(*)
TPC-C(DBT-2)
RHEL4 U1 (2.6.9-11)
XFS, 8 S-ATA disks / 8GB memory(shmem=512MB)

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories