Re: [HACKERS] Fix mdsync never-ending loop problem

Lists: pgsql-hackerspgsql-patches
From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Fix mdsync never-ending loop problem
Date: 2007-04-05 10:46:39
Message-ID: 4614D38F.2060902@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Here's a fix for the problem that on a busy system, mdsync never
finishes. See the original problem description on hackers:
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00259.php

The solution is taken from ITAGAKI Takahiro's Load Distributed
Checkpoint patch. At the beginning of mdsync, the pendingOpsTable is
copied to a linked list, and that list is then processed until it's empty.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
fix_neverending_mdsync_loop.patch text/x-diff 10.3 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 13:45:24
Message-ID: 20070405134524.GB8578@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

While skimming over this I was baffled a bit about the usage of
(InvalidBlockNumber - 1) as value for FORGET_DATABASE_FSYNC. It took me
a while to realize that this code is abusing the BlockNumber typedef to
pass around *segment* numbers, so the useful range is much smaller and
thus the usage of that value is not a problem in practice.

I wonder if it wouldn't be better to clean this up by creating a
separate typedef for segment numbers, with its own special values?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 16:14:32
Message-ID: 46152068.3000501@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Here's a fix for the problem that on a busy system, mdsync never
> finishes. See the original problem description on hackers:
> http://archives.postgresql.org/pgsql-hackers/2007-04/msg00259.php
>
> The solution is taken from ITAGAKI Takahiro's Load Distributed
> Checkpoint patch. At the beginning of mdsync, the pendingOpsTable is
> copied to a linked list, and that list is then processed until it's empty.

Here's an updated patch, the one I sent earlier is broken. I ignored the
return value of list_delete_cell.

We could just review and apply ITAGAKI's patch as it is instead of this
snippet of it, but because that can take some time I'd like to see this
applied before that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
fix_neverending_mdsync_loop_v2.patch text/x-diff 10.3 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 16:15:00
Message-ID: 28223.1175789700@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Here's a fix for the problem that on a busy system, mdsync never
> finishes. See the original problem description on hackers:

This leaks memory, no? (list_delete_cell only deletes the ListCell.)
But I dislike copying the table entries anyway, see comment on -hackers.

BTW, it's very hard to see what a patch like this is actually changing.
It might be better to submit a version that doesn't reindent the chunks
of code you aren't changing, so as to reduce the visual size of the
diff. A note to the committer to reindent the whole function is
sufficient (or if he forgets, pg_indent will fix it eventually).

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 16:19:25
Message-ID: 28272.1175789965@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I wonder if it wouldn't be better to clean this up by creating a
> separate typedef for segment numbers, with its own special values?

Probably. I remember having thought about it when I put in the
FORGET_DATABASE_FSYNC hack. I think I didn't do it because I needed
to backpatch and so I wanted a minimal-size patch. Feel free to do it
in HEAD ...

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 16:27:07
Message-ID: 4615235B.9040204@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Here's a fix for the problem that on a busy system, mdsync never
>> finishes. See the original problem description on hackers:
>
> This leaks memory, no? (list_delete_cell only deletes the ListCell.)

Oh, I just spotted another problem with it and posted an updated patch,
but I missed that.

> But I dislike copying the table entries anyway, see comment on -hackers.

Frankly the cycle id idea sounds more ugly and fragile to me. You'll
need to do multiple scans of the hash table that way, starting from top
every time you call AbsorbFsyncRequests (like we do know). But whatever...

> BTW, it's very hard to see what a patch like this is actually changing.
> It might be better to submit a version that doesn't reindent the chunks
> of code you aren't changing, so as to reduce the visual size of the
> diff. A note to the committer to reindent the whole function is
> sufficient (or if he forgets, pg_indent will fix it eventually).

Ok, will do that. Or would you like to just take over from here?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 16:34:03
Message-ID: 28455.1175790843@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> But I dislike copying the table entries anyway, see comment on -hackers.

> Frankly the cycle id idea sounds more ugly and fragile to me. You'll
> need to do multiple scans of the hash table that way, starting from top
> every time you call AbsorbFsyncRequests (like we do know).

How so? You just ignore entries whose cycleid is too large. You'd have
to be careful about wraparound in the comparisons, but that's not hard
to deal with. Also, AFAICS you still have the retry problem (and an
even bigger memory leak problem) with this coding --- the "to-do list"
doesn't eliminate the issue of correct handling of a failure.

> Ok, will do that. Or would you like to just take over from here?

No, I'm up to my ears in varlena. You're the one in a position to test
this, anyway.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 16:57:04
Message-ID: 46152A60.1070404@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> But I dislike copying the table entries anyway, see comment on -hackers.
>
>> Frankly the cycle id idea sounds more ugly and fragile to me. You'll
>> need to do multiple scans of the hash table that way, starting from top
>> every time you call AbsorbFsyncRequests (like we do know).
>
> How so? You just ignore entries whose cycleid is too large. You'd have
> to be careful about wraparound in the comparisons, but that's not hard
> to deal with. Also, AFAICS you still have the retry problem (and an
> even bigger memory leak problem) with this coding --- the "to-do list"
> doesn't eliminate the issue of correct handling of a failure.

You have to start the hash_seq_search from scratch after each call to
AbsorbFsyncRequests because it can remove entries, including the one the
scan is stopped on.

I think the failure handling is correct in the "to-do list" approach,
when an entry is read from the list, it's checked that the entry hasn't
been removed from the hash table. Actually there was a bug in the
original LDC patch in the failure handling: it replaced the per-entry
failures-counter with a local retry_counter variable, but it wasn't
cleared after a successful write which would lead to bogus ERRORs when
multiple relations are dropped during the mdsync. I kept the original
per-entry counter, though the local variable approach could be made to work.

The memory leak obviously needs to be fixed, but that's just a matter of
adding a pfree.

>> Ok, will do that. Or would you like to just take over from here?
>
> No, I'm up to my ears in varlena. You're the one in a position to test
> this, anyway.

Ok.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 17:08:32
Message-ID: 28872.1175792912@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I think the failure handling is correct in the "to-do list" approach,
> when an entry is read from the list, it's checked that the entry hasn't
> been removed from the hash table. Actually there was a bug in the
> original LDC patch in the failure handling: it replaced the per-entry
> failures-counter with a local retry_counter variable, but it wasn't
> cleared after a successful write which would lead to bogus ERRORs when
> multiple relations are dropped during the mdsync. I kept the original
> per-entry counter, though the local variable approach could be made to work.

Yeah. One of the things that bothered me about the patch was that it
would be easy to mess up by updating state in the copied entry instead
of the "real" info in the hashtable. It would be clearer what's
happening if the to-do list contains only the lookup keys and not the
whole struct.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 17:12:54
Message-ID: 1175793174.3623.357.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 2007-04-05 at 17:14 +0100, Heikki Linnakangas wrote:

> We could just review and apply ITAGAKI's patch as it is instead of
> this snippet of it, but because that can take some time I'd like to
> see this applied before that.

I think we are just beginning to understand the quality of Itagaki's
thinking.

We should give him a chance to interact on this and if there are parts
of his patch that we want, then it should be him that does it. I'm not
sure that carving the good bits off each others patches is likely to
help teamwork in the long term. At very least he deserves much credit
for his farsighted work.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Fix mdsync never-ending loop problem
Date: 2007-04-05 17:24:00
Message-ID: 461530B0.2050106@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs wrote:
> On Thu, 2007-04-05 at 17:14 +0100, Heikki Linnakangas wrote:
>
>> We could just review and apply ITAGAKI's patch as it is instead of
>> this snippet of it, but because that can take some time I'd like to
>> see this applied before that.
>
> I think we are just beginning to understand the quality of Itagaki's
> thinking.
>
> We should give him a chance to interact on this and if there are parts
> of his patch that we want, then it should be him that does it.

Itagaki, would you like to take a stab at this?

> I'm not
> sure that carving the good bits off each others patches is likely to
> help teamwork in the long term. At very least he deserves much credit
> for his farsighted work.

Oh sure! Thank you for your efforts, Itagaki!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [PATCHES] Fix mdsync never-ending loop problem
Date: 2007-04-06 06:05:35
Message-ID: 20070406142846.6A19.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> Itagaki, would you like to take a stab at this?

Yes, I'll try to fix the mdsync problem. I'll separate this fix from LDC
patch. If we need to backport the fix to the back branches, a stand-alone
patch would be better.

In my understanding from the discussion, we'd better to take "cycle ID"
approach instead of "making a copy of pendingOpsTable", because duplicated
table is hard to debug and requires us to pay attention not to leak memories.
I'll adopt the cycle ID approach and build LDC on it as a separate patch.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [PATCHES] Fix mdsync never-ending loop problem
Date: 2007-04-06 06:37:15
Message-ID: 392.1175841435@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> In my understanding from the discussion, we'd better to take "cycle ID"
> approach instead of "making a copy of pendingOpsTable", because duplicated
> table is hard to debug and requires us to pay attention not to leak memories.
> I'll adopt the cycle ID approach and build LDC on it as a separate patch.

Heikki made some reasonable arguments against the cycle-ID idea. I'm
not intending to insist on it ...

I do think there are multiple issues here and it'd be better to try
to separate the fixes into different patches.

regards, tom lane


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-patches(at)postgresql(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [PATCHES] Fix mdsync never-ending loop problem
Date: 2007-04-10 02:02:00
Message-ID: 20070410102252.8FB5.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

(Sorry if you receive duplicate messages. I resend it since it was not
delivered after a day.)

Here is another patch to fix never-ending loop in mdsync. I introduced
a mdsync counter (cycle id) and cancel flags to fix the problem.

The mdsync counter is incremented at the every beginning of mdsync().
Each pending entry has a field assigned from the counter when it is
newly inserted to pendingOpsTable. Only entries that have smaller counter
values than the mdsync counter are fsync-ed in mdsync().

Another change is to add a cancel flag in each pending entry. When a
relation is dropped and bgwriter receives a forget-request, the corresponding
entry is marked as dropped but we don't delete it at that time. Actual
deletion is performed in the next fsync loop. We don't have to retry after
AbsorbFsyncRequests() because entries are not removed outside of seqscan.

This patch can be applied to HEAD, 8.2 and 8.1 with a few hunks.

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > In my understanding from the discussion, we'd better to take "cycle ID"
> > approach instead of "making a copy of pendingOpsTable", because duplicated
> > table is hard to debug and requires us to pay attention not to leak memories.
> > I'll adopt the cycle ID approach and build LDC on it as a separate patch.
>
> Heikki made some reasonable arguments against the cycle-ID idea. I'm
> not intending to insist on it ...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment Content-Type Size
fix_mdsync.patch application/octet-stream 6.6 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [PATCHES] Fix mdsync never-ending loop problem
Date: 2007-04-10 17:36:20
Message-ID: 27771.1176226580@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Here is another patch to fix never-ending loop in mdsync. I introduced
> a mdsync counter (cycle id) and cancel flags to fix the problem.

> The mdsync counter is incremented at the every beginning of mdsync().
> Each pending entry has a field assigned from the counter when it is
> newly inserted to pendingOpsTable. Only entries that have smaller counter
> values than the mdsync counter are fsync-ed in mdsync().

> Another change is to add a cancel flag in each pending entry. When a
> relation is dropped and bgwriter receives a forget-request, the corresponding
> entry is marked as dropped but we don't delete it at that time. Actual
> deletion is performed in the next fsync loop. We don't have to retry after
> AbsorbFsyncRequests() because entries are not removed outside of seqscan.

This patch looks fairly sane to me; I have a few small gripes about
coding style but that can be fixed while applying. Heikki, you were
concerned about the cycle-ID idea; do you have any objection to this
patch?

> This patch can be applied to HEAD, 8.2 and 8.1 with a few hunks.

I don't think we should back-patch something that's a performance fix
for an extreme case, especially not when it's not been through any
extensive testing yet ...

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-10 18:41:02
Message-ID: 28448.1176230462@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
> This patch looks fairly sane to me; I have a few small gripes about
> coding style but that can be fixed while applying. Heikki, you were
> concerned about the cycle-ID idea; do you have any objection to this
> patch?

Actually, on second look I think the key idea here is Takahiro-san's
introduction of a cancellation flag in the hashtable entries, to
replace the cases where AbsorbFsyncRequests can try to delete entries.

What that means is mdsync() doesn't need an outer retry loop at all:
the periodic AbsorbFsyncRequests calls are not a hazard, and retry of
FileSync failures can be handled as an inner loop on the single failing
table entry. (We can make the failure counter a local variable, too,
instead of needing space in every hashtable entry.)

And with that change, it's no longer possible for an incoming stream
of fsync requests to keep mdsync from terminating. It might fsync
more than it really needs to, but it won't repeat itself, and it must
reach the end of the hashtable eventually. So we don't actually need
the cycle counter at all.

It might be worth having the cycle counter anyway just to avoid doing
"useless" fsync work. I'm not sure about this. If we have a cycle
counter of say 32 bits, then it's theoretically possible for an fsync
to fail 2^32 consecutive times and then be skipped on the next try,
allowing a checkpoint to succeed that should not have. We can fix that
with a few more lines of logic to detect a wrapped-around value, but is
it worth the trouble?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-10 21:49:05
Message-ID: 3928.1176241745@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
> Actually, on second look I think the key idea here is Takahiro-san's
> introduction of a cancellation flag in the hashtable entries, to
> replace the cases where AbsorbFsyncRequests can try to delete entries.
> What that means is mdsync() doesn't need an outer retry loop at all:

I fooled around with this idea and came up with the attached patch.
It seems to do what's intended but could do with more eyeballs and
testing before committing. Comments please?

(Note: I ignored my own advice not to reindent. Sorry ...)

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 14.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-patches(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-10 23:50:23
Message-ID: 5021.1176249023@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:
> I fooled around with this idea and came up with the attached patch.
> It seems to do what's intended but could do with more eyeballs and
> testing before committing. Comments please?

Earlier I said that I didn't want to back-patch this change, but on
looking at the CVS history I'm reconsidering. The performance problem
originates from the decision some time ago to do an AbsorbFsyncRequests
every so often during the mdsync loop; without that, and assuming no
actual failures, there isn't any absorption of new requests before
mdsync can complete. Originally that code only existed in 8.2.x, but
very recently we back-patched it into 8.1.x as part of fixing the
file-deletion-on-Windows problem. This means that 8.1.x users could
see a performance degradation upon updating to 8.1.8 from prior
subreleases, which wouldn't make them happy.

So I'm now thinking we ought to back-patch into 8.2.x and 8.1.x,
but of course that makes it even more urgent that we test the patch
thoroughly.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-11 07:19:24
Message-ID: 461C8BFC.6000204@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> I wrote:
>> Actually, on second look I think the key idea here is Takahiro-san's
>> introduction of a cancellation flag in the hashtable entries, to
>> replace the cases where AbsorbFsyncRequests can try to delete entries.
>> What that means is mdsync() doesn't need an outer retry loop at all:
>
> I fooled around with this idea and came up with the attached patch.
> It seems to do what's intended but could do with more eyeballs and
> testing before committing. Comments please?

I'm traveling today, but I'll take a closer look at it tomorrow morning.
My first thought is that the cycle_ctr just adds extra complexity. The
canceled-flag really is the key in Takahiro-san's patch, so we don't
need the cycle_ctr anymore.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-11 14:11:33
Message-ID: 13940.1176300693@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> My first thought is that the cycle_ctr just adds extra complexity. The
> canceled-flag really is the key in Takahiro-san's patch, so we don't
> need the cycle_ctr anymore.

We don't have to have it in the sense of the code not working without it,
but it probably pays for itself by eliminating useless fsyncs. The
overhead for it in my proposed implementation is darn near zero in the
non-error case. Also, Takahiro-san mentioned at one point that he was
concerned to avoid useless fsyncs because of some property of the LDC
patch --- I wasn't too clear on what, but maybe he can explain.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-12 08:40:19
Message-ID: 461DF073.6020204@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> My first thought is that the cycle_ctr just adds extra complexity. The
>> canceled-flag really is the key in Takahiro-san's patch, so we don't
>> need the cycle_ctr anymore.
>
> We don't have to have it in the sense of the code not working without it,
> but it probably pays for itself by eliminating useless fsyncs. The
> overhead for it in my proposed implementation is darn near zero in the
> non-error case. Also, Takahiro-san mentioned at one point that he was
> concerned to avoid useless fsyncs because of some property of the LDC
> patch --- I wasn't too clear on what, but maybe he can explain.

Ok. Perhaps we should not use the canceled-flag but just remove the
entry from pendingOpsTable like we used to when mdsync_in_progress isn't
set. We might otherwise accumulate a lot of canceled entries in the hash
table if checkpoint interval is long and relations are created and
dropped as part of normal operation.

I think there's one little bug in the patch:

1. AbsorbFsyncRequests is called. A FORGET message is received, and an
entry in the hash table is marked as canceled
2. Another relation with the same relfilenode is created. This can
happen after OID wrap-around
3. RememberFsyncRequest is called for the new relation. The old entry is
still in the hash table, marked with the canceled-flag, so it's not touched.

The fsync request for the new relation is masked by the old canceled
entry. The trivial fix is to always clear the flag on step 3:

--- md.c 2007-04-11 08:18:08.000000000 +0100
+++ md.c.new 2007-04-12 09:21:00.000000000 +0100
@@ -1161,9 +1161,9 @@
&found);
if (!found) /* new entry,
so initialize it */
{
- entry->canceled = false;
entry->cycle_ctr = mdsync_cycle_ctr;
}
+ entry->canceled = false;
/*
* NB: it's intentional that we don't change cycle_ctr
if the entry
* already exists. The fsync request must be treated
as old, even

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-12 13:29:52
Message-ID: 19188.1176384592@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Ok. Perhaps we should not use the canceled-flag but just remove the
> entry from pendingOpsTable like we used to when mdsync_in_progress isn't
> set.

I'm not thrilled about that; it seems overly intricate, and won't the
LDC patch make it mostly useless anyway (because of time-extended
checkpointing)?

> I think there's one little bug in the patch:

> 1. AbsorbFsyncRequests is called. A FORGET message is received, and an
> entry in the hash table is marked as canceled
> 2. Another relation with the same relfilenode is created. This can
> happen after OID wrap-around
> 3. RememberFsyncRequest is called for the new relation. The old entry is
> still in the hash table, marked with the canceled-flag, so it's not touched.

Good point. I was wondering what to do with an already-canceled entry,
but didn't think of that scenario. I think your fix is not quite right:
if we clear a pre-existing cancel flag then we do need to set cycle_ctr,
because this is effectively an all-new request.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-12 15:12:30
Message-ID: 461E4C5E.8050509@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Ok. Perhaps we should not use the canceled-flag but just remove the
>> entry from pendingOpsTable like we used to when mdsync_in_progress isn't
>> set.
>
> I'm not thrilled about that; it seems overly intricate, and won't the
> LDC patch make it mostly useless anyway (because of time-extended
> checkpointing)?

Not quite useless, but definitely less useful, depending on how long the
checkpoints are stretched and how much of the time is allocated to
fsyncing (the defaults in the latest LDC patch was 20%). OTOH, this
doesn't seem like a very performance sensitive codepath anyway, so we
should just stick to the simplest thing that works.

>> I think there's one little bug in the patch:
>
>> 1. AbsorbFsyncRequests is called. A FORGET message is received, and an
>> entry in the hash table is marked as canceled
>> 2. Another relation with the same relfilenode is created. This can
>> happen after OID wrap-around
>> 3. RememberFsyncRequest is called for the new relation. The old entry is
>> still in the hash table, marked with the canceled-flag, so it's not touched.
>
> Good point. I was wondering what to do with an already-canceled entry,
> but didn't think of that scenario. I think your fix is not quite right:
> if we clear a pre-existing cancel flag then we do need to set cycle_ctr,
> because this is effectively an all-new request.

Hmm, I guess. Though not setting it just makes us fsync the file earlier
than necessary, which isn't too bad either.

I believe Itagaki-san's motivation for tackling this in the LDC patch
was the fact that it can fsync the same file many times, and in the
worst case go to an endless loop, and adding delays inside the loop
makes it much more likely. After that is fixed, I doubt any of the
optimizations of trying to avoid extra fsyncs make any difference in
real applications, and we should just keep it simple, especially if we
back-patch it.

That said, I'm getting tired of this piece of code :). I'm happy to have
any of the discussed approaches committed soon and move on.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-patches(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Fix mdsync never-ending loop problem
Date: 2007-04-12 15:42:20
Message-ID: 22503.1176392540@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I believe Itagaki-san's motivation for tackling this in the LDC patch
> was the fact that it can fsync the same file many times, and in the
> worst case go to an endless loop, and adding delays inside the loop
> makes it much more likely. After that is fixed, I doubt any of the
> optimizations of trying to avoid extra fsyncs make any difference in
> real applications, and we should just keep it simple, especially if we
> back-patch it.

I looked at the dynahash code and noticed that new entries are attached
to the *end* of their hashtable chain. While this maybe should be
changed to link them at the front, the implication at the moment is that
without a cycle counter it would still be possible to loop indefinitely
because we'd continue to revisit the same file(s) after removing their
hashtable entries. I think you'd need a constant stream of requests for
more than one file falling into the same hash chain, but it certainly
seems like a potential risk. I'd prefer a solution that adheres to the
dynahash API's statement that it's unspecified whether newly-added
entries will be visited by hash_seq_search, and will in fact not loop
even if they always are visited.

> That said, I'm getting tired of this piece of code :).

Me too.

regards, tom lane