Re: Support for N synchronous standby servers

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers
Date: 2014-09-12 05:13:46
Message-ID: CAB7nPqRvAcsXt2MCf2Fy6GH2gpU51+8JhSoZnWHQmJgwqj70gA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 12, 2014 at 12:48 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> Currently two nodes can only have the same priority if they have the
>> same application_name, so we could for example add a new connstring
>> parameter called, let's say application_group, to define groups of
>> nodes that will have the same priority (if a node does not define
>> application_group, it defaults to application_name, if app_name is
>> NULL, well we don't care much it cannot be a sync candidate). That's a
>> first idea that we could use to control groups of nodes. And we could
>> switch syncrep.c to use application_group in s_s_names instead of
>> application_name. That would be backward-compatible, and could open
>> the door for more improvements for quorum commits as we could control
>> groups node nodes. Well this is a super-set of what application_name
>> can already do, but there is no problem to identify single nodes of
>> the same data center and how much they could be late in replication,
>> so I think that this would be really user-friendly. An idea similar to
>> that would be a base work for the next thing... See below.
>
> In general, I think the user's requirement for what synchronous
> standbys could need to acknowledge a commit could be an arbitrary
> Boolean expression - well, probably no NOT, but any amount of AND and
> OR that you want to use. Can someone want A OR (((B AND C) OR (D AND
> E)) AND F)? Maybe! Based on previous discussions, it seems not
> unlikely that as soon as we decide we don't want to support that,
> someone will tell us they can't live without it. In general, though,
> I'd expect the two common patterns to be more or less what you've set
> forth above: any K servers from set X plus any L servers from set Y
> plus any M servers from set Z, etc. However, I'm not confident it's
> right to control this by adding more configuration on the client side.
> I think it would be better to stick with the idea that each client
> specifies an application_name, and then the master specifies the
> policy in some way. One advantage of that is that you can change the
> rules in ONE place - the master - rather than potentially having to
> update every client.
OK. I see your point.

Now, what about the following assumptions (somewhat restrictions to
facilitate the user experience for setting syncrep and the
parametrization of this feature):
- Nodes are defined within the same set (or group) if they have the
same priority, aka the same application_name.
- One node cannot be a part of two sets. That's obvious...

The current patch has its own merit, but it fails in the case you and
Heikki are describing: wait for k nodes in set 1 (nodes with lowest
priority value), l nodes in set 2 (nodes with priority 2nd lowest
priority value), etc.
What is does is, if for example we have a set of nodes with priorities
{0,1,1,2,2,3,3}, backends will wait for flush_position from the first
s_s_num nodes. By setting s_s_num to 3, we'll wait for {0,1,1}, to 2
{0,1,1,2}, etc.

Now what about that: instead of waiting for the nodes in "absolute"
order like the way current patch does, let's do it in a "relative"
way. By that I mean that a backend waits for flush_position
confirmation only from *1* node among a set of nodes having the same
priority. So by using s_s_num = 3, we'll wait for {0, "one node with
1", "one node with 2"}, and you can guess the rest.

The point is as well that we can keep s_s_num behavior as it is now:
- if set at -1, we rely on the default way of doing with s_s_names
(empty means all nodes async, at least one entry meaning that we need
to wait for a node)
- if set at 0, all nodes are forced to be async'd
- if set at n > 1, we have to wait for one node in each set of the
N-lowest priority values.
I'd see enough users happy with those improvements, and that would
help improving the coverage of test cases that Heikki and you
envisioned.

By the way, as the CF is running low in time, I am going to mark this
patch as "Returned with Feedback" as I have received enough feedback.
I am still planning to work on that for the next CF, so it would be
great if there is an agreement on what can be done for this feature to
avoid blind progress. Particularly I see some merit in the last idea,
that we could still extend by allowing values of the type "k,l,m" in
s_s_num to let user decide: wait for 3 sets, k nodes in set 1, l nodes
in set 2 and m nodes in set 3. Having a GUC parameter with integer
values is not that user-friendly though, so I think that I'd hold on
having only one node for a single set.

Thoughts?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2014-09-12 06:19:04 Re: proposal (9.5) : psql unicode border line styles
Previous Message Pavel Stehule 2014-09-12 04:26:29 Re: vacuumdb --all --analyze-in-stages - wrong order?