Quick Links

Re: Postgres-R: internal messaging

Lists:	pgsql-hackers

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Postgres-R: internal messaging
Date:	2008-07-23 07:17:31
Message-ID:	4886DB0B.1090508@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

As you certainly know by now, Postgres-R introduces an additional
manager process. That one is forked from the postmaster, so are all
backends, no matter if they are processing local or remote transactions.
That led to a communication problem, which has originally (i.e. around
Postgres-R for 6.4) been solved by using unix pipes. I didn't like that
approach for various reasons: first, AFAIK there are portability issues,
second it eats file descriptors and third it involves copying around the
messages several times. As the replication manager needs to talk to the
backends, but they both need to be forked from the postmaster, pipes
would also have to go through the postmaster process.

Trying to be as portable as Postgres itself and still wanting an
efficient messaging system, I came up with that imessages stuff, which
I've already posted to -patches before [1]. It uses shared memory to
store and 'transfer' the messages and signals to notify other processes
(the so far unused SIGUSR2, IIRC). Of course this implies having a hard
limit on the total size of messages waiting to be delivered, due to the
fixed size of the shared memory area.

Besides the communication between the replication manager and the
backends, which is currently done by using these imessages, the
replication manager also needs to communicate with the postmaster: it
needs to be able to request new helper backends and it wants to be
notified upon termination (or crash) of such a helper backend (and other
backends as well...). I'm currently doing this with imessages as well,
which violates the rule that the postmaster may not to touch shared
memory. I didn't look into ripping that out, yet. I'm not sure it can be
done with the existing signaling of the postmaster.

Let's have a simple example: consider a local transaction which changes
some tuples. Those are being collected into a change set, which gets
written to the shared memory area as an imessage for the replication
manager. The backend then also signals the manager, which then awakes
from its select(), checks its imessages queue and processes the message,
delivering it to the GCS. It then removes the imessage from the shared
memory area again.

My initial design features only a single doubly linked list as the
message queue, holding all messages for all processes. An imessages lock
blocks concurrent writing acces. That's still what's in there, but I
realize that's not enough. Each process should better have its own
queue, and the single lock needs to vanish to avoid contention on that
lock. However, that would require dynamically allocatable shared memory...

As another side node: I've had to write methods similar to those in
libpq, which serialize and deserialize integers or strings. The libpq
functions were not appropriate because they cannot write shared memory,
instead they are designed to flush to a socket, if I understand
correctly. Maybe, these could be extended or modified to be usable there
as well? I've been hesitating and rather implemented separate methods in
src/backed/storage/ipc/buffer.c.

Comments?

Regards

Markus Wanner

[1]: last time I published IMessage stuff on -patches, WIP:
http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php

From:	Alexey Klyukin <alexk(at)commandprompt(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Postgres-R: internal messaging
Date:	2008-07-23 09:41:43
Message-ID:	20080723094143.GA27415@katana.lan
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner wrote:
> Besides the communication between the replication manager and the
> backends, which is currently done by using these imessages, the
> replication manager also needs to communicate with the postmaster: it
> needs to be able to request new helper backends and it wants to be
> notified upon termination (or crash) of such a helper backend (and other
> backends as well...). I'm currently doing this with imessages as well,
> which violates the rule that the postmaster may not to touch shared
> memory. I didn't look into ripping that out, yet. I'm not sure it can be
> done with the existing signaling of the postmaster.

In Replicator we avoided the need for postmaster to read/write backend's
shmem data by using it as a signal forwarder. When a backend wants to
inform a special process (i.e. queue monitor) about replication-related
event (such as commit) it sends SIGUSR1 to Postmaster with a related
"reason" flag and the postmaster upon receiving this signal forwards
it to the destination process. Termination of backends and special
processes are handled by the postmaster itself.

>
> Let's have a simple example: consider a local transaction which changes
> some tuples. Those are being collected into a change set, which gets
> written to the shared memory area as an imessage for the replication
> manager. The backend then also signals the manager, which then awakes
> from its select(), checks its imessages queue and processes the message,
> delivering it to the GCS. It then removes the imessage from the shared
> memory area again.

Hm...what would happen with the new data under heavy load when the queue
would eventually be filled with messages, the relevant transactions
would be aborted or they would wait for the manager to release the queue
space occupied by already processed messages? ISTM that having a fixed
size buffer limits the maximum transaction rate.

>
> My initial design features only a single doubly linked list as the
> message queue, holding all messages for all processes. An imessages lock
> blocks concurrent writing acces. That's still what's in there, but I
> realize that's not enough. Each process should better have its own
> queue, and the single lock needs to vanish to avoid contention on that
> lock. However, that would require dynamically allocatable shared
> memory...

What about keeping the per-process message queue in the local memory of
the process, and exporting only the queue head to the shmem, thus having
only one message per-process there. When the queue manager gets a
message from the process it may signal that process to copy the next
message from the process local memory into the shmem. To keep a
correct ordering of queue messages an additional shared memory queue of
pid_t can be maintained, containing one pid per each message.

--
Alexey Klyukin http://www.commandprompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Markus Wanner <markus(at)bluegap(dot)ch>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Postgres-R: internal messaging
Date:	2008-07-23 10:31:31
Message-ID:	48870883.9040807@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Alexey,

thanks for your feedback, these are interesting points.

Alexey Klyukin wrote:
> In Replicator we avoided the need for postmaster to read/write backend's
> shmem data by using it as a signal forwarder. When a backend wants to
> inform a special process (i.e. queue monitor) about replication-related
> event (such as commit) it sends SIGUSR1 to Postmaster with a related
> "reason" flag and the postmaster upon receiving this signal forwards
> it to the destination process. Termination of backends and special
> processes are handled by the postmaster itself.

Hm.. how about larger data chunks, like change sets? In Postgres-R,
those need to travel between the backends and the replication manager,
which then sends it to the GCS.

> Hm...what would happen with the new data under heavy load when the queue
> would eventually be filled with messages, the relevant transactions
> would be aborted or they would wait for the manager to release the queue
> space occupied by already processed messages? ISTM that having a fixed
> size buffer limits the maximum transaction rate.

That's why the replication manager is a very simple forwarder, which
does not block messages, but consumes them immediately from shared
memory. It already features a message cache, which holds messages it
cannot currently forward to a backend, because all backends are busy.

And it takes care to only send change sets to helper backend which are
not busy and can consume the process the remote transaction immediately.
That way, I don't think the limit on shared memory is the bottleneck.
However, I didn't measure.

WRT waiting vs aborting: I think at the moment I don't handle this
situation gracefully. I've never encountered it. ;-) But I think the
simpler option is letting the sender wait until there is enough room in
the queue for its message. To avoid deadlocks, each process should
consume its messages, before trying to send one. (Which is done
correctly only for the replication manager ATM, not for the backends, IIRC).

> What about keeping the per-process message queue in the local memory of
> the process, and exporting only the queue head to the shmem, thus having
> only one message per-process there.

The replication manager already does that with its cache. No other
process needs to send (large enough) messages which cannot be consumed
immediately. So such a local cache does not make much sense for any
other process.

Even for the replication manager, I find it dubious to require such a
cache, because it introduces an unnecessary copying of data within memory.

> When the queue manager gets a
> message from the process it may signal that process to copy the next
> message from the process local memory into the shmem. To keep a
> correct ordering of queue messages an additional shared memory queue of
> pid_t can be maintained, containing one pid per each message.

The replication manager takes care of the ordering for cached messages.

Regards

Markus Wanner

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexey Klyukin <alexk(at)commandprompt(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Postgres-R: internal messaging
Date:	2008-07-23 15:01:29
Message-ID:	15284.1216825289@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexey Klyukin <alexk(at)commandprompt(dot)com> writes:
> Markus Wanner wrote:
>> I'm currently doing this with imessages as well,
>> which violates the rule that the postmaster may not to touch shared
>> memory. I didn't look into ripping that out, yet. I'm not sure it can be
>> done with the existing signaling of the postmaster.

> In Replicator we avoided the need for postmaster to read/write backend's
> shmem data by using it as a signal forwarder.

You should also look at the current code for communication between
autovac launcher and autovac workers. That seems to be largely a
similar problem, and it's been solved in a way that seems to be
safe enough with respect to the postmaster vs shared memory issue.

regards, tom lane

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexey Klyukin <alexk(at)commandprompt(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Postgres-R: internal messaging
Date:	2008-07-23 17:36:37
Message-ID:	48876C25.40409@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Tom Lane wrote:
> You should also look at the current code for communication between
> autovac launcher and autovac workers. That seems to be largely a
> similar problem, and it's been solved in a way that seems to be
> safe enough with respect to the postmaster vs shared memory issue.

Oh yeah, thanks for reminding me. Back when it was added I thought I
might find some helpful insights in there. But I didn't ever take the
time to read through it...

Regards

Markus Wanner

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexey Klyukin <alexk(at)commandprompt(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Postgres-R: internal messaging
Date:	2008-07-23 20:26:52
Message-ID:	4887940C.4090305@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

what follows are some comments after trying to understand how the
autovacuum launcher works and thoughts on how to apply this to the
replication manager in Postgres-R.

The initial comments in autovacuum.c say:

> If the fork() call fails in the postmaster, it sets a flag in the shared
> memory area, and sends a signal to the launcher.

I note that the shmem area that the postmaster is writing to is pretty
static and not dependent on any other state stored in shmem. That
certainly makes a difference compared to my imessages approach, where a
corruption in the shmem for imessages could also confuse the postmaster.

Reading on, the 'can_launch' flag in the launcher's main loop makes sure
that only one worker is requested concurrently, so that the launcher
doesn't miss a failure or success notice from either the postmaster or
the newly started worker. The replication manager currently shamelessly
requests as many helper backend as it wants. I think I can change that
without much trouble. Would certainly make sense.

Notifications of the replication manager after termination or crashes of
a helper backend remain. Upon normal errors (i.e. elog(ERROR... ), the
backend processes themselves should take care of notifying the
replication manager. But crashes are more difficult. IMO the replication
manager needs to stay alive during this reinitialization, to keep the
GCS connection. However, it can easily detach from shared memory
temporarily (the imessages stuff is the only shmem place it touches,
IIRC). However, a more difficult aspect is: it must be able to tell if a
backend has applied its transaction *before* it died or not. Thus, after
all backends have been killed, the postmaster needs to wait with
reinitializing shared memory, until the replication manager has consumed
all its messages. (Otherwise we would risk "losing" local transactions,
probably also remote ones).

So, yes, after thinking about it, detaching the postmaster from shared
memory seems doable for Postgres-R (in the sense of "the postmaster does
not rely on possibly corrupted data in shared memory"). Reinitialization
needs some more thoughts, but in general that seems like the way to go.

Regards

Markus Wanner

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Alexey Klyukin <alexk(at)commandprompt(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Postgres-R: internal messaging
Date:	2008-07-23 20:51:42
Message-ID:	11514.1216846302@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> ... crashes are more difficult. IMO the replication
> manager needs to stay alive during this reinitialization, to keep the
> GCS connection. However, it can easily detach from shared memory
> temporarily (the imessages stuff is the only shmem place it touches,
> IIRC). However, a more difficult aspect is: it must be able to tell if a
> backend has applied its transaction *before* it died or not. Thus, after
> all backends have been killed, the postmaster needs to wait with
> reinitializing shared memory, until the replication manager has consumed
> all its messages. (Otherwise we would risk "losing" local transactions,
> probably also remote ones).

I hope you're not expecting the contents of shared memory to still be
trustworthy after a backend crash. If the manager is working strictly
from its own local memory, then it would be reasonable to operate
as above.

regards, tom lane

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexey Klyukin <alexk(at)commandprompt(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Postgres-R: internal messaging
Date:	2008-07-24 07:44:04
Message-ID:	488832C4.1010003@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Tom Lane wrote:
> I hope you're not expecting the contents of shared memory to still be
> trustworthy after a backend crash.

Hm.. that's a good point.

So I either need to bullet-proof the imessages with checksums or some
such. I'm not sure that's doable reliably. Not to speak about performance.

Thus it might be better to just restart the replication manager as well.
Note that this means leaving the replication group temporarily and going
through node recovery to apply remote transactions it has missed in
between. This sounds expensive, but it's certainly the safer way to do
it. And as such backend crashes are Expected Not To Happen(tm) on
production systems, that's probably good enough.

> If the manager is working strictly
> from its own local memory, then it would be reasonable to operate
> as above.

That's not the case... :-(

Thanks for your excellent guidance.

Regards

Markus

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexey Klyukin <alexk(at)commandprompt(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Postgres-R: internal messaging
Date:	2008-07-30 08:07:35
Message-ID:	48902147.2010300@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

That's now changed in today's snapshot of Postgres-R: the postmaster no
longer uses imessages (and thus shared memory) to communicate with the
replication manager. Instead the manager signals the postmaster using a
newish PMSIGNAL for requesting new helper backends. It now only requests
one helper at a time and keeps track of pending requests. The helper
backends now read the name of the database to which they must connect to
from shared memory themselves. That should now adhere to the standard
Postgres rules for shared memory safety.

Additionally, the replication manager is also restarted after a backend
crash, to make sure it never tries to work on corrupted shared memory.
However, that part isn't complete, as the replication manager cannot
really handle that situation just yet. There are other outstanding
issues having to do with that change. Those are documented in the TODO
file in src/backend/replication/.

Regards

Markus Wanner