PostgreSQL clustering (shared disk)

Lists: pgsql-general
From: "Mikko Partio" <mpartio(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: PostgreSQL clustering (shared disk)
Date: 2007-08-16 06:42:01
Message-ID: 2ca799770708152342o1f92b63r26d70be3fb71936f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hello list,

I have a mission to implement a two-node active-passive PostgreSQL cluster.
The databases at the cluster are rather large (hundreds of GB's) which opts
me to consider a shared disk environment. I know this is not natively
supported with PostgreSQL, but I have been investigating the Red Hat Cluster
Suite with GFS. The idea would be that the cluster programs with gfs (and HP
ilo) would make sure that only one postmaster at a time would be able to
access the shared disk, and in case the active node fails the cluster
software would shift the services to the previously passive node. What I'm
pondering here is that is the cluster able to keep the postmasters
synchronized at all times so that the database won't get corrupted.

Is there anyone on the list that has seen such configuration, or, even
better, implemented it themselves? I found a small document by Devrim Gunduz
describing this scenario but it was rather scant on details.

If shared disk is definitely out of the question, the fallback plan would be
to use drbd and linux-ha.

Regards

MP


From: Hannes Dorbath <light(at)theendofthetunnel(dot)de>
To: Mikko Partio <mpartio(at)gmail(dot)com>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-16 07:01:23
Message-ID: 46C3F643.7020102@theendofthetunnel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 16.08.2007 08:42, Mikko Partio wrote:
> I have a mission to implement a two-node active-passive PostgreSQL cluster.
> The databases at the cluster are rather large (hundreds of GB's) which opts
> me to consider a shared disk environment. I know this is not natively
> supported with PostgreSQL, but I have been investigating the Red Hat Cluster
> Suite with GFS. The idea would be that the cluster programs with gfs (and HP
> ilo) would make sure that only one postmaster at a time would be able to
> access the shared disk, and in case the active node fails the cluster
> software would shift the services to the previously passive node. What I'm
> pondering here is that is the cluster able to keep the postmasters
> synchronized at all times so that the database won't get corrupted.
>
> Is there anyone on the list that has seen such configuration, or, even
> better, implemented it themselves? I found a small document by Devrim Gunduz
> describing this scenario but it was rather scant on details.
>
> If shared disk is definitely out of the question, the fallback plan would be
> to use drbd and linux-ha.

The usual setup is DRBD + Heartbeat, which is fast, simple and proven.
Using a shared disk / SAN has disadvantages, such as single point of
failure, (usually) non-native fencing and (usually) way higher latency.

DRBD does handle a lot of stuff by it self, which you need to take care
yourself with a plain shared device. Using a cluster file system such as
GFS2, OCFS2 or Lustre is a waste of resources, as you can't have
active/active with PostgreSQL anyway.

--
Regards,
Hannes Dorbath


From: Devrim GÜNDÜZ <devrim(at)CommandPrompt(dot)com>
To: Mikko Partio <mpartio(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-16 07:05:31
Message-ID: 1187247931.2878.10.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

On Thu, 2007-08-16 at 09:42 +0300, Mikko Partio wrote:
> The idea would be that the cluster programs with gfs (and HP ilo)
> would make sure that only one postmaster at a time would be able to
> access the shared disk, and in case the active node fails the cluster
> software would shift the services to the previously passive node.

AFAIK, it is the fence device that will prevent the postmaster access
from the failed node. RHCS will just switch the servers.

> What I'm pondering here is that is the cluster able to keep the
> postmasters synchronized at all times so that the database won't get
> corrupted.

Keep all the $PGDATA in the shared disk. That would minimize data loss
(Of course, there is still a risk of data loss -- the postmasters are
not aware of each other and they don't share each other's buffers, etc.)

Regards,
--
Devrim GÜNDÜZ
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, ODBCng - http://www.commandprompt.com/


From: Devrim GÜNDÜZ <devrim(at)CommandPrompt(dot)com>
To: Mikko Partio <mpartio(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-16 07:58:30
Message-ID: 1187251110.2878.27.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

On Thu, 2007-08-16 at 10:05 +0300, Devrim GÜNDÜZ wrote:
> (Of course, there is still a risk of data loss -- the postmasters are
> not aware of each other and they don't share each other's buffers,
> etc.)

Err... I was talking about uncommitted transactions, and of course this
does not mean a data loss (Thanks to Magnus for the reminder)

Regards,
--
Devrim GÜNDÜZ
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, ODBCng - http://www.commandprompt.com/


From: Douglas McNaught <doug(at)mcnaught(dot)org>
To: Devrim GÜNDÜZ <devrim(at)CommandPrompt(dot)com>
Cc: Mikko Partio <mpartio(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-16 14:27:44
Message-ID: 87sl6jtuzj.fsf@suzuka.mcnaught.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Devrim GÜNDÜZ <devrim(at)CommandPrompt(dot)com> writes:

>> What I'm pondering here is that is the cluster able to keep the
>> postmasters synchronized at all times so that the database won't get
>> corrupted.
>
> Keep all the $PGDATA in the shared disk. That would minimize data loss
> (Of course, there is still a risk of data loss -- the postmasters are
> not aware of each other and they don't share each other's buffers, etc.)

It would be much better to have the cluster software only run one
postmaster at a time, starting up the secondary if the primary fails.
That's the usual practice with shared storage.

-Doug


From: "Mikko Partio" <mpartio(at)gmail(dot)com>
To: "Douglas McNaught" <doug(at)mcnaught(dot)org>
Cc: Devrim GÜNDÜZ <devrim(at)commandprompt(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-17 09:12:16
Message-ID: 2ca799770708170212n5e23c6e4qf0842ebc522bd89e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/16/07, Douglas McNaught <doug(at)mcnaught(dot)org> wrote:
>
> Devrim GÜNDÜZ <devrim(at)CommandPrompt(dot)com> writes:
>
> >> What I'm pondering here is that is the cluster able to keep the
> >> postmasters synchronized at all times so that the database won't get
> >> corrupted.
> >
> > Keep all the $PGDATA in the shared disk. That would minimize data loss
> > (Of course, there is still a risk of data loss -- the postmasters are
> > not aware of each other and they don't share each other's buffers, etc.)
>
> It would be much better to have the cluster software only run one
> postmaster at a time, starting up the secondary if the primary fails.
> That's the usual practice with shared storage.

This was my original intention. I'm still quite hesitant to trust the
fencing devices ability to quarantee that only one postmaster at a time is
running, because of the disastrous possibility of corrupting the whole
database.

Maybe I'm just better off using the more simple (crude?) method of drbd +
heartbeat?

Regards

MP


From: Hannes Dorbath <light(at)theendofthetunnel(dot)de>
To: Mikko Partio <mpartio(at)gmail(dot)com>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-17 09:47:13
Message-ID: 46C56EA1.4040200@theendofthetunnel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 17.08.2007 11:12, Mikko Partio wrote:
> Maybe I'm just better off using the more simple (crude?) method of drbd +
> heartbeat?

Crude? Use what you like to use, but you should keep one thing in mind:
If you don't know the software you are running in each and every detail,
how it behaves in each and every situation you can think of, it's a bad
idea to use it in a HA setup.

You don't want to be one of those admins that just configured something
in a few days, moved production stuff on it and fail to recover from a
split brain situation. Setting up a HA environment is something you do
in months, not days, at least if you want to do it right. There is so
much that can go wrong, and so much to learn. Keep it simple.

--
Regards,
Hannes Dorbath


From: "Mikko Partio" <mpartio(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-17 10:18:35
Message-ID: 2ca799770708170318r8929b54g377171c644936e43@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/17/07, Hannes Dorbath <light(at)theendofthetunnel(dot)de> wrote:
>
> On 17.08.2007 11:12, Mikko Partio wrote:
> > Maybe I'm just better off using the more simple (crude?) method of drbd
> +
> > heartbeat?
>
> Crude? Use what you like to use, but you should keep one thing in mind:
> If you don't know the software you are running in each and every detail,
> how it behaves in each and every situation you can think of, it's a bad
> idea to use it in a HA setup.
>
> You don't want to be one of those admins that just configured something
> in a few days, moved production stuff on it and fail to recover from a
> split brain situation. Setting up a HA environment is something you do
> in months, not days, at least if you want to do it right. There is so
> much that can go wrong, and so much to learn. Keep it simple.
>

Exactly my thoughts, as I have some experience with drbd and I know it
works. My point was that since I have access to a san environment, a shared
storage would be a more "elegant" solution, but as you pointed out it's
probably better to stick to the method that feels most comfortable.

Thanks for your thoughts.

Regards

MP


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mikko Partio" <mpartio(at)gmail(dot)com>
Cc: "Douglas McNaught" <doug(at)mcnaught(dot)org>, Devrim GÜNDÜZ <devrim(at)commandprompt(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-17 13:59:26
Message-ID: 4745.1187359166@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Mikko Partio" <mpartio(at)gmail(dot)com> writes:
> This was my original intention. I'm still quite hesitant to trust the
> fencing devices ability to quarantee that only one postmaster at a time is
> running, because of the disastrous possibility of corrupting the whole
> database.

Making that guarantee is a fencing device's only excuse for existence.
So I think you should trust that a properly-implemented fence will do
what it's claimed to do.

On the other side of the coin, I have little confidence in DRBD
providing the storage semantics we need (in particular guaranteeing
write ordering). So that path doesn't sound exactly risk-free either.

regards, tom lane


From: Hannes Dorbath <light(at)theendofthetunnel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: drbd-user(at)lists(dot)linbit(dot)com
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-17 14:19:57
Message-ID: 46C5AE8D.2080408@theendofthetunnel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 17.08.2007 15:59, Tom Lane wrote:
> On the other side of the coin, I have little confidence in DRBD
> providing the storage semantics we need (in particular guaranteeing
> write ordering). So that path doesn't sound exactly risk-free either.

To my understanding DRBD provides this. I think a discussion about that
with the DRBD developers would be very useful for many users searching
for a solution to replicate PostgreSQL, so I'm cross posting this to
DRBD list. Maybe you can make clear in detail what requirements
PostgreSQL has.

--
Regards,
Hannes Dorbath


From: "Sander Steffann" <s(dot)steffann(at)computel(dot)nl>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Mikko Partio'" <mpartio(at)gmail(dot)com>
Cc: "'Douglas McNaught'" <doug(at)mcnaught(dot)org>, 'Devrim GÜNDÜZ' <devrim(at)commandprompt(dot)com>, "'pgsql-general'" <pgsql-general(at)postgresql(dot)org>
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-17 14:24:14
Message-ID: 002c01c7e0da$49e623a0$dc128953@kantoor.computel.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

> On the other side of the coin, I have little confidence in DRBD
> providing the storage semantics we need (in particular guaranteeing
> write ordering). So that path doesn't sound exactly risk-free either.

DRBD seems to enforce strict write ordering on both sides of the link
according to the docs. I didn't look at the code, but my plug-pulling tests
on a busy PostgreSQL server didn't cause any problems. No conclusive
evidence, but useful at lease in my use-case. (And yes: I make ps_dumps
often just in case)

- Sander


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Hannes Dorbath <light(at)theendofthetunnel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org, drbd-user(at)lists(dot)linbit(dot)com
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-17 14:45:39
Message-ID: 20070817144539.GE13741@svr2.hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Aug 17, 2007 at 04:19:57PM +0200, Hannes Dorbath wrote:
> On 17.08.2007 15:59, Tom Lane wrote:
> >On the other side of the coin, I have little confidence in DRBD
> >providing the storage semantics we need (in particular guaranteeing
> >write ordering). So that path doesn't sound exactly risk-free either.
>
> To my understanding DRBD provides this. I think a discussion about that
> with the DRBD developers would be very useful for many users searching
> for a solution to replicate PostgreSQL, so I'm cross posting this to
> DRBD list. Maybe you can make clear in detail what requirements
> PostgreSQL has.

It does, AFAIK, if yuo configure it properly. I think it's the "protocol"
parameter you need to set to C which is the slowest, but it's the only one
that waits for the block to hit *both* disks.

//Magnus


From: Lars Ellenberg <lars(dot)ellenberg(at)linbit(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mikko Partio <mpartio(at)gmail(dot)com>, Douglas McNaught <doug(at)mcnaught(dot)org>, Devrim GÜNDÜZ <devrim(at)commandprompt(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>, drbd-user(at)lists(dot)linbit(dot)com
Subject: Re: PostgreSQL clustering (shared disk)
Date: 2007-08-17 19:34:55
Message-ID: 20070817193455.GA6219@racke.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Aug 17, 2007 at 09:59:26AM -0400, Tom Lane wrote:
> "Mikko Partio" <mpartio(at)gmail(dot)com> writes:
> > This was my original intention. I'm still quite hesitant to trust the
> > fencing devices ability to quarantee that only one postmaster at a time is
> > running, because of the disastrous possibility of corrupting the whole
> > database.
>
> Making that guarantee is a fencing device's only excuse for existence.
> So I think you should trust that a properly-implemented fence will do
> what it's claimed to do.
>
> On the other side of the coin, I have little confidence in DRBD
> providing the storage semantics we need (in particular guaranteeing
> write ordering). So that path doesn't sound exactly risk-free either.
>
> regards, tom lane

of course we guarantee write ordering.

we (linbit, company behind drbd, paying drbd developers)
operate quite a few postgres clusters in production on clusters
"powered by heartbeat and DRBD".
there are much more we do not operate directly ourselves.

just because we happen to have a partnership with mysql
does not mean we don't like postgres very much indeed :)

to get an idea of what drbd does for you, please,
if you are interessted, read some of the
http://www.drbd.org/publications.html,
maybe start with the LinuxConf 2007 pdf.

cheers,

--
: Lars Ellenberg Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :