Re: Horizontal Write Scaling

Lists: pgsql-hackers
From: Eliot Gable <egable+pgsql-hackers(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Horizontal Write Scaling
Date: 2010-11-23 20:43:23
Message-ID: AANLkTinxTiuuDy8Up-pZVh=1YDa89ga_cDarGZ3usg3L@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I know there has been a lot of talk about replication getting built into
Postgres and I know of many projects that aim to fill the role. However, I
have not seen much in the way of a serious attempt at multi-master write
scaling. I understand the fundamental problem with write scaling across
multiple nodes is Disk I/O and inter-node communication latency and that in
the conventional synchronous, multi-master replication type setup you would
be limited to the speed of the slowest node, plus the communication protocol
overhead and latency. However, it occurs to me that if you had a shared disk
system via either iSCSI, Fiber Channel, NFS, or whatever (which also had
higher I/O capabilities than a single server could utilize), if you used a
file system that supported locks on a particular section (extent) of a file,
it should theoretically be possible for multiple Postgres instances on
multiple systems sharing the database to read and write to the database
without causing corruption. Obviously, it would take some carefully designed
code in terms of file extent locking, but it seems like it would also
eliminate the need entirely for any type of replication system. Lustre seems
to support the type of file locking required. Of course, I am assuming the
disk system would be RAID 1, RAID 10, RAID 5, or RAID 6 for reliability
purposes and that it is sufficiently redundant that you don't have to worry
about an outage of your storage system.

Has anyone put any thought into what it would take to do this in Postgres?
Is it simply a matter of making the database file interaction code aware of
extent locking, or is it considerably more involved than that? It also
occurs to me that you probably need some form of transaction ordering
mechanism across the nodes based on synchronized timestamps, but it seems
Postgres-R has the required code to do that portion already written. That
may not even be needed since all nodes would actually be working on the same
database files and the locks would ensure a strict ordering of queries. The
only place I see that possibly causing a problem is someone load balancing
across the servers and sending a delete and insert really close to each
other to different nodes such that if the insert executes first, it would be
deleted by the delete even though the intent was to have the delete run
first. Timestamp ordering of the queries or just some shared transaction ID
across the nodes would eliminate that possibility. Other than that, is there
anything else I am missing? Wouldn't this type of setup be far simpler to
implement and provide better scalability than trying to do multi-master
replication using log shipping or binary object shipping or any other
techniques? Wouldn't it also be far more efficient since you don't need to
have a copy of your data on each master node and therefor also don't have to
ship your data to each node and have each node process it?

I am mostly asking for educational purposes, and I would appreciate
technical (and hopefully specific) explanations as to what in Postgres would
need to change to support this.


From: Eliot Gable <egable+pgsql-hackers(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Horizontal Write Scaling
Date: 2010-11-23 20:55:55
Message-ID: AANLkTikfX67NKrzPXxHzriSMXgQDV37+-xoGoUH4tT0v@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 23, 2010 at 3:43 PM, Eliot Gable
<egable+pgsql-hackers(at)gmail(dot)com<egable%2Bpgsql-hackers(at)gmail(dot)com>
> wrote:
<snip>

> Other than that, is there anything else I am missing? Wouldn't this type of
> setup be far simpler to implement and provide better scalability than trying
> to do multi-master replication using log shipping or binary object shipping
> or any other techniques? Wouldn't it also be far more efficient since you
> don't need to have a copy of your data on each master node and therefor also
> don't have to ship your data to each node and have each node process it?
>
> I am mostly asking for educational purposes, and I would appreciate
> technical (and hopefully specific) explanations as to what in Postgres would
> need to change to support this.
>
>
Now that I think about this more, it seems you would still need to ship the
transactions to your other nodes and have some form of processing system on
each that knew which node was supposed to be executing each transaction and
whether that node is currently online. It would also have to have designated
backup nodes to execute the transaction on. Otherwise, you could end up
waiting forever for a transaction to finish that was sent to one node right
before that node lost power. However, if a transaction manager on each node
is able to figured out the ordering of the transactions for itself based on
some globally incrementing transaction ID and able to figure out which node
will be executing the transaction and which node is the backup if the first
one fails, etc., then if the backup sees the primary for that transaction go
offline, it could execute the transaction instead.

Then, I suppose you also need some system in Postgres which can allow
concurrent processing of transactions such that they don't process stuff in
a transaction which is dependent on a transaction that has not yet been
committed, but can process other stuff. So, evaluation of deterministic
functions could take place, but anything volatile could not until all
previous transactions finished. I assume Postgres already has something like
this in order to scale across multiple cores in a single box. This setup
would basically make all the master nodes for the database look like just
extra memory and CPU cores.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Eliot Gable" <egable+pgsql-hackers(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Horizontal Write Scaling
Date: 2010-11-23 20:59:44
Message-ID: 4CEBD6E00200002500037D9E@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Eliot Gable <egable+pgsql-hackers(at)gmail(dot)com> wrote:

> the locks would ensure a strict ordering of queries.

PostgreSQL doesn't support S2PL. I'm not sure what locks you mean.

-Kevin


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Horizontal Write Scaling
Date: 2010-11-23 22:45:51
Message-ID: 4CEC441F.4010802@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24/11/10 09:43, Eliot Gable wrote:
> However, it occurs to me that if you had a shared disk system via
> either iSCSI, Fiber Channel, NFS, or whatever (which also had higher
> I/O capabilities than a single server could utilize)

Yeah, current Postgres multi-master projects seem to be focusing on
shared-nothing architecture as opposed to a shared-disk one. I guess the
advantage of the former is that specialized (i.e expensive) hardware is
not required to attempt to overcome the point of failure with
shared-disk systems - the disk they share.

Cheers

Mark


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Eliot Gable <egable+pgsql-hackers(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Horizontal Write Scaling
Date: 2010-11-24 06:11:31
Message-ID: 4CECAC93.5030508@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Eliot Gable wrote:
> However, I have not seen much in the way of a serious attempt at
> multi-master write scaling.

Scaling writes across nodes using PL/Proxy works.

> Of course, I am assuming the disk system would be RAID 1, RAID 10,
> RAID 5, or RAID 6 for reliability purposes and that it is sufficiently
> redundant that you don't have to worry about an outage of your storage
> system.

The idea that you'll have a system that needs better write scalability
that isn't limited by the storage system is an unusual one, not the
expected case. And the trend everywhere in the industry is away from
giant redundant systems, and toward having multiple cheaper redundant
copies of all the data instead. It's impossible to protect against
things like environmental failure at any single location. Once you've
accepted that you have to be able to replicate this beast too if you
want high availability, you're back at having a multi-node problem
again. This is why the most active work is on distributed designs that
start on that basis, rather than projects trying to build more scalable
monoliths.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Eliot Gable <egable+pgsql-hackers(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Horizontal Write Scaling
Date: 2010-11-25 09:03:53
Message-ID: 4CEE2679.1000803@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Eliot,

On 11/23/2010 09:43 PM, Eliot Gable wrote:
> I know there has been a lot of talk about replication getting built into
> Postgres and I know of many projects that aim to fill the role. However,
> I have not seen much in the way of a serious attempt at multi-master
> write scaling.

Postgres-XC and Postgres-R are two pretty serious projects, IMO.

> I understand the fundamental problem with write scaling
> across multiple nodes is Disk I/O and inter-node communication latency
> and that in the conventional synchronous, multi-master replication type
> setup you would be limited to the speed of the slowest node,

That's not necessarily true for Postgres-R, which is why I call it an
'eager' solution (as opposed to fully synchronous). While it guarantees
that all transactions that got committed *will* be committable on all
nodes at some time in the future, nodes may still lag behind others.

Thus, even a slower / busy node doesn't hold back the others, but may
serve stale data. Ideally, your load balancer accounts for that and
gives that node a break or at least reduces the amount of transactions
going to that node, so it can catch up again.

Anyway, that's pretty Postgres-R specific.

> plus the
> communication protocol overhead and latency. However, it occurs to me
> that if you had a shared disk system via either iSCSI, Fiber Channel,
> NFS, or whatever (which also had higher I/O capabilities than a single
> server could utilize), if you used a file system that supported locks on
> a particular section (extent) of a file, it should theoretically be
> possible for multiple Postgres instances on multiple systems sharing the
> database to read and write to the database without causing corruption.

Possible, yes. Worthwile to do, probably not.

> Has anyone put any thought into what it would take to do this in
> Postgres? Is it simply a matter of making the database file interaction
> code aware of extent locking, or is it considerably more involved than
> that? It also occurs to me that you probably need some form of
> transaction ordering mechanism across the nodes based on synchronized
> timestamps, but it seems Postgres-R has the required code to do that
> portion already written.

If you rely on such an ordering, why use additional locks. That seems
like a waste of resources compared to Postgres-R. Not to mention the
introduction of a SPOF with the SAN.

> Wouldn't this type of setup be far
> simpler to implement

That's certainly debatable, yes. I obviously think that the benefit per
cost ratio for Postgres-R is better :-)

> and provide better scalability than trying to do
> multi-master replication using log shipping or binary object shipping or
> any other techniques?

It's more similar to replication using two phase commit, which provably
doesn't scale (see for example [1]) And using a SAN for locking
certainly doesn't beat 2PC via an equally modern/expensive interconnect.

> Wouldn't it also be far more efficient since you
> don't need to have a copy of your data on each master node and therefor
> also don't have to ship your data to each node and have each node
> process it?

You have to ship it from the SAN to the node, so I definitely don't
think so, but see this as an argument against it. Each having a local
copy and only exchange locking information and transactional changes
sounds like much less traffic overall.

Regards

Markus Wanner

[1]: The Dangers of Replication and a Solution, Gray et al, In Proc. of
the SIGMOD Conf., 1996,
http://research.microsoft.com/apps/pubs/default.aspx?id=68247


From: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Eliot Gable <egable+pgsql-hackers(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Horizontal Write Scaling
Date: 2010-11-25 09:45:45
Message-ID: AANLkTikf-S2H0OP_P+OOqFFGO=_yopuSG_PkeqS1_0i6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

2010/11/25 Markus Wanner <markus(at)bluegap(dot)ch>:
> Eliot,
>
> On 11/23/2010 09:43 PM, Eliot Gable wrote:
>> I know there has been a lot of talk about replication getting built into
>> Postgres and I know of many projects that aim to fill the role. However,
>> I have not seen much in the way of a serious attempt at multi-master
>> write scaling.
>
> Postgres-XC and Postgres-R are two pretty serious projects, IMO.

Yes. Please visit http://postgres-xc.sourceforge.net/ for details.

>> I understand the fundamental problem with write scaling
>> across multiple nodes is Disk I/O and inter-node communication latency
>> and that in the conventional synchronous, multi-master replication type
>> setup you would be limited to the speed of the slowest node,
>
> That's not necessarily true for Postgres-R, which is why I call it an
> 'eager' solution (as opposed to fully synchronous). While it guarantees
> that all transactions that got committed *will* be committable on all
> nodes at some time in the future, nodes may still lag behind others.
>
> Thus, even a slower / busy node doesn't hold back the others, but may
> serve stale data. Ideally, your load balancer accounts for that and
> gives that node a break or at least reduces the amount of transactions
> going to that node, so it can catch up again.
>
> Anyway, that's pretty Postgres-R specific.

Right. In the case of Postgres-XC, tables can be partitioned (we
call "distributed") among cluster nodes so that writing can be done in
parallel.

>
>> plus the
>> communication protocol overhead and latency. However, it occurs to me
>> that if you had a shared disk system via either iSCSI, Fiber Channel,
>> NFS, or whatever (which also had higher I/O capabilities than a single
>> server could utilize), if you used a file system that supported locks on
>> a particular section (extent) of a file, it should theoretically be
>> possible for multiple Postgres instances on multiple systems sharing the
>> database to read and write to the database without causing corruption.
>
> Possible, yes. Worthwile to do, probably not.

We may be suffered from synchronizing cache on each database.

>
>> Has anyone put any thought into what it would take to do this in
>> Postgres? Is it simply a matter of making the database file interaction
>> code aware of extent locking, or is it considerably more involved than
>> that? It also occurs to me that you probably need some form of
>> transaction ordering mechanism across the nodes based on synchronized
>> timestamps, but it seems Postgres-R has the required code to do that
>> portion already written.
>
> If you rely on such an ordering, why use additional locks. That seems
> like a waste of resources compared to Postgres-R. Not to mention the
> introduction of a SPOF with the SAN.
>
>> Wouldn't this type of setup be far
>> simpler to implement
>
> That's certainly debatable, yes. I obviously think that the benefit per
> cost ratio for Postgres-R is better :-)
>
>> and provide better scalability than trying to do
>> multi-master replication using log shipping or binary object shipping or
>> any other techniques?

Postgres-XC uses combination of replicated table and distributed
(partitioned) table, not just simple replication.

>
> It's more similar to replication using two phase commit, which provably
> doesn't scale (see for example [1]) And using a SAN for locking
> certainly doesn't beat 2PC via an equally modern/expensive interconnect.
>
>> Wouldn't it also be far more efficient since you
>> don't need to have a copy of your data on each master node and therefor
>> also don't have to ship your data to each node and have each node
>> process it?
>
> You have to ship it from the SAN to the node, so I definitely don't
> think so, but see this as an argument against it. Each having a local
> copy and only exchange locking information and transactional changes
> sounds like much less traffic overall.
>
> Regards
>
> Markus Wanner
>
>
> [1]: The Dangers of Replication and a Solution, Gray et al, In Proc. of
> the SIGMOD Conf., 1996,
> http://research.microsoft.com/apps/pubs/default.aspx?id=68247
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Cheers;
---
Koichi Suzuki


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: koichi(dot)szk(at)gmail(dot)com
Cc: Markus Wanner <markus(at)bluegap(dot)ch>, Eliot Gable <egable+pgsql-hackers(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Horizontal Write Scaling
Date: 2010-11-25 15:06:39
Message-ID: AANLkTikCYE-FkQ97Vw7sS_mQe17hKhpVZHe4Y9UjnqGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 25, 2010 at 4:45 AM, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> wrote:
>>> plus the
>>> communication protocol overhead and latency. However, it occurs to me
>>> that if you had a shared disk system via either iSCSI, Fiber Channel,
>>> NFS, or whatever (which also had higher I/O capabilities than a single
>>> server could utilize), if you used a file system that supported locks on
>>> a particular section (extent) of a file, it should theoretically be
>>> possible for multiple Postgres instances on multiple systems sharing the
>>> database to read and write to the database without causing corruption.
>>
>> Possible, yes. Worthwile to do, probably not.
>
> We may be suffered from synchronizing cache on each database.

That's putting it mildly. You have to worry about the database buffer
cache, the shared invalidation queue, the control file data, the
ProcArray, the lock manager, the LWLock tables, and probably some
other things I'm forgetting about. Everything in shared memory, in
short.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Eliot Gable <egable+pgsql-hackers(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Horizontal Write Scaling
Date: 2010-11-27 17:04:55
Message-ID: AANLkTimGuETOHVntCxoBDyZk28JRTVTC5T1mz4mQkj54@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks, everyone, for all the feedback! I am nowhere near a database expert
yet, but you guys have been very helpful in clearing up some of my
confusion. I have checked out Postgres-XC and it looks like the version 1.0
that will be released soon probably covers everything I have been looking
for in terms of Postgres capabilities. The big ones are write scaling, read
scaling, consistent view of data between each server, and HA capabilities.
Last time I looked at Postgres-XC was probably a year ago, and it was
nowhere close to what I was looking for at the time, and I forgot all about
it. Now, it looks like a real contender.

I was aware of Postgres-R and was actually thinking I might be able to get
away with using that, but the project I am working on does a substantial
amount of writing, as well as being CPU intensive. Each query executes a
stored procedure which is about 2,500 lines long and pulls data from about
80 tables to compute a final result set. That final result set is returned
to the requester, and is also written into 3 tables (while still inside the
original transaction). One of those tables gets one row while the other two
get 6 - 15 rows per query. I execute hundreds of these queries per second.
So, I need to be able to spread the load across multiple boxes due to CPU
usage, but still have a consistent view of the written data. Using
conventional drives, I would saturate the disk I/O pretty quickly on
commodity hardware. With normal multi-master replication, the cost of making
sure I have enough disk I/O on each server is way more than I have the
budget for. With a write scaling solution, it suddenly looks affordable. I
was looking at maybe getting a single shared RAID array with some
enterprise-class SSDs that could guarantee writes even during a power
failure. I was hoping I could find something that would let multiple
Postgres instances share that disk array as it would be more cost effective
to get both the CPU power and Disk I/O I needed than sticking such a RAID
array in each and every server I was going to spread load across.
Postgres-XC actually makes it look even more affordable, as I now probably
no longer need to consider SSDs, or at least I don't need to consider a RAID
10 array of 4 or more SSDs per box. I can probably do RAID 1 with 2 drives
per box and have plenty of Disk I/O available for the amount of CPU power I
would have in the boxes.

So, thanks again for the feedback.