Quick Links

Re: SSI and Hot Standby

Lists:	pgsql-hackers

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<pgsql-hackers(at)postgresql(dot)org>
Cc:	"Dan Ports" <drkp(at)csail(dot)mit(dot)edu>
Subject:	SSI and Hot Standby
Date:	2011-01-20 01:05:07
Message-ID:	4D3735E30200002500039869@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's an issue for feedback from the community -- do we want to
support truly serializable transactions on hot standby machines?

The best way Dan and I have been able to think to do this is to
build on the SERIALIZABLE READ ONLY DEFERRABLE behavior. We are
able to obtain a snapshot and then check to see if it is at a place
in the transaction processing that it would be guaranteed to be
serializable without participating in predicate locking, rw-conflict
detection, etc. If it's not, we block until a READ WRITE
transaction completes, and then check again. Repeat. We may reach
a point where we determine that the snapshot can't work, and we get
a new one and start over. Due to the somewhat complex rules for
this, you are likely to see a safe snapshot fairly quickly even in a
mix which always has short-lived READ WRITE transactions running,
although a single long-running READ WRITE transaction can block
things until it completes.

The idea is that whenever we see a valid snapshot which would yield
a truly serializable view of the data for a READ ONLY transaction,
we add a WAL record with that snapshot information. Of course, we
might want some limit of how often they are sent, to avoid WAL
bloat. A hot standby could just keep the most recently received of
these and use it when a SERIALIZABLE transaction is requested.
Perhaps DEFERRABLE in this context could mean that it waits for the
*next* one and uses it, to assure "freshness".

Actually, we could try to get tricky to avoid sending a complete
snapshot by having two WAL messages with no payload -- one would
mean "the snapshot you would get now is being tested for
serializability". If it failed reach that state we would send
another when we started working a new snapshot. The other type of
message would mean "the snapshot you built when we last told you we
were starting to test one is good." I *think* that can work, and it
may require less WAL space.

If we don't do something like this, do we just provide REPEATABLE
READ on the standby as the strictest level of transaction isolation?
If so, do we generate an error on a request for SERIALIZABLE, warn
and provide degraded behavior, or just quietly give them REPEATABLE
READ behavior?

Thoughts?

-Kevin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 01:21:14
Message-ID:	1295486474.1803.783.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:

> Here's an issue for feedback from the community -- do we want to
> support truly serializable transactions on hot standby machines?

In this release? Maybe? In later releases? Yes.

If it blocks your excellent contribution in this release, then from me,
"no". If you can achieve this in this release, yes. However, if this is
difficult or complex, then I would rather say "not yet" quickly now,
than spend months working out the weirdnesses and possibly still get
them wrong.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc:	"Dan Ports" <drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 01:34:00
Message-ID:	4D373CA80200002500039872@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:

> In this release? Maybe? In later releases? Yes.
>
> If it blocks your excellent contribution in this release, then
> from me, "no". If you can achieve this in this release, yes.
> However, if this is difficult or complex, then I would rather say
> "not yet" quickly now, than spend months working out the
> weirdnesses and possibly still get them wrong.

We already have a mechanism for generating a good snapshot, the hard
part (for me at least) would be to get that snapshot over to the hot
standby and have it use the latest one on a request for a
serializable transaction. I have no experience with WAL file
output, and don't know what it would take for hot standby to use it
as I describe.

I agree it's pretty late in the cycle, but I'm going through all the
loose ends and found this one -- which has been hanging out on the
Wiki page as an R&D item for over a full year without discussion.
:-( If we provide the snapshots (which we can safely and easily
do), can someone else who knows what they're doing with WAL and HS
get the rest of it safely into the release? That seems to me to be
the only way it can still happen for 9.1.

If not, I agree this can be 9.2 material. We just have to decide
how to document it and answer the questions near the bottom of my
initial post of the thread.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 01:59:23
Message-ID:	AANLkTikYvkPaYhPNW4YQHGd6jfyzERjd4LLwH_0RHQBF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 8:34 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> I agree it's pretty late in the cycle, but I'm going through all the
> loose ends and found this one -- which has been hanging out on the
> Wiki page as an R&D item for over a full year without discussion.
> :-( If we provide the snapshots (which we can safely and easily
> do), can someone else who knows what they're doing with WAL and HS
> get the rest of it safely into the release? That seems to me to be
> the only way it can still happen for 9.1.

I think it's way too late to be embarking on what will probably turn
out to be a reasonably complex and possibly controversial new
development arc. I don't have a strong position on what we should do
instead, but let's NOT do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 02:06:17
Message-ID:	20110120020617.GA87714@csail.mit.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kevin's suggestion seems eminently reasonable to me and probably the
best approach one can do for SSI and hot standby. Pulling it off in
time for 9.1 would be a stretch; 9.2 seems quite doable.

It's worth noting that one way or another, the semantics of
SERIALIZABLE transactions on hot standby replicas could be surprising
to some. There's no getting around this; serializability in distributed
systems is just a hard problem in general. Either we go with Kevin's
suggestion of treating SERIALIZABLE transactions as DEFERRABLE (whether
now or for 9.2), causing them to have to use an older snapshot or block
until an acceptable snapshot becomes available; or we require them to
be downgraded to REPEATABLE READ either implicitly or explicitly.

Now, neither of these is as alarming as they might sound, given that
replication lag is a fact of life for hot standby systems and
REPEATABLE READ is exactly the same as the current (9.0) SERIALIZABLE
behavior. But it's definitely something that should be addressed in
documentation.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 02:50:10
Message-ID:	1295491810.1803.1105.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2011-01-19 at 19:34 -0600, Kevin Grittner wrote:

> I agree it's pretty late in the cycle, but I'm going through all the
> loose ends and found this one -- which has been hanging out on the
> Wiki page as an R&D item for over a full year without discussion.
> :-( If we provide the snapshots (which we can safely and easily
> do), can someone else who knows what they're doing with WAL and HS
> get the rest of it safely into the release? That seems to me to be
> the only way it can still happen for 9.1.

I gave you a quick response to let you know that HS need not be a
blocker, for this release. If you are saying you have knowingly ignored
a requirement for a whole year, then I am shocked. How exactly did you
think this would ever be committed?

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 03:04:28
Message-ID:	20110120030428.GG30352@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Simon Riggs (simon(at)2ndQuadrant(dot)com) wrote:
> I gave you a quick response to let you know that HS need not be a
> blocker, for this release. If you are saying you have knowingly ignored
> a requirement for a whole year, then I am shocked. How exactly did you
> think this would ever be committed?

Erm, to be perfectly honest, I think the answer is probably "I was
busy.", and "no one provided any feedback on *how* to deal with it."
Given the amount of work that Kevin's put into this patch (which has
been beyond impressive, imv), I have a hard time finding fault with
him not getting time to implement a solution for Hot Standby for this.

As you say, it's not a blocker, I agree completely with that, regardless
of when it was identified as an issue. What we're talking about is
right now, and right now is too late to fix it for HS, and to be
perfectly frank, fixing it for HS isn't required or even a terribly
important factor in if it should be committed.

I'll refrain from casting stones about issues brought up nearly a year
ago on certain other patches which are apparently not going to include
what I, at least, consider extremely important to PG acceptance by
others.

Thanks,

Stephen

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 04:43:58
Message-ID:	1295498638.11513.93.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:
> If we don't do something like this, do we just provide REPEATABLE
> READ on the standby as the strictest level of transaction isolation?
> If so, do we generate an error on a request for SERIALIZABLE, warn
> and provide degraded behavior, or just quietly give them REPEATABLE
> READ behavior?
>
> Thoughts?

Hopefully there is a better option available. We don't want to silently
give wrong results.

Maybe we should bring back the compatibility GUC? It could throw an
error unless the user sets the compatibility GUC to turn "serializable"
into "repeatable read".

Regards,
Jeff Davis

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 07:36:47
Message-ID:	4D37E60F.60803@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.01.2011 03:05, Kevin Grittner wrote:
> If we don't do something like this, do we just provide REPEATABLE
> READ on the standby as the strictest level of transaction isolation?
> If so, do we generate an error on a request for SERIALIZABLE, warn
> and provide degraded behavior, or just quietly give them REPEATABLE
> READ behavior?

+1 for generating an error.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc:	"Dan Ports" <drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 21:54:36
Message-ID:	4D385ABC0200002500039914@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 20.01.2011 03:05, Kevin Grittner wrote:
>> If we don't do something like this, do we just provide REPEATABLE
>> READ on the standby as the strictest level of transaction
>> isolation? If so, do we generate an error on a request for
>> SERIALIZABLE, warn and provide degraded behavior, or just quietly
>> give them REPEATABLE READ behavior?
>
> +1 for generating an error.

Before I go do that, I want to be sure everyone is clear about the
state of things.

If SSI is used to provide data integrity on the master, it will
prevent any serialization anomalies from being persisted on any hot
standby *long term*. For example, at any point where the standby is
at a point in the transaction stream where there were no read/write
transaction active, no anomalies can be observed. (That isn't the
*only* time; it's just the simplest one to describe as an example.)
Queries on the standby can, however, see *transient* anomalies when
they run queries which would cause a serialization failure if run on
the master at the same point in the transaction stream. This can
only occur when, of two concurrent transactions, the one which
*appears* to run second because the other can't read what it wrote,
*commits* first.

The most common and alarming situation where this occurs, in my
opinion, is batch processing. This is extremely common in financial
applications, and tends to show up in a lot of other places, too.
(The receipting query set is an instance of this type of problem,
but I'm going to keep it more general in hopes that people can see
where it impacts them.) Imagine an application which has some small
control record in a table, and inserts to some other table are
assigned to a batch based on the control record. The batches are
normally identified by ascending dates or serial numbers.
Periodically a new batch is opened and the old batch is closed by
updating a "current batch id" column in the control table. If the
batch ID is updated and the transaction in which that update was
executed commits while a transaction which read the old batch ID is
still in flight, a read of the database will show that the batch is
closed, but if you look at the detail of the batch, it will not yet
be complete.

Under SSI, one of these transactions will be canceled to prevent
this. Our implementation will always allow the update which closes
the batch to complete, and either the insert or the select of the
detail will be rolled back with a serialization failure, depending
on the timing the actions inside those transactions. If the insert
fails, it can be retried, and will land in the new batch -- making
the list of the batch which omits it OK. If the listing of the
batch details is canceled, it will be because the insert into the
old batch committed before it recognized the problem, so an
immediate retry of the select will see the complete batch contents.

A hot standby can't really take part in the predicate locking and
transaction cancellation on the master.

Dan and I have both come to the conclusion that the only reasonable
way to allow hot standby to work with SSI is for the WAL (when
wal_level = hot_standby) to contain information about which
snapshots develop which won't see such a state. In the above
example, barring some throttling mechanism skipping these particular
snapshots, or other problematic conflicts around the same time, the
master would tell the standby that the snapshot before either of the
two problem transactions was OK, and then it would tell them that
the snapshot after both had committed was OK. It would not suggest
using the snapshot available between the commit of the control
record update and the commit of the insert into the batch.

This seems to me to be not completely unrelated to the snapshot
synchronization patch. It is clearly closely related to the READ
ONLY DEFERRABLE mode, which also looks for a snapshot which is
immune to serialization anomalies without predicate locking,
conflict detection, transaction cancellation, etc. Melding these
two things with hot standby seems to be beyond what can reasonably
happen for 9.1 without delaying the release.

If someone is using one feature and not the other, they really don't
have a problem. Like anyone else, if a hot standby user has been
using SERIALIZABLE mode under 9.0 or earlier, they will need to
switch to REPEATABLE READ. A SERIALIZABLE user who doesn't set up
hot standby has no issue. Opinions so far seem to be in favor of
reporting an error on the standby if SERIALIZABLE is requested, so
that people don't silently get less protection than they expect.
The most annoying thing about that is that if the use would *like*
to use truly serializable transactions on the standby, and will do
so when they get it in 9.2, they must switch to REPEATABLE READ now,
and switch back to SERIALIZABLE with the next release.

So, based on a more complete description of the issues, any more
opinions on whether to generate the error, as suggested by Heikki?
Does anyone think this justifies the compatibility GUC as suggested
by Jeff? It seems to me that this deserved documentation in the
MVCC chapter under both the "Serializable Isolation Level" and
"Enforcing Consistency With Serializable Transactions" sections. I
think it probably deserves a note in the SET TRANSACTION reference
page, too. Agreed? Anywhere else?

-Kevin

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 22:09:16
Message-ID:	4D38B28C.10701@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kevin,

> So, based on a more complete description of the issues, any more
> opinions on whether to generate the error, as suggested by Heikki?

If it's a choice between generating an error and letting users see
inconsistent data, I'll take the former.

> Does anyone think this justifies the compatibility GUC as suggested
> by Jeff?

I think it might, yes. Since someone could simply turn on the backwards
compatibility flag for 9.1 and turn it off for 9.2, rather than trying
to mess with transaction states which might be set in application code.

Unfortunately, people have not responded to our survey :-(
http://www.postgresql.org/community/survey.77

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 23:11:30
Message-ID:	1295565090.1803.6122.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:

> The idea is that whenever we see a valid snapshot which would yield
> a truly serializable view of the data for a READ ONLY transaction,
> we add a WAL record with that snapshot information.

You haven't explained why this approach is the way forwards. What other
options have been ruled out, and why. The above approach doesn't sound
particularly viable to me.

It's not clear to me what the reason is that this doesn't just work on
HS already. If you started there it might help.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc:	"Dan Ports" <drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 23:22:49
Message-ID:	4D386F69020000250003991C@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:
>
>> The idea is that whenever we see a valid snapshot which would
>> yield a truly serializable view of the data for a READ ONLY
>> transaction, we add a WAL record with that snapshot information.
>
> You haven't explained why this approach is the way forwards. What
> other options have been ruled out, and why. The above approach
> doesn't sound particularly viable to me.

Why not? We already generate appropriate snapshots for this in SSI,
so is the problem in getting the appropriate information into the
WAL stream or in having a request for a snapshot within a
serializable transaction while running in hot standby the problem?

> It's not clear to me what the reason is that this doesn't just
> work on HS already. If you started there it might help.

Because the standby would need to bombard the server with a stream
of predicate lock information, we would need to allow transactions
on the master to be canceled do in part to activity on the standby,
and I don't even know how you would begin to track read/write
conflicts between transactions on two different clusters.

If any of that didn't make sense, it would probably be more
efficient for everyone involved if those interested browsed the
Overview section of the Wiki page than to have me duplicate its
contents in a post.

http://wiki.postgresql.org/wiki/Serializable

-Kevin

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 23:26:49
Message-ID:	806FB186-74BB-4660-B182-92A96C352999@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan21, 2011, at 00:11 , Simon Riggs wrote:
> It's not clear to me what the reason is that this doesn't just work on
> HS already. If you started there it might help.

The problem is that snapshots taken on the master sometimes represent a
state of the database which cannot occur under any (valid) serial schedule.
Hence, if you use that snapshot to read the *whole* database, you've
surely violated serializability. If you read only parts of the database,
things may or may not be fine, depending on the parts you read.

To have the same stringent guarantees that SERIALIZABLE provides on the
master also for queries run against the slave, you somehow need to prevent
this. The easiest way is to only use snapshots on the slave which *cannot*
produce such anomalies. We already know now to generate such snapshots -
SERIALIZABLE READ ONLY DEFERRABLE does exactly that. So the open question
is mainly how to transfer such snapshots to the slave, and how often we
transmit a new one.

best regards,
Florian Pflug

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Dan Ports" <drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 23:28:30
Message-ID:	4D3870BE0200002500039922@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:

> Why not? We already generate appropriate snapshots for this in
> SSI, so is the problem in getting the appropriate information into
> the WAL stream or in having a request for a snapshot within a
> serializable transaction while running in hot standby the problem?

I dropped few words.

That was supposed to ask whether the problem was in getting hot
standby to *use such a snapshot*.

I'm open to other suggestions on how else we might do this. I don't
see any alternatives, but maybe you're seeing some possibility that
eludes me.

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-20 23:37:01
Message-ID:	23698.1295566621@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote:
>> The idea is that whenever we see a valid snapshot which would yield
>> a truly serializable view of the data for a READ ONLY transaction,
>> we add a WAL record with that snapshot information.

> You haven't explained why this approach is the way forwards. What other
> options have been ruled out, and why. The above approach doesn't sound
> particularly viable to me.

I'm pretty concerned about the performance implications, too. In
particular that sounds like you could get an unbounded amount of WAL
emitted from a *purely read only* transaction flow. Which is not
going to fly.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 00:28:25
Message-ID:	1295569705.1803.6542.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-01-21 at 00:26 +0100, Florian Pflug wrote:
> On Jan21, 2011, at 00:11 , Simon Riggs wrote:
> > It's not clear to me what the reason is that this doesn't just work on
> > HS already. If you started there it might help.
>
>
> The problem is that snapshots taken on the master sometimes represent a
> state of the database which cannot occur under any (valid) serial schedule.
> Hence, if you use that snapshot to read the *whole* database, you've
> surely violated serializability. If you read only parts of the database,
> things may or may not be fine, depending on the parts you read.
>
> To have the same stringent guarantees that SERIALIZABLE provides on the
> master also for queries run against the slave, you somehow need to prevent
> this. The easiest way is to only use snapshots on the slave which *cannot*
> produce such anomalies. We already know now to generate such snapshots -
> SERIALIZABLE READ ONLY DEFERRABLE does exactly that. So the open question
> is mainly how to transfer such snapshots to the slave, and how often we
> transmit a new one.

Thank you for explaining a little more.

What I'm still not clear on is why that HS is different. Whatever rules
apply on the master must also apply on the standby, immutably. Why is it
we need to pass explicit snapshot information from master to standby? We
don't do that, except at startup for normal HS. Why do we need that?

I hear, but do not yet understand, that the SSI transaction sequence on
the master may differ from the WAL transaction sequence. Is it important
that the ordering on the master would differ from the standby?

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 01:19:34
Message-ID:	20110121011934.GF87714@csail.mit.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> What I'm still not clear on is why that HS is different. Whatever rules
> apply on the master must also apply on the standby, immutably. Why is it
> we need to pass explicit snapshot information from master to standby? We
> don't do that, except at startup for normal HS. Why do we need that?
>
> I hear, but do not yet understand, that the SSI transaction sequence on
> the master may differ from the WAL transaction sequence. Is it important
> that the ordering on the master would differ from the standby?

The logical serializable ordering of transactions in SSI doesn't
necessarily match the commit time ordering (i.e. the WAL sequence). For
example, with two concurrent transactions, T1 might commit after T2,
even though it didn't see the changes made by T2 and thus has to be
considered "earlier".

It doesn't matter whether T1 committed before T2 or the other way
around, as long as no other transaction can tell the difference. If
someone saw the changes made by T1 but not those made by T2, they'd see
T2 as happening before T1, violating serializability. Our SSI code
ensures that doesn't happen by tracking read dependencies. If it
detects that such a read is happening, it rolls back one of the
transactions involved.

Now, if we extend this to hot standby, if T2 commits before T1 on the
master, it obviously will on the slave too. A transaction run on the
slave at the right time might be able to see that T2 has happened but
not T1, which is unserializable. If that transaction had ben run on the
master, then it would have been detected and something would have been
rolled back, but the master has no way to know what data is being read
on the slave.

What Kevin is suggesting is that we already have a mechanism for
identifying snapshots where serialization failures like these will
never happen. If we pass that information to the slave and allow it to
run transactions only on those snapshots, serializability is safe.

Hopefully that made some more sense...

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 01:25:13
Message-ID:	28809270-AE80-4F89-9DE5-556C92D9B53D@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan21, 2011, at 01:28 , Simon Riggs wrote:
> What I'm still not clear on is why that HS is different. Whatever rules
> apply on the master must also apply on the standby, immutably. Why is it
> we need to pass explicit snapshot information from master to standby? We
> don't do that, except at startup for normal HS. Why do we need that?

> I hear, but do not yet understand, that the SSI transaction sequence on
> the master may differ from the WAL transaction sequence. Is it important
> that the ordering on the master would differ from the standby?

The COMMIT order in the actual, concurrent, schedule doesn't not necessarily
represent the order of the transaction in an equivalent serial schedule. Here's
an example

T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
T1: UPDATE D1 ... ;
T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
T2: SELECT * FROM D1 ... ;
T2: UPDATE D2 ... ;
T1: COMMIT;
T3: SELECT * FROM D1, D2;
T2: COMMIT;

Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent
serial schedule. In any such schedule

T2 must run before T1 because T2 didn't see T1's changes to D1
T3 must run after T1 because T3 did see T1's changes to D1
T3 must run before T2 because T3 didn't see T2's changes to D2

This is obviously impossible - if T3 runs before T2 and T2 runs before T1
then T3 runs before T1, contradicting the second requirement. There is thus
no equivalent serial schedule and we must abort of these transactions with
a serialization error.

Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone,
an equivalent serial schedule is T2,T1!

On the master, these "run before" requirement are tracked by remembering which
transaction read which parts of the data via the SIREAD-lock mechanism (These
are more flags than locks, since nobody ever blocks on them).

Since we do not want to report SIREAD locks back to the master, the slave has
to prevent this another way. Kevin's proposed solution does that by only using
those snapshots on the slave for which reading the *whole* database is safe. The
downside is that whether or not a snapshot is safe can only be decided after all
concurrent transactions have finished. The snapshot is thus always a bit outdated,
but shows that state that is known to be possible in some serial schedule.

The very same mechanism can be used on the master also by setting the isolation
level to SERIALIZABLE READ ONLY DEFERRED.

best regards,
Florian Pflug

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 06:44:59
Message-ID:	4D392B6B.5030907@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21.01.2011 03:19, Dan Ports wrote:
>> What I'm still not clear on is why that HS is different. Whatever rules
>> apply on the master must also apply on the standby, immutably. Why is it
>> we need to pass explicit snapshot information from master to standby? We
>> don't do that, except at startup for normal HS. Why do we need that?
>>
>> I hear, but do not yet understand, that the SSI transaction sequence on
>> the master may differ from the WAL transaction sequence. Is it important
>> that the ordering on the master would differ from the standby?
>
> The logical serializable ordering of transactions in SSI doesn't
> necessarily match the commit time ordering (i.e. the WAL sequence). For
> example, with two concurrent transactions, T1 might commit after T2,
> even though it didn't see the changes made by T2 and thus has to be
> considered "earlier".
>
> It doesn't matter whether T1 committed before T2 or the other way
> around, as long as no other transaction can tell the difference. If
> someone saw the changes made by T1 but not those made by T2, they'd see
> T2 as happening before T1, violating serializability. Our SSI code
> ensures that doesn't happen by tracking read dependencies. If it
> detects that such a read is happening, it rolls back one of the
> transactions involved.
>
> Now, if we extend this to hot standby, if T2 commits before T1 on the
> master, it obviously will on the slave too. A transaction run on the
> slave at the right time might be able to see that T2 has happened but
> not T1, which is unserializable. If that transaction had ben run on the
> master, then it would have been detected and something would have been
> rolled back, but the master has no way to know what data is being read
> on the slave.

We have enough information in the standby to reconstruct all writes done
in the master. I gather that's not enough, in order to roll back
read-only transaction T3 on the standby which would see an anomaly, we'd
also need to know what reads T1 and T2 did in the master. Is that correct?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 07:32:05
Message-ID:	20110121073205.GG87714@csail.mit.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 21, 2011 at 08:44:59AM +0200, Heikki Linnakangas wrote:
> We have enough information in the standby to reconstruct all writes done
> in the master. I gather that's not enough, in order to roll back
> read-only transaction T3 on the standby which would see an anomaly, we'd
> also need to know what reads T1 and T2 did in the master. Is that correct?

That's some of the information we need, but it's not enough...

The problem is that the conflict might not be discovered until after T3
(the reader) commits. In that case, it's too late to abort T3, so you'd
need to roll back T2 instead. But that means a read-only transaction on
the slave has to be able to cause a concurrent read-write transaction
on the master to abort, which brings with it no end of problems.

To make that a little more concrete, let me borrow Kevin's favorite
batch processing example...

[master] T2: BEGIN
[master] T2: SELECT FROM control
[master] T1: BEGIN
[master] T1: UPDATE control
[master] T1: COMMIT
[slave] T3: BEGIN
[slave] T3: SELECT FROM control, receipt
[slave] T3: COMMIT
[master] T2: INSERT INTO receipt
[master] T2: COMMIT

If this all happened at the master, T2 would get rolled back when it
tries to do its INSERT. (I just tried it.) But if T3 happened on the
slave, the master doesn't know that it read both tables, nor does the
slave know at the time it's executing T3 that it's going to conflict
with T2.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 09:10:47
Message-ID:	1295601047.1803.9228.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-01-21 at 02:32 -0500, Dan Ports wrote:
> On Fri, Jan 21, 2011 at 08:44:59AM +0200, Heikki Linnakangas wrote:
> > We have enough information in the standby to reconstruct all writes done
> > in the master. I gather that's not enough, in order to roll back
> > read-only transaction T3 on the standby which would see an anomaly, we'd
> > also need to know what reads T1 and T2 did in the master. Is that correct?
>
> That's some of the information we need, but it's not enough...
>
> The problem is that the conflict might not be discovered until after T3
> (the reader) commits. In that case, it's too late to abort T3, so you'd
> need to roll back T2 instead. But that means a read-only transaction on
> the slave has to be able to cause a concurrent read-write transaction
> on the master to abort, which brings with it no end of problems.

So T1 and T2 are already potentially unserialized and the presence of T3
causes the sequence to be "caught out" from which we must then abort T2.
The witness does not create the offence, they just report the crime.

So any xid that commits in a different sequence to the order in which
the xid was assigned creates a potential for unserialization? Or?

On HS we know the order of arrival of xids, and we know the order of
commits, so we should be able to work out which are the potentially
unserializable snapshots. That would allow us to make the standby
independent of the master, thereby avoiding all this messy information
flow.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>, Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 09:19:25
Message-ID:	4D394F9D.6020207@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21.01.2011 11:10, Simon Riggs wrote:
> So any xid that commits in a different sequence to the order in which
> the xid was assigned creates a potential for unserialization? Or?

It's not the order in which the xid was assigned that matters, but the
order the transactions started and got their snapshots. The xids might
be assigned a lot later, after the transactions have already read data.

> On HS we know the order of arrival of xids, and we know the order of
> commits, so we should be able to work out which are the potentially
> unserializable snapshots. That would allow us to make the standby
> independent of the master, thereby avoiding all this messy information
> flow.

Unfortunately although we know the order of arrival of xids, it doesn't
tell us the order the transactions started.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>, Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 10:00:54
Message-ID:	1295604054.1803.9502.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote:
> On 21.01.2011 11:10, Simon Riggs wrote:
> > So any xid that commits in a different sequence to the order in which
> > the xid was assigned creates a potential for unserialization? Or?
>
> It's not the order in which the xid was assigned that matters, but the
> order the transactions started and got their snapshots. The xids might
> be assigned a lot later, after the transactions have already read data.

So if a read-write transaction assigns an xid before it takes a snapshot
then we'll be OK? That seems much easier to arrange than passing chunks
of snapshot data backwards and forwards. Optionally.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 11:55:29
Message-ID:	4D397431.1030201@thl.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/21/2011 03:25 AM, Florian Pflug wrote:
> The COMMIT order in the actual, concurrent, schedule doesn't not necessarily
> represent the order of the transaction in an equivalent serial schedule. Here's
> an example
>
> T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
> T1: UPDATE D1 ... ;
> T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
> T2: SELECT * FROM D1 ... ;
> T2: UPDATE D2 ... ;
> T1: COMMIT;
> T3: SELECT * FROM D1, D2;
> T2: COMMIT;
>
> Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent
> serial schedule. In any such schedule
>
> T2 must run before T1 because T2 didn't see T1's changes to D1
> T3 must run after T1 because T3 did see T1's changes to D1
> T3 must run before T2 because T3 didn't see T2's changes to D2
>
> This is obviously impossible - if T3 runs before T2 and T2 runs before T1
> then T3 runs before T1, contradicting the second requirement. There is thus
> no equivalent serial schedule and we must abort of these transactions with
> a serialization error.
>
> Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone,
> an equivalent serial schedule is T2,T1!
Sorry for bothering all of you, but I just don't get this. What if T2
rolls back instead of committing? Then the snapshot of T3 would have
been valid, right? Now, for the snapshot of T3 it doesn't matter if T2
commits or if it doesn't, because it can't see the changes of T2 in any
case. Thus, it would seem that the snapshot is valid. On the other hand
I can't see anything wrong in the logic in your post. What am I missing?
I am feeling stupid...

At least for dumps I don't see how T2 can matter (assuming T3 is the
pg_dump's snapshot). Because if you reload from the dump, T2 never
happened in that dump. In the reloaded database it just did not exist at
all.

- Anssi

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 12:21:15
Message-ID:	876DA065-D013-4364-836E-3870D9986D34@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan21, 2011, at 12:55 , Anssi Kääriäinen wrote:
> On 01/21/2011 03:25 AM, Florian Pflug wrote:
>> The COMMIT order in the actual, concurrent, schedule doesn't not necessarily
>> represent the order of the transaction in an equivalent serial schedule. Here's
>> an example
>>
>> T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
>> T1: UPDATE D1 ... ;
>> T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here)
>> T2: SELECT * FROM D1 ... ;
>> T2: UPDATE D2 ... ;
>> T1: COMMIT;
>> T3: SELECT * FROM D1, D2;
>> T2: COMMIT;
>>
>> Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent
>> serial schedule. In any such schedule
>>
>> T2 must run before T1 because T2 didn't see T1's changes to D1
>> T3 must run after T1 because T3 did see T1's changes to D1
>> T3 must run before T2 because T3 didn't see T2's changes to D2
>>
>> This is obviously impossible - if T3 runs before T2 and T2 runs before T1
>> then T3 runs before T1, contradicting the second requirement. There is thus
>> no equivalent serial schedule and we must abort of these transactions with
>> a serialization error.
>>
>> Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone,
>> an equivalent serial schedule is T2,T1!

> Sorry for bothering all of you, but I just don't get this. What if T2 rolls back instead of committing? Then the snapshot of T3 would have been valid, right?

Yeah. If T2 is removed from the picture, the only remaining ordering constraint is "T3 must run after T1 because T3 did see T1's changes to D1", and thus T1,T3 is an equivalent serial schedule.

> Now, for the snapshot of T3 it doesn't matter if T2 commits or if it doesn't, because it can't see the changes of T2 in any case. Thus, it would seem that the snapshot is valid. On the other hand I can't see anything wrong in the logic in your post. What am I missing? I am feeling stupid...

The problem is that T3 sees the effects of T1 but not those of T2. Since T2 must run *before* T1 in any equivalent serial schedule, that is impossible. In other words, if you look at an equivalent serial schedule of the *writing* transactions T1 and T2 you won't find a place to insert T3 such that it gives the same answer as in the concurrent schedule.

It isn't really T3's snapshot that is invalid, it's the interleaving of T1,T2,T3 because there is no equivalent serial schedule (a serial schedule the produces the same results). If, for example T3 reads only *one* of the tables D1,D2 then the whole thing suddenly *is* serializable! If T3 reads only D1 an equivalent serial schedule must run T3 after T1, and if it reads only D2 then it must run before T2.

That "validity" of snapshots comes into play if you attempt to distinguish safe and unsafe interleaved schedules *without* taking the dataset inspected by T3 into account. So you simply assume that T3 reads the *whole* database (since thats the worst case), and must thus run *after* all transactions I didn't see as COMMITTED in any serial schedule.

The root of the whole issue is that this might not be possible! Some not-yet-committed transaction (T2 in the example) may have to be placed *before* some transaction seen as COMMITTED by T3 (T1 is our example). Since T3 needs to run *after* T1 (since it saw it as committed) it'd thus also see T2 in any serial schedule. But it didn't see T2 in the interleaved schedule, we're hosed.

> At least for dumps I don't see how T2 can matter (assuming T3 is the pg_dump's snapshot). Because if you reload from the dump, T2 never happened in that dump. In the reloaded database it just did not exist at all.

Still, the would dump reflects a database state that *logically* never existed (i.e. not in any serial schedule). If you dump for disaster recovery, you might not care. If you dump to copy the data onto some reporting server you might.

best regards,
Florian Pflug

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 13:00:31
Message-ID:	A4EF660F-1098-4F63-83A6-F9A339EC1119@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan21, 2011, at 10:19 , Heikki Linnakangas wrote:
> On 21.01.2011 11:10, Simon Riggs wrote:
>> So any xid that commits in a different sequence to the order in which
>> the xid was assigned creates a potential for unserialization? Or?
>
> It's not the order in which the xid was assigned that matters, but the order the transactions started and got their snapshots. The xids might be assigned a lot later, after the transactions have already read data.

Any pair of concurrent transactions on the master between which r/w-dependencies exist are a potential risk. If their order in all equivalent serial schedule doesn't match their commit order, any snapshot taken between the two commits don't represent a fully consistent view of the database.

Hm, wait a minute... Thinks...

Since transactions on the slave don't write, they really don't need to be aware of any SIREAD locks taken on the master, right? We'd still need to detect conflicts between SIREAD locks taken on the slaves and writes by transaction on the master, but that could be handled by the recovery process without having to report anything back to the master, and without logging SIREAD lock acquisitions. So, how about the following

A) We log r/w-dependencies between transactions committed on the master in the WAL, probably in the COMMIT record
B) SERIALIZABLE queries on the slave use the SIREAD lock machinery like they'd do on the master. The recovery process does the necessary conflict flagging in case the write happens (in wall clock time) after the slave, mimicking what the writing transaction had done on the master had it seen the SIREAD lock
C) By using the r/w-dependency information from the WAL plus the r/w-dependency information generated on the slave we can detect dangerous situations on the slave, and abort the offending query on the slave.

(A) and (B) seem quite straight-forward. Regarding (C), I'm not familiar enough with the inner workings of the SSI patch to judge that.

best regards,
Florian Pflug

From:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To:	Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 13:05:04
Message-ID:	AANLkTikL4B7yEQ4WHtj-kxYGsbH0GowHJF3P93G=SVcX@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/1/21 Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>:

> Sorry for bothering all of you, but I just don't get this. What if T2 rolls
> back instead of committing? Then the snapshot of T3 would have been valid,
> right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it
> doesn't, because it can't see the changes of T2 in any case. Thus, it would
> seem that the snapshot is valid. On the other hand I can't see anything
> wrong in the logic in your post. What am I missing? I am feeling stupid...
>
> At least for dumps I don't see how T2 can matter (assuming T3 is the
> pg_dump's snapshot). Because if you reload from the dump, T2 never happened
> in that dump. In the reloaded database it just did not exist at all.

This has been discussed before; in [1] I summarized:

"IOW, one could say that the backup is consistent only if it were
never compared against the system as it continued running after the
dump took place."

Nicolas

[1] <URL:http://archives.postgresql.org/pgsql-hackers/2010-09/msg01763.php>

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
Cc:	Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 13:45:44
Message-ID:	AANLkTin9aUX662sF7LPUwqXLWpMnupQWwzKy5=_3-qBu@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier
<nicolas(dot)barbier(at)gmail(dot)com> wrote:
> 2011/1/21 Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>:
>
>> Sorry for bothering all of you, but I just don't get this. What if T2 rolls
>> back instead of committing? Then the snapshot of T3 would have been valid,
>> right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it
>> doesn't, because it can't see the changes of T2 in any case. Thus, it would
>> seem that the snapshot is valid. On the other hand I can't see anything
>> wrong in the logic in your post. What am I missing? I am feeling stupid...
>>
>> At least for dumps I don't see how T2 can matter (assuming T3 is the
>> pg_dump's snapshot). Because if you reload from the dump, T2 never happened
>> in that dump. In the reloaded database it just did not exist at all.
>
> This has been discussed before; in [1] I summarized:
>
> "IOW, one could say that the backup is consistent only if it were
> never compared against the system as it continued running after the
> dump took place."

But that's a pretty fair way to look at it, isn't it? I mean, I guess
it's a question of what you plan to use that backup for, but if it's
disaster recovery, everything that happened after the dump is gone, so
no such comparison will occur. And that's probably the most common
reason for taking a dump.

It occurs to me that focusing on how this is going to work on Hot
Standby might be looking at the question too narrowly. The general
issue is - does this technique generalize to a distributed computing
environment, with distributed transactions across multiple PostgreSQL
databases? For example, what if the control record in Kevin's example
is stored in another database, or on another server. Or what if some
tables are replicated via Slony? I realize this is all outside the
scope of the patch, but that's exactly the point: making this stuff
work across multiple databases (even if they are replicas of each
other) is much more complex than getting it to work on just one
machine. Even if we could agree on how to do it, coming up with some
hack that can only ever possibly work in the Hot Standby case might
not be the best thing to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Nicolas Barbier" <nicolas(dot)barbier(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, <anssi(dot)kaariainen(at)thl(dot)fi>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 14:58:59
Message-ID:	4D394AD3020000250003998D@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier

>> This has been discussed before; in [1] I summarized:
>>
>> "IOW, one could say that the backup is consistent only if it were
>> never compared against the system as it continued running after
>> the dump took place."
>
> But that's a pretty fair way to look at it, isn't it? I mean, I
> guess it's a question of what you plan to use that backup for, but
> if it's disaster recovery, everything that happened after the dump
> is gone, so no such comparison will occur. And that's probably
> the most common reason for taking a dump.

It's not, however, a reason for having a hot standby (versus a warm
standby or PITR backup).

> It occurs to me that focusing on how this is going to work on Hot
> Standby might be looking at the question too narrowly. The
> general issue is - does this technique generalize to a distributed
> computing environment, with distributed transactions across
> multiple PostgreSQL databases?

No, and I can pretty much guarantee that you can't have such a
solution without blocking on all masters at commit time. What
you're suggesting goes *way* beyond two phase commit, which just
guarantees the integrity rules of each database are honored and
that all transactions either commit or don't. You're talking about
sharing lock information across high-latency communication links
which in SSI are communicated via LW locking. Expect any such
generalized "and world peace!" solution to be rather slow.

> For example, what if the control record in Kevin's example is
> stored in another database, or on another server. Or what if some
> tables are replicated via Slony? I realize this is all outside
> the scope of the patch

Yep. Again, the patch achieves true serializability with minimal
cost and *no blocking*. Spend a few minutes thinking about how you
might coordinate what you propose, and you'll see it's going to
involve blocking based on waiting for messages from across the wire.

> but that's exactly the point: making this stuff work across
> multiple databases (even if they are replicas of each other) is
> much more complex than getting it to work on just one machine.
> Even if we could agree on how to do it, coming up with some hack
> that can only ever possibly work in the Hot Standby case might not
> be the best thing to do.

I don't see it as a hack. It's the logical extension of SSI onto
read only replicas. If you're looking for something more than that
(as the above suggests), it's not a good fit; but I suspect that
there are people besides me who would want to use hot standby for
reporting and read only web access who would want a serializable
view. What this proposal does is to say that there are two time
streams to look at on the standby -- how far along you are for
purposes of recovery, and how far along you are for purposes of
seeing a view of the data sure to be consistent the later state of
the master. With SSI they can't be the same. If someone wants them
to be, they could implement a traditional S2PL serializable mode,
complete with blocking and deadlocks, and then you'd have it
automatically on the replicas, because with S2PL the apparent order
of execution matches the commit order.

-Kevin

From:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 15:06:15
Message-ID:	AANLkTin7zPmJSZ6Q013Gh2a+YUCO-SM_tRcstfD6oRMV@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/1/21 Robert Haas <robertmhaas(at)gmail(dot)com>:

> On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier
> <nicolas(dot)barbier(at)gmail(dot)com> wrote:
>
>> 2011/1/21 Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>:
>>
>>> Sorry for bothering all of you, but I just don't get this. What if T2 rolls
>>> back instead of committing? Then the snapshot of T3 would have been valid,
>>> right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it
>>> doesn't, because it can't see the changes of T2 in any case. Thus, it would
>>> seem that the snapshot is valid. On the other hand I can't see anything
>>> wrong in the logic in your post. What am I missing? I am feeling stupid...
>>>
>>> At least for dumps I don't see how T2 can matter (assuming T3 is the
>>> pg_dump's snapshot). Because if you reload from the dump, T2 never happened
>>> in that dump. In the reloaded database it just did not exist at all.
>>
>> This has been discussed before; in [1] I summarized:
>>
>> "IOW, one could say that the backup is consistent only if it were
>> never compared against the system as it continued running after the
>> dump took place."
>
> But that's a pretty fair way to look at it, isn't it?

Indeed, I just wanted to point Anssi to the previous discussion.

> It occurs to me that focusing on how this is going to work on Hot
> Standby might be looking at the question too narrowly. The general
> issue is - does this technique generalize to a distributed computing
> environment, with distributed transactions across multiple PostgreSQL
> databases? For example, what if the control record in Kevin's example
> is stored in another database, or on another server. Or what if some
> tables are replicated via Slony? I realize this is all outside the
> scope of the patch, but that's exactly the point: making this stuff
> work across multiple databases (even if they are replicas of each
> other) is much more complex than getting it to work on just one
> machine. Even if we could agree on how to do it, coming up with some
> hack that can only ever possibly work in the Hot Standby case might
> not be the best thing to do.

You seem to be questioning whether the normal (?) way of using 2PC on
multiple DBs (just start transactions, and let 2PC coordinate the
commits) that all use SERIALIZABLE isolation mode always results in
global serializable behavior.

I must say that I don't immediately see the answer (my gut feeling
says "nope"), but it sure is an interesting question.

In the special case where all databases use SS2PL as their concurrency
mechanism, the answer is "yes". (Because any "possibly conflicting"
local transactions of any global transactions that touch the same
tables ("conflict") will necessarily commit in some globally
consistent order, which in the case of SS2PL is consistent with the
"must-have-happened-before-in-any-serialized-order" relation, i.e.,
using the terminology from the literature: "because SS2PL is CO".)

I don't know whether it is generally considered a requirement for a
2PC environment to guarantee global serializability.

Nicolas

From:	Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 15:08:00
Message-ID:	4D39A150.20000@thl.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/21/2011 02:21 PM, Florian Pflug wrote:
> Still, the would dump reflects a database state that *logically* never existed (i.e. not in any serial schedule). If you dump for disaster recovery, you might not care. If you dump to copy the data onto some reporting server you might.
>
> best regards,
> Florian Pflug

I am beginning to understand the problem. If you don't mind, here is a
complete example if somebody else is having troubles understanding this.

Let's say we have tables D1 and D2. Both contain a single column, id,
and a single row. The data in the beginning is as follows:

D1: id = 1
D2: id = 1

The constrains: D1.id can only be incremented. Whenever D2.id is
updated, it must be updated to D1.id + 1.

The transactions:
T1: begin; update D1 set id = id + 1;
T2: begin; update D2 set id = (select id+1 from D1);
T1: commit;
T3: begin; select id from D1; select id from D2; commit; Data seen: (2,
1) -- this is a possible state
T2: commit;
T4: begin; select id from D1; select id from D2; commit; Data seen: (2, 2)
This is again a possible state. But if we compare this to the state seen
by T3 this is not valid. From state (2, 1) we can not get to state (2,
2) without breaking one of the constraints. Thus, the state of T3 is not
valid in the database.

So, I finally got it! :-) I hope this example will help somebody else
understand the problem. The problem I had understanding this was that
the state in T3 is in fact perfectly valid. I though that there must be
some problem with that state alone. There isn't, unless you compare it
to the state after T2 has committed.

Thanks to all explaining this to me,
- Anssi

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>, Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 15:32:01
Message-ID:	10201.1295623921@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote:
>> It's not the order in which the xid was assigned that matters, but the
>> order the transactions started and got their snapshots. The xids might
>> be assigned a lot later, after the transactions have already read data.

> So if a read-write transaction assigns an xid before it takes a snapshot
> then we'll be OK? That seems much easier to arrange than passing chunks
> of snapshot data backwards and forwards. Optionally.

No, that idea is DOA from a performance standpoint. We sweated blood to
avoid having to assign XIDs to read-only transactions, and we're not
going back. If SSI requires that, SSI is not getting committed.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>, Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 15:36:12
Message-ID:	AANLkTin1jKp42VNSMDMt81KE_THsMNR_yRtWwA++WZ88@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 21, 2011 at 10:32 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote:
>>> It's not the order in which the xid was assigned that matters, but the
>>> order the transactions started and got their snapshots. The xids might
>>> be assigned a lot later, after the transactions have already read data.
>
>> So if a read-write transaction assigns an xid before it takes a snapshot
>> then we'll be OK? That seems much easier to arrange than passing chunks
>> of snapshot data backwards and forwards. Optionally.
>
> No, that idea is DOA from a performance standpoint. We sweated blood to
> avoid having to assign XIDs to read-only transactions, and we're not
> going back. If SSI requires that, SSI is not getting committed.

So far I think all of the ideas proposed for generalizing this across
the master-standby connection seem likely to be DOA from a performance
perspective. But I think we have a pretty broad consensus that it's
OK to punt this issue for 9.1. We can always add this in 9.2 if it
can be demonstrated to work well, but it's all vapor-ware right now
anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Dan Ports" <drkp(at)csail(dot)mit(dot)edu>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 16:16:21
Message-ID:	4D395CF502000025000399A2@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote:
>>> It's not the order in which the xid was assigned that matters,
>>> but the order the transactions started and got their snapshots.
>>> The xids might be assigned a lot later, after the transactions
>>> have already read data.
>
>> So if a read-write transaction assigns an xid before it takes a
>> snapshot then we'll be OK? That seems much easier to arrange than
>> passing chunks of snapshot data backwards and forwards.

We're not talking about passing the backwards. I'm suggesting that
we probably don't even need to pass them forward, but that
suggestion has been pretty handwavy so far. I guess I should fill
it out, because everyone's been ignoring it so far.

> No, that idea is DOA from a performance standpoint. We sweated
> blood to avoid having to assign XIDs to read-only transactions,
> and we're not going back. If SSI requires that, SSI is not
> getting committed.

SSI doesn't require that. The suggestion that it would in *any* way
help with the interaction with hot standby is off-base.

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Florian Pflug" <fgp(at)phlo(dot)org>, Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>
Cc:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Dan Ports" <drkp(at)csail(dot)mit(dot)edu>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 17:37:16
Message-ID:	4D396FEC02000025000399B6@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Anssi Kääriäinen<anssi(dot)kaariainen(at)thl(dot)fi> wrote:

> I am beginning to understand the problem. If you don't mind, here
> is a complete example if somebody else is having troubles
> understanding this.
>
> Let's say we have tables D1 and D2. Both contain a single column,
> id, and a single row. The data in the beginning is as follows:
>
> D1: id = 1
> D2: id = 1
>
> The constrains: D1.id can only be incremented. Whenever D2.id is
> updated, it must be updated to D1.id + 1.
>
> The transactions:
> T1: begin; update D1 set id = id + 1;
> T2: begin; update D2 set id = (select id+1 from D1);
> T1: commit;
> T3: begin; select id from D1; select id from D2; commit; Data
> seen: (2, 1) -- this is a possible state
> T2: commit;
> T4: begin; select id from D1; select id from D2; commit; Data
> seen: (2, 2)
> This is again a possible state. But if we compare this to the
> state seen
> by T3 this is not valid. From state (2, 1) we can not get to state
> (2, 2) without breaking one of the constraints. Thus, the state of
> T3 is not valid in the database.
>
> So, I finally got it! :-) I hope this example will help somebody
> else understand the problem.

Yeah, interesting example. Under SSI, once T3 selects from D2 you
have a dangerous structure, and either T2 or T3 must fail to prevent
the possibility of the sort of anomaly your example demonstrates.
We would prefer to see T2 fail, because if T3 fails it will continue
to fail on retry until T2 completes. We're trying to avoid that
kind of thrashing. If T2 fails and is retried, it will immediately
succeed and generate results consistent with what T3 saw.

When I test your example, though, I'm getting the serialization
failure on T3 rather than T2, so I'd call that a bug. Will
investigate. Thanks again for your tests! You seem to be able to
shake out issues better than anyone else! Once found, fixing them
is not usually very hard, it's coming up with that creative usage
pattern to *find* the problem which is the hard part.

OK if I add this one to our dcheck test suite, too?

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<anssi(dot)kaariainen(at)thl(dot)fi>
Cc:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Dan Ports" <drkp(at)csail(dot)mit(dot)edu>, "Florian Pflug" <fgp(dot)phlo(dot)org(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-21 18:21:17
Message-ID:	4D397A3D02000025000399C5@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> When I test your example, though, I'm getting the serialization
> failure on T3 rather than T2, so I'd call that a bug. Will
> investigate. Thanks again for your tests!

Fixed with this:

http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=b91460812396b68362c812d6e4fb67799fc6147e

Thanks again!

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Dan Ports" <drkp(at)csail(dot)mit(dot)edu>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-22 00:52:28
Message-ID:	4D39D5EC0200002500039A19@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:

> We're not talking about passing the backwards. I'm suggesting
> that we probably don't even need to pass them forward, but that
> suggestion has been pretty handwavy so far. I guess I should fill
> it out, because everyone's been ignoring it so far.

It's been too hectic today to flesh this out very well, but I can at
least do a better brain dump -- you know, wave my hands a little
less vaguely.

The idea of communicating regarding a safe snapshot through the WAL
without actually *sending* snapshot XIDs through the WAL might work
something like this:

(1) We communicate when we are starting to consider a snapshot.
This would always be related to the commit or rollback of a
serializable read-write transaction, so perhaps we could include the
information in an existing WAL record. We would need to find one
free bit somewhere, or make room for it. Alternatively, we could
send a new WAL record type to communicate this. At the point that a
standby processes such a WAL record, it would grab a snapshot effect
after the commit, and save it as the "latest candidate", releasing
the previous candidate, if any.

(2) If a snapshot fails to make it to a safe status on the master,
it will pick a new candidate and repeat (1) -- there's no need to
explicitly quash a failed candidate.

(3) We communicate when we find that the last candidate made it to
"safe" status. Again, this would be related to the commit or
rollback of a serializable read-write transaction. Same issues
about needing (another) bit or using a new record type. When a
standby receives this, it promotes the latest candidate to the new
"safe snapshot" to be used when a serializable transaction asks for
a snapshot, replacing the previous value, if any. Any transactions
waiting for a snapshot (either because there previously wasn't a
safe snapshot on record or because they requested DEFERRABLE) could
be provided the new snapshot and turned loose.

(4) It's not inconceivable that we might want to send both (1) and
(3) with the same commit.

(5) Obviously, we can pick our heuristics for how often we try to
refresh this, limiting it to avoid too much overhead, at the cost of
less frequent snapshot updates for serializable transactions on the
standbys.

My assumption is that when we have a safe snapshot (which should be
pretty close to all the time), we immediately provide it to any
serializable transaction requesting a snapshot, except it seems to
make sense to use the new DEFERRABLE mode to mean that you want to
use the *next* one to arrive.

This would effectively cause the point in time which was visible to
serializable transactions to lag behind what is visible to other
transactions by a variable amount, but would ensure that a
serializable transaction couldn't see any serialization anomalies.
It would also be immune to serialization failures from SSI logic;
but obviously, standby-related cancellations would be in play. I
don't know whether the older snapshots would tend to increase the
standby-related cancellations, but it wouldn't surprise me.

Hopefully this is enough for people to make something of it.

-Kevin

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI and Hot Standby
Date:	2011-01-22 01:32:18
Message-ID:	1295659938.19048.21.camel@jdavis-ux.asterdata.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-01-21 at 18:52 -0600, Kevin Grittner wrote:
> My assumption is that when we have a safe snapshot (which should be
> pretty close to all the time), we immediately provide it to any
> serializable transaction requesting a snapshot, except it seems to
> make sense to use the new DEFERRABLE mode to mean that you want to
> use the *next* one to arrive.

How would it handle this situation:
1. Standby has safe snapshot S1
2. Primary does a VACUUM which removes some stuff visible in S1
3. Standby can't replay the VACUUM because it still has S1, but also
can't get a new S2 because the WAL needed for that is behind the VACUUM

So, S1 needs to be discarded. What do we do on the standby while there
is no safe snapshot? I suppose throw errors -- I can't think of anything
else.

> This would effectively cause the point in time which was visible to
> serializable transactions to lag behind what is visible to other
> transactions by a variable amount, but would ensure that a
> serializable transaction couldn't see any serialization anomalies.
> It would also be immune to serialization failures from SSI logic;
> but obviously, standby-related cancellations would be in play. I
> don't know whether the older snapshots would tend to increase the
> standby-related cancellations, but it wouldn't surprise me.

I'm also a little concerned about the user-understandability here. Is it
possible to make the following guarantees in this approach:

1. If transactions are completing on the primary, new snapshots will be
taken on the standby; and
2. If no write transactions are in progress on the primary, then the
standby will get a snapshot that represents the exact same data as on
the primary?

That would be fairly easy to explain to users. If there is a visibility
lag, then we just say "finish the write transactions, and progress will
be made". And if the system is idle, they should see identical data.

Regards,
Jeff Davis

From:	Kääriäinen Anssi <anssi(dot)kaariainen(at)thl(dot)fi>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Dan Ports <drkp(at)csail(dot)mit(dot)edu>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI and Hot Standby
Date:	2011-01-22 03:50:21
Message-ID:	BC19EF15D84DC143A22D6A8F2590F0A76515EE9822@EXMAIL.stakes.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"When I test your example, though, I'm getting the serialization
failure on T3 rather than T2, so I'd call that a bug. Will
investigate. Thanks again for your tests! You seem to be able to
shake out issues better than anyone else! Once found, fixing them
is not usually very hard, it's coming up with that creative usage
pattern to *find* the problem which is the hard part."

Thank you very much, but I do not deserve this honor. I was just constructing an example for myself so that I could understand why read only transaction might pose a problem. I posted it to help other people to see a concrete example of the problem. I had no idea this would show an actual bug in the code.

"OK if I add this one to our dcheck test suite, too?"

It is of course OK. And if you want to add this as an example in the documentation, it would be great. This is a simple, but concrete example of why read only serializable transaction might cause an anomaly. If I am not mistaken, there isn't any complete example in the documentation. It was hard for me to grasp why there might be a problem and I don't think I am alone.

- Anssi