Quick Links

Re: Snapshot synchronization, again...

Lists:	pgsql-hackers

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Snapshot synchronization, again...
Date:	2010-12-30 12:31:47
Message-ID:	AANLkTi=XPm4FGRjzdQtWnHxJ8Su_tE+x59Yv_bYwDuaR@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The snapshot synchronization discussion from the parallel pg_dump
patch somehow ended without a clear way to go forward.

Let me sum up what has been brought up and propose a short- and
longterm solution.

Summary:

Passing snapshot sync information can be done either:

a) by returning complete snapshot information from the backend to the
client so that the client can pass it along to a different backend
b) or by returning only a unique identifier to the client and storing
the actual snapshot data somewhere on the server side

Advantage of a: no memory is used in the backend and no memory needs
to get cleaned up, it is also theoretically possible that we could
forward that data to a hot standby server and do e.g. a dump partially
on the master server and partially on the hot standby server or among
several hot standby servers.
Disadvantage of a: The snapshot must be validated to make sure that
its information is still current, it might be difficult to cover all
cases of this validation. A client can not only access exactly a
published snapshot, but just about any snapshot that fits and passes
the validation checks (this is more a disadvantage than an advantage
because it allows to see a database state that never existed in
reality).

Advantage of b: No validation necessary, as soon as the transaction
that publishes the snapshot loses that snapshot, it will also revoke
the snapshot information (either by removing a temp file or deleting
it from shared memory)
Disadvantage of b: It doesn't allow a snapshot to be installed on a
different server. It requires a serializable open transaction to hold
the snapshot.

What I am proposing now is the following:

We return snapshot information as a chunk of data to the client. At
the same time however, we set a checksum in shared memory to protect
against modification of the snapshot. A publishing backend can revoke
its snapshot by deleting the checksum and a backend that is asked to
install a snapshot can verify that the snapshot is correct and current
by calculating the checksum and comparing it with the one in shared
memory.

This only costs us a few bytes for the checksum * max_connection in
shared memory and apart from resetting the checksum it does not have
cleanup and verification issues. Note that we are also free to change
the internal format of the chunk of data we return whenever we like,
so we are free to enhance this feature in the future, transparently to
the client.

Thoughts?

Joachim

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2010-12-30 14:40:11
Message-ID:	1293719782-sup-7939@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Joachim Wieland's message of jue dic 30 09:31:47 -0300 2010:

> Advantage of b: No validation necessary, as soon as the transaction
> that publishes the snapshot loses that snapshot, it will also revoke
> the snapshot information (either by removing a temp file or deleting
> it from shared memory)
> Disadvantage of b: It doesn't allow a snapshot to be installed on a
> different server. It requires a serializable open transaction to hold
> the snapshot.

Why does it require a serializable transaction? You could simply
register the snapshot in any transaction. (Of course, the net effect
would be pretty similar to a serializable transaction).

> We return snapshot information as a chunk of data to the client. At
> the same time however, we set a checksum in shared memory to protect
> against modification of the snapshot. A publishing backend can revoke
> its snapshot by deleting the checksum and a backend that is asked to
> install a snapshot can verify that the snapshot is correct and current
> by calculating the checksum and comparing it with the one in shared
> memory.
>
> This only costs us a few bytes for the checksum * max_connection in
> shared memory and apart from resetting the checksum it does not have
> cleanup and verification issues.

So one registered snapshot per transaction? Sounds a reasonable
limitation (I doubt there's a use case for more than that, anyway).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2010-12-30 14:49:09
Message-ID:	8FD1001B-D1E7-463D-A542-865FEE23B249@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
> We return snapshot information as a chunk of data to the client. At
> the same time however, we set a checksum in shared memory to protect
> against modification of the snapshot. A publishing backend can revoke
> its snapshot by deleting the checksum and a backend that is asked to
> install a snapshot can verify that the snapshot is correct and current
> by calculating the checksum and comparing it with the one in shared
> memory.

We'd still have to stream these checksums to the standbys though,
or would they be exempt from the checksum checks?

I still wonder whether these checks are worth the complexity. I
believe we'd only allow snapshot modifications for read-only queries
anyway, so what point is there in preventing clients from setting
broken snapshots?

best regards,
Florian Pflug

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2010-12-30 21:45:20
Message-ID:	4D1CFD70.4010603@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.12.2010 16:49, Florian Pflug wrote:
> On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
>> We return snapshot information as a chunk of data to the client. At
>> the same time however, we set a checksum in shared memory to protect
>> against modification of the snapshot. A publishing backend can revoke
>> its snapshot by deleting the checksum and a backend that is asked to
>> install a snapshot can verify that the snapshot is correct and current
>> by calculating the checksum and comparing it with the one in shared
>> memory.
>
> We'd still have to stream these checksums to the standbys though,
> or would they be exempt from the checksum checks?
>
> I still wonder whether these checks are worth the complexity. I
> believe we'd only allow snapshot modifications for read-only queries
> anyway, so what point is there in preventing clients from setting
> broken snapshots?

Hmm, our definition of "read-only" is a bit fuzzy. While a transaction
doesn't modify the database itself, it could still send NOTIFYs or call
a PL function to do all sorts of things outside the database. Imagine
that you're paranoid about data integrity, and have a security definer
function that runs cross checks on the data. If it finds any
anomalities, it wakes up the operator or forces shutdown or similar.

Now a malicious user could set a snapshot that passes the basic validity
checks, ie. xmin >= GlobalXmin, but contains a combination of still
in-progress that never existed in reality. If he then calls the
paranoia-function, it would see an inconsistent state of committed
tuples and get upset.

Maybe that's a bit far-stretched, but it's not entirely clear that
running with an inconsistent snapshot is harmless.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2010-12-31 03:15:57
Message-ID:	AANLkTimiyDbGFMh2A4WOk=mH9F4nCr9eQ7XS=ko2msAX@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 30, 2010 at 9:40 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>> Disadvantage of b: It doesn't allow a snapshot to be installed on a
>> different server. It requires a serializable open transaction to hold
>> the snapshot.
>
> Why does it require a serializable transaction? You could simply
> register the snapshot in any transaction. (Of course, the net effect
> would be pretty similar to a serializable transaction).

I am not assuming that the publishing transaction blocks until its
snapshot is being picked up. A read committed transaction would get a
new snapshot for every other query, so the published snapshot is no
longer represented by an actual backend until it is being picked up by
one. Since nobody is holding off xmin/GlobalXmin, eventually vacuum
would remove tuples that the published-but-not-yet-picked-up snapshot
should still be able to see, no?

Joachim

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2010-12-31 03:26:52
Message-ID:	AANLkTimn+Coy4wUg73YPJqGxG-Sw6+HMqMSfLi6Tin0Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 30, 2010 at 9:49 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
>> We return snapshot information as a chunk of data to the client. At
>> the same time however, we set a checksum in shared memory to protect
>> against modification of the snapshot. A publishing backend can revoke
>> its snapshot by deleting the checksum and a backend that is asked to
>> install a snapshot can verify that the snapshot is correct and current
>> by calculating the checksum and comparing it with the one in shared
>> memory.
>
> We'd still have to stream these checksums to the standbys though,
> or would they be exempt from the checksum checks?

I am not talking about having synchronized snapshots among standby
servers at all.

I am only proposing a client API that will work for this future idea as well.

> I still wonder whether these checks are worth the complexity. I
> believe we'd only allow snapshot modifications for read-only queries
> anyway, so what point is there in preventing clients from setting
> broken snapshots?

What's the use case for it? As soon as nobody comes up with a
reasonable use case for it, let's aim for the robust version.

Joachim

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Florian Pflug <fgp(at)phlo(dot)org>, Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2010-12-31 09:42:14
Message-ID:	4D1DA576.1030504@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/30/2010 10:45 PM, Heikki Linnakangas wrote:
> On 30.12.2010 16:49, Florian Pflug wrote:
>> On Dec30, 2010, at 13:31 , Joachim Wieland wrote:
>>> We return snapshot information as a chunk of data to the client. At
>>> the same time however, we set a checksum in shared memory to protect
>>> against modification of the snapshot. A publishing backend can revoke
>>> its snapshot by deleting the checksum and a backend that is asked to
>>> install a snapshot can verify that the snapshot is correct and current
>>> by calculating the checksum and comparing it with the one in shared
>>> memory.
>>
>> We'd still have to stream these checksums to the standbys though,
>> or would they be exempt from the checksum checks?
>>
>> I still wonder whether these checks are worth the complexity. I
>> believe we'd only allow snapshot modifications for read-only queries
>> anyway, so what point is there in preventing clients from setting
>> broken snapshots?
>
> Hmm, our definition of "read-only" is a bit fuzzy. While a transaction
> doesn't modify the database itself, it could still send NOTIFYs or call
> a PL function to do all sorts of things outside the database. Imagine
> that you're paranoid about data integrity, and have a security definer
> function that runs cross checks on the data. If it finds any
> anomalities, it wakes up the operator or forces shutdown or similar.

are people actually doing that in reality? I'm also having a hard time
picturing a realistic example of what that "data integrity check"
function would actually being able to check with default isolation mode
and concurrent activity...

>
> Now a malicious user could set a snapshot that passes the basic validity
> checks, ie. xmin >= GlobalXmin, but contains a combination of still
> in-progress that never existed in reality. If he then calls the
> paranoia-function, it would see an inconsistent state of committed
> tuples and get upset.

sure but I would expect being able to set a snapshot requiring either
superuser or some sort of "WITH SNAPSHOT" grant thingy - and in general
there are much more trivial and not that obscure scenarios a "normal"
user can cause the admin to get paged :)

>
> Maybe that's a bit far-stretched, but it's not entirely clear that
> running with an inconsistent snapshot is harmless.

well there has been some discussion with to the SSI stuff that we might
want to reconsider our definition of "read-only" maybe that would be the
right way to approach the problem?

Stefan

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2010-12-31 13:28:36
Message-ID:	1293801726-sup-5463@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Joachim Wieland's message of vie dic 31 00:15:57 -0300 2010:
> On Thu, Dec 30, 2010 at 9:40 AM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> >> Disadvantage of b: It doesn't allow a snapshot to be installed on a
> >> different server. It requires a serializable open transaction to hold
> >> the snapshot.
> >
> > Why does it require a serializable transaction? You could simply
> > register the snapshot in any transaction. (Of course, the net effect
> > would be pretty similar to a serializable transaction).
>
> I am not assuming that the publishing transaction blocks until its
> snapshot is being picked up. A read committed transaction would get a
> new snapshot for every other query, so the published snapshot is no
> longer represented by an actual backend until it is being picked up by
> one. Since nobody is holding off xmin/GlobalXmin, eventually vacuum
> would remove tuples that the published-but-not-yet-picked-up snapshot
> should still be able to see, no?

A backend can have any number of snapshots registered, and those don't
allow GlobalXmin to advance. Consider an open cursor, for example.
Even if the rest of the transaction is read committed, the snapshot
registered by the open cursor still holds back GlobalXmin. My
(handwavy) idea is that whenever the transaction calls
pg_publish_snapshot(), said snapshot is registered, which makes it safe
to use even if the transaction continues to operate and obtain newer
snapshots.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2010-12-31 14:38:51
Message-ID:	AANLkTiktRBU5M6=Zgfmodn=ZX5VkZm=K2+_TdLxhYkC0@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 31, 2010 at 8:28 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> A backend can have any number of snapshots registered, and those don't
> allow GlobalXmin to advance. Consider an open cursor, for example.
> Even if the rest of the transaction is read committed, the snapshot
> registered by the open cursor still holds back GlobalXmin. My
> (handwavy) idea is that whenever the transaction calls
> pg_publish_snapshot(), said snapshot is registered, which makes it safe
> to use even if the transaction continues to operate and obtain newer
> snapshots.

Cool, even better that this is taken care of already :-)

Thanks,
Joachim

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-01-07 11:41:38
Message-ID:	AANLkTi=3ApbZ2CyDCf4OruEtCnxZ8tiKEmsUQapr3oC3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe(at)mcknight(dot)de> wrote:
> What I am proposing now is the following:
>
> We return snapshot information as a chunk of data to the client. At
> the same time however, we set a checksum in shared memory to protect
> against modification of the snapshot. A publishing backend can revoke
> its snapshot by deleting the checksum and a backend that is asked to
> install a snapshot can verify that the snapshot is correct and current
> by calculating the checksum and comparing it with the one in shared
> memory.

So here's the patch implementing this idea.

I named the user visible functions pg_export_snapshot() and
pg_import_snapshot().

A user starts a transaction and calls pg_export_snapshot() to get a
chunk of bytea data. In a different connection he opens a transaction in
isolation level serializable and passes that chunk of data into
pg_import_snapshot(). Now subsequent queries of the second connection see the
snapshot that was current when pg_export_snapshot() has been executed. In case
both transactions are in isolation level serializable then both see the same
data from this moment on (this is the case of pg_dump).

There are most probably a few loose ends and someone who is more familiar to
this area than me needs to look into it but at least everybody should be happy
now with the overall approach.

These are the implementation details and restrictions of the patch:

The exporting transaction

- should be read-only (to discourage people from expecting that writes of
the exporting transaction can be seen by the importing transaction)
- must not be a subtransaction (we don't add subxips of our own transaction
to the snapshot, so importing the snapshot later would result in missing
subxips)
- adds its own xid (if any) to the xip-array

The importing transaction

- will not import a snapshot of the same backend (even though it would
probably work)
- will not import a snapshot of a different database in the cluster
- should be isolation level serializable
- must not be a subtransaction (can we completely rollback on
subxact-rollback?)
- leaves curcid as is, otherwise effects of previous commands would get lost
and magically appear later when curcid increases
- applies xmin, xmax, xip, subxip values of the exported snapshot to
GetActiveSnapshot() and GetTransactionSnapshot()
- takes itself out of the xip array
- updates MyProc->xmin but sets it only backwards but not forwards

The snapshot is invalidated on transaction commit/rollback as well as when doing
"prepare transaction".

Comments welcome.

Joachim

Attachment	Content-Type	Size
pg_import_export_snapshot.diff	text/x-patch	13.3 KB

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-01-07 13:47:55
Message-ID:	A5621BD4-B29C-44B2-9D8F-EFA66D85FA43@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan7, 2011, at 12:41 , Joachim Wieland wrote:
> On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe(at)mcknight(dot)de> wrote:
> These are the implementation details and restrictions of the patch:
>
> The exporting transaction
>
> - should be read-only (to discourage people from expecting that writes of
> the exporting transaction can be seen by the importing transaction)

That seems overly strict. Wouldn't a warning about this in the docs suffice?

> - must not be a subtransaction (we don't add subxips of our own transaction
> to the snapshot, so importing the snapshot later would result in missing
> subxips)

If that is necessary, wouldn't a sub-transaction that was sub-committed before
exporting the snapshot also pose a problem? The snapshot wouldn't include
the sub-transaction's xid, so once the exporting transaction has committed any
importing transaction would suddenly see the sub-transaction's changes.

> - adds its own xid (if any) to the xip-array

Per the previous paragraph, I believe it should also add the xid's of all
non-aborted sub-transactions to the subxip-array, overflowing the array if
necessary.

> The importing transaction
>
> - will not import a snapshot of the same backend (even though it would
> probably work)
> - will not import a snapshot of a different database in the cluster
> - should be isolation level serializable
> - must not be a subtransaction (can we completely rollback on
> subxact-rollback?)

Sounds fine so far.

> - leaves curcid as is, otherwise effects of previous commands would get lost
> and magically appear later when curcid increases
> - applies xmin, xmax, xip, subxip values of the exported snapshot to
> GetActiveSnapshot() and GetTransactionSnapshot()
> - takes itself out of the xip array
> - updates MyProc->xmin but sets it only backwards but not forwards

Change a snapshot mid-transaction like that seems risky. I'd suggest to simply
decree that an importing transaction must be read-only and must not yet have set
its snapshot. So effectively, the only allowed calling sequence would be

begin;
set transaction isolation level serializable read only;
pg_import_snapshot(...);

best regards,
Florian Pflug

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-01-15 03:13:05
Message-ID:	20110115031305.GA9509@tornado.gateway.2wire.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Joachim,

I'm reviewing this patch for CommitFest 2011-01.

On Fri, Jan 07, 2011 at 06:41:38AM -0500, Joachim Wieland wrote:
> On Thu, Dec 30, 2010 at 7:31 AM, Joachim Wieland <joe(at)mcknight(dot)de> wrote:
> > What I am proposing now is the following:
> >
> > We return snapshot information as a chunk of data to the client. At
> > the same time however, we set a checksum in shared memory to protect
> > against modification of the snapshot. A publishing backend can revoke
> > its snapshot by deleting the checksum and a backend that is asked to
> > install a snapshot can verify that the snapshot is correct and current
> > by calculating the checksum and comparing it with the one in shared
> > memory.

I like this design. It makes the backend's use of resources to track exported
snapshots very simple. It also avoids pessimizing the chances that we can, at
some point, synchronize snapshots across various standby clusters without
changing the user-facing API.

>
> So here's the patch implementing this idea.
>
> I named the user visible functions pg_export_snapshot() and
> pg_import_snapshot().
>
> A user starts a transaction and calls pg_export_snapshot() to get a
> chunk of bytea data. In a different connection he opens a transaction in
> isolation level serializable and passes that chunk of data into
> pg_import_snapshot(). Now subsequent queries of the second connection see the
> snapshot that was current when pg_export_snapshot() has been executed. In case
> both transactions are in isolation level serializable then both see the same
> data from this moment on (this is the case of pg_dump).
>
> There are most probably a few loose ends and someone who is more familiar to
> this area than me needs to look into it but at least everybody should be happy
> now with the overall approach.
>
> These are the implementation details and restrictions of the patch:
>
> The exporting transaction
>
> - should be read-only (to discourage people from expecting that writes of
> the exporting transaction can be seen by the importing transaction)
> - must not be a subtransaction (we don't add subxips of our own transaction
> to the snapshot, so importing the snapshot later would result in missing
> subxips)
> - adds its own xid (if any) to the xip-array
>
> The importing transaction
>
> - will not import a snapshot of the same backend (even though it would
> probably work)
> - will not import a snapshot of a different database in the cluster
> - should be isolation level serializable
> - must not be a subtransaction (can we completely rollback on
> subxact-rollback?)

I don't know, but this restriction does not seem onerous.

General points:

Basics: patch has the correct format and applies cleanly (offset in pg_proc.h).
It includes no tests, but perhaps it should not. Really testing this requires
concurrency, and our test framework is not set up for that. We could add some
cheap tests to ensure the functions fail cleanly in some basic situations, but
that would be about it. The patch does not, but should, document the new
functions. The new functions also need comments preceding their definitions,
similar to those found on all other functions in procarray.c.

You add four elog() calls, all of which should be ereport() to facilitate
translation. Also consider suitable error codes. The error messages do not
follow style; in particular, they begin with capitals. See the style guide,
http://www.postgresql.org/docs/current/interactive/error-style-guide.html.

ImportSnapshot() has no paths that justify returning false. It should always
return true or throw a suitable error.

See below for various specific comments.

> diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
> index 95beba8..c24150f 100644
> *** a/src/backend/storage/ipc/ipci.c
> --- b/src/backend/storage/ipc/ipci.c
> *************** CreateSharedMemoryAndSemaphores(bool mak
> *** 124,129 ****
> --- 124,130 ----
> size = add_size(size, BTreeShmemSize());
> size = add_size(size, SyncScanShmemSize());
> size = add_size(size, AsyncShmemSize());
> + size = add_size(size, SyncSnapshotShmemSize());
> #ifdef EXEC_BACKEND
> size = add_size(size, ShmemBackendArraySize());
> #endif
> *************** CreateSharedMemoryAndSemaphores(bool mak
> *** 228,233 ****
> --- 229,235 ----
> BTreeShmemInit();
> SyncScanShmemInit();
> AsyncShmemInit();
> + SyncSnapshotInit();
>
> #ifdef EXEC_BACKEND
>
> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
> index 980996e..e851bcd 100644
> *** a/src/backend/storage/ipc/procarray.c
> --- b/src/backend/storage/ipc/procarray.c
> ***************
> *** 50,60 ****
> --- 50,62 ----
> #include "access/transam.h"
> #include "access/xact.h"
> #include "access/twophase.h"
> + #include "libpq/md5.h"
> #include "miscadmin.h"
> #include "storage/procarray.h"
> #include "storage/spin.h"
> #include "storage/standby.h"
> #include "utils/builtins.h"
> + #include "utils/bytea.h"
> #include "utils/snapmgr.h"
>
>
> *************** static int KnownAssignedXidsGetAndSetXmi
> *** 159,164 ****
> --- 161,170 ----
> static TransactionId KnownAssignedXidsGetOldestXmin(void);
> static void KnownAssignedXidsDisplay(int trace_level);
>
> + typedef char snapshotChksum[16];
> + static snapshotChksum *syncSnapshotChksums;
> + static Snapshot exportedSnapshot;
> +
> /*
> * Report shared-memory space needed by CreateSharedProcArray.
> */
> *************** KnownAssignedXidsDisplay(int trace_level
> *** 3065,3067 ****
> --- 3071,3335 ----
>
> pfree(buf.data);
> }
> +
> +
> + /*
> + * Report space needed for our shared memory area, which is basically an
> + * md5 checksum per connection.
> + */
> + Size
> + SyncSnapshotShmemSize(void)
> + {
> + return PROCARRAY_MAXPROCS * sizeof(snapshotChksum);
> + }
> +
> + void
> + SyncSnapshotInit(void)
> + {
> + Size size;
> + bool found;
> + int i;
> +
> + size = SyncSnapshotShmemSize();
> +
> + syncSnapshotChksums = (snapshotChksum*) ShmemInitStruct("SyncSnapshotChksums",
> + size, &found);
> + if (!found)
> + for (i = 0; i < PROCARRAY_MAXPROCS; i++)
> + memset(syncSnapshotChksums[i], 0, sizeof(snapshotChksum));
> + }

No need for a loop:

memset(syncSnapshotchksums, 0, PROCARRAY_MAXPROCS * sizeof(snapshotChksum));

"00000000000000000000" is a valid md5 message digest. To avoid always accepting
a snapshot with that digest, we would need to separately store a flag indicating
whether a given syncSnapshotChksums member is currently valid. Maybe that hole
is acceptable, though.

> +
> + void
> + InvalidateExportedSnapshot(void)
> + {
> + if (syncSnapshotChksums[MyBackendId][0] == '\0')
> + return;

The first byte of a valid message digest will regularly be zero.

> +
> + memset(syncSnapshotChksums[MyBackendId], 0, sizeof(snapshotChksum));
> +
> + UnregisterSnapshotFromOwner(exportedSnapshot, TopTransactionResourceOwner);
> + exportedSnapshot = NULL;
> + }
> +
> + static void
> + snapshot_appendf(char ** buffer, int *bufsize_filled, int *bufsize_left, char *fmt, ...)
> + {
> + va_list ap;
> + int ret;
> +
> + if (*bufsize_left < 64)
> + {
> + /* enlarge buffer by 1024 bytes */
> + *buffer = repalloc(*buffer, *bufsize_filled + 1024);
> + *bufsize_left = *bufsize_filled + 1024;
> + }
> +
> + va_start(ap, fmt);
> + ret = vsnprintf(*buffer + *bufsize_filled, *bufsize_left, fmt, ap);
> + va_end(ap);
> +
> + /* There shouldn't be any error, we leave enough room for the data that we
> + * are writing (only numbers basically). */
> + Assert(ret < *bufsize_left && ret > 0);
> +
> + *bufsize_left -= ret;
> + *bufsize_filled += ret;
> +
> + Assert(strlen(*buffer) == *bufsize_filled);
> + }
> +
> + bytea *
> + ExportSnapshot(Snapshot snapshot)
> + {
> + int bufsize = 1024;
> + int bufsize_filled = 0; /* doesn't include NUL byte */
> + int bufsize_left = bufsize - bufsize_filled;
> + char *buf = (char *) palloc(bufsize);
> + snapshotChksum cksum;
> + int i;
> +
> + /* In a subtransaction we don't see our subxip values in the snapshot so
> + * they would be missing in the backend applying it. */
> + if (GetCurrentTransactionNestLevel() != 1)
> + elog(ERROR, "Can only export a snapshot from a top level transaction.");
> +
> + /* Write up all the data that we return */
> + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
> + "did:%d ", MyDatabaseId);
> + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
> + "bid:%d ", MyBackendId);
> + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
> + "xmi:%d ", snapshot->xmin);
> + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
> + "xma:%d ", snapshot->xmax);
> + for (i = 0; i < snapshot->xcnt; i++)
> + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
> + "xip:%d ", snapshot->xip[i]);
> + /*
> + * Finally add our own XID if we have one, since by definition we will
> + * still be running when the other transaction takes over the snapshot.
> + */
> + if (TransactionIdIsValid(GetTopTransactionIdIfAny()))
> + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
> + "xip:%d ", GetTopTransactionIdIfAny());
> + if (snapshot->suboverflowed)
> + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
> + "sof:1 ");
> + else
> + for (i = 0; i < snapshot->subxcnt; i++)
> + snapshot_appendf(&buf, &bufsize_filled, &bufsize_left,
> + "sxp:%d ", snapshot->subxip[i]);
> +
> + /* Register the snapshot */
> + snapshot = RegisterSnapshotOnOwner(snapshot, TopTransactionResourceOwner);
> + exportedSnapshot = snapshot;
> +
> + pg_md5_binary(buf, bufsize_filled, (void *) &cksum);
> +
> + memcpy(syncSnapshotChksums[MyBackendId], &cksum, sizeof(snapshotChksum));

No need for an intermediate variable:

pg_md5_binary(buf, bufsize_filled, &syncSnapshotChksums[MyBackendId]);

> +
> + if (!XactReadOnly)
> + elog(WARNING, "A snapshot exporting function should be readonly.");

There are legitimate use cases for copying the snapshot of a read-write
transaction. Consider a task like calculating some summaries for insert into a
table. To speed this up, you might notionally partition the source data and
have each of several workers calculate each partition. I'd vote for removing
this warning entirely.

If we do keep the warning, it's the transaction that should be read-only, not a
"function". My first reaction to the text was "why not?". If the documentation
for pg_export_snapshot explains, that might be enough. Primary wording like
"importing transactions will not see changes from this transaction" with wording
like yours in a HINT might help.

> +
> + return DatumGetByteaP(DirectFunctionCall1(byteain, CStringGetDatum(buf)));
> + }
> +
> + static Oid
> + parseOidFromText(char **s, const char *prefix)
> + {
> + char *n, *p = strstr(*s, prefix);
> + Oid oid;
> +
> + if (!p)
> + return InvalidOid;
> + p += strlen(prefix);
> + n = strchr(p, ' ');
> + if (!n)
> + return InvalidOid;
> + *n = '\0';
> + oid = DatumGetObjectId(DirectFunctionCall1(oidin, CStringGetDatum(p)));
> + *s = n + 1;
> + return oid;
> + }
> +
> + static bool
> + parseBoolFromText(char **s, const char *prefix)
> + {
> + /* It's safe to overload parseOid to parse 0/1. This returns false if the
> + * entry could not be found, at least as long as InvalidOid is defined to be 0. */
> + Assert(InvalidOid == (Oid) 0);
> + return (bool) parseOidFromText(s, prefix);
> + }
> +
> + static TransactionId
> + parseXactFromText(char **s, const char *prefix)
> + {
> + char *n, *p = strstr(*s, prefix);
> + TransactionId xid;
> +
> + if (!p)
> + return InvalidTransactionId;
> + p += strlen(prefix);
> + n = strchr(p, ' ');
> + if (!n)
> + return InvalidTransactionId;
> + *n = '\0';
> + xid = DatumGetTransactionId(DirectFunctionCall1(xidin, CStringGetDatum(p)));
> + *s = n + 1;
> + return xid;
> + }
> +
> + bool
> + ImportSnapshot(bytea *snapshotData, Snapshot snapshot)
> + {
> + snapshotChksum cksum;
> + Oid databaseId;
> + Oid backendId;
> + int i;
> + TransactionId xid;
> + int len = VARSIZE_ANY_EXHDR(snapshotData);
> + char *s = palloc(len + 1);
> +
> + if (GetCurrentTransactionNestLevel() != 1)
> + elog(ERROR, "Can only import a snapshot to a top level transaction.");
> +
> + Assert(len > 0);

This is user-facing; "SELECT pg_import_snapshot('')" must not trigger an
assertion failure.

> + strncpy(s, VARDATA_ANY(snapshotData), len);
> + s[len] = '\0';
> +
> + pg_md5_binary(s, strlen(s), (void *) &cksum);

You know strlen(s) == len; no need to recalculate.

> +
> + if ((databaseId = parseOidFromText(&s, "did:")) == InvalidOid)
> + return false;
> + if (databaseId != MyDatabaseId)
> + return false;
> + if ((backendId = parseOidFromText(&s, "bid:")) == InvalidOid)
> + return false;
> + if (backendId == MyBackendId)
> + return false;

InvalidBackendId is -1. Is InvalidOid (0) in fact also invalid as a backend ID?
My backend IDs seem to start at 1 in practice. However, by inserting a
too-large bid, an attacker could have us overrun syncSnapshotChksums. For that,
we need to test 0 <= bid <= PROCARRAY_MAXPROCS. Example crasher:
select pg_import_snapshot('did:16384 bid:4000000000 xmi:795 xma:795');

I'm thinking the order of validation should be:
1. Extract bid.
2. Validate bid. Throw error on out-of-bounds.
3. Confirm checksum. Throw error on mismatch.
4. Extract did. Assert databaseId != InvalidOid.
5. Error if databaseId != MyDatabaseid.

> +
> + /*
> + * Lock considerations:
> + *
> + * syncSnapshotChksums[backendId] is only changed by the backend with ID
> + * backendID and read by another backend that is asked to import a
> + * snapshot.
> + * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum
> + * of a valid snapshot.
> + * Every comparision to the checksum while it is being written or
> + * deleted is okay to fail, so we don't need to take a lock at all.
> + */
> +
> + if (memcmp(&cksum, syncSnapshotChksums[backendId], sizeof(snapshotChksum)))
> + return false;
> +
> + snapshot->xmin = parseXactFromText(&s, "xmi:");
> + Assert(snapshot->xmin != InvalidTransactionId);
> + snapshot->xmax = parseXactFromText(&s, "xma:");
> + Assert(snapshot->xmax != InvalidTransactionId);
> +
> + i = 0;
> + while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
> + {
> + if (xid == GetTopTransactionIdIfAny())
> + continue;
> + snapshot->xip[i++] = xid;

This works on a virgin GetSnapshotData() snapshot, which always has enough space
(sized according to max_connections). A snapshot from CopySnapshot(), however,
has just enough space for the original usage. I get a SIGSEGV at this line with
the attached test script. If I change things around so xip starts non-NULL, I
get messages like these as the code scribbles past the end of the allocation:

WARNING: problem in alloc set TopTransactionContext: detected write past chunk end in block 0xc92f10, chunk 0xc93280
WARNING: problem in alloc set TopTransactionContext: detected write past chunk end in block 0xc92f10, chunk 0xc93350

The last CopySnapshot before the crash had this stack trace:

#0 CopySnapshot (snapshot=0xc27980) at snapmgr.c:203
#1 0x00000000008109e5 in PushActiveSnapshot (snap=0xc27980) at snapmgr.c:283
#2 0x00000000006e1c13 in PortalStart (portal=0xcaf7b8, params=0x0, snapshot=0x0) at pquery.c:508
#3 0x00000000006dc411 in exec_simple_query (query_string=0xd3ee38 "SELECT pg_import_snapshot('did:16384 bid:2 xmi:756 xma:756 xip:756 ')") at postgres.c:1020
#4 0x00000000006e0633 in PostgresMain (argc=2, argv=0xcb9760, username=0xc90658 "nm") at postgres.c:3939
#5 0x0000000000695520 in BackendRun (port=0xcbb7d0) at postmaster.c:3578
#6 0x0000000000694bec in BackendStartup (port=0xcbb7d0) at postmaster.c:3263
#7 0x0000000000691f85 in ServerLoop () at postmaster.c:1449
#8 0x000000000069174f in PostmasterMain (argc=1, argv=0xc8af30) at postmaster.c:1110
#9 0x000000000060e4e2 in main (argc=1, argv=0xc8af30) at main.c:199

Do you have a recipe for a getting ImportSnapshot to see a non-CopySnapshot
snapshot that I could use to continue testing? All the things I tried ended
this way.

> + }
> + snapshot->xcnt = i;
> +
> + /*
> + * We only write "sof:1" if the snapshot overflowed. If not, then there is
> + * no "sof:x" entry at all and parseBoolFromText() will just return false.
> + */
> + snapshot->suboverflowed = parseBoolFromText(&s, "sof:");
> +
> + i = 0;
> + while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
> + snapshot->subxip[i++] = xid;
> + snapshot->subxcnt = i;
> + /* Leave snapshot->curcid as is. */
> +
> + /*
> + * No ProcArrayLock held here, we assume that a write is atomic. Also note
> + * that MyProc->xmin can go backwards here. However this is safe because
> + * the xmin we set here is the same as in the backend's proc->xmin whose
> + * snapshot we are copying. At this very moment, anybody computing a
> + * minimum will calculate at least this xmin as the overall xmin with or
> + * without us setting MyProc->xmin to this value.

Though unlikely, the other backend may not even exist by now. Indeed, that
could have happened and RecentGlobalXmin advanced since we compared the
checksum. Not sure what the right locking is here, but something is needed.

> + * Instead we must check to not go forward (we might have opened a cursor
> + * in this transaction and still have its snapshot registered)
> + */

I'm thinking this case should yield an ERROR. Otherwise, our net result would
be to silently adopt a snapshot that is neither our old self nor the target.
Maybe there's a use case for that, but none comes to my mind.

> + if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin))
> + MyProc->xmin = snapshot->xmin;
> +
> + /*
> + * If we are in read committed mode then the next query will execute with a
> + * new snapshot making this function call quite useless.
> + */
> + if (!IsolationUsesXactSnapshot())
> + elog(WARNING, "A snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE");

The message should convey that the test (correctly) allows either REPEATABLE
READ or SERIALIZABLE.

I guess the use case for pg_import_snapshot from READ COMMITTED would be
something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
First off, would that work ("stuff" use the imported snapshot)? If it does
work, we should take the precedent of SET LOCAL and issue no warning. If it
doesn't work, then we should document that the snapshot does take effect until
the next command and probably keep this warning or something similar.

This warning can appear twice due to the two calls to ImportSnapshot() in
pg_import_snapshot(); that's a bit ugly.

> +
> + return true;
> + }
> +
> diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
> index 273d8bd..75315bb 100644
> *** a/src/backend/utils/time/snapmgr.c
> --- b/src/backend/utils/time/snapmgr.c
> ***************
> *** 29,34 ****
> --- 29,35 ----
> #include "access/xact.h"
> #include "storage/proc.h"
> #include "storage/procarray.h"
> + #include "utils/builtins.h"
> #include "utils/memutils.h"
> #include "utils/memutils.h"
> #include "utils/resowner.h"
> *************** AtEarlyCommit_Snapshot(void)
> *** 523,528 ****
> --- 524,530 ----
> TopTransactionResourceOwner);
> registered_xact_snapshot = false;
>
> + InvalidateExportedSnapshot();
> }
>
> /*
> *************** AtEOXact_Snapshot(bool isCommit)
> *** 559,561 ****
> --- 561,585 ----
> FirstSnapshotSet = false;
> registered_xact_snapshot = false;
> }
> +
> + Datum
> + pg_export_snapshot(PG_FUNCTION_ARGS)
> + {
> + bytea *snapshotData = ExportSnapshot(GetTransactionSnapshot());
> + PG_RETURN_BYTEA_P(snapshotData);
> + }
> +
> + Datum
> + pg_import_snapshot(PG_FUNCTION_ARGS)
> + {
> + bytea *snapshotData = PG_GETARG_BYTEA_P(0);
> + bool ret = true;
> +
> + if (ActiveSnapshotSet())
> + ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
> +
> + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());

Is it valid to scribble directly on snapshots like this?

> +
> + PG_RETURN_BOOL(ret);
> + }
> +
> diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
> index c624243..fa06f07 100644
> *** a/src/include/catalog/pg_proc.h
> --- b/src/include/catalog/pg_proc.h
> *************** DATA(insert OID = 2171 ( pg_cancel_backe
> *** 3375,3380 ****
> --- 3375,3384 ----
> DESCR("cancel a server process' current query");
> DATA(insert OID = 2096 ( pg_terminate_backend PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
> DESCR("terminate a server process");
> + DATA(insert OID = 3115 ( pg_export_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 0 0 17 "" _null_ _null_ _null_ _null_ pg_export_snapshot _null_ _null_ _null_ ));
> + DESCR("export a snapshot");
> + DATA(insert OID = 3116 ( pg_import_snapshot PGNSP PGUID 12 1 0 0 f f f t f v 1 0 16 "17" _null_ _null_ _null_ _null_ pg_import_snapshot _null_ _null_ _null_ ));
> + DESCR("import a snapshot");
> DATA(insert OID = 2172 ( pg_start_backup PGNSP PGUID 12 1 0 0 f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ ));
> DESCR("prepare for taking an online backup");
> DATA(insert OID = 2173 ( pg_stop_backup PGNSP PGUID 12 1 0 0 f f f t f v 0 0 25 "" _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
> diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
> index ea030d6..a1ac8c6 100644
> *** a/src/include/storage/procarray.h
> --- b/src/include/storage/procarray.h
> *************** extern void XidCacheRemoveRunningXids(Tr
> *** 71,74 ****
> --- 71,81 ----
> int nxids, const TransactionId *xids,
> TransactionId latestXid);
>
> + extern Size SyncSnapshotShmemSize(void);
> + extern void SyncSnapshotInit(void);
> +
> + extern bytea *ExportSnapshot(Snapshot snapshot);
> + extern bool ImportSnapshot(bytea *snapshotData, Snapshot snapshot);
> + extern void InvalidateExportedSnapshot(void);
> +
> #endif /* PROCARRAY_H */
> diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
> index f03647b..d0e3f9a 100644
> *** a/src/include/utils/snapmgr.h
> --- b/src/include/utils/snapmgr.h
> *************** extern void AtSubAbort_Snapshot(int leve
> *** 43,46 ****
> --- 43,49 ----
> extern void AtEarlyCommit_Snapshot(void);
> extern void AtEOXact_Snapshot(bool isCommit);
>
> + extern Datum pg_export_snapshot(PG_FUNCTION_ARGS);
> + extern Datum pg_import_snapshot(PG_FUNCTION_ARGS);
> +
> #endif /* SNAPMGR_H */

Attachment	Content-Type	Size
crash-sync-snap.pl	application/x-perl	366 bytes

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-01-19 05:12:39
Message-ID:	AANLkTinvmf9yaXQp7BijTkodChUQ5Nb3wvx5riD5BxBM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Noah, thank you for this excellent review. I have taken over most
(allmost all) of your suggestions (except for the documentation which
is still missing). Please check the updated & attached patch for
details.

On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> "00000000000000000000" is a valid md5 message digest. To avoid always accepting
> a snapshot with that digest, we would need to separately store a flag indicating
> whether a given syncSnapshotChksums member is currently valid. Maybe that hole
> is acceptable, though.

In the end I decided to store md5 checksums as printable characters in
shmem. We now need a few more bytes for each checksum but as soon as a
byte is NUL we know that it is not a valid checksum.

>> +
>> + if (!XactReadOnly)
>> + elog(WARNING, "A snapshot exporting function should be readonly.");
>
> There are legitimate use cases for copying the snapshot of a read-write
> transaction. Consider a task like calculating some summaries for insert into a
> table. To speed this up, you might notionally partition the source data and
> have each of several workers calculate each partition. I'd vote for removing
> this warning entirely.

Warning removed, adding this fact to the documentation only is fine with me.

> InvalidBackendId is -1. Is InvalidOid (0) in fact also invalid as a backend ID?

Yes, there is a check in utils/init/postinit.c that makes sure that 0
is never a valid backendId. But anyway, I have now made this more
explicit by adding an own parse routine for ints.

>> + while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
>> + {
>> + if (xid == GetTopTransactionIdIfAny())
>> + continue;
>> + snapshot->xip[i++] = xid;
>
> This works on a virgin GetSnapshotData() snapshot, which always has enough space
> (sized according to max_connections). A snapshot from CopySnapshot(), however,
> has just enough space for the original usage. I get a SIGSEGV at this line with
> the attached test script. If I change things around so xip starts non-NULL, I
> get messages like these as the code scribbles past the end of the allocation:

This has been fixed. xip/subxip are now allocated separately if necessary.

> Though unlikely, the other backend may not even exist by now. Indeed, that
> could have happened and RecentGlobalXmin advanced since we compared the
> checksum. Not sure what the right locking is here, but something is needed.

Good catch. What I have done now is a second check at the end of
ImportSnapshot(). At that point we have set up the new snapshot and
adjusted MyProc->xmin. If we succeed, then we are fine. If not we
throw an error and invalidate the whole transaction.

>> + * Instead we must check to not go forward (we might have opened a cursor
>> + * in this transaction and still have its snapshot registered)
>> + */
>
> I'm thinking this case should yield an ERROR. Otherwise, our net result would
> be to silently adopt a snapshot that is neither our old self nor the target.
> Maybe there's a use case for that, but none comes to my mind.

This can happen when you do:

DECLARE foo CURSOR FOR SELECT * FROM bar;
import snapshot...
FETCH 1 FROM foo;

> I guess the use case for pg_import_snapshot from READ COMMITTED would be
> something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
> First off, would that work ("stuff" use the imported snapshot)? If it does
> work, we should take the precedent of SET LOCAL and issue no warning. If it
> doesn't work, then we should document that the snapshot does take effect until
> the next command and probably keep this warning or something similar.

No, this will also give you a new snapshot for every query, no matter
if it is executed within or outside of a DO-Block.

>> + Datum
>> + pg_import_snapshot(PG_FUNCTION_ARGS)
>> + {
>> + bytea *snapshotData = PG_GETARG_BYTEA_P(0);
>> + bool ret = true;
>> +
>> + if (ActiveSnapshotSet())
>> + ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
>> +
>> + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
>
> Is it valid to scribble directly on snapshots like this?

I figured that previously executed code still has references pointing
to the snapshots and so we cannot just push a new snapshot on top but
really need to change the memory where they are pointing to.

I am also adding a test script that shows the difference of READ
COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
It's based on the script you sent.

So thanks again and I'm looking forward to your next review... :-)

Joachim

Attachment	Content-Type	Size
syncsnapshots.2.diff	text/x-patch	16.8 KB
test3.pl	text/x-perl	1.5 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-01-20 06:37:03
Message-ID:	20110120063703.GB13329@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 12:12:39AM -0500, Joachim Wieland wrote:
> Noah, thank you for this excellent review. I have taken over most
> (allmost all) of your suggestions (except for the documentation which
> is still missing). Please check the updated & attached patch for
> details.

Great. I do get an assertion failure with this sequence:

BEGIN; SELECT pg_export_snapshot(); ROLLBACK;
SELECT 1;

TRAP: FailedAssertion("!(RegisteredSnapshots > 0)", File: "snapmgr.c", Line: 430)

Perhaps some cleanup is missing from a ROLLBACK path?

>
> On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > "00000000000000000000" is a valid md5 message digest. ?To avoid always accepting
> > a snapshot with that digest, we would need to separately store a flag indicating
> > whether a given syncSnapshotChksums member is currently valid. ?Maybe that hole
> > is acceptable, though.
>
> In the end I decided to store md5 checksums as printable characters in
> shmem. We now need a few more bytes for each checksum but as soon as a
> byte is NUL we know that it is not a valid checksum.

Just to clarify, I was envisioning something like:

typedef struct { bool valid; char value[16]; } snapshotChksum;

I don't object to the way you've done it, but someone else might not like the
extra marshalling between text and binary. Your call.

> >> + ? ? ?* Instead we must check to not go forward (we might have opened a cursor
> >> + ? ? ?* in this transaction and still have its snapshot registered)
> >> + ? ? ?*/
> >
> > I'm thinking this case should yield an ERROR. ?Otherwise, our net result would
> > be to silently adopt a snapshot that is neither our old self nor the target.
> > Maybe there's a use case for that, but none comes to my mind.
>
> This can happen when you do:
>
> DECLARE foo CURSOR FOR SELECT * FROM bar;
> import snapshot...
> FETCH 1 FROM foo;

I see now. You set snapshot->xmin unconditionally, but never move MyProc->xmin
forward, only backward. Yes; that seems just right.

> > I guess the use case for pg_import_snapshot from READ COMMITTED would be
> > something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
> > First off, would that work ("stuff" use the imported snapshot)? ?If it does
> > work, we should take the precedent of SET LOCAL and issue no warning. ?If it
> > doesn't work, then we should document that the snapshot does take effect until
> > the next command and probably keep this warning or something similar.
>
> No, this will also give you a new snapshot for every query, no matter
> if it is executed within or outside of a DO-Block.

You're right. Then consider "VALUES (pg_import_snapshot('...'), (SELECT
count(*) from t))" at READ COMMITTED. It works roughly as I'd guess; the
subquery uses the imported snapshot. If I flip the two expressions and do
"VALUES ((SELECT count(*) from t), pg_import_snapshot('...'))", the subquery
uses the normal snapshot. That makes sense, but I can't really see a use case
for it. A warning, like your code has today, seems appropriate.

> > Is it valid to scribble directly on snapshots like this?
>
> I figured that previously executed code still has references pointing
> to the snapshots and so we cannot just push a new snapshot on top but
> really need to change the memory where they are pointing to.

Okay. Had no special reason to believe otherwise, just didn't know.

> I am also adding a test script that shows the difference of READ
> COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
> It's based on the script you sent.

Thanks; that was handy. One thing I noticed is that the second "SELECT * FROM
kidseen" yields zero rows instead of yielding "ERROR: relation "kidseen" does
not exist". I changed things around per the attached "test-drop.pl", such that
the table is already gone in the ordinary snapshot, but still visible to the
imported snapshot. Note how the pg_class row is visible, but an actual attempt
to query the table fails. Must some kind of syscache invalidation follow the
snapshot alteration to make this behave normally?

General tests involving only DML seem to do the right thing. Subtransaction
handling looks sound. Patch runs cleanly according to Valgrind.

> So thanks again and I'm looking forward to your next review... :-)

Hope this one helps, too.

> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
> index 8b36df4..29c9426 100644

> + bytea *
> + ExportSnapshot(Snapshot snapshot)
> + {
> + int bufsize = 1024;
> + int bufsize_filled = 0; /* doesn't include NUL byte */
> + int bufsize_left = bufsize - bufsize_filled;
> + char *buf = (char *) palloc(bufsize);
> + int i;
> + TransactionId *children;
> + int nchildren;
> +
> + /* In a subtransaction we don't see our open subxip values in the snapshot
> + * so they would be missing in the backend applying it. */
> + /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less?
> + * It's a valid transaction state but invalid for the requested operation. */
> + if (GetCurrentTransactionNestLevel() != 1)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("can only export a snapshot from a top level transaction")));

When I'm not sure about questions like this (as in, 99.8% of the time), I
usually grep one of the .po files for messages that seem similar, then see what
they used. In this case, the most-similar is "SET TRANSACTION ISOLATION LEVEL
must not be called in a subtransaction" in assign_XactIsoLevel(), which uses
ERRCODE_ACTIVE_SQL_TRANSACTION. I'd also consider making the message similar to
that one, like "cannot export a snapshot from a subtransaction".

> + static int
> + parseIntFromText(char **s, const char *prefix, int notfound)
> + {
> + char *n, *p = strstr(*s, prefix);
> + int i;
> +
> + if (!p)
> + return notfound;
> + p += strlen(prefix);
> + n = strchr(p, ' ');
> + if (!n)
> + return notfound;
> + *n = '\0';
> + i = DatumGetObjectId(DirectFunctionCall1(int4in, CStringGetDatum(p)));

DatumGetInt32

> + bool
> + ImportSnapshot(bytea *snapshotData, Snapshot snapshot)
> + {
> + snapshotChksum cksum;
> + Oid databaseId;
> + BackendId backendId;
> + int i;
> + TransactionId xid;
> + int len = VARSIZE_ANY_EXHDR(snapshotData);
> + char *s = palloc(len + 1);
> + MemoryContext oldctx;
> +
> + /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less?
> + * It's a valid transaction state but invalid for the requested operation. */
> + if (GetCurrentTransactionNestLevel() != 1)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("can only import a snapshot to a top level transaction")));

See discussion of similar code above.

> +
> + if (len == 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid data in snapshot information")));
> +
> + strncpy(s, VARDATA_ANY(snapshotData), len);
> + s[len] = '\0';
> +
> + pg_md5_hash(s, len, (void *) &cksum);
> +
> + databaseId = parseOidFromText(&s, "did:");
> + backendId = (BackendId) parseIntFromText(&s, "bid:", (int) InvalidBackendId);
> +
> + if (databaseId != MyDatabaseId
> + || backendId == InvalidBackendId
> + /*
> + * Make sure backendId is in a reasonable range before using it as an
> + * array subscript.
> + */
> + || backendId >= PROCARRAY_MAXPROCS
> + || backendId < 0
> + || backendId == MyBackendId
> + /*
> + * Lock considerations:
> + *
> + * syncSnapshotChksums[backendId] is only changed by the backend with ID
> + * backendID and read by another backend that is asked to import a
> + * snapshot.
> + * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum
> + * of a valid snapshot.
> + * Every comparision to the checksum while it is being written or
> + * deleted is okay to fail, so we don't need to take a lock at all.
> + */
> + || strcmp(cksum, syncSnapshotChksums[backendId]) != 0)
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid data in snapshot information")));
> + }

That error message is good for the second, third, and fourth conditions: those
only happen when we're given something not generated by pg_export_snapshot().
The first and fifth tests deserve different error messages (perhaps "snapshot is
from a different database"; "cannot export and import a snapshot in the same
connection"). The sixth test is also a bit different; failure arises there when
the input is bogus or when the exporting transaction has ended. Failing to hold
that transaction open until all child transactions have imported might be a
common early mistake when using this feature. So, what about "snapshot not
found exported by any running transaction" for the sixth test?

> +
> + Assert(databaseId != InvalidOid);
> +
> + oldctx = MemoryContextSwitchTo(TopTransactionContext);

How about "MemoryContextAlloc(TopTransactionContext, ..." in place of the two
palloc calls and dropping this? Not sure.

> +
> + snapshot->xmin = parseXactFromText(&s, "xmi:");
> + Assert(snapshot->xmin != InvalidTransactionId);
> + snapshot->xmax = parseXactFromText(&s, "xma:");
> + Assert(snapshot->xmax != InvalidTransactionId);
> +
> + if (snapshot->copied)
> + {
> + int xcnt = parseIntFromText(&s, "xcnt:", 0);
> + /*
> + * Unfortunately CopySnapshot allocates just one large chunk of memory,
> + * and makes snapshot->xip then point past the end of the fixed-size
> + * snapshot data. That way we cannot just repalloc(snapshot->xip, ...).
> + * Neither can we just change the base address of the snapshot because
> + * this address might still be saved somewhere.
> + */
> + if (snapshot->xcnt < xcnt)
> + snapshot->xip = palloc(xcnt * sizeof(TransactionId));
> + }
> +
> + i = 0;
> + while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
> + {
> + if (xid == GetTopTransactionIdIfAny())
> + continue;
> + snapshot->xip[i++] = xid;
> + }
> + snapshot->xcnt = i;
> +
> + /*
> + * We only write "sof:1" if the snapshot overflowed. If not, then there is
> + * no "sof:x" entry at all and parseBoolFromText() will just return false.
> + */
> + snapshot->suboverflowed = parseBoolFromText(&s, "sof:");
> +
> + if (!snapshot->suboverflowed)
> + {
> + if (snapshot->copied)
> + {
> + int sxcnt = parseIntFromText(&s, "sxcnt:", 0);
> + if (snapshot->subxcnt < sxcnt)
> + snapshot->subxip = palloc(sxcnt * sizeof(TransactionId));
> + }
> +
> + i = 0;
> + while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
> + snapshot->subxip[i++] = xid;
> + snapshot->subxcnt = i;
> + }
> +
> + MemoryContextSwitchTo(oldctx);
> +
> + /* Leave snapshot->curcid as is. */
> +
> + /*
> + * No ProcArrayLock held here, we assume that a write is atomic. Also note
> + * that MyProc->xmin can go backwards here. However this is safe because
> + * the xmin we set here is the same as in the backend's proc->xmin whose
> + * snapshot we are copying. At this very moment, anybody computing a
> + * minimum will calculate at least this xmin as the overall xmin with or
> + * without us setting MyProc->xmin to this value.
> + * Instead we must check to not go forward (we might have opened a cursor
> + * in this transaction and still have its snapshot registered)
> + */
> + if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin))
> + MyProc->xmin = snapshot->xmin;

The write is atomic, in the sense that this process cannot be interrupted with
the update in progress. You'd still need a memory barrier to make other
processes see the change immediately on all architectures we support. I'd
suggest taking ProcArrayLock and adding a comment to the effect that we could
make it lockless, should we ever adopt such methods. I don't advise testing
those waters as part of this patch.

> +
> + /*
> + * Check the checksum again to prevent a race condition. If the exporting
> + * backend invalidated its snapshot since we last checked or before we
> + * finish with this second check, then we fail here and error out, thus
> + * invalidating the snapshot we've built up.
> + * We could succeed here even though the exporting transaction is at the
> + * same time invalidating the checksum while we are checking it here for
> + * the second time. But even then we are good, because we have set up our
> + * MyProc record already.
> + */
> + if (strcmp(cksum, syncSnapshotChksums[backendId]) != 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid data in snapshot information")));

I was trying to decide whether this is vulnerable to the ABA problem, and I
believe it is not. It is possible that the original exporting transaction
completed and a new transaction exported the same snapshot, all between our
first checksum comparison and now. However, this would also mean that no new
permanent transactions IDs got allocated in the gap. So I think we're good.

This message should continue to mirror the one from the earlier test, of course.

> +
> + return true;
> + }
> +
> +
> diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
> index 45b92a0..f74ea76 100644

> + Datum
> + pg_import_snapshot(PG_FUNCTION_ARGS)
> + {
> + bytea *snapshotData = PG_GETARG_BYTEA_P(0);
> + bool ret = true;
> +
> + if (ActiveSnapshotSet())
> + ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
> +
> + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
> +
> + /*
> + * If we are in read committed mode then the next query will execute with a
> + * new snapshot making this function call quite useless.
> + */
> + if (!IsolationUsesXactSnapshot())
> + ereport(WARNING,
> + (errcode(ERRCODE_INAPPROPRIATE_ISOLATION_LEVEL_FOR_BRANCH_TRANSACTION),
> + errmsg("a snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE"),
> + errhint("A transaction of isolation level READ COMMITTED gives you a new snapshot for each query.")));

Again, REPEATABLE READ is a perfectly good isolation level for this use. The
test allows it, but the error message does not.

Attachment	Content-Type	Size
test-drop.pl	application/x-perl	1.3 KB

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-01-30 17:36:12
Message-ID:	AANLkTik_DNm64gaBKFZyVP4GaWHzuhP2gEF16JZQGTjK@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here is a new version of the patch incorporating most of Noah's
suggestions. The patch now also adds documentation. Since I couldn't
really find a suitable section to document the two new functions, I
added a new one for now. Any better ideas where it should go?

On Thu, Jan 20, 2011 at 1:37 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> Just to clarify, I was envisioning something like:
>
> typedef struct { bool valid; char value[16]; } snapshotChksum;
>
> I don't object to the way you've done it, but someone else might not like the
> extra marshalling between text and binary. Your call.

I didn't like it in the beginning but later on I adopted your
proposal. I have also changed the locking to be more natural. Even
though we don't really need it, I am now grabbing shared ProcArrayLock
for any reads of shared memory and exclusive lock for writes. Of
course no additional lock is taken if the feature is not used.

> You're right. Then consider "VALUES (pg_import_snapshot('...'), (SELECT
> count(*) from t))" at READ COMMITTED. It works roughly as I'd guess; the
> subquery uses the imported snapshot. If I flip the two expressions and do
> "VALUES ((SELECT count(*) from t), pg_import_snapshot('...'))", the subquery
> uses the normal snapshot. That makes sense, but I can't really see a use case
> for it. A warning, like your code has today, seems appropriate.

Yeah, that would do what you wanted to illustrate but it truely cannot
be considered the standard use case :-)

>> > Is it valid to scribble directly on snapshots like this?
>> I figured that previously executed code still has references pointing
>> to the snapshots and so we cannot just push a new snapshot on top but
>> really need to change the memory where they are pointing to.
> Okay. Had no special reason to believe otherwise, just didn't know.

This is one part where I'd like someone more experienced than me look into it.

> Thanks; that was handy. One thing I noticed is that the second "SELECT * FROM
> kidseen" yields zero rows instead of yielding "ERROR: relation "kidseen" does
> not exist". I changed things around per the attached "test-drop.pl", such that
> the table is already gone in the ordinary snapshot, but still visible to the
> imported snapshot. Note how the pg_class row is visible, but an actual attempt
> to query the table fails. Must some kind of syscache invalidation follow the
> snapshot alteration to make this behave normally?

As discussed with Noah off-list this is just not an MVCC safe
operation. You could hit on this in a regular SERIALIZABLE transaction
as well: Somebody else can delete a table and you won't be able to
access it anymore. This is also the reason why pg_dump in the
beginning gets a shared lock on every table that it will dump later
on.

Thanks for the review Noah,

Joachim

Attachment	Content-Type	Size
syncsnapshots.3.diff	text/x-patch	23.2 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-01-30 23:26:31
Message-ID:	20110130232631.GA24897@tornado.gateway.2wire.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 30, 2011 at 12:36:12PM -0500, Joachim Wieland wrote:
> Here is a new version of the patch incorporating most of Noah's
> suggestions. The patch now also adds documentation. Since I couldn't
> really find a suitable section to document the two new functions, I
> added a new one for now. Any better ideas where it should go?

No major opinion here. The second best fit after a new section would be System
Administration Functions, inasmuch as that's the bucket for miscellaneous
functions. I have an idea for improving structure here, but that would go
beyond the purview of this patch.

This version looks sound; I've marked it Ready for Committer.

Thanks,
nm

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-18 16:02:22
Message-ID:	1298044925-sup-7790@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Looking into this patch.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-18 19:57:11
Message-ID:	1298058809-sup-208@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I have two questions:

1. why are you using the expansible char array stuff instead of using
the StringInfo facility?

2. is md5 the most appropriate digest for this? If you need a
cryptographically secure hash, do we need something stronger? If not,
why not just use hash_any?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-19 00:41:51
Message-ID:	AANLkTik0SFP8CJCxSeH9nzKw9a-+da93bHW8spB2joTF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> 1. why are you using the expansible char array stuff instead of using
> the StringInfo facility?
>
> 2. is md5 the most appropriate digest for this? If you need a
> cryptographically secure hash, do we need something stronger? If not,
> why not just use hash_any?

We don't need a cryptographically secure hash.

There is no special reason for why it is like it is, I just didn't
think of the better alternatives that you are proposing.

Should I send an updated patch? Anything else?

Thanks for the review,
Joachim

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-19 01:12:59
Message-ID:	1298077237-sup-4499@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Joachim Wieland's message of vie feb 18 21:41:51 -0300 2011:
> On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> > 1. why are you using the expansible char array stuff instead of using
> > the StringInfo facility?
> >
> > 2. is md5 the most appropriate digest for this? If you need a
> > cryptographically secure hash, do we need something stronger? If not,
> > why not just use hash_any?
>
> We don't need a cryptographically secure hash.

Ok.

> Should I send an updated patch? Anything else?

No need, I'm halfway through already.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-19 20:17:03
Message-ID:	1298146623.5977.7.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On fre, 2011-02-18 at 16:57 -0300, Alvaro Herrera wrote:
> 2. is md5 the most appropriate digest for this? If you need a
> cryptographically secure hash, do we need something stronger? If not,
> why not just use hash_any?

MD5 is probably more appropriate than hash_any, because the latter is
optimized for speed and collision avoidance and doesn't have a
guaranteed external format. The only consideration against MD5 might be
that it would make us look quite lame. We should probably provide
builtin SHA1 and SHA2 functions for this and other reasons.

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-19 23:02:36
Message-ID:	AANLkTin-znHrBv6LZ4K=vb9o8wiOFQ5i=k_pYncea2vR@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 19, 2011 at 9:17 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> The only consideration against MD5 might be
> that it would make us look quite lame. We should probably provide
> builtin SHA1 and SHA2 functions for this and other reasons.

In this particular case however the checksum is never exposed to the
user but stays within shared memory. So nobody would notice that we
are still using lame old md5sum :-)

Joachim

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Joachim Wieland <joe(at)mcknight(dot)de>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-20 00:26:42
Message-ID:	643.1298161602@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On fre, 2011-02-18 at 16:57 -0300, Alvaro Herrera wrote:
>> 2. is md5 the most appropriate digest for this? If you need a
>> cryptographically secure hash, do we need something stronger? If not,
>> why not just use hash_any?

> MD5 is probably more appropriate than hash_any, because the latter is
> optimized for speed and collision avoidance and doesn't have a
> guaranteed external format. The only consideration against MD5 might be
> that it would make us look quite lame.

Only to people who don't understand whether crypto strength is actually
important in a given use-case.

However ... IIRC, hash_any gives different results on bigendian and
littleendian machines. I'm not sure if a predictable cross-platform
result is important for this use? If you're hashing data containing
native integers, this is a problem anyway.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Joachim Wieland <joe(at)mcknight(dot)de>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-20 00:57:46
Message-ID:	1298163311-sup-3469@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Tom Lane's message of sáb feb 19 21:26:42 -0300 2011:

> However ... IIRC, hash_any gives different results on bigendian and
> littleendian machines. I'm not sure if a predictable cross-platform
> result is important for this use? If you're hashing data containing
> native integers, this is a problem anyway.

The current feature only lets you synchronize snapshots within a cluster
-- you can't ship them to another machine. I don't think dependency on
endianness is a problem here. (But no, the data doesn't contain native
integers -- it's the snapshot's text representation.)

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-21 15:56:36
Message-ID:	1298303173-sup-6701@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

A couple more questions:

What's the reason for this restriction?
if (databaseId != MyDatabaseId)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("cannot import snapshot from a different database")));

Why are we using bytea as the output format instead of text? The output
is just plain text anyway, as can be seen by setting bytea_output to
'escape'. Perhaps there would be a value to using bytea if we were to
pglz_compress the result, but would there be a point in doing so?
Normally exported info should be short enough that it would cause more
overhead than it would save anyway.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-21 17:56:42
Message-ID:	1298310067-sup-2301@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Joachim Wieland's message of dom ene 30 14:36:12 -0300 2011:

> On Thu, Jan 20, 2011 at 1:37 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:

> >> > Is it valid to scribble directly on snapshots like this?
> >> I figured that previously executed code still has references pointing
> >> to the snapshots and so we cannot just push a new snapshot on top but
> >> really need to change the memory where they are pointing to.
> > Okay. Had no special reason to believe otherwise, just didn't know.
>
> This is one part where I'd like someone more experienced than me look into it.

Yeah, I don't like this bit very much. I'm inclined to have the import
routine fill CurrentSnapshot directly (and perhaps ActiveSnapshot too,
not sure about this part yet) instead; I think we need a safety net so
that the new serializable isolation code doesn't get upset if we change
the base snapshot from under it, but I haven't looked at that yet.

Trying to import a snap into a transaction that has committed children
should also fail; otherwise, objects touched during those subxacts would
be in a spooky zone. Also, I want to have Importing into a
read-committed transaction fail with an error instead of the current
warning -- just like LOCK TABLE fails if you're not inside a
transaction.

I'm also inclined to have PREPARE TRANSACTION fail if it has an exported
snapshot, instead of just invalidating it. This lets us get rid of the
extra locking added during that operation.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-21 19:33:26
Message-ID:	AANLkTi=m0-427u4hFDt1+Lzue13fJ_BNACgSAJtwwkZM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Feb 21, 2011 at 4:56 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> What's the reason for this restriction?
> if (databaseId != MyDatabaseId)
> ereport(ERROR,
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> errmsg("cannot import snapshot from a different database")));

I just couldn't think of a use case for it and so didn't want to
introduce a feature that we might have to support in the future then.

> Why are we using bytea as the output format instead of text? The output
> is just plain text anyway, as can be seen by setting bytea_output to
> 'escape'. Perhaps there would be a value to using bytea if we were to
> pglz_compress the result, but would there be a point in doing so?
> Normally exported info should be short enough that it would cause more
> overhead than it would save anyway.

It is bytea because it should be thought of "just some data". It
should be regarded more as a token than as text and should not be
inspected/interpreted at all. If anybody decides to do so, fine, but
then they should not rely on the stability of the message format for
the future.

Joachim

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-21 21:12:32
Message-ID:	6106.1298322752@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joachim Wieland <joe(at)mcknight(dot)de> writes:
> On Mon, Feb 21, 2011 at 4:56 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> Why are we using bytea as the output format instead of text?

> It is bytea because it should be thought of "just some data". It
> should be regarded more as a token than as text and should not be
> inspected/interpreted at all. If anybody decides to do so, fine, but
> then they should not rely on the stability of the message format for
> the future.

I don't think that passing it as bytea conveys those semantics at all.
It just makes it harder to look at the value for debugging purposes.

Plus you gotta worry about variable formatting/escaping conventions
(bytea_output etc) that could easily be avoided if the value were
promised to be text with no funny characters in it.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 06:59:56
Message-ID:	4D635EEC.2000106@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19.02.2011 02:41, Joachim Wieland wrote:
> On Fri, Feb 18, 2011 at 8:57 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> 1. why are you using the expansible char array stuff instead of using
>> the StringInfo facility?
>>
>> 2. is md5 the most appropriate digest for this? If you need a
>> cryptographically secure hash, do we need something stronger? If not,
>> why not just use hash_any?
>
> We don't need a cryptographically secure hash.

Really? The idea of the hash is to prevent you from importing arbitrary
snapshots into the system, allowing only copying snapshots that are in
use by another backend. The purpose of the hash is to make sure the
snapshot the client passes in is identical to one used by another backend.

If you don't use a cryptographically secure hash, it's easy to construct
a snapshot with the same hash as an existing snapshot, with more or less
arbitrary contents.

This also makes me worried:
> +
> + /*
> + * Verify that the checksum matches before doing any more work. We will
> + * later verify again to make sure that the exporting transaction has not
> + * yet terminated by then. We don't want to optimize this into just one
> + * verification call at the very end because the instructions that follow
> + * this comment rely on a sane format of the textual snapshot data. In
> + * particular they assume that there are not more XactIds than
> + * advertised...
> + */

It sounds like you risk a buffer overflow if the snapshot is bogus,
which makes a collision-resistant hash even more important.

I know this was discussed already, but I don't much like using a hash
like this. We should find a way to just store the whole snapshot in
shared memory, or even a temporary file if we want to skimp on shared
memory. It's generally better to not rely on cryptography when you don't
have to.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 07:10:13
Message-ID:	4D636155.9050000@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21.02.2011 21:33, Joachim Wieland wrote:
> Hi,
>
> On Mon, Feb 21, 2011 at 4:56 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> What's the reason for this restriction?
>> if (databaseId != MyDatabaseId)
>> ereport(ERROR,
>> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> errmsg("cannot import snapshot from a different database")));
>
> I just couldn't think of a use case for it and so didn't want to
> introduce a feature that we might have to support in the future then.

Hmm, here's one: getting a consistent snapshot of all databases in
pg_dumpall. Not sure how useful that is..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 12:24:22
Message-ID:	AANLkTimS0wGSbAj7CjwBt2UM_+zsGg5ApjcQRCc+z3_O@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 22, 2011 at 1:59 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Really? The idea of the hash is to prevent you from importing arbitrary
> snapshots into the system, allowing only copying snapshots that are in use
> by another backend. The purpose of the hash is to make sure the snapshot the
> client passes in is identical to one used by another backend.

If that's the purpose of the hash, it is utterly useless. The
computation performed by the backend can easily be replicated by an
adversary.

> If you don't use a cryptographically secure hash, it's easy to construct a
> snapshot with the same hash as an existing snapshot, with more or less
> arbitrary contents.

And if you did use a cryptographically secure hash, it would still be
easy, unless there is some plausible way of keeping the key a secret,
which I doubt.

The original reason why we went with this design (send the snapshot to
the client and then have the client send it back to the server) was
because some people thought it could be useful to take a snapshot from
the master and recreate it on a standby, or perhaps the other way
around. If that's the goal, checking whether the snapshot being
imported is one that's already in use is missing the point. We have
to rely on our ability to detect, when importing the snapshot, any
anomalies that could lead to system instability; and allow people to
import an arbitrary snapshot that meets those constraints, even one
that the user just created out of whole cloth.

Now, as I said before, I think that's a bad idea. I think we should
NOT be designing the system to allow publication of arbitrary
snapshots; instead, the backend that wishes to export its snapshot
should so request, getting back a token (like "1") that other backends
can then pass to START TRANSACTION (SNAPSHOT '1'), or whatever syntax
we end up using. In that design, there's no need to validate
snapshots because they never leave the server's memory space, which
IMHO is as it should be. If someone can develop a convincing argument
that transmitting snapshots between the master and standby is a good
idea, we can still accommodate that: the master-standby connection is
now full-duplex. In fact, we may want to do that anyway for other
reasons (see Kevin Grittner's musings on this point vs. true
serializability).

Another reason I don't like this approach is because it's possible
that the representation of snapshots might change in the future. Tom
mused a while back about using an LSN as a snapshot (snapshots sees
all transactions committed prior to that LSN) and there are other
plausible changes, too. If we expose the internal details of the
snapshot mechanism to users, then (1) it becomes a user-visible
feature with backward compatibility implications if we choose to break
it down the road and (2) all future snapshot representations must be
able to be fully sanity checked on import. Right now it's fairly easy
to verify that the snapshot's xmin isn't too old and that the
remaining XIDs are in a reasonable range and that's probably all we
really need to be certain of, but it's not clear that some future
representation would be equally easy to validate.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 13:01:45
Message-ID:	4D63B3B9.3030700@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.02.2011 14:24, Robert Haas wrote:
> On Tue, Feb 22, 2011 at 1:59 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> If you don't use a cryptographically secure hash, it's easy to construct a
>> snapshot with the same hash as an existing snapshot, with more or less
>> arbitrary contents.
>
> And if you did use a cryptographically secure hash, it would still be
> easy, unless there is some plausible way of keeping the key a secret,
> which I doubt.

This is hashing, not encryption, there is no key. The point is that even
if the attacker has the hash value and knows the algorithm, he can not
construct *another* snapshot that has the same hash. A cryptographically
strong hash function has that property, second preimage resistance,
whereas for a regular CRC or similar it's easy to create an arbitrary
input that has any given checksum.

> The original reason why we went with this design (send the snapshot to
> the client and then have the client send it back to the server) was
> because some people thought it could be useful to take a snapshot from
> the master and recreate it on a standby, or perhaps the other way
> around. If that's the goal, checking whether the snapshot being
> imported is one that's already in use is missing the point. We have
> to rely on our ability to detect, when importing the snapshot, any
> anomalies that could lead to system instability; and allow people to
> import an arbitrary snapshot that meets those constraints, even one
> that the user just created out of whole cloth.

Yes. It would be good to perform those sanity checks anyway.

> Now, as I said before, I think that's a bad idea. I think we should
> NOT be designing the system to allow publication of arbitrary
> snapshots; instead, the backend that wishes to export its snapshot
> should so request, getting back a token (like "1") that other backends
> can then pass to START TRANSACTION (SNAPSHOT '1'), or whatever syntax
> we end up using.

Well, I'm not sure if we should allow importing arbitrary (as long as
they're sane) snapshots or not; the arguments for and against that are
both compelling.

But even if we don't allow it, there's no harm in sending the whole
snapshot to the client, anyway. Ie. instead of "1" as the identifier,
use the snapshot itself. That leaves the door open for allowing it in
the future, should we choose to do so.

> In that design, there's no need to validate
> snapshots because they never leave the server's memory space, which
> IMHO is as it should be.

I agree that if we don't allow importing arbitrary snapshots, we should
use some out-of-band communication to get the snapshot to the backend.
IOW not rely on a hash.

> Another reason I don't like this approach is because it's possible
> that the representation of snapshots might change in the future. Tom
> mused a while back about using an LSN as a snapshot (snapshots sees
> all transactions committed prior to that LSN) and there are other
> plausible changes, too. If we expose the internal details of the
> snapshot mechanism to users, then (1) it becomes a user-visible
> feature with backward compatibility implications if we choose to break
> it down the road

Clients should treat the snapshot as an opaque block.

> and (2) all future snapshot representations must be
> able to be fully sanity checked on import. Right now it's fairly easy
> to verify that the snapshot's xmin isn't too old and that the
> remaining XIDs are in a reasonable range and that's probably all we
> really need to be certain of, but it's not clear that some future
> representation would be equally easy to validate.

Perhaps, although I can't imagine a representation that wouldn't be
equally easy to validate.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 13:52:34
Message-ID:	AANLkTikgJ3OFFTqwC=QuXm8ouL2i3Ed1tNHqamQLR4Sn@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 22, 2011 at 8:01 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> This is hashing, not encryption, there is no key. The point is that even if
> the attacker has the hash value and knows the algorithm, he can not
> construct *another* snapshot that has the same hash.

What good does that do us?

> Yes. It would be good to perform those sanity checks anyway.

I don't think it's good; I think it's absolutely necessary. Otherwise
someone can generate arbitrary garbage, hash it, and feed it to us.
No?

> But even if we don't allow it, there's no harm in sending the whole snapshot
> to the client, anyway. Ie. instead of "1" as the identifier, use the
> snapshot itself. That leaves the door open for allowing it in the future,
> should we choose to do so.

The door is open either way, AFAICS: we could eventually allow:

BEGIN TRANSACTION (SNAPSHOT '1');
and also
BEGIN TRANSACTION (SNAPSHOT '{xmin 123 xmax 456 xids 128 149 201}');

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 13:58:05
Message-ID:	4D63C0ED.7070906@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.02.2011 15:52, Robert Haas wrote:
> On Tue, Feb 22, 2011 at 8:01 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Yes. It would be good to perform those sanity checks anyway.
>
> I don't think it's good; I think it's absolutely necessary. Otherwise
> someone can generate arbitrary garbage, hash it, and feed it to us.
> No?

No, the hash is stored in shared memory. The hash of the garbage has to
match.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 14:29:20
Message-ID:	AANLkTim0t=ro8pH010yrwMcNvdVO8yPw-XWRgz0qmfnq@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 22, 2011 at 8:58 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 22.02.2011 15:52, Robert Haas wrote:
>>
>> On Tue, Feb 22, 2011 at 8:01 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>
>>> Yes. It would be good to perform those sanity checks anyway.
>>
>> I don't think it's good; I think it's absolutely necessary. Otherwise
>> someone can generate arbitrary garbage, hash it, and feed it to us.
>> No?
>
> No, the hash is stored in shared memory. The hash of the garbage has to
> match.

Oh. Well that's really silly. At that point you might as well just
store the snapshot and an integer identifier in shared memory, right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 14:34:34
Message-ID:	4D63C97A.3020900@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.02.2011 16:29, Robert Haas wrote:
> On Tue, Feb 22, 2011 at 8:58 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> On 22.02.2011 15:52, Robert Haas wrote:
>>>
>>> On Tue, Feb 22, 2011 at 8:01 AM, Heikki Linnakangas
>>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>>
>>>> Yes. It would be good to perform those sanity checks anyway.
>>>
>>> I don't think it's good; I think it's absolutely necessary. Otherwise
>>> someone can generate arbitrary garbage, hash it, and feed it to us.
>>> No?
>>
>> No, the hash is stored in shared memory. The hash of the garbage has to
>> match.
>
> Oh. Well that's really silly. At that point you might as well just
> store the snapshot and an integer identifier in shared memory, right?

Yes, that's the point I was trying to make. I believe the idea of a hash
was that it takes less memory than storing the whole snapshot (and more
importantly, a fixed amount of memory per snapshot). But I'm not
convinced either that dealing with a hash is any less troublesome.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-22 14:50:43
Message-ID:	AANLkTin5sVFj+Yov=tR5487DTpMgWD9Ad=PuopiPh_f+@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 22, 2011 at 9:34 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Oh. Well that's really silly. At that point you might as well just
>> store the snapshot and an integer identifier in shared memory, right?
>
> Yes, that's the point I was trying to make. I believe the idea of a hash was
> that it takes less memory than storing the whole snapshot (and more
> importantly, a fixed amount of memory per snapshot). But I'm not convinced
> either that dealing with a hash is any less troublesome.

OK, sorry for taking a while to get the point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-23 01:00:15
Message-ID:	AANLkTi=pJ7YnDA4r15WW2vXoNv3cq9C=r8B6MhdSFRFD@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 22, 2011 at 3:34 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 22.02.2011 16:29, Robert Haas wrote:
>> On Tue, Feb 22, 2011 at 8:58 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> No, the hash is stored in shared memory. The hash of the garbage has to
>>> match.
>>
>> Oh. Well that's really silly. At that point you might as well just
>> store the snapshot and an integer identifier in shared memory, right?
>
> Yes, that's the point I was trying to make. I believe the idea of a hash was
> that it takes less memory than storing the whole snapshot (and more
> importantly, a fixed amount of memory per snapshot). But I'm not convinced
> either that dealing with a hash is any less troublesome.

Both Tom and Robert voted quite explicitly against the
store-in-shared-memory idea. Others don't want to allow people request
arbitrary snapshots and again others wanted to pass the snapshot
through the client so that in the future we could also request
snapshots from standby servers. The hash idea seemed to at least make
nobody unhappy.

For easier review, here are a few links to the previous discusion:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg00361.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00383.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00481.php
http://archives.postgresql.org/pgsql-hackers/2010-12/msg02454.php

Why exactly, Heikki do you think the hash is more troublesome? And how
could we validate/invalidate snapshots without the checksum (assuming
the through-the-client approach instead of storing the whole snapshot
in shared memory)?

Joachim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-23 01:18:24
Message-ID:	AANLkTi=3A2p9TyN7LSs1xt7UeJ1GOy1W5E_pT4a5Ph6y@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 22, 2011 at 8:00 PM, Joachim Wieland <joe(at)mcknight(dot)de> wrote:
> Both Tom and Robert voted quite explicitly against the
> store-in-shared-memory idea.

No, I voted *for* that approach.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joachim Wieland <joe(at)mcknight(dot)de>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-23 01:55:25
Message-ID:	1298406311-sup-5625@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The current discussion seems to have reached the same conclusion as the
last one: the snapshot info shouldn't leave the server, but be
transmitted internally -- the idea of the temp file seems prevalent.
Here's an attempt at a detailed spec:

By invoking pg_export_snapshot(), a transaction can export its current
transaction snapshot.

To export a snapshot, the transaction creates a temp file containing all
information about the snapshot, including the BackendId and
TransactionId of the creating transaction. The file name is determined
internally by the exporting transaction, and returned by the function
that creates the snapshot. The act of exporting a snapshot causes a
TransactionId to be acquired, if there is none.

Care is taken that no existing snapshot file is ever overwritten, to
prevent security problems. A transaction may export any number of
snapshots.

The "cookie" to be passed around, then, is a temp file identifier. This
cookie gets passed as START TRANSACTION (snapshot='foo'), which sets the
transaction's snapshot to the one determined by the identifier. The
importing transaction must have isolation level serializable or
repeatable read declared on the same START TRANSACTION command; an
attempt to import a snapshot into a read committed transaction or one
with unspecified isolation level causes the transaction to get into
aborted state (so you still need to say ROLLBACK to get out of it. This
is necessary to avoid the session to proceed involuntarily with the
wrong snapshot.)

Similarly, if the snapshot is not available, an error is raised and the
transaction block remains in aborted state. This can happen if the
originating transaction ended after you opened the file, for example.
The BackendId and TransactionId of the exporting transaction let the
importing backend determine the validity of the snapshot beyond the mere
existence of the file.

The snapshot temp file is to be marked for deletion on transaction end.
All snapshot temp files are also deleted on crash recovery, to prevent
dangling files (I need some help here to determine how this plays with
hot standby.)

Privileges: Any role can export or import a snapshot. The rationale
here is that exporting a snapshot doesn't cause any more harm than
holding a transaction open, so if you let your users do that, then
there's no extra harm. Similarly, there's no extra privilege given to a
role that imports a snapshot that cannot be obtained by sending a START
TRANSACTION command at the right time.

Note: this proposal doesn't mention restrictions on having
subtransactions in the exporting transaction. This is because I haven't
figured them out, not because I think there shouldn't be any.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joachim Wieland <joe(at)mcknight(dot)de>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-23 02:05:15
Message-ID:	AANLkTimp71AzLw6KVXLxsVPtPj21OMUAmqRqrjG6F3NX@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 22, 2011 at 8:55 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> The current discussion seems to have reached the same conclusion as the
> last one: the snapshot info shouldn't leave the server, but be
> transmitted internally -- the idea of the temp file seems prevalent.
> Here's an attempt at a detailed spec:
>
> By invoking pg_export_snapshot(), a transaction can export its current
> transaction snapshot.
>
> To export a snapshot, the transaction creates a temp file containing all
> information about the snapshot, including the BackendId and
> TransactionId of the creating transaction. The file name is determined
> internally by the exporting transaction, and returned by the function
> that creates the snapshot. The act of exporting a snapshot causes a
> TransactionId to be acquired, if there is none.
>
> Care is taken that no existing snapshot file is ever overwritten, to
> prevent security problems. A transaction may export any number of
> snapshots.
>
> The "cookie" to be passed around, then, is a temp file identifier. This
> cookie gets passed as START TRANSACTION (snapshot='foo'), which sets the
> transaction's snapshot to the one determined by the identifier. The
> importing transaction must have isolation level serializable or
> repeatable read declared on the same START TRANSACTION command; an
> attempt to import a snapshot into a read committed transaction or one
> with unspecified isolation level causes the transaction to get into
> aborted state (so you still need to say ROLLBACK to get out of it. This
> is necessary to avoid the session to proceed involuntarily with the
> wrong snapshot.)
>
> Similarly, if the snapshot is not available, an error is raised and the
> transaction block remains in aborted state. This can happen if the
> originating transaction ended after you opened the file, for example.
> The BackendId and TransactionId of the exporting transaction let the
> importing backend determine the validity of the snapshot beyond the mere
> existence of the file.
>
> The snapshot temp file is to be marked for deletion on transaction end.
> All snapshot temp files are also deleted on crash recovery, to prevent
> dangling files (I need some help here to determine how this plays with
> hot standby.)

I think this can be done within StartupXLOG(), right around where we
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP).

> Privileges: Any role can export or import a snapshot. The rationale
> here is that exporting a snapshot doesn't cause any more harm than
> holding a transaction open, so if you let your users do that, then
> there's no extra harm. Similarly, there's no extra privilege given to a
> role that imports a snapshot that cannot be obtained by sending a START
> TRANSACTION command at the right time.

Agree.

> Note: this proposal doesn't mention restrictions on having
> subtransactions in the exporting transaction. This is because I haven't
> figured them out, not because I think there shouldn't be any.

I think it's probably sufficient to say that a snapshot can only be
exported from the toplevel transaction.

Overall, this proposal is fine with me. But then I wasn't the one who
complained about it last time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-27 14:04:31
Message-ID:	4D6A59EF.1020102@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23.02.2011 03:00, Joachim Wieland wrote:
> On Tue, Feb 22, 2011 at 3:34 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Yes, that's the point I was trying to make. I believe the idea of a hash was
>> that it takes less memory than storing the whole snapshot (and more
>> importantly, a fixed amount of memory per snapshot). But I'm not convinced
>> either that dealing with a hash is any less troublesome.
>
> Both Tom and Robert voted quite explicitly against the
> store-in-shared-memory idea. Others don't want to allow people request
> arbitrary snapshots and again others wanted to pass the snapshot
> through the client so that in the future we could also request
> snapshots from standby servers. The hash idea seemed to at least make
> nobody unhappy.
>
> For easier review, here are a few links to the previous discusion:
>
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg00361.php
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg00383.php
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg00481.php
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg02454.php
>
> Why exactly, Heikki do you think the hash is more troublesome?

It just feels wrong to rely on cryptography just to save some shared memory.

I realize that there are conflicting opinions on this, but from user
point-of-view the hash is just a variant of the idea of passing the
snapshot through shared memory, just implemented in an indirect way.

> And how
> could we validate/invalidate snapshots without the checksum (assuming
> the through-the-client approach instead of storing the whole snapshot
> in shared memory)?

Either you accept anything that passes sanity checks, or you store the
whole snapshot in shared memory (or a temp file). I'm not sure which is
better, but they both seem better than the hash.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-28 01:33:44
Message-ID:	AANLkTikAjgAu6nLfv7UmJ-AhXaf8fkFs3G5zWU_JBndY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 27, 2011 at 3:04 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Why exactly, Heikki do you think the hash is more troublesome?
> It just feels wrong to rely on cryptography just to save some shared memory.

Remember that it's not only about saving shared memory, it's also
about making sure that the snapshot reflects a state of the database
that has actually existed at some point in the past. Furthermore, we
can easily invalidate a snapshot that we have published earlier by
deleting its checksum in shared memory as soon as the original
transaction commits/aborts. And for these two a checksum seems to be a
good fit. Saving memory then comes as a benefit and makes all those
happy who don't want to argue about how many slots to reserve in
shared memory or don't want to have another GUC for what will probably
be a low-usage feature.

> I realize that there are conflicting opinions on this, but from user
> point-of-view the hash is just a variant of the idea of passing the snapshot
> through shared memory, just implemented in an indirect way.

The user will never see the hash, why should he bother? The user point
of view is that he receives data and can obtain the same snapshot if
he passed that data back. This user experience is no different from
any other way of passing the snapshot through the client. And from the
previous discussions this seemed to be what most people wanted.

>> And how
>> could we validate/invalidate snapshots without the checksum (assuming
>> the through-the-client approach instead of storing the whole snapshot
>> in shared memory)?
>
> Either you accept anything that passes sanity checks, or you store the whole
> snapshot in shared memory (or a temp file). I'm not sure which is better,
> but they both seem better than the hash.

True, both might work but I don't see a real technical advantage over
the checksum approach for any of them, rather the opposite.

Nobody has come up with a use case for the accept-anything option so
far, so I don't see why we should commit ourselves to this feature at
this point, given that we have a cheap and easy way of
validating/invalidating snapshots. And I might be just paranoid but I
also fear that someone could raise security issues for the fact that
you would be able to request an arbitrary database state from the past
and inspect changes of other peoples' transactions. We might want to
allow that later though and I realize that we have to allow it for a
standby server that would take over a snapshot from the master anyway,
but I don't want to add this complexity into this first patch. I want
however be able to possibly allow this in the future without touching
the external API of the feature.

And for the tempfile approach, I don't see that the creation and
removal of the temp file is any less code complexity than flipping a
number in shared memory. Also it seemed that people rather wanted to
go with the through-the-client approach because it seems to be more
flexible.

Maybe you should just look at it as a conservative accept-anything
approach that uses a checksum to allow only snapshots that we know
have existed and have been published. Does the checksum still look so
weird from this perspective?

Joachim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-28 17:38:21
Message-ID:	AANLkTikQABMFxa_uaeG2mdiWxq-07Yq7yd8GeThQGNcj@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 27, 2011 at 8:33 PM, Joachim Wieland <joe(at)mcknight(dot)de> wrote:
> On Sun, Feb 27, 2011 at 3:04 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> Why exactly, Heikki do you think the hash is more troublesome?
>> It just feels wrong to rely on cryptography just to save some shared memory.
>
> Remember that it's not only about saving shared memory, it's also
> about making sure that the snapshot reflects a state of the database
> that has actually existed at some point in the past. Furthermore, we
> can easily invalidate a snapshot that we have published earlier by
> deleting its checksum in shared memory as soon as the original
> transaction commits/aborts. And for these two a checksum seems to be a
> good fit. Saving memory then comes as a benefit and makes all those
> happy who don't want to argue about how many slots to reserve in
> shared memory or don't want to have another GUC for what will probably
> be a low-usage feature.

But you can do all of this with files too, can't you? Just remove or
truncate the file when the snapshot is no longer valid.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-02-28 17:59:56
Message-ID:	26171.1298915996@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Feb 27, 2011 at 8:33 PM, Joachim Wieland <joe(at)mcknight(dot)de> wrote:
>> Remember that it's not only about saving shared memory, it's also
>> about making sure that the snapshot reflects a state of the database
>> that has actually existed at some point in the past.

> But you can do all of this with files too, can't you? Just remove or
> truncate the file when the snapshot is no longer valid.

Yeah. I think adopting a solution similar to 2PC state files is a very
reasonable way to go here. This isn't likely to be a high-usage or
performance-critical feature, so it's not essential to keep the
information in shared memory for performance reasons.

regards, tom lane

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-03-01 04:27:44
Message-ID:	AANLkTinUjdt3OFcWoSgXmULDfU_Ux-Om6hKR=ms=bbiM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 28, 2011 at 6:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Remember that it's not only about saving shared memory, it's also
>> about making sure that the snapshot reflects a state of the database
>> that has actually existed at some point in the past. Furthermore, we
>> can easily invalidate a snapshot that we have published earlier by
>> deleting its checksum in shared memory as soon as the original
>> transaction commits/aborts. And for these two a checksum seems to be a
>> good fit. Saving memory then comes as a benefit and makes all those
>> happy who don't want to argue about how many slots to reserve in
>> shared memory or don't want to have another GUC for what will probably
>> be a low-usage feature.
>
> But you can do all of this with files too, can't you? Just remove or
> truncate the file when the snapshot is no longer valid.

Sure we can, but it looked like the consensus of the first discussion
was that the through-the-client approach was more flexible. But then
again nobody is actively arguing for that anymore.

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Joachim Wieland <joe(at)mcknight(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-03-02 02:28:06
Message-ID:	8B6E94FB-57FA-4F13-AD91-294A53CF7ED3@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 28, 2011, at 11:59 AM, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Sun, Feb 27, 2011 at 8:33 PM, Joachim Wieland <joe(at)mcknight(dot)de> wrote:
>>> Remember that it's not only about saving shared memory, it's also
>>> about making sure that the snapshot reflects a state of the database
>>> that has actually existed at some point in the past.
>
>> But you can do all of this with files too, can't you? Just remove or
>> truncate the file when the snapshot is no longer valid.
>
> Yeah. I think adopting a solution similar to 2PC state files is a very
> reasonable way to go here. This isn't likely to be a high-usage or
> performance-critical feature, so it's not essential to keep the
> information in shared memory for performance reasons.

Dumb question: Is this something that could be solved by having the postmaster track this information in it's local memory and make it available via a variable-sized IPC mechanism, such as a port or socket? That would eliminate the need to clean things up after a crash; I'm not sure if there would be other benefits.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Joachim Wieland <joe(at)mcknight(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-03-02 04:54:11
Message-ID:	5233.1299041651@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <jim(at)nasby(dot)net> writes:
> Dumb question: Is this something that could be solved by having the postmaster track this information in it's local memory and make it available via a variable-sized IPC mechanism, such as a port or socket? That would eliminate the need to clean things up after a crash; I'm not sure if there would be other benefits.

Involving the postmaster in this is entirely *not* reasonable. The
postmaster cannot do anything IPC-wise that the stats collector couldn't
do, and every additional function we load onto the postmaster is another
potential source of unrecoverable database-wide failures. The PM is
reliable only because it doesn't do much.

regards, tom lane

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Joachim Wieland <joe(at)mcknight(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-03-03 23:52:58
Message-ID:	2C44F95C-C135-4F79-AB35-2F9E73181F73@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 1, 2011, at 10:54 PM, Tom Lane wrote:
> Jim Nasby <jim(at)nasby(dot)net> writes:
>> Dumb question: Is this something that could be solved by having the postmaster track this information in it's local memory and make it available via a variable-sized IPC mechanism, such as a port or socket? That would eliminate the need to clean things up after a crash; I'm not sure if there would be other benefits.
>
> Involving the postmaster in this is entirely *not* reasonable. The
> postmaster cannot do anything IPC-wise that the stats collector couldn't
> do, and every additional function we load onto the postmaster is another
> potential source of unrecoverable database-wide failures. The PM is
> reliable only because it doesn't do much.

Makes sense. Doesn't have to be the postmaster; it could be some other process.

Anyway, I just wanted to throw the idea out as food for thought. I don't know if it'd be better or worse than temp files...
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net