Quick Links

Re: Snapshot synchronization, again...

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Snapshot synchronization, again...
Date:	2011-01-20 06:37:03
Message-ID:	20110120063703.GB13329@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 12:12:39AM -0500, Joachim Wieland wrote:
> Noah, thank you for this excellent review. I have taken over most
> (allmost all) of your suggestions (except for the documentation which
> is still missing). Please check the updated & attached patch for
> details.

Great. I do get an assertion failure with this sequence:

BEGIN; SELECT pg_export_snapshot(); ROLLBACK;
SELECT 1;

TRAP: FailedAssertion("!(RegisteredSnapshots > 0)", File: "snapmgr.c", Line: 430)

Perhaps some cleanup is missing from a ROLLBACK path?

>
> On Fri, Jan 14, 2011 at 10:13 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > "00000000000000000000" is a valid md5 message digest. ?To avoid always accepting
> > a snapshot with that digest, we would need to separately store a flag indicating
> > whether a given syncSnapshotChksums member is currently valid. ?Maybe that hole
> > is acceptable, though.
>
> In the end I decided to store md5 checksums as printable characters in
> shmem. We now need a few more bytes for each checksum but as soon as a
> byte is NUL we know that it is not a valid checksum.

Just to clarify, I was envisioning something like:

typedef struct { bool valid; char value[16]; } snapshotChksum;

I don't object to the way you've done it, but someone else might not like the
extra marshalling between text and binary. Your call.

> >> + ? ? ?* Instead we must check to not go forward (we might have opened a cursor
> >> + ? ? ?* in this transaction and still have its snapshot registered)
> >> + ? ? ?*/
> >
> > I'm thinking this case should yield an ERROR. ?Otherwise, our net result would
> > be to silently adopt a snapshot that is neither our old self nor the target.
> > Maybe there's a use case for that, but none comes to my mind.
>
> This can happen when you do:
>
> DECLARE foo CURSOR FOR SELECT * FROM bar;
> import snapshot...
> FETCH 1 FROM foo;

I see now. You set snapshot->xmin unconditionally, but never move MyProc->xmin
forward, only backward. Yes; that seems just right.

> > I guess the use case for pg_import_snapshot from READ COMMITTED would be
> > something like "DO $$BEGIN PERFORM pg_import_snapshot('...'); stuff; END$$;".
> > First off, would that work ("stuff" use the imported snapshot)? ?If it does
> > work, we should take the precedent of SET LOCAL and issue no warning. ?If it
> > doesn't work, then we should document that the snapshot does take effect until
> > the next command and probably keep this warning or something similar.
>
> No, this will also give you a new snapshot for every query, no matter
> if it is executed within or outside of a DO-Block.

You're right. Then consider "VALUES (pg_import_snapshot('...'), (SELECT
count(*) from t))" at READ COMMITTED. It works roughly as I'd guess; the
subquery uses the imported snapshot. If I flip the two expressions and do
"VALUES ((SELECT count(*) from t), pg_import_snapshot('...'))", the subquery
uses the normal snapshot. That makes sense, but I can't really see a use case
for it. A warning, like your code has today, seems appropriate.

> > Is it valid to scribble directly on snapshots like this?
>
> I figured that previously executed code still has references pointing
> to the snapshots and so we cannot just push a new snapshot on top but
> really need to change the memory where they are pointing to.

Okay. Had no special reason to believe otherwise, just didn't know.

> I am also adding a test script that shows the difference of READ
> COMMITTED and SERIALIZABLE in an importing transaction in a DO-Block.
> It's based on the script you sent.

Thanks; that was handy. One thing I noticed is that the second "SELECT * FROM
kidseen" yields zero rows instead of yielding "ERROR: relation "kidseen" does
not exist". I changed things around per the attached "test-drop.pl", such that
the table is already gone in the ordinary snapshot, but still visible to the
imported snapshot. Note how the pg_class row is visible, but an actual attempt
to query the table fails. Must some kind of syscache invalidation follow the
snapshot alteration to make this behave normally?

General tests involving only DML seem to do the right thing. Subtransaction
handling looks sound. Patch runs cleanly according to Valgrind.

> So thanks again and I'm looking forward to your next review... :-)

Hope this one helps, too.

> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
> index 8b36df4..29c9426 100644

> + bytea *
> + ExportSnapshot(Snapshot snapshot)
> + {
> + int bufsize = 1024;
> + int bufsize_filled = 0; /* doesn't include NUL byte */
> + int bufsize_left = bufsize - bufsize_filled;
> + char *buf = (char *) palloc(bufsize);
> + int i;
> + TransactionId *children;
> + int nchildren;
> +
> + /* In a subtransaction we don't see our open subxip values in the snapshot
> + * so they would be missing in the backend applying it. */
> + /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less?
> + * It's a valid transaction state but invalid for the requested operation. */
> + if (GetCurrentTransactionNestLevel() != 1)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("can only export a snapshot from a top level transaction")));

When I'm not sure about questions like this (as in, 99.8% of the time), I
usually grep one of the .po files for messages that seem similar, then see what
they used. In this case, the most-similar is "SET TRANSACTION ISOLATION LEVEL
must not be called in a subtransaction" in assign_XactIsoLevel(), which uses
ERRCODE_ACTIVE_SQL_TRANSACTION. I'd also consider making the message similar to
that one, like "cannot export a snapshot from a subtransaction".

> + static int
> + parseIntFromText(char **s, const char *prefix, int notfound)
> + {
> + char *n, *p = strstr(*s, prefix);
> + int i;
> +
> + if (!p)
> + return notfound;
> + p += strlen(prefix);
> + n = strchr(p, ' ');
> + if (!n)
> + return notfound;
> + *n = '\0';
> + i = DatumGetObjectId(DirectFunctionCall1(int4in, CStringGetDatum(p)));

DatumGetInt32

> + bool
> + ImportSnapshot(bytea *snapshotData, Snapshot snapshot)
> + {
> + snapshotChksum cksum;
> + Oid databaseId;
> + BackendId backendId;
> + int i;
> + TransactionId xid;
> + int len = VARSIZE_ANY_EXHDR(snapshotData);
> + char *s = palloc(len + 1);
> + MemoryContext oldctx;
> +
> + /* XXX is ERRCODE_INVALID_TRANSACTION_STATE more appropriate or less?
> + * It's a valid transaction state but invalid for the requested operation. */
> + if (GetCurrentTransactionNestLevel() != 1)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("can only import a snapshot to a top level transaction")));

See discussion of similar code above.

> +
> + if (len == 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid data in snapshot information")));
> +
> + strncpy(s, VARDATA_ANY(snapshotData), len);
> + s[len] = '\0';
> +
> + pg_md5_hash(s, len, (void *) &cksum);
> +
> + databaseId = parseOidFromText(&s, "did:");
> + backendId = (BackendId) parseIntFromText(&s, "bid:", (int) InvalidBackendId);
> +
> + if (databaseId != MyDatabaseId
> + || backendId == InvalidBackendId
> + /*
> + * Make sure backendId is in a reasonable range before using it as an
> + * array subscript.
> + */
> + || backendId >= PROCARRAY_MAXPROCS
> + || backendId < 0
> + || backendId == MyBackendId
> + /*
> + * Lock considerations:
> + *
> + * syncSnapshotChksums[backendId] is only changed by the backend with ID
> + * backendID and read by another backend that is asked to import a
> + * snapshot.
> + * syncSnapshotChksums[backendId] contains either NUL bytes or the checksum
> + * of a valid snapshot.
> + * Every comparision to the checksum while it is being written or
> + * deleted is okay to fail, so we don't need to take a lock at all.
> + */
> + || strcmp(cksum, syncSnapshotChksums[backendId]) != 0)
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid data in snapshot information")));
> + }

That error message is good for the second, third, and fourth conditions: those
only happen when we're given something not generated by pg_export_snapshot().
The first and fifth tests deserve different error messages (perhaps "snapshot is
from a different database"; "cannot export and import a snapshot in the same
connection"). The sixth test is also a bit different; failure arises there when
the input is bogus or when the exporting transaction has ended. Failing to hold
that transaction open until all child transactions have imported might be a
common early mistake when using this feature. So, what about "snapshot not
found exported by any running transaction" for the sixth test?

> +
> + Assert(databaseId != InvalidOid);
> +
> + oldctx = MemoryContextSwitchTo(TopTransactionContext);

How about "MemoryContextAlloc(TopTransactionContext, ..." in place of the two
palloc calls and dropping this? Not sure.

> +
> + snapshot->xmin = parseXactFromText(&s, "xmi:");
> + Assert(snapshot->xmin != InvalidTransactionId);
> + snapshot->xmax = parseXactFromText(&s, "xma:");
> + Assert(snapshot->xmax != InvalidTransactionId);
> +
> + if (snapshot->copied)
> + {
> + int xcnt = parseIntFromText(&s, "xcnt:", 0);
> + /*
> + * Unfortunately CopySnapshot allocates just one large chunk of memory,
> + * and makes snapshot->xip then point past the end of the fixed-size
> + * snapshot data. That way we cannot just repalloc(snapshot->xip, ...).
> + * Neither can we just change the base address of the snapshot because
> + * this address might still be saved somewhere.
> + */
> + if (snapshot->xcnt < xcnt)
> + snapshot->xip = palloc(xcnt * sizeof(TransactionId));
> + }
> +
> + i = 0;
> + while ((xid = parseXactFromText(&s, "xip:")) != InvalidTransactionId)
> + {
> + if (xid == GetTopTransactionIdIfAny())
> + continue;
> + snapshot->xip[i++] = xid;
> + }
> + snapshot->xcnt = i;
> +
> + /*
> + * We only write "sof:1" if the snapshot overflowed. If not, then there is
> + * no "sof:x" entry at all and parseBoolFromText() will just return false.
> + */
> + snapshot->suboverflowed = parseBoolFromText(&s, "sof:");
> +
> + if (!snapshot->suboverflowed)
> + {
> + if (snapshot->copied)
> + {
> + int sxcnt = parseIntFromText(&s, "sxcnt:", 0);
> + if (snapshot->subxcnt < sxcnt)
> + snapshot->subxip = palloc(sxcnt * sizeof(TransactionId));
> + }
> +
> + i = 0;
> + while ((xid = parseXactFromText(&s, "sxp:")) != InvalidTransactionId)
> + snapshot->subxip[i++] = xid;
> + snapshot->subxcnt = i;
> + }
> +
> + MemoryContextSwitchTo(oldctx);
> +
> + /* Leave snapshot->curcid as is. */
> +
> + /*
> + * No ProcArrayLock held here, we assume that a write is atomic. Also note
> + * that MyProc->xmin can go backwards here. However this is safe because
> + * the xmin we set here is the same as in the backend's proc->xmin whose
> + * snapshot we are copying. At this very moment, anybody computing a
> + * minimum will calculate at least this xmin as the overall xmin with or
> + * without us setting MyProc->xmin to this value.
> + * Instead we must check to not go forward (we might have opened a cursor
> + * in this transaction and still have its snapshot registered)
> + */
> + if (TransactionIdPrecedes(snapshot->xmin, MyProc->xmin))
> + MyProc->xmin = snapshot->xmin;

The write is atomic, in the sense that this process cannot be interrupted with
the update in progress. You'd still need a memory barrier to make other
processes see the change immediately on all architectures we support. I'd
suggest taking ProcArrayLock and adding a comment to the effect that we could
make it lockless, should we ever adopt such methods. I don't advise testing
those waters as part of this patch.

> +
> + /*
> + * Check the checksum again to prevent a race condition. If the exporting
> + * backend invalidated its snapshot since we last checked or before we
> + * finish with this second check, then we fail here and error out, thus
> + * invalidating the snapshot we've built up.
> + * We could succeed here even though the exporting transaction is at the
> + * same time invalidating the checksum while we are checking it here for
> + * the second time. But even then we are good, because we have set up our
> + * MyProc record already.
> + */
> + if (strcmp(cksum, syncSnapshotChksums[backendId]) != 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid data in snapshot information")));

I was trying to decide whether this is vulnerable to the ABA problem, and I
believe it is not. It is possible that the original exporting transaction
completed and a new transaction exported the same snapshot, all between our
first checksum comparison and now. However, this would also mean that no new
permanent transactions IDs got allocated in the gap. So I think we're good.

This message should continue to mirror the one from the earlier test, of course.

> +
> + return true;
> + }
> +
> +
> diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
> index 45b92a0..f74ea76 100644

> + Datum
> + pg_import_snapshot(PG_FUNCTION_ARGS)
> + {
> + bytea *snapshotData = PG_GETARG_BYTEA_P(0);
> + bool ret = true;
> +
> + if (ActiveSnapshotSet())
> + ret = ImportSnapshot(snapshotData, GetActiveSnapshot());
> +
> + ret &= ImportSnapshot(snapshotData, GetTransactionSnapshot());
> +
> + /*
> + * If we are in read committed mode then the next query will execute with a
> + * new snapshot making this function call quite useless.
> + */
> + if (!IsolationUsesXactSnapshot())
> + ereport(WARNING,
> + (errcode(ERRCODE_INAPPROPRIATE_ISOLATION_LEVEL_FOR_BRANCH_TRANSACTION),
> + errmsg("a snapshot importing transaction should be ISOLATION LEVEL SERIALIZABLE"),
> + errhint("A transaction of isolation level READ COMMITTED gives you a new snapshot for each query.")));

Again, REPEATABLE READ is a perfectly good isolation level for this use. The
test allows it, but the error message does not.

Attachment	Content-Type	Size
test-drop.pl	application/x-perl	1.3 KB

In response to

Re: Snapshot synchronization, again... at 2011-01-19 05:12:39 from Joachim Wieland

Responses

Re: Snapshot synchronization, again... at 2011-01-30 17:36:12 from Joachim Wieland

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Itagaki Takahiro	2011-01-20 06:37:12	Re: REVIEW: EXPLAIN and nfiltered
Previous Message	Noah Misch	2011-01-20 05:57:53	Re: ALTER TYPE 1: recheck index-based constraints