Re: pika buildfarm member failure on isolationCheck tests

Lists: pgsql-hackers
From: Rémi Zara <remi_zara(at)mac(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-18 15:57:02
Message-ID: F935E4FA-1701-447B-8EC6-B14851D17FDE@mac.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Pika failed at the isolationCheck phase, hitting an assertion:

TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))", File: "predicate.c", Line: 991)

see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30

It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server. Maybe that should be fixed ?

Regards,

Rémi Zara


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Rémi Zara <remi_zara(at)mac(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-18 21:21:40
Message-ID: 4DFD16E4.7090403@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/18/2011 11:57 AM, Rémi Zara wrote:
> Hi,
>
> Pika failed at the isolationCheck phase, hitting an assertion:
>
> TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))", File: "predicate.c", Line: 991)
>
> see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30
>
> It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server. Maybe that should be fixed ?
>
>

Change committed. See
<https://github.com/PGBuildFarm/client-code/commit/dd49c04>. It will be
in the next release.

cheers

andrew


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rémi Zara <remi_zara(at)mac(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-20 01:10:02
Message-ID: BANLkTik29uvHyPyHTVA5LNo2p7WUsRH__w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jun 18, 2011 at 11:57 AM, Rémi Zara <remi_zara(at)mac(dot)com> wrote:
> Pika failed at the isolationCheck phase, hitting an assertion:
>
> TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))", File: "predicate.c", Line: 991)
>
> see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30
>
> It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server. Maybe that should be fixed ?

Is this an open item for 9.1beta3?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: R?mi Zara <remi_zara(at)mac(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject: Re: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-20 09:15:29
Message-ID: 20110620091529.GG83336@csail.mit.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 19, 2011 at 09:10:02PM -0400, Robert Haas wrote:
> Is this an open item for 9.1beta3?

Yes. I've put it on the list.

The SxactGlobalXmin and its refcount were getting out of sync with the
actual transaction state. This is timing-dependent but I can reproduce it
fairly reliably under concurrent workloads.

It looks the problem comes from the change a couple days ago that
removed the SXACT_FLAG_ROLLED_BACK flag and changed the
SxactIsRolledBack checks to SxactIsDoomed. That's the correct thing to
do everywhere else, but gets us in trouble here. We shouldn't be
ignoring doomed transactions in SetNewSxactGlobalXmin, because they're
not eligible for cleanup until the associated backend aborts the
transaction and calls ReleasePredicateLocks.

However, it isn't as simple as just removing the SxactIsDoomed check
from SetNewSxactGlobalXmin. ReleasePredicateLocks calls
SetNewSxactGlobalXmin *before* it releases the aborted transaction's
SerializableXact (it pretty much has to, because we must drop and
reacquire SerializableXactHashLock in between). But we don't want the
aborted transaction included in the SxactGlobalXmin computation.

It seems like we do need that SXACT_FLAG_ROLLED_BACK after all, even
though it's only set for this brief interval. We need to be able to
distinguish a transaction that's just been marked for death (doomed)
from one that's already called ReleasePredicateLocks.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Dan Ports" <drkp(at)csail(dot)mit(dot)edu>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "R?mi Zara" <remi_zara(at)mac(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-20 14:40:44
Message-ID: 4DFF159C020000250003E95F@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dan Ports <drkp(at)csail(dot)mit(dot)edu> wrote:

> It looks the problem comes from the change a couple days ago that
> removed the SXACT_FLAG_ROLLED_BACK flag and changed the
> SxactIsRolledBack checks to SxactIsDoomed. That's the correct
> thing to do everywhere else, but gets us in trouble here. We
> shouldn't be ignoring doomed transactions in
> SetNewSxactGlobalXmin, because they're not eligible for cleanup
> until the associated backend aborts the transaction and calls
> ReleasePredicateLocks.
>
> However, it isn't as simple as just removing the SxactIsDoomed
> check from SetNewSxactGlobalXmin. ReleasePredicateLocks calls
> SetNewSxactGlobalXmin *before* it releases the aborted
> transaction's SerializableXact (it pretty much has to, because we
> must drop and reacquire SerializableXactHashLock in between). But
> we don't want the aborted transaction included in the
> SxactGlobalXmin computation.
>
> It seems like we do need that SXACT_FLAG_ROLLED_BACK after all,
> even though it's only set for this brief interval. We need to be
> able to distinguish a transaction that's just been marked for
> death (doomed) from one that's already called
> ReleasePredicateLocks.

I just looked this over and I agree. It looks to me like we can set
this flag in ReleasePredicateLocks (along with the doomed flag) and
test it only in SetNewSxactGlobalXmin (instead of the doomed flag).
Fundamentally, the problem is that while it's enough to know that a
transaction *will be* canceled in order to ignore it during
detection of rw-conflicts and dangerous structures, it's not enough
when calculating the new serializable xmin -- we need to know that a
transaction actually *has been* canceled to ignore it there.

Essentially, the fix is to revert a very small part of
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=264a6b127a918800d9f8bac80b5f4a8a8799d0f1

Either of us could easily fix this, but since you have the test
environment which can reproduce the problem, it's probably best that
you do it.

Since Heikki took the trouble to put the flags in a nice logical
order, we should probably put this right after the doomed flag.

-Kevin


From: Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: R?mi Zara <remi_zara(at)mac(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject: Re: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-21 02:18:28
Message-ID: 20110621021828.GL83336@csail.mit.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While testing the fix for this one, I found another bug. Patches for
both are attached.

The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK,
in a more limited form than its previous incarnation.

We need to be able to distinguish transactions that have already
called ReleasePredicateLocks and are thus eligible for cleanup from
those that have been merely marked for abort by other
backends. Transactions that are ROLLED_BACK are excluded from
SxactGlobalXmin calculations, but those that are merely DOOMED need to
be included.

Also update a couple of assertions to ensure we only try to clean up
ROLLED_BACK transactions.

The second patch fixes a bug in PreCommit_CheckForSerializationFailure.
This function checks whether there's a dangerous structure of the form
far ---> near ---> me
where neither the "far" or "near" transactions have committed. If so,
it aborts the "near" transaction by marking it as DOOMED. However, that
transaction might already be PREPARED. We need to check whether that's
the case and, if so, abort the transaction that's trying to commit
instead.

One of the prepared_xacts regression tests actually hits this bug.
I removed the anomaly from the duplicate-gids test so that it fails in
the intended way, and added a new test to check serialization failures
with a prepared transaction.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Attachment Content-Type Size
readd-rolled-back-flag.patch text/x-diff 4.2 KB
check-conflict-prepared.patch text/x-diff 3.9 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, R?mi Zara <remi_zara(at)mac(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject: Re: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-21 12:01:48
Message-ID: 4E00882C.8080900@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21.06.2011 05:18, Dan Ports wrote:
> The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK,
> in a more limited form than its previous incarnation.
>
> We need to be able to distinguish transactions that have already
> called ReleasePredicateLocks and are thus eligible for cleanup from
> those that have been merely marked for abort by other
> backends. Transactions that are ROLLED_BACK are excluded from
> SxactGlobalXmin calculations, but those that are merely DOOMED need to
> be included.
>
> Also update a couple of assertions to ensure we only try to clean up
> ROLLED_BACK transactions.

Thanks, committed.

> The second patch fixes a bug in PreCommit_CheckForSerializationFailure.
> This function checks whether there's a dangerous structure of the form
> far ---> near ---> me
> where neither the "far" or "near" transactions have committed. If so,
> it aborts the "near" transaction by marking it as DOOMED. However, that
> transaction might already be PREPARED. We need to check whether that's
> the case and, if so, abort the transaction that's trying to commit
> instead.

Yep, committed. All the other places where we set the DOOMED flag seem
to handle that already.

In the long term, I'm not sure this is the best way to handle this. It
feels a bit silly to set the flag, release the SerializableXactHashLock,
and reacquire it later to remove the transaction from the hash table.
Surely that could be done in some more straightforward way. But I don't
feel like fiddling with it this late in the release cycle.

> One of the prepared_xacts regression tests actually hits this bug.
> I removed the anomaly from the duplicate-gids test so that it fails in
> the intended way, and added a new test to check serialization failures
> with a prepared transaction.

Hmm, I have ran "make installcheck" with
default_transaction_isolation='serializable' earlier. I wonder why I
didn't see that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, R?mi Zara <remi_zara(at)mac(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject: Re: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-22 05:31:11
Message-ID: 20110622053111.GO83336@csail.mit.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 21, 2011 at 03:01:48PM +0300, Heikki Linnakangas wrote:
> Thanks, committed.

Thanks.

> In the long term, I'm not sure this is the best way to handle this. It
> feels a bit silly to set the flag, release the SerializableXactHashLock,
> and reacquire it later to remove the transaction from the hash table.
> Surely that could be done in some more straightforward way. But I don't
> feel like fiddling with it this late in the release cycle.

Yes, I suspect it can be done better. The reason it's tricky is a lock
ordering issue; part of releasing a SerializableXact has to be done
while holding SerializableXactHashLock and part has to be done without
it (because it involves taking partition locks). Reworking the order of
these things might work, but would require some careful thought since
most of the code is shared with the non-abort cleanup paths. And yes,
it's definitely the time for that.

I've been meaning to take another look at this part of the code anyway,
with an eye towards performance. SerializableXactHashLock can be a
bottleneck on certain in-memory workloads.

> > One of the prepared_xacts regression tests actually hits this bug.
> > I removed the anomaly from the duplicate-gids test so that it fails in
> > the intended way, and added a new test to check serialization failures
> > with a prepared transaction.
>
> Hmm, I have ran "make installcheck" with
> default_transaction_isolation='serializable' earlier. I wonder why I
> didn't see that.

The affected test was being run at SERIALIZABLE already, so that
wouldn't have made a difference. One reason I didn't notice an issue
when I looked at that test before before, is that it was intended to
fail anyway, just with a different error. I didn't think too hard about
which error would take precedence.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/


From: Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, R?mi Zara <remi_zara(at)mac(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject: Re: pika buildfarm member failure on isolationCheck tests
Date: 2011-06-22 05:44:58
Message-ID: 20110622054458.GQ83336@csail.mit.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 22, 2011 at 01:31:11AM -0400, Dan Ports wrote:
> Yes, I suspect it can be done better. The reason it's tricky is a lock
> ordering issue; part of releasing a SerializableXact has to be done
> while holding SerializableXactHashLock and part has to be done without
> it (because it involves taking partition locks). Reworking the order of
> these things might work, but would require some careful thought since
> most of the code is shared with the non-abort cleanup paths. And yes,
> it's definitely the time for that.

...by which I mean it's definitely *not* the time for that, of course.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/