Re: random isolation test failures

Lists: pgsql-hackers
From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: random isolation test failures
Date: 2011-09-26 15:43:37
Message-ID: 4E809DA9.1020109@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


We are seeing numerous occasional buildfarm failures of the fk-deadlock2
isolation test, that look like this:

***************
*** 32,39 ****
step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2;
step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2;<waiting ...>
step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2;
- step s1u2:<... completed>
ERROR: deadlock detected
step s2c: COMMIT;
step s1c: COMMIT;

--- 32,39 ----
step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2;
step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2;<waiting ...>
step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2;
ERROR: deadlock detected
+ step s1u2:<... completed>
step s2c: COMMIT;
step s1c: COMMIT;

If this is harmless, we could provide an alternative results file as a
simple fix. If it's not harmless, it should be fixed.

cheers

andrew


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: random isolation test failures
Date: 2011-09-26 16:04:13
Message-ID: 4E805C2D0200002500041761@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> We are seeing numerous occasional buildfarm failures of the
> fk-deadlock2 isolation test

> If this is harmless, we could provide an alternative results file
> as a simple fix. If it's not harmless, it should be fixed.

I agree, but don't look at me. I'm not the one who added the tests,
nor are they related to serializable snapshot isolation. Tom
recently raised the same issue on this thread:

http://archives.postgresql.org/pgsql-hackers/2011-09/msg00991.php

Alvaro?

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: random isolation test failures
Date: 2011-09-26 17:10:27
Message-ID: 9646.1317057027@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> We are seeing numerous occasional buildfarm failures of the fk-deadlock2
> isolation test,

Yeah, I complained about this already, but Kevin disclaims all
responsibility for the fk isolation tests. It looks like Alvaro
and Noah Misch are the people to be harassing.

regards, tom lane


From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: random isolation test failures
Date: 2011-09-27 00:57:40
Message-ID: 20110927005740.GA17938@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 01:10:27PM -0400, Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> > We are seeing numerous occasional buildfarm failures of the fk-deadlock2
> > isolation test,
>
> Yeah, I complained about this already, but Kevin disclaims all
> responsibility for the fk isolation tests. It looks like Alvaro
> and Noah Misch are the people to be harassing.

Yep; I took advantage of Kevin's test harness for some unrelated tests.

These sporadic failures happen whenever the test case takes longer than
deadlock_timeout (currently 100ms for these tests) to setup the deadlock. I
outlined some mitigating strategies here:
http://archives.postgresql.org/message-id/20110727171438.GE18910@tornado.leadboat.com

I'd vote for #1: let's double the deadlock_timeout until the failures stop.
Other opinions?

Thanks,
nm

Attachment Content-Type Size
fk-isolation-200ms.patch text/plain 1.1 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: random isolation test failures
Date: 2011-09-27 03:29:00
Message-ID: 1317093816-sup-4240@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Noah Misch's message of lun sep 26 21:57:40 -0300 2011:

> These sporadic failures happen whenever the test case takes longer than
> deadlock_timeout (currently 100ms for these tests) to setup the deadlock. I
> outlined some mitigating strategies here:
> http://archives.postgresql.org/message-id/20110727171438.GE18910@tornado.leadboat.com
>
> I'd vote for #1: let's double the deadlock_timeout until the failures stop.
> Other opinions?

I just tweaked isolationtester so that it collects the error messages
and displays them all together at the end of the test. After seeing it
run, I didn't like it -- I think I prefer something more local, so that
in the only case where we call try_complete_step twice in the loop, we
report any errors in either. AFAICS this would make both expected cases
behave identically in test output. The only thing left to figure out is
where to store the error message between calls ... clearly Step is not
the right place for it. I'm on it now, anyway.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment Content-Type Size
isolation-fix.patch application/octet-stream 2.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: random isolation test failures
Date: 2011-09-27 04:11:39
Message-ID: 13956.1317096699@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I just tweaked isolationtester so that it collects the error messages
> and displays them all together at the end of the test. After seeing it
> run, I didn't like it -- I think I prefer something more local, so that
> in the only case where we call try_complete_step twice in the loop, we
> report any errors in either. AFAICS this would make both expected cases
> behave identically in test output.

Hmm, is that really an appropriate fix? I'm worried that it might mask
event-ordering differences that actually are significant.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: random isolation test failures
Date: 2011-09-27 19:22:57
Message-ID: 1317149986-sup-664@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Tom Lane's message of mar sep 27 01:11:39 -0300 2011:
>
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > I just tweaked isolationtester so that it collects the error messages
> > and displays them all together at the end of the test. After seeing it
> > run, I didn't like it -- I think I prefer something more local, so that
> > in the only case where we call try_complete_step twice in the loop, we
> > report any errors in either. AFAICS this would make both expected cases
> > behave identically in test output.
>
> Hmm, is that really an appropriate fix? I'm worried that it might mask
> event-ordering differences that actually are significant.

In the attached, it only affects the case where there is one blocking
command and another command that unblocks it; this is only exercised by
the much-beaten fk-deadlock cases. If either of the steps fails with a
deadlock error, it is reported identically, i.e. the error message is
emitted as

"error in s1u1 s2u1: ERROR: deadlock detected"

So the deadlock could have been detected in either s1u1 or s2u1; we
don't really care.

The way error messages are reported in all the other cases is not
changed, and these do not have a prefix; so if anything were to behave
differently, we would find out because a spurious prefix would appear.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment Content-Type Size
isolation-fix-2.patch application/octet-stream 25.4 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: random isolation test failures
Date: 2011-09-27 19:59:13
Message-ID: 28334.1317153553@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of mar sep 27 01:11:39 -0300 2011:
>> Hmm, is that really an appropriate fix? I'm worried that it might mask
>> event-ordering differences that actually are significant.

> In the attached, it only affects the case where there is one blocking
> command and another command that unblocks it; this is only exercised by
> the much-beaten fk-deadlock cases. If either of the steps fails with a
> deadlock error, it is reported identically, i.e. the error message is
> emitted as
> "error in s1u1 s2u1: ERROR: deadlock detected"
> So the deadlock could have been detected in either s1u1 or s2u1; we
> don't really care.

Hmm. For the case of "deadlock detected", we actually don't *want* to
care because the infrastructure is such that either process might report
it. So I agree that this is a good fix for that case. I'm just worried
whether it will obscure other situations where it's important to know
which command failed. But if you're convinced there aren't any, fine.

regards, tom lane