Re: isolation test deadlocking on buildfarm member coypu

Lists: pgsql-hackers
From: Rémi Zara <remi_zara(at)mac(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: isolation test deadlocking on buildfarm member coypu
Date: 2011-07-16 15:50:45
Message-ID: 4082D6E2-8965-4D90-A3AC-E1008DFEF23F@mac.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).

Here are the process for one deadlock (several days) :

pgbuildfarm 2405 0.0 1.2 26948 7916 ? Is Wed02AM 0:00.25 postgres: pgbuildfarm isolationtest [local] INSERT waiting
pgbuildfarm 6559 0.0 0.4 10288 2684 ? I Wed02AM 0:00.05 ./pg_isolation_regress --psqldir=/home/pgbuildfarm/workdir/HEAD/inst/bin --inputdir=. --schedule=./isolation_schedule (pg_isolation_reg)
pgbuildfarm 8296 0.0 1.8 25924 11880 ? Ss Wed02AM 0:59.79 postgres: writer process
pgbuildfarm 10630 0.0 0.4 16212 2916 ? Ss Wed02AM 1:43.16 postgres: stats collector process
pgbuildfarm 11814 0.0 0.7 9084 4408 ? I Wed01AM 0:00.39 /usr/pkg/bin/perl ./run_branches.pl --run-one
pgbuildfarm 14785 0.0 0.2 3344 1276 ? I Wed02AM 0:00.03 gmake NO_LOCALE=1 installcheck
pgbuildfarm 15353 0.0 0.8 26948 4960 ? Ss Wed02AM 1:10.55 postgres: autovacuum launcher process
pgbuildfarm 16619 0.0 1.8 15992 11512 ? I Wed01AM 0:04.90 /usr/pkg/bin/perl ./run_build.pl --config build-farm.conf HEAD
pgbuildfarm 17532 0.0 0.5 25924 3052 ? Ss Wed02AM 0:34.49 postgres: wal writer process
pgbuildfarm 18073 0.0 1.0 25924 6844 ? I Wed02AM 0:19.83 /home/pgbuildfarm/workdir/HEAD/inst/bin/postgres -D data-C
pgbuildfarm 21310 0.0 1.2 26948 7948 ? Is Wed02AM 0:00.06 postgres: pgbuildfarm isolationtest [local] idle in transaction
pgbuildfarm 24137 0.0 0.2 3456 1064 ? Is Wed01AM 0:00.01 /bin/sh -c cd /home/pgbuildfarm/client-code && ./run_branches.pl --run-one
pgbuildfarm 26932 0.0 0.2 3456 1072 ? I Wed02AM 0:00.01 sh -c cd pgsql.16619/src/test/isolation && gmake NO_LOCALE=1 installcheck 2>&1
pgbuildfarm 28430 0.0 1.3 26948 8816 ? Ss Wed02AM 408:43.91 postgres: pgbuildfarm isolationtest [local] idle
pgbuildfarm 28809 0.0 0.4 10476 2844 ? R Wed02AM 27:48.24 ./isolationtester dbname=isolationtest

I killed one of the postgres process, and it lead to a failure of the buildfarm perl script (probably due to the 2GB+ log file), and persistance of the postgres cluster (which I had to clean manually)
I've kept the source and install directories, if useful for investigation.

Here are the processes for a second lock (several hours):

pgbuildfarm 1423 0.0 0.4 16212 2888 ? Ss 2:11PM 0:04.07 postgres: stats collector process
pgbuildfarm 1604 0.0 1.2 26948 7724 ? Is 2:13PM 0:00.05 postgres: pgbuildfarm isolationtest [local] idle in transaction
pgbuildfarm 2380 0.0 0.4 10288 2624 ? I 2:12PM 0:00.05 ./pg_isolation_regress --psqldir=/home/pgbuildfarm/workdir/HEAD/inst/bin --inputdir=. --schedule=./isolation_schedule (pg_isolation_reg)
pgbuildfarm 3156 0.0 0.8 26948 4920 ? Ss 2:11PM 0:02.71 postgres: autovacuum launcher process
pgbuildfarm 3253 0.0 0.5 25924 3044 ? Ss 2:11PM 0:01.65 postgres: wal writer process
pgbuildfarm 9889 0.0 1.8 25924 11856 ? Ss 2:11PM 0:04.87 postgres: writer process
pgbuildfarm 11359 0.0 1.4 26948 9032 ? Ss 2:13PM 15:16.63 postgres: pgbuildfarm isolationtest [local] idle
pgbuildfarm 13738 0.0 1.2 26948 7860 ? Ss 2:13PM 0:00.06 postgres: pgbuildfarm isolationtest [local] UPDATE waiting
pgbuildfarm 14031 0.0 1.0 25924 6800 ? S 2:11PM 0:01.82 /home/pgbuildfarm/workdir/HEAD/inst/bin/postgres -D data-C
pgbuildfarm 16555 0.0 0.4 10476 2764 ? S 2:13PM 1:01.77 ./isolationtester dbname=isolationtest
pgbuildfarm 18038 0.0 0.2 3456 1008 ? I 2:11PM 0:00.01 sh -c cd pgsql.20707/src/test/isolation && gmake NO_LOCALE=1 installcheck 2>&1
pgbuildfarm 19057 0.0 0.2 3456 1000 ? Is 1:30PM 0:00.01 /bin/sh -c cd /home/pgbuildfarm/client-code && ./run_branches.pl --run-one
pgbuildfarm 20707 0.0 1.8 15992 11460 ? I 1:30PM 0:04.92 /usr/pkg/bin/perl ./run_build.pl --config build-farm.conf HEAD
pgbuildfarm 21749 0.0 0.2 3344 1248 ? I 2:11PM 0:00.04 gmake NO_LOCALE=1 installcheck
pgbuildfarm 23054 0.0 0.7 9084 4352 ? I 1:30PM 0:00.38 /usr/pkg/bin/perl ./run_branches.pl --run-one
pgbuildfarm 2125 0.0 0.0 2652 0 ttyp0 R+ 5:09PM 0:00.00 grep pgbuildf (bash)
pgbuildfarm 9930 0.0 0.3 2652 2072 ttyp0 S 5:03PM 0:00.09 bash
pgbuildfarm 29194 0.0 0.1 3552 872 ttyp0 O+ 5:09PM 0:00.01 ps -aux
pgbuildfarm(at)cube:workdir$date
Sat Jul 16 17:10:13 CEST 2011

The log fills with:

[4e21807d.2c5f:1060074] LOG: execute isolationtester_waiting: SELECT 1 WHERE pg_stat_get_backend_waiting($1)
[4e21807d.2c5f:1060075] DETAIL: parameters: $1 = '5'
[4e21807d.2c5f:1060076] LOG: execute isolationtester_waiting: SELECT 1 WHERE pg_stat_get_backend_waiting($1)
[4e21807d.2c5f:1060077] DETAIL: parameters: $1 = '5'

I killed process 1604 (postgres: pgbuildfarm isolationtest [local] idle in transaction). Again, the buildfarm perlscript kicked into action. See http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=coypu&dt=2011-07-16%2011%3A30%3A02 for the logs

Regards,

Rémi Zara


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Rémi Zara <remi_zara(at)mac(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: isolation test deadlocking on buildfarm member coypu
Date: 2011-07-16 17:25:03
Message-ID: 20110716172500.GA14724@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jul 16, 2011 at 05:50:45PM +0200, Rémi Zara wrote:
> Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).

Thanks for the report and detailed analysis. I believe the patch here will fix
the problem:
http://archives.postgresql.org/message-id/20110716171121.GB2047@tornado.leadboat.com

If you're up for testing it locally, I'd be interested to hear how it works out.

Thanks,
nm

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Rémi Zara <remi_zara(at)mac(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: isolation test deadlocking on buildfarm member coypu
Date: 2011-07-21 04:48:38
Message-ID: 1311223626-sup-2901@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Noah Misch's message of sáb jul 16 13:25:03 -0400 2011:
> On Sat, Jul 16, 2011 at 05:50:45PM +0200, Rémi Zara wrote:
> > Isolation tests seem to deadlock on buildfarm member coypu (NetBSD/powerpc 5.1).
>
> Thanks for the report and detailed analysis. I believe the patch here will fix
> the problem:
> http://archives.postgresql.org/message-id/20110716171121.GB2047@tornado.leadboat.com
>
> If you're up for testing it locally, I'd be interested to hear how it works out.

Rémi,

I have committed Noah's patch, so please unblock coypu and locust so we
can see whether they are OK now.

Thanks,

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Rémi Zara <remi_zara(at)mac(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: isolation test deadlocking on buildfarm member coypu
Date: 2011-07-26 21:04:28
Message-ID: 1311713204-sup-7462@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

After committing Noah's patch to fix the isolation tests, there have
been two more failures in Rémi's machines pika and coypu:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pika&dt=2011-07-24%2006%3A45%3A45
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2011-07-23%2021%3A50%3A54

In both cases, the failure is identical in fk-deadlock2:

================== pgsql.20950/src/test/isolation/regression.diffs ===================
*** /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/expected/fk-deadlock2.out Sun Jul 24 08:46:44 2011
--- /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/results/fk-deadlock2.out Sun Jul 24 15:11:42 2011
***************
*** 22,29 ****
step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2;
step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...>
step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2;
- step s1u2: <... completed>
ERROR: deadlock detected
step s1c: COMMIT;
step s2c: COMMIT;

--- 22,29 ----
step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2;
step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...>
step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2;
ERROR: deadlock detected
+ step s1u2: <... completed>
step s1c: COMMIT;
step s2c: COMMIT;

I think the only explanation necessary for this is that one process
reports its status before the other one. I think it would be enough to
add another variant of the expected file to fix this problem, but I
don't quite want to do that because we already have three of them, and I
think we would need to add one per existing expected, so we'd end up
with 6 expected files which would be a pain to work with.

Not quite sure what to do with this.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Rémi Zara <remi_zara(at)mac(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: isolation test deadlocking on buildfarm member coypu
Date: 2011-07-27 17:14:38
Message-ID: 20110727171438.GE18910@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 05:04:28PM -0400, Alvaro Herrera wrote:
> *** /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/expected/fk-deadlock2.out Sun Jul 24 08:46:44 2011
> --- /home/pgbuildfarm/workdir/HEAD/pgsql.20950/src/test/isolation/results/fk-deadlock2.out Sun Jul 24 15:11:42 2011
> ***************
> *** 22,29 ****
> step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2;
> step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...>
> step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2;
> - step s1u2: <... completed>
> ERROR: deadlock detected
> step s1c: COMMIT;
> step s2c: COMMIT;
>
> --- 22,29 ----
> step s2u1: UPDATE B SET Col2 = 1 WHERE BID = 2;
> step s1u2: UPDATE B SET Col2 = 1 WHERE BID = 2; <waiting ...>
> step s2u2: UPDATE B SET Col2 = 1 WHERE BID = 2;
> ERROR: deadlock detected
> + step s1u2: <... completed>
> step s1c: COMMIT;
> step s2c: COMMIT;
>
>
>
> I think the only explanation necessary for this is that one process
> reports its status before the other one. I think it would be enough to
> add another variant of the expected file to fix this problem, but I
> don't quite want to do that because we already have three of them, and I
> think we would need to add one per existing expected, so we'd end up
> with 6 expected files which would be a pain to work with.

To really cover the problem in this way, we would need 16*3 variations covering
every permutation of the deadlocks detected when s1 runs the first command. I
see these other options:

1. Keep increasing the s1 deadlock_timeout until we stop getting these. This is
simple, but it proportionally slows the test suite for everyone. No value will
ever be guaranteed sufficient.

2. Post-process the output to ascribe the deadlock detections in a standard way
before comparing with expected output. This would also let us drop
deadlock_timeout arbitrarily low, giving a rapid test run.

3. Implement actual deadlock priorities, per discussion at
http://archives.postgresql.org/message-id/AANLkTimAqFzKV4Sc1DScruFft_Be78Y-oWPeuURCDnjR@mail.gmail.com
This is much more work, but it would let us drop deadlock_timeout arbitrarily
low and still get consistent results from the start.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services