Re: Going for "all green" buildfarm results

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Going for "all green" buildfarm results
Date: 2006-06-02 04:06:49
Message-ID: 24270.1149221209@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been making another pass over getting rid of buildfarm failures.
The remaining ones I see at the moment are:

firefly HEAD: intermittent failures in the stats test. We seem to have
fixed every other platform back in January, but not this one.

kudu HEAD: one-time failure 6/1/06 in statement_timeout test, never seen
before. Is it possible system was under enough load that the 1-second
timeout fired before control reached the exception block?

tapir HEAD: pilot error, insufficient SysV shmem settings

carp various: carp seems to have *serious* hardware problems, as it
has been failing randomly in all branches for a long time. I suggest
putting that poor machine out to pasture.

penguin 8.0: fails in tsearch2. Previous investigation says that the
failure is unfixable without initdb, which we are not going to force
for 8.0 branch. I suggest retiring penguin from checking 8.0, as
there's not much point in continuing to see a failure there. Or is
it worth improving buildfarm to be able to skip specific tests?

penguin 7.4: fails in initdb, with what seems to be a variant of the
alignment issue that kills tsearch2 in 8.0. We won't fix this either,
so again might as well stop tracking this branch on this machine.

cobra, stoat, sponge 7.4: pilot error. Either install Tk or configure
--without-tk.

firefly 7.4: dblink test fails, with what looks like an rpath problem.
Another one that we fixed awhile ago, and the fix worked on every
platform but this one.

firefly 7.3: trivial regression diffs; we could install variant
comparison files if anyone cared.

cobra, stoat, caribou 7.3: same Tk configuration error as in 7.4 branch

Firefly is obviously the outlier here. I dunno if anyone cares enough
about SCO to spend time investigating it (I don't). Most of the others
just need a little bit of attention from the machine owner.

regards, tom lane


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-06-02 06:07:09
Message-ID: 447FD58D.9080807@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I've been making another pass over getting rid of buildfarm failures.
> The remaining ones I see at the moment are:
>
> firefly HEAD: intermittent failures in the stats test. We seem to have
> fixed every other platform back in January, but not this one.
>
> kudu HEAD: one-time failure 6/1/06 in statement_timeout test, never seen
> before. Is it possible system was under enough load that the 1-second
> timeout fired before control reached the exception block?

[...]

FWIW: lionfish had a weird make check error 3 weeks ago which I
(unsuccessfully) tried to reproduce multiple times after that:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-05-12%2005:30:14

[...]

> cobra, stoat, sponge 7.4: pilot error. Either install Tk or configure
> --without-tk.

sorry for that but the issue with sponge on 7.4 was fixed nearly a week
ago though there have been no changes until today to trigger a new build ;-)

Stefan


From: "Larry Rosenman" <ler(at)lerctr(dot)org>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Andrew Dunstan'" <andrew(at)dunslane(dot)net>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Going for "all green" buildfarm results
Date: 2006-06-02 09:09:37
Message-ID: 001501c68624$4ced69d0$68c8a8c0@lerctr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I've been making another pass over getting rid of buildfarm failures.
> The remaining ones I see at the moment are:
>
> firefly HEAD: intermittent failures in the stats test. We seem to
> have fixed every other platform back in January, but not this one.
>
>
> firefly 7.4: dblink test fails, with what looks like an rpath problem.
> Another one that we fixed awhile ago, and the fix worked on every
> platform but this one.
>
> firefly 7.3: trivial regression diffs; we could install variant
> comparison files if anyone cared.
>
>
> Firefly is obviously the outlier here. I dunno if anyone cares
> enough about SCO to spend time investigating it (I don't). Most of
> the others just need a little bit of attention from the machine
> owner.

If I generate fixes for firefly (I'm the owner), would they have a prayer
Of being applied?

LER

>
> regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler(at)lerctr(dot)org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3683 US


From: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
To: <ler(at)lerctr(dot)org>
Cc: <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Going for 'all green' buildfarm results
Date: 2006-06-02 11:22:23
Message-ID: 3895.24.211.165.134.1149247343.squirrel@www.dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Larry Rosenman said:
> Tom Lane wrote:
>> I've been making another pass over getting rid of buildfarm failures.
>> The remaining ones I see at the moment are:
>>
>> firefly HEAD: intermittent failures in the stats test. We seem to
>> have fixed every other platform back in January, but not this one.
>>
>>
>> firefly 7.4: dblink test fails, with what looks like an rpath problem.
>> Another one that we fixed awhile ago, and the fix worked on every
>> platform but this one.
>>
>> firefly 7.3: trivial regression diffs; we could install variant
>> comparison files if anyone cared.
>>
>>
>> Firefly is obviously the outlier here. I dunno if anyone cares
>> enough about SCO to spend time investigating it (I don't). Most of
>> the others just need a little bit of attention from the machine
>> owner.
>
> If I generate fixes for firefly (I'm the owner), would they have a
> prayer Of being applied?
>

Sure, although I wouldn't bother with 7.3 - just take 7.3 out of firefly's
build schedule. That's not carte blanche on fixes, of course - we'd have to
see them.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-06-02 12:16:46
Message-ID: 44802C2E.7070200@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Or is
> it worth improving buildfarm to be able to skip specific tests?
>
>
>

There is a session on buildfarm improvements scheduled for the Toronto
conference. This is certainly one possibility.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-06-02 13:22:33
Message-ID: 27657.1149254553@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> FWIW: lionfish had a weird make check error 3 weeks ago which I
> (unsuccessfully) tried to reproduce multiple times after that:

> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-05-12%2005:30:14

Weird.

SELECT ''::text AS eleven, unique1, unique2, stringu1
FROM onek WHERE unique1 < 50
ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
! ERROR: could not open relation with OID 27035

AFAICS, the only way to get that error in HEAD is if ScanPgRelation
can't find a pg_class row with the mentioned OID. Presumably 27035
belongs to "onek" or one of its indexes. The very next command also
refers to "onek", and doesn't fail, so what we seem to have here is
a transient lookup failure. We've found a btree bug like that once
before ... wonder if there's still one left?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: ler(at)lerctr(dot)org, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for 'all green' buildfarm results
Date: 2006-06-02 13:36:27
Message-ID: 27816.1149255387@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Andrew Dunstan" <andrew(at)dunslane(dot)net> writes:
> Larry Rosenman said:
>> If I generate fixes for firefly (I'm the owner), would they have a
>> prayer Of being applied?

> Sure, although I wouldn't bother with 7.3 - just take 7.3 out of firefly's
> build schedule. That's not carte blanche on fixes, of course - we'd have to
> see them.

What he said ... it'd depend entirely on how ugly the fixes are ;-)

regards, tom lane


From: "Larry Rosenman" <ler(at)lerctr(dot)org>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Andrew Dunstan'" <andrew(at)dunslane(dot)net>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Going for 'all green' buildfarm results
Date: 2006-06-02 13:40:49
Message-ID: 000f01c6864a$2a257e80$0202fea9@aus.pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Andrew Dunstan" <andrew(at)dunslane(dot)net> writes:
>> Larry Rosenman said:
>>> If I generate fixes for firefly (I'm the owner), would they have a
>>> prayer Of being applied?
>
>> Sure, although I wouldn't bother with 7.3 - just take 7.3 out of
>> firefly's build schedule. That's not carte blanche on fixes, of
>> course - we'd have to see them.
>
> What he said ... it'd depend entirely on how ugly the fixes are ;-)
>
Ok, 7.3 is out of firefly's crontab.

I'll look into 7.4.

LER

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler(at)lerctr(dot)org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-06-02 14:47:25
Message-ID: 44804F7D.7070408@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>
>>FWIW: lionfish had a weird make check error 3 weeks ago which I
>>(unsuccessfully) tried to reproduce multiple times after that:
>
>
>>http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-05-12%2005:30:14
>
>
> Weird.
>
> SELECT ''::text AS eleven, unique1, unique2, stringu1
> FROM onek WHERE unique1 < 50
> ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
> ! ERROR: could not open relation with OID 27035
>
> AFAICS, the only way to get that error in HEAD is if ScanPgRelation
> can't find a pg_class row with the mentioned OID. Presumably 27035
> belongs to "onek" or one of its indexes. The very next command also
> refers to "onek", and doesn't fail, so what we seem to have here is
> a transient lookup failure. We've found a btree bug like that once
> before ... wonder if there's still one left?

If there is still one left it must be quite hard to trigger (using the
regression tests). Like i said before - I tried quite hard to reproduce
the issue back then - without any success.

Stefan


From: "Larry Rosenman" <ler(at)lerctr(dot)org>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Andrew Dunstan'" <andrew(at)dunslane(dot)net>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Going for 'all green' buildfarm results
Date: 2006-06-02 15:19:50
Message-ID: 005f01c68657$fdde9f60$0202fea9@aus.pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Larry Rosenman wrote:
> Tom Lane wrote:
>> "Andrew Dunstan" <andrew(at)dunslane(dot)net> writes:
>>> Larry Rosenman said:
>>>> If I generate fixes for firefly (I'm the owner), would they have a
>>>> prayer Of being applied?
>>
>>> Sure, although I wouldn't bother with 7.3 - just take 7.3 out of
>>> firefly's build schedule. That's not carte blanche on fixes, of
>>> course - we'd have to see them.
>>
>> What he said ... it'd depend entirely on how ugly the fixes are ;-)
>>
> Ok, 7.3 is out of firefly's crontab.
>
> I'll look into 7.4.
>
> LER

I've taken the cheaters way out for 7.4, and turned off the perl stuff for
now.

as to HEAD, I've played with the system send/recv space parms, and let's see
if
that helps the stats stuff.

LER

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler(at)lerctr(dot)org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


From: "Larry Rosenman" <ler(at)lerctr(dot)org>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Andrew Dunstan'" <andrew(at)dunslane(dot)net>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Going for 'all green' buildfarm results
Date: 2006-06-02 17:20:33
Message-ID: 00e201c68668$db4ac8f0$0202fea9@aus.pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Larry Rosenman wrote:
> Larry Rosenman wrote:
>> Tom Lane wrote:
>>> "Andrew Dunstan" <andrew(at)dunslane(dot)net> writes:
>>>> Larry Rosenman said:
>>>>> If I generate fixes for firefly (I'm the owner), would they have a
>>>>> prayer Of being applied?
>>>
>>>> Sure, although I wouldn't bother with 7.3 - just take 7.3 out of
>>>> firefly's build schedule. That's not carte blanche on fixes, of
>>>> course - we'd have to see them.
>>>
>>> What he said ... it'd depend entirely on how ugly the fixes are ;-)
>>>
>> Ok, 7.3 is out of firefly's crontab.
>>
>> I'll look into 7.4.
>>
>> LER
>
> I've taken the cheaters way out for 7.4, and turned off the perl
> stuff for now.
>
> as to HEAD, I've played with the system send/recv space parms, and
> let's see if
> that helps the stats stuff.
>
> LER

well, the changes didn't help.

I've pulled ALL the cronjobs from firefly.

consider it dead.

Since it is an outlier, it's not useful.

LER

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler(at)lerctr(dot)org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Larry Rosenman <ler(at)lerctr(dot)org>
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for 'all green' buildfarm results
Date: 2006-06-08 14:54:09
Message-ID: 44883A11.3050705@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Larry Rosenman wrote:
> well, the changes didn't help.
>
> I've pulled ALL the cronjobs from firefly.
>
> consider it dead.
>
> Since it is an outlier, it's not useful.
>
>
>

OK, I am marking firefly as retired. That means we have no coverage for
Unixware.

cheers

andrew


From: ohp(at)pyrenet(dot)fr
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, Larry Rosenman <ler(at)lerctr(dot)org>
Subject: Re: Going for 'all green' buildfarm results
Date: 2006-06-09 09:12:07
Message-ID: Pine.UW2.4.53.0606091111090.2307@sun.pyrenet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I can take other if that helps.

Larry, could you help me in the setup?

Regards,
On Thu, 8 Jun 2006, Andrew Dunstan wrote:

> Date: Thu, 08 Jun 2006 10:54:09 -0400
> From: Andrew Dunstan <andrew(at)dunslane(dot)net>
> Newsgroups: pgsql.hackers
> Subject: Re: Going for 'all green' buildfarm results
>
> Larry Rosenman wrote:
> > well, the changes didn't help.
> >
> > I've pulled ALL the cronjobs from firefly.
> >
> > consider it dead.
> >
> > Since it is an outlier, it's not useful.
> >
> >
> >
>
>
>
> OK, I am marking firefly as retired. That means we have no coverage for
> Unixware.
>
> cheers
>
> andrew
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>
>

--
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
15, Chemin des Monges +33-5-61-50-97-01 (Fax)
31190 AUTERIVE +33-6-07-63-80-64 (GSM)
FRANCE Email: ohp(at)pyrenet(dot)fr
------------------------------------------------------------------------------
Make your life a dream, make your dream a reality. (St Exupery)


From: ohp(at)pyrenet(dot)fr
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for 'all green' buildfarm results
Date: 2006-06-09 11:18:11
Message-ID: Pine.UW2.4.53.0606091317360.3237@sun.pyrenet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 9 Jun 2006 ohp(at)pyrenet(dot)fr wrote:

> Date: Fri, 9 Jun 2006 11:12:07 +0200
> From: ohp(at)pyrenet(dot)fr
> To: Andrew Dunstan <andrew(at)dunslane(dot)net>, Larry Rosenman <ler(at)lerctr(dot)org>
> Newsgroups: pgsql.hackers
> Subject: Re: Going for 'all green' buildfarm results
>
> I can take other if that helps.
Ooops... takeover :)
>
> Larry, could you help me in the setup?
>
> Regards,
> On Thu, 8 Jun 2006, Andrew Dunstan wrote:
>
> > Date: Thu, 08 Jun 2006 10:54:09 -0400
> > From: Andrew Dunstan <andrew(at)dunslane(dot)net>
> > Newsgroups: pgsql.hackers
> > Subject: Re: Going for 'all green' buildfarm results
> >
> > Larry Rosenman wrote:
> > > well, the changes didn't help.
> > >
> > > I've pulled ALL the cronjobs from firefly.
> > >
> > > consider it dead.
> > >
> > > Since it is an outlier, it's not useful.
> > >
> > >
> > >
> >
> >
> >
> > OK, I am marking firefly as retired. That means we have no coverage for
> > Unixware.
> >
> > cheers
> >
> > andrew
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> > message can get through to the mailing list cleanly
> >
> >
>
>

--
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
15, Chemin des Monges +33-5-61-50-97-01 (Fax)
31190 AUTERIVE +33-6-07-63-80-64 (GSM)
FRANCE Email: ohp(at)pyrenet(dot)fr
------------------------------------------------------------------------------
Make your life a dream, make your dream a reality. (St Exupery)


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-30 05:40:38
Message-ID: 44CC4656.2030405@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> FWIW: lionfish had a weird make check error 3 weeks ago which I
>> (unsuccessfully) tried to reproduce multiple times after that:
>
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-05-12%2005:30:14
>
> Weird.
>
> SELECT ''::text AS eleven, unique1, unique2, stringu1
> FROM onek WHERE unique1 < 50
> ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
> ! ERROR: could not open relation with OID 27035
>
> AFAICS, the only way to get that error in HEAD is if ScanPgRelation
> can't find a pg_class row with the mentioned OID. Presumably 27035
> belongs to "onek" or one of its indexes. The very next command also
> refers to "onek", and doesn't fail, so what we seem to have here is
> a transient lookup failure. We've found a btree bug like that once
> before ... wonder if there's still one left?

FYI: lionfish just managed to hit that problem again:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-07-29%2023:30:06

Stefan


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-30 06:20:05
Message-ID: 20060730062005.GA29720@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner wrote:
> Tom Lane wrote:
> > Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> >> FWIW: lionfish had a weird make check error 3 weeks ago which I
> >> (unsuccessfully) tried to reproduce multiple times after that:
> >
> >> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-05-12%2005:30:14
> >
> > Weird.
> >
> > SELECT ''::text AS eleven, unique1, unique2, stringu1
> > FROM onek WHERE unique1 < 50
> > ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
> > ! ERROR: could not open relation with OID 27035
> >
> > AFAICS, the only way to get that error in HEAD is if ScanPgRelation
> > can't find a pg_class row with the mentioned OID. Presumably 27035
> > belongs to "onek" or one of its indexes. The very next command also
> > refers to "onek", and doesn't fail, so what we seem to have here is
> > a transient lookup failure. We've found a btree bug like that once
> > before ... wonder if there's still one left?
>
> FYI: lionfish just managed to hit that problem again:
>
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-07-29%2023:30:06

The error message this time is

! ERROR: could not open relation with OID 27006

It's worth mentioning that the portals_p2 test, which happens in the
parallel group previous to where this test is run, also accesses the
onek table successfully. It may be interesting to see exactly what
relation is 27006.

The test alter_table, which is on the same parallel group as limit (the
failing test), contains these lines:

ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;

Maybe this is related.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: alvherre(at)commandprompt(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-30 08:03:25
Message-ID: 44CC67CD.6090604@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:
> Stefan Kaltenbrunner wrote:
>> Tom Lane wrote:
>>> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>>>> FWIW: lionfish had a weird make check error 3 weeks ago which I
>>>> (unsuccessfully) tried to reproduce multiple times after that:
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-05-12%2005:30:14
>>> Weird.
>>>
>>> SELECT ''::text AS eleven, unique1, unique2, stringu1
>>> FROM onek WHERE unique1 < 50
>>> ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
>>> ! ERROR: could not open relation with OID 27035
>>>
>>> AFAICS, the only way to get that error in HEAD is if ScanPgRelation
>>> can't find a pg_class row with the mentioned OID. Presumably 27035
>>> belongs to "onek" or one of its indexes. The very next command also
>>> refers to "onek", and doesn't fail, so what we seem to have here is
>>> a transient lookup failure. We've found a btree bug like that once
>>> before ... wonder if there's still one left?
>> FYI: lionfish just managed to hit that problem again:
>>
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-07-29%2023:30:06
>
> The error message this time is
>
> ! ERROR: could not open relation with OID 27006

yeah and before it was:
! ERROR: could not open relation with OID 27035

which looks quite related :-)

>
> It's worth mentioning that the portals_p2 test, which happens in the
> parallel group previous to where this test is run, also accesses the
> onek table successfully. It may be interesting to see exactly what
> relation is 27006.

sorry but i don't have access to the cluster in question any more
(lionfish is quite resource starved and I only enabled to keep failed
builds on -HEAD after the last incident ...)

>
> The test alter_table, which is on the same parallel group as limit (the
> failing test), contains these lines:
>
> ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
> ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;

hmm interesting - lionfish is a slow box(250Mhz MIPS) and particulary
low on memory(48MB+140MB swap) so it is quite likely that the parallel
regress tests are driving it into swap - maybe some sort of subtile
timing issue ?

Stefan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-30 15:44:44
Message-ID: 25567.1154274284@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Stefan Kaltenbrunner wrote:
>> FYI: lionfish just managed to hit that problem again:
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-07-29%2023:30:06

> The test alter_table, which is on the same parallel group as limit (the
> failing test), contains these lines:
> ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
> ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;

I bet Alvaro's spotted the problem. ALTER INDEX RENAME doesn't seem to
take any lock on the index's parent table, only on the index itself.
That means that a query on "onek" could be trying to read the pg_class
entries for onek's indexes concurrently with someone trying to commit
a pg_class update to rename an index. If the query manages to visit
the new and old versions of the row in that order, and the commit
happens between, *neither* of the versions would look valid. MVCC
doesn't save us because this is all SnapshotNow.

Not sure what to do about this. Trying to lock the parent table could
easily be a cure-worse-than-the-disease, because it would create
deadlock risks (we've already locked the index before we could look up
and lock the parent). Thoughts?

The path of least resistance might just be to not run these tests in
parallel. The chance of this issue causing problems in the real world
seems small.

regards, tom lane


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-31 16:20:32
Message-ID: 20060731162032.GH66525@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Stefan Kaltenbrunner wrote:
> >> FYI: lionfish just managed to hit that problem again:
> >> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-07-29%2023:30:06
>
> > The test alter_table, which is on the same parallel group as limit (the
> > failing test), contains these lines:
> > ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
> > ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;
>
> I bet Alvaro's spotted the problem. ALTER INDEX RENAME doesn't seem to
> take any lock on the index's parent table, only on the index itself.
> That means that a query on "onek" could be trying to read the pg_class
> entries for onek's indexes concurrently with someone trying to commit
> a pg_class update to rename an index. If the query manages to visit
> the new and old versions of the row in that order, and the commit
> happens between, *neither* of the versions would look valid. MVCC
> doesn't save us because this is all SnapshotNow.
>
> Not sure what to do about this. Trying to lock the parent table could
> easily be a cure-worse-than-the-disease, because it would create
> deadlock risks (we've already locked the index before we could look up
> and lock the parent). Thoughts?
>
> The path of least resistance might just be to not run these tests in
> parallel. The chance of this issue causing problems in the real world
> seems small.

It doesn't seem that unusual to want to rename an index on a running
system, and it certainly doesn't seem like the kind of operation that
should pose a problem. So at the very least, we'd need a big fat warning
in the docs about how renaming an index could cause other queries in the
system to fail, and the error message needs to be improved.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-31 18:17:46
Message-ID: 44CE494A.2000100@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
>> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>>> Stefan Kaltenbrunner wrote:
>>>> FYI: lionfish just managed to hit that problem again:
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2006-07-29%2023:30:06
>>> The test alter_table, which is on the same parallel group as limit (the
>>> failing test), contains these lines:
>>> ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
>>> ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;
>> I bet Alvaro's spotted the problem. ALTER INDEX RENAME doesn't seem to
>> take any lock on the index's parent table, only on the index itself.
>> That means that a query on "onek" could be trying to read the pg_class
>> entries for onek's indexes concurrently with someone trying to commit
>> a pg_class update to rename an index. If the query manages to visit
>> the new and old versions of the row in that order, and the commit
>> happens between, *neither* of the versions would look valid. MVCC
>> doesn't save us because this is all SnapshotNow.
>>
>> Not sure what to do about this. Trying to lock the parent table could
>> easily be a cure-worse-than-the-disease, because it would create
>> deadlock risks (we've already locked the index before we could look up
>> and lock the parent). Thoughts?
>>
>> The path of least resistance might just be to not run these tests in
>> parallel. The chance of this issue causing problems in the real world
>> seems small.
>
> It doesn't seem that unusual to want to rename an index on a running
> system, and it certainly doesn't seem like the kind of operation that
> should pose a problem. So at the very least, we'd need a big fat warning
> in the docs about how renaming an index could cause other queries in the
> system to fail, and the error message needs to be improved.

it is my understanding that Tom is already tackling the underlying issue
on a much more general base ...

Stefan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-31 20:13:20
Message-ID: 13241.1154376800@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Jim C. Nasby wrote:
>> On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
>>> The path of least resistance might just be to not run these tests in
>>> parallel. The chance of this issue causing problems in the real world
>>> seems small.
>>
>> It doesn't seem that unusual to want to rename an index on a running
>> system, and it certainly doesn't seem like the kind of operation that
>> should pose a problem. So at the very least, we'd need a big fat warning
>> in the docs about how renaming an index could cause other queries in the
>> system to fail, and the error message needs to be improved.

> it is my understanding that Tom is already tackling the underlying issue
> on a much more general base ...

Done in HEAD, but we might still wish to think about changing the
regression tests in the back branches, else we'll probably continue to
see this failure once in a while ...

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-31 20:20:33
Message-ID: 44CE6611.8040408@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>
>
>>Jim C. Nasby wrote:
>>
>>
>>>On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
>>>
>>>
>>>>The path of least resistance might just be to not run these tests in
>>>>parallel. The chance of this issue causing problems in the real world
>>>>seems small.
>>>>
>>>>
>>>It doesn't seem that unusual to want to rename an index on a running
>>>system, and it certainly doesn't seem like the kind of operation that
>>>should pose a problem. So at the very least, we'd need a big fat warning
>>>in the docs about how renaming an index could cause other queries in the
>>>system to fail, and the error message needs to be improved.
>>>
>>>
>
>
>
>>it is my understanding that Tom is already tackling the underlying issue
>>on a much more general base ...
>>
>>
>
>Done in HEAD, but we might still wish to think about changing the
>regression tests in the back branches, else we'll probably continue to
>see this failure once in a while ...
>
>
>
>

How sure are we that this is the cause of the problem? The feeling I got
was "this is a good guess". If so, do we want to prevent ourselves
getting any further clues in case we're wrong? It's also an interesting
case of a (low likelihood) bug which is not fixable on any stable branch.

cheers

andrew


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-31 20:28:48
Message-ID: 44CE6800.30808@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> Tom Lane wrote:
>
>> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>>
>>
>>> Jim C. Nasby wrote:
>>>
>>>> On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
>>>>
>>>>> The path of least resistance might just be to not run these tests in
>>>>> parallel. The chance of this issue causing problems in the real world
>>>>> seems small.
>>>>>
>>>> It doesn't seem that unusual to want to rename an index on a running
>>>> system, and it certainly doesn't seem like the kind of operation that
>>>> should pose a problem. So at the very least, we'd need a big fat
>>>> warning
>>>> in the docs about how renaming an index could cause other queries in
>>>> the
>>>> system to fail, and the error message needs to be improved.
>>>>
>>
>>
>>
>>> it is my understanding that Tom is already tackling the underlying issue
>>> on a much more general base ...
>>>
>>
>> Done in HEAD, but we might still wish to think about changing the
>> regression tests in the back branches, else we'll probably continue to
>> see this failure once in a while ...
>>
>>
>>
>>
>
> How sure are we that this is the cause of the problem? The feeling I got
> was "this is a good guess". If so, do we want to prevent ourselves
> getting any further clues in case we're wrong? It's also an interesting
> case of a (low likelihood) bug which is not fixable on any stable branch.

well I have a lot of trust into tom - though the main issue is that this
issue seems to be difficult hard to trigger.
afaik only one box (lionfish) ever managed to hit it and even there only
2 times out of several hundred builds - I don't suppose we can come up
with a testcase that might be more reliably showing that issue ?

Stefan


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-31 20:31:58
Message-ID: 20060731203158.GI20962@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner wrote:
> Andrew Dunstan wrote:

> > How sure are we that this is the cause of the problem? The feeling I got
> > was "this is a good guess". If so, do we want to prevent ourselves
> > getting any further clues in case we're wrong? It's also an interesting
> > case of a (low likelihood) bug which is not fixable on any stable branch.
>
> well I have a lot of trust into tom - though the main issue is that this
> issue seems to be difficult hard to trigger.
> afaik only one box (lionfish) ever managed to hit it and even there only
> 2 times out of several hundred builds - I don't suppose we can come up
> with a testcase that might be more reliably showing that issue ?

Maybe we could write a suitable test case using Martijn's concurrent
testing framework. Or with a pair of custom SQL script running under
pgbench, and a separate process sending random SIGSTOP/SIGCONT to
backends.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-31 21:03:35
Message-ID: 13755.1154379815@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Maybe we could write a suitable test case using Martijn's concurrent
> testing framework.

The trick is to get process A to commit between the times that process B
looks at the new and old versions of the pg_class row (and it has to
happen to do so in that order ... although that's not a bad bet given
the way btree handles equal keys).

I think the reason we've not tracked this down before is that that's a
pretty small window. You could force the problem by stopping process B
with a debugger breakpoint and then letting A do its thing, but short of
something like that you'll never reproduce it with high probability.

As far as Andrew's question goes: I have no doubt that this race
condition is (or now, was) real and could explain Stefan's failure.
It's not impossible that there's some other problem in there, though.
If so we will still see the problem from time to time on HEAD, and
know that we have more work to do. But I don't think that continuing
to see it on the back branches will teach us anything.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-07-31 22:57:46
Message-ID: 44CE8AEA.1070106@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> As far as Andrew's question goes: I have no doubt that this race
> condition is (or now, was) real and could explain Stefan's failure.
> It's not impossible that there's some other problem in there, though.
> If so we will still see the problem from time to time on HEAD, and
> know that we have more work to do. But I don't think that continuing
> to see it on the back branches will teach us anything.
>
>
>

Fair enough.

cheers

andrew


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-08-17 08:59:24
Message-ID: 44E42FEC.4040403@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Maybe we could write a suitable test case using Martijn's concurrent
>> testing framework.
>
> The trick is to get process A to commit between the times that process B
> looks at the new and old versions of the pg_class row (and it has to
> happen to do so in that order ... although that's not a bad bet given
> the way btree handles equal keys).
>
> I think the reason we've not tracked this down before is that that's a
> pretty small window. You could force the problem by stopping process B
> with a debugger breakpoint and then letting A do its thing, but short of
> something like that you'll never reproduce it with high probability.
>
> As far as Andrew's question goes: I have no doubt that this race
> condition is (or now, was) real and could explain Stefan's failure.
> It's not impossible that there's some other problem in there, though.
> If so we will still see the problem from time to time on HEAD, and
> know that we have more work to do. But I don't think that continuing
> to see it on the back branches will teach us anything.

maybe the following buildfarm report means that we need a new theory :-(

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sponge&dt=2006-08-16%2021:30:02

Stefan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-08-17 13:02:36
Message-ID: 5207.1155819756@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> maybe the following buildfarm report means that we need a new theory :-(

> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sponge&dt=2006-08-16%2021:30:02

Vacuum's always had a race condition: it makes a list of rel OIDs and
then tries to vacuum each one. It narrows the window for failure by
doing a SearchSysCacheExists test before relation_open, but there's
still a window for failure.

The rel in question is most likely a temp rel of another backend,
because sanity_check is running by itself and so there shouldn't
be anything else happening except perhaps some other session's
post-disconnect cleanup. Maybe we could put the check for "is
this a temp rel of another relation" into the initial list-making
step instead of waiting till after relation_open. That doesn't
seem to solve the general problem though.

regards, tom lane


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-08-17 13:09:03
Message-ID: 44E46A6F.7090603@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> maybe the following buildfarm report means that we need a new theory :-(
>
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sponge&dt=2006-08-16%2021:30:02
>
> Vacuum's always had a race condition: it makes a list of rel OIDs and
> then tries to vacuum each one. It narrows the window for failure by
> doing a SearchSysCacheExists test before relation_open, but there's
> still a window for failure.
>
> The rel in question is most likely a temp rel of another backend,
> because sanity_check is running by itself and so there shouldn't
> be anything else happening except perhaps some other session's
> post-disconnect cleanup. Maybe we could put the check for "is
> this a temp rel of another relation" into the initial list-making
> step instead of waiting till after relation_open. That doesn't
> seem to solve the general problem though.

hmm yeah - missed the VACUUM; part of the regression diff.
Still this means we will have to live with (rare) failures once in a
while during that test ?

Stefan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Going for "all green" buildfarm results
Date: 2006-08-18 14:06:52
Message-ID: 8105.1155910012@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> Vacuum's always had a race condition: it makes a list of rel OIDs and
>> then tries to vacuum each one. It narrows the window for failure by
>> doing a SearchSysCacheExists test before relation_open, but there's
>> still a window for failure.

> hmm yeah - missed the VACUUM; part of the regression diff.
> Still this means we will have to live with (rare) failures once in a
> while during that test ?

I thought of what seems a pretty simple solution for this: make VACUUM
lock the relation before doing the SearchSysCacheExists, ie instead
of the existing code

if (!SearchSysCacheExists(RELOID,
ObjectIdGetDatum(relid),
0, 0, 0))
// give up

lmode = vacstmt->full ? AccessExclusiveLock : ShareUpdateExclusiveLock;

onerel = relation_open(relid, lmode);

do

lmode = vacstmt->full ? AccessExclusiveLock : ShareUpdateExclusiveLock;

LockRelationOid(relid, lmode);

if (!SearchSysCacheExists(RELOID,
ObjectIdGetDatum(relid),
0, 0, 0))
// give up

onerel = relation_open(relid, NoLock);

Once we're holding lock, we can be sure there's not a DROP TABLE in
progress, so there's no race condition anymore. It's OK to take a
lock on the OID of a relation that no longer exists, AFAICS; we'll
just drop it again immediately (the "give up" path includes transaction
exit, so there's not even any extra code needed).

This wasn't possible before the recent adjustments to the relation
locking protocol, but now it looks trivial ... am I missing anything?

Perhaps it is worth folding this test into a "conditional_relation_open"
function that returns NULL instead of failing if the rel no longer
exists. I think there are potential uses in CLUSTER and perhaps REINDEX
as well as VACUUM.

regards, tom lane