Cause of occasional buildfarm failures in sequence test

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Cause of occasional buildfarm failures in sequence test
Date: 2008-08-31 17:17:51
Message-ID: 11009.1220203071@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The current result from buildfarm member pigeon
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pigeon&dt=2008-08-31%2008:04:54
shows a symptom I've noticed a few times before: everything is fine
except that one "select * from sequence" shows log_cnt = 31 instead
of the expected 32. This is a bit odd since no other backend is
touching that sequence, so you'd expect perfectly reproducible results.

I finally got around to looking through the sequence code to see if I
could figure out what was happening, and indeed I did. The relevant
parts of the test are basically
create sequence foo;
select nextval('foo');
select nextval('foo');
select * from foo;
and you can reproduce the unexpected result if you insert a checkpoint
command between the create and the first nextval. So the buildfarm
failures are due to chance occurrences of a bgwriter-driven checkpoint
occurring just there during the test.

In the normal execution of this test, the CREATE SEQUENCE leaves the
sequence in a state where one nextval() can be done "for free", without
emitting a new WAL record. So the second nextval() is the one that
emits a WAL record and pushes log_cnt up to 32. But if a checkpoint
intervenes, then the first nextval() decides it'd better emit the WAL
record (cf lines 500ff in sequence.c as of CVS HEAD), so it pushes
log_cnt up to 32, and then the second one decreases it to 31.

So unless we want to just live with this test failing occasionally,
it seems we have two choices: redesign the behavior of nextval()
to be insensitive to checkpoint timing, or provide an alternate
regression "expected" file that matches the result with log_cnt = 31.
I favor the second answer --- I don't want to touch the nextval
logic, which has been stable for over six years.

Comments?

regards, tom lane


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Cause of occasional buildfarm failures in sequence test
Date: 2008-08-31 17:34:10
Message-ID: 1220204050.17281.1.camel@huvostro
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2008-08-31 at 13:17 -0400, Tom Lane wrote:
> The current result from buildfarm member pigeon
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pigeon&dt=2008-08-31%2008:04:54
> shows a symptom I've noticed a few times before: everything is fine
> except that one "select * from sequence" shows log_cnt = 31 instead
> of the expected 32. This is a bit odd since no other backend is
> touching that sequence, so you'd expect perfectly reproducible results.
>
> I finally got around to looking through the sequence code to see if I
> could figure out what was happening, and indeed I did. The relevant
> parts of the test are basically
> create sequence foo;
> select nextval('foo');
> select nextval('foo');
> select * from foo;
> and you can reproduce the unexpected result if you insert a checkpoint
> command between the create and the first nextval. So the buildfarm
> failures are due to chance occurrences of a bgwriter-driven checkpoint
> occurring just there during the test.
>
> In the normal execution of this test, the CREATE SEQUENCE leaves the
> sequence in a state where one nextval() can be done "for free", without
> emitting a new WAL record. So the second nextval() is the one that
> emits a WAL record and pushes log_cnt up to 32. But if a checkpoint
> intervenes, then the first nextval() decides it'd better emit the WAL
> record (cf lines 500ff in sequence.c as of CVS HEAD), so it pushes
> log_cnt up to 32, and then the second one decreases it to 31.
>
> So unless we want to just live with this test failing occasionally,
> it seems we have two choices: redesign the behavior of nextval()
> to be insensitive to checkpoint timing, or provide an alternate
> regression "expected" file that matches the result with log_cnt = 31.
> I favor the second answer --- I don't want to touch the nextval
> logic, which has been stable for over six years.
>
> Comments?

Maybe you get consistent result by just changing the test thus:

checkpoint;
create sequence foo;
select nextval('foo');
select nextval('foo');
select * from foo;

--------------
Hannu


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Cause of occasional buildfarm failures in sequence test
Date: 2008-08-31 17:50:27
Message-ID: 11474.1220205027@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing <hannu(at)2ndQuadrant(dot)com> writes:
> On Sun, 2008-08-31 at 13:17 -0400, Tom Lane wrote:
>> So unless we want to just live with this test failing occasionally,
>> it seems we have two choices: redesign the behavior of nextval()
>> to be insensitive to checkpoint timing, or provide an alternate
>> regression "expected" file that matches the result with log_cnt = 31.
>> I favor the second answer --- I don't want to touch the nextval
>> logic, which has been stable for over six years.

> Maybe you get consistent result by just changing the test thus:

> checkpoint;
> create sequence foo;
> select nextval('foo');
> select nextval('foo');
> select * from foo;

Actually I think we'd need to put the checkpoint after the create,
but yeah we could do that. Or we could leave log_cnt out of the
set of columns displayed. I don't really favor either of those
answers though. They amount to avoiding testing of some code
paths ...

regards, tom lane