Re: Brittleness in regression test setup

Lists: pgsql-hackers
From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Brittleness in regression test setup
Date: 2008-11-14 15:28:53
Message-ID: 491D9935.9010200@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I have experienced some brittleness in the regression test setup that
causes the tests to be run against a different server instance or fail
in confusing ways when you have multiple instances running.

For some historic reasons, I have my local scripts set up so that they
build development instances using the hardcoded port 65432. This will
cause a temp install regression test to attempt to use port 565432 which
will be rejected silently by pg_regress, which will then use its
hardcoded default 65432 (no relation to my 65432). If I have some other
instance already running on 65432, then this will fail in non-reassuring
ways such as

============== removing existing temp installation ==============
============== creating temporary installation ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 65432 with pid 94178
============== creating database "regression" ==============
ERROR: database "regression" already exists

It evidently failed to realize that there is another postmaster already
running at that port and just ran its test setup routines against that
other instance.

If there is no database named "regression" on that other instance, then
it will happily go ahead and run its full test suite against that other
instance.

I see two problems here:

1) It fails to realize that it could not start its own temp instance
when another instance is already running.

2) It ignores the port specification almost silently.

Since ports above 49152 are for private use, I think it is valid to
configure test instances in that port range, but our regression test
setup does not handle that port range very well.

So even if I configured my local scripts to use ports that are all
different from each other and from 65432, this problem would still exist.

So, also,

2a) It has an undocumented hardcoded default port.

Any thoughts on how to fix this?


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-14 16:13:20
Message-ID: 491DA3A0.1080301@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
>
> So even if I configured my local scripts to use ports that are all
> different from each other and from 65432, this problem would still exist.

Only if you choose the private port to be above 9999. The standard
buildfarm config file contains a warning against using such ports for
exactly this reason. Is using a private 4-digit port terribly difficult
for you?

>
> So, also,
>
> 2a) It has an undocumented hardcoded default port.
>
> Any thoughts on how to fix this?
>

a) document it

b) make it a lot noisier if it falls back on 65432.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-14 16:50:31
Message-ID: 28469.1226681431@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> For some historic reasons, I have my local scripts set up so that they
> build development instances using the hardcoded port 65432. This will
> cause a temp install regression test to attempt to use port 565432 which
> will be rejected silently by pg_regress, which will then use its
> hardcoded default 65432 (no relation to my 65432).

One thing we should do is have pg_regress.c, not the Makefile,
select the default port to use. The concatenate-5 behavior is
just not intelligent enough.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-25 13:52:34
Message-ID: 492C0322.1050907@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> ============== removing existing temp installation ==============
> ============== creating temporary installation ==============
> ============== initializing database system ==============
> ============== starting postmaster ==============
> running on port 65432 with pid 94178
> ============== creating database "regression" ==============
> ERROR: database "regression" already exists
>
> It evidently failed to realize that there is another postmaster already
> running at that port and just ran its test setup routines against that
> other instance.

On this matter, I noticed that pg_regress doesn't do anything to clean
up its child processes. I see zombies lying around on Linux and Mac OS
when the postmaster dies. (And the zombie is exactly the pid 94178 it
announced in the example above.)

I played around a little with signal handling to collect the dying
postmaster and report and error; see attached rough patch. I'm not
exactly understanding how this works though. I would expect lots of
psql zombies for example, because those go through the same
spawn_process() call, but I'm not seeing any. Also, the sleep() call in
my patch is necessary to get some effect. Anyone else go a clue on what
to do here?

Attachment Content-Type Size
pgregress-postmaster-death.diff text/plain 3.4 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-25 14:03:52
Message-ID: 15318.1227621832@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> I played around a little with signal handling to collect the dying
> postmaster and report and error; see attached rough patch. I'm not
> exactly understanding how this works though. I would expect lots of
> psql zombies for example, because those go through the same
> spawn_process() call, but I'm not seeing any.

That's because wait_for_tests wait()s for them.

AFAICS the only way you'd end up with a zombie postmaster is if pg_ctl
stop fails, but I'm failing to understand why that's likely to happen.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-25 15:17:05
Message-ID: 492C16F1.20105@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> One thing we should do is have pg_regress.c, not the Makefile,
> select the default port to use. The concatenate-5 behavior is
> just not intelligent enough.

How about something like this, constructing a port number from the
version and a timestamp? We could also take 2 more bits from the
version and give it to the timestamp, which would make this a bit safer,
I think.

Attachment Content-Type Size
pgregress-port-number.diff text/plain 4.7 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-25 15:27:15
Message-ID: 20081125152715.GF4875@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Tom Lane wrote:
>> One thing we should do is have pg_regress.c, not the Makefile,
>> select the default port to use. The concatenate-5 behavior is
>> just not intelligent enough.
>
> How about something like this, constructing a port number from the
> version and a timestamp? We could also take 2 more bits from the
> version and give it to the timestamp, which would make this a bit safer,
> I think.

Is it possible to make it retry in case the chosen port is busy? I
guess a simple check should suffice, ignoring the obvious race condition
that someone uses the port after you checked it was OK.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-25 15:50:09
Message-ID: 27154.1227628209@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Tom Lane wrote:
>> One thing we should do is have pg_regress.c, not the Makefile,
>> select the default port to use. The concatenate-5 behavior is
>> just not intelligent enough.

> How about something like this, constructing a port number from the
> version and a timestamp? We could also take 2 more bits from the
> version and give it to the timestamp, which would make this a bit safer,
> I think.

I'd vote for keeping the --temp-port option but not having the Makefile
use it. Seems like it'd still be potentially useful for hand use of
pg_regress.

Also, like Alvaro I'm thinking that a retry is really needed. As this
patch stands you'd be vulnerable to random, unrepeatable failures
anytime you picked a port that happened to be in use for something else.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-26 11:56:35
Message-ID: 492D3973.4050507@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> AFAICS the only way you'd end up with a zombie postmaster is if pg_ctl
> stop fails, but I'm failing to understand why that's likely to happen.

No, the zombies appear if the postmaster dies (briefly) after launch.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-26 12:03:21
Message-ID: 492D3B09.3020604@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:
> Is it possible to make it retry in case the chosen port is busy? I
> guess a simple check should suffice, ignoring the obvious race condition
> that someone uses the port after you checked it was OK.

Well, the whole point of this exercise was to avoid that. If we had a
way to do a "simple check", we might as well stick to the hardcoded port
and count up from that or something.

The problem with doing the checking is that you have to emulate the
complete postmaster logic for port numbers, listen addresses, Unix
domain socket directories, etc. That can become quite involved.

Then again, a simple way to avoid the issue altogether on platforms
supporting Unix-domain sockets would be to run the test over Unix-domain
sockets (which we do anyway) placed in a private directory. How about that?


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-26 12:04:36
Message-ID: 492D3B54.1020004@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I'd vote for keeping the --temp-port option but not having the Makefile
> use it. Seems like it'd still be potentially useful for hand use of
> pg_regress.

Sorry, I didn't document this fully. The --temp-port option appears to
be redundant with the --port option, so I figured we could drop the
former and just use the latter for both the temp install and existing
install cases.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-26 13:28:04
Message-ID: 6822.1227706084@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Then again, a simple way to avoid the issue altogether on platforms
> supporting Unix-domain sockets would be to run the test over Unix-domain
> sockets (which we do anyway) placed in a private directory. How about that?

Then the brittleness is still there on Windows, only we'd probably get
confused and think it was a platform-specific bug.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-27 10:31:05
Message-ID: 492E76E9.4050208@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Alvaro Herrera wrote:
>> Is it possible to make it retry in case the chosen port is busy? I
>> guess a simple check should suffice, ignoring the obvious race condition
>> that someone uses the port after you checked it was OK.
>
> Well, the whole point of this exercise was to avoid that. If we had a
> way to do a "simple check", we might as well stick to the hardcoded port
> and count up from that or something.

Well, duh, the checking is actually pretty simple. We just try to
connect with psql to the candidate port number before starting our own
postmaster and see if anyone is already there.

Patch attached. It solves my immediate problems nicely.

Attachment Content-Type Size
pgregress-port-number.diff text/plain 6.2 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-27 11:56:52
Message-ID: 20081127115652.GA4586@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Peter Eisentraut wrote:
>> Alvaro Herrera wrote:
>>> Is it possible to make it retry in case the chosen port is busy? I
>>> guess a simple check should suffice, ignoring the obvious race condition
>>> that someone uses the port after you checked it was OK.
>>
>> Well, the whole point of this exercise was to avoid that. If we had a
>> way to do a "simple check", we might as well stick to the hardcoded
>> port and count up from that or something.
>
> Well, duh, the checking is actually pretty simple. We just try to
> connect with psql to the candidate port number before starting our own
> postmaster and see if anyone is already there.

But what if something else is using the port? I think you could attempt
a bare connect().

Note typo here:

> + fprintf(stderr, _("Specify an used port using the --port option or shut down any conflicting PostgreSQL servers.\n"));

Should say "an unused port"

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Brittleness in regression test setup
Date: 2008-11-27 12:23:27
Message-ID: 492E913F.8070207@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:
>> Well, duh, the checking is actually pretty simple. We just try to
>> connect with psql to the candidate port number before starting our own
>> postmaster and see if anyone is already there.
>
> But what if something else is using the port? I think you could attempt
> a bare connect().

Well, that goes beyond the scope of my original problem, which is that
the regression tests will silently run against a different installation.
If you run psql against a non-PostgreSQL server, you will hopefully
see more obvious failures. We could add this in the future, if there
are complaints from the field.