Quick Links

regresssion script hole

Lists:	pgsql-hackers

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	regresssion script hole
Date:	2006-06-18 21:30:55
Message-ID:	4495C60F.9090509@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

While investigating some problems with buildfarm member spoonbill I came
across this piece of code in pg_regress.sh, which seems less than robust:

# Wait till postmaster is able to accept connections (normally only
# a second or so, but Cygwin is reportedly *much* slower). Don't
# wait forever, however.
i=0
max=60
until "$bindir/psql" -X $psql_options postgres </dev/null 2>/dev/null
do
i=`expr $i + 1`
if [ $i -ge $max ]
then
break
fi
if kill -0 $postmaster_pid >/dev/null 2>&1
then
: still starting up
else
break
fi
sleep 1
done

if kill -0 $postmaster_pid >/dev/null 2>&1
then
echo "running on port $PGPORT with pid $postmaster_pid"
else
echo
echo "$me: postmaster did not start"
echo "Examine $LOGDIR/postmaster.log for the reason."
echo
(exit 2); exit
fi

The problem is that if the postmaster takes more than 60 seconds to
start listening (as is apparently happening on spoonbill - don't yet
know why) this code falls through.

I'm inclined to run the psql test one more time to make sure we can
actually connect, and if not then fail at this point. I wouldn't bother
but it did confuse the heck out of both Stefan and me when createlang
failed.

Thoughts?

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-18 21:39:01
Message-ID:	23852.1150666741@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> The problem is that if the postmaster takes more than 60 seconds to
> start listening (as is apparently happening on spoonbill - don't yet
> know why) this code falls through.

If the postmaster takes that long to start listening, I'd say we need to
fix the postmaster not pg_regress.

> I'm inclined to run the psql test one more time to make sure we can
> actually connect, and if not then fail at this point.

How does that differ from just iterating the loop one more time?

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-18 22:03:13
Message-ID:	4495CDA1.3000802@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:

>Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>
>>The problem is that if the postmaster takes more than 60 seconds to
>>start listening (as is apparently happening on spoonbill - don't yet
>>know why) this code falls through.
>>
>>
>
>If the postmaster takes that long to start listening, I'd say we need to
>fix the postmaster not pg_regress.
>
>
>

We need both, I think. I am still trying to find out why it's taking so
long. This is on the 8.0 branch, though. Later branches seem to be working.

>>I'm inclined to run the psql test one more time to make sure we can
>>actually connect, and if not then fail at this point.
>>
>>
>
>How does that differ from just iterating the loop one more time?
>
>

There is no guarantee that at the end of the loop we have connected
successfully to postgres.

I will post a patch that shows what I suggest.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-18 22:47:38
Message-ID:	27921.1150670858@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> We need both, I think. I am still trying to find out why it's taking so
> long. This is on the 8.0 branch, though. Later branches seem to be working.

One idea that comes to mind is a DNS lookup timeout. Can you strace the
postmaster to see what it's doing?

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-18 23:36:31
Message-ID:	4495E37F.20709@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:

>Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>
>>We need both, I think. I am still trying to find out why it's taking so
>>long. This is on the 8.0 branch, though. Later branches seem to be working.
>>
>>
>
>One idea that comes to mind is a DNS lookup timeout. Can you strace the
>postmaster to see what it's doing?
>
>
>
>

There is ktrace output I managed to capture at
http://developer.postgresql.org/~adunstan/ktrace.txt

Not sure what it tells us. I do see it reading in the whole timezone db,
which isn't pretty.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-18 23:50:04
Message-ID:	165.1150674604@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> One idea that comes to mind is a DNS lookup timeout. Can you strace the
>> postmaster to see what it's doing?

> There is ktrace output I managed to capture at
> http://developer.postgresql.org/~adunstan/ktrace.txt
> Not sure what it tells us. I do see it reading in the whole timezone db,
> which isn't pretty.

Yeah... I think we fixed that in 8.1. Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:

24583 postgres CALL socket(0x2,0x2,0)
24583 postgres RET socket 3
24583 postgres CALL sendto(0x3,0x43e1e000,0x25,0,0x493a6338,0x10)
24583 postgres GIO fd 3 wrote 37 bytes
"\M-Sr\^A\0\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
24583 postgres RET sendto 37/0x25
24583 postgres CALL getpid()
24583 postgres RET getpid 24583/0x6007
24583 postgres CALL select(0x4,0x40739600,0,0,0xfffffffffffe4d90)
24583 postgres RET select 1
24583 postgres CALL recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)
24583 postgres GIO fd 3 read 37 bytes
"\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
24583 postgres RET recvfrom 37/0x25
24583 postgres CALL close(0x3)
24583 postgres RET close 0

I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Subject:	Re: regresssion script hole
Date:	2006-06-19 00:43:18
Message-ID:	4495F326.6000404@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:

>Anyway, the tail end of the trace
>shows it repeatedly sending off a UDP packet and getting practically the
>same data back:
>
>
>I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
>bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
>failing.
>
>
>

Why are we actually looking up anything? Just so we can bind to a
listening socket?

Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
direct it to use files first, something like

lookup file bind

Stefan, can you look into that? It would be a bit ugly if it's calling
DNS (and failing) to resolve localhost.

cheers

andrew

From:	Michael Fuhr <mike(at)fuhr(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 01:18:07
Message-ID:	20060619011807.GA3554@winnie.fuhr.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 18, 2006 at 07:50:04PM -0400, Tom Lane wrote:
> 24583 postgres CALL recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)
> 24583 postgres GIO fd 3 read 37 bytes
> "\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
> 24583 postgres RET recvfrom 37/0x25
> 24583 postgres CALL close(0x3)
> 24583 postgres RET close 0
>
> I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
> bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
> failing.

Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
and the DNS server is responding with SRVFAIL.

--
Michael Fuhr

From:	Michael Fuhr <mike(at)fuhr(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 01:21:42
Message-ID:	20060619012142.GA3638@winnie.fuhr.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:
> Maybe I'm misreading the packet, but I think the query is for
> ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)

Correction: ''.kaltenbrunner.cc

--
Michael Fuhr

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Michael Fuhr <mike(at)fuhr(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 08:44:10
Message-ID:	449663DA.5070203@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Fuhr wrote:
> On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:
>> Maybe I'm misreading the packet, but I think the query is for
>> ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
>
> Correction: ''.kaltenbrunner.cc

yes that is exactly the issue - the postmaster tries to resolve
''.kaltenbrunner.cc multiple times during startup and getting ServFail
as a response from the upstream resolver.

Stefan

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 08:52:09
Message-ID:	449665B9.8040806@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>
>> Anyway, the tail end of the trace
>> shows it repeatedly sending off a UDP packet and getting practically the
>> same data back:
>>
>>
>> I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
>> bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
>> failing.
>>
>>
>>
>
> Why are we actually looking up anything? Just so we can bind to a
> listening socket?
>
> Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
> direct it to use files first, something like
>
> lookup file bind
>
> Stefan, can you look into that? It would be a bit ugly if it's calling
> DNS (and failing) to resolve localhost.

no - resolving localhost works fine (both using /etc/hosts and through
the dns-resolver) - and I infact verified that when we initially started
to investigate that issue a while ago :-)

Stefan

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 12:08:46
Message-ID:	449693CE.9090302@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner wrote:

>Andrew Dunstan wrote:
>
>
>>Why are we actually looking up anything? Just so we can bind to a
>>listening socket?
>>
>>Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
>>direct it to use files first, something like
>>
>> lookup file bind
>>
>>Stefan, can you look into that? It would be a bit ugly if it's calling
>>DNS (and failing) to resolve localhost.
>>
>>
>
>
>no - resolving localhost works fine (both using /etc/hosts and through
>the dns-resolver) - and I infact verified that when we initially started
>to investigate that issue a while ago :-)
>
>
>
>

Why are we looking up 'kaltenbrunner.cc' at all then? In any case, can
we just try with that resolver line?

The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 13:21:21
Message-ID:	18364.1150723281@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> The question isn't whether is succeeds, it's how long it takes to
> succeed. When I increased the pg_regress timeout it actually went
> through the whole regression test happily. I suspect we have 2 things
> eating up the 60s timeout here: loading the timezone db and resolving
> whatever it is we are trying to resolve.

The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds. So the failed DNS responses must be the problem. Could
we get a ktrace with timestamps on the syscalls to confirm that?

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.

regards, tom lane

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 13:29:29
Message-ID:	20060619132929.GC4297@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:
> Of course the $64 question is *why* is 8.0 trying to resolve that name,
> particularly seeing that the later branches apparently aren't.

The formatting of the message suggests it is a gethostbyname("''")
doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
configuration files?

I wonder it it'd be worth adding some conditional code around
gethostbyname() calls that warn if the call took longer than say 10
seconds. By printing out that and the string it's looking up you could
save a lot of time confirming if the delay is there or elsewhere...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 13:39:58
Message-ID:	4496A92E.6090901@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martijn van Oosterhout wrote:
> On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:
>> Of course the $64 question is *why* is 8.0 trying to resolve that name,
>> particularly seeing that the later branches apparently aren't.
>
> The formatting of the message suggests it is a gethostbyname("''")
> doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
> configuration files?

I tcpdump'd the dns-traffic on that box during a postmaster startup and
it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.
And from what it looks like it might be getting somehow rate limited by
my ISPs recursive resolvers after doing the same query a dozens of times
and getting a "servfail" every time.
At least the timestamps seem to indicate that the responses are getting
delayed up to 10 seconds after a number of queries ...
It might be a complete shot in the dark but spoonbill worked fine on
REL_8_0_STABLE until i disabled reporting 3 month ago.
During this time the large escaping security fix/standard_strings patch
went in - could this be related in any way ?

Stefan

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 13:44:44
Message-ID:	4496AA4C.7090704@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> The question isn't whether is succeeds, it's how long it takes to
>> succeed. When I increased the pg_regress timeout it actually went
>> through the whole regression test happily. I suspect we have 2 things
>> eating up the 60s timeout here: loading the timezone db and resolving
>> whatever it is we are trying to resolve.
>
> The behavior of loading the whole TZ database was there for awhile
> before anyone noticed; I believe it could only be responsible for a
> few seconds. So the failed DNS responses must be the problem. Could
> we get a ktrace with timestamps on the syscalls to confirm that?
>
> Of course the $64 question is *why* is 8.0 trying to resolve that name,
> particularly seeing that the later branches apparently aren't.

hmm maybe the later branches are trying to resolve that too - but only
the combination of the TZ database loading + the failed DNS-queries is
pushing the startup time over the 60 second limit on this (quite slow) box ?

I will try to verify what the later branches are doing exactly ...

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 13:49:03
Message-ID:	18572.1150724943@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> I tcpdump'd the dns-traffic on that box during a postmaster startup and
> it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.

I just strace'd postmaster start on a Fedora box and can see nothing
corresponding. Since this is a "make check" we know that the PG
configuration files it's using are stock ... so it must be something
about the system config that's sending it round the bend. What do you
have in /etc/hosts, /etc/resolv.conf, /etc/nsswitch.conf?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 13:54:38
Message-ID:	18613.1150725278@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Oh, I think I see the problem:

8.0 pg_regress:

if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses=''"
fi

8.1 pg_regress:

if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses="
fi

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 13:59:08
Message-ID:	4496ADAC.80007@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner wrote:
> Tom Lane wrote:
>
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>
>>> The question isn't whether is succeeds, it's how long it takes to
>>> succeed. When I increased the pg_regress timeout it actually went
>>> through the whole regression test happily. I suspect we have 2 things
>>> eating up the 60s timeout here: loading the timezone db and resolving
>>> whatever it is we are trying to resolve.
>>>
>> The behavior of loading the whole TZ database was there for awhile
>> before anyone noticed; I believe it could only be responsible for a
>> few seconds. So the failed DNS responses must be the problem. Could
>> we get a ktrace with timestamps on the syscalls to confirm that?
>>
>> Of course the $64 question is *why* is 8.0 trying to resolve that name,
>> particularly seeing that the later branches apparently aren't.
>>
>
> hmm maybe the later branches are trying to resolve that too - but only
> the combination of the TZ database loading + the failed DNS-queries is
> pushing the startup time over the 60 second limit on this (quite slow) box ?
>
> I will try to verify what the later branches are doing exactly ...
>
>
>

Yes, we're on the margin here. The successful runs I saw got through the
timeout in 5 or 10 seconds over the 60 that we currently allow.

What interests me is where it even gets the string 'kaltenbrunner.cc'
from. It looks to me like the most likely place is the search line in
/etc/resolv.conf. Ir would be nice to know exactly what it is trying to
resolve.

cheers

andrew

cheers

andrew

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 14:16:08
Message-ID:	4496B1A8.6090205@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Oh, I think I see the problem:
>
> 8.0 pg_regress:
>
> if [ "$unix_sockets" = no ]; then
> postmaster_options="$postmaster_options -c listen_addresses=$hostname"
> else
> postmaster_options="$postmaster_options -c listen_addresses=''"
> fi
>
> 8.1 pg_regress:
>
> if [ "$unix_sockets" = no ]; then
> postmaster_options="$postmaster_options -c listen_addresses=$hostname"
> else
> postmaster_options="$postmaster_options -c listen_addresses="
> fi
>
>
>

Good catch! I'm impressed! This is surely the heart of the problem.

That change (from rev 1.56) clearly needs to be backported to 8.0.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: regresssion script hole
Date:	2006-06-19 14:16:18
Message-ID:	18772.1150726578@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> 8.0 pg_regress:
> postmaster_options="$postmaster_options -c listen_addresses=''"
> 8.1 pg_regress:
> postmaster_options="$postmaster_options -c listen_addresses="

and in fact here's the commit that changed that:

2005-06-19 22:26 tgl

* src/test/regress/pg_regress.sh: Change shell syntax that seems
not to work right on FreeBSD 6-CURRENT buildfarm machines.

So apparently it's some generic disease of the BSD shell. I should have
back-patched at the time but did not. Will take care of it.

On the timezone search business, it's still the case that HEAD will
search through all the timezones if it's not given an explicit setting
(eg an explicit environment TZ value). We could suppress that by having
pg_regress set TZ, but then the regression tests wouldn't exercise the
search code at all, which is probably not a great idea.

regards, tom lane