Re: join regression failure on cygwin

Lists: pgsql-hackers
From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: join regression failure on cygwin
Date: 2009-07-23 00:29:09
Message-ID: 4A67AED5.6000908@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


My Cygwin buildfarm member started failing (hanging, in fact) recently.
It seems to hang consistently in join.sql and the only way I can get it
to complete is to kill the backend fairly violently. See
<http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21>
It looks like this is the result of the semi-join ordering bug fix Tom
applied a few days ago, but it is building and running 8.4 quite
happily, and I think that branch got the same changes, so I'm not quite
sure about it.

I'm going to try reversing that change locally to see if it fixes the
problem. Unfortunately, this isn't an easy platform to debug.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To:
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-23 02:02:34
Message-ID: 4A67C4BA.90505@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
>
> My Cygwin buildfarm member started failing (hanging, in fact)
> recently. It seems to hang consistently in join.sql and the only way I
> can get it to complete is to kill the backend fairly violently. See
> <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21>
> It looks like this is the result of the semi-join ordering bug fix Tom
> applied a few days ago, but it is building and running 8.4 quite
> happily, and I think that branch got the same changes, so I'm not
> quite sure about it.
>
> I'm going to try reversing that change locally to see if it fixes the
> problem. Unfortunately, this isn't an easy platform to debug.
>
>

Nope, it wasn't that. So my next candidate is the prior
join_is_legal/GEQO changes. But reverting that would get rid of the new
regression test where we seem to be failing, so I'm not quite sure where
to go on it.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-23 04:42:37
Message-ID: 8689.1248324157@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> My Cygwin buildfarm member started failing (hanging, in fact)
>> recently. It seems to hang consistently in join.sql and the only way I
>> can get it to complete is to kill the backend fairly violently. See
>> <http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&dt=2009-07-22%2023:10:21>

> Nope, it wasn't that. So my next candidate is the prior
> join_is_legal/GEQO changes. But reverting that would get rid of the new
> regression test where we seem to be failing, so I'm not quite sure where
> to go on it.

I see it claims to have working erand48, but maybe not so much in
reality?

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-23 18:09:53
Message-ID: 4A68A771.1040401@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> I see it claims to have working erand48, but maybe not so much in
> reality?
>
>
>

Good guess. I removed erand48 from the configure file and added
erand48.o to OBJS in src/port/Makefile and it suddenly sailed through.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-23 18:16:59
Message-ID: 22722.1248373019@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> I see it claims to have working erand48, but maybe not so much in
>> reality?

> Good guess. I removed erand48 from the configure file and added
> erand48.o to OBJS in src/port/Makefile and it suddenly sailed through.

Hmm. So we need to figure out how to improve configure's check so that
it rejects whatever broken version you've got ...

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-23 18:42:56
Message-ID: 4A68AF30.6060003@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> I see it claims to have working erand48, but maybe not so much in
>>> reality?
>>>
>
>
>> Good guess. I removed erand48 from the configure file and added
>> erand48.o to OBJS in src/port/Makefile and it suddenly sailed through.
>>
>
> Hmm. So we need to figure out how to improve configure's check so that
> it rejects whatever broken version you've got ...
>
>
>

Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin
and then find out later it's broken on some other abstruse system.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-23 21:57:19
Message-ID: 9684.1248386239@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Hmm. So we need to figure out how to improve configure's check so that
>> it rejects whatever broken version you've got ...

> Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin
> and then find out later it's broken on some other abstruse system.

Seems like it would be useful to figure out exactly why it's failing.

I don't personally have a problem with just forcing use of our own
erand48 on Cygwin; it's not a lot of code and it would make the behavior
of that build more like the MSVC build. But it's curious that such a
simple library function is seemingly broken on Cygwin ... especially
when their random() and srandom() evidently work.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-23 22:43:45
Message-ID: 4A68E7A1.70301@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> Hmm. So we need to figure out how to improve configure's check so that
>>> it rejects whatever broken version you've got ...
>>>
>
>
>> Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin
>> and then find out later it's broken on some other abstruse system.
>>
>
> Seems like it would be useful to figure out exactly why it's failing.
>
> I don't personally have a problem with just forcing use of our own
> erand48 on Cygwin; it's not a lot of code and it would make the behavior
> of that build more like the MSVC build. But it's curious that such a
> simple library function is seemingly broken on Cygwin ... especially
> when their random() and srandom() evidently work.
>

I'll work on it, but for now I propose to make the following change to
configure.in and the corresponding change in configure:

diff -u -r1.605 configure.in
--- configure.in 16 Jul 2009 17:43:52 -0000 1.605
+++ configure.in 23 Jul 2009 22:39:19 -0000
@@ -1249,7 +1249,7 @@
pgac_save_LIBS="$LIBS"
LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`

-AC_REPLACE_FUNCS([crypt erand48 getopt getrusage inet_aton random rint
srandom strdup strerror strlcat strlcpy strtol strtoul])
+AC_REPLACE_FUNCS([crypt getopt getrusage inet_aton random rint srandom
strdup strerror strlcat strlcpy strtol strtoul])

case $host_os in

@@ -1262,6 +1262,12 @@
;;
esac

+# Cygwin's erand48 sometimes hangs, so force use of ours
+if test "$PORTNAME" = "cygwin"; then
+ AC_LIBOBJ(erand48)
+else
+ AC_REPLACE_FUNCS([erand48])
+fi

LIBS="$pgac_save_LIBS"

> regards, tom lane
>
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-23 23:06:07
Message-ID: 14722.1248390367@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> I'll work on it, but for now I propose to make the following change to
> configure.in and the corresponding change in configure:

I believe you can just add AC_LIBOBJ(erand48) in the Cygwin-specific
section without touching the other part; that's supposed to be a no-op
if the filename has already been added to LIBOBJS.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-24 13:05:50
Message-ID: 4A69B1AE.1030502@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I don't personally have a problem with just forcing use of our own
> erand48 on Cygwin; it's not a lot of code and it would make the behavior
> of that build more like the MSVC build. But it's curious that such a
> simple library function is seemingly broken on Cygwin ... especially
> when their random() and srandom() evidently work.
>
>

It appears on Googling a bit that the erand48() is buggy in that it
requires the seed to have been initialized with srand48() or it will
constantly return 0.0.

So I think just forcing use of ours is the safe way to go. It might have
been fixed since I installed Cygwin, although I can't find a reference
to that, and I don't feel like triangulating it anyway.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-24 13:33:20
Message-ID: 25962.1248442400@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> It appears on Googling a bit that the erand48() is buggy in that it
> requires the seed to have been initialized with srand48() or it will
> constantly return 0.0.

Huh, and that sends us into an infinite loop? I'll take a look at that.
Even though it's surely nonrandom, it doesn't seem like pathological
behavior of the RNG should wedge us completely.

> So I think just forcing use of ours is the safe way to go.

Agreed, but I'm curious ...

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-24 14:22:23
Message-ID: 26720.1248445343@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> It appears on Googling a bit that the erand48() is buggy in that it
>> requires the seed to have been initialized with srand48() or it will
>> constantly return 0.0.

> Huh, and that sends us into an infinite loop? I'll take a look at that.
> Even though it's surely nonrandom, it doesn't seem like pathological
> behavior of the RNG should wedge us completely.

The answer is that a constant RNG result sends this bit of
geqo_selection() into a tight loop:

int first,
second;

first = linear(root, pool->size, bias);
second = linear(root, pool->size, bias);

if (pool->size > 1)
{
while (first == second)
second = linear(root, pool->size, bias);
}

Not sure if it's worth trying to do something about that, or exactly
what we'd do anyway. Even if we hacked this up somehow, a constant RNG
result would pretty much break GEQO for any useful purpose. So it could
be argued that having the regression tests fail here is a good thing...

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join regression failure on cygwin
Date: 2009-07-24 14:27:54
Message-ID: 4A69C4EA.4000909@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I wrote:
>
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>
>>> It appears on Googling a bit that the erand48() is buggy in that it
>>> requires the seed to have been initialized with srand48() or it will
>>> constantly return 0.0.
>>>
>
>
>> Huh, and that sends us into an infinite loop? I'll take a look at that.
>> Even though it's surely nonrandom, it doesn't seem like pathological
>> behavior of the RNG should wedge us completely.
>>
>
> The answer is that a constant RNG result sends this bit of
> geqo_selection() into a tight loop:
>
> int first,
> second;
>
> first = linear(root, pool->size, bias);
> second = linear(root, pool->size, bias);
>
> if (pool->size > 1)
> {
> while (first == second)
> second = linear(root, pool->size, bias);
> }
>
> Not sure if it's worth trying to do something about that, or exactly
> what we'd do anyway. Even if we hacked this up somehow, a constant RNG
> result would pretty much break GEQO for any useful purpose. So it could
> be argued that having the regression tests fail here is a good thing...
>
>
>

Right. Let's let sleeping dogs lie. I think at most a code comment is
the only action called for.

cheers

andrew