Re: reproduced elusive cygwin bug

Lists: pgsql-cygwin
From: Jason Tishler <jason(at)tishler(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Adler <adler(at)glimpser(dot)org>
Cc: Pgsql-Cygwin <pgsql-cygwin(at)postgresql(dot)org>
Subject: Re: reproduced elusive cygwin bug
Date: 2002-01-16 14:31:16
Message-ID: 20020116143116.GA1804@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-cygwin

Tom,
Michael,

Please post to pgsql-cygwin(at)postgresql(dot)org instead of sending private
email so that others can benefit and possibly help and to record this
issue in the archives too.

On Tue, Jan 15, 2002 at 03:38:56PM -0500, Tom Lane wrote:
> Michael Adler <adler(at)glimpser(dot)org> writes:
> >> Most curious. Needless to say, that test procedure shows no problems on
> >> Unix machines. Perhaps a cygwin bug?
>
> > Correct. In the first thread mentioned above, you and Jason discuss what
> > is probably happening. Windows doesn't let one process delete another's
> > file. For some reason, it fails less gracefully here.
>
> Hmm. Actually, the sequence of events appears to be:
>
> 1. Vacuum process deletes pg_internal.init (as it's supposed to).
>
> 2. next backend to start attempts to create a new pg_internal.init.
> It does this by creating a temporary file name, writing it, and then
> renaming it into place. (This is to ensure we don't leave broken files
> behind if two or more new backends try to do this at about the same
> time. On Unix, at least, rename() is guaranteed atomic.)

I can reproduce the problem with just the above two steps...

> BTW, I do not see what connection a temp table would have with this.
> AFAICS it wouldn't matter what you vacuumed. Does it really have to
> be a temp table?

..and I agree that temporary tables are not part of the root cause.

> It would appear that if the process that deleted pg_internal.init is
> still around, a rename to create a fresh pg_internal.init fails. I
> cannot understand why this should be; surely it's either a Windows bug
> or a cygwin bug?

Actually, the root cause is due to the following Windows "feature":

If one process has a file open (i.e, has a open handle), then other
processes will not be able to perform certain operations on this file.

Hence, until *all* handles open during the unlink() are closed, rename()
(and other operations) will fail.

So the problem is, in fact, even worse. Consider the following scenario:

1. A connects
2. B connects
3. C connects
4. B vacuums
5. D connects
6. B disconnects
7. D connects
8. A disconnects
9. C disconnects
10. D connects

After 3, the three backends associated with each client have open handles
to pg_internal.init (verified by the SysInternals Handle utility). After
4, pg_internal.init is successfully deleted (verified by Cygwin's strace).

During 5, rename() fails because the backends associated with A, B,
and C have open handles. During 7, rename() stills fail even though
the backend that performed the vacuum has exited (and closed its handle)
because of the other open handles. During 10, rename() succeeds because
*all* handles open during unlink() (i.e., DeleteFile()) have been closed.

> If it's true, however, you should be able to repeat
> the problem with a couple of trivial C programs,

See attached for such programs which can be built as follows:

$ g++ -o open open.cc
$ g++ -o unlink unlink.cc

The following is a sample run:

$ # window 1
$ >foo
open foo
Enter any character (and CF) to continue...

$ # window 2
$ unlink foo
Enter any character (and CF) to continue...

$ # window 3
$ # both open and unlink running
$ handle foo | fgrep foo
open.exe pid: 1784 C:\home\jtishler\tmp\pgsql\foo
unlink.exe pid: 2284 C:\home\jtishler\tmp\pgsql\foo
$ ls foo
ls: foo: Permission denied
$ # only open running
$ ls foo
ls: foo: Permission denied
$ # both open and unlink have exited
$ ls foo
ls: foo: No such file or directory
$ date >foo
$ cat foo
Wed Jan 16 09:03:18 2002

> which would be a fit setting for a bug report to the cygwin people.

IMO, this is not fixable in Cygwin. Unfortunately, I don't think
that this is fixable in PostgreSQL either -- unless it can close all
pg_internal.init file descriptors when necessary. Hopefully, I'm wrong...

Jason

Attachment Content-Type Size
open.cc text/plain 520 bytes
unlink.cc text/plain 641 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jason Tishler <jason(at)tishler(dot)net>
Cc: Michael Adler <adler(at)glimpser(dot)org>, Pgsql-Cygwin <pgsql-cygwin(at)postgresql(dot)org>
Subject: Re: reproduced elusive cygwin bug
Date: 2002-01-16 15:40:52
Message-ID: 9289.1011195652@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-cygwin

Jason Tishler <jason(at)tishler(dot)net> writes:
>> It would appear that if the process that deleted pg_internal.init is
>> still around, a rename to create a fresh pg_internal.init fails. I
>> cannot understand why this should be; surely it's either a Windows bug
>> or a cygwin bug?

> Actually, the root cause is due to the following Windows "feature":

> If one process has a file open (i.e, has a open handle), then other
> processes will not be able to perform certain operations on this file.

> Hence, until *all* handles open during the unlink() are closed, rename()
> (and other operations) will fail.

Hmm. But pg_internal.init is not held open, or shouldn't be. It should
be read once and immediately closed during backend startup. Are we
leaking an open file descriptor someplace?

Based on your description, there would be a small window for the problem
to occur anyway, but the window seems to be much larger than it oughta
be.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jason Tishler <jason(at)tishler(dot)net>
Cc: Michael Adler <adler(at)glimpser(dot)org>, Pgsql-Cygwin <pgsql-cygwin(at)postgresql(dot)org>
Subject: Re: reproduced elusive cygwin bug
Date: 2002-01-16 16:02:06
Message-ID: 9480.1011196926@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-cygwin

I said:
> Hmm. But pg_internal.init is not held open, or shouldn't be. It should
> be read once and immediately closed during backend startup.

Well damn. init_irels() forgets to close the file. Will fix for 7.2.

Meanwhile, that still leaves us with a JDBC problem. Michael, will you
report that to the pgsql-jdbc list?

regards, tom lane


From: Jason Tishler <jason(at)tishler(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Adler <adler(at)glimpser(dot)org>, Tom Pfau <T(dot)Pfau(at)emCrit(dot)com>, Pgsql-Cygwin <pgsql-cygwin(at)postgresql(dot)org>
Subject: Re: reproduced elusive cygwin bug
Date: 2002-01-16 16:08:56
Message-ID: 20020116160855.GD1804@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-cygwin

Tom,

On Wed, Jan 16, 2002 at 10:40:52AM -0500, Tom Lane wrote:
> Hmm. But pg_internal.init is not held open, or shouldn't be. It should
> be read once and immediately closed during backend startup. Are we
> leaking an open file descriptor someplace?

Yes. In src/backend/utils/cache/relcache.c:init_irels(), there is a
FileNameOpenFile() but no corresponding FileClose(). I will patch my
7.1.3 and see if this solves the problem.

Jason


From: Michael Adler <adler(at)glimpser(dot)org>
To: Pgsql-Cygwin <pgsql-cygwin(at)postgresql(dot)org>
Subject: Re: reproduced elusive cygwin bug
Date: 2002-01-16 16:28:05
Message-ID: Pine.NEB.4.44.0201161126440.19719-100000@reva.sixgirls.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-cygwin


On Wed, 16 Jan 2002, Tom Lane wrote:

> I said:
> > Hmm. But pg_internal.init is not held open, or shouldn't be. It
should
> > be read once and immediately closed during backend startup.
>
> Well damn. init_irels() forgets to close the file. Will fix for 7.2.
>
> Meanwhile, that still leaves us with a JDBC problem. Michael, will you
> report that to the pgsql-jdbc list?

Yes, I'll report as soon as I am granted access to that list. The
subscriber hasn't responded with a confirmation, however, after several
hours I started receiving the CYGWIN messages. I guess I'll shortly be on
th JDBC list.

It looks like the JDBC interfaces throws a nasty SQLexception when it
receives any NOTICE (or any message at that point which starts with "N").
I think that's a bug.

Mike Adler


From: Jason Tishler <jason(at)tishler(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Adler <adler(at)glimpser(dot)org>, Tom Pfau <T(dot)Pfau(at)emCrit(dot)com>, Pgsql-Cygwin <pgsql-cygwin(at)postgresql(dot)org>
Subject: Re: reproduced elusive cygwin bug
Date: 2002-01-16 17:04:56
Message-ID: 20020116170456.GE1804@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-cygwin

Tom,

On Wed, Jan 16, 2002 at 11:02:06AM -0500, Tom Lane wrote:
> I said:
> > Hmm. But pg_internal.init is not held open, or shouldn't be. It should
> > be read once and immediately closed during backend startup.
>
> Well damn. init_irels() forgets to close the file. Will fix for 7.2.

The attached patch (against Cygwin PostgreSQL 7.1.3-1) solves this
problem. I will release 7.1.3-2 ASAP.

Jason

Attachment Content-Type Size
relcache.c.diff text/plain 2.6 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jason Tishler <jason(at)tishler(dot)net>
Cc: Michael Adler <adler(at)glimpser(dot)org>, Tom Pfau <T(dot)Pfau(at)emCrit(dot)com>, Pgsql-Cygwin <pgsql-cygwin(at)postgresql(dot)org>
Subject: Re: reproduced elusive cygwin bug
Date: 2002-01-16 17:36:00
Message-ID: 11242.1011202560@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-cygwin

Jason Tishler <jason(at)tishler(dot)net> writes:
> The attached patch (against Cygwin PostgreSQL 7.1.3-1) solves this
> problem. I will release 7.1.3-2 ASAP.

I have committed a fix to CVS for 7.2, also.

regards, tom lane