Re: Cygwin PostgreSQL Regression Test Problems

Lists: pgsql-ports
From: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
To: pgsql-ports(at)postgresql(dot)org
Subject: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-16 04:37:55
Message-ID: 20010115233755.B1748@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Over the last few days, I ran the regression tests for 7.1 Beta 3 much more
than I have in the past for 7.0.2 and 7.0.3. Unfortunately, I experienced
the following problems:

1. Until I did a cvs update last night (1/14/2001), the regression tests
were failing on 1/12 and 1/13. Did anyone do a cvs commit that would
fix backend children from stackdump-ing on Cygwin? I hope so.

Here are some interesting snippets:

--- pg_regress output ---
..
parallel group (7 tests): create_aggregate create_operator inherit triggers constraints create_misc create_index
constraints ... FAILED
triggers ... FAILED
create_misc ... FAILED
create_aggregate ... ok
..
--- pg_regress output ---

--- postmaster output ---
NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
..
ERROR: Relation 'temptest' does not exist
0 [main] postmaster 2640 handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
479 [main] postmaster 2640 stackdump: Dumping stack trace to postmaster.exe.stackdump
Server process (pid 2640) exited with status 139 at Sat Jan 13 21:28:36 2001
Terminating any active server processes...
Server processes were terminated at Sat Jan 13 21:28:36 2001
Reinitializing shared memory and semaphores
IpcMemoryDetach: shmdt(0x120b0000) failed: Invalid argument
..
--- postmaster output ---

2. I am unable to successfully run the regression tests on a NT 4.0 SP5
machine with only 64 MB of physical memory and about 175 MB of swap space.
Other than lacking RAM and swap space, this machine is the "same" as other
NT/2000 machines which can successfully run the regression tests.

The tests usually hang during the "parallel group (18 tests)" test
right after numerology. By "hang," I mean that the original postmaster
is still running, but there are no postmaster children, and there are
some number of psql processes hanging around. Using NT's TaskManager,
I can see that the machine is running out of memory. I have even seen
the "Windows is running low on virtual memory" dialog a few times.
Should I expect this behavior from such a lame machine?

3. Once (or twice), I noticed that the plpgsql test failed.
Unfortunately, I didn't capture the precise output but I think that
postmaster was complaining about being unable to

mv <somepath>/pg_internal.init.<somepid> <somepath>/pg_internal.init

due to a permissions problem. Sorry, for being vague...

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-16 06:45:21
Message-ID: 11518.979627521@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> parallel group (7 tests): create_aggregate create_operator inherit triggers constraints create_misc create_index
> constraints ... FAILED
> triggers ... FAILED
> create_misc ... FAILED
> create_aggregate ... ok

Can't tell much from this. What are the detail diffs (regression.diffs file?)

> 2. I am unable to successfully run the regression tests on a NT 4.0 SP5
> machine with only 64 MB of physical memory and about 175 MB of swap space.
> Other than lacking RAM and swap space, this machine is the "same" as other
> NT/2000 machines which can successfully run the regression tests.

> The tests usually hang during the "parallel group (18 tests)" test
> right after numerology. By "hang," I mean that the original postmaster
> is still running, but there are no postmaster children, and there are
> some number of psql processes hanging around.

Hm. You will have 18 backends firing up there, plus 18 psqls to drive
'em, and probably 18 shell subprocesses parenting the psqls. I wouldn't
be too surprised at running out of memory --- but one would like to
expect a more graceful failure than just hanging. What if anything
shows up in the postmaster log?

> 3. Once (or twice), I noticed that the plpgsql test failed.
> Unfortunately, I didn't capture the precise output but I think that
> postmaster was complaining about being unable to
> mv <somepath>/pg_internal.init.<somepid> <somepath>/pg_internal.init
> due to a permissions problem. Sorry, for being vague...

Hm. The first backend to fire up after a vacuum will try to rebuild
pg_internal.init, and then move it into place with

/*
* And rename the temp file to its final name, deleting any
* previously-existing init file.
*/
if (rename(tempfilename, finalfilename) < 0)
{
elog(NOTICE, "Cannot rename init file %s to %s: %m\n\tContinuing anyway, but there's something wrong.", tempfilename, finalfilename);
}

In a parallel test it's possible that several backends would try to do
this at about the same time, but that should be OK; we should end up
with just one file from the last-to-finish backend. I think you have
found another Cygwin bug :-(

regards, tom lane


From: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 13:46:55
Message-ID: 20010118084655.C1092@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Tom,

I'm finally back in front of the machine where I ran these tests...

On Tue, Jan 16, 2001 at 01:45:21AM -0500, Tom Lane wrote:
> Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> > parallel group (7 tests): create_aggregate create_operator inherit triggers constraints create_misc create_index
> > constraints ... FAILED
> > triggers ... FAILED
> > create_misc ... FAILED
> > create_aggregate ... ok
>
> Can't tell much from this. What are the detail diffs (regression.diffs file?)

Unfortunately I ran more (successful) tests after these failure, so the
detail diffs are no longer available.

> > 2. I am unable to successfully run the regression tests on a NT 4.0 SP5
> > machine with only 64 MB of physical memory and about 175 MB of swap space.
> > Other than lacking RAM and swap space, this machine is the "same" as other
> > NT/2000 machines which can successfully run the regression tests.
>
> What if anything shows up in the postmaster log?

Sorry, the postmaster log is gone too.

> > 3. Once (or twice), I noticed that the plpgsql test failed.
> > Unfortunately, I didn't capture the precise output but I think that
> > postmaster was complaining about being unable to
> > mv <somepath>/pg_internal.init.<somepid> <somepath>/pg_internal.init
> > due to a permissions problem. Sorry, for being vague...
>
> Hm. The first backend to fire up after a vacuum will try to rebuild
> pg_internal.init, and then move it into place with
>
> /*
> * And rename the temp file to its final name, deleting any
> * previously-existing init file.
> */
> if (rename(tempfilename, finalfilename) < 0)
> {
> elog(NOTICE, "Cannot rename init file %s to %s: %m\n\tContinuing anyway, but there's something wrong.", tempfilename, finalfilename);
> }
>
> In a parallel test it's possible that several backends would try to do
> this at about the same time, but that should be OK; we should end up
> with just one file from the last-to-finish backend. I think you have
> found another Cygwin bug :-(

Windows has issues with open files. So, if a backend is trying to
rename a file when it is open (by another), then the rename will fail.
Will this cause database integrity problems? Or, will there just be
some spurious warning?

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 17:39:59
Message-ID: 3721.979839599@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
>> In a parallel test it's possible that several backends would try to do
>> this at about the same time, but that should be OK; we should end up
>> with just one file from the last-to-finish backend. I think you have
>> found another Cygwin bug :-(

> Windows has issues with open files. So, if a backend is trying to
> rename a file when it is open (by another), then the rename will fail.
> Will this cause database integrity problems? Or, will there just be
> some spurious warning?

In this context the only bad side-effect is that a useless temporary
file gets left around. It's small, so I wouldn't worry too much.

However --- I suppose Windows can't cope with deleting a file someone
else is holding open, either? That would cause significantly bigger
problems :-(

regards, tom lane


From: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 17:51:24
Message-ID: 20010118125124.E1092@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Tom,

On Thu, Jan 18, 2001 at 12:39:59PM -0500, Tom Lane wrote:
> However --- I suppose Windows can't cope with deleting a file someone
> else is holding open, either?

Yes.

> That would cause significantly bigger problems :-(

That sounds ominous, please elaborate.

Thanks,
Jason

--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 17:59:00
Message-ID: 3889.979840740@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> On Thu, Jan 18, 2001 at 12:39:59PM -0500, Tom Lane wrote:
>> However --- I suppose Windows can't cope with deleting a file someone
>> else is holding open, either?

> Yes.

>> That would cause significantly bigger problems :-(

> That sounds ominous, please elaborate.

If you drop a table that someone else has recently used, the someone
else's backend is probably still holding the file open. We generally
don't close open file descriptors until we have to.

In current sources I think that you'd get a "cannot unlink" NOTICE,
but the table would get logically dropped anyway, and the sole
side-effect would be failure to recover the disk space. But in this
case we could be talking about large amounts of disk space.

regards, tom lane


From: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 18:23:57
Message-ID: 20010118132357.H1092@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Tom,

On Thu, Jan 18, 2001 at 12:59:00PM -0500, Tom Lane wrote:
> In current sources I think that you'd get a "cannot unlink" NOTICE,
> but the table would get logically dropped anyway, and the sole
> side-effect would be failure to recover the disk space. But in this
> case we could be talking about large amounts of disk space.

Cygwin does attempt to overcome the Windows open file issue. If a sharing
violation is detected (i.e., the file is open) during an unlink operation
(really DeleteFile), Cygwin will queue it for deletion later. However,
reading the Cygwin code, I found the following:

/* FIXME: this delqueue module is very flawed and should be rewritten.
First, having an array of a fixed size for keeping track of the
unlinked but not yet deleted files is bad. Second, some programs
will unlink files and then create a new one in the same location
and this behavior is not supported in the current code. Probably
we should find a move/rename function that will work on open files,
and move delqueue files to some special location or some such
hack... */

With the above caveats, is the current functionality sufficient for
PostgreSQL's needs?

Thanks
Jason

--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 18:53:36
Message-ID: 4370.979844016@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> /* FIXME: this delqueue module is very flawed and should be rewritten.
> First, having an array of a fixed size for keeping track of the
> unlinked but not yet deleted files is bad. Second, some programs
> will unlink files and then create a new one in the same location
> and this behavior is not supported in the current code. Probably
> we should find a move/rename function that will work on open files,
> and move delqueue files to some special location or some such
> hack... */

> With the above caveats, is the current functionality sufficient for
> PostgreSQL's needs?

The fixed-size-array thing sounds like a gotcha waiting to bite someone.
How big is the array, anyway?

The unlink/recreate issue is not a problem for us anymore, since we use
OIDs as filenames --- we won't try to reuse the same filename.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-ports(at)postgresql(dot)org
Subject: Re: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 18:58:48
Message-ID: 200101181858.NAA07791@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

> Tom,
>
> On Thu, Jan 18, 2001 at 12:59:00PM -0500, Tom Lane wrote:
> > In current sources I think that you'd get a "cannot unlink" NOTICE,
> > but the table would get logically dropped anyway, and the sole
> > side-effect would be failure to recover the disk space. But in this
> > case we could be talking about large amounts of disk space.
>
> Cygwin does attempt to overcome the Windows open file issue. If a sharing
> violation is detected (i.e., the file is open) during an unlink operation
> (really DeleteFile), Cygwin will queue it for deletion later. However,
> reading the Cygwin code, I found the following:
>
> /* FIXME: this delqueue module is very flawed and should be rewritten.
> First, having an array of a fixed size for keeping track of the
> unlinked but not yet deleted files is bad. Second, some programs
> will unlink files and then create a new one in the same location
> and this behavior is not supported in the current code. Probably
> we should find a move/rename function that will work on open files,
> and move delqueue files to some special location or some such
> hack... */
>
> With the above caveats, is the current functionality sufficient for
> PostgreSQL's needs?

No, it doesn't seems sufficient, though 7.1 will be a little better
because of oid file names.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 20:01:18
Message-ID: 20010118150118.N1092@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Tom,

On Thu, Jan 18, 2001 at 01:53:36PM -0500, Tom Lane wrote:
> Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> > With the above caveats, is the current functionality sufficient for
> > PostgreSQL's needs?
>
> The fixed-size-array thing sounds like a gotcha waiting to bite someone.

Agreed.

> How big is the array, anyway?

The current size is 100 deep. Is that sufficient for PostgreSQL or is
this dependent on usage?

Jason

--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 20:53:44
Message-ID: 5056.979851224@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
>> The fixed-size-array thing sounds like a gotcha waiting to bite someone.

> Agreed.

>> How big is the array, anyway?

> The current size is 100 deep. Is that sufficient for PostgreSQL or is
> this dependent on usage?

Mumble. I'd sure you could gin up a scenario where it fails, but
deleting 100 recently-used tables in one transaction doesn't seem like a
very likely situation.

Probably a more interesting question to ask is how graceful is the
behavior when that array fills up?

regards, tom lane


From: Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-ports(at)postgresql(dot)org
Subject: Re: Cygwin PostgreSQL Regression Test Problems
Date: 2001-01-18 21:07:05
Message-ID: 20010118160705.R1092@dothill.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-ports

Tom,

On Thu, Jan 18, 2001 at 03:53:44PM -0500, Tom Lane wrote:
> Probably a more interesting question to ask is how graceful is the
> behavior when that array fills up?

If no slots are available, then the file is never queued. Hence, it is
nevered deleted.

Jason

--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com