Re: Permission denied on fsync / Win32 (was right

Lists: pgsql-bugs
From: "Magnus Hagander" <mha(at)sollentuna(dot)net>
To: "Peter Brant" <Peter(dot)Brant(at)wicourts(dot)gov>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Permission denied on fsync / Win32 (was right
Date: 2006-04-18 19:00:54
Message-ID: 6BCB9D8A16AC4241919521715F4D8BCEA0F91B@algol.sollentuna.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

> It happens often enough and the episodes last long enough
> that grabbing a handle dump while this is going on should be
> easily done.
>
> Regarding the Win32 error code, backend/storage/file/fd.c
> calls _commit().
> http://msdn2.microsoft.com/en-us/library/17618685(VS.80).aspx
> It looks
> like it is already using errno to report errors. Will
> GetLastError() return something useful there?

Good point.
Ran a quick test. If I open the file read-only and then fsync, I get
errno=9 (EBADF) and GetLastError()=5. Which explains the fact that we
got the wrong error-code. The *underlying API call* to _commit() returns
access denied...

Looking at the source to _commit(), if the call to FlushFileBuffers()
returns an error, it will set _doserrno to that value,and then return
with errno=EBADF.

So, this basicalliyu means that FlushFileBuffers() returns ACCESS
DENIED.

//Magnus


From: "Peter Brant" <Peter(dot)Brant(at)wicourts(dot)gov>
To: "Magnus Hagander" <mha(at)sollentuna(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Permission denied on fsync / Win32 (was right
Date: 2006-04-19 15:50:52
Message-ID: 4446160C020000BE00002FC6@gwmta.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Here's the evidence from this morning. I have to admit I'm not really
sure what to make of it though.

Pete

The fsync / Permission denied errors occurred on 2 of 3 active servers
for the 7 am CLUSTER cycle.

Server #1 (with fsync errors):
- Both old and new relfilenodes are still visible with a 'dir':
04/19/2006 07:00 AM 16,384 1532868
04/19/2006 07:06 AM 8,192 1536650

- postgres.exe processes have handles to both old and new
relfilenodes

#1:
F64: File G:\pgsql\data\base\16385\1532868
F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(1745) idle

#2:
F80: File G:\pgsql\data\base\16385\1536650
AB4: Event \BaseNamedObjects\pgident: postgres: writer
process

#3:
F0C: File G:\pgsql\data\base\16385\1536650
F48: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(1732) idle

(plus a few more like this)

Server #2 (with fsync errors):
- Same pattern as Server #1. bgwriter has a handle to the new
relfilenode. Other backends have a handle to either old or new.

Server #3 (w/o fsync errors):
- Only the new relfilenode is visible with a 'dir':
04/19/2006 07:34 AM 16,384 1550915

- postgres.exe processes have handles to both old and new
relfilenodes

#1:
F60: File G:\pgsql\data\base\16385\1547888
F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(4060) idle

(plus two more like this)

#2:
F78: File G:\pgsql\data\base\16385\1550915
F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(2925) idle

(plus two more like this)

>>> "Magnus Hagander" <mha(at)sollentuna(dot)net> 04/18/06 9:00 pm >>>
> It happens often enough and the episodes last long enough
> that grabbing a handle dump while this is going on should be
> easily done.
>
> Regarding the Win32 error code, backend/storage/file/fd.c
> calls _commit().
> http://msdn2.microsoft.com/en-us/library/17618685(VS.80).aspx
> It looks
> like it is already using errno to report errors. Will
> GetLastError() return something useful there?

Good point.
Ran a quick test. If I open the file read-only and then fsync, I get
errno=9 (EBADF) and GetLastError()=5. Which explains the fact that we
got the wrong error-code. The *underlying API call* to _commit()
returns
access denied...

Looking at the source to _commit(), if the call to FlushFileBuffers()
returns an error, it will set _doserrno to that value,and then return
with errno=EBADF.

So, this basicalliyu means that FlushFileBuffers() returns ACCESS
DENIED.

//Magnus

---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Peter Brant" <Peter(dot)Brant(at)wicourts(dot)gov>
Cc: "Magnus Hagander" <mha(at)sollentuna(dot)net>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Permission denied on fsync / Win32 (was right
Date: 2006-04-19 17:21:04
Message-ID: 196.1145467264@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Peter Brant" <Peter(dot)Brant(at)wicourts(dot)gov> writes:
> Here's the evidence from this morning. I have to admit I'm not really
> sure what to make of it though.
> ...
> - Same pattern as Server #1. bgwriter has a handle to the new
> relfilenode. Other backends have a handle to either old or new.

It seems pretty clear to me that the problem occurs when we try to
fsync the old relfilenode, which is in a pending-delete state but
hasn't been deleted yet because of the presence of open handles in
some backends. (Peter, did you check that the error messages in
the postmaster log all refer to old relfilenodes not new ones?)

We should be able to ignore this error, but I'm certainly unwilling
to just treat EACCES in general as an ignorable error. So the problem
is to distinguish this case from genuine permission failures. Perhaps
ERROR_SHARING_VIOLATION should be mapped to something other than EACCES?
It seems like that's a rather poor fit. Or we could leave the mapping
as-is and add an #ifdef'd test on GetLastError to mdsync() (ugly...)

One worry is whether there are any other possible causes of
ERROR_SHARING_VIOLATION during fsync, and if so are they all ignorable.

regards, tom lane


From: "Peter Brant" <Peter(dot)Brant(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-bugs(at)postgresql(dot)org>,"Magnus Hagander" <mha(at)sollentuna(dot)net>
Subject: Re: Permission denied on fsync / Win32 (was right
Date: 2006-04-19 17:55:33
Message-ID: 44463345020000BE00002FD3@gwmta.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

I'm not sure that's the whole story. "Server #3" had backends with
handles to the old relfilenode. It didn't have any fsync errors and the
old relfilenode was apparently successfully deleted (or at least it
wasn't visible in the file system anymore). That's the part of the
morning's investigations that left me kind of nonplussed.

Of half dozen or so cases I've looked at more closely, the error has
always referred to the old relfilenode. I'll spot check a few more
though.

Pete

>>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> 04/19/06 7:21 pm >>>
It seems pretty clear to me that the problem occurs when we try to
fsync the old relfilenode, which is in a pending-delete state but
hasn't been deleted yet because of the presence of open handles in
some backends.