Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.

Lists: pgsql-bugs
From: "Luke Koops" <luke(dot)koops(at)entrust(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-05 03:52:29
Message-ID: 200909050352.n853qTEH071667@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 5038
Logged by: Luke Koops
Email address: luke(dot)koops(at)entrust(dot)com
PostgreSQL version: 8.3.7
Operating system: Windows 2003 Server Enterprise Edition
Description: WAL file is pending deletion in pg_xlog folder, this
interferes with WAL archiving.
Details:

On my system, one of the WAL files is pending deletion. The handle is being
held by one of the postgres backend processes, but that is another potential
bug.

At first, the unlink worked, and the .ready and .done files were deleted.
But the WAL file still shows up in the pg_xlog directory listing.

Note: the WAL file did get archived properly. There was no error reported
at the time.

When it comes time to recycle the log files, RemoveOldXLogFiles() calls
ReadDir() to get the list of files, then it calls XLogArchiveCheckDone()
which, if it cannot find a .done or a .ready file, calls
XLogArchiveNotify(). XLogArchiveNotify() creates the .ready file again.
This causes the archiver to call the archive command on the old WAL file
that is pending deletion. The copy command will fail and all subsequent
archive attempts will keep trying to copy the old WAL file that is pending
deletion.

At this point, none of the WAL files will get shipped and the pg_xlog folder
will start filling up.

Before calling XLogArchiveCheckDone(), RemoveOldXLogFiles() makes a number
of tests to make sure the name is for a legitimate XLOG. This would be a
good time to make sure the file is real, not pending deletion. That would
prevent the creation of the .ready file and WAL archiving would continue to
work.

It might be a good idea to log something at the DEBUG level if a directory
entry is encoutered that matches the naming conventions but is not a real
file.

You could probably reproduce this behaviour by changing the permissions on a
WAL file, although you wouldn't be able to test a fix in the same way.

I have not reliably reproduced the WAL file handle "leak" in the postgres
back end. I believe may be related to statements timing out. My system
currently has statement_timeout=1min, but that will be removed. I will
report the "leak" when I have a better handle (no pun) on the situation.

-Luke


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Luke Koops <luke(dot)koops(at)entrust(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-07 09:21:43
Message-ID: 4AA4D0A7.7040204@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Luke Koops wrote:
> The following bug has been logged online:
>
> Bug reference: 5038
> Logged by: Luke Koops
> Email address: luke(dot)koops(at)entrust(dot)com
> PostgreSQL version: 8.3.7
> Operating system: Windows 2003 Server Enterprise Edition
> Description: WAL file is pending deletion in pg_xlog folder, this
> interferes with WAL archiving.
> Details:
>
> On my system, one of the WAL files is pending deletion. The handle is being
> held by one of the postgres backend processes, but that is another potential
> bug.

Hmm. Under normal Unix filesystem semantics, that doesn't matter much
since the file can still be unlinked. It will still consume space, but
that's not a big issue. On Windows, however, the open handle keeps the
file locked, so it can't be deleted.

If I'm reading the code correctly, when a backend opens a WAL file for
writing to it, it stays open until the backend needs to write to another
WAL file. If the backend only writes once to a file, and then doesn't
create WAL records anymore (= executes read-only queries only), the file
is kept open indefinitely.

Perhaps we should try to close the old WAL file sooner. It's easy to
check if the current open log segment is old and close it if so, but I'm
not sure what the check should be hooked into.

> At first, the unlink worked, and the .ready and .done files were deleted.
> But the WAL file still shows up in the pg_xlog directory listing.

If the file didn't go away, it seems like the unlink didn't work. We
don't check the return code in RemoveOldXlogFiles(); I suspect that
we're getting EBUSY in that scenario. We should check for that if we're
going to delete the .ready and .done files. Patch attached, but I
haven't tested it. I don't have Windows environment at hand, but I'll
try to find one..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
windows-unlink-can-fail-1.patch text/x-diff 638 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Luke Koops <luke(dot)koops(at)entrust(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-07 17:17:18
Message-ID: 7737.1252343838@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Perhaps we should try to close the old WAL file sooner.

There is zero hope of making that work. What we probably need to do
is fix the code that scans pg_xlog so that it ignores files that are
pending deletion. (I assume there's some way to find that out on
Windows.)

regards, tom lane


From: Luke Koops <luke(dot)koops(at)entrust(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-07 20:29:45
Message-ID: A3144629B5AC714A8BF27806EBFA7057514623A2@sottexch7.corp.ad.entrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Sent: Monday, September 07, 2009 1:17 PM
> To: Heikki Linnakangas
> Cc: Luke Koops; pgsql-bugs(at)postgresql(dot)org
> Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion
> in pg_xlog folder, this interferes with WAL archiving.
>
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > Perhaps we should try to close the old WAL file sooner.
>
> There is zero hope of making that work. What we probably
> need to do is fix the code that scans pg_xlog so that it
> ignores files that are pending deletion. (I assume there's
> some way to find that out on
> Windows.)
On *nux systems, unlink removes the link from the directory. When other processes get a directory listing, the file will no longer be listed. On Windows, the file name continues to show up in directory listings. The file is in a state called pending deletion. Windows documentation doesn't give a specific test for this state. Perhaps you could use _access().
====================
>From http://support.microsoft.com/kb/159199

This file is in a state known as pending deletion. This file has been deleted, but there are still handles open to it. NTFS will wait until all handles to this file are closed before updating the index. If an attempt is made to access the file, however, NTFS will deny the attempt. Because the file is listed in the index, but is effectively deleted, you can see the file but you cannot access it.

Windows NT returns an "Access Denied" error message when you attempt to manipulate the file. You are not able to view the permissions, the owner, or the contents of the file. The file does, however, show up in a DIR listing in File Manager and in Explorer. This occurs even though the user trying to access the file has permissions to the file. Even an administrator will be unable to take ownership of this file.
====================
>
> regards, tom lane
>


From: Luke Koops <luke(dot)koops(at)entrust(dot)com>
To: Luke Koops <luke(dot)koops(at)entrust(dot)com>, 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-09 18:36:23
Message-ID: A3144629B5AC714A8BF27806EBFA7057514623BC@sottexch7.corp.ad.entrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

For those of you who are still looking at this, I tried to reproduce the issue by holding one of the WAL files open with another program (just opened it with the cygwin build of less.exe for windows). That didn't do the trick. It prevented unlink or rename from working at all. I wrote a program (open.exe) that opens the file using pgwin32_open() and passed in the same parameters that postgres uses when opening a WAL file. That allowed the file to be renamed. And, when deleted, the open file went into the pending deletion state.

I used open.exe to hold onto a WAL file that was going to be recycled. The recycling worked, but what is going to happen down the road when the handle is released, leaving a gap in the WAL file sequence. Or if it is not released, when a backend tries to open the WAL file and does not have access to it?

When open.exe was holding onto a WAL file that was going to be deleted, the deletion worked. The file went into the deletion pending state. The archive status for the WAL file went through the .ready ==> .done ==> {no status file} ==> .ready sequence. At that point Postgres repeatedly tries to archive the WAL file.

I reported earlier that I believe postgres leaked the file handle to the WAL file. I don't believe that is the case. We have a process that only checks data in the database for integrity. It is only reading. I think it opened the WAL file initially, perhaps the backend had some maintenance work to do when that session started and had to write something to the WAL and then never moved on to a new one.

Now that I can reproduce the pending deletion case, I'm working on code to detect it reliably and, hopefully, efficiently.

-Luke
> -----Original Message-----
> From: pgsql-bugs-owner(at)postgresql(dot)org
> [mailto:pgsql-bugs-owner(at)postgresql(dot)org] On Behalf Of Luke Koops
> Sent: Monday, September 07, 2009 4:30 PM
> To: 'Tom Lane'; Heikki Linnakangas
> Cc: pgsql-bugs(at)postgresql(dot)org
> Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion
> in pg_xlog folder, this interferes with WAL archiving.
>
> > -----Original Message-----
> > From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> > Sent: Monday, September 07, 2009 1:17 PM
> > To: Heikki Linnakangas
> > Cc: Luke Koops; pgsql-bugs(at)postgresql(dot)org
> > Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion
> in pg_xlog
> > folder, this interferes with WAL archiving.
> >
> > Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > > Perhaps we should try to close the old WAL file sooner.
> >
> > There is zero hope of making that work. What we probably
> need to do
> > is fix the code that scans pg_xlog so that it ignores files
> that are
> > pending deletion. (I assume there's some way to find that out on
> > Windows.)
> On *nux systems, unlink removes the link from the directory.
> When other processes get a directory listing, the file will
> no longer be listed. On Windows, the file name continues to
> show up in directory listings. The file is in a state called
> pending deletion. Windows documentation doesn't give a
> specific test for this state. Perhaps you could use _access().
> ====================
> From http://support.microsoft.com/kb/159199
>
> This file is in a state known as pending deletion. This file
> has been deleted, but there are still handles open to it.
> NTFS will wait until all handles to this file are closed
> before updating the index. If an attempt is made to access
> the file, however, NTFS will deny the attempt. Because the
> file is listed in the index, but is effectively deleted, you
> can see the file but you cannot access it.
>
> Windows NT returns an "Access Denied" error message when you
> attempt to manipulate the file. You are not able to view the
> permissions, the owner, or the contents of the file. The file
> does, however, show up in a DIR listing in File Manager and
> in Explorer. This occurs even though the user trying to
> access the file has permissions to the file. Even an
> administrator will be unable to take ownership of this file.
> ====================
> >
> > regards, tom lane
> >
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Luke Koops <luke(dot)koops(at)entrust(dot)com>
Cc: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-09 18:58:47
Message-ID: 4AA7FAE7.5040707@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Luke Koops wrote:
> For those of you who are still looking at this, I tried to reproduce the issue by holding one of the WAL files open with another program (just opened it with the cygwin build of less.exe for windows). That didn't do the trick. It prevented unlink or rename from working at all. I wrote a program (open.exe) that opens the file using pgwin32_open() and passed in the same parameters that postgres uses when opening a WAL file. That allowed the file to be renamed. And, when deleted, the open file went into the pending deletion state.

Yeah, it's the FILE_SHARE_DELETE flag that allows the deletion.

> I used open.exe to hold onto a WAL file that was going to be recycled. The recycling worked, but what is going to happen down the road when the handle is released, leaving a gap in the WAL file sequence. Or if it is not released, when a backend tries to open the WAL file and does not have access to it?

When the file is recycled, I believe we're fine. The file is not
deleted, only renamed, so it won't be deleted when open.exe closes it.
No gap in WAL sequence is created.

> When open.exe was holding onto a WAL file that was going to be deleted, the deletion worked. The file went into the deletion pending state. The archive status for the WAL file went through the .ready ==> .done ==> {no status file} ==> .ready sequence. At that point Postgres repeatedly tries to archive the WAL file.

> I reported earlier that I believe postgres leaked the file handle to the WAL file. I don't believe that is the case. We have a process that only checks data in the database for integrity. It is only reading. I think it opened the WAL file initially, perhaps the backend had some maintenance work to do when that session started and had to write something to the WAL and then never moved on to a new one.
>
> Now that I can reproduce the pending deletion case, I'm working on code to detect it reliably and, hopefully, efficiently.

I got hold of a Windows virtual machine as well, and could reproduce the
issue. It was a bit tricky to coerce the file to be deleted instead of
recycled, but setting "max_advance = 0" in RemoveOldXlogFiles() finally
did the trick.

I googled around, and saw some discussion that suggest that when a file
is in "pending deletion" state, it's implemented by setting a
"delete-on-close" flag on the existing file handle. The upshot of that
is that if you pull the power plug, the file won't be deleted after all.

One option is to rename the file before deleting it. For all practical
purposes, that's the same as if the file no longer exists. Seems like
the simplest solution to me.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Luke Koops <luke(dot)koops(at)entrust(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-09 19:06:05
Message-ID: 5092.1252523165@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> One option is to rename the file before deleting it.

I was under the impression that renaming would fail in the same cases
where deletion fails, ie something else holding the file open ...

regards, tom lane


From: Luke Koops <luke(dot)koops(at)entrust(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-09 19:11:09
Message-ID: A3144629B5AC714A8BF27806EBFA7057514623BD@sottexch7.corp.ad.entrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom,

If you first delete the file and it goes into deletion pending, then subsequent attempts to rename or delete will fail.

Heikki is proposing that the file be renamed before it is first deleted.

-Luke

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Sent: Wednesday, September 09, 2009 3:06 PM
> To: Heikki Linnakangas
> Cc: Luke Koops; pgsql-bugs(at)postgresql(dot)org
> Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion
> in pg_xlog folder, this interferes with WAL archiving.
>
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > One option is to rename the file before deleting it.
>
> I was under the impression that renaming would fail in the
> same cases where deletion fails, ie something else holding
> the file open ...
>
> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Luke Koops <luke(dot)koops(at)entrust(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-09 19:20:21
Message-ID: 5386.1252524021@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Luke Koops <luke(dot)koops(at)entrust(dot)com> writes:
> Heikki is proposing that the file be renamed before it is first deleted.

Indeed, and what I'm saying is that that will still fail if some
unhelpful program is holding the file open without FILE_SHARE_DELETE.
Which is what I understood your case to be ...

regards, tom lane


From: Luke Koops <luke(dot)koops(at)entrust(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-09 19:27:04
Message-ID: A3144629B5AC714A8BF27806EBFA7057514623C0@sottexch7.corp.ad.entrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

My case is that a backend was holding the file open. It was not some other unhelpful program. That's why I had to write a program that held the file open with the same flags that postgres uses (including FILE_SHARE_DELETE).

We have a process that wrote to the DB, and then for many days it only read from the DB, so it didn't move on to another WAL file.

-Luke

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Sent: Wednesday, September 09, 2009 3:20 PM
> To: Luke Koops
> Cc: Heikki Linnakangas; pgsql-bugs(at)postgresql(dot)org
> Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion
> in pg_xlog folder, this interferes with WAL archiving.
>
> Luke Koops <luke(dot)koops(at)entrust(dot)com> writes:
> > Heikki is proposing that the file be renamed before
> it is first deleted.
>
> Indeed, and what I'm saying is that that will still fail if
> some unhelpful program is holding the file open without
> FILE_SHARE_DELETE.
> Which is what I understood your case to be ...
>
> regards, tom lane
>


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Luke Koops <luke(dot)koops(at)entrust(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-09 19:27:49
Message-ID: 4AA801B5.4070700@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> Luke Koops <luke(dot)koops(at)entrust(dot)com> writes:
>> Heikki is proposing that the file be renamed before it is first deleted.
>
> Indeed, and what I'm saying is that that will still fail if some
> unhelpful program is holding the file open without FILE_SHARE_DELETE.
> Which is what I understood your case to be ...

No, it's a backend that's holding the file open, with FILE_SHARE_DELETE.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Luke Koops <luke(dot)koops(at)entrust(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-09 19:44:09
Message-ID: 5849.1252525449@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> No, it's a backend that's holding the file open, with FILE_SHARE_DELETE.

If that's the only case we care about covering, then rename might be
enough. I was just wondering what it would take to solve the more
general problem of something holding it open with the wrong flags
at the time we want to get rid of it.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Luke Koops <luke(dot)koops(at)entrust(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-10 05:36:23
Message-ID: 4AA89057.10806@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> No, it's a backend that's holding the file open, with FILE_SHARE_DELETE.
>
> If that's the only case we care about covering, then rename might be
> enough. I was just wondering what it would take to solve the more
> general problem of something holding it open with the wrong flags
> at the time we want to get rid of it.

Yes, that's a separate problem, and I think we should address that too.
That's what I thought was going on in OP's case at first, the patch I
posted in my first reply should address that.

I'll try to reproduce that case too, and verify that the patch fixes it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Koops <luke(dot)koops(at)entrust(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-10 09:44:27
Message-ID: 4AA8CA7B.4020608@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> No, it's a backend that's holding the file open, with FILE_SHARE_DELETE.
>> If that's the only case we care about covering, then rename might be
>> enough. I was just wondering what it would take to solve the more
>> general problem of something holding it open with the wrong flags
>> at the time we want to get rid of it.
>
> Yes, that's a separate problem, and I think we should address that too.
> That's what I thought was going on in OP's case at first, the patch I
> posted in my first reply should address that.
>
> I'll try to reproduce that case too, and verify that the patch fixes it.

Ok, I've committed a patch along those lines. The file is now renamed
before unlinking (on Windows), and the return code of rename() and
unlink() is checked, so that we don't delete the .done file if the WAL
file deletion failed. This fixes both scenarios, the one OP reported
with another backend keeping the file open, and the one where a
different process keeps a file open without FILE_SHARE_DELETE.

I considered making failure to rename or delete a WARNING instead of
ERROR, so that RemoveOldXLogFiles() would still clean up any other old
WAL files. However, when a file is recycled, we throw an error anyway if
the rename fails in InstallXLogFileSegment(), so it doesn't seem like it
would buy us much.

BTW, it seems that errno is not set on Windows when rename fails, but we
still try to print the OS error message in InstallXLogFileSegment().
When I tested the case where another process is keeping the file locked,
for example, I got this:

ERROR: could not rename file "pg_xlog/000000010000000100000073" to
"pg_xlog/000000010000000100000092" (initialization of log file 1,
segment 146): No such file or directory

even though the file clearly exists, it's just locked. I'm not sure
where errno is coming from in that case, and if we should do something
about that, but that exceeds my appetite for fixing Windows issues right
now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Luke Koops <luke(dot)koops(at)entrust(dot)com>
To: 'Heikki Linnakangas' <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date: 2009-09-12 03:37:37
Message-ID: A3144629B5AC714A8BF27806EBFA7057514623F2@sottexch7.corp.ad.entrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

I picked up the patch and verified both fixes on 8.3.7.

In one test, Handles to two different WAL files were being held by two different backends. The WAL files were renamed to .deleted after I forced a switch xlog. Eventually the .deleted files disappeared. In one case the backend exited. In the other, the backend moved on to the latest WAL file.

In another test, I opened a WAL file so that it could not be renamed or deleted. The appropriate error was logged and the .done file remained. The error is logged quite frequently. When released the WAL file it was soon deleted.

If you get into a case where the rename works but the unlink fails (I don't see how this could happen in real life, except possibly for a race condition with AV software), you will have a situation where there is a .done file that does not match any WAL logs, and you will have a .deleted file that won't get cleaned up.

I couldn't reproduce this, so I faked it by adding a .done file back into the archive_status folder after it was deleted. The orphaned .done file doesn't cause any trouble. It doesn't get cleaned up, it doesn't generate any log messages, and it doesn't interfere with WAL file recycling or removal (unlike the trouble that is caused by orphaned .ready files).

The patch looks good.

Thank-you,

-Luke

> -----Original Message-----
> From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
> Sent: Thursday, September 10, 2009 5:44 AM
> Cc: Tom Lane; Luke Koops; pgsql-bugs(at)postgresql(dot)org
> Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion
> in pg_xlog folder, this interferes with WAL archiving.
>
> Heikki Linnakangas wrote:
> > Tom Lane wrote:
> >> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> >>> No, it's a backend that's holding the file open, with
> FILE_SHARE_DELETE.
> >> If that's the only case we care about covering, then
> rename might be
> >> enough. I was just wondering what it would take to solve the more
> >> general problem of something holding it open with the
> wrong flags at
> >> the time we want to get rid of it.
> >
> > Yes, that's a separate problem, and I think we should
> address that too.
> > That's what I thought was going on in OP's case at first,
> the patch I
> > posted in my first reply should address that.
> >
> > I'll try to reproduce that case too, and verify that the
> patch fixes it.
>
> Ok, I've committed a patch along those lines. The file is now
> renamed before unlinking (on Windows), and the return code of
> rename() and
> unlink() is checked, so that we don't delete the .done file
> if the WAL file deletion failed. This fixes both scenarios,
> the one OP reported with another backend keeping the file
> open, and the one where a different process keeps a file open
> without FILE_SHARE_DELETE.
>
> I considered making failure to rename or delete a WARNING
> instead of ERROR, so that RemoveOldXLogFiles() would still
> clean up any other old WAL files. However, when a file is
> recycled, we throw an error anyway if the rename fails in
> InstallXLogFileSegment(), so it doesn't seem like it would
> buy us much.
>
> BTW, it seems that errno is not set on Windows when rename
> fails, but we still try to print the OS error message in
> InstallXLogFileSegment().
> When I tested the case where another process is keeping the
> file locked, for example, I got this:
>
> ERROR: could not rename file
> "pg_xlog/000000010000000100000073" to
> "pg_xlog/000000010000000100000092" (initialization of log
> file 1, segment 146): No such file or directory
>
> even though the file clearly exists, it's just locked. I'm
> not sure where errno is coming from in that case, and if we
> should do something about that, but that exceeds my appetite
> for fixing Windows issues right now.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>