Re: could not create directory "...": File exists

Lists: pgsql-hackers
From: Stephen Frost <sfrost(at)snowman(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: could not create directory "...": File exists
Date: 2013-01-17 14:19:37
Message-ID: 20130117141936.GR16126@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greetings,

We've come across this rather annoying error happening during our
builds:

ERROR: could not create directory "pg_tblspc/25120/PG_9.3_201212081/231253": File exists

It turns out that this is coming from copydir() when called by
createdb() during a CREATE DATABASE .. FROM TEMPLATE where the
template has tables in tablespaces.

It turns out that createdb() currently only takes an AccessShareLock
on pg_tablespace when scanning it with SnapshotNow, making it possible
for a concurrent process to make some uninteresting modification to a
tablespace (such as an ACL change) which will cause the heap scan in
createdb() to see a given tablespace multiple times. copydir() will
then, rightfully, complain that it's being asked to create a directory
which already exists.

Given that this is during createdb(), I'm guessing we don't have any
option but to switch the scan to using ShareLock, since there isn't a
snapshot available to do an MVCC scan with (I'm guessing that there
could be other issues trying to do that anyway).

Attached is a patch which does this and corrects the problem for us
(of course, we're now going to go rework our build system to not
modify tablespace ACLs, since with this patch concurrent builds end up
blocking on each other, which is annoying).

Thanks,

Stephen

Attachment Content-Type Size
fix-concurrent-tablespace-update.patch text/x-diff 1.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-17 15:47:53
Message-ID: 8980.1358437673@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> It turns out that createdb() currently only takes an AccessShareLock
> on pg_tablespace when scanning it with SnapshotNow, making it possible
> for a concurrent process to make some uninteresting modification to a
> tablespace (such as an ACL change) which will cause the heap scan in
> createdb() to see a given tablespace multiple times. copydir() will
> then, rightfully, complain that it's being asked to create a directory
> which already exists.

Ugh. Still another problem with non-MVCC catalog scans.

> Given that this is during createdb(), I'm guessing we don't have any
> option but to switch the scan to using ShareLock, since there isn't a
> snapshot available to do an MVCC scan with (I'm guessing that there
> could be other issues trying to do that anyway).

It seems that the only thing we actually use from each tuple is the OID.
So there are other ways to fix it, of which probably the minimum-change
one is to keep a list of already-processed tablespace OIDs and skip a
tuple if we get a match in the list. This would be O(N^2) in the number
of tablespaces, but I rather doubt that's a problem.

[ thinks for a bit... ] Ugh, no, because the *other* risk you've got
here is not seeing a row at all, which would be really bad.

I'm not sure that transiently increasing the lock here is enough,
either. The concurrent transactions you're worried about probably
aren't holding their locks till commit, so you could get this lock
and still see tuples with unstable committed-good states.

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-17 17:37:00
Message-ID: 20130117173700.GS16126@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Ugh. Still another problem with non-MVCC catalog scans.

Indeed.

> It seems that the only thing we actually use from each tuple is the OID.

Yes, that's true.

> So there are other ways to fix it, of which probably the minimum-change
> one is to keep a list of already-processed tablespace OIDs and skip a
> tuple if we get a match in the list. This would be O(N^2) in the number
> of tablespaces, but I rather doubt that's a problem.

I did consider this option also but it felt like a lot of additional
code to write and I wasn't entirely convinced it'd be worth it. It's
very frustrating that we can't simply get a list of *currently valid*
tablespaces with a guarantee that no one else is going to mess with that
list while we process it. That's what MVCC would give us, but we can't
rely on that yet..

If we agree that keeping a list and then skipping over anything on the
list already seen, I can go ahead and code that up.

> [ thinks for a bit... ] Ugh, no, because the *other* risk you've got
> here is not seeing a row at all, which would be really bad.

I'm not sure that I see how that could happen..? I agree that it'd be
really bad if it did though. Or are you thinking if we were to take a
snapshot and then walk the table?

> I'm not sure that transiently increasing the lock here is enough,
> either. The concurrent transactions you're worried about probably
> aren't holding their locks till commit, so you could get this lock
> and still see tuples with unstable committed-good states.

If there are other transactiosn which have open locks against the table,
wouldn't that prevent my being able to acquire ShareLock on it? Or put
another way, how would they not hold their locks till commit or
rollback? We do only look at tablespaces which exist for the database
we're copying, as I recall, so any newly created tablespaces shouldn't
impact this.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-17 18:46:44
Message-ID: 14758.1358448404@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> [ thinks for a bit... ] Ugh, no, because the *other* risk you've got
>> here is not seeing a row at all, which would be really bad.

> I'm not sure that I see how that could happen..? I agree that it'd be
> really bad if it did though. Or are you thinking if we were to take a
> snapshot and then walk the table?

The case where you see a tuple twice is if an update drops a new version
of a row beyond your seqscan, and then commits before you get to the new
version. But if it drops the new version of the row *behind* your
seqscan, and then commits before you get to the original tuple, you'll
not see either row version as committed. Both of these cases are
inherent risks in SnapshotNow scans.

>> I'm not sure that transiently increasing the lock here is enough,
>> either. The concurrent transactions you're worried about probably
>> aren't holding their locks till commit, so you could get this lock
>> and still see tuples with unstable committed-good states.

> If there are other transactiosn which have open locks against the table,
> wouldn't that prevent my being able to acquire ShareLock on it?

What I'm worried about is that we very commonly release locks on
catalogs as soon as we're done with the immediate operation --- not only
read locks, but update locks as well. So there are lots of cases where
locks are released before commit. It's possible that that never happens
in operations applied to pg_tablespace, but I'd not want to bet on it.

I wonder whether it's practical to look at the on-disk directories
instead of relying on pg_tablespace for this particular purpose.

Or maybe we should just write this off as a case we can't realistically
fix before we have MVCC catalog scans. It seems that any other fix is
going to be hopelessly ugly.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-17 18:51:00
Message-ID: 20130117185100.GC31262@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-01-17 13:46:44 -0500, Tom Lane wrote:
> Or maybe we should just write this off as a case we can't realistically
> fix before we have MVCC catalog scans. It seems that any other fix is
> going to be hopelessly ugly.

ISTM we could just use a MVCC snapshot in this specific case?

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-17 19:01:30
Message-ID: 20130117190130.GT16126@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> The case where you see a tuple twice is if an update drops a new version
> of a row beyond your seqscan, and then commits before you get to the new
> version. But if it drops the new version of the row *behind* your
> seqscan, and then commits before you get to the original tuple, you'll
> not see either row version as committed. Both of these cases are
> inherent risks in SnapshotNow scans.

Ah, I see.

> > If there are other transactiosn which have open locks against the table,
> > wouldn't that prevent my being able to acquire ShareLock on it?
>
> What I'm worried about is that we very commonly release locks on
> catalogs as soon as we're done with the immediate operation --- not only
> read locks, but update locks as well. So there are lots of cases where
> locks are released before commit. It's possible that that never happens
> in operations applied to pg_tablespace, but I'd not want to bet on it.

Fair enough.

> I wonder whether it's practical to look at the on-disk directories
> instead of relying on pg_tablespace for this particular purpose.

So we'd traverse all of the tablespace directories and copy any
directories which we come across which have the oid of the template
database..? Sounds pretty reasonable, though I'm not sure that we'd
want to back-patch a large change like that.

> Or maybe we should just write this off as a case we can't realistically
> fix before we have MVCC catalog scans. It seems that any other fix is
> going to be hopelessly ugly.

I feel like we should be able to do better than what we have now, at
least. Using ShareLock prevented the specific case that we were
experiencing and is therefore MUCH better (for us, anyway) than
released versions where we ran into the error on a regular basis.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-17 19:31:54
Message-ID: 15649.1358451114@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> Or maybe we should just write this off as a case we can't realistically
>> fix before we have MVCC catalog scans. It seems that any other fix is
>> going to be hopelessly ugly.

> I feel like we should be able to do better than what we have now, at
> least. Using ShareLock prevented the specific case that we were
> experiencing and is therefore MUCH better (for us, anyway) than
> released versions where we ran into the error on a regular basis.

If it actually *reliably* fixed the problem, rather than just reducing
the probabilities, that would mean that the updates your other sessions
were doing didn't release RowExclusiveLock early. Have you dug into the
code to see if that's true? (Or even more to the point, if it's still
true in HEAD? I have no idea if all the recent refactoring has changed
this behavior for GRANT.)

My thought about a localized fix is similar to Andres' --- maybe we
could take a snapshot and use a real MVCC scan.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-17 19:51:28
Message-ID: 16046.1358452288@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
>> I feel like we should be able to do better than what we have now, at
>> least. Using ShareLock prevented the specific case that we were
>> experiencing and is therefore MUCH better (for us, anyway) than
>> released versions where we ran into the error on a regular basis.

> My thought about a localized fix is similar to Andres' --- maybe we
> could take a snapshot and use a real MVCC scan.

BTW, it strikes me that the reason this particular SnapshotNow scan is a
big problem for you is that, because we stop and copy a subdirectory
after each tuple, the elapsed time for the seqscan is several orders of
magnitude more than it is for most (perhaps all) other cases where
SnapshotNow is used. So the window for trouble with concurrent updates
is that much bigger. This seems to provide a reasonably principled
argument why we might want to fix this case with a localized use of an
MVCC scan before we have such a fix globally.

Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY
kind of annotation. We know we need the SnapshotNow scan fix.

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-17 23:33:22
Message-ID: 20130117233322.GU16126@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> This seems to provide a reasonably principled
> argument why we might want to fix this case with a localized use of an
> MVCC scan before we have such a fix globally.

I had discussed that idea a bit with Andres on IRC and my only concern
was if there's some reason that acquiring a snapshot during createdb()
would be problematic. It doesn't appear to currently and I wasn't sure
if there'd be any issues.

I'll start working on a patch for that.

> Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY
> kind of annotation. We know we need the SnapshotNow scan fix.

Agreed.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-18 00:26:18
Message-ID: 20947.1358468778@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> This seems to provide a reasonably principled
>> argument why we might want to fix this case with a localized use of an
>> MVCC scan before we have such a fix globally.

> I had discussed that idea a bit with Andres on IRC and my only concern
> was if there's some reason that acquiring a snapshot during createdb()
> would be problematic. It doesn't appear to currently and I wasn't sure
> if there'd be any issues.

Don't see what. The main reason we've not yet attempted a global fix is
that the most straightforward way (take a new snapshot each time we
start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE
is so expensive that the cost of an extra snapshot there ain't gonna
matter.

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-18 13:20:56
Message-ID: 20130118132056.GY16126@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Don't see what. The main reason we've not yet attempted a global fix is
> that the most straightforward way (take a new snapshot each time we
> start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE
> is so expensive that the cost of an extra snapshot there ain't gonna
> matter.

Patch attached. Passes the regression tests and our internal testing
shows that it eliminates the error we were seeing (and doesn't cause
blocking, which is even better).

We have a workaround in place for our build system (more-or-less
"don't do that" approach), but it'd really be great if this was
back-patched and in the next round of point releases. Glancing
through the branches, looks like it should apply pretty cleanly.

Thanks,

Stephen

Attachment Content-Type Size
fix-concurrent-tablespace-update2.patch text/x-diff 2.4 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-18 19:49:59
Message-ID: 22774.1358538599@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> Patch attached. Passes the regression tests and our internal testing
> shows that it eliminates the error we were seeing (and doesn't cause
> blocking, which is even better).

> We have a workaround in place for our build system (more-or-less
> "don't do that" approach), but it'd really be great if this was
> back-patched and in the next round of point releases. Glancing
> through the branches, looks like it should apply pretty cleanly.

It looks like it will work back to 8.4; before that we didn't have
RegisterSnapshot. The patch could be adjusted for 8.3 if anyone
is sufficiently excited about it, but personally I'm not.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-18 23:12:04
Message-ID: 8246.1358550724@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> Tom,
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> Don't see what. The main reason we've not yet attempted a global fix is
>> that the most straightforward way (take a new snapshot each time we
>> start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE
>> is so expensive that the cost of an extra snapshot there ain't gonna
>> matter.

> Patch attached. Passes the regression tests and our internal testing
> shows that it eliminates the error we were seeing (and doesn't cause
> blocking, which is even better).

I committed this with a couple of changes:

* I used GetLatestSnapshot rather than GetTransactionSnapshot. Since
we don't allow these operations inside transaction blocks, there
shouldn't be much difference, but in principle GetTransactionSnapshot
is the wrong thing; in a serializable transaction it could be quite old.

* After reviewing the other uses of SnapshotNow in dbcommands.c, I
decided we'd better make the same type of change in remove_dbtablespaces
and check_db_file_conflict, because those are likewise doing filesystem
operations on the strength of what they find in pg_tablespace.

I also ended up deciding to back-patch to 8.3 as well.

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not create directory "...": File exists
Date: 2013-01-19 18:50:18
Message-ID: 20130119185017.GH16126@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> I committed this with a couple of changes:

Great, thanks!

> * I used GetLatestSnapshot rather than GetTransactionSnapshot. Since
> we don't allow these operations inside transaction blocks, there
> shouldn't be much difference, but in principle GetTransactionSnapshot
> is the wrong thing; in a serializable transaction it could be quite old.

Makes sense.

> * After reviewing the other uses of SnapshotNow in dbcommands.c, I
> decided we'd better make the same type of change in remove_dbtablespaces
> and check_db_file_conflict, because those are likewise doing filesystem
> operations on the strength of what they find in pg_tablespace.

Thanks for that. I had noticed the other places where we were using
SnapshotNow, but I hadn't run down if they needed to be changed or not.

> I also ended up deciding to back-patch to 8.3 as well.

Excellent, thanks again.

Stephen