Re: 7.2.3 vacuum bug

Lists: pgsql-hackers
From: Rod Taylor <rbt(at)rbt(dot)ca>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: 7.2.3 vacuum bug
Date: 2002-10-30 20:14:22
Message-ID: 1036008862.88796.16.camel@jester
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

ERROR: RelationClearRelation: relation 11584078 deleted while still in
use

I've been unable to come up with a test case that will cause the
problem, seems to be timing related. The queries that are currently
running when these errors occur do a lot or work with temp tables that
are frequently truncated.

--
Rod Taylor


From: Neil Conway <neilc(at)samurai(dot)com>
To: Rod Taylor <rbt(at)rbt(dot)ca>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-30 20:38:31
Message-ID: 87hef3ek6w.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor <rbt(at)rbt(dot)ca> writes:
> ERROR: RelationClearRelation: relation 11584078 deleted while still in
> use

I was going to report a similar error that arises in a different
situation:

client 1:

CREATE TABLE a (b int);
BEGIN;
DROP TABLE a;
-- wait

client 2:

SELECT * FROM a;

client 1:

COMMIT;

Now, client 2 will receive "RelationClearRelation: relation 25172
deleted while still in use", rather than "Relation "a" does not
exist", as you might expect. Not sure if it's the same bug, or just a
different problem...

Cheers,

Neil

--
Neil Conway <neilc(at)samurai(dot)com> || PGP Key ID: DB3C29FC


From: Rod Taylor <rbt(at)rbt(dot)ca>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-30 20:49:30
Message-ID: 1036010978.88796.18.camel@jester
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2002-10-30 at 15:38, Neil Conway wrote:
> Rod Taylor <rbt(at)rbt(dot)ca> writes:
> > ERROR: RelationClearRelation: relation 11584078 deleted while still in
> > use
>
> I was going to report a similar error that arises in a different
> situation:

Probably a different look at the same problem.

--
Rod Taylor


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Rod Taylor <rbt(at)rbt(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-31 04:40:05
Message-ID: 3879.1036039205@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway <neilc(at)samurai(dot)com> writes:
> client 1:

> CREATE TABLE a (b int);
> BEGIN;
> DROP TABLE a;
> -- wait

> client 2:

> SELECT * FROM a;

> client 1:

> COMMIT;

> Now, client 2 will receive "RelationClearRelation: relation 25172
> deleted while still in use", rather than "Relation "a" does not
> exist", as you might expect.

But relation "a" *does* exist at the start of client 2's operation.
While I'm not here to defend the exact phrasing of this error message,
it does seem to me that it's appropriate to give a different error
message than what appears when the table wasn't found at all.

An example of why the two cases shouldn't be folded together: suppose
that client 2's schema search path is "myschema, public", and that
client 1 creates/drops myschema.a while there is also a public.a.
client 2 will locate myschema.a as the meaning of "a", and one way or
another it is going to error out when myschema.a gets dropped from
underneath it --- it will not (and shouldn't IMHO) go back and repeat
the schema search to find public.a. But a user who gets a "Relation "a"
does not exist" error message in such a scenario would be justifiably
confused.

regards, tom lane


From: Neil Conway <neilc(at)samurai(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Rod Taylor <rbt(at)rbt(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-31 04:54:50
Message-ID: 87smynb42t.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> But relation "a" *does* exist at the start of client 2's operation.
> While I'm not here to defend the exact phrasing of this error message,
> it does seem to me that it's appropriate to give a different error
> message than what appears when the table wasn't found at all.

Ok, fair enough -- I agree that we should treat the two cases
differently. But one thing I think we should do in any case is improve
the wording of the error message.

Cheers,

Neil

--
Neil Conway <neilc(at)samurai(dot)com> || PGP Key ID: DB3C29FC


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rod Taylor <rbt(at)rbt(dot)ca>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-31 04:57:29
Message-ID: 3975.1036040249@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor <rbt(at)rbt(dot)ca> writes:
> ERROR: RelationClearRelation: relation 11584078 deleted while still in
> use

> I've been unable to come up with a test case that will cause the
> problem, seems to be timing related. The queries that are currently
> running when these errors occur do a lot or work with temp tables that
> are frequently truncated.

Hm. vacuum.c tries to avoid this class of problem:

/*
* Race condition -- if the pg_class tuple has gone away since the
* last time we saw it, we don't need to vacuum it.
*/
if (!SearchSysCacheExists(RELOID,
ObjectIdGetDatum(relid),
0, 0, 0))
{
CommitTransactionCommand(true);
return true; /* okay 'cause no data there */
}

...

onerel = relation_open(relid, lmode);

but on reflection it's clear that this doesn't really prevent a race
condition. If the table is already exclusive-locked by a DROP TABLE
that hasn't committed yet (eg, the implicit DROP that happens when temp
tables are cleared out at backend exit), then the syscache lookup will
go fine, but the relation_open() routine blocks waiting for lock and
eventually fails.

What would probably work better is to first lock the relation OID,
then see if we can open the relation or not.

Thinking further, it's really kinda bogus that LockRelation() works on
an already-opened Relation; if possible we should acquire the lock
before attempting to create a relcache entry. (We only need to know the
OID and the relisshared status before we can make a locktag, so it'd be
possible to acquire the lock using only the contents of the pg_class row.)
Not sure how much code restructuring might be involved to make this
happen, but it'd be worth thinking about for 7.4.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Rod Taylor <rbt(at)rbt(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-31 05:02:40
Message-ID: 4024.1036040560@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway <neilc(at)samurai(dot)com> writes:
> Ok, fair enough -- I agree that we should treat the two cases
> differently. But one thing I think we should do in any case is improve
> the wording of the error message.

Got a suggestion?

regards, tom lane


From: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Neil Conway <neilc(at)samurai(dot)com>, Rod Taylor <rbt(at)rbt(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-31 16:33:23
Message-ID: Pine.LNX.4.33.0210310932330.4588-100000@css120.ihs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 31 Oct 2002, Tom Lane wrote:

> Neil Conway <neilc(at)samurai(dot)com> writes:
> > Ok, fair enough -- I agree that we should treat the two cases
> > differently. But one thing I think we should do in any case is improve
> > the wording of the error message.
>
> Got a suggestion?

Change: RelationClearRelation: relation 25172 deleted while still in use
to: RelationClearRelation: a relation (id: 25172) was deleted while still
in use


From: Rod Taylor <rbt(at)rbt(dot)ca>
To: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Neil Conway <neilc(at)samurai(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-31 17:56:24
Message-ID: 1036086999.94263.32.camel@jester
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Found another:

ERROR: cannot find attribute 2 of relation pg_temp_12100_0

On Thu, 2002-10-31 at 11:33, scott.marlowe wrote:
> On Thu, 31 Oct 2002, Tom Lane wrote:
>
> > Neil Conway <neilc(at)samurai(dot)com> writes:
> > > Ok, fair enough -- I agree that we should treat the two cases
> > > differently. But one thing I think we should do in any case is improve
> > > the wording of the error message.
> >
> > Got a suggestion?
>
> Change: RelationClearRelation: relation 25172 deleted while still in use
> to: RelationClearRelation: a relation (id: 25172) was deleted while still
> in use
>
>
--
Rod Taylor


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rod Taylor <rbt(at)rbt(dot)ca>
Cc: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>, Neil Conway <neilc(at)samurai(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-31 18:03:36
Message-ID: 8216.1036087416@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor <rbt(at)rbt(dot)ca> writes:
> Found another:
> ERROR: cannot find attribute 2 of relation pg_temp_12100_0

Can you reproduce that?

It could be that this just represents someone's temp table deletion
committing while VACUUM is partway through trying to build a relcache
entry to open the relation. If so, it is only another manifestation
of the should-lock-before-relation-open problem.

regards, tom lane


From: Rod Taylor <rbt(at)rbt(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>, Neil Conway <neilc(at)samurai(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-10-31 20:23:08
Message-ID: 1036095788.94263.47.camel@jester
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2002-10-31 at 13:03, Tom Lane wrote:
> Rod Taylor <rbt(at)rbt(dot)ca> writes:
> > Found another:
> > ERROR: cannot find attribute 2 of relation pg_temp_12100_0
>
> Can you reproduce that?
>
> It could be that this just represents someone's temp table deletion
> committing while VACUUM is partway through trying to build a relcache
> entry to open the relation. If so, it is only another manifestation
> of the should-lock-before-relation-open problem.

Yes, but not easily (very timing dependent), takes a lot of worker
processes to throw it. So it's likely a part of the locking issue.

--
Rod Taylor


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Rod Taylor <rbt(at)rbt(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-11-02 03:50:32
Message-ID: 200211020350.gA23oW910502@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Is this a TODO?

---------------------------------------------------------------------------

Tom Lane wrote:
> Rod Taylor <rbt(at)rbt(dot)ca> writes:
> > ERROR: RelationClearRelation: relation 11584078 deleted while still in
> > use
>
> > I've been unable to come up with a test case that will cause the
> > problem, seems to be timing related. The queries that are currently
> > running when these errors occur do a lot or work with temp tables that
> > are frequently truncated.
>
> Hm. vacuum.c tries to avoid this class of problem:
>
> /*
> * Race condition -- if the pg_class tuple has gone away since the
> * last time we saw it, we don't need to vacuum it.
> */
> if (!SearchSysCacheExists(RELOID,
> ObjectIdGetDatum(relid),
> 0, 0, 0))
> {
> CommitTransactionCommand(true);
> return true; /* okay 'cause no data there */
> }
>
> ...
>
> onerel = relation_open(relid, lmode);
>
> but on reflection it's clear that this doesn't really prevent a race
> condition. If the table is already exclusive-locked by a DROP TABLE
> that hasn't committed yet (eg, the implicit DROP that happens when temp
> tables are cleared out at backend exit), then the syscache lookup will
> go fine, but the relation_open() routine blocks waiting for lock and
> eventually fails.
>
> What would probably work better is to first lock the relation OID,
> then see if we can open the relation or not.
>
> Thinking further, it's really kinda bogus that LockRelation() works on
> an already-opened Relation; if possible we should acquire the lock
> before attempting to create a relcache entry. (We only need to know the
> OID and the relisshared status before we can make a locktag, so it'd be
> possible to acquire the lock using only the contents of the pg_class row.)
> Not sure how much code restructuring might be involved to make this
> happen, but it'd be worth thinking about for 7.4.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Rod Taylor <rbt(at)rbt(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-11-02 04:52:33
Message-ID: 17751.1036212753@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Is this a TODO?

Yes. Maybe

* Acquire lock on a relation before building a relcache entry for it

I'm not quite sure yet how this should interact with the case where
you already have a relcache entry, but certainly the existing behavior
of "build the whole entry and then acquire lock" is not good.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Rod Taylor <rbt(at)rbt(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.2.3 vacuum bug
Date: 2002-11-02 05:42:03
Message-ID: 200211020542.gA25g3T29847@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO.

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Is this a TODO?
>
> Yes. Maybe
>
> * Acquire lock on a relation before building a relcache entry for it
>
> I'm not quite sure yet how this should interact with the case where
> you already have a relcache entry, but certainly the existing behavior
> of "build the whole entry and then acquire lock" is not good.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073