Re: pg_relation_size / could not open relation with OID #

Lists: pgsql-general
From: tv(at)fuzzy(dot)cz
To: pgsql-general(at)postgresql(dot)org
Subject: pg_relation_size / could not open relation with OID #
Date: 2010-09-20 11:52:35
Message-ID: e5ebb6f285135aebdc53b199aef95be5.squirrel@sq.gransy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi everyone,

I've run into a strange problem with system catalogs - we're collecting
database stats periodically (every 10 minutes), and from time to time we
get the following error:

--------------------------------------------------
ERROR: could not open relation with OID 154873708
--------------------------------------------------

Most of the time it works fine - there are about 144 executions every day,
and it fails about twice a day (i.e. about 1%). The OID value is different
every time.

The query executed is this (this one reads table stats, there is a similar
query for indexes and it fails too):

-----------------------------------------------------------------------
SELECT now() AS stat_time, pg_relation_size(stat.relid) AS relation_size,
relfrozenxid AS frozen_xid, age(relfrozenxid) AS frozen_xid_age,
current_database() AS dbname, stat.schemaname AS schemaname, stat.relname
AS tablename, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch,
heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit,
toast_blks_read, toast_blks_hit, tidx_blks_read, tidx_blks_hit, n_tup_ins,
n_tup_upd, n_tup_del, relpages, reltuples, n_tup_hot_upd, n_live_tup,
n_dead_tup, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
FROM pg_stat_all_tables AS stat LEFT JOIN pg_statio_all_tables AS statio
USING (relid) LEFT JOIN pg_class ON (stat.relid = pg_class.oid)
-----------------------------------------------------------------------

I guess this has something to do with pg_relation_size function, used in
the query. Maybe a race condition or something like that ... Or is that a
know feature?

We've checked if there are any suspicious cron scripts (maintenance with
temporary tables, reindexing etc.) but we have found nothing. Otherwise
the database seems just fine, everything (including backups etc.) works
fine.

The database is running on 8.4.x (not sure which if it's 8.4.4).

regards
Tomas


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: tv(at)fuzzy(dot)cz
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-09-20 13:44:02
Message-ID: 8436.1284990242@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

tv(at)fuzzy(dot)cz writes:
> I've run into a strange problem with system catalogs - we're collecting
> database stats periodically (every 10 minutes), and from time to time we
> get the following error:

> --------------------------------------------------
> ERROR: could not open relation with OID 154873708
> --------------------------------------------------

I think you're probably hitting a problem with a table being deleted
while you're scanning pg_class. pg_relation_size() will fail if the
given OID isn't valid "now" --- but the underlying query returns all
OIDs that were valid when the transaction or statement snapshot was
taken. So you have a race condition.

You might consider excluding temp tables from the query, if that's the
most likely source of the problem.

regards, tom lane


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-09-20 22:37:55
Message-ID: 4C97E243.8030607@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Dne 20.9.2010 15:44, Tom Lane napsal(a):
> tv(at)fuzzy(dot)cz writes:
>> I've run into a strange problem with system catalogs - we're collecting
>> database stats periodically (every 10 minutes), and from time to time we
>> get the following error:
>
>> --------------------------------------------------
>> ERROR: could not open relation with OID 154873708
>> --------------------------------------------------
>
> I think you're probably hitting a problem with a table being deleted
> while you're scanning pg_class. pg_relation_size() will fail if the
> given OID isn't valid "now" --- but the underlying query returns all
> OIDs that were valid when the transaction or statement snapshot was
> taken. So you have a race condition.
>
> You might consider excluding temp tables from the query, if that's the
> most likely source of the problem.
>
> regards, tom lane
>

OK, I'm a bit confused now. So it's not a race condition (i.e. a bug) in
a pg_relation_size but a feature?

I expected the whole query (including function calls etc.) will execute
on a consistent snapshot but as I understand the whole scenario is
something like this:

1) execute the SELECT statement (load the OIDs)
2) drop one of the tables (before the functions are evaluated)
3) execute pg_relation_size for all the OID (one of the tables does not
exist anymore so the function call will fail)

I'll try to exclude the temp tables but I'm not sure if it will solve
the issue. It seems to me the very same scenario is possible with
regular tables, right?

Is there some other way to prevent this issue? E.g. locking the pg_class
table before executing the query or something like that?

regards
Tomas


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-09-21 02:31:10
Message-ID: 21489.1285036270@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tomas Vondra <tv(at)fuzzy(dot)cz> writes:
> Dne 20.9.2010 15:44, Tom Lane napsal(a):
>> I think you're probably hitting a problem with a table being deleted
>> while you're scanning pg_class. pg_relation_size() will fail if the
>> given OID isn't valid "now" --- but the underlying query returns all
>> OIDs that were valid when the transaction or statement snapshot was
>> taken. So you have a race condition.
>>
>> You might consider excluding temp tables from the query, if that's the
>> most likely source of the problem.

> OK, I'm a bit confused now. So it's not a race condition (i.e. a bug) in
> a pg_relation_size but a feature?

Well, "feature" is in the eye of the beholder I guess. The race
condition is not really avoidable; certainly pg_relation_size() can't
do anything to prevent it. And you do *not* want "guaranteed consistent"
results; that would mean taking a lock on every table in the system,
which would likely result in far more failures, not fewer failures,
because of deadlocks.

We could prevent your query from failing if we did something like having
pg_relation_size() return NULL, rather than throwing an error, if the
OID it's given doesn't turn out to correspond to a live table. I'm not
sure if that'd be a net improvement or not --- it certainly seems to
reduce the system's ability to detect simple errors, and depending on
what your query was doing with the results, a NULL could bollix it up in
other ways. But IIRC we've done similar things for other system inquiry
functions, so maybe it'd be reasonable here too.

regards, tom lane


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-09-21 20:38:24
Message-ID: 4C9917C0.7000301@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>> OK, I'm a bit confused now. So it's not a race condition (i.e. a bug) in
>> a pg_relation_size but a feature?
>
> Well, "feature" is in the eye of the beholder I guess. The race
> condition is not really avoidable; certainly pg_relation_size() can't
> do anything to prevent it. And you do *not* want "guaranteed consistent"
> results; that would mean taking a lock on every table in the system,
> which would likely result in far more failures, not fewer failures,
> because of deadlocks.

Well, I was thinking about locking the pg_class itself (SHARE ROW
EXCLUSIVE or EXCLUSIVE), but I really am not sure it's a good idea. But
I'm not sure locking a table is different from locking a corresponding
row in the pg_class table - maybe it's the same.

All (or most of) the DDL commands have to modify pg_class at some point,
so I thought that if those commands lock the row (representing a
table/index/...), I could lock the whole table (preventing the DDL from
running).

But I admit this may be a hilariously stupid idea ...

>
> We could prevent your query from failing if we did something like having
> pg_relation_size() return NULL, rather than throwing an error, if the
> OID it's given doesn't turn out to correspond to a live table. I'm not
> sure if that'd be a net improvement or not --- it certainly seems to
> reduce the system's ability to detect simple errors, and depending on
> what your query was doing with the results, a NULL could bollix it up in
> other ways. But IIRC we've done similar things for other system inquiry
> functions, so maybe it'd be reasonable here too.

Hmm, sounds nice. Actually I could create such 'graceful wrapper' on my
own - just catch the exception and return NULL. Right?

Another possible solution is to remove the pg_relation_size from the
SELECT itself execute it in a loop for each of the rows (I have to do a
row-by-row processing anyway, and in case of pg_relation_size the
slowdown should be negligible - I guess).

regards
Tomas


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-10-07 13:42:22
Message-ID: 4CADCE3E.60705@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom Lane wrote:
> We could prevent your query from failing if we did something like having
> pg_relation_size() return NULL, rather than throwing an error, if the
> OID it's given doesn't turn out to correspond to a live table. I'm not
> sure if that'd be a net improvement or not --- it certainly seems to
> reduce the system's ability to detect simple errors

I've struggled with scripts using pg_relation_size doing strange things
because of this more than once. How about a) return NULL and b) log at
NOTICE that you just asked for something undefined? That would let
scripts run without interruption in some of these race condition cases,
while still generating some clue the user was likely to see that there
was a problem when people were just using the thing blatantly wrong--the
biggest subset of cases that produce errors now I suspect.

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-10-07 14:10:27
Message-ID: 28177.1286460627@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> Tom Lane wrote:
>> We could prevent your query from failing if we did something like having
>> pg_relation_size() return NULL, rather than throwing an error, if the
>> OID it's given doesn't turn out to correspond to a live table. I'm not
>> sure if that'd be a net improvement or not --- it certainly seems to
>> reduce the system's ability to detect simple errors

> I've struggled with scripts using pg_relation_size doing strange things
> because of this more than once. How about a) return NULL and b) log at
> NOTICE that you just asked for something undefined?

I don't care for the NOTICE at all; it's just useless log bloat (that
likely will never be seen by a human) in most use-cases. Either we
think this is an expected case, or not.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-10-07 14:26:50
Message-ID: 28488.1286461610@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I wrote:
> Greg Smith <greg(at)2ndquadrant(dot)com> writes:
>> I've struggled with scripts using pg_relation_size doing strange things
>> because of this more than once. How about a) return NULL and b) log at
>> NOTICE that you just asked for something undefined?

> I don't care for the NOTICE at all; it's just useless log bloat (that
> likely will never be seen by a human) in most use-cases. Either we
> think this is an expected case, or not.

It strikes me that if we were willing to throw code at the problem,
we could make it work like this:

1. Try to open the relation. If successful, proceed as normal.

2. Try to fetch the pg_class row by OID, using SnapshotDirty (or perhaps
the surrounding query's snapshot). If we can find it under a
non-current snapshot, return NULL.

3. Else throw error.

This would properly throw error for cases where you'd passed the wrong
catalog's OID column to pg_relation_size. Depending on how tense we
were about the snapshot selection, it might sometimes return NULL in
cases where an error would be more appropriate (because the relation
had been dead for some time).

regards, tom lane


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-10-07 15:07:09
Message-ID: 4CADE21D.4080107@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom Lane wrote:
> It strikes me that if we were willing to throw code at the problem,
> we could make it work like this:
> ...
> This would properly throw error for cases where you'd passed the wrong
> catalog's OID column to pg_relation_size.

Yeah, you're right that is the proper way to handle this. As the
problem isn't that serious once you're aware of it, I don't see a large
amount of motivation to work on that now though, if that's what it will
take to fix. And Tomas seemed satisfied with a workaround too. I just
added a TODO item pointing to your suggested implementation so it's more
likely people will stumble onto the relevant trivia here on their own,
and maybe some day a patch will get written to implement that idea too.

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance" Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-general(at)postgresql(dot)org
Subject: Re: pg_relation_size / could not open relation with OID #
Date: 2010-10-07 15:40:33
Message-ID: 29893.1286466033@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> Yeah, you're right that is the proper way to handle this. As the
> problem isn't that serious once you're aware of it, I don't see a large
> amount of motivation to work on that now though, if that's what it will
> take to fix. And Tomas seemed satisfied with a workaround too. I just
> added a TODO item pointing to your suggested implementation so it's more
> likely people will stumble onto the relevant trivia here on their own,
> and maybe some day a patch will get written to implement that idea too.

Well, if we're leaving TODO crumbs, what I had in mind was:

1. Use try_relation_open() not relation_open() to start with.

2. On failure return, do something like

if (relation_recently_dead(relid))
... return NULL ...
else
... throw error ...

relation_recently_dead() would probably be a few dozen lines of code,
but it could be shared among all places where we want to do something
like this.

regards, tom lane