Re: [GENERAL] cache lookup of relation 165058647 failed

Lists: pgsql-bugspgsql-general
From: Juris Krumins <juriskr(at)komin(dot)lv>
To: pgsql-general(at)postgresql(dot)org
Subject: cache lookup of relation 165058647 failed
Date: 2004-04-14 09:35:43
Message-ID: 20040414123113.S61393@juriskr_workstation
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

I have a problem with postgresql tables. periodicaly, I would say
frequently about 5-10 time per hour i have such errors in my server log
file:

2004-04-14 12:23:32 [73692] ERROR: cache lookup of relation 149064743
failed
2004-04-14 12:23:32 [73692] ERROR: Relation "tmp_table1" does not exist
2004-04-14 12:23:32 [73692] ERROR: Relation "tmp_table1" does not exist

So turn on debugging options and have that's what i got:

2004-04-14 12:24:54 [74021] LOG: connection received: host=[local]
2004-04-14 12:24:54 [74021] LOG: connection authorized: user=ejob_guest
database=ejob
2004-04-14 12:24:54 [74021] LOG: query: begin; select
getdatabaseencoding(); commit
2004-04-14 12:24:54 [74021] LOG: duration: 0.000987 sec
2004-04-14 12:24:54 [74021] LOG: query: SELECT d.config_id, d.text1 as
text, d.field_name,
2004-04-14 12:24:54 [74021] LOG: duration: 0.005295 sec
2004-04-14 12:24:54 [74021] LOG: query: CREATE TEMP TABLE tmp_table1 AS
SELECT advert.id, (SELECT employer.fullname FROM employer WHERE
advert.empid=employer.id) as emp_name, (CASE WHEN advert.status1 = 'A'
THEN advert.postname1 ELSE advert.postname2 END) as postname, (SELECT
cd.name FROM catalog_names cd WHERE advert.industry=cd.catalog_id AND
cd.language_id=1) AS f1002, (CASE WHEN status1='A' THEN
advert.obligations1 ELSE advert.obligations2 END) as obligations1 FROM
good_adv as advert
2004-04-14 12:24:54 [74021] ERROR: cache lookup of relation 165058647
failed
2004-04-14 12:24:54 [74021] LOG: query: SELECT * FROM tmp_table1
2004-04-14 12:24:54 [74021] ERROR: Relation "tmp_table1" does not exist
2004-04-14 12:24:54 [74021] LOG: query: SELECT * FROM tmp_table1 ORDER
BY id desc LIMIT 100
2004-04-14 12:24:54 [74021] ERROR: Relation "tmp_table1" does not exist

Does anybody have any idea about what can cause such kind of problems and
how to overcome it ?

Thanks a lot.


From: Richard Huxton <dev(at)archonet(dot)com>
To: Juris Krumins <juriskr(at)komin(dot)lv>, pgsql-general(at)postgresql(dot)org
Subject: Re: cache lookup of relation 165058647 failed
Date: 2004-04-14 11:58:20
Message-ID: 200404141258.20763.dev@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

On Wednesday 14 April 2004 10:35, Juris Krumins wrote:

> 2004-04-14 12:24:54 [74021] LOG: query: SELECT d.config_id, d.text1 as
> text, d.field_name,
> 2004-04-14 12:24:54 [74021] LOG: duration: 0.005295 sec
> 2004-04-14 12:24:54 [74021] LOG: query: CREATE TEMP TABLE tmp_table1 AS
> SELECT advert.id, (SELECT employer.fullname FROM employer WHERE
> advert.empid=employer.id) as emp_name, (CASE WHEN advert.status1 = 'A'
> THEN advert.postname1 ELSE advert.postname2 END) as postname, (SELECT
> cd.name FROM catalog_names cd WHERE advert.industry=cd.catalog_id AND
> cd.language_id=1) AS f1002, (CASE WHEN status1='A' THEN
> advert.obligations1 ELSE advert.obligations2 END) as obligations1 FROM
> good_adv as advert
> 2004-04-14 12:24:54 [74021] ERROR: cache lookup of relation 165058647
> failed

The error is complaining about a relation that used to have an OID of
165058647. This is almost certainly a temporary table. Are there any plpgsql
functions referring to a temporary table, or do you create, drop, recreate
tmp_table1 within the same connection?

--
Richard Huxton
Archonet Ltd


From: Juris Krumins <juriskr(at)komin(dot)lv>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: cache lookup of relation 165058647 failed
Date: 2004-04-14 12:18:02
Message-ID: 20040414151230.K61393@juriskr_workstation
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general


On Wed, 14 Apr 2004, Richard Huxton wrote:

> On Wednesday 14 April 2004 10:35, Juris Krumins wrote:
>
> > 2004-04-14 12:24:54 [74021] LOG: query: SELECT d.config_id, d.text1 as
> > text, d.field_name,
> > 2004-04-14 12:24:54 [74021] LOG: duration: 0.005295 sec
> > 2004-04-14 12:24:54 [74021] LOG: query: CREATE TEMP TABLE tmp_table1 AS
> > SELECT advert.id, (SELECT employer.fullname FROM employer WHERE
> > advert.empid=employer.id) as emp_name, (CASE WHEN advert.status1 = 'A'
> > THEN advert.postname1 ELSE advert.postname2 END) as postname, (SELECT
> > cd.name FROM catalog_names cd WHERE advert.industry=cd.catalog_id AND
> > cd.language_id=1) AS f1002, (CASE WHEN status1='A' THEN
> > advert.obligations1 ELSE advert.obligations2 END) as obligations1 FROM
> > good_adv as advert
> > 2004-04-14 12:24:54 [74021] ERROR: cache lookup of relation 165058647
> > failed
>
> The error is complaining about a relation that used to have an OID of
> 165058647. This is almost certainly a temporary table. Are there any plpgsql
> functions referring to a temporary table, or do you create, drop, recreate
> tmp_table1 within the same connection?
>
> --
> Richard Huxton
> Archonet Ltd
>
There is actually no creation, droping, recreation of tmp_table1 within
the same connection. The only thing is that good_adv is not actually a
table, but a view. Can it be reason for this problem ?
The second thing is that queries I'v posted before going actually the same
way, first of all create temporary table, then making couple select
queries to that temporary table and that's all. The problem is that i use
such method all over my web site, that's why create temp table tmp_table1
... are quite large. Coulde it be problem ?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: Juris Krumins <juriskr(at)komin(dot)lv>, pgsql-general(at)postgresql(dot)org
Subject: Re: cache lookup of relation 165058647 failed
Date: 2004-04-14 12:44:50
Message-ID: 4508.1081946690@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

Richard Huxton <dev(at)archonet(dot)com> writes:
> The error is complaining about a relation that used to have an OID of
> 165058647. This is almost certainly a temporary table. Are there any plpgsql
> functions referring to a temporary table, or do you create, drop, recreate
> tmp_table1 within the same connection?

The trace suggests this is happening immediately after connect, so the
last part doesn't sound right. It could be a plpgsql function problem,
though the query doesn't look like it invokes any functions.

What I was wondering about was dangling references within a view, that
is a view referring to a table that no longer exists. (That shouldn't
happen anymore in 7.3 and later, but if this is a pre-7.3 system then
it's possible.) Does this query make use of any views, and if so do the
views change from time to time?

regards, tom lane


From: Juris Krumins <juriskr(at)komin(dot)lv>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: cache lookup of relation 165058647 failed
Date: 2004-04-14 14:02:55
Message-ID: 20040414165109.Q61393@juriskr_workstation
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

On Wed, 14 Apr 2004, Tom Lane wrote:

> Richard Huxton <dev(at)archonet(dot)com> writes:
> > The error is complaining about a relation that used to have an OID of
> > 165058647. This is almost certainly a temporary table. Are there any plpgsql
> > functions referring to a temporary table, or do you create, drop, recreate
> > tmp_table1 within the same connection?
>
> The trace suggests this is happening immediately after connect, so the
> last part doesn't sound right. It could be a plpgsql function problem,
> though the query doesn't look like it invokes any functions.
>
> What I was wondering about was dangling references within a view, that
> is a view referring to a table that no longer exists. (That shouldn't
> happen anymore in 7.3 and later, but if this is a pre-7.3 system then
> it's possible.) Does this query make use of any views, and if so do the
> views change from time to time?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>

Sorry i didn't mention my version. It is PostgreSQL 7.3.4. As I metioned
there is no function invocted from that query.

The view I use to make temporary table was made by query:

SELECT advert.id, advert.status1, advert.empid, advert.email,
advert.postname1, advert.postname2, advert.industry,
advert.obligations1, advert.obligations2, advert.chk_edu, advert.chk_lang,
advert.chk_compskills, advert.chk_work, advert.chk_travel
, advert.chk_drivlic, advert.chk_auto, advert.chk_worktime,
advert.chk_workarea, advert.edu_deg, advert.travel, advert.lica, advert.
licb, advert.licc, advert.licd, advert.lice, advert.auto,
advert.aquirement1, advert.aquirement2, advert.offer1, advert.offer2, adve
rt.fut_all_time, advert.salary1, advert.address1, advert."location",
advert.work_time1, advert.work_time2, advert.person1, advert.pe
rson2, advert.personpost1, advert.personpost2, advert.phone, advert.fax,
advert.web, advert.address2, advert.salary2, advert.time_cr
eate, advert.enddate, advert.status2, advert.send_email FROM advert,
employer WHERE (((((advert.empid = employer.id) AND (employer.s
tatus <> 'B'::bpchar)) AND (employer.act_lidz >=
date("timestamp"('today'::text)))) AND (employer.act_no <=
date("timestamp"('today'
::text)))) AND ((advert.enddate >= date("timestamp"('today'::text))) AND
((advert.status1 = 'A'::bpchar) OR (advert.status2 = 'A'::b
pchar))));

So there is no mucj information you can get out of this, but the only
thing is true, that information in view change from time to time.
I'v checked everything conected with nonexistent tables and things like
that. Everything looks fine.

Diging through the source coude reveals me that:

src/backend/catalog/dependency.c: elog(ERROR, "cache lookup
of relation %u failed", relid);
src/backend/catalog/heap.c: elog(ERROR, "cache lookup of
relation %u failed",
src/backend/catalog/pg_constraint.c:
elog(ERROR, "cache lookup of relation %u failed",);
src/backend/utils/adt/ruleutils.c: elog(ERROR, "cache lookup
of relation %u failed", relid);

So bacicaly we have only 4 places, which cane generate errors like that,
and all this places refer to HeapTupleIsValid function.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Juris Krumins <juriskr(at)komin(dot)lv>
Cc: Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: cache lookup of relation 165058647 failed
Date: 2004-04-14 15:24:25
Message-ID: 6002.1081956265@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

Juris Krumins <juriskr(at)komin(dot)lv> writes:
> So bacicaly we have only 4 places, which cane generate errors like that,
> and all this places refer to HeapTupleIsValid function.

Hmm. What's even more interesting is that none of them are in places
that I would expect to be called during a CREATE TABLE AS operation.
Can you dig into it a bit more and find out which one is failing?
A stack trace back from the error would be useful too.

regards, tom lane


From: Juris Krumins <juriskr(at)komin(dot)lv>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: cache lookup of relation 165058647 failed
Date: 2004-04-26 09:35:31
Message-ID: 20040426123152.C61393@juriskr_workstation
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general


I'v find out that this error occurs in:
dependency.c file

2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of relation
149064743 failed
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist

in getRelationDescription(StringInfo buffer, Oid relid) function.

Any ideas what can cause this errors.

On Wed, 14 Apr 2004, Tom Lane wrote:

> Richard Huxton <dev(at)archonet(dot)com> writes:
> > The error is complaining about a relation that used to have an OID of
> > 165058647. This is almost certainly a temporary table. Are there any plpgsql
> > functions referring to a temporary table, or do you create, drop, recreate
> > tmp_table1 within the same connection?
>
> The trace suggests this is happening immediately after connect, so the
> last part doesn't sound right. It could be a plpgsql function problem,
> though the query doesn't look like it invokes any functions.
>
> What I was wondering about was dangling references within a view, that
> is a view referring to a table that no longer exists. (That shouldn't
> happen anymore in 7.3 and later, but if this is a pre-7.3 system then
> it's possible.) Does this query make use of any views, and if so do the
> views change from time to time?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: PostgreSQL Bugs List <pgsql-bugs(at)postgresql(dot)org>, Juris Krumins <juriskr(at)komin(dot)lv>
Subject: Re: [GENERAL] cache lookup of relation 165058647 failed
Date: 2004-05-04 05:19:08
Message-ID: 91B18B5E-9D8A-11D8-8912-000A95C705DC@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

> I'v find out that this error occurs in:
> dependency.c file
>
> 2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of relation
> 149064743 failed
> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
>
> in getRelationDescription(StringInfo buffer, Oid relid) function.
>
> Any ideas what can cause this errors.

<aol>Me too.</aol>

But, I am suspecting that it's a race condition with the new background
writer code. I've started testing a new database design and was able
to reproduce this on my laptop nearly 90% of the time, but could only
reproduce it about 10% of the time on my production databases until I
figured out what the difference was, fsync.

fsync was causing enough of a slow down that SearchSysCache() was
finding the tuple, whereas with fsync = false, it wasn't able to find
it. But, in search of proving that it wasn't fsync (I use fsync =
false on my laptop to save my pour drive), I threw in a sleep in
between my tests, and I'm able to get things to work 100% of the time
by adding a sleep. The following fails to work with fsync = false, 90%
of the time and with fsync = true, only 10% of the time.

% psql -f test-begin.sql template1 && psql -f test_enterprise_class.sql
&& psql -f test-end1.sql template1 && psql -f test-end2.sql template1

But, if I change the command to:

% psql -f test-begin.sql template1 && psql -f test_enterprise_class.sql
&& psql -f test-end1.sql template1 && sleep 1 && psql -f test-end2.sql
template1

I have no problems with cache relation misses. As for what happens in
those commands, I'm:

-- 1) Dropping the test database and re-creating it
-- 2) In a different connection, load a rather large schema as the dba
-- 3) Connect again and create a temp table
-- 4) Connect a second time, and check to see if the temp table exists

The sleep comes at step 3.5 in the above sequence of operations.

*boom* Here's a snippet of my terminal (the first thing I do after
BEGINning a transaction is create a temp table if it doesn't exist):

## BEGIN ##
[snip]
[...]
COMMIT
You are now connected to database "test" as user "usr".
BEGIN
psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033
CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN
pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
pg_catalog.pg_table_is_visible(c.oid)"
PL/pgSQL function "create_tmptbl" line 2 at perform
PL/pgSQL function "check_or_populate_func" line 8 at assignment
PL/pgSQL function "setuid_wrapper_func" line 5 at return
## END ##

What's really bothering me is I can push the up arrow on the console,
run the exact same thing (including dropping the database), and it'll
work sometimes. Very disturbing. As I said, I'm *very* suspicious of
the background writer goo that Jan added simply because I can't think
of anything else that'd have this problem.

I've run each of those commands 100 times now, with and without the
sleep 1. With the sleep 1, it's worked 100% of the time. Jan, any bit
of code that comes to mind?

All of my bgwriter_* settings are set to their default.

-sc

--
Sean Chittenden


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sean Chittenden <sean(at)chittenden(dot)org>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, PostgreSQL Bugs List <pgsql-bugs(at)postgresql(dot)org>, Juris Krumins <juriskr(at)komin(dot)lv>
Subject: Re: [GENERAL] cache lookup of relation 165058647 failed
Date: 2004-05-04 05:55:23
Message-ID: 465.1083650123@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

Sean Chittenden <sean(at)chittenden(dot)org> writes:
> But, I am suspecting that it's a race condition with the new background
> writer code.

Why? Have you demonstrated that the failure does not occur in 7.4?

> psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033
> CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN
> pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
> 'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
> pg_catalog.pg_table_is_visible(c.oid)"

I think that pg_table_is_visible() will inspect the catalogs using
SnapshotNow semantics, while the above query will feed it with OIDs that
were valid under a start-of-query snapshot. So I'd expect failures in
any recent PG version, if tables are being dynamically created/deleted
by concurrent transactions.

regards, tom lane


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, PostgreSQL Bugs List <pgsql-bugs(at)postgresql(dot)org>, Juris Krumins <juriskr(at)komin(dot)lv>
Subject: Re: [GENERAL] cache lookup of relation 165058647 failed
Date: 2004-05-04 07:06:51
Message-ID: 9DF5F484-9D99-11D8-8912-000A95C705DC@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

[Renames thread from "The Pepsi Challenge" to "The PostgreSQL
Challenge"]

>> But, I am suspecting that it's a race condition with the new
>> background
>> writer code.
>
> Why? Have you demonstrated that the failure does not occur in 7.4?

What other operations have been added to HEAD that would allow for
successful operation of sequential use or testing of temp tables?

More importantly, yes, I can confirm that this behavior doesn't exist
in REL7_4 as of 40min ago. I've been running the exact same tests
repeatedly with nearly identical configs (as close as they can be given
the config changes) on REL7_4 as I have HEAD and only HEAD is giving me
problems.

>> psql:test-end2.sql:3: ERROR: cache lookup failed for relation 398033
>> CONTEXT: SQL query "SELECT TRUE FROM pg_catalog.pg_class c LEFT JOIN
>> pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
>> 'tmptbl'::TEXT AND c.relkind = 'r'::TEXT AND
>> pg_catalog.pg_table_is_visible(c.oid)"
>
> I think that pg_table_is_visible() will inspect the catalogs using
> SnapshotNow semantics, while the above query will feed it with OIDs
> that
> were valid under a start-of-query snapshot. So I'd expect failures in
> any recent PG version, if tables are being dynamically created/deleted
> by concurrent transactions.

There is no concurrency in the test I gave: it's all sequential.

0 | -- connect
1 | BEGIN;
2 | SELECT setuid_wrapper();
3 | -- Do other things;
4 | COMMIT;
5 | -- disconnect
6 | -- connect again
7 | BEGIN;
8 | SELECT setuid_wrapper(); -- This fails some % of the time

If I add step 5.5 that is a sleep, step 8 will work, without fail. In
7.4, I have no problems, however. That said, the bgwriter code is the
only think that I can think of that would muck with caching. -sc

PS Other comments temp schema permission patch?

--
Sean Chittenden


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Sean Chittenden <sean(at)chittenden(dot)org>
Cc: PostgreSQL Bugs List <pgsql-bugs(at)postgresql(dot)org>, Juris Krumins <juriskr(at)komin(dot)lv>
Subject: Re: [GENERAL] cache lookup of relation 165058647 failed
Date: 2004-05-05 16:32:39
Message-ID: 40991727.3060405@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

Sean Chittenden wrote:
>> I'v find out that this error occurs in:
>> dependency.c file
>>
>> 2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of relation
>> 149064743 failed
>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
>>
>> in getRelationDescription(StringInfo buffer, Oid relid) function.
>>
>> Any ideas what can cause this errors.
>
> <aol>Me too.</aol>
>
> But, I am suspecting that it's a race condition with the new background
> writer code. I've started testing a new database design and was able
> to reproduce this on my laptop nearly 90% of the time, but could only
> reproduce it about 10% of the time on my production databases until I
> figured out what the difference was, fsync.

temp tables don't use the shared buffer cache, how can this be related
to the BG writer?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: PostgreSQL Bugs List <pgsql-bugs(at)postgresql(dot)org>, Juris Krumins <juriskr(at)komin(dot)lv>
Subject: Re: [GENERAL] cache lookup of relation 165058647 failed
Date: 2004-05-05 20:40:54
Message-ID: 816FE1CE-9ED4-11D8-B669-000A95C705DC@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

>>> I'v find out that this error occurs in:
>>> dependency.c file
>>>
>>> 2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of
>>> relation
>>> 149064743 failed
>>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
>>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
>>>
>>> in getRelationDescription(StringInfo buffer, Oid relid) function.
>>>
>>> Any ideas what can cause this errors.
>> <aol>Me too.</aol>
>> But, I am suspecting that it's a race condition with the new
>> background writer code. I've started testing a new database design
>> and was able to reproduce this on my laptop nearly 90% of the time,
>> but could only reproduce it about 10% of the time on my production
>> databases until I figured out what the difference was, fsync.
>
> temp tables don't use the shared buffer cache, how can this be related
> to the BG writer?

Don't the system catalogs use the shared buffer cache?

BEGIN;
SELECT create_temp_table_func(); -- Inserts a row into pg_class via
CREATE TEMP TABLE
-- Do other stuff
COMMIT; -- After the commit, the row is now visible to other
backends
-- disconnect -- If the delay between the disconnect and reconnect is
small enough
-- reconnect -- It's as though there is a race condition that allows
the function
-- pg_table_is_visible() to assert the "cache lookup of relation"
-- error.
BEGIN;
SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I
call
/* SELECT TRUE FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = ''footmp''::TEXT AND
c.relkind = ''r''::TEXT AND
pg_catalog.pg_table_is_visible(c.oid); */
-- But the query fails

My guess was that the series of events went something like:

proc 0) COMMIT's and the row in pg_class is committed
proc 1) bgwriter writer code removes a page for the cache
proc 2) queries for the page [*]
proc 1) writes it to disk
proc 2) queries for the page [*]
proc 1) sync's the fd

[*] proc 2 queries for the page at either of these points

In 7.4, there is no bgwriter or background process mucking with cache,
which is why this works 100% of the time. In 7.5, however, there's a
200ms gap where a race condition appears and pg_table_is_visible()
fails its PointerIsValid() check. If I put a sleep in, the sleep gives
the bgwriter enough time to commit the pages to disk so that the
queries for the page happen after the fd's been sync()'ed.

I have no other clue as to why this would be happening though, so
believe me when I say, I could very well be quite wrong.... but this is
my best, quasi-educated/grep(1)'ed guess.

-sc

--
Sean Chittenden


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Sean Chittenden <sean(at)chittenden(dot)org>
Cc: PostgreSQL Bugs List <pgsql-bugs(at)postgresql(dot)org>, Juris Krumins <juriskr(at)komin(dot)lv>
Subject: Re: [GENERAL] cache lookup of relation 165058647 failed
Date: 2004-05-06 03:30:11
Message-ID: 4099B143.8020006@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

Sean Chittenden wrote:

>>>> I'v find out that this error occurs in:
>>>> dependency.c file
>>>>
>>>> 2004-04-26 11:09:34 ERROR: dependency.c 1621: cache lookup of
>>>> relation
>>>> 149064743 failed
>>>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
>>>> 2004-04-26 11:09:34 ERROR: Relation "tmp_table1" does not exist
>>>>
>>>> in getRelationDescription(StringInfo buffer, Oid relid) function.
>>>>
>>>> Any ideas what can cause this errors.
>>> <aol>Me too.</aol>
>>> But, I am suspecting that it's a race condition with the new
>>> background writer code. I've started testing a new database design
>>> and was able to reproduce this on my laptop nearly 90% of the time,
>>> but could only reproduce it about 10% of the time on my production
>>> databases until I figured out what the difference was, fsync.
>>
>> temp tables don't use the shared buffer cache, how can this be related
>> to the BG writer?
>
> Don't the system catalogs use the shared buffer cache?
>
> BEGIN;
> SELECT create_temp_table_func(); -- Inserts a row into pg_class via
> CREATE TEMP TABLE
> -- Do other stuff
> COMMIT; -- After the commit, the row is now visible to other
> backends
> -- disconnect -- If the delay between the disconnect and reconnect is
> small enough
> -- reconnect -- It's as though there is a race condition that allows
> the function
> -- pg_table_is_visible() to assert the "cache lookup of relation"
> -- error.
> BEGIN;
> SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I
> call
> /* SELECT TRUE FROM pg_catalog.pg_class c
> LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
> WHERE c.relname = ''footmp''::TEXT AND
> c.relkind = ''r''::TEXT AND
> pg_catalog.pg_table_is_visible(c.oid); */
> -- But the query fails
>
> My guess was that the series of events went something like:
>
> proc 0) COMMIT's and the row in pg_class is committed
> proc 1) bgwriter writer code removes a page for the cache
> proc 2) queries for the page [*]
> proc 1) writes it to disk
> proc 2) queries for the page [*]
> proc 1) sync's the fd
>
> [*] proc 2 queries for the page at either of these points
>
> In 7.4, there is no bgwriter or background process mucking with cache,

Except for the checkpoint process, which does exactly the same as the
bgwriter does, and ALL concurrent backends whenever they feel the need
to evict a dirty buffer.

If it makes a difference if a pg_class page is dirty in the buffer or
copied out to disk with respect to visibility rules of the tuples
contained in it, then the whole thing is a way larger bug than the one
in MIB. First of all, committed or not, a temp object from one session
should NEVER be visible in any other.

Jan

> which is why this works 100% of the time. In 7.5, however, there's a
> 200ms gap where a race condition appears and pg_table_is_visible()
> fails its PointerIsValid() check. If I put a sleep in, the sleep gives
> the bgwriter enough time to commit the pages to disk so that the
> queries for the page happen after the fd's been sync()'ed.
>
> I have no other clue as to why this would be happening though, so
> believe me when I say, I could very well be quite wrong.... but this is
> my best, quasi-educated/grep(1)'ed guess.
>
> -sc
>

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: PostgreSQL Bugs List <pgsql-bugs(at)postgresql(dot)org>, Juris Krumins <juriskr(at)komin(dot)lv>
Subject: Re: [GENERAL] cache lookup of relation 165058647 failed
Date: 2004-05-06 05:50:50
Message-ID: 5455CBEC-9F21-11D8-A24D-000A95C705DC@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

>>> temp tables don't use the shared buffer cache, how can this be
>>> related to the BG writer?
>> Don't the system catalogs use the shared buffer cache?
>> BEGIN;
>> SELECT create_temp_table_func(); -- Inserts a row into pg_class via
>> CREATE TEMP TABLE
>> -- Do other stuff
>> COMMIT; -- After the commit, the row is now visible to other
>> backends
>> -- disconnect -- If the delay between the disconnect and reconnect
>> is small enough
>> -- reconnect -- It's as though there is a race condition that allows
>> the function
>> -- pg_table_is_visible() to assert the "cache lookup of relation"
>> -- error.
>> BEGIN;
>> SELECT create_temp_table_func(); -- Before the CREATE TEMP TABLE, I
>> call
>> /* SELECT TRUE FROM pg_catalog.pg_class c
>> LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
>> WHERE c.relname = ''footmp''::TEXT AND
>> c.relkind = ''r''::TEXT AND
>> pg_catalog.pg_table_is_visible(c.oid); */
>> -- But the query fails
>> My guess was that the series of events went something like:
>> proc 0) COMMIT's and the row in pg_class is committed
>> proc 1) bgwriter writer code removes a page for the cache
>> proc 2) queries for the page [*]
>> proc 1) writes it to disk
>> proc 2) queries for the page [*]
>> proc 1) sync's the fd
>> [*] proc 2 queries for the page at either of these points
>> In 7.4, there is no bgwriter or background process mucking with cache,
>
> Except for the checkpoint process, which does exactly the same as the
> bgwriter does, and ALL concurrent backends whenever they feel the need
> to evict a dirty buffer.

Hrm... well, haven't the slightest idea what would be causing this
then. About all I can say is that some problem does exist in HEAD that
doesn't exist in REL7_4 that I'm able to tickle via temp tables. :-/
Because this is time sensitive, what debugging foo could I insert to
get some useful diagnostic output?

> If it makes a difference if a pg_class page is dirty in the buffer or
> copied out to disk with respect to visibility rules of the tuples
> contained in it, then the whole thing is a way larger bug than the one
> in MIB. First of all, committed or not, a temp object from one session
> should NEVER be visible in any other.

Hrm... well, I'm going to take my test scripts and reduce them down to
a test case. For sure, there's something different in HEAD that's
causing problems that are time sensitive. I've even thought about
grabbing my camera and making a low res 320x200 movie of the test
sequence. I went and ran script(1) on one of the runs for the sake of
something.

http://sean.chittenden.org/postgresql/pgsql-create-temp-bug-
typescript.txt

Any help or assistance is greatly appreciated, I'm not sure where to go
from here. -sc

--
Sean Chittenden