Re: Cache lookup failure for index during pg_dump

Lists: pgsql-bugs
From: Bob Lunney <bob_lunney(at)yahoo(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Cache lookup failure for index during pg_dump
Date: 2010-02-19 21:13:28
Message-ID: 497351.17667.qm@web39708.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

I have a job that summarizes some data from a table, truncates the table, drops the indexes, writes the summarized data back into the table, then recreates the indexes. The operations above are not in a single transaction, but separate statements executed by a script. Easy, runs great, has for years.

Recently the job takes a little longer to run and is still going when the database dump starts. That's when I started getting this:

ERROR: cache lookup failed for index 70424
STATEMENT: SELECT t.tableoid, t.oid, t.relname AS indexname, pg_catalog.pg_get_indexdef(i.indexrelid) AS indexdef, t.relnatts AS indnkeys, i.indkey, i.indisclustered, c.contype, c.conname, c.tableoid AS contableoid, c.oid AS conoid, (SELECT spcname FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) AS tablespace, array_to_string(t.reloptions, ', ') AS options FROM pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid = i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid = t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND d.refobjid = c.oid) WHERE i.indrelid = '56448'::pg_catalog.oid ORDER BY indexname

The oid value changes daily, of course. pg_dump then disconnects and stops. I checked the source code and the query is definitely coming from pg_dump. When I run the dump a few minutes later it works beautifully, so I don't think the system catalogs are corrupt.

My questions are: can making DDL changes during a dump cause this error? Are the queries used by pg_dump transactionally consistent, i.e. do they run in a transaction and get a single view of the database system catalogs? Other than finer coordination of jobs, how can this situation be avoided?

I'm running PG 8.4.1 on linux.

Thanks in advance for your responses.

Bob Lunney


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bob Lunney <bob_lunney(at)yahoo(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Cache lookup failure for index during pg_dump
Date: 2010-02-20 00:39:51
Message-ID: 14781.1266626391@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Bob Lunney <bob_lunney(at)yahoo(dot)com> writes:
> I have a job that summarizes some data from a table, truncates the table, drops the indexes, writes the summarized data back into the table, then recreates the indexes. The operations above are not in a single transaction, but separate statements executed by a script. Easy, runs great, has for years.
> Recently the job takes a little longer to run and is still going when the database dump starts. That's when I started getting this:

> ERROR: cache lookup failed for index 70424

> My questions are: can making DDL changes during a dump cause this error? Are the queries used by pg_dump transactionally consistent, i.e. do they run in a transaction and get a single view of the database system catalogs? Other than finer coordination of jobs, how can this situation be avoided?

It's a bit messy. pg_dump runs in a serializable transaction, so it
sees a consistent snapshot of the database including system catalogs.
However, it relies in part on various specialized backend functions like
pg_get_indexdef(), and those things tend to run on SnapshotNow time, ie
they look at the currently committed state. So it is possible to get
this type of error if someone performs DDL changes while a dump is
happening: pg_dump sees index 70424 still listed in the catalogs,
so it asks about it, and the backend says "there is no such index",
which there isn't anymore because somebody dropped it since pg_dump's
transaction started.

The window for this sort of thing isn't very large, because the first
thing pg_dump does is acquire AccessShareLock on every table it intends
to dump, and past that point it won't be possible for anyone to modify
the table's DDL. But it can happen.

The right fix for this is to make all those inquiry functions use the
calling query's snapshot; but duplicating a lot of backend
infrastructure is going to be a major pain in the rear, so the
discussion has kind of petered out every time it's come up in the past.

In practice, because pg_dump does lock out DDL changes for the bulk of
its run, it's not a great idea to be scheduling DDL-changing jobs during
your dumps anyhow. Most of the time they'll just get blocked till the
dump finishes, and if they touch more than one table it's not at all
unlikely for them to end up deadlocked against pg_dump's locks. A fix
for the snapshot-timing problem wouldn't do a thing for that problem.

So in short, the path of least resistance is to reschedule your dumps.
Or reconsider whether you really need to drop and recreate those indexes
--- could you use REINDEX instead?

regards, tom lane


From: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bob Lunney <bob_lunney(at)yahoo(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Cache lookup failure for index during pg_dump
Date: 2010-02-20 01:50:35
Message-ID: 4B7F3FEB.3070700@timbira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane escreveu:
> The window for this sort of thing isn't very large, because the first
> thing pg_dump does is acquire AccessShareLock on every table it intends
> to dump, and past that point it won't be possible for anyone to modify
> the table's DDL. But it can happen.
>
I did not see it documented anywhere. Should we at least add a comment at the
top of pg_dump documenting this behavior? Attached is a proposed patch using
your own words.

--
Euler Taveira de Oliveira
http://www.timbira.com/

Attachment Content-Type Size
cmt.diff text/x-patch 1.0 KB

From: Bob Lunney <bob_lunney(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Cache lookup failure for index during pg_dump
Date: 2010-02-20 20:10:39
Message-ID: 724775.65373.qm@web39704.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom,

Thanks for the detailed explanation - I appreciate it. If i understand correctly I should be able to just change the job to not drop and recreate the indexes and the dump will run without the cache lookup error. I'll take a hit in performance reloading the table, but I like database dumps more than speed at this point.

Could pg_dump also take an AccessShared lock on the system tables to prevent DDL changes during the dump, thereby preventing this error? Just a thought...

Bob Lunney

--- On Fri, 2/19/10, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Subject: Re: [BUGS] Cache lookup failure for index during pg_dump
> To: "Bob Lunney" <bob_lunney(at)yahoo(dot)com>
> Cc: pgsql-bugs(at)postgresql(dot)org
> Date: Friday, February 19, 2010, 7:39 PM
> Bob Lunney <bob_lunney(at)yahoo(dot)com>
> writes:
> > I have a job that summarizes some data from a table,
> truncates the table, drops the indexes, writes the
> summarized data back into the table, then recreates the
> indexes.  The operations above are not in a single
> transaction, but separate statements executed by a
> script.   Easy, runs great, has for
> years. 
> > Recently the job takes a little longer to run and is
> still going when the database dump starts. That's when I
> started getting this:
>
> > ERROR:  cache lookup failed for index 70424
>
> > My questions are: can making DDL changes during a dump
> cause this error?  Are the queries used by pg_dump
> transactionally consistent, i.e. do they run in a
> transaction and get a single view of the database system
> catalogs?  Other than finer coordination of jobs, how
> can this situation be avoided?
>
> It's a bit messy.  pg_dump runs in a serializable
> transaction, so it
> sees a consistent snapshot of the database including system
> catalogs.
> However, it relies in part on various specialized backend
> functions like
> pg_get_indexdef(), and those things tend to run on
> SnapshotNow time, ie
> they look at the currently committed state.  So it is
> possible to get
> this type of error if someone performs DDL changes while a
> dump is
> happening: pg_dump sees index 70424 still listed in the
> catalogs,
> so it asks about it, and the backend says "there is no such
> index",
> which there isn't anymore because somebody dropped it since
> pg_dump's
> transaction started.
>
> The window for this sort of thing isn't very large, because
> the first
> thing pg_dump does is acquire AccessShareLock on every
> table it intends
> to dump, and past that point it won't be possible for
> anyone to modify
> the table's DDL.  But it can happen.
>
> The right fix for this is to make all those inquiry
> functions use the
> calling query's snapshot; but duplicating a lot of backend
> infrastructure is going to be a major pain in the rear, so
> the
> discussion has kind of petered out every time it's come up
> in the past.
>
> In practice, because pg_dump does lock out DDL changes for
> the bulk of
> its run, it's not a great idea to be scheduling
> DDL-changing jobs during
> your dumps anyhow.  Most of the time they'll just get
> blocked till the
> dump finishes, and if they touch more than one table it's
> not at all
> unlikely for them to end up deadlocked against pg_dump's
> locks.  A fix
> for the snapshot-timing problem wouldn't do a thing for
> that problem.
>
> So in short, the path of least resistance is to reschedule
> your dumps.
> Or reconsider whether you really need to drop and recreate
> those indexes
> --- could you use REINDEX instead?
>
>            
> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bob Lunney <bob_lunney(at)yahoo(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Cache lookup failure for index during pg_dump
Date: 2010-02-20 20:47:00
Message-ID: 20060.1266698820@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Bob Lunney <bob_lunney(at)yahoo(dot)com> writes:
> Could pg_dump also take an AccessShared lock on the system tables to prevent DDL changes during the dump, thereby preventing this error? Just a thought...

I think that cure would probably be worse than the disease ...

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bob Lunney <bob_lunney(at)yahoo(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Cache lookup failure for index during pg_dump
Date: 2010-03-03 23:37:51
Message-ID: 201003032337.o23NbpH23410@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Euler Taveira de Oliveira wrote:
> Tom Lane escreveu:
> > The window for this sort of thing isn't very large, because the first
> > thing pg_dump does is acquire AccessShareLock on every table it intends
> > to dump, and past that point it won't be possible for anyone to modify
> > the table's DDL. But it can happen.
> >
> I did not see it documented anywhere. Should we at least add a comment at the
> top of pg_dump documenting this behavior? Attached is a proposed patch using
> your own words.

Applied, thanks. I also added the URL of the discussion.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do


From: Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bob Lunney <bob_lunney(at)yahoo(dot)com>, pgsql-bugs(at)postgresql(dot)org, Hans-Juergen Schoenig <hs(at)cybertec(dot)at>
Subject: Re: Cache lookup failure for index during pg_dump
Date: 2010-05-27 11:34:40
Message-ID: 4BFE58D0.2020307@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane írta:
> Bob Lunney <bob_lunney(at)yahoo(dot)com> writes:
>
>> I have a job that summarizes some data from a table, truncates the table, drops the indexes, writes the summarized data back into the table, then recreates the indexes. The operations above are not in a single transaction, but separate statements executed by a script. Easy, runs great, has for years.
>> Recently the job takes a little longer to run and is still going when the database dump starts. That's when I started getting this:
>>
>
>
>> ERROR: cache lookup failed for index 70424
>>
>
>
>> My questions are: can making DDL changes during a dump cause this error? Are the queries used by pg_dump transactionally consistent, i.e. do they run in a transaction and get a single view of the database system catalogs? Other than finer coordination of jobs, how can this situation be avoided?
>>
>
> It's a bit messy. pg_dump runs in a serializable transaction, so it
> sees a consistent snapshot of the database including system catalogs.
> However, it relies in part on various specialized backend functions like
> pg_get_indexdef(), and those things tend to run on SnapshotNow time, ie
> they look at the currently committed state. So it is possible to get
> this type of error if someone performs DDL changes while a dump is
> happening: pg_dump sees index 70424 still listed in the catalogs,
> so it asks about it, and the backend says "there is no such index",
> which there isn't anymore because somebody dropped it since pg_dump's
> transaction started.
>

I looked at ruleutils.c and now I am a little confused.
I mean different functions use different subsystems to
get their info. E.g.:

syscache is used by e.g.:
pg_get_indexdef
pg_get_constraintdef
pg_get_userbyid
pg_get_functiondef

direct systable_beginscan(SnapshotNow) is used by:
pg_get_triggerdef
pg_get_serial_sequence

SPI is used by:
pg_get_ruledef
pg_get_viewdef

SPI uses the same snapshot as the calling query, right?
So, a serializable transaction gets all their info properly.

Syscache doesn't contain old info, only the latest according
to committed transactions, this is where the "cache lookup
failed for index" problem comes from but some others,
e.g. dropped functions or usernames may show a similar
error message.

Looking at the tqual.c and snapmgr.c, using
systable_beginscan(GetTransactionSnapshot())
instead of SearchSysCache() would solve the problem for
pg_dump. Why would we have to duplicate these functions?
It seems (at least for pg_get_ruledef and pg_get_indexdef)
that these functions are only used by pg_dump and psql's
\something macros. We would lose a little performance
by not using the syscache but gain a lot of correctness.
Am I missing something?

Best regards,
Zoltán Böszörményi

--
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc: Bob Lunney <bob_lunney(at)yahoo(dot)com>, pgsql-bugs(at)postgresql(dot)org, Hans-Juergen Schoenig <hs(at)cybertec(dot)at>
Subject: Re: Cache lookup failure for index during pg_dump
Date: 2010-05-27 13:50:08
Message-ID: 14175.1274968208@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Boszormenyi Zoltan <zb(at)cybertec(dot)at> writes:
> I looked at ruleutils.c and now I am a little confused.
> I mean different functions use different subsystems to
> get their info. E.g.:

Yup.

> Am I missing something?

Yes. That code is hopelessly intertwined with other operations inside
the backend that need to run on SnapshotNow time. Disentangling it is
not a simple project.

regards, tom lane