Non-transactional pg_class, try 2

Lists: pgsql-hackerspgsql-patches
From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Non-transactional pg_class, try 2
Date: 2006-06-11 21:53:19
Message-ID: 20060611215318.GA15181@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Here I repost the patch to implement non-transactional catalogs, the
first of which is pg_ntclass, intended to hold the non-transactional
info about pg_class (reltuples, relpages).

pg_ntclass is a relation of a new relkind, RELKIND_NON_TRANSACTIONAL
(ideas for shorter names welcome). In pg_class, we store a TID to the
corresponding tuple. The tuples are not cached; they are obtained by
heap_fetch() each time they are requested. This may be worth
reconsideration.

heap_update refuses to operate on a non-transactional catalog, because
there's no (easy) way to update pg_class accordingly. This normally
shouldn't be a problem. vac_update_relstats updates the tuple by using
the new heap_inplace_update call.

VACUUM FULL also refuses to operate on these tables, and ANALYZE
silently skips them. Only plain VACUUM cleans them.

Note that you can DELETE from pg_ntclass. Not sure if we should
disallow it somehow, because it's not easy to get out from that if you
do. (But it's possible -- just insert enough tuples until you reach the
needed TID, and then delete the ones that are not pointed by any
pg_class row).

Regression test pass; I updated the stats test because it was accessing
pg_class.relpages. So there's already a test to verify that it's
working.

There is one caveat that I'm worried about. I had to add a "typedef" to
pg_class.h to put ItemPointerData in FormData_pg_class, because the C
struct doesn't recognize the "tid" type, but the bootstrap type system
does not recognize ItemPointerData as a valid type. I find this mighty
ugly because it will have side effects whenever we #include pg_class.h
(which is virtually anywhere, since that header is #included in htup.h
which in turn is included almost everywhere). Suggestions welcome.
Maybe this is not a problem.

Other two caveats are:
1. During bootstrap, RelationBuildLocalRelation creates nailed relations
with hardcoded TID=(0,1). This is because we don't have access to
pg_class yet, so we can't find the real pointer; and furthermore, we are
going to fix the entries later in the bootstrapping process.

2. The whole VACUUM/VACUUM FULL/ANALYZE relation list stuff is pretty
ugly as well; and autovacuum is skipping pg_ntclass (really all
non-transactional catalogs) altogether. We could improve the situation
by introducing some sort of struct like {relid, relkind}, so that
vacuum_rel could know what relkind to expect, and it could skip
non-transactional catalogs cleanly in vacuum full and analyze.

I intend to apply this patch by tuesday or wednesday, unless an
objection is raised prior to that.

Attachment Content-Type Size
fixclass-3.patch text/plain 74.8 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Non-transactional pg_class, try 2
Date: 2006-06-11 21:58:03
Message-ID: 20060611215803.GA15211@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera wrote:
> Here I repost the patch to implement non-transactional catalogs, the
> first of which is pg_ntclass, intended to hold the non-transactional
> info about pg_class (reltuples, relpages).

I forgot to attach the new file pg_ntclass.h (src/include/catalog).
Here it is.

Attachment Content-Type Size
pg_ntclass.h text/x-chdr 2.0 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: The corresponding relminxid patch; try 1
Date: 2006-06-11 22:20:22
Message-ID: 20060611222022.GB15211@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Hi,

This is the relminxid patch corresponding to the pg_ntclass patch I just
posted. Obviously, the relminxid and relvacuumxid fields are in
pg_ntclass (not pg_class like in the previous incarnations of this
patch). This makes the whole business much saner and now I don't need
to insert bogus CommandCounterIncrement calls. Regression tests pass.

The thing that bothers me most about this is that it turns LockRelation
into an operation that needs to heap_fetch() from pg_ntclass in some
cases, and possibly update it. I think we should consider some sort of
"non-transactional shared cache" for storing RELKIND_NON_TRANSACTIONAL
catalog entries. Eventually it may help the sequences stuff as well, if
we implement sequences using that kind of catalog.

The documentation changes may be a bit off in this patch, since I didn't
worry about merging it with the pg_ntclass patch. But it's easy to
correct and I'll do it before committing it.

My intention is to wait two or three days after committing the
pg_ntclass patch, and then commit this one, unless I hear objections
before that.

Attachment Content-Type Size
relminxid-ntclass-1.patch text/plain 93.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Non-transactional pg_class, try 2
Date: 2006-06-12 00:03:22
Message-ID: 28527.1150070602@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I forgot to attach the new file pg_ntclass.h (src/include/catalog).
> Here it is.

Couple thoughts about this:

* I still suggest calling it pg_class_nt not pg_ntclass; that naming
convention seems like it will scale better if there are more
nontransactional "appendage" relations. I'm surprised you didn't
already need to invent pg_database_nt, for instance ... don't
datvacuumxid and datfrozenxid need to be nontransactional?

* The DATA() entries for the bootstrapped relations ought to be
commented as to which rels they belong to (corresponding to the
hardwired TIDs in pg-class.h):

DATA(insert ( 0 0 )); /* pg_type */
etc

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Non-transactional pg_class, try 2
Date: 2006-06-12 00:27:23
Message-ID: 28617.1150072043@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> VACUUM FULL also refuses to operate on these tables, and ANALYZE
> silently skips them. Only plain VACUUM cleans them.

I wonder whether VACUUM FULL applied to an NT table shouldn't just act
like plain VACUUM instead. Probably not very important though.

> Note that you can DELETE from pg_ntclass. Not sure if we should
> disallow it somehow, because it's not easy to get out from that if you
> do.

No worse than DELETE FROM pg_class ;-). Please verify that the
aclcheck prohibitions on changing system catalogs are properly applied
to these catalogs too.

> There is one caveat that I'm worried about. I had to add a "typedef" to
> pg_class.h to put ItemPointerData in FormData_pg_class, because the C
> struct doesn't recognize the "tid" type, but the bootstrap type system
> does not recognize ItemPointerData as a valid type. I find this mighty
> ugly because it will have side effects whenever we #include pg_class.h
> (which is virtually anywhere, since that header is #included in htup.h
> which in turn is included almost everywhere). Suggestions welcome.
> Maybe this is not a problem.

Would it work to do

#define tid ItemPointerData
...
tid relntrans;
...
#undef tid
?

I'm not sure whether this will cause the right things to appear in the
.bki file, but if it does then the #undef would limit the scope of the
"tid" name.

In any case, the only thing uglier than a hack is an uncommented hack
;-) ... the typedef or macro needs to have a comments explaining what
and why.

The *real* problem with what you've done is that pg_class.h now depends
on having ItemPointerData defined before it's included, which creates an
inclusion ordering dependency that should not exist. If you stick with
either of these approaches then pg_class.h needs to #include whatever
defines ItemPointerData.

I notice that postgres.h defines a typedef for aclitem to work around a
similar issue. Is it reasonable to move ItemPointerData into postgres.h
so we could put the tid typedef beside the aclitem one?

> Other two caveats are:
> 1. During bootstrap, RelationBuildLocalRelation creates nailed relations
> with hardcoded TID=(0,1). This is because we don't have access to
> pg_class yet, so we can't find the real pointer; and furthermore, we are
> going to fix the entries later in the bootstrapping process.

This seems dangerous; can't you set it to InvalidItemPointer instead?
If it's not used before fixed, this doesn't matter, and if someone
*does* try to use it, that will catch the problem.

> 2. The whole VACUUM/VACUUM FULL/ANALYZE relation list stuff is pretty
> ugly as well; and autovacuum is skipping pg_ntclass (really all
> non-transactional catalogs) altogether. We could improve the situation
> by introducing some sort of struct like {relid, relkind}, so that
> vacuum_rel could know what relkind to expect, and it could skip
> non-transactional catalogs cleanly in vacuum full and analyze.

Need to do something about this. pg_ntclass will contain XIDs (of
inserting/deleting transactions) so it MUST get vacuumed to be sure
we don't expose ourselves to XID wrap problems.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: The corresponding relminxid patch; try 1
Date: 2006-06-12 00:46:27
Message-ID: 28705.1150073187@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> This is the relminxid patch corresponding to the pg_ntclass patch I just
> posted.

That disable_heap_unfreeze thing seriously sucks. How bad are the API
changes needed to pass that as a parameter instead of having a
very-dangerous global variable?

The comment at line 328ff in dbcommands.c seems misguided, which makes
me doubt the code too. datfrozenxid and datvacuumxid should be
considered as indicating what XIDs appear inside the database, not what
is in its pg_database row.

The changes in vacuum.c are far too extensive to review meaningfully.
What did you do, and did it really need to touch so much code?

> The thing that bothers me most about this is that it turns LockRelation
> into an operation that needs to heap_fetch() from pg_ntclass in some
> cases, and possibly update it.

Have you done any profiling to see what that actually costs?

I think we could possibly dodge the work in the normal case if we are
willing to make VACUUM FREEZE take ExclusiveLock and send out a relation
cache inval on the relation. Then, we can cache the pg_ntclass tuple in
relcache entries (discarding it on cache inval), and if the cached value
says it's not frozen then it's not frozen. You couldn't trust the
cached value much further than that, but that would at least take the
fetch out of the normal path in LockRelation. The trick here is the
problem that if VACUUM FREEZE fails after modifying pg_ntclass, its
relcache inval won't be sent out.

A bigger issue here is that I'm not sure what the locking protocol is
for pg_ntclass itself. It looks like you're not consistently taking
a RowExclusiveLock when you update it.

BTW, I think you have the order of operations wrong in LockRelation;
should it not unfreeze only *after* obtaining lock? Consider race
condition against relation drop for instance.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: The corresponding relminxid patch; try 1
Date: 2006-06-12 01:16:12
Message-ID: 20060612011612.GA25809@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > This is the relminxid patch corresponding to the pg_ntclass patch I just
> > posted.
>
> That disable_heap_unfreeze thing seriously sucks. How bad are the API
> changes needed to pass that as a parameter instead of having a
> very-dangerous global variable?

Let's see -- I would need to fix all callers of LockRelation, and the
problem I found in an earlier version of the patch (before the invention
of the non-transaction stuff) was that some callers needed to pass that
information several levels down. It's possible that this was an
artifact of the fact that it was using the relcache. I'll experiment
with changing stuff so that the global variable is not needed.

> The comment at line 328ff in dbcommands.c seems misguided, which makes
> me doubt the code too. datfrozenxid and datvacuumxid should be
> considered as indicating what XIDs appear inside the database, not what
> is in its pg_database row.

No, actually it's correct. The point of that comment is that if the
source database is frozen, then all XIDs appearing inside both databases
(source and newly created) are frozen. So it's possible that the row in
pg_database is frozen as well. But because we are creating a new row in
pg_database, it's not really frozen any longer; so we change the
pg_database fields in the new row to match.

Actually, pg_database is going to be unfrozen right after that code,
because it's opened with RowExclusiveLock shortly after, precisely to
insert that new row we are inserting. So maybe this is not an issue.

> The changes in vacuum.c are far too extensive to review meaningfully.
> What did you do, and did it really need to touch so much code?

Yeah, they are extensive. I did several things there: get rid of a
couple of global variables that no longer need to be global; remove the
return value from vacuum_rel, since it's no longer needed (it's used to
determine whether we can truncate pg_clog, but now we can do it
regardless of whether this particular vacuuming took place or not); I
changed some variables from the old "frozenXid" name to "minXid"; I put
in a hack to make VACUUM FREEZE take a stronger lock; changed the API of
vacuum_rel so that instead of taking a specific acceptable relkind, it
receives whether TOAST is acceptable or not; and added the code needed
to keep track of the earliest Xid found in code. But by far the most
extensive change is the melding of vac_update_dbstats into
vac_update_dbminxid, and the update of vac_update_relstats to cope with
pg_ntclass.

Maybe I should take a stab at making incremental patches instead of
doing everything in one patch. This way it would be easier to review
for correctness (and I'd be more confident that it is actually correct
as well).

> > The thing that bothers me most about this is that it turns LockRelation
> > into an operation that needs to heap_fetch() from pg_ntclass in some
> > cases, and possibly update it.
>
> Have you done any profiling to see what that actually costs?

No, but I guess it must be expensive. While relminxid was still in the
relcache, it was cheap because we checked the value before having to
actually do anything else. That's why I was suggesting having a
separate cache for non-transactional stuff.

> I think we could possibly dodge the work in the normal case if we are
> willing to make VACUUM FREEZE take ExclusiveLock and send out a relation
> cache inval on the relation.

Well, one problem is that if enough transactions pass after the last
update to a table, a normal VACUUM (i.e. not FREEZE) could mark a table
as frozen as well; marking frozen is not an exclusive property of VACUUM
FREEZE.

> BTW, I think you have the order of operations wrong in LockRelation;
> should it not unfreeze only *after* obtaining lock? Consider race
> condition against relation drop for instance.

Hmm, good point. I think it was OK (and actually, it was required)
while relminxid was still on pg_class; or rather, there was a race
condition the other way around, so it was required to unfreeze the table
_before_ obtaining the lock. But it's certainly wrong now.

I'll work on pg_class_nt and I'll be back to this soon. Thanks for the
review.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: The corresponding relminxid patch; try 1
Date: 2006-06-12 01:21:16
Message-ID: 28916.1150075276@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> No, actually it's correct. The point of that comment is that if the
> source database is frozen, then all XIDs appearing inside both databases
> (source and newly created) are frozen. So it's possible that the row in
> pg_database is frozen as well. But because we are creating a new row in
> pg_database, it's not really frozen any longer; so we change the
> pg_database fields in the new row to match.

No, this only says that pg_database has to be unfrozen. If the source
DB is frozen then the clone is frozen too.

>> The changes in vacuum.c are far too extensive to review meaningfully.
>> What did you do, and did it really need to touch so much code?

> Yeah, they are extensive. ...

> Maybe I should take a stab at making incremental patches instead of
> doing everything in one patch. This way it would be easier to review
> for correctness (and I'd be more confident that it is actually correct
> as well).

Please. I've got no confidence that I see what's going on there.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Non-transactional pg_class, try 2
Date: 2006-06-12 01:26:19
Message-ID: 20060612012619.GB25809@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:

> Would it work to do
>
> #define tid ItemPointerData
> ...
> tid relntrans;
> ...
> #undef tid
> ?

Yeah, it probably would. I'll try.

> The *real* problem with what you've done is that pg_class.h now depends
> on having ItemPointerData defined before it's included, which creates an
> inclusion ordering dependency that should not exist. If you stick with
> either of these approaches then pg_class.h needs to #include whatever
> defines ItemPointerData.

storage/itemptr.h is #included in pg_class.h (first chunk of the patch).

> > Other two caveats are:
> > 1. During bootstrap, RelationBuildLocalRelation creates nailed relations
> > with hardcoded TID=(0,1). This is because we don't have access to
> > pg_class yet, so we can't find the real pointer; and furthermore, we are
> > going to fix the entries later in the bootstrapping process.
>
> This seems dangerous; can't you set it to InvalidItemPointer instead?
> If it's not used before fixed, this doesn't matter, and if someone
> *does* try to use it, that will catch the problem.

Doesn't work because the bootstrap system actually _writes_ there :-( A
workaround could be to disable writing in bootstrapping mode, and store
InvalidItemPointer. (Actually storing InvalidItemPointer was the first
thing I did, but it crashed on bootstrap.)

I wasn't worried about bootstrap writing invalid values, because the
correct values are written shortly after (at the latest, when vacuum is
run by initdb). And if I set it to Invalid and have bootstrap mode skip
writing, exactly the same thing will happen ...

> > 2. The whole VACUUM/VACUUM FULL/ANALYZE relation list stuff is pretty
> > ugly as well; and autovacuum is skipping pg_ntclass (really all
> > non-transactional catalogs) altogether. We could improve the situation
> > by introducing some sort of struct like {relid, relkind}, so that
> > vacuum_rel could know what relkind to expect, and it could skip
> > non-transactional catalogs cleanly in vacuum full and analyze.
>
> Need to do something about this. pg_ntclass will contain XIDs (of
> inserting/deleting transactions) so it MUST get vacuumed to be sure
> we don't expose ourselves to XID wrap problems.

Oh, certainly it does get vacuumed. vacuum.c is modified so that
non-transactional catalogs are vacuumed (in database-wide VACUUM for
instance). The only thing I was stating above was that the way vacuum.c
handles the list of relations, is a bit of a mess, because vacuum_rel
wants to check the relkind but get_oids_list forms only a single list
and it's assumed that they are all RELKIND_RELATION rels. I had to
modify it a bit so that NON_TRANSACTIONAL rels are also included in that
list, and therefore the check had to be relaxed.

I also made ANALYZE silently skip non-transactional catalogs, in an
similarly ugly way. I don't remember what was the rationale for this --
certainly there isn't any harm. But on the other hand, analyzing it
serves no purpose since the statistics are not used for anything.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Non-transactional pg_class, try 2
Date: 2006-06-12 01:31:32
Message-ID: 28994.1150075892@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
>>> Other two caveats are:
>>> 1. During bootstrap, RelationBuildLocalRelation creates nailed relations
>>> with hardcoded TID=(0,1).
>>
>> This seems dangerous; can't you set it to InvalidItemPointer instead?
>> If it's not used before fixed, this doesn't matter, and if someone
>> *does* try to use it, that will catch the problem.

> Doesn't work because the bootstrap system actually _writes_ there :-( A
> workaround could be to disable writing in bootstrapping mode, and store
> InvalidItemPointer. (Actually storing InvalidItemPointer was the first
> thing I did, but it crashed on bootstrap.)

Or, set it to (0,1) and reserve that TID as a dummy entry. What I'm
afraid of here is scribbling on some other relation's entry. I'd like
to see some defense against that, don't much care what.

We do plenty of disable-this-in-bootstrap-mode checks, so one more
doesn't seem like a problem; so the first solution may be better.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Non-transactional pg_class, try 2
Date: 2006-06-12 19:05:07
Message-ID: 20060612190507.GD4035@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:

> Or, set it to (0,1) and reserve that TID as a dummy entry. What I'm
> afraid of here is scribbling on some other relation's entry. I'd like
> to see some defense against that, don't much care what.
>
> We do plenty of disable-this-in-bootstrap-mode checks, so one more
> doesn't seem like a problem; so the first solution may be better.

New version of the patch, including fixes to all the feedback you
provided. Thanks!

I used a dummy entry in (0,1), which seems cleaner to me (the
index-creation stuff in bootstrap is apparently still needed to generate
sinval messages, so it's not as easy as returning early from the
function). Maybe we could include a step in initdb to get rid of it,
but it doesn't seem too much of an issue.

Attachment Content-Type Size
fixclass-4.patch text/plain 76.5 KB

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Non-transactional pg_class, try 2
Date: 2006-06-12 22:15:19
Message-ID: 1150150519.13699.43.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2006-06-11 at 17:53 -0400, Alvaro Herrera wrote:
> Here I repost the patch to implement non-transactional catalogs, the
> first of which is pg_ntclass, intended to hold the non-transactional
> info about pg_class (reltuples, relpages).

Would it be possible to get a summary of what this new feature gives us?
I'm trying to follow the implementation but the why of it seems to have
been buried in the detail.

Will a user be able to update reltuples and relpages manually?

Thanks.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-12 23:15:14
Message-ID: 13541.1150154114@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

[ moving to -hackers to get some more eyeballs on the question ]

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Sun, 2006-06-11 at 17:53 -0400, Alvaro Herrera wrote:
>> Here I repost the patch to implement non-transactional catalogs, the
>> first of which is pg_ntclass, intended to hold the non-transactional
>> info about pg_class (reltuples, relpages).

> Will a user be able to update reltuples and relpages manually?

No, which is a tad annoying now that you mention it. I'm not sure that
there's any very good reason for users to want to do that, though. Once
or twice I've hacked those fields manually to set up test cases for the
planner, which is why I'd be annoyed to lose the ability --- but does it
really matter to users? (Especially in view of the fact that the
planner no longer trusts relpages anyway.)

It does seem like rather a lot of mechanism and overhead though,
especially in view of Alvaro's worries about the non-cacheability of
pg_class_nt rows. I wonder whether we shouldn't take two steps back
and rethink.

The main thing we are trying to accomplish here is to decouple
transactional and nontransactional updates to a pg_class row.
Is there another way to do that? Do we need complete decoupling?
It strikes me that the only case where we absolutely must not lose a
nontransactional update is where we are un-freezing a frozen rel.
If we could guarantee that un-freezing happens before any transactional
update within a particular transaction, then maybe we could have that.
Manual updates to pg_class seem like they'd risk breaking such a
guarantee, but maybe there's a way around that. Personally I'd be
willing to live with commands that try to modify a frozen rel erroring
out if they see the current pg_class row is uncommitted.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-13 08:27:05
Message-ID: 1150187225.13699.68.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2006-06-12 at 19:15 -0400, Tom Lane wrote:
> [ moving to -hackers to get some more eyeballs on the question ]
>
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Sun, 2006-06-11 at 17:53 -0400, Alvaro Herrera wrote:
> >> Here I repost the patch to implement non-transactional catalogs, the
> >> first of which is pg_ntclass, intended to hold the non-transactional
> >> info about pg_class (reltuples, relpages).
>
> > Will a user be able to update reltuples and relpages manually?
>
> No, which is a tad annoying now that you mention it. I'm not sure that
> there's any very good reason for users to want to do that, though. Once
> or twice I've hacked those fields manually to set up test cases for the
> planner, which is why I'd be annoyed to lose the ability --- but does it
> really matter to users? (Especially in view of the fact that the
> planner no longer trusts relpages anyway.)

No need to have an SQL route. A special function call would suffice.

I'd like to be able to set up a test database that has the statistics
copied from the live system. A schema only pg_dump with mods is all I
need, but it sounds like we're moving away from that. We can then
perform various what-ifs on the design.

Elsewhere, it has been discussed that we might hold the number of blocks
in a relation in shared memory. Does that idea now fall down, or is it
complementary to this? i.e. would we replace ANALYZE's relpages with an
accurate relpages for the planner?

> It does seem like rather a lot of mechanism and overhead though,
> especially in view of Alvaro's worries about the non-cacheability of
> pg_class_nt rows. I wonder whether we shouldn't take two steps back
> and rethink.

Review, yes. Could still be the best way.

> The main thing we are trying to accomplish here is to decouple
> transactional and nontransactional updates to a pg_class row.

With the goal being avoiding table bloat??

> Is there another way to do that? Do we need complete decoupling?

> It strikes me that the only case where we absolutely must not lose a
> nontransactional update is where we are un-freezing a frozen rel.

Not sure why you'd want to do that, assuming I've understood you.

For me, freezing is last step before writing to WORM media, so there is
never an unfreeze step.

> If we could guarantee that un-freezing happens before any transactional
> update within a particular transaction, then maybe we could have that.
> Manual updates to pg_class seem like they'd risk breaking such a
> guarantee, but maybe there's a way around that. Personally I'd be
> willing to live with commands that try to modify a frozen rel erroring
> out if they see the current pg_class row is uncommitted.

Sounds OK. It's a major state change after all.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-13 14:02:10
Message-ID: 22280.1150207330@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Elsewhere, it has been discussed that we might hold the number of blocks
> in a relation in shared memory. Does that idea now fall down, or is it
> complementary to this?

It's been the case for some time that the planner uses
RelationGetNumberOfBlocks() to determine true rel size. The only reason
relpages is still stored at all is that it's used to approximate true
number of tuples via
true_ntuples = (reltuples/relpages) * true_npages
ie, assuming that the tuple density is still what it was at the last
VACUUM or ANALYZE. So you can't fool the system with a totally made-up
relation size anyway. (This too is moderately annoying for planner
testing, but it seems the only way to get the planner to react when a
table's been filled without an immediate vacuum/analyze.)

The only point of tracking rel size in shared memory would be to avoid
the costs of lseek() kernel calls in RelationGetNumberOfBlocks.

>> The main thing we are trying to accomplish here is to decouple
>> transactional and nontransactional updates to a pg_class row.

> With the goal being avoiding table bloat??

No, with the goal being correctness. If you have a freeze/unfreeze
mechanism then unfreezing a relation is an action that must NOT be
rolled back if your transaction (or any other one for that matter) later
aborts. The tuples you put into it meanwhile need to be vacuumed anyway.
So you can't mark it unfrozen in an uncommitted pg_class entry that
might never become committed.

> For me, freezing is last step before writing to WORM media, so there is
> never an unfreeze step.

That is not what Alvaro is after. Nor anyone else here. I have not
heard anyone mention WORM media for Postgres in *years*.

It strikes me though that automatic UNFREEZE isn't necessarily the
requirement. What if VACUUM FREEZE causes the table to become
effectively read-only, and you need an explicit UNFREEZE command to
put it back into a read-write state? Then UNFREEZE could be a
transactional operation, and most of these issues go away. The case
where this doesn't work conveniently is copying a frozen database
(viz template0), but maybe biting the bullet and finding a way to do
prep work in a freshly made database is the answer for that. We've
certainly seen plenty of other possible uses for post-CREATE processing
in a new database.

Another reason for not doing unfreeze automatically is that as the patch
stands, any database user can force unfreezing of any table, whether he
has any access rights on it or not (because the LockTable will happen
before we check access rights, I believe). This is probably Not Good.
Ideally I think FREEZE/UNFREEZE would be owner-permission-required.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-13 15:19:39
Message-ID: 1150211979.2691.656.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 2006-06-13 at 10:02 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Elsewhere, it has been discussed that we might hold the number of blocks
> > in a relation in shared memory. Does that idea now fall down, or is it
> > complementary to this?
>
> It's been the case for some time that the planner uses
> RelationGetNumberOfBlocks() to determine true rel size. The only reason
> relpages is still stored at all is that it's used to approximate true
> number of tuples via
> true_ntuples = (reltuples/relpages) * true_npages
> ie, assuming that the tuple density is still what it was at the last
> VACUUM or ANALYZE. So you can't fool the system with a totally made-up
> relation size anyway. (This too is moderately annoying for planner
> testing, but it seems the only way to get the planner to react when a
> table's been filled without an immediate vacuum/analyze.)
>
> The only point of tracking rel size in shared memory would be to avoid
> the costs of lseek() kernel calls in RelationGetNumberOfBlocks.

Yes, understood. With the second point to allow them to be separately
set for PGSQL developer testing of optimizer, and application dev
testing of SQL and/or what/if scenarios.

> >> The main thing we are trying to accomplish here is to decouple
> >> transactional and nontransactional updates to a pg_class row.
>
> > With the goal being avoiding table bloat??
>
> No, with the goal being correctness. If you have a freeze/unfreeze
> mechanism then unfreezing a relation is an action that must NOT be
> rolled back if your transaction (or any other one for that matter) later
> aborts. The tuples you put into it meanwhile need to be vacuumed anyway.
> So you can't mark it unfrozen in an uncommitted pg_class entry that
> might never become committed.
>
> > For me, freezing is last step before writing to WORM media, so there is
> > never an unfreeze step.
>
> That is not what Alvaro is after. Nor anyone else here.

So what is unfreeze for again?

> I have not
> heard anyone mention WORM media for Postgres in *years*.

Oh? Big requirements for archive these days, much more so than before.
This will allow years of data in a seamless on-line/near-line
partitioned table set. Lots of people want that: .gov, .mil, .com

More modern equivalent: a MAID archive system for WORO data

> It strikes me though that automatic UNFREEZE isn't necessarily the
> requirement. What if VACUUM FREEZE causes the table to become
> effectively read-only, and you need an explicit UNFREEZE command to
> put it back into a read-write state? Then UNFREEZE could be a
> transactional operation, and most of these issues go away.

That works for me. Very much preferred.

> The case
> where this doesn't work conveniently is copying a frozen database
> (viz template0), but maybe biting the bullet and finding a way to do
> prep work in a freshly made database is the answer for that. We've
> certainly seen plenty of other possible uses for post-CREATE processing
> in a new database.
>
> Another reason for not doing unfreeze automatically is that as the patch
> stands, any database user can force unfreezing of any table, whether he
> has any access rights on it or not (because the LockTable will happen
> before we check access rights, I believe). This is probably Not Good.
> Ideally I think FREEZE/UNFREEZE would be owner-permission-required.

Seems like a plan.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-26 17:58:23
Message-ID: 20060626175822.GD8999@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Ok, let's step back to discuss this again. Sorry for the length -- this
is a description of the problem I'm trying to solve, the issues I found,
and how I tried to solve them.

The relminxid Patch
===================

What I'm after is not freezing for read-only media, nor archive, nor
read-only tables. What I'm after is removing the requirement that all
databases must be vacuumed wholly every 2 billion transactions.

Now, why do we need to vacuum whole databases at a time?

The Transaction Id Counter
==========================

We know that the Xid counter is weird; it cycles, for starters, and also
there are special values at the "start" of the cycle that are lesser
than all other values (BootstrapXid, FrozenXid). The idea here is to
allow the counter to wrap around and old tuples not be affected, i.e.,
appear like they were committed in some distant past.

So we use the special Xid values to mark special stuff, like tuples
created by the bootstrap processing (which are always known to be good)
or tuples in template databases that are not connectable ("frozen"
databases). We also use FrozenXid to mark tuples that are very old,
i.e. were committed a long time ago and never deleted. Any such tuple
is unaffected by the status of the Xid counter.

It should be clear that we must ensure that after a suitable amount of
"time" (measured in advancement of the Xid counter) has passed, we
should change the old Xids in tuples to the special FrozenXid value.
The requirement for whole-database vacuuming is there because we
need to ensure that this is done in all the tables in the database.

We keep track of a "minimum Xid", call it minxid. The Xid generator
refuses to assign a new Xid counter if this minxid is too far in the
past, because we'd risk causing Xid-wraparound data loss if we did; the
Xid comparison semantics would start behaving funny, and some tuples
that appeared to be alive not many transactions ago now suddenly appear
dead. Clearly, it's important that before we advance this minxid we
ensure that all tables in the database have been under the process of
changing all regular Xids into FrozenXid.

Currently the only way to ensure that all tables have gone through this
process is processing them in a single VACUUM pass. Skip even one
table, and you can forget about advancing the minxid. Even if the
skipped table was vacuumed in the transaction just before this one.
Even if the table is fully frozen, i.e., all tables on it are marked
with FrozenXid. Even if the table is empty.

Tracking minxid Per Table
=========================

So, my idea is to track this minxid per table. To do this, I added a
column to pg_class called relminxid. The minimum of it across a
database is used to determine each database's minimum, datminxid. The
minimum of all databases is used to advance the global minimum Xid
counter.

So, if a table has 3 tuples whose Xmins are 42, 512 and FrozenXid, the
relminxid is 42. If we keep track of all these religiously during
vacuum, we know exactly what is the minxid we should apply to this
particular table.

It is obvious that vacuuming one table can set the minimum for that
table. So when the vacuuming is done, we can recalculate the database
minimum; and using the minima of all databases, we can advance the
global minimum Xid counter and truncate pg_clog. We can do this on each
single-table vacuum -- so, no more need for database-wide vacuuming.

If a table is empty, or all tuples on it are frozen, then we must mark
the table with relminxid = RecentXmin. This is because there could be
an open transaction that writes a new tuple to the table after the
vacuum is finished. A newly created table must also be created with
relminxid = RecentXid. Because of this, we never mark a table with
relminxid = FrozenXid.

Template Databases
==================

Up to this point everything is relatively simple. Here is where the
strange problems appear. The main issue is template databases.

Why are template databases special? Because they are never vacuumed.
More generally, we assume that every database that is marked as
"datallowconn = false" is fully frozen, i.e. all tables on it are
frozen. Autovacuum skips them. VACUUM ignores them. The minxid
calculations ignore them. They are fully frozen so they don't matter
and they don't harm anybody.

That's fine and dandy until you realize what happens when you freeze a
database, let a couple billion transactions pass, and then create a
database using that as a template (or just "reallow connections" to a
database). Because all the tables were frozen 2 billion transaction
ago, they are marked with an old relminxid, so as soon as you vacuum any
table, the minxid computations went to hell, and we have a DoS
condition.

So, we have to do something to cope with frozen databases. I see two
ways:

1. Remove the special case, i.e., process frozen databases in VACUUM
like every other database.
This is the easiest, because no extra logic is needed. Just make
sure they are vacuumed in time. The only problem would be that we'd
need to uselessly vacuum tables that we know are frozen, from time to
time. But then, those tables are probably small, so what's the
problem with that?

2. Mark frozen databases specially somehow.
To mark databases frozen, we need a way to mark tables as frozen.
How do we do that? As I explain below, this allows some nice
optimizations, but it's a very tiny can full of a huge amount of
worms.

Marking a Table Frozen
======================

Marking a table frozen is simple as setting relminxid = FrozenXid for a
table. As explained above, this cannot be done in a regular postmaster
environment, because a concurrent transaction could be doing nasty stuff
to a table. So we can do it only in a standalone backend.

On the other hand, a "frozen" table must be marked with relminxid =
a-regular-Xid as soon as a transaction writes some tuples on it. Note
that this "unfreezing" must take place even if the offending transaction
is aborted, because the Xid is written in the table nevertheless and
thus it would be incorrect to lose the unfreezing.

This is how pg_class_nt came into existence -- it would be a place where
information about a table would be stored and not subject to the rolling
back of the transaction that wrote it. So if you find that a table is
frozen, you write an unfreezing into its pg_class_nt tuple, and that's
it.

Nice optimization: if we detect that a table is fully frozen, then
VACUUM is a no-op (not VACUUM FULL), because by definition there are no
tuples to remove.

Another optimization: if we are sure that unfreezing works, we can even
mark a table as frozen in a postmaster environment, as long as we take
an ExclusiveLock on the table. Thus we know that the vacuum is the sole
transaction concurrently accessing the table; and if another transaction
comes about and writes something after we're finished, it'll correctly
unfreeze the table and all is well.

Where are the problems in this approach?

1. Performance. We'll need to keep a cache of pg_class_nt tuples. This
cache must be independent of the current relcache, because the relcache
is properly transactional while the pg_class_nt cache must not be.

2. The current implementation puts the unfreezing in LockRelation. This
is a problem, because any user can cause a LockRelation on any table,
even if the user does not have access to that table.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-26 18:21:07
Message-ID: 20060626182107.GE8999@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

[Resending: apparently the previous one to the list was eaten by spam
filters or something. Changing SMTP relay again ... ]

Ok, let's step back to discuss this again. Sorry for the length -- this
is a description of the problem I'm trying to solve, the issues I found,
and how I tried to solve them.

The relminxid Patch
===================

What I'm after is not freezing for read-only media, nor archive, nor
read-only tables. What I'm after is removing the requirement that all
databases must be vacuumed wholly every 2 billion transactions.

Now, why do we need to vacuum whole databases at a time?

The Transaction Id Counter
==========================

We know that the Xid counter is weird; it cycles, for starters, and also
there are special values at the "start" of the cycle that are lesser
than all other values (BootstrapXid, FrozenXid). The idea here is to
allow the counter to wrap around and old tuples not be affected, i.e.,
appear like they were committed in some distant past.

So we use the special Xid values to mark special stuff, like tuples
created by the bootstrap processing (which are always known to be good)
or tuples in template databases that are not connectable ("frozen"
databases). We also use FrozenXid to mark tuples that are very old,
i.e. were committed a long time ago and never deleted. Any such tuple
is unaffected by the status of the Xid counter.

It should be clear that we must ensure that after a suitable amount of
"time" (measured in advancement of the Xid counter) has passed, we
should change the old Xids in tuples to the special FrozenXid value.
The requirement for whole-database vacuuming is there because we
need to ensure that this is done in all the tables in the database.

We keep track of a "minimum Xid", call it minxid. The Xid generator
refuses to assign a new Xid counter if this minxid is too far in the
past, because we'd risk causing Xid-wraparound data loss if we did; the
Xid comparison semantics would start behaving funny, and some tuples
that appeared to be alive not many transactions ago now suddenly appear
dead. Clearly, it's important that before we advance this minxid we
ensure that all tables in the database have been under the process of
changing all regular Xids into FrozenXid.

Currently the only way to ensure that all tables have gone through this
process is processing them in a single VACUUM pass. Skip even one
table, and you can forget about advancing the minxid. Even if the
skipped table was vacuumed in the transaction just before this one.
Even if the table is fully frozen, i.e., all tables on it are marked
with FrozenXid. Even if the table is empty.

Tracking minxid Per Table
=========================

So, my idea is to track this minxid per table. To do this, I added a
column to pg_class called relminxid. The minimum of it across a
database is used to determine each database's minimum, datminxid. The
minimum of all databases is used to advance the global minimum Xid
counter.

So, if a table has 3 tuples whose Xmins are 42, 512 and FrozenXid, the
relminxid is 42. If we keep track of all these religiously during
vacuum, we know exactly what is the minxid we should apply to this
particular table.

It is obvious that vacuuming one table can set the minimum for that
table. So when the vacuuming is done, we can recalculate the database
minimum; and using the minima of all databases, we can advance the
global minimum Xid counter and truncate pg_clog. We can do this on each
single-table vacuum -- so, no more need for database-wide vacuuming.

If a table is empty, or all tuples on it are frozen, then we must mark
the table with relminxid = RecentXmin. This is because there could be
an open transaction that writes a new tuple to the table after the
vacuum is finished. A newly created table must also be created with
relminxid = RecentXid. Because of this, we never mark a table with
relminxid = FrozenXid.

Template Databases
==================

Up to this point everything is relatively simple. Here is where the
strange problems appear. The main issue is template databases.

Why are template databases special? Because they are never vacuumed.
More generally, we assume that every database that is marked as
"datallowconn = false" is fully frozen, i.e. all tables on it are
frozen. Autovacuum skips them. VACUUM ignores them. The minxid
calculations ignore them. They are fully frozen so they don't matter
and they don't harm anybody.

That's fine and dandy until you realize what happens when you freeze a
database, let a couple billion transactions pass, and then create a
database using that as a template (or just "reallow connections" to a
database). Because all the tables were frozen 2 billion transaction
ago, they are marked with an old relminxid, so as soon as you vacuum any
table, the minxid computations went to hell, and we have a DoS
condition.

So, we have to do something to cope with frozen databases. I see two
ways:

1. Remove the special case, i.e., process frozen databases in VACUUM
like every other database.
This is the easiest, because no extra logic is needed. Just make
sure they are vacuumed in time. The only problem would be that we'd
need to uselessly vacuum tables that we know are frozen, from time to
time. But then, those tables are probably small, so what's the
problem with that?

2. Mark frozen databases specially somehow.
To mark databases frozen, we need a way to mark tables as frozen.
How do we do that? As I explain below, this allows some nice
optimizations, but it's a very tiny can full of a huge amount of
worms.

Marking a Table Frozen
======================

Marking a table frozen is simple as setting relminxid = FrozenXid for a
table. As explained above, this cannot be done in a regular postmaster
environment, because a concurrent transaction could be doing nasty stuff
to a table. So we can do it only in a standalone backend.

On the other hand, a "frozen" table must be marked with relminxid =
a-regular-Xid as soon as a transaction writes some tuples on it. Note
that this "unfreezing" must take place even if the offending transaction
is aborted, because the Xid is written in the table nevertheless and
thus it would be incorrect to lose the unfreezing.

This is how pg_class_nt came into existence -- it would be a place where
information about a table would be stored and not subject to the rolling
back of the transaction that wrote it. So if you find that a table is
frozen, you write an unfreezing into its pg_class_nt tuple, and that's
it.

Nice optimization: if we detect that a table is fully frozen, then
VACUUM is a no-op (not VACUUM FULL), because by definition there are no
tuples to remove.

Another optimization: if we are sure that unfreezing works, we can even
mark a table as frozen in a postmaster environment, as long as we take
an ExclusiveLock on the table. Thus we know that the vacuum is the sole
transaction concurrently accessing the table; and if another transaction
comes about and writes something after we're finished, it'll correctly
unfreeze the table and all is well.

Where are the problems in this approach?

1. Performance. We'll need to keep a cache of pg_class_nt tuples. This
cache must be independent of the current relcache, because the relcache
is properly transactional while the pg_class_nt cache must not be.

2. The current implementation puts the unfreezing in LockRelation. This
is a problem, because any user can cause a LockRelation on any table,
even if the user does not have access to that table.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-26 20:12:55
Message-ID: 1151352775.2479.101.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2006-06-26 at 13:58 -0400, Alvaro Herrera wrote:
> Ok, let's step back to discuss this again. Sorry for the length -- this
> is a description of the problem I'm trying to solve, the issues I found,
> and how I tried to solve them.

Thanks. This is good.

> The relminxid Patch
> ===================
>
> What I'm after is not freezing for read-only media, nor archive, nor
> read-only tables.

OK, but I am... but I'm happy to not to confuse the discussion.

> Now, why do we need to vacuum whole databases at a time?

> So, we have to do something to cope with frozen databases. I see two
> ways:
>
> 1. Remove the special case, i.e., process frozen databases in VACUUM
> like every other database.
> This is the easiest, because no extra logic is needed. Just make
> sure they are vacuumed in time. The only problem would be that we'd
> need to uselessly vacuum tables that we know are frozen, from time to
> time. But then, those tables are probably small, so what's the
> problem with that?

> 2. Mark frozen databases specially somehow.
> To mark databases frozen, we need a way to mark tables as frozen.
> How do we do that? As I explain below, this allows some nice
> optimizations, but it's a very tiny can full of a huge amount of
> worms.

At this stage you talk about databases, yet below we switch to
discussing tables. Not sure why we switched from one to the other.

> Marking a Table Frozen
> ======================
>
> Marking a table frozen is simple as setting relminxid = FrozenXid for a
> table. As explained above, this cannot be done in a regular postmaster
> environment, because a concurrent transaction could be doing nasty stuff
> to a table. So we can do it only in a standalone backend.

Surely we just lock the table? No concurrent transactions?

> On the other hand, a "frozen" table must be marked with relminxid =
> a-regular-Xid as soon as a transaction writes some tuples on it. Note
> that this "unfreezing" must take place even if the offending transaction
> is aborted, because the Xid is written in the table nevertheless and
> thus it would be incorrect to lose the unfreezing.
>
> This is how pg_class_nt came into existence -- it would be a place where
> information about a table would be stored and not subject to the rolling
> back of the transaction that wrote it. So if you find that a table is
> frozen, you write an unfreezing into its pg_class_nt tuple, and that's
> it.
>
> Nice optimization: if we detect that a table is fully frozen, then
> VACUUM is a no-op (not VACUUM FULL), because by definition there are no
> tuples to remove.

Yes please, but we don't need it anymore do we? Guess we need it for
backwards compatibility? VACUUM still needs to vacuum every table.

> Another optimization: if we are sure that unfreezing works, we can even
> mark a table as frozen in a postmaster environment, as long as we take
> an ExclusiveLock on the table. Thus we know that the vacuum is the sole
> transaction concurrently accessing the table; and if another transaction
> comes about and writes something after we're finished, it'll correctly
> unfreeze the table and all is well.

Why not just have a command to FREEZE and UNFREEZE an object? It can
hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and
UNFREEZE are rare commands?

> Where are the problems in this approach?
>
> 1. Performance. We'll need to keep a cache of pg_class_nt tuples. This
> cache must be independent of the current relcache, because the relcache
> is properly transactional while the pg_class_nt cache must not be.
>
> 2. The current implementation puts the unfreezing in LockRelation. This
> is a problem, because any user can cause a LockRelation on any table,
> even if the user does not have access to that table.

That last bit just sounds horrible to me. But thinking about it: how
come any user can lock a relation they shouldn't even be allowed to know
exists? Possibly OT.

I can see other reasons for having pg_class_nt, so having table info
cached in shared memory does make sense to me (yet not being part of the
strict definitions of the relcache).

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-26 20:54:07
Message-ID: 20060626205407.GA11926@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs wrote:
> On Mon, 2006-06-26 at 13:58 -0400, Alvaro Herrera wrote:

> > The relminxid Patch
> > ===================
> >
> > What I'm after is not freezing for read-only media, nor archive, nor
> > read-only tables.
>
> OK, but I am... but I'm happy to not to confuse the discussion.

Ok :-) I think I put a note about this but removed it while
restructuring the text so it would be clearer. The note is that while I
don't care about read-only stuff in this proposal, it may be that
read-only tables may come as a "side effect of implementing this. But I
agree we should not make the discussion more complex than it already is.

> > 2. Mark frozen databases specially somehow.
> > To mark databases frozen, we need a way to mark tables as frozen.
> > How do we do that? As I explain below, this allows some nice
> > optimizations, but it's a very tiny can full of a huge amount of
> > worms.
>
> At this stage you talk about databases, yet below we switch to
> discussing tables. Not sure why we switched from one to the other.

Sorry, I forgot one step. To mark a database frozen, we must make sure
that all tables within that database are frozen as well. So the first
step to freezing a database is freezing all its tables.

> > Marking a Table Frozen
> > ======================
> >
> > Marking a table frozen is simple as setting relminxid = FrozenXid for a
> > table. As explained above, this cannot be done in a regular postmaster
> > environment, because a concurrent transaction could be doing nasty stuff
> > to a table. So we can do it only in a standalone backend.
>
> Surely we just lock the table? No concurrent transactions?

No, because a transaction can have been started previously and yet not
hold any lock on the table, and write on the table after the vacuum
finishes. Or write on an earlier page of the table, after the vacuuming
already processed it. But here it comes one of the "nice points" below,
which was that if we acquire a suitable exclusive lock on the table, we
_can_ mark it frozen. Of course, this cannot be done by plain vacuum,
because we want the table to be still accesible by other transactions.
This is where VACUUM FREEZE comes in -- it does the same processing as
lazy vacuum, except that it locks the table exclusively and marks it
with FrozenXid.

> > Nice optimization: if we detect that a table is fully frozen, then
> > VACUUM is a no-op (not VACUUM FULL), because by definition there are no
> > tuples to remove.
>
> Yes please, but we don't need it anymore do we? Guess we need it for
> backwards compatibility? VACUUM still needs to vacuum every table.

Sorry, I don't understand what you mean here. We don't need what
anymore?

> > Another optimization: if we are sure that unfreezing works, we can even
> > mark a table as frozen in a postmaster environment, as long as we take
> > an ExclusiveLock on the table. Thus we know that the vacuum is the sole
> > transaction concurrently accessing the table; and if another transaction
> > comes about and writes something after we're finished, it'll correctly
> > unfreeze the table and all is well.
>
> Why not just have a command to FREEZE and UNFREEZE an object? It can
> hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and
> UNFREEZE are rare commands?

Ok, if I'm following you here, your point is that FREEZE'ing a table
sets the relminxid to FrozenXid, and UNFREEZE removes that; and also, in
between, no one can write to the table?

This seems to make sense. However, I'm not very sure about the
FREEZE'ing operation, because we need to make sure the table is really
frozen. So we either scan it, or we make sure something else already
scanned it; to me what makes the most sense is having a VACUUM option
that would do the freezing (and a separate command to do the
unfreezing).

> > Where are the problems in this approach?
> >
> > 2. The current implementation puts the unfreezing in LockRelation. This
> > is a problem, because any user can cause a LockRelation on any table,
> > even if the user does not have access to that table.
>
> That last bit just sounds horrible to me. But thinking about it: how
> come any user can lock a relation they shouldn't even be allowed to know
> exists? Possibly OT.

Hmm, I guess there must be several commands that open the relation and
lock it, and then check permissions. I haven't checked the code but you
shouldn't check permissions before acquiring some kind of lock, and we
shouldn't be upgrading locks either.

> I can see other reasons for having pg_class_nt, so having table info
> cached in shared memory does make sense to me (yet not being part of the
> strict definitions of the relcache).

Yeah.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-26 21:26:33
Message-ID: 23140.1151357193@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> What I'm after is not freezing for read-only media, nor archive, nor
> read-only tables. What I'm after is removing the requirement that all
> databases must be vacuumed wholly every 2 billion transactions.

Well, if that's the only goal then I hardly think we need to have a
discussion, because your alternative #1 is *obviously* the winner:

> 1. Remove the special case, i.e., process frozen databases in VACUUM
> like every other database.
> This is the easiest, because no extra logic is needed. Just make
> sure they are vacuumed in time. The only problem would be that we'd
> need to uselessly vacuum tables that we know are frozen, from time to
> time. But then, those tables are probably small, so what's the
> problem with that?

> 2. Mark frozen databases specially somehow.
> To mark databases frozen, we need a way to mark tables as frozen.
> How do we do that? As I explain below, this allows some nice
> optimizations, but it's a very tiny can full of a huge amount of
> worms.

Avoiding a vacuum pass over template0 once every 2 billion transactions
cannot be thought worthy of the amount of complexity and risk entailed
in the nontransactional-catalog thing. Especially since in the normal
case those would be read-only passes (the tuples all being frozen already).

So if you want to bring in the other goals that you're trying to pretend
aren't there, step right up and do it. You have not here made a case
that would convince anyone that we shouldn't just do #1 and be done with
it.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Non-transactional pg_class, try 2
Date: 2006-06-26 21:36:14
Message-ID: 20060626213614.GF11926@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > What I'm after is not freezing for read-only media, nor archive, nor
> > read-only tables. What I'm after is removing the requirement that all
> > databases must be vacuumed wholly every 2 billion transactions.
>
> Well, if that's the only goal then I hardly think we need to have a
> discussion, because your alternative #1 is *obviously* the winner:
>
> > 1. Remove the special case, i.e., process frozen databases in VACUUM
> > like every other database.
> > This is the easiest, because no extra logic is needed. Just make
> > sure they are vacuumed in time. The only problem would be that we'd
> > need to uselessly vacuum tables that we know are frozen, from time to
> > time. But then, those tables are probably small, so what's the
> > problem with that?

I'm happy to do at least this for 8.2. We can still try to do the
non-transactional catalog later, either in this release or the next; the
code is almost there, and it'll be easier to discuss/design because
we'll have taken the relminxid stuff out of the way.

So if everyone agrees, I'll do this now. Beware -- this may make you
vacuum databases that you previously weren't vacuuming. (I really doubt
anyone is setting datallowconn=false just to skip vacuuming big
databases, but there are people with strange ideas out there.)

> So if you want to bring in the other goals that you're trying to pretend
> aren't there, step right up and do it. You have not here made a case
> that would convince anyone that we shouldn't just do #1 and be done with
> it.

We can do it in a separate discussion.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-26 21:59:49
Message-ID: 1151359190.2479.121.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2006-06-26 at 16:54 -0400, Alvaro Herrera wrote:
> > > Another optimization: if we are sure that unfreezing works, we can
> even
> > > mark a table as frozen in a postmaster environment, as long as we
> take
> > > an ExclusiveLock on the table. Thus we know that the vacuum is
> the sole
> > > transaction concurrently accessing the table; and if another
> transaction
> > > comes about and writes something after we're finished, it'll
> correctly
> > > unfreeze the table and all is well.
> >
> > Why not just have a command to FREEZE and UNFREEZE an object? It can
> > hold an ExclusiveLock, avoiding all issues. Presumably FREEZE and
> > UNFREEZE are rare commands?
>
> Ok, if I'm following you here, your point is that FREEZE'ing a table
> sets the relminxid to FrozenXid, and UNFREEZE removes that; and also,
> in
> between, no one can write to the table?
>
> This seems to make sense. However, I'm not very sure about the
> FREEZE'ing operation, because we need to make sure the table is really
> frozen. So we either scan it, or we make sure something else already
> scanned it; to me what makes the most sense is having a VACUUM option
> that would do the freezing (and a separate command to do the
> unfreezing).

Sounds like we're in step here:

VACUUM FREEZE
-- works at either table or database level
-- takes ExclusiveLock, reads all blocks of a table, freezing rows
-- once complete, all write operations are prevented until...

ALTER TABLE xxx UNFREEZE;
ALTER DATABASE xxx UNFREEZE;
-- takes AccessExclusiveLock, allows writes again

CREATE DATABASE automatically does unfreeze after template db copy

Suggest that we prevent write operations on Frozen tables by revoking
all INSERT, UPDATE or DELETE rights held, then enforcing a check during
GRANT to prevent them being re-enabled. Superusers would need to check
every time. If we dont do this, then we will have two contradictory
states marked in the catalog - privilges saying Yes and freezing saying
No.

Not sure where pg_class_nt comes in here though, even though I think I
still want it.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Zeugswetter Andreas DCP SD" <ZeugswetterA(at)spardat(dot)at>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-27 08:04:25
Message-ID: E1539E0ED7043848906A8FF995BDA579011F00C8@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Very nice explanation, thanks Alvaro.

> 2. Mark frozen databases specially somehow.
> To mark databases frozen, we need a way to mark tables as frozen.
> How do we do that? As I explain below, this allows some nice
> optimizations, but it's a very tiny can full of a huge amount of
> worms.
>
> Marking a Table Frozen
> ======================
>
> Marking a table frozen is simple as setting relminxid =
> FrozenXid for a table. As explained above, this cannot be
> done in a regular postmaster environment, because a
> concurrent transaction could be doing nasty stuff to a table.
> So we can do it only in a standalone backend.

Unless you lock the table exclusively during vacuum, that could be done
with
vacuum freeze. I like that more, than changing stuff that is otherwise
completely
frozen/static. (I see you wrote that below)

> On the other hand, a "frozen" table must be marked with
> relminxid = a-regular-Xid as soon as a transaction writes
> some tuples on it. Note that this "unfreezing" must take
> place even if the offending transaction is aborted, because
> the Xid is written in the table nevertheless and thus it
> would be incorrect to lose the unfreezing.

The other idea was to need a special unfreeze command ...

>
> This is how pg_class_nt came into existence -- it would be a
> place where information about a table would be stored and not
> subject to the rolling back of the transaction that wrote it.

Oh, that puts it in another league, since it must guarantee commit.
I am not sure we can do that. The previous discussion was about
concurrency and data that was not so important like tuple count.

In short:
- I'd start with #1 (no relminxid = FrozenXid) like Tom
suggested
- and then implement FREEZE/UNFREEZE with exclusive locks
like Simon wrote (so it does not need pg_class_nt) and use that
for the templates.

Simon wrote:
> Suggest that we prevent write operations on Frozen tables by revoking
all INSERT, UPDATE or DELETE rights held, then enforcing a check during
GRANT to prevent them being re-enabled. Superusers would need to check
every time. If we dont do this, then we will have two contradictory
states marked in the catalog - privilges saying Yes and freezing saying
No.

No, I'd not mess with the permissions and return a different error when
trying to
modify a frozen table. (It would also be complicated to unfreeze after
create database)
We should make it clear, that freezing is no replacement for revoke.

Andreas


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Zeugswetter Andreas DCP SD <ZeugswetterA(at)spardat(dot)at>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-27 08:08:35
Message-ID: 1151395715.2691.1623.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 2006-06-27 at 10:04 +0200, Zeugswetter Andreas DCP SD wrote:

> Simon wrote:
> > Suggest that we prevent write operations on Frozen tables by revoking
> all INSERT, UPDATE or DELETE rights held, then enforcing a check during
> GRANT to prevent them being re-enabled. Superusers would need to check
> every time. If we dont do this, then we will have two contradictory
> states marked in the catalog - privilges saying Yes and freezing saying
> No.
>
> No, I'd not mess with the permissions and return a different error when
> trying to
> modify a frozen table. (It would also be complicated to unfreeze after
> create database)
> We should make it clear, that freezing is no replacement for revoke.

That was with a mind to performance. Checking every INSERT, UPDATE and
DELETE statement to see if they are being done against a frozen table
seems like a waste.

There would still be a specific error message for frozen tables, just on
the GRANT rather than the actual DML statements.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Zeugswetter Andreas DCP SD" <ZeugswetterA(at)spardat(dot)at>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-27 08:57:15
Message-ID: E1539E0ED7043848906A8FF995BDA579011F00E1@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


> > > Suggest that we prevent write operations on Frozen tables by
> > > revoking
> > all INSERT, UPDATE or DELETE rights held, then enforcing a check
> > during GRANT to prevent them being re-enabled. Superusers would need

> > to check every time. If we dont do this, then we will have two
> > contradictory states marked in the catalog - privilges saying Yes
and
> > freezing saying No.
> >
> > No, I'd not mess with the permissions and return a different error
> > when trying to modify a frozen table. (It would also be complicated
to
> > unfreeze after create database) We should make it clear, that
freezing
> > is no replacement for revoke.
>
> That was with a mind to performance. Checking every INSERT,
> UPDATE and DELETE statement to see if they are being done
> against a frozen table seems like a waste.

I'd think we would have relminxid in the relcache, so I don't buy the
performance argument :-) (You could still do the actual check in the
same place where the permission is checked)

> There would still be a specific error message for frozen
> tables, just on the GRANT rather than the actual DML statements.

I'd still prefer to see the error on modify. Those that don't can
revoke.

Andreas


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Zeugswetter Andreas DCP SD" <ZeugswetterA(at)spardat(dot)at>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Non-transactional pg_class, try 2
Date: 2006-06-27 13:58:02
Message-ID: 12797.1151416682@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Zeugswetter Andreas DCP SD" <ZeugswetterA(at)spardat(dot)at> writes:
>> That was with a mind to performance. Checking every INSERT,
>> UPDATE and DELETE statement to see if they are being done
>> against a frozen table seems like a waste.

> I'd think we would have relminxid in the relcache, so I don't buy the
> performance argument :-)

Me either. Further, auto-revoking permissions loses information.
I think that idea is an ugly kluge.

Anyway, the bottom line here seems to be that we should forget about
pg_class_nt and just keep the info in pg_class; there's not sufficient
justification to build the infrastructure needed for a nontransactional
auxiliary catalog. This implies the following conclusions:

* template0 has to be vacuumed against wraparound, same as any other
database.

* To support frozen tables, "VACUUM FREEZE" and "ALTER TABLE UNFREEZE"
would need to be explicit commands taking ExclusiveLock, and can't be
nested inside transaction blocks either. Automatic unfreeze upon an
updating command isn't possible.

Neither of these are bad enough to justify pg_class_nt --- in fact,
I'd argue that explicit unfreeze is better than automatic anyway.
So it was a cute idea, but its time hasn't come.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-06-28 23:44:09
Message-ID: 20060628234408.GA460@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera wrote:
> Tom Lane wrote:
> > Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > > What I'm after is not freezing for read-only media, nor archive, nor
> > > read-only tables. What I'm after is removing the requirement that all
> > > databases must be vacuumed wholly every 2 billion transactions.
> >
> > Well, if that's the only goal then I hardly think we need to have a
> > discussion, because your alternative #1 is *obviously* the winner:
> >
> > > 1. Remove the special case, i.e., process frozen databases in VACUUM
> > > like every other database.
> > > This is the easiest, because no extra logic is needed. Just make
> > > sure they are vacuumed in time. The only problem would be that we'd
> > > need to uselessly vacuum tables that we know are frozen, from time to
> > > time. But then, those tables are probably small, so what's the
> > > problem with that?
>
> I'm happy to do at least this for 8.2. We can still try to do the
> non-transactional catalog later, either in this release or the next; the
> code is almost there, and it'll be easier to discuss/design because
> we'll have taken the relminxid stuff out of the way.

I attach a patch pursuant to this idea (lacking doc patches for the
maintenance section.)

This patch has the nasty side effect mentioned above -- people will have
to set template0 as connectable and manually run vacuum on it
periodically, unless they run autovacuum.

A future improvement in this area would be to allow frozen tables and
frozen databases, removing this requirement. But I'm inclined to apply
it as is, in the spirit of incremental improvement. Objectors please
speak up!

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-06-28 23:49:54
Message-ID: 20060628234954.GC460@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera wrote:

> I attach a patch pursuant to this idea (lacking doc patches for the
> maintenance section.)

Huh, really attached, sorry :-(

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment Content-Type Size
relminxid-5.patch text/plain 73.0 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-06-29 00:08:21
Message-ID: 9109.1151539701@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> This patch has the nasty side effect mentioned above -- people will have
> to set template0 as connectable and manually run vacuum on it
> periodically, unless they run autovacuum.

That's pretty messy --- making template0 connectable is a great way to
shoot yourself in the foot. What I'd propose instead is that even if
autovacuum is nominally off, the system forces autovacuum when it
notices that a non-connectable database is approaching wraparound.
In this mode the autovac daemon would be restricted to processing
non-connectable databases. (The subtext here is that autovac is the
wave of the future anyway.)

In fact, maybe we should just force an autovac cycle for any DB that
appears to be approaching wraparound, rather than waiting for the
shutdown-before-wraparound code to kick in. Getting into that state
amounts to whacking DBAs upside the head for being stupid, which
doesn't really win us any friends ...

Implementation-wise, I'd propose that we add another PostmasterSignal
event type whereby a backend could request the postmaster to launch
an autovac process even if autovac is off. The end-of-VACUUM code that
scans pg_database.datminxid would issue the signal if it finds anything
seriously old.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-06-29 08:39:27
Message-ID: 1151570367.2691.1964.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, 2006-06-28 at 20:08 -0400, Tom Lane wrote:
> In fact, maybe we should just force an autovac cycle for any DB that
> appears to be approaching wraparound, rather than waiting for the
> shutdown-before-wraparound code to kick in. Getting into that state
> amounts to whacking DBAs upside the head for being stupid, which
> doesn't really win us any friends ...

Yes, please can we have the auto autovacuum cut in rather than the
wraparound message? I'd rather have them complain that we did this, than
complain that we didn't.

Normally, I wouldn't support automatically starting admin tasks without
thr sysadmins knowledge.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-06-29 20:37:36
Message-ID: 20060629203736.GB8591@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:

> In fact, maybe we should just force an autovac cycle for any DB that
> appears to be approaching wraparound, rather than waiting for the
> shutdown-before-wraparound code to kick in. Getting into that state
> amounts to whacking DBAs upside the head for being stupid, which
> doesn't really win us any friends ...

Sounds fine. How far back should we allow databases to go? If we wait
too long, pg_clog won't be truncated regularly, so I think we should do
it rather early than wait until it's close to wraparound.

> Implementation-wise, I'd propose that we add another PostmasterSignal
> event type whereby a backend could request the postmaster to launch
> an autovac process even if autovac is off. The end-of-VACUUM code that
> scans pg_database.datminxid would issue the signal if it finds anything
> seriously old.

I think we could give autovac a "reason for being started", which would
normally be the periodic stuff, but if the postmaster got the signal
from a backend, pass that info to autovac and it could use a different
database selection algorithm -- say, just select the oldest database,
even if it's not in danger of Xid wraparound. So this would allow early
database-wide vacuums for non-connectable databases (template0), and
normal per-table vacuuming for database that are in actual use.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-06-29 21:37:41
Message-ID: 20060629213740.GP17241@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, Jun 29, 2006 at 09:39:27AM +0100, Simon Riggs wrote:
> On Wed, 2006-06-28 at 20:08 -0400, Tom Lane wrote:
> > In fact, maybe we should just force an autovac cycle for any DB that
> > appears to be approaching wraparound, rather than waiting for the
> > shutdown-before-wraparound code to kick in. Getting into that state
> > amounts to whacking DBAs upside the head for being stupid, which
> > doesn't really win us any friends ...
>
> Yes, please can we have the auto autovacuum cut in rather than the
> wraparound message? I'd rather have them complain that we did this, than
> complain that we didn't.
>
> Normally, I wouldn't support automatically starting admin tasks without
> thr sysadmins knowledge.

I think it'd be good to put a big, fat WARNING in the log if we fire up
an autovac to avoid an XID wrap, since it's an indication that the
vacuuming scheme that's in place probably isn't good enough.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-06-29 22:01:08
Message-ID: 17831.1151618468@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> I think it'd be good to put a big, fat WARNING in the log if we fire up
> an autovac to avoid an XID wrap, since it's an indication that the
> vacuuming scheme that's in place probably isn't good enough.

No, for nonconnectable databases it'd be expected behavior (given the
proposed patch). Warn away if the db is connectable, though ...

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-06-29 23:34:55
Message-ID: 18246.1151624095@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I think we could give autovac a "reason for being started", which would
> normally be the periodic stuff, but if the postmaster got the signal
> from a backend, pass that info to autovac and it could use a different
> database selection algorithm -- say, just select the oldest database,
> even if it's not in danger of Xid wraparound.

I don't think it's worth the trouble to provide such a signaling
mechanism (it'd be a bit of a PITA to make it work in both fork and
EXEC_BACKEND cases). ISTM it's sufficient for autovac to look at the
GUC state and notice whether it's nominally enabled or not. If not,
it should only consider anti-wraparound vacuum operations.

If the signal is given just when VACUUM sees a datvacuumxid value that's
old enough to cause autovac to decide it had better go prevent
wraparound, then this will work without any additional data transfer.
We could negotiate exactly how old DBs need to be to cause this; if you
want to make it less than a billion xacts so that we can keep pg_clog
cut down to size, that's fine with me.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Non-transactional pg_class, try 2
Date: 2006-07-10 16:30:15
Message-ID: 20060710163014.GA25554@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > I think we could give autovac a "reason for being started", which would
> > normally be the periodic stuff, but if the postmaster got the signal
> > from a backend, pass that info to autovac and it could use a different
> > database selection algorithm -- say, just select the oldest database,
> > even if it's not in danger of Xid wraparound.
>
> I don't think it's worth the trouble to provide such a signaling
> mechanism (it'd be a bit of a PITA to make it work in both fork and
> EXEC_BACKEND cases). ISTM it's sufficient for autovac to look at the
> GUC state and notice whether it's nominally enabled or not. If not,
> it should only consider anti-wraparound vacuum operations.
>
> If the signal is given just when VACUUM sees a datvacuumxid value that's
> old enough to cause autovac to decide it had better go prevent
> wraparound, then this will work without any additional data transfer.
> We could negotiate exactly how old DBs need to be to cause this; if you
> want to make it less than a billion xacts so that we can keep pg_clog
> cut down to size, that's fine with me.

Committed. I didn't do anything but the simplest stuff; autovacuum
checks whether it's enabled in GUC, and if it doesn't then it just picks
the oldest database and issues a database-wide vacuum. The
vac_truncate_clog routine will send a signal at the same time as the
WARNING about nearing Xid wraparound would be emitted.

One funny thing I noticed is that if there is more than one database
needing db-wide vacuum, vacuum will send a signal; autovacuum will
process one database; and autovacuum will send a signal as well when
it's done because of the other databases. But autovacuum will get the
second signal and do nothing, because of the code to check for frequent
autovacuum starts. This kind of suprised me at first but it's really to
be expected. I considered disabling that timing code in the case of
getting the signal, but I don't think it's worth it.

One important improvement we'd may want to do is changing vacuum so that
it only calls vac_truncate_clog once if invoked by autovacuum (currently
it'll be called every time a table is vacuumed). Also I think we could
change the OldestXmin stuff so that it only takes database-local
transactions into account for non-shared tables.

But in the spirit of incremental improvement I think we're much better
now.

Happy sprinting to everyone,

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.