Re: [PATCHES] CLUSTER not lose indexes

Lists: pgsql-hackerspgsql-patches
From: Alvaro Herrera <alvherre(at)atentus(dot)com>
To: pgsql-patches(at)postgresql(dot)org
Subject: CLUSTER not lose indexes
Date: 2002-07-05 02:54:49
Message-ID: Pine.LNX.4.44.0207042235580.16373-200000@cm-lcon-46-187.cm.vtr.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Hackers:

I've modified commands/cluster.c so that it recreates the indexes on the
table after clustering the table. I attach the patch.

There are (of course) things I don't understand. For example, whether
(or when) I should use CommandCounterIncrement() after each
index_create, or if I should call setRelhasindex() only once (and not
once per index); or whether I need to acquire some lock on the indexes.

I tested it with one table and several indexes. Truth is I don't know
how to test for concurrency, or if it's worth the trouble.

The purpose of this experiment (and, I hope, more to follow) is to
familiarize myself with the guts of PostgreSQL, so I can work on my CS
thesis with it. If you can point me my misconceptions I'd be happy to
try again (and again, and...)

Thank you.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"La Primavera ha venido. Nadie sabe como ha sido" (A. Machado)

Attachment Content-Type Size
cluster.patch text/plain 6.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)atentus(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-05 03:44:08
Message-ID: 7952.1025840648@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)atentus(dot)com> writes:
> There are (of course) things I don't understand. For example, whether
> (or when) I should use CommandCounterIncrement() after each
> index_create, or if I should call setRelhasindex() only once (and not
> once per index); or whether I need to acquire some lock on the indexes.

I think you probably want a CommandCounterIncrement at the bottom of the
loop (after setRelhasindex). If it works as-is it's just by chance,
ie due to internal CCI calls in index_create.

Locking newly-created indexes is not really necessary, since no one else
can see them until you commit anyhow.

+ tuple = SearchSysCache(RELOID, ObjectIdGetDatum(attrs->indexOID),
+ 0, 0, 0);
+ if (!HeapTupleIsValid(tuple))
+ break;

Breaking out of the loop hardly seems an appropriate response to this
failure condition. Not finding the index' pg_class entry is definitely
an error.

I'd also suggest more-liberal commenting, as well as more attention to
updating the existing comments to match new reality.

In general, I'm not thrilled about expending more code on the existing
fundamentally-broken implementation of CLUSTER. We need to look at
making use of the ability to write a new version of a table (or index)
under a new relfilenode value, without changing the table's OID.
However, some parts of your patch will probably still be needed when
someone gets around to making that happen, so I won't object for now.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)atentus(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-05 05:06:38
Message-ID: 200207050506.g6556cZ06097@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera wrote:
> Hackers:
>
> I've modified commands/cluster.c so that it recreates the indexes on the
> table after clustering the table. I attach the patch.
>
> There are (of course) things I don't understand. For example, whether
> (or when) I should use CommandCounterIncrement() after each
> index_create, or if I should call setRelhasindex() only once (and not
> once per index); or whether I need to acquire some lock on the indexes.
>
> I tested it with one table and several indexes. Truth is I don't know
> how to test for concurrency, or if it's worth the trouble.
>
> The purpose of this experiment (and, I hope, more to follow) is to
> familiarize myself with the guts of PostgreSQL, so I can work on my CS
> thesis with it. If you can point me my misconceptions I'd be happy to
> try again (and again, and...)

I think Tom was suggesting that you may want to continue work on CLUSTER
and make use of relfilenode. After the cluster, you can just update
pg_class.relfilenode with the new file name (random oid generated at
build time) and as soon as you commit, all backends will start using the
new file and you can delete the old one.

The particular case we would like to improve is this:

/* Destroy old heap (along with its index) and rename new. */
heap_drop_with_catalog(OIDOldHeap, allowSystemTableMods);

CommandCounterIncrement();

renamerel(OIDNewHeap, oldrelation->relname);

In this code, we delete the old relation, then rename the new one. It
would be good to have this all happen in one update of
pg_class.relfilenode; that way it is an atomic operation.

So, create a heap (in the temp namespace so it is deleted on crash),
copy the old heap into the new file in cluster order, and when you are
done, point the old pg_class relfilenode at the new clustered heap
filename, then point the new cluster heap pg_class at the old heap file,
and then drop the cluster heap file; that will remove the _old_ file
(I believe on commit) and you are ready to go.

So, you are basically creating a new heap, but at finish, the new heap's
pg_class and the old heap's file go away. I thought about doing it
without creating the pg_class entry for the new heap, but the code
really wants to have a heap it can manipulate.

Same with index rebuilding, I think. The indexes are built on the new
heap, and the relfilenode swap works just the same.

Let us know if you want to pursue that and we can give you additional
assistance. I would like to see CLUSTER really polished up. More
people are using it now and it really needs someone to focus on it.

Glad to see you working on CLUSTER. Welcome aboard.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Alvaro Herrera <alvherre(at)atentus(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)atentus(dot)com>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-05 05:34:38
Message-ID: Pine.LNX.4.44.0207050123080.19948-100000@cm-lcon-46-187.cm.vtr.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian dijo:

> Alvaro Herrera wrote:
> > Hackers:
> >
> > I've modified commands/cluster.c so that it recreates the indexes on the
> > table after clustering the table. I attach the patch.

> I think Tom was suggesting that you may want to continue work on CLUSTER
> and make use of relfilenode. After the cluster, you can just update
> pg_class.relfilenode with the new file name (random oid generated at
> build time) and as soon as you commit, all backends will start using the
> new file and you can delete the old one.

I'm looking at pg_class, and relfilenode is equal to oid in all cases
AFAICS. If I create a new, "random" relfilenode, the equality will not
hold. Is that OK? Also, is the new relfilenode somehow guaranteed to
not be assigned to another relation (pg_class tuple, I think)?

> In this code, we delete the old relation, then rename the new one. It
> would be good to have this all happen in one update of
> pg_class.relfilenode; that way it is an atomic operation.

OK, I'll see if I can do that.

> Let us know if you want to pursue that and we can give you additional
> assistance. I would like to see CLUSTER really polished up. More
> people are using it now and it really needs someone to focus on it.

I thought CLUSTER was meant to be replaced by automatically clustering
indexes. Isn't that so? Is anyone working on that?

> Glad to see you working on CLUSTER. Welcome aboard.

Thank you. I'm really happy to be able to work in Postgres.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Limitate a mirar... y algun dia veras"


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)atentus(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-05 05:45:09
Message-ID: 200207050545.g655j9s08777@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera wrote:
> Bruce Momjian dijo:
>
> > Alvaro Herrera wrote:
> > > Hackers:
> > >
> > > I've modified commands/cluster.c so that it recreates the indexes on the
> > > table after clustering the table. I attach the patch.
>
> > I think Tom was suggesting that you may want to continue work on CLUSTER
> > and make use of relfilenode. After the cluster, you can just update
> > pg_class.relfilenode with the new file name (random oid generated at
> > build time) and as soon as you commit, all backends will start using the
> > new file and you can delete the old one.
>
> I'm looking at pg_class, and relfilenode is equal to oid in all cases
> AFAICS. If I create a new, "random" relfilenode, the equality will not
> hold. Is that OK? Also, is the new relfilenode somehow guaranteed to

Yes, they are all the same only because we haven't gotten CLUSTER fixed
yet. ;-) We knew we needed to make cases where they are not the same
specifically for cases like CLUSTER.

> not be assigned to another relation (pg_class tuple, I think)?

It will be fine. The relfilenode will be the one assigned to the new
heap table and will not be used by others.

>
>
> > In this code, we delete the old relation, then rename the new one. It
> > would be good to have this all happen in one update of
> > pg_class.relfilenode; that way it is an atomic operation.
>
> OK, I'll see if I can do that.
>
> > Let us know if you want to pursue that and we can give you additional
> > assistance. I would like to see CLUSTER really polished up. More
> > people are using it now and it really needs someone to focus on it.
>
> I thought CLUSTER was meant to be replaced by automatically clustering
> indexes. Isn't that so? Is anyone working on that?

We have no idea how to automatically cluster based on an index. Not
even sure how the commercial guys do it.

> > Glad to see you working on CLUSTER. Welcome aboard.
>
> Thank you. I'm really happy to be able to work in Postgres.

Great.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Alvaro Herrera <alvherre(at)atentus(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-05 06:28:47
Message-ID: Pine.LNX.4.44.0207050216160.19948-200000@cm-lcon-46-187.cm.vtr.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane dijo:

> I think you probably want a CommandCounterIncrement at the bottom of the
> loop (after setRelhasindex). If it works as-is it's just by chance,
> ie due to internal CCI calls in index_create.

Done.

> + tuple = SearchSysCache(RELOID, ObjectIdGetDatum(attrs->indexOID),
> + 0, 0, 0);
> + if (!HeapTupleIsValid(tuple))
> + break;
>
> Breaking out of the loop hardly seems an appropriate response to this
> failure condition. Not finding the index' pg_class entry is definitely
> an error.

Sure. elog(ERROR) now. I'm not sure what was I thinking when I wrote
that.

> I'd also suggest more-liberal commenting, as well as more attention to
> updating the existing comments to match new reality.

I'm afraid I cannot get too verbose no matter how hard I try. I hope
this one is OK.

> In general, I'm not thrilled about expending more code on the existing
> fundamentally-broken implementation of CLUSTER. We need to look at
> making use of the ability to write a new version of a table (or index)
> under a new relfilenode value, without changing the table's OID.
> However, some parts of your patch will probably still be needed when
> someone gets around to making that happen, so I won't object for now.

Will try to do this.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
Licensee shall have no right to use the Licensed Software
for productive or commercial use. (Licencia de StarOffice 6.0 beta)

Attachment Content-Type Size
cluster.patch text/plain 6.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)atentus(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-05 14:27:05
Message-ID: 10821.1025879225@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)atentus(dot)com> writes:
> I'm looking at pg_class, and relfilenode is equal to oid in all cases
> AFAICS. If I create a new, "random" relfilenode, the equality will not
> hold. Is that OK?

The idea is that OID is the logical table identifier and relfilenode is
the physical identifier (so relfilenode is what should be used in bufmgr
and below). There are undoubtedly a few places that get this wrong, but
we won't be able to ferret them out until someone starts to exercise
the facility.

> Also, is the new relfilenode somehow guaranteed to
> not be assigned to another relation (pg_class tuple, I think)?

I've been wondering about that myself. We might have to add a unique
index on pg_class.relfilenode to ensure this; otherwise, after OID
wraparound there would be no guarantees.

>> In this code, we delete the old relation, then rename the new one. It
>> would be good to have this all happen in one update of
>> pg_class.relfilenode; that way it is an atomic operation.

As long as you have not committed, it's atomic anyway because no one can
see your updates. It'd be nice to do it in one update for efficiency,
but don't contort the code beyond reason to achieve that.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)atentus(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-05 18:08:57
Message-ID: 200207051808.g65I8w226685@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> > Also, is the new relfilenode somehow guaranteed to
> > not be assigned to another relation (pg_class tuple, I think)?
>
> I've been wondering about that myself. We might have to add a unique
> index on pg_class.relfilenode to ensure this; otherwise, after OID
> wraparound there would be no guarantees.

Yep, good point.

> >> In this code, we delete the old relation, then rename the new one. It
> >> would be good to have this all happen in one update of
> >> pg_class.relfilenode; that way it is an atomic operation.
>
> As long as you have not committed, it's atomic anyway because no one can
> see your updates. It'd be nice to do it in one update for efficiency,
> but don't contort the code beyond reason to achieve that.

Sorry, I meant to say that we added relfilenode for exactly this case,
so that we have atomic file access semantics. Do we already have that
feature in the current code?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)atentus(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-05 20:12:27
Message-ID: 200207052012.g65KCR215377@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera wrote:
> > In general, I'm not thrilled about expending more code on the existing
> > fundamentally-broken implementation of CLUSTER. We need to look at
> > making use of the ability to write a new version of a table (or index)
> > under a new relfilenode value, without changing the table's OID.
> > However, some parts of your patch will probably still be needed when
> > someone gets around to making that happen, so I won't object for now.

OK, I remember now. In renamerel(), the new clustered file is renamed
to the old table name. However, it has a new oid. The idea of
relfilenode was that we want to cluster the table and keep the same
relation oid. That way, other system tables that reference that OID
will still work.

Seems like renamerel will have to stay because it is used by ALTER TABLE
RENAME, so we just need some new code that updates the relfilenode of
the old pg_class row to point to the new clustered file. Swapping
relfilenodes between the old and new pg_class rows and deleting the new
table should do the trick of deleting the non-clustered file and the
temp pg_class row at the same time.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)atentus(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-06 15:11:20
Message-ID: 22491.1025968280@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Seems like renamerel will have to stay because it is used by ALTER TABLE
> RENAME, so we just need some new code that updates the relfilenode of
> the old pg_class row to point to the new clustered file. Swapping
> relfilenodes between the old and new pg_class rows and deleting the new
> table should do the trick of deleting the non-clustered file and the
> temp pg_class row at the same time.

I think you're still letting your thinking be contorted by the existing
CLUSTER implementation. Do we need a temp pg_class entry at all? Seems
like we just want to UPDATE the pg_class row with the new relfilenode
value; then we can see the update but no one else can (till we commit).
Ditto for the indexes.

What's still a little unclear to me is how to access the old heap and
index files to read the data while simultaneously accessing the new ones
to write it. Much of the existing code likes to have a Relation struct
available, not only a RelFileNode, so it may be necessary to have both
old and new Relations present at the same time. If that's the case we
might be stuck with making a temp pg_class entry just to support a phony
Relation :-(

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)atentus(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-06 16:24:56
Message-ID: 200207061624.g66GOun29971@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Seems like renamerel will have to stay because it is used by ALTER TABLE
> > RENAME, so we just need some new code that updates the relfilenode of
> > the old pg_class row to point to the new clustered file. Swapping
> > relfilenodes between the old and new pg_class rows and deleting the new
> > table should do the trick of deleting the non-clustered file and the
> > temp pg_class row at the same time.
>
> I think you're still letting your thinking be contorted by the existing
> CLUSTER implementation. Do we need a temp pg_class entry at all? Seems
> like we just want to UPDATE the pg_class row with the new relfilenode
> value; then we can see the update but no one else can (till we commit).
> Ditto for the indexes.
>
> What's still a little unclear to me is how to access the old heap and
> index files to read the data while simultaneously accessing the new ones
> to write it. Much of the existing code likes to have a Relation struct
> available, not only a RelFileNode, so it may be necessary to have both
> old and new Relations present at the same time. If that's the case we
> might be stuck with making a temp pg_class entry just to support a phony
> Relation :-(

Yes, that was my conclusion, that we need the temp heap so we can access
it in a clean manner. Sure, it would be nice if we could access a file
on its own, but it doesn't seem worth the complexity required to
accomplish it.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Alvaro Herrera <alvherre(at)atentus(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)atentus(dot)com>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-06 22:40:19
Message-ID: Pine.LNX.4.44.0207061832350.2262-100000@cm-lcon-46-187.cm.vtr.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane dijo:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Seems like renamerel will have to stay because it is used by ALTER TABLE
> > RENAME, so we just need some new code that updates the relfilenode of
> > the old pg_class row to point to the new clustered file. Swapping
> > relfilenodes between the old and new pg_class rows and deleting the new
> > table should do the trick of deleting the non-clustered file and the
> > temp pg_class row at the same time.
>
> I think you're still letting your thinking be contorted by the existing
> CLUSTER implementation. Do we need a temp pg_class entry at all? Seems
> like we just want to UPDATE the pg_class row with the new relfilenode
> value; then we can see the update but no one else can (till we commit).
> Ditto for the indexes.

That's what I originally thought: mess around directly with the smgr (or
with some upper layer? I don't know) to create a new relfilenode, and
then attach it to the heap. I don't know if it's possible or too
difficult.

Then, with Bruce's explanation, I thought I should just create a temp
table and exchange relfilenodes, which is much simpler.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"El dia que dejes de cambiar dejaras de vivir"


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)atentus(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-07 23:29:30
Message-ID: 200207072329.g67NTUC21748@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera wrote:
> > I think you're still letting your thinking be contorted by the existing
> > CLUSTER implementation. Do we need a temp pg_class entry at all? Seems
> > like we just want to UPDATE the pg_class row with the new relfilenode
> > value; then we can see the update but no one else can (till we commit).
> > Ditto for the indexes.
>
> That's what I originally thought: mess around directly with the smgr (or
> with some upper layer? I don't know) to create a new relfilenode, and
> then attach it to the heap. I don't know if it's possible or too
> difficult.
>
> Then, with Bruce's explanation, I thought I should just create a temp
> table and exchange relfilenodes, which is much simpler.

Yes, creating the file is easy. Inserting into a file without a heap
relation pointer is a pain. ;-)

Actually, I did a little research on the crash/abort handling of
swapping the relfilenodes between the old and clustered pg_class heap
entries. A abort/crash can happen in several ways- abort of the
transaction, backend crash during the transaction, or cleaning up of old
file at commit.

First, when you create a table, the relfilenode is _copied_ to
pendingDeletes. These files are cleaned up if the transaction aborts in
smgrDoPendingDeletes(). There will also be an entry in there for the
same table after you call heap_delete. As long as the relfilenode at
heap_delete time points to the original heap file, you should be fine.
If it commits, the new file is removed (remember that oid was saved).
If it aborts, the new file is removed (CLUSTER failed).

As far as a crash, there is RemoveTempRelations which calls
FindTempRelations() but I think that only gets populated after the
commit happens and the temp namespaces is populated. I assume the
aborted part of the transaction is invisible to the backend at that
point (crash time). Of course, it wouldn't hurt to test these cases to
make sure it works right, and I would document that the copying of
relfilenode in smgr create/unlink is a desired effect relied upon by
cluster and any other commands that may use relfilenode renaming in the
future. (If it wasn't copied, then an abort would delete the _old_
heap. Bad.)

FYI, RENAME also deals with relfilenode renaming in setNewRelfilenode().
The difference is that RENAME doesn't need to access the old index, just
build a new one, so it can take shortcuts in how it handles things. It
uses two methods to modify the tuple, one directly modifying pg_class,
rather than inserting a new heap row. and another of doing it the
traditional way. It does this because REINDEX is used to fix corrupt
indexes, including corrupt system indexes. You will not be using that
type of code in CLUSTER because there is a real temp heap associated
with this operation. Just heap_update() like normal for both relations.

Hope this helps.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Alvaro Herrera <alvherre(at)atentus(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)atentus(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CLUSTER not lose indexes
Date: 2002-07-11 02:41:48
Message-ID: Pine.LNX.4.44.0207102212270.23138-200000@cm-lcon1-46-187.cm.vtr.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian dijo:

> FYI, RENAME also deals with relfilenode renaming in setNewRelfilenode().
> The difference is that RENAME doesn't need to access the old index, just
> build a new one, so it can take shortcuts in how it handles things. It
> uses two methods to modify the tuple, one directly modifying pg_class,
> rather than inserting a new heap row. and another of doing it the
> traditional way. It does this because REINDEX is used to fix corrupt
> indexes, including corrupt system indexes. You will not be using that
> type of code in CLUSTER because there is a real temp heap associated
> with this operation. Just heap_update() like normal for both relations.

Well, I think my approach is somewhat more naive. What I'm actually
doing is something like:

1. save information on extant indexes
2. create a new heap, and populate (clustered)
3. swap the relfilenodes of the new and old heaps
4. drop new heap (with old relfilenode)
5. for each index saved in 1:
5.1. create a new index with the same attributes
5.2. swap relfilenodes of original and new index
5.3. drop new index (with old relfilenode)

But now I'm lost. It has worked sometimes; then I change a minimal
thing, recompile and then it doesn't work (bufmgr fails an assertion).
I can tell that the new (table) filenode is correct if I skip step 4
above, and check the temp table manually (but of course it has no
indexes). I've never gotten as far as completing all steps right, so I
cannot tell whether the new indexes' filenodes are correct.

I'm posting the new patch. Please review it; I've turned it upside down
a few times and frankly don't know what's happening. I sure am
forgetting a lock or something but cannot find what is it.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Aprender sin pensar es inutil; pensar sin aprender, peligroso" (Confucio)

Attachment Content-Type Size
cluster.patch text/plain 11.1 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)atentus(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] CLUSTER not lose indexes
Date: 2002-07-15 22:54:09
Message-ID: Pine.LNX.4.44.0207130051430.28230-100000@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane writes:

> > Also, is the new relfilenode somehow guaranteed to
> > not be assigned to another relation (pg_class tuple, I think)?
>
> I've been wondering about that myself. We might have to add a unique
> index on pg_class.relfilenode to ensure this; otherwise, after OID
> wraparound there would be no guarantees.

I've never been happy with the current setup. It's much too tempting to
think file name = OID, both in the backend code and by external onlookers,
especially since it's currently rare/impossible(?) for them to be
different.

It would be a lot clearer if relfilenode were replaced by an integer
version, starting at 0, and the heap files were named "OID_VERSION".

(In related news, how about filling up the oid/relfilenode numbers with
zeros on the left, so a directory listing would reflect the numerical
order?)

--
Peter Eisentraut peter_e(at)gmx(dot)net


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)atentus(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] CLUSTER not lose indexes
Date: 2002-07-15 22:58:41
Message-ID: 200207152258.g6FMwfD16504@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Peter Eisentraut wrote:
> Tom Lane writes:
>
> > > Also, is the new relfilenode somehow guaranteed to
> > > not be assigned to another relation (pg_class tuple, I think)?
> >
> > I've been wondering about that myself. We might have to add a unique
> > index on pg_class.relfilenode to ensure this; otherwise, after OID
> > wraparound there would be no guarantees.
>
> I've never been happy with the current setup. It's much too tempting to
> think file name = OID, both in the backend code and by external onlookers,
> especially since it's currently rare/impossible(?) for them to be
> different.

Yea, only reindex and cluster change them. Problem is we already have
oid as a nice unique number ready for use. I don't see a huge advantage
of improving it.

> It would be a lot clearer if relfilenode were replaced by an integer
> version, starting at 0, and the heap files were named "OID_VERSION".

Problem there is that we can't have relfilenode as an int unless we take
the table oid and sequence number and merge them on the fly in the
backend. Would be nice for admins, though, so the oid would be there.
I thought WAL liked the relfilenode as a single number.

> (In related news, how about filling up the oid/relfilenode numbers with
> zeros on the left, so a directory listing would reflect the numerical
> order?)

Problem there is that we increase the size of much of the directory
lookups. Not sure if it is worth worrying about.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Alvaro Herrera <alvherre(at)atentus(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] CLUSTER not lose indexes
Date: 2002-07-16 04:01:41
Message-ID: 28372.1026792101@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> It would be a lot clearer if relfilenode were replaced by an integer
> version, starting at 0, and the heap files were named "OID_VERSION".

The reason to not do that is that the bufmgr and levels below would now
need to pass around three numbers, not two, to identify physical files.

We already beat this topic to death a year ago, I'm not eager to re-open
it.

regards, tom lane


From: Curt Sampson <cjs(at)cynic(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)atentus(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] CLUSTER not lose indexes
Date: 2002-07-16 08:59:46
Message-ID: Pine.NEB.4.44.0207161758150.465-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 15 Jul 2002, Bruce Momjian wrote:

> > (In related news, how about filling up the oid/relfilenode numbers with
> > zeros on the left, so a directory listing would reflect the numerical
> > order?)
>
> Problem there is that we increase the size of much of the directory
> lookups. Not sure if it is worth worrying about.

Probably not such a big deal, since most systems will be reading
a full block (8K or 16K under *BSD) to load the directory information
anyway. Doing it in hex would give you only 8-char filenames, anyway.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Curt Sampson <cjs(at)cynic(dot)net>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)atentus(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] CLUSTER not lose indexes
Date: 2002-07-16 15:41:44
Message-ID: 200207161541.g6GFfiq26196@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Curt Sampson wrote:
> On Mon, 15 Jul 2002, Bruce Momjian wrote:
>
> > > (In related news, how about filling up the oid/relfilenode numbers with
> > > zeros on the left, so a directory listing would reflect the numerical
> > > order?)
> >
> > Problem there is that we increase the size of much of the directory
> > lookups. Not sure if it is worth worrying about.
>
> Probably not such a big deal, since most systems will be reading
> a full block (8K or 16K under *BSD) to load the directory information
> anyway. Doing it in hex would give you only 8-char filenames, anyway.

Yes, hex may be interesting as a more compact, consistent format. We
need to change the docs so oid2name and queries convert to hex on
output.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Curt Sampson <cjs(at)cynic(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)atentus(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] CLUSTER not lose indexes
Date: 2002-07-16 15:57:00
Message-ID: 12989.1026835020@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> (In related news, how about filling up the oid/relfilenode numbers with
>> zeros on the left, so a directory listing would reflect the numerical
>> order?)

> Yes, hex may be interesting as a more compact, consistent format. We
> need to change the docs so oid2name and queries convert to hex on
> output.

I don't really see the value-added here. If we had made this decision
before releasing 7.1, I'd not have objected; but at this point we're
talking about breaking oid2name and any similar scripts that people
may have developed, for what's really a *very* marginal gain. Who cares
whether a directory listing reflects numerical order?

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Curt Sampson <cjs(at)cynic(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)atentus(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] CLUSTER not lose indexes
Date: 2002-07-16 15:59:07
Message-ID: 200207161559.g6GFx7128098@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> >> (In related news, how about filling up the oid/relfilenode numbers with
> >> zeros on the left, so a directory listing would reflect the numerical
> >> order?)
>
> > Yes, hex may be interesting as a more compact, consistent format. We
> > need to change the docs so oid2name and queries convert to hex on
> > output.
>
> I don't really see the value-added here. If we had made this decision
> before releasing 7.1, I'd not have objected; but at this point we're
> talking about breaking oid2name and any similar scripts that people
> may have developed, for what's really a *very* marginal gain. Who cares
> whether a directory listing reflects numerical order?

I don't see the big value either, just brainstorming.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026