Re: [WIP] In-place upgrade

Lists: pgsql-hackers
From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: [WIP] In-place upgrade
Date: 2008-10-31 21:43:55
Message-ID: 490B7C1B.8050408@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This is really first patch which is not clean up, but it add in-place upgrade
functionality. The patch requires other clean up patches which I already send.
You can find aslo GIT repository with "workable" version.

Main point is that tuples are converted to latest version in SeqScan and
IndexScan node. All storage/access module is able process database 8.1-8.4.
(Page Layout 3 and 4).

What works:
- select - heap scan is ok, but index scan does not work on varlena datatypes. I
need to convert index key somewhere in index access.

What does not work:
- tuple conversion which contains arrays, composite datatypes and toast
- vacuum - it tries to cleanup old pages - probably better could be converted
them to the new format during processing...
- insert/delete/update

The Patch contains lot of extra comments and rubbish, but it is in process of
cleanup.

What I need to know/solve:

1) yes/no for this kind of online upgrade method
2) I'm not sure if the calling ExecStoreTuple correct.
3) I'm still looking best place to store old data structures and conversion
functions. My idea is to create new directories:
src/include/odf/v03/...
src/backend/storage/upgrade/
src/backend/access/upgrade
(odf = On Disk Format)

Links:
http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=summary
http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/postgres/postgresql-upgrade/

Thanks for your comments

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

Attachment Content-Type Size
inplaceupgrade.patch text/x-diff 80.1 KB

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-03 04:22:16
Message-ID: 603c8f070811022022x582a9cbdp60798a6b87910edf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I tried to apply this patch to CVS HEAD and it blew up all over the
place. It doesn't seem to be intended to apply against CVS HEAD; for
example, I don't have backend/access/heap/htup.c at all, so can't
apply changes to that file. I was able to clone the GIT repository
with the following command...

git clone http://git.postgresql.org/git/~davidfetter/upgrade_in_place/.git

...but now I'm confused, because I don't see the changes from the diff
reflected in the resulting tree. As you can see, I am not a git
wizard. Any help would be appreciated.

Here are a few initial thoughts based mostly on reading the diff:

In the minor nit department, I don't really like the idea of
PageHeaderData_04, SizeOfPageHeaderData04, PageLayoutIsValid_04, etc.
I think the latest version should just be PageHeaderData and
SizeOfPageHeaderData, and previous versions should be, e.g.
PageHeaderDataV3. It looks to me like this would cut a few hunks out
of this and maybe make it a bit easier to understand what is going on.
At any rate, if we are going to stick with an explicit version number
in both versions, it should be marked in a consistent way, not _04
sometimes and just 04 other times. My suggestion is e.g. "V4" but
YMMV.

The changes to nodeIndexscan.c and nodeSeqscan.c are worrisome to me.
It looks like the added code is (nearly?) identical in both places, so
probably it needs to be refactored to avoid code duplication. I'm
also a bit skeptical about the idea of doing the tuple conversion
here. Why here rather than ExecStoreTuple()? If you decide to
convert the tuple, you can palloc the new one, pfree the old one if
ShouldFree is set, and reset shouldFree to true.

I am pretty skeptical of the idea that all of the HeapTuple* functions
can just be conditionalized on the page version and everything will
Just Work. It seems like that is too low a level to be worrying about
such things. Even if it happens to work for the changes between V3
and V4, what happens when V5 or V6 is changed in such a way that the
answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
"Maybe" or "Seven"? The performance hit also sounds painful. I don't
have a better idea right now though...

I think it's going to be absolutely imperative to begin vacuuming away
old V3 pages as quickly as possible after the upgrade. If you go with
the approach of converting the tuple in, or just before,
ExecStoreTuple, then you're going to introduce a lot of overhead when
working with V3 pages. I think that's fine. You should plan to do
your in-place upgrade at 1AM on Christmas morning (or whenever your
load hits rock bottom...) and immediately start converting the
database, starting with your most important and smallest tables. In
fact, I would look whenever possible for ways to make the V4 case a
fast-path and just accept that the system is going to labor a bit when
dealing with V3 stuff. Any overhead you introduce when dealing with
V3 pages can go away; any V4 overhead is permanent and therefore much
more difficult to accept.

That's about all I have for now... if you can give me some pointers on
working with this git repository, or provide a complete patch that
applies cleanly to CVS HEAD, I will try to look at this in more
detail.

...Robert


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-03 20:36:07
Message-ID: 490F60B7.4030409@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Big thanks for review.

Robert Haas napsal(a):
> I tried to apply this patch to CVS HEAD and it blew up all over the
> place. It doesn't seem to be intended to apply against CVS HEAD; for
> example, I don't have backend/access/heap/htup.c at all, so can't
> apply changes to that file.

You need to apply also two other patches:
which are located here:
http://wiki.postgresql.org/wiki/CommitFestInProgress#Upgrade-in-place_and_related_issues
I moved one related patch from another category here to correct place.

The problem is that it is difficult to keep it in sync with head, because they
change a lot of things. It the reason why I put all also into GIT repository,
but ...

> I was able to clone the GIT repository
> with the following command...
>
> git clone http://git.postgresql.org/git/~davidfetter/upgrade_in_place/.git
>
> ...but now I'm confused, because I don't see the changes from the diff
> reflected in the resulting tree. As you can see, I am not a git
> wizard. Any help would be appreciated.

I'm GIT newbie I use mercurial for development and I manually applied changes
into GIT. I asked David Fetter with help how to get back the correct clone. In
meantime you can download a tarball.

http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=snapshot;h=c72bafada59ed278ffac59657c913bc375f77808;sf=tgz

It should contains every think including yesterdays improvements (delete,
insert, update works - inser/update only on table without index).

> Here are a few initial thoughts based mostly on reading the diff:
>
> In the minor nit department, I don't really like the idea of
> PageHeaderData_04, SizeOfPageHeaderData04, PageLayoutIsValid_04, etc.
> I think the latest version should just be PageHeaderData and
> SizeOfPageHeaderData, and previous versions should be, e.g.
> PageHeaderDataV3. It looks to me like this would cut a few hunks out
> of this and maybe make it a bit easier to understand what is going on.
> At any rate, if we are going to stick with an explicit version number
> in both versions, it should be marked in a consistent way, not _04
> sometimes and just 04 other times. My suggestion is e.g. "V4" but
> YMMV.

Yeah, it is most difficult part :-) find correct names for it. I think that each
version of structure should have version suffix including lastone. And of
cource the last one we should have a general name without suffix - see example:

typedef struct PageHeaderData_04 { ...} PageHeaderData_04
typedef struct PageHeaderData_03 { ...} PageHeaderData_03
typedef PageHeaderData_04 PageHeaderData

This allows you exactly specify version on places where you need it and keep
general name where version is not relevant.

How suffix should looks it another question. I prefer to have 04 not only 4.
What's about PageHeaderData_V04?

By the way what YMMV means?

> The changes to nodeIndexscan.c and nodeSeqscan.c are worrisome to me.
> It looks like the added code is (nearly?) identical in both places, so
> probably it needs to be refactored to avoid code duplication. I'm
> also a bit skeptical about the idea of doing the tuple conversion
> here. Why here rather than ExecStoreTuple()? If you decide to
> convert the tuple, you can palloc the new one, pfree the old one if
> ShouldFree is set, and reset shouldFree to true.

Good point. I thought about it as a one variant. And if I look it close now it
is really much better place. It should fix a problem why REINDEX does not work.
I will move it.

> I am pretty skeptical of the idea that all of the HeapTuple* functions
> can just be conditionalized on the page version and everything will
> Just Work. It seems like that is too low a level to be worrying about
> such things. Even if it happens to work for the changes between V3
> and V4, what happens when V5 or V6 is changed in such a way that the
> answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
> "Maybe" or "Seven"? The performance hit also sounds painful. I don't
> have a better idea right now though...

OK. Currently it works (or I hope that it works). If somebody in a future invent
some special change, i think in most (maybe all) cases there will be possible
mapping.

The speed is key point. When I check it last time I go 1% performance drop in
fresh database. I think 1% is good price for in-place online upgrade.

> I think it's going to be absolutely imperative to begin vacuuming away
> old V3 pages as quickly as possible after the upgrade. If you go with
> the approach of converting the tuple in, or just before,
> ExecStoreTuple, then you're going to introduce a lot of overhead when
> working with V3 pages. I think that's fine. You should plan to do
> your in-place upgrade at 1AM on Christmas morning (or whenever your
> load hits rock bottom...) and immediately start converting the
> database, starting with your most important and smallest tables. In
> fact, I would look whenever possible for ways to make the V4 case a
> fast-path and just accept that the system is going to labor a bit when
> dealing with V3 stuff. Any overhead you introduce when dealing with
> V3 pages can go away; any V4 overhead is permanent and therefore much
> more difficult to accept.

Yes, it is a plan to improve vacuum to convert old page to new one. But in as a
second step. I have already page converter code. With some modification it could
be integrated easily into vacuum code.

> That's about all I have for now... if you can give me some pointers on
> working with this git repository, or provide a complete patch that
> applies cleanly to CVS HEAD, I will try to look at this in more
> detail.

Thanks for your comments. Try snapshot link. I hope that it will work.

Zdenek

PS: I'm sorry about response time, but I'm on training this week.

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 01:20:31
Message-ID: 603c8f070811031720n75506fcfx68b682e21eb8984f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> You need to apply also two other patches:
> which are located here:
> http://wiki.postgresql.org/wiki/CommitFestInProgress#Upgrade-in-place_and_related_issues
> I moved one related patch from another category here to correct place.

Just to confirm, which two?

> http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=snapshot;h=c72bafada59ed278ffac59657c913bc375f77808;sf=tgz
>
> It should contains every think including yesterdays improvements (delete,
> insert, update works - inser/update only on table without index).

Wow, sounds like great improvements. I understand your difficulties
in keeping up with HEAD, but I hope we can figure out some solution,
because right now I have a diff (that I can't apply) and a tarball
(that I can't diff) and that is not ideal for reviewing.

> Yeah, it is most difficult part :-) find correct names for it. I think that
> each version of structure should have version suffix including lastone. And
> of cource the last one we should have a general name without suffix - see
> example:
>
> typedef struct PageHeaderData_04 { ...} PageHeaderData_04
> typedef struct PageHeaderData_03 { ...} PageHeaderData_03
> typedef PageHeaderData_04 PageHeaderData
>
> This allows you exactly specify version on places where you need it and keep
> general name where version is not relevant.

That doesn't make sense to me. If PageHeaderData and
PageHeaderData_04 are the same type, how do you decide which one to
use in any particular place in the code?

> How suffix should looks it another question. I prefer to have 04 not only 4.
> What's about PageHeaderData_V04?

I prefer "V" as a delimiter rather than "_" because that makes it more
clear that the number which follows is a version number, but I think
"_V" is overkill. However, I don't really want to argue the point;
I'm just throwing in my $0.02 and I am sure others will have their own
views as well.

> By the way what YMMV means?

"Your Mileage May Vary."
http://www.urbandictionary.com/define.php?term=YMMV

>> I am pretty skeptical of the idea that all of the HeapTuple* functions
>> can just be conditionalized on the page version and everything will
>> Just Work. It seems like that is too low a level to be worrying about
>> such things. Even if it happens to work for the changes between V3
>> and V4, what happens when V5 or V6 is changed in such a way that the
>> answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
>> "Maybe" or "Seven"? The performance hit also sounds painful. I don't
>> have a better idea right now though...
>
> OK. Currently it works (or I hope that it works). If somebody in a future
> invent some special change, i think in most (maybe all) cases there will be
> possible mapping.
>
> The speed is key point. When I check it last time I go 1% performance drop
> in fresh database. I think 1% is good price for in-place online upgrade.

I think that's arguable and something that needs to be more broadly
discussed. I wouldn't be keen to pay a 1% performance drop for this
feature, because it's not a feature I really need. Sure, in-place
upgrade would be nice to have, but for me, dump and reload isn't a
huge problem. It's a lot better than the 5% number you quoted
previously, but I'm not sure whether it is good enough,

I would feel more comfortable if the feature could be completely
disabled via compile-time defines. Then you could build the system
either with or without in-place upgrade, according to your needs. But
I don't think that's very practical with HeapTuple* as functions. You
could conditionalize away the switch, but the function call overhead
would remain. To get rid of that, you'd need some enormous, fragile
hack that I don't even want to contemplate.

Really, what I'd ideally like to see here is a system where the V3
code is in essence error-recovery code. Everything should be V4-only
unless you detect a V3 page, and then you error out (if in-place
upgrade is not enabled) or jump to the appropriate V3-aware code (if
in-place upgrade is enabled). In theory, with a system like this, it
seems like the overhead for V4 ought to be no more than the cost of
checking the page version on each page read, which is a cheap sanity
check we'd be willing to pay for anyway, and trivial in cost.

But I think we probably need some input from -core on this topic as well.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 01:27:37
Message-ID: 10341.1225762057@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> Really, what I'd ideally like to see here is a system where the V3
> code is in essence error-recovery code. Everything should be V4-only
> unless you detect a V3 page, and then you error out (if in-place
> upgrade is not enabled) or jump to the appropriate V3-aware code (if
> in-place upgrade is enabled). In theory, with a system like this, it
> seems like the overhead for V4 ought to be no more than the cost of
> checking the page version on each page read, which is a cheap sanity
> check we'd be willing to pay for anyway, and trivial in cost.

We already do check the page version on read-in --- see PageHeaderIsValid.

> But I think we probably need some input from -core on this topic as well.

I concur that I don't want to see this patch adding more than the
absolute unavoidable minimum of overhead for data that meets the
"current" layout definition. I'm disturbed by the proposal to stick
overhead into tuple header access, for example.

regards, tom lane


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 02:22:36
Message-ID: 603c8f070811031822q7d3b33f7x8576b7028f498cc4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> We already do check the page version on read-in --- see PageHeaderIsValid.

Right, but the only place this is called is in ReadBuffer_common,
which doesn't seem like a suitable place to deal with the possibility
of a V3 page since you don't yet know what you plan to do with it.
I'm not quite sure what the right solution to that problem is...

>> But I think we probably need some input from -core on this topic as well.
> I concur that I don't want to see this patch adding more than the
> absolute unavoidable minimum of overhead for data that meets the
> "current" layout definition. I'm disturbed by the proposal to stick
> overhead into tuple header access, for example.

...but it seems like we both agree that conditionalizing heap tuple
header access on page version is not the right answer. Based on that,
I'm going to move the "htup and bufpage API clean up" patch to
"Returned with feedback" and continue reviewing the remainder of these
patches.

As I'm looking at this, I'm realizing another problem - there is a lot
of code that looks like this:

void HeapTupleSetXmax(HeapTuple tuple, TransactionId xmax)
{
switch(tuple->t_ver)
{
case 4 : tuple->t_data->t_choice.t_heap.t_xmax = xmax;
break;
case 3 : TPH03(tuple)->t_choice.t_heap.t_xmax = xmax;
break;
default: elog(PANIC, "HeapTupleSetXmax is not supported.");
}
}

TPH03 is a macro that is casting tuple->t_data to HeapTupleHeader_03.
Unless I'm missing something, that means that given an arbitrary
pointer to HeapTuple, there is absolutely no guarantee that
tuple->t_data->t_choice actually points to that field at all. It will
if tuple->t_ver happens to be 4 OR if HeapTupleHeader and
HeapTupleHeader_03 happen to agree on where t_choice is; otherwise it
points to some other member of HeapTupleHeader_03, or off the end of
the structure. To me that seems unacceptably fragile, because it
means the compiler can't warn us that we're using a pointer
inappropriately. If we truly want to be safe here then we need to
create an opaque HeapTupleHeader structure that contains only those
elements that HeapTupleHeader_03 and HeapTupleHeader_04 have in
common, and cast BOTH of them after checking the version. That way if
somone writes a function that attempts to deference a HeapTupleHeader
without going through the API, it will fail to compile rather than
mostly working but possibly failing on a V3 page.

...Robert


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 14:06:38
Message-ID: 491056EE.3070701@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas napsal(a):

>
> Really, what I'd ideally like to see here is a system where the V3
> code is in essence error-recovery code. Everything should be V4-only
> unless you detect a V3 page, and then you error out (if in-place
> upgrade is not enabled) or jump to the appropriate V3-aware code (if
> in-place upgrade is enabled). In theory, with a system like this, it
> seems like the overhead for V4 ought to be no more than the cost of
> checking the page version on each page read, which is a cheap sanity
> check we'd be willing to pay for anyway, and trivial in cost.

OK. It was original idea to make "Convert on read" which has several problems
with no easy solution. One is that new data does not fit on the page and second
big problem is how to convert TOAST table data. Another problem which is general
is how to convert indexes...

Convert on read has minimal impact on core when latest version is processed. But
problem is what happen when you need to migrate tuple form page to new one
modify index and also needs convert toast value(s)... Problem is that response
could be long in some query, because it invokes a lot of changes and conversion.
I think in corner case it could requires converts all index when you request
one record.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 14:32:56
Message-ID: 603c8f070811040632pf877480uf6f4d7ab7fd2525b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> OK. It was original idea to make "Convert on read" which has several
> problems with no easy solution. One is that new data does not fit on the
> page and second big problem is how to convert TOAST table data. Another
> problem which is general is how to convert indexes...
>
> Convert on read has minimal impact on core when latest version is processed.
> But problem is what happen when you need to migrate tuple form page to new
> one modify index and also needs convert toast value(s)... Problem is that
> response could be long in some query, because it invokes a lot of changes
> and conversion. I think in corner case it could requires converts all index
> when you request one record.

I don't think I'm proposing convert on read, exactly. If you actually
try to convert the entire page when you read it in, I think you're
doomed to failure, because, as you rightly point out, there is
absolutely no guarantee that the page contents in their new format
will still fit into one block. I think what you want to do is convert
the structures within the page one by one as you read them out of the
page. The proposed refactoring of ExecStoreTuple will do exactly
this, for example.

HEAD uses a pointer into the actual buffer for a V4 tuple that comes
from an existing relation, and a pointer to a palloc'd structure for a
tuple that is generated during query execution. The proposed
refactoring will keep these rules, plus add a new rule that if you
happen to read a V3 page, you will palloc space for a new V4 tuple
that is semantically equivalent to the V3 tuple on the page, and use
that pointer instead. That, it seems to me, is exactly the right
balance - the PAGE is still a V3 page, but all of the tuples that the
upper-level code ever sees are V4 tuples.

I'm not sure how far this particular approach can be generalized.
ExecStoreTuple has the advantage that it already has to deal with both
direct buffer pointers and palloc'd structures, so the code doesn't
need to be much more complex to handle this case as well. I think the
thing to do is go through and scrutinize all of the ReadBuffer call
sites and figure out an approach to each one. I haven't looked at
your latest code yet, so you may have already done this, but just for
example, RelationGetBufferForTuple should probably just reject any V3
pages encountered as if they were full, including updating the FSM
where appropriate. I would think that it would be possible to
implement that with almost zero performance impact. I'm happy to look
at and discuss the problem cases with you, and hopefully others will
chime in as well since my knowledge of the code is far from
exhaustive.

...Robert


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 15:09:16
Message-ID: 4910659C.2090806@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas napsal(a):
>> OK. It was original idea to make "Convert on read" which has several
>> problems with no easy solution. One is that new data does not fit on the
>> page and second big problem is how to convert TOAST table data. Another
>> problem which is general is how to convert indexes...
>>
>> Convert on read has minimal impact on core when latest version is processed.
>> But problem is what happen when you need to migrate tuple form page to new
>> one modify index and also needs convert toast value(s)... Problem is that
>> response could be long in some query, because it invokes a lot of changes
>> and conversion. I think in corner case it could requires converts all index
>> when you request one record.
>
> I don't think I'm proposing convert on read, exactly. If you actually
> try to convert the entire page when you read it in, I think you're
> doomed to failure, because, as you rightly point out, there is
> absolutely no guarantee that the page contents in their new format
> will still fit into one block. I think what you want to do is convert
> the structures within the page one by one as you read them out of the
> page. The proposed refactoring of ExecStoreTuple will do exactly
> this, for example.

I see. But Vacuum and other internals function access heap pages directly
without ExecStoreTuple. however you point to one idea which I'm currently
thinking about it too. There is my version:

If you look into new page API it has PageGetHeapTuple. It could do the
conversion job. Problem is that you don't have relation info there and you
cannot convert data, but transaction information can be converted.

I think about HeapTupleData structure modification. It will have pointer to
transaction info t_transinfo, which will point to the page tuple for V4. For V3
PageGetHeapTuple function will allocate memory and put converted data here.

ExecStoreTuple will finally convert data. Because it know about relation and It
does not make sense convert data early. Who wants to convert invisible or dead data.

With this approach tuple will be processed same way with V4 without any overhead
(they will be small overhead with allocating and free heaptupledata in some
places - mostly vacuum).

Only multi version access will be driven on page basis.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 15:38:32
Message-ID: 603c8f070811040738r4b3f54fap841b4655704a4793@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I see. But Vacuum and other internals function access heap pages directly
> without ExecStoreTuple.

Right. I don't think there's any getting around the fact that any
function which accesses heap pages directly is going to need
modification. The key is to make those modifications as non-invasive
as possible. For example, in the case of vacuum, as soon as it
detects that a V3 page has been read, it should call a special
function whose only purpose in life is to move the data out of that V3
page and onto one or more V4 pages, and return. What you shouldn't do
is try to make the regular vacuum code handle both V3 and V4 pages,
because that will lead to code that may be slow and will almost
certainly be complicated and difficult to maintain.

I'll read through the rest of this when I have a bit more time.

...Robert


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 15:46:09
Message-ID: 49106E41.90302@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala wrote:
> Robert Haas napsal(a):
>> Really, what I'd ideally like to see here is a system where the V3
>> code is in essence error-recovery code. Everything should be V4-only
>> unless you detect a V3 page, and then you error out (if in-place
>> upgrade is not enabled) or jump to the appropriate V3-aware code (if
>> in-place upgrade is enabled). In theory, with a system like this, it
>> seems like the overhead for V4 ought to be no more than the cost of
>> checking the page version on each page read, which is a cheap sanity
>> check we'd be willing to pay for anyway, and trivial in cost.
>
> OK. It was original idea to make "Convert on read" which has several
> problems with no easy solution. One is that new data does not fit on the
> page and second big problem is how to convert TOAST table data. Another
> problem which is general is how to convert indexes...

We've talked about this many times before, so I'm sure you know what my
opinion is. Let me phrase it one more time:

1. You *will* need a function to convert a page from old format to new
format. We do want to get rid of the old format pages eventually,
whether it's during VACUUM, whenever a page is read in, or by using an
extra utility. And that process needs to online. Please speak up now if
you disagree with that.

2. It follows from point 1, that you *will* need to solve the problems
with pages where the data doesn't fit on the page in new format, as well
as converting TOAST data.

We've discussed various solutions to those problems; it's not
insurmountable. For the "data doesn't fit anymore" problem, a fairly
simple solution is to run a pre-upgrade utility in the old version, that
reserves some free space on each page, to make sure everything fits
after converting to new format. For TOAST, you can retoast tuples when
the heap page is read in. I'm not sure what the problem with indexes is,
but you can split pages if necessary, for example.

Assuming everyone agrees with point 1, could we focus on these issues?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 16:01:07
Message-ID: 603c8f070811040801x13031ba4sa07a8024ea29644c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> We've talked about this many times before, so I'm sure you know what my
> opinion is. Let me phrase it one more time:
>
> 1. You *will* need a function to convert a page from old format to new
> format. We do want to get rid of the old format pages eventually, whether
> it's during VACUUM, whenever a page is read in, or by using an extra
> utility. And that process needs to online. Please speak up now if you
> disagree with that.

Well, I just proposed an approach that doesn't work this way, so I
guess I'll have to put myself in the disagree category, or anyway yet
to be convinced. As long as you can move individual tuples onto new
pages, you can eventually empty V3 pages and reinitialize them as new,
empty V4 pages. You can force that process along via, say, VACUUM,
but in the meantime you can still continue to read the old pages
without being forced to change them to the new format. That's not the
only possible approach, but it's not obvious to me that it's insane.
If you think it's a non-starter, it would be good to know why.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 16:14:11
Message-ID: 20742.1225815251@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> Well, I just proposed an approach that doesn't work this way, so I
> guess I'll have to put myself in the disagree category, or anyway yet
> to be convinced. As long as you can move individual tuples onto new
> pages, you can eventually empty V3 pages and reinitialize them as new,
> empty V4 pages. You can force that process along via, say, VACUUM,
> but in the meantime you can still continue to read the old pages
> without being forced to change them to the new format. That's not the
> only possible approach, but it's not obvious to me that it's insane.
> If you think it's a non-starter, it would be good to know why.

That's sane *if* you can guarantee that only negligible overhead is
added for accessing data that is in the up-to-date format. I don't
think that will be the case if we start putting version checks into
every tuple access macro.

regards, tom lane


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 16:42:39
Message-ID: 603c8f070811040842r4a26ac3am6833436e6f1ed1dd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> That's sane *if* you can guarantee that only negligible overhead is
> added for accessing data that is in the up-to-date format. I don't
> think that will be the case if we start putting version checks into
> every tuple access macro.

Yes, the point is that you'll read the page as V3 or V4, whichever it
is, but if it's V3, you'll convert the tuples to V4 format before you
try to doing anything with them (for example by modifying
ExecStoreTuple to copy any V3 tuple into a palloc'd buffer, which fits
nicely into what that function already does).

...Robert


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 19:49:14
Message-ID: 878wrz6zlx.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:

>> We've talked about this many times before, so I'm sure you know what my
>> opinion is. Let me phrase it one more time:
>>
>> 1. You *will* need a function to convert a page from old format to new
>> format. We do want to get rid of the old format pages eventually, whether
>> it's during VACUUM, whenever a page is read in, or by using an extra
>> utility. And that process needs to online. Please speak up now if you
>> disagree with that.
>
> Well, I just proposed an approach that doesn't work this way, so I
> guess I'll have to put myself in the disagree category, or anyway yet
> to be convinced. As long as you can move individual tuples onto new
> pages, you can eventually empty V3 pages and reinitialize them as new,
> empty V4 pages. You can force that process along via, say, VACUUM,

No, if you can force that process along via some command, whatever it is, then
you're still in the category he described.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 20:57:34
Message-ID: 603c8f070811041257m60df429ct40694f509b876829@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Well, I just proposed an approach that doesn't work this way, so I
>> guess I'll have to put myself in the disagree category, or anyway yet
>> to be convinced. As long as you can move individual tuples onto new
>> pages, you can eventually empty V3 pages and reinitialize them as new,
>> empty V4 pages. You can force that process along via, say, VACUUM,
>
> No, if you can force that process along via some command, whatever it is, then
> you're still in the category he described.

Maybe. The difference is that I'm talking about converting tuples,
not pages, so "What happens when the data doesn't fit on the new
page?" is a meaningless question. Since that seemed to be Heikki's
main concern, I thought we must be talking about different things. My
thought was that the code path for converting a tuple would be very
similar to what heap_update does today, and large tuples would be
handled via TOAST just as they are now - by converting the relation
one tuple at a time, you might end up with a new relation that has
either more or fewer pages than the old relation, and it really
doesn't matter which.

I haven't really thought through all of the other kinds of things that
might need to be converted, though. That's where it would be useful
for someone more experienced to weigh in on indexes, etc.

...Robert


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-04 22:18:39
Message-ID: 87skq75e4g.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:

>>> Well, I just proposed an approach that doesn't work this way, so I
>>> guess I'll have to put myself in the disagree category, or anyway yet
>>> to be convinced. As long as you can move individual tuples onto new
>>> pages, you can eventually empty V3 pages and reinitialize them as new,
>>> empty V4 pages. You can force that process along via, say, VACUUM,
>>
>> No, if you can force that process along via some command, whatever it is, then
>> you're still in the category he described.
>
> Maybe. The difference is that I'm talking about converting tuples,
> not pages, so "What happens when the data doesn't fit on the new
> page?" is a meaningless question.

No it's not, because as you pointed out you still need a way for the user to
force it to happen sometime. Unless you're going to be happy with telling
users they need to update all their tuples which would not be an online
process.

In any case it sounds like you're saying you want to allow multiple versions
of tuples on the same page -- which a) would be much harder and b) doesn't
solve the problem since the page still has to be converted sometime anyways.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 03:49:57
Message-ID: 603c8f070811041949y7964c507wded15c6a5f583480@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Maybe. The difference is that I'm talking about converting tuples,
>> not pages, so "What happens when the data doesn't fit on the new
>> page?" is a meaningless question.
>
> No it's not, because as you pointed out you still need a way for the user to
> force it to happen sometime. Unless you're going to be happy with telling
> users they need to update all their tuples which would not be an online
> process.
>
> In any case it sounds like you're saying you want to allow multiple versions
> of tuples on the same page -- which a) would be much harder and b) doesn't
> solve the problem since the page still has to be converted sometime anyways.

No, that's not what I'm suggesting. My thought was that any V3 page
would be treated as if it were completely full, with the exception of
a completely empty page which can be reinitialized as a V4 page. So
you would never add any tuples to a V3 page, but you would need to
update xmax, hint bits, etc. Eventually when all the tuples were dead
you could reuse the page.

...Robert


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 04:25:30
Message-ID: 87od0u6bph.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:

>>> Maybe. The difference is that I'm talking about converting tuples,
>>> not pages, so "What happens when the data doesn't fit on the new
>>> page?" is a meaningless question.
>>
>> No it's not, because as you pointed out you still need a way for the user to
>> force it to happen sometime. Unless you're going to be happy with telling
>> users they need to update all their tuples which would not be an online
>> process.
>>
>> In any case it sounds like you're saying you want to allow multiple versions
>> of tuples on the same page -- which a) would be much harder and b) doesn't
>> solve the problem since the page still has to be converted sometime anyways.
>
> No, that's not what I'm suggesting. My thought was that any V3 page
> would be treated as if it were completely full, with the exception of
> a completely empty page which can be reinitialized as a V4 page. So
> you would never add any tuples to a V3 page, but you would need to
> update xmax, hint bits, etc. Eventually when all the tuples were dead
> you could reuse the page.

But there's no guarantee that will ever happen. Heikki claimed you would need
a mechanism to convert the page some day and you said you proposed a system
where that wasn't true.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 04:51:32
Message-ID: 603c8f070811042051t6c2a7d63j97f83aa367794e47@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> No, that's not what I'm suggesting. My thought was that any V3 page
>> would be treated as if it were completely full, with the exception of
>> a completely empty page which can be reinitialized as a V4 page. So
>> you would never add any tuples to a V3 page, but you would need to
>> update xmax, hint bits, etc. Eventually when all the tuples were dead
>> you could reuse the page.
>
> But there's no guarantee that will ever happen. Heikki claimed you would need
> a mechanism to convert the page some day and you said you proposed a system
> where that wasn't true.

What's the scenario you're concerned about? An old snapshot that
never goes away?

Can we lock the old and new pages, move the tuple to a V4 page, and
update index entries without changing xmin/xmax?

...Robert


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 05:11:03
Message-ID: 87k5bi69lk.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:

>>> No, that's not what I'm suggesting. My thought was that any V3 page
>>> would be treated as if it were completely full, with the exception of
>>> a completely empty page which can be reinitialized as a V4 page. So
>>> you would never add any tuples to a V3 page, but you would need to
>>> update xmax, hint bits, etc. Eventually when all the tuples were dead
>>> you could reuse the page.
>>
>> But there's no guarantee that will ever happen. Heikki claimed you would need
>> a mechanism to convert the page some day and you said you proposed a system
>> where that wasn't true.
>
> What's the scenario you're concerned about? An old snapshot that
> never goes away?

An old page which never goes away. New page formats are introduced for a
reason -- to support new features. An old page lying around indefinitely means
some pages can't support those new features. Just as an example, DBAs may be
surprised to find out that large swathes of their database are still not
protected by CRC checksums months or years after having upgraded to 8.4 (or
even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
data is upgraded.

> Can we lock the old and new pages, move the tuple to a V4 page, and
> update index entries without changing xmin/xmax?

Not exactly. But regardless -- the point is we need to do something.

(And then the argument goes that since we *have* to do that then we needn't
bother with doing anything else. At least if we do it's just an optimization
over just doing the whole page right away.)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 05:51:41
Message-ID: 4911346D.3040200@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:

> An old page which never goes away. New page formats are introduced for a
> reason -- to support new features. An old page lying around indefinitely means
> some pages can't support those new features. Just as an example, DBAs may be
> surprised to find out that large swathes of their database are still not
> protected by CRC checksums months or years after having upgraded to 8.4 (or
> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
> data is upgraded.

Then provide a manual mechanism to convert all pages?

Joshua D. Drake


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 06:04:41
Message-ID: 87d4ha6746.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:

> Gregory Stark wrote:
>> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>
>> An old page which never goes away. New page formats are introduced for a
>> reason -- to support new features. An old page lying around indefinitely means
>> some pages can't support those new features. Just as an example, DBAs may be
>> surprised to find out that large swathes of their database are still not
>> protected by CRC checksums months or years after having upgraded to 8.4 (or
>> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
>> data is upgraded.
>
> Then provide a manual mechanism to convert all pages?

The origin of this thread was the dispute over this claim:

1. You *will* need a function to convert a page from old format to new
format. We do want to get rid of the old format pages eventually, whether
it's during VACUUM, whenever a page is read in, or by using an extra
utility. And that process needs to online. Please speak up now if you
disagree with that.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 06:30:35
Message-ID: 49113D8B.9040006@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>
>> Gregory Stark wrote:
>>> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>>> An old page which never goes away. New page formats are introduced for a
>>> reason -- to support new features. An old page lying around indefinitely means
>>> some pages can't support those new features. Just as an example, DBAs may be
>>> surprised to find out that large swathes of their database are still not
>>> protected by CRC checksums months or years after having upgraded to 8.4 (or
>>> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
>>> data is upgraded.
>> Then provide a manual mechanism to convert all pages?
>
> The origin of this thread was the dispute over this claim:
>
> 1. You *will* need a function to convert a page from old format to new
> format. We do want to get rid of the old format pages eventually, whether
> it's during VACUUM, whenever a page is read in, or by using an extra
> utility. And that process needs to online. Please speak up now if you
> disagree with that.
>

I agree.

Joshua D. Drake


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 12:32:07
Message-ID: 603c8f070811050432i2495bc94i8ab8fff65bf60c33@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> An old page which never goes away. New page formats are introduced for a
> reason -- to support new features. An old page lying around indefinitely means
> some pages can't support those new features. Just as an example, DBAs may be
> surprised to find out that large swathes of their database are still not
> protected by CRC checksums months or years after having upgraded to 8.4 (or
> even 8.5 or 8.6 or ...). They would certainly want a way to ensure all their
> data is upgraded.

OK, I see your point. In the absence of any old snapshots,
convert-on-write allows you to forcibly upgrade the whole table by
rewriting all of the tuples into new pages:

UPDATE table SET col = col

In the absence of page expansion, you can put logic into VACUUM to
upgrade each page in place.

If you have both old snapshots that you can't get rid of, and page
expansion, then you have a big problem, which I guess brings us back
to Heikki's point.

...Robert


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 12:43:26
Message-ID: 491194EE.8040603@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:

>
> We've talked about this many times before, so I'm sure you know what my
> opinion is. Let me phrase it one more time:
>
> 1. You *will* need a function to convert a page from old format to new
> format. We do want to get rid of the old format pages eventually,
> whether it's during VACUUM, whenever a page is read in, or by using an
> extra utility. And that process needs to online. Please speak up now if
> you disagree with that.

Yes. Agree. The basic idea is to create new empty page and copy+convert tuples
into new page. This new page will overwrite old one I have already code which
converts heap table (excluding arrays and composite datatype).

> 2. It follows from point 1, that you *will* need to solve the problems
> with pages where the data doesn't fit on the page in new format, as well
> as converting TOAST data.

Yes or no. It depends if we will want live with old pages forever. But I think
convert all pages to the newest version is good idea.

> We've discussed various solutions to those problems; it's not
> insurmountable. For the "data doesn't fit anymore" problem, a fairly
> simple solution is to run a pre-upgrade utility in the old version, that
> reserves some free space on each page, to make sure everything fits
> after converting to new format.

I think it will not work. you need protect also PotgreSQL to put any data extra
data on a page. Which requires modification into PostgreSQL code in old branches.

> For TOAST, you can retoast tuples when
> the heap page is read in.

Yes you have to retosted it which is only possible method but problem is thet
you need workinig toastable index ... yeah, indexes are different story.

> I'm not sure what the problem with indexes is,
> but you can split pages if necessary, for example.

Indexes is different story. In first step I prefer to use reindex. But in the
future a prefer to extend pg_am and add ampageconvert which will point to
conversion function. Maybe we can extend it now and keep this column empty.

> Assuming everyone agrees with point 1, could we focus on these issues?

Yes, OK I'm going to cleanup code which I have and I will send it soon. Tuple
conversion is already part of patch which I already send. See
access/heapam/htup_03.c.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 13:26:37
Message-ID: 49119F0D.3030406@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):

>
> I concur that I don't want to see this patch adding more than the
> absolute unavoidable minimum of overhead for data that meets the
> "current" layout definition. I'm disturbed by the proposal to stick
> overhead into tuple header access, for example.

OK. I agree that it is overhead. However the patch contains also Tuple and Page
API cleanup which is general thing. All function should use HeapTuple access not
HeapTupleHeader. I used function in the patch because I added multi version
access, but they can be macro.

The main change of page API is to add two function PageGetHeapTuple and
PageGetIndexTuple. I also add function like PageItemIsDead and so on. These
change are not only related to upgrade.

I accepting your complains about Tuples, but I think we should have multi page
version access method. The main advantage is that indexes are ready for reading
without any problem. It helps mostly in TOAST chunk data access and it is
necessary for retoasting. OK it will works until somebody change btree ondisk
format, but now it helps.

Zdenek


From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 13:38:41
Message-ID: 0C0FFDB7-9E62-49AB-8BA7-842D2B1F375A@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I don't think this really qualifies as "in place upgrade" since it
would mean creating a whole second copy of all your data. And it's
only online got read-only queries too.

I think we need a way to upgrade the pages in place and deal with any
overflow data as exceptional cases or else there's hardly much point
in the exercise.

greg

On 5 Nov 2008, at 07:32 AM, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:

>> An old page which never goes away. New page formats are introduced
>> for a
>> reason -- to support new features. An old page lying around
>> indefinitely means
>> some pages can't support those new features. Just as an example,
>> DBAs may be
>> surprised to find out that large swathes of their database are
>> still not
>> protected by CRC checksums months or years after having upgraded to
>> 8.4 (or
>> even 8.5 or 8.6 or ...). They would certainly want a way to ensure
>> all their
>> data is upgraded.
>
> OK, I see your point. In the absence of any old snapshots,
> convert-on-write allows you to forcibly upgrade the whole table by
> rewriting all of the tuples into new pages:
>
> UPDATE table SET col = col
>
> In the absence of page expansion, you can put logic into VACUUM to
> upgrade each page in place.
>
> If you have both old snapshots that you can't get rid of, and page
> expansion, then you have a big problem, which I guess brings us back
> to Heikki's point.
>
> ...Robert


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 14:04:42
Message-ID: 4911A7FA.1030805@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark napsal(a):
> I don't think this really qualifies as "in place upgrade" since it would
> mean creating a whole second copy of all your data. And it's only online
> got read-only queries too.
>
> I think we need a way to upgrade the pages in place and deal with any
> overflow data as exceptional cases or else there's hardly much point in
> the exercise.

It is exceptional case between V3 and V4 and only on heap, because you save in
varlena. But between V4 and V5 we will lost another 4 bytes in a page header ->
page header will be 28 bytes long but tuple size is same.

Try to get raw free space on each page in 8.3 database and you probably see a
lot of pages where free space is 0. My last experience is something about 1-2%
of pages.

Zdenek


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 18:16:12
Message-ID: 20081105181612.GB3531@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 05, 2008 at 03:04:42PM +0100, Zdenek Kotala wrote:
> Greg Stark napsal(a):
> It is exceptional case between V3 and V4 and only on heap, because you save
> in varlena. But between V4 and V5 we will lost another 4 bytes in a page
> header -> page header will be 28 bytes long but tuple size is same.
>
> Try to get raw free space on each page in 8.3 database and you probably see
> a lot of pages where free space is 0. My last experience is something about
> 1-2% of pages.

Is this really such a big deal? You do the null-update on the last
tuple of the page and then you do have enough room. So Phase one moves
a few tuples to make room. Phase 2 actually converts the pages inplace.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 19:55:25
Message-ID: 4911FA2D.2070303@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Wed, Nov 05, 2008 at 03:04:42PM +0100, Zdenek Kotala wrote:
>> Greg Stark napsal(a):
>> It is exceptional case between V3 and V4 and only on heap, because you save
>> in varlena. But between V4 and V5 we will lost another 4 bytes in a page
>> header -> page header will be 28 bytes long but tuple size is same.
>>
>> Try to get raw free space on each page in 8.3 database and you probably see
>> a lot of pages where free space is 0. My last experience is something about
>> 1-2% of pages.
>
> Is this really such a big deal? You do the null-update on the last
> tuple of the page and then you do have enough room. So Phase one moves
> a few tuples to make room. Phase 2 actually converts the pages inplace.

Problem is how to move tuple from page to another and keep indexes in sync. One
solution is to perform some think like "update" operation on the tuple. But you
need exclusive lock on the page and pin counter have to be zero. And question is
where it is safe operation.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 20:44:36
Message-ID: 2549.1225917876@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> Martijn van Oosterhout napsal(a):
>> Is this really such a big deal? You do the null-update on the last
>> tuple of the page and then you do have enough room. So Phase one moves
>> a few tuples to make room. Phase 2 actually converts the pages inplace.

> Problem is how to move tuple from page to another and keep indexes in
> sync. One solution is to perform some think like "update" operation on
> the tuple. But you need exclusive lock on the page and pin counter
> have to be zero. And question is where it is safe operation.

Hmm. Well, it may be a nasty problem but you have to find a solution.
We're not going to guarantee that no update ever expands the data ...

regards, tom lane


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 21:05:53
Message-ID: 603c8f070811051305y3c185a61o2142fab2095c41a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Problem is how to move tuple from page to another and keep indexes in sync.
> One solution is to perform some think like "update" operation on the tuple.
> But you need exclusive lock on the page and pin counter have to be zero. And
> question is where it is safe operation.

But doesn't this problem go away if you do it in a transaction? You
set xmax on the old tuple, write the new tuple, and add index entries
just as you would for a normal update.

When the old tuple is no longer visible to any transaction, you nuke it.

...Robert


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 21:41:52
Message-ID: 873ai5c0kf.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:

>> Problem is how to move tuple from page to another and keep indexes in sync.
>> One solution is to perform some think like "update" operation on the tuple.
>> But you need exclusive lock on the page and pin counter have to be zero. And
>> question is where it is safe operation.
>
> But doesn't this problem go away if you do it in a transaction? You
> set xmax on the old tuple, write the new tuple, and add index entries
> just as you would for a normal update.

But that doesn't actually solve the overflow problem on the old page...

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-05 22:07:39
Message-ID: 20081105220739.GC3531@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 05, 2008 at 09:41:52PM +0000, Gregory Stark wrote:
> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>
> >> Problem is how to move tuple from page to another and keep indexes in sync.
> >> One solution is to perform some think like "update" operation on the tuple.
> >> But you need exclusive lock on the page and pin counter have to be zero. And
> >> question is where it is safe operation.
> >
> > But doesn't this problem go away if you do it in a transaction? You
> > set xmax on the old tuple, write the new tuple, and add index entries
> > just as you would for a normal update.
>
> But that doesn't actually solve the overflow problem on the old page...

Sure it does. You move just enough tuples that you can convert the page
without an overflow.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 01:38:26
Message-ID: 87iqr1ab1p.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:

> On Wed, Nov 05, 2008 at 09:41:52PM +0000, Gregory Stark wrote:
>> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>>
>> >> Problem is how to move tuple from page to another and keep indexes in sync.
>> >> One solution is to perform some think like "update" operation on the tuple.
>> >> But you need exclusive lock on the page and pin counter have to be zero. And
>> >> question is where it is safe operation.
>> >
>> > But doesn't this problem go away if you do it in a transaction? You
>> > set xmax on the old tuple, write the new tuple, and add index entries
>> > just as you would for a normal update.
>>
>> But that doesn't actually solve the overflow problem on the old page...
>
> Sure it does. You move just enough tuples that you can convert the page
> without an overflow.

setting the xmax on a tuple doesn't "move" the tuple

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 03:12:20
Message-ID: 603c8f070811051912w16827461q9bf5576eb814e885@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> >> Problem is how to move tuple from page to another and keep indexes in sync.
>>> >> One solution is to perform some think like "update" operation on the tuple.
>>> >> But you need exclusive lock on the page and pin counter have to be zero. And
>>> >> question is where it is safe operation.
>>> >
>>> > But doesn't this problem go away if you do it in a transaction? You
>>> > set xmax on the old tuple, write the new tuple, and add index entries
>>> > just as you would for a normal update.
>>>
>>> But that doesn't actually solve the overflow problem on the old page...
>>
>> Sure it does. You move just enough tuples that you can convert the page
>> without an overflow.
>
> setting the xmax on a tuple doesn't "move" the tuple

Nobody said it did. I think this would have been more clear if you
had quoted my whole email instead of stopping in the middle:

>> But doesn't this problem go away if you do it in a transaction? You
>> set xmax on the old tuple, write the new tuple, and add index entries
>> just as you would for a normal update.
>>
>> When the old tuple is no longer visible to any transaction, you nuke it.

To spell this out in more detail:

Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and
F. We examine the page and determine that if we convert this to a V4
page, only five tuples will fit. So we need to get rid of one of the
tuples. We begin a transaction and choose F as the victim. Searching
the FSM, we discover that page 456 is a V4 page with available free
space. We pin and lock pages 123 and 456 just as if we were doing a
heap_update. We create F', the V4 version of F, and write it onto
page 456. We set xmax on the original F. We peform the corresponding
index updates and commit the transaction.

Time passes. Eventually F becomes dead. We reclaim the space
previously used by F, and page 123 now contains only 5 tuples. This
is exactly what we needed in order to convert page F to a V4 page, so
we do.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 13:52:01
Message-ID: 16613.1225979521@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> To spell this out in more detail:

> Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and
> F. We examine the page and determine that if we convert this to a V4
> page, only five tuples will fit. So we need to get rid of one of the
> tuples. We begin a transaction and choose F as the victim. Searching
> the FSM, we discover that page 456 is a V4 page with available free
> space. We pin and lock pages 123 and 456 just as if we were doing a
> heap_update. We create F', the V4 version of F, and write it onto
> page 456. We set xmax on the original F. We peform the corresponding
> index updates and commit the transaction.

> Time passes. Eventually F becomes dead. We reclaim the space
> previously used by F, and page 123 now contains only 5 tuples. This
> is exactly what we needed in order to convert page F to a V4 page, so
> we do.

That's all fine and dandy, except that it presumes that you can perform
SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that
A-E aren't there until they get converted. Which is exactly the
overhead we were looking to avoid.

(Another small issue is exactly when you convert the index entries,
should you be faced with an upgrade that requires that.)

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 14:30:07
Message-ID: 4912FF6F.6010007@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):
> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>> To spell this out in more detail:
>
>> Suppose page 123 is a V3 page containing 6 tuples A, B, C, D, E, and
>> F. We examine the page and determine that if we convert this to a V4
>> page, only five tuples will fit. So we need to get rid of one of the
>> tuples. We begin a transaction and choose F as the victim. Searching
>> the FSM, we discover that page 456 is a V4 page with available free
>> space. We pin and lock pages 123 and 456 just as if we were doing a
>> heap_update. We create F', the V4 version of F, and write it onto
>> page 456. We set xmax on the original F. We peform the corresponding
>> index updates and commit the transaction.
>
>> Time passes. Eventually F becomes dead. We reclaim the space
>> previously used by F, and page 123 now contains only 5 tuples. This
>> is exactly what we needed in order to convert page F to a V4 page, so
>> we do.
>
> That's all fine and dandy, except that it presumes that you can perform
> SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that
> A-E aren't there until they get converted. Which is exactly the
> overhead we were looking to avoid.

We want to avoid overhead on V$lastest$ tuples, but I guess small performance
gap on old tuple is acceptable. The only way (which I see now) how it should
work is to have multi page version processing. And old tuple will be converted
when PageGetHepaTuple will be called.

However, how Heikki mentioned tuple and page conversion is basic and same for
all upgrade method and it should be done first.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 14:39:27
Message-ID: 603c8f070811060639n15bbdb79p8ffdc6b84aacf32e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> That's all fine and dandy, except that it presumes that you can perform
> SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that
> A-E aren't there until they get converted. Which is exactly the
> overhead we were looking to avoid.

I don't understand this comment at all. Unless you have some sort of
magical wand in your back pocket that will instantaneously transform
the entire database, there is going to be a period of time when you
have to cope with both V3 and V4 pages. ISTM that what we should be
talking about here is:

(1) How are we going to do that in a way that imposes near-zero
overhead once the entire database has been converted?
(2) How are we going to do that in a way that is minimally invasive to the code?
(3) Can we accomplish (1) and (2) while still retaining somewhat
reasonable performance for V3 pages?

Zdenek's initial proposal did this by replacing all of the tuple
header macros with functions that were conditionalized on page
version. I think we agree that's not going to work. That doesn't
mean that there is no approach that can work, and we were discussing
possible ways to make it work upthread until the thread got hijacked
to discuss the right way of handling page expansion. Now that it
seems we agree that a transaction can be used to move tuples onto new
pages, I think we'd be well served to stop talking about page
expansion and get back to the original topic: where and how to insert
the hooks for V3 tuple handling.

> (Another small issue is exactly when you convert the index entries,
> should you be faced with an upgrade that requires that.)

Zdenek set out his thoughts on this point upthread, no need to rehash here.

...Robert


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 17:15:09
Message-ID: 200811061715.mA6HF9g00508@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> > That's all fine and dandy, except that it presumes that you can perform
> > SELECT/UPDATE/DELETE on V3 tuple versions; you can't just pretend that
> > A-E aren't there until they get converted. Which is exactly the
> > overhead we were looking to avoid.
>
> I don't understand this comment at all. Unless you have some sort of
> magical wand in your back pocket that will instantaneously transform
> the entire database, there is going to be a period of time when you
> have to cope with both V3 and V4 pages. ISTM that what we should be
> talking about here is:
>
> (1) How are we going to do that in a way that imposes near-zero
> overhead once the entire database has been converted?
> (2) How are we going to do that in a way that is minimally invasive to the code?
> (3) Can we accomplish (1) and (2) while still retaining somewhat
> reasonable performance for V3 pages?
>
> Zdenek's initial proposal did this by replacing all of the tuple
> header macros with functions that were conditionalized on page
> version. I think we agree that's not going to work. That doesn't
> mean that there is no approach that can work, and we were discussing
> possible ways to make it work upthread until the thread got hijacked
> to discuss the right way of handling page expansion. Now that it
> seems we agree that a transaction can be used to move tuples onto new
> pages, I think we'd be well served to stop talking about page
> expansion and get back to the original topic: where and how to insert
> the hooks for V3 tuple handling.

I think the above is a good summary. For me, the problem with any
approach that has information about prior-version block formats in the
main code path is code complexity, and secondarily performance.

I know there is concern that converting all blocks on read-in might
expand the page beyond 8k in size. One idea Heikki had was to require
some tool must be run on minor releases before a major upgrade to
guarantee there is enough free space to convert the block to the current
format on read-in, which would localize the information about prior
block formats. We could release the tool in minor branches around the
time as a major release. Also consider that there are very few releases
that expand the page size.

For these reasons, the expand-the-page-beyond-8k problem should not be
dictating what approach we take for upgrade-in-place because there are
workarounds for the problem, and the problem is rare. I would like us
to again focus on converting the pages to the current version format on
read-in, and perhaps a tool to convert all old pages to the new format.

FYI, we are also going to need the ability to convert all pages to the
current format for multi-release upgrades. For example, if you did
upgrade-in-place from 8.2 to 8.3, you are going to need to update all
pages to the 8.3 format before doing upgrade-in-place to 8.4; perhaps
vacuum can do something like this on a per-table basis, and we can
record that status a pg_class column.

Also, consider that when we did PITR, we required commands before and
after the tar so that there was a consistent API for PITR, and later had
to add capabilities to those functions, but the user API didn't change.

I envision a similar system where we have utilities to guarantee all
pages have enough free space, and all pages are the current version,
before allowing an upgrade-in-place to the next version. Such a
consistent API will make the job for users easier and our job simpler,
and with upgrade-in-place, where we have limited time and resources to
code this for each release, simplicity is important.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 18:09:15
Message-ID: 3332.1225994955@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> I envision a similar system where we have utilities to guarantee all
> pages have enough free space, and all pages are the current version,
> before allowing an upgrade-in-place to the next version. Such a
> consistent API will make the job for users easier and our job simpler,
> and with upgrade-in-place, where we have limited time and resources to
> code this for each release, simplicity is important.

An external utility doesn't seem like the right way to approach it.
For example, given the need to ensure X amount of free space in each
page, the only way to guarantee that would be to shut down the database
while you run the utility over all the pages --- otherwise somebody
might fill some page up again. And that completely defeats the purpose,
which is to have minimal downtime during upgrade.

I think we can have a notion of pre-upgrade maintenance, but it would
have to be integrated into normal operations. For instance, if
conversion to 8.4 requires extra free space, we'd make late releases
of 8.3.x not only be able to force that to occur, but also tweak the
normal code paths to maintain that minimum free space.

The full concept as I understood it (dunno why Bruce left all these
details out of his message) went like this:

* Add a "format serial number" column to pg_class, and probably also
pg_database. Rather like the frozenxid columns, this would have the
semantics that all pages in a relation or database are known to have at
least the specified format number.

* There would actually be two serial numbers per release, at least for
releases where pre-update prep work is involved --- for instance,
between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is
8.3 but known ready to update to 8.4 (eg, enough free space available).
Minor releases of 8.3 that appear with or subsequent to 8.4 release
understand the "half" format number and how to upgrade to it.

* VACUUM would be empowered, in the same way as it handles frozenxid
maintenance, to update any less-than-the-latest-version pages and then
fix the pg_class and pg_database entries.

* We could mechanically enforce that you not update until the database
is ready for it by checking pg_database.datformatversion during
postmaster startup.

So the update process would require users to install a suitably late
version of 8.3, vacuum everything over a suitable maintenance window,
then install 8.4, then perhaps vacuum everything again if they want to
try to push page update work into specific maintenance windows. But
the DB is up and functioning the whole time.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 18:36:07
Message-ID: 200811061836.mA6Ia7o13784@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > I envision a similar system where we have utilities to guarantee all
> > pages have enough free space, and all pages are the current version,
> > before allowing an upgrade-in-place to the next version. Such a
> > consistent API will make the job for users easier and our job simpler,
> > and with upgrade-in-place, where we have limited time and resources to
> > code this for each release, simplicity is important.
>
> An external utility doesn't seem like the right way to approach it.
> For example, given the need to ensure X amount of free space in each
> page, the only way to guarantee that would be to shut down the database
> while you run the utility over all the pages --- otherwise somebody
> might fill some page up again. And that completely defeats the purpose,
> which is to have minimal downtime during upgrade.
>
> I think we can have a notion of pre-upgrade maintenance, but it would
> have to be integrated into normal operations. For instance, if
> conversion to 8.4 requires extra free space, we'd make late releases
> of 8.3.x not only be able to force that to occur, but also tweak the
> normal code paths to maintain that minimum free space.
>
> The full concept as I understood it (dunno why Bruce left all these
> details out of his message) went like this:

Exactly. I didn't go into the implementation details to make it easer
for people to see my general goals. Tom's implementation steps are the
correct approach, assuming we can get agreement on the general goals.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Bruce Momjian" <bruce(at)momjian(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 19:17:41
Message-ID: 603c8f070811061117t7bc31de6u81825999647736e7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> An external utility doesn't seem like the right way to approach it.
> For example, given the need to ensure X amount of free space in each
> page, the only way to guarantee that would be to shut down the database
> while you run the utility over all the pages --- otherwise somebody
> might fill some page up again. And that completely defeats the purpose,
> which is to have minimal downtime during upgrade.

Agreed.

> I think we can have a notion of pre-upgrade maintenance, but it would
> have to be integrated into normal operations. For instance, if
> conversion to 8.4 requires extra free space, we'd make late releases
> of 8.3.x not only be able to force that to occur, but also tweak the
> normal code paths to maintain that minimum free space.

1. This seems to fly in the face of the sort of thing we've
traditionally back-patched. The code to make pages ready for upgrade
to the next major release will not necessarily be straightforward (in
fact it probably isn't, otherwise we wouldn't have insisted on a
two-stage conversion process), which turns a seemingly safe minor
upgrade into a potentially dangerous operation.

2. Just because I want to upgrade to 8.3.47 and get the latest bug
fixes does not mean that I have any intention of upgrading to 8.4, and
yet you've rearranged all of my pages to have useless free space in
them (possibly at considerable and unexpected I/O cost for at least as
long as the conversion is running).

The second point could probably be addressed with a GUC but the first
one certainly can't.

3. What about multi-release upgrades? Say someone wants to upgrade
from 8.3 to 8.6. 8.6 only knows how to read pages that are
8.5-and-a-half or better, 8.5 only knows how to read pages that are
8.4-and-a-half or better, and 8.4 only knows how to read pages that
are 8.3-and-a-half or better. So the user will have to upgrade to
8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6.

It seems to me that if there is any way to put all of the logic to
handle old page versions in the new code that would be much better,
especially if it's an optional feature that can be compiled in or not.
Then when it's time to upgrade from 8.3 to 8.6 you could do:

./configure --with-upgrade-83 --with-upgrade-84 --with-upgrade85

but if you don't need the code to handle old page versions you can:

./configure --without-upgrade85

Admittedly, this requires making the new code capable of rearranging
pages to create free space when necessary, and to be able to continue
to execute queries while doing it, but ways of doing this have been
proposed. The only uncertainty is as to whether the performance and
code complexity can be kept manageable, but I don't believe that
question has been explored to the point where we should be ready to
declare defeat.

...Robert


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 19:31:15
Message-ID: 200811061931.mA6JVF711246@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> The second point could probably be addressed with a GUC but the first
> one certainly can't.
>
> 3. What about multi-release upgrades? Say someone wants to upgrade
> from 8.3 to 8.6. 8.6 only knows how to read pages that are
> 8.5-and-a-half or better, 8.5 only knows how to read pages that are
> 8.4-and-a-half or better, and 8.4 only knows how to read pages that
> are 8.3-and-a-half or better. So the user will have to upgrade to
> 8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6.

Yes.

> It seems to me that if there is any way to put all of the logic to
> handle old page versions in the new code that would be much better,
> especially if it's an optional feature that can be compiled in or not.
> Then when it's time to upgrade from 8.3 to 8.6 you could do:
>
> ./configure --with-upgrade-83 --with-upgrade-84 --with-upgrade85
>
> but if you don't need the code to handle old page versions you can:
>
> ./configure --without-upgrade85
>
> Admittedly, this requires making the new code capable of rearranging
> pages to create free space when necessary, and to be able to continue
> to execute queries while doing it, but ways of doing this have been
> proposed. The only uncertainty is as to whether the performance and
> code complexity can be kept manageable, but I don't believe that
> question has been explored to the point where we should be ready to
> declare defeat.

And almost guarantee that the job will never be completed, or tested
fully. Remember that in-place upgrades would be pretty painless so
doing multiple major upgrades should not be a difficult requiremnt, or
they can dump/reload their data to skip it.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 19:49:07
Message-ID: 49134A33.6030503@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I think we can have a notion of pre-upgrade maintenance, but it would
> have to be integrated into normal operations. For instance, if
> conversion to 8.4 requires extra free space, we'd make late releases
> of 8.3.x not only be able to force that to occur, but also tweak the
> normal code paths to maintain that minimum free space.

Agreed, the backend needs to be modified to reserve the space.

> The full concept as I understood it (dunno why Bruce left all these
> details out of his message) went like this:
>
> * Add a "format serial number" column to pg_class, and probably also
> pg_database. Rather like the frozenxid columns, this would have the
> semantics that all pages in a relation or database are known to have at
> least the specified format number.
>
> * There would actually be two serial numbers per release, at least for
> releases where pre-update prep work is involved --- for instance,
> between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is
> 8.3 but known ready to update to 8.4 (eg, enough free space available).
> Minor releases of 8.3 that appear with or subsequent to 8.4 release
> understand the "half" format number and how to upgrade to it.
>
> * VACUUM would be empowered, in the same way as it handles frozenxid
> maintenance, to update any less-than-the-latest-version pages and then
> fix the pg_class and pg_database entries.
>
> * We could mechanically enforce that you not update until the database
> is ready for it by checking pg_database.datformatversion during
> postmaster startup.

Adding catalog columns seems rather complicated, and not back-patchable.
Not backpatchable means that we'd need to be sure now that the format
serial numbers are enough for the upcoming 8.4-8.5 upgrade.

I imagined that you would have just a single cluster-wide variable, a
GUC perhaps, indicating how much space should be reserved by
updates/inserts. Then you'd have an additional program, perhaps a new
contrib module, that sets the variable to the right value for the
version you're upgrading, and scans through all tables, moving tuples so
that every page has enough free space for the upgrade. After that's
done, it'd set a flag in the data directory indicating that the cluster
is ready for upgrade.

The tool could run concurrently with normal activity, so you could just
let it run for as long as it takes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 20:40:31
Message-ID: 603c8f070811061240i2b8d98a4ja5f5d68492208a9f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> And almost guarantee that the job will never be completed, or tested
> fully. Remember that in-place upgrades would be pretty painless so
> doing multiple major upgrades should not be a difficult requiremnt, or
> they can dump/reload their data to skip it.

Regardless of what design is chosen, there's no requirement that we
support in-place upgrade from 8.3 to 8.6, or even 8.4 to 8.6, in one
shot. But the design that you and Tom are proposing pretty much
ensures that it will be impossible.

But that's certainly the least important reason not to do it this way.
I think this comment from Heikki is pretty revealing:

> Adding catalog columns seems rather complicated, and not back-patchable. Not backpatchable means that we'd need to be sure now
> that the format serial numbers are enough for the upcoming 8.4-8.5 upgrade.

That means, in essence, that the earliest possible version that could
be in-place upgraded would be an 8.4 system - we are giving up
completely on in-place upgrade to 8.4 from any earlier version (which
personally I thought was the whole point of this feature in the first
place). And we'll only be able to in-place upgrade to 8.5 if the
unproven assumption that these catalog changes are sufficient turns
out to be true, or if whatever other changes turn out to be necessary
are back-patchable.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Bruce Momjian" <bruce(at)momjian(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 21:26:17
Message-ID: 15464.1226006777@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> That means, in essence, that the earliest possible version that could
> be in-place upgraded would be an 8.4 system - we are giving up
> completely on in-place upgrade to 8.4 from any earlier version (which
> personally I thought was the whole point of this feature in the first
> place).

Quite honestly, given where we are in the schedule and the lack of
consensus about how to do this, I think we would be well advised to
decide right now to forget about supporting in-place upgrade to 8.4,
and instead work on allowing in-place upgrades from 8.4 onwards.
Shooting for a general-purpose does-it-all scheme that can handle
old versions that had no thought of supporting such updates is likely
to ensure that we end up with *NOTHING*.

What Bruce is proposing, I think, is that we intentionally restrict what
we want to accomplish to something that might be within reach now and
also sustainable over the long term. Planning to update any version to
any other version is *not* sustainable --- we haven't got the resources
nor the interest to create large amounts of conversion code.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 21:42:09
Message-ID: 15637.1226007729@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Adding catalog columns seems rather complicated, and not back-patchable.

Agreed, we'd not be able to make them retroactively appear in 8.3.

> I imagined that you would have just a single cluster-wide variable, a
> GUC perhaps, indicating how much space should be reserved by
> updates/inserts. Then you'd have an additional program, perhaps a new
> contrib module, that sets the variable to the right value for the
> version you're upgrading, and scans through all tables, moving tuples so
> that every page has enough free space for the upgrade. After that's
> done, it'd set a flag in the data directory indicating that the cluster
> is ready for upgrade.

Possibly that could work. The main thing is to have a way of being sure
that the prep work has been completed on every page of the database.
The disadvantage of not having catalog support is that you'd have to
complete the entire scan operation in one go to be sure you'd hit
everything.

Another thought here is that I don't think we are yet committed to any
changes that require extra space between 8.3 and 8.4, are we? The
proposed addition of CRC words could be put off to 8.5, for instance.
So it seems at least within reach to not require any preparatory steps
for 8.3-to-8.4, and put the infrastructure in place now to support such
steps in future go-rounds.

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-06 23:51:20
Message-ID: Pine.GSO.4.64.0811061754120.15452@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 6 Nov 2008, Tom Lane wrote:

> Another thought here is that I don't think we are yet committed to any
> changes that require extra space between 8.3 and 8.4, are we? The
> proposed addition of CRC words could be put off to 8.5, for instance.

I was just staring at that code as you wrote this thinking about the same
thing. CRCs are a great feature I'd really like to see. On the other
hand, announcing that 8.4 features in-place upgrades for 8.3 databases,
and that the project has laid the infrastructure such that future releases
will also upgrade in-place, would IMHO be the biggest positive
announcement of the new release by a large margin. At least then new
large (>1TB) installs could kick off on either the stable 8.3 or 8.4
knowing they'd never be forced to deal with dump/reload, whereas right now
there is no reasonable solution for them that involves PostgreSQL (I just
crossed 3TB on a system last month and I'm not looking forward to its
future upgrades).

Two questions come to mind here:

-If you reduce the page layout upgrade problem to "convert from V4 to V5
adding support for CRCs", is there a worthwhile simpler path to handling
that without dragging the full complexity of the older page layout changes
in?

-Is it worth considering making CRCs an optional compile-time feature, and
that (for now at least) you couldn't get them and the in-place upgrade at
the same time?

Stepping back for a second, the idea that in-place upgrade is only
worthwhile if it yields zero downtime isn't necessarily the case. Even
having an offline-only upgrade tool to handle the more complicated
situations where tuples have to be squeezed onto another page would still
be a major improvement over the current situation. The thing that you
have to recognize here is that dump/reload is extremely slow because of
bottlenecks in the COPY process. That makes for a large amount of
downtime--many hours isn't unusual.

If older version upgrade downtime was reduced to how long it takes to run
a "must scan every page and fiddle with it if full" tool, that would still
be a giant improvement over the current state of things. If Zdenek's
figures that only a small percentages of pages will need such adjustment
holds up, that should take only some factor longer than a sequential scan
of the whole database. That's not instant, but it's at least an order of
magnitude faster than a dump/reload on a big system.

The idea that you're going to get in-place upgrade all the way back to 8.2
without taking the database down for a even little bit to run such a
utility is hard to pull off, and it's impressive that Zdenek and everyone
else involved has gotten so close to doing it. I personally am on the
fence as to whether it's worth paying even the 1% penalty for that
implementation all the time just to get in-place upgrades. If an offline
utility with reasonable (scan instead of dump/reload) downtime and closer
to zero overhead when finished was available instead, that might be a more
reasonable trade-off to make for handling older releases. There are so
many bottlenecks in the older versions that you're less likely to find a
database too large to dump and reload there anyway. It would also be the
case that improvements to that offline utility could continue after 8.4
proper was completely frozen.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-07 00:25:38
Message-ID: 603c8f070811061625t3fdc59cem68e21e89b1c9a418@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The idea that you're going to get in-place upgrade all the way back to 8.2
> without taking the database down for a even little bit to run such a utility
> is hard to pull off, and it's impressive that Zdenek and everyone else
> involved has gotten so close to doing it.

I think we should at least wait to see what the next version of his
patch looks like before making any final decisions.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-07 01:43:20
Message-ID: 18128.1226022200@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Thu, 6 Nov 2008, Tom Lane wrote:
>> Another thought here is that I don't think we are yet committed to any
>> changes that require extra space between 8.3 and 8.4, are we? The
>> proposed addition of CRC words could be put off to 8.5, for instance.

> I was just staring at that code as you wrote this thinking about the same
> thing. ...

> -Is it worth considering making CRCs an optional compile-time feature, and
> that (for now at least) you couldn't get them and the in-place upgrade at
> the same time?

Hmm ... might be better than not offering them in 8.4 at all, but the
thing is that then you are asking packagers to decide for their
customers which is more important. And I'd bet you anything you want
that in-place upgrade would be their choice.

Also, having such an option would create extra complexity for 8.4-to-8.5
upgrades.

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-07 02:31:12
Message-ID: Pine.GSO.4.64.0811062111120.10016@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 6 Nov 2008, Tom Lane wrote:

>> -Is it worth considering making CRCs an optional compile-time feature, and
>> that (for now at least) you couldn't get them and the in-place upgrade at
>> the same time?
>
> Hmm ... might be better than not offering them in 8.4 at all, but the
> thing is that then you are asking packagers to decide for their
> customers which is more important. And I'd bet you anything you want
> that in-place upgrade would be their choice.

I was thinking of something similar to how --enable-thread-safety has been
rolled out. It could be hanging around there and available to those who
want it in their build, even though it might not be available by default
in a typical mainstream distribution. Since there's already a GUC for
toggling the checksums in the code, internally it could work like
debug_assertions where you only get that option if support was compiled in
appropriately. Just a thought I wanted to throw out there, if it makes
eventual upgrades from 8.4 more complicated it may not be worth even
considering.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-07 20:19:09
Message-ID: 4914A2BD.6090706@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas napsal(a):
> Tom Lane wrote:
>> I think we can have a notion of pre-upgrade maintenance, but it would
>> have to be integrated into normal operations. For instance, if
>> conversion to 8.4 requires extra free space, we'd make late releases
>> of 8.3.x not only be able to force that to occur, but also tweak the
>> normal code paths to maintain that minimum free space.
>
> Agreed, the backend needs to be modified to reserve the space.
>
>> The full concept as I understood it (dunno why Bruce left all these
>> details out of his message) went like this:
>>
>> * Add a "format serial number" column to pg_class, and probably also
>> pg_database. Rather like the frozenxid columns, this would have the
>> semantics that all pages in a relation or database are known to have at
>> least the specified format number.
>>
>> * There would actually be two serial numbers per release, at least for
>> releases where pre-update prep work is involved --- for instance,
>> between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is
>> 8.3 but known ready to update to 8.4 (eg, enough free space available).
>> Minor releases of 8.3 that appear with or subsequent to 8.4 release
>> understand the "half" format number and how to upgrade to it.
>>
>> * VACUUM would be empowered, in the same way as it handles frozenxid
>> maintenance, to update any less-than-the-latest-version pages and then
>> fix the pg_class and pg_database entries.
>>
>> * We could mechanically enforce that you not update until the database
>> is ready for it by checking pg_database.datformatversion during
>> postmaster startup.
>
> Adding catalog columns seems rather complicated, and not back-patchable.
> Not backpatchable means that we'd need to be sure now that the format
> serial numbers are enough for the upcoming 8.4-8.5 upgrade.

Reloptions is suitable for keeping amount of reserver space. And it can be back
ported into 8.3 and 8.2. And of course there is no problem to convert 8.1->8.2.

For backported branch would be better to combine internal modification -
preserve space and e.g. store procedure which check all relations.

In the 8.4 and newer pg_class could be extended for new attributes.

> I imagined that you would have just a single cluster-wide variable, a
> GUC perhaps, indicating how much space should be reserved by
> updates/inserts.

You sometimes need different reserved size for different type of relation. For
example on 32bit x86 you don't need reserve space for heap but you need do it
for indexes (between v3->v4). Better is to use reloptions and pre-upgrade
procedure sets this information correctly.

> Then you'd have an additional program, perhaps a new
> contrib module, that sets the variable to the right value for the
> version you're upgrading, and scans through all tables, moving tuples so
> that every page has enough free space for the upgrade. After that's
> done, it'd set a flag in the data directory indicating that the cluster
> is ready for upgrade.

I prefer to have this information in pg_class. It is accessible by SQL commands.
pg_class should also contains information about last checked page to prevent
repeatable check on very large tables.

> The tool could run concurrently with normal activity, so you could just
> let it run for as long as it takes.

Agree.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-07 20:24:20
Message-ID: 4914A3F4.10008@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Adding catalog columns seems rather complicated, and not back-patchable.
>
> Agreed, we'd not be able to make them retroactively appear in 8.3.
>
>> I imagined that you would have just a single cluster-wide variable, a
>> GUC perhaps, indicating how much space should be reserved by
>> updates/inserts. Then you'd have an additional program, perhaps a new
>> contrib module, that sets the variable to the right value for the
>> version you're upgrading, and scans through all tables, moving tuples so
>> that every page has enough free space for the upgrade. After that's
>> done, it'd set a flag in the data directory indicating that the cluster
>> is ready for upgrade.
>
> Possibly that could work. The main thing is to have a way of being sure
> that the prep work has been completed on every page of the database.
> The disadvantage of not having catalog support is that you'd have to
> complete the entire scan operation in one go to be sure you'd hit
> everything.

I prefer to have catalog support. Special on very long tables it helps when
somebody stop preupgrade script for some reason.

> Another thought here is that I don't think we are yet committed to any
> changes that require extra space between 8.3 and 8.4, are we? The
> proposed addition of CRC words could be put off to 8.5, for instance.
> So it seems at least within reach to not require any preparatory steps
> for 8.3-to-8.4, and put the infrastructure in place now to support such
> steps in future go-rounds.

Yeah. We still have V4 without any storage modification (exclude HASH index).
However I think if reloptions will be use for storing information about reserved
space then It shouldn't be a problem. But we need to be sure if it is possible.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-07 20:41:30
Message-ID: 4914A7FA.6030705@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):

> I think we can have a notion of pre-upgrade maintenance, but it would
> have to be integrated into normal operations. For instance, if
> conversion to 8.4 requires extra free space, we'd make late releases
> of 8.3.x not only be able to force that to occur, but also tweak the
> normal code paths to maintain that minimum free space.

OK. I will focus on this. I guess this approach revival my hook patch:

http://archives.postgresql.org/pgsql-hackers/2008-04/msg00990.php

> The full concept as I understood it (dunno why Bruce left all these
> details out of his message) went like this:
>
> * Add a "format serial number" column to pg_class, and probably also
> pg_database. Rather like the frozenxid columns, this would have the
> semantics that all pages in a relation or database are known to have at
> least the specified format number.
>
> * There would actually be two serial numbers per release, at least for
> releases where pre-update prep work is involved --- for instance,
> between 8.3 and 8.4 there'd be an "8.3-and-a-half" format which is
> 8.3 but known ready to update to 8.4 (eg, enough free space available).
> Minor releases of 8.3 that appear with or subsequent to 8.4 release
> understand the "half" format number and how to upgrade to it.

I prefer to have latest processed block. InvalidBlockNumber would mean nothing
is processed and 0 means everything is already reserved. I suggest to process it
backward. It should prevent to check new extended block which will be already
correctly setup.

> * VACUUM would be empowered, in the same way as it handles frozenxid
> maintenance, to update any less-than-the-latest-version pages and then
> fix the pg_class and pg_database entries.
>
> * We could mechanically enforce that you not update until the database
> is ready for it by checking pg_database.datformatversion during
> postmaster startup.

I'm don't understand you here? Do you mean on old server version or new server
version. Or who will perform this check? Do not remember that we currently do
catalog conversion by dump and import which lost all extended information.

Thanks Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-07 20:51:58
Message-ID: 23592.1226091118@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> Tom Lane napsal(a):
>> * Add a "format serial number" column to pg_class, and probably also
>> pg_database. Rather like the frozenxid columns, this would have the
>> semantics that all pages in a relation or database are known to have at
>> least the specified format number.

> I prefer to have latest processed block. InvalidBlockNumber would mean
> nothing is processed and 0 means everything is already reserved. I
> suggest to process it backward. It should prevent to check new
> extended block which will be already correctly setup.

That seems bizarre and not very helpful. In the first place, if we're
driving it off vacuum there would be no opportunity for recording a
half-processed state value. In the second place, this formulation fails
to provide any evidence of *what* processing you completed or didn't
complete. In a multi-step upgrade sequence I think it's going to be a
mess if we aren't explicit about that.

regards, tom lane


From: Decibel! <decibel(at)decibel(dot)org>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-10 00:12:21
Message-ID: 75081FC9-2909-496E-BAFB-2541636F66DA@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Nov 6, 2008, at 1:31 PM, Bruce Momjian wrote:
>> 3. What about multi-release upgrades? Say someone wants to upgrade
>> from 8.3 to 8.6. 8.6 only knows how to read pages that are
>> 8.5-and-a-half or better, 8.5 only knows how to read pages that are
>> 8.4-and-a-half or better, and 8.4 only knows how to read pages that
>> are 8.3-and-a-half or better. So the user will have to upgrade to
>> 8.3.MAX, then 8.4.MAX, then 8.5.MAX, and then 8.6.
>
> Yes.

I think that's pretty seriously un-desirable. It's not at all
uncommon for databases to stick around for a very long time and then
jump ahead many versions. I don't think we want to tell people they
can't do that.

More importantly, I think we're barking up the wrong tree by putting
migration knowledge into old versions. All that the old versions need
to do is guarantee a specific amount of free space per page. We
should provide a mechanism to tell a cluster what that free space
requirement is, and not hard-code it into the backend.

Unless I'm mistaken, there are only two cases we care about for
additional space: per-page and per-tuple. Those requirements could
also vary for different types of pg_class objects. What we need is an
API that allows an administrator to tell the database to start
setting this space aside. One possibility:

pg_min_free_space( version, relkind, bytes_per_page, bytes_per_tuple );
pg_min_free_space_index( version, indexkind, bytes_per_page,
bytes_per_tuple );

version: This would be provided as a safety mechanism. You would have
to provide the major version that matches what the backend is
running. See below for an example.

relkind: Essentially, heap vs toast, though I suppose it's possible
we might need this for sequences.

indexkind: Because we support different types of indexes, I think we
need to handle them differently than heap/toast. If we wanted, we
could have a single function that demands that indexkind is NULL if
relkind != 'index'.

bytes_per_(page|tuple): obvious. :)

Once we have an API, we need to get users to make use of it. I'm
thinking add something like the following to the release notes:

"To upgrade from a prior version to 8.4, you will need to run some of
the following commands, depending on what version you are currently
using:

For version 8.3:
SELECT pg_min_free_space( '8.3', 'heap', 4, 12 );
SELECT pg_min_free_space( '8.3', 'toast', 4, 12 );

For version 8.2:
SELECT pg_min_free_space( '8.2', 'heap', 14, 12 );
SELECT pg_min_free_space( '8.2', 'toast', 14, 12 );
SELECT pg_min_free_space_index( '8.2', 'b-tree', 4, 4);"

(Note I'm just pulling numbers out of thin air in this example.)

As you can see, we pass in the version number to ensure that if
someone accidentally cut and pastes the wrong stuff they know what
they did wrong immediately.

One downside to this scheme is that it doesn't provide a mechanism to
ensure that all required minimum free space requirements were passed
in. Perhaps we want a function that takes an array of complex types
and forces you to supply information for all known storage
mechanisms. Another possibility would be to pass in some kind of
binary format that contains a checksum.

Even if we do come up with a pretty fool-proof way to tell the old
version what free space it needs to set aside, I think we should
still have a mechanism for the new version to know exactly what the
old version has set aside, and if it's actually been accomplished or
not. One option that comes to mind is to add min_free_space_per_page
and min_free_space_per_tuple to pg_class. Normally these fields would
be NULL; the old version would only set them once it had verified
that all pages in a given relation met those requirements (presumably
via vacuum). The new version would check all these values on startup
to ensure they made sense.

OTOH, we might not want to go mucking around with changing the
catalog for older versions (I'm not even sure if we can). So perhaps
it would be better to store this information in a separate table, or
maybe a separate file. That might be best anyway; we generally
wouldn't need this information, so it would be nice if it wasn't
bloating pg_class all the time.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-10 01:02:19
Message-ID: 17159.1226278939@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Decibel! <decibel(at)decibel(dot)org> writes:
> I think that's pretty seriously un-desirable. It's not at all
> uncommon for databases to stick around for a very long time and then
> jump ahead many versions. I don't think we want to tell people they
> can't do that.

Of course they can do that --- they just have to do it one version at a
time.

I think it's time for people to stop asking for the moon and realize
that if we don't constrain this feature pretty darn tightly, we will
have *nothing at all* for 8.4. Again.

regards, tom lane


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Decibel! <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-10 04:09:27
Message-ID: 1226290167.17553.2.camel@jd-laptop.pragmaticzealot.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2008-11-09 at 20:02 -0500, Tom Lane wrote:
> Decibel! <decibel(at)decibel(dot)org> writes:
> > I think that's pretty seriously un-desirable. It's not at all
> > uncommon for databases to stick around for a very long time and then
> > jump ahead many versions. I don't think we want to tell people they
> > can't do that.
>
> Of course they can do that --- they just have to do it one version at a
> time.
>
> I think it's time for people to stop asking for the moon and realize
> that if we don't constrain this feature pretty darn tightly, we will
> have *nothing at all* for 8.4. Again.

Gotta go with Tom on this one. The idea that we would somehow upgrade
from 8.1 to 8.4 is silly. Yes it will be unfortunate for those running
8.1 but keeping track of multi version like that is going to be entirely
too expensive.

At some point it won't matter but right now it really does.

Joshua D. Drake

>
> regards, tom lane
>
--


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-10 08:15:13
Message-ID: 4917ED91.5080302@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Decibel! napsal(a):

> Unless I'm mistaken, there are only two cases we care about for
> additional space: per-page and per-tuple.

Yes. And maybe special space indexes could be extended, but it is covered in
per-page setting.

> Those requirements could also
> vary for different types of pg_class objects. What we need is an API
> that allows an administrator to tell the database to start setting this
> space aside. One possibility:

We need API or mechanism how in-place upgrade will setup it. It must be done by
in-place upgrade.

<snip>

> relkind: Essentially, heap vs toast, though I suppose it's possible we
> might need this for sequences.

Sequences are converted during catalog upgrade.

<snip>
> Once we have an API, we need to get users to make use of it. I'm
> thinking add something like the following to the release notes:
>
> "To upgrade from a prior version to 8.4, you will need to run some of
> the following commands, depending on what version you are currently using:
>
<snip>

It is too complicated. At first it depends also on architecture and it is
possible to easily compute by in-place upgrade script. What you need is only run
script which do all setting for you. You can obtain it from next version (IIRC
Oracle do it this way) or we can add this configuration script into previous
version during a minor update.

>
> OTOH, we might not want to go mucking around with changing the catalog
> for older versions (I'm not even sure if we can). So perhaps it would be
> better to store this information in a separate table, or maybe a
> separate file. That might be best anyway; we generally wouldn't need
> this information, so it would be nice if it wasn't bloating pg_class all
> the time.

It is why I selected relopt for storing this configuration parameter, which is
supported from 8.2 and upgrade from 8.1->8.2 works fine.

Zdenek


From: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Decibel!" <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-10 14:14:19
Message-ID: 491841BB.9020305@zeut.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Decibel! <decibel(at)decibel(dot)org> writes:
>
>> I think that's pretty seriously un-desirable. It's not at all
>> uncommon for databases to stick around for a very long time and then
>> jump ahead many versions. I don't think we want to tell people they
>> can't do that.
>>
>
> Of course they can do that --- they just have to do it one version at a
> time.

Also, people may be less likely to stick with an old outdated version
for years and years if the upgrade process is easier.


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Decibel! <decibel(at)decibel(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-10 18:16:34
Message-ID: 1226340994.21694.42.camel@jd-laptop.pragmaticzealot.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-11-10 at 09:14 -0500, Matthew T. O'Connor wrote:
> Tom Lane wrote:
> > Decibel! <decibel(at)decibel(dot)org> writes:
> >
> >> I think that's pretty seriously un-desirable. It's not at all
> >> uncommon for databases to stick around for a very long time and then
> >> jump ahead many versions. I don't think we want to tell people they
> >> can't do that.
> >>
> >
> > Of course they can do that --- they just have to do it one version at a
> > time.
>
> Also, people may be less likely to stick with an old outdated version
> for years and years if the upgrade process is easier.

Kind of OT but, I don't agree with this. There will always be those who
are willing to just upgrade because they can but the smart play is to
upgrade because you need to. If anything in place upgrades is just going
to remove the last real business and technical barrier to using
postgresql for enterprises.

Joshua D. Drake

>
>
--


From: Jeff <threshar(at)torgo(dot)978(dot)org>
To: jd(at)commandprompt(dot)com
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-10 20:36:16
Message-ID: 580C7B8A-3605-4DDF-96F7-B220FF4B19FE@torgo.978.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 9, 2008, at 11:09 PM, Joshua D. Drake wrote:
>> I think it's time for people to stop asking for the moon and realize
>> that if we don't constrain this feature pretty darn tightly, we will
>> have *nothing at all* for 8.4. Again.
>
> Gotta go with Tom on this one. The idea that we would somehow upgrade
> from 8.1 to 8.4 is silly. Yes it will be unfortunate for those running
> 8.1 but keeping track of multi version like that is going to be
> entirely
> too expensive.
>

I agree as well. If we can get the at least the base level stuff in
8.4 so that 8.5 and beyond is in-place upgradable then that is a huge
win. If we could support 8.2 or 8.3 or 6.5 :) that would be nice,
but I think dealing with everything retroactively will cause our heads
to explode and a mountain of awful code to arise. If we say "8.4 and
beyond will be upgradable" we can toss everything in we think we'll
need to deal with it and not worry about the retroactive case (unless
someone has a really clever(tm) idea!)

This can't be an original problem to solve, too many other databases
do it as well.

--
Jeff Trout <jeff(at)jefftrout(dot)com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-26 04:40:29
Message-ID: 603c8f070811252040w24e32b46sc4612e154955e41f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek -

I am a bit murky on where we stand with upgrade-in-place in terms of
reviewing. Initially, you had submitted four patches for this
commitfest:

1. htup and bufpage API clean up
2. HeapTuple version extension + code cleanup
3. In-place online upgrade
4. Extending pg_class info + more flexible TOAST chunk size

I think that it was decided that replacing the heap tuple access
macros with function calls was not acceptable, so I have moved patches
#1 and #2 to the "Returned with feedback" section. I thought that
perhaps the third patch could be salvaged, but the consensus seemed to
be to go in a new direction, so I'm thinking that one should probably
be moved to "Returned with feedback" as well. However, I'm not clear
on whether you will be submitting something else instead and whether
that thing should be considered material for this commitfest. Can you
let me know how you are thinking about this?

With respect to #4, I know that Alvaro submitted a draft patch, but
I'm not clear on whether that needs to be reviewed, because:

- I'm not sure whether it's close enough to being finished for a
review to be a good use of time.
- I'm not sure how much you and Heikki have already reviewed it.
- I'm not sure whether this patch buys us anything by itself.

Thoughts?

...Robert


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-26 09:03:26
Message-ID: 492D10DE.3020108@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert,

big thanks for your review. I think #1 is still partially valid, because it
contains general cleanups, but part of it is not necessary now. #2, #3 and #4
you can move to return with feedback section.

Thanks Zdenek

Robert Haas napsal(a):
> Zdenek -
>
> I am a bit murky on where we stand with upgrade-in-place in terms of
> reviewing. Initially, you had submitted four patches for this
> commitfest:
>
> 1. htup and bufpage API clean up
> 2. HeapTuple version extension + code cleanup
> 3. In-place online upgrade
> 4. Extending pg_class info + more flexible TOAST chunk size
>
> I think that it was decided that replacing the heap tuple access
> macros with function calls was not acceptable, so I have moved patches
> #1 and #2 to the "Returned with feedback" section. I thought that
> perhaps the third patch could be salvaged, but the consensus seemed to
> be to go in a new direction, so I'm thinking that one should probably
> be moved to "Returned with feedback" as well. However, I'm not clear
> on whether you will be submitting something else instead and whether
> that thing should be considered material for this commitfest. Can you
> let me know how you are thinking about this?
>
> With respect to #4, I know that Alvaro submitted a draft patch, but
> I'm not clear on whether that needs to be reviewed, because:
>
> - I'm not sure whether it's close enough to being finished for a
> review to be a good use of time.
> - I'm not sure how much you and Heikki have already reviewed it.
> - I'm not sure whether this patch buys us anything by itself.
>
> Thoughts?
>
> ...Robert


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-26 14:44:56
Message-ID: 603c8f070811260644v765118a5pcf44fe40ead52871@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> 1. htup and bufpage API clean up
>> 2. HeapTuple version extension + code cleanup
>> 3. In-place online upgrade
>> 4. Extending pg_class info + more flexible TOAST chunk size
> big thanks for your review. I think #1 is still partially valid, because it
> contains general cleanups, but part of it is not necessary now. #2, #3 and
> #4 you can move to return with feedback section.

OK, when can you submit a new version of #1 with the parts that are
still valid, updated to CVS HEAD, etc?

Thanks,

...Robert


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-26 14:55:50
Message-ID: 20081126145550.GE4275@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas escribió:

> With respect to #4, I know that Alvaro submitted a draft patch, but
> I'm not clear on whether that needs to be reviewed, because:
>
> - I'm not sure whether it's close enough to being finished for a
> review to be a good use of time.
> - I'm not sure how much you and Heikki have already reviewed it.
> - I'm not sure whether this patch buys us anything by itself.

I finished that patch, but I didn't submit it because in later
discussion it turned out (at least as I read it) that it's considered to
be unnecessary.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-26 15:18:35
Message-ID: 492D68CB.1020807@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera napsal(a):
> Robert Haas escribió:
>
>> With respect to #4, I know that Alvaro submitted a draft patch, but
>> I'm not clear on whether that needs to be reviewed, because:
>>
>> - I'm not sure whether it's close enough to being finished for a
>> review to be a good use of time.
>> - I'm not sure how much you and Heikki have already reviewed it.
>> - I'm not sure whether this patch buys us anything by itself.
>
> I finished that patch, but I didn't submit it because in later
> discussion it turned out (at least as I read it) that it's considered to
> be unnecessary.
>

From pg_upgrade perspective, it is something what we will need do anyway.
Because TOAST_MAX_CHUNK_SIZE will be different in 8.5 (if you commit CRC). Then
we will need the patch for 8.5. It is not necessary for 8.3->8.4 upgrade because
TOAST_MAX_CHUNK_SIZE is same. And make this change into toast table now will
add unnecessary complexity.

Zdenek


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] In-place upgrade
Date: 2008-11-27 11:55:45
Message-ID: 492E8AC1.6020808@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas napsal(a):
>>> 1. htup and bufpage API clean up
>>> 2. HeapTuple version extension + code cleanup
>>> 3. In-place online upgrade
>>> 4. Extending pg_class info + more flexible TOAST chunk size
>> big thanks for your review. I think #1 is still partially valid, because it
>> contains general cleanups, but part of it is not necessary now. #2, #3 and
>> #4 you can move to return with feedback section.
>
> OK, when can you submit a new version of #1 with the parts that are
> still valid, updated to CVS HEAD, etc?
>

It does not have priority now. I'm working on space reservation first.

Thanks Zdenek