WIP: Pg_upgrade - page layout converter (PLC) hook

Lists: pgsql-hackers
From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-04-15 10:46:35
Message-ID: 4804878B.3040709@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I attached patch which implemented page layout converter (PLC) hook. It is base
stone for in-place upgrade.

How it works:

When PLC module is loaded, then for each page which does not have native page
version conversion routine is called. Buffer is mark as a dirty and upgraded
page is inserted into WAL.

Performance:

I executed "select count(*) from table" on 2,2GB table (4671039 rows) (without
any tunning) and with conversion 2033s (34min) and after conversion and server
restart 31s (0,5min).

Request for comments:

1) I not sure if calling log_newpage is correct.

a) Calling from storage something in access method seems to me as bad think.
I'm thinking to move log_newpage to storage, but it invokes more question about
placement, RM ...

b) log_newpage is used for new page logging, but I use it for storing
converted page. It seems to me that it safe and heap_xlog_newpage correctly
works for new and converted page. I have only doubt about assert macro
mdextend/mdwrite which checks extend vs.write.

2) PLC module placement. I'm looking for best place (directory) where I can put
PLC code. One possibility is to put under contrib/pg_upgrade another
possibility is to put into backend/storage/upgrade/, but in this location it
will not be possible make it as a module.

3) data structures version tracking

For PLC I need to have old version of data structures like page header, tuple
header and so on. It is also useful for external tools to handle more version of
postgresql easily (e.g. pg_control should show data from all supported
postgresql versions).

My idea is to have for each structure version keep own header e.g. bufpage_03.h,
bufpage_04.h ... which will contain typedef struct PageHeaderData_03 ... and
generic bufpage.h with following content:

...
#include "bufpage_04.h"
...
typedef PageHeaderData_04 PageHeaderData;

#define PageGetPageSize(page) PageGetPageSize_04(page)
...

4) how to handle corrupted page? If page is corrupted it could invoke false
calling of convert routine. It could hide problems and conversion could "fix" it
in wrong way. Probably we need to have PageHeaderIsValid for all page layout
version.


Thanks for your comments

Attachment Content-Type Size
plc_hook_01.diff text/x-patch 2.1 KB

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-04-15 15:58:01
Message-ID: 4804D089.2070008@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala wrote:
> I attached patch which implemented page layout converter (PLC) hook. It
> is base stone for in-place upgrade.

I agree it's an important piece of the puzzle, but not the most
complicated one by far. How will you deal with catalog changes for
example? I'm going to assume that you'll use pg_migrator for that, and
this page layout conversion is just the final step of the migration and
all the other things like the catalog, clog, config file etc. have
already been converted.

I would suggest starting with putting some serious effort into
pg_migrator. This page layout conversion is actually pretty trivial
compared to all that, and even more so if you can get the page layout
conversion working in pg_migrator first.

Which versions do you plan to support, initially?

> How it works:
>
> When PLC module is loaded, then for each page which does not have native
> page version conversion routine is called. Buffer is mark as a dirty and
> upgraded page is inserted into WAL.

I would suggest delegating the WAL logging to the PLC module. In some
cases it's going to be just a matter of changing

> Performance:
>
> I executed "select count(*) from table" on 2,2GB table (4671039 rows)
> (without any tunning) and with conversion 2033s (34min) and after
> conversion and server restart 31s (0,5min).

Wow, that's a big slowdown. Where's the time spent? Is it the extra I/O
of rewriting the table?

> Request for comments:
>
> 1) I not sure if calling log_newpage is correct.
>
> a) Calling from storage something in access method seems to me as bad
> think. I'm thinking to move log_newpage to storage, but it invokes more
> question about placement, RM ...

Yeah, it probably should be moved somewhere else. We already use it not
only for heapam, but for indexes as well.

> b) log_newpage is used for new page logging, but I use it for storing
> converted page. It seems to me that it safe and heap_xlog_newpage
> correctly works for new and converted page. I have only doubt about
> assert macro mdextend/mdwrite which checks extend vs.write.

That should be fine. In WAL replay, we don't distinguish between write
and extend.

> 3) data structures version tracking
>
> For PLC I need to have old version of data structures like page header,
> tuple header and so on. It is also useful for external tools to handle
> more version of postgresql easily (e.g. pg_control should show data from
> all supported postgresql versions).
>
> My idea is to have for each structure version keep own header e.g.
> bufpage_03.h, bufpage_04.h ... which will contain typedef struct
> PageHeaderData_03 ... and generic bufpage.h with following content:
>
> ...
> #include "bufpage_04.h"
> ...
> typedef PageHeaderData_04 PageHeaderData;
>
> #define PageGetPageSize(page) PageGetPageSize_04(page)
> ...

That's not enough, in general. There might be changes in other header
files that affect the page layout. Like the packed varlen change, which
affected c.h.

> + /* Hook for page layout convertor plugin */
> + typedef void (*plc_hook_type)(char *buffer);
> + extern PGDLLIMPORT plc_hook_type plc_hook;

That's not flexible enough. We've changed the internal representations
of data types in the past, and will likely do that in the future. The
hook therefore needs to have at least the tuple descriptor, to know how
to convert the tuples. I would suggest passing Relation, we have that
available at the call site, and that should contain all the necessary
information.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-04-15 16:15:45
Message-ID: 16356.1208276145@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Zdenek Kotala wrote:
>> I attached patch which implemented page layout converter (PLC) hook. It
>> is base stone for in-place upgrade.

> I agree it's an important piece of the puzzle, but not the most
> complicated one by far.

In fact, it's not even worth thinking about at this stage, because you
couldn't meaningfully use or test it before you have dealt with catalog
update issues.

Even more to the point, we already had consensus that we would try to
avoid page-level changes once we had a working migration solution,
so it's not clear that there will ever be a big need for page layout
conversion.

I think this patch is a complete waste of effort at this stage, and
should not be considered for application until we have a context in
which we could meaningfully test it.

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-04-15 19:36:43
Message-ID: 480503CB.10407@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for your comment. See my comments inline:

Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> I attached patch which implemented page layout converter (PLC) hook. It
>> is base stone for in-place upgrade.
>
> I agree it's an important piece of the puzzle, but not the most
> complicated one by far. How will you deal with catalog changes for
> example? I'm going to assume that you'll use pg_migrator for that, and
> this page layout conversion is just the final step of the migration and
> all the other things like the catalog, clog, config file etc. have
> already been converted.

I'm sorry I forgot to link to my original proposal.
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00057.php

I agree is that it is easiest part, but it is base for my work and upgrade in
place itself. Final state should have everything inside postgres include catalog
conversion. New version should startup on old cluster and converted data "on the
fly".

For testing I'm currently use my own ksh script which is similar to pg_migrator,
because I need database with upgraded catalog. And you cannot upgrade catalog
internally until you are not able to convert page layout. egg chicken problem

> I would suggest starting with putting some serious effort into
> pg_migrator. This page layout conversion is actually pretty trivial
> compared to all that, and even more so if you can get the page layout
> conversion working in pg_migrator first.

By my opinion pg_migrator is good workaround but it is not suitable for real
production. For example TOAST relid protection is dirty hack and you have
problem with tablespaces and so on...

> Which versions do you plan to support, initially?

Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means conversion
between layout version 3 and 4. I'm verifying PLC now and when this part will be
ready it is very easy to implement others as well.

>> How it works:
>>
>> When PLC module is loaded, then for each page which does not have
>> native page version conversion routine is called. Buffer is mark as a
>> dirty and upgraded page is inserted into WAL.
>
> I would suggest delegating the WAL logging to the PLC module. In some
> cases it's going to be just a matter of changing

I rejected this idea. PLC function should be used separately. Put WAL logging
inside makes stronger dependency by my opinion.

>> Performance:
>>
>> I executed "select count(*) from table" on 2,2GB table (4671039 rows)
>> (without any tunning) and with conversion 2033s (34min) and after
>> conversion and server restart 31s (0,5min).
>
> Wow, that's a big slowdown. Where's the time spent? Is it the extra I/O
> of rewriting the table?

Not idea yet. I going to test it. But I guess that it is I/O problem. Each page
is written twice.

>> Request for comments:
>>
>> 1) I not sure if calling log_newpage is correct.
>>
>> a) Calling from storage something in access method seems to me as
>> bad think. I'm thinking to move log_newpage to storage, but it invokes
>> more question about placement, RM ...
>
> Yeah, it probably should be moved somewhere else. We already use it not
> only for heapam, but for indexes as well.

But question where is good place? Because it is related to whole page I suggest
to put it in smgr.c and under SGMR RM.

>> b) log_newpage is used for new page logging, but I use it for
>> storing converted page. It seems to me that it safe and
>> heap_xlog_newpage correctly works for new and converted page. I have
>> only doubt about assert macro mdextend/mdwrite which checks extend
>> vs.write.
>
> That should be fine. In WAL replay, we don't distinguish between write
> and extend.

But in this case the asserts macros in mdexted/mdwrite are odd and the should be
removed.

>> 3) data structures version tracking
>>
>> For PLC I need to have old version of data structures like page
>> header, tuple header and so on. It is also useful for external tools
>> to handle more version of postgresql easily (e.g. pg_control should
>> show data from all supported postgresql versions).
>>
>> My idea is to have for each structure version keep own header e.g.
>> bufpage_03.h, bufpage_04.h ... which will contain typedef struct
>> PageHeaderData_03 ... and generic bufpage.h with following content:
>>
>> ...
>> #include "bufpage_04.h"
>> ...
>> typedef PageHeaderData_04 PageHeaderData;
>>
>> #define PageGetPageSize(page) PageGetPageSize_04(page)
>> ...
>
> That's not enough, in general. There might be changes in other header
> files that affect the page layout. Like the packed varlen change, which
> affected c.h.

Yes, I know, tuple headers and so on. It is only example.

>> + /* Hook for page layout convertor plugin */
>> + typedef void (*plc_hook_type)(char *buffer);
>> + extern PGDLLIMPORT plc_hook_type plc_hook;
>
> That's not flexible enough. We've changed the internal representations
> of data types in the past, and will likely do that in the future. The
> hook therefore needs to have at least the tuple descriptor, to know how
> to convert the tuples. I would suggest passing Relation, we have that
> available at the call site, and that should contain all the necessary
> information.
>

Good point, I thought about it. I currently don't use it but in the future it
could be useful. I will extend it.

thank Zdenek


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-04-15 20:15:18
Message-ID: 48050CD6.5030102@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane napsal(a):
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Zdenek Kotala wrote:
>>> I attached patch which implemented page layout converter (PLC) hook. It
>>> is base stone for in-place upgrade.
>
>> I agree it's an important piece of the puzzle, but not the most
>> complicated one by far.
>
> In fact, it's not even worth thinking about at this stage, because you
> couldn't meaningfully use or test it before you have dealt with catalog
> update issues.

It is egg and chicken problem. I'm able to test it because I use own script for
catalog upgrade (similar to pg_migrator).

> Even more to the point, we already had consensus that we would try to
> avoid page-level changes once we had a working migration solution,
> so it's not clear that there will ever be a big need for page layout
> conversion.

I'm really not convenient that page layout will never changed in the future. It
is modified each second version. And you need way how to convert data from
pre8.4 versions as well. And PLC convert page header, tuple headers and it could
convert also data items.

> I think this patch is a complete waste of effort at this stage, and
> should not be considered for application until we have a context in
> which we could meaningfully test it.

I will send PLC code soon. I'm now testing it. For catalog upgrade you can use
pg_migrator or pg_upgrade.sh script
(http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/postgres/postgresql-upgrade/)

Zdenek


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-04-16 05:34:26
Message-ID: 48058FE2.2000205@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala wrote:
> Heikki Linnakangas napsal(a):
>> I would suggest starting with putting some serious effort into
>> pg_migrator. This page layout conversion is actually pretty trivial
>> compared to all that, and even more so if you can get the page layout
>> conversion working in pg_migrator first.
>
> By my opinion pg_migrator is good workaround but it is not suitable for
> real production. For example TOAST relid protection is dirty hack and
> you have problem with tablespaces and so on...

Sure, it's not perfect, that's exactly why I suggested working on it! If
you have something else that works better, that's fine, but you will
need to release it under the BSD license if you want help with it.

>> Which versions do you plan to support, initially?
>
> Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means
> conversion between layout version 3 and 4. I'm verifying PLC now and
> when this part will be ready it is very easy to implement others as well.

Hmm. I don't see any of that code in the directory you posted. ?

>>> b) log_newpage is used for new page logging, but I use it for
>>> storing converted page. It seems to me that it safe and
>>> heap_xlog_newpage correctly works for new and converted page. I have
>>> only doubt about assert macro mdextend/mdwrite which checks extend
>>> vs.write.
>>
>> That should be fine. In WAL replay, we don't distinguish between write
>> and extend.
>
> But in this case the asserts macros in mdexted/mdwrite are odd and the
> should be removed.

Those asserts are enforced during normal operation. It's just in WAL
replay that we extend the relation automatically when full pages are
restored. See XLogReadBuffer().

>>> + /* Hook for page layout convertor plugin */
>>> + typedef void (*plc_hook_type)(char *buffer);
>>> + extern PGDLLIMPORT plc_hook_type plc_hook;
>>
>> That's not flexible enough. We've changed the internal representations
>> of data types in the past, and will likely do that in the future. The
>> hook therefore needs to have at least the tuple descriptor, to know
>> how to convert the tuples. I would suggest passing Relation, we have
>> that available at the call site, and that should contain all the
>> necessary information.
>
> Good point, I thought about it. I currently don't use it but in the
> future it could be useful. I will extend it.

Surely you need it already to do the packed varlen conversion?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-04-16 07:37:25
Message-ID: 4805ACB5.9030504@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> Heikki Linnakangas napsal(a):
>>> I would suggest starting with putting some serious effort into
>>> pg_migrator. This page layout conversion is actually pretty trivial
>>> compared to all that, and even more so if you can get the page layout
>>> conversion working in pg_migrator first.
>>
>> By my opinion pg_migrator is good workaround but it is not suitable
>> for real production. For example TOAST relid protection is dirty hack
>> and you have problem with tablespaces and so on...
>
> Sure, it's not perfect, that's exactly why I suggested working on it! If
> you have something else that works better, that's fine, but you will
> need to release it under the BSD license if you want help with it.

Sure, my upgrade script is in opensolaris and it is under CDDL license. By my
meaning it is only workaround until PostgreSQL 8.x will not have upgrade
integrated inside. After that this script will be removed.

Of course all work will be under BSD.

>>> Which versions do you plan to support, initially?
>>
>> Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means
>> conversion between layout version 3 and 4. I'm verifying PLC now and
>> when this part will be ready it is very easy to implement others as well.
>
> Hmm. I don't see any of that code in the directory you posted. ?

PLC is not there yet, because it is not reviewed, tested and verified. And I
tested it only on SPARC where varlen works without any modification, but on x86
it needs extra work yet.

>>>> b) log_newpage is used for new page logging, but I use it for
>>>> storing converted page. It seems to me that it safe and
>>>> heap_xlog_newpage correctly works for new and converted page. I have
>>>> only doubt about assert macro mdextend/mdwrite which checks extend
>>>> vs.write.
>>>
>>> That should be fine. In WAL replay, we don't distinguish between
>>> write and extend.
>>
>> But in this case the asserts macros in mdexted/mdwrite are odd and the
>> should be removed.
>
> Those asserts are enforced during normal operation. It's just in WAL
> replay that we extend the relation automatically when full pages are
> restored. See XLogReadBuffer().

OK

>>>> + /* Hook for page layout convertor plugin */
>>>> + typedef void (*plc_hook_type)(char *buffer);
>>>> + extern PGDLLIMPORT plc_hook_type plc_hook;
>>>
>>> That's not flexible enough. We've changed the internal
>>> representations of data types in the past, and will likely do that in
>>> the future. The hook therefore needs to have at least the tuple
>>> descriptor, to know how to convert the tuples. I would suggest
>>> passing Relation, we have that available at the call site, and that
>>> should contain all the necessary information.
>>
>> Good point, I thought about it. I currently don't use it but in the
>> future it could be useful. I will extend it.
>
> Surely you need it already to do the packed varlen conversion?
>

It works on SPARC without any varlen modification :-). I originally planed to
implement varlen modification in another way but it seems to be better to do it
in this place as well.

Thanks for your comment


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-05-17 20:02:29
Message-ID: 482F39D5.6060901@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala napsal(a):

<snip>

> How it works:
>
> When PLC module is loaded, then for each page which does not have native
> page version conversion routine is called. Buffer is mark as a dirty and
> upgraded page is inserted into WAL.
>

Unfortunately, this approach does not work between layout 3 and 4. It works only
for heap on platfrom with maxallign=4.

The main problem is that PageHeader has been extended to 24 bytes and it
requires reindexing, TOAST chunk resizing, converted tuples does not fit on page
on platform where maxallign=8.

I'm now working on offline conversion method.

Zdenek


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-05-17 20:24:31
Message-ID: 200805171324.33410.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek,

> Unfortunately, this approach does not work between layout 3 and 4. It
> works only for heap on platfrom with maxallign=4.
>
> The main problem is that PageHeader has been extended to 24 bytes and it
> requires reindexing, TOAST chunk resizing, converted tuples does not fit
> on page on platform where maxallign=8.
>
> I'm now working on offline conversion method.

Hmmm. I thought we determined earlier that the use case for *offline*
binary conversion was rather limited. It would have to be 10x faster than
dump/reload to be worthwhile. Do you think this is possible?

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: Pg_upgrade - page layout converter (PLC) hook
Date: 2008-05-17 20:40:59
Message-ID: 482F42DB.1070808@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus napsal(a):
> Zdenek,
>
>> Unfortunately, this approach does not work between layout 3 and 4. It
>> works only for heap on platfrom with maxallign=4.
>>
>> The main problem is that PageHeader has been extended to 24 bytes and it
>> requires reindexing, TOAST chunk resizing, converted tuples does not fit
>> on page on platform where maxallign=8.
>>
>> I'm now working on offline conversion method.
>
> Hmmm. I thought we determined earlier that the use case for *offline*
> binary conversion was rather limited. It would have to be 10x faster than
> dump/reload to be worthwhile. Do you think this is possible?
>

It is difficult to say without prototype. However, binary conversion can be
performed without (or minimal) WAL logging and without sync. I think async write
is a big improvement. And there is not problem convert tables in parallel mode.
And of course you don't need bin<->text<->bin conversion. Unfortunately reindex
is necessary in both cases.

Zdenek