Column reordering in pg_dump

Lists: pgsql-hackers
From: "hernan gonzalez" <hgonzalez(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Column reordering in pg_dump
Date: 2008-11-14 17:21:50
Message-ID: 48692c2d0811140921v4504162ctb2c22b53e2f1b5c9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi.
As I wanted to change the order of columns of some tables (I know, I
know, my code does not depend
on that; but I prefer that \dt gives me a more organized ouput...)
I've added an option to pg_dump to reorder
columns in the ouput "CREATE TABLE" dump.
The specified order is given in a external file (one column by line,
format "scheme.tablename.colname"),
you only need to write the desired columns to reorder.
I've done it for my own use, minimal testing, etc; but the code is
quite simple and has little coupling
with the rest; and very little risk, IMO.
The code (reoder.c and patched pg_dump.c) is here:
http://hjg.com.ar/tech/prog/pg_dump_reorder/
See the README.txt, and the comments at reorder.c.
If the dev team is interested in adding this little feature to the
core, delighted (in that case, I guess the code
needs some polishing, and perhaps revise terminology, options names,
modify man page, etc)
If not, there it is so the people interested can use it.

BTW: I've been using Postgresql since ten years ago, and it rocks. Congrats.

--
Hernán J. González
http://hjg.com.ar/
Buenos Aires, Argentina


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "hernan gonzalez" <hgonzalez(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-14 18:12:09
Message-ID: 14588.1226686329@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"hernan gonzalez" <hgonzalez(at)gmail(dot)com> writes:
> I've added an option to pg_dump to reorder
> columns in the ouput "CREATE TABLE" dump.

This doesn't seem like a particularly good idea to me. In the first
place, pg_dump is a tool for reproducing your database, not altering it,
so it seems like basically the wrong place to be inserting this type of
feature. (There's been some talk of a Postgres ETL tool, which would be
the right place, but so far it's only talk :-(.) In the second place,
column order is actually a pretty delicate affair when you start to
think about table inheritance situations and tables that have been
altered via ADD/DROP COLUMN. We had bugs in pg_dump in the past with
its ability to deal with column order in such cases. So I'm not nearly
as optimistic as you are that such a feature is incapable of causing
problems.

regards, tom lane


From: "hernan gonzalez" <hgonzalez(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-14 18:37:05
Message-ID: 48692c2d0811141037r62014180s10d34e8048a004b6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 14, 2008 at 4:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "hernan gonzalez" <hgonzalez(at)gmail(dot)com> writes:
>> I've added an option to pg_dump to reorder
>> columns in the ouput "CREATE TABLE" dump.
>
> This doesn't seem like a particularly good idea to me. In the first
> place, pg_dump is a tool for reproducing your database, not altering it,
> so it seems like basically the wrong place to be inserting this type of
> feature. (There's been some talk of a Postgres ETL tool, which would be
> the right place, but so far it's only talk :-(.) In the second place,
> column order is actually a pretty delicate affair when you start to
> think about table inheritance situations and tables that have been
> altered via ADD/DROP COLUMN. We had bugs in pg_dump in the past with
> its ability to deal with column order in such cases. So I'm not nearly
> as optimistic as you are that such a feature is incapable of causing
> problems.
>
> regards, tom lane

> In the first placeplace, pg_dump is a tool for reproducing your database, not altering it

Yes, but the standard/recommended procedure for reorder columns in postgresql
is "pg_dump , edit , restore". I just didn't want to mess editing a dump.

Of couse, the standard behavior of pg_dump is not altered when the
"reorder" option
is not use. And bear in mind that the reordering hook is guaranteed to
alter only the order
of the "CREATE TABLE" fields. (The original and the modified dump
will differ only in that;
even in the case of dropped columns and inherited tables). The only
possible troubling scenario
I can imagine: a dump using a COPY without columns names in the data
dump; but that
only arises with version < 7.3.

Anyway, I know you know better.
Best regards!

Hernán J. González
http://hjg.com.ar/


From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "hernan gonzalez" <hgonzalez(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Column reordering in pg_dump
Date: 2008-11-14 23:11:24
Message-ID: 200811141811.24326.xzilla@users.sourceforge.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday 14 November 2008 13:37:05 hernan gonzalez wrote:
> On Fri, Nov 14, 2008 at 4:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > "hernan gonzalez" <hgonzalez(at)gmail(dot)com> writes:
> >> I've added an option to pg_dump to reorder
> >> columns in the ouput "CREATE TABLE" dump.
> >
> > This doesn't seem like a particularly good idea to me. In the first
> > place, pg_dump is a tool for reproducing your database, not altering it,
> > so it seems like basically the wrong place to be inserting this type of
> > feature. (There's been some talk of a Postgres ETL tool, which would be
> > the right place, but so far it's only talk :-(.) In the second place,
> > column order is actually a pretty delicate affair when you start to
> > think about table inheritance situations and tables that have been
> > altered via ADD/DROP COLUMN. We had bugs in pg_dump in the past with
> > its ability to deal with column order in such cases. So I'm not nearly
> > as optimistic as you are that such a feature is incapable of causing
> > problems.
> >
> > regards, tom lane
> >
> > In the first placeplace, pg_dump is a tool for reproducing your database,
> > not altering it
>
> Yes, but the standard/recommended procedure for reorder columns in
> postgresql is "pg_dump , edit , restore". I just didn't want to mess
> editing a dump.
>

it's one method, but not the only method, see
http://wiki.postgresql.org/wiki/Alter_column_position

> Of couse, the standard behavior of pg_dump is not altered when the
> "reorder" option
> is not use. And bear in mind that the reordering hook is guaranteed to
> alter only the order
> of the "CREATE TABLE" fields. (The original and the modified dump
> will differ only in that;
> even in the case of dropped columns and inherited tables). The only
> possible troubling scenario
> I can imagine: a dump using a COPY without columns names in the data
> dump; but that
> only arises with version < 7.3.
>

yeah, i remember using that trick a lot... ah the good ole days :-P

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com


From: Decibel! <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "hernan gonzalez" <hgonzalez(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-25 21:10:30
Message-ID: F5F50792-8B6E-4273-9537-BB8FEEC12459@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Nov 14, 2008, at 12:12 PM, Tom Lane wrote:
> "hernan gonzalez" <hgonzalez(at)gmail(dot)com> writes:
>> I've added an option to pg_dump to reorder
>> columns in the ouput "CREATE TABLE" dump.
>
> This doesn't seem like a particularly good idea to me. In the first
> place, pg_dump is a tool for reproducing your database, not
> altering it,
> so it seems like basically the wrong place to be inserting this
> type of
> feature. (There's been some talk of a Postgres ETL tool, which
> would be
> the right place, but so far it's only talk :-(.) In the second place,
> column order is actually a pretty delicate affair when you start to
> think about table inheritance situations and tables that have been
> altered via ADD/DROP COLUMN. We had bugs in pg_dump in the past with
> its ability to deal with column order in such cases. So I'm not
> nearly
> as optimistic as you are that such a feature is incapable of causing
> problems.

IIRC the community did come to a consensus on allowing for a
different logical ordering from physical ordering, it was an issue of
actually doing the work. If this is an itch you want to scratch, you
might look into fixing that problem instead.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, hernan gonzalez <hgonzalez(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-25 22:40:07
Message-ID: 20081125224007.GB27273@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 25, 2008 at 03:10:30PM -0600, Decibel! wrote:
> IIRC the community did come to a consensus on allowing for a
> different logical ordering from physical ordering, it was an issue of
> actually doing the work. If this is an itch you want to scratch, you
> might look into fixing that problem instead.

Err, as I recall it was decided that the chance for confusion was too
high.

http://www.mail-archive.com/pgsql-hackers(at)postgresql(dot)org/msg85548.html

However, it seems to me we could have "reasonably bulletproof or
machine-checkable way to keep the two kinds of column numbers
distinct", like so:

typedef struct { short log; } logical_pos;
typedef struct { short phys; } physical_pos;

This doesn't change the size of the objects, but the compiler will
prevent them from being assigned interchangably.

It does mean you need to use macros to access them, even if it's in a
loop. Fortunatly, we don't need to do too much arithmetic on them.

If the size of the object doesn't matter, you can do thing like typedef
a pointer to a one byte struct. Then most standard arithmetic
operations will still work (IIRC the Linux kernel uses this trick a
lot).

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Decibel! <decibel(at)decibel(dot)org>, hernan gonzalez <hgonzalez(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-26 00:30:34
Message-ID: 23035.1227659434@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> On Tue, Nov 25, 2008 at 03:10:30PM -0600, Decibel! wrote:
>> IIRC the community did come to a consensus on allowing for a
>> different logical ordering from physical ordering, it was an issue of
>> actually doing the work. If this is an itch you want to scratch, you
>> might look into fixing that problem instead.

> Err, as I recall it was decided that the chance for confusion was too
> high.
> http://www.mail-archive.com/pgsql-hackers(at)postgresql(dot)org/msg85548.html

That message was about an approach that didn't have consensus ;-)

The ultimate conclusion was that a three-way split (identity, logical
position, physical position) could work because most of the code only
cares about column identity; the places where logical or physical
positions are important are pretty narrowly circumscribed, or could
be made so.

regards, tom lane


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, Decibel! <decibel(at)decibel(dot)org>, "hernan gonzalez" <hgonzalez(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-26 02:03:25
Message-ID: 603c8f070811251803g660ed938w3ab6b521837a00da@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The ultimate conclusion was that a three-way split (identity, logical
> position, physical position) could work because most of the code only
> cares about column identity; the places where logical or physical
> positions are important are pretty narrowly circumscribed, or could
> be made so.

I started to take a look at this at one point and quickly got
intimidated. Do you have any sense of what sort of refactoring would
be required to make this viable?

I believe that the original discussion[1] may have somewhat
underestimated the number of places where logical position is
relevant. The list includes at least:

SELECT * FROM foo;
TABLE foo;
INSERT INTO foo VALUES (...) (or SELECT, but without column list
COPY foo FROM 'foo';
COPY foo TO 'foo';

There are also some problems with this syntax:

alias (column_alias, column_alias, column_alias)

Imagine for example:

CREATE TABLE foo (c1 integer, c2 text, c3 boolean, c4 date, c5
timestamp, c6 numeric, c7 varchar);
CREATE OR REPLACE VIEW tricky AS SELECT * FROM foo AS bar (a, b, c);
ALTER TABLE foo ALTER COLUMN c2 POSITION LAST;

After some thought, it seems pretty clear, at least to me, that the
third (hypothetical) command should not change the result of "SELECT *
FROM tricky" (the contrary conclusion gives rise to a lot of problems,
especially if there are other views depending on it). But what will
"pg_dump -t tricky" output at this point? I suspect it will be
necessary to introduce some new syntax here.

...Robert

[1] http://archives.postgresql.org/pgsql-hackers/2006-12/msg00977.php


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Decibel! <decibel(at)decibel(dot)org>, hernan gonzalez <hgonzalez(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-26 02:18:41
Message-ID: 20081126021841.GU4875@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas escribió:

> After some thought, it seems pretty clear, at least to me, that the
> third (hypothetical) command should not change the result of "SELECT *
> FROM tricky" (the contrary conclusion gives rise to a lot of problems,
> especially if there are other views depending on it). But what will
> "pg_dump -t tricky" output at this point? I suspect it will be
> necessary to introduce some new syntax here.

Everything that's user-visible needs to use logical positioning. That
includes pg_dump.

Changing physical positioning is purely an internal matter. A first-cut
implementation should probably just make it identical to logical
positioning, until the latter is changed by the user (after which,
physical positioning continues to reflect the original ordering). Only
after this work has been done and gotten battle-tested, we can get into
niceties like having the server automatically rearrange physical
positioning to improve performance.

Column identity is, of course, set in stone as soon as decided for the
first time.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, Decibel! <decibel(at)decibel(dot)org>, "hernan gonzalez" <hgonzalez(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-26 03:41:11
Message-ID: 603c8f070811251941w40e9e080vccbb821a211aee7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 25, 2008 at 9:18 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>> After some thought, it seems pretty clear, at least to me, that the
>> third (hypothetical) command should not change the result of "SELECT *
>> FROM tricky" (the contrary conclusion gives rise to a lot of problems,
>> especially if there are other views depending on it). But what will
>> "pg_dump -t tricky" output at this point? I suspect it will be
>> necessary to introduce some new syntax here.
>
> Everything that's user-visible needs to use logical positioning. That
> includes pg_dump.

Obviously. The point is that the "alias (column_alias, column_alias,
column_alias)" syntax only allows you to alias the columns that are
the first N logical positions. If you have a view which is aliasing
columns 1..3 of some table, and column 2 of the table gets moved to
position 7, the view definition now needs to alias columns 1, 2, and
7, which isn't possible with the present syntax unless you also alias
3, 4, 5, and 6.

> Changing physical positioning is purely an internal matter. A first-cut
> implementation should probably just make it identical to logical
> positioning, until the latter is changed by the user (after which,
> physical positioning continues to reflect the original ordering). Only
> after this work has been done and gotten battle-tested, we can get into
> niceties like having the server automatically rearrange physical
> positioning to improve performance.

Yeah. The problem with that is that, as Tom pointed out in a previous
iteration of this discussion, you will likely have lurking bugs. The
bugs are going to come from confusing physical vs. logical vs. column
identity, and if some of those are always-equal, it's gonna be pretty
hard to know if you have bugs that confuse the two. Now, if you could
run the regression tests with a special option that would randomly
permute the two orderings with respect to one another, that would give
you at least some degree of confidence...

> Column identity is, of course, set in stone as soon as decided for the
> first time.

Agreed... but I'd still like to hear some thoughts on where to put
the abstraction boundaries.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, Decibel! <decibel(at)decibel(dot)org>, "hernan gonzalez" <hgonzalez(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-26 14:23:24
Message-ID: 7469.1227709404@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> Imagine for example:

> CREATE TABLE foo (c1 integer, c2 text, c3 boolean, c4 date, c5
> timestamp, c6 numeric, c7 varchar);
> CREATE OR REPLACE VIEW tricky AS SELECT * FROM foo AS bar (a, b, c);
> ALTER TABLE foo ALTER COLUMN c2 POSITION LAST;

> After some thought, it seems pretty clear, at least to me, that the
> third (hypothetical) command should not change the result of "SELECT *
> FROM tricky" (the contrary conclusion gives rise to a lot of problems,
> especially if there are other views depending on it). But what will
> "pg_dump -t tricky" output at this point?

I don't think it's as bad as you fear, because you can always insert
additional aliases that aren't changing the column names. Furthermore,
per spec the column ordering of tricky doesn't change when foo's does.
So immediately after the CREATE VIEW tricky ought to look like

SELECT a, b, c, c4, c5, c6, c7 FROM foo AS bar (a, b, c);

which we could also represent as

SELECT a, b, c, c4, c5, c6, c7 FROM foo AS bar (a, b, c, c4, c5, c6, c7);

and the column position change would morph this into

SELECT a, b, c, c4, c5, c6, c7 FROM foo AS bar (a, c, c4, c5, c6, c7, b);

Now admittedly the current internal representation of alias-lists
doesn't cope with that (unless maybe you consider that list position
corresponds to column identity), but that representation isn't set in
stone.

regards, tom lane


From: Decibel! <decibel(at)decibel(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "hernan gonzalez" <hgonzalez(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Column reordering in pg_dump
Date: 2008-11-26 19:32:53
Message-ID: 5B520153-8D8B-4008-9489-B2A4D3A5AC83@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Nov 25, 2008, at 9:41 PM, Robert Haas wrote:
>> Changing physical positioning is purely an internal matter. A
>> first-cut
>> implementation should probably just make it identical to logical
>> positioning, until the latter is changed by the user (after which,
>> physical positioning continues to reflect the original ordering).
>> Only
>> after this work has been done and gotten battle-tested, we can get
>> into
>> niceties like having the server automatically rearrange physical
>> positioning to improve performance.
>
> Yeah. The problem with that is that, as Tom pointed out in a previous
> iteration of this discussion, you will likely have lurking bugs. The
> bugs are going to come from confusing physical vs. logical vs. column
> identity, and if some of those are always-equal, it's gonna be pretty
> hard to know if you have bugs that confuse the two. Now, if you could
> run the regression tests with a special option that would randomly
> permute the two orderings with respect to one another, that would give
> you at least some degree of confidence...

Random is good, but I suspect there are some boundary cases that
could be tested too.

As for the complexity, it might make sense to only tackle part of
this at a time. There would be value in only allowing logical order
to differ from literal order, or only allowing physical order to
differ. That means you could tackle just one of those for the first
go-round and still get a benefit from it.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828