Re: logical column order and physical column order

Lists: pgsql-hackers
From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: logical column order and physical column order
Date: 2013-11-03 07:37:31
Message-ID: CAApHDvqhnuznxd4xVMFDcGn+nHVYyUtJ-TvbRsOuR=PaVbbGqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've just been looking at how alignment of columns in tuples can make the
tuple larger than needed.

I created 2 tables... None of which are very real world, but I was hunting
for the extremes here...

The first table contained an int primary key and then a total of 10 int
columns and 10 boolean columns, I placed one boolean column after an int
column so that it was int,bool, int bool, etc
With the 2nd table I had all the ints first then all the booleans at the
end of the table. I then inserted 1 million records per table and checked
the sizes of each table.

The first table was 112 MB and the 2nd table was 81MB, so naturally there
is a difference when it comes to running queries on these tables.

postgres=# select sum(Value1) from test1;
sum
--------------
500000500000
(1 row)

Time: 239.306 ms

postgres=# select sum(Value1) from test2;
sum
--------------
500000500000
(1 row)

Time: 186.926 ms

So in this example a full scan and aggregate of a single column is 28%
faster.

I'm sure in the real world there are many cases where a better choice in
column ordering would save space and save processing times, but is this
something that we want to leave up to our users?

I've not yet looked at the code to see how hard implementing separation of
column physical order and logical order would be. I really just want to get
an idea of what the thoughts would be on such a change.

I would imagine it should be possible to have a function which optimises
column orders which is run when a table is created or rewritten. New
columns would still go onto the end of the tuple unless the table had to be
rewritten and in this case the column order would be optimised again. All
plays where column names were displayed without explicit ordering, e.g
select * and in psql the catalog could be queried to see which order these
columns should be displayed in.

For reference I've attached the script I used for testing this.

I'd like to implement this as a project, but before I start any work on it
I'd just like to find out other people's thoughts on it.

Regards

David Rowley

Attachment Content-Type Size
column_order_test.txt text/plain 2.4 KB

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical column order and physical column order
Date: 2013-11-03 08:40:18
Message-ID: 52760BF2.60200@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/11/13 20:37, David Rowley wrote:
> I've just been looking at how alignment of columns in tuples can make
> the tuple larger than needed.
>
> I created 2 tables... None of which are very real world, but I was
> hunting for the extremes here...
>
> The first table contained an int primary key and then a total of 10
> int columns and 10 boolean columns, I placed one boolean column after
> an int column so that it was int,bool, int bool, etc
> With the 2nd table I had all the ints first then all the booleans at
> the end of the table. I then inserted 1 million records per table and
> checked the sizes of each table.
>
> The first table was 112 MB and the 2nd table was 81MB, so naturally
> there is a difference when it comes to running queries on these tables.
>
> postgres=# select sum(Value1) from test1;
> sum
> --------------
> 500000500000
> (1 row)
>
>
> Time: 239.306 ms
>
> postgres=# select sum(Value1) from test2;
> sum
> --------------
> 500000500000
> (1 row)
>
>
> Time: 186.926 ms
>
> So in this example a full scan and aggregate of a single column is 28%
> faster.
>
> I'm sure in the real world there are many cases where a better choice
> in column ordering would save space and save processing times, but is
> this something that we want to leave up to our users?
>
> I've not yet looked at the code to see how hard implementing
> separation of column physical order and logical order would be. I
> really just want to get an idea of what the thoughts would be on such
> a change.
>
> I would imagine it should be possible to have a function which
> optimises column orders which is run when a table is created or
> rewritten. New columns would still go onto the end of the tuple unless
> the table had to be rewritten and in this case the column order would
> be optimised again. All plays where column names were displayed
> without explicit ordering, e.g select * and in psql the catalog could
> be queried to see which order these columns should be displayed in.
>
> For reference I've attached the script I used for testing this.
>
> I'd like to implement this as a project, but before I start any work
> on it I'd just like to find out other people's thoughts on it.
>
> Regards
>
> David Rowley
>
>
>
I think the system should PHYSICALLY store the columns in the most space
efficient order, and have a facility for mapping to & from the LOGICAL
order - so that users & application developers only have worry about the
logical order. Even system programers would normally not have to be
concerned with the physical order. I am a little surprised that this is
not already done, to be honest.

Cheers,
Gavin


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical column order and physical column order
Date: 2013-11-03 10:35:05
Message-ID: 20131103103505.GA3552@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 03, 2013 at 09:40:18PM +1300, Gavin Flower wrote:
> I think the system should PHYSICALLY store the columns in the most
> space efficient order, and have a facility for mapping to & from the
> LOGICAL order - so that users & application developers only have
> worry about the logical order. Even system programers would
> normally not have to be concerned with the physical order. I am a
> little surprised that this is not already done, to be honest.

This has been discussed before, extensively. I beleive there have even
been some patches. Apart from the space savings it also allow postgres
to support column reordering of tables.

The main objection IIRC is that you now have a logical order and a
physical order and there would be an endless stream of bugs caused by
code confusing the two.

I don't really buy this: you can make the two identifiers
non-overlapping so you can always tell which kind you have and some
properly places checks will catch obvious problems. Logical order is
only used in a handful of places anyway. You could even make them two
seperate datatypes so the compiler will complain if you screw up.

Dig through the archives for the full story.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical column order and physical column order
Date: 2013-11-03 14:14:12
Message-ID: 20131103141411.GF5809@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Rowley escribió:
> I've just been looking at how alignment of columns in tuples can make the
> tuple larger than needed.

This has been discussed at length previously, and there was a design
proposed to solve this problem. See these past discussions:

http://archives.postgresql.org/pgsql-hackers/2006-12/msg00983.php
http://archives.postgresql.org/pgsql-hackers/2007-02/msg01235.php
http://archives.postgresql.org/pgsql-hackers/2008-11/msg01680.php

I started work on this, and managed to get parts of it to work. While
doing so I realized that it was quite a lot more hideous than I had
originally expected. I published a tree at github:
https://github.com/alvherre/postgres/tree/column

This is incomplete and there are lots of things that don't yet work. I
posted a to-do.org file there noting some of these, which are probably
also way incomplete, and that was intended only for my own consumption,
so don't expect anything too elaborate.

One thing I never even get to the point of discussing is the matter of
UI, that is, should some of these things be automatic or should they
require some command; and if so, what would that look like.

Hope this helps clarify where we stand. I warn you, it's a major
undertaking, but if you're able to make inroads I'm sure many people
will be happy.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: David Johnston <polobo(at)yahoo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: logical column order and physical column order
Date: 2013-11-03 15:10:12
Message-ID: 1383491412523-5776784.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Rowley wrote
> I'm sure in the real world there are many cases where a better choice in
> column ordering would save space and save processing times, but is this
> something that we want to leave up to our users?

Right now there is little visibility, from probably 99% of people, that this
is even something to be concerned with. I have no qualms with making a
person run a routine to change the physical ordering of their tables - and
if they really care about logical order in the output it is best to list the
column names anyway - so the problem that is worth solving is providing a
way for the system to tell the user, for a given table, what the most
efficient physical order would be - ideally in the form of a CREATE TABLE AS
statement - and let the user manually effect the change.

So invent the algorithm to identify the best physical order and make it
accessible to users for manual use. If the benefits seem great enough after
its use for a couple of releases a more informed decision can be made as to
whether to try and automate its application.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/logical-column-order-and-physical-column-order-tp5776770p5776784.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical column order and physical column order
Date: 2013-11-05 07:07:04
Message-ID: CAApHDvo1KyFpneFXXQEF94GMVAd=Z07A0Srh9T8nczMoOqx7UA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 4, 2013 at 3:14 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>wrote:

> David Rowley escribió:
> > I've just been looking at how alignment of columns in tuples can make the
> > tuple larger than needed.
>
> This has been discussed at length previously, and there was a design
> proposed to solve this problem. See these past discussions:
>
> http://archives.postgresql.org/pgsql-hackers/2006-12/msg00983.php
> http://archives.postgresql.org/pgsql-hackers/2007-02/msg01235.php
> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01680.php
>
> I started work on this, and managed to get parts of it to work. While
> doing so I realized that it was quite a lot more hideous than I had
> originally expected. I published a tree at github:
> https://github.com/alvherre/postgres/tree/column

Thanks for the archive links... I read these last night and pulled out some
key pieces of information from some of the posts.

I should say that I've not dived into the code too much to see how hard it
would be, but my, perhaps naive original idea would have just be to add 1
column to pg_attribute to store the logical order and have attnum store the
physical order... This would have meant that at least only the following
places would have to take into account the change.

1. pg_dump needs to display columns in logical order both for create tables
and copy/insert statements.
2. INSERT INTO table values( ... ) (i.e without column names) needs to look
at logical order.
3. create table like <table>
4. backup and restore using copy
5. select * expand to column names

And of lesser importance as I'd assume it would just be a change in an
ORDER BY clause in their queries of pg_attribute

1. Display in clients... psql Pg Admin

I thought the above would have been doable and I did wonder what all the
fuss was about relating to bugs in the code where it could use the logical
number instead of attnum.

On reading of the posts last night I can see that the idea was to add not 1
but 2 new fields to pg_attribute. One was for the physical order and one
for the logical order and at first I didn't really understand as I thought
attnum would always be the physical order. I didn't really know before this
that attnum was static... I did some tests were I dropped one of the middle
columns out of a table and then rewrote the table with cluster and I see
that the pg_attribute record is kept even though the remains of the column
values have been wiped out by the rewrite... Is this done because things
like indexes, foreign keys and sequences reference the {attrelid,attnum} ?
if so then I see why the 2 extra columns are needed and I guess that's
where the extra complications come from.

So now I'm wondering, with my freshly clustered table which I dropped one
of the middle columns from before the cluster... my pg_attributes look
something like:

relname | attname | attnum
---------+------------------------------+--------
dropcol | tableoid | -7
dropcol | cmax | -6
dropcol | xmax | -5
dropcol | cmin | -4
dropcol | xmin | -3
dropcol | ctid | -1
dropcol | one | 1
dropcol | ........pg.dropped.2........ | 2
dropcol | three | 3

and I would imagine since the table has just been clustered that the
columns are stored like {..., ctid, one,three}
In this case how does Postgresql know that attnum 3 is the 2nd user column
in that table? Unless I have misunderstood something then there must be
some logic in there to skip dropped columns and if so then could it not
just grab the "attphynum" at that location? then just modify the 1-5 places
listed above to sort on attlognum?

Regards

David Rowley


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical column order and physical column order
Date: 2013-11-05 14:22:01
Message-ID: 20131105142201.GG5809@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Rowley escribió:

> In this case how does Postgresql know that attnum 3 is the 2nd user column
> in that table? Unless I have misunderstood something then there must be
> some logic in there to skip dropped columns and if so then could it not
> just grab the "attphynum" at that location? then just modify the 1-5 places
> listed above to sort on attlognum?

During parse analysis, those columns obtained from pg_attribute are
transformed to target list entries; they travel through the parser and
executor in that representation, and TupleDescs are constructed from
those lists. Making that works correctly needs some more code than just
sorting on attlognum. There are some other messy parts like handling
composite types when passed to functions, COPY, and some other things I
don't remember. Did you look at the places my patch touches?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services