Re: Enum proposal / design

Lists: pgsql-hackers
From: Tom Dunstan <pgsql(at)tomd(dot)cc>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Enum proposal / design
Date: 2006-08-16 03:54:00
Message-ID: 44E296D8.2060505@tomd.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi guys

Andrew and I got together and worked out a more detailed idea of how we
want to add enums to the postgresql core. This follows on from his
original enumkit prototype last year [1]. Here's a more formal proposal
/ design with what we came up with. Comments / criticism hereby solicited.

How they will work (once created) is more or less the same as last time
with the enumkit, with the exception of how they're created.

Enum types will be created with a specialised version of the CREATE TYPE
command thusly:

CREATE TYPE rgb AS ENUM ('red', 'green', 'blue');

They can then be used as column types, being input in quoted string form
as with other user types:

CREATE TABLE enumtest (col rgb);
INSERT INTO enumtest VALUES ('red');

Input is to be case sensitive, and ordering is to be in the definition
order, not the collation order of the text values (ie 'red' < 'green' in
the example above). See the original thread for more discussion and
usage examples.

The implementation will work as below. I've included something of a list
of stuff to do as well.

On disk, enums will occupy 4 bytes: the high 22 bits will be an enum
identifier, with the bottom 10 bits being the enum value. This allows
1024 values for a given enum, and 2^22 different enum types, both of
which should be heaps. The exact distribution of bits doesn't matter all
that much, we just picked some that we were comfortable with.

The identifier is required as output functions are not fed information
about which exact type they are being asked to format (see below).

The creation of a new pg_enum catalog is required. This will hold:
- the type OID for the enum, from pg_type
- the enum identifier for on disk storage
- the enum values in definition order, as an array of text values

The CREATE TYPE command will create a row in pg_type and a row in
pg_enum. We will get a new enum id by scanning pg_enum and looking for
the first unused value, rather than using a sequence, to make reuse of
enum ids more predictable.

Two new syscaches on pg_enum will be created to simplify lookup in the
i/o functions: one indexed by type oid for the input function, and one
indexed by enum id for the output function.

All functions will be builtins; there will be no duplicate entries of
them in pg_proc as was required for the enumkit.

The i/o functions will both cache enum info in the same way that the
domain and composite type i/o functions do, by attaching the data to the
fcinfo->flinfo->fn_extra pointer. The input function will look up the
enum data in the syscache using the type oid that it will be passed, and
cache it in a hashtable or binary tree for easy repeated lookup. The
output function will look up the enum data in the syscache using the
enum id stripped from the high 22 bits of the on-disk value and cache
the data as a straight array for easy access, with the enum value being
used as a index into the array.

The other functions will all work pretty much like they did in the
enumkit, with comparison operators more or less treating the enum as its
integer representation.

The grammar will have to be extended to support the new CREATE TYPE
syntax. This should not require making ENUM a reserved word. Likewise
psql will be extended to learn the new grammar. There's probably a bit
of work to do in DROP TYPE to make sure it deletes rows from pg_enum
when appropriate.

pg_dump must be taught how to dump enums properly.

We'll need some regression tests, maybe including one in one of the PL
testsuites to ensure that the io functions work happily when called from
a non-standard direction.

Documentation etc.

General discussion:

While we would really like to have had a 2 byte representation on disk
(or even 1 for most cases), with the stored value being *just* the enum
ordinal and not containing any type info about the enum type itself,
this is difficult. Since the output function cleanup [2] [3], postgresql
doesn't pass through the expected output type to output functions. This
makes it difficult to tell the difference between e.g. the first value
of the various enums, which would all have an integer representation of
0. We could have gone down the path of having the output function look
up its expected type from the fcinfo->flinfo struct, as Martijn's tagged
types do [4], but that would have required extra entries in pg_proc for
every single enum. Alternatively we could have stored the full enum type
oid on disk, but that would have blown out the on-disk representation to
5 or 6 bytes. The given approach of having a smaller enum id and the
enum ordinal value stored in the 4 bytes seems a reasonable tradeoff
given the current constraints.

To preempt some questions (particularly some which came up in the
enumkit discussion), here's a list of stuff which will *not* be
implemented in the initial patch (and quite possibly never):

- Support for ALTER TYPE to allow adding / modifying values etc. For
the time being you'll just have to create a new type, do a bunch of
ALTER TABLE commands, DROP the old type and rename the new one if you
want the old name back.

- Inline column enum declarations a la MySQL. While this feature might
allow easier migration from MySQL, and we could theoretically do it by
creating an anonymous type when creating the table, the cleanup when the
column/table are dropped is a real problem, and pg_dump has to get a lot
smarter. Given the ugliness of suppporting something similar with SERIAL
columns [5], this is definitely not on the cards anytime soon.

- Ordering by text value rather than the declaration order. If you
want this, you really want a varchar domain instead. Or alternately you
can order by e.g. colname::text if that does what you want. Doing
something like that sounds suspiciously like ordering something for
human consumption, though, which sounds like a really fast way to make
your application difficult to localize. Anyway, if that's the only
ordering you'll ever want, just define the values in alphabetical order. :)

- Access to the internal integer representation. If you need to modify
the values used or want to know what the integer is, use a lookup table
instead. Enums are the wrong abstraction for you.

Comments? Particularly on implementation strategy; the functionality was
thrashed out pretty well last time around.

Cheers

Tom "unholy chimera" Dunstan

[1] http://archives.postgresql.org/pgsql-hackers/2005-10/msg01243.php
[2] http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php
[3] http://archives.postgresql.org/pgsql-hackers/2005-12/msg00454.php
[4] http://svana.org/kleptog/pgsql/taggedtypes.html
[5] http://archives.postgresql.org/pgsql-hackers/2006-04/msg01118.php


From: andrew(at)dunslane(dot)net
To: "Tom Dunstan" <pgsql(at)tomd(dot)cc>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 07:37:33
Message-ID: 1328.24.211.165.134.1155713853.squirrel@www.dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We forgot to mention that we'll need to implement domains over enums and
arrays of enums too.

cheers

andrew

Tom Dunstan wrote:
> Hi guys
>
> Andrew and I got together and worked out a more detailed idea of how we
> want to add enums to the postgresql core. This follows on from his
> original enumkit prototype last year [1]. Here's a more formal proposal
> / design with what we came up with. Comments / criticism hereby solicited.
>
>
> How they will work (once created) is more or less the same as last time
> with the enumkit, with the exception of how they're created.
>
> Enum types will be created with a specialised version of the CREATE TYPE
> command thusly:
>
> CREATE TYPE rgb AS ENUM ('red', 'green', 'blue');
>
> They can then be used as column types, being input in quoted string form
> as with other user types:
>
> CREATE TABLE enumtest (col rgb);
> INSERT INTO enumtest VALUES ('red');
>
> Input is to be case sensitive, and ordering is to be in the definition
> order, not the collation order of the text values (ie 'red' < 'green' in
> the example above). See the original thread for more discussion and
> usage examples.
>
>
> The implementation will work as below. I've included something of a list
> of stuff to do as well.
>
> On disk, enums will occupy 4 bytes: the high 22 bits will be an enum
> identifier, with the bottom 10 bits being the enum value. This allows
> 1024 values for a given enum, and 2^22 different enum types, both of
> which should be heaps. The exact distribution of bits doesn't matter all
> that much, we just picked some that we were comfortable with.
>
> The identifier is required as output functions are not fed information
> about which exact type they are being asked to format (see below).
>
> The creation of a new pg_enum catalog is required. This will hold:
> - the type OID for the enum, from pg_type
> - the enum identifier for on disk storage
> - the enum values in definition order, as an array of text values
>
> The CREATE TYPE command will create a row in pg_type and a row in
> pg_enum. We will get a new enum id by scanning pg_enum and looking for
> the first unused value, rather than using a sequence, to make reuse of
> enum ids more predictable.
>
> Two new syscaches on pg_enum will be created to simplify lookup in the
> i/o functions: one indexed by type oid for the input function, and one
> indexed by enum id for the output function.
>
> All functions will be builtins; there will be no duplicate entries of
> them in pg_proc as was required for the enumkit.
>
> The i/o functions will both cache enum info in the same way that the
> domain and composite type i/o functions do, by attaching the data to the
> fcinfo->flinfo->fn_extra pointer. The input function will look up the
> enum data in the syscache using the type oid that it will be passed, and
> cache it in a hashtable or binary tree for easy repeated lookup. The
> output function will look up the enum data in the syscache using the
> enum id stripped from the high 22 bits of the on-disk value and cache
> the data as a straight array for easy access, with the enum value being
> used as a index into the array.
>
> The other functions will all work pretty much like they did in the
> enumkit, with comparison operators more or less treating the enum as its
> integer representation.
>
> The grammar will have to be extended to support the new CREATE TYPE
> syntax. This should not require making ENUM a reserved word. Likewise
> psql will be extended to learn the new grammar. There's probably a bit
> of work to do in DROP TYPE to make sure it deletes rows from pg_enum
> when appropriate.
>
> pg_dump must be taught how to dump enums properly.
>
> We'll need some regression tests, maybe including one in one of the PL
> testsuites to ensure that the io functions work happily when called from
> a non-standard direction.
>
> Documentation etc.
>
>
> General discussion:
>
> While we would really like to have had a 2 byte representation on disk
> (or even 1 for most cases), with the stored value being *just* the enum
> ordinal and not containing any type info about the enum type itself,
> this is difficult. Since the output function cleanup [2] [3], postgresql
> doesn't pass through the expected output type to output functions. This
> makes it difficult to tell the difference between e.g. the first value
> of the various enums, which would all have an integer representation of
> 0. We could have gone down the path of having the output function look
> up its expected type from the fcinfo->flinfo struct, as Martijn's tagged
> types do [4], but that would have required extra entries in pg_proc for
> every single enum. Alternatively we could have stored the full enum type
> oid on disk, but that would have blown out the on-disk representation to
> 5 or 6 bytes. The given approach of having a smaller enum id and the
> enum ordinal value stored in the 4 bytes seems a reasonable tradeoff
> given the current constraints.
>
> To preempt some questions (particularly some which came up in the
> enumkit discussion), here's a list of stuff which will *not* be
> implemented in the initial patch (and quite possibly never):
>
> - Support for ALTER TYPE to allow adding / modifying values etc. For
> the time being you'll just have to create a new type, do a bunch of
> ALTER TABLE commands, DROP the old type and rename the new one if you
> want the old name back.
>
> - Inline column enum declarations a la MySQL. While this feature might
> allow easier migration from MySQL, and we could theoretically do it by
> creating an anonymous type when creating the table, the cleanup when the
> column/table are dropped is a real problem, and pg_dump has to get a lot
> smarter. Given the ugliness of suppporting something similar with SERIAL
> columns [5], this is definitely not on the cards anytime soon.
>
> - Ordering by text value rather than the declaration order. If you
> want this, you really want a varchar domain instead. Or alternately you
> can order by e.g. colname::text if that does what you want. Doing
> something like that sounds suspiciously like ordering something for
> human consumption, though, which sounds like a really fast way to make
> your application difficult to localize. Anyway, if that's the only
> ordering you'll ever want, just define the values in alphabetical order.
> :)
>
> - Access to the internal integer representation. If you need to modify
> the values used or want to know what the integer is, use a lookup table
> instead. Enums are the wrong abstraction for you.
>
>
> Comments? Particularly on implementation strategy; the functionality was
> thrashed out pretty well last time around.
>
> Cheers
>
> Tom "unholy chimera" Dunstan
>
>
> [1] http://archives.postgresql.org/pgsql-hackers/2005-10/msg01243.php
> [2] http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php
> [3] http://archives.postgresql.org/pgsql-hackers/2005-12/msg00454.php
> [4] http://svana.org/kleptog/pgsql/taggedtypes.html
> [5] http://archives.postgresql.org/pgsql-hackers/2006-04/msg01118.php
>
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tom Dunstan <pgsql(at)tomd(dot)cc>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 13:28:29
Message-ID: 13550.1155734909@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Dunstan <pgsql(at)tomd(dot)cc> writes:
> Andrew and I got together and worked out a more detailed idea of how we
> want to add enums to the postgresql core. This follows on from his
> original enumkit prototype last year [1]. Here's a more formal proposal
> / design with what we came up with. Comments / criticism hereby solicited.
> ...
> On disk, enums will occupy 4 bytes: the high 22 bits will be an enum
> identifier, with the bottom 10 bits being the enum value. This allows
> 1024 values for a given enum, and 2^22 different enum types, both of
> which should be heaps. The exact distribution of bits doesn't matter all
> that much, we just picked some that we were comfortable with.

I think this is excessive concern for bit-shaving. Make the on-disk
representation be 8 bytes instead of 4, then you can store the OID
directly and have no need for the separate identifier concept. This
in turn eliminates one index, one syscache, and one set of lookup/cache
routines. And you can have as many values of an enum as you darn please.

> The i/o functions will both cache enum info in the same way that the
> domain and composite type i/o functions do, by attaching the data to the
> fcinfo->flinfo->fn_extra pointer. The input function will look up the
> enum data in the syscache using the type oid that it will be passed, and
> cache it in a hashtable or binary tree for easy repeated lookup.

If you didn't notice already: typcache is the place to put any
type-related caching you need to add.

regards, tom lane


From: Tom Dunstan <pgsql(at)tomd(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 17:17:18
Message-ID: 44E3531E.4090103@tomd.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Tom Dunstan <pgsql(at)tomd(dot)cc> writes:
>>On disk, enums will occupy 4 bytes: the high 22 bits will be an enum
>>identifier, with the bottom 10 bits being the enum value. This allows
>>1024 values for a given enum, and 2^22 different enum types, both of
>>which should be heaps. The exact distribution of bits doesn't matter all
>>that much, we just picked some that we were comfortable with.
>
>
> I think this is excessive concern for bit-shaving. Make the on-disk
> representation be 8 bytes instead of 4, then you can store the OID
> directly and have no need for the separate identifier concept. This
> in turn eliminates one index, one syscache, and one set of lookup/cache
> routines. And you can have as many values of an enum as you darn please.

That's all true. It's a bit depressing to think that IMO 99% of users of
this will have enum values whose range would fit into 1 byte, but we'll
be using 8 to store it on disk. I had convinced myself that 4 was ok on
the basis that alignment issues in surrounding columns would pad out the
remaining bits anyway much of the time. Was I correct in that
assumption? Would e.g. an int after a char require 3 bytes of padding?

Ok, I'll run one more idea up the flagpole before giving up on a 4 byte
on disk representation. :) How about assigning a unique 4 byte id to
each enum value, and storing that on disk. This would be unique across
the database, not per enum type. The structure of pg_enum would be a bit
different, as the per-type enum id would be gone, and there would be
multiple rows for each enum type. The columns would be: the type oid,
the associated unique id and the textual representation. That would
probably simplify the caching mechanism as well, since input function
lookups could do a straight syscache lookup on type oid and text
representation, and the output function could do a straight lookup on
the unique id. No need to muck around creating a little dynahash or
whatever to attach to the fn_entra pointer.

It does still require the extra syscache, but it removes the limitations
on number of enum types and number of values per type while keeping the
on disk size smallish. I like that better than the original idea, actually.

> If you didn't notice already: typcache is the place to put any
> type-related caching you need to add.

I hadn't. I'll investigate. Thanks.

Cheers

Tom


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tom Dunstan <pgsql(at)tomd(dot)cc>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 17:33:18
Message-ID: 26936.1155749598@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Dunstan <pgsql(at)tomd(dot)cc> writes:
> Tom Lane wrote:
>> I think this is excessive concern for bit-shaving. Make the on-disk
>> representation be 8 bytes instead of 4, then you can store the OID
>> directly and have no need for the separate identifier concept.

> That's all true. It's a bit depressing to think that IMO 99% of users of
> this will have enum values whose range would fit into 1 byte, but we'll
> be using 8 to store it on disk. I had convinced myself that 4 was ok on
> the basis that alignment issues in surrounding columns would pad out the
> remaining bits anyway much of the time.

Right, and on a 64-bit machine the same frequently holds at the 8-byte
level, so it's not real clear how much you're saving.

> Ok, I'll run one more idea up the flagpole before giving up on a 4 byte
> on disk representation. :) How about assigning a unique 4 byte id to
> each enum value, and storing that on disk. This would be unique across
> the database, not per enum type. The structure of pg_enum would be a bit
> different, as the per-type enum id would be gone, and there would be
> multiple rows for each enum type. The columns would be: the type oid,
> the associated unique id and the textual representation.

That seems not a bad idea. I had been considering complaining that the
array-based catalog structure was denormalized, but refrained ... I like
the fact that this approach makes it normalized.

Another thought is that this isn't really tied to any particular width
of stored enum values. You could easily imagine a compile time switch
to say you want 2-byte enums instead of 4. Or 8; or even 1.

Even more radical: do it at runtime. You could assign the typlen
(stored width) of an enum type at creation time on the basis of the
largest identifier it contains. This might be a bit too weird because
enums created earlier would have a size advantage over those created
later, but if you are looking to shave space ...

That reminds me: were you intending to allow an ALTER ENUM operation
to add (or remove, or rename) elements of an enum type? The above
method would fail for the case where an ADD operation needed to assign
an identifier wider than the type allowed for.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Dunstan <pgsql(at)tomd(dot)cc>
Subject: Re: Enum proposal / design
Date: 2006-08-16 17:55:25
Message-ID: 44E35C0D.8010204@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


(I had a private bet with myself that Tom Lane would object to the "bit
shaving" ;-) )

Tom Lane wrote:

>> Ok, I'll run one more idea up the flagpole before giving up on a 4 byte
>> on disk representation. :) How about assigning a unique 4 byte id to
>> each enum value, and storing that on disk. This would be unique across
>> the database, not per enum type. The structure of pg_enum would be a bit
>> different, as the per-type enum id would be gone, and there would be
>> multiple rows for each enum type. The columns would be: the type oid,
>> the associated unique id and the textual representation.
>>
>
> That seems not a bad idea. I had been considering complaining that the
> array-based catalog structure was denormalized, but refrained ... I like
> the fact that this approach makes it normalized.
>
> Another thought is that this isn't really tied to any particular width
> of stored enum values. You could easily imagine a compile time switch
> to say you want 2-byte enums instead of 4. Or 8; or even 1.
>
> Even more radical: do it at runtime. You could assign the typlen
> (stored width) of an enum type at creation time on the basis of the
> largest identifier it contains. This might be a bit too weird because
> enums created earlier would have a size advantage over those created
> later, but if you are looking to shave space ...
>

I'm not sure I like either of these options. The configure option at
least would make it too easy to break loading a dump from a db with
different compile time limit, and the runtime typelen stuff just seems
messy.

I'm inclined to say let's keep it simple and stay with a fixed 4-byte
global size.

> That reminds me: were you intending to allow an ALTER ENUM operation
> to add (or remove, or rename) elements of an enum type? The above
> method would fail for the case where an ADD operation needed to assign
> an identifier wider than the type allowed for.
>
>
>

No, I think that's something of a footgun. We'd have to check every row
to ensure we weren't orphaning some value.

The workaround is to create a new enum type and then do "alter table
alter column type ..." although I realise that could cause dependency
problems too.

Of course, people will be able to hack the catalog if they want to, but
then it will be on their heads if things break - the intention is to
treat these as essentially static - for dynamic stuff use a domain or a
lookup table.

cheers

andrew


From: Tom Dunstan <pgsql(at)tomd(dot)cc>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Enum proposal / design
Date: 2006-08-16 18:51:00
Message-ID: 44E36914.8050907@tomd.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:

>> Even more radical: do it at runtime. You could assign the typlen
>> (stored width) of an enum type at creation time on the basis of the
>> largest identifier it contains. This might be a bit too weird because
>> enums created earlier would have a size advantage over those created
>> later, but if you are looking to shave space ...
>
> I'm not sure I like either of these options. The configure option at
> least would make it too easy to break loading a dump from a db with
> different compile time limit, and the runtime typelen stuff just seems
> messy.

I thought the runtime one was kinda cute, actually, but you would have
to have duplicate functions for the differently sized types, eg.
enum1_out, enum2_out etc since otherwise you wouldn't know what sized
parameter you were just handed. And as Tom pointed out there could be
issues when someone wanted to modify the type.

> I'm inclined to say let's keep it simple and stay with a fixed 4-byte
> global size.

Fair enough. I'm ok with 4 bytes; 8 seemed a bit gratuitous.

>> That reminds me: were you intending to allow an ALTER ENUM operation
>> to add (or remove, or rename) elements of an enum type? The above
>> method would fail for the case where an ADD operation needed to assign
>> an identifier wider than the type allowed for.
>
> No, I think that's something of a footgun. We'd have to check every row
> to ensure we weren't orphaning some value.
>
> The workaround is to create a new enum type and then do "alter table
> alter column type ..." although I realise that could cause dependency
> problems too.

Well, one option that we might want to consider down the line is doing
all that behind the scenes in an ALTER TYPE statement. Of the
unsupported stuff that I listed, being able to alter the enum definition
was the one that I thought had the most likely use case.

Anyway, it's not something that we need to sort out straight away since
there's a workaround. I suspect that it only came up because there would
have been consequences for the ALTER if we had gone with the variable
size idea, depending on how the ALTER was implemented.

> Of course, people will be able to hack the catalog if they want to, but
> then it will be on their heads if things break - the intention is to
> treat these as essentially static - for dynamic stuff use a domain or a
> lookup table.

Right. Altering the values is a schema change (and I'd argue that
domains fall into the same boat). If you want user-editable entries,
create a separate table.

Cheers

Tom


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tom Dunstan <pgsql(at)tomd(dot)cc>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 20:13:43
Message-ID: 28234.1155759223@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Dunstan <pgsql(at)tomd(dot)cc> writes:
> I thought the runtime one was kinda cute, actually, but you would have
> to have duplicate functions for the differently sized types, eg.
> enum1_out, enum2_out etc since otherwise you wouldn't know what sized
> parameter you were just handed.

I'm not sure that that matters really. What you are actually going to
get handed is a Datum that IIRC is right-justified and zero-padded, so
very probably one function would work for all stored widths. The bigger
issue I think is the surprise factor if a column gets wider over a dump
and reload.

regards, tom lane


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 22:52:37
Message-ID: 20060816225237.GU21363@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 16, 2006 at 04:13:43PM -0400, Tom Lane wrote:
> Tom Dunstan <pgsql(at)tomd(dot)cc> writes:
> > I thought the runtime one was kinda cute, actually, but you would have
> > to have duplicate functions for the differently sized types, eg.
> > enum1_out, enum2_out etc since otherwise you wouldn't know what sized
> > parameter you were just handed.
>
> I'm not sure that that matters really. What you are actually going to
> get handed is a Datum that IIRC is right-justified and zero-padded, so
> very probably one function would work for all stored widths. The bigger
> issue I think is the surprise factor if a column gets wider over a dump
> and reload.

Actually, if we're going to support variable-width enums, I think it
makes the most sense to just expose that to the user, since they'll be
able to have a chance of figuring out which size would make the most
sense for a given table (unless you want to add logic to look at the
table's layout...)

If we wanted to provide an idiot-proof version that was "unsized", we
could just make that an alias for a 4 or 8 byte enum.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tom Dunstan <pgsql(at)tomd(dot)cc>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 23:02:01
Message-ID: 29335.1155769321@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Dunstan <pgsql(at)tomd(dot)cc> writes:
> Andrew Dunstan wrote:
>> I'm inclined to say let's keep it simple and stay with a fixed 4-byte
>> global size.

> Fair enough. I'm ok with 4 bytes; 8 seemed a bit gratuitous.

If you're gonna fix it at 4 bytes, then I strongly suggest that the
value identifiers actually be OIDs assigned through the standard
OID-generating mechanism, and that the pg_enum table have the structure

standard system OID column unique enum-value identifier
enumtypid OID of enum type it belongs to
enumname name of enum value

unique indexes on:
oid
(enumtypid, enumname)

The advantage of doing this is that you can use the existing, well
debugged, normally-quite-fast mechanisms for generating new unique value
identifiers. Rather than consing up your own slow full-table-scan
mechanism as envisioned in the original proposal.

regards, tom lane


From: Gregory Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 23:21:06
Message-ID: 87veosxo65.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I think this is excessive concern for bit-shaving. Make the on-disk
> representation be 8 bytes instead of 4, then you can store the OID
> directly and have no need for the separate identifier concept. This
> in turn eliminates one index, one syscache, and one set of lookup/cache
> routines. And you can have as many values of an enum as you darn please.

Egads. bit-shaving is *important*. If it's 8 bytes you could just use a
char(4) and store 4 character text codes instead. The whole reason to want
this feature is precisely for bit-shaving.

I was originally going to reply with some thoughts about how we really ought
to fix things so that we don't need to store the type in every record of the
entire table. That would let you use 1 or 2 bytes for most applications.

Data density is the dominant factor controlling the overall speed of your
database. If you can shave 10% off the width of your records that's a 10%
speed gain in i/o and a 10% gain in headroom.

This is the same issue we have with char(n) and numeric(x,y) already. If we
found a general solution for getting the type name to the enum would it also
help getting the typmod to char(n) and numeric(x,y)? Would it let us store
those as fixed sized data types?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Gregory Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-17 17:11:45
Message-ID: 20060817171145.GR21363@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 16, 2006 at 07:21:06PM -0400, Gregory Stark wrote:
> This is the same issue we have with char(n) and numeric(x,y) already. If we
> found a general solution for getting the type name to the enum would it also
> help getting the typmod to char(n) and numeric(x,y)? Would it let us store
> those as fixed sized data types?

Hopefully. It would be great to also have a fixed-width raw/bytea field.
See the recent discussions about adding a guid type, etc.

How about this for a TODO:

* Allow for field widths to be stored in the catalog instead of each tuple

Data types such as char are usually used when the user knows that a
field will always contain a fixed amount of data. In these cases,
our char implementation is wasteful, because the varlena header
always contains the same value. If there was a mechanism to obtain
field widths from the catalog there would be no need to store the
field width in each tuple. This would be useful for other types as
well (UUID and ENUM, for example).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Gregory Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-17 17:13:58
Message-ID: 200608171713.k7HHDwL17536@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> On Wed, Aug 16, 2006 at 07:21:06PM -0400, Gregory Stark wrote:
> > This is the same issue we have with char(n) and numeric(x,y) already. If we
> > found a general solution for getting the type name to the enum would it also
> > help getting the typmod to char(n) and numeric(x,y)? Would it let us store
> > those as fixed sized data types?
>
> Hopefully. It would be great to also have a fixed-width raw/bytea field.
> See the recent discussions about adding a guid type, etc.
>
> How about this for a TODO:
>
> * Allow for field widths to be stored in the catalog instead of each tuple
>
> Data types such as char are usually used when the user knows that a
> field will always contain a fixed amount of data. In these cases,
> our char implementation is wasteful, because the varlena header
> always contains the same value. If there was a mechanism to obtain
> field widths from the catalog there would be no need to store the
> field width in each tuple. This would be useful for other types as
> well (UUID and ENUM, for example).

I don't think there is concensus on adding that.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Gregory Stark <gsstark(at)mit(dot)edu>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-17 17:42:20
Message-ID: 17982.1155836540@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> Jim C. Nasby wrote:
>> If there was a mechanism to obtain
>> field widths from the catalog there would be no need to store the
>> field width in each tuple. This would be useful for other types as
>> well (UUID and ENUM, for example).

> I don't think there is concensus on adding that.

Well, it's pie-in-the-sky at the moment because we have no credible
design for doing it. Whether any given proposal would get accepted
would depend on what its downsides were.

Do we (or should we) have a TODO section for "blue sky research
ideas"? I'd not object to putting an item like this in such a
section. But for most of the TODO items we have a reasonably clear
idea of what we're talking about, so this doesn't seem to belong
in with the rest.

regards, tom lane


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Gregory Stark <gsstark(at)mit(dot)edu>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Fixed-width types (was: Enum proposal / design)
Date: 2006-08-17 18:55:27
Message-ID: 20060817185527.GZ21363@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 17, 2006 at 01:42:20PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > Jim C. Nasby wrote:
> >> If there was a mechanism to obtain
> >> field widths from the catalog there would be no need to store the
> >> field width in each tuple. This would be useful for other types as
> >> well (UUID and ENUM, for example).
>
> > I don't think there is concensus on adding that.

I seem to recall it coming up a few times in recent history, but
consider this a call for people to "+1" it.

> Well, it's pie-in-the-sky at the moment because we have no credible
> design for doing it. Whether any given proposal would get accepted
> would depend on what its downsides were.
>
> Do we (or should we) have a TODO section for "blue sky research
> ideas"? I'd not object to putting an item like this in such a
> section. But for most of the TODO items we have a reasonably clear
> idea of what we're talking about, so this doesn't seem to belong
> in with the rest.

It seems what's desired is fairly clear (even if my proposed TODO
wasn't)... have a means for types that do not contain varlena info but
aren't pre-compiled to a fixed width like int is.

Of course, how to do it is a whole different story. I thought that there
was precident for putting items like that on the TODO, but maybe not.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Gregory Stark <gsstark(at)mit(dot)edu>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-18 00:02:32
Message-ID: 200608180002.k7I02Wi26998@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > Jim C. Nasby wrote:
> >> If there was a mechanism to obtain
> >> field widths from the catalog there would be no need to store the
> >> field width in each tuple. This would be useful for other types as
> >> well (UUID and ENUM, for example).
>
> > I don't think there is concensus on adding that.
>
> Well, it's pie-in-the-sky at the moment because we have no credible
> design for doing it. Whether any given proposal would get accepted
> would depend on what its downsides were.
>
> Do we (or should we) have a TODO section for "blue sky research
> ideas"? I'd not object to putting an item like this in such a
> section. But for most of the TODO items we have a reasonably clear
> idea of what we're talking about, so this doesn't seem to belong
> in with the rest.

Blue sky ideas just don't seem natural on the TODO list. Some people
wanted to use a wiki, and maybe it would be good for that.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <gsstark(at)mit(dot)edu>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-23 07:10:00
Message-ID: 20060823071000.GH88878@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 17, 2006 at 08:02:32PM -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > > Jim C. Nasby wrote:
> > >> If there was a mechanism to obtain
> > >> field widths from the catalog there would be no need to store the
> > >> field width in each tuple. This would be useful for other types as
> > >> well (UUID and ENUM, for example).
> >
> > > I don't think there is concensus on adding that.
> >
> > Well, it's pie-in-the-sky at the moment because we have no credible
> > design for doing it. Whether any given proposal would get accepted
> > would depend on what its downsides were.
> >
> > Do we (or should we) have a TODO section for "blue sky research
> > ideas"? I'd not object to putting an item like this in such a
> > section. But for most of the TODO items we have a reasonably clear
> > idea of what we're talking about, so this doesn't seem to belong
> > in with the rest.
>
> Blue sky ideas just don't seem natural on the TODO list. Some people
> wanted to use a wiki, and maybe it would be good for that.

I think it would be good to have something, so that people are
occasionally reminded about these things. That's a good way to help
shake ideas out.

Something else to consider is that anything is doable, given enough
effort, which is an argument for just putting it on the TODO.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Greg Stark <gsstark(at)mit(dot)edu>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <gsstark(at)mit(dot)edu>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-23 09:13:27
Message-ID: 87zmdvolvs.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:

> I think it would be good to have something, so that people are
> occasionally reminded about these things. That's a good way to help
> shake ideas out.

I think the only reason there aren't more outrageous dreamworld ideas in the
TODO is that people came along and did a lot of them. 3-phase-commit,
nested-transactions, PITR, etc. were all at some point pretty pie in the sky.

At some level there's not much point in keeping a TODO of ideas we know how to
do, most of those ideas just get done.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com