Normal vs Surrogate Primary Keys...

Lists: pgsql-general
From: "rlee0001" <robeddielee(at)hotmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Normal vs Surrogate Primary Keys...
Date: 2006-10-02 00:58:28
Message-ID: 1159750708.611855.18730@h48g2000cwc.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I know this is an old topic and also a religious one so I won't get
into the debate, but I thought up one possible solution that would make
almost everybody happy and was wondering if any PostgreSQL hackers out
there had any thoughts.

I was wondering if, considering that an entity can only have a single
primary key, a combination of both methods could be possible. For
example, using a surrogate key exclusively for maintaining entity
relationships while at the same time using a natural candidate key as a
user-friendly method of identifying records. The reason I ask is that
it seems like this is the sort of thing that even a fully SQL-compliant
DBMS could do internally to compensate for the performance issues with
using large natural keys in relationships.

I know, for example, that by default PostgreSQL assigns every record a
small unique identifier called an OID. It seems reasonable then, that
when the DBA creates a cascading foreign key to a record, that the DBMS
could, instead of storing the record's entire natural key, store only a
reference to the OID and abstract/hide this behavior from the
environment just as PostgreSQL does with its OID feature now. Of
course, this would require that the OID be guaranteed unique, which I
don't beleave is the case in the current versions.

This would completely eliminate concerns related to the performance of
cascading updates because no actual cascade would take place, but
rather the update would affect all referencing records implicitly via
the abstraction. Additionally, storage concerns related to large
foreign keys would similarly be addressed since only the OID reference
would be stored. It would then, as far as I can tell, be completely
reasonable to use a person's mailing address as well as other large and
numerous fields as primary keys so long as the resulting index
performance would be acceptable.

As far as index performance is concerned, for many tables "unique"
indexes already exist on these large natural candidate keys, and I
don't think any index performance would be lost by turning those unique
indexes into primary key indexes.

Is such a "behind the scenes" abstraction of large foreign keys
practical? If so, I think such a mechanism would be a very powerful
feature for physical data modelers working with PostgreSQL.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "rlee0001" <robeddielee(at)hotmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Normal vs Surrogate Primary Keys...
Date: 2006-10-02 02:14:28
Message-ID: 26430.1159755268@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"rlee0001" <robeddielee(at)hotmail(dot)com> writes:
> ... I know, for example, that by default PostgreSQL assigns every record a
> small unique identifier called an OID.

Well, actually, that hasn't been the default for some time, and even if
you turn it on it's not guaranteed unique without additional steps, and
seeing that it's only 32bits wide it's not usable as a unique ID for
very large tables.

"unique" and "small" are more or less contradictory in this context.

> ... it seems like this is the sort of thing that even a fully SQL-compliant
> DBMS could do internally to compensate for the performance issues with
> using large natural keys in relationships.

What performance issues are those, exactly? I have seen no data that
proves that large key size is a bottleneck for reasonable schema designs
in Postgres. Maybe that just means that we have more fundamental
problems to fix :-( ... but there's no point in investing a lot of
effort on an aspect that isn't the next bottleneck.

regards, tom lane


From: "rlee0001" <robeddielee(at)hotmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Normal vs Surrogate Primary Keys...
Date: 2006-10-02 02:48:14
Message-ID: 1159757293.961477.282240@k70g2000cwa.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Tom Lane wrote:
> "rlee0001" <robeddielee(at)hotmail(dot)com> writes:
> > ... I know, for example, that by default PostgreSQL assigns every record a
> > small unique identifier called an OID.
>
> Well, actually, that hasn't been the default for some time, and even if
> you turn it on it's not guaranteed unique without additional steps, and
> seeing that it's only 32bits wide it's not usable as a unique ID for
> very large tables.
>
> "unique" and "small" are more or less contradictory in this context.
>
> > ... it seems like this is the sort of thing that even a fully SQL-compliant
> > DBMS could do internally to compensate for the performance issues with
> > using large natural keys in relationships.
>
> What performance issues are those, exactly? I have seen no data that
> proves that large key size is a bottleneck for reasonable schema designs
> in Postgres. Maybe that just means that we have more fundamental
> problems to fix :-( ... but there's no point in investing a lot of
> effort on an aspect that isn't the next bottleneck.

Well from what I understand, there are basically two reasons to use
surrogate primary keys: 1) No reliable natural candidate key exists or
2) The natural candidate keys are percieved to be far too large/complex
to use as the primary key. I have yet to hear anyone recommend the use
of surrogate keys in modern databases for any other reason. Obviously
that some entities in practice have no reliable natural keys cannot be
helped and in those cases a surrogate key pretty much has to be used,
enless the data modeler would find it acceptable to use every attribute
in the entity as part of the primary key (in order to ensure uniqueness
of each record overall).

The second argument for the use of surrogate keys is simply that
reliable natural candidate keys are often perceived to be too large to
use as primary keys. The perception seems to be that large primary keys
consume a considerable amount of storage space when used in foreign
keys. For example, if I key "employee" by Last Name, First Name, Date
of Hire and Department, I would need to store copies of all this data
in any entity that relates to an employee (e.g. payroll, benefits and
so on). In addition, if any of these fields change in value, that
update would need to cascade to any related entities, which might be
perceived as a performance issue if there are many related records.

I'm not saying that PostgreSQL specifically has performance problems
but that using large natural keys can hamper performance (both in terms
of storage space required and cascading update time) when a lot of
relationships exist between entities with such large keys.

Personally I hate using surrogate keys except in places where no
reliable natural key exists but find it nessisary in order to improve
the efficiency of foreign keys in the database. So my proposal was
simply to have the DBMS internally create and use an invisible
surrogate key to identify and relate records, but expose the natural
key to the environment. The currently OID implimentation cannot be used
for this as you've already stated but I think a similar implimentation
could work.

I'm just wondering if anyone else would take advantage of the
performance benefit, or perhaps sees a flaw in my logic.

> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly


From: Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
To: rlee0001 <robeddielee(at)hotmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Normal vs Surrogate Primary Keys...
Date: 2006-10-02 02:53:47
Message-ID: 20061001194334.S14663@megazone.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sun, 1 Oct 2006, rlee0001 wrote:

> I know, for example, that by default PostgreSQL assigns every record a
> small unique identifier called an OID. It seems reasonable then, that
> when the DBA creates a cascading foreign key to a record, that the DBMS
> could, instead of storing the record's entire natural key, store only a
> reference to the OID and abstract/hide this behavior from the
> environment just as PostgreSQL does with its OID feature now. Of
> course, this would require that the OID be guaranteed unique, which I
> don't beleave is the case in the current versions.
>
> This would completely eliminate concerns related to the performance of
> cascading updates because no actual cascade would take place, but
> rather the update would affect all referencing records implicitly via
> the abstraction.

Well, that alone isn't enough I think.

MATCH SIMPLE allows you to pass the constraint for a row if any of the
columns in a multi-column foreign key are NULL, so there isn't always a
matching row, but there's also meaningful information in the column
values. MATCH PARTIAL (which we admittedly don't support yet) allows you
to have a valid key if the non-NULL portions of the multi-column foreign
key match to one or more rows in the referenced table, so there may be
more than one matching row. The all NULL case is pretty easy to handle in
general.

In addition, AFAICT for cascades you would potentially be trading the cost
at cascade time with the cost at select time, so that would itself not
always be a win. Also, I don't see how you get away with not needing two
indexes on the referenced table to do this well unless you're storing
something like a ctid which has its own problems with updates.

I think there may be some better options than what we've got, but there's
tradeoffs as well.


From: "rlee0001" <robeddielee(at)hotmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Normal vs Surrogate Primary Keys...
Date: 2006-10-02 03:13:58
Message-ID: 1159758838.673055.92430@k70g2000cwa.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Stephan Szabo wrote:
> On Sun, 1 Oct 2006, rlee0001 wrote:
>
> > I know, for example, that by default PostgreSQL assigns every record a
> > small unique identifier called an OID. It seems reasonable then, that
> > when the DBA creates a cascading foreign key to a record, that the DBMS
> > could, instead of storing the record's entire natural key, store only a
> > reference to the OID and abstract/hide this behavior from the
> > environment just as PostgreSQL does with its OID feature now. Of
> > course, this would require that the OID be guaranteed unique, which I
> > don't beleave is the case in the current versions.
> >
> > This would completely eliminate concerns related to the performance of
> > cascading updates because no actual cascade would take place, but
> > rather the update would affect all referencing records implicitly via
> > the abstraction.
>
> Well, that alone isn't enough I think.
>
> MATCH SIMPLE allows you to pass the constraint for a row if any of the
> columns in a multi-column foreign key are NULL, so there isn't always a
> matching row, but there's also meaningful information in the column
> values. MATCH PARTIAL (which we admittedly don't support yet) allows you
> to have a valid key if the non-NULL portions of the multi-column foreign
> key match to one or more rows in the referenced table, so there may be
> more than one matching row. The all NULL case is pretty easy to handle in
> general.
>
> In addition, AFAICT for cascades you would potentially be trading the cost
> at cascade time with the cost at select time, so that would itself not
> always be a win. Also, I don't see how you get away with not needing two
> indexes on the referenced table to do this well unless you're storing
> something like a ctid which has its own problems with updates.
>
> I think there may be some better options than what we've got, but there's
> tradeoffs as well.

LOL, I hadn't even considered that. In my proposed solution, every
select against a table containing a foreign key (which selects at least
part of the foreign key) would require an implicit join to retrieve the
actual foreign key values. Additionally, selects with explicit joins
would need to be converted, which would require the DBMS to execute
several additional queries (to retrieve the internal surrogate key for
the given natural key values) before executing the original join query.

Oh well, can't kill a guy for trying.

Thanks for pointing out what I was too blind to see. :o)

-Robert

> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: rlee0001 <robeddielee(at)hotmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Normal vs Surrogate Primary Keys...
Date: 2006-10-02 15:58:11
Message-ID: 20061002155811.GB29699@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sun, Oct 01, 2006 at 07:48:14PM -0700, rlee0001 wrote:
> <snip> For example, if I key "employee" by Last Name, First Name, Date
> of Hire and Department, I would need to store copies of all this data
> in any entity that relates to an employee (e.g. payroll, benefits and
> so on). In addition, if any of these fields change in value, that
> update would need to cascade to any related entities, which might be
> perceived as a performance issue if there are many related records.

Err, those fields don't make a natural key since they have no guarentee
of uniqueness. You've simply decided that the chance of collision is
low enough that you don't care, but for me that's not really good
enough for use as a key.

Secondly, three of the four fields you suggest are subject to change,
so that indeed makes them a bad choice. My definition of "key" includes
"unchanged for the lifetime of the tuple".

In that situation your idea may work well, but that's just a surrogate
key in disguise...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: "rlee0001" <robeddielee(at)hotmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Normal vs Surrogate Primary Keys...
Date: 2006-10-02 21:29:31
Message-ID: 1159824571.772571.259860@e3g2000cwe.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Martijn van Oosterhout wrote:
> On Sun, Oct 01, 2006 at 07:48:14PM -0700, rlee0001 wrote:
> > <snip> For example, if I key "employee" by Last Name, First Name, Date
> > of Hire and Department, I would need to store copies of all this data
> > in any entity that relates to an employee (e.g. payroll, benefits and
> > so on). In addition, if any of these fields change in value, that
> > update would need to cascade to any related entities, which might be
> > perceived as a performance issue if there are many related records.
>
> Err, those fields don't make a natural key since they have no guarentee
> of uniqueness. You've simply decided that the chance of collision is
> low enough that you don't care, but for me that's not really good
> enough for use as a key.

Oh look mommy, a usenet troll. Sweet. I'm bored, so...

Those fields were a contrived example of a key that might be perceived
to be too large to use as a key for performance reasons. Are you
suggesting that because they are not guaranteed to be unique that no
perforance problem would exist in using such large and complex fields
as keys? Or do you acknowledge that my example holds regardless?

The fact of the matter is, non-abstract (natural) entities have only
one perfect candidate key, which is the compound of all their natural
attributes. For these entities, a decision must be made by the data
modeler after gathering the requirements of the application as to what
the minimum subset of attributes are that would never be duplicated
(again: within the context of the application). In my employee example,
I, as the data modeler, have decided that those four fields constitute
a reasonable candidate key based on the requirements of the
application.

> Secondly, three of the four fields you suggest are subject to change,
> so that indeed makes them a bad choice. My definition of "key" includes
> "unchanged for the lifetime of the tuple".

There is no such rule of normalization or good database logic. You are
refering to a technical limitation in some obsolete system that lack
cascading update support.

> In that situation your idea may work well, but that's just a surrogate
> key in disguise...

I know. But not just in disguise -- invisible. An internal peice of the
database, like an index. This is where perforance hacks belong, not
mixed in with business logic (or in this case business data). Basically
I'm introducing the concept of a hidden-psudo-sub-primary-key. The
index of relationships. Additionally the ID could be extracted and used
by the application for other uses such as transmitting a record pointer
via a query-string and other internal/technical/non-business-logic
activities.

> Have a nice day,

Which one?

> --
> Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> > From each according to his ability. To each according to his ability to litigate.
>
> --0ntfKIWw70PvrIHh
> Content-Type: application/pgp-signature
> Content-Disposition: inline;
> filename="signature.asc"
> Content-Description: Digital signature
> X-Google-AttachSize: 190