Re: Behavior of GENERATED columns per SQL2003

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Behavior of GENERATED columns per SQL2003
Date: 2007-05-07 21:16:15
Message-ID: 18812.1178572575@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been studying the SQL spec in a bit more detail and I'm suddenly
thinking that we've got the behavior all wrong in the current
GENERATED/IDENTITY patch. In particular, it looks to me like we've
been implementing GENERATED ALWAYS AS (expr) according to the rules
that the spec in fact lays down only for GENERATED ALWAYS AS IDENTITY.
You'd think the two constructs would be pretty closely related but
the behaviors specified by the spec are light-years apart. If you
look closely, a "generated column" in the meaning of section 4.14.8
is one that has GENERATED ALWAYS AS (expr), and identity columns are
*not* in this class.

It looks to me like the behavior the spec intends for a generated column
is actually that it can be implemented as a "virtual column" occupying
no space on disk and instead computed on-the-fly when retrieved.
Identity columns can have their values overridden by the
user (it's a little harder if GENERATED ALWAYS, but still possible),
and they don't change during an UPDATE unless specifically forced to.
In contrast, generated columns cannot be overridden by
assignment, and are recomputed from their base columns during updates.
This realization also explains the following, otherwise rather strange,
facts:

* There is no GENERATED BY DEFAULT AS (expr) in the spec.

* GENERATED expressions are specifically disallowed from containing
subselects, calling functions that access any SQL-data, or being
nondeterministic; hence their values depend solely on the regular
columns in the same row.

* While identity columns are updated (if needed) before execution of
BEFORE triggers, generated columns are updated after BEFORE triggers;
hence a BEFORE trigger can override the value in one case and not the
other. (The current patch gets this wrong, btw.)

* Generated columns are forcibly updated when their base columns change
as a result of FK constraints (such as ON UPDATE CASCADE).

It looks to me like a BEFORE trigger is actually the only place that can
(transiently) see values of a generated column that are different from
the result of applying the generation expression on the rest of the row.
It's unclear whether that's intentional or an oversight.

Is anyone familiar with a database that implements SQL-spec generated
columns? Do they actually store the columns?

regards, tom lane


From: David Fuhry <dfuhry(at)cs(dot)kent(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zoltan Boszormenyi <zb(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-07 22:29:50
Message-ID: 463FA85E.4010706@cs.kent.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Oracle 10g, MySQL 5, and SQL Server 2005 don't appear to support the
syntax. The SQL:2003 SIGMOD paper [1] indicates pretty clearly that
their intention is for the values of generated columns to be stored on disk:

"... commonly used expressions are evaluated once and their results
stored for future use"

"Generated columns can lead to higher performance... because of reduced
computation"

-Dave

[1] Eisenberg, A., Melton, J., Kulkarni, K., Michels, J., and Zemke, F.
2004. SQL:2003 has been published. SIGMOD Rec. 33, 1 (Mar. 2004), 119-126.
http://www.sigmod.org/record/issues/0403/E.JimAndrew-standard.pdf

Tom Lane wrote:
> I've been studying the SQL spec in a bit more detail and I'm suddenly
> thinking that we've got the behavior all wrong in the current
> GENERATED/IDENTITY patch. In particular, it looks to me like we've
> been implementing GENERATED ALWAYS AS (expr) according to the rules
> that the spec in fact lays down only for GENERATED ALWAYS AS IDENTITY.
> You'd think the two constructs would be pretty closely related but
> the behaviors specified by the spec are light-years apart. If you
> look closely, a "generated column" in the meaning of section 4.14.8
> is one that has GENERATED ALWAYS AS (expr), and identity columns are
> *not* in this class.
>
> It looks to me like the behavior the spec intends for a generated column
> is actually that it can be implemented as a "virtual column" occupying
> no space on disk and instead computed on-the-fly when retrieved.
> Identity columns can have their values overridden by the
> user (it's a little harder if GENERATED ALWAYS, but still possible),
> and they don't change during an UPDATE unless specifically forced to.
> In contrast, generated columns cannot be overridden by
> assignment, and are recomputed from their base columns during updates.
> This realization also explains the following, otherwise rather strange,
> facts:
>
> * There is no GENERATED BY DEFAULT AS (expr) in the spec.
>
> * GENERATED expressions are specifically disallowed from containing
> subselects, calling functions that access any SQL-data, or being
> nondeterministic; hence their values depend solely on the regular
> columns in the same row.
>
> * While identity columns are updated (if needed) before execution of
> BEFORE triggers, generated columns are updated after BEFORE triggers;
> hence a BEFORE trigger can override the value in one case and not the
> other. (The current patch gets this wrong, btw.)
>
> * Generated columns are forcibly updated when their base columns change
> as a result of FK constraints (such as ON UPDATE CASCADE).
>
> It looks to me like a BEFORE trigger is actually the only place that can
> (transiently) see values of a generated column that are different from
> the result of applying the generation expression on the rest of the row.
> It's unclear whether that's intentional or an oversight.
>
> Is anyone familiar with a database that implements SQL-spec generated
> columns? Do they actually store the columns?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-09 21:33:42
Message-ID: 2776.1178746422@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

After some more study of the SQL spec, the distinction between GENERATED
ALWAYS AS IDENTITY and GENERATED BY DEFAULT AS IDENTITY is not what
I thought it was.

* As far as I can find from the spec, there is *no* difference between
the two cases for INSERT commands. The rule is that you ignore any
user-supplied data and use the default (ie, nextval()) unless OVERRIDING
SYSTEM VALUE is specified. It is not an error to try to insert data
into an identity column, it's just ignored unless OVERRIDING SYSTEM
VALUE.

* The difference for UPDATE commands is that you can update a BY DEFAULT
identity column to anything you want, whereas for an ALWAYS identity
it's an error to update to anything but DEFAULT (which causes a fresh
nextval() to be assigned). Both behaviors are different from a
generated column, which is updated whether you mention it or not.

This means that GENERATED BY DEFAULT AS IDENTITY is not at all
equivalent to our historical behavior for SERIAL columns and hence we
cannot merge the two cases.

The lack of any behavioral difference for INSERT seems surprising
and counterintuitive; have I just missed something in the spec?

BTW, I found what they did about the problem that generated columns
are out of sync with their underlying columns during BEFORE-trigger
execution: in 11.39

12)If BEFORE is specified, then:
...
c) The <triggered action> shall not contain a <field
reference> that references a field in the new transition
variable corresponding to a generated column of T.

IOW they just pretend you can't look. So I think we need not worry
about leaving the values out-of-date until after the triggers fire.

regards, tom lane


From: Zoltan Boszormenyi <zb(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-10 07:22:04
Message-ID: 4642C81C.8000607@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane írta:
> After some more study of the SQL spec, the distinction between GENERATED
> ALWAYS AS IDENTITY and GENERATED BY DEFAULT AS IDENTITY is not what
> I thought it was.
>
> * As far as I can find from the spec, there is *no* difference between
> the two cases for INSERT commands. The rule is that you ignore any
> user-supplied data and use the default (ie, nextval()) unless OVERRIDING
> SYSTEM VALUE is specified. It is not an error to try to insert data
> into an identity column, it's just ignored unless OVERRIDING SYSTEM
> VALUE.
>
> * The difference for UPDATE commands is that you can update a BY DEFAULT
> identity column to anything you want, whereas for an ALWAYS identity
> it's an error to update to anything but DEFAULT (which causes a fresh
> nextval() to be assigned). Both behaviors are different from a
> generated column, which is updated whether you mention it or not.
>

The quoted SIGMOD paper mentioned that specifying a value
for a generated column should raise an error in INSERT but
this behaviour is not mentioned by the standard.
BTW, do you know what's a "self-referencing column"?
I haven't found a definition of it and there are places where the standard
uses this term on behaviour that would be natural for generated columns.
E.g. page 860 in latest drafts, section 10.14, or SQL:2003, section 14.8,
about INSERT statement: the value the user specified should be stored if
"some underlying column of Ci is a self-referencing column and
OVERRIDING SYSTEM VALUE is specified."

> This means that GENERATED BY DEFAULT AS IDENTITY is not at all
> equivalent to our historical behavior for SERIAL columns and hence we
> cannot merge the two cases.
>

Yes, they are equivalent if you read 5IWD2-02-Foundation-2006-04.pdf
or 5CD2-02-Foundation-2006-01.pdf, i.e. the latest two drafts.
(The latter seems to be misnamed considering that www.wiscorp.com
refreshed the sql200n.zip on 2007-03-11.) Page 860, section 14.10, INSERT.
The value the user provides should be accepted for storage if:
- the column is an identity column and you provide
OVERRIDING SYSTEM VALUE, or
- the column is an GENERATED BY DEFAULT AS IDENTITY
and you provide neither OVERRIDING USER VALUE nor
the DEFAULT specification for the column.

I think the babble about OVERRIDING USER VALUE
is somewhat controversial. Why would you want to do e.g.
INSERT INTO tabname (id, ...) OVERRIDING USER VALUE (N, ...);
where N is an explicit constant? And I haven't even implemented
handling it. Anyway, without specifying OVERRIDING USER VALUE
the GENERATED BY DEFAULT AS IDENTITY is
equivalent with traditional SERIAL in PostgreSQL.
Implementing OVERRIDING USER VALUE behaviour means
that GENERATED BY DEFAULT AS IDENTITY (or SERIAL) would be
marked as an identity as well, not as a column simply having a DEFAULT
clause.
Otherwise OVERRIDING USER VALUE would override every
user-specified value for regular columns having a DEFAULT expression.

> The lack of any behavioral difference for INSERT seems surprising
> and counterintuitive; have I just missed something in the spec?
>

No, I was just ahead of the times and read newer drafts than SQL:2003.

> BTW, I found what they did about the problem that generated columns
> are out of sync with their underlying columns during BEFORE-trigger
> execution: in 11.39
>
> 12)If BEFORE is specified, then:
> ...
> c) The <triggered action> shall not contain a <field
> reference> that references a field in the new transition
> variable corresponding to a generated column of T.
>

I vaguely remember reading it, although the idea seem to have remained
in my mind. :-)

> IOW they just pretend you can't look. So I think we need not worry
> about leaving the values out-of-date until after the triggers fire.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
>

--
----------------------------------
Zoltán Böszörményi
Cybertec Geschwinde & Schönig GmbH
http://www.postgresql.at/


From: Zoltan Boszormenyi <zb(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-10 08:57:23
Message-ID: 4642DE73.4040809@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zoltan Boszormenyi írta:
> The quoted SIGMOD paper mentioned that specifying a value
> for a generated column should raise an error in INSERT but
> this behaviour is not mentioned by the standard.

I found it now, I haven't read hard enough before.
SQL:2003, section 14.8, syntax rules:

10) If <contextually typed table value constructor> CTTVC is specified,
then every <contextually typed row
value constructor element> simply contained in CTTVC whose
positionally corresponding <column name>
in <insert column list> references a column of which some underlying
column is a generated column shall
be a <default specification>.

So, I can only omit the generated column or specify DEFAULT.
Anything else should raise an error. Should it be done in analyze.c
or in rewriteHandler.c?

--
----------------------------------
Zoltán Böszörményi
Cybertec Geschwinde & Schönig GmbH
http://www.postgresql.at/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-10 14:11:23
Message-ID: 12923.1178806283@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zoltan Boszormenyi <zb(at)cybertec(dot)at> writes:
> Tom Lane rta:
>> This means that GENERATED BY DEFAULT AS IDENTITY is not at all
>> equivalent to our historical behavior for SERIAL columns and hence we
>> cannot merge the two cases.

> Yes, they are equivalent if you read 5IWD2-02-Foundation-2006-04.pdf
> or 5CD2-02-Foundation-2006-01.pdf, i.e. the latest two drafts.

Hm. So what we've got here is that the committee has decided the 2003
spec is broken, and they may someday come out with a revised definition
that might, or might not, bear any resemblance to the current 200n
working papers. According to some off-list discussion, nobody is
entirely sure what the current draft is trying to say anyway.

That about tears it for me: I think we should reject at least the
IDENTITY parts of this patch, and very likely the entire thing. I've
spent more than three days now trying to get it into a committable form,
time I can't really afford to spend right now on a patch that adds such
a marginal feature.

AFAICS the only actual new feature that IDENTITY adds is the ability to
make the default expression silently override user-specified insertion
data, as in fact was the use-case argued by you here:
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00038.php

Now we find that as the spec is actually written, it doesn't work that
way: if you try to specify a value other than DEFAULT for an inserted
column, you get a syntax error, unless you specify OVERRIDING SYSTEM
VALUE or OVERRIDING USER VALUE, neither of which are going to be found
in legacy apps ported from other DBMSes, and neither of which add any
actual new functionality (if you have to write OVERRIDING USER VALUE,
you could just as well not specify the column). So I'm seeing a lot
of complexity and a lot of confusion added for not much, not even
considering the risks of trying to track a moving standards target.

As for GENERATED ALWAYS AS (expr), now that we understand that it's not
supposed to define a virtual column, what's the point? You can get the
same behavior with a trivial BEFORE INSERT/UPDATE trigger that
recomputes the derived value, and you don't have to buy into the rather
ill-defined spec behavior (in particular the point that the generated
column is effectively undefined during trigger firing seems really
poorly done). In fact, given that the only plausible use-cases involve
expressions that are expensive to compute, a trigger can probably do
*better* than the built-in feature, since it can make use of application
knowledge about when a recomputation is really necessary. The current
patch recomputes the expression on every UPDATE, and would have a hard
time being any brighter than that, given that we don't know what BEFORE
triggers might do to the row.

So at this point I'm feeling that we've poured a lot of effort down a
hole :-(. We are not really close to having a patch that implements
the current 200n draft (in particular note that OVERRIDING USER VALUE
is not a no-op according to this morning's understanding of the draft).
Even if we were, I'd vote against implementing a part of the draft
that's clearly still in flux --- if they change it again, we'd be stuck.
But the real bottom line is that I don't see enough use-case for these
features to justify the work done already, let alone a bunch more work.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Zoltan Boszormenyi" <zb(at)cybertec(dot)at>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-10 20:51:19
Message-ID: 1178830280.10861.236.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2007-05-10 at 10:11 -0400, Tom Lane wrote:

> As for GENERATED ALWAYS AS (expr), now that we understand that it's not
> supposed to define a virtual column, what's the point? You can get the
> same behavior with a trivial BEFORE INSERT/UPDATE trigger that
> recomputes the derived value, and you don't have to buy into the rather
> ill-defined spec behavior (in particular the point that the generated
> column is effectively undefined during trigger firing seems really
> poorly done). In fact, given that the only plausible use-cases involve
> expressions that are expensive to compute, a trigger can probably do
> *better* than the built-in feature, since it can make use of application
> knowledge about when a recomputation is really necessary. The current
> patch recomputes the expression on every UPDATE, and would have a hard
> time being any brighter than that, given that we don't know what BEFORE
> triggers might do to the row.

We do need virtual columns, whether the spec requires them or not. They
would allow us to completely remove the column value when using
value-list based partitioning, giving considerable space savings for
VLDBs.

ISTM that we should interpret this as a requirement for a virtual
column. We can always move from that to a stored column if the spec
becomes more specific, though it would be harder to move the other way.
And as you point out, storing the value would make no sense.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Zoltan Boszormenyi" <zb(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-11 02:31:30
Message-ID: 9988.1178850690@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> On Thu, 2007-05-10 at 10:11 -0400, Tom Lane wrote:
>> As for GENERATED ALWAYS AS (expr), now that we understand that it's not
>> supposed to define a virtual column, what's the point?

> We do need virtual columns, whether the spec requires them or not.

Agreed, they seem more useful than what the spec's got in mind. You can
fake a virtual column using a view, but it's a whole lot more painful
than faking a GENERATED column using a trigger (at least if you wish the
view to be updatable).

> ISTM that we should interpret this as a requirement for a virtual
> column. We can always move from that to a stored column if the spec
> becomes more specific, though it would be harder to move the other way.

If you're suggesting commandeering the spec's GENERATED ALWAYS syntax
to represent virtual columns, when the committee has made it clear that
that's not what they intend, I say that's sheer folly. What will you do
when they tweak the spec to the point where a virtual column clearly
doesn't satisfy it? If we want a nonstandard feature we should use a
nonstandard syntax for it.

regards, tom lane


From: Richard Huxton <dev(at)archonet(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-11 07:32:56
Message-ID: 46441C28.6000003@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>> On Thu, 2007-05-10 at 10:11 -0400, Tom Lane wrote:
>>> As for GENERATED ALWAYS AS (expr), now that we understand that it's not
>>> supposed to define a virtual column, what's the point?
>
>> We do need virtual columns, whether the spec requires them or not.
>
> Agreed, they seem more useful than what the spec's got in mind. You can
> fake a virtual column using a view, but it's a whole lot more painful
> than faking a GENERATED column using a trigger (at least if you wish the
> view to be updatable).

If they could play a part in foreign keys that would be useful too.

--
Richard Huxton
Archonet Ltd


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zoltan Boszormenyi <zb(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Behavior of GENERATED columns per SQL2003
Date: 2007-05-15 02:36:32
Message-ID: 200705150236.l4F2aWB09919@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


URL added to TODO item. Patch rejected for 8.3.

---------------------------------------------------------------------------

Tom Lane wrote:
> I've been studying the SQL spec in a bit more detail and I'm suddenly
> thinking that we've got the behavior all wrong in the current
> GENERATED/IDENTITY patch. In particular, it looks to me like we've
> been implementing GENERATED ALWAYS AS (expr) according to the rules
> that the spec in fact lays down only for GENERATED ALWAYS AS IDENTITY.
> You'd think the two constructs would be pretty closely related but
> the behaviors specified by the spec are light-years apart. If you
> look closely, a "generated column" in the meaning of section 4.14.8
> is one that has GENERATED ALWAYS AS (expr), and identity columns are
> *not* in this class.
>
> It looks to me like the behavior the spec intends for a generated column
> is actually that it can be implemented as a "virtual column" occupying
> no space on disk and instead computed on-the-fly when retrieved.
> Identity columns can have their values overridden by the
> user (it's a little harder if GENERATED ALWAYS, but still possible),
> and they don't change during an UPDATE unless specifically forced to.
> In contrast, generated columns cannot be overridden by
> assignment, and are recomputed from their base columns during updates.
> This realization also explains the following, otherwise rather strange,
> facts:
>
> * There is no GENERATED BY DEFAULT AS (expr) in the spec.
>
> * GENERATED expressions are specifically disallowed from containing
> subselects, calling functions that access any SQL-data, or being
> nondeterministic; hence their values depend solely on the regular
> columns in the same row.
>
> * While identity columns are updated (if needed) before execution of
> BEFORE triggers, generated columns are updated after BEFORE triggers;
> hence a BEFORE trigger can override the value in one case and not the
> other. (The current patch gets this wrong, btw.)
>
> * Generated columns are forcibly updated when their base columns change
> as a result of FK constraints (such as ON UPDATE CASCADE).
>
> It looks to me like a BEFORE trigger is actually the only place that can
> (transiently) see values of a generated column that are different from
> the result of applying the generation expression on the rest of the row.
> It's unclear whether that's intentional or an oversight.
>
> Is anyone familiar with a database that implements SQL-spec generated
> columns? Do they actually store the columns?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +