Re: ALTER TYPE 1: recheck index-based constraints

Lists: pgsql-hackers
From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-09 22:00:23
Message-ID: 20110109220023.GB5777@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

When ALTER TABLE rewrites a table, it reindexes, but the reindex does not
revalidate UNIQUE/EXCLUDE constraints. This behaves badly in cases like this,
neglecting to throw an error on the new UNIQUE violation:

CREATE TABLE t (c numeric UNIQUE);
INSERT INTO t VALUES (1.1),(1.2);
ALTER TABLE t ALTER c TYPE int;

The comment gave a reason for skipping the checks: it would cause deadlocks when
we rewrite a system catalog. So, this patch changes things to only skip the
check for system catalogs.

Attachment Content-Type Size
at1-check-unique.patch text/plain 4.5 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-12 03:03:11
Message-ID: AANLkTi=qwcNkjO63KYXm134u+BU2wNLOKrj2nFpx0=Ve@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 9, 2011 at 5:00 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> When ALTER TABLE rewrites a table, it reindexes, but the reindex does not
> revalidate UNIQUE/EXCLUDE constraints.  This behaves badly in cases like this,
> neglecting to throw an error on the new UNIQUE violation:
>
> CREATE TABLE t (c numeric UNIQUE);
> INSERT INTO t VALUES (1.1),(1.2);
> ALTER TABLE t ALTER c TYPE int;
>
> The comment gave a reason for skipping the checks: it would cause deadlocks when
> we rewrite a system catalog.  So, this patch changes things to only skip the
> check for system catalogs.

It looks like this behavior was introduced by Tom's commit
1ddc2703a936d03953657f43345460b9242bbed1 on 2010-02-07, and it appears
to be quite broken. The behavior is reasonable in the case of VACUUM
FULL and CLUSTER, but your example demonstrates that it's completely
broken in the case of ALTER TABLE. This strikes me as something we
need to fix in REL9_0_STABLE as well as master, and my gut feeling is
that we ought to fix it not by excluding system relations but by
making ALTER TABLE work differently from VACUUM FULL and CLUSTER.
There's an efficiency benefit to skipping this even on normal
relations in cases where it isn't necessary, and it shouldn't be
necessary if we're rewriting the rows unchanged.

Also, you need to start adding these patches to the CF app.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-12 13:14:51
Message-ID: 20110112131451.GA16787@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 11, 2011 at 10:03:11PM -0500, Robert Haas wrote:
> On Sun, Jan 9, 2011 at 5:00 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > When ALTER TABLE rewrites a table, it reindexes, but the reindex does not
> > revalidate UNIQUE/EXCLUDE constraints. ?This behaves badly in cases like this,
> > neglecting to throw an error on the new UNIQUE violation:
> >
> > CREATE TABLE t (c numeric UNIQUE);
> > INSERT INTO t VALUES (1.1),(1.2);
> > ALTER TABLE t ALTER c TYPE int;
> >
> > The comment gave a reason for skipping the checks: it would cause deadlocks when
> > we rewrite a system catalog. ?So, this patch changes things to only skip the
> > check for system catalogs.
>
> It looks like this behavior was introduced by Tom's commit
> 1ddc2703a936d03953657f43345460b9242bbed1 on 2010-02-07, and it appears
> to be quite broken. The behavior is reasonable in the case of VACUUM
> FULL and CLUSTER, but your example demonstrates that it's completely
> broken in the case of ALTER TABLE. This strikes me as something we
> need to fix in REL9_0_STABLE as well as master, and my gut feeling is
> that we ought to fix it not by excluding system relations but by
> making ALTER TABLE work differently from VACUUM FULL and CLUSTER.
> There's an efficiency benefit to skipping this even on normal
> relations in cases where it isn't necessary, and it shouldn't be
> necessary if we're rewriting the rows unchanged.

Something like the attached?

> Also, you need to start adding these patches to the CF app.

Done for all.

Attachment Content-Type Size
at1-check-unique.patch text/plain 11.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-12 13:56:40
Message-ID: AANLkTikTGkA-_9WBOY+XUsb35s-Ezg=t0y3u3HLZoxZ=@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 12, 2011 at 8:14 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> Something like the attached?

I haven't really analyzed in this detail, but yes, approximately
something sorta like that.

>> Also, you need to start adding these patches to the CF app.
>
> Done for all.

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-20 04:50:12
Message-ID: AANLkTi=m96jovYbrBOj27JvK=Qb1edWj-92FPJvucprd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 12, 2011 at 8:56 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jan 12, 2011 at 8:14 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> Something like the attached?
>
> I haven't really analyzed in this detail, but yes, approximately
> something sorta like that.

I looked this over some more tonight. The name "tuples_changed" is
giving me some angst, because if we rewrote the heap... the tuples
changed. I think what you intend this to indicate whether the tuples
have changed in a semantic sense, ignoring CTIDs and so on. But it's
not even quite that, either, because ExecuteTruncate() is passing
false, and the set of tuples has probably changed in that case. It
seems that the cases here are:

1. When ALTER TABLE causes a rewrite, we set heap_rebuilt = true and
tuples_changed = true. This causes constraints to be revalidated and
suppresses use of indexes while the rebuild is in progress.
2. When VACUUM FULL or CLUSTER causes a rewrite, we set heap_rebuilt =
true and tuples_changed = false. This avoids constraint revalidation
but still suppresses use of indexes while the rebuild is in progress.
3. When TRUNCATE or REINDEX is invoked, we set heap_rebuilt = false
and tuples_changed = false. Here we neither revalidate constraints
nor suppress use of indexes while the rebuild is in progress.

The first question I asked myself is whether the above is actually
correct, and whether it's possible to simplify back to two cases. So:
The REINDEX case should definitely avoid suppressing the use of the
old index while the new one is rebuilt; I'm not really sure it matters
what TRUNCATE does, since we're going to be operating on a non-system
catalog on which we hold AccessExclusiveLock; the VACUUM FULL/CLUSTER
case certainly needs to suppress uses of indexes, because it can be
used on system catalogs, which may need to be used during the rebuild
itself. Thus #2 and #3 must be distinct. #1 must be distinct from #2
both for performance reasons and to prevent deadlocks when using
VACUUM FULL or CLUSTER on a system catalog (ALTER TABLE can't be so
used, so it's safe for it to not worry about this problem). So I
think the logic is correct and not overly complex.

I think what might make sense is - instead of adding another Boolean
argument, change the heap_rebuilt argument to int flags, and define
REINDEX_CHECK_CONSTRAINTS and REINDEX_SUPPRESS_INDEX_USE as OR-able
flags. I think that's more clear about what's actually going on than
heap_rebuit/tuples_changed.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-20 05:57:53
Message-ID: 20110120055753.GA13329@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 19, 2011 at 11:50:12PM -0500, Robert Haas wrote:
> On Wed, Jan 12, 2011 at 8:56 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Wed, Jan 12, 2011 at 8:14 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> >> Something like the attached?
> >
> > I haven't really analyzed in this detail, but yes, approximately
> > something sorta like that.
>
> [assessment of current uses] So I think the logic is correct and not overly
> complex.

Sounds correct to me.

> I think what might make sense is - instead of adding another Boolean
> argument, change the heap_rebuilt argument to int flags, and define
> REINDEX_CHECK_CONSTRAINTS and REINDEX_SUPPRESS_INDEX_USE as OR-able
> flags. I think that's more clear about what's actually going on than
> heap_rebuit/tuples_changed.

There are two distinct questions here:

(1) Should reindex_relation receive boolean facts from its callers by way of two
boolean arguments, or by way of one flags vector?

The former seems best when you want every caller to definitely think about which
answer is right, and the latter when that's not so necessary. (For a very long
list of options, the flags might be chosen on other grounds.) As framed, I'd
lean toward keeping distinct arguments, as these are important questions.

However, suppose we inverted both flags, say REINDEX_SKIP_CONSTRAINT_CHECKS and
REINDEX_ALLOW_OLD_INDEX_USE. Then, flags = 0 can hurt performance but not
correctness. That's looking like a win.

(2) Should reindex_relation frame its boolean arguments in terms of what the
caller did (heap_rebuilt, tuples_changed), or in terms of what reindex_relation
will be doing (check_constraints, suppress_index_use)?

The former should be the default approach, but it requires that we be able to
frame good names that effectively convey an abstraction. Prospective callers
must know what to send just by looking at the name and reading the header
comment. When no prospective name is that expressive and you'll end up reading
the reindex_relation code to see how they play out, then it's better to go with
the latter instead. A bad abstraction is worse than none at all.

I agree that both heap_rebuilt and tuples_changed are bad abstractions.
TRUNCATE is lying about the former, and the latter, as you say, is never really
correct. column_values_changed, perhaps. tuples_could_now_violate_constraints
would be correct, but that's just a bad spelling for REINDEX_CHECK_CONSTRAINTS.
I guess the equivalent long-winded improvement for heap_rebuilt would be
indexes_still_valid_for_snapshotnow. Eh.

So yes, let's adopt callee-action-focused names like you propose.

Overall, I'd vote for a flags parameter with negative senses like I noted above.
My second preference would be for two boolean parameters, check_constraints and
suppress_index_use. Not really a big deal to me, though. (I feel a bit silly
writing this email.) What do you think?

Thanks,
nm


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-20 14:26:29
Message-ID: AANLkTikV-sWCXEnPS1NZ3mGc2F09bQmMScycRGN+iz9V@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 20, 2011 at 12:57 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> There are two distinct questions here:

Agreed.

> (1) Should reindex_relation receive boolean facts from its callers by way of two
> boolean arguments, or by way of one flags vector?
>
> The former seems best when you want every caller to definitely think about which
> answer is right, and the latter when that's not so necessary.  (For a very long
> list of options, the flags might be chosen on other grounds.)  As framed, I'd
> lean toward keeping distinct arguments, as these are important questions.

My main beef with the Boolean flags is that this kind of thing is not too clear:

reindex_relation(myrel, false, false, true, true, false, true,
false, false, true);

Unless you have an excellent memory, you can't tell what the heck
that's doing without flipping back and forth between the function
definition and the call site. With a bit-field, it's a lot easier to
glance at the call site and have a clue what's going on. We're of
course not quite to the point of that exaggerated example yet.

> However, suppose we inverted both flags, say REINDEX_SKIP_CONSTRAINT_CHECKS and
> REINDEX_ALLOW_OLD_INDEX_USE.  Then, flags = 0 can hurt performance but not
> correctness.  That's looking like a win.

I prefer the positive sense for those flags because I think it's more
clear. There aren't so many call sites or so many people using this
that we have to worry about what people are going to do in new calling
locations; getting it right in any new code shouldn't be a
consideration.

> (2) Should reindex_relation frame its boolean arguments in terms of what the
> caller did (heap_rebuilt, tuples_changed), or in terms of what reindex_relation
> will be doing (check_constraints, suppress_index_use)?

Yeah, I know we're doing the former now, but I think it's just getting
confusing for exactly the reasons you state:

> I agree that both heap_rebuilt and tuples_changed are bad abstractions.
> TRUNCATE is lying about the former, and the latter, as you say, is never really
> correct.  column_values_changed, perhaps.  tuples_could_now_violate_constraints
> would be correct, but that's just a bad spelling for REINDEX_CHECK_CONSTRAINTS.
> I guess the equivalent long-winded improvement for heap_rebuilt would be
> indexes_still_valid_for_snapshotnow.  Eh.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-20 19:22:24
Message-ID: 20110120192224.GA21107@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 20, 2011 at 09:26:29AM -0500, Robert Haas wrote:
> My main beef with the Boolean flags is that this kind of thing is not too clear:
>
> reindex_relation(myrel, false, false, true, true, false, true,
> false, false, true);
>
> Unless you have an excellent memory, you can't tell what the heck
> that's doing without flipping back and forth between the function
> definition and the call site. With a bit-field, it's a lot easier to
> glance at the call site and have a clue what's going on. We're of
> course not quite to the point of that exaggerated example yet.

Agreed.

> > However, suppose we inverted both flags, say REINDEX_SKIP_CONSTRAINT_CHECKS and
> > REINDEX_ALLOW_OLD_INDEX_USE. ?Then, flags = 0 can hurt performance but not
> > correctness. ?That's looking like a win.
>
> I prefer the positive sense for those flags because I think it's more
> clear. There aren't so many call sites or so many people using this
> that we have to worry about what people are going to do in new calling
> locations; getting it right in any new code shouldn't be a
> consideration.

Okay. I've attached a new patch version based on that strategy.

Attachment Content-Type Size
at1v3-check-unique.patch text/plain 14.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ALTER TYPE 1: recheck index-based constraints
Date: 2011-01-21 03:55:30
Message-ID: AANLkTi=NSnGovwNtc1rgfmDcFBe4J6TQdaSg8dJdr49L@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 20, 2011 at 2:22 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> Okay.  I've attached a new patch version based on that strategy.

Thanks. Committed and back-patched to 9.0 (but I didn't use your
regression test).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company