Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}
Date: 2014-12-17 23:02:19
Message-ID: CAM3SWZQ7PdxFhg-B6EN80N3X0=raeyOFXxeweU2pF09hOQUb1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 17, 2014 at 1:12 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> It looks like we are close to reaching consensus on the syntax. Phew! Thanks
> for maintaining the wiki pages and the documentation. All of the below is
> based on those, I haven't looked at the patch itself yet.

Great, thanks!

Yes, I am relieved that we appeared to have agreed on a syntax.

> The one thing that I still feel uneasy about is the Unique Index Inference
> thing.

> The documentation says that:
>
>> Omitting the specification indicates a total indifference to where
>> any would-be uniqueness violation could occur, which isn't always
>> appropriate;

> Some questions:
>
> 1. Does that mean that if you leave out the key columns, the insertion is
> IGNOREd if it violates *any* unique key constraint?

Yes. This is particularly important for the implementation of things
like IGNORE's updatable view support. More generally, for various ETL
use cases it's possible to imagine the user simply not caring.

> 2. If you do specify the key columns, then the IGNORE path is taken only if
> the insertion violates a unique key constraint on those particular columns.
> Otherwise an error is thrown. Right?

That's right.

> Now, let's imagine a table like this:
>
> CREATE TABLE persons (
> username text unique,
> real_name text unique,
> data text
> );
>
> Is there any way to specify both of those constraints, so that the insertion
> is IGNOREd if it violates either one of them? If you try to do:
>
> INSERT INTO persons(username, real_name, data)
> VALUES('foobar', 'foo bar')
> ON CONFLICT (username, real_name) IGNORE;
>
> It will fail because there is no unique index on (username, real_name). In
> this particular case, you could leave out the specification, but if there
> was a third constraint that you're not expecting to conflict with, you would
> want violations of that constraint to still throw an error. And you can't
> leave out the specification with ON CONFLICT UPDATE anyway.

Good point.

For the IGNORE case: I guess the syntax just isn't that flexible. I
agree that that isn't ideal.

For the UPDATE case: Suppose your example was an UPDATE where we
simply assigned the excluded.data value to the data column in the
auxiliary UPDATE's targetlist. What would the user really be asking
for with that command, at a really high level? It seems like they
might actually want to run two UPSERT commands (one for username, the
other for real_name), or rethink their indexing strategy - in
particular, whether it's appropriate that there isn't a composite
unique constraint on (username, real_name).

Now, suppose that by accident or by convention it will always be
possible for a composite unique index to be built on (username,
real_name) - no dup violations would be raised if it was attempted,
but it just hasn't been and won't be. In other words, it's generally
safe to actually pretend that there is one. Then, surely it doesn't
matter if the user picks one or the other unique index. It'll all work
out when the user assigns to both in the UPDATE targetlist, because of
the assumed convention that I think is implied by the example. If the
convention is violated, at least you get a dup violation letting you
know (iff you bothered to assign). But I wouldn't like to encourage
that pattern.

I think that the long and the short of it is that you really ought to
have one unique index as an arbiter in mind when writing a DML
statement for the UPDATE variant. Relying on this type of convention
is possible, I suppose, but ill-advised.

> 3. Why is the specification required with ON CONFLICT UPDATE, but not with
> ON CONFLICT IGNORE?

That was a fairly recent decision, taken mainly to keep Kevin happy --
although TBH I don't recall that he was particularly insistent on that
restriction. I could still go either way on that question.

The idea, as I mentioned, is that it's legitimate to not care where a
dup violation might occur for certain ETL use cases. For UPSERT,
though, the only argument for not making it mandatory is that that's
something extra to type, and lazy people would prefer not to bother.
This is because we assume that the first dup violation is the only
possible one without the unique index inference clause. If we don't
have the index expressions with which to infer an arbiter unique index
(as with MySQL's ON DUPLICATE KEY UPDATE), you'd better be sure that
you accounted for all possible sources of would-be duplicate
violations - otherwise a random row will be updated!

That isn't a fantastic argument for not making a unique index
inference clause mandatory, but it might be an okay one.

> 4. What happens if there are multiple unique indexes with identical columns,
> and you give those columns in the inference specification? Doesn't matter
> which index you use, I guess, if they're all identical, but see next
> question.

It doesn't really matter which one you pick, but right now, at the
urging of Robert, we cost the list of candidates and pick the cheapest
iff there is more than one [1] (and error if there are none). This is
roughly similar to costing of indexes for CLUSTER, and occurs during
optimization (only parse analysis of the expressions associated with
the unique index inference clause occurs during parse analysis -
indexes are looked up and matched in the optimizer).

> 5. What if there are multiple unique indexes with the same columns, but
> different operator classes?

I thought about that. I am reusing a little bit of the CREATE INDEX
infrastructure for raw parsing, and for a small amount of parse
analysis (conveniently, this makes the command reject things like
aggregate functions with no additional code - the error messages only
mention "index expressions", so I believe that's fine). This could
include an opclass specification, but right now non-default opclasses
are rejected during extra steps in parse analysis, for no particular
reason.

I could easily have the unique index inference specification accept a
named opclass, if you thought that was important, and you thought
naming a non-default opclass by name was a good SQL interface. It
would take only a little effort to support non-default opclasses.

> 6. Why are partial unique indexes not supported as arbitrators?

Robert and I discussed this quite a bit -- it was the argument for
being able to name a unique index by name (not that I'm very happy
with that idea or anything) [2]. Basically, dealing with the possible
behaviors with before row insert triggers might in general greatly
complicate the implementation, even though the issues we'd then be
protected against would seldom arise. Robert seemed to think that we
could revisit this in a future version [3].

Note that IGNORE will still IGNORE any partial unique index -- it just
won't accept one as the sole arbiter of whether or not the IGNORE path
should be taken (so it's really the inference specification syntax
that doesn't accept partial unique indexes, just as it doesn't accept
updatable views, exclusion constraints, and inheritance parents where
the semantics are similarly iffy -- that's both the reason for and the
mechanism by which ON CONFLICT UPDATE does not support these things).

[1] http://www.postgresql.org/message-id/CAM3SWZQz+jYkwfuZvcSf0qtpa2QiY+8NGNcHjfWgz3DDzRfzEg@mail.gmail.com
[2] http://www.postgresql.org/message-id/CAM3SWZQ8tDPdjiwj_FW4AO8gEvpyiixwBE67OVQuufPJ+y1e1g@mail.gmail.com
[3] http://www.postgresql.org/message-id/CA+TgmoZgLgY2PBAMTY3T1jpYXAvNL-w=T6o+6pMqrVR+Vn-iyg@mail.gmail.com
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-12-18 01:37:37 Re: moving from contrib to bin
Previous Message Josh Berkus 2014-12-17 22:19:59 Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}