Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "Kevin Grittner" <kgrittn(at)ymail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}
Date: 2014-12-18 14:59:39
Message-ID: 5492EBDB.3020008@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/18/2014 01:02 AM, Peter Geoghegan wrote:
> On Wed, Dec 17, 2014 at 1:12 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> Now, let's imagine a table like this:
>>
>> CREATE TABLE persons (
>> username text unique,
>> real_name text unique,
>> data text
>> );
>>
>> Is there any way to specify both of those constraints, so that the insertion
>> is IGNOREd if it violates either one of them? If you try to do:
>>
>> INSERT INTO persons(username, real_name, data)
>> VALUES('foobar', 'foo bar')
>> ON CONFLICT (username, real_name) IGNORE;
>>
>> It will fail because there is no unique index on (username, real_name). In
>> this particular case, you could leave out the specification, but if there
>> was a third constraint that you're not expecting to conflict with, you would
>> want violations of that constraint to still throw an error. And you can't
>> leave out the specification with ON CONFLICT UPDATE anyway.
>
> Good point.
>
> For the IGNORE case: I guess the syntax just isn't that flexible. I
> agree that that isn't ideal.

It should be simple to allow multiple key specifications:

INSERT INTO persons (username, real_name, data)
VALUES('foobar', 'foo bar')
ON CONFLICT (username), (real_name) IGNORE;

It's a rather niche use case, but might as well support it for the sake
of completeness.

> For the UPDATE case: Suppose your example was an UPDATE where we
> simply assigned the excluded.data value to the data column in the
> auxiliary UPDATE's targetlist. What would the user really be asking
> for with that command, at a really high level? It seems like they
> might actually want to run two UPSERT commands (one for username, the
> other for real_name), or rethink their indexing strategy - in
> particular, whether it's appropriate that there isn't a composite
> unique constraint on (username, real_name).
>
> Now, suppose that by accident or by convention it will always be
> possible for a composite unique index to be built on (username,
> real_name) - no dup violations would be raised if it was attempted,
> but it just hasn't been and won't be. In other words, it's generally
> safe to actually pretend that there is one. Then, surely it doesn't
> matter if the user picks one or the other unique index. It'll all work
> out when the user assigns to both in the UPDATE targetlist, because of
> the assumed convention that I think is implied by the example. If the
> convention is violated, at least you get a dup violation letting you
> know (iff you bothered to assign). But I wouldn't like to encourage
> that pattern.
>
> I think that the long and the short of it is that you really ought to
> have one unique index as an arbiter in mind when writing a DML
> statement for the UPDATE variant. Relying on this type of convention
> is possible, I suppose, but ill-advised.

Another thought is that you might want to specify a different action
depending on which constraint is violated:

INSERT INTO persons (username, real_name, data)
VALUES('foobar', 'foo bar')
ON CONFLICT (username) IGNORE
ON CONFLICT (real_name) UPDATE ...;

Although that leaves the question of what to do if both are violated.
Perhaps:

INSERT INTO persons (username, real_name, data)
VALUES('foobar', 'foo bar')
ON CONFLICT (username, real_name) IGNORE
ON CONFLICT (real_name) UPDATE username = excluded.username;
ON CONFLICT (username) UPDATE real_name = excluded.real_name;

>> 5. What if there are multiple unique indexes with the same columns, but
>> different operator classes?
>
> I thought about that. I am reusing a little bit of the CREATE INDEX
> infrastructure for raw parsing, and for a small amount of parse
> analysis (conveniently, this makes the command reject things like
> aggregate functions with no additional code - the error messages only
> mention "index expressions", so I believe that's fine). This could
> include an opclass specification, but right now non-default opclasses
> are rejected during extra steps in parse analysis, for no particular
> reason.
>
> I could easily have the unique index inference specification accept a
> named opclass, if you thought that was important, and you thought
> naming a non-default opclass by name was a good SQL interface. It
> would take only a little effort to support non-default opclasses.

It's a little weird to mention an opclass by name. It's similar to
naming an index by name, really. How about naming the operator? For an
exclusion constraint, that would be natural, as the syntax to create an
exclusion constraint in the first place is "EXCLUDE USING gist (c WITH &&)"

Naming the index by columns makes sense in most cases, and I don't like
specifying the index's name, but how about allowing naming a constraint?
Indexes are just an implementation detail, but constraints are not.
Unique and exclusion constraints are always backed by an index, so there
is little difference in practice, but I would feel much more comfortable
mentioning constraints by name than indexes.

Most people would list the columns, but if there is a really bizarre
constraint, with non-default opclasses, or an exclusion constraint, it's
probably been given a name that you could use.

In theory, with the promise tuple approach to locking, you don't
necessarily even need an index to back up the constraint. You could just
do a sequential scan of the whole table to see if there are any
conflicting rows, then insert the row, and perform another scan to see
if any conflicting rows appeared in the meantime. Performance would
suck, and there is no guarantee that another backend doesn't do a
regular INSERT into to the table that violates the imaginary constraint,
so this is pretty useless in practice. So probably better to not allow it.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2014-12-18 15:14:35 Re: Commitfest problems
Previous Message Alvaro Herrera 2014-12-18 14:15:40 Re: Table-level log_autovacuum_min_duration