INSERT...ON DUPLICATE KEY IGNORE

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: INSERT...ON DUPLICATE KEY IGNORE
Date: 2013-08-30 22:09:59
Message-ID: CAM3SWZRg4VcLCKprVEyuLHoXwUefuiz1-P69YLye8qktZb_=UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For many years now, MySQL has a feature called INSERT IGNORE [1];
SQLite has INSERT ON CONFLICT IGNORE [2]; SQL Server has an option
called IGNORE_DUP_KEY and Oracle has a hint called
IGNORE_ROW_ON_DUPKEY_INDEX (they acknowledge that it's a bit odd that
a hint changes the semantics of a DML statement - of
course, MERGE has always been able to do something approximately equivalent).

Each of these features has their respective systems simply not insert
certain tuples that will result in a duplicate key violation, while
potentially proceeding with the insertion of other, non-violating
tuples.

The attached WIP patch implements this for Postgres, with a few
notable differences:

1) The patch is only interested in unique index violations
(specifically, violations of amcanunique access method unique
indexes); it will not do anything with NULL constraint violations, as
the MySQL feature does, for example. It also doesn't care about
exclusion constraints, even though it's currently possible to defer
exclusion constraint enforcement in a way that is analogous to how
it's done for btree indexes.

2) The clause that I've proposed is, as you'll have already gathered,
spelt slightly differently to any of these systems. I'm not
particularly attached to how I've spelt it. I've spelt it this way so
as to avoid suggesting that it's fully compatible with the MySQL
feature. I don't think we want weird NULLness behavior, but I may be
mistaken. If we want that additional behavior, we probably want it as
an optional adjunct, or perhaps an entirely independent clause to what
I've specified here.

3) RETURNING is expanded - "RETURNING REJECTS *" is now possible where
that makes sense.

It is possible for clients to interrogate the wire protocol to see the
number of rows affected (this can be done with a PQcmdTuples() call by
libpq clients, for example), and see how that compares with the number
of tuples they tried to insert. In addition, a client can use
RETURNING to see what rows were successfully inserted. Here is a
simple example session that shows this usage:

postgres=# create table tab(id int4 primary key, val text);
CREATE TABLE
postgres=# insert into tab(id, val) values(1, 'val'), (3, 'val'), (4, 'val');
INSERT 0 3
postgres=# insert into tab(id, val) values(1, 'nosert'), (2,
'nosert'), (3, 'nosert') on duplicate key ignore returning id;
id
----
2
(1 row)

INSERT 0 1
postgres=# select xmin, * from tab;
xmin | id | val
--------+----+-----
580843 | 1 | val
580843 | 3 | val
580843 | 4 | val
580844 | 2 | nosert
(4 rows)

If this all happened in two separate, concurrent sessions, the
transaction with xid 580844 might well have blocked on the outcome of
580843 in the usual way (ultimately inserting or doing nothing as
appropriate for each tuple), just as you'd expect. Note, however, that
the xid of the "noserter" is only seen here for the tuple that it
actually successfully inserted - in effect, the noserter session does
not lock tuples that it did not originally insert. That's consistent
with MySQL's INSERT IGNORE, for example. However, some people might
prefer it if this did share lock rows that prevented tuples from being
inserted, or if that was at least an option. That would be something
quite a bit closer to a fully-fledged upsert - I imagine we'd have to
lock the old tuple using EvalPlanQual, and if a submission did that
correctly then I think we'd be well on our way to solving all the
problems.

RETURNING REJECTS is a way of getting the inverse of affected tuples
of an ordinary INSERT...RETURNING statement - rather than returning
rows successfully inserted only (i.e. tuples where either a BEFORE
trigger returned NULL or we didn't proceed with insertion), we
returning rows that were not successfully inserted only. This is only
meaningful in the context of INSERT...ON DUPLICATE KEY IGNORE. Sure,
clients could figure this out for themselves using vanilla "RETURNING
*", but I think that this addition is justified by the fact that it's
generally expected that the number of rejects will be relatively
small. People are going to want to use this feature in an ad-hoc
manner within a psql session, and supporting this will particularly
help there. It might be nice to find a way to indicate the exact
constraint violated per row returned, but I preferred to wait and see
if people thought it was a good idea generally.

Use-cases
=========

The use-cases for this feature are fairly obvious. It can be used as a
way of merging tuples easily, if clients can live with the present
limitations.

In addition to that, the feature is may be of value to clients of the
logical change-set generation infrastructure (such as the proposed BDR
contrib module[3]), if and when that infrastructure is finally
committed. Currently, the BDR prototype simply balks at INSERT
conflicts. This is because the amount of subtransactions/xids
generated to do anything else is considered prohibitive. If Postgres
is ever going to solve the kinds of problems that BDR proposes to
solve well (except for the relatively narrow use-case where balking at
INSERT conflicts is acceptable), it needs something like what I've
proposed here. The only alternative is probably to replay a
transaction without using subtransactions, and if that fails, remember
that fact and wrap every single change in a subxact. This is really
going to be painful for large transactions.

Andres privately said some weeks back (prior to my sharing of this
information, and even prior to writing most of the code posted here)
that he'd like to see an INSERT...ON DUPLICATE KEY LOCK. I suppose
that he must have meant that he'd like to see shared locks obtained on
rows that blocked a noserter from actually inserting some of their
proposed tuples, for the purposes of conflict resolution (which, as
I've said, this doesn't do). Perhaps he'd like to comment on these
issues here.

Mechanism
=========

This patch is a spin-off from a broader effort to implement
INSERT...ON DUPLICATE KEY UPDATE (upsert). During the 2012 developer
meeting, it was agreed that we needed a simple, atomic upsert, and
that we should probably use the MySQL syntax as long as we don't have
merge in its full generality [4]. In doing things this way, I'm
attempting to manage the risk of that broader patch not being accepted
by checkpointing progress towards it in a major release. In
particular, I'd like to see us reach consensus on the questions
surrounding our basic approach to implementing upsert. I'm keen to see
this happen before attempting to solve all the other problems, in a
way which is perhaps predicated on my own personal opinion about how
to solve the basic problem. Having said all that, even if I never work
on upsert, this patch certainly has independent value.

Having reviewed the archives for previous discussions on upsert or
MERGE, it seems that there is some diversity of opinion among more
senior developers about just how to solve the basic concurrency issue.
For example, a few years ago Heikki opined that continuing to insert
the heap tuple first, and then potentially cleaning that up in the
event of a duplicate rather than raising an error was a workable
approach [5]. Simon addressed Heikki's point here specifically, and
suggested that we should try and find the tuple first [6], which is
approximately the approach I have taken here. In saying this, Simon
appears to have changed his mind, because on a previous occasion he
remarked that doing an UPDATE potentially followed by an INSERT might
be preferable [7]. Tom at least once briefly sketched how he thought
this ought to work [8], and it appeared to me that he was roughly in
agreement with Simon in [6]. (I hope I haven't misrepresented
somebody's view here - if so, please correct me).

From a high level, the patch works by adding something that is
tentatively internally referred to as "speculative insertion" in
lower-level code comments. The basic approach is:

* Within ExecInsert, when inserting into a regular heap relation from
a tuple table slot (after BEFORE triggers fire but before inserting
the heap tuple), there is an extra function call to a new function,
ExecLockIndexTuples().

* For each applicable amcanunique unique index, ExecLockIndexTuples()
calls a new am method, amlock. For each unique index, we end up
acquiring a write buffer lock in a similar fashion to _bt_doinsert.
When we return from the new amlock function, we report whether or not
there was a duplicate without having inserted any index tuples or
having otherwise done something that requires a heap tuple to be
present. I like to call this point "the point of no return", where
individual unique indexes commit themselves to insertion (which is to
say, the point after which a unique constraint violation becomes
impossible but before actual index tuple insertion in _bt_doinsert
[9]). Another duplicate key cannot be inserted by anyone else until
that write lock is released, which it won't be until insertion or
IGNORE, for the same reason why that's currently true once control
reaches the point of no return. A pagesplit might need to happen
before insertion proper of that index tuple, for example, and that's
okay.

* If we have total consensus among all unique indexes, we proceed with
a heap insertion of the tuple formed from the tuple table slot
ExecInsert() was called in respect of. Yes, this means that
potentially multiple exclusive buffer locks are held for the duration
of the heap tuple insertion. If not, we release the locks early and
return NULL from ExecInsert() before heap insertion ever happens (and
certainly before index tuple insertion ever happens).

* If and when we get to ExecInsertIndexTuples(), and ultimately to
_bt_doinsert, the new _bt_doinsert knows that for any unique indexes
it sees, that the process has already been partially completed. It
takes state made available from the first phase of speculative
insertion, and finishes the job, pretty much picking up from the point
of no return for amcanunique unique indexes (or starting from scratch
in the usual way for non-unique amcanunique-method indexes that happen
to be involved in speculative insertion).

Deadlock hazards
==============

In order to avoid deadlocks, the locking stage acquires locks in
reverse order to the usual index insertion order. So, regardless of
how those locks are subsequently freed (either by actual insertion or
by finishing early in the IGNORE case), we avoid buffer lock related
deadlocks. Still, deadlock hazards are something that reviewers will
want to pay particular attention to.

Unlike some other systems like DB2, we have always allowed BEFORE ROW
triggers to execute arbitrary SQL. I've frequently thought this was a
bit of a wart (e.g. [10]), and certainly not supportive of sensible,
idiomatic trigger use, but there isn't much we can do about it at this
stage. Right now, BEFORE ROW triggers will still fire when the new
code decides to not do the insert. It certainly wouldn't be acceptable
to allow before triggers to run *after* the first phase of speculative
insertion, because they might try and access an index with an
exclusive locked buffer, resulting in backend self-deadlock. Besides,
we cannot really know *what* to lock until after the before triggers
fire. That leaves us with two options, as far as I can tell:

1. Just have users live with those semantics. This is what the patch
does presently, and anyone who is using triggers in what I consider to
be a sensible way doesn't have to care. For everyone else, it's a
gotcha that they have to deal with, to be noted prominently.

2. Do something overly-cute to prevent the trigger from doing anything
we can't back out of. I'm not sure how feasible this is, because I
haven't looked at it at all (since option 1 looks perfectly reasonable
to me), but everything is on the table I suppose. As noted in code
comments, this would be pretty fragile. It would also probably be at
least as ugly as requiring users to live with option 1.

BEFORE ROW triggers aside, there are clearly other hazards implied by
taking exclusive locks on unique index buffers - a naive
implementation could be dealing with a system index. While these locks
aren't held for much longer than in the existing code, a requirement
to carefully avoid catalog access for the duration of the heap
insertion seems do-able but still quite fragile, and perhaps a source
of bugs in the future. So as a precaution against some catalog access
using an index that is already locked by speculative insertion in a
way that leads to deadlock, we simply forbid speculative insertion
into system catalogs outright. This doesn't see like too onerous a
restriction to me.

Transaction isolation
=================

The interaction with SSI is something that I considered, but may not
have got right. With the transaction isolation level set to
serializable, each exclusive-locked buffer gets a
CheckForSerializableConflictIn() call in a way that I believe is
analogous to the extant _btdoinsert code. Crucially, we always make
those calls when the isolation level is serializable, which means
extra work (we've perhaps already concluded that the speculative
insertion won't go ahead) over read committed and repeatable read. So
because the number of new calls to CheckForSerializableConflictIn()
aren't influenced by interaction with other sessions, if I'm not
mistaken that preserves SSI's guarantees. However, this may still be
unnecessary - it may also be fine to just rely on the buffer locks
that are taken. I need better analysis here.

Tests
=====

Isolation tests are included that test the feature in an obvious way.
My original test case involved a pgbench session with a custom script
where primary key value is a random value between 1 and :scale
inclusive, and the expectation is that we'll end up with exactly
:scale number of tuples. After many sessions did INSERT...ON DUPLICATE
KEY IGNORE, this is consistently the case, without spurious duplicates
or any other incorrect behavior (that I'm aware of). Presently, this
works fine with tens of thousands of tuples. I've also done quite a
lot of other smoke testing of the feature, but this isn't included
here.

I have done no performance testing to date. Reviewers will want to pay
attention to the performance implications, particularly in the regular
insert case; it's conceivable that I've regressed things, though I
don't specifically suspect that I have.

Documentation
============

I haven't added any documentation, because I didn't think that there
was much point before hearing comments from others. Obviously at least
the INSERT documentation, and "Chapter 52. Index Access Method
Interface Definition" and the pg_am docs will need to be updated if
we're to proceed.

Thoughts?

*** References ***

[1] https://dev.mysql.com/doc/refman/5.5/en/insert.html

[2] https://www.sqlite.org/lang_conflict.html

[3] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/bdr

[4] https://wiki.postgresql.org/wiki/PgCon_2012_Developer_Meeting#MERGE

[5] http://www.postgresql.org/message-id/45E845C4.6030000@enterprisedb.com

[6] http://www.postgresql.org/message-id/1172858409.3760.1618.camel@silverbirch.site

[7] http://www.postgresql.org/message-id/1132000496.3388.31.camel@holly

[8] http://www.postgresql.org/message-id/20083.1131730779@sss.pgh.pa.us

[9] https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/access/nbtree/nbtinsert.c#L174

[10] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6868ed7491b7ea7f0af6133bb66566a2f5fe5a75

--
Peter Geoghegan

Attachment Content-Type Size
insert_on_dup.v1.2013_08_30.patch.gz application/x-gzip 25.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-08-30 22:14:36 Window functions can be created with defaults, but they don't work
Previous Message Tom Lane 2013-08-30 20:28:47 Re: [v9.4] row level security