INSERT ... ON CONFLICT {UPDATE | IGNORE}

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject: INSERT ... ON CONFLICT {UPDATE | IGNORE}
Date: 2014-08-28 02:43:44
Message-ID: CAM3SWZTEODEJLz82LK4eF2HYX+qEKrbc8-Vtq3_-aOf6kRSfiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached WIP patch extends the INSERT statement, adding a new ON
CONFLICT {UPDATE | IGNORE} clause. This allows INSERT statements to
perform UPSERT operations (if you want a more formal definition of
UPSERT, I refer you to my pgCon talk's slides [1], or the thread in
which I delineated the differences between SQL MERGE and UPSERT [2]).
The patch builds on previous work in this area, and incorporates
feedback from Kevin and Andres.

Overview
=======

Example usage:

INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT UPDATE
SET val = 'update';

Essentially, the implementation has all stages of query processing
track some auxiliary UPDATE state. So, for example, during parse
analysis, UPDATE transformation occurs in an ad-hoc fashion tightly
driven by the parent INSERT, but using the existing infrastructure
(i.e. transformStmt()/transformUpdateStmt() is called, and is
insulated from having to care about the feature as a special case).
There are some restrictions on what this auxiliary update may do, but
FWIW there are considerably fewer than those that the equivalent MySQL
or SQLite feature imposes on their users. All of the following SQL
queries are valid with the patch applied:

-- Nesting within wCTE:
WITH t AS (
INSERT INTO z SELECT i, 'insert'
FROM generate_series(0, 16) i
ON CONFLICT UPDATE SET v = v || 'update' -- use of
operators/functions in targetlist
RETURNING * -- only projects inserted tuples, never updated tuples
)
SELECT * FROM t JOIN y ON t.k = y.a ORDER BY a, k;

-- IGNORE variant:
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT IGNORE;

-- predicate within UPDATE auxiliary statement (row is still locked
when the UPDATE predicate isn't satisfied):
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT UPDATE
WHERE val != 'delete';

As with SQL MERGE (at least as implemented in other systems),
subqueries may not appear within the UPDATE's targetlist, nor may they
appear within the special WHERE clause. But the "INSERT part" of the
query has no additional limitations, so you may for example put
subqueries within a VALUES() clause, or INSERT...SELECT...ON CONFLICT
UPDATE... just as you'd expect. INSERT has been augmented with a new
clause, but that clause does not unreasonably fail to play nice with
any other aspect of insertion. (Actually, that isn't quite true, since
at least for now table inheritance, updatable views and foreign tables
are unsupported. This can be revisited.)

I think that without initially realizing it, I copied the SQLite
syntax [3]. However, unlike with that SQLite feature, CONFLICT only
refers to a would-be duplicate violation, and not a violation of any
other kind of constraint.

How this new approach works (Executor and Optimizer stuff)
============================================

During the execution of the parent ModifyTable, a special auxiliary
subquery (the UPDATE ModifyTable) is considered as a special case.
This is not a subplan of the ModifyTable node in the conventional
sense, and so does not appear within EXPLAIN output. However, it is
more or less independently planned, and entirely driven by the INSERT
ModifyTable. ExecModifyTable() is never called with this special
auxiliary plan state passed directly. Rather, its parent manages the
process as the need arises. ExecLockUpdateTuple() locks and
potentially updates tuples, using the EvalPlanQual() mechanism (even
at higher isolation levels, with appropriate precautions).

The per-tuple expression context of the auxiliary query/plan is used
with EvalPlanQual() from within ExecLockUpdateTuple() (the new routine
tasked with locking and updating on conflict). There is a new
ExecUpdate() call site within ExecLockUpdateTuple(). Given the
restrictions necessarily imposed on this pseudo-rescanning
(principally the outright rejection of anything that necessitates
PARAM_EXEC parameters during planning), this is safe, as far as I'm
aware. It is convenient to be able to re-use infrastructure in such a
way as to more or less handle the UPDATE independently, driven by the
INSERT, except for execution which is more directly handled by the
INSERT (i.e. there is no ExecModifyTable() call in respect of this new
auxiliary ModifyTable plan). Granted, it is kind of bizarre that the
auxiliary query may have a more complex plan than is necessary for our
purposes, but it doesn't actually appear to be a problem when
"rescanning" (Like a SELECT FOR UPDATE/FOR SHARE's node, we call
EvalPlanQualSetTuple() directly). It is likely worthwhile to teach the
optimizer that we really don't care about how the one and only base
rel within the UPDATE auxiliary subquery (the target table) is
scanned, if only to save a few cycles. I have (temporarily) hacked the
optimizer to prevent index-only scans, which are problematic here, by
adding disable_cost when a query parse tree that uses the feature is
seen. Although what I've done is a temporary kludge, the basic idea of
forcing a particular type of relation scan has a precedent: UPDATE
WHERE CURRENT OF artificially forces a TID scan, because only a TID
scan will work correctly there. I couldn't come up with a convenient
way to artificially inject disable_cost into alternative scan types,
in the less invasive style of isCurrentOf, because there is no
convenient qual to target within cost_qual_eval().

As in previous incarnations, we lock each tuple (although, of course,
only with the UPDATE variant). We may or may not also actually proceed
with the update, depending on whether or not the user-specified
special update predicate (if any) is satisfied. But if we do,
EvalPlanQual() is (once the tuple is locked) only ever evaluated on a
conclusively committed and locked-by-us conflict tuple as part of the
process of updating, even though it's possible for the UPDATE
predicate to be satisfied where conceivably it would not be satisfied
by the tuple version actually visible to the command's MVCC snapshot.
I think this is the correct behavior. We all seem to be in agreement
that we should update at READ COMMITTED if *no* version of the tuple
is visible. It seems utterly arbitrary to me to suggest that on the
one hand it's okay to introduce one particular "MVCC violation", but
not another equivalent one. The first scenario is one in which we
update despite our update's (or rather insert's) "predicate" not being
satisfied (according to our MVCC snapshot). The second scenario is one
in which the same "predicate" is also not satisfied according to our
MVCC snapshot, but in a slightly different way. Why bother introducing
a complicated distinction, if it's a distinction without a difference?
I'd rather have a behavior that is consistent, easy to reason about,
and easy to explain. And so, the predicate is considered once, after
conclusively locking a conflict tuple.

It feels natural and appropriate to me that if the special UPDATE qual
isn't satisfied, we still lock the tuple. After all, in order to make
a conclusive determination about the qual not being satisfied, we need
to lock the tuple. This happens to insulate ExecUpdate() from having
to care about "invisible tuples", which are now possible (although we
still throw an error, just with a useful error message that phrases
the problem in reference to this new feature).

Of course, at higher isolation levels serialization errors are thrown
when something inconsistent with the higher level's guarantees would
otherwise need to occur (even for the IGNORE variant). Still,
interactions with SSI, and preserving the guarantees of SSI should
probably be closely considered by a subject matter expert.

Omission
=======

The patch currently lacks a way of referencing datums rejected for
insertion when updating. The way MySQL handles the issue seems
questionable. They allow you to do something like this:

INSERT INTO upsert (key, val) VALUES (1 'val') ON DUPLICATE KEY UPDATE
val = VALUES(val);

The implication is that the updated value comes from the INSERT's
VALUES() list, but emulating that seems like a bad idea. In general,
at least with Postgres it's entirely possible that values rejected
differ from the values appearing in the VALUES() list, due to the
effects of before triggers. I'm not sure whether or not we should
assume equivalent transformations during any UPDATE before triggers.

This is an open item. I think it makes sense to deal with it a bit later.

"Value locking"
===========

To date, on-list discussion around UPSERT has almost exclusively
concerned what I've called "value locking"; the idea of locking values
in unique indexes in the abstract (to establish the right to insert
ahead of time). There was some useful discussion on this question
between myself and Heikki back around December/January. Ultimately, we
were unable to reach agreement on an approach and discussion tapered
off. However, Heikki did understand the concerns that informed by
design. He recognized the need to be able to easily *release* value
locks, so as to avoid "unprincipled deadlocks", where under high
concurrency there are deadlocks between sessions that only UPSERT a
single row at a time. I'm not sure how widely appreciated this point
is, but I believe that Heikki appreciates it. It is a very important
point in my opinion. I don't want an implementation that is in any way
inferior to the "UPSERT looping subxact" pattern does (i.e. the plpsql
thing that the docs suggest).

When we left off, Heikki continued to favor an approach that involved
speculatively inserting heap tuples, and then deleting them in the
event of a conflict. This design was made more complicated when the
need to *release* value locks became apparent (Heikki ended up making
some changes to HeapTupleSatisfiesDirty(), as well as sketching a
design for what you might call a "super delete", where xmin can be set
to InvalidTransactionId for speculatively-inserted heap tuples). After
all, it wasn't as if we could abort a subxact to release locks, which
is what the "UPSERT looping subxact" pattern does. I think it's fair
to say that that design became more complicated than initially
anticipated [4] [5].

Anyway, the greater point here is that fundamentally, AFAICT Heikki
and I were in agreement. Once you buy into the idea that we must avoid
holding on to "value locks" of whatever form - as Heikki evidently did
- then exactly what form they take is ultimately only a detail.
Granted, it's a very important detail, but a detail nonetheless. It
can be discussed entirely independently of all of this new stuff, and
thank goodness for that.

If anyone finds my (virtually unchanged) page heavyweight lock based
value locking approach objectionable, I ask that the criticism be
framed in a way that makes a sharp distinction between each of the
following:

1. You don't accept that value locks must be easily released in the
event of a conflict. Is anyone in this camp? It's far from obvious to
me what side of this question Andres is on at this stage, for example.
Robert might have something to say here too.

2. Having taken into account the experience of myself and Heikki, and
all that is implied by taking that approach ***while avoiding
unprincipled deadlocks***, you continue to believe that an approach
based on speculative heap insertion, or some alternative scheme is
better than what I have done to the nbtree code here, or you otherwise
dislike something about the proposed value locking scheme. You accept
that value locks must be released and released easily in the event of
a conflict, but like Heikki you just don't like what I've done to get
there.

Since we can (I believe) talk about the value locking aspect and the
rest of the patch independently, we should do so...unless you're in
camp 1, in which case I guess that we'll have to thrash it out.

Syntax, footguns
=============

As I mentioned, I have incorporated feedback from Kevin Grittner. You
may specify a unique index to merge on from within the INSERT
statement, thus avoiding the risk of inadvertently having the update
affect the wrong tuple due to the user failing to consider that there
was a would-be unique violation within some other unique index
constraining some other attribute. You may write the DML statement
like this:

INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT WITHIN
upsert_pkey UPDATE SET val = 'update';

I think that there is a good chance that at least some people will
want to make this mandatory. I guess that's fair enough, but I
*really* don't want to *mandate* that users specify the name of their
unique index in DML for obvious reasons. Perhaps we can come up with a
more tasteful syntax that covers all interesting cases (consider the
issues with partial unique indexes and before triggers for example,
where a conclusion reached about which index to use during parse
analysis may subsequently be invalidated by user-defined code, or
ambiguous specifications in the face of overlapping attributes between
two unique composite indexes, etc). The Right Thing is far from
obvious, and there is very little to garner from other systems, since
SQL MERGE promises essentially nothing about concurrency, both as
specified by the standard and in practice. You don't need a unique
index at all, and as I showed in my pgCon talk, there are race
conditions even for a trivial UPSERT operations in all major SQL MERGE
implementations.

Note that making mandatory (via syntax) merging on one particular
unique index buys the implementation no useful leeway. Just for
example, the unprincipled deadlocks test case that illustrated the
problem with early "promise tuple" style approaches to value locking
[6] involved only a single unique index. AFAICT, the question of
whether or not this should be mandatory is just a detail of the
feature's high level design, as opposed to something expected to
significantly influence the implementation.

Testing, performance
===============

As you'd expect, I've included both isolation tests and regression
tests covering a reasonable variety of cases. In addition, stress
testing is an important part of my testing strategy. Reviewers are
encouraged to try out these test bash scripts:

https://github.com/petergeoghegan/upsert

(Interested hackers should request collaborator status on that Github
project from me privately. I welcome new, interesting test cases.)

The performance of the patch seems quite good, and is something that
these stress-testing bash scripts also test. Upserts are compared
against "equivalent" inserts when we know we'll never update, and
against "equivalent" updates when we know we'll never insert.

On an 8 core test server, I can sustain ~90,000 ordinary insert
transactions per second on an unlogged table defined as follows:

create unlogged table foo
(
merge serial primary key,
b int4,
c text
);

In all cases pgbench uses 8 clients (1 per CPU core).

With "equivalent" upserts, it's about ~66,000 TPS. But this is a
particularly unsympathetic case, because I've deliberately exaggerated
the effects of heavyweight lock contention on leaf pages by using a
serial primary key. Plus, there's the additional planning and parsing
overhead.

When comparing updating with updating upserting, it's a similar story.
100,000 tuples are pre-inserted in each case. I can sustain ~98,000
TPS with plain updates, or ~70,000 TPS with "equivalent" upserts.
B-Tree index page heavyweight lock contention probably explains some
of the difference between "UPSERT inserts" and "UPSERT updates".

Interlocking with VACUUM, race conditions
===============================

In previous revisions, when we went to lock + update a tuple, no
"value locks" were held, and neither were any B-Tree page buffer pins,
because they were both released at the same time (recall that I call
my heavyweight lock on B-Tree leaf pages a value lock). We still do
that (unprincipled deadlocks are our only alternative), but now hold
on to the pin for longer, until after tuple locking. Old versions of
this patch used to sit on the B-Tree buffer pin to prevent concurrent
deletion only as long as value locks were held, but maybe it isn't
good enough to sit on the pin until before we lock/update, as value
locks are released: dropping the pin implies that the heap tuple can
physically go away, and in general the same TID may then contain
anything. We may have to interlock against vacuum by sitting on the
B-Tree buffer pin (but not the value lock) throughout locking +
update. That makes it impossible for the heap tuple slot to fail to
relate to the tuple from the B-Tree, that is under consideration for
locking/updating. Recall that we aren't quite dealing with MVCC
semantics here, since in READ COMMITTED mode we can lock a
conclusively committed + visible tuple with *no* version visible to
our command's MVCC snapshot. Therefore, it seems worth considering the
possibility that the nbtree README's observations on the necessity of
holding a pin to interlock against VACUUM (for non-MVCC snapshots)
apply.

In this revision we have two callbacks (or two calls to the same
callback, with different effects): One to release value locks early,
to avoid unprincipled deadlocks, and a second to finally release the
last unneeded buffer pin.

Recall that when we find a conflict (within _bt_checkunique()), it
must be conclusively committed and visible to new MVCC snapshots; we
know at that juncture that it's live. The concern is that it might be
deleted *and* garbage collected in the interim between finding the
conflict tuple, and locking it (in practice this interim period is
only an instant).

This is probably too paranoid, though: the fact that the upserter's
transaction is running ought to imply that GetOldestXmin() returns an
XID sufficient to prevent this. OTOH, I'm not sure that there exists
anything that looks like a precedent for relying on blocking vacuum in
this manner, and it might turn out to be limiting to rely on this.
And, I hasten to add, my fix (sitting on a B-Tree pin throughout row
locking) is in another way perhaps not paranoid enough: Who is to say
that our conflicting value is on the same B-Tree leaf page as our
value lock? If might not be, since _bt_checkunique() looks at later
B-Tree pages (the value locked page is merely "the first leaf page the
value could be on"). Pinning the heavyweight lock page's buffer is
certainly justified by the need for non-speculative inserters to see a
flag that obligates them to acquire the heavyweight page lock
themselves (see comments in patch for more), but this other reason is
kind of dubious.

In other words: I'm relying on the way VACUUM actually works to
prevent premature garbage collection. It's possible to imagine a world
in which HeapTupleSatisfiesVacuum() is smart enough to realize that
the tuple UPSERT wants to lock is not visible to anyone (assuming MVCC
semantics, etc), and never can be. I've tentatively added code to keep
a buffer pin for longer, but that's probably not good enough if we
assume that it's necessary at all. Basically, I want to be comfortable
about my rationale for it being okay that a "non-MVCC" "index scan"
doesn't hold a pin, but right now I'm not. I was conflicted on whether
or not I should include the "unpin later" logic at all; for now I've
left it in, if only as a placeholder. Needless to say, if there is a
race condition you can take it that it's very difficult to isolate.

FWIW, somewhat extensive stress-testing has revealed no bugs that you
might associate with these problems, with and without extended buffer
pinning, and with artificial random sleeps added at key points in an
effort to make any race condition bugs manifest themselves. I have
made a concerted effort to break the patch in that way, and I'm now
running out of ideas. Running the stress tests (with random delays in
key points in the code) for several days reveals no bugs. This is on
the same dedicated 8 core server, with plenty of concurrency.

It's probably a good idea to begin using my B-Tree verification tool
[7] for testing...on the other hand, it doesn't know anything about
MVCC, and will only detect the violation of invariants that are
localized to the B-Tree code, at least at the moment.

Open items
=========

I already mentioned the inability to reference rejected rows in an
UPDATE, as well as my unease about VACUUM interlocking, both of which
are open item. Also, some of the restrictions that I already mentioned
- on updatable views, inheritance, and foreign tables - are probably
unnecessary. We should be able to come with reasonable behavior for at
least some of those.

Patch
====

I thought that I went too long without posting something about all of
this to the list to get feedback, and so I decided to post this WIP
patch set. I've tried to break it up into pieces, but it isn't all
that suitable for representing as cumulative commits. I've also tried
to break up the discussion usefully (the question of how everything
fits together at a high level can hopefully be discussed separately
from the question of how "value locks" are actually implemented).

Thoughts?

[1] http://www.pgcon.org/2014/schedule/attachments/327_upsert_weird.pdf,
("Goals for UPSERT in Postgres")
[2] http://www.postgresql.org/message-id/CAM3SWZRP0c3g6+aJ=YYDGYAcTZg0xA8-1_FCVo5Xm7hrEL34kw@mail.gmail.com
[3] https://sqlite.org/lang_conflict.html
[4] http://www.postgresql.org/message-id/CAM3SWZQoArVQGMi=v-jk3sBjsPg+wdjeUkM_6L5TZG_i9pyGzQ@mail.gmail.com
[5] http://www.postgresql.org/message-id/52B4AAF0.5090806@vmware.com
[6] http://www.postgresql.org/message-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com
[7] http://www.postgresql.org/message-id/CAM3SWZRtV+xmRWLWq6c-x7czvwavFdwFi4St1zz4dDgFH4yN4g@mail.gmail.com
--
Peter Geoghegan

Attachment Content-Type Size
0001-Make-UPDATE-privileges-distinct-from-INSERT-privileg.patch text/x-patch 17.8 KB
0004-Internal-documentation-for-INSERT-.-ON-CONFLICT-UPDA.patch text/x-patch 18.2 KB
0003-Tests-for-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch text/x-patch 33.1 KB
0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch text/x-patch 137.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-08-28 03:55:46 Re: Specifying the unit in storage parameter
Previous Message Peter Geoghegan 2014-08-28 02:19:44 Re: Btree internal node data?