Quick Links

INSERT...ON DUPLICATE KEY IGNORE

Lists:	pgsql-hackers

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-08-30 22:09:59
Message-ID:	CAM3SWZRg4VcLCKprVEyuLHoXwUefuiz1-P69YLye8qktZb_=UQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

For many years now, MySQL has a feature called INSERT IGNORE [1];
SQLite has INSERT ON CONFLICT IGNORE [2]; SQL Server has an option
called IGNORE_DUP_KEY and Oracle has a hint called
IGNORE_ROW_ON_DUPKEY_INDEX (they acknowledge that it's a bit odd that
a hint changes the semantics of a DML statement - of
course, MERGE has always been able to do something approximately equivalent).

Each of these features has their respective systems simply not insert
certain tuples that will result in a duplicate key violation, while
potentially proceeding with the insertion of other, non-violating
tuples.

The attached WIP patch implements this for Postgres, with a few
notable differences:

1) The patch is only interested in unique index violations
(specifically, violations of amcanunique access method unique
indexes); it will not do anything with NULL constraint violations, as
the MySQL feature does, for example. It also doesn't care about
exclusion constraints, even though it's currently possible to defer
exclusion constraint enforcement in a way that is analogous to how
it's done for btree indexes.

2) The clause that I've proposed is, as you'll have already gathered,
spelt slightly differently to any of these systems. I'm not
particularly attached to how I've spelt it. I've spelt it this way so
as to avoid suggesting that it's fully compatible with the MySQL
feature. I don't think we want weird NULLness behavior, but I may be
mistaken. If we want that additional behavior, we probably want it as
an optional adjunct, or perhaps an entirely independent clause to what
I've specified here.

3) RETURNING is expanded - "RETURNING REJECTS *" is now possible where
that makes sense.

It is possible for clients to interrogate the wire protocol to see the
number of rows affected (this can be done with a PQcmdTuples() call by
libpq clients, for example), and see how that compares with the number
of tuples they tried to insert. In addition, a client can use
RETURNING to see what rows were successfully inserted. Here is a
simple example session that shows this usage:

postgres=# create table tab(id int4 primary key, val text);
CREATE TABLE
postgres=# insert into tab(id, val) values(1, 'val'), (3, 'val'), (4, 'val');
INSERT 0 3
postgres=# insert into tab(id, val) values(1, 'nosert'), (2,
'nosert'), (3, 'nosert') on duplicate key ignore returning id;
id
----
2
(1 row)

INSERT 0 1
postgres=# select xmin, * from tab;
xmin | id | val
--------+----+-----
580843 | 1 | val
580843 | 3 | val
580843 | 4 | val
580844 | 2 | nosert
(4 rows)

If this all happened in two separate, concurrent sessions, the
transaction with xid 580844 might well have blocked on the outcome of
580843 in the usual way (ultimately inserting or doing nothing as
appropriate for each tuple), just as you'd expect. Note, however, that
the xid of the "noserter" is only seen here for the tuple that it
actually successfully inserted - in effect, the noserter session does
not lock tuples that it did not originally insert. That's consistent
with MySQL's INSERT IGNORE, for example. However, some people might
prefer it if this did share lock rows that prevented tuples from being
inserted, or if that was at least an option. That would be something
quite a bit closer to a fully-fledged upsert - I imagine we'd have to
lock the old tuple using EvalPlanQual, and if a submission did that
correctly then I think we'd be well on our way to solving all the
problems.

RETURNING REJECTS is a way of getting the inverse of affected tuples
of an ordinary INSERT...RETURNING statement - rather than returning
rows successfully inserted only (i.e. tuples where either a BEFORE
trigger returned NULL or we didn't proceed with insertion), we
returning rows that were not successfully inserted only. This is only
meaningful in the context of INSERT...ON DUPLICATE KEY IGNORE. Sure,
clients could figure this out for themselves using vanilla "RETURNING
*", but I think that this addition is justified by the fact that it's
generally expected that the number of rejects will be relatively
small. People are going to want to use this feature in an ad-hoc
manner within a psql session, and supporting this will particularly
help there. It might be nice to find a way to indicate the exact
constraint violated per row returned, but I preferred to wait and see
if people thought it was a good idea generally.

Use-cases
=========

The use-cases for this feature are fairly obvious. It can be used as a
way of merging tuples easily, if clients can live with the present
limitations.

In addition to that, the feature is may be of value to clients of the
logical change-set generation infrastructure (such as the proposed BDR
contrib module[3]), if and when that infrastructure is finally
committed. Currently, the BDR prototype simply balks at INSERT
conflicts. This is because the amount of subtransactions/xids
generated to do anything else is considered prohibitive. If Postgres
is ever going to solve the kinds of problems that BDR proposes to
solve well (except for the relatively narrow use-case where balking at
INSERT conflicts is acceptable), it needs something like what I've
proposed here. The only alternative is probably to replay a
transaction without using subtransactions, and if that fails, remember
that fact and wrap every single change in a subxact. This is really
going to be painful for large transactions.

Andres privately said some weeks back (prior to my sharing of this
information, and even prior to writing most of the code posted here)
that he'd like to see an INSERT...ON DUPLICATE KEY LOCK. I suppose
that he must have meant that he'd like to see shared locks obtained on
rows that blocked a noserter from actually inserting some of their
proposed tuples, for the purposes of conflict resolution (which, as
I've said, this doesn't do). Perhaps he'd like to comment on these
issues here.

Mechanism
=========

This patch is a spin-off from a broader effort to implement
INSERT...ON DUPLICATE KEY UPDATE (upsert). During the 2012 developer
meeting, it was agreed that we needed a simple, atomic upsert, and
that we should probably use the MySQL syntax as long as we don't have
merge in its full generality [4]. In doing things this way, I'm
attempting to manage the risk of that broader patch not being accepted
by checkpointing progress towards it in a major release. In
particular, I'd like to see us reach consensus on the questions
surrounding our basic approach to implementing upsert. I'm keen to see
this happen before attempting to solve all the other problems, in a
way which is perhaps predicated on my own personal opinion about how
to solve the basic problem. Having said all that, even if I never work
on upsert, this patch certainly has independent value.

Having reviewed the archives for previous discussions on upsert or
MERGE, it seems that there is some diversity of opinion among more
senior developers about just how to solve the basic concurrency issue.
For example, a few years ago Heikki opined that continuing to insert
the heap tuple first, and then potentially cleaning that up in the
event of a duplicate rather than raising an error was a workable
approach [5]. Simon addressed Heikki's point here specifically, and
suggested that we should try and find the tuple first [6], which is
approximately the approach I have taken here. In saying this, Simon
appears to have changed his mind, because on a previous occasion he
remarked that doing an UPDATE potentially followed by an INSERT might
be preferable [7]. Tom at least once briefly sketched how he thought
this ought to work [8], and it appeared to me that he was roughly in
agreement with Simon in [6]. (I hope I haven't misrepresented
somebody's view here - if so, please correct me).

From a high level, the patch works by adding something that is
tentatively internally referred to as "speculative insertion" in
lower-level code comments. The basic approach is:

* Within ExecInsert, when inserting into a regular heap relation from
a tuple table slot (after BEFORE triggers fire but before inserting
the heap tuple), there is an extra function call to a new function,
ExecLockIndexTuples().

* For each applicable amcanunique unique index, ExecLockIndexTuples()
calls a new am method, amlock. For each unique index, we end up
acquiring a write buffer lock in a similar fashion to _bt_doinsert.
When we return from the new amlock function, we report whether or not
there was a duplicate without having inserted any index tuples or
having otherwise done something that requires a heap tuple to be
present. I like to call this point "the point of no return", where
individual unique indexes commit themselves to insertion (which is to
say, the point after which a unique constraint violation becomes
impossible but before actual index tuple insertion in _bt_doinsert
[9]). Another duplicate key cannot be inserted by anyone else until
that write lock is released, which it won't be until insertion or
IGNORE, for the same reason why that's currently true once control
reaches the point of no return. A pagesplit might need to happen
before insertion proper of that index tuple, for example, and that's
okay.

* If we have total consensus among all unique indexes, we proceed with
a heap insertion of the tuple formed from the tuple table slot
ExecInsert() was called in respect of. Yes, this means that
potentially multiple exclusive buffer locks are held for the duration
of the heap tuple insertion. If not, we release the locks early and
return NULL from ExecInsert() before heap insertion ever happens (and
certainly before index tuple insertion ever happens).

* If and when we get to ExecInsertIndexTuples(), and ultimately to
_bt_doinsert, the new _bt_doinsert knows that for any unique indexes
it sees, that the process has already been partially completed. It
takes state made available from the first phase of speculative
insertion, and finishes the job, pretty much picking up from the point
of no return for amcanunique unique indexes (or starting from scratch
in the usual way for non-unique amcanunique-method indexes that happen
to be involved in speculative insertion).

Deadlock hazards
==============

In order to avoid deadlocks, the locking stage acquires locks in
reverse order to the usual index insertion order. So, regardless of
how those locks are subsequently freed (either by actual insertion or
by finishing early in the IGNORE case), we avoid buffer lock related
deadlocks. Still, deadlock hazards are something that reviewers will
want to pay particular attention to.

Unlike some other systems like DB2, we have always allowed BEFORE ROW
triggers to execute arbitrary SQL. I've frequently thought this was a
bit of a wart (e.g. [10]), and certainly not supportive of sensible,
idiomatic trigger use, but there isn't much we can do about it at this
stage. Right now, BEFORE ROW triggers will still fire when the new
code decides to not do the insert. It certainly wouldn't be acceptable
to allow before triggers to run *after* the first phase of speculative
insertion, because they might try and access an index with an
exclusive locked buffer, resulting in backend self-deadlock. Besides,
we cannot really know *what* to lock until after the before triggers
fire. That leaves us with two options, as far as I can tell:

1. Just have users live with those semantics. This is what the patch
does presently, and anyone who is using triggers in what I consider to
be a sensible way doesn't have to care. For everyone else, it's a
gotcha that they have to deal with, to be noted prominently.

2. Do something overly-cute to prevent the trigger from doing anything
we can't back out of. I'm not sure how feasible this is, because I
haven't looked at it at all (since option 1 looks perfectly reasonable
to me), but everything is on the table I suppose. As noted in code
comments, this would be pretty fragile. It would also probably be at
least as ugly as requiring users to live with option 1.

BEFORE ROW triggers aside, there are clearly other hazards implied by
taking exclusive locks on unique index buffers - a naive
implementation could be dealing with a system index. While these locks
aren't held for much longer than in the existing code, a requirement
to carefully avoid catalog access for the duration of the heap
insertion seems do-able but still quite fragile, and perhaps a source
of bugs in the future. So as a precaution against some catalog access
using an index that is already locked by speculative insertion in a
way that leads to deadlock, we simply forbid speculative insertion
into system catalogs outright. This doesn't see like too onerous a
restriction to me.

Transaction isolation
=================

The interaction with SSI is something that I considered, but may not
have got right. With the transaction isolation level set to
serializable, each exclusive-locked buffer gets a
CheckForSerializableConflictIn() call in a way that I believe is
analogous to the extant _btdoinsert code. Crucially, we always make
those calls when the isolation level is serializable, which means
extra work (we've perhaps already concluded that the speculative
insertion won't go ahead) over read committed and repeatable read. So
because the number of new calls to CheckForSerializableConflictIn()
aren't influenced by interaction with other sessions, if I'm not
mistaken that preserves SSI's guarantees. However, this may still be
unnecessary - it may also be fine to just rely on the buffer locks
that are taken. I need better analysis here.

Tests
=====

Isolation tests are included that test the feature in an obvious way.
My original test case involved a pgbench session with a custom script
where primary key value is a random value between 1 and :scale
inclusive, and the expectation is that we'll end up with exactly
:scale number of tuples. After many sessions did INSERT...ON DUPLICATE
KEY IGNORE, this is consistently the case, without spurious duplicates
or any other incorrect behavior (that I'm aware of). Presently, this
works fine with tens of thousands of tuples. I've also done quite a
lot of other smoke testing of the feature, but this isn't included
here.

I have done no performance testing to date. Reviewers will want to pay
attention to the performance implications, particularly in the regular
insert case; it's conceivable that I've regressed things, though I
don't specifically suspect that I have.

Documentation
============

I haven't added any documentation, because I didn't think that there
was much point before hearing comments from others. Obviously at least
the INSERT documentation, and "Chapter 52. Index Access Method
Interface Definition" and the pg_am docs will need to be updated if
we're to proceed.

Thoughts?

*** References ***

[1] https://dev.mysql.com/doc/refman/5.5/en/insert.html

[2] https://www.sqlite.org/lang_conflict.html

[3] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/bdr

[4] https://wiki.postgresql.org/wiki/PgCon_2012_Developer_Meeting#MERGE

[5] http://www.postgresql.org/message-id/45E845C4.6030000@enterprisedb.com

[6] http://www.postgresql.org/message-id/1172858409.3760.1618.camel@silverbirch.site

[7] http://www.postgresql.org/message-id/1132000496.3388.31.camel@holly

[8] http://www.postgresql.org/message-id/20083.1131730779@sss.pgh.pa.us

[9] https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/access/nbtree/nbtinsert.c#L174

[10] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6868ed7491b7ea7f0af6133bb66566a2f5fe5a75

--
Peter Geoghegan

Attachment	Content-Type	Size
insert_on_dup.v1.2013_08_30.patch.gz	application/x-gzip	25.7 KB

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-08-30 22:40:15
Message-ID:	52211F4F.7090503@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/30/2013 03:09 PM, Peter Geoghegan wrote:
> The attached WIP patch implements this for Postgres, with a few
> notable differences:

Thank you for addressing this. If nobody is going to hack out MERGE, we
could really use at least this feature.

> 3) RETURNING is expanded - "RETURNING REJECTS *" is now possible where
> that makes sense.

Oh, nifty! OK, now I can *really* use this feature.

> This patch is a spin-off from a broader effort to implement
> INSERT...ON DUPLICATE KEY UPDATE (upsert). During the 2012 developer

Yeah, I was wondering when we'd get to that. Obviously there will be
users clamoring for it ...

> Unlike some other systems like DB2, we have always allowed BEFORE ROW
> triggers to execute arbitrary SQL. I've frequently thought this was a
> bit of a wart (e.g. [10]), and certainly not supportive of sensible,
> idiomatic trigger use, but there isn't much we can do about it at this
> stage. Right now, BEFORE ROW triggers will still fire when the new
> code decides to not do the insert. It certainly wouldn't be acceptable
> to allow before triggers to run *after* the first phase of speculative
> insertion, because they might try and access an index with an
> exclusive locked buffer, resulting in backend self-deadlock. Besides,
> we cannot really know *what* to lock until after the before triggers
> fire. That leaves us with two options, as far as I can tell:
>
> 1. Just have users live with those semantics. This is what the patch
> does presently, and anyone who is using triggers in what I consider to
> be a sensible way doesn't have to care. For everyone else, it's a
> gotcha that they have to deal with, to be noted prominently.

+1. We already allow BEFORE triggers to violate referential integrity.
I don't think that allowing them to behave oddly around INSERT ...
IGNORE is a problem compared to that. We just need to document it, so
that users know that their BEFORE code will fire even if the INSERT is
being ignored, and that a BEFORE trigger can cause an INSERT ... IGNORE
to error out.

> I have done no performance testing to date. Reviewers will want to pay
> attention to the performance implications, particularly in the regular
> insert case; it's conceivable that I've regressed things, though I
> don't specifically suspect that I have.

Yeah, we'll also want to document the performance overhead for the bulk
loading case. I know I'll want to use this syntax as a primitive form
of MERGE, and I'll want to know what it costs me.

Does this work with multiple VALUES rows?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-08-30 22:42:51
Message-ID:	CAM3SWZTFPX-PncUuEaCBuZ3Bcq108KoRxN4WZnmu2d+VM=wFfw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Aug 30, 2013 at 3:40 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Does this work with multiple VALUES rows?

Yes.

--
Peter Geoghegan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-08-31 00:47:53
Message-ID:	20130831004753.GD1138@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

This is awesome.

On 2013-08-30 15:09:59 -0700, Peter Geoghegan wrote:
> 1) The patch is only interested in unique index violations
> (specifically, violations of amcanunique access method unique
> indexes); it will not do anything with NULL constraint violations, as
> the MySQL feature does, for example. It also doesn't care about
> exclusion constraints, even though it's currently possible to defer
> exclusion constraint enforcement in a way that is analogous to how
> it's done for btree indexes.

All that seems sane to me. I very, very much do not want it to deal with
NOT NULL violations.

> [syntax]

Haven't thought about syntax at all so far, that's food for another day.

> Use-cases
> =========
> ...
> In addition to that, the feature is may be of value to clients of the
> logical change-set generation infrastructure (such as the proposed BDR
> contrib module[3]), if and when that infrastructure is finally
> committed. Currently, the BDR prototype simply balks at INSERT
> conflicts. This is because the amount of subtransactions/xids
> generated to do anything else is considered prohibitive. If Postgres
> is ever going to solve the kinds of problems that BDR proposes to
> solve well (except for the relatively narrow use-case where balking at
> INSERT conflicts is acceptable), it needs something like what I've
> proposed here. The only alternative is probably to replay a
> transaction without using subtransactions, and if that fails, remember
> that fact and wrap every single change in a subxact. This is really
> going to be painful for large transactions.

Exactly. In completely unscientific tests it reduces speed to less than
1/10 of the version without subxacts. Without a single conflict that is.

> Andres privately said some weeks back (prior to my sharing of this
> information, and even prior to writing most of the code posted here)
> that he'd like to see an INSERT...ON DUPLICATE KEY LOCK. I suppose
> that he must have meant that he'd like to see shared locks obtained on
> rows that blocked a noserter from actually inserting some of their
> proposed tuples, for the purposes of conflict resolution (which, as
> I've said, this doesn't do). Perhaps he'd like to comment on these
> issues here.

Ok, so what we would like to do is basically to follow a protocol
(simplified) like:

1) replay INSERT, using ON DUPLICATE IGNORE
2) if INSERT succeeded, be happy, no conflict (another INSERT with the
same key might conflict though)
3) if the INSERT failed, lock tuple for UPDATE
4) if the tuple got deleted by another session, goto 1
5) check whether local tuple or the already inserted tuple wins conflict resolution
6) UPDATE or skip accordingly

If there's a variant of INSERT ... ON DUPLICATE that gets a FOR UPDATE
lock on the existing row this gets simpler because there's no chance
anymore we need to loop or look for other versions of the row.

Makes sense?

> Mechanism
> =========
>
> [...]
>
> From a high level, the patch works by adding something that is
> tentatively internally referred to as "speculative insertion" in
> lower-level code comments. The basic approach is:
>
> * Within ExecInsert, when inserting into a regular heap relation from
> a tuple table slot (after BEFORE triggers fire but before inserting
> the heap tuple), there is an extra function call to a new function,
> ExecLockIndexTuples().
>
> * For each applicable amcanunique unique index, ExecLockIndexTuples()
> calls a new am method, amlock. For each unique index, we end up
> acquiring a write buffer lock in a similar fashion to _bt_doinsert.
> When we return from the new amlock function, we report whether or not
> there was a duplicate without having inserted any index tuples or
> having otherwise done something that requires a heap tuple to be
> present. I like to call this point "the point of no return", where
> individual unique indexes commit themselves to insertion (which is to
> say, the point after which a unique constraint violation becomes
> impossible but before actual index tuple insertion in _bt_doinsert
> [9]). Another duplicate key cannot be inserted by anyone else until
> that write lock is released, which it won't be until insertion or
> IGNORE, for the same reason why that's currently true once control
> reaches the point of no return. A pagesplit might need to happen
> before insertion proper of that index tuple, for example, and that's
> okay.
>
> * If we have total consensus among all unique indexes, we proceed with
> a heap insertion of the tuple formed from the tuple table slot
> ExecInsert() was called in respect of. Yes, this means that
> potentially multiple exclusive buffer locks are held for the duration
> of the heap tuple insertion. If not, we release the locks early and
> return NULL from ExecInsert() before heap insertion ever happens (and
> certainly before index tuple insertion ever happens).
>
> * If and when we get to ExecInsertIndexTuples(), and ultimately to
> _bt_doinsert, the new _bt_doinsert knows that for any unique indexes
> it sees, that the process has already been partially completed. It
> takes state made available from the first phase of speculative
> insertion, and finishes the job, pretty much picking up from the point
> of no return for amcanunique unique indexes (or starting from scratch
> in the usual way for non-unique amcanunique-method indexes that happen
> to be involved in speculative insertion).

Puh. It took me some time to even start to understand what you're doing
here...

While I ponder on it some more, could you explain whether I understood
something correctly? Consider the following scenario:

CREATE TABLE foo(a int UNIQUE, b int UNIQUE);

S1: INSERT INTO foo(0, 0);
S1: BEGIN;
S1: INSERT INTO foo(1, 1);
S1: SELECT pg_sleep(3600);
S2: BEGIN;
S2: INSERT INTO foo(2, 1);
S3: SELECT * FROM foo WHERE a = 0;

Won't S2 in this case exclusively lock a page in foo_a_key (the one
where 2 will reside on) for 3600s, while it's waiting for S1 to finish
while getting the speculative insertion into foo_b_key?

ISTM we could use something like a new lock level to make that work more
sensibly, basically something like LW_FOR_UPDATE which does not conflict
with _SHARED but conflicts with _EXCLUSIVE. This can then be used by the
speculative insert to "reserve" that page.
But while that would allow concurrent searched, this algorithm would
still prevent any insertion on the same page for 3600s, right?

> Deadlock hazards
> ==============
>
> In order to avoid deadlocks, the locking stage acquires locks in
> reverse order to the usual index insertion order. So, regardless of
> how those locks are subsequently freed (either by actual insertion or
> by finishing early in the IGNORE case), we avoid buffer lock related
> deadlocks. Still, deadlock hazards are something that reviewers will
> want to pay particular attention to.
>
> Unlike some other systems like DB2, we have always allowed BEFORE ROW
> triggers to execute arbitrary SQL. I've frequently thought this was a
> bit of a wart (e.g. [10]), and certainly not supportive of sensible,
> idiomatic trigger use, but there isn't much we can do about it at this
> stage. Right now, BEFORE ROW triggers will still fire when the new
> code decides to not do the insert. It certainly wouldn't be acceptable
> to allow before triggers to run *after* the first phase of speculative
> insertion, because they might try and access an index with an
> exclusive locked buffer, resulting in backend self-deadlock. Besides,
> we cannot really know *what* to lock until after the before triggers
> fire. That leaves us with two options, as far as I can tell:
>
> 1. Just have users live with those semantics. This is what the patch
> does presently, and anyone who is using triggers in what I consider to
> be a sensible way doesn't have to care. For everyone else, it's a
> gotcha that they have to deal with, to be noted prominently.

I find that acceptable. They already need to deal with the fact that a
second trigger can stop insertion.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-08-31 02:38:24
Message-ID:	CAM3SWZS20wQua9xosrcK9opAt=x_3uFhWDP+kF4rcV59bdG+bA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> This is awesome.

Thanks.

> All that seems sane to me. I very, very much do not want it to deal with
> NOT NULL violations.

Sure. But there's nothing stopping us from doing that as a totally
orthogonal thing. Not that I personally find it to be terribly
compelling or anything. It's an easy addition, but I'm certainly not
going to try and mandate it as integral to what I've done here if that
doesn't suit you.

> Ok, so what we would like to do is basically to follow a protocol
> (simplified) like:
>
> 1) replay INSERT, using ON DUPLICATE IGNORE
> 2) if INSERT succeeded, be happy, no conflict (another INSERT with the
> same key might conflict though)
> 3) if the INSERT failed, lock tuple for UPDATE
> 4) if the tuple got deleted by another session, goto 1
> 5) check whether local tuple or the already inserted tuple wins conflict resolution
> 6) UPDATE or skip accordingly

Right, I thought that's what you meant. I'll have to ponder it
further. However, I don't think locking the old tuple(s) is mandatory
to make this a useful feature - as I've noted, MySQL doesn't do that.
That said, I really want to support that, perhaps as another DUPLICATE
KEY variant. As I've noted, if we had that, we almost wouldn't need
upsert - loosely speaking it would be mere syntactic sugar on top of
what you require.

> If there's a variant of INSERT ... ON DUPLICATE that gets a FOR UPDATE
> lock on the existing row this gets simpler because there's no chance
> anymore we need to loop or look for other versions of the row.
>
> Makes sense?

Perfect sense.

> Puh. It took me some time to even start to understand what you're doing
> here...
>
> While I ponder on it some more, could you explain whether I understood
> something correctly? Consider the following scenario:
>
> CREATE TABLE foo(a int UNIQUE, b int UNIQUE);
>
> S1: INSERT INTO foo(0, 0);
> S1: BEGIN;
> S1: INSERT INTO foo(1, 1);
> S1: SELECT pg_sleep(3600);
> S2: BEGIN;
> S2: INSERT INTO foo(2, 1);
> S3: SELECT * FROM foo WHERE a = 0;
>
> Won't S2 in this case exclusively lock a page in foo_a_key (the one
> where 2 will reside on) for 3600s, while it's waiting for S1 to finish
> while getting the speculative insertion into foo_b_key?

Yes. The way the patch currently handles that case is unacceptable. It
needs to be more concerned about holding an exclusive lock on
earlier-locked indexes when locking the second and subsequent indexes
iff it blocks on another transaction in the course of locking the
second and subsequent indexes.

> ISTM we could use something like a new lock level to make that work more
> sensibly, basically something like LW_FOR_UPDATE which does not conflict
> with _SHARED but conflicts with _EXCLUSIVE.

I'll ponder it further. There are probably a number of options.

--
Peter Geoghegan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-08-31 18:34:26
Message-ID:	20130831183426.GA6989@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2013-08-30 19:38:24 -0700, Peter Geoghegan wrote:
> On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > While I ponder on it some more, could you explain whether I understood
> > something correctly? Consider the following scenario:
> >
> > CREATE TABLE foo(a int UNIQUE, b int UNIQUE);
> >
> > S1: INSERT INTO foo(0, 0);
> > S1: BEGIN;
> > S1: INSERT INTO foo(1, 1);
> > S1: SELECT pg_sleep(3600);
> > S2: BEGIN;
> > S2: INSERT INTO foo(2, 1);
> > S3: SELECT * FROM foo WHERE a = 0;
> >
> > Won't S2 in this case exclusively lock a page in foo_a_key (the one
> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish
> > while getting the speculative insertion into foo_b_key?
>
> Yes. The way the patch currently handles that case is unacceptable. It
> needs to be more concerned about holding an exclusive lock on
> earlier-locked indexes when locking the second and subsequent indexes
> iff it blocks on another transaction in the course of locking the
> second and subsequent indexes.

After some thinking I don't think any solution primarily based on
holding page level locks across other index operations is going to scale
ok.

What I had in mind when playing around with this is the following trick:
Just as in your patch split index insertion into two phases and perform
one of them before inserting the heap tuple. In what you called
"speculative insertion" don't stop at locking the page, but actually
insert an index tuple in the index. As we haven't yet inserted the heap
tuple we obviously can't insert an actual tuple. Instead set a flag on
the item and reuse the the item pointer to store the xid of the
inserting process. I coined those items "insertion promises"...

Index searches will ignore promises they see, but concurrent index
insertions will notice it's a promise and wait for the xid (or not, if
it has aborted). That fits pretty well into the existing btree code. If
the insertion of an index promise worked for all unique indexes and the
heap tuple has been inserted, convert those promises into actual item
pointers pointing to the heap tuple. That probably requires a second
search and some cuteness to find the correct pointer because of
concurrent splits, but that's not too bad (possibly you can do away with
that if we continue to hold a pin).

The fact that now xids can be in the index creates some slight
complications because it now needs to be vacuumed - but that's
relatively easy to fit into the bulkdelete api and I don't think the
contortions to the vacuum code will be too bad.

Imo that solution is fairly elegant because it doesn't cause any heap
bloat, and it causes the same amount of index bloat
"insert-into-heap-first" type of solutions cause. It also has pretty
good concurrency behaviour because you never need to retain a page lock
for longer than a normal index insertion.
It's certainly a bit more complex than your solution tho...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-01 02:38:49
Message-ID:	CAM3SWZRUfCaB1L8mxXki2hqS+AnJz2=P2LpHSUoUGxMRCTt5UQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > Won't S2 in this case exclusively lock a page in foo_a_key (the one
>> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish
>> > while getting the speculative insertion into foo_b_key?
>>
>> Yes. The way the patch currently handles that case is unacceptable. It
>> needs to be more concerned about holding an exclusive lock on
>> earlier-locked indexes when locking the second and subsequent indexes
>> iff it blocks on another transaction in the course of locking the
>> second and subsequent indexes.

I think that the fix here may be to do something analogous to what the
code already does: release all buffer locks, wait for the other
transaction to commit/abort, and try again. The difference being that
we try again across all unique indexes rather than just this one. I
have other obligations to take care of over the next couple of days,
so I don't really have time to code that up and satisfy myself that
it's robust; I'd prefer to have something more concrete to back this
thought up, but that's my immediate reaction for what it's worth.

We're looking for the first duplicate. So it would probably be okay
for the IGNORE case to not bother retrying and re-locking if the other
transaction committed (which, at least very broadly speaking, is the
likely outcome).

> After some thinking I don't think any solution primarily based on
> holding page level locks across other index operations is going to scale
> ok.

Just to be clear: I'm certainly not suggesting that it's okay to have
exclusive buffer locks held for any amount of time longer than
instantaneous. Furthermore, it isn't necessary to do anything
different to what we currently do with non-unique btree indexes during
speculative insertion - I don't treat them any differently, and they
only get locked in the usual way at the usual time, after heap
insertion. Much of the time, there'll only be a primary key index to
lock, and you'll have increased the window of the exclusive write-lock
(compared to the existing code when doing a regular insertion) by only
a very small amount - the duration of heap tuple insertion (the
baseline here is the duration of finding an insertion location on page
for index tuple, tuple insertion + possible page split). We're talking
about a write-lock on a btree leaf page, which is only going to block
other sessions from *finishing off* tuple insertion, and then only for
a minority of such insertions much of the time.

It's probably worth doing as much work up-front as possible (in terms
of asking questions about if we even want to lock the indexes and so
on, within ExecLockIndexTuples()), so as to decrease the window that
those locks are held. That's just an obvious optimization.

As users add unique indexes to tables, things do of course get worse
and worse. However, since in order for any of this to matter, there
has to be lots of concurrent activity in clustered ranges of values,
it's pretty likely that you'd conclude that tuple insertion should not
go ahead early on, and not end up doing too much exclusive/write
locking per tuple.

The question must be asked: if you don't think this will scale okay,
what are you comparing it to? The most useful basis of comparison is
likely to be other systems. Asking about uniqueness across many unique
indexes, and getting them all to commit to their unanimous position
for a moment in time is inherently involved - I don't see a way to do
it other than lock values, and I don't see any granularity at which
that's possible other than the btree page granularity.

> What I had in mind when playing around with this is the following trick:
> Just as in your patch split index insertion into two phases and perform
> one of them before inserting the heap tuple. In what you called
> "speculative insertion" don't stop at locking the page, but actually
> insert an index tuple in the index. As we haven't yet inserted the heap
> tuple we obviously can't insert an actual tuple. Instead set a flag on
> the item and reuse the the item pointer to store the xid of the
> inserting process. I coined those items "insertion promises"...

I did consider that sort of approach from an early stage - Simon
suggested it to me privately early last year. Actually, I even pointed
out in the 2012 developer meeting that there were unused IndexTuple
flag bits available for such purposes.

> Index searches will ignore promises they see, but concurrent index
> insertions will notice it's a promise and wait for the xid (or not, if
> it has aborted).

If inserters must block, they'll have to restart - I'm sure you'll
agree that they can't sit on even a shared/read buffer lock
indefinitely. So, have you really saved anything? Especially since
you'll have to search at least twice (more shared locks on *root*
pages) where my scheme only needs to search once. Also, are you really
sure that index searches should have license to ignore these promises?
What about dirty snapshots?

The question I should really ask about your scheme is the same
question you've asked of mine: What about multiple unique indexes on
the same table? Complexity issues aside, is it really better to get
exclusive/write locks on each of them twice (albeit shorter held
ones), and approximately double the number of shared locks too, just
to avoid concurrently held index tuple exclusive locks?

Now, I guess you probably could make all this work - somehow - but
that's even more complexity and even more cycles for vanishingly small
gain in one specific area. And frankly I don't think there was much
gain to begin with. You're glossing over too many details about what
your scheme would need to do to make a fair comparison with mine. It
seems like you're really concerned about scalability in the sense of
scaling the number of unique indexes on one table when there is lots
of concurrent activity on the same range of values across multiple
unique indexes. With my patch, and a single primary key unique index,
the overhead of this is extended window of locking is very small
indeed - because we usually only walk the tree and lock once, the
overhead relative to regular insertion is fixed. 2 or 3 unique indexes
aren't much worse in practice, I suspect.

> Imo that solution is fairly elegant because it doesn't cause any heap
> bloat, and it causes the same amount of index bloat
> "insert-into-heap-first" type of solutions cause.

I don't think that's a reasonable comparison. Bloating indexes with
dead *duplicate* tuples is, in general, worse than doing the same with
the heap. If you end up with a heavily bloated index with lots of
duplicates, that will adversely affect performance worse than with
non-duplicate index tuples (see [1] for example). That was one reason
that HOT helped as much as it did, I believe.

More fundamentally, I don't like the idea that very routine bloat is
okay. I've seen people abuse unique indexes by using them in a way
that made it inevitable that there'd be a constant stream of
constraint violations, with attendant bloat. I don't like that.

> It also has pretty
> good concurrency behaviour because you never need to retain a page lock
> for longer than a normal index insertion.
> It's certainly a bit more complex than your solution tho...

It's a lot more complex. You have to write an index tuple, and WAL-log
that, just to do nothing. You may also need to WAL-log "insertion
proper", which in your scheme involved updating the index in-place.

Having said all that, I'm perfectly willing to talk about the
scalability aspects of what I've done when I produce a revision that
fixes that bug. It would certainly help your argument here if you had
a benchmark, since it is my understanding that your concern is
entirely about performance/scalability.

Thanks for your input.

Unrelated:

Does the patch need logic about looking to the primary key for
duplicates first? Or maybe, if it's a SERIAL column, to look there
last since it's unlikely to have a duplicate and it might hurt more to
lock a leaf page there? It doesn't really matter now, but for upsert
we're going to want that. For upsert, which row to update may be
ambiguous, so it makes sense to prioritise indexes for checking if
only the first duplicate found will be updated in respect of a single
inserted tuple. This kind of thing also helps performance.

*** References ***

[1] https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/access/nbtree/nbtinsert.c#L556

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-01 02:53:03
Message-ID:	CAM3SWZS2fvGu7rmFoCHeYDF8Qtfr+45c1B8Scy5shBEjwKCe=Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 31, 2013 at 7:38 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> Imo that solution is fairly elegant because it doesn't cause any heap
>> bloat, and it causes the same amount of index bloat
>> "insert-into-heap-first" type of solutions cause.
>
> I don't think that's a reasonable comparison. Bloating indexes with
> dead *duplicate* tuples is, in general, worse than doing the same with
> the heap. If you end up with a heavily bloated index with lots of
> duplicates, that will adversely affect performance worse than with
> non-duplicate index tuples (see [1] for example). That was one reason
> that HOT helped as much as it did, I believe.

Bear in mind also that one unique index could hardly ever have real
duplicates, but it would still get "broken promise"/bloat tuples
because another, later index said it had a duplicate. All of the
indexes get bloated (quite possibly) just because of a duplicate in
one. With a table with many unique indexes and many reasons to decide
to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY
IGNORE, it isn't hard to imagine them all becoming heavily bloated
very quickly.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-01 06:02:46
Message-ID:	CAM3SWZQj0B2sA57GR30JEu1RQAtoULxe30uK=9_jwKCKDHvpmQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 31, 2013 at 7:53 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> With a table with many unique indexes and many reasons to decide
> to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY
> IGNORE, it isn't hard to imagine them all becoming heavily bloated
> very quickly.

There may be no index tuple directly corresponding to some heap tuple
where you might expect that, due to transactions aborting (as with a
traditional unique constraint violations) or because of HOT, but,
prior to now, never the other way around (heap tuples are created
first and physically killed last). That is an assumption that has
existed since the Berkeley days, and it is something that may be
treated as an invariant in some places. Autovacuum is one such place,
I'd say. Isn't it the case that the autovacuum launcher daemon might
not hear about all of this index-only bloat, even if it was extremely
serious? Isn't the threshold it follows only in respect of heap tuples
(even if the statistics collector does collect index tuple stats
separately)?

--
Peter Geoghegan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-01 11:06:39
Message-ID:	20130901110639.GA5695@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-08-31 19:38:49 -0700, Peter Geoghegan wrote:
> On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> > Won't S2 in this case exclusively lock a page in foo_a_key (the one
> >> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish
> >> > while getting the speculative insertion into foo_b_key?
> >>
> >> Yes. The way the patch currently handles that case is unacceptable. It
> >> needs to be more concerned about holding an exclusive lock on
> >> earlier-locked indexes when locking the second and subsequent indexes
> >> iff it blocks on another transaction in the course of locking the
> >> second and subsequent indexes.
>
> I think that the fix here may be to do something analogous to what the
> code already does: release all buffer locks, wait for the other
> transaction to commit/abort, and try again. The difference being that
> we try again across all unique indexes rather than just this one. I
> have other obligations to take care of over the next couple of days,
> so I don't really have time to code that up and satisfy myself that
> it's robust; I'd prefer to have something more concrete to back this
> thought up, but that's my immediate reaction for what it's worth.

That could probably made work. I still have some concerns with the
fundamentals of your concept I voice later in the mail...

> We're looking for the first duplicate. So it would probably be okay
> for the IGNORE case to not bother retrying and re-locking if the other
> transaction committed (which, at least very broadly speaking, is the
> likely outcome).

Hm. Could you explain this a bit more? Do you mean that we can stop
trying to insert after the first violation (yes, sure if so)?

> > After some thinking I don't think any solution primarily based on
> > holding page level locks across other index operations is going to scale
> > ok.
>
> Just to be clear: I'm certainly not suggesting that it's okay to have
> exclusive buffer locks held for any amount of time longer than
> instantaneous. Furthermore, it isn't necessary to do anything
> different to what we currently do with non-unique btree indexes during
> speculative insertion - I don't treat them any differently, and they
> only get locked in the usual way at the usual time, after heap
> insertion. Much of the time, there'll only be a primary key index to
> lock, and you'll have increased the window of the exclusive write-lock
> (compared to the existing code when doing a regular insertion) by only
> a very small amount - the duration of heap tuple insertion (the
> baseline here is the duration of finding an insertion location on page
> for index tuple, tuple insertion + possible page split). We're talking
> about a write-lock on a btree leaf page, which is only going to block
> other sessions from *finishing off* tuple insertion, and then only for
> a minority of such insertions much of the time.

I can't really agree with this part. First off, a heap_insert()
certainly isn't lightweight operation. There's several things that can
take extended time:
* toasting of rows (writing a gigabyte worth of data...)
* extension of the heap/toast table
* writeout of a dirty victim buffer if the page isn't in s_b
* reading in the target page if the page isn't in s_b
* waiting for an exclusive lock on the target pages

Secondly, you seem to be saying that such a lock would only block other
sessions from finishing of tuple insertions - but it will block normal
index searches as well? I have a hard time accepting the fact that one
session using INSERT ... KEY IGNORE will block lookup of a range of values
from the index in *other* sessions.

> The question must be asked: if you don't think this will scale okay,
> what are you comparing it to? The most useful basis of comparison is
> likely to be other systems. Asking about uniqueness across many unique
> indexes, and getting them all to commit to their unanimous position
> for a moment in time is inherently involved - I don't see a way to do
> it other than lock values, and I don't see any granularity at which
> that's possible other than the btree page granularity.

I'd expect that with the current implementation strategy if you
e.g. made pgbench use INSERT .. ON DUPLICATE IGNORE without ever
actually hitting duplicates, you'd see a very, very noticeable hit in
scalability. I wouldn't expect a single session to get much slower bug
trying a dozen or two should show a very noticeable dip.

> > What I had in mind when playing around with this is the following trick:
> > Just as in your patch split index insertion into two phases and perform
> > one of them before inserting the heap tuple. In what you called
> > "speculative insertion" don't stop at locking the page, but actually
> > insert an index tuple in the index. As we haven't yet inserted the heap
> > tuple we obviously can't insert an actual tuple. Instead set a flag on
> > the item and reuse the the item pointer to store the xid of the
> > inserting process. I coined those items "insertion promises"...
>
> I did consider that sort of approach from an early stage - Simon
> suggested it to me privately early last year. Actually, I even pointed
> out in the 2012 developer meeting that there were unused IndexTuple
> flag bits available for such purposes.

Yea, I am not trying to say that I am the first one having that idea,
just that that's what I had investigated.

> > Index searches will ignore promises they see, but concurrent index
> > insertions will notice it's a promise and wait for the xid (or not, if
> > it has aborted).
>
> If inserters must block, they'll have to restart - I'm sure you'll
> agree that they can't sit on even a shared/read buffer lock
> indefinitely.

We already have the logic for waiting in _bt_doinsert() - we can just
reuse the logic already there for plain unique insertions. Just that we
would know the xid to wait on without doing actual heap lookups (in
_bt_check_unique()).

> So, have you really saved anything? Especially since
> you'll have to search at least twice (more shared locks on *root*
> pages) where my scheme only needs to search once.

I think we could optimize away the second search with some
cleverness. If we remember & pin the page from the first search we know
that the promise will have to be either on that page or - after a page
split - potentially to ones on the right. Which we should be able to find
by repeatedly stepping right. Similar to logic of doing the
_bt_moveright() in _bt_doinsert().

> Also, are you really sure that index searches should have license to
> ignore these promises? What about dirty snapshots?

Well, normal indexsearches are looking for actual heap tuples. Which do
not yet exist at that point, so we cannot return anything for those. And
given that all current logic using dirty snapshots works with us
inserting the index tuples only after inserting heap tuples I cannot see
this causing distress there.
Obviously we cannot ignore the promises from inside the btree insertion
code, since them being visible there is their raison d'être...

> The question I should really ask about your scheme is the same
> question you've asked of mine: What about multiple unique indexes on
> the same table? Complexity issues aside, is it really better to get
> exclusive/write locks on each of them twice (albeit shorter held
> ones), and approximately double the number of shared locks too, just
> to avoid concurrently held index tuple exclusive locks?

I am pretty sure it would be better. We know that nbtree works pretty
damn well under concurrency. And the "promise" logic wouldn't change
anything of that.
Sure, the ON DUPLICATE IGNORE inserter will have a slightly higher
overhead because it has to navigate to wherever the promise has been
split to and then modify the promise to point to an ondisk tuple instead
of the xid. But it won't have to do splits a second time or anything
like that. And it will just need to lock one page at a time.

Furthermore, I am pretty sure that locking an index page *and* related
heap pages exclusively at the same time will be a deadlock
nightmare. Even in the case of only having a single unique index.

> Now, I guess you probably could make all this work - somehow - but
> that's even more complexity and even more cycles for vanishingly small
> gain in one specific area. And frankly I don't think there was much
> gain to begin with. You're glossing over too many details about what
> your scheme would need to do to make a fair comparison with mine. It
> seems like you're really concerned about scalability in the sense of
> scaling the number of unique indexes on one table when there is lots
> of concurrent activity on the same range of values across multiple
> unique indexes.

No, I am primarily concerned of the performance of concurrent inserts
and concurrent lookups, even in the case of a single unique index. I
don't think we can disregard multiple unique indexes - they are not all
that uncommon - but I think we'll see problems before that.

> With my patch, and a single primary key unique index,
> the overhead of this is extended window of locking is very small
> indeed - because we usually only walk the tree and lock once, the
> overhead relative to regular insertion is fixed. 2 or 3 unique indexes
> aren't much worse in practice, I suspect.

Maybe I am missing something, but I can't see how that's could be the
case.

> > Imo that solution is fairly elegant because it doesn't cause any heap
> > bloat, and it causes the same amount of index bloat
> > "insert-into-heap-first" type of solutions cause.
>
> I don't think that's a reasonable comparison. Bloating indexes with
> dead *duplicate* tuples is, in general, worse than doing the same with
> the heap. If you end up with a heavily bloated index with lots of
> duplicates, that will adversely affect performance worse than with
> non-duplicate index tuples (see [1] for example). That was one reason
> that HOT helped as much as it did, I believe.

But we wouldn't ever insert promises when there is a duplicate in the
index? We obviously need to check for existing unique violations before
inserting the promise?
Also, unless we abort with a fatal error, we can relatively easily mark
all those promises as dead when the insertion doesn't suceed due to
other indexes failing the uniqueness check (that won't merge pages though).

> > It also has pretty
> > good concurrency behaviour because you never need to retain a page lock
> > for longer than a normal index insertion.
> > It's certainly a bit more complex than your solution tho...
>
> It's a lot more complex.

No denying there, although I think your scheme will also grow a good bit
more complex ;).

Please don't get me wrong, I'd be very happy if your scheme can be made
to work, it certainly has advantages. I just don't see how, that's why I
am explaining the problems I see and what solution *I* can see. But
then, I unsuprisingly I have thought more about my idea of doing things,
so I might completely miss some solutions to your idea.

Also, I think, even if we were to switch schemes, about 70% of your
patch could be used with barely any changes. It's mostly the btree
changes that would have to change.

> You have to write an index tuple, and WAL-log
> that, just to do nothing. You may also need to WAL-log "insertion
> proper", which in your scheme involved updating the index in-place.

You certainly would need to log that, yes.

> Having said all that, I'm perfectly willing to talk about the
> scalability aspects of what I've done when I produce a revision that
> fixes that bug. It would certainly help your argument here if you had
> a benchmark, since it is my understanding that your concern is
> entirely about performance/scalability.

Yes. And deadlocks due to holding the exlusive lock for prolonged time.

> Does the patch need logic about looking to the primary key for
> duplicates first? Or maybe, if it's a SERIAL column, to look there
> last since it's unlikely to have a duplicate and it might hurt more to
> lock a leaf page there? It doesn't really matter now, but for upsert
> we're going to want that. For upsert, which row to update may be
> ambiguous, so it makes sense to prioritise indexes for checking if
> only the first duplicate found will be updated in respect of a single
> inserted tuple. This kind of thing also helps performance.

Yes, I wondered whether we could sort the indexes to move the ones with
the highest conflict ratio upfront. But I am afraid that doing so with a
scheme that involves keeping exlusive locks across multiple other
operations would result in deadlocks.
We might want to keep unique violation stats per index to do this
efficently. I think keeping those would be a good idea anyway.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-01 11:13:57
Message-ID:	20130901111357.GB5695@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-08-31 23:02:46 -0700, Peter Geoghegan wrote:
> On Sat, Aug 31, 2013 at 7:53 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> > With a table with many unique indexes and many reasons to decide
> > to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY
> > IGNORE, it isn't hard to imagine them all becoming heavily bloated
> > very quickly.
>
> There may be no index tuple directly corresponding to some heap tuple
> where you might expect that, due to transactions aborting (as with a
> traditional unique constraint violations) or because of HOT, but,
> prior to now, never the other way around (heap tuples are created
> first and physically killed last). That is an assumption that has
> existed since the Berkeley days, and it is something that may be
> treated as an invariant in some places. Autovacuum is one such place,
> I'd say. Isn't it the case that the autovacuum launcher daemon might
> not hear about all of this index-only bloat, even if it was extremely
> serious? Isn't the threshold it follows only in respect of heap tuples
> (even if the statistics collector does collect index tuple stats
> separately)?

I don't think the amount of bloat should get that big since we should
immediately kill the promises when we've detected that we conflict on a
following unique index. The reason vacuum needs to care at all is only
that we might have crashed after inserting the promise but before
deleting it and thus haven't cleaned up. That itself isn't too bad since
that should happen infrequently but if we then get a xid wraparound
later we could be waiting on the wrong transaction.

Sure, there would need to be some changes to that code. I think we can
trigger that fairly easy by just triggering relation vacuums via
relfrozenxid and enforce a index bulkdelete on full table vacuums even
if there are no dead tuples.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-02 06:38:08
Message-ID:	1378103888.10345.3.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2013-08-30 at 15:09 -0700, Peter Geoghegan wrote:
> The attached WIP patch implements this for Postgres, with a few
> notable differences:
>
Your patch breaks the ECPG regression tests:

../../preproc/ecpg --regression -I./../../include -o insupd.c -I. insupd.pgc
insupd.pgc:22: ERROR: syntax error at or near "into"
make[3]: *** [insupd.c] Error 3

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-02 06:40:34
Message-ID:	CAM3SWZT_+RyNbeCm8hG_9rG+LCm6KCEdxv=KzXt=X0nJMQTmVQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Sep 1, 2013 at 11:38 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> ../../preproc/ecpg --regression -I./../../include -o insupd.c -I. insupd.pgc
> insupd.pgc:22: ERROR: syntax error at or near "into"
> make[3]: *** [insupd.c] Error 3

I'll work on fixing it then. Thanks.

--
Peter Geoghegan

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-02 13:25:43
Message-ID:	522491D7.2070705@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/31/2013 06:40 AM, Josh Berkus wrote:
>> > 3) RETURNING is expanded - "RETURNING REJECTS *" is now possible where
>> > that makes sense.
> Oh, nifty! OK, now I can *really* use this feature.

Absolutely; especially combined with COPY to a staging TEMPORARY or
UNLOGGED table.

It'll be yet another way for people to get upsert wrong, of course.
They'll use a wCTE with RETURNING REJECTS to do an UPDATE of the rejects
w/o locking the table against writes first. Documenting this pitfall
should be enough, though.

Speaking of upsert, I'm starting to think that to solve the upsert
problem without forcing full-table locking on people we need a lock type
that conflicts only with DELETE/INSERT and a way to prevent UPDATEs from
changing the key. Kind of like the table level inverse of FOR KEY UPDATE
- a way to say "you can change rows, just cannot add or remove from the
set of keys".

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-02 22:18:42
Message-ID:	CAM3SWZSYWprR_ezYS3cZi22bZNFf8Z=CVr_qLK40dt1++CwkKA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 2, 2013 at 6:25 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> It'll be yet another way for people to get upsert wrong, of course.
> They'll use a wCTE with RETURNING REJECTS to do an UPDATE of the rejects
> w/o locking the table against writes first. Documenting this pitfall
> should be enough, though.

My preferred solution is to actually provide a variant to lock the
rows implicated in the would-be unique constraint violation. Obviously
that's a harder problem to solve.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-03 04:59:42
Message-ID:	CAM3SWZT=3nH=xVO9G-zcjW+muwhr9OiGU7Z-G5+D=fzWa9O42g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Sep 1, 2013 at 4:06 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> We're looking for the first duplicate. So it would probably be okay
>> for the IGNORE case to not bother retrying and re-locking if the other
>> transaction committed (which, at least very broadly speaking, is the
>> likely outcome).
>
> Hm. Could you explain this a bit more? Do you mean that we can stop
> trying to insert after the first violation (yes, sure if so)?

Yeah, that's all I mean. Nothing controversial there.

> I can't really agree with this part. First off, a heap_insert()
> certainly isn't lightweight operation. There's several things that can
> take extended time:
> * toasting of rows (writing a gigabyte worth of data...)
> * extension of the heap/toast table
> * writeout of a dirty victim buffer if the page isn't in s_b
> * reading in the target page if the page isn't in s_b
> * waiting for an exclusive lock on the target pages

These are all valid concerns. I'll need to better assess just how bad
this can get (hopefully with help from others). But some of this stuff
can surely be done separately for my case. For example, surely we
could make TOASTing occur after index insertion proper when locks are
released, without too much work.

> Secondly, you seem to be saying that such a lock would only block other
> sessions from finishing of tuple insertions - but it will block normal
> index searches as well? I have a hard time accepting the fact that one
> session using INSERT ... KEY IGNORE will block lookup of a range of values
> from the index in *other* sessions.

I'm sorry that I created the impression that I was denying an impact
on read queries as compared to regular insertion. Of course, regular
insertion also blocks read queries for very small intervals, and there
could be quite a bit of work to be done with that exclusive lock held
already. I am only proposing increasing that period for this case, and
usually by not terribly much.

> I'd expect that with the current implementation strategy if you
> e.g. made pgbench use INSERT .. ON DUPLICATE IGNORE without ever
> actually hitting duplicates, you'd see a very, very noticeable hit in
> scalability. I wouldn't expect a single session to get much slower bug
> trying a dozen or two should show a very noticeable dip.

I do not accept the premise of your questions regarding the patch's
performance, which appears to be that it's fair to expect the same
level of performance of INSERT .,. ON DUPLICATE IGNORE as it is to
expect of regular INSERT. No scheme is going to get that, including
your scheme, which necessitates WAL logging everything related to
index tuple insertion twice for each unique index. We certainly didn't
require SSI to perform as well as the current REPEATABLE READ, and
certainly not as well as READ COMMITTED, and for good reasons.

That said, I'm not dismissive of these concerns. As I said already, by
all means, let us scrutinise the performance of the patch.

I'm not going to bother benchmarking the patch myself until I give it
another whack. There is some really obvious low-hanging performance
fruit. Obviously by this I mean doing as little as possible when
holding the exclusive lock.

> I think we could optimize away the second search with some
> cleverness. If we remember & pin the page from the first search we know
> that the promise will have to be either on that page or - after a page
> split - potentially to ones on the right. Which we should be able to find
> by repeatedly stepping right. Similar to logic of doing the
> _bt_moveright() in _bt_doinsert().

That may well be possible, but we'd still have to do a binary search
on the page. And under your proposal, that will happen twice - each
time with an exclusive buffer lock held. Since our intent would be to
update in-place, the entire binary search operation would need a full
write/exclusive lock the second time around as well. You might even
end up doing more work with an exclusive/write lock held than under my
proposal, before you even factor in the other increased costs
(granted, that's not that likely, but you're still doing much more
with exclusive locks held than the regular insertion case).

What you describe here is the kind of trick that could be very useful
to another, existing common cases -- blocking on an xid in the event
of a possible duplicate (as you know, it's okay to hold pins pretty
much indefinitely) -- and yet has remained unimplemented for many
years.

> I am pretty sure it would be better. We know that nbtree works pretty
> damn well under concurrency. And the "promise" logic wouldn't change
> anything of that.

It would have multiple extra steps. Including WAL-logging of the
initial insertion, and clean-up for the IGNORE case. And WAL-logging
of the cleanup. Instead of doing exactly nothing extra in the IGNORE
case, which, under my proposal, has an additional duration of the
exclusive lock being held of zero (instead of the ereport() call we
just immediately release the lock and report the duplicate). The
IGNORE case matters too, and there is going to be considerably more
exclusive locking for that case with your proposal, before we even
talk about other overhead.

> Furthermore, I am pretty sure that locking an index page *and* related
> heap pages exclusively at the same time will be a deadlock
> nightmare. Even in the case of only having a single unique index.

It's seems like what you're saying here is "I don't know, this
extended index locking thing seems pretty sketchy to me". Which is a
perfectly reasonable thing to say in the absence of better analysis
from either of us - it is sketchy. I think about the same of your idea
of creating a whole new category of bloat, and doing all that extra
work, though.

I grant you entirely that:

1) My analysis of the deadlock hazards generally but particular wrt
interactions with heap page locks to date has not been terribly
extensive. I've thought about it, but not in enough detail. It made
more sense for me to test the waters about the basic approach here
first.

2) In order for us to commit this, or a patch implementing any other
similar scheme, there needs to be at the very least a highly
scrutinized, watertight analysis of this and other deadlock hazards.

I have taken a certain amount of reassurance from the fact that I was
not able to observe any deadlocks when pushing the patch hard -
perhaps that's naive. I mean hammering a table with inserts using
pgbench for 3 hours, with 64 INSERT...DUPLICATE KEY IGNORE sessions.
PK values were over a range of values from 1....15 million. This was
with shared_buffers set to 512MB and assertions enabled. It stood up
without any deadlocks or other manifestations of bugs. Granted, I did
that before fixing the bug you've highlighted (but after you pointed
it out), and so I haven't tested the multi-index case in the same way.

>> Now, I guess you probably could make all this work - somehow - but
>> that's even more complexity and even more cycles for vanishingly small
>> gain in one specific area. And frankly I don't think there was much
>> gain to begin with. You're glossing over too many details about what
>> your scheme would need to do to make a fair comparison with mine. It
>> seems like you're really concerned about scalability in the sense of
>> scaling the number of unique indexes on one table when there is lots
>> of concurrent activity on the same range of values across multiple
>> unique indexes.
>
> No, I am primarily concerned of the performance of concurrent inserts
> and concurrent lookups, even in the case of a single unique index. I
> don't think we can disregard multiple unique indexes - they are not all
> that uncommon - but I think we'll see problems before that.

I'm actually more worried about deadlock caused by the heap buffer
locks than anything else. I don't think that locking multiple indexes
in this way is really much of a special case at all. I was annoyed at
myself for having overlooked such a simple bug, but it happened
because my testing of multiple unique indexes was lacking.

> Also, unless we abort with a fatal error, we can relatively easily mark
> all those promises as dead when the insertion doesn't suceed due to
> other indexes failing the uniqueness check (that won't merge pages though).

I suppose that's true. But then, you're getting into more extra steps.
The first step, and the clean-up step. Plus WAL-logging each. Just to
do nothing.

>> It's a lot more complex.
>
> No denying there, although I think your scheme will also grow a good bit
> more complex ;).

I suspect you're right. :-)

> Please don't get me wrong, I'd be very happy if your scheme can be made
> to work, it certainly has advantages. I just don't see how, that's why I
> am explaining the problems I see and what solution *I* can see. But
> then, I unsuprisingly I have thought more about my idea of doing things,
> so I might completely miss some solutions to your idea.

Fair enough.

> Yes, I wondered whether we could sort the indexes to move the ones with
> the highest conflict ratio upfront. But I am afraid that doing so with a
> scheme that involves keeping exlusive locks across multiple other
> operations would result in deadlocks.
> We might want to keep unique violation stats per index to do this
> efficently. I think keeping those would be a good idea anyway.

Why should unique index violation stats be a reliable proxy here? In
the IGNORE case, a violation doesn't occur, so you'd have to track
those separately.

I don't think that there'd be any deadlock hazards if the ordering was
a function of properties of relations that can only be changed by DDL.
Which is all I had in mind.

--
Peter Geoghegan

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-03 07:52:50
Message-ID:	52259552.7000406@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/03/2013 06:18 AM, Peter Geoghegan wrote:
> On Mon, Sep 2, 2013 at 6:25 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>> It'll be yet another way for people to get upsert wrong, of course.
>> They'll use a wCTE with RETURNING REJECTS to do an UPDATE of the rejects
>> w/o locking the table against writes first. Documenting this pitfall
>> should be enough, though.
>
> My preferred solution is to actually provide a variant to lock the
> rows implicated in the would-be unique constraint violation. Obviously
> that's a harder problem to solve.

That'd certainly be the ideal, but as you say is far from simple, and
IIRC prior discussions here have suggested the SSI / predicate locking
stuff won't help.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-03 09:11:43
Message-ID:	CAM3SWZQ5r=EMrqpiUENTYARJTqvG-pr3tnd_qd7Zg2gw9u7EwQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 3, 2013 at 12:52 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> That'd certainly be the ideal, but as you say is far from simple, and
> IIRC prior discussions here have suggested the SSI / predicate locking
> stuff won't help.

I think that by far the strongest argument for what Andres suggested
(that has, incidentally, been suggested by a number of others), which
is to insert what he termed promise index tuples before a heap tuple,
is that there is no sensible way to lock the tuples while holding an
exclusive buffer lock on an index.

It could take basically any amount of time to acquire those locks (see
GetTupleForTrigger() to get a rough idea of how that ought to work),
and it would be totally unacceptable if that meant that lots of people
blocked on the page lock that an upserter held while it tried to
acquire locks on tuples (we could only unlock the buffer lock/locked
value when the rows were locked). So while I think what I've done here
is probably workable for the ON DUPLICATE KEY IGNORE case, perhaps I'm
being short-sighted in focusing on that. I'm surprised Andres didn't
call me out on this himself, actually.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-03 09:48:18
Message-ID:	CAM3SWZSAjMqtAd6mBBWBcmxNKQt8350EQ4V38kMm9Q9ozvmEpA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 3, 2013 at 2:11 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> and it would be totally unacceptable if that meant that lots of people
> blocked on the page lock that an upserter held while it tried to
> acquire locks on tuples

By which I mean: it seems shaky that I'd then be assuming that to lock
all of a row's unique index values was to lock the row itself, and so
that there was no need to worry about blocking on acquiring an actual
row lock while holding all those exclusive/write buffer locks.

--
Peter Geoghegan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-03 13:26:54
Message-ID:	20130903132654.GC5783@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-09-02 21:59:42 -0700, Peter Geoghegan wrote:
> > I can't really agree with this part. First off, a heap_insert()
> > certainly isn't lightweight operation. There's several things that can
> > take extended time:
> > * toasting of rows (writing a gigabyte worth of data...)
> > * extension of the heap/toast table
> > * writeout of a dirty victim buffer if the page isn't in s_b
> > * reading in the target page if the page isn't in s_b
> > * waiting for an exclusive lock on the target pages
>
> These are all valid concerns. I'll need to better assess just how bad
> this can get (hopefully with help from others). But some of this stuff
> can surely be done separately for my case. For example, surely we
> could make TOASTing occur after index insertion proper when locks are
> released, without too much work.

I don't think those are that easy to solve, but I'd like to be surprised.

E.g. the toasting cannot really happen after the main table insert since
the inserted rows needs to refer to the toast id. And you don't even
know the final size of a row without doing the toasting, since how much
it will get compressed depends on the amount of free space on the
current target buffer.

> > I'd expect that with the current implementation strategy if you
> > e.g. made pgbench use INSERT .. ON DUPLICATE IGNORE without ever
> > actually hitting duplicates, you'd see a very, very noticeable hit in
> > scalability. I wouldn't expect a single session to get much slower bug
> > trying a dozen or two should show a very noticeable dip.
>
> I do not accept the premise of your questions regarding the patch's
> performance, which appears to be that it's fair to expect the same
> level of performance of INSERT .,. ON DUPLICATE IGNORE as it is to
> expect of regular INSERT. No scheme is going to get that, including
> your scheme, which necessitates WAL logging everything related to
> index tuple insertion twice for each unique index. We certainly didn't
> require SSI to perform as well as the current REPEATABLE READ, and
> certainly not as well as READ COMMITTED, and for good reasons.

I think it's ok that INSERT .. IGNORE itself is noticeably slower than a
normal INSERT. What I am concerned about is it negatively impacting
overall scalability.

> I'm not going to bother benchmarking the patch myself until I give it
> another whack. There is some really obvious low-hanging performance
> fruit. Obviously by this I mean doing as little as possible when
> holding the exclusive lock.

Sounds fair.

> > I am pretty sure it would be better. We know that nbtree works pretty
> > damn well under concurrency. And the "promise" logic wouldn't change
> > anything of that.

> It would have multiple extra steps. Including WAL-logging of the
> initial insertion, and clean-up for the IGNORE case. And WAL-logging
> of the cleanup. Instead of doing exactly nothing extra in the IGNORE
> case, which, under my proposal, has an additional duration of the
> exclusive lock being held of zero (instead of the ereport() call we
> just immediately release the lock and report the duplicate). The
> IGNORE case matters too, and there is going to be considerably more
> exclusive locking for that case with your proposal, before we even
> talk about other overhead.

What I mean is that none of the individual proposed steps require
changing the locking semantics/granularity. I.e. we will never hold
locks on more pages at once than we do for a normal insert. Yes, one
page will get locked twice, but for shorter than a normal insert (all
you need to do is to rewrite the target of the item) and without having
any other pages locked.

> > Furthermore, I am pretty sure that locking an index page *and* related
> > heap pages exclusively at the same time will be a deadlock
> > nightmare. Even in the case of only having a single unique index.

> It's seems like what you're saying here is "I don't know, this
> extended index locking thing seems pretty sketchy to me". Which is a
> perfectly reasonable thing to say in the absence of better analysis
> from either of us - it is sketchy.

Yes, that's basically what I am saying.

I'd really like some input from others on this. Perhaps you could write
up an email/thread just about the locking semantics after the next
version of the patch (since that presumably will change the semantics)?

> I have taken a certain amount of reassurance from the fact that I was
> not able to observe any deadlocks when pushing the patch hard -
> perhaps that's naive. I mean hammering a table with inserts using
> pgbench for 3 hours, with 64 INSERT...DUPLICATE KEY IGNORE sessions.
> PK values were over a range of values from 1....15 million. This was
> with shared_buffers set to 512MB and assertions enabled. It stood up
> without any deadlocks or other manifestations of bugs. Granted, I did
> that before fixing the bug you've highlighted (but after you pointed
> it out), and so I haven't tested the multi-index case in the same way.

I can readily believe that there are no deadlocks with rather
straightforward schemas/queries like in pgbench. I think it's more
likely to be problematic in more complex scenarios. Running pgbench with
--foreign-keys might be enough to make that scenario more interesting.

> > No, I am primarily concerned of the performance of concurrent inserts
> > and concurrent lookups, even in the case of a single unique index. I
> > don't think we can disregard multiple unique indexes - they are not all
> > that uncommon - but I think we'll see problems before that.
>
> I'm actually more worried about deadlock caused by the heap buffer
> locks than anything else.

The above was meant in the context of performance/scalability.

But yes, I think the more fundamental issue is likely to be the locking
semantics.

> > Yes, I wondered whether we could sort the indexes to move the ones with
> > the highest conflict ratio upfront. But I am afraid that doing so with a
> > scheme that involves keeping exlusive locks across multiple other
> > operations would result in deadlocks.
> > We might want to keep unique violation stats per index to do this
> > efficently. I think keeping those would be a good idea anyway.

> Why should unique index violation stats be a reliable proxy here? In
> the IGNORE case, a violation doesn't occur, so you'd have to track
> those separately.

I assumed we'd track violations on a per-index basis and would include
INSERT .. IGNORE, UPSERTs that detect there already is a preexisting
row.

> I don't think that there'd be any deadlock hazards if the ordering was
> a function of properties of relations that can only be changed by DDL.
> Which is all I had in mind.

Depends on the locking scheme. If it's possible that one backend does an
INSERT IGNORE with the old order and one with the new one, you certainly
have a deadlock hazard.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-04 20:26:53
Message-ID:	CA+TgmoYrJxbRTha_+fbwH=TsiudgSb5Ze7iXe8=jwJkB9PV9PA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 31, 2013 at 2:34 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> After some thinking I don't think any solution primarily based on
> holding page level locks across other index operations is going to scale
> ok.

I'd like to chime in with a large +1 for this sentiment and pretty
much everything else Andres said further downthread. The operations
across which you're proposing to hold buffer locks seem at least an
order of magnitude too complex to get away with something like that.
Concurrent readers will block in a non-interruptible wait if they try
to access a buffer, and that's a situation that will be intolerable
if, for example, it can persist across a disk I/O. And I don't see
any way to avoid that.

One possible alternative to inserting promises into the index pages
themselves might be to use some kind of heavyweight lock. The way
that SIREAD locks work is not entirely dissimilar to what's needed
here, I think. Of course, the performance implications of checking
for lots of extra locks during inserts could be pretty bad, so you'd
probably need some way of avoiding that in common cases, which I don't
know exactly how to do, but maybe there's a way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-04 20:42:35
Message-ID:	52279B3B.2020404@archidevsys.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05/09/13 08:26, Robert Haas wrote:
> On Sat, Aug 31, 2013 at 2:34 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> After some thinking I don't think any solution primarily based on
>> holding page level locks across other index operations is going to scale
>> ok.
> I'd like to chime in with a large +1 for this sentiment and pretty
> much everything else Andres said further downthread. The operations
> across which you're proposing to hold buffer locks seem at least an
> order of magnitude too complex to get away with something like that.
> Concurrent readers will block in a non-interruptible wait if they try
> to access a buffer, and that's a situation that will be intolerable
> if, for example, it can persist across a disk I/O. And I don't see
> any way to avoid that.
>
> One possible alternative to inserting promises into the index pages
> themselves might be to use some kind of heavyweight lock. The way
> that SIREAD locks work is not entirely dissimilar to what's needed
> here, I think. Of course, the performance implications of checking
> for lots of extra locks during inserts could be pretty bad, so you'd
> probably need some way of avoiding that in common cases, which I don't
> know exactly how to do, but maybe there's a way.
>
How about an 'Expensive bit' (of course, renamed to sound more
professional and to better indicate what it does!) - if the bit is set,
then do the expensive processing. This should have minimal impact for
the common case, so extensive checking would only be required when lots
of locks need to be checked.

I strongly suspect that the situation, is way more complicated, than I
imply above - but possibly, a more sophisticated version of the above
might help?

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-04 22:01:57
Message-ID:	CAM3SWZRLu80UtKfdS4V0Sdt1Qv3-_7Ey2bhOCayjLYt7oTmkHQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 4, 2013 at 1:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Concurrent readers will block in a non-interruptible wait if they try
> to access a buffer, and that's a situation that will be intolerable
> if, for example, it can persist across a disk I/O. And I don't see
> any way to avoid that.

Then I have some bad news for you - that's already quite possible.
_bt_insertonpg() is called with the very same buffer exclusive locked,
and is where we do btree page splits. The first thing that _bt_split
does is this:

/* Acquire a new page to split into */
rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);

(Obviously this may ultimately result in the storage manager extending
the index relation).

Plus the split is WAL-logged immediately afterwards, which could
result in us blocking on someone else's I/O under contention (granted,
the XLogInsert scaling patch has now considerably ameliorated that
general problem). All the while holding an exclusive lock on the same
buffer. Note also that _bt_insertonpg() is called again recursively
after a page split. And of course we WAL-log btree index tuple
insertion proper all the time.

Let me be clear about something, though: I am not at all dismissive of
Andres' concerns. I was concerned about many of the same things before
I posted the patch.

I think that Andres and I ought to re-frame this discussion a little
bit. Right now, the presumption we seem to be making, perhaps without
even realizing it, is this is about providing functionality equivalent
to MySQL's INSERT IGNORE; insert tuples proposed for insertion where
possible, otherwise do not. However, Andres and I, not to mention
almost every Postgres user, are actually much more interested in
something like INSERT...ON DUPLICATE KEY LOCK FOR UPDATE.

That's what this mechanism has to support, even if it is *technically*
possible to commit just INSERT...ON DUPLICATE KEY IGNORE and punt on
these other questions. It was silly of me not to do that up-front. So
I'm going to try and produce a patch that does this as well for my
next revision.

Maybe this will enable Andres to refute my position that the buffer
locking approach to speculative insertion/value locking may actually
be acceptable. If that gets us closer to having the feature committed
in some form, then I welcome it. I fully expect to be held to this new
standard - it would be insane to do anything less. I don't want to
throw out an old IGNORE value locking mechanism and invent a whole new
one for upserting a little bit down the line.

--
Peter Geoghegan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-04 22:39:37
Message-ID:	20130904223937.GA860@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-09-04 15:01:57 -0700, Peter Geoghegan wrote:
> On Wed, Sep 4, 2013 at 1:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > Concurrent readers will block in a non-interruptible wait if they try
> > to access a buffer, and that's a situation that will be intolerable
> > if, for example, it can persist across a disk I/O. And I don't see
> > any way to avoid that.
>
> Then I have some bad news for you - that's already quite possible.
> _bt_insertonpg() is called with the very same buffer exclusive locked,
> and is where we do btree page splits. The first thing that _bt_split
> does is this:
>
> /* Acquire a new page to split into */
> rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
>
> (Obviously this may ultimately result in the storage manager extending
> the index relation).

I don't think that's an argument for much TBH. Those operations are way
much less heavyweight than the ones you're proposing to hold the pages
locked over and there actually is a forward guarantee. And it's very
hard to avoid locking a page exlusively once you've decided that you
need to split the page. You cannot just release the lock while you look
for a victim buffer.

> I think that Andres and I ought to re-frame this discussion a little
> bit. Right now, the presumption we seem to be making, perhaps without
> even realizing it, is this is about providing functionality equivalent
> to MySQL's INSERT IGNORE; insert tuples proposed for insertion where
> possible, otherwise do not. However, Andres and I, not to mention
> almost every Postgres user, are actually much more interested in
> something like INSERT...ON DUPLICATE KEY LOCK FOR UPDATE.

Yes, the promises approach gets more advantageous if you think about
UPSERT because most of the work will be paid of when the UPDATE occurs.

> Maybe this will enable Andres to refute my position that the buffer
> locking approach to speculative insertion/value locking may actually
> be acceptable.

Sorry to be harsh here, but I don't think I need to do that. I've
explained most of the reasons I see that that approach won't work out
and so far I don't see those refuted. And to me those issues seem to be
fatal for the approach. If you find a solution to the problems noted
uppon - great. So far it seems neither Robert nor me see how that is
possible, but that obviously doesn't mean it's impossible that you find
a way.
But why should I argue further until you proof me wrong (newer patch or
explaining changed algorithms)? If you don't think my arguments are
valid, well, I've brought those up I see as relevant and that's
it. Can't do much further.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-04 23:16:16
Message-ID:	CAM3SWZSZWkdjwjKCki5YH-YZ2kr5+sO+St0h5cfY8bn+pARfeQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 4, 2013 at 3:39 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Sorry to be harsh here, but I don't think I need to do that. I've
> explained most of the reasons I see that that approach won't work out
> and so far I don't see those refuted. And to me those issues seem to be
> fatal for the approach. If you find a solution to the problems noted
> uppon - great. So far it seems neither Robert nor me see how that is
> possible, but that obviously doesn't mean it's impossible that you find
> a way.

It seems you've misunderstood. My position was that I don't think it's
terribly useful to have a discussion about approaches to value locking
without considering how that needs to fit in with row locking too. So
it makes sense as an immediate goal to introduce that into the patch.
Any scheme is going to be constrained by having to think about the
interplay with value locking and row locking going forward.

For example, in my scheme, I couldn't block on locking the row if that
meant that buffer locks would be held indefinitely. There are also
deadlock hazards there for either scheme that must be carefully
considered.

What possible objection could you have? Suppose it was the case that I
was dead set on using buffer locking like this, because I'm stubborn
or whatever. I've just made life harder for myself, while probably not
also putting the same degree of burden on alternative proposals. Maybe
I am stubborn, but I don't think I'm stubborn about the basic approach
taken in this particular patch. I've merely been pointing out, as I
feel is my role as a participant in the community's adversarial system
of reaching agreement, the problems that exist with your proposal, and
some inconsistencies in your objections to mine.

Obviously the basic approach will remain the most difficult and
probably controversial part of this. Even if I threw my hands up and
immediately accepted everything you said, that would still be true. We
need to get all of the constraints in place sooner rather than later.

> But why should I argue further until you proof me wrong (newer patch or
> explaining changed algorithms)?

I didn't ask you to. You shouldn't.

> If you don't think my arguments are
> valid, well, I've brought those up I see as relevant and that's
> it. Can't do much further.

Uh, I just said that I thought your arguments were totally valid. I
couldn't have been clearer about that. Actually, I'm pretty surprised
that you haven't been the one insisting that I add a row locking
component from quite early on for exactly these reasons.

--
Peter Geoghegan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-05 00:08:48
Message-ID:	c0ddb74c-41d9-4dd5-9a3c-64a5f6f05dad@email.android.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

We seem to be miscommunication a bit.

You've proposed an framework and algorithm for something I'd really, really like to see. I don't think that it can work explicitly as you proposed, so I roughly sketched out a solution I can see.
I don't want "my" idea to win, I want a idea to win. I haven't fleshed out my idea to the point where I would consider it something ready to implement or something.
You're the patch author here whose plans are laid open to be scrutinized ;). If you think my idea has merit, use and adapt it to reality. If not, find another, better, solution.

Even if our path to that goal is confrontational at times, the goal is to find a good solution, not the argument itself.

I haven't argued about INSERT ... DUPLICATE LOCK because the page locking scheme doesn't seem to work out for plain DUPLICATE. No need to think/argue about the fancier version in that case.

Regards,

Andres
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-05 00:47:22
Message-ID:	CAM3SWZQVJg_bJA9B-G93VoVezH07FEqx8BcqzrXM1JUggAqBpA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 4, 2013 at 5:08 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I don't want "my" idea to win, I want a idea to win.

I know. I want the same thing.

> You're the patch author here whose plans are laid open to be scrutinized ;). If you think my idea has merit, use and adapt it to reality. If not, find another, better, solution.

Sure.

> Even if our path to that goal is confrontational at times, the goal is to find a good solution, not the argument itself.

Agreed.

> I haven't argued about INSERT ... DUPLICATE LOCK because the page locking scheme doesn't seem to work out for plain DUPLICATE. No need to think/argue about the fancier version in that case.

I see where you're coming from, but my point is precisely that adding
a row locking component *isn't* fancier. I've come to realize that
it's an integral part of the patch, and that my previous omission of
row locking - and the subsequent defence of that decision I made in
passing - was ridiculous. In a world where IGNORE/not locking is a
feature we support, it can only exist as an adjunct to ON DUPLICATE
KEY LOCK - certainly not the other way around.

The tail cannot be allowed to wag the dog.

In posting the patch with a row locking component, I'll only be asking
you to consider that aspect separately. You may find that seeing the
problems I encounter and how I handle them will make you (or others)
re-assess your thoughts on value locking in a direction that nobody
expects right now. Equally, I myself may reassess things.

Now, I don't guarantee that that's the case, but it certainly seems
very possible. And so even if I were to concede right now that the
buffer locking approach is not workable, I feel it would be a little
premature to seriously get down to talking about the alternatives in
detail.

--
Peter Geoghegan

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: INSERT...ON DUPLICATE KEY IGNORE
Date:	2013-09-05 16:52:20
Message-ID:	CA+TgmoYqPiL1JJk=Bh6JthidDqOrHJRza+v7+bid4thQ26Npjw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 4, 2013 at 8:08 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> You've proposed an framework and algorithm for something I'd really, really like to see. I don't think that it can work explicitly as you proposed, so I roughly sketched out a solution I can see.
> I don't want "my" idea to win, I want a idea to win. I haven't fleshed out my idea to the point where I would consider it something ready to implement or something.
> You're the patch author here whose plans are laid open to be scrutinized ;). If you think my idea has merit, use and adapt it to reality. If not, find another, better, solution.
>
> Even if our path to that goal is confrontational at times, the goal is to find a good solution, not the argument itself.

Totally agreed.

> I haven't argued about INSERT ... DUPLICATE LOCK because the page locking scheme doesn't seem to work out for plain DUPLICATE. No need to think/argue about the fancier version in that case.

Check.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company