Quick Links

Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)

Lists:	pgsql-hackers

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-02 07:17:39
Message-ID:	CAM3SWZTfTt_fehet3tU3YKCpCYPYnNaUqUZ3Q+NAASnH-60teA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've been working on fixing the bugs that Jeff Janes found [1] with
approach #2 to value locking [2]. Approach #1 was unaffected.

I'm starting this new thread, to discuss these issues. Let's try and
confine discussion of semantics and syntax to the main thread, since
that has been what has mostly been discussed there.

Jeff produced a stress testing tool which tests concurrent deletions
(and not just concurrent updates); his test case deletes tuples when a
counter value goes under zero, which happens occasionally. It also
independently does book keeping for all values in a range of values
that are randomly incremented or decremented, and verifies that
everything is as it should be. This revealed problems with #2 around
concurrent "super deletion" of "unkept" promise heap tuples. As Jeff
pointed out, all-NULL tuples were appearing as visible, or spurious
duplicates appeared. For example:

I should point out that one of the problems I found here was a garden
variety bug, which has been fixed (the conflict flag in ExecInsert()
was not being reset correctly, I believe). However, the other
remaining problems could only be fixed with further changes to the
routines in tqual.c, which I'm not happy about, and require further
discussion here. I attach a differential patch which applies on top of
the most recent revision of #2, which is v1.8.vallock2.tar.gz [3].

Note that I've been extending Jeff's stress testing tool to make it
highlight problems earlier, and to make it provide more and better
instrumentation (e.g. printing of XIDs to correlate with pg_xlogdump
output). A good trick I added was to have an INSERT...ON CONFLICT
UPDATE predicate of "WHERE TARGET.index = EXCLUDED.index", which
serves as a kind of assertion when used within the Perl script (the
Perl scripts checks RETURNING-projected tuples, and expects to always
insert or update - that WHERE clause should always pass, obviously).
Jeff's tool is available here (with my revisions):
https://github.com/petergeoghegan/jjanes_upsert

I've been running with fsync off, which Jeff felt increased the
likelihood of revealing the bug. It isn't necessary, though, and FWIW
it was easy for me to recreate this problem once I understood it,
using only my 4 core laptop. I found it important to build without
assertions and at optimization level -O2, though.

The attached patch fixes Jeff's test case, as far as it goes. But, as
I mentioned, it's not intended as a real fix. For one thing, I have
made multiple changes to HeapTupleSatisfies*() routines, but haven't
fully considered the consequences for, say, ri_triggers.c.
HeapTupleSatisfiesSelf() and HeapTupleSatisfiesHistoricMVCC() were not
modified. I've modified HeapTupleSatisfiesVacuum() to see
InvalidTransactionID (not invalid xid infomask bits) as visible for
garbage collection (alongside HeapTupleSatisfiesDirty() and
HeapTupleSatisfiesMVCC(), which I changed first), and I wouldn't be
surprised if that created new problems in other areas. In short, this
patch is just for illustrative purposes.

To see the difference this patch makes, I suggest commenting out the
new code, and running the following test after a fresh initdb:

$ perl count_upsert.pl 4 100000

I think that the actual nature of the problem I fixed was that
concurrent upserters felt entitled to update a promise tuple that they
could at least initially see, but was then "super deleted", but it
turned out that once it was super deleted it was okay to update (the
tuple wasn't actually duplicated insofar as HeapTupleSatisfiesDirty()
was then able to ascertain due to other concurrent activity). It's
pretty complicated, in any case. This is just a band-aid.

Thoughts? Does anyone have any thoughts on my diagnosis of the problem?

[1] http://www.postgresql.org/message-id/CAMkU=1w8e9Q6ZX3U85RtsyCMpdYWFrvVAO3=uNEvtqiRUzntaQ@mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Value_locking#.232._.22Promise.22_heap_tuples_.28Heikki_Linnakangas.29
[3] http://www.postgresql.org/message-id/CAM3SWZRg_hTrOL-6_wfe6_d_UcUYc28JfaPsFh_tra76GkkdNw@mail.gmail.com
--
Peter Geoghegan

Attachment	Content-Type	Size
fix_val2_races.patch	text/x-patch	4.2 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-03 00:43:00
Message-ID:	CAM3SWZR-0oc1AZYkaDgw0QekbFy2PMRRij1AGEmpOG-DwoGm5w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 1, 2015 at 11:17 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> I've been working on fixing the bugs that Jeff Janes found [1] with
> approach #2 to value locking [2]. Approach #1 was unaffected.

Unfortunately, exclusion constraints (which only value locking
approach #2 has support for, for the IGNORE variant) are broken, even
with my fix for the largely unrelated issues also described in my
opening mail. I now believe that it is inherently impossible to
support exclusion constraints with INSERT ... ON CONFLICT IGNORE, if
we are insistent that there cannot be livelocks, unprincipled
deadlocks or spurious exclusion violations....and I think you all know
how I feel about those :-) . I think that we should drop support for
exclusion constraints, regardless of whatever else happens with value
locking.

Previously, my mental model of approach #2 was that it more or less
handled B-Trees and exclusion constraint related indexes equivalently.
This is untrue, though.

Today, in general, exclusion constraints more or less work by
physically inserting a heap tuple and index tuple relating to the
exclusion constraint (typically this is with a GiST index). There
could well be a conflict due to a concurrent insertion, or even due to
the fact that there was a conflicting tuple all along. The
implementation tries to find a conclusively committed conflicting
tuple. If it does not find such a tuple, it knows that its own already
physically inserted tuple is enough of a reason for others to have to
worry about it, and so it may proceed and commit the xact -- once it
does a whole index scan with only its own tuple found (and not any
overlapping tuple from an unfinished xact), it's clear that it can
proceed.

Exclusion constraints are made to do a pre-check by approach #2 to
value locking, to do an IGNORE, but that isn't really significant
here. The reason that #2 is broken for exclusion constraints but not
for unique indexes is that unique index insertion reports a conflict
in the process of inserting. In contrast, unique constraints insert
optimistically, and re-check. What's the difference? Well, even though
(since the implementation passes UNIQUE_CHECK_PARTIAL for unique index
insertion) we still insert when there is a conflict, there is an
important distinction: Value locking. That is to say, even though we
still insert, we also still use buffer locks as value locks in the
usual way that unique indexes do (we hold an exclusion buffer lock on
the first B-Tree leaf page the value could be on, as always). We still
decide that there is no conflict when the is an exclusion buffer lock,
which is a sufficient choke point to make that determination
conclusive. This implies that even with many concurrent insertions
into a newly created table, there will definitely be one inserter that
conclusively does *not* have a conflict, while the others might
(depends on the status of the guy who conclusively had *no* conflict,
or his line of successor xacts that may or may not themselves commit).

In a private e-mail to Heikki, I pointed out that he was mistaken when
he said that there is a pre-existing problem with exclusion
constraints: it is not possible for concurrent inserters to both
spuriously get exclusion violations when only one needs to. Heikki
accepted this. However, I now realize that Heikki was closer to being
right about it than I was; you can't get spurious exclusion
violations, but you can get something that's about the same, which is
a deadlock. That usually doesn't matter that much, because people
usually don't worry about the issue with exclusion constraints. But in
a world where they support INSERT ... ON CONFLICT IGNORE, that isn't
acceptable, IMV.

I have a test case:
https://github.com/petergeoghegan/upsert/blob/master/exclusion.sh

I see both spurious exclusion violations, and unprincipled deadlocks
with this simple IGNORE-based testcase, that uses a GiST + btree_gist
exclusion constraint. I didn't see livelocks, which would have been
really bad, because a mutual need to wait on each other's xact causes
a deadlock, and the deadlock detector detects those. I think that the
spurious exclusion violations are merely an artifact of the
implementation, as opposed to an inherent problem with exclusion
constraints (in other words, I don't withdraw my remarks from the last
paragraph - only (arguably) unprincipled deadlocks occur with ordinary
exclusion constraint enforcement with lots of concurrency).

--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-03 09:29:03
Message-ID:	54A7B65F.1060009@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/03/2015 02:43 AM, Peter Geoghegan wrote:
> On Thu, Jan 1, 2015 at 11:17 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> I've been working on fixing the bugs that Jeff Janes found [1] with
>> approach #2 to value locking [2]. Approach #1 was unaffected.
>
> Unfortunately, exclusion constraints (which only value locking
> approach #2 has support for, for the IGNORE variant) are broken, even
> with my fix for the largely unrelated issues also described in my
> opening mail. I now believe that it is inherently impossible to
> support exclusion constraints with INSERT ... ON CONFLICT IGNORE, if
> we are insistent that there cannot be livelocks, unprincipled
> deadlocks or spurious exclusion violations....and I think you all know
> how I feel about those :-) . I think that we should drop support for
> exclusion constraints, regardless of whatever else happens with value
> locking.
>
> Previously, my mental model of approach #2 was that it more or less
> handled B-Trees and exclusion constraint related indexes equivalently.
> This is untrue, though.
>
> Today, in general, exclusion constraints more or less work by
> physically inserting a heap tuple and index tuple relating to the
> exclusion constraint (typically this is with a GiST index). There
> could well be a conflict due to a concurrent insertion, or even due to
> the fact that there was a conflicting tuple all along. The
> implementation tries to find a conclusively committed conflicting
> tuple. If it does not find such a tuple, it knows that its own already
> physically inserted tuple is enough of a reason for others to have to
> worry about it, and so it may proceed and commit the xact -- once it
> does a whole index scan with only its own tuple found (and not any
> overlapping tuple from an unfinished xact), it's clear that it can
> proceed.
>
> Exclusion constraints are made to do a pre-check by approach #2 to
> value locking, to do an IGNORE, but that isn't really significant
> here. The reason that #2 is broken for exclusion constraints but not
> for unique indexes is that unique index insertion reports a conflict
> in the process of inserting. In contrast, unique constraints insert
> optimistically, and re-check. What's the difference? Well, even though
> (since the implementation passes UNIQUE_CHECK_PARTIAL for unique index
> insertion) we still insert when there is a conflict, there is an
> important distinction: Value locking. That is to say, even though we
> still insert, we also still use buffer locks as value locks in the
> usual way that unique indexes do (we hold an exclusion buffer lock on
> the first B-Tree leaf page the value could be on, as always). We still
> decide that there is no conflict when the is an exclusion buffer lock,
> which is a sufficient choke point to make that determination
> conclusive. This implies that even with many concurrent insertions
> into a newly created table, there will definitely be one inserter that
> conclusively does *not* have a conflict, while the others might
> (depends on the status of the guy who conclusively had *no* conflict,
> or his line of successor xacts that may or may not themselves commit).
>
> In a private e-mail to Heikki, I pointed out that he was mistaken when
> he said that there is a pre-existing problem with exclusion
> constraints: it is not possible for concurrent inserters to both
> spuriously get exclusion violations when only one needs to. Heikki
> accepted this. However, I now realize that Heikki was closer to being
> right about it than I was; you can't get spurious exclusion
> violations, but you can get something that's about the same, which is
> a deadlock. That usually doesn't matter that much, because people
> usually don't worry about the issue with exclusion constraints. But in
> a world where they support INSERT ... ON CONFLICT IGNORE, that isn't
> acceptable, IMV.
>
> I have a test case:
> https://github.com/petergeoghegan/upsert/blob/master/exclusion.sh
>
> I see both spurious exclusion violations, and unprincipled deadlocks
> with this simple IGNORE-based testcase, that uses a GiST + btree_gist
> exclusion constraint. I didn't see livelocks, which would have been
> really bad, because a mutual need to wait on each other's xact causes
> a deadlock, and the deadlock detector detects those. I think that the
> spurious exclusion violations are merely an artifact of the
> implementation, as opposed to an inherent problem with exclusion
> constraints (in other words, I don't withdraw my remarks from the last
> paragraph - only (arguably) unprincipled deadlocks occur with ordinary
> exclusion constraint enforcement with lots of concurrency).

I read this email several times but could not understand. Please explain
in words of one syllable how the deadlock arises. Then, please find a
way to fix it.

- Heikki

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-03 10:00:52
Message-ID:	CAM3SWZQgP2j5TU0F-OcAgnDtxy2MdfQQdFRBhTNhrmQW01xZ4g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 3, 2015 at 1:29 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Please explain in words of one syllable how the deadlock arises. Then,
> please find a way to fix it.

I believe that the deadlock arises because there is no choke point.
Exclusion constraint insertions always insert first, then check for a
conflict later. Whereas with B-Trees, we check *while* inserting, at
the critical point when an exclusive buffer lock is held on the first
leaf page a value could be on.

Two concurrent exclusion constraints inserters can easily insert at
exactly the same time, and then wait on each other's xact, and then
deadlock. That can't happen with B-Tree inserts because the checking
and insertion happen at the same time, when that exclusive buffer lock
is held. Some inserter establishes the right to insert, and then
actually inserts atomically, and when it releases the buffer lock
every other inserter will see for itself that it has inserted (and
established the right to do so).

I'm sorry, but I honestly don't see a way to fix this one. It would
take a very novel approach, since exclusion constraints can work with
any amgettuple AM. I briefly though about doing something crazy with
the deadlock detector, but apart from anything else I think that might
introduce livelock risks.
--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-03 10:41:49
Message-ID:	54A7C76D.3070101@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/03/2015 12:00 PM, Peter Geoghegan wrote:
> Two concurrent exclusion constraints inserters can easily insert at
> exactly the same time, and then wait on each other's xact, and then
> deadlock. That can't happen with B-Tree inserts because the checking
> and insertion happen at the same time, when that exclusive buffer lock
> is held. Some inserter establishes the right to insert, and then
> actually inserts atomically, and when it releases the buffer lock
> every other inserter will see for itself that it has inserted (and
> established the right to do so).

A-ha, I see. And this can happen without INSERT ON CONFLICT, too? In
that case, one of the transactions is bound to error and roll back
anyway, but you get a deadlock error instead of the constraint violation
error, which is not as nice.

> I'm sorry, but I honestly don't see a way to fix this one. It would
> take a very novel approach, since exclusion constraints can work with
> any amgettuple AM. I briefly though about doing something crazy with
> the deadlock detector, but apart from anything else I think that might
> introduce livelock risks.

Some ideas off the top of my head:

1. On conflict, mark the inserted tuple as killed, and retry. But before
retrying, acquire a new kind of lock on the table, let's call it
SpeculativeInsertionLock. This fixes the deadlock, by retrying instead
of sleeping, and avoids the livelock because the new lock ensures that
only one backend retries at a time.

2. Use e.g. the XID (or backend id or something) to resolve the tie.
When you have inserted a tuple and find that it conflicts with another
in-progress insertion, check the conflicting tuple's xmin. If it's <
current XID, wajt for the other inserter to finish. Otherwise kill the
already-inserted tuple first, and wait only after that.

3. Don't allow the deadlock checker to kick in. Instead, use timeout
with exponential backoff to avoid getting stuck in the livelock
indefinitely.

Can there be other lock types involved in the deadlock? AFAICS it's
always going to be between two or more backends that wait on each with
XactLockTableWait (or some variant of that specific to speculative
insertions).

- Heikki

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-03 20:42:24
Message-ID:	CAM3SWZTMgDJY7kZcTUBG7XPRcpCzbeV7DzrGbMiKvkOkF6n3Wg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 3, 2015 at 2:41 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> A-ha, I see. And this can happen without INSERT ON CONFLICT, too? In that
> case, one of the transactions is bound to error and roll back anyway, but
> you get a deadlock error instead of the constraint violation error, which is
> not as nice.

Agreed.

I haven't experimentally verified that this can happen with exclusion
constraints without the ON CONFLICT patch being involved, but I would
be very surprised if it didn't. How could it possibly not happen? It's
not all that bad since, as you say, one or the other xact was going to
be aborted anyway. Since the insertions would have to occur at exactly
the same time, even if one backend were to get an exclusion violation
rather than being killed by the deadlock detector, the choice of which
backend to raise an error within would be essentially random anyway.

There are probably additional factors that make the situation more
complicated than it is for the ON CONFLICT patch, but it's clear to me
that the mutual dependency that can be involved with two ordinary
exclusion constraint inserters is reason enough for those to deadlock.

>> I'm sorry, but I honestly don't see a way to fix this one. It would
>> take a very novel approach, since exclusion constraints can work with
>> any amgettuple AM. I briefly though about doing something crazy with
>> the deadlock detector, but apart from anything else I think that might
>> introduce livelock risks.
>
> Some ideas off the top of my head:
>
> 1. On conflict, mark the inserted tuple as killed, and retry. But before
> retrying, acquire a new kind of lock on the table, let's call it
> SpeculativeInsertionLock. This fixes the deadlock, by retrying instead of
> sleeping, and avoids the livelock because the new lock ensures that only one
> backend retries at a time.

We "super delete" the tuple on retry already. However, we wait for the
other xact with our broken promise tuple still physically inserted
into the heap. We can't super delete the tuple before
XactLockTableWait()/SpeculativeInsertionWait() lock acquisition,
because doing so risks livelock. I think you already agree with me up
to here.

However, if we're not sleep waiting on the other xact (rather, we're
"retrying [with a new class of exclusive table-level lock] instead of
sleeping"), why should our xact be able to do useful work on retry?
Why won't it just spin/busy wait? More to the point, why wouldn't it
deadlock when it is obliged to wait on a third inserter's tuple?
AFAICT, you've just moved the problem around, because now we're
obliged to get a shared lock on the xid or speculative token
(XactLockTableWait()/SpeculativeInsertionWait()) of a session that
itself wants this new table level lock that only we have.

> 2. Use e.g. the XID (or backend id or something) to resolve the tie. When
> you have inserted a tuple and find that it conflicts with another
> in-progress insertion, check the conflicting tuple's xmin. If it's < current
> XID, wajt for the other inserter to finish. Otherwise kill the
> already-inserted tuple first, and wait only after that.

That sounds scary. I think it might introduce livelock hazards. What
about some third xact, that's waiting on our XID, when we ourselves
super delete the already-inserted tuple in deference to the first
xact/inserter? I also worry about the bloating this could cause.

> 3. Don't allow the deadlock checker to kick in. Instead, use timeout with
> exponential backoff to avoid getting stuck in the livelock indefinitely.

I don't know if that would work, but it seems very undesirable...to
add a new behavior to the deadlock detector just to make ON CONFLICT
IGNORE work with exclusion constraints. It seems to me like the best
suggestion, though.

> Can there be other lock types involved in the deadlock? AFAICS it's always
> going to be between two or more backends that wait on each with
> XactLockTableWait (or some variant of that specific to speculative
> insertions).

I believe you're right about that. But don't forget that at least with
the current implementation, we also get spurious exclusion violations.
I think that's because we super-delete, as we must for other reasons
(so they're not a concern for regular inserters today, as I said). So
solving the problem for regular inserters is probably easier than
solving it for the ON CONFLICT patch.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-04 06:16:20
Message-ID:	CAM3SWZTWVui+m0mYFa+jRyVwnA71t20zHLhXexUQp+TNZe7Emw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 3, 2015 at 12:42 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> There are probably additional factors that make the situation more
> complicated than it is for the ON CONFLICT patch, but it's clear to me
> that the mutual dependency that can be involved with two ordinary
> exclusion constraint inserters is reason enough for those to deadlock.

I looked at the code in more detail, and realized that there were old
bugs in the exclusion constraint related modifications. I attach a
delta patch that fixes them. This is a combined patch that is all that
is needed to apply on top of v1.8.vallock2.tar.gz [1] to have all
available bugfixes.

This unbreaks my previous exclusion constraint test case, as far as it
goes. I am still concerned about the risk of lock starvation/livelocks
here. But I must admit, having stressed the patch with this latest
fix, that it's possible that the real risks have been overestimated.
Maybe this is in the same category as the way L&Y's algorithm could
theoretically starve a session with a pagesplit that needs to move
right indefinitely. It's a theoretical risk that turns out to not
matter in practice, for reasons that lack a real theoretical
justification beyond the fact that sustained very bad luck - a
sustained "perfect storm" - is required for starvation to occur. OTOH,
maybe my OS scheduler doesn't have the characteristic that provoke a
the suspected bug. I really don't know, but I suppose that one or the
other concurrent inserters will tend to win the race often enough - a
draw may be a very rare thing.

[1] http://www.postgresql.org/message-id/CAM3SWZRg_hTrOL-6_wfe6_d_UcUYc28JfaPsFh_tra76GkkdNw@mail.gmail.com
--
Peter Geoghegan

Attachment	Content-Type	Size
combined_bugfixes.patch	text/x-patch	7.8 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-04 08:14:19
Message-ID:	CAM3SWZTRVfmVBJptCUpZSFydso8WJvWTaMZSv+UFmfqpoeu=Vw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 3, 2015 at 10:16 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> I looked at the code in more detail, and realized that there were old
> bugs in the exclusion constraint related modifications. I attach a
> delta patch that fixes them. This is a combined patch that is all that
> is needed to apply on top of v1.8.vallock2.tar.gz [1] to have all
> available bugfixes.

I've updated Jeff Janes' test suite to support testing of exclusion
constraints that are equivalent to unique indexes:

https://github.com/petergeoghegan/jjanes_upsert/commit/a941f423e9500b847b1a9d1805ba52cb11db0ae9

(This requires a quick hack to the Postgres source code to accept
exclusion constraints as ON CONFLICT UPDATE arbiters).

So far, everything seems okay with exclusion constraints, as far as I
can determine using the stress tests that we have. This is an
encouraging sign.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-06 01:53:24
Message-ID:	CAM3SWZSd3MU=c9HcP-SxCbUcLvFWT8t9MZCotOJxuE3RbTvd0A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 1, 2015 at 11:17 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> The attached patch fixes Jeff's test case, as far as it goes. But, as
> I mentioned, it's not intended as a real fix. For one thing, I have
> made multiple changes to HeapTupleSatisfies*() routines, but haven't
> fully considered the consequences for, say, ri_triggers.c.
> HeapTupleSatisfiesSelf() and HeapTupleSatisfiesHistoricMVCC() were not
> modified. I've modified HeapTupleSatisfiesVacuum() to see
> InvalidTransactionID (not invalid xid infomask bits) as visible for
> garbage collection (alongside HeapTupleSatisfiesDirty() and
> HeapTupleSatisfiesMVCC(), which I changed first), and I wouldn't be
> surprised if that created new problems in other areas. In short, this
> patch is just for illustrative purposes.

I think that I see another race condition, requiring custom
instrumentation to the server to demonstrate.

Basically, even with the fix above, there are still similar races. I'm
not trying to call ExecLockUpdateTuple() against "super deleted" NULL
tuples, because those are no longer visible to any relevant snapshot
type, the fix described above. However, I can still see a change in
tuple header xmin between a dirty index scan and attempting to lock
that row to update. If I'm not mistaken, the race condition is small
enough that something like Jeff's tool won't catch it directly in a
reasonable amount of time, because it happens to be the same index key
value.

It's not all that easy to recreate, even with my custom
instrumentation. With fsync=off, it occurs somewhat infrequently with
a custom instrumentation Postgres build:

pg(at)hamster:~/jjanes_upsert$ perl count_upsert.pl 8 100000
[Mon Jan 5 17:31:57 2015] init done at count_upsert.pl line 101.
[Mon Jan 5 17:32:17 2015] child abnormal exit [Mon Jan 5 17:32:17
2015] DBD::Pg::db selectall_arrayref failed: ERROR: mismatch in xmin
for (322,264). Initially 4324145, then 4324429 at count_upsert.pl line
210.\n at count_upsert.pl line 227.
[Mon Jan 5 17:32:49 2015] sum is -800
[Mon Jan 5 17:32:49 2015] count is 9515
[Mon Jan 5 17:32:49 2015] normal exit at 1420507969 after 733912
items processed at count_upsert.pl line 182.

I think that "super deleting" broken promise tuples undermines how
vacuum interlocking is supposed to work [1]. We end up at the wrong
tuple, that happens to occupy the same slot as the old one (and
happens to have the same index key value, because the race is so small
are there are relatively few distinct values in play - it's hard to
recreate). Attached instrumentation patch was used with Jeff Jane's
tool.

If we weren't super deleting in the patch, then I think that our
backend's xmin horizon would be enough of an interlock (haven't tested
that theory by testing value locking approach #1 implementation yet,
but I worried about the scenario quite a lot before and things seemed
okay). But we are "super deleting" with implementation #2, and we're
also not holding a pin on the heap buffer or B-Tree leaf page buffer
throughout, so this happens (if I'm not mistaken).

[1] https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/access/nbtree/README#L142
--
Peter Geoghegan

Attachment	Content-Type	Size
xmin_changes_race.patch	text/x-patch	6.2 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-01-06 07:12:11
Message-ID:	CAM3SWZQHzVpqXDWVysGszFivxFrpwHNFekTuf4UTb6Dc6nvz-Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 5, 2015 at 5:53 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> It's not all that easy to recreate, even with my custom
> instrumentation. With fsync=off, it occurs somewhat infrequently with
> a custom instrumentation Postgres build:
>
> pg(at)hamster:~/jjanes_upsert$ perl count_upsert.pl 8 100000
> [Mon Jan 5 17:31:57 2015] init done at count_upsert.pl line 101.
> [Mon Jan 5 17:32:17 2015] child abnormal exit [Mon Jan 5 17:32:17
> 2015] DBD::Pg::db selectall_arrayref failed: ERROR: mismatch in xmin
> for (322,264). Initially 4324145, then 4324429 at count_upsert.pl line
> 210.\n at count_upsert.pl line 227.
> [Mon Jan 5 17:32:49 2015] sum is -800
> [Mon Jan 5 17:32:49 2015] count is 9515
> [Mon Jan 5 17:32:49 2015] normal exit at 1420507969 after 733912
> items processed at count_upsert.pl line 182.
>
> I think that "super deleting" broken promise tuples undermines how
> vacuum interlocking is supposed to work [1].

Hmm. On second thought, I think that the instrumentation I added here
can be confused by redirecting line pointers, and that may account for
this apparent problem. Still, I would like for us to figure out how it
was previously possible for the implementation to update a
"super-deleted" tuple (resulting in unusual all-NULLs tuples) since I
though that in order for that to happen, other xacts would have to see
what was only ever a promise tuple as a fully-fledged tuple -- in
order for the logic that pre-checks for conflicts to find a conflict,
it would have to find the tuple already committed (so it certainly
couldn't be an unresolved promise tuple). My superficial fix [1] was
written without fully understanding what the problem was.

In any case, I doubt the changes that I made to tqual.c in that fix
will be accepted as-is.

[1] http://www.postgresql.org/message-id/CAM3SWZTfTt_fehet3tU3YKCpCYPYnNaUqUZ3Q+NAASnH-60teA@mail.gmail.com
--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject:	Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
Date:	2015-02-02 17:36:19
Message-ID:	54CFB593.3020509@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/03/2015 10:42 PM, Peter Geoghegan wrote:
> On Sat, Jan 3, 2015 at 2:41 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> A-ha, I see. And this can happen without INSERT ON CONFLICT, too? In that
>> case, one of the transactions is bound to error and roll back anyway, but
>> you get a deadlock error instead of the constraint violation error, which is
>> not as nice.
>
> Agreed.
>
> I haven't experimentally verified that this can happen with exclusion
> constraints without the ON CONFLICT patch being involved, but I would
> be very surprised if it didn't. How could it possibly not happen? It's
> not all that bad since, as you say, one or the other xact was going to
> be aborted anyway. Since the insertions would have to occur at exactly
> the same time, even if one backend were to get an exclusion violation
> rather than being killed by the deadlock detector, the choice of which
> backend to raise an error within would be essentially random anyway.

Yep. I just tested this, and I can confirm that it does happen with
vanilla exclusion constraints. If two backends insert the same value at
the same time, you usually get the error "conflicting key value violates
exclusion constraint", but if the timing is right, you get "deadlock
detected" instead.

Let's focus on fixing that case first. I wouldn't care otherwise, but it
allows us to work on the locking, the "super-deletion" etc. without all
the rest of the patch. That will provide you a way to split the patch
into two: 1. Avoid deadlock errors with regular exclusion constraints,
with all the locking etc. that's needed to solve that, and 2. Upsert.

>> 1. On conflict, mark the inserted tuple as killed, and retry. But before
>> retrying, acquire a new kind of lock on the table, let's call it
>> SpeculativeInsertionLock. This fixes the deadlock, by retrying instead of
>> sleeping, and avoids the livelock because the new lock ensures that only one
>> backend retries at a time.
>
> We "super delete" the tuple on retry already. However, we wait for the
> other xact with our broken promise tuple still physically inserted
> into the heap. We can't super delete the tuple before
> XactLockTableWait()/SpeculativeInsertionWait() lock acquisition,
> because doing so risks livelock. I think you already agree with me up
> to here.
>
> However, if we're not sleep waiting on the other xact (rather, we're
> "retrying [with a new class of exclusive table-level lock] instead of
> sleeping"), why should our xact be able to do useful work on retry?
> Why won't it just spin/busy wait? More to the point, why wouldn't it
> deadlock when it is obliged to wait on a third inserter's tuple?
> AFAICT, you've just moved the problem around, because now we're
> obliged to get a shared lock on the xid or speculative token
> (XactLockTableWait()/SpeculativeInsertionWait()) of a session that
> itself wants this new table level lock that only we have.

Sorry, I was very terse in my previous explanation. Let me try again.
Let's begin with a simpler, poor-performance version of the scheme:

1. Acquire the new SpeculativeInsertionLock on the table
2. Insert tuple to heap and index.
3. Scan the index to see if there is any other conflicting tuple. (If
there isn't, or the conflicting tuple is already committed, we're done)
4. Super-delete the tuple we already inserted
5. Release SpeculativeInsertionLock
6. XactLockTableWait() on the other guy

Note that we don't try to acquire any other locks while holding
SpeculativeInsertionLock.

Compare this with the way unique-checks in b-tree work today. The B-tree
check is free of race conditions because we hold the lock on the b-tree
page while we check the visibility of the page, and we don't insert the
index tuple if we have to wait. The SpeculativeInsertionLock
accomplishes the same. It makes steps 3 and 4 atomic.

Compared to your patch as it stands, this fixes the deadlock because
when A inserts a tuple and scans the index, any conflicting tuples it
finds must have completed the insertion before A. The other inserters
won't later try to wait on the tuple that A inserts.

We could do the above without the new lock, but as you've said, that
would risk a livelock. Two concurrent inserters might repeatedly insert,
scan, find each other, super-delete, and retry and not make progress.
The lock fixes that by ensuring that there is only one inserter is doing
the above at a time.

(With the above procedure, we could also first scan the index for
conflicts, and only insert after that, because the
SpeculativeInsertionLock prevents anyone else from inserting a
conflicting row. But of course, we're not going to hold an exclusive
table-level lock under normal circumstances anyway; the above was just a
prelude to the real scheme below.)

The full procedure is this:

1. Insert tuple to heap and index
2. Scan the index to see if there is any other conflicting tuple. (If
there isn't, or the conflicting tuple is already committed, we're done)
3. Super-delete the tuple we already inserted

4. Acquire SpeculativeInsertionLock on the table
5. Insert tuple to heap and index
6. Scan the index to see if there is any other conflicting tuple. (If
there isn't, or the conflicting tuple is already committed, we're done)
7. Super-delete the tuple we already inserted
8. Release SpeculativeInsertionLock
9. XactLockTableWait() on the other guy

So in a nutshell, first try to do the insertion without the table-level
lock. But because that would be prone to livelocks, if it doesn't
immediately work, retry with the table-level lock.

>> 2. Use e.g. the XID (or backend id or something) to resolve the tie. When
>> you have inserted a tuple and find that it conflicts with another
>> in-progress insertion, check the conflicting tuple's xmin. If it's < current
>> XID, wajt for the other inserter to finish. Otherwise kill the
>> already-inserted tuple first, and wait only after that.
>
> That sounds scary. I think it might introduce livelock hazards. What
> about some third xact, that's waiting on our XID, when we ourselves
> super delete the already-inserted tuple in deference to the first
> xact/inserter?

He will get woken up when we super-delete the already-inserted tuple.
What problem do you see there?

I actually like this scheme the best. It's simple. I haven't made any
rigorous analysis to prove that it's correct, but I don't immediately
see a problem with it.

> I also worry about the bloating this could cause.

Well, this is all under the assumption that conflicts and
super-deletions are rare. Of course, we must avoid the situation where
we have to attempt an insertion dozens of times until it succeeds,
leaving behind 20x bloat compared to the real data. I don't think that
would happen with these schemes, although it will need some testing I guess.

>> Can there be other lock types involved in the deadlock? AFAICS it's always
>> going to be between two or more backends that wait on each with
>> XactLockTableWait (or some variant of that specific to speculative
>> insertions).
>
> I believe you're right about that. But don't forget that at least with
> the current implementation, we also get spurious exclusion violations.

Ok, I clearly haven't been able to keep up with all the issues here. How
do spurious exclusion violations arise?

- Heikki