Quick Links

Re: foreign key locks, 2nd attempt

Lists:	pgsql-hackers

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	foreign key locks, 2nd attempt
Date:	2011-11-03 18:12:49
Message-ID:	1320343602-sup-2290@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

After some rather extensive rewriting, I submit the patch to improve
foreign key locks.

To recap, the point of this patch is to introduce a new lock tuple mode,
that lets the RI code obtain a lighter lock on tuples, which doesn't
conflict with updates that do not modify the key columns.

So Noah Misch proposed using the FOR KEY SHARE syntax, and that's what I
have implemented here. (There was some discussion that instead of
inventing new SQL syntax we could pass the necessary lock mode
internally in the ri_triggers code. That can still be done of course,
though I haven't done so in the current version of the patch.)

The other user-visible pending item is that it was said that instead of
simply using "columns used by unique indexes" as the key columns
considered by this patch, we should do some ALTER TABLE command. This
would be a comparatively trivial undertaking, I think, but I would like
there to be consensus on that this is really the way to go before I
implement it.

There are three places that have been extensively touched for this to be
possible:

- multixact.c stores two flag bits for each member transaction of a
MultiXactId. With those two flags we can tell whether each member
transaction is a key-share locker, a Share locker, an Exclusive
locker, or an updater. This also needed some new logic to implement
new truncation logic: previously we could truncate multixact as soon
as the member xacts went below the oldest multi generated by current
transactions. The new code cannot do this, because some multis can
contain updates, which means that they need to persist beyond that.
The new design is to truncate multixact segments when the
corresponding Xid is frozen by vacuum -- to this end, we keep track
of RecentGlobalXmin (and corresponding Xid epoch) on each multixact
SLRU segment, and remove previous segments when that Xid is frozen.
This RecentGlobalXmin and epoch values are stored in the first two
multixact/offset values in the first page of each segment.
(AFAICT there are serious bugs in the implementation of this, but I
believe the basic idea to be sound.)

- heapam.c needs some new logic to keep closer track of multixacts
after updates and locks.

- tqual needed to be touched extensively too, mainly so that we
understand that some multixacts can contain updates -- and this needs
to show as HeapTupleBeingUpdated (or equivalent) when consulted.

The new code mostly works fine but I'm pretty sure there must be serious
bugs somewhere. Also, in places, heap_update and heap_lock_tuple have
become spaguettish, though I'm not sure I see better ways to write them.

I would like some opinions on the ideas on this patch, and on the patch
itself. If someone wants more discussion on implementation details of
each part of the patch, I'm happy to provide a textual description --
please just ask.

--
Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>

Attachment	Content-Type	Size
fklocks-4.patch	application/octet-stream	228.1 KB

From:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-06 07:28:52
Message-ID:	4EB63734.1070706@xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-11-04 01:12, Alvaro Herrera wrote:

> I would like some opinions on the ideas on this patch, and on the patch
> itself. If someone wants more discussion on implementation details of
> each part of the patch, I'm happy to provide a textual description --
> please just ask.

Jumping in a bit late here, but thanks for working on this: it looks
like it could solve some annoying problems for us.

I do find myself idly wondering if those problems couldn't be made to go
away more simply given some kind of “I will never ever update this key”
constraint. I'm having trouble picturing the possible lock interactions
as it is. :-)

Jeroen

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-10 19:59:20
Message-ID:	201111101959.pAAJxKl14538@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> Hello,
>
> After some rather extensive rewriting, I submit the patch to improve
> foreign key locks.
>
> To recap, the point of this patch is to introduce a new lock tuple mode,
> that lets the RI code obtain a lighter lock on tuples, which doesn't
> conflict with updates that do not modify the key columns.

What kind of operations benefit from a non-key lock like this?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Christopher Browne <cbbrowne(at)gmail(dot)com>
To:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-10 20:17:59
Message-ID:	CAFNqd5W41U8sGcOzCPvz3pOay-wYYX9QqQE2KMqht3U0UXk-WA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 6, 2011 at 2:28 AM, Jeroen Vermeulen <jtv(at)xs4all(dot)nl> wrote:
> On 2011-11-04 01:12, Alvaro Herrera wrote:
>
>> I would like some opinions on the ideas on this patch, and on the patch
>> itself. If someone wants more discussion on implementation details of
>> each part of the patch, I'm happy to provide a textual description --
>> please just ask.
>
> Jumping in a bit late here, but thanks for working on this: it looks like it
> could solve some annoying problems for us.
>
> I do find myself idly wondering if those problems couldn't be made to go
> away more simply given some kind of “I will never ever update this key”
> constraint. I'm having trouble picturing the possible lock interactions as
> it is. :-)

+1 on that, though I'd make it more general than that. There's value
in having an "immutability" constraint on a column, where, in effect,
you're not allowed to modify the value of the column, once assigned.
That certainly doesn't prevent issuing DELETE + INSERT to get whatever
value you want into place, but that's a big enough hoop to need to
jump through to get rid of some nonsensical updates.

And if the target of a foreign key constraint consists of immutable
columns, then, yes, indeed, UPDATE on that table no longer conflicts
with references.

In nearly all cases, I'd expect that SERIAL would be reasonably
followed by IMMUTABLE.

create table something_assigned (
something_id serial immutable primary key,
something_identifier text not null unique
);
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-10 20:21:39
Message-ID:	CAFj8pRB=vZttBCG8_aE9bg2Qo6wGWAkPtGev7tL+CG0m82BnVQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/11/10 Christopher Browne <cbbrowne(at)gmail(dot)com>:
> On Sun, Nov 6, 2011 at 2:28 AM, Jeroen Vermeulen <jtv(at)xs4all(dot)nl> wrote:
>> On 2011-11-04 01:12, Alvaro Herrera wrote:
>>
>>> I would like some opinions on the ideas on this patch, and on the patch
>>> itself. If someone wants more discussion on implementation details of
>>> each part of the patch, I'm happy to provide a textual description --
>>> please just ask.
>>
>> Jumping in a bit late here, but thanks for working on this: it looks like it
>> could solve some annoying problems for us.
>>
>> I do find myself idly wondering if those problems couldn't be made to go
>> away more simply given some kind of “I will never ever update this key”
>> constraint. I'm having trouble picturing the possible lock interactions as
>> it is. :-)
>
> +1 on that, though I'd make it more general than that. There's value
> in having an "immutability" constraint on a column, where, in effect,
> you're not allowed to modify the value of the column, once assigned.
> That certainly doesn't prevent issuing DELETE + INSERT to get whatever
> value you want into place, but that's a big enough hoop to need to
> jump through to get rid of some nonsensical updates.
>
> And if the target of a foreign key constraint consists of immutable
> columns, then, yes, indeed, UPDATE on that table no longer conflicts
> with references.
>
> In nearly all cases, I'd expect that SERIAL would be reasonably
> followed by IMMUTABLE.
>
> create table something_assigned (
> something_id serial immutable primary key,
> something_identifier text not null unique
> );

I like this idea - it can solve two problem

Regards

Pavel Stehule

> --
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Christopher Browne" <cbbrowne(at)gmail(dot)com>, "Jeroen Vermeulen" <jtv(at)xs4all(dot)nl>
Cc:	"Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-10 20:29:44
Message-ID:	4EBBDFD80200002500042D09@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:

> There's value in having an "immutability" constraint on a column,
> where, in effect, you're not allowed to modify the value of the
> column, once assigned.

+1 We would definitely use such a feature, should it become
available.

-Kevin

From:	Christopher Browne <cbbrowne(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-10 20:38:41
Message-ID:	CAFNqd5Ui+=3YcM=WWBsPDjy0Q0FFwy87q9BPrT2fcaJBS3s0cg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 10, 2011 at 3:29 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:
>
>> There's value in having an "immutability" constraint on a column,
>> where, in effect, you're not allowed to modify the value of the
>> column, once assigned.
>
> +1 We would definitely use such a feature, should it become
> available.

Added to TODO list.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-10 21:09:12
Message-ID:	1320959229-sup-8122@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Bruce Momjian's message of jue nov 10 16:59:20 -0300 2011:
> Alvaro Herrera wrote:
> > Hello,
> >
> > After some rather extensive rewriting, I submit the patch to improve
> > foreign key locks.
> >
> > To recap, the point of this patch is to introduce a new lock tuple mode,
> > that lets the RI code obtain a lighter lock on tuples, which doesn't
> > conflict with updates that do not modify the key columns.
>
> What kind of operations benefit from a non-key lock like this?

I'm not sure I understand the question.

With this patch, a RI check does "SELECT FOR KEY SHARE". This means the
tuple is locked with that mode until the transaction finishes. An
UPDATE that modifies the referenced row will not conflict with that lock.

An UPDATE that modifies the key columns will be blocked, just as now.
Same with a DELETE.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-10 21:19:59
Message-ID:	201111102119.pAALJxM27424@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
>
> Excerpts from Bruce Momjian's message of jue nov 10 16:59:20 -0300 2011:
> > Alvaro Herrera wrote:
> > > Hello,
> > >
> > > After some rather extensive rewriting, I submit the patch to improve
> > > foreign key locks.
> > >
> > > To recap, the point of this patch is to introduce a new lock tuple mode,
> > > that lets the RI code obtain a lighter lock on tuples, which doesn't
> > > conflict with updates that do not modify the key columns.
> >
> > What kind of operations benefit from a non-key lock like this?
>
> I'm not sure I understand the question.
>
> With this patch, a RI check does "SELECT FOR KEY SHARE". This means the
> tuple is locked with that mode until the transaction finishes. An
> UPDATE that modifies the referenced row will not conflict with that lock.
>
> An UPDATE that modifies the key columns will be blocked, just as now.
> Same with a DELETE.

OK, so it prevents non-key data modifications from spilling to the
referred rows --- nice.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	David Kerr <dmk(at)mr-paradox(dot)net>
To:	Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-11 17:30:18
Message-ID:	20111111173018.GA6219@mr-paradox.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 10, 2011 at 03:17:59PM -0500, Christopher Browne wrote:
- On Sun, Nov 6, 2011 at 2:28 AM, Jeroen Vermeulen <jtv(at)xs4all(dot)nl> wrote:
- > On 2011-11-04 01:12, Alvaro Herrera wrote:
- >
- >> I would like some opinions on the ideas on this patch, and on the patch
- >> itself. If someone wants more discussion on implementation details of
- >> each part of the patch, I'm happy to provide a textual description --
- >> please just ask.
- >
- > Jumping in a bit late here, but thanks for working on this: it looks like it
- > could solve some annoying problems for us.
- >
- > I do find myself idly wondering if those problems couldn't be made to go
- > away more simply given some kind of I will never ever update this key
- > constraint. I'm having trouble picturing the possible lock interactions as
- > it is. :-)
-
- +1 on that, though I'd make it more general than that. There's value
- in having an "immutability" constraint on a column, where, in effect,
- you're not allowed to modify the value of the column, once assigned.
- That certainly doesn't prevent issuing DELETE + INSERT to get whatever
- value you want into place, but that's a big enough hoop to need to
- jump through to get rid of some nonsensical updates.
-
- And if the target of a foreign key constraint consists of immutable
- columns, then, yes, indeed, UPDATE on that table no longer conflicts
- with references.
-
- In nearly all cases, I'd expect that SERIAL would be reasonably
- followed by IMMUTABLE.
-
- create table something_assigned (
- something_id serial immutable primary key,
- something_identifier text not null unique
- );

Is this being suggested in lieu of Alvaro's patch? because it seems to be adding
complexity to the system (multiple types of primary key definitions) instead of
just fixing an obvious problem (over-aggressive locking done on FK checks).

If it's going to be in addition to, then it sounds like it'd be really nice.

Dave

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-11 22:02:00
Message-ID:	4EBD9B58.5060108@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> An UPDATE that modifies the key columns will be blocked, just as now.
>> Same with a DELETE.
>
> OK, so it prevents non-key data modifications from spilling to the
> referred rows --- nice.

Yes. Eliminates the leading cause of deadlocks in Postgres applications.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>
To:	David Kerr <dmk(at)mr-paradox(dot)net>
Cc:	Christopher Browne <cbbrowne(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-12 04:21:10
Message-ID:	4EBDF436.7000004@xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-11-12 00:30, David Kerr wrote:

> Is this being suggested in lieu of Alvaro's patch? because it seems to be adding
> complexity to the system (multiple types of primary key definitions) instead of
> just fixing an obvious problem (over-aggressive locking done on FK checks).

It wouldn't be a new type of primary key definition, just a new type of
column constraint similar to "not null." Particularly useful with keys,
but entirely orthogonal to them.

Parser and reserved words aside, it seems a relatively simple change.
Of course that's not necessarily the same as "small."

> If it's going to be in addition to, then it sounds like it'd be really nice.

I wasn't thinking that far ahead, myself. But if some existing lock
type covers the situation well enough, then that could be a big argument
for doing it in-lieu-of.

I haven't looked at lock types much so I could be wrong, but my
impression is that there are dangerously many lock types already. One
would expect the risk of subtle locking bugs to grow as the square of
the number of interacting lock types.

Jeroen

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-19 09:21:41
Message-ID:	CA+U5nM+5c-BtDbQPmZ03iBuaABkK0xW0JF4m24M-9ifi-1mX+A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 10, 2011 at 8:17 PM, Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:
> On Sun, Nov 6, 2011 at 2:28 AM, Jeroen Vermeulen <jtv(at)xs4all(dot)nl> wrote:
>> On 2011-11-04 01:12, Alvaro Herrera wrote:
>>
>>> I would like some opinions on the ideas on this patch, and on the patch
>>> itself. If someone wants more discussion on implementation details of
>>> each part of the patch, I'm happy to provide a textual description --
>>> please just ask.
>>
>> Jumping in a bit late here, but thanks for working on this: it looks like it
>> could solve some annoying problems for us.
>>
>> I do find myself idly wondering if those problems couldn't be made to go
>> away more simply given some kind of “I will never ever update this key”
>> constraint. I'm having trouble picturing the possible lock interactions as
>> it is. :-)
>
> +1 on that, though I'd make it more general than that. There's value
> in having an "immutability" constraint on a column, where, in effect,
> you're not allowed to modify the value of the column, once assigned.
> That certainly doesn't prevent issuing DELETE + INSERT to get whatever
> value you want into place, but that's a big enough hoop to need to
> jump through to get rid of some nonsensical updates.
>
> And if the target of a foreign key constraint consists of immutable
> columns, then, yes, indeed, UPDATE on that table no longer conflicts
> with references.
>
> In nearly all cases, I'd expect that SERIAL would be reasonably
> followed by IMMUTABLE.
>
> create table something_assigned (
> something_id serial immutable primary key,
> something_identifier text not null unique
> );

This is a good idea but doesn't do what KEY LOCKS are designed to do so.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-19 09:21:51
Message-ID:	CA+U5nMK13ij662FZKkc6ZcwA1EVd_Dj1DozhTS-jQc1dftKCow@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 3, 2011 at 6:12 PM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:

> So Noah Misch proposed using the FOR KEY SHARE syntax, and that's what I
> have implemented here. (There was some discussion that instead of
> inventing new SQL syntax we could pass the necessary lock mode
> internally in the ri_triggers code. That can still be done of course,
> though I haven't done so in the current version of the patch.)

FKs are a good short hand, but they aren't the only constraint people
implement. It can often be necessary to write triggers to enforce
complex constraints. So user triggers need access to the same
facilities that ri triggers uses. Please keep the syntax.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-19 15:36:48
Message-ID:	16993.1321717008@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Thu, Nov 3, 2011 at 6:12 PM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>> So Noah Misch proposed using the FOR KEY SHARE syntax, and that's what I
>> have implemented here. (There was some discussion that instead of
>> inventing new SQL syntax we could pass the necessary lock mode
>> internally in the ri_triggers code. That can still be done of course,
>> though I haven't done so in the current version of the patch.)

> FKs are a good short hand, but they aren't the only constraint people
> implement. It can often be necessary to write triggers to enforce
> complex constraints. So user triggers need access to the same
> facilities that ri triggers uses. Please keep the syntax.

It's already the case that RI triggers require access to special
executor features that are not accessible at the SQL level. I don't
think the above argument is a compelling reason for exposing more
such features at the SQL level. All we need is that C-coded functions
can get at them somehow.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-21 17:09:06
Message-ID:	CA+TgmoYhn+ne55TyK+cMO+g2xqfwPTtNUA+7CSNA3oyxpuAGrg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 19, 2011 at 10:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Thu, Nov 3, 2011 at 6:12 PM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>>> So Noah Misch proposed using the FOR KEY SHARE syntax, and that's what I
>>> have implemented here. (There was some discussion that instead of
>>> inventing new SQL syntax we could pass the necessary lock mode
>>> internally in the ri_triggers code. That can still be done of course,
>>> though I haven't done so in the current version of the patch.)
>
>> FKs are a good short hand, but they aren't the only constraint people
>> implement. It can often be necessary to write triggers to enforce
>> complex constraints. So user triggers need access to the same
>> facilities that ri triggers uses. Please keep the syntax.
>
> It's already the case that RI triggers require access to special
> executor features that are not accessible at the SQL level. I don't
> think the above argument is a compelling reason for exposing more
> such features at the SQL level. All we need is that C-coded functions
> can get at them somehow.

I kinda agree with Simon. In general, if we don't need to expose
something at the SQL level, then sure, let's not. But it seems weird
to me to say, well, we have four lock modes internally, and you can
get to three of them via SQL. To me, that sort of inconsistency feels
like a wart.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-11-21 20:26:55
Message-ID:	m2aa7pfchc.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Nov 19, 2011 at 10:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> It's already the case that RI triggers require access to special
>> executor features that are not accessible at the SQL level. I don't
>> think the above argument is a compelling reason for exposing more
>> such features at the SQL level. All we need is that C-coded functions
>> can get at them somehow.
>
> I kinda agree with Simon. In general, if we don't need to expose
> something at the SQL level, then sure, let's not. But it seems weird
> to me to say, well, we have four lock modes internally, and you can
> get to three of them via SQL. To me, that sort of inconsistency feels
> like a wart.

I know I've already rolled constraint triggers into production, being
able to use FOR KEY SHARE locks would be good.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-04 12:20:27
Message-ID:	20111204122027.GA10035@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Alvaro,

On Thu, Nov 03, 2011 at 03:12:49PM -0300, Alvaro Herrera wrote:
> After some rather extensive rewriting, I submit the patch to improve
> foreign key locks.

I've reviewed this patch. The basic design and behaviors are sound. All the
bugs noted in my previous review are gone.

Making pg_multixact persistent across clean shutdowns is no bridge to cross
lightly, since it means committing to an on-disk format for an indefinite
period. We should do it; the benefits of this patch justify it, and I haven't
identified a way to avoid it without incurring worse problems.

FWIW, I pondered a dead-end alternate idea of having every MultiXactId also be
a SubTransactionId. That way, you could still truncate pg_multixact early,
with the subtransaction commit status being adequate going forward. However,
for the case when the locker arrives after the updater, this would require the
ability to create a new subtransaction on behalf of a different backend. It
would also burn the xid space more quickly, slow commits affected by the added
subtransaction load, and aggravate "suboverflowed" incidence.

I did some halfhearted benchmarking to at least ensure the absence of any
gross performance loss on at-risk operations. Benchmarks done with a vanilla
build, -O2, no --enable-cassert. First, to exercise the cost of comparing
large column values an extra time, I created a table with a 2000-byte key
column and another int4 column. I then did a HOT update of every tuple. The
patch did not significantly change runtime. See attached fklock-wide.sql for
the commands run and timings collected.

Second, I tried a SELECT FOR SHARE on a table of 1M tuples; this might incur
some cost due to the now-guaranteed use of pg_multixact for FOR SHARE. See
attached fklock-test-forshare.sql. The median run slowed by 7% under the
patch, albeit with a rather brief benchmark run. Both master and patched
PostgreSQL seemed to exhibit a statement-scope memory leak in this test case:
to lock 1M rows, backend-private memory grew by about 500M. When trying 10M
rows, I cancelled the query after 1.2 GiB of consumption. This limited the
duration of a convenient test run.

I planned to benchmark the overhead of the HeapTupleSatisfiesMVCC() changes
when no foreign keys are in use, but I did not get around to that.

For anyone else following along, here are some important past threads:
http://archives.postgresql.org/message-id/1294953201-sup-2099@alvh.no-ip.org
http://archives.postgresql.org/message-id/20110211071322.GB26971@tornado.leadboat.com
http://archives.postgresql.org/message-id/1312907125-sup-9346@alvh.no-ip.org
http://archives.postgresql.org/message-id/cmdap.323308e530.1315601945-sup-7377@alvh.no-ip.org
http://archives.postgresql.org/message-id/1317053656-sup-7193@alvh.no-ip.org
http://archives.postgresql.org/message-id/1317840445-sup-7142@alvh.no-ip.org

From a UI perspective, I'd somewhat rather we exposed not only FOR KEY SHARE,
but also FOR KEY UPDATE, in the grammar. That lets us document the tuple lock
conflict table entirely in terms of other documented tuple locks.

> The other user-visible pending item is that it was said that instead of
> simply using "columns used by unique indexes" as the key columns
> considered by this patch, we should do some ALTER TABLE command. This
> would be a comparatively trivial undertaking, I think, but I would like
> there to be consensus on that this is really the way to go before I
> implement it.

As I mentioned in my last review, the base heuristic should be to select
attributes actually referenced by some foreign key constraint. We don't gain
enough by automatically involving columns of indexes that could, but do not,
support an actual FK constraint.

I see value in that ALTER TABLE proposal, as a follow-on patch, for the
benefit of user-defined referential integrity constraints. Users ought to get
the benefit of the core patch automatically, not just when they know to mark
every key column. For that reason, the key columns defined through ALTER
TABLE should add to, not replace, the automatically-identified set.

> The new code mostly works fine but I'm pretty sure there must be serious
> bugs somewhere. Also, in places, heap_update and heap_lock_tuple have
> become spaguettish, though I'm not sure I see better ways to write them.

Agreed, but I'm also short on ideas for rapidly and significantly improving
the situation. If I were going to try something, I'd try splitting out the
lock-acquisition loop into a separate function, preferably shared between
heap_update() and heap_lock_tuple().

Starting with source clusters of the catversion introducing this feature,
pg_upgrade must copy pg_multixact to the new cluster. The patch does not add
code for this.

This patch adds no user documentation. We need to document FOR KEY SHARE. I
don't see any current documentation of the tuple lock consequences of foreign
key constraints, so we don't need any changes there.

Should we add a few (2-6) unused flag bits to each multixact member to provide
growing room for future needs?

Does this patch have any special implications for REPEATABLE READ?

> --- a/contrib/pgrowlocks/pgrowlocks.c
> +++ b/contrib/pgrowlocks/pgrowlocks.c

I used pgrowlocks to play with this patch, and the output clarity seems to
have fallen somewhat:

**** w/ patch
-- SELECT * FROM test_rowlock FOR KEY SHARE;
locked_row | lock_type | locker | multi | xids | modes | pids
------------+-----------+--------+-------+---------+-------+---------
(0,1) | KeyShare | 15371 | f | {15371} | | {27276}
-- SELECT * FROM test_rowlock FOR SHARE;
locked_row | lock_type | locker | multi | xids | modes | pids
------------+--------------+--------+-------+---------+-------+---------
(0,1) | IsNotUpdate | 70 | t | {15372} | {shr} | {27276}
-- SELECT * FROM test_rowlock FOR UPDATE;
locked_row | lock_type | locker | multi | xids | modes | pids
------------+------------------------+--------+-------+---------+----------+---------
(0,1) | Exclusive IsNotUpdate | 71 | t | {15373} | {forupd} | {27276}
-- UPDATE test_rowlock SET non_key_col = 11;
locked_row | lock_type | locker | multi | xids | modes | pids
------------+-----------+--------+-------+---------+-------+---------
(0,1) | | 15374 | f | {15374} | | {27276}
-- UPDATE test_rowlock SET key_col = 2;
locked_row | lock_type | locker | multi | xids | modes | pids
------------+-----------+--------+-------+---------+-------+---------
(0,1) | | 15375 | f | {15375} | | {27276}

**** 9.1.1
-- SELECT * FROM test_rowlock FOR SHARE;
locked_row | lock_type | locker | multi | xids | pids
------------+-----------+--------+-------+-------+---------
(0,1) | Shared | 757 | f | {757} | {27349}
-- SELECT * FROM test_rowlock FOR UPDATE;
locked_row | lock_type | locker | multi | xids | pids
------------+-----------+--------+-------+-------+---------
(0,1) | Exclusive | 758 | f | {758} | {27349}
-- UPDATE test_rowlock SET non_key_col = 11;
locked_row | lock_type | locker | multi | xids | pids
------------+-----------+--------+-------+-------+---------
(0,1) | Exclusive | 759 | f | {759} | {27349}
-- UPDATE test_rowlock SET key_col = 2;
locked_row | lock_type | locker | multi | xids | pids
------------+-----------+--------+-------+-------+---------
(0,1) | Exclusive | 760 | f | {760} | {27349}

I've attached fklock-pgrowlocks.sql, used to produce the above results. In
particular, the absence of any distinction between the key_col and non_key_col
update scenarios is suboptimal. Also, the "SELECT * FROM test_rowlock FOR
UPDATE" ought not to use a multi, right? Choices such as letting lock_type be
blank or contain IsNotUpdate tend to make the output reflect implementation
details more than user-relevant lock semantics. One could go either way on
those. I tend to think pageinspect is for raw implementation exposure and
pgrowlocks for something more cooked.

I have not reviewed your actual pgrowlocks code changes.

> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c

heap_xlog_update() still sets xmax of the old tuple and xmin of the new tuple
based on the XLogRecord, so it will always be the plain xid of the updater. I
suppose that's still fine, because locks don't matter after a crash. I
suggest adding a comment, though.

> @@ -1620,7 +1622,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
> ItemPointerGetBlockNumber(tid));
> offnum = ItemPointerGetOffsetNumber(&heapTuple->t_data->t_ctid);
> at_chain_start = false;
> - prev_xmax = HeapTupleHeaderGetXmax(heapTuple->t_data);
> + prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple->t_data);
> }
> else
> break; /* end of chain */

The HOT search logic in pruneheap.c needs the same change.

> @@ -1743,7 +1745,7 @@ heap_get_latest_tid(Relation relation,
> * tuple. Check for XMIN match.
> */
> if (TransactionIdIsValid(priorXmax) &&
> - !TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(tp.t_data)))
> + !TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(tp.t_data)))

pgindent will undo this change.

> @@ -2174,20 +2178,22 @@ l1:
> */
> if (!have_tuple_lock)
> {
> - LockTuple(relation, &(tp.t_self), ExclusiveLock);
> + LockTuple(relation, &(tp.t_self),
> + get_lockmode_for_tuplelock(LockTupleKeyUpdate));

I suggest hiding calls to get_lockmode_for_tuplelock() behind a local macro
wrapper around LockTuple(), itself taking a LockTupleMode directly.

> @@ -2471,8 +2483,14 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> bool have_tuple_lock = false;
> bool iscombo;
> bool use_hot_update = false;
> + bool key_intact;
> bool all_visible_cleared = false;
> bool all_visible_cleared_new = false;
> + bool keep_xmax_multi = false;
> + TransactionId keep_xmax = InvalidTransactionId;

Those two new variables should be initialized after `l2'. If the final goto
fires (we lack a needed pin on the visibility map page), their existing values
become invalid. (Not positive there's a live bug here, but it's fragile.)

> + TransactionId keep_xmax_old = InvalidTransactionId;
> + uint16 keep_xmax_infomask = 0;
> + uint16 keep_xmax_old_infomask = 0;
>
> Assert(ItemPointerIsValid(otid));
>
> @@ -2488,7 +2506,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> * Note that we get a copy here, so we need not worry about relcache flush
> * happening midway through.
> */
> - hot_attrs = RelationGetIndexAttrBitmap(relation);
> + hot_attrs = RelationGetIndexAttrBitmap(relation, false);
> + key_attrs = RelationGetIndexAttrBitmap(relation, true);
>
> block = ItemPointerGetBlockNumber(otid);
> buffer = ReadBuffer(relation, block);
> @@ -2513,6 +2532,24 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> oldtup.t_self = *otid;
>
> /*
> + * If we're not updating any "key" column, we can grab a milder lock type.
> + * This allows for more concurrency when we are running simultaneously with
> + * foreign key checks.
> + */
> + if (HeapSatisfiesHOTUpdate(relation, key_attrs, &oldtup, newtup))

This will only pass when the toastedness matches, too. That is, something
like "UPDATE t SET keycol = keycol || ''" will take the stronger lock if
keycol is toasted. That's probably for the best -- the important case to
optimize is updates that never manipulate key columns at all, not those that
serendipitously arrive at the same key values. I wouldn't expect a net win
from the toaster effort needed to recognize the latter case.

Nonetheless, a comment here should note the decision.

> + {
> + tuplock = LockTupleUpdate;
> + mxact_status = MultiXactStatusUpdate;
> + key_intact = true;
> + }
> + else
> + {
> + tuplock = LockTupleKeyUpdate;
> + mxact_status = MultiXactStatusKeyUpdate;
> + key_intact = false;
> + }
> +
> + /*
> * Note: beyond this point, use oldtup not otid to refer to old tuple.
> * otid may very well point at newtup->t_self, which we will overwrite
> * with the new tuple's location, so there's great risk of confusion if we
> @@ -2522,6 +2559,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> l2:
> result = HeapTupleSatisfiesUpdate(oldtup.t_data, cid, buffer);
>
> + /* see below about the "no wait" case */
> + Assert(result != HeapTupleBeingUpdated || wait);

Maybe just Assert(wait)? Given that we're breaking that usage, no point in
giving the user any illusions. However ...

> +
> if (result == HeapTupleInvisible)
> {
> UnlockReleaseBuffer(buffer);
> @@ -2529,8 +2569,21 @@ l2:
> }
> else if (result == HeapTupleBeingUpdated && wait)
> {
> - TransactionId xwait;
> + TransactionId xwait;

pgindent will undo this change.

> uint16 infomask;
> + bool none_remain = false;
> +
> + /*
> + * XXX note that we don't consider the "no wait" case here. This
> + * isn't a problem currently because no caller uses that case, but it
> + * should be fixed if such a caller is introduced. It wasn't a problem
> + * previously because this code would always wait, but now that some
> + * tuple locks do not conflict with one of the lock modes we use, it is
> + * possible that this case is interesting to handle specially.
> + *
> + * This may cause failures with third-party code that calls heap_update
> + * directly.
> + */

... consider that this introduces code drift between heap_update() and the
presently-similar logic in heap_lock_tuple().

>
> /* must copy state data before unlocking buffer */
> xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
> @@ -2549,20 +2602,26 @@ l2:
> */
> if (!have_tuple_lock)
> {
> - LockTuple(relation, &(oldtup.t_self), ExclusiveLock);
> + LockTuple(relation, &(oldtup.t_self),
> + get_lockmode_for_tuplelock(tuplock));
> have_tuple_lock = true;
> }
>
> /*
> - * Sleep until concurrent transaction ends. Note that we don't care
> - * if the locker has an exclusive or shared lock, because we need
> - * exclusive.
> + * Now sleep on the locker. Note that if there are only key-share
> + * lockers and we're not updating the key columns, we will be awaken
> + * before it is gone, so we may need to mark the new tuple with a
> + * new MultiXactId including the original xmax and ourselves.

Well, we'll never actually sleep at all.

> + *
> + * XXX this comment needs to be more comprehensive
> */
> -
> if (infomask & HEAP_XMAX_IS_MULTI)
> {
> + TransactionId update_xact;
> + int remain;
> +
> /* wait for multixact */
> - MultiXactIdWait((MultiXactId) xwait);
> + MultiXactIdWait((MultiXactId) xwait, mxact_status, &remain);
> LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
>
> /*
> @@ -2576,41 +2635,98 @@ l2:
> goto l2;
>
> /*
> - * You might think the multixact is necessarily done here, but not
> - * so: it could have surviving members, namely our own xact or
> - * other subxacts of this backend. It is legal for us to update
> - * the tuple in either case, however (the latter case is
> - * essentially a situation of upgrading our former shared lock to
> - * exclusive). We don't bother changing the on-disk hint bits
> - * since we are about to overwrite the xmax altogether.
> + * Note that the multixact may not be done by now. It could have
> + * surviving members; our own xact or other subxacts of this
> + * backend, and also any other concurrent transaction that locked
> + * the tuple with KeyShare if we only got TupleLockUpdate. If this
> + * is the case, we have to be careful to mark the updated tuple
> + * with the surviving members in Xmax.
> + *
> + * Note that there could have been another update in the MultiXact.
> + * In that case, we need to check whether it committed or aborted.
> + * If it aborted we are safe to update it again; otherwise there is
> + * an update conflict that must be handled below.

It's handled below in the sense that we bail, returning HeapTupleUpdated?

> + *
> + * In the LockTupleKeyUpdate case, we still need to preserve the
> + * surviving members: those would include the tuple locks we had
> + * before this one, which are important to keep in case this
> + * subxact aborts.
> */
> + update_xact = InvalidTransactionId;
> + if (!(oldtup.t_data->t_infomask & HEAP_XMAX_IS_NOT_UPDATE))
> + update_xact = HeapTupleGetUpdateXid(oldtup.t_data);
> +
> + /* there was no UPDATE in the MultiXact; or it aborted. */
> + if (update_xact == InvalidTransactionId ||
> + TransactionIdDidAbort(update_xact))
> + {
> + /*
> + * if the multixact still has live members, we need to preserve
> + * it by creating a new multixact. If all members are gone, we
> + * can simply update the tuple by setting ourselves in Xmax.
> + */
> + if (remain > 0)
> + {
> + keep_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
> + keep_xmax_multi =
> + (oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) != 0;

Will keep_xmax_multi ever be false? Would we not have exited at the above
"goto l2;" in those cases?

> + }
> + else
> + {
> + /*
> + * We could set the HEAP_XMAX_INVALID bit here instead of
> + * using a separate boolean flag. However, since we're going
> + * to set up a new xmax below, this would waste time
> + * setting up the buffer's dirty bit.
> + */
> + none_remain = false;
> + }
> + }
> }
> else

This would require less reindentation as an "else if", rather than "else { if".

> {
> - /* wait for regular transaction to end */
> - XactLockTableWait(xwait);
> - LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> -
> /*
> - * xwait is done, but if xwait had just locked the tuple then some
> - * other xact could update this tuple before we get to this point.
> - * Check for xmax change, and start over if so.
> + * If it's just a key-share locker, and we're not changing the
> + * key columns, we don't need to wait for it to wait; but we
> + * need to preserve it as locker.
> */
> - if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
> - !TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
> - xwait))
> - goto l2;
> + if ((oldtup.t_data->t_infomask & HEAP_XMAX_KEYSHR_LOCK) &&
> + key_intact)

You don't have a content lock on the buffer at this point, so the test should
be against "infomask", not "oldtup.t_data->t_infomask".

> + {
> + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> + keep_xmax = xwait;
> + keep_xmax_multi = false;
> + }

Like the other branches, this one needs to recheck the t_infomask after
reacquiring the content lock.

It would be nice to completely avoid releasing the content lock in cases that
don't involve any waiting. However, since that (have_tuple_lock = true) is
already something of a slow path, I doubt it's worth the complexity.

> + else
> + {
> + /* wait for regular transaction to end */
> + XactLockTableWait(xwait);
> + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
>
> - /* Otherwise check if it committed or aborted */
> - UpdateXmaxHintBits(oldtup.t_data, buffer, xwait);
> + /*
> + * xwait is done, but if xwait had just locked the tuple then some
> + * other xact could update this tuple before we get to this point.
> + * Check for xmax change, and start over if so.
> + */
> + if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
> + !TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
> + xwait))
> + goto l2;
> +
> + /* Otherwise check if it committed or aborted */
> + UpdateXmaxHintBits(oldtup.t_data, buffer, xwait);
> + }
> }
>
> /*
> * We may overwrite if previous xmax aborted, or if it committed but
> - * only locked the tuple without updating it.
> + * only locked the tuple without updating it, or if we are going to
> + * keep it around in Xmax.
> */
> - if (oldtup.t_data->t_infomask & (HEAP_XMAX_INVALID |
> - HEAP_IS_LOCKED))
> + if (TransactionIdIsValid(keep_xmax) ||
> + none_remain ||
> + (oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
> + HeapTupleHeaderIsLocked(oldtup.t_data))

When is the HeapTupleHeaderIsLocked(oldtup.t_data) condition needed? Offhand,
I'd think none_remain = true and HEAP_XMAX_INVALID cover its cases.

> result = HeapTupleMayBeUpdated;
> else
> result = HeapTupleUpdated;
> @@ -2630,13 +2746,15 @@ l2:
> result == HeapTupleBeingUpdated);
> Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
> *ctid = oldtup.t_data->t_ctid;
> - *update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
> + *update_xmax = HeapTupleHeaderGetUpdateXid(oldtup.t_data);
> UnlockReleaseBuffer(buffer);
> if (have_tuple_lock)
> - UnlockTuple(relation, &(oldtup.t_self), ExclusiveLock);
> + UnlockTuple(relation, &(oldtup.t_self),
> + get_lockmode_for_tuplelock(tuplock));
> if (vmbuffer != InvalidBuffer)
> ReleaseBuffer(vmbuffer);
> bms_free(hot_attrs);
> + bms_free(key_attrs);
> return result;
> }
>
> @@ -2645,7 +2763,7 @@ l2:
> * visible while we were busy locking the buffer, or during some subsequent
> * window during which we had it unlocked, we'll have to unlock and
> * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
> - * unfortunate, esepecially since we'll now have to recheck whether the
> + * unfortunate, especially since we'll now have to recheck whether the
> * tuple has been locked or updated under us, but hopefully it won't
> * happen very often.
> */
> @@ -2678,13 +2796,54 @@ l2:
> Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
> }
>
> + /*
> + * If the tuple we're updating is locked, we need to preserve this in the
> + * new tuple's Xmax as well as in the old tuple. Prepare the new xmax
> + * value for these uses.
> + *
> + * Note there cannot be an xmax to save if we're changing key columns; in
> + * this case, the wait above should have only returned when the locking
> + * transactions finished.
> + */
> + if (TransactionIdIsValid(keep_xmax))
> + {
> + if (keep_xmax_multi)
> + {
> + keep_xmax_old = MultiXactIdExpand(keep_xmax,
> + xid, MultiXactStatusUpdate);
> + keep_xmax_infomask = HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_IS_MULTI;

Not directly related to this line, but is the HEAP_IS_NOT_UPDATE bit getting
cleared where needed?

> + }
> + else
> + {
> + /* not a multi? must be a KEY SHARE locker */
> + keep_xmax_old = MultiXactIdCreate(keep_xmax, MultiXactStatusForKeyShare,
> + xid, MultiXactStatusUpdate);
> + keep_xmax_infomask = HEAP_XMAX_KEYSHR_LOCK;
> + }
> + keep_xmax_old_infomask = HEAP_XMAX_IS_MULTI | HEAP_XMAX_KEYSHR_LOCK;
> + /* FIXME -- need more infomask bits? */

Maybe ... I haven't thought it all through.

> + }
> +
> + /*
> + * Prepare the new tuple with the appropriate initial values of Xmin and
> + * Xmax, as well as initial infomask bits.
> + */
> newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
> newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
> - newtup->t_data->t_infomask |= (HEAP_XMAX_INVALID | HEAP_UPDATED);
> + newtup->t_data->t_infomask |= HEAP_UPDATED;
> HeapTupleHeaderSetXmin(newtup->t_data, xid);
> HeapTupleHeaderSetCmin(newtup->t_data, cid);
> - HeapTupleHeaderSetXmax(newtup->t_data, 0); /* for cleanliness */
> newtup->t_tableOid = RelationGetRelid(relation);
> + if (TransactionIdIsValid(keep_xmax))
> + {
> + newtup->t_data->t_infomask |= keep_xmax_infomask;
> + HeapTupleHeaderSetXmax(newtup->t_data, keep_xmax);
> + }
> + else
> + {
> + newtup->t_data->t_infomask |= HEAP_XMAX_INVALID;
> + HeapTupleHeaderSetXmax(newtup->t_data, 0); /* for cleanliness */
> + }
>
> /*
> * Replace cid with a combo cid if necessary. Note that we already put
> @@ -2725,11 +2884,20 @@ l2:
> oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED |
> HEAP_XMAX_INVALID |
> HEAP_XMAX_IS_MULTI |
> - HEAP_IS_LOCKED |
> + HEAP_LOCK_BITS |
> HEAP_MOVED);
> + oldtup.t_data->t_infomask2 &= ~HEAP_UPDATE_KEY_INTACT;
> HeapTupleClearHotUpdated(&oldtup);
> /* ... and store info about transaction updating this tuple */
> - HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> + if (TransactionIdIsValid(keep_xmax_old))
> + {
> + HeapTupleHeaderSetXmax(oldtup.t_data, keep_xmax_old);
> + oldtup.t_data->t_infomask |= keep_xmax_old_infomask;
> + }
> + else
> + HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> + if (key_intact)
> + oldtup.t_data->t_infomask2 |= HEAP_UPDATE_KEY_INTACT;
> HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
> /* temporarily make it look not-updated */
> oldtup.t_data->t_ctid = oldtup.t_self;

Shortly after this, we release the content lock and go off toasting the tuple
and finding free space. When we come back, could the old tuple have
accumulated additional KEY SHARE locks that we need to re-copy?

> @@ -2883,10 +3051,19 @@ l2:
> oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED |
> HEAP_XMAX_INVALID |
> HEAP_XMAX_IS_MULTI |
> - HEAP_IS_LOCKED |
> + HEAP_LOCK_BITS |
> HEAP_MOVED);
> + oldtup.t_data->t_infomask2 &= ~HEAP_UPDATE_KEY_INTACT;
> /* ... and store info about transaction updating this tuple */
> - HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> + if (TransactionIdIsValid(keep_xmax_old))
> + {
> + HeapTupleHeaderSetXmax(oldtup.t_data, keep_xmax_old);
> + oldtup.t_data->t_infomask |= keep_xmax_old_infomask;
> + }
> + else
> + HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> + if (key_intact)
> + oldtup.t_data->t_infomask2 |= HEAP_UPDATE_KEY_INTACT;
> HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
> }

> @@ -3201,13 +3429,13 @@ heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
> Page page;
> TransactionId xid;
> TransactionId xmax;
> + TransactionId keep_xmax = InvalidTransactionId;
> + bool keep_xmax_multi = false;
> + bool none_remains = false;
> uint16 old_infomask;
> uint16 new_infomask;
> - LOCKMODE tuple_lock_type;
> bool have_tuple_lock = false;
>
> - tuple_lock_type = (mode == LockTupleShared) ? ShareLock : ExclusiveLock;
> -
> *buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
>
> @@ -3220,6 +3448,9 @@ heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
> tuple->t_tableOid = RelationGetRelid(relation);
>
> l3:
> + /* shouldn't get back here if we already set keep_xmax */
> + Assert(keep_xmax == InvalidTransactionId);
> +
> result = HeapTupleSatisfiesUpdate(tuple->t_data, cid, *buffer);
>
> if (result == HeapTupleInvisible)
> @@ -3231,30 +3462,70 @@ l3:
> {
> TransactionId xwait;
> uint16 infomask;
> + uint16 infomask2;
> + bool require_sleep;
>
> /* must copy state data before unlocking buffer */
> xwait = HeapTupleHeaderGetXmax(tuple->t_data);
> infomask = tuple->t_data->t_infomask;
> + infomask2 = tuple->t_data->t_infomask2;
>
> LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
>
> /*
> - * If we wish to acquire share lock, and the tuple is already
> - * share-locked by a multixact that includes any subtransaction of the
> - * current top transaction, then we effectively hold the desired lock
> - * already. We *must* succeed without trying to take the tuple lock,
> - * else we will deadlock against anyone waiting to acquire exclusive
> - * lock. We don't need to make any state changes in this case.
> + * If we wish to acquire share or key lock, and the tuple is already
> + * key or share locked by a multixact that includes any subtransaction
> + * of the current top transaction, then we effectively hold the desired
> + * lock already (except if we own key share lock and now desire share
> + * lock). We *must* succeed without trying to take the tuple lock,

This can now apply to FOR UPDATE as well.

For the first sentence, I suggest the wording "If any subtransaction of the
current top transaction already holds a stronger lock, we effectively hold the
desired lock already."

> + * else we will deadlock against anyone wanting to acquire a stronger
> + * lock.

> + *
> + * FIXME -- we don't do the below currently, but I think we should:
> + *
> + * We update the Xmax with a new MultiXactId to include the new lock
> + * mode in this case.
> + *
> + * Note that since we want to alter the Xmax, we need to re-acquire the
> + * buffer lock. The xmax could have changed in the meantime, so we
> + * recheck it in that case, but we keep the buffer lock while doing it
> + * to prevent starvation. The second time around we know we must be
> + * part of the MultiXactId in any case, which is why we don't need to
> + * go back to recheck HeapTupleSatisfiesUpdate. Also, after we
> + * re-acquire lock, the MultiXact is likely to (but not necessarily) be
> + * the same that we see here, so it should be in multixact's cache and
> + * thus quick to obtain.

What is the benefit of doing so?

> */
> - if (mode == LockTupleShared &&
> - (infomask & HEAP_XMAX_IS_MULTI) &&
> - MultiXactIdIsCurrent((MultiXactId) xwait))
> + if ((infomask & HEAP_XMAX_IS_MULTI) &&
> + ((mode == LockTupleShare) || (mode == LockTupleKeyShare)))
> {
> - Assert(infomask & HEAP_XMAX_SHARED_LOCK);
> - /* Probably can't hold tuple lock here, but may as well check */
> - if (have_tuple_lock)
> - UnlockTuple(relation, tid, tuple_lock_type);
> - return HeapTupleMayBeUpdated;
> + int i;
> + int nmembers;
> + MultiXactMember *members;
> +
> + nmembers = GetMultiXactIdMembers(xwait, &members);
> +
> + for (i = 0; i < nmembers; i++)
> + {
> + if (TransactionIdIsCurrentTransactionId(members[i].xid))

This does not handle subtransactions like the previous code.

I have not yet reviewed the rest of the heap_lock_tuple() changes.

> @@ -3789,6 +4305,8 @@ recheck_xmax:
> * extremely low-probability scenario with minimal downside even if
> * it does happen, so for now we don't do the extra bookkeeping that
> * would be needed to clean out MultiXactIds.
> + *
> + * FIXME -- today is that day. Figure this out.

Yep. I think you can just use HeapTupleHeaderGetUpdateXid() and remove the
explicit conditional on HEAP_XMAX_IS_MULTI.

> @@ -3919,6 +4536,7 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
> TransactionId *latestRemovedXid)
> {
> TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
> + /* FIXME -- change this? */
> TransactionId xmax = HeapTupleHeaderGetXmax(tuple);

Yes. Since this function is only passed dead tuples, it could previously
expect to never see a multixact xmax. No longer.

> @@ -4991,14 +5609,18 @@ heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
> htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
> HEAP_XMAX_INVALID |
> HEAP_XMAX_IS_MULTI |
> - HEAP_IS_LOCKED |
> + HEAP_LOCK_BITS |
> HEAP_MOVED);
> - if (xlrec->xid_is_mxact)
> + if (xlrec->infobits_set & XLHL_XMAX_IS_MULTI)
> htup->t_infomask |= HEAP_XMAX_IS_MULTI;
> - if (xlrec->shared_lock)
> - htup->t_infomask |= HEAP_XMAX_SHARED_LOCK;
> - else
> + if (xlrec->infobits_set & XLHL_XMAX_IS_NOT_UPDATE)
> + htup->t_infomask |= HEAP_XMAX_IS_NOT_UPDATE;
> + if (xlrec->infobits_set & XLHL_XMAX_EXCL_LOCK)
> htup->t_infomask |= HEAP_XMAX_EXCL_LOCK;
> + if (xlrec->infobits_set & XLHL_XMAX_KEYSHR_LOCK)
> + htup->t_infomask |= HEAP_XMAX_KEYSHR_LOCK;
> + if (xlrec->infobits_set & XLHL_UPDATE_KEY_INTACT)
> + htup->t_infomask2 |= HEAP_UPDATE_KEY_INTACT;
> HeapTupleHeaderClearHotUpdated(htup);
> HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
> HeapTupleHeaderSetCmax(htup, FirstCommandId, false);

Just after here is this code:

/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = xlrec->target.tid;

Now that a KEY SHARE locker could apply over an UPDATE, that's no longer
always valid.

Incidentally, why is this level of xlog detail needed for tuple locks? We
need an FPI of the page before the lock-related changes start scribbling on
it, and we need to log any xid, even that of a locker, that could land in the
heap on disk. But, why do we actually need to replay each lock?

> --- a/src/backend/access/transam/multixact.c
> +++ b/src/backend/access/transam/multixact.c
> @@ -4,7 +4,7 @@
> * PostgreSQL multi-transaction-log manager
> *
> * The pg_multixact manager is a pg_clog-like manager that stores an array
> - * of TransactionIds for each MultiXactId. It is a fundamental part of the
> + * of MultiXactMember for each MultiXactId. It is a fundamental part of the
> * shared-row-lock implementation. A share-locked tuple stores a
> * MultiXactId in its Xmax, and a transaction that needs to wait for the
> * tuple to be unlocked can sleep on the potentially-several TransactionIds

This header comment (including more than the portion quoted here) needs
further updates. In particular, there's no direct reference to the flag bits
now stored with each member xid. Also, the comment only mentions merely
preserving state across crashes, but this data now has the pg_clog life cycle.
Consider mentioning that the name is a bit historical: a singleton multixact
now has value for storing flags having no other expression.

> @@ -48,6 +48,8 @@
> */
> #include "postgres.h"
>
> +#include <unistd.h>
> +
> #include "access/multixact.h"
> #include "access/slru.h"
> #include "access/transam.h"
> @@ -60,6 +62,7 @@
> #include "storage/procarray.h"
> #include "utils/builtins.h"
> #include "utils/memutils.h"
> +#include "utils/snapmgr.h"
>
>
> /*
> @@ -75,19 +78,58 @@
> * (see MultiXact{Offset,Member}PagePrecedes).
> */

The comment just ending here mentions "MULTIXACT_*_PER_PAGE", but it's now
only correct for MULTIXACT_OFFSETS_PER_PAGE.

> @@ -180,7 +213,8 @@ static MultiXactId *OldestVisibleMXactId;
> * so they will be uninteresting by the time our next transaction starts.
> * (XXX not clear that this is correct --- other members of the MultiXact
> * could hang around longer than we did. However, it's not clear what a
> - * better policy for flushing old cache entries would be.)
> + * better policy for flushing old cache entries would be.) FIXME actually
> + * this is plain wrong now that multixact's may contain update Xids.

A key role of the cache is to avoid creating vast numbers of multixacts each
having the same membership. In that role, the existing policy seems no less
suitable than before. I agree that this patch makes the policy less suitable
for readers, though. Not sure what should be done about that, if anything.

> @@ -235,29 +297,59 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
> MultiXactOffset offset2);
> static void ExtendMultiXactOffset(MultiXactId multi);
> static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
> -static void TruncateMultiXact(void);
> -static void WriteMZeroPageXlogRec(int pageno, uint8 info);
> +static void fillSegmentInfoData(SlruCtl ctl, SegmentInfo *segment);
> +static int compareTruncateXidEpoch(const void *a, const void *b);
> +static void WriteMZeroOffsetPageXlogRec(int pageno, TransactionId truncateXid,
> + uint32 truncateXidEpoch);
> +static void WriteMZeroMemberPageXlogRec(int pageno);
>
>
> /*
> + * MultiXactIdCreateSingleton
> + * Construct a MultiXactId representing a single transaction.

I suggest mentioning that this is useful for marking a tuple in a manner that
can only be achieved through multixact flags.

> + *
> + * NB - we don't worry about our local MultiXactId cache here, because that
> + * is handled by the lower-level routines.
> + */
> +MultiXactId
> +MultiXactIdCreateSingleton(TransactionId xid, MultiXactStatus status)
> +{
> + MultiXactId newMulti;
> + MultiXactMember member[1];
> +
> + AssertArg(TransactionIdIsValid(xid));
> +
> + member[0].xid = xid;
> + member[0].status = status;
> +
> + newMulti = CreateMultiXactId(1, member);
> +
> + debug_elog4(DEBUG2, "Create: returning %u for %u",
> + newMulti, xid);
> +
> + return newMulti;
> +}
> +
> +/*
> * MultiXactIdCreate
> * Construct a MultiXactId representing two TransactionIds.
> *
> - * The two XIDs must be different.
> + * The two XIDs must be different, or be requesting different lock modes.

Why is it not sufficient to store the strongest type for a particular xid?

> @@ -376,7 +480,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid)
> bool
> MultiXactIdIsRunning(MultiXactId multi)
> {
> - TransactionId *members;
> + MultiXactMember *members;
> int nmembers;
> int i;
>
> @@ -397,7 +501,7 @@ MultiXactIdIsRunning(MultiXactId multi)
> */
> for (i = 0; i < nmembers; i++)
> {
> - if (TransactionIdIsCurrentTransactionId(members[i]))
> + if (TransactionIdIsCurrentTransactionId(members[i].xid))
> {
> debug_elog3(DEBUG2, "IsRunning: I (%d) am running!", i);
> pfree(members);
> @@ -412,10 +516,10 @@ MultiXactIdIsRunning(MultiXactId multi)
> */

Just before here, there's a comment referring to the now-nonexistent
MultiXactIdIsCurrent().

> @@ -576,17 +541,24 @@ MultiXactIdSetOldestVisible(void)
> * this would not merely be useless but would lead to Assert failure inside
> * XactLockTableWait. By the time this returns, it is certain that all
> * transactions *of other backends* that were members of the MultiXactId
> - * are dead (and no new ones can have been added, since it is not legal
> - * to add members to an existing MultiXactId).
> + * that conflict with the requested status are dead (and no new ones can have
> + * been added, since it is not legal to add members to an existing
> + * MultiXactId).
> + *
> + * We return the number of members that we did not test for. This is dubbed
> + * "remaining" as in "the number of members that remaing running", but this is

Typo: "remaing".

> + * slightly incorrect, because lockers whose status did not conflict with ours
> + * are not even considered and so might have gone away anyway.
> *
> * But by the time we finish sleeping, someone else may have changed the Xmax
> * of the containing tuple, so the caller needs to iterate on us somehow.
> */
> void
> -MultiXactIdWait(MultiXactId multi)
> +MultiXactIdWait(MultiXactId multi, MultiXactStatus status, int *remaining)

This function should probably move (with a new name) to heapam.c (or maybe
lmgr.c, in part). It's an abstraction violation to have multixact.c knowing
about lock conflict tables. multixact.c should be marshalling those two bits
alongside each xid without any deep knowledge of their meaning.

> @@ -663,7 +649,7 @@ CreateMultiXactId(int nxids, TransactionId *xids)
> xl_multixact_create xlrec;
>
> debug_elog3(DEBUG2, "Create: %s",
> - mxid_to_string(InvalidMultiXactId, nxids, xids));
> + mxid_to_string(InvalidMultiXactId, nmembers, members));
>
> /*
> * See if the same set of XIDs already exists in our cache; if so, just

XIDs -> members

> @@ -870,13 +875,14 @@ GetNewMultiXactId(int nxids, MultiXactOffset *offset)
> *
> * We don't care about MultiXactId wraparound here; it will be handled by
> * the next iteration. But note that nextMXact may be InvalidMultiXactId
> - * after this routine exits, so anyone else looking at the variable must
> - * be prepared to deal with that. Similarly, nextOffset may be zero, but
> - * we won't use that as the actual start offset of the next multixact.
> + * or the first value on a segment-beggining page after this routine exits,

Typo: "beggining".

> @@ -904,64 +932,61 @@ GetMultiXactIdMembers(MultiXactId multi, TransactionId **xids)
> int length;
> int truelength;
> int i;
> + MultiXactId oldestMXact;
> MultiXactId nextMXact;
> MultiXactId tmpMXact;
> MultiXactOffset nextOffset;
> - TransactionId *ptr;
> + MultiXactMember *ptr;
>
> debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
>
> Assert(MultiXactIdIsValid(multi));
>
> /* See if the MultiXactId is in the local cache */
> - length = mXactCacheGetById(multi, xids);
> + length = mXactCacheGetById(multi, members);
> if (length >= 0)
> {
> debug_elog3(DEBUG2, "GetMembers: found %s in the cache",
> - mxid_to_string(multi, length, *xids));
> + mxid_to_string(multi, length, *members));
> return length;
> }
>
> - /* Set our OldestVisibleMXactId[] entry if we didn't already */
> - MultiXactIdSetOldestVisible();
> -
> /*
> * We check known limits on MultiXact before resorting to the SLRU area.
> *
> - * An ID older than our OldestVisibleMXactId[] entry can't possibly still
> - * be running, and we'd run the risk of trying to read already-truncated
> - * SLRU data if we did try to examine it.
> + * An ID older than MultiXactState->oldestMultiXactId cannot possibly be
> + * useful; it should have already been frozen by vacuum. We've truncated
> + * the on-disk structures anyway, so we return empty if such a value is
> + * queried.

Per the "XXX perhaps someday" comment in heap_freeze_tuple(), the implication
of probing for an old multixact record has been heretofore minor. From now,
it can mean making the wrong visibility decision. Enter data loss. Hence, an
elog(ERROR) is normally in order. For the benefit of binary upgrades, we
could be permissive in the face of HEAP_XMAX_IS_NOT_UPDATE (formerly known as
HEAP_XMAX_SHARED_LOCK).

> *
> * Conversely, an ID >= nextMXact shouldn't ever be seen here; if it is
> * seen, it implies undetected ID wraparound has occurred. We just
> * silently assume that such an ID is no longer running.

Likewise, this is now fatal.

This raises a notable formal hazard: it's possible to burn through the
MultiXactId space faster than the regular TransactionId space. We could get
into a situation where pg_clog is covering 2B xids, and yet we need >4B
MultiXactId to cover that period. We had better at least notice this and
halt, if not have autovacuum actively prevent it.

> @@ -1026,9 +1051,8 @@ retry:
> {
> MultiXactOffset nextMXOffset;
>
> - /* handle wraparound if needed */
> - if (tmpMXact < FirstMultiXactId)
> - tmpMXact = FirstMultiXactId;
> + /* Handle corner cases if needed */
> + tmpMXact = HandleMxactOffsetCornerCases(tmpMXact);

Is there a reason apart from cycle shaving to increment a MultiXactId in one
place and elsewhere fix up the incremented value to skip the special values?
Compare to just having a MultiXactIdIncrement() function. This isn't new with
your patch, but it certainly looks odd.

> @@ -1113,26 +1170,27 @@ retry:
> * for the majority of tuples, thus keeping MultiXactId usage low (saving
> * both I/O and wraparound issues).
> *
> - * NB: the passed xids[] array will be sorted in-place.
> + * NB: the passed members array will be sorted in-place.
> */
> static MultiXactId
> -mXactCacheGetBySet(int nxids, TransactionId *xids)
> +mXactCacheGetBySet(int nmembers, MultiXactMember *members)
> {
> mXactCacheEnt *entry;
>
> debug_elog3(DEBUG2, "CacheGet: looking for %s",
> - mxid_to_string(InvalidMultiXactId, nxids, xids));
> + mxid_to_string(InvalidMultiXactId, nmembers, members));
>
> /* sort the array so comparison is easy */
> - qsort(xids, nxids, sizeof(TransactionId), xidComparator);
> + qsort(members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
>
> for (entry = MXactCache; entry != NULL; entry = entry->next)
> {
> - if (entry->nxids != nxids)
> + if (entry->nmembers != nmembers)
> continue;
>
> /* We assume the cache entries are sorted */
> - if (memcmp(xids, entry->xids, nxids * sizeof(TransactionId)) == 0)
> + /* XXX we assume the unused bits in "status" are zeroed */

That's a fair assumption if the public entry points assert it. However, ...

> + if (memcmp(members, entry->members, nmembers * sizeof(MultiXactMember)) == 0)

... this also assumes the structure has no padding. To make that safe,
MultiXactStatus should be an int32, not an enum.

> @@ -1338,17 +1367,7 @@ void
> multixact_twophase_recover(TransactionId xid, uint16 info,
> void *recdata, uint32 len)
> {
> - BackendId dummyBackendId = TwoPhaseGetDummyBackendId(xid);
> - MultiXactId oldestMember;
> -
> - /*
> - * Get the oldest member XID from the state file record, and set it in the
> - * OldestMemberMXactId slot reserved for this prepared transaction.
> - */
> - Assert(len == sizeof(MultiXactId));
> - oldestMember = *((MultiXactId *) recdata);
> -
> - OldestMemberMXactId[dummyBackendId] = oldestMember;
> + /* nothing to do */
> }
>
> /*
> @@ -1359,11 +1378,7 @@ void
> multixact_twophase_postcommit(TransactionId xid, uint16 info,
> void *recdata, uint32 len)
> {
> - BackendId dummyBackendId = TwoPhaseGetDummyBackendId(xid);
> -
> - Assert(len == sizeof(MultiXactId));
> -
> - OldestMemberMXactId[dummyBackendId] = InvalidMultiXactId;
> + /* nothing to do */
> }
>
> /*
> @@ -1374,7 +1389,7 @@ void
> multixact_twophase_postabort(TransactionId xid, uint16 info,
> void *recdata, uint32 len)
> {
> - multixact_twophase_postcommit(xid, info, recdata, len);
> + /* nothing to do */
> }

Looks like you can completely remove TWOPHASE_RM_MULTIXACT_ID.

> /*
> - * Also truncate MultiXactMember at the previously determined offset.
> + * FIXME there's a race condition here: somebody might have created a new
> + * segment after we finished scanning the dir. That scenario would leave
> + * us with an invalid truncateXid in shared memory, which is not an easy
> + * situation to get out of. Needs more thought.

Agreed. Not sure.

Broadly, this feels like a lot of code to handle truncating the segments, but
I don't know how to simplify it.

> @@ -1947,13 +2130,29 @@ MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
> return (diff < 0);
> }
>
> +static void
> +WriteMZeroOffsetPageXlogRec(int pageno, TransactionId truncateXid,
> + uint32 truncateXidEpoch)
> +{
> + XLogRecData rdata;
> + MxactZeroOffPg zerooff;
> +
> + zerooff.pageno = pageno;
> + zerooff.truncateXid = truncateXid;
> + zerooff.truncateXidEpoch = truncateXidEpoch;
> +
> + rdata.data = (char *) (&zerooff);
> + rdata.len = sizeof(MxactZeroOffPg);

A MinSizeOf* macro is more conventional.

> + rdata.buffer = InvalidBuffer;
> + rdata.next = NULL;
> + (void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_ZERO_OFF_PAGE, &rdata);
> +}

> --- a/src/backend/utils/time/tqual.c
> +++ b/src/backend/utils/time/tqual.c
> @@ -966,10 +1088,25 @@ HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
> if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */
> return true;
>
> - if (tuple->t_infomask & HEAP_IS_LOCKED) /* not deleter */
> + if (HeapTupleHeaderIsLocked(tuple)) /* not deleter */
> return true;
>
> - Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
> + if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
> + {
> + TransactionId xmax;
> +
> + xmax = HeapTupleGetUpdateXid(tuple);
> + if (!TransactionIdIsValid(xmax))
> + return true;

When does this happen? Offhand, I'd expect the HeapTupleHeaderIsLocked() test
to keep us from reaching this scenario. Anyway, the next test would catch it.

> +
> + /* updating subtransaction must have aborted */
> + if (!TransactionIdIsCurrentTransactionId(xmax))
> + return true;
> + else if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
> + return true; /* updated after scan started */
> + else
> + return false; /* updated before scan started */
> + }
>
> if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tuple)))
> {
> @@ -1008,13 +1145,34 @@ HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
> if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
> return true;
>
> - if (tuple->t_infomask & HEAP_IS_LOCKED)
> + if (HeapTupleHeaderIsLocked(tuple))
> return true;
>
> if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
> {
> - /* MultiXacts are currently only allowed to lock tuples */
> - Assert(tuple->t_infomask & HEAP_IS_LOCKED);
> + TransactionId xmax;
> +
> + if (HeapTupleHeaderIsLocked(tuple))
> + return true;

This test is redundant with the one just prior.

> +
> + xmax = HeapTupleGetUpdateXid(tuple);
> + if (TransactionIdIsCurrentTransactionId(xmax))
> + {
> + if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
> + return true; /* deleted after scan started */
> + else
> + return false; /* deleted before scan started */
> + }
> + if (TransactionIdIsInProgress(xmax))
> + return true;
> + if (TransactionIdDidCommit(xmax))
> + {
> + SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED, xmax);
> + /* updating transaction committed, but when? */
> + if (XidInMVCCSnapshot(xmax, snapshot))
> + return true; /* treat as still in progress */
> + return false;
> + }

In both HEAP_XMAX_MULTI conditional blocks, you do not set HEAP_XMAX_INVALID
for an aborted updater. What is the new meaning of HEAP_XMAX_INVALID for
multixacts? What implications would arise if we instead had it mean that the
updating xid is aborted? That would allow us to get the mid-term performance
benefit of the hint bit when the updating xid spills into a multixact, and it
would reduce code duplication in this function.

I did not review the other tqual.c changes. Could you summarize how the
changes to the other functions must differ from the changes to
HeapTupleSatisfiesMVCC()?

> --- a/src/bin/pg_resetxlog/pg_resetxlog.c
> +++ b/src/bin/pg_resetxlog/pg_resetxlog.c
> @@ -332,6 +350,11 @@ main(int argc, char *argv[])
> if (set_mxoff != -1)
> ControlFile.checkPointCopy.nextMultiOffset = set_mxoff;
>
> + /*
> + if (set_mxfreeze != -1)
> + ControlFile.checkPointCopy.mxactFreezeXid = set_mxfreeze;
> + */
> +
> if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID)
> ControlFile.checkPointCopy.ThisTimeLineID = minXlogTli;
>
> @@ -578,6 +601,10 @@ PrintControlValues(bool guessed)
> ControlFile.checkPointCopy.nextMulti);
> printf(_("Latest checkpoint's NextMultiOffset: %u\n"),
> ControlFile.checkPointCopy.nextMultiOffset);
> + /*
> + printf(_("Latest checkpoint's MultiXact freezeXid: %u\n"),
> + ControlFile.checkPointCopy.mxactFreezeXid);
> + */

Should these changes be live? They look reasonable at first glance.

> --- a/src/include/access/htup.h
> +++ b/src/include/access/htup.h
> @@ -164,12 +164,15 @@ typedef HeapTupleHeaderData *HeapTupleHeader;
> #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
> #define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */
> #define HEAP_HASOID 0x0008 /* has an object-id field */
> -/* bit 0x0010 is available */
> +#define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */
> #define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */
> #define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */
> -#define HEAP_XMAX_SHARED_LOCK 0x0080 /* xmax is shared locker */
> -/* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
> -#define HEAP_IS_LOCKED (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)
> +#define HEAP_XMAX_IS_NOT_UPDATE 0x0080 /* xmax, if valid, is only a locker.
> + * Note this is not set unless
> + * XMAX_IS_MULTI is also set. */
> +
> +#define HEAP_LOCK_BITS (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_IS_NOT_UPDATE | \
> + HEAP_XMAX_KEYSHR_LOCK)
> #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
> #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
> #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
> @@ -187,14 +190,30 @@ typedef HeapTupleHeaderData *HeapTupleHeader;
> #define HEAP_XACT_MASK 0xFFE0 /* visibility-related bits */

HEAP_XACT_MASK should gain HEAP_XMAX_KEYSHR_LOCK, becoming 0xFFF0.

>
> /*
> + * A tuple is only locked (i.e. not updated by its Xmax) if it the Xmax is not
> + * a multixact and it has either the EXCL_LOCK or KEYSHR_LOCK bits set, or if
> + * the xmax is a multi that doesn't contain an update.
> + *
> + * Beware of multiple evaluation of arguments.
> + */
> +#define HeapTupleHeaderInfomaskIsLocked(infomask) \
> + ((!((infomask) & HEAP_XMAX_IS_MULTI) && \
> + (infomask) & (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)) || \
> + (((infomask) & HEAP_XMAX_IS_MULTI) && ((infomask) & HEAP_XMAX_IS_NOT_UPDATE)))
> +
> +#define HeapTupleHeaderIsLocked(tup) \
> + HeapTupleHeaderInfomaskIsLocked((tup)->t_infomask)

I'm uneasy having a HeapTupleHeaderIsLocked() that returns false when a tuple
is both updated and KEY SHARE-locked. Perhaps HeapTupleHeaderIsUpdated() with
the opposite meaning, or HeapTupleHeaderIsOnlyLocked()?

> +
> +/*
> * information stored in t_infomask2:
> */
> #define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
> -/* bits 0x3800 are available */
> +/* bits 0x1800 are available */
> +#define HEAP_UPDATE_KEY_INTACT 0x2000 /* tuple updated, key cols untouched */
> #define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
> #define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
>
> -#define HEAP2_XACT_MASK 0xC000 /* visibility-related bits */
> +#define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
>
> /*
> * HEAP_TUPLE_HAS_MATCH is a temporary flag used during hash joins. It is
> @@ -221,6 +240,23 @@ typedef HeapTupleHeaderData *HeapTupleHeader;
> (tup)->t_choice.t_heap.t_xmin = (xid) \
> )
>
> +/*
> + * HeapTupleHeaderGetXmax gets you the raw Xmax field. To find out the Xid
> + * that updated a tuple, you might need to resolve the MultiXactId if certain
> + * bits are set. HeapTupleHeaderGetUpdateXid checks those bits and takes care
> + * to resolve the MultiXactId if necessary. This might involve multixact I/O,
> + * so it should only be used if absolutely necessary.
> + */
> +#define HeapTupleHeaderGetUpdateXid(tup) \
> +( \
> + (!((tup)->t_infomask & HEAP_XMAX_INVALID) && \
> + ((tup)->t_infomask & HEAP_XMAX_IS_MULTI) && \
> + !((tup)->t_infomask & HEAP_XMAX_IS_NOT_UPDATE)) ? \
> + HeapTupleGetUpdateXid(tup) \
> + : \
> + HeapTupleHeaderGetXmax(tup) \
> +)
> +
> #define HeapTupleHeaderGetXmax(tup) \

How about having making HeapTupleHeaderGetXmax() do an AssertMacro() against
HEAP_XMAX_IS_MULTI and adding HeapTupleHeaderGetRawXmax() for places that
truly do not care?

> ( \
> (tup)->t_choice.t_heap.t_xmax \
> @@ -721,16 +757,22 @@ typedef struct xl_heap_newpage
>
> #define SizeOfHeapNewpage (offsetof(xl_heap_newpage, blkno) + sizeof(BlockNumber))
>
> +/* flags for xl_heap_lock.infobits_set */
> +#define XLHL_XMAX_IS_MULTI 0x01
> +#define XLHL_XMAX_IS_NOT_UPDATE 0x02
> +#define XLHL_XMAX_EXCL_LOCK 0x04
> +#define XLHL_XMAX_KEYSHR_LOCK 0x08
> +#define XLHL_UPDATE_KEY_INTACT 0x10
> +
> /* This is what we need to know about lock */
> typedef struct xl_heap_lock
> {
> xl_heaptid target; /* locked tuple id */
> TransactionId locking_xid; /* might be a MultiXactId not xid */
> - bool xid_is_mxact; /* is it? */
> - bool shared_lock; /* shared or exclusive row lock? */
> + int8 infobits_set; /* infomask and infomask2 bits to set */
> } xl_heap_lock;
>
> -#define SizeOfHeapLock (offsetof(xl_heap_lock, shared_lock) + sizeof(bool))
> +#define SizeOfHeapLock (offsetof(xl_heap_lock, infobits_set) + sizeof(int8))
>
> /* This is what we need to know about in-place update */
> typedef struct xl_heap_inplace
> @@ -768,8 +810,7 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
> extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup);
> extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup);
> extern void HeapTupleHeaderAdjustCmax(HeapTupleHeader tup,
> - CommandId *cmax,
> - bool *iscombo);
> + CommandId *cmax, bool *iscombo);

Spurious change?

>
> /* ----------------
> * fastgetattr
> @@ -854,6 +895,9 @@ extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
> heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
> )
>
> +/* Prototype for HeapTupleHeader accessor in heapam.c */
> +extern TransactionId HeapTupleGetUpdateXid(HeapTupleHeader tuple);
> +
> /* prototypes for functions in common/heaptuple.c */
> extern Size heap_compute_data_size(TupleDesc tupleDesc,
> Datum *values, bool *isnull);
> diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
> index c3ec763..ff255d7 100644
> --- a/src/include/access/multixact.h
> +++ b/src/include/access/multixact.h
> @@ -13,8 +13,14 @@
>
> #include "access/xlog.h"
>
> +
> +/*
> + * The first two MultiXactId values are reserved to store the truncation Xid
> + * and epoch of the first segment, so we start assigning multixact values from
> + * 2.
> + */
> #define InvalidMultiXactId ((MultiXactId) 0)
> -#define FirstMultiXactId ((MultiXactId) 1)
> +#define FirstMultiXactId ((MultiXactId) 2)
>
> #define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)

Seems like this should reject 1, as well.

> --- a/src/include/access/xlog_internal.h
> +++ b/src/include/access/xlog_internal.h
> @@ -71,7 +71,7 @@ typedef struct XLogContRecord
> /*
> * Each page of XLOG file has a header like this:
> */
> -#define XLOG_PAGE_MAGIC 0xD068 /* can be used as WAL version indicator */
> +#define XLOG_PAGE_MAGIC 0xD069 /* can be used as WAL version indicator */

Need to bump pg_control_version, too.

> --- a/src/test/isolation/expected/fk-contention.out
> +++ b/src/test/isolation/expected/fk-contention.out
> @@ -7,9 +7,8 @@ step upd: UPDATE foo SET b = 'Hello World';
>
> starting permutation: ins upd com
> step ins: INSERT INTO bar VALUES (42);
> -step upd: UPDATE foo SET b = 'Hello World'; <waiting ...>
> +step upd: UPDATE foo SET b = 'Hello World';
> step com: COMMIT;
> -step upd: <... completed>

Excellent!

Thanks,
nm

Attachment	Content-Type	Size
fklock-wide.sql	text/plain	969 bytes
fklock-test-forshare.sql	text/plain	339 bytes
fklock-pgrowlocks.sql	text/plain	1.3 KB

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-12 20:20:39
Message-ID:	1323720855-sup-3526@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Noah,

Many thanks for this review. I'm going through items on it; definitely
there are serious issues here, as well as minor things that also need
fixing. Thanks for all the detail.

I'll post an updated patch shortly (probably not today though); in the
meantime, this bit:

Excerpts from Noah Misch's message of dom dic 04 09:20:27 -0300 2011:

> Second, I tried a SELECT FOR SHARE on a table of 1M tuples; this might incur
> some cost due to the now-guaranteed use of pg_multixact for FOR SHARE. See
> attached fklock-test-forshare.sql. The median run slowed by 7% under the
> patch, albeit with a rather brief benchmark run. Both master and patched
> PostgreSQL seemed to exhibit a statement-scope memory leak in this test case:
> to lock 1M rows, backend-private memory grew by about 500M. When trying 10M
> rows, I cancelled the query after 1.2 GiB of consumption. This limited the
> duration of a convenient test run.

I found that this is caused by mxid_to_string being leaked all over the
place :-( I "fixed" it by making the returned string be a static that's
malloced and then freed on the next call. There's still virtsize growth
(not sure it's a legitimate leak) with that, but it's much smaller.
This being a debugging aid, I don't think there's any need to backpatch
this.

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ddf76b3..c45bd36 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1305,9 +1305,14 @@ mxstatus_to_string(MultiXactStatus status)
static char *
mxid_to_string(MultiXactId multi, int nmembers, MultiXactMember *members)
{
- char *str = palloc(15 * (nmembers + 1) + 4);
+ static char *str = NULL;
int i;

+ if (str != NULL)
+ free(str);
+
+ str = malloc(15 * (nmembers + 1) + 4);
+
snprintf(str, 47, "%u %d[%u (%s)", multi, nmembers, members[0].xid,
mxstatus_to_string(members[0].status));

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-12 21:03:37
Message-ID:	1323723554-sup-9850@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Alvaro Herrera's message of lun dic 12 17:20:39 -0300 2011:

> I found that this is caused by mxid_to_string being leaked all over the
> place :-( I "fixed" it by making the returned string be a static that's
> malloced and then freed on the next call. There's still virtsize growth
> (not sure it's a legitimate leak) with that, but it's much smaller.

this fixes the remaining leaks. AFAICS it now grows to a certain point
and it's fixed size after that. I was able to share-lock a 10M rows
table with a 30MB RSS process.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 49d3369..7069950 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3936,6 +3936,8 @@ l3:
keep_xmax = xwait;
keep_xmax_multi = true;
}
+
+ pfree(members);
}
}
else if (infomask & HEAP_XMAX_KEYSHR_LOCK)
@@ -4693,6 +4695,9 @@ GetMultiXactIdHintBits(MultiXactId multi)
if (!has_update)
bits |= HEAP_XMAX_IS_NOT_UPDATE;

+ if (nmembers > 0)
+ pfree(members);
+
return bits;
}

@@ -4743,6 +4748,8 @@ HeapTupleGetUpdateXid(HeapTupleHeader tuple)
break;
#endif
}
+
+ pfree(members);
}

return update_xact;

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-13 14:44:49
Message-ID:	20111213144449.GB5736@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 12, 2011 at 05:20:39PM -0300, Alvaro Herrera wrote:
> Excerpts from Noah Misch's message of dom dic 04 09:20:27 -0300 2011:
>
> > Second, I tried a SELECT FOR SHARE on a table of 1M tuples; this might incur
> > some cost due to the now-guaranteed use of pg_multixact for FOR SHARE. See
> > attached fklock-test-forshare.sql. The median run slowed by 7% under the
> > patch, albeit with a rather brief benchmark run. Both master and patched
> > PostgreSQL seemed to exhibit a statement-scope memory leak in this test case:
> > to lock 1M rows, backend-private memory grew by about 500M. When trying 10M
> > rows, I cancelled the query after 1.2 GiB of consumption. This limited the
> > duration of a convenient test run.
>
> I found that this is caused by mxid_to_string being leaked all over the
> place :-( I "fixed" it by making the returned string be a static that's
> malloced and then freed on the next call. There's still virtsize growth
> (not sure it's a legitimate leak) with that, but it's much smaller.

Great. I'll retry that benchmark with the next patch version. I no longer
see a leak on master, so I probably messed up that part of the test somehow.

By the way, do you have a rapid procedure for finding the call site behind a
leak like this?

> This being a debugging aid, I don't think there's any need to backpatch
> this.

Agreed.

> diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
> index ddf76b3..c45bd36 100644
> --- a/src/backend/access/transam/multixact.c
> +++ b/src/backend/access/transam/multixact.c
> @@ -1305,9 +1305,14 @@ mxstatus_to_string(MultiXactStatus status)
> static char *
> mxid_to_string(MultiXactId multi, int nmembers, MultiXactMember *members)
> {
> - char *str = palloc(15 * (nmembers + 1) + 4);
> + static char *str = NULL;
> int i;
>
> + if (str != NULL)
> + free(str);
> +
> + str = malloc(15 * (nmembers + 1) + 4);

Need a check for NULL return.

> +
> snprintf(str, 47, "%u %d[%u (%s)", multi, nmembers, members[0].xid,
> mxstatus_to_string(members[0].status));

Thanks,
nm

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-13 16:09:46
Message-ID:	1323792157-sup-5657@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Noah Misch's message of mar dic 13 11:44:49 -0300 2011:
>
> On Mon, Dec 12, 2011 at 05:20:39PM -0300, Alvaro Herrera wrote:
> > Excerpts from Noah Misch's message of dom dic 04 09:20:27 -0300 2011:
> >
> > > Second, I tried a SELECT FOR SHARE on a table of 1M tuples; this might incur
> > > some cost due to the now-guaranteed use of pg_multixact for FOR SHARE. See
> > > attached fklock-test-forshare.sql. The median run slowed by 7% under the
> > > patch, albeit with a rather brief benchmark run. Both master and patched
> > > PostgreSQL seemed to exhibit a statement-scope memory leak in this test case:
> > > to lock 1M rows, backend-private memory grew by about 500M. When trying 10M
> > > rows, I cancelled the query after 1.2 GiB of consumption. This limited the
> > > duration of a convenient test run.
> >
> > I found that this is caused by mxid_to_string being leaked all over the
> > place :-( I "fixed" it by making the returned string be a static that's
> > malloced and then freed on the next call. There's still virtsize growth
> > (not sure it's a legitimate leak) with that, but it's much smaller.
>
> Great. I'll retry that benchmark with the next patch version. I no longer
> see a leak on master, so I probably messed up that part of the test somehow.

Maybe you recompiled without the MULTIXACT_DEBUG symbol defined?

> By the way, do you have a rapid procedure for finding the call site behind a
> leak like this?

Not really ... I tried some games with GDB (which yielded the first
report: I did some "call MemoryContextStats(TopMemoryContext)" to see
where the bloat was, and then stepped with breaks on MemoryContextAlloc,
also with a watch on CurrentMemoryContext and noting when it was
pointing to the bloated context. But since I'm a rookie with GDB I
didn't find a way to only break when MemoryContextAlloc was pointing at
that context. I know there must be a way.) and then went to do some
code inspection instead. I gather some people use valgrind
successfully.

> > + if (str != NULL)
> > + free(str);
> > +
> > + str = malloc(15 * (nmembers + 1) + 4);
>
> Need a check for NULL return.

Yeah, thanks ... I changed it to MemoryContextAlloc(TopMemoryContext),
because I'm not sure that a combination of malloc plus palloc would end
up in extra memory fragmentation.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-13 16:20:10
Message-ID:	20111213162010.GC5736@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 13, 2011 at 01:09:46PM -0300, Alvaro Herrera wrote:
>
> Excerpts from Noah Misch's message of mar dic 13 11:44:49 -0300 2011:
> >
> > On Mon, Dec 12, 2011 at 05:20:39PM -0300, Alvaro Herrera wrote:
> > > Excerpts from Noah Misch's message of dom dic 04 09:20:27 -0300 2011:
> > >
> > > > Second, I tried a SELECT FOR SHARE on a table of 1M tuples; this might incur
> > > > some cost due to the now-guaranteed use of pg_multixact for FOR SHARE. See
> > > > attached fklock-test-forshare.sql. The median run slowed by 7% under the
> > > > patch, albeit with a rather brief benchmark run. Both master and patched
> > > > PostgreSQL seemed to exhibit a statement-scope memory leak in this test case:
> > > > to lock 1M rows, backend-private memory grew by about 500M. When trying 10M
> > > > rows, I cancelled the query after 1.2 GiB of consumption. This limited the
> > > > duration of a convenient test run.
> > >
> > > I found that this is caused by mxid_to_string being leaked all over the
> > > place :-( I "fixed" it by making the returned string be a static that's
> > > malloced and then freed on the next call. There's still virtsize growth
> > > (not sure it's a legitimate leak) with that, but it's much smaller.
> >
> > Great. I'll retry that benchmark with the next patch version. I no longer
> > see a leak on master, so I probably messed up that part of the test somehow.
>
> Maybe you recompiled without the MULTIXACT_DEBUG symbol defined?

Neither my brain nor my shell history recall that, but it remains possible.

> > By the way, do you have a rapid procedure for finding the call site behind a
> > leak like this?
>
> Not really ... I tried some games with GDB (which yielded the first
> report: I did some "call MemoryContextStats(TopMemoryContext)" to see
> where the bloat was, and then stepped with breaks on MemoryContextAlloc,
> also with a watch on CurrentMemoryContext and noting when it was
> pointing to the bloated context. But since I'm a rookie with GDB I
> didn't find a way to only break when MemoryContextAlloc was pointing at
> that context. I know there must be a way.) and then went to do some
> code inspection instead. I gather some people use valgrind
> successfully.

Understood. Incidentally, the GDB command in question is "condition".

> > > + if (str != NULL)
> > > + free(str);
> > > +
> > > + str = malloc(15 * (nmembers + 1) + 4);
> >
> > Need a check for NULL return.
>
> Yeah, thanks ... I changed it to MemoryContextAlloc(TopMemoryContext),
> because I'm not sure that a combination of malloc plus palloc would end
> up in extra memory fragmentation.

Sounds good.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-13 21:36:21
Message-ID:	1323808901-sup-7243@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Noah Misch's message of dom dic 04 09:20:27 -0300 2011:

> > + /*
> > + * If the tuple we're updating is locked, we need to preserve this in the
> > + * new tuple's Xmax as well as in the old tuple. Prepare the new xmax
> > + * value for these uses.
> > + *
> > + * Note there cannot be an xmax to save if we're changing key columns; in
> > + * this case, the wait above should have only returned when the locking
> > + * transactions finished.
> > + */
> > + if (TransactionIdIsValid(keep_xmax))
> > + {
> > + if (keep_xmax_multi)
> > + {
> > + keep_xmax_old = MultiXactIdExpand(keep_xmax,
> > + xid, MultiXactStatusUpdate);
> > + keep_xmax_infomask = HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_IS_MULTI;
>
> Not directly related to this line, but is the HEAP_IS_NOT_UPDATE bit getting
> cleared where needed?

AFAICS it's reset along the rest of the HEAP_LOCK_BITS when the tuple is
modified.

> > @@ -2725,11 +2884,20 @@ l2:
> > oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED |
> > HEAP_XMAX_INVALID |
> > HEAP_XMAX_IS_MULTI |
> > - HEAP_IS_LOCKED |
> > + HEAP_LOCK_BITS |
> > HEAP_MOVED);
> > + oldtup.t_data->t_infomask2 &= ~HEAP_UPDATE_KEY_INTACT;
> > HeapTupleClearHotUpdated(&oldtup);
> > /* ... and store info about transaction updating this tuple */
> > - HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> > + if (TransactionIdIsValid(keep_xmax_old))
> > + {
> > + HeapTupleHeaderSetXmax(oldtup.t_data, keep_xmax_old);
> > + oldtup.t_data->t_infomask |= keep_xmax_old_infomask;
> > + }
> > + else
> > + HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> > + if (key_intact)
> > + oldtup.t_data->t_infomask2 |= HEAP_UPDATE_KEY_INTACT;
> > HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
> > /* temporarily make it look not-updated */
> > oldtup.t_data->t_ctid = oldtup.t_self;
>
> Shortly after this, we release the content lock and go off toasting the tuple
> and finding free space. When we come back, could the old tuple have
> accumulated additional KEY SHARE locks that we need to re-copy?

Yeah, I've been wondering about this: do we have a problem already with
exclusion constraints? I mean, if a concurrent inserter doesn't see the
tuple that we've marked here as deleted while we toast it, it could
result in a violated constraint, right? I haven't built a test case to
prove it.

> > @@ -3231,30 +3462,70 @@ l3:
> > {
> > TransactionId xwait;
> > uint16 infomask;
> > + uint16 infomask2;
> > + bool require_sleep;
> >
> > /* must copy state data before unlocking buffer */
> > xwait = HeapTupleHeaderGetXmax(tuple->t_data);
> > infomask = tuple->t_data->t_infomask;
> > + infomask2 = tuple->t_data->t_infomask2;
> >
> > LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
> >
> > /*
> > - * If we wish to acquire share lock, and the tuple is already
> > - * share-locked by a multixact that includes any subtransaction of the
> > - * current top transaction, then we effectively hold the desired lock
> > - * already. We *must* succeed without trying to take the tuple lock,
> > - * else we will deadlock against anyone waiting to acquire exclusive
> > - * lock. We don't need to make any state changes in this case.
> > + * If we wish to acquire share or key lock, and the tuple is already
> > + * key or share locked by a multixact that includes any subtransaction
> > + * of the current top transaction, then we effectively hold the desired
> > + * lock already (except if we own key share lock and now desire share
> > + * lock). We *must* succeed without trying to take the tuple lock,
>
> This can now apply to FOR UPDATE as well.
>
> For the first sentence, I suggest the wording "If any subtransaction of the
> current top transaction already holds a stronger lock, we effectively hold the
> desired lock already."

I settled on this:

/*
* If any subtransaction of the current top transaction already holds a
* lock as strong or stronger than what we're requesting, we
* effectively hold the desired lock already. We *must* succeed
* without trying to take the tuple lock, else we will deadlock against
* anyone wanting to acquire a stronger lock.
*/
if (infomask & HEAP_XMAX_IS_MULTI)
{
int i;
int nmembers;
MultiXactMember *members;
MultiXactStatus cutoff = get_mxact_status_for_tuplelock(mode);

nmembers = GetMultiXactIdMembers(xwait, &members);

for (i = 0; i < nmembers; i++)
{
if (TransactionIdIsCurrentTransactionId(members[i].xid))
{
if (members[i].status >= cutoff)
{
if (have_tuple_lock)
UnlockTupleTuplock(relation, tid, mode);

pfree(members);
return HeapTupleMayBeUpdated;
}
}
}

pfree(members);
}

Now, I can't see the reason that we didn't previously consider locks "as
strong as what we're requesting" ... but surely it's the same case?

> > + * else we will deadlock against anyone wanting to acquire a stronger
> > + * lock.
>
> > + *
> > + * FIXME -- we don't do the below currently, but I think we should:
> > + *
> > + * We update the Xmax with a new MultiXactId to include the new lock
> > + * mode in this case.
> > + *
> > + * Note that since we want to alter the Xmax, we need to re-acquire the
> > + * buffer lock. The xmax could have changed in the meantime, so we
> > + * recheck it in that case, but we keep the buffer lock while doing it
> > + * to prevent starvation. The second time around we know we must be
> > + * part of the MultiXactId in any case, which is why we don't need to
> > + * go back to recheck HeapTupleSatisfiesUpdate. Also, after we
> > + * re-acquire lock, the MultiXact is likely to (but not necessarily) be
> > + * the same that we see here, so it should be in multixact's cache and
> > + * thus quick to obtain.
>
> What is the benefit of doing so?

After thinking more about it, I think it's bogus. I've removed it.

> Incidentally, why is this level of xlog detail needed for tuple locks? We
> need an FPI of the page before the lock-related changes start scribbling on
> it, and we need to log any xid, even that of a locker, that could land in the
> heap on disk. But, why do we actually need to replay each lock?

Uhm. I remember thinking that a hot standby replica needed it ...

> > + * slightly incorrect, because lockers whose status did not conflict with ours
> > + * are not even considered and so might have gone away anyway.
> > *
> > * But by the time we finish sleeping, someone else may have changed the Xmax
> > * of the containing tuple, so the caller needs to iterate on us somehow.
> > */
> > void
> > -MultiXactIdWait(MultiXactId multi)
> > +MultiXactIdWait(MultiXactId multi, MultiXactStatus status, int *remaining)
>
> This function should probably move (with a new name) to heapam.c (or maybe
> lmgr.c, in part). It's an abstraction violation to have multixact.c knowing
> about lock conflict tables. multixact.c should be marshalling those two bits
> alongside each xid without any deep knowledge of their meaning.

Interesting thought.

> > /*
> > - * Also truncate MultiXactMember at the previously determined offset.
> > + * FIXME there's a race condition here: somebody might have created a new
> > + * segment after we finished scanning the dir. That scenario would leave
> > + * us with an invalid truncateXid in shared memory, which is not an easy
> > + * situation to get out of. Needs more thought.
>
> Agreed. Not sure.
>
> Broadly, this feels like a lot of code to handle truncating the segments, but
> I don't know how to simplify it.

It is a lot of code. And it took me quite a while to even figure out
how to do it. I don't see any other way to go about it.

> > + xmax = HeapTupleGetUpdateXid(tuple);
> > + if (TransactionIdIsCurrentTransactionId(xmax))
> > + {
> > + if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
> > + return true; /* deleted after scan started */
> > + else
> > + return false; /* deleted before scan started */
> > + }
> > + if (TransactionIdIsInProgress(xmax))
> > + return true;
> > + if (TransactionIdDidCommit(xmax))
> > + {
> > + SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED, xmax);
> > + /* updating transaction committed, but when? */
> > + if (XidInMVCCSnapshot(xmax, snapshot))
> > + return true; /* treat as still in progress */
> > + return false;
> > + }
>
> In both HEAP_XMAX_MULTI conditional blocks, you do not set HEAP_XMAX_INVALID
> for an aborted updater. What is the new meaning of HEAP_XMAX_INVALID for
> multixacts? What implications would arise if we instead had it mean that the
> updating xid is aborted? That would allow us to get the mid-term performance
> benefit of the hint bit when the updating xid spills into a multixact, and it
> would reduce code duplication in this function.

Well, HEAP_XMAX_INVALID means the Xmax is not valid, period. If there's
a multi whose updater is aborted, there's still a multi that needs to be
checked in various places, so we cannot set that bit.

> I did not review the other tqual.c changes. Could you summarize how the
> changes to the other functions must differ from the changes to
> HeapTupleSatisfiesMVCC()?

I don't think they should differ in any significant way ... if they do,
it's probably bogus. Only HeapTupleSatisfiesVacuum should differ
significantly, because it's a world on its own.

> > --- a/src/bin/pg_resetxlog/pg_resetxlog.c
> > +++ b/src/bin/pg_resetxlog/pg_resetxlog.c
> > @@ -332,6 +350,11 @@ main(int argc, char *argv[])
> > if (set_mxoff != -1)
> > ControlFile.checkPointCopy.nextMultiOffset = set_mxoff;
> >
> > + /*
> > + if (set_mxfreeze != -1)
> > + ControlFile.checkPointCopy.mxactFreezeXid = set_mxfreeze;
> > + */
> > +
> > if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID)
> > ControlFile.checkPointCopy.ThisTimeLineID = minXlogTli;
> >
> > @@ -578,6 +601,10 @@ PrintControlValues(bool guessed)
> > ControlFile.checkPointCopy.nextMulti);
> > printf(_("Latest checkpoint's NextMultiOffset: %u\n"),
> > ControlFile.checkPointCopy.nextMultiOffset);
> > + /*
> > + printf(_("Latest checkpoint's MultiXact freezeXid: %u\n"),
> > + ControlFile.checkPointCopy.mxactFreezeXid);
> > + */
>
> Should these changes be live? They look reasonable at first glance.

Oh, I forgot about these. Yeah, these need to be live, but not in the
exact for they have here; there were some tweaks I needed to do IIRC.

> > /*
> > + * A tuple is only locked (i.e. not updated by its Xmax) if it the Xmax is not
> > + * a multixact and it has either the EXCL_LOCK or KEYSHR_LOCK bits set, or if
> > + * the xmax is a multi that doesn't contain an update.
> > + *
> > + * Beware of multiple evaluation of arguments.
> > + */
> > +#define HeapTupleHeaderInfomaskIsLocked(infomask) \
> > + ((!((infomask) & HEAP_XMAX_IS_MULTI) && \
> > + (infomask) & (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)) || \
> > + (((infomask) & HEAP_XMAX_IS_MULTI) && ((infomask) & HEAP_XMAX_IS_NOT_UPDATE)))
> > +
> > +#define HeapTupleHeaderIsLocked(tup) \
> > + HeapTupleHeaderInfomaskIsLocked((tup)->t_infomask)
>
> I'm uneasy having a HeapTupleHeaderIsLocked() that returns false when a tuple
> is both updated and KEY SHARE-locked. Perhaps HeapTupleHeaderIsUpdated() with
> the opposite meaning, or HeapTupleHeaderIsOnlyLocked()?

I had the IsOnlyLocked thought too. I will go that route.

(I changed HeapTupleHeaderGetXmax to GetRawXmax, thanks for that
suggestion)

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-14 15:21:29
Message-ID:	20111214152129.GA28988@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 13, 2011 at 06:36:21PM -0300, Alvaro Herrera wrote:
> Excerpts from Noah Misch's message of dom dic 04 09:20:27 -0300 2011:

> > > @@ -2725,11 +2884,20 @@ l2:
> > > oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED |
> > > HEAP_XMAX_INVALID |
> > > HEAP_XMAX_IS_MULTI |
> > > - HEAP_IS_LOCKED |
> > > + HEAP_LOCK_BITS |
> > > HEAP_MOVED);
> > > + oldtup.t_data->t_infomask2 &= ~HEAP_UPDATE_KEY_INTACT;
> > > HeapTupleClearHotUpdated(&oldtup);
> > > /* ... and store info about transaction updating this tuple */
> > > - HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> > > + if (TransactionIdIsValid(keep_xmax_old))
> > > + {
> > > + HeapTupleHeaderSetXmax(oldtup.t_data, keep_xmax_old);
> > > + oldtup.t_data->t_infomask |= keep_xmax_old_infomask;
> > > + }
> > > + else
> > > + HeapTupleHeaderSetXmax(oldtup.t_data, xid);
> > > + if (key_intact)
> > > + oldtup.t_data->t_infomask2 |= HEAP_UPDATE_KEY_INTACT;
> > > HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
> > > /* temporarily make it look not-updated */
> > > oldtup.t_data->t_ctid = oldtup.t_self;
> >
> > Shortly after this, we release the content lock and go off toasting the tuple
> > and finding free space. When we come back, could the old tuple have
> > accumulated additional KEY SHARE locks that we need to re-copy?
>
> Yeah, I've been wondering about this: do we have a problem already with
> exclusion constraints? I mean, if a concurrent inserter doesn't see the
> tuple that we've marked here as deleted while we toast it, it could
> result in a violated constraint, right? I haven't built a test case to
> prove it.

Does the enforcement code for exclusion constraints differ significantly from
the ordinary unique constraint code? If not, I'd expect the concurrent inserter
to treat the tuple precisely like an uncommitted delete, in which case it will
wait for the deleter.

> I settled on this:
>
> /*
> * If any subtransaction of the current top transaction already holds a
> * lock as strong or stronger than what we're requesting, we
> * effectively hold the desired lock already. We *must* succeed
> * without trying to take the tuple lock, else we will deadlock against
> * anyone wanting to acquire a stronger lock.
> */

> Now, I can't see the reason that we didn't previously consider locks "as
> strong as what we're requesting" ... but surely it's the same case?

I think it does degenerate to the same case. When we hold an exclusive lock
in master, HeapTupleSatisfiesUpdate() will return HeapTupleMayBeUpdated. So,
we can only get here while holding a mere share lock.

> > In both HEAP_XMAX_MULTI conditional blocks, you do not set HEAP_XMAX_INVALID
> > for an aborted updater. What is the new meaning of HEAP_XMAX_INVALID for
> > multixacts? What implications would arise if we instead had it mean that the
> > updating xid is aborted? That would allow us to get the mid-term performance
> > benefit of the hint bit when the updating xid spills into a multixact, and it
> > would reduce code duplication in this function.
>
> Well, HEAP_XMAX_INVALID means the Xmax is not valid, period. If there's
> a multi whose updater is aborted, there's still a multi that needs to be
> checked in various places, so we cannot set that bit.

Ah, yes. Perhaps a better question: would changing HEAP_XMAX_INVALID to
HEAP_UPDATER_INVALID pay off? That would help HeapTupleSatisfiesMVCC() at the
expense of HeapTupleSatisfiesUpdate(), probably along with other consequences I
haven't contemplated adequately.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-14 15:36:54
Message-ID:	7284.1323877014@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Tue, Dec 13, 2011 at 06:36:21PM -0300, Alvaro Herrera wrote:
>> Yeah, I've been wondering about this: do we have a problem already with
>> exclusion constraints? I mean, if a concurrent inserter doesn't see the
>> tuple that we've marked here as deleted while we toast it, it could
>> result in a violated constraint, right? I haven't built a test case to
>> prove it.

> Does the enforcement code for exclusion constraints differ significantly from
> the ordinary unique constraint code?

It's an entirely separate code path (involving an AFTER trigger). I
don't know if there's a problem, but Alvaro's right to worry that it
might behave differently.

regards, tom lane

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: foreign key locks, 2nd attempt
Date:	2011-12-16 13:37:16
Message-ID:	4EEB498C.3070207@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sounds like there's still a few things left to research out on Alvaro's
side, and I'm thinking there's a performance/reliability under load
testing side of this that will take some work to validate too. Since I
can't see all that happening fast enough to commit for a bit, I'm going
to mark it returned with feedback for now. I'm trying to remove
everything that isn't likely to end up in the next alpha from the open list.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-15 04:49:54
Message-ID:	1326601991-sup-3787@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's an updated version of this patch. It fixes many of Noah's review
points from the previous version, and it also contains some fixes for
other problems. The most glaring one was the fact that when we locked
an old version of a tuple, the locking code did not walk the update
chain to make the newer versions locked too. This is necessary for
correctness; moreover, locking a tuple whose future version is being
deleted by a concurrent transaction needs to cause the locking
transaction to block until the deleting transaction finishes. The
current code correctly sleeps on the deleter (or, if the deleter has
already committed, causes the tuple lock acquisition to fail.)

One other interesting change is that I flipped the
HEAP_UPDATE_KEY_INTACT bit meaning, so that it's now
HEAP_UPDATE_KEY_REVOKED; it's now set not only when an UPDATE changes a
key column, but also when a tuple is deleted. Only now that I write
this message I realize that I should have changed the name too, because
it's no longer just about UPDATE.

There are a number of smaller items still remaining, and I will be
working on those in the next few days. Most notably,

- I have not updated the docs yet.

- I haven't done anything about exposing FOR KEY UPDATE to the SQL
level. There clearly isn't consensus about exposing this; in fact there
isn't consensus on exposing FOR KEY SHARE, but I haven't changed that
from the previous patch, either.

- pg_rowlocks hasn't been updated; in this patch it's in the same shape
as it was previously. I agree with the idea that this module should
display user-level lock information instead of just decoding infomask
bits.

- I'm not sure that the multixact truncation code is sane on
checkpoints. It might be that I need to tweak more the pg_control info
we keep about truncation. The whole truncation thing needs more
testing, too.

- pg_upgrade bits are missing.

- Go over Noah's two reviews again and see if I missed anything; also
make sure I haven't missed anything from other reviewers.

At the core, ISTM this patch is in much closer form to final. The
number of things that are still open is much shorter. I'm pretty sure
this can be taken to committable state during the 2012-01 commitfest.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment	Content-Type	Size
fklocks-5.patch.gz	application/x-gzip	60.8 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-16 19:17:42
Message-ID:	4F1477D6.5090809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15.01.2012 06:49, Alvaro Herrera wrote:
> --- 164,178 ----
> #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
> #define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */
> #define HEAP_HASOID 0x0008 /* has an object-id field */
> ! #define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */
> #define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */
> #define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */
> ! #define HEAP_XMAX_IS_NOT_UPDATE 0x0080 /* xmax, if valid, is only a locker.
> ! * Note this is not set unless
> ! * XMAX_IS_MULTI is also set. */
> !
> ! #define HEAP_LOCK_BITS (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_IS_NOT_UPDATE | \
> ! HEAP_XMAX_KEYSHR_LOCK)
> #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
> #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
> #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */

HEAP_XMAX_IS_NOT_UPDATE is a pretty opaque name for that. From the name,
I'd say that a DELETE would set that, but the comment says it has to do
with locking. After going through all the combinations in my mind, I
think I finally understood it: HEAP_XMAX_IS_NOT_UPDATE is set if xmax is
a multi-xact, that represent only locking xids, no updates. How about
calling it "HEAP_XMAX_LOCK_ONLY", and setting it whenever whether or not
xmax is a multi-xid?

> - I haven't done anything about exposing FOR KEY UPDATE to the SQL
> level. There clearly isn't consensus about exposing this; in fact
> there isn't consensus on exposing FOR KEY SHARE, but I haven't
> changed that from the previous patch, either.

I think it would be useful to expose it. Not that anyone should be using
them in an application (or would it be useful?), but I feel it could
make testing significantly easier.

> - pg_upgrade bits are missing.

I guess we'll need to rewrite pg_multixact contents in pg_upgrade. Is
the page format backwards-compatible?

Why are you renaming HeapTupleHeaderGetXmax() into
HeapTupleHeaderGetRawXmax()? Any current callers of
HeapTupleHeaderGetXmax() should already check that HEAP_XMAX_IS_MULTI is
not set.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-16 19:52:36
Message-ID:	1326743000-sup-7028@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Heikki Linnakangas's message of lun ene 16 16:17:42 -0300 2012:
>
> On 15.01.2012 06:49, Alvaro Herrera wrote:
> > --- 164,178 ----
> > #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
> > #define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */
> > #define HEAP_HASOID 0x0008 /* has an object-id field */
> > ! #define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */
> > #define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */
> > #define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */
> > ! #define HEAP_XMAX_IS_NOT_UPDATE 0x0080 /* xmax, if valid, is only a locker.
> > ! * Note this is not set unless
> > ! * XMAX_IS_MULTI is also set. */
> > !
> > ! #define HEAP_LOCK_BITS (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_IS_NOT_UPDATE | \
> > ! HEAP_XMAX_KEYSHR_LOCK)
> > #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
> > #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
> > #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
>
> HEAP_XMAX_IS_NOT_UPDATE is a pretty opaque name for that. From the name,
> I'd say that a DELETE would set that, but the comment says it has to do
> with locking. After going through all the combinations in my mind, I
> think I finally understood it: HEAP_XMAX_IS_NOT_UPDATE is set if xmax is
> a multi-xact, that represent only locking xids, no updates. How about
> calling it "HEAP_XMAX_LOCK_ONLY", and setting it whenever whether or not
> xmax is a multi-xid?

Hm, sounds like a good idea. Will do.

> > - I haven't done anything about exposing FOR KEY UPDATE to the SQL
> > level. There clearly isn't consensus about exposing this; in fact
> > there isn't consensus on exposing FOR KEY SHARE, but I haven't
> > changed that from the previous patch, either.
>
> I think it would be useful to expose it. Not that anyone should be using
> them in an application (or would it be useful?), but I feel it could
> make testing significantly easier.

Okay, two votes in favor; I'll go do that too.

> > - pg_upgrade bits are missing.
>
> I guess we'll need to rewrite pg_multixact contents in pg_upgrade. Is
> the page format backwards-compatible?

It's not.

I haven't worked out what pg_upgrade needs to do, honestly. I think we
should just not copy old pg_multixact files when upgrading across this
patch. I was initially thinking that pg_multixact should return the
empty set if requested members of a multi that preceded the freeze
point. That way, I thought, we would just never try to access a page
originated in the older version (assuming the freeze point is set to
"current" whenever pg_upgrade runs). However, as things currently
stand, accessing an old multi raises an error. So maybe we need a
scheme a bit more complex to handle this.

> Why are you renaming HeapTupleHeaderGetXmax() into
> HeapTupleHeaderGetRawXmax()? Any current callers of
> HeapTupleHeaderGetXmax() should already check that HEAP_XMAX_IS_MULTI is
> not set.

I had this vague impression that it'd be better to break existing
callers so that they ensure they decide between
HeapTupleHeaderGetRawXmax and HeapTupleHeaderGetUpdateXid. Noah
suggested changing the macro name, too. It's up to each caller to
decide what's the sematics they want. Most want the latter; and callers
outside core are more likely to want that one. If we kept the old name,
they would get the wrong value.

If we want to keep the original name, it's the same to me.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-17 06:21:28
Message-ID:	4F151368.4020507@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16.01.2012 21:52, Alvaro Herrera wrote:
>
> Excerpts from Heikki Linnakangas's message of lun ene 16 16:17:42 -0300 2012:
>>
>> On 15.01.2012 06:49, Alvaro Herrera wrote:
>>> - pg_upgrade bits are missing.
>>
>> I guess we'll need to rewrite pg_multixact contents in pg_upgrade. Is
>> the page format backwards-compatible?
>
> It's not.
>
> I haven't worked out what pg_upgrade needs to do, honestly. I think we
> should just not copy old pg_multixact files when upgrading across this
> patch.

Sorry, I meant whether the *data* page format is backwards-compatible?
the multixact page format clearly isn't.

> I was initially thinking that pg_multixact should return the
> empty set if requested members of a multi that preceded the freeze
> point. That way, I thought, we would just never try to access a page
> originated in the older version (assuming the freeze point is set to
> "current" whenever pg_upgrade runs). However, as things currently
> stand, accessing an old multi raises an error. So maybe we need a
> scheme a bit more complex to handle this.

Hmm, could we create new multixact files filled with zeros, covering the
range that was valid in the old cluster?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-17 09:39:13
Message-ID:	20120117093913.GA13462@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 15, 2012 at 01:49:54AM -0300, Alvaro Herrera wrote:
> - I'm not sure that the multixact truncation code is sane on
> checkpoints. It might be that I need to tweak more the pg_control info
> we keep about truncation. The whole truncation thing needs more
> testing, too.

My largest outstanding concern involves the possibility of MultiXactId
wraparound. From my last review:

(That should have been 2B rather than 4B, since MultiXactId uses the same
2B-in-past, 2B-in-future behavior as regular xids.)

Are we willing to guess that this will "never" happen and make recovery
minimally possible? If so, we could have GetNewMultiXactId() grow defenses
similar to GetNewTransactionId() and leave it at that. If not, we need to
involve autovacuum.

The other remaining high-level thing is to have key_attrs contain only columns
actually referenced by FKs.

> - Go over Noah's two reviews again and see if I missed anything; also
> make sure I haven't missed anything from other reviewers.

There are some, yes.

> *** a/src/backend/access/heap/heapam.c
> --- b/src/backend/access/heap/heapam.c

> ***************
> *** 2773,2783 **** l2:
> }
> else if (result == HeapTupleBeingUpdated && wait)
> {
> ! TransactionId xwait;
> uint16 infomask;
>
> /* must copy state data before unlocking buffer */
> ! xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
> infomask = oldtup.t_data->t_infomask;
>
> LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> --- 2848,2871 ----
> }
> else if (result == HeapTupleBeingUpdated && wait)
> {
> ! TransactionId xwait;
> uint16 infomask;
> + bool none_remain = false;

Nothing can ever set this variable to anything different. It seems that
keep_xact == InvalidTransactionId substitutes well enough, though.

> /*
> * We may overwrite if previous xmax aborted, or if it committed but
> ! * only locked the tuple without updating it, or if we are going to
> ! * keep it around in Xmax.
> */

The second possibility is just a subset of the third.

> ! if (TransactionIdIsValid(keep_xmax) ||
> ! none_remain ||
> ! (oldtup.t_data->t_infomask & HEAP_XMAX_INVALID))
> result = HeapTupleMayBeUpdated;
> else
> result = HeapTupleUpdated;

> ***************
> *** 3314,3323 **** heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
> */
> static bool
> HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
> ! HeapTuple oldtup, HeapTuple newtup)
> {
> int attrnum;
>
> while ((attrnum = bms_first_member(hot_attrs)) >= 0)
> {
> /* Adjust for system attributes */
> --- 3537,3549 ----
> */
> static bool
> HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
> ! HeapTuple oldtup, HeapTuple newtup, bool empty_okay)
> {
> int attrnum;
>
> + if (!empty_okay && bms_is_empty(hot_attrs))
> + return false;

When a table contains no key attributes, it seems arbitrary whether we call
the key revoked or not. What is the motivation for this change?

> ! /*
> ! * If we're requesting KeyShare, and there's no update present, we
> ! * don't need to wait. Even if there is an update, we can still
> ! * continue if the key hasn't been modified.
> ! *
> ! * However, if there are updates, we need to walk the update chain
> ! * to mark future versions of the row as locked, too. That way, if
> ! * somebody deletes that future version, we're protected against the
> ! * key going away. This locking of future versions could block
> ! * momentarily, if a concurrent transaction is deleting a key; or it
> ! * could return a value to the effect that the transaction deleting the
> ! * key has already committed. So we do this before re-locking the
> ! * buffer; otherwise this would be prone to deadlocks. Note that the TID
> ! * we're locking was grabbed before we unlocked the buffer. For it to
> ! * change while we're not looking, the other properties we're testing
> ! * for below after re-locking the buffer would also change, in which
> ! * case we would restart this loop above.
> ! */
> ! if ((mode == LockTupleKeyShare) &&
> ! (HeapTupleHeaderInfomaskIsOnlyLocked(infomask) ||
> ! !(infomask2 & HEAP_UPDATE_KEY_REVOKED)))

Isn't the OnlyLocked test redundant, a subset of the !KEY_REVOKED test?

> {
> ! /* if there are updates, follow the update chain */
> ! if (!HeapTupleHeaderInfomaskIsOnlyLocked(infomask))
> ! {
> ! HTSU_Result res;
> !
> ! res = heap_lock_updated_tuple(relation, tid,
> ! GetCurrentTransactionId(),
> ! mode);
> ! if (res != HeapTupleMayBeUpdated)
> ! {
> ! result = res;
> ! /* recovery code expects to have buffer lock held */
> ! LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
> ! goto failed;
> ! }
> ! }
> !
> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
>
> /*
> ! * Make sure it's still an appropriate lock, else start over.
> */
> ! if (!HeapTupleHeaderIsOnlyLocked(tuple->t_data) &&
> ! (tuple->t_data->t_infomask2 & HEAP_UPDATE_KEY_REVOKED))
> goto l3;
> + require_sleep = false;
> +
> + /*
> + * Note we allow Xmax to change here; other updaters/lockers could
> + * have modified it before we grabbed the buffer lock. However,
> + * this is not a problem, because with the recheck we just did we
> + * ensure that they still don't conflict with the lock we want.
> + */

If an updater has appeared in the meantime, don't we need to go back and lock
along its update chain?

> }
>
> + /*
> + * If we're requesting Share, we can similarly avoid sleeping if
> + * there's no update and no exclusive lock present.
> + */
> + if (mode == LockTupleShare &&
> + (infomask & (HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_IS_NOT_UPDATE)) &&
> + !(infomask & HEAP_XMAX_EXCL_LOCK))
> + {
> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
>
> /*
> ! * Make sure it's still an appropriate lock, else start over. See
> ! * above about allowing xmax to change.
> */

I agree that it should be safe here, though.

> ! if (!(tuple->t_data->t_infomask &
> ! (HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_IS_NOT_UPDATE)) ||
> ! (tuple->t_data->t_infomask & HEAP_XMAX_EXCL_LOCK))
> goto l3;
> + require_sleep = false;
> + }

> ! elog(ERROR, "invalid lock mode in heap_tuple_lock");

"heap_lock_tuple" in that message.

> ! /*
> ! * Make sure there is no forward chain link in t_ctid. Note that in the
> ! * cases where the tuple has been updated, we must not overwrite t_ctid,
> ! * because it was set by the updater. Moreover, if the tuple has been
> ! * updated, we need to follow the update chain to lock the new versions
> ! * of the tuple as well.
> ! *
> ! * FIXME -- not 100% sure of the implications of this.
> ! */
> ! if (HeapTupleHeaderInfomaskIsOnlyLocked(new_infomask))
> ! tuple->t_data->t_ctid = *tid;

This seems right.

> + /*
> + * Given an original set of Xmax and infomask, and a transaction acquiring a
> + * new lock of some mode, compute the new Xmax and corresponding infomask to
> + * use on the tuple.
> + *
> + * Note that this might have side effects such as creating a new MultiXactId.
> + *
> + * Most callers will have called HeapTupleSatisfiesUpdate before this function;
> + * that will have set the HEAP_XMAX_INVALID bit if the xmax was a MultiXactId
> + * but it was not running anymore. There is a race condition, which is that the
> + * MultiXactId may have finished since then, but that uncommon case is handled
> + * within MultiXactIdExpand.
> + *
> + * There is a similar race condition possible when the old xmax was a regular
> + * TransactionId. We test TransactionIdIsInProgress again just to narrow the
> + * window, but it's still possible to end up creating an unnecessary
> + * MultiXactId. Fortunately this is harmless.
> + */
> + static void
> + compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
> + TransactionId add_to_xmax, LockTupleMode mode,
> + TransactionId *result_xmax, uint16 *result_infomask)
> + {
> + TransactionId new_xmax;
> + uint16 new_infomask = old_infomask;
> +
> + if (old_infomask & (HEAP_XMAX_INVALID | HEAP_XMAX_COMMITTED))
> + {
> + /*
> + * No previous locker, or it already finished; we just insert our own
> + * TransactionId.
> + */
> + switch (mode)
> + {
> + case LockTupleKeyShare:
> + new_xmax = add_to_xmax;
> + new_infomask |= HEAP_XMAX_KEYSHR_LOCK;
> + break;
> + case LockTupleShare:
> + /* need a multixact here in any case */
> + new_xmax = MultiXactIdCreateSingleton(add_to_xmax, MultiXactStatusForShare);
> + new_infomask |= GetMultiXactIdHintBits(new_xmax);
> + break;
> + case LockTupleUpdate:
> + new_infomask |= HEAP_XMAX_EXCL_LOCK;
> + new_xmax = xmax;

Shouldn't that be "new_xmax = add_to_xmax"?

> + break;
> + default:
> + elog(ERROR, "invalid lock mode");
> + new_xmax = InvalidTransactionId; /* keep compiler quiet */
> + }
> + /* no other updater; just add myself */
> + }
> + else if (old_infomask & HEAP_XMAX_IS_MULTI)
> + {
> + MultiXactStatus new_mxact_status;
> +
> + new_mxact_status = get_mxact_status_for_tuplelock(mode);
> + /*
> + * If the XMAX is already a MultiXactId, then we need to
> + * expand it to include our own TransactionId.
> + */
> + new_xmax = MultiXactIdExpand((MultiXactId) xmax, add_to_xmax, new_mxact_status);
> + new_infomask |= GetMultiXactIdHintBits(new_xmax);
> + }
> + else if (TransactionIdIsInProgress(xmax))
> + {
> + /*
> + * If the XMAX is a valid, in-progress TransactionId, then we need to
> + * create a new MultiXactId that includes both the old locker and our
> + * own TransactionId.
> + */
> + MultiXactStatus status;
> + MultiXactStatus new_mxact_status;
> +
> + new_mxact_status = get_mxact_status_for_tuplelock(mode);
> +
> + if (old_infomask & HEAP_XMAX_EXCL_LOCK)
> + status = MultiXactStatusForUpdate;
> + else if (old_infomask & HEAP_XMAX_KEYSHR_LOCK)
> + status = MultiXactStatusForKeyShare;
> + else
> + {
> + status = MultiXactStatusUpdate;
> + }
> +
> + /* FIXME need to verify the KEY_REVOKED bit, and block if it's set */
> +
> + new_xmax = MultiXactIdCreate(xmax, status, add_to_xmax, new_mxact_status);
> + new_infomask |= GetMultiXactIdHintBits(new_xmax);
> + /* FIXME -- we need to add bits to the infomask here! */
> + }
> + else if (mode == LockTupleShare)
> + {
> + MultiXactStatus new_mxact_status;
> +
> + /*
> + * There's no hint bit for FOR SHARE, so we need a multixact
> + * here no matter what.
> + */
> + new_mxact_status = get_mxact_status_for_tuplelock(mode);
> + new_xmax = MultiXactIdCreateSingleton(add_to_xmax, new_mxact_status);
> + new_infomask |= GetMultiXactIdHintBits(new_xmax);
> + }

If you remove the conditional block above, the next conditional block will
handle it fine.

> + else
> + {
> + /*
> + * Can get here iff the updating transaction was running when the
> + * infomask was extracted from the tuple, but finished before
> + * TransactionIdIsInProgress got to run. Treat it like there's no
> + * locker in the tuple.
> + */
> + switch (mode)
> + {
> + case LockTupleKeyShare:
> + new_infomask |= HEAP_XMAX_KEYSHR_LOCK;
> + new_xmax = xmax;

Shouldn't that be add_to_xmax?

> + break;
> + case LockTupleShare:
> + /* need a multixact here in any case */
> + new_xmax = MultiXactIdCreateSingleton(add_to_xmax, MultiXactStatusForShare);
> + new_infomask |= GetMultiXactIdHintBits(new_xmax);
> + break;
> + case LockTupleUpdate:
> + new_infomask |= HEAP_XMAX_EXCL_LOCK;
> + new_xmax = xmax;
> + break;
> + default:
> + elog(ERROR, "invalid lock mode");
> + new_xmax = InvalidTransactionId; /* keep compiler quiet */
> + }

Can you rearrange conditional flow to avoid having two copies of this switch
statement?

> + }
> +
> + /* must unset the XMAX_INVALID bit */
> + new_infomask &= ~HEAP_XMAX_INVALID;
> +
> + *result_infomask = new_infomask;
> + *result_xmax = new_xmax;
> + }
> +
> + static HTSU_Result
> + heap_lock_updated_tuple(Relation rel, ItemPointer tid, TransactionId xid,
> + LockTupleMode mode)
> + {

This function could use a comment.

> + ItemPointerData tupid;
> + HeapTupleData mytup;
> + Buffer buf;
> + uint16 new_infomask,
> + old_infomask;
> + TransactionId xmax,
> + new_xmax;
> +
> + ItemPointerCopy(tid, &tupid);
> +
> + restart:
> + new_infomask = 0;
> + new_xmax = InvalidTransactionId;
> + ItemPointerCopy(&tupid, &(mytup.t_self));
> +
> + l5:
> + if (!heap_fetch(rel, SnapshotAny, &mytup, &buf, false, NULL))
> + elog(ERROR, "unable to fetch updated version of tuple");
> +
> + /*
> + * XXX we do not lock this tuple here; the theory is that it's sufficient
> + * with the buffer lock we're about to grab. Any other code must be able
> + * to cope with tuple lock specifics changing while they don't hold buffer
> + * lock anyway.
> + */
> +
> + /*
> + * We've got a more recent (updated) version of a tuple we locked.
> + * We need to propagate the lock to it; here we don't sleep at all
> + * or try to check visibility, we just inconditionally mark it as
> + * locked by us. We only need to ensure we have buffer lock.
> + */
> + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
> +
> + old_infomask = mytup.t_data->t_infomask;
> + xmax = HeapTupleHeaderGetRawXmax(mytup.t_data);
> +
> + if (!(old_infomask & HEAP_XMAX_INVALID) &&
> + (mytup.t_data->t_infomask & HEAP_UPDATE_KEY_REVOKED))
> + {
> + TransactionId xmax;
> +
> + xmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
> + if (TransactionIdIsCurrentTransactionId(xmax))
> + {
> + UnlockReleaseBuffer(buf);
> + return HeapTupleSelfUpdated;
> + }

Is reaching this code indeed possible, with cursors or something?

> + else if (TransactionIdIsInProgress(xmax))
> + {
> + UnlockReleaseBuffer(buf);
> + XactLockTableWait(xmax);
> + goto l5;

What about just unlocking the buffer here and moving "l5" to after the
heap_fetch()? We should not need to refetch.

> + }
> + else if (TransactionIdDidAbort(xmax))
> + ; /* okay to proceed */
> + else if (TransactionIdDidCommit(xmax))
> + {
> + UnlockReleaseBuffer(buf);
> + return HeapTupleUpdated;
> + }
> + }
> +
> + compute_new_xmax_infomask(xmax, old_infomask, xid, mode,
> + &new_xmax, &new_infomask);
> +
> + START_CRIT_SECTION();
> +
> + /* And set them. */
> + HeapTupleHeaderSetXmax(mytup.t_data, new_xmax);
> + mytup.t_data->t_infomask = new_infomask;
> +
> + MarkBufferDirty(buf);
> +
> + /*
> + * FIXME XLOG stuff goes here. Is it really necessary to have this, or
> + * would it be sufficient to just WAL log the original tuple lock, and have
> + * the replay code follow the update chain too?
> + */

A single WAL record referencing the root tuple and a PageSetLSN()/PageSetTLI()
on all affected pages should be sufficient. However, that requires a critical
section from here until the writing of that WAL record. As the code stands,
plenty can fail in the meantime.

> +
> + END_CRIT_SECTION();
> +
> + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> +
> + /* found end of update chain? */
> + /* FIXME -- ISTM we must also check that Xmin in the new tuple matches
> + * updating Xid of the old, as other routines do. */

Agreed. I suggest mimicing heap_get_latest_tid() here.

> + if (ItemPointerEquals(&(mytup.t_self), &(mytup.t_data->t_ctid)))
> + {
> + ReleaseBuffer(buf);
> + return HeapTupleMayBeUpdated;
> + }
> +
> + /* tail recursion */
> + ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);

You normally need at least a shared buffer content lock to read t_ctid. Any
concurrent updater who changes t_ctid after we released the lock will also be
copying the lock we just added, so there would be no need to continue up the
chain. So, if you copied t_ctid outside of any content lock and then verified
the xmax/xmin match per above, it might be fine. However, I wouldn't use that
cleverness to merely shave a couple of instructions from the locked region.

> + ReleaseBuffer(buf);
> + goto restart;
> + }

> /*
> * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED
> ! * + LOCKED, possibly with IS_MULTI too. Normalize to INVALID just
> ! * to be sure no one gets confused. Also get rid of the
> ! * HEAP_UPDATE_KEY_REVOKED bit.
> */
> ! tuple->t_infomask &= ~(HEAP_XMAX_COMMITTED | HEAP_LOCK_BITS |
> ! HEAP_XMAX_IS_MULTI);
> ! tuple->t_infomask &= ~HEAP_LOCK_BITS;

The most recent line is redundant with its predecessor.

> *** a/src/backend/access/transam/multixact.c
> --- b/src/backend/access/transam/multixact.c

> + #define MULTIXACT_DEBUG

Omit the above line.

> + /*
> * MultiXactIdCreate
> * Construct a MultiXactId representing two TransactionIds.
> *
> ! * The two XIDs must be different, or be requesting different lock modes.

Why is it not sufficient to store the strongest type for a particular xid?

> ***************
> *** 775,786 **** RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
>
> prev_pageno = -1;
>
> ! for (i = 0; i < nxids; i++, offset++)
> {
> TransactionId *memberptr;
>
> pageno = MXOffsetToMemberPage(offset);
> ! entryno = MXOffsetToMemberEntry(offset);
>
> if (pageno != prev_pageno)
> {
> --- 768,792 ----
>
> prev_pageno = -1;
>
> ! for (i = 0; i < nmembers; i++, offset++)
> {
> TransactionId *memberptr;
> + uint32 *flagsptr;
> + uint32 flagsval;
> + int bshift;
> + int flagsoff;
> + int memberoff;
> +
> + if (members[i].xid < 900)
> + abort();

Leftover from testing?

> ***************
> *** 1222,1236 **** mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids)
>
> #ifdef MULTIXACT_DEBUG
> static char *
> ! mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids)
> {
> ! char *str = palloc(15 * (nxids + 1) + 4);
> int i;
>
> ! snprintf(str, 47, "%u %d[%u", multi, nxids, xids[0]);
>
> ! for (i = 1; i < nxids; i++)
> ! snprintf(str + strlen(str), 17, ", %u", xids[i]);
>
> strcat(str, "]");
> return str;
> --- 1285,1327 ----
>
> #ifdef MULTIXACT_DEBUG
> static char *
> ! mxstatus_to_string(MultiXactStatus status)
> {
> ! switch (status)
> ! {
> ! case MultiXactStatusForKeyShare:
> ! return "keysh";
> ! case MultiXactStatusForShare:
> ! return "sh";
> ! case MultiXactStatusForUpdate:
> ! return "forupd";
> ! case MultiXactStatusUpdate:
> ! return "upd";
> ! case MultiXactStatusKeyUpdate:
> ! return "keyup";
> ! default:
> ! elog(ERROR, "unrecognized multixact status %d", status);
> ! return "";
> ! }
> ! }
> !
> ! static char *
> ! mxid_to_string(MultiXactId multi, int nmembers, MultiXactMember *members)
> ! {
> ! static char *str = NULL;
> int i;
>
> ! if (str != NULL)
> ! pfree(str);
> !
> ! str = MemoryContextAlloc(TopMemoryContext, 15 * (nmembers + 1) + 4);
>
> ! snprintf(str, 47, "%u %d[%u (%s)", multi, nmembers, members[0].xid,
> ! mxstatus_to_string(members[0].status));
> !
> ! for (i = 1; i < nmembers; i++)
> ! snprintf(str + strlen(str), 17, ", %u (%s)", members[i].xid,
> ! mxstatus_to_string(members[i].status));

This could truncate: 10 chars from %u, 6 from %s, 5 constant chars.

How about using a StringInfoData instead?

>
> strcat(str, "]");
> return str;

> *** a/src/backend/utils/time/combocid.c
> --- b/src/backend/utils/time/combocid.c
> ***************
> *** 118,126 **** HeapTupleHeaderGetCmax(HeapTupleHeader tup)
> {
> CommandId cid = HeapTupleHeaderGetRawCommandId(tup);
>
> /* We do not store cmax when locking a tuple */
> ! Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));
> ! Assert(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tup)));
>
> if (tup->t_infomask & HEAP_COMBOCID)
> return GetRealCmax(cid);
> --- 118,128 ----
> {
> CommandId cid = HeapTupleHeaderGetRawCommandId(tup);
>
> + Assert(!(tup->t_infomask & HEAP_MOVED));
> /* We do not store cmax when locking a tuple */

The comment is deceptive now.

> ! Assert(!HeapTupleHeaderIsOnlyLocked(tup));
> ! Assert((tup->t_infomask & HEAP_XMAX_IS_MULTI) ||
> ! TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tup)));

How about the more-specific
"Assert(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tup)))"
in place of both of these asserts?

>
> if (tup->t_infomask & HEAP_COMBOCID)
> return GetRealCmax(cid);
> *** a/src/backend/utils/time/tqual.c
> --- b/src/backend/utils/time/tqual.c

I still haven't reviewed the tqual.c changes in detail, but I see that is has
several FIXMEs.

> *** a/src/test/isolation/isolationtester.c
> --- b/src/test/isolation/isolationtester.c
> ***************
> *** 395,401 **** run_named_permutations(TestSpec * testspec)
> Permutation *p = testspec->permutations[i];
> Step **steps;
>
> ! if (p->nsteps != nallsteps)
> {
> fprintf(stderr, "invalid number of steps in permutation %d\n", i + 1);
> exit_nicely();
> --- 395,401 ----
> Permutation *p = testspec->permutations[i];
> Step **steps;
>
> ! if (p->nsteps > nallsteps)
> {
> fprintf(stderr, "invalid number of steps in permutation %d\n", i + 1);
> exit_nicely();
> ***************
> *** 565,570 **** run_permutation(TestSpec * testspec, int nsteps, Step ** steps)
> --- 565,571 ----
> * steps from this session can run until it is unblocked, but it
> * can only be unblocked by running steps from other sessions.
> */
> + fflush(stdout);
> fprintf(stderr, "invalid permutation detected\n");
>
> /* Cancel the waiting statement from this session. */

Why these isolationtester.c changes?

Thanks,
nm

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-17 09:56:04
Message-ID:	20120117095604.GB13462@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 16, 2012 at 04:52:36PM -0300, Alvaro Herrera wrote:
> Excerpts from Heikki Linnakangas's message of lun ene 16 16:17:42 -0300 2012:
> > On 15.01.2012 06:49, Alvaro Herrera wrote:
> > > --- 164,178 ----
> > > #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
> > > #define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */
> > > #define HEAP_HASOID 0x0008 /* has an object-id field */
> > > ! #define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */
> > > #define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */
> > > #define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */
> > > ! #define HEAP_XMAX_IS_NOT_UPDATE 0x0080 /* xmax, if valid, is only a locker.
> > > ! * Note this is not set unless
> > > ! * XMAX_IS_MULTI is also set. */
> > > !
> > > ! #define HEAP_LOCK_BITS (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_IS_NOT_UPDATE | \
> > > ! HEAP_XMAX_KEYSHR_LOCK)
> > > #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
> > > #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
> > > #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
> >
> > HEAP_XMAX_IS_NOT_UPDATE is a pretty opaque name for that. From the name,
> > I'd say that a DELETE would set that, but the comment says it has to do
> > with locking. After going through all the combinations in my mind, I
> > think I finally understood it: HEAP_XMAX_IS_NOT_UPDATE is set if xmax is
> > a multi-xact, that represent only locking xids, no updates. How about
> > calling it "HEAP_XMAX_LOCK_ONLY", and setting it whenever whether or not
> > xmax is a multi-xid?
>
> Hm, sounds like a good idea. Will do.

> > Why are you renaming HeapTupleHeaderGetXmax() into
> > HeapTupleHeaderGetRawXmax()? Any current callers of
> > HeapTupleHeaderGetXmax() should already check that HEAP_XMAX_IS_MULTI is
> > not set.
>
> I had this vague impression that it'd be better to break existing
> callers so that they ensure they decide between
> HeapTupleHeaderGetRawXmax and HeapTupleHeaderGetUpdateXid. Noah
> suggested changing the macro name, too. It's up to each caller to
> decide what's the sematics they want. Most want the latter; and callers
> outside core are more likely to want that one. If we kept the old name,
> they would get the wrong value.

My previous suggestion was to have both macros:

#define HeapTupleHeaderGetXmax(tup) \
( \
AssertMacro(!((tup)->t_infomask & HEAP_XMAX_IS_MULTI)), \
HeapTupleHeaderGetRawXmax(tup) \
)

#define HeapTupleHeaderGetRawXmax(tup) \
( \
(tup)->t_choice.t_heap.t_xmax \
)

No strong preference, though.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-18 20:18:31
Message-ID:	1326917500-sup-7593@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Heikki Linnakangas's message of mar ene 17 03:21:28 -0300 2012:
>
> On 16.01.2012 21:52, Alvaro Herrera wrote:
> >
> > Excerpts from Heikki Linnakangas's message of lun ene 16 16:17:42 -0300 2012:
> >>
> >> On 15.01.2012 06:49, Alvaro Herrera wrote:
> >>> - pg_upgrade bits are missing.
> >>
> >> I guess we'll need to rewrite pg_multixact contents in pg_upgrade. Is
> >> the page format backwards-compatible?
> >
> > It's not.
> >
> > I haven't worked out what pg_upgrade needs to do, honestly. I think we
> > should just not copy old pg_multixact files when upgrading across this
> > patch.
>
> Sorry, I meant whether the *data* page format is backwards-compatible?
> the multixact page format clearly isn't.

It's supposed to be, yes. The HEAP_XMAX_IS_NOT_UPDATE bit (now renamed)
was chosen so that it'd take the place of the old HEAP_XMAX_SHARE_LOCK
bit, so any pages with that bit set should continue to work similarly.
The other infomask bits I used were previously unused.

> > I was initially thinking that pg_multixact should return the
> > empty set if requested members of a multi that preceded the freeze
> > point. That way, I thought, we would just never try to access a page
> > originated in the older version (assuming the freeze point is set to
> > "current" whenever pg_upgrade runs). However, as things currently
> > stand, accessing an old multi raises an error. So maybe we need a
> > scheme a bit more complex to handle this.
>
> Hmm, could we create new multixact files filled with zeros, covering the
> range that was valid in the old cluster?

Hm, we could do something like that I guess. I'm not sure that just
zeroes is the right pattern, but it should be something simple.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-19 02:11:35
Message-ID:	20120119021135.GA15158@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 18, 2012 at 05:18:31PM -0300, Alvaro Herrera wrote:
> Excerpts from Heikki Linnakangas's message of mar ene 17 03:21:28 -0300 2012:
> > On 16.01.2012 21:52, Alvaro Herrera wrote:
> > > I was initially thinking that pg_multixact should return the
> > > empty set if requested members of a multi that preceded the freeze
> > > point. That way, I thought, we would just never try to access a page
> > > originated in the older version (assuming the freeze point is set to
> > > "current" whenever pg_upgrade runs). However, as things currently
> > > stand, accessing an old multi raises an error. So maybe we need a
> > > scheme a bit more complex to handle this.
> >
> > Hmm, could we create new multixact files filled with zeros, covering the
> > range that was valid in the old cluster?
>
> Hm, we could do something like that I guess. I'm not sure that just
> zeroes is the right pattern, but it should be something simple.

PostgreSQL 9.1 can have all ~4B MultiXactId on disk at any given time.

We could silently ignore the lookup miss when HEAP_XMAX_LOCK_ONLY is also set.
That makes existing data files acceptable while still catching data loss
scenarios going forward. (It's tempting to be stricter when we know the
cluster data files originated in PostgreSQL 9.2+, but I'm not sure whether
that's worth its weight.)

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-24 18:47:16
Message-ID:	1327429258-sup-8214@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This version of the patch fixes most of the problems pointed out by this
review. There are a bunch of relatively minor items that need
addressing yet, but I wanted to throw this out in case anyone is
interested in giving this some testing or more feedback.

The biggest item remaining is the point you raised about multixactid
wraparound. This is closely related to multixact truncation and the way
checkpoints are to be handled. If we think that MultiXactId wraparound
is possible, and we need to involve autovacuum to keep it at bay, then I
think the only way to make that work is to add another column to
pg_class so that each table's oldest multixact is tracked, same as we do
with relfrozenxid for Xids. If we do that, I think we can do away with
most of the MultiXactTruncate junk I added -- things would become a lot
simpler. The cost would be bloating pg_class a bit more. Are we okay
with paying that cost? I asked this question some months ago and I
decided that I would not add the column, but I am starting to lean the
other way. I would like some opinions on this.

You asked two questions about WAL-logging locks: one was about the level
of detail we log for each lock we grab; the other was about
heap_xlog_update logging the right info. AFAICS, the main thing that
makes detailed WAL logging necessary is hot standbys. That is, the
standby must have all the locking info so that concurrent transactions
are similarly locked as in the master ... or am I wrong in that? (ISTM
this means I need to fix heap_xlog_update so that it faithfully
registers the lock info we're storing, not just the current Xid).

You also asked about heap_update and TOAST; in particular, do we need to
re-check locking info after we've unlocked the buffer for toasting and
finding free space? I believe the current version handles this, but I
haven't tested it.

Some open questions:

* Do we need some extra flag bits for each multi?

* how to deal with heap_update in the 'nowait' case?

* multixact.c cache
Do we need some updates to that?

* multis with multiple members per Xid ??
> * MultiXactIdCreate
> * Construct a MultiXactId representing two TransactionIds.
> *
> - * The two XIDs must be different.
> + * The two XIDs must be different, or be requesting different lock modes.

Why is it not sufficient to store the strongest type for a particular xid?
In this version, I've moved MultiXactIdWait to heapam.c. This makes
multixact largely unaware of the meaning of the flag bits stored with
each multi. I believe we can fix this problem, but not at this level,
but rather in heap_lock_tuple.

* Columns that are part of the key
Noah thinks the set of columns should only consider those actually referenced
by keys, not those that *could* be referenced.

Also, in a table without columns, are all columns part of the key, or is the
key the empty set? I changed HeapSatisfiesHOTUpdate but that seems arbitrary.

Need more code changes for the following:

* pg_upgrade issues are still open
There are two things here. One is what to do when migrating from an old
version that only has HEAP_XMAX_SHARED_LOCK into a new one. The other is
what we need to do from 9.2 into the future (need to copy pg_multixact
contents).

* export FOR KEY UPDATE lock mode in SQL

* Ensure that MultiXactIdIsValid is sane.

* heap_lock_updated_tuple needs WAL logging.

git diff --stat:

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment	Content-Type	Size
fklocks-6.patch.gz	application/x-gzip	66.9 KB

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-26 19:03:02
Message-ID:	1327604235-sup-4501@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Alvaro Herrera's message of mar ene 24 15:47:16 -0300 2012:

> Need more code changes for the following:

> * export FOR KEY UPDATE lock mode in SQL

While doing this, I realized that there's an open item here regarding a
transaction that locks a tuple, and then in an aborted savepoint deletes
it. As things stand, what happens is that the original tuple lock is
forgotten entirely, which was one of the things I wanted to fix (and in
fact is fixed for all other cases AFAICS). So what we need is to be
able to store a MultiXactId that includes a member for KeyUpdate locks,
which will represent an UPDATE that touches key columns as well as
DELETEs. That closes the hole. However, the problem with this is that
we have no more bits left in the flag bitmask, which is another concern
you had raised. I chose the easy way out and added a full byte of flags
per transaction.

This means that we now have 1636 xacts per members page rather than
1900+, but I'm not too concerned about that. (We could cut back to 4
flag bits per xact -- but then, having some room for future growth is
probably a good plan anyway).

So DELETEs can also create multis. I'm going to submit an updated patch
shortly.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-27 23:47:27
Message-ID:	1327707718-sup-2731@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Alvaro Herrera's message of jue ene 26 16:03:02 -0300 2012:
> Excerpts from Alvaro Herrera's message of mar ene 24 15:47:16 -0300 2012:
>
> > Need more code changes for the following:
>
> > * export FOR KEY UPDATE lock mode in SQL
>
> While doing this, I realized that there's an open item here regarding a
> transaction that locks a tuple, and then in an aborted savepoint deletes
> it. As things stand, what happens is that the original tuple lock is
> forgotten entirely, which was one of the things I wanted to fix (and in
> fact is fixed for all other cases AFAICS). So what we need is to be
> able to store a MultiXactId that includes a member for KeyUpdate locks,
> which will represent an UPDATE that touches key columns as well as
> DELETEs. That closes the hole. However, the problem with this is that
> we have no more bits left in the flag bitmask, which is another concern
> you had raised. I chose the easy way out and added a full byte of flags
> per transaction.
>
> This means that we now have 1636 xacts per members page rather than
> 1900+, but I'm not too concerned about that. (We could cut back to 4
> flag bits per xact -- but then, having some room for future growth is
> probably a good plan anyway).
>
> So DELETEs can also create multis. I'm going to submit an updated patch
> shortly.

... and here it is. The main change here is that FOR KEY UPDATE is
supported, and multis can now represent both FOR KEY UPDATE as well as
DELETEs and UPDATEs that change the key values. I have enlarged the
status bits for each member of a multi to eight, as mentioned above.
We're currently using only three (and not completely -- we currently
have only six states to represent), so that gives us a lot of room for
growth.

(Looking at this code, I have the impression that either moving
MultiXactIdWait to heapam.c was a mistake, or defining MultiXactStatus
in multixact.h is the mistake.)

Other than that and that it's based on current master, this is pretty
much the same as version 6 of the patch.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment	Content-Type	Size
fklocks-7.patch.gz	application/x-gzip	68.2 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-30 23:48:47
Message-ID:	20120130234847.GA20642@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
> The biggest item remaining is the point you raised about multixactid
> wraparound. This is closely related to multixact truncation and the way
> checkpoints are to be handled. If we think that MultiXactId wraparound
> is possible, and we need to involve autovacuum to keep it at bay, then I

To prove it possible, we need prove there exists some sequence of operations
consuming N xids and M > N multixactids. Have N transactions key-lock N-1
rows apiece, then have each of them key-lock one of the rows locked by each
other transaction. This consumes N xids and N(N-1) multixactids. I believe
you could construct a workload with N! multixact usage, too.

Existence proofs are one thing, real workloads another. My unsubstantiated
guess is that multixactid use will not overtake xid use in bulk on a workload
not specifically tailored to do so. So, I think it's enough to notice it,
refuse to assign a new multixactid, and tell the user to clear the condition
with a VACUUM FREEZE of all databases. Other opinions would indeed be useful.

> think the only way to make that work is to add another column to
> pg_class so that each table's oldest multixact is tracked, same as we do
> with relfrozenxid for Xids. If we do that, I think we can do away with
> most of the MultiXactTruncate junk I added -- things would become a lot
> simpler. The cost would be bloating pg_class a bit more. Are we okay
> with paying that cost? I asked this question some months ago and I
> decided that I would not add the column, but I am starting to lean the
> other way. I would like some opinions on this.

That's not the only way; autovacuum could just accelerate normal freezing to
advance the multixactid horizon indirectly. Your proposal could make it far
more efficient, though.

> You asked two questions about WAL-logging locks: one was about the level
> of detail we log for each lock we grab; the other was about
> heap_xlog_update logging the right info. AFAICS, the main thing that
> makes detailed WAL logging necessary is hot standbys. That is, the
> standby must have all the locking info so that concurrent transactions
> are similarly locked as in the master ... or am I wrong in that? (ISTM
> this means I need to fix heap_xlog_update so that it faithfully
> registers the lock info we're storing, not just the current Xid).

Standby servers do not need accurate tuple locks, because it's not possible to
wait on a tuple lock during recovery. (By the way, the question about log
detail was just from my own curiosity. We don't especially need an answer to
move forward with this patch, unless you want one.)

> * Columns that are part of the key
> Noah thinks the set of columns should only consider those actually referenced
> by keys, not those that *could* be referenced.

Well, do you disagree? To me it's low-hanging fruit, because it isolates the
UPDATE-time overhead of this patch to FK-referenced tables rather than all
tables having a PK or PK-like index (often just "all tables").

> Also, in a table without columns, are all columns part of the key, or is the
> key the empty set? I changed HeapSatisfiesHOTUpdate but that seems arbitrary.

It does seem arbitrary. What led you to switch in a later version?

Thanks,
nm

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-31 13:17:40
Message-ID:	CA+Tgmob6FQUPHA_Shgnwj5oFwQc4wUTFkzwJWS5PAPhoLXKqyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 30, 2012 at 6:48 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
>> The biggest item remaining is the point you raised about multixactid
>> wraparound. This is closely related to multixact truncation and the way
>> checkpoints are to be handled. If we think that MultiXactId wraparound
>> is possible, and we need to involve autovacuum to keep it at bay, then I
>
> To prove it possible, we need prove there exists some sequence of operations
> consuming N xids and M > N multixactids. Have N transactions key-lock N-1
> rows apiece, then have each of them key-lock one of the rows locked by each
> other transaction. This consumes N xids and N(N-1) multixactids. I believe
> you could construct a workload with N! multixact usage, too.
>
> Existence proofs are one thing, real workloads another. My unsubstantiated
> guess is that multixactid use will not overtake xid use in bulk on a workload
> not specifically tailored to do so. So, I think it's enough to notice it,
> refuse to assign a new multixactid, and tell the user to clear the condition
> with a VACUUM FREEZE of all databases. Other opinions would indeed be useful.

I suspect you are right that it is unlikely, but OTOH that sounds like
an extremely painful recovery procedure. We probably don't need to
put a ton of thought into handling this case as efficiently as
possible, but I think we would do well to avoid situations that could
lead to, basically, a full-cluster shutdown. If that happens to one
of my customers I expect to lose the customer.

I have a couple of other concerns about this patch:

1. I think it's probably fair to assume that this is going to be a
huge win in cases where it avoids deadlocks or lock waits. But is
there a worst case where we don't avoid that but still add a lot of
extra multi-xact lookups? What's the worst case we can imagine and
how pathological does the workload have to be to tickle that case?

2. What algorithm did we end up using do fix the set of key columns,
and is there any user configuration that can or needs to happen there?
Do we handle cleanly the case where the set of key columns is changed
by DDL?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-31 14:19:57
Message-ID:	1328016507-sup-3313@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mar ene 31 10:17:40 -0300 2012:

> I suspect you are right that it is unlikely, but OTOH that sounds like
> an extremely painful recovery procedure. We probably don't need to
> put a ton of thought into handling this case as efficiently as
> possible, but I think we would do well to avoid situations that could
> lead to, basically, a full-cluster shutdown. If that happens to one
> of my customers I expect to lose the customer.

Okay, so the worst case here is really bad and we should do something
about it. Are you okay with a new pg_class column of type xid? The
advantage is not only that we would be able to track it with high
precision; we would also get rid of a lot of code in which I have little
confidence.

> I have a couple of other concerns about this patch:
>
> 1. I think it's probably fair to assume that this is going to be a
> huge win in cases where it avoids deadlocks or lock waits. But is
> there a worst case where we don't avoid that but still add a lot of
> extra multi-xact lookups? What's the worst case we can imagine and
> how pathological does the workload have to be to tickle that case?

Hm. I haven't really thought about this. There are some code paths
that now have to resolve Multixacts that previously did not; things like
vacuum. I don't think there's any case in which we previously did not
block and now block, but there might be things that got slower without
blocking. One thing that definitely got slower is use of SELECT FOR
SHARE. (This command previously used hint bits to mark the row as
locked; now it is always going to create a multixact). However, I
expect that with foreign keys switching to FOR KEY SHARE, the use of FOR
SHARE is going to decline, maybe disappear completely, so it shouldn't
be a problem.

> 2. What algorithm did we end up using do fix the set of key columns,
> and is there any user configuration that can or needs to happen there?

Currently we just use all columns indexed by unique indexes (excluding
expressional and partial ones). Furthermore we consider "key column"
all columns in a table without unique indexes. Noah disagrees with this
choice; he says we should drop this last point, and that we should relax
the first to "columns actually used by foreign key constraints". I
expect that this is a rather simple change.

Currently there's nothing that the user can do to add more columns to
the set considered (other than creating more unique indexes, of course).
We discussed having an ALTER TABLE command to do it, but this isn't seen
as essential.

> Do we handle cleanly the case where the set of key columns is changed
> by DDL?

Hmm, I remember thinking about this at some point, but now I'm not 100%
sure. I think it doesn't matter due to multis being so ephemeral. Let
me try and figure it out.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-31 16:18:30
Message-ID:	CA+TgmoZh2==pt4hO+uw7eYnqUsQriwcOairAYmzhQD3tko0uHA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 31, 2012 at 9:19 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Robert Haas's message of mar ene 31 10:17:40 -0300 2012:
>> I suspect you are right that it is unlikely, but OTOH that sounds like
>> an extremely painful recovery procedure. We probably don't need to
>> put a ton of thought into handling this case as efficiently as
>> possible, but I think we would do well to avoid situations that could
>> lead to, basically, a full-cluster shutdown. If that happens to one
>> of my customers I expect to lose the customer.
>
> Okay, so the worst case here is really bad and we should do something
> about it. Are you okay with a new pg_class column of type xid? The
> advantage is not only that we would be able to track it with high
> precision; we would also get rid of a lot of code in which I have little
> confidence.

I think it's butt-ugly, but it's only slightly uglier than
relfrozenxid which we're already stuck with. The slight amount of
additional ugliness is that you're going to use an XID column to store
a uint4 that is not an XID - but I don't have a great idea how to fix
that. You could mislabel it as an OID or a (signed) int4, but I'm not
sure that either of those is any better. We could also create an mxid
data type, but that seems like it might be overkill.

>> I have a couple of other concerns about this patch:
>>
>> 1. I think it's probably fair to assume that this is going to be a
>> huge win in cases where it avoids deadlocks or lock waits. But is
>> there a worst case where we don't avoid that but still add a lot of
>> extra multi-xact lookups? What's the worst case we can imagine and
>> how pathological does the workload have to be to tickle that case?
>
> Hm. I haven't really thought about this. There are some code paths
> that now have to resolve Multixacts that previously did not; things like
> vacuum. I don't think there's any case in which we previously did not
> block and now block, but there might be things that got slower without
> blocking. One thing that definitely got slower is use of SELECT FOR
> SHARE. (This command previously used hint bits to mark the row as
> locked; now it is always going to create a multixact). However, I
> expect that with foreign keys switching to FOR KEY SHARE, the use of FOR
> SHARE is going to decline, maybe disappear completely, so it shouldn't
> be a problem.

What about SELECT FOR UPDATE? That's a pretty common case, I think.
If that's now going to force a multixact to get created and
additionally force multixact lookups when the row is subsequently
examined, that seems, well, actually pretty scary at first glance.
SELECT FOR UPDATE is fairly expensive as it stands, and is commonly
used.

>> 2. What algorithm did we end up using do fix the set of key columns,
>> and is there any user configuration that can or needs to happen there?
>
> Currently we just use all columns indexed by unique indexes (excluding
> expressional and partial ones). Furthermore we consider "key column"
> all columns in a table without unique indexes. Noah disagrees with this
> choice; he says we should drop this last point, and that we should relax
> the first to "columns actually used by foreign key constraints". I
> expect that this is a rather simple change.

Why the special case for tables without unique indexes? Like Noah, I
don't see the point. Unless there's some trade-off I'm not seeing, we
should want the number of key columns to be as minimal as possible, so
that as many updates as possible can use the "cheap" path that doesn't
involve locking the whole tuple.

>> Do we handle cleanly the case where the set of key columns is changed
>> by DDL?
>
> Hmm, I remember thinking about this at some point, but now I'm not 100%
> sure. I think it doesn't matter due to multis being so ephemeral. Let
> me try and figure it out.

I thought part of the point here was that multixacts aren't so
ephemeral any more: they're going to stick around until the table gets
frozen. I'm worried that's going to turn out to be a problem somehow.
With respect to this particular issue, what I'm worried about is
something like this:

1. Transaction A begins.
2. Transaction B begins and does some updating or locking of table T,
and then commits.
3. Transaction C begins and does DDL on table T, acquiring
AccessExclusiveLock while it does so, and changes the set of key
columns. It then commits.
4A.Transaction A now accesses table T
and/or
4B. Transaction D begins and accesses table T.

At step 4, A and/or D have an up-to-date relcache entries that
correctly describes the current set of key columns in T. But the work
done by transaction B was done with a different set of key columns
(could be more or less), and A and/or D mustn't get confused on that
basis. Also, in the case of A, there is the further possibility that
A's snapshot can't see B as committed yet (even though C subsequently
held an AccessExclusiveLock on the table).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-31 16:58:21
Message-ID:	1328027380-sup-3177@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mar ene 31 13:18:30 -0300 2012:
>
> On Tue, Jan 31, 2012 at 9:19 AM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> > Excerpts from Robert Haas's message of mar ene 31 10:17:40 -0300 2012:
> >> I suspect you are right that it is unlikely, but OTOH that sounds like
> >> an extremely painful recovery procedure. We probably don't need to
> >> put a ton of thought into handling this case as efficiently as
> >> possible, but I think we would do well to avoid situations that could
> >> lead to, basically, a full-cluster shutdown. If that happens to one
> >> of my customers I expect to lose the customer.
> >
> > Okay, so the worst case here is really bad and we should do something
> > about it. Are you okay with a new pg_class column of type xid? The
> > advantage is not only that we would be able to track it with high
> > precision; we would also get rid of a lot of code in which I have little
> > confidence.
>
> I think it's butt-ugly, but it's only slightly uglier than
> relfrozenxid which we're already stuck with. The slight amount of
> additional ugliness is that you're going to use an XID column to store
> a uint4 that is not an XID - but I don't have a great idea how to fix
> that. You could mislabel it as an OID or a (signed) int4, but I'm not
> sure that either of those is any better. We could also create an mxid
> data type, but that seems like it might be overkill.

Well, we're already storing a multixact in Xmax, so it's not like we
don't assume that we can store multis in space normally reserved for
Xids. What I've been wondering is not how ugly it is, but rather of the
fact that we're bloating pg_class some more.

> >> 1. I think it's probably fair to assume that this is going to be a
> >> huge win in cases where it avoids deadlocks or lock waits. But is
> >> there a worst case where we don't avoid that but still add a lot of
> >> extra multi-xact lookups? What's the worst case we can imagine and
> >> how pathological does the workload have to be to tickle that case?
> >
> > Hm. I haven't really thought about this. There are some code paths
> > that now have to resolve Multixacts that previously did not; things like
> > vacuum. I don't think there's any case in which we previously did not
> > block and now block, but there might be things that got slower without
> > blocking. One thing that definitely got slower is use of SELECT FOR
> > SHARE. (This command previously used hint bits to mark the row as
> > locked; now it is always going to create a multixact). However, I
> > expect that with foreign keys switching to FOR KEY SHARE, the use of FOR
> > SHARE is going to decline, maybe disappear completely, so it shouldn't
> > be a problem.
>
> What about SELECT FOR UPDATE? That's a pretty common case, I think.
> If that's now going to force a multixact to get created and
> additionally force multixact lookups when the row is subsequently
> examined, that seems, well, actually pretty scary at first glance.
> SELECT FOR UPDATE is fairly expensive as it stands, and is commonly
> used.

SELECT FOR UPDATE is still going to work without a multi in the simple
cases. The case where it's different is when somebody else grabs a KEY
SHARE lock on the same tuple; it's now going to get a multi, where it
previously blocked. So other transactions later checking the tuple will
have a bit of a larger cost. That's okay considering that it meant
the other transaction did not have to wait anymore.

> >> 2. What algorithm did we end up using do fix the set of key columns,
> >> and is there any user configuration that can or needs to happen there?
> >
> > Currently we just use all columns indexed by unique indexes (excluding
> > expressional and partial ones). Furthermore we consider "key column"
> > all columns in a table without unique indexes. Noah disagrees with this
> > choice; he says we should drop this last point, and that we should relax
> > the first to "columns actually used by foreign key constraints". I
> > expect that this is a rather simple change.
>
> Why the special case for tables without unique indexes? Like Noah, I
> don't see the point. Unless there's some trade-off I'm not seeing, we
> should want the number of key columns to be as minimal as possible, so
> that as many updates as possible can use the "cheap" path that doesn't
> involve locking the whole tuple.

No trade-off. I just thought it was safer: my thought was that if
there's no nominated key column, the safer bet was that any of them
could have been. But then, in reality there cannot be any foreign key
here anyway. I'll revert that bit.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-31 17:12:10
Message-ID:	CA+TgmoYprk=6VJ-tDVgafYg1Nkz-CExxChmiX1R4ojm7Z00=Dg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 31, 2012 at 11:58 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Well, we're already storing a multixact in Xmax, so it's not like we
> don't assume that we can store multis in space normally reserved for
> Xids. What I've been wondering is not how ugly it is, but rather of the
> fact that we're bloating pg_class some more.

I don't think another 4 bytes in pg_class is that big a deal. We
don't do relcache rebuilds frequently enough for that to really matter
much. The bigger cost of this patch seems to me to be that we're
going to have to carry around multi-xact IDs for a long time, and
probably fsync and/or WAL-log them moreso than now. I'm not sure how
much you've worried about that, but a new column in pg_class seems
like line noise by comparison.

>> What about SELECT FOR UPDATE? That's a pretty common case, I think.
>> If that's now going to force a multixact to get created and
>> additionally force multixact lookups when the row is subsequently
>> examined, that seems, well, actually pretty scary at first glance.
>> SELECT FOR UPDATE is fairly expensive as it stands, and is commonly
>> used.
>
> SELECT FOR UPDATE is still going to work without a multi in the simple
> cases. The case where it's different is when somebody else grabs a KEY
> SHARE lock on the same tuple; it's now going to get a multi, where it
> previously blocked. So other transactions later checking the tuple will
> have a bit of a larger cost. That's okay considering that it meant
> the other transaction did not have to wait anymore.

OK. I assume that the different treatment of SELECT FOR SHARE is due
to lack of bit space?

>> >> 2. What algorithm did we end up using do fix the set of key columns,
>> >> and is there any user configuration that can or needs to happen there?
>> >
>> > Currently we just use all columns indexed by unique indexes (excluding
>> > expressional and partial ones). Furthermore we consider "key column"
>> > all columns in a table without unique indexes. Noah disagrees with this
>> > choice; he says we should drop this last point, and that we should relax
>> > the first to "columns actually used by foreign key constraints". I
>> > expect that this is a rather simple change.
>>
>> Why the special case for tables without unique indexes? Like Noah, I
>> don't see the point. Unless there's some trade-off I'm not seeing, we
>> should want the number of key columns to be as minimal as possible, so
>> that as many updates as possible can use the "cheap" path that doesn't
>> involve locking the whole tuple.
>
> No trade-off. I just thought it was safer: my thought was that if
> there's no nominated key column, the safer bet was that any of them
> could have been. But then, in reality there cannot be any foreign key
> here anyway. I'll revert that bit.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-01-31 23:55:19
Message-ID:	1328053433-sup-1091@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mar ene 31 14:12:10 -0300 2012:
>
> On Tue, Jan 31, 2012 at 11:58 AM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> > Well, we're already storing a multixact in Xmax, so it's not like we
> > don't assume that we can store multis in space normally reserved for
> > Xids. What I've been wondering is not how ugly it is, but rather of the
> > fact that we're bloating pg_class some more.
>
> I don't think another 4 bytes in pg_class is that big a deal. We
> don't do relcache rebuilds frequently enough for that to really matter
> much. The bigger cost of this patch seems to me to be that we're
> going to have to carry around multi-xact IDs for a long time, and
> probably fsync and/or WAL-log them moreso than now. I'm not sure how
> much you've worried about that, but a new column in pg_class seems
> like line noise by comparison.

I'm not too worried by either fsyncing or WAL logging, because those
costs are only going to be paid when a multixact is used at all; if we
avoid having to wait for an arbitrary length of time at some point, then
it doesn't matter than we some things are bit slower. I worry about a
new pg_class column because it's going to be paid by everyone, whether
they use multixacts or not.

But, having never heard anybody stand against this proposal, I'll go do
that.

> >> What about SELECT FOR UPDATE? That's a pretty common case, I think.
> >> If that's now going to force a multixact to get created and
> >> additionally force multixact lookups when the row is subsequently
> >> examined, that seems, well, actually pretty scary at first glance.
> >> SELECT FOR UPDATE is fairly expensive as it stands, and is commonly
> >> used.
> >
> > SELECT FOR UPDATE is still going to work without a multi in the simple
> > cases. The case where it's different is when somebody else grabs a KEY
> > SHARE lock on the same tuple; it's now going to get a multi, where it
> > previously blocked. So other transactions later checking the tuple will
> > have a bit of a larger cost. That's okay considering that it meant
> > the other transaction did not have to wait anymore.
>
> OK. I assume that the different treatment of SELECT FOR SHARE is due
> to lack of bit space?

Yes. I gave preference for SELECT FOR UPDATE and SELECT FOR KEY SHARE
because those are presumably going to be used much more frequently than
SELECT FOR SHARE; one because it's part of the standard and there are
plenty of use cases; the other because we're going to use it internally
very frequently.

Now, perhaps we could fix that (i.e. have a separate hint bit for SELECT
FOR SHARE), but I don't think it's justified.

In the meantime, here's an updated version which fixes some funny
border cases, mostly involving locks acquired in aborted
subtransactions. Interestingly, it seems to me the code in heapam.c is
now clearer than before.

The other bit about columns to be considered keys isn't yet changed in
this version.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment	Content-Type	Size
fklocks-8.patch.gz	application/x-gzip	72.0 KB

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-02 00:33:47
Message-ID:	7854C4C9-8871-4FB0-B737-EA7E7EE56914@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 31, 2012, at 10:58 AM, Alvaro Herrera wrote:
>> I think it's butt-ugly, but it's only slightly uglier than
>> relfrozenxid which we're already stuck with. The slight amount of
>> additional ugliness is that you're going to use an XID column to store
>> a uint4 that is not an XID - but I don't have a great idea how to fix
>> that. You could mislabel it as an OID or a (signed) int4, but I'm not
>> sure that either of those is any better. We could also create an mxid
>> data type, but that seems like it might be overkill.
>
> Well, we're already storing a multixact in Xmax, so it's not like we
> don't assume that we can store multis in space normally reserved for
> Xids. What I've been wondering is not how ugly it is, but rather of the
> fact that we're bloating pg_class some more.

FWIW, users have been known to request uint datatypes; so if this really is a uint perhaps we should just create a uint datatype...
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-02 00:58:42
Message-ID:	1328144272-sup-774@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Jim Nasby's message of mié feb 01 21:33:47 -0300 2012:
>
> On Jan 31, 2012, at 10:58 AM, Alvaro Herrera wrote:
> >> I think it's butt-ugly, but it's only slightly uglier than
> >> relfrozenxid which we're already stuck with. The slight amount of
> >> additional ugliness is that you're going to use an XID column to store
> >> a uint4 that is not an XID - but I don't have a great idea how to fix
> >> that. You could mislabel it as an OID or a (signed) int4, but I'm not
> >> sure that either of those is any better. We could also create an mxid
> >> data type, but that seems like it might be overkill.
> >
> > Well, we're already storing a multixact in Xmax, so it's not like we
> > don't assume that we can store multis in space normally reserved for
> > Xids. What I've been wondering is not how ugly it is, but rather of the
> > fact that we're bloating pg_class some more.
>
> FWIW, users have been known to request uint datatypes; so if this really is a uint perhaps we should just create a uint datatype...

Yeah. This is just for internal consumption, though, not a full-blown
datatype.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-13 22:16:58
Message-ID:	1329170123-sup-322@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Alvaro Herrera's message of mar ene 31 20:55:19 -0300 2012:
> Excerpts from Robert Haas's message of mar ene 31 14:12:10 -0300 2012:
> >
> > On Tue, Jan 31, 2012 at 11:58 AM, Alvaro Herrera
> > <alvherre(at)commandprompt(dot)com> wrote:
> > > Well, we're already storing a multixact in Xmax, so it's not like we
> > > don't assume that we can store multis in space normally reserved for
> > > Xids. What I've been wondering is not how ugly it is, but rather of the
> > > fact that we're bloating pg_class some more.
> >
> > I don't think another 4 bytes in pg_class is that big a deal. We
> > don't do relcache rebuilds frequently enough for that to really matter
> > much. The bigger cost of this patch seems to me to be that we're
> > going to have to carry around multi-xact IDs for a long time, and
> > probably fsync and/or WAL-log them moreso than now. I'm not sure how
> > much you've worried about that, but a new column in pg_class seems
> > like line noise by comparison.
>
> I'm not too worried by either fsyncing or WAL logging, because those
> costs are only going to be paid when a multixact is used at all; if we
> avoid having to wait for an arbitrary length of time at some point, then
> it doesn't matter than we some things are bit slower. I worry about a
> new pg_class column because it's going to be paid by everyone, whether
> they use multixacts or not.
>
> But, having never heard anybody stand against this proposal, I'll go do
> that.

Okay, so this patch fixes the truncation and wraparound issues through a
mechanism much like pg_clog's: it keeps track of the oldest possibly
existing multis on each and every table, and then during tuple freezing
those are removed. I also took the liberty to make the code remove
multis altogether (i.e. resetting the IS_MULTI hint bit) when only the
update remains and lockers are all gone.

I also cleaned up the code in heapam so that there's a couple of tables
mapping MultiXactStatus to LockTupleMode and back, and to heavyweight
lock modes (the older patches used functions to do this, which was
pretty ugly). I had to add a little helper function to lock.c to make
this work. I made a rather large bunch of other minor changes to close
minor bugs here and there.

Docs have been added, as have new tests for the isolation harness, which
I've ensured pass in both read committed and serializable modes; WAL
logging for locking updated versions of a tuple, when an old one is
locked due to an old snapshot, was also added; there's plenty of room
for growth in the MultiXact flag bits; the bit that made tables with no
keys lock the entire row all the time was removed; multiple places in
code comments were cleaned up that referred to this feature as "FOR KEY
LOCK" and ensured that it also mentions FOR KEY UPDATE; the pg_rowlocks,
pageinspect, pg_controldata, pg_resetxlog utilities have been updated.

All in all, I think this is in pretty much final shape. Only pg_upgrade
bits are still missing. If sharp eyes could give this a critical look
and knuckle-cracking testers could give it a spin, that would be
helpful.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment	Content-Type	Size
fklocks-9.patch.gz	application/x-gzip	82.9 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-22 17:00:07
Message-ID:	20120222170007.GA24935@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 13, 2012 at 07:16:58PM -0300, Alvaro Herrera wrote:
> Okay, so this patch fixes the truncation and wraparound issues through a
> mechanism much like pg_clog's: it keeps track of the oldest possibly
> existing multis on each and every table, and then during tuple freezing
> those are removed. I also took the liberty to make the code remove
> multis altogether (i.e. resetting the IS_MULTI hint bit) when only the
> update remains and lockers are all gone.
>
> I also cleaned up the code in heapam so that there's a couple of tables
> mapping MultiXactStatus to LockTupleMode and back, and to heavyweight
> lock modes (the older patches used functions to do this, which was
> pretty ugly). I had to add a little helper function to lock.c to make
> this work. I made a rather large bunch of other minor changes to close
> minor bugs here and there.
>
> Docs have been added, as have new tests for the isolation harness, which
> I've ensured pass in both read committed and serializable modes; WAL
> logging for locking updated versions of a tuple, when an old one is
> locked due to an old snapshot, was also added; there's plenty of room
> for growth in the MultiXact flag bits; the bit that made tables with no
> keys lock the entire row all the time was removed; multiple places in
> code comments were cleaned up that referred to this feature as "FOR KEY
> LOCK" and ensured that it also mentions FOR KEY UPDATE; the pg_rowlocks,
> pageinspect, pg_controldata, pg_resetxlog utilities have been updated.

All of the above sounds great. I especially like the growing test coverage.

> All in all, I think this is in pretty much final shape. Only pg_upgrade
> bits are still missing. If sharp eyes could give this a critical look
> and knuckle-cracking testers could give it a spin, that would be
> helpful.

Lack of pg_upgrade support leaves this version incomplete, because that
omission would constitute a blocker for beta 2. This version changes as much
code compared to the version I reviewed at the beginning of the CommitFest as
that version changed overall. In that light, it's time to close the books on
this patch for the purpose of this CommitFest; I'm marking it Returned with
Feedback. Thanks for your efforts thus far.

On Mon, Jan 30, 2012 at 06:48:47PM -0500, Noah Misch wrote:
> On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
> > * Columns that are part of the key
> > Noah thinks the set of columns should only consider those actually referenced
> > by keys, not those that *could* be referenced.
>
> Well, do you disagree? To me it's low-hanging fruit, because it isolates the
> UPDATE-time overhead of this patch to FK-referenced tables rather than all
> tables having a PK or PK-like index (often just "all tables").

You have not answered my question above.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 09:18:57
Message-ID:	CA+U5nMJKLrwG4SBcF-n=MAm2N7zfe4993UvJRz-YFQyfjpUb0g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 22, 2012 at 5:00 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:

>> All in all, I think this is in pretty much final shape. Only pg_upgrade
>> bits are still missing. If sharp eyes could give this a critical look
>> and knuckle-cracking testers could give it a spin, that would be
>> helpful.
>
> Lack of pg_upgrade support leaves this version incomplete, because that
> omission would constitute a blocker for beta 2. This version changes as much
> code compared to the version I reviewed at the beginning of the CommitFest as
> that version changed overall. In that light, it's time to close the books on
> this patch for the purpose of this CommitFest; I'm marking it Returned with
> Feedback. Thanks for your efforts thus far.

My view would be that with 90 files touched this is a very large
patch, so that alone makes me wonder whether we should commit this
patch, so I agree with Noah and compliment him on an excellent
detailed review.

However, review of such a large patch should not be simply pass or
fail. We should be looking back at the original problem and ask
ourselves whether some subset of the patch could solve a useful subset
of the problem. For me, that seems quite likely and this is very
definitely an important patch.

Even if we can't solve some part of the problem we can at least commit
some useful parts of infrastructure to allow later work to happen more
smoothly and quickly.

So please let's not focus on the 100%, lets focus on 80/20.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 13:08:28
Message-ID:	4F463A4C.9000906@xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2012-02-23 10:18, Simon Riggs wrote:

> However, review of such a large patch should not be simply pass or
> fail. We should be looking back at the original problem and ask
> ourselves whether some subset of the patch could solve a useful subset
> of the problem. For me, that seems quite likely and this is very
> definitely an important patch.
>
> Even if we can't solve some part of the problem we can at least commit
> some useful parts of infrastructure to allow later work to happen more
> smoothly and quickly.
>
> So please let's not focus on the 100%, lets focus on 80/20.

The suggested immutable-column constraint was meant as a potential
"80/20 workaround." Definitely not a full solution, helpful to some,
probably easier to do. I don't know if an immutable key would actually
be enough to elide foreign-key locks though.

Simon, I think you had a reason why it couldn't work, but I didn't quite
get your meaning and didn't want to distract things further at that
stage. You wrote that it "doesn't do what KEY LOCKS are designed to
do"... any chance you might recall what the problem was?

I don't mean to be pushy about my pet idea, and heaven knows I don't
have time to implement it, but it'd be good to know whether I should put
the whole thought to rest.

Jeroen

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 14:15:45
Message-ID:	CA+U5nM+tT=yR3KgWCksLPFpaaBjEPP5Ha_e_9nOgUpaLMr=Sgg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 4, 2011 at 12:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:

> Making pg_multixact persistent across clean shutdowns is no bridge to cross
> lightly, since it means committing to an on-disk format for an indefinite
> period. We should do it; the benefits of this patch justify it, and I haven't
> identified a way to avoid it without incurring worse problems.

I can't actually see anything in the patch that explains why this is
required. (That is something we should reject more patches on, since
it creates a higher maintenance burden).

Can someone explain? We might think of a way to avoid that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 14:21:39
Message-ID:	CA+U5nMJY0Mc7wmyVjR9AEr0mrb473e+5Rmhi+RDOWMRq-qA8zg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2012 at 1:08 PM, Jeroen Vermeulen <jtv(at)xs4all(dot)nl> wrote:

> Simon, I think you had a reason why it couldn't work, but I didn't quite get
> your meaning and didn't want to distract things further at that stage. You
> wrote that it "doesn't do what KEY LOCKS are designed to do"... any chance
> you might recall what the problem was?

The IMMUTABLE idea would work, but it requires all users to recode
their apps. By the time they've done that we'll have probably fixed
the problem in full anyway, so then we have to ask them to stop again,
which is hard so we'll be stuck with a performance tweak that applies
to just one release. So its the fully automatic solution we're looking
for. I don't object to someone implementing IMMUTABLE, I'm just saying
its not a way to get this patch simpler and therefore acceptable.

If people are willing to recode apps to avoid this then hire me and
I'll tell you how ;-)

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 15:04:14
Message-ID:	1330009090-sup-3614@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of jue feb 23 11:15:45 -0300 2012:
> On Sun, Dec 4, 2011 at 12:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> > Making pg_multixact persistent across clean shutdowns is no bridge to cross
> > lightly, since it means committing to an on-disk format for an indefinite
> > period. We should do it; the benefits of this patch justify it, and I haven't
> > identified a way to avoid it without incurring worse problems.
>
> I can't actually see anything in the patch that explains why this is
> required. (That is something we should reject more patches on, since
> it creates a higher maintenance burden).
>
> Can someone explain? We might think of a way to avoid that.

Sure. The problem is that we are allowing updated rows to be locked (and
locked rows to be updated). This means that we need to store extended
Xmax information in tuples that goes beyond mere locks, which is what we
were doing previously -- they may now have locks and updates simultaneously.

(In the previous code, a multixact never meant an update, it always
signified only shared locks. After a crash, all backends that could
have been holding locks must necessarily be gone, so the multixact info
is not interesting and can be treated like the tuple is simply live.)

This means that this extended Xmax info needs to be able to survive, so
that it's possible to retrieve it after a crash; because even if the
lockers are all gone, the updater might have committed and this means
the tuple is dead. If we failed to keep this, the tuple would be
considered live which would be wrong because the other version of the
tuple, which was created by the update, is also live.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 15:12:13
Message-ID:	CA+U5nM+OdwvJuiAsFfBBYrR3juWhu8o=fQxv3jrtp2Z2rAdpqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2012 at 3:04 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>
> Excerpts from Simon Riggs's message of jue feb 23 11:15:45 -0300 2012:
>> On Sun, Dec 4, 2011 at 12:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>>
>> > Making pg_multixact persistent across clean shutdowns is no bridge to cross
>> > lightly, since it means committing to an on-disk format for an indefinite
>> > period. We should do it; the benefits of this patch justify it, and I haven't
>> > identified a way to avoid it without incurring worse problems.
>>
>> I can't actually see anything in the patch that explains why this is
>> required. (That is something we should reject more patches on, since
>> it creates a higher maintenance burden).
>>
>> Can someone explain? We might think of a way to avoid that.
>
> Sure. The problem is that we are allowing updated rows to be locked (and
> locked rows to be updated). This means that we need to store extended
> Xmax information in tuples that goes beyond mere locks, which is what we
> were doing previously -- they may now have locks and updates simultaneously.
>
> (In the previous code, a multixact never meant an update, it always
> signified only shared locks. After a crash, all backends that could
> have been holding locks must necessarily be gone, so the multixact info
> is not interesting and can be treated like the tuple is simply live.)
>
> This means that this extended Xmax info needs to be able to survive, so
> that it's possible to retrieve it after a crash; because even if the
> lockers are all gone, the updater might have committed and this means
> the tuple is dead. If we failed to keep this, the tuple would be
> considered live which would be wrong because the other version of the
> tuple, which was created by the update, is also live.

OK, thanks.

So why do we need pg_upgrade support?

If pg_multixact is not persistent now, surely there is no requirement
to have pg_upgrade do any form of upgrade? The only time we'll need to
do this is from 9.2 to 9.3, which can of course occur any time in next
year. That doesn't sound like a reason to block a patch now, because
of something that will be needed a year from now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 15:28:20
Message-ID:	13826.1330010900@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Sure. The problem is that we are allowing updated rows to be locked (and
> locked rows to be updated). This means that we need to store extended
> Xmax information in tuples that goes beyond mere locks, which is what we
> were doing previously -- they may now have locks and updates simultaneously.

> (In the previous code, a multixact never meant an update, it always
> signified only shared locks. After a crash, all backends that could
> have been holding locks must necessarily be gone, so the multixact info
> is not interesting and can be treated like the tuple is simply live.)

Ugh. I had not been paying attention to what you were doing in this
patch, and now that I read this I wish I had objected earlier. This
seems like a horrid mess that's going to be unsustainable both from a
complexity and a performance standpoint. The only reason multixacts
were tolerable at all was that they had only one semantics. Changing
it so that maybe a multixact represents an actual updater and maybe
it doesn't is not sane.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 15:43:11
Message-ID:	1330009849-sup-9783@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of jue feb 23 06:18:57 -0300 2012:
>
> On Wed, Feb 22, 2012 at 5:00 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> >> All in all, I think this is in pretty much final shape. Only pg_upgrade
> >> bits are still missing. If sharp eyes could give this a critical look
> >> and knuckle-cracking testers could give it a spin, that would be
> >> helpful.
> >
> > Lack of pg_upgrade support leaves this version incomplete, because that
> > omission would constitute a blocker for beta 2. This version changes as much
> > code compared to the version I reviewed at the beginning of the CommitFest as
> > that version changed overall. In that light, it's time to close the books on
> > this patch for the purpose of this CommitFest; I'm marking it Returned with
> > Feedback. Thanks for your efforts thus far.

Now this is an interesting turn of events. I must thank you for your
extensive review effort in the current version of the patch, and also
thank you and credit you for the idea that initially kicked this patch
from the older, smaller, simpler version I wrote during the 9.1 timeline
(which you also reviewed exhaustively). Without your and Simon's
brilliant ideas, this patch wouldn't exist at all.

I completely understand that you don't want to review this latest
version of the patch; it's a lot of effort and I wouldn't inflict it on
anybody who hasn't not volunteered. However, it doesn't seem to me that
this is reason to boot the patch from the commitfest. I think the thing
to do would be to remove yourself from the reviewers column and set it
back to "needs review", so that other reviewers can pick it up.

As for the late code churn, it mostly happened as a result of your
own feedback; I would have left most of it in the original state, but as
I went ahead it seemed much better to refactor things. This is mostly
in heapam.c. As for multixact.c, it also had a lot of churn, but that
was mostly to restore it to the state it has in the master branch,
dropping much of the code I had written to handle multixact truncation.
The new code there and in the vacuum code path (relminmxid and so on) is
a lot smaller than that other code was, and it's closely based on
relfrozenxid which is a known piece of technology.

> My view would be that with 90 files touched this is a very large
> patch, so that alone makes me wonder whether we should commit this
> patch, so I agree with Noah and compliment him on an excellent
> detailed review.

I note, however, that the bulk of the patch is in three files --
multixact.c, tqual.c, heapam.c, as is clearly illustrated in the diff
stats I posted. The rest of them are touched mostly to follow their new
APIs (and of course to add tests and docs).

To summarize, of 94 files touched in total:
* 22 files are in src/test/isolation/
(new and updated tests and expected files)
* 19 files are in src/include/
* 10 files are in contrib/
* 39 files are in src/backend;
* in that subdir, there are 3097 insertions and 1006 deletions
* 3047 (83%) of which are in heapam.c multixact.c tqual.c
* one is a README

Well, we have the patch I originally posted in the 9.1 timeframe.
That's a lot smaller and simpler. However, that solves only part of the
blocking problem, and in particular it doesn't fix the initial deadlock
reports from Joel Jacobson at Glue Finance (now renamed Trustly, in case
you wonder about his change of email address) that started this effort
in the first place. I don't think we can cut down to that and still
satisfy the users that requested this; and Glue was just the first one,
because after I started blogging about this, some more people started
asking for it.

I don't think there's any useful middlepoint between that one and the
current one, but maybe I'm wrong.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 15:49:02
Message-ID:	1330011962-sup-8@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of jue feb 23 12:12:13 -0300 2012:
> On Thu, Feb 23, 2012 at 3:04 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:

> > Sure. The problem is that we are allowing updated rows to be locked (and
> > locked rows to be updated). This means that we need to store extended
> > Xmax information in tuples that goes beyond mere locks, which is what we
> > were doing previously -- they may now have locks and updates simultaneously.

> OK, thanks.
>
> So why do we need pg_upgrade support?

Two reasons. One is that in upgrades from a version that contains this
patch to another version that also contains this patch (i.e. future
upgrades), we need to copy the multixact files from the old cluster to
the new.

The other is that in upgrades from a version that doesn't contain this
patch to a version that does, we need to set the multixact limit values
so that values that were used in the old cluster are returned as empty
values (keeping the old semantics); otherwise they would cause errors
trying to read the member Xids from disk.

> If pg_multixact is not persistent now, surely there is no requirement
> to have pg_upgrade do any form of upgrade? The only time we'll need to
> do this is from 9.2 to 9.3, which can of course occur any time in next
> year. That doesn't sound like a reason to block a patch now, because
> of something that will be needed a year from now.

I think there's a policy that we must allow upgrades from one beta to
the next, which is why Noah says this is a blocker starting from beta2.

The pg_upgrade code for this is rather simple however. There's no
rocket science there.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 16:01:33
Message-ID:	1330012167-sup-6111@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Tom Lane's message of jue feb 23 12:28:20 -0300 2012:
>
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Sure. The problem is that we are allowing updated rows to be locked (and
> > locked rows to be updated). This means that we need to store extended
> > Xmax information in tuples that goes beyond mere locks, which is what we
> > were doing previously -- they may now have locks and updates simultaneously.
>
> > (In the previous code, a multixact never meant an update, it always
> > signified only shared locks. After a crash, all backends that could
> > have been holding locks must necessarily be gone, so the multixact info
> > is not interesting and can be treated like the tuple is simply live.)
>
> Ugh. I had not been paying attention to what you were doing in this
> patch, and now that I read this I wish I had objected earlier.

Uhm, yeah, a lot earlier -- I initially blogged about this in August
last year:
http://www.commandprompt.com/blogs/alvaro_herrera/2011/08/fixing_foreign_key_deadlocks_part_three/

and in several posts in pgsql-hackers.

> This
> seems like a horrid mess that's going to be unsustainable both from a
> complexity and a performance standpoint. The only reason multixacts
> were tolerable at all was that they had only one semantics. Changing
> it so that maybe a multixact represents an actual updater and maybe
> it doesn't is not sane.

As far as complexity, yeah, it's a lot more complex now -- no question
about that.

Regarding performance, the good thing about this patch is that if you
have an operation that used to block, it might now not block. So maybe
multixact-related operation is a bit slower than before, but if it
allows you to continue operating rather than sit waiting until some
other transaction releases you, it's much better.

As for sanity -- I regard multixacts as a way to store extended Xmax
information. The original idea was obviously much more limited in that
the extended info was just locking info. We've extended it but I don't
think it's such a stretch.

I have been posting about (most? all of?) the ideas that I've been
following to make this work at all, so that people had plenty of chances
to disagree with them -- and Noah and others did disagree with many of
them, so I changed the patch accordingly. I'm not closed to further
rework, but I'm not going to entirely abandon the idea too lightly.

I'm sure there are bugs too, but hopefully there are as shallow as
interested reviewer eyeballs there are.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Noah Misch" <noah(at)leadboat(dot)com>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 16:31:36
Message-ID:	4F4615880200002500045AB5@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:

> As for sanity -- I regard multixacts as a way to store extended
> Xmax information. The original idea was obviously much more
> limited in that the extended info was just locking info. We've
> extended it but I don't think it's such a stretch.

Since the limitation on what can be stored in xmax was the killer
for Florian's attempt to support SELECT FOR UPDATE in a form which
was arguably more useful (and certainly more convenient for those
converting from other database products), I'm wondering whether
anyone has determined whether this new scheme would allow Florian's
work to be successfully completed. The issues seem very similar.
If this approach also provides a basis for the other work, I think
it helps bolster the argument that this is a good design; if not, I
think it suggests that maybe it should be made more general or
extensible in some way. Once this has to be supported by pg_upgrade
it will be harder to change the format, if that is needed for some
other feature.

-Kevin

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 17:45:45
Message-ID:	1330017319-sup-9160@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Kevin Grittner's message of jue feb 23 13:31:36 -0300 2012:
>
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
>
> > As for sanity -- I regard multixacts as a way to store extended
> > Xmax information. The original idea was obviously much more
> > limited in that the extended info was just locking info. We've
> > extended it but I don't think it's such a stretch.
>
> Since the limitation on what can be stored in xmax was the killer
> for Florian's attempt to support SELECT FOR UPDATE in a form which
> was arguably more useful (and certainly more convenient for those
> converting from other database products), I'm wondering whether
> anyone has determined whether this new scheme would allow Florian's
> work to be successfully completed. The issues seem very similar.
> If this approach also provides a basis for the other work, I think
> it helps bolster the argument that this is a good design; if not, I
> think it suggests that maybe it should be made more general or
> extensible in some way. Once this has to be supported by pg_upgrade
> it will be harder to change the format, if that is needed for some
> other feature.

I have no idea what improvements Florian was seeking, but multixacts now
have plenty of bit flag space to indicate whatever we want for each
member transaction, so most likely the answer is yes. However we need
to make clear that a single SELECT FOR UPDATE in a tuple does not
currently use a multixact; if we wish to always store flags then we are
forced to use one which incurs a performance hit.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 17:48:13
Message-ID:	4F467BDD.2060804@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/23/2012 10:43 AM, Alvaro Herrera wrote:
> I completely understand that you don't want to review this latest
> version of the patch; it's a lot of effort and I wouldn't inflict it on
> anybody who hasn't not volunteered. However, it doesn't seem to me that
> this is reason to boot the patch from the commitfest. I think the thing
> to do would be to remove yourself from the reviewers column and set it
> back to "needs review", so that other reviewers can pick it up.

This feature made Robert's list of serious CF concerns, too, and the
idea that majorly revised patches might be punted isn't a new one. Noah
is certainly justified in saying you're off his community support list,
after all the review work he's been doing for this CF.

We here think it would be a shame for all of these other performance
bits to be sorted but still have this one loose though, if it's possible
to keep going on it. It's well known as something on Simon's peeve list
for some time now. I was just reading someone else ranting about how
this foreign key locking issue proves Postgres isn't "enterprise scale"
yesterday, it was part of an article proving why DB2 is worth paying for
I think. This change crosses over into the advocacy area due to that,
albeit only for the people who have been burned by this already.

If the main problem is pg_upgrade complexity, eventually progress on
that front needs to be made. I'm surprised the project has survived
this long without needing anything beyond catalog conversion for
in-place upgrade. That luck won't hold forever.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 18:04:26
Message-ID:	1330019620-sup-3837@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Greg Smith's message of jue feb 23 14:48:13 -0300 2012:
> On 02/23/2012 10:43 AM, Alvaro Herrera wrote:
> > I completely understand that you don't want to review this latest
> > version of the patch; it's a lot of effort and I wouldn't inflict it on
> > anybody who hasn't not volunteered. However, it doesn't seem to me that
> > this is reason to boot the patch from the commitfest. I think the thing
> > to do would be to remove yourself from the reviewers column and set it
> > back to "needs review", so that other reviewers can pick it up.
>
> This feature made Robert's list of serious CF concerns, too, and the
> idea that majorly revised patches might be punted isn't a new one.

Well, this patch (or rather, a previous incarnation of it) got punted
from 9.1's fourth commitfest; I intended to have the new version in
9.2's first CF, but business reasons (which I will not discuss in
public) forced me otherwise. So here we are again -- as I said to Tom,
I don't intend to let go of this one easily, though of course I will
concede to whatever the community decides.

> Noah
> is certainly justified in saying you're off his community support list,
> after all the review work he's been doing for this CF.

Yeah, I can't blame him. I've been trying to focus most of my review
availability on his own patches precisely due to that, but it's very
clear to me that his effort is larger than mine.

> We here think it would be a shame for all of these other performance
> bits to be sorted but still have this one loose though, if it's possible
> to keep going on it. It's well known as something on Simon's peeve list
> for some time now. I was just reading someone else ranting about how
> this foreign key locking issue proves Postgres isn't "enterprise scale"
> yesterday, it was part of an article proving why DB2 is worth paying for
> I think. This change crosses over into the advocacy area due to that,
> albeit only for the people who have been burned by this already.

Yeah, Simon's been on this particular issue for quite some time -- which
is probably why the initial idea that kickstarted this patch was his.
Personally I've been in the "not enterprise strength" camp for a long
time, mostly unintentionally; you can see that by tracing how my major
patches close holes in that kind of area ("cluster loses indexes", "we
don't have subtransactions", "foreign key concurrency sucks" (--> SELECT
FOR SHARE), "manual vacuum is teh sux0r", and now this one about FKs
again).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 18:30:09
Message-ID:	4F4685B1.50207@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/23/2012 01:04 PM, Alvaro Herrera wrote:
> "manual vacuum is teh sux0r"

I think you've just named my next conference talk submission.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 18:41:21
Message-ID:	CA+U5nML3f8Gvuk0diaOEHorG+GbKmwkHiy2Pafp7drbJFxZU9A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2012 at 4:01 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:

> As far as complexity, yeah, it's a lot more complex now -- no question
> about that.

As far as complexity goes, would it be easier if we treated the UPDATE
of a primary key column as a DELETE plus an INSERT?

There's not really a logical reason why updating a primary key has
meaning, so allowing an ExecPlanQual to follow the chain across
primary key values doesn't seem valid to me.

That would make all primary keys IMMUTABLE to updates.

No primary key, no problem.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Noah Misch" <noah(at)leadboat(dot)com>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 18:44:50
Message-ID:	4F4634C20200002500045ACE@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Kevin Grittner's message:

>> Since the limitation on what can be stored in xmax was the killer
>> for Florian's attempt to support SELECT FOR UPDATE in a form
>> which was arguably more useful (and certainly more convenient for
>> those converting from other database products), I'm wondering
>> whether anyone has determined whether this new scheme would allow
>> Florian's work to be successfully completed. The issues seem
>> very similar. If this approach also provides a basis for the
>> other work, I think it helps bolster the argument that this is a
>> good design; if not, I think it suggests that maybe it should be
>> made more general or extensible in some way. Once this has to be
>> supported by pg_upgrade it will be harder to change the format,
>> if that is needed for some other feature.
>
> I have no idea what improvements Florian was seeking, but
> multixacts now have plenty of bit flag space to indicate whatever
> we want for each member transaction, so most likely the answer is
> yes. However we need to make clear that a single SELECT FOR
> UPDATE in a tuple does not currently use a multixact; if we wish
> to always store flags then we are forced to use one which incurs a
> performance hit.

Well, his effort really started to go into a tailspin on the related
issues here:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg01743.php

... with a summary of the problem and possible directions for a
solution here:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg01833.php

One of the problems that Florian was trying to address is that
people often have a need to enforce something with a lot of
similarity to a foreign key, but with more subtle logic than
declarative foreign keys support. One example would be the case
Robert has used in some presentations, where the manager column in
each row in a project table must contain the id of a row in a person
table *which has the project_manager boolean column set to TRUE*.
Short of using the new serializable transaction isolation level in
all related transactions, hand-coding enforcement of this useful
invariant through trigger code (or application code enforced through
some framework) is very tricky. The change to SELECT FOR UPDATE
that Florian was working on would make it pretty straightforward.

-Kevin

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 21:12:35
Message-ID:	20120223211235.GA9520@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2012 at 02:08:28PM +0100, Jeroen Vermeulen wrote:
> On 2012-02-23 10:18, Simon Riggs wrote:
>
>> However, review of such a large patch should not be simply pass or
>> fail. We should be looking back at the original problem and ask
>> ourselves whether some subset of the patch could solve a useful subset
>> of the problem. For me, that seems quite likely and this is very
>> definitely an important patch.
>>
>> Even if we can't solve some part of the problem we can at least commit
>> some useful parts of infrastructure to allow later work to happen more
>> smoothly and quickly.
>>
>> So please let's not focus on the 100%, lets focus on 80/20.
>
> The suggested immutable-column constraint was meant as a potential
> "80/20 workaround." Definitely not a full solution, helpful to some,
> probably easier to do. I don't know if an immutable key would actually
> be enough to elide foreign-key locks though.

That alone would not simplify the patch much. INSERT/UPDATE/DELETE on the
foreign side would still need to take some kind of tuple lock on the primary
side to prevent primary-side DELETE. You then still face the complicated case
of a tuple that's both locked and updated (non-key/immutable columns only).
Updates that change keys are relatively straightforward, following what we
already do today. It's the non-key updates that complicate things.

If you had both an immutable column constraint and a never-deleted table
constraint, that combination would be sufficient to simplify the picture.
(Directly or indirectly, it would not actually be a never-deleted constraint
so much as a "you must take AccessExclusiveLock to DELETE" constraint.)
Foreign-side DML would then take an AccessShareLock on the parent table with
no tuple lock at all.

By then, though, that change would share little or no code with the current
patch. It may have its own value, but it's not a means for carving a subset
from the current patch.

Thanks,
nm

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 21:36:42
Message-ID:	1330032679-sup-7677@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Noah Misch's message of mié feb 22 14:00:07 -0300 2012:
>
> On Mon, Feb 13, 2012 at 07:16:58PM -0300, Alvaro Herrera wrote:

> On Mon, Jan 30, 2012 at 06:48:47PM -0500, Noah Misch wrote:
> > On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
> > > * Columns that are part of the key
> > > Noah thinks the set of columns should only consider those actually referenced
> > > by keys, not those that *could* be referenced.
> >
> > Well, do you disagree? To me it's low-hanging fruit, because it isolates the
> > UPDATE-time overhead of this patch to FK-referenced tables rather than all
> > tables having a PK or PK-like index (often just "all tables").
>
> You have not answered my question above.

Sorry. The reason I didn't research this is that at the very start of
the discussion it was said that having heapam.c figure out whether
columns are being used as FK destinations or not would be more of a
modularity violation than "indexed columns" already are for HOT support
(this was a contentious issue for HOT, so I don't take it lightly. I
don't think I need any more reasons for Tom to object to this patch, or
more bulk into it. Both are already serious issues.)

In any case, with the way we've defined FOR KEY SHARE locks (in the docs
it's explicitely said that the set of columns considered could vary in
the future), it's a relatively easy patch to add on top of what I've
submitted. Just as the ALTER TABLE bits to add columns to the set of
columns considered, it could be left for a second pass on the issue.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-23 23:57:47
Message-ID:	20120223235747.GB9520@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2012 at 06:36:42PM -0300, Alvaro Herrera wrote:
>
> Excerpts from Noah Misch's message of mi?? feb 22 14:00:07 -0300 2012:
> >
> > On Mon, Feb 13, 2012 at 07:16:58PM -0300, Alvaro Herrera wrote:
>
> > On Mon, Jan 30, 2012 at 06:48:47PM -0500, Noah Misch wrote:
> > > On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
> > > > * Columns that are part of the key
> > > > Noah thinks the set of columns should only consider those actually referenced
> > > > by keys, not those that *could* be referenced.
> > >
> > > Well, do you disagree? To me it's low-hanging fruit, because it isolates the
> > > UPDATE-time overhead of this patch to FK-referenced tables rather than all
> > > tables having a PK or PK-like index (often just "all tables").
> >
> > You have not answered my question above.
>
> Sorry. The reason I didn't research this is that at the very start of
> the discussion it was said that having heapam.c figure out whether
> columns are being used as FK destinations or not would be more of a
> modularity violation than "indexed columns" already are for HOT support
> (this was a contentious issue for HOT, so I don't take it lightly. I
> don't think I need any more reasons for Tom to object to this patch, or
> more bulk into it. Both are already serious issues.)

That's fair.

> In any case, with the way we've defined FOR KEY SHARE locks (in the docs
> it's explicitely said that the set of columns considered could vary in
> the future), it's a relatively easy patch to add on top of what I've
> submitted. Just as the ALTER TABLE bits to add columns to the set of
> columns considered, it could be left for a second pass on the issue.

Agreed. Let's have that debate another day, as a follow-on patch.

Thanks for shedding this light.

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-24 04:53:34
Message-ID:	20120224045334.GD9520@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2012 at 12:43:11PM -0300, Alvaro Herrera wrote:
> Excerpts from Simon Riggs's message of jue feb 23 06:18:57 -0300 2012:
> > On Wed, Feb 22, 2012 at 5:00 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> >
> > >> All in all, I think this is in pretty much final shape. ??Only pg_upgrade
> > >> bits are still missing. ??If sharp eyes could give this a critical look
> > >> and knuckle-cracking testers could give it a spin, that would be
> > >> helpful.
> > >
> > > Lack of pg_upgrade support leaves this version incomplete, because that
> > > omission would constitute a blocker for beta 2. ??This version changes as much
> > > code compared to the version I reviewed at the beginning of the CommitFest as
> > > that version changed overall. ??In that light, it's time to close the books on
> > > this patch for the purpose of this CommitFest; I'm marking it Returned with
> > > Feedback. ??Thanks for your efforts thus far.
>
> Now this is an interesting turn of events. I must thank you for your
> extensive review effort in the current version of the patch, and also
> thank you and credit you for the idea that initially kicked this patch
> from the older, smaller, simpler version I wrote during the 9.1 timeline
> (which you also reviewed exhaustively). Without your and Simon's
> brilliant ideas, this patch wouldn't exist at all.
>
> I completely understand that you don't want to review this latest
> version of the patch; it's a lot of effort and I wouldn't inflict it on
> anybody who hasn't not volunteered. However, it doesn't seem to me that
> this is reason to boot the patch from the commitfest. I think the thing
> to do would be to remove yourself from the reviewers column and set it
> back to "needs review", so that other reviewers can pick it up.

It would indeed be wrong to change any patch from Needs Review to Returned
with Feedback on account of a personal distaste for reviewing the patch. I
hope I did not harbor such a motive here. Rather, this CommitFest has given
your patch its fair shake, and I and other reviewers would better serve the
CF's needs by reviewing, say, "ECPG FETCH readahead" instead of your latest
submission. Likewise, you would better serve the CF by evaluating one of the
four non-committer patches that have been Ready for Committer since January.
That's not to imply that the goals of the CF align with my goals, your goals,
or broader PGDG goals. The patch status on commitfest.postgresql.org does
exist solely for the advancement of the CF, and I have set it in accordingly.

> As for the late code churn, it mostly happened as a result of your
> own feedback; I would have left most of it in the original state, but as
> I went ahead it seemed much better to refactor things. This is mostly
> in heapam.c. As for multixact.c, it also had a lot of churn, but that
> was mostly to restore it to the state it has in the master branch,
> dropping much of the code I had written to handle multixact truncation.
> The new code there and in the vacuum code path (relminmxid and so on) is
> a lot smaller than that other code was, and it's closely based on
> relfrozenxid which is a known piece of technology.

I appreciate that.

> > However, review of such a large patch should not be simply pass or
> > fail. We should be looking back at the original problem and ask
> > ourselves whether some subset of the patch could solve a useful subset
> > of the problem. For me, that seems quite likely and this is very
> > definitely an important patch.

Incidentally, I find it harmful to think of "Returned with Feedback" as
"fail". For large patches, it's healthier to think of a CF as a bimonthly
project status meeting with stakeholders. When the project is done,
wonderful! When there's work left, that's no great surprise.

> > Even if we can't solve some part of the problem we can at least commit
> > some useful parts of infrastructure to allow later work to happen more
> > smoothly and quickly.
> >
> > So please let's not focus on the 100%, lets focus on 80/20.
>
> Well, we have the patch I originally posted in the 9.1 timeframe.
> That's a lot smaller and simpler. However, that solves only part of the
> blocking problem, and in particular it doesn't fix the initial deadlock
> reports from Joel Jacobson at Glue Finance (now renamed Trustly, in case
> you wonder about his change of email address) that started this effort
> in the first place. I don't think we can cut down to that and still
> satisfy the users that requested this; and Glue was just the first one,
> because after I started blogging about this, some more people started
> asking for it.
>
> I don't think there's any useful middlepoint between that one and the
> current one, but maybe I'm wrong.

Nothing additional comes to my mind, either. This patch is monolithic.

Thanks,
nm

From:	Jeroen Vermeulen <jtv(at)xs4all(dot)nl>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-24 09:38:52
Message-ID:	4F475AAC.9090806@xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2012-02-23 22:12, Noah Misch wrote:

> That alone would not simplify the patch much. INSERT/UPDATE/DELETE on the
> foreign side would still need to take some kind of tuple lock on the primary
> side to prevent primary-side DELETE. You then still face the complicated case
> of a tuple that's both locked and updated (non-key/immutable columns only).
> Updates that change keys are relatively straightforward, following what we
> already do today. It's the non-key updates that complicate things.

Ah, so there's the technical hitch. From previous discussion I was
under the impression that:

1. Foreign-key updates only inherently conflict with _key_ updates on
the foreign side, and that non-key updates on the foreign side were just
caught in the locking cross-fire, so to speak.

And

2. The DELETE case was somehow trivially accounted for. But, for
instance, there does not seem to be a lighter lock type that DELETE
conflicts with but UPDATE does not. Bummer.

> By then, though, that change would share little or no code with the current
> patch. It may have its own value, but it's not a means for carving a subset
> from the current patch.

No, to be clear, it was never meant to be. Only a possible way to give
users a way out of foreign-key locks more quickly. Not a way to get
some of the branch out to users more quickly.

At any rate, that seems to be moot then. And to be honest, mechanisms
designed for more than one purpose rarely pan out.

Thanks for explaining!

Jeroen

From:	Vik Reykja <vikreykja(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-25 01:06:49
Message-ID:	CALDgxVvr=O60Om6Y58f1s4kb=9UiOgMoAq0wmDkNe26KiaNpaA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2012 at 19:44, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov>wrote:

> One of the problems that Florian was trying to address is that
> people often have a need to enforce something with a lot of
> similarity to a foreign key, but with more subtle logic than
> declarative foreign keys support. One example would be the case
> Robert has used in some presentations, where the manager column in
> each row in a project table must contain the id of a row in a person
> table *which has the project_manager boolean column set to TRUE*.
> Short of using the new serializable transaction isolation level in
> all related transactions, hand-coding enforcement of this useful
> invariant through trigger code (or application code enforced through
> some framework) is very tricky. The change to SELECT FOR UPDATE
> that Florian was working on would make it pretty straightforward.
>

I'm not sure what Florian's patch does, but I've been trying to advocate
syntax like the following for this exact scenario:

foreign key (manager_id, true) references person (id, is_manager)

Basically, allow us to use constants instead of field names as part of
foreign keys. I have no idea what the implementation aspect of this is,
but I need the user aspect of it and don't know the best way to get it.

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Vik Reykja" <vikreykja(at)gmail(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Noah Misch" <noah(at)leadboat(dot)com>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-25 18:06:38
Message-ID:	4F48CECE0200002500045BBF@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Vik Reykja <vikreykja(at)gmail(dot)com> wrote:
> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>wrote:
>
>> One of the problems that Florian was trying to address is that
>> people often have a need to enforce something with a lot of
>> similarity to a foreign key, but with more subtle logic than
>> declarative foreign keys support. One example would be the case
>> Robert has used in some presentations, where the manager column
>> in each row in a project table must contain the id of a row in a
>> person table *which has the project_manager boolean column set to
>> TRUE*. Short of using the new serializable transaction isolation
>> level in all related transactions, hand-coding enforcement of
>> this useful invariant through trigger code (or application code
>> enforced through some framework) is very tricky. The change to
>> SELECT FOR UPDATE that Florian was working on would make it
>> pretty straightforward.
>
> I'm not sure what Florian's patch does, but I've been trying to
> advocate syntax like the following for this exact scenario:
>
> foreign key (manager_id, true) references person (id, is_manager)
>
> Basically, allow us to use constants instead of field names as
> part of foreign keys.

Interesting. IMV, a declarative approach like that is almost always
better than the alternatives, so something like this (possibly with
different syntax) would be another step in the right direction. I
suspect that there will always be a few corner cases where the
business logic required is too esoteric to be handled by a
generalized declarative construct, so I think Florian's idea still
has merit -- especially if we want to ease the transition to
PostgreSQL for large shops using other products.

> I have no idea what the implementation aspect of this is,
> but I need the user aspect of it and don't know the best way to
> get it.

There are those in the community who make their livings by helping
people get the features they want. If you have some money to fund
development, I would bet you could get this addressed -- it sure
sounds reasonable to me.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-27 02:47:50
Message-ID:	CA+TgmoaC-iZMSOPPrzCyoxH2qW5EF0dhotL2tFhy92E=CEcuDg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2012 at 11:01 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>> This
>> seems like a horrid mess that's going to be unsustainable both from a
>> complexity and a performance standpoint. The only reason multixacts
>> were tolerable at all was that they had only one semantics. Changing
>> it so that maybe a multixact represents an actual updater and maybe
>> it doesn't is not sane.
>
> As far as complexity, yeah, it's a lot more complex now -- no question
> about that.
>
> Regarding performance, the good thing about this patch is that if you
> have an operation that used to block, it might now not block. So maybe
> multixact-related operation is a bit slower than before, but if it
> allows you to continue operating rather than sit waiting until some
> other transaction releases you, it's much better.

That's probably true, although there is some deferred cost that is
hard to account for. You might not block immediately, but then later
somebody might block either because the mxact SLRU now needs fsyncs or
because they've got to decode an mxid long after the relevant segment
has been evicted from the SLRU buffers. In general, it's hard to
bound that latter cost, because you only avoid blocking once (when the
initial update happens) but you might pay the extra cost of decoding
the mxid as many times as the row is read, which could be arbitrarily
many. How much of a problem that is in practice, I'm not completely
sure, but it has worried me before and it still does. In the worst
case scenario, a handful of frequently-accessed rows with MXIDs all of
whose members are dead except for the UPDATE they contain could result
in continual SLRU cache-thrashing.

From a performance standpoint, we really need to think not only about
the cases where the patch wins, but also, and maybe more importantly,
the cases where it loses. There are some cases where the current
mechanism, use SHARE locks for foreign keys, is adequate. In
particular, it's adequate whenever the parent table is not updated at
all, or only very lightly. I believe that those people will pay
somewhat more with this patch, and especially in any case where
backends end up waiting for fsyncs in order to create new mxids, but
also just because I think this patch will have the effect of
increasing the space consumed by each individual mxid, which imposes a
distributed cost of its own.

I think we should avoid having a theoretical argument about how
serious these problems are; instead, you should try to construct
somewhat-realistic worst case scenarios and benchmark them. Tom's
complaint about code complexity is basically a question of opinion, so
I don't know how to evaluate that objectively, but performance is
something we can measure. We might still disagree on the
interpretation of the results, but I still think having some real
numbers to talk about based on carefully-thought-out test cases would
advance the debate.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-27 12:13:32
Message-ID:	4F4B736C.2080404@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23.02.2012 18:01, Alvaro Herrera wrote:
>
> Excerpts from Tom Lane's message of jue feb 23 12:28:20 -0300 2012:
>>
>> Alvaro Herrera<alvherre(at)commandprompt(dot)com> writes:
>>> Sure. The problem is that we are allowing updated rows to be locked (and
>>> locked rows to be updated). This means that we need to store extended
>>> Xmax information in tuples that goes beyond mere locks, which is what we
>>> were doing previously -- they may now have locks and updates simultaneously.
>>
>>> (In the previous code, a multixact never meant an update, it always
>>> signified only shared locks. After a crash, all backends that could
>>> have been holding locks must necessarily be gone, so the multixact info
>>> is not interesting and can be treated like the tuple is simply live.)
>>
>> Ugh. I had not been paying attention to what you were doing in this
>> patch, and now that I read this I wish I had objected earlier.
>
> Uhm, yeah, a lot earlier -- I initially blogged about this in August
> last year:
> http://www.commandprompt.com/blogs/alvaro_herrera/2011/08/fixing_foreign_key_deadlocks_part_three/
>
> and in several posts in pgsql-hackers.
>
>> This
>> seems like a horrid mess that's going to be unsustainable both from a
>> complexity and a performance standpoint. The only reason multixacts
>> were tolerable at all was that they had only one semantics. Changing
>> it so that maybe a multixact represents an actual updater and maybe
>> it doesn't is not sane.
>
> As far as complexity, yeah, it's a lot more complex now -- no question
> about that.

How about assigning a new, real, transaction id, to represent the group
of transaction ids. The new transaction id would be treated as a
subtransaction of the updater, and the xids of the lockers would be
stored in the multixact-members slru. That way the multixact structures
wouldn't need to survive a crash; you don't care about the shared
lockers after a crash, and the xid of the updater would be safely stored
as is in the xmax field.

That way you wouldn't need to handle multixact wraparound, because we
already handle xid wraparound, and you wouldn't need to make multixact
slrus crash-safe.

Not sure what the performance implications would be. You would use up
xids more quickly, which would require more frequent anti-wraparound
vacuuming. And if we just start using real xids as the key to
multixact-offsets slru, we would need to extend that a lot more often.
But I feel it would probably be acceptable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-28 00:28:14
Message-ID:	20120228002814.GA29227@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 27, 2012 at 02:13:32PM +0200, Heikki Linnakangas wrote:
> On 23.02.2012 18:01, Alvaro Herrera wrote:
>> As far as complexity, yeah, it's a lot more complex now -- no question
>> about that.
>
> How about assigning a new, real, transaction id, to represent the group
> of transaction ids. The new transaction id would be treated as a
> subtransaction of the updater, and the xids of the lockers would be
> stored in the multixact-members slru. That way the multixact structures
> wouldn't need to survive a crash; you don't care about the shared
> lockers after a crash, and the xid of the updater would be safely stored
> as is in the xmax field.
>
> That way you wouldn't need to handle multixact wraparound, because we
> already handle xid wraparound, and you wouldn't need to make multixact
> slrus crash-safe.
>
> Not sure what the performance implications would be. You would use up
> xids more quickly, which would require more frequent anti-wraparound
> vacuuming. And if we just start using real xids as the key to
> multixact-offsets slru, we would need to extend that a lot more often.
> But I feel it would probably be acceptable.

When a key locker arrives after the updater and creates this implicit
subtransaction of the updater, how might you arrange for the xid's clog status
to eventually get updated in accordance with the updater's outcome?

Thanks,
nm

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-02-28 07:55:31
Message-ID:	CA+U5nMLLdrgV-co4H5qsEtRHN+rvBUc-wGdFynHSGLwrbUuj8w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 28, 2012 at 12:28 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Feb 27, 2012 at 02:13:32PM +0200, Heikki Linnakangas wrote:
>> On 23.02.2012 18:01, Alvaro Herrera wrote:
>>> As far as complexity, yeah, it's a lot more complex now -- no question
>>> about that.
>>
>> How about assigning a new, real, transaction id, to represent the group
>> of transaction ids. The new transaction id would be treated as a
>> subtransaction of the updater, and the xids of the lockers would be
>> stored in the multixact-members slru. That way the multixact structures
>> wouldn't need to survive a crash; you don't care about the shared
>> lockers after a crash, and the xid of the updater would be safely stored
>> as is in the xmax field.
>>
>> That way you wouldn't need to handle multixact wraparound, because we
>> already handle xid wraparound, and you wouldn't need to make multixact
>> slrus crash-safe.
>>
>> Not sure what the performance implications would be. You would use up
>> xids more quickly, which would require more frequent anti-wraparound
>> vacuuming. And if we just start using real xids as the key to
>> multixact-offsets slru, we would need to extend that a lot more often.
>> But I feel it would probably be acceptable.
>
> When a key locker arrives after the updater and creates this implicit
> subtransaction of the updater, how might you arrange for the xid's clog status
> to eventually get updated in accordance with the updater's outcome?

Somewhat off-topic, but just seen another bad case of FK lock contention.

Thanks for working on this everybody.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-05 18:28:59
Message-ID:	CA+U5nM+mPbv3N-6iAU9DUmPgvrgJsgj3VOQOygTT8OWbnLxtig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 27, 2012 at 2:47 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Feb 23, 2012 at 11:01 AM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>>> This
>>> seems like a horrid mess that's going to be unsustainable both from a
>>> complexity and a performance standpoint. The only reason multixacts
>>> were tolerable at all was that they had only one semantics. Changing
>>> it so that maybe a multixact represents an actual updater and maybe
>>> it doesn't is not sane.
>>
>> As far as complexity, yeah, it's a lot more complex now -- no question
>> about that.
>>
>> Regarding performance, the good thing about this patch is that if you
>> have an operation that used to block, it might now not block. So maybe
>> multixact-related operation is a bit slower than before, but if it
>> allows you to continue operating rather than sit waiting until some
>> other transaction releases you, it's much better.
>
> That's probably true, although there is some deferred cost that is
> hard to account for. You might not block immediately, but then later
> somebody might block either because the mxact SLRU now needs fsyncs or
> because they've got to decode an mxid long after the relevant segment
> has been evicted from the SLRU buffers. In general, it's hard to
> bound that latter cost, because you only avoid blocking once (when the
> initial update happens) but you might pay the extra cost of decoding
> the mxid as many times as the row is read, which could be arbitrarily
> many. How much of a problem that is in practice, I'm not completely
> sure, but it has worried me before and it still does. In the worst
> case scenario, a handful of frequently-accessed rows with MXIDs all of
> whose members are dead except for the UPDATE they contain could result
> in continual SLRU cache-thrashing.

Cases I regularly see involve wait times of many seconds.

When this patch helps, it will help performance by algorithmic gains,
so perhaps x10-100.

That can and should be demonstrated though, I agree.

> From a performance standpoint, we really need to think not only about
> the cases where the patch wins, but also, and maybe more importantly,
> the cases where it loses. There are some cases where the current
> mechanism, use SHARE locks for foreign keys, is adequate. In
> particular, it's adequate whenever the parent table is not updated at
> all, or only very lightly. I believe that those people will pay
> somewhat more with this patch, and especially in any case where
> backends end up waiting for fsyncs in order to create new mxids, but
> also just because I think this patch will have the effect of
> increasing the space consumed by each individual mxid, which imposes a
> distributed cost of its own.

That is a concern also.

It's taken me a while reviewing the patch to realise that space usage
is actually 4 times worse than before.

> I think we should avoid having a theoretical argument about how
> serious these problems are; instead, you should try to construct
> somewhat-realistic worst case scenarios and benchmark them. Tom's
> complaint about code complexity is basically a question of opinion, so
> I don't know how to evaluate that objectively, but performance is
> something we can measure. We might still disagree on the
> interpretation of the results, but I still think having some real
> numbers to talk about based on carefully-thought-out test cases would
> advance the debate.

It's a shame that the isolation tester can't be used directly by
pgbench - I think we need something similar for performance regression
testing.

So yes, performance testing is required.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-05 18:37:53
Message-ID:	1330972495-sup-9447@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of lun mar 05 15:28:59 -0300 2012:
>
> On Mon, Feb 27, 2012 at 2:47 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> > From a performance standpoint, we really need to think not only about
> > the cases where the patch wins, but also, and maybe more importantly,
> > the cases where it loses. There are some cases where the current
> > mechanism, use SHARE locks for foreign keys, is adequate. In
> > particular, it's adequate whenever the parent table is not updated at
> > all, or only very lightly. I believe that those people will pay
> > somewhat more with this patch, and especially in any case where
> > backends end up waiting for fsyncs in order to create new mxids, but
> > also just because I think this patch will have the effect of
> > increasing the space consumed by each individual mxid, which imposes a
> > distributed cost of its own.
>
> That is a concern also.
>
> It's taken me a while reviewing the patch to realise that space usage
> is actually 4 times worse than before.

Eh. You're probably misreading something. Previously each member of a
multixact used 4 bytes (the size of an Xid). With the current patch a
member uses 5 bytes (same plus a flags byte). An earlier version used
4.25 bytes per multi, which I increased to leave space for future
expansion.

So it's 1.25x worse, not 4x worse.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-05 19:34:10
Message-ID:	CA+U5nMKoW1U=5D6smVNik-k2RO3SuhBe3r=z0wUU+c=sTBf-JA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 5, 2012 at 6:37 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>
> Excerpts from Simon Riggs's message of lun mar 05 15:28:59 -0300 2012:
>>
>> On Mon, Feb 27, 2012 at 2:47 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> > From a performance standpoint, we really need to think not only about
>> > the cases where the patch wins, but also, and maybe more importantly,
>> > the cases where it loses. There are some cases where the current
>> > mechanism, use SHARE locks for foreign keys, is adequate. In
>> > particular, it's adequate whenever the parent table is not updated at
>> > all, or only very lightly. I believe that those people will pay
>> > somewhat more with this patch, and especially in any case where
>> > backends end up waiting for fsyncs in order to create new mxids, but
>> > also just because I think this patch will have the effect of
>> > increasing the space consumed by each individual mxid, which imposes a
>> > distributed cost of its own.
>>
>> That is a concern also.
>>
>> It's taken me a while reviewing the patch to realise that space usage
>> is actually 4 times worse than before.
>
> Eh. You're probably misreading something. Previously each member of a
> multixact used 4 bytes (the size of an Xid). With the current patch a
> member uses 5 bytes (same plus a flags byte). An earlier version used
> 4.25 bytes per multi, which I increased to leave space for future
> expansion.
>
> So it's 1.25x worse, not 4x worse.

Thanks for correcting me. That sounds better.

It does however, illustrate my next review comment which is that the
comments and README items are sorely lacking here. It's quite hard to
see how it works, let along comment on major design decisions. It
would help myself and others immensely if we could improve that.

Is there a working copy on a git repo? Easier than waiting for next
versions of a patch.

My other comments so far are

* some permutations commented out - no comments as to why
Something of a fault with the isolation tester that it just shows
output, there's no way to record expected output in the spec

Comments required for these points

* Why do we need multixact to be persistent? Do we need every page of
multixact to be persistent, or just particular pages in certain
circumstances?

* Why do we need to expand multixact with flags? Can we avoid that in
some cases?

* Why do we need to store just single xids in multixact members?
Didn't understand comments, no explanation

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-05 19:53:37
Message-ID:	1330976418-sup-7849@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of lun mar 05 16:34:10 -0300 2012:
> On Mon, Mar 5, 2012 at 6:37 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:

> It does however, illustrate my next review comment which is that the
> comments and README items are sorely lacking here. It's quite hard to
> see how it works, let along comment on major design decisions. It
> would help myself and others immensely if we could improve that.

Hm. Okay.

> Is there a working copy on a git repo? Easier than waiting for next
> versions of a patch.

No, I don't have an external mirror of my local repo.

> My other comments so far are
>
> * some permutations commented out - no comments as to why
> Something of a fault with the isolation tester that it just shows
> output, there's no way to record expected output in the spec

The reason they are commented out is that they are "invalid", that is,
it requires running a command on a session that's blocked in the
previous command. Obviously, that cannot happen in real life.

isolationtester now has support for detecting such conditions; if the
spec specifies running a command in a locked session, the permutation is
killed with an error message "invalid permutation" and just continues
with the next permutation. It used to simply die, aborting the test.
Maybe we could just modify the specs so that all permutations are there
(this can be done by simply removing the permutation lines), and the
"invalid permutation" messages are part of the expected file. Would
that be better?

> Comments required for these points
>
> * Why do we need multixact to be persistent? Do we need every page of
> multixact to be persistent, or just particular pages in certain
> circumstances?

Any page that contains at least one multi with an update as a member
must persist. It's possible that some pages contain no update (and this
is even likely in some workloads, if updates are rare), but I'm not sure
it's worth complicating the code to cater for early removal of some
pages.

> * Why do we need to expand multixact with flags? Can we avoid that in
> some cases?

Did you read my blog post?
http://www.commandprompt.com/blogs/alvaro_herrera/2011/08/fixing_foreign_key_deadlocks_part_three/
This explains the reason -- the point is that we need to distinguish the
lock strength acquired by each locker.

> * Why do we need to store just single xids in multixact members?
> Didn't understand comments, no explanation

This is just for SELECT FOR SHARE. We don't have a hint bit to indicate
"this tuple has a for-share lock", so we need to create a multi for it.
Since FOR SHARE is probably going to be very uncommon, this isn't likely
to be a problem. We're mainly catering for users of SELECT FOR SHARE so
that it continues to work, i.e. maintain backwards compatibility.

(Maybe I misunderstood your question -- what I think you're asking is,
"why are there some multixacts that have a single member?")

I'll try to come up with a good place to add some paragraphs about all
this. Please let me know if answers here are unclear and/or you have
further questions.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-05 20:35:15
Message-ID:	CA+U5nMLqqf3pQaPsOPB26p8qZKUvsSm230ZNCNtdBM4U8UaYbg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 5, 2012 at 7:53 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:

>> My other comments so far are
>>
>> * some permutations commented out - no comments as to why
>> Something of a fault with the isolation tester that it just shows
>> output, there's no way to record expected output in the spec
>
> The reason they are commented out is that they are "invalid", that is,
> it requires running a command on a session that's blocked in the
> previous command. Obviously, that cannot happen in real life.
>
> isolationtester now has support for detecting such conditions; if the
> spec specifies running a command in a locked session, the permutation is
> killed with an error message "invalid permutation" and just continues
> with the next permutation. It used to simply die, aborting the test.
> Maybe we could just modify the specs so that all permutations are there
> (this can be done by simply removing the permutation lines), and the
> "invalid permutation" messages are part of the expected file. Would
> that be better?

It would be better to have an isolation tester mode that checks to see
it was invalid and if not, report that.

At the moment we can't say why you commented something out. There's no
comment or explanation, and we need something, otherwise 3 years from
now we'll be completely in the dark.

>> Comments required for these points
>>
>> * Why do we need multixact to be persistent? Do we need every page of
>> multixact to be persistent, or just particular pages in certain
>> circumstances?
>
> Any page that contains at least one multi with an update as a member
> must persist. It's possible that some pages contain no update (and this
> is even likely in some workloads, if updates are rare), but I'm not sure
> it's worth complicating the code to cater for early removal of some
> pages.

If the multixact contains an xid and that is being persisted then you
need to set an LSN to ensure that a page writes causes an XLogFlush()
before the multixact write. And you need to set do_fsync, no? Or
explain why not in comments...

I was really thinking we could skip the fsync of a page if we've not
persisted anything important on that page, since that was one of
Robert's performance points.

>> * Why do we need to expand multixact with flags? Can we avoid that in
>> some cases?
>
> Did you read my blog post?
> http://www.commandprompt.com/blogs/alvaro_herrera/2011/08/fixing_foreign_key_deadlocks_part_three/
> This explains the reason -- the point is that we need to distinguish the
> lock strength acquired by each locker.

Thanks, I will, but it all belongs in a README please.

>> * Why do we need to store just single xids in multixact members?
>> Didn't understand comments, no explanation
>
> This is just for SELECT FOR SHARE. We don't have a hint bit to indicate
> "this tuple has a for-share lock", so we need to create a multi for it.
> Since FOR SHARE is probably going to be very uncommon, this isn't likely
> to be a problem. We're mainly catering for users of SELECT FOR SHARE so
> that it continues to work, i.e. maintain backwards compatibility.

Good, thanks.

Are we actively recommending people use FOR KEY SHARE rather than FOR
SHARE, in explicit use?

> (Maybe I misunderstood your question -- what I think you're asking is,
> "why are there some multixacts that have a single member?")
>
> I'll try to come up with a good place to add some paragraphs about all
> this. Please let me know if answers here are unclear and/or you have
> further questions.

Thanks

I think we need to define some test workloads to measure the
performance impact of this patch. We need to be certain that it has a
good impact in target cases, plus a known impact in other cases.

Suggest

* basic pgbench - no RI

* inserts into large table, RI checks to small table, no activity on small table

* large table parent, large table: child
20 child rows per parent, fk from child to parent
updates of multiple children at same time
low/medium/heavy locking

* large table parent, large table: child
20 child rows per parent,fk from child to parent
updates of parent and child at same time
low/medium/heavy locking

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-06 19:39:32
Message-ID:	1331060319-sup-2769@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of lun mar 05 16:34:10 -0300 2012:

Here's a first attempt at a README illustrating this. I intend this to
be placed in src/backend/access/heap/README.tuplock; the first three
paragraphs are stolen from the comment in heap_lock_tuple, so I'd remove
those from there, directing people to this new file instead. Is there
something that you think should be covered more extensively (or at all)
here?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Locking tuples
--------------

Because the shared-memory lock table is of finite size, but users could
reasonably want to lock large numbers of tuples, we do not rely on the
standard lock manager to store tuple-level locks over the long term. Instead,
a tuple is marked as locked by setting the current transaction's XID as its
XMAX, and setting additional infomask bits to distinguish this usage from the
more normal case of having deleted the tuple. When multiple transactions
concurrently lock a tuple, a MultiXact is used; see below.

When it is necessary to wait for a tuple-level lock to be released, the basic
delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
the tuple's XMAX. However, that mechanism will release all waiters
concurrently, so there would be a race condition as to which waiter gets the
tuple, potentially leading to indefinite starvation of some waiters. The
possibility of share-locking makes the problem much worse --- a steady stream
of share-lockers can easily block an exclusive locker forever. To provide
more reliable semantics about who gets a tuple-level lock first, we use the
standard lock manager. The protocol for waiting for a tuple-level lock is
really

LockTuple()
XactLockTableWait()
mark tuple as locked by me
UnlockTuple()

When there are multiple waiters, arbitration of who is to get the lock next
is provided by LockTuple(). However, at most one tuple-level lock will
be held or awaited per backend at any time, so we don't risk overflow
of the lock table. Note that incoming share-lockers are required to
do LockTuple as well, if there is any conflict, to ensure that they don't
starve out waiting exclusive-lockers. However, if there is not any active
conflict for a tuple, we don't incur any extra overhead.

We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
super-exclusive locking (used to delete tuples and more generally to update
tuples modifying the values of the columns that make up the key of the tuple);
SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
UPDATE. This last mode implements a mode just strong enough to implement RI
checks, i.e. it ensures that tuples do not go away from under a check, without
blocking when some other transaction that want to update the tuple without
changing its key.

The conflict table is:

KEY UPDATE UPDATE SHARE KEY SHARE
KEY UPDATE conflict conflict conflict conflict
UPDATE conflict conflict conflict
SHARE conflict conflict
KEY SHARE conflict

When there is a single locker in a tuple, we can just store the locking info
in the tuple itself. We do this by storing the locker's Xid in XMAX, and
setting hint bits specifying the locking strength. There is one exception
here: since hint bit space is limited, we do not provide a separate hint bit
for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
presumably more commonly used due to being the standards-mandated locking
mechanism, or heavily used by the RI code, so we want to provide fast paths
for those.)

MultiXacts
----------

A tuple header provides very limited space for storing information about tuple
locking and updates: there is room only for a single Xid and a small number of
hint bits. Whenever we need to store more than one lock, we replace the first
locker's Xid with a new MultiXactId. Each MultiXact provides extended locking
data; it comprises an array of Xids plus some flags bits for each one. The
flags are currently used to store the locking strength of each member
transaction. (The flags also distinguish a pure locker from an actual
updater.)

In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
locked in shared mode by multiple transactions. This is no longer the case; a
MultiXact may contain an update or delete Xid. (Keep in mind that tuple locks
in a transaction do not conflict with other tuple locks in the same
transaction, so it's possible to have otherwise conflicting locks in a
MultiXact if they belong to the same transaction).

Note that each lock is attributed to the subtransaction that acquires it.
This means that a subtransaction that aborts is seen as though it releases the
locks it acquired; concurrent transactions can then proceed without having to
wait for the main transaction to finish. It also means that a subtransaction
can upgrade to a stronger lock level than an earlier transaction had, and if
the subxact aborts, the earlier, weaker lock is kept.

The possibility of having an update within a MultiXact means that they must
persist across crashes and restarts: a future reader of the tuple needs to
figure out whether the update committed or aborted. So we have a requirement
that pg_multixact needs to retain pages of its data until we're certain that
the MultiXacts in them are no longer of interest.

VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
This works in the same way that pg_clog segments are removed: we have a
pg_class column that stores the earliest multixact that could possibly be
stored in the table; the minimum of all such values is stored in a pg_database
column. VACUUM computes the minimum across all pg_database values, and
removes pg_multixact segments older than the minimum.

Hint Bits
---------

The following hint bits are applicable:

- HEAP_XMAX_INVALID
Any tuple with this hint bit set does not have a valid value stored in XMAX.

- HEAP_XMAX_IS_MULTI
This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
regular TransactionId).

- HEAP_XMAX_LOCK_ONLY
This bit is set when the XMAX is a locker only; that is, if it's a
multixact, it does not contain an update among its members. It's set when
the XMAX is a plain Xid that locked the tuple, as well.

- HEAP_XMAX_KEYSHR_LOCK
- HEAP_XMAX_EXCL_LOCK
These bits indicate the strength of the lock acquired; they are useful when
the XMAX is not a MultiXactId. If it's a multi, the info is to be found in
the member flags. If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
is set, then one of these *must* be set as well.
Note there is no hint bit for a SELECT FOR SHARE lock. Also there is no
separate bit for a SELECT FOR KEY UPDATE lock; this is implemented by the
HEAP_UPDATE_KEY_REVOKED bit.

- HEAP_UPDATE_KEY_REVOKED
This bit lives in t_infomask2. If set, indicates that a transaction updated
this tuple and changed the key values, or a transaction deleted the tuple.
It's set regardless of whether the XMAX is a TransactionId or a MultiXactId.

We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
is set.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-06 19:58:22
Message-ID:	CA+U5nMJrp1pgeTpDTQAf2kud_XUzYa7y92G4i0o9jdQa5eWPZA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 5, 2012 at 8:35 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>>> * Why do we need multixact to be persistent? Do we need every page of
>>> multixact to be persistent, or just particular pages in certain
>>> circumstances?
>>
>> Any page that contains at least one multi with an update as a member
>> must persist. It's possible that some pages contain no update (and this
>> is even likely in some workloads, if updates are rare), but I'm not sure
>> it's worth complicating the code to cater for early removal of some
>> pages.
>
> If the multixact contains an xid and that is being persisted then you
> need to set an LSN to ensure that a page writes causes an XLogFlush()
> before the multixact write. And you need to set do_fsync, no? Or
> explain why not in comments...
>
> I was really thinking we could skip the fsync of a page if we've not
> persisted anything important on that page, since that was one of
> Robert's performance points.

We need to increase these values to 32 as well

#define NUM_MXACTOFFSET_BUFFERS 8
#define NUM_MXACTMEMBER_BUFFERS 16

using same logic as for clog.

We're using 25% more space and we already know clog benefits from
increasing them, so there's little doubt we need it here also, since
we are increasing the access rate and potentially the longevity.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-06 20:28:12
Message-ID:	CA+U5nMKdz0eCBtx99uw2zEJP4TGb+Vn4ET4eT0rAQ4fS5sqrwQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 6, 2012 at 7:39 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:

> We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
> super-exclusive locking (used to delete tuples and more generally to update
> tuples modifying the values of the columns that make up the key of the tuple);
> SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
> implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
> that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
> UPDATE. This last mode implements a mode just strong enough to implement RI
> checks, i.e. it ensures that tuples do not go away from under a check, without
> blocking when some other transaction that want to update the tuple without
> changing its key.

So there are 4 lock types, but we only have room for 3 on the tuple
header, so we store the least common/deprecated of the 4 types as a
multixactid. Some rewording would help there.

Neat scheme!

My understanding is that all of theses workloads will change

* Users of explicit SHARE lockers will be slightly worse in the case
of the 1st locker, but then after that they'll be the same as before.

* Updates against an RI locked table will be dramatically faster
because of reduced lock waits

...and that these previous workloads are effectively unchanged:

* Stream of RI checks causes mxacts

* Multi row deadlocks still possible

* Queues of writers still wait in the same way

* Deletes don't cause mxacts unless by same transaction

> In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
> locked in shared mode by multiple transactions. This is no longer the case; a
> MultiXact may contain an update or delete Xid. (Keep in mind that tuple locks
> in a transaction do not conflict with other tuple locks in the same
> transaction, so it's possible to have otherwise conflicting locks in a
> MultiXact if they belong to the same transaction).

Somewhat confusing, but am getting there.

> Note that each lock is attributed to the subtransaction that acquires it.
> This means that a subtransaction that aborts is seen as though it releases the
> locks it acquired; concurrent transactions can then proceed without having to
> wait for the main transaction to finish. It also means that a subtransaction
> can upgrade to a stronger lock level than an earlier transaction had, and if
> the subxact aborts, the earlier, weaker lock is kept.

> The possibility of having an update within a MultiXact means that they must
> persist across crashes and restarts: a future reader of the tuple needs to
> figure out whether the update committed or aborted. So we have a requirement
> that pg_multixact needs to retain pages of its data until we're certain that
> the MultiXacts in them are no longer of interest.

I think the "no longer of interest" aspect needs to be tracked more
closely because it will necessarily lead to more I/O.

If we store the LSN on each mxact page, as I think we need to, we can
get rid of pages more quickly if we know they don't have an LSN set.
So its possible we can optimise that more.

> VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.

You mean mxact segments?

Surely we set hint bits on tuples same as now? Hope so.

> This works in the same way that pg_clog segments are removed: we have a
> pg_class column that stores the earliest multixact that could possibly be
> stored in the table; the minimum of all such values is stored in a pg_database
> column. VACUUM computes the minimum across all pg_database values, and
> removes pg_multixact segments older than the minimum.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-06 21:10:16
Message-ID:	CA+Tgmobuef=hJrv05VgHLFS8sFmhjcuHpjMspnxfPpoxMY06Ew@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Preliminary comment:

This README is very helpful.

On Tue, Mar 6, 2012 at 2:39 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
> super-exclusive locking (used to delete tuples and more generally to update
> tuples modifying the values of the columns that make up the key of the tuple);
> SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
> implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
> that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
> UPDATE. This last mode implements a mode just strong enough to implement RI
> checks, i.e. it ensures that tuples do not go away from under a check, without
> blocking when some other transaction that want to update the tuple without
> changing its key.

I feel like there is a naming problem here. The semantics that have
always been associated with SELECT FOR UPDATE are now attached to
SELECT FOR KEY UPDATE; and SELECT FOR UPDATE itself has been weakened.
I think users will be surprised to find that SELECT FOR UPDATE
doesn't block all concurrent updates.

It seems to me that SELECT FOR KEY UPDATE should be called SELECT FOR
UPDATE, and what you're calling SELECT FOR UPDATE should be called
something else - essentially NONKEY UPDATE, though I don't much like
that name.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-06 21:27:51
Message-ID:	1331068942-sup-7658@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mar mar 06 18:10:16 -0300 2012:
>
> Preliminary comment:
>
> This README is very helpful.

Thanks. I feel silly that I didn't write it earlier.

> On Tue, Mar 6, 2012 at 2:39 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> > We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
> > super-exclusive locking (used to delete tuples and more generally to update
> > tuples modifying the values of the columns that make up the key of the tuple);
> > SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
> > implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
> > that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
> > UPDATE. This last mode implements a mode just strong enough to implement RI
> > checks, i.e. it ensures that tuples do not go away from under a check, without
> > blocking when some other transaction that want to update the tuple without
> > changing its key.
>
> I feel like there is a naming problem here. The semantics that have
> always been associated with SELECT FOR UPDATE are now attached to
> SELECT FOR KEY UPDATE; and SELECT FOR UPDATE itself has been weakened.
> I think users will be surprised to find that SELECT FOR UPDATE
> doesn't block all concurrent updates.

I'm not sure why you say that. Certainly SELECT FOR UPDATE continues to
block all updates. It continues to block SELECT FOR SHARE as well.
The things that it doesn't block are the new SELECT FOR KEY SHARE locks;
since those didn't exist before, it doesn't seem correct to consider
that SELECT FOR UPDATE changed in any way.

The main difference in the UPDATE behavior is that an UPDATE is regarded
as though it might acquire two different lock modes -- it either
acquires SELECT FOR KEY UPDATE if the key is modified, or SELECT FOR
UPDATE if not. Since SELECT FOR KEY UPDATE didn't exist before, we can
consider that previous to this patch, what UPDATE did was always acquire
a lock of strength SELECT FOR UPDATE. So UPDATE also hasn't been
weakened.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-06 21:33:13
Message-ID:	CA+U5nMJHMbsRFKyg2Yhw3_FYbi-EC0Ug9+qgiR4xuShMwCEW1w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 6, 2012 at 9:10 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Preliminary comment:
>
> This README is very helpful.
>
> On Tue, Mar 6, 2012 at 2:39 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
>> super-exclusive locking (used to delete tuples and more generally to update
>> tuples modifying the values of the columns that make up the key of the tuple);
>> SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
>> implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
>> that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
>> UPDATE. This last mode implements a mode just strong enough to implement RI
>> checks, i.e. it ensures that tuples do not go away from under a check, without
>> blocking when some other transaction that want to update the tuple without
>> changing its key.
>
> I feel like there is a naming problem here. The semantics that have
> always been associated with SELECT FOR UPDATE are now attached to
> SELECT FOR KEY UPDATE; and SELECT FOR UPDATE itself has been weakened.
> I think users will be surprised to find that SELECT FOR UPDATE
> doesn't block all concurrent updates.
>
> It seems to me that SELECT FOR KEY UPDATE should be called SELECT FOR
> UPDATE, and what you're calling SELECT FOR UPDATE should be called
> something else - essentially NONKEY UPDATE, though I don't much like
> that name.

No, because that would stop it from doing what it is designed to do.

The lock modes are correct, appropriate and IMHO have meaningful
names. No redesign required here.

Not sure about the naming of some of the flag bits however.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-06 21:40:51
Message-ID:	CA+TgmoZeCee1JsFxf=ZH1EFdhbL+gSDwBory0DnL8T0=93xuuQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 6, 2012 at 4:27 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Robert Haas's message of mar mar 06 18:10:16 -0300 2012:
>>
>> Preliminary comment:
>>
>> This README is very helpful.
>
> Thanks. I feel silly that I didn't write it earlier.
>
>> On Tue, Mar 6, 2012 at 2:39 PM, Alvaro Herrera
>> <alvherre(at)commandprompt(dot)com> wrote:
>> > We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
>> > super-exclusive locking (used to delete tuples and more generally to update
>> > tuples modifying the values of the columns that make up the key of the tuple);
>> > SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
>> > implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
>> > that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
>> > UPDATE. This last mode implements a mode just strong enough to implement RI
>> > checks, i.e. it ensures that tuples do not go away from under a check, without
>> > blocking when some other transaction that want to update the tuple without
>> > changing its key.
>>
>> I feel like there is a naming problem here. The semantics that have
>> always been associated with SELECT FOR UPDATE are now attached to
>> SELECT FOR KEY UPDATE; and SELECT FOR UPDATE itself has been weakened.
>> I think users will be surprised to find that SELECT FOR UPDATE
>> doesn't block all concurrent updates.
>
> I'm not sure why you say that. Certainly SELECT FOR UPDATE continues to
> block all updates. It continues to block SELECT FOR SHARE as well.
> The things that it doesn't block are the new SELECT FOR KEY SHARE locks;
> since those didn't exist before, it doesn't seem correct to consider
> that SELECT FOR UPDATE changed in any way.
>
> The main difference in the UPDATE behavior is that an UPDATE is regarded
> as though it might acquire two different lock modes -- it either
> acquires SELECT FOR KEY UPDATE if the key is modified, or SELECT FOR
> UPDATE if not. Since SELECT FOR KEY UPDATE didn't exist before, we can
> consider that previous to this patch, what UPDATE did was always acquire
> a lock of strength SELECT FOR UPDATE. So UPDATE also hasn't been
> weakened.

Ah, I see. My mistake.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-07 09:24:11
Message-ID:	CAHMh4-YY3AGrPRQ4jqzjB97KtWj7SFg_tztMh7k4yPJtWyDWKw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I feel sad, that i followed this topic very late. But i still want to put
forward my views.
Have we thought on the lines of how Robert has implemented relation level
locks. In short it should go like this

a) The locks for enforcing Referential integrity should be taken only when
the rarest of the events( that would cause the integrity failure) occur.
That would be the update of the referenced column. Other cases of update,
delete and insert should not be required to take locks. In this way, we can
reduce a lot of lock traffic.

So if we have a table like employee( empid, empname, ... depid references
dept(deptid)) and table dept(depid depname).

Currently we are taking shared locks on referenced rows in dept table,
whenever we are updating something in the employee table. This should not
happen. Instead any insert / update of referenced column / delete should
check for some lock in its PGPROC structure, which will only get created
when the depid gets updated / deleted( rare event )

b) But the operation of update of the referenced column will be made more
costly. May be it can create something like a predicate lock(used for
enforcing serializable) and keep it in all the PG_PROC structures.

I know this is a abstract idea, but just wanted to know, whether we have
thought on those lines.

Thanks,
Gokul.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-07 10:02:43
Message-ID:	CA+U5nMLpLUG2Y3-gRFXbZCA+RbHvfKvsO0_16khb6C73axFXTg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 7, 2012 at 9:24 AM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
> I feel sad, that i followed this topic very late. But i still want to put
> forward my views.
> Have we thought on the lines of how Robert has implemented relation level
> locks. In short it should go like this
>
> a) The locks for enforcing Referential integrity should be taken only when
> the rarest of the events( that would cause the integrity failure) occur.
> That would be the update of the referenced column. Other cases of update,
> delete and insert should not be required to take locks. In this way, we can
> reduce a lot of lock traffic.

Insert, Update and Delete don't take locks they simply mark the tuples
they change with an xid. Anybody else wanting to "wait on the lock"
just waits on the xid. We do insert a lock row for each xid, but not
one per row changed.

> So if we have a table like employee( empid, empname, ... depid references
> dept(deptid)) and table dept(depid depname).
>
> Currently we are taking shared locks on referenced rows in dept table,
> whenever we are updating something in the employee table. This should not
> happen. Instead any insert / update of referenced column / delete should
> check for some lock in its PGPROC structure, which will only get created
> when the depid gets updated / deleted( rare event )

It's worked that way for 5 years, so its too late to modify it now and
this patch won't change that.

The way we do RI locking is designed to prevent holding that in memory
and then having the lock table overflow, which then either requires us
to revert to the current design or upgrade to table level locks to
save space in the lock table - which is a total disaster, if you've
ever worked with DB2.

What you're suggesting is that we store the locks in memory only as a
way of avoiding updating the row.

My understanding is we have two optimisation choices. A single set of
xids can be used in many places, since the same set of transactions
may do roughly the same thing.

1. We could assign a new mxactid every time we touch a new row. That
way there is no correspondence between sets of xids, and we may hold
the same set many times. OTOH since each set is unique we can expand
it easily and we don't need to touch each row once for each lock. That
saves on row touches but it also greatly increases the mxactid
creation rate, which causes cache scrolling.

2. We assign a new mxactid each time we create a new unique set of
rows. We have a separate cache for local sets. This way reduces the
mxactid creation rate but causes row updates each time we lock the
row, which then needs WAL.

(2) is how we currently handle the very difficult decision of how to
optimise this for the general case. I'm not sure that is right in all
cases, but it is at least scalable and it is the devil we know.

> b) But the operation of update of the referenced column will be made more
> costly. May be it can create something like a predicate lock(used for
> enforcing serializable) and keep it in all the PG_PROC structures.

No, updates of referenced columns are exactly the same as now when no
RI checks happening.

If the update occurs when an RI check takes place there is more work
to do, but previously it would have just blocked and done nothing. So
that path is relatively heavyweight but much better than nothing.

> I know this is a abstract idea, but just wanted to know, whether we have
> thought on those lines.

Thanks for your thoughts.

The most useful way to help with this patch right now is to run
performance investigations and see if there are non-optimal cases. We
can then see how the patch handles those. Theory is good, but it needs
to drive experimentation, as I myself re-discover continually.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-07 10:18:26
Message-ID:	CAHMh4-YkKp8RH1ucwL3QGSeqTgo-UAy0yooTH=CxbpGhgrTFDw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>
> Insert, Update and Delete don't take locks they simply mark the tuples
> they change with an xid. Anybody else wanting to "wait on the lock"
> just waits on the xid. We do insert a lock row for each xid, but not
> one per row changed.
>
I mean the foreign key checks here. They take a Select for Share Lock
right. That's what we are trying to optimize here. Or am i missing
something? So by following the suggested methodology, the foreign key
checks won't take any locks.

> It's worked that way for 5 years, so its too late to modify it now and
> this patch won't change that.
>
> The way we do RI locking is designed to prevent holding that in memory
> and then having the lock table overflow, which then either requires us
> to revert to the current design or upgrade to table level locks to
> save space in the lock table - which is a total disaster, if you've
> ever worked with DB2.
>
> What you're suggesting is that we store the locks in memory only as a
> way of avoiding updating the row.
>
> But that memory would be consumed, only when someone updates the
referenced column( which will usually be the primary key of the referenced
table). Any normal database programmer knows that updating primary key is
not good for performance. So we go by the same logic.

> No, updates of referenced columns are exactly the same as now when no
> RI checks happening.
>
> If the update occurs when an RI check takes place there is more work
> to do, but previously it would have just blocked and done nothing. So
> that path is relatively heavyweight but much better than nothing.
>
> As i have already said, that path is definitely heavy weight( like how
Robert has made the DDL path heavy weight). If we assume that DDLs are
going to be a rare phenomenon, then we can also assume that update of
primary keys is a rare phenomenon in a normal database.

>
> The most useful way to help with this patch right now is to run
> performance investigations and see if there are non-optimal cases. We
> can then see how the patch handles those. Theory is good, but it needs
> to drive experimentation, as I myself re-discover continually.
>
> I understand. I just wanted to know, whether the developer considered that
line of thought.

Thanks,
Gokul.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-07 11:11:06
Message-ID:	CA+U5nMLAr1VoDsmkJFYvFk_GvW-bqVgQnTLzmAyzLFQhZM1W1Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 7, 2012 at 10:18 AM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
>>
>> Insert, Update and Delete don't take locks they simply mark the tuples
>> they change with an xid. Anybody else wanting to "wait on the lock"
>> just waits on the xid. We do insert a lock row for each xid, but not
>> one per row changed.
>
> I mean the foreign key checks here. They take a Select for Share Lock right.
> That's what we are trying to optimize here. Or am i missing something? So by
> following the suggested methodology, the foreign key checks won't take any
> locks.

Please explain in detail your idea of how it will work.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-07 11:37:01
Message-ID:	CAHMh4-aL-LbVBVDyn+Mg_2R-bVJ6iiQ5HOyCiCz4psruUeKvNg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>
> Please explain in detail your idea of how it will work.
>
>
OK. I will try to explain the abstract idea, i have.
a) Referential integrity gets violated, when there are referencing key
values, not present in the referenced key values. We are maintaining the
integrity by taking a Select for Share Lock during the foreign key checks,
so that referred value is not updated/deleted during the operation.

b) We can do the same in the reverse way. When there is a update/delete of
the referred value, we don't want any new inserts with the referred value
in referring table, any update that will update its value to the referred
value being updated/deleted. So we will take some kind of lock, which will
stop such a happening. This can be achieved through
i) predicate locking infrastructure already present (or)
ii) a temporary B-Tree index ( no WAL protection ), that gets created only
for the referred value updations and holds those values that are being
updated/deleted (if we are scared of predicate locking).

So whenever we de foreign key checks, we just need to make sure there is no
such referential integrity lock in our own PGPROC structure(if implemented
with predicate locking) / check the temporary B-Tree index for any entry
matching the entry that we are going to insert/update to.( the empty tree
can be tracked with a flag to optimize )

May be someone can come up with better ideas than this.

Gokul.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-07 12:16:22
Message-ID:	CA+U5nMLeYXTjfJggtGvGULZEinPTHVZjtKQEgd8e5txoA3pc3A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

n Wed, Mar 7, 2012 at 11:37 AM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
>>
>> Please explain in detail your idea of how it will work.

> So we will take some kind of lock, which will stop such a happening.
...
> May be someone can come up with better ideas than this.

With respect, I don't call this a detailed explanation of an idea. For
consideration here, come up with a very detailed design of how your
suggestion will work. Think about it carefully, spend hours and days
thinking it through and when you are personally sure it is better than
what we have now, please raise it on list at an appropriate time. Bear
in mind that most people throw away 90% of their ideas before even
mentioning them here. I hope that helps you to contribute.

At the moment we're trying to review patches for specific code to
include or exclude, not discuss huge redesign of internal mechanisms
using broad brush descriptions. It is possible you may find an
improvement and if you do, people will be interested but that seems an
unlikely thing to happen here and now.

If you have specific comments or tests on this patch those are very welcome.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-12 17:28:11
Message-ID:	CA+TgmoakeJzSyhJQSHQrEoCciwFd4iw5-uCGpwZUe3jncN=Y=Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 26, 2012 at 9:47 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Regarding performance, the good thing about this patch is that if you
>> have an operation that used to block, it might now not block. So maybe
>> multixact-related operation is a bit slower than before, but if it
>> allows you to continue operating rather than sit waiting until some
>> other transaction releases you, it's much better.
>
> That's probably true, although there is some deferred cost that is
> hard to account for. You might not block immediately, but then later
> somebody might block either because the mxact SLRU now needs fsyncs or
> because they've got to decode an mxid long after the relevant segment
> has been evicted from the SLRU buffers. In general, it's hard to
> bound that latter cost, because you only avoid blocking once (when the
> initial update happens) but you might pay the extra cost of decoding
> the mxid as many times as the row is read, which could be arbitrarily
> many. How much of a problem that is in practice, I'm not completely
> sure, but it has worried me before and it still does. In the worst
> case scenario, a handful of frequently-accessed rows with MXIDs all of
> whose members are dead except for the UPDATE they contain could result
> in continual SLRU cache-thrashing.
>
> From a performance standpoint, we really need to think not only about
> the cases where the patch wins, but also, and maybe more importantly,
> the cases where it loses. There are some cases where the current
> mechanism, use SHARE locks for foreign keys, is adequate. In
> particular, it's adequate whenever the parent table is not updated at
> all, or only very lightly. I believe that those people will pay
> somewhat more with this patch, and especially in any case where
> backends end up waiting for fsyncs in order to create new mxids, but
> also just because I think this patch will have the effect of
> increasing the space consumed by each individual mxid, which imposes a
> distributed cost of its own.

I spent some time thinking about this over the weekend, and I have an
observation, and an idea. Here's the observation: I believe that
locking a tuple whose xmin is uncommitted is always a noop, because if
it's ever possible for a transaction to wait for an XID that is part
of its own transaction (exact same XID, or sub-XIDs of the same top
XID), then a transaction could deadlock against itself. I believe
that this is not possible: if a transaction were to wait for an XID
assigned to that same backend, then the lock manager would observe
that an ExclusiveLock on the xid is already held, so the request for a
ShareLock would be granted immediately. I also don't believe there's
any situation in which the existence of an uncommitted tuple fails to
block another backend, but a lock on that same uncommitted tuple would
have caused another backend to block. If any of that sounds wrong,
you can stop reading here (but please tell me why it sounds wrong).

If it's right, then here's the idea: what if we stored mxids using
xmin rather than xmax? This would mean that, instead of making mxids
contain the tuple's original xmax, they'd need to instead contain the
tuple's original xmin. This might seem like rearranging the deck
chairs on the titanic, but I think it actually works out substantially
better, because if we can assume that the xmin is committed, then we
only need to know its exact value until it becomes older than
RecentGlobalXmin. This means that a tuple can be both updated and
locked at the same time without the MultiXact SLRU needing to be
crash-safe, because if we crash and restart, any mxids that are still
around from before the crash are known to contain only xmins that are
now all-visible. We therefore don't need their exact values, so it
doesn't matter if that data actually made it to disk. Furthermore, in
the case where a previously-locked tuple is read repeatedly, we only
need to keep doing SLRU lookups until the xmin becomes all-visible;
after that, we can set a hint bit indicating that the tuple's xmin is
all-visible, and any future readers (or writers) can use that to skip
the SLRU lookup. In the degenerate (and probably common) case where a
tuple is already all-visible at the time it's locked, we don't really
need to record the original xmin at all; we can still do so if
convenient, but we can set the xmin-all-visible hint right away, so
nobody needs to probe the SLRU just to get xmin.

In other words, we'd entirely avoid needing to make mxacts crash-safe,
and we'd avoid most of the extra SLRU lookups that the current
implementation requires; they'd only be needed when (and for as long
as) the locked tuple was not yet all-visible.

This also seems like it would make the anti-wraparound issues simpler
to handle - once an mxid is old enough that any xmin it contains must
be all-visible, we can simply overwrite the tuple's xmin with
FrozenXID, which is pretty much what we're already doing anyway. It's
already the case that a table has to have an anti-wraparound vacuum at
least once after any given XID falls behind RecentGlobalXmin and
before it can be reused, so we wouldn't need to do anti-wraparound
vacuums any more frequently than currently. There's still the problem
that we might exhaust the mxid space more quickly than the XID space,
but that exists in either implementation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-12 17:50:11
Message-ID:	CA+U5nML-=GmS5kZa2rQL_2saUdRREbJ5sxwcjpuqQsxFzL8Nyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 12, 2012 at 5:28 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> In other words, we'd entirely avoid needing to make mxacts crash-safe,
> and we'd avoid most of the extra SLRU lookups that the current
> implementation requires; they'd only be needed when (and for as long
> as) the locked tuple was not yet all-visible.

The current implementation only requires additional lookups in the
update/check case, which is the case that doesn't do anything other
than block right now. Since we're replacing lock contention with
physical access contention even the worst case situation is still
better than what we have now. Please feel free to point out worst case
situations and show that isn't true.

I've also pointed out how to avoid overhead of making mxacts crash
safe when the new facilities are not in use, so I don't see problems
with the proposed mechanism. Given that I am still myself reviewing
the actual code.

So those things are not something we need to avoid.

My feeling is that overwriting xmin is a clever idea, but arrives too
late to require sensible analysis in this stage of the CF. It's not
solving a problem, its just an alternate mechanism and at best an
optimisation of the mechanism. Were we to explore it now, it seems
certain that another person would observe that design were taking
place and so the patch should be rejected, which would be unnecessary
and wasteful. I also think it would alter our ability to diagnose
problems, not least the normal test that xmax matches xmin across an
update.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-12 18:14:33
Message-ID:	CA+TgmoYYvLaT1pNYk34+mrQtqTJRkOL-cGwfAFWCC1Wd_NSKbQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 12, 2012 at 1:50 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Mon, Mar 12, 2012 at 5:28 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> In other words, we'd entirely avoid needing to make mxacts crash-safe,
>> and we'd avoid most of the extra SLRU lookups that the current
>> implementation requires; they'd only be needed when (and for as long
>> as) the locked tuple was not yet all-visible.
>
> The current implementation only requires additional lookups in the
> update/check case, which is the case that doesn't do anything other
> than block right now. Since we're replacing lock contention with
> physical access contention even the worst case situation is still
> better than what we have now. Please feel free to point out worst case
> situations and show that isn't true.

I think I already have:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg01258.php

The case I'm worried about is where we are allocating mxids quickly,
and we end up having to wait for fsyncs on mxact segments. That might
be very slow, but you could argue that it could *possibly* be still
worthwhile if it avoids blocking. That doesn't strike me as a
slam-dunk, though, because we've already seen and fixed cases where
too many fsyncs causes the performance of the entire system to go down
the tubes (cf. commit 7f242d880b5b5d9642675517466d31373961cf98). But
it's really bad if there are no updates on the parent table - then,
whatever extra overhead there is will be all for naught, since the
more fine-grained locking doesn't help anyway.

> I've also pointed out how to avoid overhead of making mxacts crash
> safe when the new facilities are not in use, so I don't see problems
> with the proposed mechanism. Given that I am still myself reviewing
> the actual code.

The closest thing I can find to a proposal from you in that regard is
this comment:

# I was really thinking we could skip the fsync of a page if we've not
# persisted anything important on that page, since that was one of
# Robert's performance points.

It might be possible to do something with that idea, but at the moment
I'm not seeing how to make it work.

> So those things are not something we need to avoid.
>
> My feeling is that overwriting xmin is a clever idea, but arrives too
> late to require sensible analysis in this stage of the CF. It's not
> solving a problem, its just an alternate mechanism and at best an
> optimisation of the mechanism. Were we to explore it now, it seems
> certain that another person would observe that design were taking
> place and so the patch should be rejected, which would be unnecessary
> and wasteful.

Considering that nobody's done any work to resolve the uncertainty
about whether the worst-case performance characteristics of this patch
are acceptable, and considering further that it was undergoing massive
code churn for more than a month after the final CommitFest, I think
it's not that unreasonable to think it might not be ready for prime
time at this point. In any event, your argument is exactly backwards:
we need to first decide whether the patch needs a redesign and then,
if it does, postpone it. Deciding that we don't want to postpone it
first, and therefore we're not going to redesign it even if that is
what's really needed makes no sense.

> I also think it would alter our ability to diagnose
> problems, not least the normal test that xmax matches xmin across an
> update.

There's nothing stopping the new tuple from being frozen before the
old one, even today.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-12 19:28:21
Message-ID:	CA+U5nM+CYYK8rsMGmZdbeiMcqZS1tYFWSZGKORw9ZtSZw-O8_Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 12, 2012 at 6:14 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Considering that nobody's done any work to resolve the uncertainty
> about whether the worst-case performance characteristics of this patch
> are acceptable, and considering further that it was undergoing massive
> code churn for more than a month after the final CommitFest, I think
> it's not that unreasonable to think it might not be ready for prime
> time at this point.

Thank you for cutting to the chase.

The "uncertainty" of which you speak is a theoretical point you
raised. It has been explained, but nobody has yet shown the
performance numbers to illustrate the point but only because they
seemed so clear. I would point out that you haven't demonstrated the
existence of a problem either, so redesigning something without any
proof of a problem seems a strange.

Let me explain again what this patch does and why it has such major
performance benefit.

This feature give us a step change in lock reductions from FKs. A real
world "best case" might be to examine the benefit this patch has on a
large batch load that inserts many new orders for existing customers.
In my example case the orders table has a FK to the customer table. At
the same time as the data load, we attempt to update a customer's
additional details, address or current balance etc. The large load
takes locks on the customer table and keeps them for the whole
transaction. So the customer updates are locked out for multiple
seconds, minutes or maybe hours, depending upon how far you want to
stretch the example. With this patch the customer updates don't cause
lock conflicts but they require mxact lookups in *some* cases, so they
might take 1-10ms extra, rather than 1-10 minutes more. >1000x faster.
The only case that causes the additional lookups is the case that
otherwise would have been locked. So producing "best case" results is
trivial and can be as enormous as you like.

I agree with you that some worst case performance tests should be
done. Could you please say what you think the worst cases would be, so
those can be tested? That would avoid wasting time or getting anything
backwards.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-13 01:24:40
Message-ID:	20120313012440.GA27122@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 12, 2012 at 01:28:11PM -0400, Robert Haas wrote:
> I spent some time thinking about this over the weekend, and I have an
> observation, and an idea. Here's the observation: I believe that
> locking a tuple whose xmin is uncommitted is always a noop, because if
> it's ever possible for a transaction to wait for an XID that is part
> of its own transaction (exact same XID, or sub-XIDs of the same top
> XID), then a transaction could deadlock against itself. I believe
> that this is not possible: if a transaction were to wait for an XID
> assigned to that same backend, then the lock manager would observe
> that an ExclusiveLock on the xid is already held, so the request for a
> ShareLock would be granted immediately. I also don't believe there's
> any situation in which the existence of an uncommitted tuple fails to
> block another backend, but a lock on that same uncommitted tuple would
> have caused another backend to block. If any of that sounds wrong,
> you can stop reading here (but please tell me why it sounds wrong).

When we lock an update-in-progress row, we walk the t_ctid chain and lock all
descendant tuples. They may all have uncommitted xmins. This is essential to
ensure that the final outcome of the updating transaction does not affect
whether the locking transaction has its KEY SHARE lock. Similarly, when we
update a previously-locked tuple, we copy any locks (always KEY SHARE locks)
to the new version. That new tuple is both uncommitted and has locks, and we
cannot easily sacrifice either property. Do you see a way to extend your
scheme to cover these needs?

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-13 17:00:52
Message-ID:	20120313170052.GB9030@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 06, 2012 at 04:39:32PM -0300, Alvaro Herrera wrote:
> Here's a first attempt at a README illustrating this. I intend this to
> be placed in src/backend/access/heap/README.tuplock; the first three
> paragraphs are stolen from the comment in heap_lock_tuple, so I'd remove
> those from there, directing people to this new file instead. Is there
> something that you think should be covered more extensively (or at all)
> here?
...
>
> When there is a single locker in a tuple, we can just store the locking info
> in the tuple itself. We do this by storing the locker's Xid in XMAX, and
> setting hint bits specifying the locking strength. There is one exception
> here: since hint bit space is limited, we do not provide a separate hint bit
> for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
> that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
> presumably more commonly used due to being the standards-mandated locking
> mechanism, or heavily used by the RI code, so we want to provide fast paths
> for those.)

Are those tuple bits actually "hint" bits? They seem quite a bit more
powerful than a "hint".

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-13 17:09:57
Message-ID:	CA+TgmoYA4HdOTm-w7g73=NYXC4m5SYJMYtxazO67aKWOHOxv3g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 12, 2012 at 9:24 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> When we lock an update-in-progress row, we walk the t_ctid chain and lock all
> descendant tuples. They may all have uncommitted xmins. This is essential to
> ensure that the final outcome of the updating transaction does not affect
> whether the locking transaction has its KEY SHARE lock. Similarly, when we
> update a previously-locked tuple, we copy any locks (always KEY SHARE locks)
> to the new version. That new tuple is both uncommitted and has locks, and we
> cannot easily sacrifice either property. Do you see a way to extend your
> scheme to cover these needs?

No, I think that sinks it. Good analysis.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-13 17:35:02
Message-ID:	1331659946-sup-3775@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Bruce Momjian's message of mar mar 13 14:00:52 -0300 2012:
>
> On Tue, Mar 06, 2012 at 04:39:32PM -0300, Alvaro Herrera wrote:

> > When there is a single locker in a tuple, we can just store the locking info
> > in the tuple itself. We do this by storing the locker's Xid in XMAX, and
> > setting hint bits specifying the locking strength. There is one exception
> > here: since hint bit space is limited, we do not provide a separate hint bit
> > for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
> > that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
> > presumably more commonly used due to being the standards-mandated locking
> > mechanism, or heavily used by the RI code, so we want to provide fast paths
> > for those.)
>
> Are those tuple bits actually "hint" bits? They seem quite a bit more
> powerful than a "hint".

I'm not sure what's your point. We've had a "hint" bit for SELECT FOR
UPDATE for ages. Even 8.2 had HEAP_XMAX_EXCL_LOCK and
HEAP_XMAX_SHARED_LOCK. Maybe they are misnamed and aren't really
"hints", but it's not the job of this patch to fix that problem.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-13 17:46:24
Message-ID:	CA+TgmobONEodK-M4CF5Ov0UONxb-5d5ru4QDnL77OqzBcEnLNw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I agree with you that some worst case performance tests should be
> done. Could you please say what you think the worst cases would be, so
> those can be tested? That would avoid wasting time or getting anything
> backwards.

I've thought about this some and here's what I've come up with so far:

1. SELECT FOR SHARE on a large table on a system with no write cache.

2. A small parent table (say 30 rows or so) and a larger child table
with a many-to-one FK relationship to the parent (say 100 child rows
per parent row), with heavy update activity on the child table, on a
system where fsyncs are very slow. This should generate lots of mxid
consumption, and every 1600 or so mxids (I think) we've got to fsync;
does that generate a noticeable performance hit?

3. It would be nice to test the impact of increased mxid lookups in
the parent, but I've realized that the visibility map will probably
mask a good chunk of that effect, which is a good thing. Still, maybe
something like this: a fairly large parent table, say a million rows,
but narrow rows, so that many of them fit on a page, with frequent
reads and occasional updates (if there are only reads, autovacuum
might end with all the visibility map bits set); plus a child table
with one or a few rows per parent which is heavily updated. In theory
this ought to be good for the patch, since the the more fine-grained
locking will avoid blocking, but in this case the parent table is
large enough that you shouldn't get much blocking anyway, yet you'll
still pay the cost of mxid lookups because the occasional updates on
the parent will clear VM bits. This might not be the exactly right
workload to measure this effect, but if it's not maybe someone can
devote a little time to thinking about what would be.

4. A plain old pgbench run or two, to see whether there's any
regression when none of this matters at all...

This isn't exactly a test case, but from Noah's previous comments I
gather that there is a theoretical risk of mxid consumption running
ahead of xid consumption. We should try to think about whether there
are any realistic workloads where that might actually happen. I'm
willing to believe that there aren't, but not just because somebody
asserts it. The reason I'm concerned about this is because, if it
should happen, the result will be more frequent anti-wraparound
vacuums on every table in the cluster. Those are already quite
painful for some users.

It would be nice if Noah or someone else who has reviewed this patch
in detail could comment further. I am shooting from the hip here, a
bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-14 03:42:26
Message-ID:	20120314034226.GC27122@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 13, 2012 at 01:46:24PM -0400, Robert Haas wrote:
> On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > I agree with you that some worst case performance tests should be
> > done. Could you please say what you think the worst cases would be, so
> > those can be tested? That would avoid wasting time or getting anything
> > backwards.
>
> I've thought about this some and here's what I've come up with so far:
>
> 1. SELECT FOR SHARE on a large table on a system with no write cache.

Easy enough that we may as well check it. Share-locking an entire large table
is impractical in a real application, so I would not worry if this shows a
substantial regression.

> 2. A small parent table (say 30 rows or so) and a larger child table
> with a many-to-one FK relationship to the parent (say 100 child rows
> per parent row), with heavy update activity on the child table, on a
> system where fsyncs are very slow. This should generate lots of mxid
> consumption, and every 1600 or so mxids (I think) we've got to fsync;
> does that generate a noticeable performance hit?

More often than that; each 2-member mxid takes 4 bytes in an offsets file and
10 bytes in a members file. So, more like one fsync per ~580 mxids. Note
that we already fsync the multixact SLRUs today, so any increase will arise
from the widening of member entries from 4 bytes to 5. The realism of this
test is attractive. Nearly-static parent tables are plenty common, and this
test will illustrate the impact on those designs.

> 3. It would be nice to test the impact of increased mxid lookups in
> the parent, but I've realized that the visibility map will probably
> mask a good chunk of that effect, which is a good thing. Still, maybe
> something like this: a fairly large parent table, say a million rows,
> but narrow rows, so that many of them fit on a page, with frequent
> reads and occasional updates (if there are only reads, autovacuum
> might end with all the visibility map bits set); plus a child table
> with one or a few rows per parent which is heavily updated. In theory
> this ought to be good for the patch, since the the more fine-grained
> locking will avoid blocking, but in this case the parent table is
> large enough that you shouldn't get much blocking anyway, yet you'll
> still pay the cost of mxid lookups because the occasional updates on
> the parent will clear VM bits. This might not be the exactly right
> workload to measure this effect, but if it's not maybe someone can
> devote a little time to thinking about what would be.

You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
lookups, so I think something more sophisticated is needed to exercise that
cost. Not sure what.

> 4. A plain old pgbench run or two, to see whether there's any
> regression when none of this matters at all...

Might as well.

> This isn't exactly a test case, but from Noah's previous comments I
> gather that there is a theoretical risk of mxid consumption running
> ahead of xid consumption. We should try to think about whether there
> are any realistic workloads where that might actually happen. I'm
> willing to believe that there aren't, but not just because somebody
> asserts it. The reason I'm concerned about this is because, if it
> should happen, the result will be more frequent anti-wraparound
> vacuums on every table in the cluster. Those are already quite
> painful for some users.

Yes. Pre-release, what can we really do here other than have more people
thinking about ways it might happen in practice? Post-release, we could
suggest monitoring methods or perhaps have VACUUM emit a WARNING when a table
is using more mxid space than xid space.

Also consider a benchmark that does plenty of non-key updates on a parent
table with no activity on the child table. We'll pay the overhead of
determining that the key column(s) have not changed, but it will never pay off
by preventing a lock wait. Granted, this is barely representative of
application behavior. Perhaps, too, we already have a good sense of this cost
from the HOT benchmarking efforts and have no cause to revisit it.

Thanks,
nm

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-14 17:23:14
Message-ID:	CA+Tgmob4wLtvd2djg5DtzCoieRNcsCrVyRg8fWPwwEkOugvUOg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 13, 2012 at 11:42 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> More often than that; each 2-member mxid takes 4 bytes in an offsets file and
> 10 bytes in a members file. So, more like one fsync per ~580 mxids. Note
> that we already fsync the multixact SLRUs today, so any increase will arise
> from the widening of member entries from 4 bytes to 5. The realism of this
> test is attractive. Nearly-static parent tables are plenty common, and this
> test will illustrate the impact on those designs.

Agreed. But speaking of that, why exactly do we fsync the multixact SLRU today?

> You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
> lookups, so I think something more sophisticated is needed to exercise that
> cost. Not sure what.

I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
all-visible. HEAP_XMAX_INVALID will obviously help, when it happens.

>> This isn't exactly a test case, but from Noah's previous comments I
>> gather that there is a theoretical risk of mxid consumption running
>> ahead of xid consumption. We should try to think about whether there
>> are any realistic workloads where that might actually happen. I'm
>> willing to believe that there aren't, but not just because somebody
>> asserts it. The reason I'm concerned about this is because, if it
>> should happen, the result will be more frequent anti-wraparound
>> vacuums on every table in the cluster. Those are already quite
>> painful for some users.
>
> Yes. Pre-release, what can we really do here other than have more people
> thinking about ways it might happen in practice? Post-release, we could
> suggest monitoring methods or perhaps have VACUUM emit a WARNING when a table
> is using more mxid space than xid space.

Well, post-release, the cat is out of the bag: we'll be stuck with
this whether the performance characteristics are acceptable or not.
That's why we'd better be as sure as possible before committing to
this implementation that there's nothing we can't live with. It's not
like there's any reasonable way to turn this off if you don't like it.

> Also consider a benchmark that does plenty of non-key updates on a parent
> table with no activity on the child table. We'll pay the overhead of
> determining that the key column(s) have not changed, but it will never pay off
> by preventing a lock wait.

Good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-14 22:10:00
Message-ID:	20120314221000.GG27122@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 14, 2012 at 01:23:14PM -0400, Robert Haas wrote:
> On Tue, Mar 13, 2012 at 11:42 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > More often than that; each 2-member mxid takes 4 bytes in an offsets file and
> > 10 bytes in a members file. ?So, more like one fsync per ~580 mxids. ?Note
> > that we already fsync the multixact SLRUs today, so any increase will arise
> > from the widening of member entries from 4 bytes to 5. ?The realism of this
> > test is attractive. ?Nearly-static parent tables are plenty common, and this
> > test will illustrate the impact on those designs.
>
> Agreed. But speaking of that, why exactly do we fsync the multixact SLRU today?

Good question. So far, I can't think of a reason. "nextMulti" is critical,
but we already fsync it with pg_control. We could delete the other multixact
state data at every startup and set OldestVisibleMXactId accordingly.

> > You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
> > lookups, so I think something more sophisticated is needed to exercise that
> > cost. ?Not sure what.
>
> I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
> all-visible. HEAP_XMAX_INVALID will obviously help, when it happens.

True. The patch (see ResetMultiHintBit()) also replaces a multixact xmax with
the updater xid when all transactions of the multixact have ended. You would
need a test workload with long-running multixacts that delay such replacement.
However, the workloads that come to mind are the very workloads for which this
patch eliminates lock waits; they wouldn't illustrate a worst-case.

> >> This isn't exactly a test case, but from Noah's previous comments I
> >> gather that there is a theoretical risk of mxid consumption running
> >> ahead of xid consumption. ?We should try to think about whether there
> >> are any realistic workloads where that might actually happen. ?I'm
> >> willing to believe that there aren't, but not just because somebody
> >> asserts it. ?The reason I'm concerned about this is because, if it
> >> should happen, the result will be more frequent anti-wraparound
> >> vacuums on every table in the cluster. ?Those are already quite
> >> painful for some users.
> >
> > Yes. ?Pre-release, what can we really do here other than have more people
> > thinking about ways it might happen in practice? ?Post-release, we could
> > suggest monitoring methods or perhaps have VACUUM emit a WARNING when a table
> > is using more mxid space than xid space.
>
> Well, post-release, the cat is out of the bag: we'll be stuck with
> this whether the performance characteristics are acceptable or not.
> That's why we'd better be as sure as possible before committing to
> this implementation that there's nothing we can't live with. It's not
> like there's any reasonable way to turn this off if you don't like it.

I disagree; we're only carving in stone the FOR KEY SHARE and FOR KEY UPDATE
syntax additions. We could even avoid doing that by not documenting them. A
later major release could implement them using a completely different
mechanism or even reduce them to aliases, KEY SHARE = SHARE and KEY UPDATE =
UPDATE. To be sure, let's still do a good job the first time.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 01:17:33
Message-ID:	1331773686-sup-1655@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Noah Misch's message of mié mar 14 19:10:00 -0300 2012:
>
> On Wed, Mar 14, 2012 at 01:23:14PM -0400, Robert Haas wrote:
> > On Tue, Mar 13, 2012 at 11:42 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > More often than that; each 2-member mxid takes 4 bytes in an offsets file and
> > > 10 bytes in a members file. ?So, more like one fsync per ~580 mxids. ?Note
> > > that we already fsync the multixact SLRUs today, so any increase will arise
> > > from the widening of member entries from 4 bytes to 5. ?The realism of this
> > > test is attractive. ?Nearly-static parent tables are plenty common, and this
> > > test will illustrate the impact on those designs.
> >
> > Agreed. But speaking of that, why exactly do we fsync the multixact SLRU today?
>
> Good question. So far, I can't think of a reason. "nextMulti" is critical,
> but we already fsync it with pg_control. We could delete the other multixact
> state data at every startup and set OldestVisibleMXactId accordingly.

Hmm, yeah.

> > > You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
> > > lookups, so I think something more sophisticated is needed to exercise that
> > > cost. ?Not sure what.
> >
> > I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
> > all-visible. HEAP_XMAX_INVALID will obviously help, when it happens.
>
> True. The patch (see ResetMultiHintBit()) also replaces a multixact xmax with
> the updater xid when all transactions of the multixact have ended.

I have noticed that this code is not correct, because we don't know that
we're holding an appropriate lock on the page, so we can't simply change
the Xmax and reset those hint bits. As things stand today, mxids
persist longer. (We could do some cleanup at HOT-style page prune, for
example, though the lock we need is even weaker than that.) Overall
this means that coming up with a test case demonstrating this pressure
probably isn't that hard.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 02:15:22
Message-ID:	CA+TgmoYr=yohzXJsnudWVt+Rg7FguSDnCVMG+kinYQ1-+Mxmjg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 14, 2012 at 6:10 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> Well, post-release, the cat is out of the bag: we'll be stuck with
>> this whether the performance characteristics are acceptable or not.
>> That's why we'd better be as sure as possible before committing to
>> this implementation that there's nothing we can't live with. It's not
>> like there's any reasonable way to turn this off if you don't like it.
>
> I disagree; we're only carving in stone the FOR KEY SHARE and FOR KEY UPDATE
> syntax additions. We could even avoid doing that by not documenting them. A
> later major release could implement them using a completely different
> mechanism or even reduce them to aliases, KEY SHARE = SHARE and KEY UPDATE =
> UPDATE. To be sure, let's still do a good job the first time.

What I mean is really that, once the release is out, we don't get to
take it back. Sure, the next release can fix things, but any
regressions will become obstacles to upgrading and pain points for new
users.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 02:26:53
Message-ID:	CA+Tgmoa3rJ1uuQfj=gfRkDdrDirHCHpVrQq5cNj4wv4qKsSwKg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 14, 2012 at 9:17 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>> > Agreed. But speaking of that, why exactly do we fsync the multixact SLRU today?
>>
>> Good question. So far, I can't think of a reason. "nextMulti" is critical,
>> but we already fsync it with pg_control. We could delete the other multixact
>> state data at every startup and set OldestVisibleMXactId accordingly.
>
> Hmm, yeah.

In a way, the fact that we don't do that is kind of fortuitous in this
situation. I had just assumed that we were not fsyncing it because
there seems to be no reason to do so. But since we are, we already
know that the fsyncs resulting from frequent mxid allocation aren't a
huge pain point. If they were, somebody would have presumably
complained about it and fixed it before now. So that means that what
we're really worrying about here is the overhead of fsyncing a little
more often, which is a lot less scary than starting to do it when we
weren't previously.

Now, we could look at this as an opportunity to optimize the existing
implementation by removing the fsyncs, rather than adding the new
infrastructure Alvaro is proposing. But that would only make sense if
we thought that getting rid of the fsyncs would be more valuable than
avoiding the blocking here, and I don't.

I still think that someone needs to do some benchmarking here, because
this is a big complicated performance patch, and we can't predict the
impact of it on real-world scenarios without testing. There is
clearly some additional overhead, and it makes sense to measure it and
hopefully discover that it isn't excessive. Still, I'm a bit
relieved.

> I have noticed that this code is not correct, because we don't know that
> we're holding an appropriate lock on the page, so we can't simply change
> the Xmax and reset those hint bits. As things stand today, mxids
> persist longer. (We could do some cleanup at HOT-style page prune, for
> example, though the lock we need is even weaker than that.) Overall
> this means that coming up with a test case demonstrating this pressure
> probably isn't that hard.

What would such a test case look like?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 21:07:28
Message-ID:	CA+U5nM+gb_vBSCcWUwGxEtypV9v7h69zaA_3i1zZn0SrR5z8XQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 14, 2012 at 5:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
>> lookups, so I think something more sophisticated is needed to exercise that
>> cost. Not sure what.
>
> I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
> all-visible.

So because committed does not equal all visible there will be
additional lookups on mxids? That's complete rubbish.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 21:20:38
Message-ID:	CA+U5nMK7Ro7uej2ds+M8Oz1-E5CMOSFMFnP2R7-vn7uRqB5xPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 2:15 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Mar 14, 2012 at 6:10 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>>> Well, post-release, the cat is out of the bag: we'll be stuck with
>>> this whether the performance characteristics are acceptable or not.
>>> That's why we'd better be as sure as possible before committing to
>>> this implementation that there's nothing we can't live with. It's not
>>> like there's any reasonable way to turn this off if you don't like it.
>>
>> I disagree; we're only carving in stone the FOR KEY SHARE and FOR KEY UPDATE
>> syntax additions. We could even avoid doing that by not documenting them. A
>> later major release could implement them using a completely different
>> mechanism or even reduce them to aliases, KEY SHARE = SHARE and KEY UPDATE =
>> UPDATE. To be sure, let's still do a good job the first time.
>
> What I mean is really that, once the release is out, we don't get to
> take it back. Sure, the next release can fix things, but any
> regressions will become obstacles to upgrading and pain points for new
> users.

This comment is completely superfluous. It's a complete waste of time
to turn up on a thread and remind people that if they commit something
and it doesn't actually work that it would be a bad thing. Why, we
might ask do you think that thought needs to be expressed here?
Please, don't answer, lets spend the time on actually reviewing the
patch.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 21:38:53
Message-ID:	CA+U5nM+obGXKz6M16u8OFQfj3LNnFh2bqvZD0C3zpEt3LMt6kg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 2:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Mar 14, 2012 at 9:17 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>>> > Agreed. But speaking of that, why exactly do we fsync the multixact SLRU today?
>>>
>>> Good question. So far, I can't think of a reason. "nextMulti" is critical,
>>> but we already fsync it with pg_control. We could delete the other multixact
>>> state data at every startup and set OldestVisibleMXactId accordingly.
>>
>> Hmm, yeah.
>
> In a way, the fact that we don't do that is kind of fortuitous in this
> situation. I had just assumed that we were not fsyncing it because
> there seems to be no reason to do so. But since we are, we already
> know that the fsyncs resulting from frequent mxid allocation aren't a
> huge pain point. If they were, somebody would have presumably
> complained about it and fixed it before now. So that means that what
> we're really worrying about here is the overhead of fsyncing a little
> more often, which is a lot less scary than starting to do it when we
> weren't previously.

Good

> Now, we could look at this as an opportunity to optimize the existing
> implementation by removing the fsyncs, rather than adding the new
> infrastructure Alvaro is proposing.

This is not an exercise in tuning mxact code. There is a serious
algorithmic problem that is causing real world problems.

Removing the fsync will *not* provide a solution to the problem, so
there is no "opportunity" here.

> But that would only make sense if
> we thought that getting rid of the fsyncs would be more valuable than
> avoiding the blocking here, and I don't.

You're right that the existing code could use some optimisation.

I'm a little tired, but I can't see a reason to fsync this except at checkpoint.

Also seeing that we issue 2 WAL records for each RI check. We issue
one during MultiXactIdCreate/MultiXactIdExpand and then immediately
afterwards issue a XLOG_HEAP_LOCK record. The comments on both show
that each thinks it is doing it for the same reason and is the only
place its being done. Alvaro, any ideas why that is.

> I still think that someone needs to do some benchmarking here, because
> this is a big complicated performance patch, and we can't predict the
> impact of it on real-world scenarios without testing. There is
> clearly some additional overhead, and it makes sense to measure it and
> hopefully discover that it isn't excessive. Still, I'm a bit
> relieved.

Very much agreed.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 21:46:44
Message-ID:	CA+U5nM+Mc0H7yFXC0h2TCLUtYRks9XgZSg0TRE1u0qmEn99wqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 1:17 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:

> As things stand today

Can I confirm where we are now? Is there another version of the patch
coming out soon?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 21:54:32
Message-ID:	1331848117-sup-4005@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of jue mar 15 18:38:53 -0300 2012:
> On Thu, Mar 15, 2012 at 2:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> > But that would only make sense if
> > we thought that getting rid of the fsyncs would be more valuable than
> > avoiding the blocking here, and I don't.
>
> You're right that the existing code could use some optimisation.
>
> I'm a little tired, but I can't see a reason to fsync this except at checkpoint.

Hang on. What fsyncs are we talking about? I don't see that the
multixact code calls any fsync except that checkpoint and shutdown.

> Also seeing that we issue 2 WAL records for each RI check. We issue
> one during MultiXactIdCreate/MultiXactIdExpand and then immediately
> afterwards issue a XLOG_HEAP_LOCK record. The comments on both show
> that each thinks it is doing it for the same reason and is the only
> place its being done. Alvaro, any ideas why that is.

AFAIR the XLOG_HEAP_LOCK log entry only records the fact that the row is
being locked by a multixact -- it doesn't record the contents (member
xids) of said multixact, which is what the other log entry records.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 21:55:05
Message-ID:	1331848492-sup-171@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of jue mar 15 18:46:44 -0300 2012:
>
> On Thu, Mar 15, 2012 at 1:17 AM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>
> > As things stand today
>
> Can I confirm where we are now? Is there another version of the patch
> coming out soon?

Yes, another version is coming soon.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 22:04:41
Message-ID:	CA+U5nMLofcYGVX_iHoGo+=vHnPWQm+2WuNNvx0Gv++rcFAcdFw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 9:54 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>
> Excerpts from Simon Riggs's message of jue mar 15 18:38:53 -0300 2012:
>> On Thu, Mar 15, 2012 at 2:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> > But that would only make sense if
>> > we thought that getting rid of the fsyncs would be more valuable than
>> > avoiding the blocking here, and I don't.
>>
>> You're right that the existing code could use some optimisation.
>>
>> I'm a little tired, but I can't see a reason to fsync this except at checkpoint.
>
> Hang on. What fsyncs are we talking about? I don't see that the
> multixact code calls any fsync except that checkpoint and shutdown.

If a dirty page is evicted it will fsync.

>> Also seeing that we issue 2 WAL records for each RI check. We issue
>> one during MultiXactIdCreate/MultiXactIdExpand and then immediately
>> afterwards issue a XLOG_HEAP_LOCK record. The comments on both show
>> that each thinks it is doing it for the same reason and is the only
>> place its being done. Alvaro, any ideas why that is.
>
> AFAIR the XLOG_HEAP_LOCK log entry only records the fact that the row is
> being locked by a multixact -- it doesn't record the contents (member
> xids) of said multixact, which is what the other log entry records.

Agreed. But issuing two records when we could issue just one seems a
little strange, especially when the two record types follow one
another so closely - so we end up queuing for the lock twice while
holding the lock on the data block.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 22:13:45
Message-ID:	1331849597-sup-8328@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of jue mar 15 19:04:41 -0300 2012:
>
> On Thu, Mar 15, 2012 at 9:54 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> >
> > Excerpts from Simon Riggs's message of jue mar 15 18:38:53 -0300 2012:
> >> On Thu, Mar 15, 2012 at 2:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> >> > But that would only make sense if
> >> > we thought that getting rid of the fsyncs would be more valuable than
> >> > avoiding the blocking here, and I don't.
> >>
> >> You're right that the existing code could use some optimisation.
> >>
> >> I'm a little tired, but I can't see a reason to fsync this except at checkpoint.
> >
> > Hang on. What fsyncs are we talking about? I don't see that the
> > multixact code calls any fsync except that checkpoint and shutdown.
>
> If a dirty page is evicted it will fsync.

Ah, right.

> >> Also seeing that we issue 2 WAL records for each RI check. We issue
> >> one during MultiXactIdCreate/MultiXactIdExpand and then immediately
> >> afterwards issue a XLOG_HEAP_LOCK record. The comments on both show
> >> that each thinks it is doing it for the same reason and is the only
> >> place its being done. Alvaro, any ideas why that is.
> >
> > AFAIR the XLOG_HEAP_LOCK log entry only records the fact that the row is
> > being locked by a multixact -- it doesn't record the contents (member
> > xids) of said multixact, which is what the other log entry records.
>
> Agreed. But issuing two records when we could issue just one seems a
> little strange, especially when the two record types follow one
> another so closely - so we end up queuing for the lock twice while
> holding the lock on the data block.

Hmm, that seems optimization that could be done separately.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-15 22:23:32
Message-ID:	CA+U5nMLX7oOMnO_T+xbf0iBbp90r=XfvsNtqDN+cema+MYXhKw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 10:13 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>
> Excerpts from Simon Riggs's message of jue mar 15 19:04:41 -0300 2012:
>>
>> On Thu, Mar 15, 2012 at 9:54 PM, Alvaro Herrera
>> <alvherre(at)commandprompt(dot)com> wrote:
>> >
>> > Excerpts from Simon Riggs's message of jue mar 15 18:38:53 -0300 2012:
>> >> On Thu, Mar 15, 2012 at 2:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> >
>> >> > But that would only make sense if
>> >> > we thought that getting rid of the fsyncs would be more valuable than
>> >> > avoiding the blocking here, and I don't.
>> >>
>> >> You're right that the existing code could use some optimisation.
>> >>
>> >> I'm a little tired, but I can't see a reason to fsync this except at checkpoint.
>> >
>> > Hang on. What fsyncs are we talking about? I don't see that the
>> > multixact code calls any fsync except that checkpoint and shutdown.
>>
>> If a dirty page is evicted it will fsync.
>
> Ah, right.
>
>> >> Also seeing that we issue 2 WAL records for each RI check. We issue
>> >> one during MultiXactIdCreate/MultiXactIdExpand and then immediately
>> >> afterwards issue a XLOG_HEAP_LOCK record. The comments on both show
>> >> that each thinks it is doing it for the same reason and is the only
>> >> place its being done. Alvaro, any ideas why that is.
>> >
>> > AFAIR the XLOG_HEAP_LOCK log entry only records the fact that the row is
>> > being locked by a multixact -- it doesn't record the contents (member
>> > xids) of said multixact, which is what the other log entry records.
>>
>> Agreed. But issuing two records when we could issue just one seems a
>> little strange, especially when the two record types follow one
>> another so closely - so we end up queuing for the lock twice while
>> holding the lock on the data block.
>
> Hmm, that seems optimization that could be done separately.

Oh yes, definitely not something for you to add to the main patch.

Just some additional tuning to alleviate Robert's concerns.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 00:37:36
Message-ID:	CA+TgmoZx300pkRkL27-ejWAcHCp-6-1MuyMvotJZqX91qyzOtA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 5:07 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Mar 14, 2012 at 5:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
>>> lookups, so I think something more sophisticated is needed to exercise that
>>> cost. Not sure what.
>>
>> I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
>> all-visible.
>
> So because committed does not equal all visible there will be
> additional lookups on mxids? That's complete rubbish.

Noah seemed to be implying that once the updating transaction
committed, HEAP_XMAX_COMMITTED would get set and save the mxid lookup.
But I think that's not true, because anyone who looks at the tuple
afterward will still need to know the exact xmax, to test it against
their snapshot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 00:53:05
Message-ID:	1331859074-sup-6025@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of jue mar 15 21:37:36 -0300 2012:
>
> On Thu, Mar 15, 2012 at 5:07 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > On Wed, Mar 14, 2012 at 5:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>> You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
> >>> lookups, so I think something more sophisticated is needed to exercise that
> >>> cost. Not sure what.
> >>
> >> I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
> >> all-visible.
> >
> > So because committed does not equal all visible there will be
> > additional lookups on mxids? That's complete rubbish.
>
> Noah seemed to be implying that once the updating transaction
> committed, HEAP_XMAX_COMMITTED would get set and save the mxid lookup.
> But I think that's not true, because anyone who looks at the tuple
> afterward will still need to know the exact xmax, to test it against
> their snapshot.

Yeah, we don't set HEAP_XMAX_COMMITTED on multis, even when there's an
update in it and it committed. I think we could handle it, at least
some of the cases, but that'd require careful re-examination of all the
tqual.c code, which is not something I want to do right now.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 01:52:12
Message-ID:	20120316015212.GA6150@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 08:37:36PM -0400, Robert Haas wrote:
> On Thu, Mar 15, 2012 at 5:07 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > On Wed, Mar 14, 2012 at 5:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>> You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
> >>> lookups, so I think something more sophisticated is needed to exercise that
> >>> cost. ?Not sure what.
> >>
> >> I don't think HEAP_XMAX_COMMITTED is much help, because committed !=
> >> all-visible.
> >
> > So because committed does not equal all visible there will be
> > additional lookups on mxids? That's complete rubbish.
>
> Noah seemed to be implying that once the updating transaction
> committed, HEAP_XMAX_COMMITTED would get set and save the mxid lookup.
> But I think that's not true, because anyone who looks at the tuple
> afterward will still need to know the exact xmax, to test it against
> their snapshot.

Yeah, my comment above was wrong. I agree that we'll need to retrieve the
mxid members during every MVCC scan until we either mark the page all-visible
or have occasion to simplify the mxid xmax to the updater xid.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 03:04:06
Message-ID:	20120316030406.GA8738@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 13, 2012 at 02:35:02PM -0300, Alvaro Herrera wrote:
>
> Excerpts from Bruce Momjian's message of mar mar 13 14:00:52 -0300 2012:
> >
> > On Tue, Mar 06, 2012 at 04:39:32PM -0300, Alvaro Herrera wrote:
>
> > > When there is a single locker in a tuple, we can just store the locking info
> > > in the tuple itself. We do this by storing the locker's Xid in XMAX, and
> > > setting hint bits specifying the locking strength. There is one exception
> > > here: since hint bit space is limited, we do not provide a separate hint bit
> > > for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
> > > that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
> > > presumably more commonly used due to being the standards-mandated locking
> > > mechanism, or heavily used by the RI code, so we want to provide fast paths
> > > for those.)
> >
> > Are those tuple bits actually "hint" bits? They seem quite a bit more
> > powerful than a "hint".
>
> I'm not sure what's your point. We've had a "hint" bit for SELECT FOR
> UPDATE for ages. Even 8.2 had HEAP_XMAX_EXCL_LOCK and
> HEAP_XMAX_SHARED_LOCK. Maybe they are misnamed and aren't really
> "hints", but it's not the job of this patch to fix that problem.

Now I am confused. Where do you see the word "hint" used by
HEAP_XMAX_EXCL_LOCK and HEAP_XMAX_SHARED_LOCK. These are tuple infomask
bits, not hints, meaning they are not optional or there just for
performance.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 03:08:29
Message-ID:	20120316030829.GB8738@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 11:04:06PM -0400, Bruce Momjian wrote:
> On Tue, Mar 13, 2012 at 02:35:02PM -0300, Alvaro Herrera wrote:
> >
> > Excerpts from Bruce Momjian's message of mar mar 13 14:00:52 -0300 2012:
> > >
> > > On Tue, Mar 06, 2012 at 04:39:32PM -0300, Alvaro Herrera wrote:
> >
> > > > When there is a single locker in a tuple, we can just store the locking info
> > > > in the tuple itself. We do this by storing the locker's Xid in XMAX, and
> > > > setting hint bits specifying the locking strength. There is one exception
> > > > here: since hint bit space is limited, we do not provide a separate hint bit
> > > > for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
> > > > that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
> > > > presumably more commonly used due to being the standards-mandated locking
> > > > mechanism, or heavily used by the RI code, so we want to provide fast paths
> > > > for those.)
> > >
> > > Are those tuple bits actually "hint" bits? They seem quite a bit more
> > > powerful than a "hint".
> >
> > I'm not sure what's your point. We've had a "hint" bit for SELECT FOR
> > UPDATE for ages. Even 8.2 had HEAP_XMAX_EXCL_LOCK and
> > HEAP_XMAX_SHARED_LOCK. Maybe they are misnamed and aren't really
> > "hints", but it's not the job of this patch to fix that problem.
>
> Now I am confused. Where do you see the word "hint" used by
> HEAP_XMAX_EXCL_LOCK and HEAP_XMAX_SHARED_LOCK. These are tuple infomask
> bits, not hints, meaning they are not optional or there just for
> performance.

Are you saying that the bit is only a guide and is there only for
performance? If so, I understand why it is called "hint".

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 03:09:52
Message-ID:	20120316030952.GC8738@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I question whether we are in a position to do the testing necessary to
commit this for 9.2.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 13:36:11
Message-ID:	1331904601-sup-237@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Bruce Momjian's message of vie mar 16 00:04:06 -0300 2012:
>
> On Tue, Mar 13, 2012 at 02:35:02PM -0300, Alvaro Herrera wrote:
> >
> > Excerpts from Bruce Momjian's message of mar mar 13 14:00:52 -0300 2012:
> > >
> > > On Tue, Mar 06, 2012 at 04:39:32PM -0300, Alvaro Herrera wrote:
> >
> > > > When there is a single locker in a tuple, we can just store the locking info
> > > > in the tuple itself. We do this by storing the locker's Xid in XMAX, and
> > > > setting hint bits specifying the locking strength. There is one exception
> > > > here: since hint bit space is limited, we do not provide a separate hint bit
> > > > for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
> > > > that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
> > > > presumably more commonly used due to being the standards-mandated locking
> > > > mechanism, or heavily used by the RI code, so we want to provide fast paths
> > > > for those.)
> > >
> > > Are those tuple bits actually "hint" bits? They seem quite a bit more
> > > powerful than a "hint".
> >
> > I'm not sure what's your point. We've had a "hint" bit for SELECT FOR
> > UPDATE for ages. Even 8.2 had HEAP_XMAX_EXCL_LOCK and
> > HEAP_XMAX_SHARED_LOCK. Maybe they are misnamed and aren't really
> > "hints", but it's not the job of this patch to fix that problem.
>
> Now I am confused. Where do you see the word "hint" used by
> HEAP_XMAX_EXCL_LOCK and HEAP_XMAX_SHARED_LOCK. These are tuple infomask
> bits, not hints, meaning they are not optional or there just for
> performance.

Okay, I think this is just a case of confusing terminology. I have
always assumed (because I have not seen any evidence to the contrary)
that anything in t_infomask and t_infomask2 is a "hint bit" --
regardless of it being actually a hint or something with a stronger
significance. HEAP_XMAX_EXCL_LOCK and HEAP_XMAX_SHARED_LOCK are
certainly not "optional" in the sense that if they are missing, the
meaning of the Xmax field is completely different. So in all
correctness they are not "hints", though we call them that.

Now, if we want to differentiate infomask bits that are just hints from
those that are something else, we can do that, but I'm not sure it's
useful -- AFAICS only XMAX_COMMITTED and XMIN_COMMITTED are proper
hints.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 13:40:01
Message-ID:	1331905118-sup-7192@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Alvaro Herrera's message of vie mar 16 10:36:11 -0300 2012:

> > Now I am confused. Where do you see the word "hint" used by
> > HEAP_XMAX_EXCL_LOCK and HEAP_XMAX_SHARED_LOCK. These are tuple infomask
> > bits, not hints, meaning they are not optional or there just for
> > performance.
>
> Okay, I think this is just a case of confusing terminology. I have
> always assumed (because I have not seen any evidence to the contrary)
> that anything in t_infomask and t_infomask2 is a "hint bit" --
> regardless of it being actually a hint or something with a stronger
> significance.

Maybe this is just my mistake. I see in
http://wiki.postgresql.org/wiki/Hint_Bits that we only call the
COMMITTED/INVALID infomask bits "hints".

I think it's easy enough to correct the README to call them "infomask
bits" rather than hints .. I'll go do that.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 14:08:07
Message-ID:	CA+TgmoYzYaEb2V+XYrZ1ZURG0Z+WC-UwJ5RTAPfOMSaTPO8A5w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 15, 2012 at 11:09 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Tue, Mar 13, 2012 at 01:46:24PM -0400, Robert Haas wrote:
>> On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> > I agree with you that some worst case performance tests should be
>> > done. Could you please say what you think the worst cases would be, so
>> > those can be tested? That would avoid wasting time or getting anything
>> > backwards.
>>
>> I've thought about this some and here's what I've come up with so far:
>
> I question whether we are in a position to do the testing necessary to
> commit this for 9.2.

Is anyone even working on testing it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 18:15:11
Message-ID:	20120316181511.GB28340@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 16, 2012 at 10:40:01AM -0300, Alvaro Herrera wrote:
>
> Excerpts from Alvaro Herrera's message of vie mar 16 10:36:11 -0300 2012:
>
> > > Now I am confused. Where do you see the word "hint" used by
> > > HEAP_XMAX_EXCL_LOCK and HEAP_XMAX_SHARED_LOCK. These are tuple infomask
> > > bits, not hints, meaning they are not optional or there just for
> > > performance.
> >
> > Okay, I think this is just a case of confusing terminology. I have
> > always assumed (because I have not seen any evidence to the contrary)
> > that anything in t_infomask and t_infomask2 is a "hint bit" --
> > regardless of it being actually a hint or something with a stronger
> > significance.
>
> Maybe this is just my mistake. I see in
> http://wiki.postgresql.org/wiki/Hint_Bits that we only call the
> COMMITTED/INVALID infomask bits "hints".
>
> I think it's easy enough to correct the README to call them "infomask
> bits" rather than hints .. I'll go do that.

OK, thanks. I only brought it up so people would not be confused by
thinking these were optional pieces of information, and that the real
information is stored somewhere else.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 18:22:05
Message-ID:	20120316182205.GC28340@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 16, 2012 at 10:08:07AM -0400, Robert Haas wrote:
> On Thu, Mar 15, 2012 at 11:09 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > On Tue, Mar 13, 2012 at 01:46:24PM -0400, Robert Haas wrote:
> >> On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >> > I agree with you that some worst case performance tests should be
> >> > done. Could you please say what you think the worst cases would be, so
> >> > those can be tested? That would avoid wasting time or getting anything
> >> > backwards.
> >>
> >> I've thought about this some and here's what I've come up with so far:
> >
> > I question whether we are in a position to do the testing necessary to
> > commit this for 9.2.
>
> Is anyone even working on testing it?

No one I know of. I am just trying to set expectations that this still
has a long way to go.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-16 18:49:03
Message-ID:	1331923720-sup-3214@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Bruce Momjian's message of vie mar 16 15:22:05 -0300 2012:
>
> On Fri, Mar 16, 2012 at 10:08:07AM -0400, Robert Haas wrote:
> > On Thu, Mar 15, 2012 at 11:09 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > > On Tue, Mar 13, 2012 at 01:46:24PM -0400, Robert Haas wrote:
> > >> On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > >> > I agree with you that some worst case performance tests should be
> > >> > done. Could you please say what you think the worst cases would be, so
> > >> > those can be tested? That would avoid wasting time or getting anything
> > >> > backwards.
> > >>
> > >> I've thought about this some and here's what I've come up with so far:
> > >
> > > I question whether we are in a position to do the testing necessary to
> > > commit this for 9.2.
> >
> > Is anyone even working on testing it?
>
> No one I know of. I am just trying to set expectations that this still
> has a long way to go.

A Command Prompt colleague is on it.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-17 22:45:20
Message-ID:	1332023610-sup-4118@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here is v11. This version is mainly updated to add pg_upgrade support,
as discussed. It also contains the README file that was posted earlier
(plus wording fixes per Bruce), a couple of bug fixes, and some comment
updates.

There's also a SRF for inspecting multixact members, pg_get_multixact_members,
but I'm not sure how useful it is for the general user so maybe I'll rip
it out of the patch before committing.

I mentioned elsewhere in the thread that ResetMultiHintBit was bogus: we
don't know, while running the various HeapTupleSatisfies routines, what
kind of lock we hold; so we can't do anything to the tuple beyond
setting HEAP_XMAX_INVALID. There's probably a good place in page
pruning that could be used to transform multis containing committed
updates into plain no-multi Xmax.

The whole thing can be seen in github here:
https://github.com/alvherre/postgres/tree/fklocks

(While creating this patch I noticed that I had created v10 of the patch
on March 6th but apparently never sent it.)

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment	Content-Type	Size
fklocks-11.patch.gz	application/x-gzip	86.7 KB

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-17 22:58:41
Message-ID:	1332024762-sup-6979@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of lun mar 05 15:28:59 -0300 2012:
>
> On Mon, Feb 27, 2012 at 2:47 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> >> Regarding performance, the good thing about this patch is that if you
> >> have an operation that used to block, it might now not block. So maybe
> >> multixact-related operation is a bit slower than before, but if it
> >> allows you to continue operating rather than sit waiting until some
> >> other transaction releases you, it's much better.
> >
> > That's probably true, although there is some deferred cost that is
> > hard to account for. You might not block immediately, but then later
> > somebody might block either because the mxact SLRU now needs fsyncs or
> > because they've got to decode an mxid long after the relevant segment
> > has been evicted from the SLRU buffers. In general, it's hard to
> > bound that latter cost, because you only avoid blocking once (when the
> > initial update happens) but you might pay the extra cost of decoding
> > the mxid as many times as the row is read, which could be arbitrarily
> > many. How much of a problem that is in practice, I'm not completely
> > sure, but it has worried me before and it still does. In the worst
> > case scenario, a handful of frequently-accessed rows with MXIDs all of
> > whose members are dead except for the UPDATE they contain could result
> > in continual SLRU cache-thrashing.
>
> Cases I regularly see involve wait times of many seconds.
>
> When this patch helps, it will help performance by algorithmic gains,
> so perhaps x10-100.
>
> That can and should be demonstrated though, I agree.

BTW, the isolation tester cases have a few places that in the unpatched
code die with deadlocks and with the patched code continue without
dying. Others are cases that block in unpatched master, and continue
without blocking in patched. This should be enough proof that there are
"algorithmic gains" here.

There's also a test case that demostrates a fix for the problem (pointed
out by docs) that if you acquire a row lock, then a subxact upgrades it
(say by deleting the row) and the subxact aborts, the original row lock
is lost. With the patched code, the original lock is no longer lost.

I completely agree with the idea that we need some mitigation against
repeated lookups of mxids that contain committed updates.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-17 23:01:51
Message-ID:	1332025214-sup-4400@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of mar mar 06 18:33:13 -0300 2012:

> The lock modes are correct, appropriate and IMHO have meaningful
> names. No redesign required here.
>
> Not sure about the naming of some of the flag bits however.

Feel free to suggest improvements ... I've probably seen them for too
long to find them anything but what I intended them to mean.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-17 23:11:57
Message-ID:	1332025380-sup-7328@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Simon Riggs's message of mar mar 06 17:28:12 -0300 2012:
> On Tue, Mar 6, 2012 at 7:39 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>
> > We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
> > super-exclusive locking (used to delete tuples and more generally to update
> > tuples modifying the values of the columns that make up the key of the tuple);
> > SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
> > implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
> > that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
> > UPDATE. This last mode implements a mode just strong enough to implement RI
> > checks, i.e. it ensures that tuples do not go away from under a check, without
> > blocking when some other transaction that want to update the tuple without
> > changing its key.
>
> So there are 4 lock types, but we only have room for 3 on the tuple
> header, so we store the least common/deprecated of the 4 types as a
> multixactid. Some rewording would help there.

Hmm, I rewrote that paragraph two times. I'll try to adjust it a bit
more.

> My understanding is that all of theses workloads will change
>
> * Users of explicit SHARE lockers will be slightly worse in the case
> of the 1st locker, but then after that they'll be the same as before.

Right. (We're assuming that there *are* users of SHARE locks, which I'm
not sure to be a given.)

> * Updates against an RI locked table will be dramatically faster
> because of reduced lock waits

Correct.

> ...and that these previous workloads are effectively unchanged:
>
> * Stream of RI checks causes mxacts

Yes.

> * Multi row deadlocks still possible

Yes.

> * Queues of writers still wait in the same way

Yes.

> * Deletes don't cause mxacts unless by same transaction

Yeah .. there's no way for anyone to not conflict with a FOR KEY UPDATE
lock (the strength grabbed by a delete) unless you're the same
transaction.

> > The possibility of having an update within a MultiXact means that they must
> > persist across crashes and restarts: a future reader of the tuple needs to
> > figure out whether the update committed or aborted. So we have a requirement
> > that pg_multixact needs to retain pages of its data until we're certain that
> > the MultiXacts in them are no longer of interest.
>
> I think the "no longer of interest" aspect needs to be tracked more
> closely because it will necessarily lead to more I/O.

Not sure what you mean here.

> If we store the LSN on each mxact page, as I think we need to, we can
> get rid of pages more quickly if we know they don't have an LSN set.
> So its possible we can optimise that more.

Hmm, I had originally thought that this was rather pointless because it
was unlikely that a segment would *never* have *all* multis not
containing updates. But then, maybe Robert is right and there are users
out there that run a lot of RI checks and never update the masters ...
Hm. I'm not sure that LSN tracking is the right tool to do that
optimization, however -- I mean, a single multi containing an update in
a whole segment will prevent that segment from being considered useless.

> > VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
>
> You mean mxact segments?

Well, both. When a tuple is frozen, we both remove its Xmin/Xmax and
any possible multi that it might have in Xmax. That's what I really
meant above. But also, vacuum will remove pg_multixact segments just as
it will remove pg_clog segments.

(It is possible, and probably desirable, to remove a Multi much earlier
than freezing the tuple. The patch does not (yet) do that, however.)

> Surely we set hint bits on tuples same as now? Hope so.

We set hint bits, but if a multi contains an update, we don't set
HEAP_XMAX_COMMITTED even when the update is known committed. I think
we could do this in some cases.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-03-25 08:17:59
Message-ID:	CA+U5nMJ_HiHxyZh+NMt1RAc694C+gWo7EEc_yZsLV2hzK5iaHg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 17, 2012 at 10:45 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:

> Here is v11. This version is mainly updated to add pg_upgrade support,
> as discussed. It also contains the README file that was posted earlier
> (plus wording fixes per Bruce), a couple of bug fixes, and some comment
> updates.

The main thing we're waiting on are the performance tests to confirm
the lack of regression.

You are working on that, right?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: foreign key locks, 2nd attempt
Date:	2012-04-05 18:50:58
Message-ID:	CAEYLb_VGv=rGJx-HuC9tStc70MUtQht5x-HbOf-Xp-tnEeTv+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 25 March 2012 09:17, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> The main thing we're waiting on are the performance tests to confirm
> the lack of regression.

I have extensively benchmarked the latest revision of the patch
(tpc-b.sql), which I pulled from Alvaro's github account. The
benchmark was of the feature branch's then-and-current HEAD, "Don't
follow update chains unless caller requests it".

I've had to split these numbers out into two separate reports.
Incidentally, at some future point I hope that pgbench-tools can
handling testing across feature branches, initdb'ing and suchlike
automatically and as needed. That's something that's likely to happen
sooner rather than later.

The server used was kindly supplied by the University of Oregon open
source lab.

Server (It's virtualised, but apparently this is purely for sandboxing
purposes and the virtualisation technology is rather good):

IBM,8231-E2B POWER7 processor (8 cores).
Fedora 16
8GB Ram
Dedicated RAID1 disks. Exact configuration unknown.

postgresql.conf (this information is available when you drill down
into each test too, fwiw):
max_connections = 200
shared_buffers = 2GB
checkpoint_segments = 30
checkpoint_completion_target = 0.8
effective_cache_size = 6GB

Reports:

http://results_fklocks.staticloud.com/
http://results_master_for_fks.staticloud.com/

Executive summary: There is a clear regression of less than 10%. There
also appears to be a new source of contention at higher client counts.

I realise that the likely upshot of this, and other concerns that are
generally held at this late stage is that this patch will not make it
into 9.2 . For what it's worth, that comes as a big disappointment to
me. I would like to thank both Alvaro and Noah for their hard work
here.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services