parallel mode and parallel contexts

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: parallel mode and parallel contexts
Date: 2014-12-12 22:52:38
Message-ID: CA+Tgmob8u=J-D-_5SCXOQ9-ZtK_xCHjwa3M29CxBhU3VPPnezw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Attached is a patch that adds two new concepts: parallel mode, and
parallel contexts. The idea of this code is to provide a framework
for specific parallel things that you want to do, such as parallel
sequential scan or parallel sort. When you're in parallel mode,
certain operations - like DDL, and anything that would update the
command counter - are prohibited. But you gain the ability to create
a parallel context, which in turn can be used to fire up parallel
workers. And if you do that, then your snapshot, combo CID hash, and
GUC values will be copied to the worker, which is handy.

This patch is very much half-baked. Among the things that aren't right yet:

- There's no handling of heavyweight locking, so I'm quite sure it'll
be possible to cause undetected deadlocks if you work at it. There
are some existing threads on this topic and perhaps we can incorporate
one of those concepts into this patch, but this version does not.
- There's no provision for copying the parent's XID and sub-XIDs, if
any, to the background workers, which means that if you use this and
your transaction has written data, you will get wrong answers, because
TransactionIdIsCurrentTransactionId() will do the wrong thing.
- There's no really deep integration with the transaction system yet.
Previous discussions seem to point toward the need to do various types
of coordinated cleanup when the parallel phase is done, or when an
error happens. In particular, you probably don't want the abort
record to get written while there are still possibly backends that are
part of that transaction doing work; and you certainly don't want
files created by the current transaction to get removed while some
other backend is still writing them. The right way to work all of
this out needs some deep thought; agreeing on what the design should
be is probably harder than implement it.

Despite the above, I think this does a fairly good job laying out how
I believe parallelism can be made to work in PostgreSQL: copy a bunch
of state from the user backend to the parallel workers, compute for a
while, and then shut everything down. Meanwhile, while parallelism is
running, forbid changes to state that's already been synchronized, so
that things don't get out of step. I think the patch it shows how the
act of synchronizing state from the master to the workers can be made
quite modular and painless, even though it doesn't synchronize
everything relevant. I'd really appreciate any design thoughts anyone
may have on how to fix the problems mentioned above, how to fix any
other problems you foresee, or even just a list of reasons why you
think this will blow up.

What I think is that we're really pretty close to do real parallelism,
and that this is probably the last major piece of infrastructure that
we need in order to support parallel execution in a reasonable way.
That's a pretty bold statement, but I believe it to be true: despite
the limitations of the current version of this patch, I think we're
very close to being able to sit down and code up a parallel algorithm
in PostgreSQL and have that not be all that hard. Once we get the
first one, I expect a whole bunch more to come together far more
quickly than the first one did.

I would be remiss if I failed to mention that this patch includes work
by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
well as my former colleague Noah Misch; and that it would not have
been possible without the patient support of EnterpriseDB management.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v0.patch text/x-patch 82.2 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2014-12-17 14:16:01
Message-ID: CA+U5nM+=h=WBoWvnDqxLFmV3d8zzObR0rb=aVvrRMbb6srog6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12 December 2014 at 22:52, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I would be remiss if I failed to mention that this patch includes work
> by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
> well as my former colleague Noah Misch; and that it would not have
> been possible without the patient support of EnterpriseDB management.

Noted and thanks to all.

I will make this my priority for review, but regrettably not until New
Year, about 2.5-3 weeks away.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2014-12-17 19:53:51
Message-ID: CA+TgmobV7a5Z4cQ_Hkuvvb9BrrkHWpOPyyzMZZcznk=7fKwJ+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 12 December 2014 at 22:52, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> I would be remiss if I failed to mention that this patch includes work
>> by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
>> well as my former colleague Noah Misch; and that it would not have
>> been possible without the patient support of EnterpriseDB management.
>
> Noted and thanks to all.
>
> I will make this my priority for review, but regrettably not until New
> Year, about 2.5-3 weeks away.

OK!

In the meantime, I had a good chat with Heikki on IM yesterday which
gave me some new ideas on how to fix up the transaction handling in
here, which I am working on implementing. So hopefully I will have
that by then.

I am also hoping Amit will be adapting his parallel seq-scan patch to
this framework.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2014-12-19 13:59:18
Message-ID: CA+Tgmobrbizn41fmCYJqXx-YDYaYTTm3YYo_+H2BRPXjtk9ORg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 17, 2014 at 2:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> In the meantime, I had a good chat with Heikki on IM yesterday which
> gave me some new ideas on how to fix up the transaction handling in
> here, which I am working on implementing. So hopefully I will have
> that by then.

And here is a new version. This version has some real integration
with the transaction system, along the lines of what Heikki suggested
to me, so hopefully tuple visibility calculations in a parallel worker
will now produce the right answers, though I need to integrate this
with code to actually do something-or-other in parallel in order to
really test that. There are still some problems with parallel worker
shutdown. As hard as I tried to fight it, I'm gradually resigning
myself to the fact that we're probably going to have to set things up
so that the worker waits for all of its children to die during abort
processing (and during commit processing, but that's not bothersome).
Otherwise, to take just one example, chaos potentially ensues if you
run a parallel query in a subtransaction and then roll back to a
savepoint. But this version just kills the workers and doesn't
actually wait for them to die; I'll see about fixing that, but wanted
to send this around for comments first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v0.1.patch text/x-patch 99.9 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2014-12-22 00:56:35
Message-ID: CAB7nPqS29+LhoH1912bFM9mY+yjABwCpUb+ZTjHNMdVoVVp3mA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 18, 2014 at 4:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 12 December 2014 at 22:52, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>>> I would be remiss if I failed to mention that this patch includes work
>>> by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
>>> well as my former colleague Noah Misch; and that it would not have
>>> been possible without the patient support of EnterpriseDB management.
>>
>> Noted and thanks to all.
>>
>> I will make this my priority for review, but regrettably not until New
>> Year, about 2.5-3 weeks away.
>
> OK!
>
> In the meantime, I had a good chat with Heikki on IM yesterday which
> gave me some new ideas on how to fix up the transaction handling in
> here, which I am working on implementing. So hopefully I will have
> that by then.
>
> I am also hoping Amit will be adapting his parallel seq-scan patch to
> this framework.
Simon, if you are planning to review this patch soon, could you add
your name as a reviewer for this patch in the commit fest app?
Thanks,
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2014-12-22 19:14:31
Message-ID: CA+TgmoZJUOhd0hzXSraSPqYDaiXoDzeopxt6dTJpuGGGRkurjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 19, 2014 at 8:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> And here is a new version.

Here is another new version, with lots of bugs fixed. The worker
shutdown sequence is now much more robust, although I think there may
still be a bug or two lurking, and I fixed a bunch of other things
too. There's now a function called parallel_count() in the
parallel_dummy extension contained herein, which does a parallel count
of a relation you choose:

rhaas=# select count(*) from pgbench_accounts;
count
---------
4000000
(1 row)

Time: 396.635 ms
rhaas=# select parallel_count('pgbench_accounts'::regclass, 0);
NOTICE: PID 2429 counted 4000000 tuples
parallel_count
----------------
4000000
(1 row)

Time: 234.445 ms
rhaas=# select parallel_count('pgbench_accounts'::regclass, 4);
NOTICE: PID 2499 counted 583343 tuples
CONTEXT: parallel worker, pid 2499
NOTICE: PID 2500 counted 646478 tuples
CONTEXT: parallel worker, pid 2500
NOTICE: PID 2501 counted 599813 tuples
CONTEXT: parallel worker, pid 2501
NOTICE: PID 2502 counted 611389 tuples
CONTEXT: parallel worker, pid 2502
NOTICE: PID 2429 counted 1558977 tuples
parallel_count
----------------
4000000
(1 row)

Time: 150.004 ms
rhaas=# select parallel_count('pgbench_accounts'::regclass, 8);
NOTICE: PID 2429 counted 1267566 tuples
NOTICE: PID 2504 counted 346236 tuples
CONTEXT: parallel worker, pid 2504
NOTICE: PID 2505 counted 345077 tuples
CONTEXT: parallel worker, pid 2505
NOTICE: PID 2506 counted 355325 tuples
CONTEXT: parallel worker, pid 2506
NOTICE: PID 2507 counted 350872 tuples
CONTEXT: parallel worker, pid 2507
NOTICE: PID 2508 counted 338855 tuples
CONTEXT: parallel worker, pid 2508
NOTICE: PID 2509 counted 336903 tuples
CONTEXT: parallel worker, pid 2509
NOTICE: PID 2511 counted 326716 tuples
CONTEXT: parallel worker, pid 2511
NOTICE: PID 2510 counted 332450 tuples
CONTEXT: parallel worker, pid 2510
parallel_count
----------------
4000000
(1 row)

Time: 166.347 ms

This example table (pgbench_accounts, scale 40, ~537 MB) is small
enough that parallelism doesn't really make sense; you can see from
the notice messages above that the master manages to count a quarter
of the table before the workers get themselves up and running. The
pointer is rather to show how the infrastructure works and that it can
be used to write code to do practically useful tasks in a surprisingly
small number of lines of code; parallel_count is only maybe ~100 lines
on top of the base patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v0.2.patch text/x-patch 107.3 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-02 14:04:13
Message-ID: CAA4eK1LOHfex-ykTD1uHVDFELwhX04EyPWG=8iy78UGCZuo6-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 18, 2014 at 1:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>
> In the meantime, I had a good chat with Heikki on IM yesterday which
> gave me some new ideas on how to fix up the transaction handling in
> here, which I am working on implementing. So hopefully I will have
> that by then.
>
> I am also hoping Amit will be adapting his parallel seq-scan patch to
> this framework.
>

While working on parallel seq-scan patch to adapt this framework, I
noticed few things and have questions regrading the same.

1.
Currently parallel worker just attaches to error queue, for tuple queue
do you expect it to be done in the same place or in the caller supplied
function, if later then we need segment address as input to that function
to attach queue to the segment(shm_mq_attach()).
Another question, I have in this regard is that if we have redirected
messages to error queue by using pq_redirect_to_shm_mq, then how can
we set tuple queue for the same purpose. Similarly I think more handling
is needed for tuple queue in master backend and the answer to above will
dictate what is the best way to do it.

2.
Currently there is no interface for wait_for_workers_to_become_ready()
in your patch, don't you think it is important that before start of fetching
tuples, we should make sure all workers are started, what if some worker
fails to start?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-04 00:31:22
Message-ID: 20150104003122.GA29301@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-22 14:14:31 -0500, Robert Haas wrote:
> On Fri, Dec 19, 2014 at 8:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > And here is a new version.
>
> Here is another new version, with lots of bugs fixed.

A couple remarks:
* Shouldn't this provide more infrastructure to deal with the fact that
we might get less parallel workers than we had planned for?
* I wonder if parallel contexts shouldn't be tracked via resowners
* combocid.c should error out in parallel mode, as insurance
* I doubt it's a good idea to allow heap_insert, heap_inplace_update,
index_insert. I'm not convinced that those will be handled correct and
relaxing restrictions is easier than adding them.
* I'd much rather separate HandleParallelMessageInterrupt() in one
variant that just tells the machinery there's a interrupt (called from
signal handlers) and one that actually handles them. We shouldn't even
consider adding any new code that does allocations, errors and such in
signal handlers. That's just a *bad* idea.
* I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
much rather place a struct there and be careful about not using
pointers. That also would obliviate the need to reserve some ids.
* Doesn't the restriction in GetSerializableTransactionSnapshotInt()
apply for repeatable read just the same?
* I'm not a fan of relying on GetComboCommandId() to restore things in
the same order as before.
* I'd say go ahead and commit the BgworkerByOid thing
earlier/independently. I've seen need for that a couple times.
* s/entrypints/entrypoints/

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-05 15:23:31
Message-ID: CA+TgmoZFctU39uJmBXzOm=rW_GpP1jxUqdDQvrDtuHPnz5JK0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 2, 2015 at 9:04 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> While working on parallel seq-scan patch to adapt this framework, I
> noticed few things and have questions regrading the same.
>
> 1.
> Currently parallel worker just attaches to error queue, for tuple queue
> do you expect it to be done in the same place or in the caller supplied
> function, if later then we need segment address as input to that function
> to attach queue to the segment(shm_mq_attach()).
> Another question, I have in this regard is that if we have redirected
> messages to error queue by using pq_redirect_to_shm_mq, then how can
> we set tuple queue for the same purpose. Similarly I think more handling
> is needed for tuple queue in master backend and the answer to above will
> dictate what is the best way to do it.

I've come to the conclusion that it's a bad idea for tuples to be sent
through the same queue as errors. We want errors (or notices, but
especially errors) to be processed promptly, but there may be a
considerable delay in processing tuples. For example, imagine a plan
that looks like this:

Nested Loop
-> Parallel Seq Scan on p
-> Index Scan on q
Index Scan: q.x = p.x

The parallel workers should fill up the tuple queues used by the
parallel seq scan so that the master doesn't have to do any of that
work itself. Therefore, the normal situation will be that those tuple
queues are all full. If an error occurs in a worker at that point, it
can't add it to the tuple queue, because the tuple queue is full. But
even if it could do that, the master then won't notice the error until
it reads all of the queued-up tuple messges that are in the queue in
front of the error, and maybe some messages from the other queues as
well, since it probably round-robins between the queues or something
like that. Basically, it could do a lot of extra work before noticing
that error in there.

Now we could avoid that by having the master read messages from the
queue immediately and just save them off to local storage if they
aren't error messages. But that's not very desirable either, because
now we have no flow-control. The workers will just keep spamming
tuples that the master isn't ready for into the queues, and the master
will keep reading them and saving them to local storage, and
eventually it will run out of memory and die.

We could engineer some solution to this problem, of course, but it
seems quite a bit simpler to just have two queues. The error queues
don't need to be very big (I made them 16kB, which is trivial on any
system on which you care about having working parallelism) and the
tuple queues can be sized as needed to avoid pipeline stalls.

> 2.
> Currently there is no interface for wait_for_workers_to_become_ready()
> in your patch, don't you think it is important that before start of fetching
> tuples, we should make sure all workers are started, what if some worker
> fails to start?

I think that, in general, getting the most benefit out of parallelism
means *avoiding* situations where backends have to wait for each
other. If the relation being scanned is not too large, the user
backend might be able to finish the whole scan - or a significant
fraction of it - before the workers initialize. Of course in that
case it might have been a bad idea to parallelize in the first place,
but we should still try to make the best of the situation. If some
worker fails to start, then instead of having the full degree N
parallelism we were hoping for, we have some degree K < N, so things
will take a little longer, but everything should still work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-05 16:27:57
Message-ID: CA+TgmoZZ+vODB=dNKY_RSAHL0j0mUnZWNJX7ecvhy0pmyxNzAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> A couple remarks:
> * Shouldn't this provide more infrastructure to deal with the fact that
> we might get less parallel workers than we had planned for?

Maybe, but I'm not really sure what that should look like. My working
theory is that individual parallel applications (executor nodes, or
functions that use parallelism internally, or whatever) should be
coded in such a way that they work correctly even if the number of
workers that starts is smaller than what they requested, or even zero.
It may turn out that this is impractical in some cases, but I don't
know what those cases are yet so I don't know what the common threads
will be.

I think parallel seq scan should work by establishing N tuple queues
(where N is the number of workers we have). When asked to return a
tuple, the master should poll all of those queues one after another
until it finds one that contains a tuple. If none of them contain a
tuple then it should claim a block to scan and return a tuple from
that block. That way, if there are enough running workers to keep up
with the master's demand for tuples, the master won't do any of the
actual scan itself. But if the workers can't keep up -- e.g. suppose
90% of the CPU consumed by the query is in the filter qual for the
scan -- then the master can pitch in along with everyone else. As a
non-trivial fringe benefit, if the workers don't all start, or take a
while to initialize, the user backend isn't stalled meanwhile.

> * I wonder if parallel contexts shouldn't be tracked via resowners

That is a good question. I confess that I'm somewhat fuzzy about
which things should be tracked via the resowner mechanism vs. which
things should have their own bespoke bookkeeping. However, the
AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
which makes merging them seem not particularly clean.

> * combocid.c should error out in parallel mode, as insurance

Eh, which function? HeapTupleHeaderGetCmin(),
HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work
in parallel mode. HeapTupleHeaderAdjustCmax() could
Assert(!IsInParallelMode()) but the only calls are in heap_update()
and heap_delete() which already have error checks, so putting yet
another elog() there seems like overkill.

> * I doubt it's a good idea to allow heap_insert, heap_inplace_update,
> index_insert. I'm not convinced that those will be handled correct and
> relaxing restrictions is easier than adding them.

I'm fine with adding checks to heap_insert() and
heap_inplace_update(). I'm not sure we really need to add anything to
index_insert(); how are we going to get there without hitting some
other prohibited operation first?

> * I'd much rather separate HandleParallelMessageInterrupt() in one
> variant that just tells the machinery there's a interrupt (called from
> signal handlers) and one that actually handles them. We shouldn't even
> consider adding any new code that does allocations, errors and such in
> signal handlers. That's just a *bad* idea.

That's a nice idea, but I didn't invent the way this crap works today.
ImmediateInterruptOK gets set to true while performing a lock wait,
and we need to be able to respond to errors while in that state. I
think there's got to be a better way to handle that than force every
asynchronous operation to have to cope with the fact that
ImmediateInterruptOK may be set or not set, but as of today that's
what you have to do.

> * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
> much rather place a struct there and be careful about not using
> pointers. That also would obliviate the need to reserve some ids.

I don't see how that's going to work with variable-size data
structures, and a bunch of the things that we need to serialize are
variable-size. Moreover, I'm not really interested in rehashing this
design again. I know it's not your favorite; you've said that before.
But it makes it possible to write code to do useful things in
parallel, a capability that we do not otherwise have. And I like it
fine.

> * Doesn't the restriction in GetSerializableTransactionSnapshotInt()
> apply for repeatable read just the same?

Yeah. I'm not sure whether we really need that check at all, because
there is a check in GetTransactionSnapshot() that probably checks the
same thing.

> * I'm not a fan of relying on GetComboCommandId() to restore things in
> the same order as before.

I thought that was a little wonky, but it's not likely to break, and
there is an elog(ERROR) there to catch it if it does, so I'd rather
not make it more complicated.

> * I'd say go ahead and commit the BgworkerByOid thing
> earlier/independently. I've seen need for that a couple times.

Woohoo! I was hoping someone would say that.

> * s/entrypints/entrypoints/

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-05 17:01:07
Message-ID: 32660.1420477267@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> * I wonder if parallel contexts shouldn't be tracked via resowners

> That is a good question. I confess that I'm somewhat fuzzy about
> which things should be tracked via the resowner mechanism vs. which
> things should have their own bespoke bookkeeping. However, the
> AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
> which makes merging them seem not particularly clean.

FWIW, the resowner mechanism was never meant as a substitute for bespoke
bookkeeping. What it is is a helper mechanism to reduce the need for
PG_TRY blocks that guarantee that a resource-releasing function will be
called even in error paths. I'm not sure whether that analogy applies
well in parallel-execution cases.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-05 23:21:32
Message-ID: CA+U5nMK52iL1R+REVBqW9OYu5gkfhSgRKHQLhd5rxhw0rUcGHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 22 December 2014 at 19:14, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Here is another new version, with lots of bugs fixed.

An initial blind review, independent of other comments already made on thread.

OK src/backend/access/heap/heapam.c
heapam.c prohibitions on update and delete look fine

OK src/backend/access/transam/Makefile

OK src/backend/access/transam/README.parallel
README.parallel and all concepts look good

PARTIAL src/backend/access/transam/parallel.c
wait_patiently mentioned in comment but doesn’t exist
Why not make nworkers into a uint?
Trampoline? Really? I think we should define what we mean by that,
somewhere, rather than just use the term as if it was self-evident.
Entrypints?
..

OK src/backend/access/transam/varsup.c

QUESTIONS src/backend/access/transam/xact.c
These comments don’t have any explanation or justification

+ * This stores XID of the toplevel transaction to which we belong.
+ * Normally, it's the same as TopTransactionState.transactionId, but in
+ * a parallel worker it may not be.

+ * If we're a parallel worker, there may be additional XIDs that we need
+ * to regard as part of our transaction even though they don't appear
+ * in the transaction state stack.

This comment is copied-and-pasted too many times for safety and elegance
+ /*
+ * Workers synchronize transaction state at the beginning of
each parallel
+ * operation, so we can't account for transaction state change
after that
+ * point. (Note that this check will certainly error out if
s->blockState
+ * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an
invalid case
+ * below.)
+ */
We need a single Check() that contains more detail and comments within

Major comments

* Parallel stuff at start sounded OK
* Transaction stuff strikes me as overcomplicated and error prone.
Given there is no explanation or justification for most of it, I’d be
inclined to question why its there
* If we have a trampoline for DSM, why don’t we use a trampoline for
snapshots, then you wouldn’t need to do all this serialize stuff

This is very impressive, but it looks like we’re trying to lift too
much weight on the first lift. If we want to go anywhere near this, we
need to have very clear documentation about how and why its like that.
I’m actually very sorry to say that because the review started very
well, much much better than most.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-06 03:59:35
Message-ID: CAA4eK1KgfbEyLCWaqU4Q5i1ETn6R1znf9L-W1AKQ_cq5iFiYJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 5, 2015 at 9:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
>
> > * Doesn't the restriction in GetSerializableTransactionSnapshotInt()
> > apply for repeatable read just the same?
>
> Yeah. I'm not sure whether we really need that check at all, because
> there is a check in GetTransactionSnapshot() that probably checks the
> same thing.
>

The check in GetSerializableTransactionSnapshotInt() will also prohibit
any user/caller of SetSerializableTransactionSnapshot() in parallel mode
as that can't be prohibited by GetTransactionSnapshot().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-06 16:33:22
Message-ID: CA+TgmoYmp_=XcJEhvJZt9P8drBgW-pDpjHxBhZA79+M4o-CZQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 5, 2015 at 6:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> PARTIAL src/backend/access/transam/parallel.c
> wait_patiently mentioned in comment but doesn’t exist

Woops. Fixed. Originally, DestroyParallelContext() had a Boolean
argument wait_patiently, specifying whether it should exit at once or
do the equivalent of WaitForParallelWorkersToFinish() first. But
that turned out not to work well - for example, in the included
parallel_count example, it's very important to (1) first wait for all
the workers to exit, (2) then read the final count, and (3) only after
that destroy the dynamic shared memory segment. So
WaitForParallelWorkersToFinish() got separated out into a different
function, but I failed to update the comments.

> Why not make nworkers into a uint?

We don't have or use a type called uint. I could make it uint32 or
uint64 or uint16, but I don't see much advantage in that. I could
also make it unsigned, but we make very little use of the unsigned
type generally, so I picked int as most consistent with our general
practice. If there's a consensus that something else is much better
than int, I will change it throughout, but I think we have bigger fish
to fry.

> Trampoline? Really? I think we should define what we mean by that,
> somewhere, rather than just use the term as if it was self-evident.

Comments improved.

> Entrypints?

Already noted by Andres; fixed in the attached version.

> These comments don’t have any explanation or justification

OK, I rewrote them. Hopefully it's better now.

> This comment is copied-and-pasted too many times for safety and elegance

It's not the same comment each time it appears; it appears in the
exact form you quoted just twice. It does get a little long-winded, I
guess, but I think that parallelism is a sufficiently-unfamiliar
concept to us as a group that it makes sense to have a detailed
comment in each place explaining exactly why that particular callsite
requires a check, so that's what I've tried to do.

> * Parallel stuff at start sounded OK
> * Transaction stuff strikes me as overcomplicated and error prone.
> Given there is no explanation or justification for most of it, I’d be
> inclined to question why its there

Gosh, I was pretty pleased with how simple the transaction integration
turned out to be. Most of what's there right now is either (a) code
to copy state from the master to the parallel workers or (b) code to
throw errors if the workers try to things that aren't safe. I suspect
there are a few things missing, but I don't see anything there that
looks unnecessary.

> * If we have a trampoline for DSM, why don’t we use a trampoline for
> snapshots, then you wouldn’t need to do all this serialize stuff

The trampoline is just to let extensions use this infrastructure if
they want to; there's no way to avoid the serialization-and-restore
stuff unless we switch to creating the child processes using fork(),
but that would:

1. Not work on Windows.

2. Require the postmaster to deal with processes that are not its
immediate children.

3. Possibly introduce other bugs.

Since I've spent a year and a half trying to get this method to work
and am now, I think, almost there, I'm not particularly sanguine about
totally changing the approach.

> This is very impressive, but it looks like we’re trying to lift too
> much weight on the first lift. If we want to go anywhere near this, we
> need to have very clear documentation about how and why its like that.
> I’m actually very sorry to say that because the review started very
> well, much much better than most.

When I posted the group locking patch, it got criticized because it
didn't actually do anything useful by itself; similarly, the
pg_background patch was criticized for not being a large enough step
toward parallelism. So, this time, I posted something more
comprehensive. I don't think it's quite complete yet. I expect a
committable version of this patch to be maybe another 500-1000 lines
over what I have here right now -- I think it needs to do something
about heavyweight locking, and I expect that there are some unsafe
things that aren't quite prohibited yet. But the current patch is
only 2300 lines, which is not astonishingly large for a feature of
this magnitude; if anything, I'd say it's surprisingly small, due to a
year and a half of effort laying the necessary groundwork via long
series of preliminary commits. I'm not unwilling to divide this up
some more if we can agree on a way to do that that makes sense, but I
think we're nearing the point where we need to take the plunge and
say, look, this is version one of parallelism. Thunk.

In addition to the changes mentioned above, the attached version
prohibits a few more things (as suggested by Andres) and passes the
dsm_segment to the user-supplied entrypoint (as requested off-list by
Andres, because otherwise you can't set up additional shm_mq
structures).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v0.3.patch text/x-patch 109.8 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-06 20:04:04
Message-ID: CA+U5nMKaCpDN1VZ8Om6JFQnZ=6ECHBYpfXqHW5rJSgvePmvEnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6 January 2015 at 16:33, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> These comments don’t have any explanation or justification
>
> OK, I rewrote them. Hopefully it's better now.

Thanks for new version and answers.

>> * Transaction stuff strikes me as overcomplicated and error prone.
>> Given there is no explanation or justification for most of it, I’d be
>> inclined to question why its there
>
> Gosh, I was pretty pleased with how simple the transaction integration
> turned out to be. Most of what's there right now is either (a) code
> to copy state from the master to the parallel workers or (b) code to
> throw errors if the workers try to things that aren't safe. I suspect
> there are a few things missing, but I don't see anything there that
> looks unnecessary.

If you can explain it in more detail in comments and README, I may
agree. At present, I don't get it and it makes me nervous.

The comment says
"Only the top frame of the transaction state stack is copied to a parallel
worker"
but I'm not sure why.

Top meaning the current subxact or the main xact?
If main, why are do we need XactTopTransactionId

>> This is very impressive, but it looks like we’re trying to lift too
>> much weight on the first lift. If we want to go anywhere near this, we
>> need to have very clear documentation about how and why its like that.
>> I’m actually very sorry to say that because the review started very
>> well, much much better than most.
>
> When I posted the group locking patch, it got criticized because it
> didn't actually do anything useful by itself; similarly, the
> pg_background patch was criticized for not being a large enough step
> toward parallelism. So, this time, I posted something more
> comprehensive. I don't think it's quite complete yet. I expect a
> committable version of this patch to be maybe another 500-1000 lines
> over what I have here right now -- I think it needs to do something
> about heavyweight locking, and I expect that there are some unsafe
> things that aren't quite prohibited yet. But the current patch is
> only 2300 lines, which is not astonishingly large for a feature of
> this magnitude; if anything, I'd say it's surprisingly small, due to a
> year and a half of effort laying the necessary groundwork via long
> series of preliminary commits. I'm not unwilling to divide this up
> some more if we can agree on a way to do that that makes sense, but I
> think we're nearing the point where we need to take the plunge and
> say, look, this is version one of parallelism. Thunk.

I want this also; the only debate is where to draw the line and please
don't see that as criticism.

I'm very happy it's so short, I agree it could be longer.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-06 21:01:51
Message-ID: CA+TgmoazbDuaQ2mWqDHFEDtBqj_rYSn7BQQM=G6jmHT7C3RXJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> If you can explain it in more detail in comments and README, I may
> agree. At present, I don't get it and it makes me nervous.
>
> The comment says
> "Only the top frame of the transaction state stack is copied to a parallel
> worker"
> but I'm not sure why.
>
> Top meaning the current subxact or the main xact?
> If main, why are do we need XactTopTransactionId

Current subxact.

I initially thought of copying the entire TransactionStateData stack,
but Heikki suggested (via IM) that I do it this way instead. I
believe his concern was that it's never valid to commit or roll back
to a subtransaction that is not at the top of the stack, and if you
don't copy the stack, you avoid the risk of somehow ending up in that
state. Also, you avoid having to invent resource owners for
(sub)transactions that don't really exist in the current process. On
the other hand, you do end up with a few special cases that wouldn't
exist with the other approach. Still, I'm pretty happy to have taken
Heikki's advice: it was certainly simple to implement this way, plus
hopefully that way at least one person likes what I ended up with.
:-)

What else needs clarification?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-06 21:37:36
Message-ID: CA+U5nM+Rwp2NbSBFJP_-vogwYH8k9V8OwZLFFr7odwYvTkzhxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6 January 2015 at 21:01, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> If you can explain it in more detail in comments and README, I may
>> agree. At present, I don't get it and it makes me nervous.
>>
>> The comment says
>> "Only the top frame of the transaction state stack is copied to a parallel
>> worker"
>> but I'm not sure why.
>>
>> Top meaning the current subxact or the main xact?
>> If main, why are do we need XactTopTransactionId
>
> Current subxact.

TopTransactionStateData points to the top-level transaction data, but
XactTopTransactionId points to the current subxact.

So when you say "Only the top frame of the transaction state stack is
copied" you don't mean the top, you mean the bottom (the latest
subxact)? Which then becomes the top in the parallel worker? OK...

> I initially thought of copying the entire TransactionStateData stack,
> but Heikki suggested (via IM) that I do it this way instead. I
> believe his concern was that it's never valid to commit or roll back
> to a subtransaction that is not at the top of the stack, and if you
> don't copy the stack, you avoid the risk of somehow ending up in that
> state. Also, you avoid having to invent resource owners for
> (sub)transactions that don't really exist in the current process. On
> the other hand, you do end up with a few special cases that wouldn't
> exist with the other approach. Still, I'm pretty happy to have taken
> Heikki's advice: it was certainly simple to implement this way, plus
> hopefully that way at least one person likes what I ended up with.
> :-)
>
> What else needs clarification?

Those comments really belong in a README, not the first visible
comment in xact.c

You need to start with the explanation that parallel workers have a
faked-up xact stack to make it easier to copy and manage. That is
valid because we never change xact state during a worker operation.

I get it now and agree, but please work some more on clarity of
README, comments and variable naming!

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-07 02:37:56
Message-ID: 54AC9C04.8060309@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/6/15, 10:33 AM, Robert Haas wrote:
>> >Entrypints?
> Already noted by Andres; fixed in the attached version.

Perhaps we only parallelize while drinking beer... ;)

CreateParallelContext(): Does it actually make sense to have nworkers=0? ISTM that's a bogus case. Also, since the number of workers will normally be determined dynamically by the planner, should that check be a regular conditional instead of just an Assert?

In LaunchParallelWorkers() the "Start Workers" comment states that we give up registering more workers if one fails to register, but there's nothing in the if condition to do that, and I don't see RegisterDynamicBackgroundWorker() doing it either. Is the comment just incorrect?

SerializeTransactionState(): instead of looping through the transaction stack to calculate nxids, couldn't we just set it to maxsize - sizeof(TransactionId) * 3? If we're looping a second time for safety a comment explaining that would be useful...

sequence.c: Is it safe to read a sequence value in a parallel worker if the cache_value is > 1?

This may be a dumb question, but for functions do we know that all pl's besides C and SQL use SPI? If not I think they could end up writing in a worker.

@@ -2968,7 +2969,8 @@ ProcessInterrupts(void)
errmsg("canceling statement due to user request")));
}
}
- /* If we get here, do nothing (probably, QueryCancelPending was reset) */
+ if (ParallelMessagePending)
+ HandleParallelMessageInterrupt(false);
ISTM it'd be good to leave that comment in place (after the if).

RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be &&? AIUI both should always be either set or 0.

Typo: + elog(ERROR, "cannot update SecondarySnapshpt during a parallel operation");

The comment for RestoreSnapshot refers to set_transaction_snapshot, but that doesn't actually exist (or isn't referenced).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-07 10:04:44
Message-ID: CA+U5nMK5SOgyAzQ4xtwp1y_0tnBRsuB2RhREiGQOBXLRy42zwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6 January 2015 at 21:37, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> I get it now and agree

Yes, very much.

Should we copy both the top-level frame and the current subxid?
Hot Standby links subxids directly to the top-level, so this would be similar.

If we copied both, we wouldn't need to special case the Get functions.
It would still be O(1).

> but please work some more on clarity of
> README, comments and variable naming!

Something other than "top" sounds good.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-07 13:11:12
Message-ID: CA+TgmoYWbMmRX+1cr5HiwfMfndYD5WKcpK3wCQrJ4w_X=gjO1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> So when you say "Only the top frame of the transaction state stack is
> copied" you don't mean the top, you mean the bottom (the latest
> subxact)? Which then becomes the top in the parallel worker? OK...

The item most recently added to the stack is properly called the top,
but I guess it's confusing in this case because the item on the bottom
of the stack is referred to as the TopTransaction. I'll see if I can
rephrase that.

> Those comments really belong in a README, not the first visible
> comment in xact.c

OK.

> You need to start with the explanation that parallel workers have a
> faked-up xact stack to make it easier to copy and manage. That is
> valid because we never change xact state during a worker operation.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-07 15:34:13
Message-ID: CA+U5nM+PAprPg54yPkZMOrX5xoM_u48x+yLwMvKz+XCSoxEB=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7 January 2015 at 13:11, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> So when you say "Only the top frame of the transaction state stack is
>> copied" you don't mean the top, you mean the bottom (the latest
>> subxact)? Which then becomes the top in the parallel worker? OK...
>
> The item most recently added to the stack is properly called the top,
> but I guess it's confusing in this case because the item on the bottom
> of the stack is referred to as the TopTransaction. I'll see if I can
> rephrase that.

Yes, I didn't mean to suggest the confusion was introduced by you -
it's just you stepped into the mess by correctly using the word top in
a place where its meaning would be opposite to the close-by usage of
TopTransaction.

Anyway, feeling good about this now. Thanks for your patience.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-07 15:39:09
Message-ID: CA+Tgmob_WLEJ7RCtzD5nrfQB0zrJ=2SdiXzhN9yqj3Ybwi+Vcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 6, 2015 at 9:37 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
> CreateParallelContext(): Does it actually make sense to have nworkers=0?
> ISTM that's a bogus case.

I'm not sure whether we'll ever use the zero-worker case in
production, but I've certainly found it useful for
performance-testing.

> Also, since the number of workers will normally be
> determined dynamically by the planner, should that check be a regular
> conditional instead of just an Assert?

I don't think that's really necessary. It shouldn't be too much of a
stretch for the planner to come up with a non-negative value.

> In LaunchParallelWorkers() the "Start Workers" comment states that we give
> up registering more workers if one fails to register, but there's nothing in
> the if condition to do that, and I don't see
> RegisterDynamicBackgroundWorker() doing it either. Is the comment just
> incorrect?

Woops, that got changed at some point and I forgot to update the
comment. Will fix.

> SerializeTransactionState(): instead of looping through the transaction
> stack to calculate nxids, couldn't we just set it to maxsize -
> sizeof(TransactionId) * 3? If we're looping a second time for safety a
> comment explaining that would be useful...

Yeah, I guess we could do that. I don't think it really matters very
much one way or the other.

> sequence.c: Is it safe to read a sequence value in a parallel worker if the
> cache_value is > 1?

No, because the sequence cache isn't synchronized between the workers.
Maybe it would be safe if cache_value == 1, but there's not much
use-case: how often are you going to have a read-only query that uses
a sequence value. At some point we can look at making sequences
parallel-safe, but worrying about it right now doesn't seem like a
good use of time.

> This may be a dumb question, but for functions do we know that all pl's
> besides C and SQL use SPI? If not I think they could end up writing in a
> worker.

Well, the lower-level checks would catch that. But it is generally
true that there's no way to prevent arbitrary C code from doing things
that are unsafe in parallel mode and that we can't tell are unsafe.
As I've said before, I think that we'll need to have a method of
labeling functions as parallel-safe or not, but this patch isn't
trying to solve that part of the problem.

> @@ -2968,7 +2969,8 @@ ProcessInterrupts(void)
> errmsg("canceling statement due to
> user request")));
> }
> }
> - /* If we get here, do nothing (probably, QueryCancelPending was
> reset) */
> + if (ParallelMessagePending)
> + HandleParallelMessageInterrupt(false);
> ISTM it'd be good to leave that comment in place (after the if).
>
> RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be
> &&? AIUI both should always be either set or 0.

Fixed.

> Typo: + elog(ERROR, "cannot update SecondarySnapshpt during a
> parallel operation");

Fixed.

> The comment for RestoreSnapshot refers to set_transaction_snapshot, but that
> doesn't actually exist (or isn't referenced).

Fixed.

Will post a new version in a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-07 17:33:56
Message-ID: CA+TgmoZfkKWoMmcy3RUT43YwzGW8BEf2Opzc3_+wwzV99gSoWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 7, 2015 at 10:34 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Yes, I didn't mean to suggest the confusion was introduced by you -
> it's just you stepped into the mess by correctly using the word top in
> a place where its meaning would be opposite to the close-by usage of
> TopTransaction.
>
> Anyway, feeling good about this now. Thanks for your patience.

Thanks for the kind words, and the review. Here's a new version with
a much-expanded README and some other changes to the comments to
hopefully address some of your other concerns. I also fixed a couple
of other problems while I was at it: the comments in xact.c claimed
that parallel shutdown waits for workers to exit, but parallel.c
didn't know anything about that. Also, I fixed it so that when an
error is propagated from the parallel worker to the user backend, the
error context in effect at the time the parallel context was created
is used, rather than whatever is in effect when we notice the error.

I have little doubt that this version is still afflicted with various
bugs, and the heavyweight locking issue remains to be dealt with, but
on the whole I think this is headed in the right direction.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v0.4.patch text/x-patch 122.7 KB

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-07 20:54:06
Message-ID: 54AD9CEE.9000601@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/7/15, 9:39 AM, Robert Haas wrote:
>> sequence.c: Is it safe to read a sequence value in a parallel worker if the
>> >cache_value is > 1?
> No, because the sequence cache isn't synchronized between the workers.
> Maybe it would be safe if cache_value == 1, but there's not much
> use-case: how often are you going to have a read-only query that uses
> a sequence value. At some point we can look at making sequences
> parallel-safe, but worrying about it right now doesn't seem like a
> good use of time.

Agreed, I was more concerned with calls to nextval(), which don't seem to be prevented in parallel mode?

>> >This may be a dumb question, but for functions do we know that all pl's
>> >besides C and SQL use SPI? If not I think they could end up writing in a
>> >worker.
> Well, the lower-level checks would catch that. But it is generally
> true that there's no way to prevent arbitrary C code from doing things
> that are unsafe in parallel mode and that we can't tell are unsafe.
> As I've said before, I think that we'll need to have a method of
> labeling functions as parallel-safe or not, but this patch isn't
> trying to solve that part of the problem.

I was more thinking about all the add-on pl's like pl/ruby. But yeah, I don't see that there's much we can do if they're not using SPI.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-08 11:52:26
Message-ID: CAA4eK1KaNw6Rsr1eCv8tD4FPJFOurGCYU-HdsuOEkuJ9WNdaRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 7, 2015 at 11:03 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I have little doubt that this version is still afflicted with various
> bugs, and the heavyweight locking issue remains to be dealt with, but
> on the whole I think this is headed in the right direction.
>

+ParallelMain(Datum main_arg)
{
..
+ /*
+ * Now that we have a resource owner, we can attach to the dynamic
+ * shared memory
segment and read the table of contents.
+ */
+ seg = dsm_attach(DatumGetInt32(main_arg));

Here, I think DatumGetUInt32() needs to be used instead of
DatumGetInt32() as the segment handle is uint32.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-12 19:38:36
Message-ID: CA+TgmoZmkwkDJo85KpPLaL1h4mcQtX_muj2OrXryQr_8SAb0eQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 7, 2015 at 3:54 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
> Agreed, I was more concerned with calls to nextval(), which don't seem to be
> prevented in parallel mode?

It looks prevented:

/*
* Forbid this during parallel operation because, to make it work,
* the cooperating backends would need to share the backend-local cached
* sequence information. Currently, we don't support that.
*/
PreventCommandIfParallelMode("nextval()");

> I was more thinking about all the add-on pl's like pl/ruby. But yeah, I
> don't see that there's much we can do if they're not using SPI.

Actually, there is: it's forbidden at the layer of heap_insert(),
heap_update(), heap_delete(). It's hard to imagine anyone trying to
modify the database as a lower level than that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-12 20:03:50
Message-ID: CA+TgmoZuR3CN7P8RMq8pi=NBJf=EQnB9TTDSKAK6yoLiMUpbmQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> + seg = dsm_attach(DatumGetInt32(main_arg));
>
> Here, I think DatumGetUInt32() needs to be used instead of
> DatumGetInt32() as the segment handle is uint32.

OK, I'll change that in the next version.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-15 12:00:41
Message-ID: CAA4eK1JOyp5bby-_m76RSwWNMPfg1yptgd11fz2ctfx9ydtVPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 13, 2015 at 1:33 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > + seg = dsm_attach(DatumGetInt32(main_arg));
> >
> > Here, I think DatumGetUInt32() needs to be used instead of
> > DatumGetInt32() as the segment handle is uint32.
>
> OK, I'll change that in the next version.
>

No issues, I have another question related to below code:

+HandleParallelMessages(void)
+{
..
..
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ /*
+ * Read messages for as long as we have an error queue; if we
+ * have hit (or hit while reading) ReadyForQuery, this will go to
+ * NULL.
+ */
+ while (pcxt->worker[i].error_mqh != NULL)
+ {
+ shm_mq_result res;
+
+ CHECK_FOR_INTERRUPTS();
+
+ res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes,
+ &data, true);
+ if (res == SHM_MQ_SUCCESS)

Here we are checking the error queue for all the workers and this loop
will continue untill all have sent ReadyForQuery() message ('Z') which
will make this loop continue till all workers have finished their work.
Assume situation where first worker has completed the work and sent
'Z' message and second worker is still sending some tuples, now above
code will keep on waiting for 'Z' message from second worker and won't
allow to receive tuples sent by second worker till it send 'Z' message.

As each worker send its own 'Z' message after completion, so ideally
the above code should receive the message only for worker which has
sent the message. I think for that it needs worker information who has
sent the message.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-15 13:22:59
Message-ID: CA+TgmoagwFJyKVkt7b9m=vLiHM16De3rbykTVyQ5f46qiSHJFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Jan 13, 2015 at 1:33 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>> > + seg = dsm_attach(DatumGetInt32(main_arg));
>> >
>> > Here, I think DatumGetUInt32() needs to be used instead of
>> > DatumGetInt32() as the segment handle is uint32.
>>
>> OK, I'll change that in the next version.
>>
>
> No issues, I have another question related to below code:
>
> +HandleParallelMessages(void)
> +{
> ..
> ..
> + for (i = 0; i < pcxt->nworkers; ++i)
> + {
> + /*
> + * Read messages for as long as we have an error queue; if we
> + * have hit (or hit while reading) ReadyForQuery, this will go to
> + * NULL.
> + */
> + while (pcxt->worker[i].error_mqh != NULL)
> + {
> + shm_mq_result res;
> +
> + CHECK_FOR_INTERRUPTS();
> +
> + res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes,
> + &data, true);
> + if (res == SHM_MQ_SUCCESS)
>
> Here we are checking the error queue for all the workers and this loop
> will continue untill all have sent ReadyForQuery() message ('Z') which
> will make this loop continue till all workers have finished their work.
> Assume situation where first worker has completed the work and sent
> 'Z' message and second worker is still sending some tuples, now above
> code will keep on waiting for 'Z' message from second worker and won't
> allow to receive tuples sent by second worker till it send 'Z' message.
>
> As each worker send its own 'Z' message after completion, so ideally
> the above code should receive the message only for worker which has
> sent the message. I think for that it needs worker information who has
> sent the message.

Are you talking about HandleParallelMessages() or
WaitForParallelWorkersToFinish()? The former doesn't wait for
anything; it just handles any messages that are available now. The
latter does wait for all workers to finish, but the intention is that
you only call it when you're ready to wind up the entire parallel
operation, so that's OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-15 14:09:51
Message-ID: CAA4eK1L3QUWYX7mj10xPwuEbuEEAjrgd=c2jBWrSAjT-CsJ8KA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 15, 2015 at 6:52 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > +HandleParallelMessages(void)
> > +{
> > ..
> > ..
> > + for (i = 0; i < pcxt->nworkers; ++i)
> > + {
> > + /*
> > + * Read messages for as long as we have an error queue; if we
> > + * have hit (or hit while reading) ReadyForQuery, this will go to
> > + * NULL.
> > + */
> > + while (pcxt->worker[i].error_mqh != NULL)
> > + {
> > + shm_mq_result res;
> > +
> > + CHECK_FOR_INTERRUPTS();
> > +
> > + res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes,
> > + &data, true);
> > + if (res == SHM_MQ_SUCCESS)
> >
> > Here we are checking the error queue for all the workers and this loop
> > will continue untill all have sent ReadyForQuery() message ('Z') which
> > will make this loop continue till all workers have finished their work.
> > Assume situation where first worker has completed the work and sent
> > 'Z' message and second worker is still sending some tuples, now above
> > code will keep on waiting for 'Z' message from second worker and won't
> > allow to receive tuples sent by second worker till it send 'Z' message.
> >
> > As each worker send its own 'Z' message after completion, so ideally
> > the above code should receive the message only for worker which has
> > sent the message. I think for that it needs worker information who has
> > sent the message.
>
> Are you talking about HandleParallelMessages() or
> WaitForParallelWorkersToFinish()? The former doesn't wait for
> anything; it just handles any messages that are available now.

I am talking about HandleParallelMessages(). It doesn't wait but
it is looping which will make it run for longer time as explained
above. Just imagine a case where there are two workers and first
worker has sent 'Z' message and second worker is doing some
work, now in such a scenario loop will not finish until second worker
also send 'Z' message or error. Am I missing something?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-15 14:54:23
Message-ID: CA+TgmoZDuaAi0UMNTPb5TV8yQQVVa58+89nUtqM8jJ6G6FyWEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 15, 2015 at 9:09 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, Jan 15, 2015 at 6:52 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>> > +HandleParallelMessages(void)
>> > +{
>> > ..
>> > ..
>> > + for (i = 0; i < pcxt->nworkers; ++i)
>> > + {
>> > + /*
>> > + * Read messages for as long as we have an error queue; if we
>> > + * have hit (or hit while reading) ReadyForQuery, this will go to
>> > + * NULL.
>> > + */
>> > + while (pcxt->worker[i].error_mqh != NULL)
>> > + {
>> > + shm_mq_result res;
>> > +
>> > + CHECK_FOR_INTERRUPTS();
>> > +
>> > + res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes,
>> > + &data, true);
>> > + if (res == SHM_MQ_SUCCESS)
>> >
>> > Here we are checking the error queue for all the workers and this loop
>> > will continue untill all have sent ReadyForQuery() message ('Z') which
>> > will make this loop continue till all workers have finished their work.
>> > Assume situation where first worker has completed the work and sent
>> > 'Z' message and second worker is still sending some tuples, now above
>> > code will keep on waiting for 'Z' message from second worker and won't
>> > allow to receive tuples sent by second worker till it send 'Z' message.
>> >
>> > As each worker send its own 'Z' message after completion, so ideally
>> > the above code should receive the message only for worker which has
>> > sent the message. I think for that it needs worker information who has
>> > sent the message.
>>
>> Are you talking about HandleParallelMessages() or
>> WaitForParallelWorkersToFinish()? The former doesn't wait for
>> anything; it just handles any messages that are available now.
>
> I am talking about HandleParallelMessages(). It doesn't wait but
> it is looping which will make it run for longer time as explained
> above. Just imagine a case where there are two workers and first
> worker has sent 'Z' message and second worker is doing some
> work, now in such a scenario loop will not finish until second worker
> also send 'Z' message or error. Am I missing something?

Blah. You're right. I intended to write this loop so that it only
runs until shm_mq_receive() returns SHM_MQ_WOULD_BLOCK. But that's
not what I did. Will fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-16 13:05:29
Message-ID: 20150116130529.GC16991@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-01-05 11:27:57 -0500, Robert Haas wrote:
> On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > * I wonder if parallel contexts shouldn't be tracked via resowners
>
> That is a good question. I confess that I'm somewhat fuzzy about
> which things should be tracked via the resowner mechanism vs. which
> things should have their own bespoke bookkeeping. However, the
> AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
> which makes merging them seem not particularly clean.

I'm not sure either. But I think the current location is wrong anyway -
during AtEOXact_Parallel() before running user defined queries via
AfterTriggerFireDeferred() seems wrong.

> > * combocid.c should error out in parallel mode, as insurance
>
> Eh, which function? HeapTupleHeaderGetCmin(),
> HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work
> in parallel mode. HeapTupleHeaderAdjustCmax() could
> Assert(!IsInParallelMode()) but the only calls are in heap_update()
> and heap_delete() which already have error checks, so putting yet
> another elog() there seems like overkill.

To me it seems like a good idea, but whatever.

> > * I doubt it's a good idea to allow heap_insert, heap_inplace_update,
> > index_insert. I'm not convinced that those will be handled correct and
> > relaxing restrictions is easier than adding them.
>
> I'm fine with adding checks to heap_insert() and
> heap_inplace_update(). I'm not sure we really need to add anything to
> index_insert(); how are we going to get there without hitting some
> other prohibited operation first?

I'm not sure. But it's not that hard to imagine that somebody will start
adding codepaths that insert into indexes first. Think upsert.

> > * I'd much rather separate HandleParallelMessageInterrupt() in one
> > variant that just tells the machinery there's a interrupt (called from
> > signal handlers) and one that actually handles them. We shouldn't even
> > consider adding any new code that does allocations, errors and such in
> > signal handlers. That's just a *bad* idea.
>
> That's a nice idea, but I didn't invent the way this crap works today.
> ImmediateInterruptOK gets set to true while performing a lock wait,
> and we need to be able to respond to errors while in that state. I
> think there's got to be a better way to handle that than force every
> asynchronous operation to have to cope with the fact that
> ImmediateInterruptOK may be set or not set, but as of today that's
> what you have to do.

I personally think it's absolutely unacceptable to add any more of
that. That the current mess works is more luck than anything else - and
I'm pretty sure there's several bugs in it. But since I realize I can't
force you to develop a alternative solution, I tried to implement enough
infrastructure to deal with it without too much work.

As far as I can see this could relatively easily be implemented ontop of
the removal of ImmediateInterruptOK in the patch series I posted
yesterday? Afaics you just need to remove most of
+HandleParallelMessageInterrupt() and replace it by a single SetLatch().

> > * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
> > much rather place a struct there and be careful about not using
> > pointers. That also would obliviate the need to reserve some ids.
>
> I don't see how that's going to work with variable-size data
> structures, and a bunch of the things that we need to serialize are
> variable-size.

Meh. Just appending the variable data to the end of the structure and
calculating offsets works just fine.

> Moreover, I'm not really interested in rehashing this
> design again. I know it's not your favorite; you've said that before.

Well, if I keep having to read code using it, I'll keep being annoyed by
it. I guess we both have to live with that.

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-16 22:10:14
Message-ID: CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 16, 2015 at 8:05 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I'm not sure either. But I think the current location is wrong anyway -
> during AtEOXact_Parallel() before running user defined queries via
> AfterTriggerFireDeferred() seems wrong.

Good point.

>> I'm fine with adding checks to heap_insert() and
>> heap_inplace_update(). I'm not sure we really need to add anything to
>> index_insert(); how are we going to get there without hitting some
>> other prohibited operation first?
>
> I'm not sure. But it's not that hard to imagine that somebody will start
> adding codepaths that insert into indexes first. Think upsert.

OK, but what's the specific reason it's unsafe? The motivation for
prohibiting update and delete is that, if a new combo CID were to be
created mid-scan, we have no way to make other workers aware of it.
There's no special reason to think that heap_insert() or
heap_inplace_update() are unsafe, but, sure, we can prohibit them for
symmetry. If we're going to start extending the net further and
further, we should have specific comments explaining what the hazards.
People will eventually want to relax these restrictions, I think, and
there's nothing scarier than removing a prohibition that has
absolutely no comments to explain why that thing was restricted in the
first place.

> As far as I can see this could relatively easily be implemented ontop of
> the removal of ImmediateInterruptOK in the patch series I posted
> yesterday? Afaics you just need to remove most of
> +HandleParallelMessageInterrupt() and replace it by a single SetLatch().

That would be swell. I'll have a look, when I have time, or when it's
committed, whichever happens first.

>> > * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
>> > much rather place a struct there and be careful about not using
>> > pointers. That also would obliviate the need to reserve some ids.
>>
>> I don't see how that's going to work with variable-size data
>> structures, and a bunch of the things that we need to serialize are
>> variable-size.
>
> Meh. Just appending the variable data to the end of the structure and
> calculating offsets works just fine.

I think coding it all up ad-hoc would be pretty bug-prone. This
provides some structure, where each module only needs to know about
its own serialization format. To me, that's a lot easier to work
with.

New patch attached. I'm going to take the risk of calling this v1
(previous versions have been 0.x), since I've now done something about
the heavyweight locking issue, as well as fixed the message-looping
bug Amit pointed out. It doubtless needs more work, but it's starting
to smell a bit more like a real patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v1.patch text/x-patch 135.2 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-20 14:41:55
Message-ID: CAA4eK1K8mZ+B1pSUhaij9G0bSUSKhwuWy6cCvx17ZEYup16MLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 17, 2015 at 3:40 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> New patch attached. I'm going to take the risk of calling this v1
> (previous versions have been 0.x), since I've now done something about
> the heavyweight locking issue, as well as fixed the message-looping
> bug Amit pointed out. It doubtless needs more work, but it's starting
> to smell a bit more like a real patch.
>

I need some clarification regarding below code:

+BgwHandleStatus
+WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle)
+{
+ BgwHandleStatus
status;
+ int rc;
+ bool save_set_latch_on_sigusr1;
+
+
save_set_latch_on_sigusr1 = set_latch_on_sigusr1;
+ set_latch_on_sigusr1 = true;
+
+ PG_TRY();
+ {
+
for (;;)
+ {
+ pid_t pid;
+
+
CHECK_FOR_INTERRUPTS();
+
+ status = GetBackgroundWorkerPid(handle, &pid);
+
if (status == BGWH_STOPPED)
+ return status;
+
+ rc =
WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET |
WL_POSTMASTER_DEATH, 0);
+
+ if (rc & WL_POSTMASTER_DEATH)
+
return BGWH_POSTMASTER_DIED;

It seems this code has possibility to wait forever.
Assume one of the worker is not able to start (not able to attach
to shared memory or some other reason), then status returned by
GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
and after that it can wait forever in WaitLatch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-20 16:22:48
Message-ID: CA+TgmoYZQWuDPF0a3_tw8t+af7ZwoyMSS3A4ybOS0rcV9fb0xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever.
> Assume one of the worker is not able to start (not able to attach
> to shared memory or some other reason), then status returned by
> GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
> and after that it can wait forever in WaitLatch.

I don't think that's right. The status only remains
BGWH_NOT_YET_STARTED until the postmaster forks the child process. At
that point it immediately changes to BGWH_STARTED. If it starts up
and then dies early on, for example because it can't attach to shared
memory or somesuch, the status will change to BGWH_STOPPED.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-21 07:11:21
Message-ID: CAA4eK1Lrbhcq-s-_vjT00U_Gm8JKyna9B+9mCrzhKEuj9=X5UA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait
forever.
> > Assume one of the worker is not able to start (not able to attach
> > to shared memory or some other reason), then status returned by
> > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
> > and after that it can wait forever in WaitLatch.
>
> I don't think that's right. The status only remains
> BGWH_NOT_YET_STARTED until the postmaster forks the child process.

I think the control flow can reach the above location before
postmaster could fork all the workers. Consider a case that
we have launched all workers during ExecutorStart phase
and then before postmaster could start all workers an error
occurs in master backend and then it try to Abort the transaction
and destroy the parallel context, at that moment it will get the
above status and wait forever in above code.

I am able to reproduce above scenario with debugger by using
parallel_seqscan patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-21 13:05:00
Message-ID: CA+TgmoaAOHSmwJtJQ5unaeR=Z_saQX6SThi7oZtfFEsDsN=G1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>> > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait
>> > forever.
>> > Assume one of the worker is not able to start (not able to attach
>> > to shared memory or some other reason), then status returned by
>> > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
>> > and after that it can wait forever in WaitLatch.
>>
>> I don't think that's right. The status only remains
>> BGWH_NOT_YET_STARTED until the postmaster forks the child process.
>
> I think the control flow can reach the above location before
> postmaster could fork all the workers. Consider a case that
> we have launched all workers during ExecutorStart phase
> and then before postmaster could start all workers an error
> occurs in master backend and then it try to Abort the transaction
> and destroy the parallel context, at that moment it will get the
> above status and wait forever in above code.
>
> I am able to reproduce above scenario with debugger by using
> parallel_seqscan patch.

Hmm. Well, if you can reproduce it, there clearly must be a bug. But
I'm not quite sure where. What should happen in that case is that the
process that started the worker has to wait for the postmaster to
actually start it, and then after that for the new process to die, and
then it should return.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-21 14:10:04
Message-ID: CAA4eK1JKq3T7PijBoVNhzscgOUNHSbSSFu1GkktPAD=Q0g-DCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 21, 2015 at 6:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> >> On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> >> wrote:
> >> > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait
> >> > forever.
> >> > Assume one of the worker is not able to start (not able to attach
> >> > to shared memory or some other reason), then status returned by
> >> > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
> >> > and after that it can wait forever in WaitLatch.
> >>
> >> I don't think that's right. The status only remains
> >> BGWH_NOT_YET_STARTED until the postmaster forks the child process.
> >
> > I think the control flow can reach the above location before
> > postmaster could fork all the workers. Consider a case that
> > we have launched all workers during ExecutorStart phase
> > and then before postmaster could start all workers an error
> > occurs in master backend and then it try to Abort the transaction
> > and destroy the parallel context, at that moment it will get the
> > above status and wait forever in above code.
> >
> > I am able to reproduce above scenario with debugger by using
> > parallel_seqscan patch.
>
> Hmm. Well, if you can reproduce it, there clearly must be a bug. But
> I'm not quite sure where. What should happen in that case is that the
> process that started the worker has to wait for the postmaster to
> actually start it,

Okay, I think this should solve the issue, also it should be done
before call of TerminateBackgroundWorker().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-30 17:08:57
Message-ID: CA+TgmobZuKjLQuOBF+N8Z+6CFyBezBg8oEm2eK-L6ZXCQeVjZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 16, 2015 at 5:10 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> New patch attached. I'm going to take the risk of calling this v1
> (previous versions have been 0.x), since I've now done something about
> the heavyweight locking issue, as well as fixed the message-looping
> bug Amit pointed out. It doubtless needs more work, but it's starting
> to smell a bit more like a real patch.

Here's a new version. Andres mentioned previously that he thought it
would be a good idea to commit the addition of
BackgroundWorkerInitializeConnectionByOid() separately, as he's had
cause to want it independently of the rest of this stuff. It would be
useful for pg_background, too. So I've broken that out into a
separate patch here (bgworker-by-oid.patch) and will commit that RSN
unless somebody thinks it's a bad idea for some reason. AFAICS it
should be uncontroversial.

The main patch (parallel-mode-v2.patch) has evolved just a bit more
since the previous version. It now arranges for all libraries
libraries which we'd dynamically loaded into the original process to
be loaded into the worker as well, which fixes a possible failure when
those libraries define custom GUCs that need to be properly restored.
I'm wondering how much more there really is to do here. From the
parallel sequential scan discussion, and off-list conversations with
Amit, it's becoming clear to me that there are a bunch of things that
are generic to query planning and execution that this patch doesn't
currently consider, so Amit's handling them in the parallel sequential
scan patch. It seems likely that some of that stuff should be pulled
out of that patch and formed into some more generic infrastructure.

But I don't favor putting that stuff in this patch, because there are
already useful things you can do with what I've got here. If you want
to write a parallel version of some SQL-callable function, for
example, you can do that without anything more than this. You could
probably also parallelize utility commands with just this
infrastructure. The stuff that may need to be pulled out of Amit's
patch is generic to query planning and execution, but probably not for
other applications of server-side parallelism. That seems like a
relatively coherent place to draw a line. So I'm feeling like it
might be just as well to push this much into the tree and move on to
the next set of problems. Of course, I don't want to preempt anyone's
desire to review and comment on this further, either.

The final patch attached her (parallel-dummy-v2.patch) has been
updated slightly to incorporate some prefetching logic. It's still
just demo code and is not intended for commit. I'm not sure whether
the prefetching logic can actually be made to improve performance,
either; if anyone feels like playing with that and reporting results
back, that would be swell.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
bgworker-by-oid.patch binary/octet-stream 10.1 KB
parallel-mode-v2.patch binary/octet-stream 123.7 KB
parallel-dummy-v2.patch binary/octet-stream 8.9 KB

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-01-30 23:38:54
Message-ID: 54CC160E.8010407@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/30/15 11:08 AM, Robert Haas wrote:
> The final patch attached her (parallel-dummy-v2.patch) has been
> updated slightly to incorporate some prefetching logic. It's still
> just demo code and is not intended for commit. I'm not sure whether
> the prefetching logic can actually be made to improve performance,
> either; if anyone feels like playing with that and reporting results
> back, that would be swell.

Wouldn't we want the prefetching to happen after ReadBuffer() instead of
before? This way we risk pushing the buffer we actually want out, plus
if it's not already available the OS probably won't try and read it
until it's read all the prefetch blocks.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-01 00:23:37
Message-ID: CA+TgmoZ3cUue42e6ZPKK8Hq+z0sspAqxN7Y9ztxt-tWFt=3i_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 30, 2015 at 6:38 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
> On 1/30/15 11:08 AM, Robert Haas wrote:
>> The final patch attached her (parallel-dummy-v2.patch) has been
>> updated slightly to incorporate some prefetching logic. It's still
>> just demo code and is not intended for commit. I'm not sure whether
>> the prefetching logic can actually be made to improve performance,
>> either; if anyone feels like playing with that and reporting results
>> back, that would be swell.
>
> Wouldn't we want the prefetching to happen after ReadBuffer() instead of
> before? This way we risk pushing the buffer we actually want out, plus if
> it's not already available the OS probably won't try and read it until it's
> read all the prefetch blocks.

Please feel free to try it both ways and let me know what you find
out. We've had a decent amount of commentary on this patch and the
parallel sequential scan patch, but AFAICT, not much actual testing.
The best way to find out whether your proposal would be an improvement
is to try it. I have my own theory based on the testing I've done so
far, but what would be better than a theory is some evidence,
preferably collected by diverse users on diverse systems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-04 16:10:27
Message-ID: CA+TgmoYg9mfjGHzKKUNx=muK8c1SFP+ab=Lv0yBFs6woeXrcHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 30, 2015 at 12:08 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Here's a new version. Andres mentioned previously that he thought it
> would be a good idea to commit the addition of
> BackgroundWorkerInitializeConnectionByOid() separately, as he's had
> cause to want it independently of the rest of this stuff. It would be
> useful for pg_background, too. So I've broken that out into a
> separate patch here (bgworker-by-oid.patch) and will commit that RSN
> unless somebody thinks it's a bad idea for some reason. AFAICS it
> should be uncontroversial.

This is now done.

The main patch needed some updating in light of Andres's recent
assault on ImmediateInterruptOK (final result: Andres 1-0 IIOK) so
here's a new version.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v3.patch application/x-patch 122.1 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-05 03:47:03
Message-ID: CAA4eK1K1o0RAwT0vt7Szx=xoG5_9XrLBF6S-4zJDi6kh8r6Zrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 4, 2015 at 9:40 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Jan 30, 2015 at 12:08 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> > Here's a new version. Andres mentioned previously that he thought it
> > would be a good idea to commit the addition of
> > BackgroundWorkerInitializeConnectionByOid() separately, as he's had
> > cause to want it independently of the rest of this stuff. It would be
> > useful for pg_background, too. So I've broken that out into a
> > separate patch here (bgworker-by-oid.patch) and will commit that RSN
> > unless somebody thinks it's a bad idea for some reason. AFAICS it
> > should be uncontroversial.
>
> This is now done.
>
> The main patch needed some updating in light of Andres's recent
> assault on ImmediateInterruptOK (final result: Andres 1-0 IIOK) so
> here's a new version.
>

I think we should expose variable ParallelWorkerNumber (or if you don't
want to expose it then atleast GetParallel* function call is required to get
the value of same), as that is needed for external applications wherever
they want to allocate something for each worker, some examples w.r.t
parallel seq scan patch are each worker should have separate tuple
queue and probably for implementation of Explain statement also we
might need it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-07 03:43:21
Message-ID: CA+Tgmob4_OeOnS4e_Kg5y=HmwNdnzLOYFLrFSfKDuTgcuEmcGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 4, 2015 at 10:47 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I think we should expose variable ParallelWorkerNumber (or if you don't
> want to expose it then atleast GetParallel* function call is required to get
> the value of same), as that is needed for external applications wherever
> they want to allocate something for each worker, some examples w.r.t
> parallel seq scan patch are each worker should have separate tuple
> queue and probably for implementation of Explain statement also we
> might need it.

Oh, right. You asked for that before, and I made the variable itself
non-static, but didn't add the prototype to the header file. Oops.

Here's v4, with that fixed and a few more tweaks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v4.patch application/x-patch 122.6 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-08 00:20:27
Message-ID: 20150208002027.GH9201@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2015-02-06 22:43:21 -0500, Robert Haas wrote:
> Here's v4, with that fixed and a few more tweaks.

If you attached files generated with 'git format-patch' I could directly
apply then with the commit message and such. All at once if it's
mutliple patches, as individual commits. On nontrivial patches it's nice
to see the commit message(s) along the diff(s).

Observations:
* Some tailing whitespace in the readme. Very nice otherwise.

* Don't like CreateParallelContextForExtension much as a function name -
I don't think we normally equate the fact that code is located in a
loadable library with the extension mechanism.

* StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's
that about?

* the plain shm calculations intentionally use mul_size/add_size to deal
with overflows. On 32bit it doesn't seem impossible, but unlikely to
overflow size_t.

* I'd s/very // i "We might be running in a very short-lived memory
context.". Or replace it with "too".

* In +LaunchParallelWorkers, does it make sense trying to start workers
if one failed? ISTM that's likely to not be helpful. I.e. it should
just break; after the first failure.

* +WaitForParallelWorkersToFinish says that it waits for workers to exit
cleanly. To me that's ambiguous. How about "fully"?

* ParallelMain restores libraries before GUC state. Given that
librararies can, and actually somewhat frequently do, inspect GUCs on
load, it seems better to do it the other way round? You argue "We want
to do this before restoring GUCs, because the libraries might define
custom variables.", but I don't buy that. It's completely normal for
namespaced GUCs to be present before a library is loaded? Especially
as you then go and process_session_preload_libraries() after setting
the GUCs.

* Should ParallelMain maybe enter the parallel context before loading
user defined libraries? It's far from absurd to execute code touching
the database on library initialization...

* rename ParallelMain to ParallelWorkerMain?

* I think restoring snapshots needs to fudge the worker's PGXACT->xmin
to be the minimum of the top transaction id and the
snapshots. Otherwise if the initiating process dies badly
(e.g. because postmaster died) the workers will continue to work,
while other processes may remove things. Also, you don't seem to
prohibit popping the active snapshot (should that be prohibitted
maybe?) which might bump the initiator's xmin horizon.

* Comment about 'signal_handler' in +HandleParallelMessageInterrupt
is outdated.

* Is it really a good idea to overlay Z/ReadyForQuery with 'this worker
exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally
not sure it's wise to use this faux FE/BE protocol here...

* HandlingParallelMessages looks dead.

* ParallelErrorContext has the wrong comment.

* ParallelErrorContext() provides the worker's pid in the context
message. I guess that's so it's easier to interpret when sent to the
initiator? It'll look odd when logged by the failing process.

* We now have pretty much the same handle_sigterm in a bunch of
places. Can't we get rid of those? You actually probably can just use
die().

* The comments in xact.c above XactTopTransactionId pretty much assume
that the reader knows that that is about parallel stuff.

* I'm a bit confused by the fact that Start/CommitTransactionCommand()
emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas
ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be
pretty much impossible to hit a ReleaseSavepoint() with a parallel
transaction in progress? We'd have had to end the previous transaction
command while parallel stuff was in progress - right?
(Internal/ReleaseCurrentSubTransaction is different, that's used in code)

* Why are you deviating from the sorting order used in other places for
ParallelCurrentXids? That seems really wierd, especially as we use
something else a couple lines down. Especially as you actually seem to
send stuff in xidComparator order?

* I don't think skipping AtEOXact_Namespace() entirely if parallel is a
good idea. That function already does some other stuff than cleaning
up temp tables. I think you should pass in parallel and do the skip in
there.

* Start/DndParallelWorkerTransaction assert the current state, whereas
the rest of the file FATALs in that case. I think it'd actually be
good to be conservative and do the same in this case.

* You're manually passing function names to
PreventCommandIfParallelMode() in a fair number of cases. I'd either
try and keep the names consistent with what the functions are actually
called at the sql level (adding the types in the parens) or just use
PG_FUNCNAME_MACRO to keep them correct.

* Wait. You now copy all held relation held "as is" to the standby? I
quite doubt that's a good idea, and it's not my reading of the
conclusions made in the group locking thread. At the very least this
part needs to be extensively documented. And while
LockAcquireExtended() refers to
src/backend/access/transam/README.parallel for details I don't see
anything pertinent in there. And the function header sounds like the
only difference is the HS logging - not mentioning that it essentially
disables lock queuing entirely.

This seems unsafe (e.g. consider if the initiating backend died and
somebody else acquired the lock, possible e.g. if postmaster died) and
not even remotely enough discussed. I think this should be removed
from the patch for now.

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-10 16:49:58
Message-ID: CA+TgmoY0Nj_eNx3tGQKsi3M8wf7hq0p6jXSUpsdPLPRn5O1vxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Observations:
> * Some tailing whitespace in the readme. Very nice otherwise.

Fixed. Thanks.

> * Don't like CreateParallelContextForExtension much as a function name -
> I don't think we normally equate the fact that code is located in a
> loadable library with the extension mechanism.

Suggestions for a better name? CreateParallelContextForLoadableFunction?

> * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's
> that about?

Gee, maybe I should have added a comment so I'd remember. If the
buffer size isn't MAXALIGN'd, that would be really bad, because
shm_mq_create() assumes that it's being given an aligned address.
Maybe I should add an Assert() there. If it is MAXALIGN'd but not
BUFFERALIGN'd, we might waste a few bytes of space, since
shm_toc_allocate() always allocates in BUFFERALIGN'd chunks, but I
don't think anything will actually break. Not sure if that's worth an
assert or not.

> * the plain shm calculations intentionally use mul_size/add_size to deal
> with overflows. On 32bit it doesn't seem impossible, but unlikely to
> overflow size_t.

Yes, I do that here too, though as with the plain shm calculations,
only in the estimate functions. The functions that actually serialize
stuff don't have to worry about overflow because it's already been
checked.

> * I'd s/very // i "We might be running in a very short-lived memory
> context.". Or replace it with "too".

Removed "very".

> * In +LaunchParallelWorkers, does it make sense trying to start workers
> if one failed? ISTM that's likely to not be helpful. I.e. it should
> just break; after the first failure.

It can't just break, because clearing pcxt->worker[i].error_mqh is an
essential step. I could add a flag variable that tracks whether any
registrations have failed and change the if statement to if
(!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker,
&pcxt->worker[i].bgwhandle)), if you want. I thought about doing that
but decided it was very unlikely to affect the real-world performance
of anything, so easier just to keep the code simple. But I'll change
it if you want.

> * +WaitForParallelWorkersToFinish says that it waits for workers to exit
> cleanly. To me that's ambiguous. How about "fully"?

I've removed the word "cleanly" and added a comment to more fully
explain the danger:

+ * Even if the parallel operation seems to have completed successfully, it's
+ * important to call this function afterwards. We must not miss any errors
+ * the workers may have thrown during the parallel operation, or any that they
+ * may yet throw while shutting down.

> * ParallelMain restores libraries before GUC state. Given that
> librararies can, and actually somewhat frequently do, inspect GUCs on
> load, it seems better to do it the other way round? You argue "We want
> to do this before restoring GUCs, because the libraries might define
> custom variables.", but I don't buy that. It's completely normal for
> namespaced GUCs to be present before a library is loaded? Especially
> as you then go and process_session_preload_libraries() after setting
> the GUCs.

This is a good question, but the answer is not entirely clear to me.
I'm thinking I should probably just remove
process_session_preload_libraries() altogether at this point. That
was added at a time when RestoreLibraryState() didn't exist yet, and I
think RestoreLibraryState() is a far superior way of handling this,
because surely the list of libraries that the parallel leader
*actually had loaded* is more definitive than any GUC.

Now, that doesn't answer the question about whether we should load
libraries first or GUCs. I agree that the comment's reasoning is
bogus, but I'm not sure I understand why you think the other order is
better. It *is* normal for namespaced GUCs to be present before a
library is loaded, but it's equally normal (and, I think, often more
desirable in practice) to load the library first and then set the
GUCs. Generally, I think that libraries ought to be loaded as early
as possible, because they may install hooks that change the way other
stuff works, and should therefore be loaded before that other stuff
happens.

> * Should ParallelMain maybe enter the parallel context before loading
> user defined libraries? It's far from absurd to execute code touching
> the database on library initialization...

It's hard to judge without specific examples. What kinds of things do
they do? Are they counting on a transaction being active? I would
have thought that was a no-no, since there are many instances in which
it won't be true. Also, you might have just gotten loaded because a
function stored in your library was called, so you could be in a
transaction that's busy doing something else, or deep in a
subtransaction stack, etc. It seems unsafe to do very much more than
a few syscache lookups here, even if there does happen to be a
transaction active.

> * rename ParallelMain to ParallelWorkerMain?

Sounds good. Done.

> * I think restoring snapshots needs to fudge the worker's PGXACT->xmin
> to be the minimum of the top transaction id and the
> snapshots. Otherwise if the initiating process dies badly
> (e.g. because postmaster died) the workers will continue to work,
> while other processes may remove things.

RestoreTransactionSnapshot() works the same way as the existing
import/export snapshot stuff, so that ought to be no less safe than
what we're doing already. Any other snapshots that we're restoring
had better not have an xmin lower than that one; if they do, the
master messed up. Possibly it would be a good idea to have additional
safeguards there; not sure exactly what. Perhaps RestoreSnapshot()
could assert that the xmin of the restored snapshot
follows-or-is-equal-to PGXACT->xmin? That would be safe for the first
snapshot we restore because our xmin will be InvalidTransactionId, and
after that it should check the condition you're worried about?
Thoughts?

> Also, you don't seem to
> prohibit popping the active snapshot (should that be prohibitted
> maybe?) which might bump the initiator's xmin horizon.

I think as long as our transaction snapshot is installed correctly our
xmin horizon can't advance; am I missing something?

It's generally OK to pop the active snapshot, as long as you only pop
what you pushed. In a worker, you can push additional snapshots on
the active snapshot stack and then pop them, but you can't pop the one
ParallelWorkerMain installed for you. You'll probably notice if you
do, because then the snapshot stack will be empty when you get back to
ParallelWorkerMain() and you'll fail an assertion. Similarly, the
master shouldn't pop the snapshot that was active at the start of
parallelism until parallelism is done, but again if you did you'd
probably fail an assertion later on. Generally, we haven't had much
of a problem with PushActiveSnapshot() and PopActiveSnapshot() calls
being unbalanced, so I don't really think this is an area that needs
especially strong cross-checks. At least not unless we get some
evidence that this is a more-common mistake in code that touches
parallelism than it is in general.

> * Comment about 'signal_handler' in +HandleParallelMessageInterrupt
> is outdated.

Removed.

> * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker
> exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally
> not sure it's wise to use this faux FE/BE protocol here...

Well, I'm not sure about that either and never have been, but I was
even less sure inventing a new one was any better. We might need a
few new protocol messages (or to reuse a few existing ones for other
things) but being able to reuse the existing format for ErrorResponse,
NoticeResponse, etc. seems like a pretty solid win. Those are
reasonably complex message formats and reimplementing them for no
reason seems like a bad idea.

Terminate is 'X', not 'T', and it's a frontend-only message. The
worker is speaking the backend half of the protocol. We could use it
anyway; that wouldn't be silly. I picked ReadyForQuery because that's
what the backend sends when it is completely done processing
everything that the user most recently requested, which seems
defensible here.

> * HandlingParallelMessages looks dead.

Good catch. Removed.

> * ParallelErrorContext has the wrong comment.

Doh. Fixed.

> * ParallelErrorContext() provides the worker's pid in the context
> message. I guess that's so it's easier to interpret when sent to the
> initiator? It'll look odd when logged by the failing process.

Yes, that's why. Regarding logging, true. I guess the master could
add the context instead, although making sure the PID is available
looks pretty annoying. At the time we establish the queue, the PID
isn't known yet, and by the time we read the error from it, the worker
might be gone, such that we can't read its PID. To fix, we'd have to
invent a new protocol message that means "here's my PID". Another
alternative is to just say that the error came from a parallel worker
(e.g. "while executing in parallel worker") and not mention the PID,
but that seems like it'd be losing possibly-useful information.

> * We now have pretty much the same handle_sigterm in a bunch of
> places. Can't we get rid of those? You actually probably can just use
> die().

Good idea. Done.

> * The comments in xact.c above XactTopTransactionId pretty much assume
> that the reader knows that that is about parallel stuff.

What would you suggest? The comment begins "Only a single
TransactionStateData is placed on the parallel worker's state stack",
which seems like a pretty clear way of giving the user a hint that we
are talking about parallel stuff. An older version of the patch was
much less clear, but I thought I'd remedied that.

> * I'm a bit confused by the fact that Start/CommitTransactionCommand()
> emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas
> ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be
> pretty much impossible to hit a ReleaseSavepoint() with a parallel
> transaction in progress? We'd have had to end the previous transaction
> command while parallel stuff was in progress - right?
> (Internal/ReleaseCurrentSubTransaction is different, that's used in code)

It's pretty simple to hit ReleaseSavepoint() while a transaction is in
progress. It's pretty much directly SQL-callable, so a PL function
run in a parallel worker could easily hit it, or anything that uses
SPI. There are similar checks in e.g. EndTransaction(), but you can't
invoke CommitTransactionCommand() directly.

> * Why are you deviating from the sorting order used in other places for
> ParallelCurrentXids? That seems really wierd, especially as we use
> something else a couple lines down. Especially as you actually seem to
> send stuff in xidComparator order?

The transaction IDs have to be sorted into some order so that they can
be binary-searched, and this seemed simplest. xidComparator sorts in
numerical order, not transaction-ID order, so that's how we send them.
That turns out to be convenient anyway, because binary-search on a
sorted array of integers is really simple. If we sent them sorted in
transaction-ID order we'd have to make sure that the reference
transaction ID was the same for both backends, which might not be
hard, but I don't see how it would be better than this.

> * I don't think skipping AtEOXact_Namespace() entirely if parallel is a
> good idea. That function already does some other stuff than cleaning
> up temp tables. I think you should pass in parallel and do the skip in
> there.

That's a very good point. Fixed.

> * Start/DndParallelWorkerTransaction assert the current state, whereas
> the rest of the file FATALs in that case. I think it'd actually be
> good to be conservative and do the same in this case.

Well, actually, StartTransaction() does this:

if (s->state != TRANS_DEFAULT)
elog(WARNING, "StartTransaction while in %s state",
TransStateAsString(s->state));

I could copy and paste that code into StartParallelWorkerTransaction()
and changing WARNING to FATAL, but the first thing
StartParallelWorkerTransaction() does is call StartTransaction(). It
seems pretty stupid to have two identical tests that differ only in
their log level. The same considerations apply to
EndParalellWorkerTransaction() and CommitTransaction().

A worthwhile question is why somebody thought that it was a good idea
for the log level there to be WARNING rather than FATAL. But I don't
think it's this patch's job to second-guess that decision.

> * You're manually passing function names to
> PreventCommandIfParallelMode() in a fair number of cases. I'd either
> try and keep the names consistent with what the functions are actually
> called at the sql level (adding the types in the parens) or just use
> PG_FUNCNAME_MACRO to keep them correct.

I think putting the type names in is too chatty; I'm not aware we use
that style in other error messages. We don't want to lead people to
believe that only the form with the particular argument types they
used is not OK.

PG_FUNCNAME_MACRO will give us the C name, not the SQL name.

> * Wait. You now copy all held relation held "as is" to the standby? I
> quite doubt that's a good idea, and it's not my reading of the
> conclusions made in the group locking thread. At the very least this
> part needs to be extensively documented. And while
> LockAcquireExtended() refers to
> src/backend/access/transam/README.parallel for details I don't see
> anything pertinent in there. And the function header sounds like the
> only difference is the HS logging - not mentioning that it essentially
> disables lock queuing entirely.
>
> This seems unsafe (e.g. consider if the initiating backend died and
> somebody else acquired the lock, possible e.g. if postmaster died) and
> not even remotely enough discussed. I think this should be removed
> from the patch for now.

If it's broken, we need to identify what's wrong and fix it, not just
rip it out. It's possible that something is broken with that code,
but it's dead certain that something is broken without it:

rhaas=# select parallel_count('pgbench_accounts', 1);
NOTICE: PID 57956 counted 2434815 tuples
NOTICE: PID 57957 counted 1565185 tuples
CONTEXT: parallel worker, pid 57957
parallel_count
----------------
4000000
(1 row)

rhaas=# begin;
BEGIN
rhaas=# lock pgbench_accounts;
LOCK TABLE
rhaas=# select parallel_count('pgbench_accounts', 1);
NOTICE: PID 57956 counted 4000000 tuples

...and then it hangs forever.

On the specific issues:

1. I agree that it's very dangerous for the parallel backend to
acquire the lock this way if the master no longer holds it.
Originally, I was imagining that there would be no interlock between
the master shutting down and the worker starting up, but you and
others convinced me that was a bad idea. So now transaction commit or
abort waits for all workers to be gone, which I think reduces the
scope of possible problems here pretty significantly. However, it's
quite possible that it isn't airtight. One thing we could maybe do to
make it safer is pass a pointer to the initiator's PGPROC. If we get
the lock via the fast-path we are safe anyway, but if we have to
acquire the partition lock, then we can cross-check that the
initiator's lock is still there. I think that would button this up
pretty tight.

2. My reading of the group locking discussions was that everybody
agreed that the originally-proposed group locking approach, which
involved considering locks from the same locking group as mutually
non-conflicting, was not OK. Several specific examples were offered -
e.g. it's clearly not OK for two backends to extend a relation at the
same time just because the same locking group. So I abandoned that
approach. When I instead proposed the approach of copying only the
locks that the master *already* had at the beginning of parallelism
and considering *only those* as mutually conflicting, I believe I got
several comments to the effect that this was "less scary".
Considering the topic area, I'm not sure I'm going to do any better
than that.

3. I welcome proposals for other ways of handling this problem, even
if they restrict the functionality that can be offered. For example,
a proposal that would make parallel_count revert to single-threaded
mode but terminate without an indefinite wait would be acceptable to
me, provided that it offers some advantage in safety and security over
what I've already got. A proposal to make it parallel_count error out
in the above case would not be acceptable to me; the planner must not
generate parallel plans that will sometimes fail unexpectedly at
execution-time. I generally believe that we will be much happier if
application programmers need not worry about the failure of parallel
workers to obtain locks already held by the master; some such failure
modes may be very complex and hard to predict. The fact that the
current approach handles the problem entirely within the lock manager,
combined with the fact that it is extremely simple, is therefore very
appealing to me. Nonetheless, a test case that demonstrates this
approach falling down badly would force a rethink; do you have one?
Or an idea about what it might look like?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-11 01:45:28
Message-ID: 20150211014528.GQ21017@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-10 11:49:58 -0500, Robert Haas wrote:
> On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > * Don't like CreateParallelContextForExtension much as a function name -
> > I don't think we normally equate the fact that code is located in a
> > loadable library with the extension mechanism.
>
> Suggestions for a better name? CreateParallelContextForLoadableFunction?

*ExternalFunction maybe? That's dfmgr.c calls it.

> > * the plain shm calculations intentionally use mul_size/add_size to deal
> > with overflows. On 32bit it doesn't seem impossible, but unlikely to
> > overflow size_t.
>
> Yes, I do that here too, though as with the plain shm calculations,
> only in the estimate functions. The functions that actually serialize
> stuff don't have to worry about overflow because it's already been
> checked.

I thought I'd seen some estimation functions that didn't use it. But
apparently I was wrong.

> > * In +LaunchParallelWorkers, does it make sense trying to start workers
> > if one failed? ISTM that's likely to not be helpful. I.e. it should
> > just break; after the first failure.
>
> It can't just break, because clearing pcxt->worker[i].error_mqh is an
> essential step. I could add a flag variable that tracks whether any
> registrations have failed and change the if statement to if
> (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker,
> &pcxt->worker[i].bgwhandle)), if you want. I thought about doing that
> but decided it was very unlikely to affect the real-world performance
> of anything, so easier just to keep the code simple. But I'll change
> it if you want.

I think it'd be better.

> > * ParallelMain restores libraries before GUC state. Given that
> > librararies can, and actually somewhat frequently do, inspect GUCs on
> > load, it seems better to do it the other way round? You argue "We want
> > to do this before restoring GUCs, because the libraries might define
> > custom variables.", but I don't buy that. It's completely normal for
> > namespaced GUCs to be present before a library is loaded? Especially
> > as you then go and process_session_preload_libraries() after setting
> > the GUCs.
>
> This is a good question, but the answer is not entirely clear to me.
> I'm thinking I should probably just remove
> process_session_preload_libraries() altogether at this point. That
> was added at a time when RestoreLibraryState() didn't exist yet, and I
> think RestoreLibraryState() is a far superior way of handling this,
> because surely the list of libraries that the parallel leader
> *actually had loaded* is more definitive than any GUC.

That sounds like a good idea to me.

> Now, that doesn't answer the question about whether we should load
> libraries first or GUCs. I agree that the comment's reasoning is
> bogus, but I'm not sure I understand why you think the other order is
> better. It *is* normal for namespaced GUCs to be present before a
> library is loaded, but it's equally normal (and, I think, often more
> desirable in practice) to load the library first and then set the
> GUCs.

Well, it's pretty much never the case that the library is loaded before
postgresql.conf gucs, right? A changed postgresql.conf is the only
exception I can see. Neither is it the normal case for
session|local_preload_libraries. Not even when GUCs are loaded via
pg_db_role_setting or the startup packet...

> Generally, I think that libraries ought to be loaded as early
> as possible, because they may install hooks that change the way other
> stuff works, and should therefore be loaded before that other stuff
> happens.

While that may be desirable I don't really see a reason for this to be
treated differently for worker processes than the majority of cases
otherwise.

Anyway, I think this is a relatively minor issue.

> > * Should ParallelMain maybe enter the parallel context before loading
> > user defined libraries? It's far from absurd to execute code touching
> > the database on library initialization...
>
> It's hard to judge without specific examples. What kinds of things do
> they do?

I've seen code filling lookup caches and creating system objects
(including tables and extensions).

> Are they counting on a transaction being active?

> I would have thought that was a no-no, since there are many instances
> in which it won't be true. Also, you might have just gotten loaded
> because a function stored in your library was called, so you could be
> in a transaction that's busy doing something else, or deep in a
> subtransaction stack, etc. It seems unsafe to do very much more than
> a few syscache lookups here, even if there does happen to be a
> transaction active.

The only reason I'd like it to be active is because that'd *prohibit*
doing the crazier stuff. There seems little reason to not da it under
the additional protection against crazy things that'd give us?

> > * I think restoring snapshots needs to fudge the worker's PGXACT->xmin
> > to be the minimum of the top transaction id and the
> > snapshots. Otherwise if the initiating process dies badly
> > (e.g. because postmaster died) the workers will continue to work,
> > while other processes may remove things.
>
> RestoreTransactionSnapshot() works the same way as the existing
> import/export snapshot stuff, so that ought to be no less safe than
> what we're doing already.

Well, ExportSnapshot()/Import has quite a bit more restrictions than
what you're doing... Most importantly it cannot import in transactions
unless using read committed isolation, forces xid assignment during
export, forces the old snapshot to stick around for the whole
transaction and only works on a primary. I'm not actually sure it makes
a relevant difference, but it's certainly worth calling attention to.

The fuding I was wondering about certainly is unnecessary though -
GetSnapshotData() has already performed it...

I'm not particularly awake right now and I think this needs a closer
look either by someone awake... I'm not fully convinced this is safe.

> > Also, you don't seem to
> > prohibit popping the active snapshot (should that be prohibitted
> > maybe?) which might bump the initiator's xmin horizon.
>
> I think as long as our transaction snapshot is installed correctly our
> xmin horizon can't advance; am I missing something?

Maybe I'm missing something, but why would that be the case in a read
committed session? ImportSnapshot() only ever calls
SetTransactionSnapshot it such a case (why does it contain code to cope
without it?), but your patch doesn't seem to guarantee that.

But as don't actually transport XactIsoLevel anywhere (omission
probably?) that seems to mean that the SetTransactionSnapshot() won't do
much, even if the source transaction is repeatable read.

Now the thing that actually does give some guarantees is that you
immediately afterwards restore the active snapshot and do a
PushActiveSnapshot(), which'll prevent MyPgXact->xmin from changing
because SnapshotResetXmin() refuses to do anything if there's an
ActiveSnapshot. Seems a bit comlicated and fragile though.

> > * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker
> > exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally
> > not sure it's wise to use this faux FE/BE protocol here...
>
> Well, I'm not sure about that either and never have been, but I was
> even less sure inventing a new one was any better. We might need a
> few new protocol messages (or to reuse a few existing ones for other
> things) but being able to reuse the existing format for ErrorResponse,
> NoticeResponse, etc. seems like a pretty solid win. Those are
> reasonably complex message formats and reimplementing them for no
> reason seems like a bad idea.
>
> Terminate is 'X', not 'T'

Oops, yes.

> and it's a frontend-only message. The worker is speaking the backend
> half of the protocol. We could use it anyway; that wouldn't be silly.

Even if it's a frontend only one it doesn't seem like a bad idea. I've
occasionally wished the backend would send explicit termination messages
in a bunch of scenarios. I'm a bit tired of seing:

FATAL: 57P01: terminating connection due to administrator command
LOCATION: ProcessInterrupts, postgres.c:2888
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

wheen the session was terminated while a query is active. A explicit
termination message seems like a nicer solution than interpreting the
FATAL severity.

> I picked ReadyForQuery because that's
> what the backend sends when it is completely done processing
> everything that the user most recently requested, which seems
> defensible here.

I'm pretty sure that we're going to reuse workers within a parallel
query at some point and ready for query seems like a nicer code for
saying 'finished with my work, give me the next thing'.

> > * ParallelErrorContext() provides the worker's pid in the context
> > message. I guess that's so it's easier to interpret when sent to the
> > initiator? It'll look odd when logged by the failing process.
>
> Yes, that's why. Regarding logging, true. I guess the master could
> add the context instead, although making sure the PID is available
> looks pretty annoying. At the time we establish the queue, the PID
> isn't known yet, and by the time we read the error from it, the worker
> might be gone, such that we can't read its PID. To fix, we'd have to
> invent a new protocol message that means "here's my PID".

Hm. Or we could attach the pid to the error message in that case - just
like there already is schema_name etc. Or, to stay more in the FE/BE
vibe - we could just send a 'K' message at startup which sends
MyProcPid, MyCancelKey in normal connections.

> > * The comments in xact.c above XactTopTransactionId pretty much assume
> > that the reader knows that that is about parallel stuff.
>
> What would you suggest? The comment begins "Only a single
> TransactionStateData is placed on the parallel worker's state stack",
> which seems like a pretty clear way of giving the user a hint that we
> are talking about parallel stuff.

Replacing "Only" by "When running as parallel worker we only place a
..." would already help. To me the comment currently readss like it
desperately wishes to be located in the function initiating parallelism
rather than global file scope. Maybe it's lonely or such.

> > * I'm a bit confused by the fact that Start/CommitTransactionCommand()
> > emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas
> > ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be
> > pretty much impossible to hit a ReleaseSavepoint() with a parallel
> > transaction in progress? We'd have had to end the previous transaction
> > command while parallel stuff was in progress - right?
> > (Internal/ReleaseCurrentSubTransaction is different, that's used in code)
>
> It's pretty simple to hit ReleaseSavepoint() while a transaction is in
> progress. It's pretty much directly SQL-callable, so a PL function
> run in a parallel worker could easily hit it, or anything that uses
> SPI.

SPI doesn't really work across subtransaction boundaries, so it really
should never be called that way. Which is way it (and thus the PLs)
return SPI_ERROR_TRANSACTION in case you try to execute it. If it's
possible to hit it, we have a problem - I think it'd pretty much lead to
rampant memory clobbering and such. Now, I'm not against providing a
error check, I was just surprised about the verbosity of the different
locations.

> > * Why are you deviating from the sorting order used in other places for
> > ParallelCurrentXids? That seems really wierd, especially as we use
> > something else a couple lines down. Especially as you actually seem to
> > send stuff in xidComparator order?
>
> The transaction IDs have to be sorted into some order so that they can
> be binary-searched, and this seemed simplest. xidComparator sorts in
> numerical order, not transaction-ID order, so that's how we send them.

Forget that bit. Was tired. I'm not sure what I thought.

> > * Start/DndParallelWorkerTransaction assert the current state, whereas
> > the rest of the file FATALs in that case. I think it'd actually be
> > good to be conservative and do the same in this case.
>
> Well, actually, StartTransaction() does this:
>
> if (s->state != TRANS_DEFAULT)
> elog(WARNING, "StartTransaction while in %s state",
> TransStateAsString(s->state));

But StartTransactionCommand does
elog(ERROR, "StartTransactionCommand: unexpected state %s",
BlockStateAsString(s->blockState));
and CommitTransactionCommand does
elog(FATAL, "CommitTransactionCommand: unexpected state %s",
BlockStateAsString(s->blockState));
and some more. These only deal with blockState though...

> A worthwhile question is why somebody thought that it was a good idea
> for the log level there to be WARNING rather than FATAL.

Yea, it's really rather odd. Everytime I've seen those WARNINGS, e.g. in
the recent "hung backends stuck in spinlock" things got even more
pearshaped shortly afterwards.

> But I don't think it's this patch's job to second-guess that decision.

Fair enough.

> > * You're manually passing function names to
> > PreventCommandIfParallelMode() in a fair number of cases. I'd either
> > try and keep the names consistent with what the functions are actually
> > called at the sql level (adding the types in the parens) or just use
> > PG_FUNCNAME_MACRO to keep them correct.
>
> I think putting the type names in is too chatty; I'm not aware we use
> that style in other error messages. We don't want to lead people to
> believe that only the form with the particular argument types they
> used is not OK.

> PG_FUNCNAME_MACRO will give us the C name, not the SQL name.

Your manual ones don't either, that's what made me
complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL
level. Instead they use the C name + parens.

> > * Wait. You now copy all held relation held "as is" to the standby? I
> > quite doubt that's a good idea, and it's not my reading of the
> > conclusions made in the group locking thread. At the very least this
> > part needs to be extensively documented. And while
> > LockAcquireExtended() refers to
> > src/backend/access/transam/README.parallel for details I don't see
> > anything pertinent in there. And the function header sounds like the
> > only difference is the HS logging - not mentioning that it essentially
> > disables lock queuing entirely.
> >
> > This seems unsafe (e.g. consider if the initiating backend died and
> > somebody else acquired the lock, possible e.g. if postmaster died) and
> > not even remotely enough discussed. I think this should be removed
> > from the patch for now.
>
> If it's broken, we need to identify what's wrong and fix it, not just
> rip it out. It's possible that something is broken with that code,
> but it's dead certain that something is broken without it:
>
> rhaas=# select parallel_count('pgbench_accounts', 1);
> NOTICE: PID 57956 counted 2434815 tuples
> NOTICE: PID 57957 counted 1565185 tuples
> CONTEXT: parallel worker, pid 57957
> parallel_count
> ----------------
> 4000000
> (1 row)
>
> rhaas=# begin;
> BEGIN
> rhaas=# lock pgbench_accounts;
> LOCK TABLE
> rhaas=# select parallel_count('pgbench_accounts', 1);
> NOTICE: PID 57956 counted 4000000 tuples
>
> ...and then it hangs forever.

Which is *not* a good example for the problem. Your primary reasoning
for needing something more than sharing the locks that we know the
individual query is going to need (which we acquire in the parse
analzye/planner/executor) is that we can't predict what some random code
does. Now, I still don't think that argument should hold too much sway
because we'll only be able to run carefully controlled code
*anyway*. But *if* we take it as reasoning for doing more than granting
the locks we need for the individual query, we *still* need to cope with
exactly the variants of the above invisible deadlock. You can still can
call a function acquiring some form of lock on either side.

If you'd, as I've argued for before, provide a API that granted workers
precisely the locks needed for the execution of a certain type of
parallel action, I'd be happy. I.e. regular parallel queries would
transport an extended version of what InitPlan() does with relation
lock.

> On the specific issues:
>
> 1. I agree that it's very dangerous for the parallel backend to
> acquire the lock this way if the master no longer holds it.
> Originally, I was imagining that there would be no interlock between
> the master shutting down and the worker starting up, but you and
> others convinced me that was a bad idea. So now transaction commit or
> abort waits for all workers to be gone, which I think reduces the
> scope of possible problems here pretty significantly. However, it's
> quite possible that it isn't airtight. One thing we could maybe do to
> make it safer is pass a pointer to the initiator's PGPROC. If we get
> the lock via the fast-path we are safe anyway, but if we have to
> acquire the partition lock, then we can cross-check that the
> initiator's lock is still there. I think that would button this up
> pretty tight.

That'd certainly make me feel less bad about it.

> 2. My reading of the group locking discussions was that everybody
> agreed that the originally-proposed group locking approach, which
> involved considering locks from the same locking group as mutually
> non-conflicting, was not OK. Several specific examples were offered -
> e.g. it's clearly not OK for two backends to extend a relation at the
> same time just because the same locking group. So I abandoned that
> approach. When I instead proposed the approach of copying only the
> locks that the master *already* had at the beginning of parallelism
> and considering *only those* as mutually conflicting, I believe I got
> several comments to the effect that this was "less scary".
> Considering the topic area, I'm not sure I'm going to do any better
> than that.

It surely is less scary, but still darn scary. It's just friggin hard to
have a mental model where (self) exclusive locks suddenly aren't that
anymore. It'll get more and more dangerous the more we start to relax
restrictions around parallelism - and that *will* come.

This concern does not only apply to user level commands, but also our
own C code. We will find more and more reasons to parallelize commands
and if we will have problems with suddenly not having a chance to
prevent parallelism during certain actions. Say we parallelize VACUUM
(grand idea for the index scans!) and someone wants to also improve the
truncation. Suddenly the AEL during truncation doesn't give us
guarantees anymore. There'll be many more of these, and we can't really
reason about them because we used to working locks.

> 3. I welcome proposals for other ways of handling this problem, even
> if they restrict the functionality that can be offered. For example,
> a proposal that would make parallel_count revert to single-threaded
> mode but terminate without an indefinite wait would be acceptable to
> me, provided that it offers some advantage in safety and security over
> what I've already got.

I think the above example where we only grant the locks required for a
specific thing and only on the relevant severity would already be a big
improvement. Even better would be to to use the computed set of locks to
check whether workers could acquire them and refuse paralellism in that
case.

Another thing to do is to mark the lock, both in the master and workers,
as not effectively being of the original severity anymore. If the own
process again tries to acquire the lock on a heavier severity than what
was needed at the time of query execution, error out. At least until
parallel mode has confirmed to have ended. That should pretty much never
happen.

I don't see how we can avoid teaching the deadlock detector about all
these silent deadlocks in any cases. By your own reasoning.

> A proposal to make it parallel_count error out
> in the above case would not be acceptable to me; the planner must not
> generate parallel plans that will sometimes fail unexpectedly at
> execution-time. I generally believe that we will be much happier if
> application programmers need not worry about the failure of parallel
> workers to obtain locks already held by the master; some such failure
> modes may be very complex and hard to predict. The fact that the
> current approach handles the problem entirely within the lock manager,
> combined with the fact that it is extremely simple, is therefore very
> appealing to me.

It's trivial to be simple and unsafe...

> Nonetheless, a test case that demonstrates this approach falling down
> badly would force a rethink; do you have one? Or an idea about what
> it might look like?

No, I don't have one handy. I have the very strong gut feeling that this
would introduce a severe architectural debt that we won't be able to get
rid of afterwards, even if it proves to be a really bad idea. To me
it's taking a shortcut directly through the hard problems.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-11 18:59:04
Message-ID: CA+TgmobhSd_14Fb6kV6GkSVK=7P-xwPqU5ZwtaZWZNMfVPnksA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 10, 2015 at 8:45 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Suggestions for a better name? CreateParallelContextForLoadableFunction?
>
> *ExternalFunction maybe? That's dfmgr.c calls it.

Done.

>> It can't just break, because clearing pcxt->worker[i].error_mqh is an
>> essential step. I could add a flag variable that tracks whether any
>> registrations have failed and change the if statement to if
>> (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker,
>> &pcxt->worker[i].bgwhandle)), if you want. I thought about doing that
>> but decided it was very unlikely to affect the real-world performance
>> of anything, so easier just to keep the code simple. But I'll change
>> it if you want.
>
> I think it'd be better.

Done.

>> This is a good question, but the answer is not entirely clear to me.
>> I'm thinking I should probably just remove
>> process_session_preload_libraries() altogether at this point. That
>> was added at a time when RestoreLibraryState() didn't exist yet, and I
>> think RestoreLibraryState() is a far superior way of handling this,
>> because surely the list of libraries that the parallel leader
>> *actually had loaded* is more definitive than any GUC.
>
> That sounds like a good idea to me.

Done.

> Well, it's pretty much never the case that the library is loaded before
> postgresql.conf gucs, right? A changed postgresql.conf is the only
> exception I can see. Neither is it the normal case for
> session|local_preload_libraries. Not even when GUCs are loaded via
> pg_db_role_setting or the startup packet...

Well, it can happen when you issue an explicit LOAD, or an implicit
one by calling a function that's in that module, and then go back and
set the GUC afterward, if nothing else.

> Anyway, I think this is a relatively minor issue.

I'm going to leave this alone for now. Evidence may emerge that this
is better done in the other order, but in the absence of evidence I'm
going to follow my gut rather than yours. With all due respect for
your gut, of course.

> The only reason I'd like it to be active is because that'd *prohibit*
> doing the crazier stuff. There seems little reason to not da it under
> the additional protection against crazy things that'd give us?

Trying to load additional libraries once parallel mode is in progress
can provide failures, because the load of the libraries causes the
system to do GUC_ACTION_SET on the GUCs whose initialization was
deferred, and that trips up the
no-changing-things-while-in-parallel-mode checks. I'm not sure if
there's anything we can, or should, do about that.

> Well, ExportSnapshot()/Import has quite a bit more restrictions than
> what you're doing... Most importantly it cannot import in transactions
> unless using read committed isolation, forces xid assignment during
> export, forces the old snapshot to stick around for the whole
> transaction and only works on a primary. I'm not actually sure it makes
> a relevant difference, but it's certainly worth calling attention to.
>
> The fuding I was wondering about certainly is unnecessary though -
> GetSnapshotData() has already performed it...
>
> I'm not particularly awake right now and I think this needs a closer
> look either by someone awake... I'm not fully convinced this is safe.

I'm not 100% comfortable with it either, but I just spent some time
looking at it and can't see what's wrong with it. Basically, we're
trying to get the parallel worker into a state that matches the
master's state after doing GetTransactionSnapshot() - namely,
CurrentSnapshot should point to the same snapshot on the master, and
FirstSnapshotSet should be true, plus the same additional processing
that GetTransactionSnapshot() would have done if we're in a higher
transaction isolation level. It's possible we don't need to mimic
that state, but it seems like a good idea.

Still, I wonder if we ought to be banning GetTransactionSnapshot()
altogether. I'm not sure if there's ever a time when it's safe for a
worker to call that.

>> > Also, you don't seem to
>> > prohibit popping the active snapshot (should that be prohibitted
>> > maybe?) which might bump the initiator's xmin horizon.
>>
>> I think as long as our transaction snapshot is installed correctly our
>> xmin horizon can't advance; am I missing something?
>
> Maybe I'm missing something, but why would that be the case in a read
> committed session? ImportSnapshot() only ever calls
> SetTransactionSnapshot it such a case (why does it contain code to cope
> without it?), but your patch doesn't seem to guarantee that.
>
> But as don't actually transport XactIsoLevel anywhere (omission
> probably?) that seems to mean that the SetTransactionSnapshot() won't do
> much, even if the source transaction is repeatable read.

Doesn't XactIsoLevel get set by assign_XactIsoLevel() when we restore
the GUC state?

> Now the thing that actually does give some guarantees is that you
> immediately afterwards restore the active snapshot and do a
> PushActiveSnapshot(), which'll prevent MyPgXact->xmin from changing
> because SnapshotResetXmin() refuses to do anything if there's an
> ActiveSnapshot. Seems a bit comlicated and fragile though.

Suggestions?

>> Terminate is 'X', not 'T'
>
> Oops, yes.
>
>> and it's a frontend-only message. The worker is speaking the backend
>> half of the protocol. We could use it anyway; that wouldn't be silly.
>
> Even if it's a frontend only one it doesn't seem like a bad idea.

Done.

>> I picked ReadyForQuery because that's
>> what the backend sends when it is completely done processing
>> everything that the user most recently requested, which seems
>> defensible here.
>
> I'm pretty sure that we're going to reuse workers within a parallel
> query at some point and ready for query seems like a nicer code for
> saying 'finished with my work, give me the next thing'.

Yeah, could well be.

> Hm. Or we could attach the pid to the error message in that case - just
> like there already is schema_name etc. Or, to stay more in the FE/BE
> vibe - we could just send a 'K' message at startup which sends
> MyProcPid, MyCancelKey in normal connections.

Done.

> Replacing "Only" by "When running as parallel worker we only place a
> ..." would already help. To me the comment currently readss like it
> desperately wishes to be located in the function initiating parallelism
> rather than global file scope. Maybe it's lonely or such.

Heh, heh. Done.

>> > * You're manually passing function names to
>> > PreventCommandIfParallelMode() in a fair number of cases. I'd either
>> > try and keep the names consistent with what the functions are actually
>> > called at the sql level (adding the types in the parens) or just use
>> > PG_FUNCNAME_MACRO to keep them correct.
>>
>> I think putting the type names in is too chatty; I'm not aware we use
>> that style in other error messages. We don't want to lead people to
>> believe that only the form with the particular argument types they
>> used is not OK.
>
>> PG_FUNCNAME_MACRO will give us the C name, not the SQL name.
>
> Your manual ones don't either, that's what made me
> complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL
> level. Instead they use the C name + parens.

On further reflection, I think I should probably just change all of
those to use a general message about advisory locks, i.e.:

ERROR: cannot use advisory locks during a parallel operation

That sound good to you?

> Which is *not* a good example for the problem. Your primary reasoning
> for needing something more than sharing the locks that we know the
> individual query is going to need (which we acquire in the parse
> analzye/planner/executor) is that we can't predict what some random code
> does. Now, I still don't think that argument should hold too much sway
> because we'll only be able to run carefully controlled code
> *anyway*.

I'll avoid repeating what I've said about this before, except to say
that I still don't believe for a minute you can predict which locks
you'll need.

> But *if* we take it as reasoning for doing more than granting
> the locks we need for the individual query, we *still* need to cope with
> exactly the variants of the above invisible deadlock. You can still can
> call a function acquiring some form of lock on either side.

That by itself is not a deadlock. If the worker blocks trying to
acquire a lock that the master held before the start of parallelism,
it will wait forever, even if the master never acquires a lock. But
if the worker blocks trying to acquire a lock that the master acquired
*during* parallelism, it's presumably something like a catalog lock or
a tuple lock or a relation extension lock that the master is going to
release quickly. So the master will finish with that resource and
then the worker will go ahead. You're right that there are other
scenarios to consider, but you need more than that.

>> On the specific issues:
>>
>> 1. I agree that it's very dangerous for the parallel backend to
>> acquire the lock this way if the master no longer holds it.
>> Originally, I was imagining that there would be no interlock between
>> the master shutting down and the worker starting up, but you and
>> others convinced me that was a bad idea. So now transaction commit or
>> abort waits for all workers to be gone, which I think reduces the
>> scope of possible problems here pretty significantly. However, it's
>> quite possible that it isn't airtight. One thing we could maybe do to
>> make it safer is pass a pointer to the initiator's PGPROC. If we get
>> the lock via the fast-path we are safe anyway, but if we have to
>> acquire the partition lock, then we can cross-check that the
>> initiator's lock is still there. I think that would button this up
>> pretty tight.
>
> That'd certainly make me feel less bad about it.

Done.

>> 2. My reading of the group locking discussions was that everybody
>> agreed that the originally-proposed group locking approach, which
>> involved considering locks from the same locking group as mutually
>> non-conflicting, was not OK. Several specific examples were offered -
>> e.g. it's clearly not OK for two backends to extend a relation at the
>> same time just because the same locking group. So I abandoned that
>> approach. When I instead proposed the approach of copying only the
>> locks that the master *already* had at the beginning of parallelism
>> and considering *only those* as mutually conflicting, I believe I got
>> several comments to the effect that this was "less scary".
>> Considering the topic area, I'm not sure I'm going to do any better
>> than that.
>
> It surely is less scary, but still darn scary. It's just friggin hard to
> have a mental model where (self) exclusive locks suddenly aren't that
> anymore. It'll get more and more dangerous the more we start to relax
> restrictions around parallelism - and that *will* come.

Assuming we get a first version committed at some point, yes.

> This concern does not only apply to user level commands, but also our
> own C code. We will find more and more reasons to parallelize commands
> and if we will have problems with suddenly not having a chance to
> prevent parallelism during certain actions. Say we parallelize VACUUM
> (grand idea for the index scans!) and someone wants to also improve the
> truncation. Suddenly the AEL during truncation doesn't give us
> guarantees anymore

If the VACUUM leader process tries to grab an AccessExclusiveLock
after starting parallel mode and before terminating it, surely that's
going to conflict with the locks held by any remaining workers at that
point. Even if it weren't going to conflict, you'd have to be pretty
stupid to code it that way, because you can't remove dead line
pointers until you've finished all of the index scans, and you can't
very well truncate the relation until you've removed dead line
pointers. So to me, this seems like an impossible scenario that
wouldn't break even if it were possible.

> There'll be many more of these, and we can't really
> reason about them because we used to working locks.

FUD.

I think one thing that might help allay your concerns in this area to
a degree is to discuss under what circumstances we think it's safe to
create a parallel context in the first place. For example, if you
were to insert code that tries to go parallel in the middle of
heap_update() or in the middle of relation extension, that's probably
not going to work out well with this model, or with any model. It's
not *impossible* for it to work out well - if you wanted to extend the
relation by a whole bunch of blocks, the worker could do that while
the master goes and try to read from CLOG or something, but you'd have
to be awfully careful and it's probably a stupid idea for lots of
reasons. It really only makes sense to enter parallelism when the
backend is in a relatively quiescent state - like at the beginning of
a query, or after vacuum scans the heap but before it scans the
indexes. At that point it's a lot easier to reason about what the
parallel backends can safely do.

Broadly, I think that's there's a close connection between what state
the master is in when we initiate parallelism and what the workers can
safely do. I'm not sure exactly how to describe what the connection
is there, but I think it exists, and firming up the way we think about
that might make it easier to reason about this.

>> 3. I welcome proposals for other ways of handling this problem, even
>> if they restrict the functionality that can be offered. For example,
>> a proposal that would make parallel_count revert to single-threaded
>> mode but terminate without an indefinite wait would be acceptable to
>> me, provided that it offers some advantage in safety and security over
>> what I've already got.
>
> I think the above example where we only grant the locks required for a
> specific thing and only on the relevant severity would already be a big
> improvement. Even better would be to to use the computed set of locks to
> check whether workers could acquire them and refuse paralellism in that
> case.

I think that will just produce lots of very-hard-to-debug undetected
deadlocks and huge volumes of code that tries and fails to assess what
locks the worker will need.

> Another thing to do is to mark the lock, both in the master and workers,
> as not effectively being of the original severity anymore. If the own
> process again tries to acquire the lock on a heavier severity than what
> was needed at the time of query execution, error out. At least until
> parallel mode has confirmed to have ended. That should pretty much never
> happen.

I'm not sure what you mean by the "severity" of the lock. How about
marking the locks that the worker inherited from the parallel master
and throwing an error if it tries to lock one of those objects in a
mode that it does not already hold? That'd probably catch a sizeable
number of programming errors without tripping up many legitimate use
cases (and we can always relax or modify the prohibition if it later
turns out to be problematic).

> I don't see how we can avoid teaching the deadlock detector about all
> these silent deadlocks in any cases. By your own reasoning.

We're not seeing eye to eye here yet, since I don't accept your
example scenario and you don't accept mine. Let's keep discussing.

Meanwhile, here's an updated patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v5.patch application/x-patch 128.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-11 19:03:18
Message-ID: CA+TgmoaHqNmzw1yD+yvXMnEbuLRMTFvH2DXNMo9jCQv3saBhXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 11, 2015 at 1:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I'm not sure what you mean by the "severity" of the lock. How about
> marking the locks that the worker inherited from the parallel master
> and throwing an error if it tries to lock one of those objects in a
> mode that it does not already hold? That'd probably catch a sizeable
> number of programming errors without tripping up many legitimate use
> cases (and we can always relax or modify the prohibition if it later
> turns out to be problematic).

Or maybe do that only when the new lock mode is stronger than one it
already holds.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-13 07:22:41
Message-ID: CAB7nPqQ4hr5V0qUwboZJ0cgXFdsyRZMDjAyoN+0WD_Za6bx9Zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> We're not seeing eye to eye here yet, since I don't accept your
> example scenario and you don't accept mine. Let's keep discussing.
>
> Meanwhile, here's an updated patch.
>

A lot of cool activity is showing up here, so moved the patch to CF
2015-02. Perhaps Andres you could add yourself as a reviewer?
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-15 06:18:46
Message-ID: CA+TgmobCMwFOz-9=hFv=hJ4SH7p=5X6Ga5V=WtT8=huzE6C+Mg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 13, 2015 at 2:22 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> We're not seeing eye to eye here yet, since I don't accept your
>> example scenario and you don't accept mine. Let's keep discussing.
>>
>> Meanwhile, here's an updated patch.
>
> A lot of cool activity is showing up here, so moved the patch to CF 2015-02.
> Perhaps Andres you could add yourself as a reviewer?

Here's a new version of the patch with (a) some additional
restrictions on heavyweight locking both at the start of, and during,
parallel mode and (b) a write-up in the README explaining the
restrictions and my theory of why the handling of heavyweight locking
is safe. Hopefully this goes some way towards addressing Andres's
concerns. I've also replaced the specific (and wrong) messages about
advisory locks with a more generic message, as previously proposed;
and I've fixed at least one bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v6.patch application/x-patch 138.0 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-17 16:01:50
Message-ID: 20150217160150.GF2895@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-11 13:59:04 -0500, Robert Haas wrote:
> On Tue, Feb 10, 2015 at 8:45 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > The only reason I'd like it to be active is because that'd *prohibit*
> > doing the crazier stuff. There seems little reason to not da it under
> > the additional protection against crazy things that'd give us?
>
> Trying to load additional libraries once parallel mode is in progress
> can provide failures, because the load of the libraries causes the
> system to do GUC_ACTION_SET on the GUCs whose initialization was
> deferred, and that trips up the
> no-changing-things-while-in-parallel-mode checks.

Oh, that's a good point.

> I'm not sure if there's anything we can, or should, do about that.

Fine with me.

> > Well, ExportSnapshot()/Import has quite a bit more restrictions than
> > what you're doing... Most importantly it cannot import in transactions
> > unless using read committed isolation, forces xid assignment during
> > export, forces the old snapshot to stick around for the whole
> > transaction and only works on a primary. I'm not actually sure it makes
> > a relevant difference, but it's certainly worth calling attention to.
> >
> > The fuding I was wondering about certainly is unnecessary though -
> > GetSnapshotData() has already performed it...
> >
> > I'm not particularly awake right now and I think this needs a closer
> > look either by someone awake... I'm not fully convinced this is safe.
>
> I'm not 100% comfortable with it either, but I just spent some time
> looking at it and can't see what's wrong with it. Basically, we're
> trying to get the parallel worker into a state that matches the
> master's state after doing GetTransactionSnapshot() - namely,
> CurrentSnapshot should point to the same snapshot on the master, and
> FirstSnapshotSet should be true, plus the same additional processing
> that GetTransactionSnapshot() would have done if we're in a higher
> transaction isolation level. It's possible we don't need to mimic
> that state, but it seems like a good idea.

I plan to look at this soonish.

> Still, I wonder if we ought to be banning GetTransactionSnapshot()
> altogether. I'm not sure if there's ever a time when it's safe for a
> worker to call that.

Why?

> >> > Also, you don't seem to
> >> > prohibit popping the active snapshot (should that be prohibitted
> >> > maybe?) which might bump the initiator's xmin horizon.
> >>
> >> I think as long as our transaction snapshot is installed correctly our
> >> xmin horizon can't advance; am I missing something?
> >
> > Maybe I'm missing something, but why would that be the case in a read
> > committed session? ImportSnapshot() only ever calls
> > SetTransactionSnapshot it such a case (why does it contain code to cope
> > without it?), but your patch doesn't seem to guarantee that.
> >
> > But as don't actually transport XactIsoLevel anywhere (omission
> > probably?) that seems to mean that the SetTransactionSnapshot() won't do
> > much, even if the source transaction is repeatable read.
>
> Doesn't XactIsoLevel get set by assign_XactIsoLevel() when we restore
> the GUC state?

Yes, but afaics the transaction start will just overwrite it again:

static void
StartTransaction(void)
{
...
XactDeferrable = DefaultXactDeferrable;
XactIsoLevel = DefaultXactIsoLevel;
...

For a client issued BEGIN it works because utility.c does:
case TRANS_STMT_BEGIN:
case TRANS_STMT_START:
{
ListCell *lc;

BeginTransactionBlock();
foreach(lc, stmt->options)
{
DefElem *item = (DefElem *) lfirst(lc);

if (strcmp(item->defname, "transaction_isolation") == 0)
SetPGVariable("transaction_isolation",
list_make1(item->arg),
true);
else if (strcmp(item->defname, "transaction_read_only") == 0)
SetPGVariable("transaction_read_only",
list_make1(item->arg),
true);
else if (strcmp(item->defname, "transaction_deferrable") == 0)
SetPGVariable("transaction_deferrable",
list_make1(item->arg),
true);
}

Pretty, isn't it?

> > Your manual ones don't either, that's what made me
> > complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL
> > level. Instead they use the C name + parens.
>
> On further reflection, I think I should probably just change all of
> those to use a general message about advisory locks, i.e.:
>
> ERROR: cannot use advisory locks during a parallel operation
>
> That sound good to you?

Perfectly fine with me.

> > Which is *not* a good example for the problem. Your primary reasoning
> > for needing something more than sharing the locks that we know the
> > individual query is going to need (which we acquire in the parse
> > analzye/planner/executor) is that we can't predict what some random code
> > does. Now, I still don't think that argument should hold too much sway
> > because we'll only be able to run carefully controlled code
> > *anyway*.
>
> I'll avoid repeating what I've said about this before, except to say
> that I still don't believe for a minute you can predict which locks
> you'll need.

I don't understand. Leaving AEL locks on catalog tables aside, pretty
much everything else is easily visible? We already do that for RTE
permission checks and such? There might be some holes, but those should
rather be fixed anyway. What's so hard about determining the locks
required for a query?

> > But *if* we take it as reasoning for doing more than granting
> > the locks we need for the individual query, we *still* need to cope with
> > exactly the variants of the above invisible deadlock. You can still can
> > call a function acquiring some form of lock on either side.
>
> That by itself is not a deadlock. If the worker blocks trying to
> acquire a lock that the master held before the start of parallelism,
> it will wait forever, even if the master never acquires a lock. But
> if the worker blocks trying to acquire a lock that the master acquired
> *during* parallelism, it's presumably something like a catalog lock or
> a tuple lock or a relation extension lock that the master is going to
> release quickly.

If there's one lock that's acquired that way, there can easily be
two. And then they just need need to be acquired in an inconsistent
order and you have a deadlock.

> > There'll be many more of these, and we can't really
> > reason about them because we used to working locks.
>
> FUD.

It certainly scares me.

> >> 3. I welcome proposals for other ways of handling this problem, even
> >> if they restrict the functionality that can be offered. For example,
> >> a proposal that would make parallel_count revert to single-threaded
> >> mode but terminate without an indefinite wait would be acceptable to
> >> me, provided that it offers some advantage in safety and security over
> >> what I've already got.
> >
> > I think the above example where we only grant the locks required for a
> > specific thing and only on the relevant severity would already be a big
> > improvement. Even better would be to to use the computed set of locks to
> > check whether workers could acquire them and refuse paralellism in that
> > case.
>
> I think that will just produce lots of very-hard-to-debug undetected
> deadlocks and huge volumes of code that tries and fails to assess what
> locks the worker will need.

Teaching the deadlock to recognize that

> > Another thing to do is to mark the lock, both in the master and workers,
> > as not effectively being of the original severity anymore. If the own
> > process again tries to acquire the lock on a heavier severity than what
> > was needed at the time of query execution, error out. At least until
> > parallel mode has confirmed to have ended. That should pretty much never
> > happen.
>
> I'm not sure what you mean by the "severity" of the lock.

The lock level.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-06 12:01:54
Message-ID: CAA4eK1+eRkLYAW1w5wE+xotNSdM7WgVFoPSMi2kPJ_r4zwjJFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 15, 2015 at 11:48 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Feb 13, 2015 at 2:22 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> > On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> >>
> >> We're not seeing eye to eye here yet, since I don't accept your
> >> example scenario and you don't accept mine. Let's keep discussing.
> >>
> >> Meanwhile, here's an updated patch.
> >
> > A lot of cool activity is showing up here, so moved the patch to CF
2015-02.
> > Perhaps Andres you could add yourself as a reviewer?
>
> Here's a new version of the patch with (a) some additional
> restrictions on heavyweight locking both at the start of, and during,
> parallel mode and (b) a write-up in the README explaining the
> restrictions and my theory of why the handling of heavyweight locking
> is safe. Hopefully this goes some way towards addressing Andres's
> concerns. I've also replaced the specific (and wrong) messages about
> advisory locks with a more generic message, as previously proposed;
> and I've fixed at least one bug.
>

Today, while testing parallel_seqscan patch, I encountered one
intermittent issue (it hangs in below function) and I have question
related to below function.

+void
+WaitForParallelWorkersToFinish(ParallelContext *pcxt)
+{
..
+ for (;;)
+ {
..
+ CHECK_FOR_INTERRUPTS();
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ if (pcxt->worker[i].error_mqh != NULL)
+ {
+ anyone_alive = true;
+ break;
+ }
+ }
+
+ if (!anyone_alive)
+ break;
+
+ WaitLatch(&MyProc->procLatch, WL_LATCH_SET, -1);
+ ResetLatch(&MyProc->procLatch);
+ }

Isn't there a race condition in this function such that after it finds
that there is some alive worker and before it does WaitLatch(), the
worker completes its work and exits, now in such a case who is
going to wake the backend waiting on procLatch?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-06 12:44:19
Message-ID: CA+TgmoZpBFoXNmHRt13ktKJR1RXmABbKusd6rz9Z=EqBDLUrVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 6, 2015 at 7:01 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Today, while testing parallel_seqscan patch, I encountered one
> intermittent issue (it hangs in below function) and I have question
> related to below function.
>
> +void
> +WaitForParallelWorkersToFinish(ParallelContext *pcxt)
> +{
> ..
> + for (;;)
> + {
> ..
> + CHECK_FOR_INTERRUPTS();
> + for (i = 0; i < pcxt->nworkers; ++i)
> + {
> + if (pcxt->worker[i].error_mqh != NULL)
> + {
> + anyone_alive = true;
> + break;
> + }
> + }
> +
> + if (!anyone_alive)
> + break;
> +
> + WaitLatch(&MyProc->procLatch, WL_LATCH_SET, -1);
> + ResetLatch(&MyProc->procLatch);
> + }
>
> Isn't there a race condition in this function such that after it finds
> that there is some alive worker and before it does WaitLatch(), the
> worker completes its work and exits, now in such a case who is
> going to wake the backend waiting on procLatch?

It doesn't matter whether some other backend sets the process latch
before we reach WaitLatch() or after we begin waiting. Either way,
it's fine.

It would be bad if the process could exit without setting the latch at
all, though. I hope that's not the case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-06 15:16:12
Message-ID: CA+TgmobqVqKfu+A1rWEVGi7KCyN8tJQirNbqOYs4bu94-9iq8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 17, 2015 at 11:01 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> I'm not 100% comfortable with it either, but I just spent some time
>> looking at it and can't see what's wrong with it. Basically, we're
>> trying to get the parallel worker into a state that matches the
>> master's state after doing GetTransactionSnapshot() - namely,
>> CurrentSnapshot should point to the same snapshot on the master, and
>> FirstSnapshotSet should be true, plus the same additional processing
>> that GetTransactionSnapshot() would have done if we're in a higher
>> transaction isolation level. It's possible we don't need to mimic
>> that state, but it seems like a good idea.
>
> I plan to look at this soonish.

Did you get a chance to look at this yet?

>> Still, I wonder if we ought to be banning GetTransactionSnapshot()
>> altogether. I'm not sure if there's ever a time when it's safe for a
>> worker to call that.
>
> Why?

I don't know. I looked at it more and I don't really see a problem,
but what do I know?

>> Doesn't XactIsoLevel get set by assign_XactIsoLevel() when we restore
>> the GUC state?
>
> Yes, but afaics the transaction start will just overwrite it again:
>
> static void
> StartTransaction(void)
> {
> ...
> XactDeferrable = DefaultXactDeferrable;
> XactIsoLevel = DefaultXactIsoLevel;
> ...

Ah, crap. So, yeah, we need to save and restore that, too. Fixed in
the attached version.

> For a client issued BEGIN it works because utility.c does:
> case TRANS_STMT_BEGIN:
> case TRANS_STMT_START:
> {
> ListCell *lc;
>
> BeginTransactionBlock();
> foreach(lc, stmt->options)
> {
> DefElem *item = (DefElem *) lfirst(lc);
>
> if (strcmp(item->defname, "transaction_isolation") == 0)
> SetPGVariable("transaction_isolation",
> list_make1(item->arg),
> true);
> else if (strcmp(item->defname, "transaction_read_only") == 0)
> SetPGVariable("transaction_read_only",
> list_make1(item->arg),
> true);
> else if (strcmp(item->defname, "transaction_deferrable") == 0)
> SetPGVariable("transaction_deferrable",
> list_make1(item->arg),
> true);
> }
>
> Pretty, isn't it?

I think that's just to handle things like BEGIN ISOLATION LEVEL
SERIALIZABLE. A plain BEGIN would work fine without that AFAICS. It
doesn't seem particularly ugly to me either, but I guess that's
neither here nor there.

>> I'll avoid repeating what I've said about this before, except to say
>> that I still don't believe for a minute you can predict which locks
>> you'll need.
>
> I don't understand. Leaving AEL locks on catalog tables aside, pretty
> much everything else is easily visible? We already do that for RTE
> permission checks and such? There might be some holes, but those should
> rather be fixed anyway. What's so hard about determining the locks
> required for a query?

Well, if you're only going to call built-in functions, and if you
exclude all of the built-in functions that that can take locks on
arbitrary objects like pg_get_object_address() and table_to_xml(), and
if you leave locks on catalog tables aside, then, given further that
we're restricting ourselves to read-only transactions, you *can*
determine the locks that will be required for a query -- there won't
be any. But one cannot exclude catalog tables by fiat, and all of
those other restrictions are ones that I'd like to have a chance of
relaxing at some point. It's entirely reasonable for a user to want
to parallelize a query that contains a user-defined PL/pgsql function,
though, and that might do anything.

> If there's one lock that's acquired that way, there can easily be
> two. And then they just need need to be acquired in an inconsistent
> order and you have a deadlock.

There is a detailed and hopefully rigorous analysis of locking-related
scenarios in README.parallel in the patch version after the one your
reviewed (posted 2015-02-15). Have you looked at that? (It's also in
this version.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v7.patch binary/octet-stream 138.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-17 22:53:00
Message-ID: CA+TgmoavU_rfLD_yN6fACZRhHECMXeNxJdiNByL2Y9JL1YXQeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 6, 2015 at 10:16 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> [ responses to review comments ]

Here's a new version, fixing a bug Amit found (missing SetLatch) and a
more significant oversight identified by Noah (XactLastRecEnd
propagation) on the "assessing parallel safety" thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v8.patch binary/octet-stream 141.5 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-18 16:02:07
Message-ID: CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 17, 2015 at 6:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 6, 2015 at 10:16 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> [ responses to review comments ]
>
> Here's a new version, fixing a bug Amit found (missing SetLatch) and a
> more significant oversight identified by Noah (XactLastRecEnd
> propagation) on the "assessing parallel safety" thread.

Rebased.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v8.1.patch binary/octet-stream 141.5 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-18 21:36:52
Message-ID: 20150318213652.GA9495@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Reading the README first, the rest later. So you can comment on my
comments, while I actually look at the code. Parallelism, yay!

On 2015-03-18 12:02:07 -0400, Robert Haas wrote:
> +Instead, we take a more pragmatic approach: we try to make as many of the
> +operations that are safe outside of parallel mode work correctly in parallel
> +mode as well, and we try to prohibit the rest via suitable error
> checks.

I'd say that we'd try to prohibit a bunch of common cases. Among them
the ones that are triggerable from SQL. We don't really try to prohibit
many kinds of traps, as you describe.

> + - The authenticated user ID and current database. Each parallel worker
> + will connect to the same database as the initiating backend, using the
> + same user ID.

This sentence immediately makes me wonder why only the authenticated
user, and not the currently active role, is mentioned.

> + - The values of all GUCs. Accordingly, permanent changes to the value of
> + any GUC are forbidden while in parallel mode; but temporary changes,
> + such as entering a function with non-NULL proconfig, are potentially OK.

"potentially OK" sounds odd to me. Which danger are you seing that isn't
relevan tfor norm

> + - The combo CID mappings. This is needed to ensure consistent answers to
> + tuple visibility checks. The need to synchronize this data structure is
> + a major reason why we can't support writes in parallel mode: such writes
> + might create new combo CIDs, and we have now way to let other workers
> + (or the initiating backend) know about them.

Absolutely *not* in the initial version, but we really need to find a
better solution for this.

> + - The currently active user ID and security context. Note that this is
> + the fourth user ID we restore: the initial step of binding to the correct
> + database also involves restoring the authenticated user ID. When GUC
> + values are restored, this incidentally sets SessionUserId and OuterUserId
> + to the correct values. This final step restores CurrentUserId.

Ah. That's the answer for above. Could you just move it next to the
other user bit?

> +We could copy the entire transaction state stack,
> +but most of it would be useless: for example, you can't roll back to a
> +savepoint from within a parallel worker, and there are no resources to
> +associated with the memory contexts or resource owners of intermediate
> +subtransactions.

I do wonder if we're not going to have to change this in the not too far
away future. But then, this isn't externally visible at all, so
whatever.

> +At the end of a parallel operation, which can happen either because it
> +completed successfully or because it was interrupted by an error, parallel
> +workers associated with that operation exit. In the error case, transaction
> +abort processing in the parallel leader kills of any remaining workers, and
> +the parallel leader then waits for them to die.

Very slightly awkward because first you talk about successful *or* error
and then about abort processing.

> + - Cleanup of pg_temp namespaces is not done. The initiating backend is
> + responsible for this, too.

How could a worker have its own pg_temp namespace?

> +Lock Management
> +===============
> +
> +Certain heavyweight locks that the initiating backend holds at the beginning
> +of parallelism are copied to each worker, which unconditionally acquires them.
> +The parallel backends acquire - without waiting - each such lock that the
> +leader holds, even if that lock is self-exclusive. This creates the unusual
> +situation that a lock which could normally only be held by a single backend
> +can be shared among several backends in a parallel group.
> +
> +Obviously, this presents significant hazards that would not be present in
> +normal execution. If, for example, a backend were to initiate parallelism
> +while ReindexIsProcessingIndex() were true for some index, the parallel
> +backends launched at that time would neither share this state nor be excluded
> +from accessing the index via the heavyweight lock mechanism. It is therefore
> +imperative that backends only initiate parallelism from places where it will
> +be safe for parallel workers to access the relations on which they hold locks.
> +It is also important that they not afterwards do anything which causes access
> +to those relations to become unsafe, or at least not until after parallelism
> +has concluded. The fact that parallelism is strictly read-only means that the
> +opportunities for such mishaps are few and far between; furthermore, most
> +operations which present these hazards are DDL operations, which will be
> +rejected by CheckTableNotInUse() if parallel mode is active.
> +
> +Only relation locks, locks on database or shared objects, and perhaps the lock
> +on our transaction ID are copied to workers.

"perhaps"?

> Advisory locks are not copied,
> +but the leader may hold them at the start of parallelism; they cannot
> +subsequently be manipulated while parallel mode is active. It is not safe to
> +attempt parallelism while holding a lock of any other type, such as a page,
> +tuple, or relation extension lock, and such attempts will fail. Although
> +there is currently no reason to do so, such locks could be taken and released
> +during parallel mode; they merely cannot be held at the start of parallel
> +mode, since we would then fail to provide necessary mutual exclusion.

Is it really true that no such locks are acquired? What about e.g. hash
indexes? They seem to be acquiring page locks while searching.

> +Copying locks to workers is important for the avoidance of undetected
> +deadlock between the initiating process and its parallel workers. If the
> +initiating process holds a lock on an object at the start of parallelism,
> +and the worker subsequently attempts to acquire a lock on that same object
> +and blocks, this will result in an undetected deadlock, because the
> +initiating process cannot finish the transaction (thus releasing the lock)
> +until the worker terminates, and the worker cannot acquire the lock while
> +the initiating process holds it. Locks which the processes involved acquire
> +and then release during parallelism do not present this hazard; they simply
> +force the processes involved to take turns accessing the protected resource.

I don't think this is a strong guarantee. There very well can be lack of
forward progress if they're waiting on each other in some way. Say the
master backend holds the lock and waits on output from a worker. The
worker then will endlessly wait for the lock to become free. A
deadlock. Or, as another scenario, consider cooperating backends that
both try to send tuples to each other but the queue is full. A deadlock.

To me it seems the deadlock detector has to be enhanced to be able to
see 'waiting for' edges. Independent on how we resolve our difference of
opinion on the copying of locks.

It seems to me that this isn't all that hard: Whenever we block waiting
for another backend we enter ourselves on the wait queue to that
backend's virtual transaction. When finished we take the blocking
backend off. That, afaics, should do it. Alternatively we can just
publish what backend we're waiting for in PGPROC and make deadlock also
look at that; but, while slightly cleaner, that looks like being more
invasive.

> +Copying locks to workers is also important for the avoidance of undetected
> +deadlock involving both the parallel processe and other processes.

"processe"

> For
> +example, suppose processes A1 and A2 are cooperating parallel processes and
> +B is an unrelated process. Suppose A1 holds a lock L1 and waits for A2 to
> +finish; B holds a lock L2 and blocks waiting for L1; and A2 blocks waiting
> +for L2. Without lock copying, the waits-for graph is A2 -> B -> A1; there
> +is no cycle. With lock copying, A2 will also hold the lock on L1 and the
> +deadlock detector can find the cycle A2 -> B -> A2. As in the case of
> +deadlocks within the parallel group, undetected deadlock occur if either A1
> +or A2 acquired a lock after the start of parallelism and attempted to
> +retain it beyond the end of parallelism. The prohibitions discussed above
> +protect us against this case.

I think we'd need to add more restrictions to actually make this
guarantee anything. At the very least it would not only have to be
prohibited to end with a lock held, but also to wait for a worker (or
the leader) backend with a not initially granted lock.

Am I missing something or does the copying currently break deadlock.c?
Because afaics that'll compute lock conflicts in FindLockCycleRecurse()
without being aware of the conflicting lock being granted to two
backends. Won't this at least trigger spurious deadlocks? It might
happen to be without consequence for some reason, but this would, at the
very least, need some very careful review.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-18 23:10:36
Message-ID: 20150318231036.GB9495@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-03-18 12:02:07 -0400, Robert Haas wrote:
> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index cb6f8a3..173f0ba 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -2234,6 +2234,17 @@ static HeapTuple
> heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
> CommandId cid, int options)
> {
> + /*
> + * For now, parallel operations are required to be strictly read-only.
> + * Unlike heap_update() and heap_delete(), an insert should never create
> + * a combo CID, so it might be possible to relax this restrction, but
> + * not without more thought and testing.
> + */
> + if (IsInParallelMode())
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
> + errmsg("cannot insert tuples during a parallel operation")));
> +

Minor nitpick: Should we perhaps move the ereport to a separate
function? Akin to PreventTransactionChain()? Seems a bit nicer to not
have the same message everywhere.

> +void
> +DestroyParallelContext(ParallelContext *pcxt)
> +{
> + int i;
> +
> + /*
> + * Be careful about order of operations here! We remove the parallel
> + * context from the list before we do anything else; otherwise, if an
> + * error occurs during a subsequent step, we might try to nuke it again
> + * from AtEOXact_Parallel or AtEOSubXact_Parallel.
> + */
> + dlist_delete(&pcxt->node);

Hm. I'm wondering about this. What if we actually fail below? Like in
dsm_detach() or it's callbacks? Then we'll enter abort again at some
point (during proc_exit at the latest) but will not wait for the child
workers. Right?
> +/*
> + * End-of-subtransaction cleanup for parallel contexts.
> + *
> + * Currently, it's forbidden to enter or leave a subtransaction while
> + * parallel mode is in effect, so we could just blow away everything. But
> + * we may want to relax that restriction in the future, so this code
> + * contemplates that there may be multiple subtransaction IDs in pcxt_list.
> + */
> +void
> +AtEOSubXact_Parallel(bool isCommit, SubTransactionId mySubId)
> +{
> + while (!dlist_is_empty(&pcxt_list))
> + {
> + ParallelContext *pcxt;
> +
> + pcxt = dlist_head_element(ParallelContext, node, &pcxt_list);
> + if (pcxt->subid != mySubId)
> + break;
> + if (isCommit)
> + elog(WARNING, "leaked parallel context");
> + DestroyParallelContext(pcxt);
> + }
> +}

> +/*
> + * End-of-transaction cleanup for parallel contexts.
> + */
> +void
> +AtEOXact_Parallel(bool isCommit)
> +{
> + while (!dlist_is_empty(&pcxt_list))
> + {
> + ParallelContext *pcxt;
> +
> + pcxt = dlist_head_element(ParallelContext, node, &pcxt_list);
> + if (isCommit)
> + elog(WARNING, "leaked parallel context");
> + DestroyParallelContext(pcxt);
> + }
> +}

Is there any reason to treat the isCommit case as a warning? To me that
sounds like a pretty much guaranteed programming error. If your're going
to argue that a couple resource leakage warnings do similarly: I don't
think that counts for that much. For one e.g. a leaked tupdesc has much
less consequences, for another IIRC those warnings were introduced
after the fact.

> + * When running as a parallel worker, we place only a single
> + * TransactionStateData is placed on the parallel worker's
> + * state stack,

'we place .. is placed'

> /*
> + * EnterParallelMode
> + */
> +void
> +EnterParallelMode(void)
> +{
> + TransactionState s = CurrentTransactionState;
> +
> + Assert(s->parallelModeLevel >= 0);
> +
> + ++s->parallelModeLevel;
> +}
> +
> +/*
> + * ExitParallelMode
> + */
> +void
> +ExitParallelMode(void)
> +{
> + TransactionState s = CurrentTransactionState;
> +
> + Assert(s->parallelModeLevel > 0);
> + Assert(s->parallelModeLevel > 1 || !ParallelContextActive());
> +
> + --s->parallelModeLevel;
> +
> + if (s->parallelModeLevel == 0)
> + CheckForRetainedParallelLocks();
> +}

Hm. Is it actually ok for nested parallel mode to retain locks on their
own? Won't that pose a problem? Or did you do it that way just because
we don't have more state? If so that deserves a comment explaining that
htat's the case and why that's acceptable.
> @@ -1769,6 +1904,9 @@ CommitTransaction(void)
> {
> TransactionState s = CurrentTransactionState;
> TransactionId latestXid;
> + bool parallel;
> +
> + parallel = (s->blockState == TBLOCK_PARALLEL_INPROGRESS);

> + /* If we might have parallel workers, clean them up now. */
> + if (IsInParallelMode())
> + {
> + CheckForRetainedParallelLocks();
> + AtEOXact_Parallel(true);
> + s->parallelModeLevel = 0;
> + }

'parallel' looks strange when we're also, rightly so, do stuff like
checking IsInParallelMode(). How about naming it is_parallel_worker or
something?

Sorry, ran out of concentration here. It's been a long day.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-19 18:13:59
Message-ID: CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Reading the README first, the rest later. So you can comment on my
> comments, while I actually look at the code. Parallelism, yay!

Sorry, we don't support parallelism yet. :-)

> On 2015-03-18 12:02:07 -0400, Robert Haas wrote:
>> +Instead, we take a more pragmatic approach: we try to make as many of the
>> +operations that are safe outside of parallel mode work correctly in parallel
>> +mode as well, and we try to prohibit the rest via suitable error
>> checks.
>
> I'd say that we'd try to prohibit a bunch of common cases. Among them
> the ones that are triggerable from SQL. We don't really try to prohibit
> many kinds of traps, as you describe.

I revised the text along these lines.

>> + - The values of all GUCs. Accordingly, permanent changes to the value of
>> + any GUC are forbidden while in parallel mode; but temporary changes,
>> + such as entering a function with non-NULL proconfig, are potentially OK.
>
> "potentially OK" sounds odd to me. Which danger are you seing that isn't
> relevan tfor norm

Removed "potentially".

>> + - The combo CID mappings. This is needed to ensure consistent answers to
>> + tuple visibility checks. The need to synchronize this data structure is
>> + a major reason why we can't support writes in parallel mode: such writes
>> + might create new combo CIDs, and we have now way to let other workers
>> + (or the initiating backend) know about them.
>
> Absolutely *not* in the initial version, but we really need to find a
> better solution for this.

I have some ideas, but that's not for right now.

>> + - The currently active user ID and security context. Note that this is
>> + the fourth user ID we restore: the initial step of binding to the correct
>> + database also involves restoring the authenticated user ID. When GUC
>> + values are restored, this incidentally sets SessionUserId and OuterUserId
>> + to the correct values. This final step restores CurrentUserId.
>
> Ah. That's the answer for above. Could you just move it next to the
> other user bit?

Well, I think it's good to keep this in the same order it happens.
That's almost true, with the exception of the libraries, which were
out of order. (I've fixed that now.)

>> +We could copy the entire transaction state stack,
>> +but most of it would be useless: for example, you can't roll back to a
>> +savepoint from within a parallel worker, and there are no resources to
>> +associated with the memory contexts or resource owners of intermediate
>> +subtransactions.
>
> I do wonder if we're not going to have to change this in the not too far
> away future. But then, this isn't externally visible at all, so
> whatever.

I definitely thought about copying the whole stack, but Heikki
suggested this approach, and I didn't see a reason to argue with it.
As you say, we can change it in the future if it proves problematic,
but so far I don't see a problem.

>> +At the end of a parallel operation, which can happen either because it
>> +completed successfully or because it was interrupted by an error, parallel
>> +workers associated with that operation exit. In the error case, transaction
>> +abort processing in the parallel leader kills of any remaining workers, and
>> +the parallel leader then waits for them to die.
>
> Very slightly awkward because first you talk about successful *or* error
> and then about abort processing.

I don't understand what's awkward about that. I make a general
statement about what happens at the end of a parallel operation, and
then the next few sentences follow up by explaining what happens in
the error case, and what happens in the success case.

>> + - Cleanup of pg_temp namespaces is not done. The initiating backend is
>> + responsible for this, too.
>
> How could a worker have its own pg_temp namespace?

It can't. My point here is that it won't clean up the master's
pg_temp namespace. I'll add an explicit prohibition against accessing
temporary relations and clarify these remarks.

>> +Lock Management
>> +===============
>> +
>> +Certain heavyweight locks that the initiating backend holds at the beginning
>> +of parallelism are copied to each worker, which unconditionally acquires them.
>> +The parallel backends acquire - without waiting - each such lock that the
>> +leader holds, even if that lock is self-exclusive. This creates the unusual
>> +situation that a lock which could normally only be held by a single backend
>> +can be shared among several backends in a parallel group.
>> +
>> +Obviously, this presents significant hazards that would not be present in
>> +normal execution. If, for example, a backend were to initiate parallelism
>> +while ReindexIsProcessingIndex() were true for some index, the parallel
>> +backends launched at that time would neither share this state nor be excluded
>> +from accessing the index via the heavyweight lock mechanism. It is therefore
>> +imperative that backends only initiate parallelism from places where it will
>> +be safe for parallel workers to access the relations on which they hold locks.
>> +It is also important that they not afterwards do anything which causes access
>> +to those relations to become unsafe, or at least not until after parallelism
>> +has concluded. The fact that parallelism is strictly read-only means that the
>> +opportunities for such mishaps are few and far between; furthermore, most
>> +operations which present these hazards are DDL operations, which will be
>> +rejected by CheckTableNotInUse() if parallel mode is active.
>> +
>> +Only relation locks, locks on database or shared objects, and perhaps the lock
>> +on our transaction ID are copied to workers.
>
> "perhaps"?

I just meant "if we have one". Changed to "and any lock on our transaction ID".

>> Advisory locks are not copied,
>> +but the leader may hold them at the start of parallelism; they cannot
>> +subsequently be manipulated while parallel mode is active. It is not safe to
>> +attempt parallelism while holding a lock of any other type, such as a page,
>> +tuple, or relation extension lock, and such attempts will fail. Although
>> +there is currently no reason to do so, such locks could be taken and released
>> +during parallel mode; they merely cannot be held at the start of parallel
>> +mode, since we would then fail to provide necessary mutual exclusion.
>
> Is it really true that no such locks are acquired? What about e.g. hash
> indexes? They seem to be acquiring page locks while searching.

Ah, right. I have removed "although ... do so". I was thinking that
until we allow writes it wouldn't come up, but that's wrong.

>> +Copying locks to workers is important for the avoidance of undetected
>> +deadlock between the initiating process and its parallel workers. If the
>> +initiating process holds a lock on an object at the start of parallelism,
>> +and the worker subsequently attempts to acquire a lock on that same object
>> +and blocks, this will result in an undetected deadlock, because the
>> +initiating process cannot finish the transaction (thus releasing the lock)
>> +until the worker terminates, and the worker cannot acquire the lock while
>> +the initiating process holds it. Locks which the processes involved acquire
>> +and then release during parallelism do not present this hazard; they simply
>> +force the processes involved to take turns accessing the protected resource.
>
> I don't think this is a strong guarantee. There very well can be lack of
> forward progress if they're waiting on each other in some way. Say the
> master backend holds the lock and waits on output from a worker. The
> worker then will endlessly wait for the lock to become free. A
> deadlock. Or, as another scenario, consider cooperating backends that
> both try to send tuples to each other but the queue is full. A deadlock.

The idea is that both the master and the workers are restricted to
locks which they take and release again. The idea is, specifically,
to allow syscache lookups. Taking a lock and then waiting for the
worker is no good, which is why WaitForParallelWorkersToFinish() calls
CheckForRetainedParallelLocks(). That would catch your first
scenario.

Your second scenario seems to me to be a different kind of problem.
If that comes up, what you're going to want to do is rewrite the
workers to avoid the deadlock by using non-blocking messaging. (The
recent discussions of fixing similar deadlocks where a libpq client
and a postgres server are both blocked on write while both output
buffers are full centers around similar issues.) Detecting the
deadlock and aborting is better than nothing, but not by much. In any
case, I don't think it's really this patch's job to prevent deadlocks
in code that doesn't exist today and in no way involves the lock
manager.

> To me it seems the deadlock detector has to be enhanced to be able to
> see 'waiting for' edges. Independent on how we resolve our difference of
> opinion on the copying of locks.
>
> It seems to me that this isn't all that hard: Whenever we block waiting
> for another backend we enter ourselves on the wait queue to that
> backend's virtual transaction. When finished we take the blocking
> backend off. That, afaics, should do it. Alternatively we can just
> publish what backend we're waiting for in PGPROC and make deadlock also
> look at that; but, while slightly cleaner, that looks like being more
> invasive.

That's an interesting idea. It would be more flexible than what I've
got here right now, in that parallel backends could take and retain
locks on arbitrary objects, and we'd only error out if it actually
created a deadlock, instead of erroring out because of the potential
for a deadlock under some unlikely circumstances. But it can't be
done with existing lock manager APIs - right now there is no way to
put yourself on a wait queue for a virtual transaction except to try
to acquire a conflicting lock, and that's no good because then you
aren't actually trying to read data from it. You'd need some kind of
API that says "pretend I'm waiting for this lock, but don't really
wait for it", and you'd need to be darn sure that you removed yourself
from the wait queue again before doing any other heavyweight lock
manipulation. Do you have specific thoughts on how to implement this?

>> +Copying locks to workers is also important for the avoidance of undetected
>> +deadlock involving both the parallel processe and other processes.
>
> "processe"

Fixed.

>> For
>> +example, suppose processes A1 and A2 are cooperating parallel processes and
>> +B is an unrelated process. Suppose A1 holds a lock L1 and waits for A2 to
>> +finish; B holds a lock L2 and blocks waiting for L1; and A2 blocks waiting
>> +for L2. Without lock copying, the waits-for graph is A2 -> B -> A1; there
>> +is no cycle. With lock copying, A2 will also hold the lock on L1 and the
>> +deadlock detector can find the cycle A2 -> B -> A2. As in the case of
>> +deadlocks within the parallel group, undetected deadlock occur if either A1
>> +or A2 acquired a lock after the start of parallelism and attempted to
>> +retain it beyond the end of parallelism. The prohibitions discussed above
>> +protect us against this case.
>
> I think we'd need to add more restrictions to actually make this
> guarantee anything. At the very least it would not only have to be
> prohibited to end with a lock held, but also to wait for a worker (or
> the leader) backend with a not initially granted lock.

I don't think so. The latter is safe. The other backend will finish
with the lock and then you get it afterwards.

> Am I missing something or does the copying currently break deadlock.c?
> Because afaics that'll compute lock conflicts in FindLockCycleRecurse()
> without being aware of the conflicting lock being granted to two
> backends. Won't this at least trigger spurious deadlocks? It might
> happen to be without consequence for some reason, but this would, at the
> very least, need some very careful review.

There may be a problem here, but I'm not seeing it. Initially, we're
blocking on a lock that we don't already hold, so it's not one of the
copied ones. If we follow one or more waits-for edges and arrive back
at a copied lock, that's a real deadlock and should be reported as
such.

Here is yet another version of this patch. In addition to the fixes
mentioned above, this version includes some minor rebasing around
recent commits, and also better handling of the case where we discover
that we cannot launch workers after all. This can happen because (1)
dynamic_shared_memory_type=none, (2) the maximum number of DSM
segments supported by the system configuration are already in use, or
(3) the user creates a parallel context with nworkers=0. In any of
those cases, the system will now create a backend-private memory
segment instead of a dynamic shared memory segment, and will skip
steps that don't need to be done in that case. This is obviously an
undesirable scenario. If we choose a parallel sequential scan, we
want it to launch workers and really run in parallel. Hopefully, in
case (1) or case (3), we will avoid choosing a parallel plan in the
first place, but case (2) is pretty hard to avoid completely, as we
have no idea what other processes may or may not be doing with dynamic
shared memory segments ... and, in any case, degrading to non-parallel
execution beats failing outright.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v9.patch binary/octet-stream 145.3 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-19 18:42:14
Message-ID: CA+TgmoYn4zoyUN-Yb-9dVFFAu5mNgFXcifCXZ=8-Qs1Zqy2y2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 18, 2015 at 7:10 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2015-03-18 12:02:07 -0400, Robert Haas wrote:
>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
>> index cb6f8a3..173f0ba 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -2234,6 +2234,17 @@ static HeapTuple
>> heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
>> CommandId cid, int options)
>> {
>> + /*
>> + * For now, parallel operations are required to be strictly read-only.
>> + * Unlike heap_update() and heap_delete(), an insert should never create
>> + * a combo CID, so it might be possible to relax this restrction, but
>> + * not without more thought and testing.
>> + */
>> + if (IsInParallelMode())
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
>> + errmsg("cannot insert tuples during a parallel operation")));
>> +
>
> Minor nitpick: Should we perhaps move the ereport to a separate
> function? Akin to PreventTransactionChain()? Seems a bit nicer to not
> have the same message everywhere.

Well, it's not actually the same message. They're all a bit
different. Or mostly all of them. And the variable part is not a
command name, as in the PreventTransactionChain() case, so it would
affect translatability if we did that. I don't actually think this is
a big deal.

>> +void
>> +DestroyParallelContext(ParallelContext *pcxt)
>> +{
>> + int i;
>> +
>> + /*
>> + * Be careful about order of operations here! We remove the parallel
>> + * context from the list before we do anything else; otherwise, if an
>> + * error occurs during a subsequent step, we might try to nuke it again
>> + * from AtEOXact_Parallel or AtEOSubXact_Parallel.
>> + */
>> + dlist_delete(&pcxt->node);
>
> Hm. I'm wondering about this. What if we actually fail below? Like in
> dsm_detach() or it's callbacks? Then we'll enter abort again at some
> point (during proc_exit at the latest) but will not wait for the child
> workers. Right?

Right. It's actually pretty hard to recover when things that get
called in the abort path fail, which is why dsm_detach() and
shm_mq_detach_callback() try pretty hard to only do things that are
no-fail. For example, failing to detach a dsm gives a WARNING, not
an ERROR. Now, I did make some attempt in previous patches to ensure
that proc_exit() has some ability to recover even if a callback fails
(cf 001a573a2011d605f2a6e10aee9996de8581d099) but I'm not too sure how
useful that's going to be in practice. I'm open to ideas on how to
improve this.

>> +void
>> +AtEOXact_Parallel(bool isCommit)
>> +{
>> + while (!dlist_is_empty(&pcxt_list))
>> + {
>> + ParallelContext *pcxt;
>> +
>> + pcxt = dlist_head_element(ParallelContext, node, &pcxt_list);
>> + if (isCommit)
>> + elog(WARNING, "leaked parallel context");
>> + DestroyParallelContext(pcxt);
>> + }
>> +}
>
> Is there any reason to treat the isCommit case as a warning? To me that
> sounds like a pretty much guaranteed programming error. If your're going
> to argue that a couple resource leakage warnings do similarly: I don't
> think that counts for that much. For one e.g. a leaked tupdesc has much
> less consequences, for another IIRC those warnings were introduced
> after the fact.

Yeah, I'm going to argue that. A leaked parallel context is pretty
harmless if there are no workers associated with it. If there are,
it's less harmless, of course, but it doesn't seem to me that making
that an ERROR buys us much. I mean, the transaction is going to end
either way, and a message is going to get printed either way, and it's
a bug either way, so whatever.

>> + * When running as a parallel worker, we place only a single
>> + * TransactionStateData is placed on the parallel worker's
>> + * state stack,
>
> 'we place .. is placed'

Will fix in next update.

>> + if (s->parallelModeLevel == 0)
>> + CheckForRetainedParallelLocks();
>> +}
>
> Hm. Is it actually ok for nested parallel mode to retain locks on their
> own? Won't that pose a problem? Or did you do it that way just because
> we don't have more state? If so that deserves a comment explaining that
> htat's the case and why that's acceptable.

The only time it's really a problem to retain locks is if you are
doing WaitForParallelWorkersToFinish(). This is pretty much just a
belt-and-suspenders check to make it easier to notice that you've
goofed. But, sure, removing the if part makes sense. I'll do that in
the next update.

>> @@ -1769,6 +1904,9 @@ CommitTransaction(void)
>> {
>> TransactionState s = CurrentTransactionState;
>> TransactionId latestXid;
>> + bool parallel;
>> +
>> + parallel = (s->blockState == TBLOCK_PARALLEL_INPROGRESS);
>
>> + /* If we might have parallel workers, clean them up now. */
>> + if (IsInParallelMode())
>> + {
>> + CheckForRetainedParallelLocks();
>> + AtEOXact_Parallel(true);
>> + s->parallelModeLevel = 0;
>> + }
>
> 'parallel' looks strange when we're also, rightly so, do stuff like
> checking IsInParallelMode(). How about naming it is_parallel_worker or
> something?

Yeah, makes sense. Will do that, too.

> Sorry, ran out of concentration here. It's been a long day.

Thanks for the review so far.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-24 19:26:28
Message-ID: 20150324192628.GG15229@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-03-19 14:13:59 -0400, Robert Haas wrote:
> On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > Reading the README first, the rest later. So you can comment on my
> > comments, while I actually look at the code. Parallelism, yay!
>
> Sorry, we don't support parallelism yet. :-)

And not even proper sequential work apparently...

> >> + - The currently active user ID and security context. Note that this is
> >> + the fourth user ID we restore: the initial step of binding to the correct
> >> + database also involves restoring the authenticated user ID. When GUC
> >> + values are restored, this incidentally sets SessionUserId and OuterUserId
> >> + to the correct values. This final step restores CurrentUserId.
> >
> > Ah. That's the answer for above. Could you just move it next to the
> > other user bit?
>
> Well, I think it's good to keep this in the same order it happens.
> That's almost true, with the exception of the libraries, which were
> out of order. (I've fixed that now.)

I don't find this convincing in the least. This should explain a
developer which state he can rely on being shared. For that a sane order
is helpful. In which precise order this happens doesn't seem to matter
for that at all; it's also likely ot get out of date.

> >> +At the end of a parallel operation, which can happen either because it
> >> +completed successfully or because it was interrupted by an error, parallel
> >> +workers associated with that operation exit. In the error case, transaction
> >> +abort processing in the parallel leader kills of any remaining workers, and
> >> +the parallel leader then waits for them to die.
> >
> > Very slightly awkward because first you talk about successful *or* error
> > and then about abort processing.
>
> I don't understand what's awkward about that. I make a general
> statement about what happens at the end of a parallel operation, and
> then the next few sentences follow up by explaining what happens in
> the error case, and what happens in the success case.

I seem to have completely overread the "In the error case, " part of the
sentence. Forget what I wrote.

> >> +Copying locks to workers is important for the avoidance of undetected
> >> +deadlock between the initiating process and its parallel workers. If the
> >> +initiating process holds a lock on an object at the start of parallelism,
> >> +and the worker subsequently attempts to acquire a lock on that same object
> >> +and blocks, this will result in an undetected deadlock, because the
> >> +initiating process cannot finish the transaction (thus releasing the lock)
> >> +until the worker terminates, and the worker cannot acquire the lock while
> >> +the initiating process holds it. Locks which the processes involved acquire
> >> +and then release during parallelism do not present this hazard; they simply
> >> +force the processes involved to take turns accessing the protected resource.
> >
> > I don't think this is a strong guarantee. There very well can be lack of
> > forward progress if they're waiting on each other in some way. Say the
> > master backend holds the lock and waits on output from a worker. The
> > worker then will endlessly wait for the lock to become free. A
> > deadlock. Or, as another scenario, consider cooperating backends that
> > both try to send tuples to each other but the queue is full. A deadlock.
>
> The idea is that both the master and the workers are restricted to
> locks which they take and release again. The idea is, specifically,
> to allow syscache lookups. Taking a lock and then waiting for the
> worker is no good, which is why WaitForParallelWorkersToFinish() calls
> CheckForRetainedParallelLocks(). That would catch your first
> scenario.
>
> Your second scenario seems to me to be a different kind of problem.
> If that comes up, what you're going to want to do is rewrite the
> workers to avoid the deadlock by using non-blocking messaging.

I don't think that actually really solves the problem. You'll still end
up blocking with full "pipes" at some point. Whether you do it inside
the messaging or outside the messaging code doesn't particularly matter.

> In any case, I don't think it's really this patch's job to prevent
> deadlocks in code that doesn't exist today and in no way involves the
> lock manager.

Well, you're arguing that we need a solution in the parallelism
infrastructure for the lock deadlocks problem. I don't think it's absurd
to extend care to other cases.

> > To me it seems the deadlock detector has to be enhanced to be able to
> > see 'waiting for' edges. Independent on how we resolve our difference of
> > opinion on the copying of locks.
> >
> > It seems to me that this isn't all that hard: Whenever we block waiting
> > for another backend we enter ourselves on the wait queue to that
> > backend's virtual transaction. When finished we take the blocking
> > backend off. That, afaics, should do it. Alternatively we can just
> > publish what backend we're waiting for in PGPROC and make deadlock also
> > look at that; but, while slightly cleaner, that looks like being more
> > invasive.
>
> That's an interesting idea. It would be more flexible than what I've
> got here right now, in that parallel backends could take and retain
> locks on arbitrary objects, and we'd only error out if it actually
> created a deadlock, instead of erroring out because of the potential
> for a deadlock under some unlikely circumstances.

It also seems like it'd be able to deal with a bunch of scenarios that
the current approach wouldn't be able to deal with.

> But it can't be
> done with existing lock manager APIs - right now there is no way to
> put yourself on a wait queue for a virtual transaction except to try
> to acquire a conflicting lock, and that's no good because then you
> aren't actually trying to read data from it.

Right.

> You'd need some kind of
> API that says "pretend I'm waiting for this lock, but don't really
> wait for it", and you'd need to be darn sure that you removed yourself
> from the wait queue again before doing any other heavyweight lock
> manipulation. Do you have specific thoughts on how to implement this?

I've thought some about this, and I think it's a bit easier to not do it
on the actual lock waitqueues, but teach deadlock.c about that kind of
blocking.

deadlock.c is far from simple, and at least I don't find the control
flow to be particularly clear. So it's not easy. It'd be advantageous
to tackle things at that level because it'd avoid the need to acquire
locks on some lock's waitqueue when blocking; we're going to do that a
lot.

But It seems to me that it should be possible to suceed: In
FindLockCycleRecurse(), in the case that we're not waiting for an actual
lock (checkProc->links.next == NULL) we can add a case that considers
the 'blocking parallelism' case. ISTM that that's just a
FindLockCycleRecurse() call on the process that we're waiting for. We'd
either have to find the other side's locktag for DEADLOCK_INFO or invent
another field in there; but that seems like a solveable problem.

> > Am I missing something or does the copying currently break deadlock.c?
> > Because afaics that'll compute lock conflicts in FindLockCycleRecurse()
> > without being aware of the conflicting lock being granted to two
> > backends. Won't this at least trigger spurious deadlocks? It might
> > happen to be without consequence for some reason, but this would, at the
> > very least, need some very careful review.
>
> There may be a problem here, but I'm not seeing it. Initially, we're
> blocking on a lock that we don't already hold, so it's not one of the
> copied ones. If we follow one or more waits-for edges and arrive back
> at a copied lock, that's a real deadlock and should be reported as
> such.

That's a fair point. I was worried that this is going to introduce
additional hard edges between processes. I think it actually might, but
only when a lock is upgraded; which really shouldn't happen for the
copied locks.

Also: Man, trying to understand the guts of deadlock.c only made me
understand how *friggin* expensive deadlock checking is. I'm really
rather surprised that it only infrequently causes problems. I think I
have seen a report or two where it might have been the deadlock check
that went bonkers during schema upgrades on larger setups, but that's
it.

I'm not sure if I said that somewhere before: If we aborted parallelism
if any of the to-be-copied locks would conflict with its copy, instead
of duplicating them, I could live with this. Then this would amount to
jumping the lock queue, which seems reasonable to me. Since not being
able to do parallelism already needs to be handled gracefull, this
doesn't seem to be too bad.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-24 21:22:11
Message-ID: 8027.1427232131@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> Also: Man, trying to understand the guts of deadlock.c only made me
> understand how *friggin* expensive deadlock checking is. I'm really
> rather surprised that it only infrequently causes problems.

The reason for that is that we only run deadlock checking if something's
been waiting for at least one second, which pretty much takes it out
of any performance-relevant code pathway. I think it would be a serious
error to put any deadlock-checking-like behavior into mainline code.
Which probably means that Andres is right that teaching deadlock.c
about any new sources of deadlock is the way to approach this.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-24 22:29:11
Message-ID: CA+Tgmoap6dPa4ueg=DE4kdH6aiKF__0GK=pVzeLiuTbz6d+Zkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> + - The currently active user ID and security context. Note that this is
>> >> + the fourth user ID we restore: the initial step of binding to the correct
>> >> + database also involves restoring the authenticated user ID. When GUC
>> >> + values are restored, this incidentally sets SessionUserId and OuterUserId
>> >> + to the correct values. This final step restores CurrentUserId.
>> >
>> > Ah. That's the answer for above. Could you just move it next to the
>> > other user bit?
>>
>> Well, I think it's good to keep this in the same order it happens.
>> That's almost true, with the exception of the libraries, which were
>> out of order. (I've fixed that now.)
>
> I don't find this convincing in the least. This should explain a
> developer which state he can rely on being shared. For that a sane order
> is helpful. In which precise order this happens doesn't seem to matter
> for that at all; it's also likely ot get out of date.

I'm hoping we can agree to disagree on this point. I like my order
order better, you like your order better; we're arguing about the
ordering of paragraphs in a README.

>> You'd need some kind of
>> API that says "pretend I'm waiting for this lock, but don't really
>> wait for it", and you'd need to be darn sure that you removed yourself
>> from the wait queue again before doing any other heavyweight lock
>> manipulation. Do you have specific thoughts on how to implement this?
>
> I've thought some about this, and I think it's a bit easier to not do it
> on the actual lock waitqueues, but teach deadlock.c about that kind of
> blocking.
>
> deadlock.c is far from simple, and at least I don't find the control
> flow to be particularly clear. So it's not easy. It'd be advantageous
> to tackle things at that level because it'd avoid the need to acquire
> locks on some lock's waitqueue when blocking; we're going to do that a
> lot.
>
> But It seems to me that it should be possible to suceed: In
> FindLockCycleRecurse(), in the case that we're not waiting for an actual
> lock (checkProc->links.next == NULL) we can add a case that considers
> the 'blocking parallelism' case. ISTM that that's just a
> FindLockCycleRecurse() call on the process that we're waiting for. We'd
> either have to find the other side's locktag for DEADLOCK_INFO or invent
> another field in there; but that seems like a solveable problem.

I'll take a look at this. Thanks for the pointers.

> That's a fair point. I was worried that this is going to introduce
> additional hard edges between processes. I think it actually might, but
> only when a lock is upgraded; which really shouldn't happen for the
> copied locks.

Agreed. And if it does, and we get a deadlock, I think I'm sorta OK
with that. It's a known fact that lock upgrades are a deadlock
hazard, and the possible existence of obscure scenarios where that's
more likely when parallelism is involved than without it doesn't
bother me terribly. I mean, we have to look at the specifics, but I
think it's far from obvious that

> I'm not sure if I said that somewhere before: If we aborted parallelism
> if any of the to-be-copied locks would conflict with its copy, instead
> of duplicating them, I could live with this. Then this would amount to
> jumping the lock queue, which seems reasonable to me. Since not being
> able to do parallelism already needs to be handled gracefull, this
> doesn't seem to be too bad.

It's a tempting offer since it would shorten the path to getting this
committed, but I think that's leaving a lot on the table. In
particular, if we ever want to have parallel CLUSTER, VACUUM FULL, or
ALTER TABLE, that's not going to get us there. If we don't have the
right infrastructure to support that in whatever the initial commit
is, fine. But we've got to have a plan that gets us there, or we're
just boxing ourselves into a corner.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-25 03:56:07
Message-ID: 20150325035607.GK3636@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Wed, Mar 18, 2015 at 7:10 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> >> + /*
> >> + * For now, parallel operations are required to be strictly read-only.
> >> + * Unlike heap_update() and heap_delete(), an insert should never create
> >> + * a combo CID, so it might be possible to relax this restrction, but
> >> + * not without more thought and testing.
> >> + */
> >> + if (IsInParallelMode())
> >> + ereport(ERROR,
> >> + (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
> >> + errmsg("cannot insert tuples during a parallel operation")));
> >> +
> >
> > Minor nitpick: Should we perhaps move the ereport to a separate
> > function? Akin to PreventTransactionChain()? Seems a bit nicer to not
> > have the same message everywhere.
>
> Well, it's not actually the same message. They're all a bit
> different. Or mostly all of them. And the variable part is not a
> command name, as in the PreventTransactionChain() case, so it would
> affect translatability if we did that.

Three things that vary are 1) the fact that some check IsParallelWorker()
and others check IsInParallelMode(), and 2) that some of them are
ereports() while others are elog(), and 3) that some are ERROR and
others are FATAL. Is there a reason for these differences?

(Note typo "restrction" in quoted paragraph above.)

Maybe something similar to what commit f88d4cfc did: have a function
where all possible messages are together, and one is selected with some
enum. That way, it's easier to maintain consistency if more cases are
added in the future.

> I don't actually think this is a big deal.

Yeah.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-20 22:49:21
Message-ID: CAM3SWZSefE4uQk3r_3gwpfDWWtT3P51SceVsL4=g8v_mE2Abtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 19, 2015 at 11:13 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Here is yet another version of this patch. In addition to the fixes
> mentioned above, this version includes some minor rebasing around
> recent commits, and also better handling of the case where we discover
> that we cannot launch workers after all. This can happen because (1)
> dynamic_shared_memory_type=none, (2) the maximum number of DSM
> segments supported by the system configuration are already in use, or
> (3) the user creates a parallel context with nworkers=0. In any of
> those cases, the system will now create a backend-private memory
> segment instead of a dynamic shared memory segment, and will skip
> steps that don't need to be done in that case. This is obviously an
> undesirable scenario. If we choose a parallel sequential scan, we
> want it to launch workers and really run in parallel. Hopefully, in
> case (1) or case (3), we will avoid choosing a parallel plan in the
> first place, but case (2) is pretty hard to avoid completely, as we
> have no idea what other processes may or may not be doing with dynamic
> shared memory segments ... and, in any case, degrading to non-parallel
> execution beats failing outright.

I see that you're using git format-patch to generate this. But the
patch is only patch 1/4. Is that intentional? Where are the other
pieces?

I think that the parallel seqscan patch, and the assessing parallel
safety patch are intended to fit together with this patch, but I can't
find a place where there is a high level overview explaining just how
they fit together (I notice Amit's patch has an "#include
"access/parallel.h", which is here, but that wasn't trivial to figure
out). I haven't been paying too much attention to this patch series.

--
Peter Geoghegan


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-21 18:08:00
Message-ID: CA+TgmoZ0tOPvfuz5BmEUqmdQ9xYQHzkG4jjJXduYDgQJXmPFbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> You'd need some kind of
>> API that says "pretend I'm waiting for this lock, but don't really
>> wait for it", and you'd need to be darn sure that you removed yourself
>> from the wait queue again before doing any other heavyweight lock
>> manipulation. Do you have specific thoughts on how to implement this?
>
> I've thought some about this, and I think it's a bit easier to not do it
> on the actual lock waitqueues, but teach deadlock.c about that kind of
> blocking.
>
> deadlock.c is far from simple, and at least I don't find the control
> flow to be particularly clear. So it's not easy. It'd be advantageous
> to tackle things at that level because it'd avoid the need to acquire
> locks on some lock's waitqueue when blocking; we're going to do that a
> lot.
>
> But It seems to me that it should be possible to suceed: In
> FindLockCycleRecurse(), in the case that we're not waiting for an actual
> lock (checkProc->links.next == NULL) we can add a case that considers
> the 'blocking parallelism' case. ISTM that that's just a
> FindLockCycleRecurse() call on the process that we're waiting for. We'd
> either have to find the other side's locktag for DEADLOCK_INFO or invent
> another field in there; but that seems like a solveable problem.

I (finally) had some time to look at this today. Initially, it looked
good. Unfortunately, the longer I looked at it, the less convinced I
got that we could solve the problem this way. The core of the issue
is that the Funnel node in the parallel group leader basically does
this:

while (!local_scan_done || !remote_scan_done)
{
attempt a read from each remaining worker's tuple queue, blocking
only if local_scan_done;
if (we got a tuple)
return it;
else if (there are no remaining workers)
remote_scan_done = true;

attempt to produce a tuple just as if we were a worker ourselves;
if (we got a tuple)
return it;
else
local_scan_done = true;
}

Imagine that we're doing a parallel sequential scan; each worker
claims one page but goes into the tank before it has returned all of
the tuples on that page. The master reads all the remaining pages but
must now wait for the workers to finish returning the tuples on the
pages they claimed.

So what this means is:

1. The master doesn't actually *wait* until the very end of the parallel phase.
2. When the master does wait, it waits for all of the parallel workers
at once, not each one individually.

So, I don't think anything as simplistic as teaching a blocking
shm_mq_receive() to tip off the deadlock detector that we're waiting
for the process on the other end of that particular queue can ever
work. Because of point #2, that never happens. When I first started
thinking about how to fix this, I said, well, that's no big deal, we
can just advertise the whole list of processes that we're waiting for
in shared memory, rather than just the one. This is a bit tricky,
though. Any general API for any backend to advertise that it's waiting
for an arbitrary subset of the other backends would require O(n^2)
shared memory state. That wouldn't be completely insane, but it's not
great, either. For this particular case, we could optimize that down
to O(n) by just chaining all of the children of a given parallel group
leader in a linked whose nodes are inlined in their PGPROCs, but that
doesn't feel very general, because it forecloses the possibility of
the children ever using that API, and I think they might need to. If
nothing else, they might need to advertise that they're blocking on
the master if they are trying to send data, the queue is full, and
they have to wait for the master to drain some of it before they can
proceed.

After thinking about it a bit more, I realized that even if we settle
on some solution to that problem, there's another issues: the
wait-edges created by this system don't really have the same semantics
as regular lock waits. Suppose A is waiting on a lock held by B and B
is waiting for a lock held by A; that's a deadlock. But if A is
waiting for B to write to a tuple queue and B is waiting for A to read
from a tuple queue, that's not a deadlock if the queues in question
are the same. If they are different queues, it might be a deadlock,
but then again maybe not. It may be that A is prepared to accept B's
message from one queue, and that upon fully receiving it, it will do
some further work that will lead it to write a tuple into the other
queue. If so, we're OK; if not, it's a deadlock. I'm not sure
whether you'll want to argue that that is an implausible scenario, but
I'm not too sure it is. The worker could be saying "hey, I need some
additional piece of your backend-local state in order to finish this
computation", and the master could then provide it. I don't have any
plans like that, but it's been suggested previously by others, so it's
not an obviously nonsensical thing to want to do.

A further difference in the semantics of these wait-edges is that if
process A is awaiting AccessExclusiveLock on a resource held by B, C,
and D at AccessShareLock, it needs to wait for all of those processes
to release their locks before it can do anything else. On the other
hand, if process A is awaiting tuples from B, C, and D, it just needs
ONE of those processes to emit tuples in order to make progress. Now
maybe that doesn't make any difference in practice, because even if
two of those processes are making lively progress and A is receiving
tuples from them and processing them like gangbusters, that's probably
not going to help the third one get unstuck. If we adopt the approach
of discounting that possibility, then as long as the parallel leader
can generate tuples locally (that is, local_scan_done = false) we
don't report the deadlock, but as soon as it can no longer do that
(local_scan_done = true) then we do, even though we could still
theoretically read more tuples from the non-stuck workers. So then
you have to wonder why we're not solving problem #1, because the
deadlock was just as certain before we generated the maximum possible
number of tuples locally as it was afterwards.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-21 19:14:56
Message-ID: CA+Tgmoaz6kFN5KVzatWuyCp1Mx5OmXRC3JXgz3QpXC7fE64BQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 24, 2015 at 11:56 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
>> Well, it's not actually the same message. They're all a bit
>> different. Or mostly all of them. And the variable part is not a
>> command name, as in the PreventTransactionChain() case, so it would
>> affect translatability if we did that.
>
> Three things that vary are 1) the fact that some check IsParallelWorker()
> and others check IsInParallelMode(), and 2) that some of them are
> ereports() while others are elog(), and 3) that some are ERROR and
> others are FATAL. Is there a reason for these differences?

The message text also varies, because it states the particular
operation that is prohibited.

We should check IsParallelWorker() for operations that are allowed in
the master during parallel mode, but not allowed in the workers - e.g.
the master can scan its own temporary relations, but its workers
can't. We should check IsInParallelMode() for operations that are
completely off-limits in parallel mode - i.e. writes.

We use ereport() where we expect that SQL could hit that check, and
elog() where we expect that only (buggy?) C code could hit that check.

We use FATAL for some of the checks in xact.c, for parity with other,
similar checks in xact.c that relate to checking the transaction
state. I think this is because we assume that if the transaction
state is hosed, it's necessary to terminate the backend to recover.
In other cases, we use ERROR.

> (Note typo "restrction" in quoted paragraph above.)

Thanks, will fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-21 19:17:34
Message-ID: CA+TgmoartTF8eptBhiNwxUkfkctsFc7WtZFhGEGQywE8e2vCmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 20, 2015 at 6:49 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> I see that you're using git format-patch to generate this. But the
> patch is only patch 1/4. Is that intentional? Where are the other
> pieces?
>
> I think that the parallel seqscan patch, and the assessing parallel
> safety patch are intended to fit together with this patch, but I can't
> find a place where there is a high level overview explaining just how
> they fit together (I notice Amit's patch has an "#include
> "access/parallel.h", which is here, but that wasn't trivial to figure
> out). I haven't been paying too much attention to this patch series.

The intended order of application is:

- parallel mode/contexts
- assess parallel safety
- parallel heap scan
- parallel seq scan

The parallel heap scan patch should probably be merged into the
parallel seq scan patch, which should probably then be split into
several patches, one or more with general parallel query
infrastructure and then another with stuff specific to parallel
sequential scan. But we're not quite there yet. Until we can get
agreement on how to finalize the parallel mode/contexts patch, I
haven't been too worried about getting the other stuff sorted out
exactly right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-26 05:35:21
Message-ID: CAA4eK1JBr3b7r34fpwBzUdm9kaF5cQUsjNWCzW697LXKtmGgwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 21, 2015 at 11:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>
> After thinking about it a bit more, I realized that even if we settle
> on some solution to that problem, there's another issues: the
> wait-edges created by this system don't really have the same semantics
> as regular lock waits. Suppose A is waiting on a lock held by B and B
> is waiting for a lock held by A; that's a deadlock. But if A is
> waiting for B to write to a tuple queue and B is waiting for A to read
> from a tuple queue, that's not a deadlock if the queues in question
> are the same. If they are different queues, it might be a deadlock,
> but then again maybe not. It may be that A is prepared to accept B's
> message from one queue, and that upon fully receiving it, it will do
> some further work that will lead it to write a tuple into the other
> queue. If so, we're OK; if not, it's a deadlock.

I agree that the way deadlock detector works for locks, it might not
be same as it needs to work for communication buffers (tuple queues).
Here I think we also need to devise some different way to remove resources
from wait/resource queue, it might not be a good idea to do it similar to
locks
as locks are released at transaction end whereas this new resources
(communication buffers) don't operate at transaction boundary.

> I'm not sure
> whether you'll want to argue that that is an implausible scenario, but
> I'm not too sure it is. The worker could be saying "hey, I need some
> additional piece of your backend-local state in order to finish this
> computation", and the master could then provide it. I don't have any
> plans like that, but it's been suggested previously by others, so it's
> not an obviously nonsensical thing to want to do.
>

If such a thing is not possible today and also it seems that is a not a
good design to solve any problem, then why to spend too much effort
in trying to find ways to detect deadlocks for the same.

> A further difference in the semantics of these wait-edges is that if
> process A is awaiting AccessExclusiveLock on a resource held by B, C,
> and D at AccessShareLock, it needs to wait for all of those processes
> to release their locks before it can do anything else. On the other
> hand, if process A is awaiting tuples from B, C, and D, it just needs
> ONE of those processes to emit tuples in order to make progress. Now
> maybe that doesn't make any difference in practice, because even if
> two of those processes are making lively progress and A is receiving
> tuples from them and processing them like gangbusters, that's probably
> not going to help the third one get unstuck. If we adopt the approach
> of discounting that possibility, then as long as the parallel leader
> can generate tuples locally (that is, local_scan_done = false) we
> don't report the deadlock, but as soon as it can no longer do that
> (local_scan_done = true) then we do, even though we could still
> theoretically read more tuples from the non-stuck workers. So then
> you have to wonder why we're not solving problem #1, because the
> deadlock was just as certain before we generated the maximum possible
> number of tuples locally as it was afterwards.
>

I think the deadlock detection code should run if anytime we have to
wait for more than deadlock timeout. Now I think the important point
here is when to actually start waiting, as per current parallel_seqscan
patch we wait only after checking the tuples from all queues, we could
have done it other way (wait if we can't fetch from one queue) as well.
It seems both has pros and cons, so we can proceed assuming current
way is okay and we can consider to change it in future once we find some
reason for the same.

Having said above, I feel this is not the problem that we should try to
solve at this point unless there is any scenario where we could hit
deadlock due to communication buffers. I think such a solution would
be required for advanced form of parallelism (like intra-query parallelism).
By the way, why are we trying to solve this problem, is there any way
with which we can hit it for parallel sequential scan?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Euler Taveira <euler(at)timbira(dot)com(dot)br>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-26 18:03:42
Message-ID: 553D287E.4020008@timbira.com.br
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19-03-2015 15:13, Robert Haas wrote:
> On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Reading the README first, the rest later. So you can comment on my
>> comments, while I actually look at the code. Parallelism, yay!
>
I'm also looking at this piece of code and found some low-hanging fruit.

--
Euler Taveira Timbira - http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Attachment Content-Type Size
0001-fix-some-typos.patch text/x-patch 2.2 KB
0002-Avoid-compiler-warnings-about-unused-variables-in-as.patch text/x-patch 834 bytes
0003-Fix-some-more-typos.patch text/x-patch 2.1 KB

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-27 01:57:51
Message-ID: 553D979F.8080608@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-22 AM 04:14, Robert Haas wrote:
>
> We should check IsParallelWorker() for operations that are allowed in
> the master during parallel mode, but not allowed in the workers - e.g.
> the master can scan its own temporary relations, but its workers
> can't. We should check IsInParallelMode() for operations that are
> completely off-limits in parallel mode - i.e. writes.
>
> We use ereport() where we expect that SQL could hit that check, and
> elog() where we expect that only (buggy?) C code could hit that check.
>

By the way, perhaps orthogonal to the basic issue Alvaro and you are
discussing here - when playing around with (parallel-mode + parallel-safety +
parallel heapscan + parallel seqscan), I'd observed (been a while since) that
if you run a CREATE TABLE AS ... SELECT and the SELECT happens to use parallel
scan, the statement as a whole fails because the top level statement (CTAS) is
not read-only. So, only way to make CTAS succeed is to disable seqscan; which
may require some considerations in reporting to user to figure out. I guess it
happens because the would-be parallel leader of the scan would also be the one
doing DefineRelation() DDL. Apologies if this has been addressed since.

Thanks,
Amit


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Euler Taveira <euler(at)timbira(dot)com(dot)br>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-28 18:38:42
Message-ID: CA+TgmoZN4KeyQ+=VOZFnO6EDhe_Yufj4fhRg8Do2W06kiWxmrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 26, 2015 at 2:03 PM, Euler Taveira <euler(at)timbira(dot)com(dot)br> wrote:
> On 19-03-2015 15:13, Robert Haas wrote:
>> On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> Reading the README first, the rest later. So you can comment on my
>>> comments, while I actually look at the code. Parallelism, yay!
>>
> I'm also looking at this piece of code and found some low-hanging fruit.

Thanks. 0001 and 0003 look good, but 0002 is actually unrelated to this patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-28 18:40:04
Message-ID: CA+TgmobgUA+Us6_UbkyENhQRJtmq6r02u3AQFi_+w3hNtbqccA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 26, 2015 at 9:57 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> On 2015-04-22 AM 04:14, Robert Haas wrote:
>> We should check IsParallelWorker() for operations that are allowed in
>> the master during parallel mode, but not allowed in the workers - e.g.
>> the master can scan its own temporary relations, but its workers
>> can't. We should check IsInParallelMode() for operations that are
>> completely off-limits in parallel mode - i.e. writes.
>>
>> We use ereport() where we expect that SQL could hit that check, and
>> elog() where we expect that only (buggy?) C code could hit that check.
>>
>
> By the way, perhaps orthogonal to the basic issue Alvaro and you are
> discussing here - when playing around with (parallel-mode + parallel-safety +
> parallel heapscan + parallel seqscan), I'd observed (been a while since) that
> if you run a CREATE TABLE AS ... SELECT and the SELECT happens to use parallel
> scan, the statement as a whole fails because the top level statement (CTAS) is
> not read-only. So, only way to make CTAS succeed is to disable seqscan; which
> may require some considerations in reporting to user to figure out. I guess it
> happens because the would-be parallel leader of the scan would also be the one
> doing DefineRelation() DDL. Apologies if this has been addressed since.

That sounds like a bug in the assess-parallel-safety patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Euler Taveira <euler(at)timbira(dot)com(dot)br>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-29 16:23:48
Message-ID: CA+TgmoYL4mXLZe6b9x68+z8SgHFuDiS0frwqDvAP_T8nD0HGTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 28, 2015 at 2:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Apr 26, 2015 at 2:03 PM, Euler Taveira <euler(at)timbira(dot)com(dot)br> wrote:
>> On 19-03-2015 15:13, Robert Haas wrote:
>>> On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>>> Reading the README first, the rest later. So you can comment on my
>>>> comments, while I actually look at the code. Parallelism, yay!
>>>
>> I'm also looking at this piece of code and found some low-hanging fruit.
>
> Thanks. 0001 and 0003 look good, but 0002 is actually unrelated to this patch.

So, I think it makes sense to split up this patch in two. There's no
real debate, AFAICS, about anything in the patch other than the
heavyweight locking stuff. So I'd like to go ahead and commit the
rest. That's attached here as parallel-mode-v10.patch.

The remaining question here is what to do about the heavyweight
locking. There seem to be a couple of possible approaches to that,
and perhaps with this specific issue separated out we can focus in on
that aspect of the problem. Let me start by restating my goals for
that portion of the patch:

1. Avoid undetected deadlocks. For example, suppose A and A1 are
cooperating parallel processes, and B is an unrelated process. If A1
waits for a lock held by B, and B waits for a lock held by A, and A is
waiting for A1 to finish executing before winding up parallelism, the
deadlock detector currently won't detect that. Either the deadlock
detector needs to know that A waits for A1, or it needs to view the
lock held by A as also held by A1, so that it can detect a cycle A1 ->
B -> A1.

2. Avoid unprincipled deadlocks. For example, suppose A holds
AccessExclusiveLock on a relation R and attempts a parallel sequential
scan on R. It launches a worker A1. If A1 attempts to lock R and
blocks, and A then waits for A1, we have a deadlock. Detecting this
deadlock (as per goal 1) is better than not detecting it, but it's not
good enough. Either A shouldn't have attempted a parallel sequential
scan in the first place, or it should run to completion without
deadlocking with its own workers.

3. Allow parallel DDL. For example, parallel VACUUM FULL or parallel
CLUSTER. We don't have code for these things today, but the locking
architecture we choose should not preclude doing them later.

4. Avoid restricting parallelism based on the history of the
transaction. If BEGIN; COPY foo FROM ...; SELECT ... FROM foo can use
parallelism, then BEGIN; TRUNCATE foo; COPY foo FROM ...; SELECT ...
FROM foo should also be able to use parallelism.

5. Impose as few restriction as possible on what can be done in
parallel mode. For example, the current version of this patch
(attached as parallel-mode-locking.patch) allows parallel workers to
do system cache lookups, where a relation lock is taken and released,
but they cannot take and retain any heavyweight locks; so for example
a user-defined function that runs on SQL query against some unrelated
table and returns the results isn't parallel-safe right now. This is
not ideal.

That's all I've got for goals. How do we achieve those goals? So far
I think we've hit on three approaches to deadlock detection that I
think are at least somewhat plausible. None of them are perfect:

D-1. Teach the deadlock detector about implicit edges between
cooperating parallel processes. Upthread, I wrote about some problems
with this approach:
http://www.postgresql.org/message-id/CA+TgmoZ0tOPvfuz5BmEUqmdQ9xYQHzkG4jjJXduYDgQJXmPFbQ@mail.gmail.com

D-2. Copy locks from the master to every worker, and forbid workers
from acquiring any new, long-lived locks. (Locks that are taken and
released quickly, like a system catalog lock for a syscache lookup, or
a relation extension lock, won't cause deadlock.) Copying locks
ensures that whenever there is a dangerous structure like A1 -> B -> A
in the waits-for graph, there will also be a cycle A1 -> B -> A1, so
the unmodified deadlock detector will detect the deadlock. Forbidding
workers from acquiring any new, long-lived locks prevents locks taken
after the copying step from causing similar problems. The restriction
against new, long-lived locks is annoying. Also, a transaction
holding many locks (like pg_dump) and using many workers will use an
enormous number of lock table slots.

D-3. Teach the deadlock detector to consider all locks held by any
member of the locking group as held by the leader. This guarantees
that all deadlocks between the parallel group and other processes in
the system will be detected. When a process in the group requests a
lock that conflicts with one already held by another group member, but
not with any held by a process outside the group, grant it anyway, and
without waiting behind pending conflicting lock requests. At the end
of parallelism, transfer any locks held by workers and not released
back to the leader. In a world where lock requests within a group
can't conflict, there's no such thing as a deadlock within a parallel
group, so we don't need to detect deadlocks of that type, and there
will never be any unprincipled deadlocks within a group because there
will never be any deadlocks within a group at all.

The biggest problem with this approach is figuring out how to make it
safe. I originally proposed this approach on the "group locking"
thread and it kind of went down in flames. If two parallel backends
are trying to extend the same relation at the same time, or lock the
same tuple or page at the same time, those requests have got to
conflict. In those cases, we are using locking to enforce real mutual
exclusion. However, that doesn't seem like a serious problem. First,
locks of those types will not be held at the start of parallelism;
starting parallelism while you hold a lock of one of those types would
be ridiculous. Second, locks of those types are never long-lived: you
take them, do what you need to do, and then release them. Finally, I
don't believe you ever take any other heavyweight locks while holding
them; I *think* they're always the last heavyweight lock you take. So
I think we could allow these locks to conflict normally without
introducing any deadlock risk. The only locks that we need to
consider as mutually non-conflicting are relation and database object
locks. Those are a different kettle of fish: when we change a
relation's relfilenode, for example, we must keep the relation locked
for the duration of the transaction because of MVCC concenrs. The new
pg_class tuple isn't visible to anyone else yet, and any further
modifications to the relation must be done using the new relfilenode.
But none of that matters for a parallel worker, which shares the same
snapshot. There are still cases that do matter; for example, if a
parallel backend could REINDEX a backend being concurrently scanned by
another parallel backend in the same group, that would cause problems,
because REINDEX uses backend-local state that wouldn't be shared. But
these cases can also arise without parallelism, due to cursors, and
there is an existing mechanism (CheckTableNotInUse) to prevent them.
So I think that's OK, too. If not, we can fix other problems as we
find them; I don't see any major architectural problems here.

The other big problem with this approach is figuring out how to
implement it. It doesn't work to have the worker processes manipulate
the leader's lock-related data structures as if they were the parent,
both because the locking code relies on the shared data structures to
match its backend-private data structures and also because if we do
block waiting for a lock, there could be more than one waiting process
in the lock group at a time, and we need to know which specific
processes are waiting so we can wake them up at the correct time.
Instead, I think the way to do this is to add an additional field to
each PROCLOCK storing a pointer to the leader's PGPROC;
FindLockCycleRecurse() can check whether two PGPROCs have the same
leader pointer instead of whether the PGPROCs themselves are the same.
It can also refuse to follow a waits-for edge if that edge is of one
of the types we do want to conflict within a locking group (e.g.
relation extension).

---

I'm OK with any of these approaches if we can agree on how to solve
the problems they respectively present. Currently, I've got an
implementation of D-2; I started with an implementation of D-3, which
was flawed, but perhaps the idea is salveable, per the discussion
above. Andres has consistently advocated for D-1, but see the link
above for where I'm stuck in terms of implementing that. There may be
another approach we haven't come up with yet, too, for all I know, and
that's fine too if it accomplishes the goals listed above.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v10.patch binary/octet-stream 127.8 KB
parallel-mode-locking.patch binary/octet-stream 22.3 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Euler Taveira <euler(at)timbira(dot)com(dot)br>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-04-30 19:23:25
Message-ID: CA+TgmoYs6eerq===P-_sXQCz7sfcw9d-tr5cCMsTfDK-MxbFeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 29, 2015 at 12:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> So, I think it makes sense to split up this patch in two. There's no
> real debate, AFAICS, about anything in the patch other than the
> heavyweight locking stuff. So I'd like to go ahead and commit the
> rest. That's attached here as parallel-mode-v10.patch.

Hearing no objections, done.

Still hoping for some input on the rest.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-05-02 16:35:22
Message-ID: 20150502163522.GA3461810@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote:
> On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

[proposal: teach deadlock detector about parallel master waiting on workers]

> > deadlock.c is far from simple, and at least I don't find the control
> > flow to be particularly clear. So it's not easy. It'd be advantageous
> > to tackle things at that level because it'd avoid the need to acquire
> > locks on some lock's waitqueue when blocking; we're going to do that a
> > lot.
> >
> > But It seems to me that it should be possible to suceed: In
> > FindLockCycleRecurse(), in the case that we're not waiting for an actual
> > lock (checkProc->links.next == NULL) we can add a case that considers
> > the 'blocking parallelism' case. ISTM that that's just a
> > FindLockCycleRecurse() call on the process that we're waiting for. We'd
> > either have to find the other side's locktag for DEADLOCK_INFO or invent
> > another field in there; but that seems like a solveable problem.

> 1. The master doesn't actually *wait* until the very end of the parallel phase.
> 2. When the master does wait, it waits for all of the parallel workers
> at once, not each one individually.
>
> So, I don't think anything as simplistic as teaching a blocking
> shm_mq_receive() to tip off the deadlock detector that we're waiting
> for the process on the other end of that particular queue can ever
> work. Because of point #2, that never happens. When I first started
> thinking about how to fix this, I said, well, that's no big deal, we
> can just advertise the whole list of processes that we're waiting for
> in shared memory, rather than just the one. This is a bit tricky,
> though. Any general API for any backend to advertise that it's waiting
> for an arbitrary subset of the other backends would require O(n^2)
> shared memory state. That wouldn't be completely insane, but it's not
> great, either. For this particular case, we could optimize that down
> to O(n) by just chaining all of the children of a given parallel group
> leader in a linked whose nodes are inlined in their PGPROCs, but that
> doesn't feel very general, because it forecloses the possibility of
> the children ever using that API, and I think they might need to. If
> nothing else, they might need to advertise that they're blocking on
> the master if they are trying to send data, the queue is full, and
> they have to wait for the master to drain some of it before they can
> proceed.

Perhaps, rather than model it as master M waiting on worker list W1|W2|W3,
model it with queue-nonempty and queue-nonfull events, one pair per queue. M
subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M
publishes to queue0-nonfull, and the workers subscribe. An edge forms in the
deadlock detector's graph when all of an event's subscribers wait on it. (One
can approximate that in userspace with a pair of advisory locks.)

> After thinking about it a bit more, I realized that even if we settle
> on some solution to that problem, there's another issues: the
> wait-edges created by this system don't really have the same semantics
> as regular lock waits. Suppose A is waiting on a lock held by B and B
> is waiting for a lock held by A; that's a deadlock. But if A is
> waiting for B to write to a tuple queue and B is waiting for A to read
> from a tuple queue, that's not a deadlock if the queues in question
> are the same.

I do see a deadlock. B wants to block until A reads from the queue, so it
advertises that and sleeps. Waking up deadlock_timeout later, it notices that
A is waiting for B to write something. B will not spontaneously suspend
waiting to write to the queue, nor will A suspend waiting to read from the
queue. Thus, the deadlock is valid. This assumes the deadlock detector
reasons from an authoritative picture of pending waits and that we reliably
wake up a process when the condition it sought has arrived. We have that for
heavyweight lock acquisitions. The assumption may be incompatible with
Andres's hope, quoted above, of avoiding the full lock acquisition procedure.

> A further difference in the semantics of these wait-edges is that if
> process A is awaiting AccessExclusiveLock on a resource held by B, C,
> and D at AccessShareLock, it needs to wait for all of those processes
> to release their locks before it can do anything else. On the other
> hand, if process A is awaiting tuples from B, C, and D, it just needs
> ONE of those processes to emit tuples in order to make progress. Now

Right.

> maybe that doesn't make any difference in practice, because even if
> two of those processes are making lively progress and A is receiving
> tuples from them and processing them like gangbusters, that's probably
> not going to help the third one get unstuck. If we adopt the approach
> of discounting that possibility, then as long as the parallel leader
> can generate tuples locally (that is, local_scan_done = false) we
> don't report the deadlock, but as soon as it can no longer do that
> (local_scan_done = true) then we do, even though we could still
> theoretically read more tuples from the non-stuck workers.

I can't discount the possibility that the master could unstick a worker in the
course of ingesting tuples from other workers. We could accept a coding rule
against parallel algorithms behaving that way, on pain of unprincipled
deadlock. It's unattractive to bake such a high-level notion of acceptable
wait patterns into the deadlock detector, and every coding rule has noteworthy
cost. Since detecting deadlocks long after they were inevitable is also
unattractive, the rule might be worthwhile.

> So then
> you have to wonder why we're not solving problem #1, because the
> deadlock was just as certain before we generated the maximum possible
> number of tuples locally as it was afterwards.

The "1." above reads like a benefit, not a problem. What might you solve
about it?

Thanks,
nm


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-05-05 17:05:38
Message-ID: CA+TgmoZCsirzt5Hg=18sbA=joMXS=VaAF0h11=TXSv+z5K=wtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, May 2, 2015 at 12:35 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> Perhaps, rather than model it as master M waiting on worker list W1|W2|W3,
> model it with queue-nonempty and queue-nonfull events, one pair per queue.

Each individual queue has only a single reader and a single writer.
In your example, there would be three queues, not just one. Of
course, one could design a new shared memory data structure
representing a collection of related queues, but I still don't see
exactly how we get around the problem of that requiring
O(MaxBackends^2) storage space.

> M
> subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M
> publishes to queue0-nonfull, and the workers subscribe. An edge forms in the
> deadlock detector's graph when all of an event's subscribers wait on it. (One
> can approximate that in userspace with a pair of advisory locks.)

An edge between which two processes?

>> After thinking about it a bit more, I realized that even if we settle
>> on some solution to that problem, there's another issues: the
>> wait-edges created by this system don't really have the same semantics
>> as regular lock waits. Suppose A is waiting on a lock held by B and B
>> is waiting for a lock held by A; that's a deadlock. But if A is
>> waiting for B to write to a tuple queue and B is waiting for A to read
>> from a tuple queue, that's not a deadlock if the queues in question
>> are the same.
>
> I do see a deadlock. B wants to block until A reads from the queue, so it
> advertises that and sleeps. Waking up deadlock_timeout later, it notices that
> A is waiting for B to write something. B will not spontaneously suspend
> waiting to write to the queue, nor will A suspend waiting to read from the
> queue. Thus, the deadlock is valid. This assumes the deadlock detector
> reasons from an authoritative picture of pending waits and that we reliably
> wake up a process when the condition it sought has arrived. We have that for
> heavyweight lock acquisitions. The assumption may be incompatible with
> Andres's hope, quoted above, of avoiding the full lock acquisition procedure.

I don't understand. What's typically going to happen here is that the
queue will initially be empty, and A will block. Suppose B has a
large message to write to the queue, let's say 100x the queue size.
It will write the first 1% of the message into the queue, set A's
latch, and go to sleep. A will now wake up, drain the queue, set B's
latch, and go to sleep. B will then wake up, write the next chunk of
the message, set A's latch again, and go to sleep. They'll go back
and forth like this until the entire message has been transmitted.
There's no deadlock here: everything is working as designed. But
there may be instants where both A and B are waiting, because (for
example) A may set B's latch and finish going to sleep before B gets
around to waking up.

That's probably something that we could patch around, but I think it's
missing the larger point. When you're dealing with locks, there is
one operation that causes blocking (acquiring a lock) and another
operation that unblocks other processes (releasing a lock). With
message queues, there are still two operations (reading and writing),
but either of them can either cause you to block yourself, or to
unblock another process. To put that another way, locks enforce
mutual exclusion; message queues force mutual *inclusion*. Locks
cause blocking when two processes try to operate on the same object at
the same time; message queues cause blocking *unless* two processes
operate on the same object at the same time. That difference seems to
me to be quite fundamental.

>> So then
>> you have to wonder why we're not solving problem #1, because the
>> deadlock was just as certain before we generated the maximum possible
>> number of tuples locally as it was afterwards.
>
> The "1." above reads like a benefit, not a problem. What might you solve
> about it?

Sorry, that wasn't very clear. Normally, it is a benefit, but in some
cases, it could result in a long delay in reporting an inevitable
deadlock.

What I mean is: suppose we construct a working mechanism where the
deadlock detector knows about tuple queue waits. Suppose further that
we have a parallel leader and two workers cooperating to do a task
that takes 300 s and can be parallelized with perfect efficiency so
that, working together, those three processes can get the job done in
just 100 s. If the leader tries to take a lock held in a conflicting
mode by one of the workers, the worker will fill up the tuple queue -
probably quite quickly - and wait. We now detect a deadlock. Our
detection is timely, and life is good. On the other hand, suppose
that one of the *workers* tries to take a lock held in a conflicting
mode by the other worker or by the leader. There is no immediate
deadlock, because the leader is not waiting. Instead, we'll finish
the computation with the remaining worker and the leader, which will
take 150 seconds since we only have two processes instead of three,
and *then* the leader waits. I predict that users will be unhappy
about doing the whole computation - and only then detecting, well
after the time the query would normally have finished, a deadlock that
was inevitable from the beginning.

That problem is actually not too hard to avoid if you're OK with
extremely frequent lock manager manipulations. It's not a deadlock if
a worker is waiting for a lock (say, a relation extension lock) held
by the leader, because the leader might release that lock quickly.
But if the leader gets to the top of the funnel node's main loop and
one of its workers is still waiting for the lock at that point, that
is almost certainly a deadlock. Locks might get taken and released
during a single pass through that loop, but any lock held across
iterations is presumably going to be held until end-of-transaction, so
we're hosed. But checking for deadlock at the top of every main loop
iteration is far too expensive to contemplate. Even doing some much
lighter-weight manipulation of the lock manager data structures at the
top of every loop iteration is probably going to recreate the same
kind of lock manager contention problems that I tried to solve with
the fast-path locking stuff in 9.2.

To put all of my cards on the table, I've never really believed in
this approach. The reason we have a deadlock detector in the first
place is because we can't prevent users from making fundamentally
incompatible locking requests, like locking two tables in incompatible
orderings in two different sessions. But we *don't* have deadlock
detection for lwlocks because we (rightly) view it as our job to write
the code in such a way that deadlocks don't happen in the first place.
So we take locks in a predictable order, even in cases (like updating
a tuple) where that involves some fancy gymnastics. We could have
update always lock the old-tuple buffer before the new-tuple buffer
and make it the deadlock detector's job to sort that out, but that's
pretty obviously inferior. We can expect users to tolerate a deadlock
when it is their own actions that have precipitated the deadlock, but
it's a whole different thing to ask them to accept that what from the
user's point of view is an indivisible action (like an UPDATE, or a
parallel query, or perhaps an UPSERT) occasionally deadlocks for some
reason that they can neither understand nor prevent. Tuple queues are
an implementation detail. They shouldn't deadlock for the same reason
that lwlock acquisition shouldn't deadlock, and this whole project
should be entirely unnecessary.

The way we backed into worrying about exposing tuple queues to the
deadlock detector is that, if you let one backend in a parallel group
get and keep a lock on some resource and then another backend in the
same group tries to lock that resource and can't get the lock, you
will eventually an undetected deadlock. Andres's view is that we
should fix the deadlock detector to detect and report such cases, but
I think that's wrong. If a process in a parallel group can take and
retain until end of transaction a lock on some resource without which
other processes in the same parallel group cannot do useful work,
that's a BUG. We shouldn't need the deadlock detector to report that
problem for the same reason we shouldn't need the deadlock detector to
report lwlock-based deadlocks: if they are ever happening, you need to
fix your code, not the deadlock detector.

I think where this project went off the rails is when we made the
decision as to how to fix the problems in the original group locking
approach. In the original concept, locks between processes in the
same parallel group just didn't ever conflict; two
otherwise-conflicting locks held by different backends within the same
group were regarded as those processes sharing that lock. Andres and
possibly others pointed out that for stuff like relation locks, that's
probably not the right behavior. I agree with that. It was also
suggested that this would be less scary if we limited ourselves to
sharing the locks held at the beginning of parallelism. That had the
further advantage of making things like relation extension locks,
which won't be held at the starting of paralellism, unshared, while
relation locks would, in most cases, be shared. That felt pretty
good, so I did it, but I now think that was a mistake, because it
creates edge cases where parallel groups can self-deadlock. If we
instead regard relation locks between parallel group members as NEVER
conflicting, rather than conflicting only when both locks were
acquired after the start of parallelism, those edge cases go away.
The only downside is that if a worker A manages to do something to a
relation R that makes it unsafe for worker B to access, and worker B
then gets the lock anyway, we get no-holds-barred insanity instead of
a deadlock. But I am unconvinced that downside amounts to much,
because I believe the only way that A can make it unsafe for B to
access the relation is by doing something that CheckTableNotInUse()
will already prohibit in parallel mode. If it turns out there is some
oversight there, I don't think it'll be hard to plug. In contrast,
exposing this tuple queue wait information to the deadlock detector is
looking quite complex.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-05-07 07:09:54
Message-ID: 20150507070954.GA3557937@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 05, 2015 at 01:05:38PM -0400, Robert Haas wrote:
> On Sat, May 2, 2015 at 12:35 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > Perhaps, rather than model it as master M waiting on worker list W1|W2|W3,
> > model it with queue-nonempty and queue-nonfull events, one pair per queue.

That comment of mine was junk; it suggested a data structure containing the
same worker list it purported to remove. Oops.

> Each individual queue has only a single reader and a single writer.
> In your example, there would be three queues, not just one. Of
> course, one could design a new shared memory data structure
> representing a collection of related queues, but I still don't see
> exactly how we get around the problem of that requiring
> O(MaxBackends^2) storage space.

If each backend waits on a uniformly-distributed 50% of other backends,
tracking that wait graph indeed requires O(MaxBackends^2) space. Backends
completing useful work in the field will have far-sparser wait graphs, and
that should inform the choice of data structure:

On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote:
> When I first started thinking about how to fix this, I said, well,
> that's no big deal, we can just advertise the whole list of processes
> that we're waiting for in shared memory, rather than just the one.

That boiled down to representing the wait graph with an adjacency list, which
sounds like an efficient choice.

> > M
> > subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M
> > publishes to queue0-nonfull, and the workers subscribe. An edge forms in the
> > deadlock detector's graph when all of an event's subscribers wait on it. (One
> > can approximate that in userspace with a pair of advisory locks.)
>
> An edge between which two processes?

I hadn't de-fuzzed my thinking that far yet. If you still need the answer
after this message, let me know, and I'll work on that. As a guess, I think
it's three edges M-W1, M-W2 and M-W3.

> >> After thinking about it a bit more, I realized that even if we settle
> >> on some solution to that problem, there's another issues: the
> >> wait-edges created by this system don't really have the same semantics
> >> as regular lock waits. Suppose A is waiting on a lock held by B and B
> >> is waiting for a lock held by A; that's a deadlock. But if A is
> >> waiting for B to write to a tuple queue and B is waiting for A to read
> >> from a tuple queue, that's not a deadlock if the queues in question
> >> are the same.
> >
> > I do see a deadlock. B wants to block until A reads from the queue, so it
> > advertises that and sleeps. Waking up deadlock_timeout later, it notices that
> > A is waiting for B to write something. B will not spontaneously suspend
> > waiting to write to the queue, nor will A suspend waiting to read from the
> > queue. Thus, the deadlock is valid. This assumes the deadlock detector
> > reasons from an authoritative picture of pending waits and that we reliably
> > wake up a process when the condition it sought has arrived. We have that for
> > heavyweight lock acquisitions. The assumption may be incompatible with
> > Andres's hope, quoted above, of avoiding the full lock acquisition procedure.
>
> I don't understand. What's typically going to happen here is that the
> queue will initially be empty, and A will block. Suppose B has a
> large message to write to the queue, let's say 100x the queue size.
> It will write the first 1% of the message into the queue, set A's
> latch, and go to sleep. A will now wake up, drain the queue, set B's
> latch, and go to sleep. B will then wake up, write the next chunk of
> the message, set A's latch again, and go to sleep. They'll go back
> and forth like this until the entire message has been transmitted.
> There's no deadlock here: everything is working as designed. But
> there may be instants where both A and B are waiting, because (for
> example) A may set B's latch and finish going to sleep before B gets
> around to waking up.

I see what you had in mind. The problem showed up in the last sentence;
namely, the hand-off from process to process is not atomic like it is for
heavyweight locks. That's exactly what I was (not too clearly) getting at
when I wrote, "This assumes ..." above. For the sake of illustration, suppose
you replace your queue in this algorithm with a heavyweight lock and a small
buffer. Initially, B holds the lock and A waits for the lock. The following
event sequence repeats until the transfer is done:

B fills the buffer
B releases the lock, granting it to A during LockRelease()
B starts waiting for the lock
A empties the buffer
A releases the lock, granting it to B during LockRelease()
A starts waiting for the lock

The deadlock detector will rightly never report a deadlock. To have the same
safety in your example, it's not enough for one process to set the latch of
another process. The process must update something in static shared memory to
indicate that the waited-for condition (queue-nonempty for A, queue-nonfull
for B) is now satisfied. You need such a unit of state anyway, don't you?
How else would the deadlock detector know that A waits for B to write to the
tuple queue? (When the deadlock detector notices this deadlock, it may be
running in a process not participating in the parallel group. It can't rely
on anything in backend or dynamic shared memory.) With that, the deadlock
detector can distinguish a process waiting for an yet-unsatisfied condition
from a process that will soon wake to exploit a recently-satisfied condition.

> That's probably something that we could patch around, but I think it's
> missing the larger point. When you're dealing with locks, there is
> one operation that causes blocking (acquiring a lock) and another
> operation that unblocks other processes (releasing a lock). With
> message queues, there are still two operations (reading and writing),
> but either of them can either cause you to block yourself, or to
> unblock another process. To put that another way, locks enforce
> mutual exclusion; message queues force mutual *inclusion*. Locks
> cause blocking when two processes try to operate on the same object at
> the same time; message queues cause blocking *unless* two processes
> operate on the same object at the same time. That difference seems to
> me to be quite fundamental.

Interesting thought.

> That problem is actually not too hard to avoid if you're OK with
> extremely frequent lock manager manipulations. It's not a deadlock if
> a worker is waiting for a lock (say, a relation extension lock) held
> by the leader, because the leader might release that lock quickly.
> But if the leader gets to the top of the funnel node's main loop and
> one of its workers is still waiting for the lock at that point, that
> is almost certainly a deadlock. Locks might get taken and released
> during a single pass through that loop, but any lock held across
> iterations is presumably going to be held until end-of-transaction, so
> we're hosed. But checking for deadlock at the top of every main loop
> iteration is far too expensive to contemplate. Even doing some much
> lighter-weight manipulation of the lock manager data structures at the
> top of every loop iteration is probably going to recreate the same
> kind of lock manager contention problems that I tried to solve with
> the fast-path locking stuff in 9.2.

Agreed.

> To put all of my cards on the table, I've never really believed in
> this approach. The reason we have a deadlock detector in the first
> place is because we can't prevent users from making fundamentally
> incompatible locking requests, like locking two tables in incompatible
> orderings in two different sessions. But we *don't* have deadlock
> detection for lwlocks because we (rightly) view it as our job to write
> the code in such a way that deadlocks don't happen in the first place.
> So we take locks in a predictable order, even in cases (like updating
> a tuple) where that involves some fancy gymnastics. We could have
> update always lock the old-tuple buffer before the new-tuple buffer
> and make it the deadlock detector's job to sort that out, but that's
> pretty obviously inferior. We can expect users to tolerate a deadlock
> when it is their own actions that have precipitated the deadlock, but
> it's a whole different thing to ask them to accept that what from the
> user's point of view is an indivisible action (like an UPDATE, or a
> parallel query, or perhaps an UPSERT) occasionally deadlocks for some
> reason that they can neither understand nor prevent. Tuple queues are
> an implementation detail. They shouldn't deadlock for the same reason
> that lwlock acquisition shouldn't deadlock, and this whole project
> should be entirely unnecessary.
>
> The way we backed into worrying about exposing tuple queues to the
> deadlock detector is that, if you let one backend in a parallel group
> get and keep a lock on some resource and then another backend in the
> same group tries to lock that resource and can't get the lock, you
> will eventually an undetected deadlock. Andres's view is that we
> should fix the deadlock detector to detect and report such cases, but
> I think that's wrong. If a process in a parallel group can take and
> retain until end of transaction a lock on some resource without which
> other processes in the same parallel group cannot do useful work,
> that's a BUG. We shouldn't need the deadlock detector to report that
> problem for the same reason we shouldn't need the deadlock detector to
> report lwlock-based deadlocks: if they are ever happening, you need to
> fix your code, not the deadlock detector.

The number of held LWLocks is always zero when entering user-defined code and
again when exiting user-defined code. Therefore, the possible LWLock wait
graphs are known at compile time, and we could prove that none contain a
deadlock. Therefore, we scarcely miss having an LWLock deadlock detector.
That does not map to the tuple queue behavior at hand, because we hope to
allow the queue's writer to run user-defined code while the queue's reader is
waiting. My term "user-defined code" does come with some hand-waving. A
superuser can cause an undetected deadlock via a C-language hook. We would
not call that a PostgreSQL bug, though the hook is literally user-defined.
Let's keep the deadlock detector able to identify every deadlock a
non-superuser can cause. That includes deadlocks arising from heavyweight
lock acquisition in parallel-safe functions.

> I think where this project went off the rails is when we made the
> decision as to how to fix the problems in the original group locking
> approach. In the original concept, locks between processes in the
> same parallel group just didn't ever conflict; two
> otherwise-conflicting locks held by different backends within the same
> group were regarded as those processes sharing that lock. Andres and
> possibly others pointed out that for stuff like relation locks, that's

I think you mean "relation extension locks".

> probably not the right behavior. I agree with that. It was also
> suggested that this would be less scary if we limited ourselves to
> sharing the locks held at the beginning of parallelism. That had the
> further advantage of making things like relation extension locks,
> which won't be held at the starting of paralellism, unshared, while
> relation locks would, in most cases, be shared. That felt pretty
> good, so I did it, but I now think that was a mistake, because it
> creates edge cases where parallel groups can self-deadlock. If we
> instead regard relation locks between parallel group members as NEVER
> conflicting, rather than conflicting only when both locks were
> acquired after the start of parallelism, those edge cases go away.

Yes. Then you're back to something like the LWLock scenario, where an
undetected deadlock implies a PostgreSQL bug. That's a good place to be.

> The only downside is that if a worker A manages to do something to a
> relation R that makes it unsafe for worker B to access, and worker B
> then gets the lock anyway, we get no-holds-barred insanity instead of
> a deadlock. But I am unconvinced that downside amounts to much,
> because I believe the only way that A can make it unsafe for B to
> access the relation is by doing something that CheckTableNotInUse()
> will already prohibit in parallel mode. If it turns out there is some
> oversight there, I don't think it'll be hard to plug.

This is the bottom line, and I agree. I wrote more about that here:
http://www.postgresql.org/message-id/20141226040546.GC1971688@tornado.leadboat.com

I admit that it's alarming to have different conflict semantics by locktag.
The tension originates, I think, from e.g. LOCKTAG_RELATION_EXTEND serving two
distinct purposes simultaneously. As I wrote in the mail just cited, before
altering a table in a manner that threatens concurrent usage, one must ensure
that (a) other transactions and (b) other parts of my own transaction aren't
in the way. Customarily, a heavyweight lock rules out (a), and
CheckTableNotInUse() rules out (b). Relation extension does not have or need
a CheckTableNotInUse() call or similar, because it doesn't call arbitrary code
that might reenter the relation extension process. Under parallelism, though,
it will be possible for multiple processes of a given transaction to attempt
relation extension simultaneously. So we need a (b) test. It just so happens
that allowing LOCKTAG_RELATION_EXTEND to conflict in a parallel group causes
that lock acquisition to suffice for both purposes (a) and (b). That's neat
but arguably impure. Every cure I've pondered has been worse than the
disease, though. It will be okay.

> In contrast,
> exposing this tuple queue wait information to the deadlock detector is
> looking quite complex.

Yep.

Thanks,
nm


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-05-07 13:47:39
Message-ID: CA+TgmoZ=eapnv-yvfg+wxLyftXnRchr2MtEeT+NKdVSL70B3BA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 7, 2015 at 3:09 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> Each individual queue has only a single reader and a single writer.
>> In your example, there would be three queues, not just one. Of
>> course, one could design a new shared memory data structure
>> representing a collection of related queues, but I still don't see
>> exactly how we get around the problem of that requiring
>> O(MaxBackends^2) storage space.
>
> If each backend waits on a uniformly-distributed 50% of other backends,
> tracking that wait graph indeed requires O(MaxBackends^2) space. Backends
> completing useful work in the field will have far-sparser wait graphs, and
> that should inform the choice of data structure:

We can, of course, underallocate and hope for the best.

>> > M
>> > subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M
>> > publishes to queue0-nonfull, and the workers subscribe. An edge forms in the
>> > deadlock detector's graph when all of an event's subscribers wait on it. (One
>> > can approximate that in userspace with a pair of advisory locks.)
>>
>> An edge between which two processes?
>
> I hadn't de-fuzzed my thinking that far yet. If you still need the answer
> after this message, let me know, and I'll work on that. As a guess, I think
> it's three edges M-W1, M-W2 and M-W3.

I think it's an edge where one end is M and the other end is fuzzily
diffused across W1, W2, and W3, because we only need one of them to
wake up. There's no way to represent such a thing in the deadlock
detector today. We could invent one, of course, but it sounds
complicated. Also, I suspect deadlock checking in this world reduces
to http://en.wikipedia.org/wiki/Boolean_satisfiability_problem and is
therefore NP-complete.

> I see what you had in mind. The problem showed up in the last sentence;
> namely, the hand-off from process to process is not atomic like it is for
> heavyweight locks. That's exactly what I was (not too clearly) getting at
> when I wrote, "This assumes ..." above. For the sake of illustration, suppose
> you replace your queue in this algorithm with a heavyweight lock and a small
> buffer. Initially, B holds the lock and A waits for the lock. The following
> event sequence repeats until the transfer is done:
>
> B fills the buffer
> B releases the lock, granting it to A during LockRelease()
> B starts waiting for the lock
> A empties the buffer
> A releases the lock, granting it to B during LockRelease()
> A starts waiting for the lock
>
> The deadlock detector will rightly never report a deadlock. To have the same
> safety in your example, it's not enough for one process to set the latch of
> another process. The process must update something in static shared memory to
> indicate that the waited-for condition (queue-nonempty for A, queue-nonfull
> for B) is now satisfied. You need such a unit of state anyway, don't you?

Yes. It's the number of bytes read and written, and it's stored in
dynamic shared memory. You can move some of the state to the main
shared memory segment, but I don't look forward to the performance
consequences of a lock acquire & release cycle every time the queue
fills or drains. Of course, there is also a loss of flexibility
there: a big reason for developing this stuff in the first place was
to avoid being constrained by the main shared memory segment.

> The number of held LWLocks is always zero when entering user-defined code and
> again when exiting user-defined code. Therefore, the possible LWLock wait
> graphs are known at compile time, and we could prove that none contain a
> deadlock. Therefore, we scarcely miss having an LWLock deadlock detector.
> That does not map to the tuple queue behavior at hand, because we hope to
> allow the queue's writer to run user-defined code while the queue's reader is
> waiting. My term "user-defined code" does come with some hand-waving. A
> superuser can cause an undetected deadlock via a C-language hook. We would
> not call that a PostgreSQL bug, though the hook is literally user-defined.
> Let's keep the deadlock detector able to identify every deadlock a
> non-superuser can cause. That includes deadlocks arising from heavyweight
> lock acquisition in parallel-safe functions.

I'm in agreement with that goal.

>> I think where this project went off the rails is when we made the
>> decision as to how to fix the problems in the original group locking
>> approach. In the original concept, locks between processes in the
>> same parallel group just didn't ever conflict; two
>> otherwise-conflicting locks held by different backends within the same
>> group were regarded as those processes sharing that lock. Andres and
>> possibly others pointed out that for stuff like relation locks, that's
>
> I think you mean "relation extension locks".

Yes, thanks.

>> probably not the right behavior. I agree with that. It was also
>> suggested that this would be less scary if we limited ourselves to
>> sharing the locks held at the beginning of parallelism. That had the
>> further advantage of making things like relation extension locks,
>> which won't be held at the starting of paralellism, unshared, while
>> relation locks would, in most cases, be shared. That felt pretty
>> good, so I did it, but I now think that was a mistake, because it
>> creates edge cases where parallel groups can self-deadlock. If we
>> instead regard relation locks between parallel group members as NEVER
>> conflicting, rather than conflicting only when both locks were
>> acquired after the start of parallelism, those edge cases go away.
>
> Yes. Then you're back to something like the LWLock scenario, where an
> undetected deadlock implies a PostgreSQL bug. That's a good place to be.

Good.

>> The only downside is that if a worker A manages to do something to a
>> relation R that makes it unsafe for worker B to access, and worker B
>> then gets the lock anyway, we get no-holds-barred insanity instead of
>> a deadlock. But I am unconvinced that downside amounts to much,
>> because I believe the only way that A can make it unsafe for B to
>> access the relation is by doing something that CheckTableNotInUse()
>> will already prohibit in parallel mode. If it turns out there is some
>> oversight there, I don't think it'll be hard to plug.
>
> This is the bottom line, and I agree. I wrote more about that here:
> http://www.postgresql.org/message-id/20141226040546.GC1971688@tornado.leadboat.com
>
> I admit that it's alarming to have different conflict semantics by locktag.
> The tension originates, I think, from e.g. LOCKTAG_RELATION_EXTEND serving two
> distinct purposes simultaneously. As I wrote in the mail just cited, before
> altering a table in a manner that threatens concurrent usage, one must ensure
> that (a) other transactions and (b) other parts of my own transaction aren't
> in the way. Customarily, a heavyweight lock rules out (a), and
> CheckTableNotInUse() rules out (b). Relation extension does not have or need
> a CheckTableNotInUse() call or similar, because it doesn't call arbitrary code
> that might reenter the relation extension process. Under parallelism, though,
> it will be possible for multiple processes of a given transaction to attempt
> relation extension simultaneously. So we need a (b) test. It just so happens
> that allowing LOCKTAG_RELATION_EXTEND to conflict in a parallel group causes
> that lock acquisition to suffice for both purposes (a) and (b). That's neat
> but arguably impure. Every cure I've pondered has been worse than the
> disease, though. It will be okay.

That's my opinion as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company