Re: Parallel Sequence Scan doubts

Lists: pgsql-hackers
From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Parallel Sequence Scan doubts
Date: 2014-08-21 06:47:01
Message-ID: CAJrrPGcSWfKp1Mpmgj+Y4WdYc2VaH7nPRkNuirAgK9oSuYRpGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Corrected subject.

Hi Hackers,

Implementation of "Parallel Sequence Scan"

Approach:

1."Parallel Sequence Scan" can achieved by using the background
workers doing the job of actual sequence scan including the
qualification check also.

2. Planner generates the parallel scan plan by checking the possible
criteria of the table to do a parallel scan and generates the tasks
(range of blocks).

3. In the executor Init phase, Try to copy the necessary data required
by the workers and start the workers.

4. In the executor run phase, just get the tuples which are sent by
the workers and process them further in the plan node execution.

some of the problems i am thinking:

1. Data structures that are required to be copied from backend to
worker are currentTransactionState, Snapshot, GUC, ComboCID, Estate
and etc.

I see some problems in copying "Estate" data structure into the shared
memory because it contains so many pointers. There is a need of some
infrastructure to copy these data structures into the shared memory.

If the backend takes care of locking the relation and there are no sub
plans involved in the qual and targetlist, is the estate is really
required in the worker to achieve the parallel scan?

Is it possible to Initialize the qual, targetlist and projinfo with
the local "Estate" structure created in the worker with the help of
"plan" structure received from the backend?

Or

In case if "estate" is required in the worker for the processing, is
there any better way possible to share backend data structures with
the workers?

Any suggestions?

Regards,
Hari Babu
Fujitsu Australia


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Sequence Scan doubts
Date: 2014-08-23 15:11:54
Message-ID: 53F8AF3A.6010906@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/21/2014 02:47 PM, Haribabu Kommi wrote:
> Corrected subject.
>
> Hi Hackers,
>
> Implementation of "Parallel Sequence Scan"

I think you mean "Parallel Sequential Scan".

Scanning a sequence in parallel doesn't make much sense.

> 1."Parallel Sequence Scan" can achieved by using the background
> workers doing the job of actual sequence scan including the
> qualification check also.

Only if the qualifiers are stable/immutable I think.

Not even necessarily stable functions - consider use of the
fmgr/executor state contexts to carry information over between calls,
and what parallel execution would do to that.

> 3. In the executor Init phase, Try to copy the necessary data required
> by the workers and start the workers.

Copy how?

Back-ends can only communicate with each other over shared memory,
signals, and using sockets.

Presumably you'd use a dynamic shared memory segment, but it's going to
be "interesting" to copy that kind of state over. Some of the work
Robert has proposed to add support for data structures that are native
to dynamic shmem might help, I guess...

> 4. In the executor run phase, just get the tuples which are sent by
> the workers and process them further in the plan node execution.

Again, how do you propose to copy these back to the main bgworker?

That's been one of the things that's limited parallel scan when it's
been looked at before.

> 1. Data structures that are required to be copied from backend to
> worker are currentTransactionState, Snapshot, GUC, ComboCID, Estate
> and etc.

That's a big "etc". Huge, in fact.

Any function can reference any global variable. Even an immutable
function might use globals for cache etc - and might, for example, set
them up using an executor start hook. You cannot really make any
assumptions about what functions access what memory.

So it's not possible, as far as I can understand, to define a simple
subset of state that must be copied.

Nor is it necessarily correct to discard the copied state at the end of
the run even if you can copy it. Code may well depend on that state
being updated and preserved across calls.

> I see some problems in copying "Estate" data structure into the shared
> memory because it contains so many pointers. There is a need of some
> infrastructure to copy these data structures into the shared memory.

It's not just a matter of copying them into/via shmem.

It's about their meaning. Does it even make sense to copy the executor
state to another backend? If so, you can't copy it back, so what do you
do at the end of the scans?

> Any suggestions?

Before you try to design anything more on this, study the *large* amount
of discussion that has happened on this topic on this mailing list over
the last years.

This is not a simple or easy task, and it's not one you should approach
without studying what's already been worked on, built, contemplated, etc.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Sequence Scan doubts
Date: 2014-08-24 01:40:39
Message-ID: CAJrrPGcPU4zbbndv61Bidmc67zSkOsOFgWw-Eh7BF2Zd=HCzdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Aug 24, 2014 at 1:11 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 08/21/2014 02:47 PM, Haribabu Kommi wrote:
>> Implementation of "Parallel Sequence Scan"
>
> I think you mean "Parallel Sequential Scan".

Sorry for not being clear, Yes it is parallel sequential scan.

>> 1."Parallel Sequence Scan" can achieved by using the background
>> workers doing the job of actual sequence scan including the
>> qualification check also.
>
> Only if the qualifiers are stable/immutable I think.
>
> Not even necessarily stable functions - consider use of the
> fmgr/executor state contexts to carry information over between calls,
> and what parallel execution would do to that.

Thanks for your input. As of now we are targeting only immutable functions,
that can be executed in parallel without any side effects.

>> 3. In the executor Init phase, Try to copy the necessary data required
>> by the workers and start the workers.
>
> Copy how?
>
> Back-ends can only communicate with each other over shared memory,
> signals, and using sockets.

Sorry for not being clear, copying those data structures into dynamic
shared memory only.
From there the workers can access.

>> 4. In the executor run phase, just get the tuples which are sent by
>> the workers and process them further in the plan node execution.
>
> Again, how do you propose to copy these back to the main bgworker?

With the help of message queues that are created in the dynamic shared memory,
the workers can send the data to the queue. On other side the main
backend receives
the tuples from the queue.

>> 1. Data structures that are required to be copied from backend to
>> worker are currentTransactionState, Snapshot, GUC, ComboCID, Estate
>> and etc.
>
> That's a big "etc". Huge, in fact.
>
> Any function can reference any global variable. Even an immutable
> function might use globals for cache etc - and might, for example, set
> them up using an executor start hook. You cannot really make any
> assumptions about what functions access what memory.

Yes you are correct. For that reason only I am thinking of Supporting
of functions
that only dependent on input variables and are not modifying any global data.

>> I see some problems in copying "Estate" data structure into the shared
>> memory because it contains so many pointers. There is a need of some
>> infrastructure to copy these data structures into the shared memory.
>
> It's not just a matter of copying them into/via shmem.
>
> It's about their meaning. Does it even make sense to copy the executor
> state to another backend? If so, you can't copy it back, so what do you
> do at the end of the scans?

If we handle the locking of relation in the backend and avoid doing
the parallel sequential scan
if any sub query is involved, then there is no need of full estate in
the worker.
In those cases by sharing less information, I think we can execute the
plan in the worker.

>> Any suggestions?
>
> Before you try to design anything more on this, study the *large* amount
> of discussion that has happened on this topic on this mailing list over
> the last years.
>
> This is not a simple or easy task, and it's not one you should approach
> without studying what's already been worked on, built, contemplated, etc.

Thanks for your information.

Regards,
Hari Babu
Fujitsu Australia


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Sequence Scan doubts
Date: 2014-08-24 02:34:42
Message-ID: 53F94F42.4040702@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/24/2014 09:40 AM, Haribabu Kommi wrote:

> Any suggestions?

Another point I didn't raise first time around, but that's IMO quite
significant, is that you haven't addressed why this approach to fully
parallel seqscans is useful and solves real problems in effective ways.

It might seem obvious - "of course they're useful". But I see two things
they'd address:

- CPU-limited sequential scans, where expensive predicates are filtering
the scan; and

- I/O limited sequential scans, where the predicates already execute
fast enough on one CPU, so most time is spent waiting for more disk I/O.

The problem I see with your design is that it's not going to be useful
for a large class of CPU-limited scans where the predicate isn't
composed entirely of immutable functions an operators. Especially since
immutable-only predicates are the best candidates for expression indexes
anyway.

While it'd likely be useful for I/O limited scans, it's going to
increase contention on shared_buffers locking and page management. More
importantly, is it the most efficient way to solve the problem with I/O
limited scans?

I would seriously suggest looking at first adding support for
asynchronous I/O across ranges of extents during a sequential scan. You
might not need multiple worker backends at all.

I'm sure using async I/O to implement effective_io_concurrency for
seqscans has been been discussed and explored before, so again I think
some time in the list archives might make sense.

I don't know if it makes sense to do something as complex and parallel
multi-process seqscans without having a path forward for supporting
non-immutable functions - probably with fmgr API enhancements,
additional function labels ("PARALLEL"), etc, depending on what you find
is needed.

Do you have specific workloads where you see this as useful, and where
doing async I/O and readahead within a single back-end wouldn't solve
the same problem?

>>> 3. In the executor Init phase, Try to copy the necessary data required
>>> by the workers and start the workers.
>>
>> Copy how?
>>
>> Back-ends can only communicate with each other over shared memory,
>> signals, and using sockets.
>
> Sorry for not being clear, copying those data structures into dynamic
> shared memory only.
> From there the workers can access.

That'll probably work with read-only data, but it's not viable for
read/write data unless you use a big lock to protect it, in which case
you lose the parallelism you want to achieve.

You'd have to classify what may be modified during scan execution
carefully and determine if you need to feed any of the resulting
modifications back to the original backend - and how to merge
modifications by multiple workers, if it's even possible.

That's going to involve a detailed structure-by-structure analysis and
seems likely to be error prone and buggy.

I think you should probably talk to Robert Haas about what he's been
doing over the last couple of years on parallel query.

>>> 4. In the executor run phase, just get the tuples which are sent by
>>> the workers and process them further in the plan node execution.
>>
>> Again, how do you propose to copy these back to the main bgworker?
>
> With the help of message queues that are created in the dynamic shared memory,
> the workers can send the data to the queue. On other side the main
> backend receives the tuples from the queue.

OK, so you plan to implement shmem queues.

That'd be a useful starting point, as it'd be something that would be
useful in its own right.

You'd have to be able to handle individual values that're than the ring
buffer or whatever you're using for transfers, in case you're dealing
with already-detoasted tuples or in-memory tuples.

Again, chatting with Robert and others who've worked on dynamic shmem,
parallel query, etc would be wise here.

> Yes you are correct. For that reason only I am thinking of Supporting
> of functions
> that only dependent on input variables and are not modifying any global data.

You'll want to be careful with that. Nothing stops an immutable function
referencing a cache in a C global that it initializes one and then
treats as read only, for example.

I suspect you'll need a per-function whitelist. I'd love to be wrong.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Sequence Scan doubts
Date: 2014-08-24 11:22:03
Message-ID: CAJrrPGd98mCCQiET0ceThjYg_ogdKs1dM3adPs4AyEAi18fY3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Aug 24, 2014 at 12:34 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 08/24/2014 09:40 AM, Haribabu Kommi wrote:
>
>> Any suggestions?
>
> Another point I didn't raise first time around, but that's IMO quite
> significant, is that you haven't addressed why this approach to fully
> parallel seqscans is useful and solves real problems in effective ways.
>
> It might seem obvious - "of course they're useful". But I see two things
> they'd address:
>
> - CPU-limited sequential scans, where expensive predicates are filtering
> the scan; and

Yes, we are mainly targeting CPU-limited sequential scans, Because of
this reason
only I want the worker to handle the predicates also not just reading
the tuples from
disk.

> - I/O limited sequential scans, where the predicates already execute
> fast enough on one CPU, so most time is spent waiting for more disk I/O.
>
> The problem I see with your design is that it's not going to be useful
> for a large class of CPU-limited scans where the predicate isn't
> composed entirely of immutable functions an operators. Especially since
> immutable-only predicates are the best candidates for expression indexes
> anyway.
>
> While it'd likely be useful for I/O limited scans, it's going to
> increase contention on shared_buffers locking and page management. More
> importantly, is it the most efficient way to solve the problem with I/O
> limited scans?
>
> I would seriously suggest looking at first adding support for
> asynchronous I/O across ranges of extents during a sequential scan. You
> might not need multiple worker backends at all.
>
> I'm sure using async I/O to implement effective_io_concurrency for
> seqscans has been been discussed and explored before, so again I think
> some time in the list archives might make sense.

Thanks for your inputs. I will check it.

> I don't know if it makes sense to do something as complex and parallel
> multi-process seqscans without having a path forward for supporting
> non-immutable functions - probably with fmgr API enhancements,
> additional function labels ("PARALLEL"), etc, depending on what you find
> is needed.

Thanks for your inputs.
I will check for a solution to extend the support for non-immutable
functions also.

>>>> 3. In the executor Init phase, Try to copy the necessary data required
>>>> by the workers and start the workers.
>>>
>>> Copy how?
>>>
>>> Back-ends can only communicate with each other over shared memory,
>>> signals, and using sockets.
>>
>> Sorry for not being clear, copying those data structures into dynamic
>> shared memory only.
>> From there the workers can access.
>
> That'll probably work with read-only data, but it's not viable for
> read/write data unless you use a big lock to protect it, in which case
> you lose the parallelism you want to achieve.

As of now I am thinking of sharing read-only data with workers.
In case if read/write data needs to be shared, then we may need
another approach to handle the same.

> You'd have to classify what may be modified during scan execution
> carefully and determine if you need to feed any of the resulting
> modifications back to the original backend - and how to merge
> modifications by multiple workers, if it's even possible.
>
> That's going to involve a detailed structure-by-structure analysis and
> seems likely to be error prone and buggy.

Thanks for your inputs. I will check it properly.

> I think you should probably talk to Robert Haas about what he's been
> doing over the last couple of years on parallel query.

Sure, I will check with him.

>>>> 4. In the executor run phase, just get the tuples which are sent by
>>>> the workers and process them further in the plan node execution.
>>>
>>> Again, how do you propose to copy these back to the main bgworker?
>>
>> With the help of message queues that are created in the dynamic shared memory,
>> the workers can send the data to the queue. On other side the main
>> backend receives the tuples from the queue.
>
> OK, so you plan to implement shmem queues.
>
> That'd be a useful starting point, as it'd be something that would be
> useful in its own right.

shmem queues are already possible with dynamic shared memory.
Just I want to use them here.

> You'd have to be able to handle individual values that're than the ring
> buffer or whatever you're using for transfers, in case you're dealing
> with already-detoasted tuples or in-memory tuples.
>
> Again, chatting with Robert and others who've worked on dynamic shmem,
> parallel query, etc would be wise here.
>
>> Yes you are correct. For that reason only I am thinking of Supporting
>> of functions
>> that only dependent on input variables and are not modifying any global data.
>
> You'll want to be careful with that. Nothing stops an immutable function
> referencing a cache in a C global that it initializes one and then
> treats as read only, for example.
>
> I suspect you'll need a per-function whitelist. I'd love to be wrong.

Yes, we need per-function level details. Once we have a better
solution to handle
non-immutable functions also then these may not be required.

Regards,
Hari Babu
Fujitsu Australia


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Sequence Scan doubts
Date: 2014-08-27 12:33:18
Message-ID: CA+TgmoYFvZeNoXaHpDDJYq56reAmMEMQG1PthFm6m5fDnqz2SA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 21, 2014 at 2:47 AM, Haribabu Kommi
<kommi(dot)haribabu(at)gmail(dot)com> wrote:
> Implementation of "Parallel Sequence Scan"
>
> Approach:
>
> 1."Parallel Sequence Scan" can achieved by using the background
> workers doing the job of actual sequence scan including the
> qualification check also.
>
> 2. Planner generates the parallel scan plan by checking the possible
> criteria of the table to do a parallel scan and generates the tasks
> (range of blocks).
>
> 3. In the executor Init phase, Try to copy the necessary data required
> by the workers and start the workers.
>
> 4. In the executor run phase, just get the tuples which are sent by
> the workers and process them further in the plan node execution.

Well, this is something I've thought quite a bit about already. Many
of my thoughts on parallelism are here:

https://wiki.postgresql.org/wiki/Parallel_Sort

Although the page title is parallel sort, many of the concerns are
applicable to parallelism of any sort.

I posted some patches containing some of the necessary infrastructure here:

http://archives.postgresql.org/message-id/CA+Tgmoam66dTzCP8N2cRcS6S6dBMFX+JMba+mDf68H=KAkNjPQ@mail.gmail.com

I seem to have forgotten to add that message to the CommitFest. Crap.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jim Nasby <jim(at)nasby(dot)net>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Sequence Scan doubts
Date: 2014-08-27 17:38:58
Message-ID: 53FE17B2.4070700@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8/24/14, 6:22 AM, Haribabu Kommi wrote:
> Yes, we are mainly targeting CPU-limited sequential scans, Because of
> this reason
> only I want the worker to handle the predicates also not just reading
> the tuples from
> disk.

In that case, I would suggest focusing on parallel execution of conditions regardless of where they show up in the query plan. In my experience, they often have nothing to do with a seqscan.

Here's a real-world example. We have a view that pivots our applications accounting journal into a ledger. The expensive part of the view is this:

sum(
CASE
WHEN b.tag::text = 'installment_principal'::text THEN b.type_cd -- type_cd is either 1, 0, or -1
ELSE 0::numeric
END
) * transaction_amount AS installment_principal

The view with this pivot has about 100 of these case statements. Frequently we only reference a few of them, but anytime we need to refer to 20+ the evaluation of that expression gets VERY cpu-expensive compared to the rest of the query.

The other thing I would look at before seqscan filters is join processing and bitmap index index combining (ie: ANDing together the results of several bitmap index scans). Those are things that can be very CPU intensive even when doing simple equality comparisons.

BTW, it's also possible that these cases would be good fits for GPU parallel execution.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net