Quick Links

Re: Support Parallel Query Execution in Executor

Lists:	pgsql-hackerspgsql-patches

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Support Parallel Query Execution in Executor
Date:	2006-04-06 10:28:33
Message-ID:	e12qms$2s6q$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I have written some experimental code of doing master-slave seqscan in
PostgreSQL. During the work, I feel we had enough infrastructure to support
parallel query execution.

What I did is adding a new node PARA and plug it above the node that we want
to execute in parallel. In this stage, a PARA node is just a SeqScan node,
which is:

typedef struct Para
{
/* TODO: add a union to put all nodes supporting parallism here */
SeqScan scan;

/* Split / Merge / Redistribute */
ParaType type;

/* TODO: other possible parameters */
} Para;

At the execution, the master (the process who receives the query) will wake
up a slave process (an idle ordinary backend) and the slave will pass the
scan results to the master via a shared memory communication-buffer. In
details, the execution is like this:

Master process:
1. PARA init: wake up a slave, pass the queryTree and outerPlan(planTree) to
it by nodeToString();
2. PARA exec:
get an item from the communication-buffer;
if item is a valid tuple
return item;
else
handle other types of item; /* execution done/error */
3. PARA end: do some cleanup.

As we can see from PARA init stage, with even the most simple PARA node, it
is easy to support inter-node parallism.

Slave process (use similar code for autovacuum process):
1. Get queryTree and planTree;
2. Redirect the destReceiver to the communication-buffer;
3. Encapsulate them in an executor and run;

The query plan is like this:
TEST=# explain select max(a), max(b) from t;
QUERY PLAN
----------------------------------------------------------------------
Aggregate (cost=7269.01..7269.02 rows=1 width=53)
-> Para [Split = 1] (cost=10.00..5879.00 rows=278000 width=53)
-> Seq Scan on T (cost=0.00..5879.00 rows=278000 width=53)
(3 rows)

There are some problems I haven't addressed yet. The most difficult one for
me is the xid assignment: master and slaves should see an identical view,
and the key is the xid. I am not sure the correct solution of this problem.
We may use the same xid or use a continuous portion of xids for master and
slaves. There are other problems like the login problem (the master and
slaves should be acting as the same user), the elog message passing etc are
also important but I think we are able to handle them without any problem.

I haven't touched the most difficult part, the parallel query optimizer. But
thanks to the two-phase parallel optimization technique, this part can be
treated as the geqo optimizer, without enough evidence, we don't enable
parallel query execution.

Is there any show-stop reasons of not doing this?

Regards,
Qingqing

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-06 10:41:13
Message-ID:	20060406104112.GB2120@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, Apr 06, 2006 at 06:28:33PM +0800, Qingqing Zhou wrote:
> I have written some experimental code of doing master-slave seqscan in
> PostgreSQL. During the work, I feel we had enough infrastructure to support
> parallel query execution.

Good work. One question though, temporary tables. Currently there no
way to get a consistant view of a temporary table from another process.

You probably have to avoid shipping off non-immutable function calls.

Is it possible to deadlock (don't see how, but I may not be looking
hard enough).

Hardest work would be finding what can be safely farmed off.

Nice work though, I hadn't thought of this approach.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-06 12:51:57
Message-ID:	36e682920604060551l6c2d1e8ct84ff73e7077c07ae@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/6/06, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> wrote:
> I have written some experimental code of doing master-slave seqscan in
> PostgreSQL. During the work, I feel we had enough infrastructure to support
> parallel query execution.

Great work! I had looked into this a little bit and came to the same
ideas/problems you did, but none of them seemed insurmountable at all.
I'd be interested in working with you on this if you'd like.

I'm interested in hearing Tom's suggestions on this topic too.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-06 14:41:15
Message-ID:	4435288B.3050003@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jonah H. Harris wrote:
> On 4/6/06, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> wrote:
>
>> I have written some experimental code of doing master-slave seqscan in
>> PostgreSQL. During the work, I feel we had enough infrastructure to support
>> parallel query execution.
>>
>
> Great work! I had looked into this a little bit and came to the same
> ideas/problems you did, but none of them seemed insurmountable at all.
> I'd be interested in working with you on this if you'd like.
>
> I'm interested in hearing Tom's suggestions on this topic too.
>
>

From time to time there is (often ill-informed) discussion about making
postgres multi-threaded - this strikes me as a possible candidate area
for multithreading, and maybe that would be a way to overcome some of
the problems mentioned.

Just a thought

cheers

andrew

From:	Thomas Hallgren <thomas(at)tada(dot)se>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-06 16:56:24
Message-ID:	44354838.4030500@tada.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hear hear ;-)

Regards,
Thomas Hallgren

Andrew Dunstan wrote:
> Jonah H. Harris wrote:
>> On 4/6/06, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> wrote:
>>
>>> I have written some experimental code of doing master-slave seqscan in
>>> PostgreSQL. During the work, I feel we had enough infrastructure to
>>> support
>>> parallel query execution.
>>>
>>
>> Great work! I had looked into this a little bit and came to the same
>> ideas/problems you did, but none of them seemed insurmountable at all.
>> I'd be interested in working with you on this if you'd like.
>>
>> I'm interested in hearing Tom's suggestions on this topic too.
>>
>>
>
> From time to time there is (often ill-informed) discussion about making
> postgres multi-threaded - this strikes me as a possible candidate area
> for multithreading, and maybe that would be a way to overcome some of
> the problems mentioned.
>
> Just a thought
>
> cheers
>
> andrew
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-07 02:18:24
Message-ID:	e14ibn$1u09$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

""Jonah H. Harris"" <jonah(dot)harris(at)gmail(dot)com> wrote
>
> Great work! I had looked into this a little bit and came to the same
> ideas/problems you did, but none of them seemed insurmountable at all.
> I'd be interested in working with you on this if you'd like.
>

Yes, I am happy to work with anyone on the topic. The plan in mind is like
this:
(1) stable the master-slave seqscan: solve all the problems left;
(2) parallize the seqscan: AFAICS, this should not very difficult based on
1, may only need some scan portition assignment;
(3) add an indexscan or other one or two node type to master-slave
solution: this is in order to make the framework extensible;
(4) parallize these node - this will be a big chunk of job;
(5) add a two-phase optimization to the server - we have to consider the
partitioned table in this stage, yet another big chunk of job;

Regards,
Qingqing

From:	"Mike Rylander" <mrylander(at)gmail(dot)com>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-07 12:59:46
Message-ID:	b918cf3d0604070559r5d139018h3693ec95ac42092c@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/6/06, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> wrote:
>
> ""Jonah H. Harris"" <jonah(dot)harris(at)gmail(dot)com> wrote
> >
> > Great work! I had looked into this a little bit and came to the same
> > ideas/problems you did, but none of them seemed insurmountable at all.
> > I'd be interested in working with you on this if you'd like.
> >

First, I want to second Jonah's enthusiasm. This is very exciting!

>
> Yes, I am happy to work with anyone on the topic. The plan in mind is like
> this:
> (1) stable the master-slave seqscan: solve all the problems left;
> (2) parallize the seqscan: AFAICS, this should not very difficult based on
> 1, may only need some scan portition assignment;

This is really only a gut feeling for me (it can't be otherwise, since
we can't yet test), but I think parallelizing a single seqscan is
pretty much guaranteed to do nothing, because seqscans, especially on
large tables, are IO bound.

There was plan some time ago (during 8.0 beta, I think) to allow
multiple seqscans from different queries to join each other, such that
scans that begin later start scanning the table at the point, or just
behind the point, that the first running scan is already at. That
plan would reduce IO contention, and buffer and OS cache thrashing, by
having multiple readers pull from the same hose.

I can't see how asking for more than one stream from the same file
would do anything but increase both cache thrashing and IO bandwidth
contention. Am I missing something here?

> (3) add an indexscan or other one or two node type to master-slave
> solution: this is in order to make the framework extensible;
> (4) parallize these node - this will be a big chunk of job;

Now that could be a _big_ win! Especially if tablespaces are used to
balance commonly combined tables and indexes.

> (5) add a two-phase optimization to the server - we have to consider the
> partitioned table in this stage, yet another big chunk of job;
>

Same here. This would be a place where parallel seqscans of different
tables (instead of multi-headed scan of one table) could buy you a
lot, especially with proper tablespace use.

Thanks again, Qingqing, for the work on this. I'm very excited about
where this could go. :)

> Regards,
> Qingqing
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	"Mike Rylander" <mrylander(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 00:52:15
Message-ID:	200604071752.16202.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Qingquing,

> First, I want to second Jonah's enthusiasm. This is very exciting!

Me, three! I didn't think this was ever going to come to Postgres absent
major corporate funding.

> This is really only a gut feeling for me (it can't be otherwise, since
> we can't yet test), but I think parallelizing a single seqscan is
> pretty much guaranteed to do nothing, because seqscans, especially on
> large tables, are IO bound.

Actuall, not true. Our current seqscan performance suffers from
produce-consumer fluctuation. GreenPlum and Sun did a whole bunch of
testing on this.

Basically reading a large table off disk does this:

read some table while not processing
process in cpu while not reading
read some more table while not processing
process some more in cpu while not reading
etc.
resulting in an I/O througput graph that looks like:

* * *
* * * * * *
* * * * * *
* * * *

The really annoying part about this, for me personally, is that the peaks
are significantly faster than comparable commercial DBMSes ... but our
average is far less. So even on a single seq scan, parallel query
execution would make a significant difference in performance, possibly as
much as +75% on seq scans of large tables.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	josh(at)agliodbs(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, "Mike Rylander" <mrylander(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 01:00:13
Message-ID:	36e682920604071800q4200331fp9828f1840870aa83@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/7/06, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> So even on a single seq scan, parallel query
> execution would make a significant difference
> in performance, possibly as
> much as +75% on seq scans of large tables.

I've been looking at several commercial systems which employ dynamic
partitioning for sequential scans of large tables. While 75% sounds a
little optimistic, there is most definitely a sizable performance
increase.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	josh(at)agliodbs(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, "Mike Rylander" <mrylander(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 06:49:49
Message-ID:	20531.1144478989@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Basically reading a large table off disk does this:

> read some table while not processing
> process in cpu while not reading
> read some more table while not processing
> process some more in cpu while not reading
> etc.
> resulting in an I/O througput graph that looks like:

> * * *
> * * * * * *
> * * * * * *
> * * * *

Interesting ...

> The really annoying part about this, for me personally, is that the peaks
> are significantly faster than comparable commercial DBMSes ... but our
> average is far less. So even on a single seq scan, parallel query
> execution would make a significant difference in performance, possibly as
> much as +75% on seq scans of large tables.

... but I'm failing to follow where it says that parallel processing
will fix that. All I can foresee in that direction is extra data
transfer costs, bought at the price of portability and locking headaches.

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Mike Rylander" <mrylander(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 17:16:08
Message-ID:	36e682920604081016o39bca5al53344b24569a8fec@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/8/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> ... but I'm failing to follow where it says that parallel processing
> will fix that. All I can foresee in that direction is extra data
> transfer costs, bought at the price of portability and locking headaches.

I don't think it's any less portable than the system is now; It's just
enabling multiple slave processes to participate in scans and
processing (parallel query, parallel index builds, parallel sorts,
...) Likewise, the additional I/O cost isn't that much of an issue
because systems which really take advantage of this type of parallel
processing have large bandwidth I/O arrays anyway.

I didn't even want to mention that EVERY other database I know of
(Oracle, DB2, Sybase, SQL Server, Ingres, Bizgres MPP, MaxDB) supports
this, but it's a pretty obvious win for many environments.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, Mike Rylander <mrylander(at)gmail(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 17:37:11
Message-ID:	1144517831.5726.18.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,

On Sat, 2006-04-08 at 13:16 -0400, Jonah H. Harris wrote:
> On 4/8/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > ... but I'm failing to follow where it says that parallel processing
> > will fix that. All I can foresee in that direction is extra data
> > transfer costs, bought at the price of portability and locking headaches.
>
> I don't think it's any less portable than the system is now;

ACK. As long as processes, signals and shared memory are used this could
be as portable as PostgreSQL is now.

> It's just
> enabling multiple slave processes to participate in scans and
> processing (parallel query, parallel index builds, parallel sorts,
> ...) Likewise, the additional I/O cost isn't that much of an issue
> because systems which really take advantage of this type of parallel
> processing have large bandwidth I/O arrays anyway.

Ehm.. which additional I/O cost? Or do you count inter-process
communication to I/O?

I'd like to help teaching PostgreSQL the art of parallel query
execution. I have rawly implemented an internal message passing system
on shared memory basis. This allows all pg-processes to send messages to
each other identified by PID. Uppon receiving of a message a process
gets a SIGUSR2 (AFAIC remember now).

On the positive side are:
- portable as I'm using only shmem and signals
- fast (not much copying around of the data in memory)

One downside I see with my approach is:
- the shared memory buffer is limited in size, so it can simply be
'full' and no more messages can be sent before another process reads
its messages and frees the memory for other messages.

In case you're interested I'll compile a patch for review.

Regards

Markus

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Mike Rylander" <mrylander(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 17:45:24
Message-ID:	36e682920604081045v1cd0c1der3662933bf298a233@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/8/06, Markus Schiltknecht <markus(at)bluegap(dot)ch> wrote:
> ACK. As long as processes, signals and shared memory are used this could
> be as portable as PostgreSQL is now.

This is certainly the case.

> Ehm.. which additional I/O cost? Or do you count inter-process
> communication to I/O?

Inter-process will add a minimal amount, but if it's done correctly,
you'll still end up ahead.

> I'd like to help teaching PostgreSQL the art of parallel query
> execution.

Cool, we're not at the communication-level yet, but your help would be
appreciated.

> In case you're interested I'll compile a patch for review.

Surely!

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 18:54:35
Message-ID:	1144522475.5726.28.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hello,

On Sat, 2006-04-08 at 13:45 -0400, Jonah H. Harris wrote:
> > I'd like to help teaching PostgreSQL the art of parallel query
> > execution.
>
> Cool, we're not at the communication-level yet, but your help would be
> appreciated.
>
> > In case you're interested I'll compile a patch for review.
>
> Surely!

[ This patch is not meant to be included immediately. But it might be
one step towards supporting parallel query execution, IMHO. ]

This is my internal message passing implementation. I'd probably also
need to document its use...

It's not tested except for 'it works for me'. And it still misses the
'wrap around' functionality to recycle the buffer. Also, I should
probably rename buffer.h and buffer.c since there are a lot of other
buffers in PostgreSQL.

Markus

Attachment	Content-Type	Size
imsg.patch	text/x-patch	18.0 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Mike Rylander" <mrylander(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 22:02:46
Message-ID:	28812.1144533766@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
> On 4/8/06, Markus Schiltknecht <markus(at)bluegap(dot)ch> wrote:
>> Ehm.. which additional I/O cost? Or do you count inter-process
>> communication to I/O?

> Inter-process will add a minimal amount, but if it's done correctly,
> you'll still end up ahead.

This is exactly the bit of optimism I was questioning. We've already
been sweating blood trying to reduce multiprocessor contention on data
structures in which collisions ought to be avoidable (ie, buffer arrays
where you hope not everyone is hitting the same buffer at once). I
think passing large volumes of data between different processes is going
to incur quite a lot of locking overhead, pipeline stalls for cache line
transfers, etc, etc, because heavy contention for the transfer buffer is
simply not going to be avoidable.

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Mike Rylander" <mrylander(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-08 22:21:38
Message-ID:	36e682920604081521v213ff096s8b200a840cd43161@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/8/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> passing large volumes of data between different processes is going
> to incur quite a lot of locking overhead, pipeline stalls for cache line
> transfers, etc, etc, because heavy contention for the transfer buffer is
> simply not going to be avoidable.

I don't think anyone believes there isn't going to be contention
somewhere. Anyone that's developed a parallel system acknowledges
there's a price to pay with parallelism (in complexity) which requires
good algorithms, strategy, and design.

I may certainly be misunderstanding you, but it seems like you want to
avoid parallelism altogether. Aside from PostgreSQL, the only systems
I know of which don't support parallelism in query execution are MySQL
and Firebird, but I don't see competing at the low end as a compelling
argument. Similarly, I don't think every major vendor put money into
developing something that isn't useful.

I do agree that there are many times you won't want to have parallel
query. The reasons for employing parallelism is very dependent on the
environment and application it's used in, so it's certainly going to
be up to the user to decide whether they want to use it or not.

We're currently playing with a proof-of-concept to find issues before
proposing a design. This doesn't affect the community in any way.
Rather than negativity, I'd really like to see your suggestions on
avoiding contention as much as possible; your ideas may certainly get
us past several obstacles more quickly.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 00:43:07
Message-ID:	e692861c0604081743k7b3e8f7bu45241ecc34d26bcf@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/8/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> This is exactly the bit of optimism I was questioning. We've already
> been sweating blood trying to reduce multiprocessor contention on data
> structures in which collisions ought to be avoidable (ie, buffer arrays
> where you hope not everyone is hitting the same buffer at once). I
> think passing large volumes of data between different processes is going
> to incur quite a lot of locking overhead, pipeline stalls for cache line
> transfers, etc, etc, because heavy contention for the transfer buffer is
> simply not going to be avoidable.

We should consider true parallel execution and overlapping execution
with I/O as distinct cases.

For example, one case made in this thread involved bursty performance
with seqscans presumably because the I/O was stalling while processing
was being performed. In general this can be avoided without parallel
execution through the use of non-blocking I/O and making an effort to
keep the request pipeline full.

There are other cases where it is useful to perform parallel I/O
without parallel processing.. for example: a query that will perform
an index lookup per row can benefit from running some number of those
lookups in parallel in order to hide the lookup latency and give the
OS and disk elevators a chance to make the random accesses a little
more orderly. This can be accomplished without true parallel
processing. (Perhaps PG does this already?)

Parallel execution to get access to more CPU and memory bandwidth is a
fine thing, and worth the costs in many cases... but it shouldn't be
used as an easy way to get parallel IO without careful consideration.

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 01:47:12
Message-ID:	C05DB5B0.21263%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg,

On 4/8/06 5:43 PM, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com> wrote:

> For example, one case made in this thread involved bursty performance
> with seqscans presumably because the I/O was stalling while processing
> was being performed. In general this can be avoided without parallel
> execution through the use of non-blocking I/O and making an effort to
> keep the request pipeline full.

I agree - there is a real continuing need for overlapped scan and execution,
though I'm not sure quite how to get it efficiently using inter-process
communication. The AIO interface is nicely proven at this point, but to
Tom's point, it's non-portable.

What we see now is that the executor is CPU bound at about 300MB/s scan rate
using modern Opteron CPUs. Using some kind of overlap strategy would help
this a lot - using programmable readahead to stage I/O requests
asynchronously from the scan node would be better than OS hints like
fadvise(), but the tuning would be tricky IMO.

- Luke

From:	Myron Scott <lister(at)sacadia(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 04:54:14
Message-ID:	44389376.5090707@sacadia.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gregory Maxwell wrote:
> We should consider true parallel execution and overlapping execution
> with I/O as distinct cases.
>
> For example, one case made in this thread involved bursty performance
> with seqscans presumably because the I/O was stalling while processing
> was being performed. In general this can be avoided without parallel
> execution through the use of non-blocking I/O and making an effort to
> keep the request pipeline full.
>
> There are other cases where it is useful to perform parallel I/O
> without parallel processing.. for example: a query that will perform
> an index lookup per row can benefit from running some number of those
> lookups in parallel in order to hide the lookup latency and give the
> OS and disk elevators a chance to make the random accesses a little
> more orderly. This can be accomplished without true parallel
> processing. (Perhaps PG does this already?)
>
>

I have done some testing more along these lines with an old fork of
postgres code
(2001). In my tests, I used a thread to delegate out the actual heap
scan of the
SeqScan. The job of the "slave" thread the was to fault in buffer
pages and
determine the time validity of the tuples. ItemPointers are passed back
to the
"master" thread via a common memory area guarded by mutex locking. The
master thread is then responsible for converting the ItemPointers to
HeapTuples
and finishing the execution run. I added a little hack to the buffer
code to force
pages read into the buffer to stay at the back of the free buffer list
until the master
thread has had a chance to use it. These are the parameters of my test
table.

Pages 9459: ; Tup 961187: Live 673029, Dead 288158

Average tuple size is 70 bytes

create table test (rand int, varchar(256) message)

So far I've done a couple of runs with a single query on a 2 processor
machine with
the following results via dtrace.

select * from test;

CPU ID FUNCTION:NAME
1 46218 ExecEndSeqScan:return Inline scan time 81729
0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 59903
0 46218 ExecEndSeqScan:return Inline scan time 95708
0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 58255
0 46218 ExecEndSeqScan:return Inline scan time 79028
0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 50500

average 34% decrease in total time using the delegated scan.

A very crude, simple test but I think it shows some promise.

I know I used threads but you could probably just as easily use a slave
process
and pass ItemPointers via pipes or shared memory.

Thanks,

Myron Scott

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Myron Scott" <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 05:29:05
Message-ID:	C05DE9B1.21296%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Myron,

First, this sounds really good!

On 4/8/06 9:54 PM, "Myron Scott" <lister(at)sacadia(dot)com> wrote:

> I added a little hack to the buffer
> code to force
> pages read into the buffer to stay at the back of the free buffer list
> until the master
> thread has had a chance to use it.

This is the part I'm curious about - is this using the shared_buffers region
in a circular buffer fashion to store pre-fetched pages?

One thing I've wondered about is: how much memory is required to get
efficient overlap? Did you find that you had to tune the amount of buffer
memory to get the performance to work out?

- Luke

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc:	Gregory Maxwell <gmaxwell(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 08:58:06
Message-ID:	1144573086.3657.3.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, L, 2006-04-08 kell 18:47, kirjutas Luke Lonergan:

> I agree - there is a real continuing need for overlapped scan and execution,
> though I'm not sure quite how to get it efficiently using inter-process
> communication. The AIO interface is nicely proven at this point, but to
> Tom's point, it's non-portable.
>
> What we see now is that the executor is CPU bound at about 300MB/s scan rate
> using modern Opteron CPUs.

Where exactly does the CPU spend its time at 300MB/s ? Is it mainly I/O
related processing, processing pages into in-memory tuples or something
else ?

--------------
Hannu

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 09:11:18
Message-ID:	e1aj3m$dim$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> This is exactly the bit of optimism I was questioning. We've already
> been sweating blood trying to reduce multiprocessor contention on data
> structures in which collisions ought to be avoidable (ie, buffer arrays
> where you hope not everyone is hitting the same buffer at once). I
> think passing large volumes of data between different processes is going
> to incur quite a lot of locking overhead, pipeline stalls for cache line
> transfers, etc, etc, because heavy contention for the transfer buffer is
> simply not going to be avoidable.
>

Yes, there is overhead. Let's say how can we tackle them one by one:

* lock contention of transfer buffer *
A general way is partitioning. We can try to alleviate the lock contention
by sharing transfer buffer only between each two processes (master and
slave).

* large amount of data *
Thinking of the simplest master-slave seqscan senario, I am copying the
HeapTuple so far (like the code in sort read/write tuple), but potentially
we can only pass the offset and pinned page, which will require the slave
process pinned the page for a little longer time, say every 5 pages. I
didn't see any application can cause Postgres pinned a lot pages, so this
might be one step to reduce the data volumn.

On the other hand, why do we need parallel execution anyway? Because if we
got a 4 way SMP with strong IO device, we may want to trade the performance
with more resources for some queries.

Regards,
Qingqing

From:	Myron Scott <lister(at)sacadia(dot)com>
To:	Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 15:23:36
Message-ID:	6D5B293D-6139-4498-939E-FF44155D5080@sacadia.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Apr 8, 2006, at 10:29 PM, Luke Lonergan wrote:

> Myron,
>
> First, this sounds really good!
>
> On 4/8/06 9:54 PM, "Myron Scott" <lister(at)sacadia(dot)com> wrote:
>
>> I added a little hack to the buffer
>> code to force
>> pages read into the buffer to stay at the back of the free buffer
>> list
>> until the master
>> thread has had a chance to use it.
>
> This is the part I'm curious about - is this using the
> shared_buffers region
> in a circular buffer fashion to store pre-fetched pages?

Yes. That is basically what the slave thread is trying to do. As
well as weed out
any tuples/pages that don't need to be looked at due to dead tuples.
I did several things to try and insure that a buffer needed by the
master thread
would not be pulled out of the buffer pool before it was seen by the
master.
I wanted to do this without holding the buffer pinned, so I did the
change to
the buffer free list to do this.

static void
AddBufferToFreelist(BufferDesc *bf)
{
S_LOCK(&SLockArray[FreeBufMgrLock]);
int movebehind = SharedFreeList->freePrev;
/* find the right spot with bias */
while ( BufferDescriptors[movebehind].bias > bf->bias ) {
movebehind = BufferDescriptors[movebehind].freePrev;
}
...

The bias number is removed the next time the buffer is pulled out of
the free list.

Also, I force an ItemPointer transfer when the ItemPointer transfer list
is full ( currently 4096 ) or 10% of the buffer pool have been affected
by the slave thread. Lastly, if the slave thread gets too far ahead of
the master thread, it waits for the master to catch up. To my
knowledge,
this hasn't happened yet.

>
> One thing I've wondered about is: how much memory is required to get
> efficient overlap? Did you find that you had to tune the amount of
> buffer
> memory to get the performance to work out?

I haven't done much tuning yet. I think there is an optimal balance
that I
most likely haven't found yet.

Myron Scott

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Myron Scott <lister(at)sacadia(dot)com>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 16:26:30
Message-ID:	20060409162630.GA5938@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sun, Apr 09, 2006 at 08:23:36AM -0700, Myron Scott wrote:
> >This is the part I'm curious about - is this using the
> >shared_buffers region
> >in a circular buffer fashion to store pre-fetched pages?
>
> Yes. That is basically what the slave thread is trying to do. As
> well as weed out
> any tuples/pages that don't need to be looked at due to dead tuples.
> I did several things to try and insure that a buffer needed by the
> master thread
> would not be pulled out of the buffer pool before it was seen by the
> master.
> I wanted to do this without holding the buffer pinned, so I did the
> change to
> the buffer free list to do this.

Is this necessary? I mean, what's the chance that a page might get
thrown out early? And if so, what's the chance that page will still be
in the OS cache?

The cost of fetching a page from the OS is not really much of an
overhead, so I'd like to know how much benefit these buffer cache hacks
actually produce.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Myron Scott <lister(at)sacadia(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 16:27:30
Message-ID:	9234.1144600050@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Myron Scott <lister(at)sacadia(dot)com> writes:
> Gregory Maxwell wrote:
>> There are other cases where it is useful to perform parallel I/O
>> without parallel processing..

> I have done some testing more along these lines with an old fork of
> postgres code (2001). In my tests, I used a thread to delegate out
> the actual heap scan of the SeqScan. The job of the "slave" thread
> the was to fault in buffer pages and determine the time validity of
> the tuples. ItemPointers are passed back to the "master" thread via a
> common memory area guarded by mutex locking.

I was considering a variant idea in the shower this morning: suppose
that we invent one or more "background reader" processes that have
basically the same infrastructure as the background writer, but have
the responsibility of causing buffer reads to happen at useful times
(whereas the writer causes writes to happen at useful times). The
idea would be for backends to signal the readers when they know they
will need a given block soon, and then hopefully when they need it
it'll already be in shared buffers. For instance, in a seqscan it'd be
pretty trivial to request block N+1 just after reading block N, and then
doing our actual processing on block N while (we hope) some reader
process is faulting in N+1. Bitmap indexscans could use this structure
too; I'm less sure about whether plain indexscans could do much with it
though.

The major issues I can see are:

1. We'd need a shared-memory queue of read requests, probably much like
the queue of fsync requests. We've already seen problems with
contention for the fsync queue, IIRC, and that's used much less heavily
than the read request queue would be. So there might be some
performance issues with getting the block requests sent over to the
readers.

2. There are some low-level assumptions that no one reads in pages of
a relation without having some kind of lock on the relation (consider
eg the case where the relation is being dropped). A bgwriter-like
process wouldn't be able to hold lmgr locks, and we wouldn't really want
it to be thrashing the lmgr shared data structures for each read anyway.
So you'd have to design some interlock to guarantee that no backend
abandons a query (and releases its own lmgr locks) while an async read
request it made is still pending. Ugh.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 16:34:03
Message-ID:	9279.1144600443@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Gregory Maxwell" <gmaxwell(at)gmail(dot)com> writes:
> For example, one case made in this thread involved bursty performance
> with seqscans presumably because the I/O was stalling while processing
> was being performed.

Actually, the question that that raised in my mind is "why isn't the
kernel doing read-ahead properly?" When we're doing nonsequential
access like an indexscan, it's unsurprising that the kernel can't guess
which block we need next, but in a plain seqscan you'd certainly expect
the read-ahead algorithm to kick in and ensure that the next block is
fetched before we need it.

So before we go inventing complicated bits of code with lots of added
overhead, we should first find out exactly why the system doesn't
already work the way it's supposed to.

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Myron Scott" <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 16:57:56
Message-ID:	36e682920604090957x34eb7a0r57ad576033668c49@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I was considering a variant idea in the shower this morning:suppose
> that we invent one or more "background reader" processes that have
> basically the same infrastructure as the background writer, but have
> the responsibility of causing buffer reads to happen at useful times
> (whereas the writer causes writes to happen at useful times). The
> idea would be for backends to signal the readers when they know they
> will need a given block soon, and then hopefully when they need it
> it'll already be in shared buffers.

This is sort of what I'm playing with. There are N-number of backends
which are configured at startup and are available solely for parallel
processing. If you have parallelism enabled and you have a query
which the planner says can take advantage of it, your backend becomes
a sort of "coordinator" and signals the number of backends defined by
(right now) a global parallelism degree parameter for participation in
query execution.

> The major issues I can see are:
>
> 1. We'd need a shared-memory queue of read requests, probably much like
> the queue of fsync requests. We've already seen problems with
> contention for the fsync queue, IIRC, and that's used much less heavily
> than the read request queue would be. So there might be some
> performance issues with getting the block requests sent over to the
> readers.

Yes, I was looking at this contention as well. However, I haven't
looked at the fsync request queue in detail, but I'm going to play
around with a simple initial semaphore test such that if the queue is
full, the query will just be processed serially. I'll also take a
look at the fsync queue and see how it's working.

While it's down the road a ways, I'm really looking forward to seeing
parallel index builds, sorts, and joins. In my opinion, all of our
combined ideas and system knowledge is really going to pay off big
time on this one.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	Myron Scott <lister(at)sacadia(dot)com>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 17:10:46
Message-ID:	1E8A8988-53CB-49F3-B4DE-75B567BF0529@sacadia.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Apr 9, 2006, at 9:26 AM, Martijn van Oosterhout wrote:

> On Sun, Apr 09, 2006 at 08:23:36AM -0700, Myron Scott wrote:
>>> This is the part I'm curious about - is this using the
>>> shared_buffers region
>>> in a circular buffer fashion to store pre-fetched pages?
>>
>> Yes. That is basically what the slave thread is trying to do. As
>> well as weed out
>> any tuples/pages that don't need to be looked at due to dead tuples.
>> I did several things to try and insure that a buffer needed by the
>> master thread
>> would not be pulled out of the buffer pool before it was seen by the
>> master.
>> I wanted to do this without holding the buffer pinned, so I did the
>> change to
>> the buffer free list to do this.
>
> Is this necessary? I mean, what's the chance that a page might get
> thrown out early? And if so, what's the chance that page will still be
> in the OS cache?
>
> The cost of fetching a page from the OS is not really much of an
> overhead, so I'd like to know how much benefit these buffer cache
> hacks
> actually produce.
>

You may be right on this one. I wanted ensure that I didn't lose
pages I needed. I may have just added a belt to my
suspenders. I'll add a switch to turn it off and on and try and
devise a test to see what the costs either way are.

Myron Scott

From:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 18:18:04
Message-ID:	e692861c0604091118n36f1896ei6e53f6274a2bea7e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Gregory Maxwell" <gmaxwell(at)gmail(dot)com> writes:
> > For example, one case made in this thread involved bursty performance
> > with seqscans presumably because the I/O was stalling while processing
> > was being performed.
>
> Actually, the question that that raised in my mind is "why isn't the
> kernel doing read-ahead properly?" When we're doing nonsequential
> access like an indexscan, it's unsurprising that the kernel can't guess
> which block we need next, but in a plain seqscan you'd certainly expect
> the read-ahead algorithm to kick in and ensure that the next block is
> fetched before we need it.
>
> So before we go inventing complicated bits of code with lots of added
> overhead, we should first find out exactly why the system doesn't
> already work the way it's supposed to.

But is that really the behavior we should expect?

How much memory can we expect the OS to spend on opportunistic
read-in? How much disk access should be spent on a guess? There is an
intrinsic tradeoff here, applications tend to be bursty so just
because you're reading a lot now doesn't mean you'll continue... and
the filesystem will have fragmentation, so a failed guess can
translate into a lot of pointless seeking.

As I recall, in Linux 2.6 you have something like a max of 128KB of
readahead. Given that and a disk subsystem that reads at 200MB/sec you
can't spend more than 600us processing before requesting enough
additional blocks put the disk back into readhead or you will stall
the disk. Stalling the disk costs more than you'd expect, due to FS
fragmentation there can be terrific gains from allowing the OS and
disk to issue reads out of order from a large request queue.

It would probably be reasonable to say that the OS should be using
much larger readhead buffers, especially on systems with fast disk
subsystems... But that doesn't come for free and can slaughter
performance for many workloads (consider, what if it was triggering
5MB of file oriented read-ahead for every index scan seek we did?).
There is an adaptive readahead patch for Linux which should improve
things (http://lwn.net/Articles/176279/ and if you google around there
are some benchmarks) but I doubt that even that would be able to keep
a 200MB/sec+ disk subsystem saturated with the sort of access patterns
PG has...

To address this in a cross platform way will be a challenge. I doubt
Linux is alone at having skimpy readahead (because big readahead
translates into huge losses if you get it wrong).

Given this information, a stupid 'fault-in' process should probably
give huge gains for seqscans... but I think the extra work required to
find a solution which is also useful for index operations is probably
worth it as well.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Myron Scott" <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 18:39:41
Message-ID:	12625.1144607981@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
> On 4/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I was considering a variant idea in the shower this morning:suppose
>> that we invent one or more "background reader" processes that have
>> basically the same infrastructure as the background writer, but have
>> the responsibility of causing buffer reads to happen at useful times

> This is sort of what I'm playing with. There are N-number of backends
> which are configured at startup and are available solely for parallel
> processing.

That's not remotely the same thing: a backend is a very different animal
from a bgwriter. In particular, bgwriter (and bgreaders if we had 'em)
aren't database-specific, don't need to think about permission checking
as they don't execute on behalf of particular users, don't have syscaches
to keep in-sync with everything else, etc etc.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 18:48:54
Message-ID:	12686.1144608534@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Gregory Maxwell" <gmaxwell(at)gmail(dot)com> writes:
> On 4/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> So before we go inventing complicated bits of code with lots of added
>> overhead, we should first find out exactly why the system doesn't
>> already work the way it's supposed to.

> But is that really the behavior we should expect?

Certainly. If the OS has readahead logic at all, it ought to think that
a seqscan of a large table qualifies. Your arguments seem to question
whether readahead is useful at all --- but they would apply *just as
well* to an app doing its own readahead, which is what is really
getting proposed in this thread.

Before we go replacing a standard OS-level facility with our own
version, we need to have a much clearer idea of why the OS isn't getting
the job done for us. Otherwise we're likely to write a large amount of
code and find out that it doesn't work very well either.

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Myron Scott" <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 18:57:16
Message-ID:	36e682920604091157x29b43b71rb525af4f568b7e8f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> That's not remotely the same thing: a backend is a very different animal
> from a bgwriter.

Yes, I know. Had I known it would've been picked apart, I would've
described it better.

> In particular, bgwriter (and bgreaders if we had 'em)
> aren't database-specific, don't need to think about permission checking
> as they don't execute on behalf of particular users, don't have syscaches
> to keep in-sync with everything else, etc etc.

In the current proof-of-concept, there is a backend (manager) and
several reader-like processes; However, in this case, the readers
perform actual execution on behalf of the user.

The reader-like processes or "execution agents", whatever you want to
call them, are limited to the postgres database only because they've
been coded that way... but there's nothing that actually restricts
them from being used for any database.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Myron Scott" <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 19:11:03
Message-ID:	12866.1144609863@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
> The reader-like processes or "execution agents", whatever you want to
> call them, are limited to the postgres database only because they've
> been coded that way... but there's nothing that actually restricts
> them from being used for any database.

Oh? Anything much above the level of bufmgr is going to be dependent on
relcache, syscache, etc, and those are all definitely database-specific.
You can't just retarget a backend to operate in another database, at
least not without major changes in that infrastructure.

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Myron Scott" <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 19:15:31
Message-ID:	36e682920604091215g658235e7u14d16bc9b104e797@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Oh? Anything much above the level of bufmgr is going to be dependent on
> relcache, syscache, etc, and those are all definitely database-specific.
> You can't just retarget a backend to operate in another database, at
> least not without major changes in that infrastructure.

It's not that complicated and it may not prove to be the best method,
but it's worth a try. When I get done playing with it, I'll send it
to ya to check out.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Myron Scott <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 19:18:04
Message-ID:	20060409191804.GC15094@surnet.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jonah H. Harris wrote:
> On 4/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Oh? Anything much above the level of bufmgr is going to be dependent on
> > relcache, syscache, etc, and those are all definitely database-specific.
> > You can't just retarget a backend to operate in another database, at
> > least not without major changes in that infrastructure.
>
> It's not that complicated and it may not prove to be the best method,
> but it's worth a try. When I get done playing with it, I'll send it
> to ya to check out.

Hmm, if it turns out to work, I'm interested. This is something that
definitely could be of some help to autovacuum. Please post to
pgsql-patches.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 20:36:54
Message-ID:	e692861c0604091336r4b5d0a5bjec2f91bd5386043f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/9/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Certainly. If the OS has readahead logic at all, it ought to think that
> a seqscan of a large table qualifies. Your arguments seem to question
> whether readahead is useful at all --- but they would apply *just as
> well* to an app doing its own readahead, which is what is really
> getting proposed in this thread.

We know we're going to read the whole table, the OS doesn't. We can be
confident that we're will not use our read-ahead when we're really
doing random accesses.

The OS has to deal with many applications with many workloads running
on a wider spectrum of hardware. It's important that it does the right
thing, but probably more important that it doesn't do the wrong thing.
This encourages erroring on the side of small readahead.

> Before we go replacing a standard OS-level facility with our own
> version, we need to have a much clearer idea of why the OS isn't getting
> the job done for us. Otherwise we're likely to write a large amount of
> code and find out that it doesn't work very well either.

Thats a fair position... It would be useful to know much much
readahead PG needs in order to keep a high speed disk subsystem
saturated. This would involve profiling how frequently PG requests
data, how much it requests when running out of a hot cache. We could
then say that the OS would need to readahead xMB to keep a yMB/s disk
subsystem saturated.

It would be good to know how much FBSD will readahead...

It might also be interesting for someone with the right testing rig on
linux to try the adaptive
readahead patch to see if that improves PG's ability to keep the disk busy.

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 20:51:50
Message-ID:	C05EC1F6.21323%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gregory,

On 4/9/06 1:36 PM, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com> wrote:

> It might also be interesting for someone with the right testing rig on
> linux to try the adaptive
> readahead patch to see if that improves PG's ability to keep the disk busy.

"the" adaptive readahead patch? Did I miss one?

We will happily test experimental patches that improve I/O utilitization.
We have an assortment of gear with high speed I/O, mostly Linux now.

- Luke

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Myron Scott" <lister(at)sacadia(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 20:59:40
Message-ID:	C05EC3CC.2132C%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom,

On 4/9/06 9:27 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> 2. There are some low-level assumptions that no one reads in pages of
> a relation without having some kind of lock on the relation (consider
> eg the case where the relation is being dropped). A bgwriter-like
> process wouldn't be able to hold lmgr locks, and we wouldn't really want
> it to be thrashing the lmgr shared data structures for each read anyway.
> So you'd have to design some interlock to guarantee that no backend
> abandons a query (and releases its own lmgr locks) while an async read
> request it made is still pending. Ugh.

Does this lead us right back to planning for the appropriate amount of
readahead at plan time? We could consider a "page range" lock at that point
instead of locking each individual page.

It seems that the planner might have enough information to identify the
times when this approach would be beneficial (low selectivity aggregate,
etc), and to allocate sufficient resources to accommodate the overlap. The
planner specific knowledge would be the main reason this would work better
than the OS built-in dynamic readahead algorithms.

- Luke

From:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
To:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 21:04:06
Message-ID:	e692861c0604091404g7f8fbf0ayec135086da03e128@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 4/9/06, Luke Lonergan <llonergan(at)greenplum(dot)com> wrote:
> Gregory,
>
> On 4/9/06 1:36 PM, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com> wrote:
>
> > It might also be interesting for someone with the right testing rig on
> > linux to try the adaptive
> > readahead patch to see if that improves PG's ability to keep the disk busy.
>
> "the" adaptive readahead patch? Did I miss one?
>
> We will happily test experimental patches that improve I/O utilitization.
> We have an assortment of gear with high speed I/O, mostly Linux now.

Linux kernel patch, I'd mentioned it in a prior post.

http://www.vanheusden.com/ara/

It increases Linux's maximum readahead from 128K to 1meg .. and it
should be smart enough that you could crank it up further without too
much risk of hurting performance elsewhere.

If PG's bottlenecked on seqscan due to insufficient readahead, I'd
expect this to show an improvement... although I am still somewhat
doubtful that it'll be enough to keep the disk saturated if PG's
behavior is highly unoptimal.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc:	"Myron Scott" <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-09 21:54:56
Message-ID:	14259.1144619696@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
> On 4/9/06 9:27 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 2. There are some low-level assumptions that no one reads in pages of
>> a relation without having some kind of lock on the relation (consider
>> eg the case where the relation is being dropped). A bgwriter-like
>> process wouldn't be able to hold lmgr locks, and we wouldn't really want
>> it to be thrashing the lmgr shared data structures for each read anyway.
>> So you'd have to design some interlock to guarantee that no backend
>> abandons a query (and releases its own lmgr locks) while an async read
>> request it made is still pending. Ugh.

> Does this lead us right back to planning for the appropriate amount of
> readahead at plan time? We could consider a "page range" lock at that point
> instead of locking each individual page.

No, you're missing my point entirely. What's bothering me is the
prospect of a "bgreader" process taking actions that are only safe
because of a lock that is held by a different process --- changing the
granularity of that lock doesn't get you out of trouble.

Here's a detailed scenario:

1. Backend X reads page N of a table T, queues a request for N+1.
2. While processing page N, backend X gets an error and aborts
its transaction, thereby dropping all its lmgr locks.
3. Backend Y executes a DROP or TRUNCATE on T, which it can
now do because there's no lock held anywhere on T. There
are actually two interesting sub-phases of this:
3a. Kill any shared buffers holding pages to be deleted.
3b. Physically drop or truncate the OS file.
4. Bgreader tries to execute the pending read request. Ooops.

If step 4 happens after 3b, the bgreader gets an error. Maybe we could
kluge that to not cause any serious problems, but the nastier case is
where 4 happens between 3a and 3b --- the bgreader sees nothing wrong,
but it's now loaded a shared buffer that must *not* be there.

Having thought more about this, there may be a solution possible using
tighter integration with the bufmgr. The bufmgr already has a notion of
a buffer page being "read busy". Maybe, rather than just pushing a read
request into a separate queue somewhere, the requestor has to assign a
shared buffer for the page it wants and put the buffer into read-busy
state, but then pass the request to perform the physical read off to
someone else. The advantage of this is that there's state that step 3a
can see telling it that a conflicting read is pending, and it just needs
to wait for the read to finish before killing the buffer.

Bottom line seems to be: just as the bgwriter is pretty intimately tied
to bufmgr, bgreaders would have to be as well.

regards, tom lane

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 08:38:06
Message-ID:	e1d5nq$1a3m$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Markus Schiltknecht" <markus(at)bluegap(dot)ch> wrote
> Hi,
>
> On Sun, 2006-04-09 at 15:11 -0400, Tom Lane wrote:
> > You can't just retarget a backend to operate in another database, at
> > least not without major changes in that infrastructure.
>
> Why not? What would be needed to retarget a backend to operate in
> another database?
>

As Tom pointed out, without big change, a backend on database "D1" can't
connect to "D2". This is because to connect to a database, we need to
initialize a lot of variables. So when you reconnect to another one on the
fly, you have to change these variables one by one.

Regards,
Qingqing

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 08:47:25
Message-ID:	1144658845.21409.123.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sun, 2006-04-09 at 17:11 +0800, Qingqing Zhou wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
> >
> > This is exactly the bit of optimism I was questioning. We've already
> > been sweating blood trying to reduce multiprocessor contention on data
> > structures in which collisions ought to be avoidable (ie, buffer arrays
> > where you hope not everyone is hitting the same buffer at once). I
> > think passing large volumes of data between different processes is going
> > to incur quite a lot of locking overhead, pipeline stalls for cache line
> > transfers, etc, etc, because heavy contention for the transfer buffer is
> > simply not going to be avoidable.
> >
>
> Yes, there is overhead. Let's say how can we tackle them one by one:
>
> * lock contention of transfer buffer *

I think it would be useful to think about exactly what type of
query/activity we are looking to improve the performance on. That way we
can understand the benefit of this proposal and take some baseline
measurements to analyse what is happening for those cases.

For example, a CREATE INDEX will require most if not all rows. Whereas a
SeqScan may have a selectivity of 0.001 or below in some cases. The
SeqScan case might also vary considerably in the amount of CPU required
to identify qualifying rows and that CPU might benefit from
parallelisation also.

e.g. A low-selectivity SeqScan is a good case for fairly simple
performance improvement, since the output from that executor node is
going to be much less than its input and there is less need to
parallelize the complete tree. A parallel slave node might then be
placed in charge of running a complete SeqScan node on an agreed section
of a table, rather than simply performing I/O for another query.

Best Regards, Simon Riggs

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Myron Scott <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 09:16:56
Message-ID:	20060410091656.GB22409@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sun, Apr 09, 2006 at 05:54:56PM -0400, Tom Lane wrote:
> Here's a detailed scenario:
>
> 1. Backend X reads page N of a table T, queues a request for N+1.
> 2. While processing page N, backend X gets an error and aborts
> its transaction, thereby dropping all its lmgr locks.
> 3. Backend Y executes a DROP or TRUNCATE on T, which it can
> now do because there's no lock held anywhere on T. There
> are actually two interesting sub-phases of this:
> 3a. Kill any shared buffers holding pages to be deleted.
> 3b. Physically drop or truncate the OS file.
> 4. Bgreader tries to execute the pending read request. Ooops.
>
> If step 4 happens after 3b, the bgreader gets an error. Maybe we could
> kluge that to not cause any serious problems, but the nastier case is
> where 4 happens between 3a and 3b --- the bgreader sees nothing wrong,
> but it's now loaded a shared buffer that must *not* be there.

The way I dealt with this when playing with async i/o for postgres was
to have the initiator mark the buffer I/O-in-progress then send the
request, just like you suggest. This would cause other people to wait
on it. When the request later finished you processed it as usual and
backends waiting would continue.

The appears to be two seperate cases here though, one is to just farm
out the read request to another process (basically aio), the other is
to do actual processing there. The latter is obviously for more useful
but requires a fair bit more infrastructure.

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 09:22:13
Message-ID:	e1d8ah$1f8l$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Markus Schiltknecht" <markus(at)bluegap(dot)ch> wrote
> Hi Qingqing,
>
> >
> > As Tom pointed out, without big change, a backend on database "D1" can't
> > connect to "D2". This is because to connect to a database, we need to
> > initialize a lot of variables. So when you reconnect to another one on
the
> > fly, you have to change these variables one by one.
>
> Sure, the question is: what is needed to retarget a backend?
>
Check the code InitPostgres(). These global varaibles are scattered in many
places, so I am not sure if it is easy to write clean code to clear up these
variables. But if you can come up with a patch to do reconnect without
disconnect, that will be cool.

> IMHO, in all of the three examples mentioned up-thread, it would be
> better to retarget an existing backend, instead of forking a new one,
> because that would perform better and/or save some resources.
>
That's for sure.

> Please do also consider, that for parallel query execution as well as
> for replaying remote transactions access rights checking and any client
> connection setup could be omitted. Such special processing backeds could
> easily be run with superuser privileges and without a client connection.
>
This is a problem for parallel query execution. I suspect we can't run with
superuser privileges for now and for future. For now, I am not clear if
there is any ACL check at query execution stage -- seems should not be, at
least for DML. For future, I am pretty clear that this is not the way -- the
mere possibility of future support of audit or MAC will bring in mind the
reason.

Regards,
Qingqing

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Myron Scott <lister(at)sacadia(dot)com>, Luke Lonergan <llonergan(at)greenplum(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 09:23:41
Message-ID:	1144661022.3657.15.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, P, 2006-04-09 kell 18:26, kirjutas Martijn van
Oosterhout:

> The cost of fetching a page from the OS is not really much of an
> overhead,

Have you tested this ?

I remember having a case, where data load usin COPY into a table with
several indexes ran an order of magnitude faster with bigger work-mem.
It was with a computer with enough memory to fit the whole working set
(table + indexes) into OS cache, thus I can't think of any other
overhead except reading the index pages repeatedly from OS cache to
shared mem buffers.

I agree that a real disk read is much slower than OS cache read, but
even an OS cache read is still much slower than no read at all. Sure
it's not nearly as bad on seqscans, but it still takes time.

> so I'd like to know how much benefit these buffer cache hacks
> actually produce.

-------------------
Hannu

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Myron Scott <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 10:02:31
Message-ID:	1144663351.32238.5.camel@fotomarburg
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,

On Sun, 2006-04-09 at 15:11 -0400, Tom Lane wrote:
> You can't just retarget a backend to operate in another database, at
> least not without major changes in that infrastructure.

Why not? What would be needed to retarget a backend to operate in
another database?

Such a retargetting would be helpful in several cases:
- pre-forking of backends (well, depends on efficiency of
retargetting..)
- slave backends for parallel query execution
- slave backends for replaying replicated transactions

I'm using backends operating without a client connection, with superuser
privileges only. But those are still connected to one specific database.
It would really be helpfull if I could retarget them as needed.

Regards

Markus

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 10:39:41
Message-ID:	1144665581.32238.21.camel@fotomarburg
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi Qingqing,

On Mon, 2006-04-10 at 16:38 +0800, Qingqing Zhou wrote:
> > Why not? What would be needed to retarget a backend to operate in
> > another database?
>
> As Tom pointed out, without big change, a backend on database "D1" can't
> connect to "D2". This is because to connect to a database, we need to
> initialize a lot of variables. So when you reconnect to another one on the
> fly, you have to change these variables one by one.

Sure, the question is: what is needed to retarget a backend?

IMHO, in all of the three examples mentioned up-thread, it would be
better to retarget an existing backend, instead of forking a new one,
because that would perform better and/or save some resources.

Please do also consider, that for parallel query execution as well as
for replaying remote transactions access rights checking and any client
connection setup could be omitted. Such special processing backeds could
easily be run with superuser privileges and without a client connection.

Regards

Markus

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 11:11:08
Message-ID:	1144667468.32238.31.camel@fotomarburg
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote:
> Check the code InitPostgres(). These global varaibles are scattered in many
> places, so I am not sure if it is easy to write clean code to clear up these
> variables. But if you can come up with a patch to do reconnect without
> disconnect, that will be cool.

Yes, that's where I've been playing around already. Not with much
success until now, though. :-(

> This is a problem for parallel query execution. I suspect we can't run with
> superuser privileges for now and for future. For now, I am not clear if
> there is any ACL check at query execution stage -- seems should not be, at
> least for DML. For future, I am pretty clear that this is not the way -- the
> mere possibility of future support of audit or MAC will bring in mind the
> reason.

(What's audit or MAC?)

Well, then let such a slave backend inherit access rights from it's
caller. You'd still save rechecking of passwords or such. And you save
the client connection.

I'll give the retargetting code another try. If anybody has some hints
on what stumbling blocks and potholes to watch out for on the way
there...

Regards

Markus

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 17:36:45
Message-ID:	20060410173645.GB2461@surnet.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Markus Schiltknecht wrote:
> On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote:
> > Check the code InitPostgres(). These global varaibles are scattered in many
> > places, so I am not sure if it is easy to write clean code to clear up these
> > variables. But if you can come up with a patch to do reconnect without
> > disconnect, that will be cool.
>
> Yes, that's where I've been playing around already. Not with much
> success until now, though. :-(

An idea arising in chat with Joshua Drake: the retargetting code, if it
turns out to work and not be excessively expensive, could also be useful
to implement a server-side "connection pooling" of sorts: the postmaster
could keep idle backends and retarget them to a database that receives
an incoming connection. However, we'd also need a mechanism to clean
all backend state previous to reusing a connection, to leave it "as
new" (no prepared statements, WITH HOLD cursors, etc.)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 18:26:14
Message-ID:	1144693574.4895.2.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,

On Mon, 2006-04-10 at 13:36 -0400, Alvaro Herrera wrote:
> An idea arising in chat with Joshua Drake: the retargetting code, if it
> turns out to work and not be excessively expensive, could also be useful
> to implement a server-side "connection pooling" of sorts: the postmaster
> could keep idle backends and retarget them to a database that receives
> an incoming connection. However, we'd also need a mechanism to clean
> all backend state previous to reusing a connection, to leave it "as
> new" (no prepared statements, WITH HOLD cursors, etc.)

I suspect that's what I was calling 'pre-forking'. Please also note that
this would be the most difficult of my three examples (by difficulty of
implementation).

Regards

Markus

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Skype Technologies OY" <hannu(at)skype(dot)net>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc:	"Myron Scott" <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 19:02:56
Message-ID:	C05FF9F0.213EB%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hannu,

On 4/10/06 2:23 AM, "Hannu Krosing" <hannu(at)skype(dot)net> wrote:

>> The cost of fetching a page from the OS is not really much of an
>> overhead,
>
> Have you tested this ?

I have - the overhead of fetching a page from Linux I/O cache to buffer
cache is about an additional 20% over fetching it directly from buffer cache
on PG 7.4.

Haven't tested it on 8.1.

- Luke

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-10 19:09:26
Message-ID:	C05FFB76.213EE%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gregory,

On 4/9/06 2:04 PM, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com> wrote:

> It increases Linux's maximum readahead from 128K to 1meg .. and it
> should be smart enough that you could crank it up further without too
> much risk of hurting performance elsewhere.

Interesting - we are now routinely using the "blockdev --setra" to set the
block device readahead and our common setting now is 16MB, so I will look at
how the behavior of this adaptive readahead affects normal query workload.

From the patch:

Normally, the kernel uses a stock readahead logic that is well
understood and well tuned. This option enables a much complex and
feature rich one. It is more aggressive and memory efficient in
doing readahead, and supports some less-common access patterns such
as reading backward and reading sparsely. However, due to the great
diversity of real world applications, it might not fit everyone.

- Luke

From:	Myron Scott <lister(at)sacadia(dot)com>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <llonergan(at)greenplum(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 14:47:14
Message-ID:	1144766833.514665.48.camel@bear
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 2006-04-10 at 02:16, Martijn van Oosterhout wrote:

> The appears to be two seperate cases here though, one is to just farm
> out the read request to another process (basically aio), the other is
> to do actual processing there. The latter is obviously for more useful
> but requires a fair bit more infrastructure.
>

I ran some tests to see where time is spent during SeqScans. I did the
following.

tester=# vacuum analyze verbose test;
INFO: vacuuming "public.test"
INFO: "test": found 0 removable, 727960 nonremovable row
versions in 5353 pagesDETAIL: 0 dead row versions cannot be
removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.18s/0.27u sec elapsed 0.91 sec.
INFO: analyzing "public.test"
INFO: "test": scanned 3000 of 5353 pages, containing 407952
live rows and 0 dead rows; 3000 rows in sample, 727922 estimated
total rows
VACUUM
tester=# select version();
version
-------------------------------------------------------------------------------
PostgreSQL 8.2devel on sparc-sun-solaris2.11, compiled by GCC
gcc (GCC) 3.3.2
(1 row)

tester=# select count(random) from test;
count
--------
727960
(1 row)

With the follow ing dtrace results...

# ./probediff2.d 514607
dtrace: script './probediff2.d' matched 10 probes
CPU ID FUNCTION:NAME
0 46811 ExecEndSeqScan:return scan time 20406
^C

smgrread
641566800
Virtualized - smgrread
439798800
smgread - Call Count
5353
HeapTupleSatisfiesSnapshot
6735471000
Virtualized - HeapTupleSatisfiesSnapshot
3516556800
HeapTupleSatisfiesSnapshot - Call Count
727960
Virtualized - ReadBuffer
558230600
ReadBuffer
864931000
Virtualized - ExecutePlan
7331181400
Virtualized - ExecSeqScan
7331349600
ExecutePlan
20405943000
ExecSeqScan
20406161000

The virtualized times are supposed to be actual time spent on the CPU
with
the time spent in the probe factored out. It seems here that half the
time
in SeqScan is spent time validating the tuples as opposed to 1/10th
doing IO.
I'm not sure that just farming out read IO is going to be all that
helpful
in this situation. That's why I think it's a good idea to create a
slave
process that prefetchs pages and transfers valid ItemPointers to the
master.
There may not be much to be gained on simple SeqScans, however, in
complex
queries that include a SeqScan, you may gain alot by offloading this
work
onto a slave thread. A table with TOAST'ed attributes comes to mind.
The
slave thread could be working away on the rest of the table while the
master
is PG_DETOAST_DATUM'ing the attributes for transmission back to the
client
or additional processing. Am I missing something in this analysis?

I've attached my dtrace script.

Myron Scott

Attachment	Content-Type	Size
probediff2.d	text/plain	1.6 KB

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Markus Schiltknecht <markus(at)bluegap(dot)ch>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 17:15:33
Message-ID:	87irpgjaoa.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:

> An idea arising in chat with Joshua Drake: the retargetting code, if it
> turns out to work and not be excessively expensive, could also be useful
> to implement a server-side "connection pooling" of sorts: the postmaster
> could keep idle backends and retarget them to a database that receives
> an incoming connection. However, we'd also need a mechanism to clean
> all backend state previous to reusing a connection, to leave it "as
> new" (no prepared statements, WITH HOLD cursors, etc.)

Isn't all that work pretty much exactly the main cost of starting a new
backend?

--
greg

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Markus Schiltknecht <markus(at)bluegap(dot)ch>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 17:20:19
Message-ID:	20060411172019.GL11760@surnet.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Stark wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>
> > An idea arising in chat with Joshua Drake: the retargetting code, if it
> > turns out to work and not be excessively expensive, could also be useful
> > to implement a server-side "connection pooling" of sorts: the postmaster
> > could keep idle backends and retarget them to a database that receives
> > an incoming connection. However, we'd also need a mechanism to clean
> > all backend state previous to reusing a connection, to leave it "as
> > new" (no prepared statements, WITH HOLD cursors, etc.)
>
> Isn't all that work pretty much exactly the main cost of starting a new
> backend?

On Linux and other systems were fork() has negligible cost, maybe; but
on Windows and Solaris, it's certainly not.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 17:28:20
Message-ID:	87d5foja2z.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:

> I think it would be useful to think about exactly what type of
> query/activity we are looking to improve the performance on. That way we
> can understand the benefit of this proposal and take some baseline
> measurements to analyse what is happening for those cases.

I find the focus on sequential scans, index scans, etc. quite odd when you're
discussing parallel query processing. The whole goal of parallel query
processing is to bring more *cpu* to bear on the problem. That's going to be
most relevant when you're cpu bound, not i/o bound.

The queries I would expect to be helped most by parallel query processing are
queries that involve sorting. For example, a big merge join with two sorts on
either side could perform the two sorts simultaneously. If they provide the
results of the final pass to a third thread it can execute the merge join and
the rest of the query plan while the sorts are still executing on two other
processors.

--
greg

From:	Myron Scott <lister(at)sacadia(dot)com>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <llonergan(at)greenplum(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 17:49:22
Message-ID:	1144777761.514665.102.camel@bear
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, 2006-04-11 at 07:47, Myron Scott wrote:
> client
> or additional processing. Am I missing something in this analysis?
>
> I've attached my dtrace script.
>

To answer my own question, I suppose my processors are
relatively slow compared to most setups.

Myron Scott

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 18:00:41
Message-ID:	877j5wj8l2.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:

> Greg Stark wrote:
> > Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> >
> > > An idea arising in chat with Joshua Drake: the retargetting code, if it
> > > turns out to work and not be excessively expensive, could also be useful
> > > to implement a server-side "connection pooling" of sorts: the postmaster
> > > could keep idle backends and retarget them to a database that receives
> > > an incoming connection. However, we'd also need a mechanism to clean
> > > all backend state previous to reusing a connection, to leave it "as
> > > new" (no prepared statements, WITH HOLD cursors, etc.)
> >
> > Isn't all that work pretty much exactly the main cost of starting a new
> > backend?
>
> On Linux and other systems were fork() has negligible cost, maybe; but
> on Windows and Solaris, it's certainly not.

Even on Solaris I'm sure parsing and preparing plans for all the queries,
building up the system table cache for all the objects in the database, and so
on are much much more expensive than fork(). I wouldn't be surprised if even
on windows it was still a pretty close race.

--
greg

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Markus Schiltknecht <markus(at)bluegap(dot)ch>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 18:06:27
Message-ID:	20060411180627.GQ11760@surnet.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Stark wrote:

> Even on Solaris I'm sure parsing and preparing plans for all the queries,
> building up the system table cache for all the objects in the database, and so
> on are much much more expensive than fork(). I wouldn't be surprised if even
> on windows it was still a pretty close race.

Parsing/planning what queries? Regarding system caches, they are
populated from a cache file; they are not read from the catalogs each
time.

But while we don't see a patch implementing the idea, this is all very
theoretical and probably wrong.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 18:10:00
Message-ID:	443BF0F8.7070004@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Greg Stark wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>
>> I think it would be useful to think about exactly what type of
>> query/activity we are looking to improve the performance on. That way we
>> can understand the benefit of this proposal and take some baseline
>> measurements to analyse what is happening for those cases.
>
> I find the focus on sequential scans, index scans, etc. quite odd when you're
> discussing parallel query processing. The whole goal of parallel query
> processing is to bring more *cpu* to bear on the problem. That's going to be
> most relevant when you're cpu bound, not i/o bound.
>
> The queries I would expect to be helped most by parallel query processing are
> queries that involve sorting. For example, a big merge join with two sorts on
> either side could perform the two sorts simultaneously. If they provide the
> results of the final pass to a third thread it can execute the merge join and
> the rest of the query plan while the sorts are still executing on two other
> processors.

Well I am go out on a limb here and gather to guess that sequential
scans and index scans are still very relevant because the CPU could be
bound by the scan (either one) based on the type of query being performed.

This doesn't really have anything to do with being IO bound as to the
type of accesses to the data we have to deal with in regards to query
processing.

You are correct about parallel query processing helping mutliple sort
queries but those sorts may or may not hit and index.

Sincerely,

Joshua D. Drake

>
>

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 19:00:41
Message-ID:	87vetghr8m.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:

> Well I am go out on a limb here and gather to guess that sequential scans and
> index scans are still very relevant because the CPU could be bound by the scan
> (either one) based on the type of query being performed.
>
> This doesn't really have anything to do with being IO bound as to the type of
> accesses to the data we have to deal with in regards to query processing.

It has everything to do with being i/o bound. The only way having two
processors perform part of an index or sequential scan would help is if your
disk subsystem is capable of providing data faster than a single processor is
capable of requesting it.

That's only going to be true for very high end systems with multiple raid
controllers and dozens of spindles.

On the other hand even moderately sized dual-core systems could probably
benefit from being able to perform multiple cpu-intensive operations
simultaneously.

--
greg

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc:	Skype Technologies OY <hannu(at)skype(dot)net>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Myron Scott <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	OS cached buffers (was: Support Parallel Query Execution in Executor)
Date:	2006-04-11 20:38:58
Message-ID:	20060411203858.GT49405@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, Apr 10, 2006 at 12:02:56PM -0700, Luke Lonergan wrote:
> Hannu,
>
> On 4/10/06 2:23 AM, "Hannu Krosing" <hannu(at)skype(dot)net> wrote:
>
> >> The cost of fetching a page from the OS is not really much of an
> >> overhead,
> >
> > Have you tested this ?
>
> I have - the overhead of fetching a page from Linux I/O cache to buffer
> cache is about an additional 20% over fetching it directly from buffer cache
> on PG 7.4.

Is there any pratcical way to tell the difference between a page comming
from the OS cache and one comming from disk? Or maybe for a set of pages
an estimate on how many came from cache vs disk? There's some areas
where having this information would be very useful, such as for vacuum
delay. It would make tuning much easier, and it would also give us some
insight on how heavily loaded disks were, which would also be useful
info for vacuum to have (so we could adjust vacuum_cost_delay
dynamically based on load).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 20:49:48
Message-ID:	20060411204948.GU49405@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, Apr 10, 2006 at 05:22:13PM +0800, Qingqing Zhou wrote:
>
> "Markus Schiltknecht" <markus(at)bluegap(dot)ch> wrote
> > Hi Qingqing,
> >
> > >
> > > As Tom pointed out, without big change, a backend on database "D1" can't
> > > connect to "D2". This is because to connect to a database, we need to
> > > initialize a lot of variables. So when you reconnect to another one on
> the
> > > fly, you have to change these variables one by one.
> >
> > Sure, the question is: what is needed to retarget a backend?
> >
> Check the code InitPostgres(). These global varaibles are scattered in many
> places, so I am not sure if it is easy to write clean code to clear up these
> variables. But if you can come up with a patch to do reconnect without
> disconnect, that will be cool.

Something else to consider: most queries that would benefit from
parallel execution are expensive enough that the cost of spawning some
new backends wouldn't be that big a deal, so perhaps for an initial
version it would be best to KISS and just spawn parallel execution
backends as needed.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 20:54:56
Message-ID:	20060411205456.GV49405@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, Apr 10, 2006 at 01:36:45PM -0400, Alvaro Herrera wrote:
> Markus Schiltknecht wrote:
> > On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote:
> > > Check the code InitPostgres(). These global varaibles are scattered in many
> > > places, so I am not sure if it is easy to write clean code to clear up these
> > > variables. But if you can come up with a patch to do reconnect without
> > > disconnect, that will be cool.
> >
> > Yes, that's where I've been playing around already. Not with much
> > success until now, though. :-(
>
> An idea arising in chat with Joshua Drake: the retargetting code, if it
> turns out to work and not be excessively expensive, could also be useful
> to implement a server-side "connection pooling" of sorts: the postmaster
> could keep idle backends and retarget them to a database that receives
> an incoming connection. However, we'd also need a mechanism to clean
> all backend state previous to reusing a connection, to leave it "as
> new" (no prepared statements, WITH HOLD cursors, etc.)

Oracle allows you to essentially re-connect to an existing connection by
saving connection state when a connection goes back to the connection
pool. Essentially, you connect, and then re-authenticate as a different
user. That user has a specific environment associated with it which is
then pulled in. This makes it reasonable to use Oracle's built-in code
for handling permissions, etc; you just give each system user a database
account. While this sounds scary to folk that haven't used it, it's
actually safer than rolling your own authentication and security
mechanism (which will likely have bugs in it) and having your middleware
connect to the database with some dedicated account.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-11 22:11:06
Message-ID:	1144793466.13511.9.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

> That's only going to be true for very high end systems with multiple raid
> controllers and dozens of spindles.

Not true. I have a system right now that would benifit. My database is
only 500 megs and I have 2 gig of ram and two processors... I only have
a raid 1, but that is o.k. because most things are cached in memory
anyway.

>
> On the other hand even moderately sized dual-core systems could probably
> benefit from being able to perform multiple cpu-intensive operations
> simultaneously.

See above :)

Joshua D. Drake

>
--

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Myron Scott <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support Parallel Query Execution in Executor
Date:	2006-04-13 10:49:55
Message-ID:	200604131049.k3DAntS10181@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Enhanded TODO:

* Experiment with multi-threaded backend better resource utilization

This would allow a single query to make use of multiple CPU's or
multiple I/O channels simultaneously. One idea is to create a
background reader that can pre-fetch sequential and index scan
pages needed by other backends. This could be expanded to allow
concurrent reads from multiple devices in a partitioned table.

The issue of parallism is basically a sliding scale, meaning what do you
want to do concurrently. In this case, we are trying to do I/O and CPU
concurrently. Another idea is to do I/O for partitioned tables
concurrently, and of course there are many CPU concurrent cases like
sorting.

---------------------------------------------------------------------------

Tom Lane wrote:
> Myron Scott <lister(at)sacadia(dot)com> writes:
> > Gregory Maxwell wrote:
> >> There are other cases where it is useful to perform parallel I/O
> >> without parallel processing..
>
> > I have done some testing more along these lines with an old fork of
> > postgres code (2001). In my tests, I used a thread to delegate out
> > the actual heap scan of the SeqScan. The job of the "slave" thread
> > the was to fault in buffer pages and determine the time validity of
> > the tuples. ItemPointers are passed back to the "master" thread via a
> > common memory area guarded by mutex locking.
>
> I was considering a variant idea in the shower this morning: suppose
> that we invent one or more "background reader" processes that have
> basically the same infrastructure as the background writer, but have
> the responsibility of causing buffer reads to happen at useful times
> (whereas the writer causes writes to happen at useful times). The
> idea would be for backends to signal the readers when they know they
> will need a given block soon, and then hopefully when they need it
> it'll already be in shared buffers. For instance, in a seqscan it'd be
> pretty trivial to request block N+1 just after reading block N, and then
> doing our actual processing on block N while (we hope) some reader
> process is faulting in N+1. Bitmap indexscans could use this structure
> too; I'm less sure about whether plain indexscans could do much with it
> though.
>
> The major issues I can see are:
>
> 1. We'd need a shared-memory queue of read requests, probably much like
> the queue of fsync requests. We've already seen problems with
> contention for the fsync queue, IIRC, and that's used much less heavily
> than the read request queue would be. So there might be some
> performance issues with getting the block requests sent over to the
> readers.
>
> 2. There are some low-level assumptions that no one reads in pages of
> a relation without having some kind of lock on the relation (consider
> eg the case where the relation is being dropped). A bgwriter-like
> process wouldn't be able to hold lmgr locks, and we wouldn't really want
> it to be thrashing the lmgr shared data structures for each read anyway.
> So you'd have to design some interlock to guarantee that no backend
> abandons a query (and releases its own lmgr locks) while an async read
> request it made is still pending. Ugh.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Skype Technologies OY <hannu(at)skype(dot)net>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Myron Scott <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: OS cached buffers (was: Support Parallel Query Execution
Date:	2006-04-13 19:38:04
Message-ID:	200604131938.k3DJc5A21801@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jim C. Nasby wrote:
> On Mon, Apr 10, 2006 at 12:02:56PM -0700, Luke Lonergan wrote:
> > Hannu,
> >
> > On 4/10/06 2:23 AM, "Hannu Krosing" <hannu(at)skype(dot)net> wrote:
> >
> > >> The cost of fetching a page from the OS is not really much of an
> > >> overhead,
> > >
> > > Have you tested this ?
> >
> > I have - the overhead of fetching a page from Linux I/O cache to buffer
> > cache is about an additional 20% over fetching it directly from buffer cache
> > on PG 7.4.
>
> Is there any pratcical way to tell the difference between a page comming
> from the OS cache and one comming from disk? Or maybe for a set of pages
> an estimate on how many came from cache vs disk? There's some areas
> where having this information would be very useful, such as for vacuum
> delay. It would make tuning much easier, and it would also give us some
> insight on how heavily loaded disks were, which would also be useful
> info for vacuum to have (so we could adjust vacuum_cost_delay
> dynamically based on load).

getrusage() returns:

! 0.000062 elapsed 0.000000 user 0.000062 system sec
! [0.000000 user 0.009859 sys total]
! 0/0 [19/2] filesystem blocks in/out
! 0/0 [0/0] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [4/5] messages rcvd/sent
! 0/0 [23/6] voluntary/involuntary context switches

but I don't see anything in there that would show kernel cache vs. disk
I/O. In fact, there is usually little connection in the kernel between
an I/O request and the process that requests it.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Skype Technologies OY <hannu(at)skype(dot)net>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Myron Scott <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: OS cached buffers (was: Support Parallel Query Execution
Date:	2006-04-14 04:45:05
Message-ID:	20060414044505.GW49405@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, Apr 13, 2006 at 03:38:04PM -0400, Bruce Momjian wrote:
> > Is there any pratcical way to tell the difference between a page comming
> > from the OS cache and one comming from disk? Or maybe for a set of pages
> > an estimate on how many came from cache vs disk? There's some areas
> > where having this information would be very useful, such as for vacuum
> > delay. It would make tuning much easier, and it would also give us some
> > insight on how heavily loaded disks were, which would also be useful
> > info for vacuum to have (so we could adjust vacuum_cost_delay
> > dynamically based on load).
>
> getrusage() returns:
>
> ! 0.000062 elapsed 0.000000 user 0.000062 system sec
> ! [0.000000 user 0.009859 sys total]
> ! 0/0 [19/2] filesystem blocks in/out
> ! 0/0 [0/0] page faults/reclaims, 0 [0] swaps
> ! 0 [0] signals rcvd, 0/0 [4/5] messages rcvd/sent
> ! 0/0 [23/6] voluntary/involuntary context switches
>
> but I don't see anything in there that would show kernel cache vs. disk
> I/O. In fact, there is usually little connection in the kernel between
> an I/O request and the process that requests it.

Yeah, my assumption has been that the only way to tell the difference
would be by timing, but I don't know how practical that is. Since
gettime() or whatever EXPLAIN ANALYZE uses is apparently very expensive,
perhaps there's some other alternative. Perhapse the timing info in
getrusage would work for this. Another idea is setting an alarm for a
fairly short period before making the IO request. If the request comes
back before the alarm fires, the data must have been in the OS cache.

Another thought is that any IO request that goes to disk would most
likely put the process requesting the IO to sleep, but a request being
served out of cache might not do that. Perhaps there's some way to
recognize that.

Or maybe a better track would be to develop a patch for as many OSes as
possible that would tell the caller if an IO request came out of cache
or not.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Skype Technologies OY <hannu(at)skype(dot)net>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Myron Scott <lister(at)sacadia(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: OS cached buffers (was: Support Parallel Query Execution
Date:	2006-04-14 13:46:57
Message-ID:	200604141346.k3EDkvZ07955@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Jim C. Nasby wrote:
> Yeah, my assumption has been that the only way to tell the difference
> would be by timing, but I don't know how practical that is. Since
> gettime() or whatever EXPLAIN ANALYZE uses is apparently very expensive,
> perhaps there's some other alternative. Perhapse the timing info in
> getrusage would work for this. Another idea is setting an alarm for a
> fairly short period before making the IO request. If the request comes
> back before the alarm fires, the data must have been in the OS cache.
>
> Another thought is that any IO request that goes to disk would most
> likely put the process requesting the IO to sleep, but a request being
> served out of cache might not do that. Perhaps there's some way to
> recognize that.
>
> Or maybe a better track would be to develop a patch for as many OSes as
> possible that would tell the caller if an IO request came out of cache
> or not.

Or give up, which seems the most fruitful approach.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +