Re: Research/Implementation of Nested Loop Join optimization

Lists: pgsql-hackers
From: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-23 20:16:58
Message-ID: 5b27a1ad0807231316w6eac8675m47b856836bcfab8e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an
Joins, and we`d like to implement an optimization on the Nested Loop Join,
this optimization consists on while scanning the inner table, the iteration
would go from up-down then backwards(down-up) to take advantage of the
buffer pages in memory.
We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes
and keys to make it this way). The research objective is to show some
students how a DBMS works.

Does PostgreSQL already works this way?
Is it possible to implement such thing? Is it easy? how hard?

Thank you in advance,
Manoel Henrique Souza.


From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-23 20:23:25
Message-ID: D425483C2C5C9F49B5B7A41F8944154701000FA1@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

When you install the source tree (e.g. in folder \postgresql-8.3.x) you
will want to examine nodeMergejoin.c typically found in a path similar
to this:

\postgresql-8.3.x\src\backend\executor\nodeMergejoin.c

Here are the comments from the version on my machine:

/*

* INTERFACE ROUTINES

* ExecMergeJoin
mergejoin outer and inner relations.

* ExecInitMergeJoin
creates and initializes run time states

* ExecEndMergeJoin
cleans up the node.

*

* NOTES

*

* Merge-join is done by joining the inner
and outer tuples satisfying

* join clauses of the form ((= outerKey
innerKey) ...).

* The join clause list is provided by the
query planner and may contain

* more than one (= outerKey innerKey) clause
(for composite sort key).

*

* However, the query executor needs to know
whether an outer

* tuple is "greater/smaller" than an inner
tuple so that it can

* "synchronize" the two relations. For
example, consider the following

* relations:

*

* outer: (0
^1 1 2 5 5 5 6 6 7) current tuple: 1

* inner: (1
^3 5 5 5 5 6) current tuple: 3

*

* To continue the merge-join, the executor
needs to scan both inner

* and outer relations till the matching
tuples 5. It needs to know

* that currently inner tuple 3 is "greater"
than outer tuple 1 and

* therefore it should scan the outer
relation first to find a

* matching tuple and so on.

*

* Therefore, rather than directly executing
the merge join clauses,

* we evaluate the left and right key
expressions separately and then

* compare the columns one at a time (see
MJCompare). The planner

* passes us enough information about the
sort ordering of the inputs

* to allow us to determine how to make the
comparison. We may use the

* appropriate btree comparison function,
since Postgres' only notion

* of ordering is specified by btree
opfamilies.

*

*

* Consider the above relations and suppose
that the executor has

* just joined the first outer "5" with the
last inner "5". The

* next step is of course to join the second
outer "5" with all

* the inner "5's". This requires
repositioning the inner "cursor"

* to point at the first inner "5". This is
done by "marking" the

* first inner 5 so we can restore the
"cursor" to it before joining

* with the second outer 5. The access method
interface provides

* routines to mark and restore to a tuple.

*

*

* Essential operation of the merge join
algorithm is as follows:

*

* Join {

* get initial outer and
inner tuples
INITIALIZE

* do forever {

* while
(outer != inner) {
SKIP_TEST

*
if (outer < inner)

*
advance outer
SKIPOUTER_ADVANCE

*
else

*
advance inner
SKIPINNER_ADVANCE

* }

* mark inner
position
SKIP_TEST

* do forever
{

*
while (outer == inner) {

*
join tuples
JOINTUPLES

*
advance inner position
NEXTINNER

*
}

*
advance outer position
NEXTOUTER

*
if (outer == mark)
TESTOUTER

*
restore inner position to mark TESTOUTER

*
else

*
break // return to top of outer loop

* }

* }

* }

*

* The merge join operation is coded in the
fashion

* of a state machine. At each state, we do
something and then

* proceed to another state. This state is
stored in the node's

* execution state information and is
preserved across calls to

* ExecMergeJoin. -cim 10/31/89

*/

From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Manoel Henrique
Sent: Wednesday, July 23, 2008 1:17 PM
To: pgsql-hackers(at)postgresql(dot)org
Subject: [HACKERS] Research/Implementation of Nested Loop Join
optimization

Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an
Joins, and we`d like to implement an optimization on the Nested Loop
Join, this optimization consists on while scanning the inner table, the
iteration would go from up-down then backwards(down-up) to take
advantage of the buffer pages in memory.

We`d work with MaterialScan and only NestedLoop (we`re dropping all
indexes and keys to make it this way). The research objective is to show
some students how a DBMS works.

Does PostgreSQL already works this way?

Is it possible to implement such thing? Is it easy? how hard?

Thank you in advance,

Manoel Henrique Souza.


From: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>
To: "Dann Corbit" <DCorbit(at)connx(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-23 20:47:22
Message-ID: 5b27a1ad0807231347o53a08dccj26bea6b04e541448@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to
find a way to change the Nested Loop Join, It would be more like on
nodeNestloop.c when rescanning the inner plan, (second time scanning the
inner plan and so on) he`d change the scan direction, If the scan direction
was from first tuple to last tuple it would go backwards, if it was from
last to first it would go forward... The code I`m looking atm is from 8.3.1
, seems to have some kind of direction manager but doesn`t seems to be in
use.

--Manoel

On Wed, Jul 23, 2008 at 5:23 PM, Dann Corbit <DCorbit(at)connx(dot)com> wrote:

> When you install the source tree (e.g. in folder \postgresql-8.3.x) you
> will want to examine nodeMergejoin.c typically found in a path similar to
> this:
>
> \postgresql-8.3.x\src\backend\executor\nodeMergejoin.c
>
>
>
> Here are the comments from the version on my machine:
>
>
>
> /*
>
> * INTERFACE ROUTINES
>
> * ExecMergeJoin
> mergejoin outer and inner relations.
>
> * ExecInitMergeJoin
> creates and initializes run time states
>
> * ExecEndMergeJoin
> cleans up the node.
>
> *
>
> * NOTES
>
> *
>
> * Merge-join is done by joining the inner and
> outer tuples satisfying
>
> * join clauses of the form ((= outerKey
> innerKey) ...).
>
> * The join clause list is provided by the query
> planner and may contain
>
> * more than one (= outerKey innerKey) clause
> (for composite sort key).
>
> *
>
> * However, the query executor needs to know
> whether an outer
>
> * tuple is "greater/smaller" than an inner
> tuple so that it can
>
> * "synchronize" the two relations. For example,
> consider the following
>
> * relations:
>
> *
>
> * outer: (0 ^1
> 1 2 5 5 5 6 6 7) current tuple: 1
>
> * inner: (1 ^3
> 5 5 5 5 6) current tuple: 3
>
> *
>
> * To continue the merge-join, the executor
> needs to scan both inner
>
> * and outer relations till the matching tuples
> 5. It needs to know
>
> * that currently inner tuple 3 is "greater"
> than outer tuple 1 and
>
> * therefore it should scan the outer relation
> first to find a
>
> * matching tuple and so on.
>
> *
>
> * Therefore, rather than directly executing the
> merge join clauses,
>
> * we evaluate the left and right key
> expressions separately and then
>
> * compare the columns one at a time (see
> MJCompare). The planner
>
> * passes us enough information about the sort
> ordering of the inputs
>
> * to allow us to determine how to make the
> comparison. We may use the
>
> * appropriate btree comparison function, since
> Postgres' only notion
>
> * of ordering is specified by btree opfamilies.
>
> *
>
> *
>
> * Consider the above relations and suppose that
> the executor has
>
> * just joined the first outer "5" with the last
> inner "5". The
>
> * next step is of course to join the second
> outer "5" with all
>
> * the inner "5's". This requires repositioning
> the inner "cursor"
>
> * to point at the first inner "5". This is done
> by "marking" the
>
> * first inner 5 so we can restore the "cursor"
> to it before joining
>
> * with the second outer 5. The access method
> interface provides
>
> * routines to mark and restore to a tuple.
>
> *
>
> *
>
> * Essential operation of the merge join
> algorithm is as follows:
>
> *
>
> * Join {
>
> * get initial outer and inner
> tuples
> INITIALIZE
>
> * do forever {
>
> * while (outer
> != inner) {
> SKIP_TEST
>
> *
> if (outer < inner)
>
> *
> advance
> outer
> SKIPOUTER_ADVANCE
>
> *
> else
>
> *
> advance
> inner
> SKIPINNER_ADVANCE
>
> * }
>
> * mark inner
> position
> SKIP_TEST
>
> * do forever {
>
> *
> while (outer == inner) {
>
> *
> join
> tuples
> JOINTUPLES
>
> *
> advance inner position
> NEXTINNER
>
> *
> }
>
> *
> advance outer
> position
> NEXTOUTER
>
> *
> if (outer ==
> mark)
> TESTOUTER
>
> *
> restore inner position to mark TESTOUTER
>
> *
> else
>
> *
> break // return to top of outer loop
>
> * }
>
> * }
>
> * }
>
> *
>
> * The merge join operation is coded in the
> fashion
>
> * of a state machine. At each state, we do
> something and then
>
> * proceed to another state. This state is
> stored in the node's
>
> * execution state information and is preserved
> across calls to
>
> * ExecMergeJoin. -cim 10/31/89
>
> */
>
>
>
> *From:* pgsql-hackers-owner(at)postgresql(dot)org [mailto:
> pgsql-hackers-owner(at)postgresql(dot)org] *On Behalf Of *Manoel Henrique
> *Sent:* Wednesday, July 23, 2008 1:17 PM
> *To:* pgsql-hackers(at)postgresql(dot)org
> *Subject:* [HACKERS] Research/Implementation of Nested Loop Join
> optimization
>
>
>
> Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an
> Joins, and we`d like to implement an optimization on the Nested Loop Join,
> this optimization consists on while scanning the inner table, the iteration
> would go from up-down then backwards(down-up) to take advantage of the
> buffer pages in memory.
>
> We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes
> and keys to make it this way). The research objective is to show some
> students how a DBMS works.
>
>
> Does PostgreSQL already works this way?
>
> Is it possible to implement such thing? Is it easy? how hard?
>
>
>
>
>
> Thank you in advance,
>
> Manoel Henrique Souza.
>


From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-23 22:50:38
Message-ID: D425483C2C5C9F49B5B7A41F8944154701000FA2@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: Manoel Henrique [mailto:mhenriquesgbd(at)gmail(dot)com]
Sent: Wednesday, July 23, 2008 1:47 PM
To: Dann Corbit
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Research/Implementation of Nested Loop Join
optimization

The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying
to find a way to change the Nested Loop Join, It would be more like on
nodeNestloop.c when rescanning the inner plan, (second time scanning the
inner plan and so on) he`d change the scan direction, If the scan
direction was from first tuple to last tuple it would go backwards, if
it was from last to first it would go forward... The code I`m looking
atm is from 8.3.1 , seems to have some kind of direction manager but
doesn`t seems to be in use.

>>

You are right. You want file: nodeNestloop.c

<<


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>
Cc: "Dann Corbit" <DCorbit(at)connx(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-24 05:49:15
Message-ID: 20189.1216878555@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
> The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to
> find a way to change the Nested Loop Join, It would be more like on
> nodeNestloop.c when rescanning the inner plan, (second time scanning the
> inner plan and so on) he`d change the scan direction, If the scan direction
> was from first tuple to last tuple it would go backwards, if it was from
> last to first it would go forward... The code I`m looking atm is from 8.3.1
> , seems to have some kind of direction manager but doesn`t seems to be in
> use.

I find this a bit dubious. If the inner rel is small enough to fit in
memory then it buys nothing. If not, then you win only to the extent
that a pretty large fraction of the inner rel fits in memory. In any
case you are relying on the assumption that backwards scan is just as
efficient as forward scan, which seems to me to be a pretty large
assumption --- we expect forward seqscans to get a performance boost
from kernel readahead, but I'd be surprised if the kernel recognized
what was happening in a backwards scan.

Note also that backwards scan doesn't work at all in some plan
node types (cf ExecSupportsBackwardScan). You'd need to check
what the inner input node was before trying this.

regards, tom lane


From: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>
To:
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-25 19:05:30
Message-ID: 5b27a1ad0807251205p3aef4f9do8cf60ccc0c7b75a7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Yes, I'm relying on the assumption that backwards scan has the same cost as
forward scan, why shouldn't it?

Yet, all plan node types we are testing works with backwards scan (looking
on ExecSupportsBackwardScan). But, is there a easy way to make a query
execute only in backwards scan? How we can do that? Our first objective is
to make a backwards scan only and then test a forward-and-backward scan.

-- Manoel

On Thu, Jul 24, 2008 at 2:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
> > The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying
> to
> > find a way to change the Nested Loop Join, It would be more like on
> > nodeNestloop.c when rescanning the inner plan, (second time scanning the
> > inner plan and so on) he`d change the scan direction, If the scan
> direction
> > was from first tuple to last tuple it would go backwards, if it was from
> > last to first it would go forward... The code I`m looking atm is from
> 8.3.1
> > , seems to have some kind of direction manager but doesn`t seems to be in
> > use.
>
> I find this a bit dubious. If the inner rel is small enough to fit in
> memory then it buys nothing. If not, then you win only to the extent
> that a pretty large fraction of the inner rel fits in memory. In any
> case you are relying on the assumption that backwards scan is just as
> efficient as forward scan, which seems to me to be a pretty large
> assumption --- we expect forward seqscans to get a performance boost
> from kernel readahead, but I'd be surprised if the kernel recognized
> what was happening in a backwards scan.
>
> Note also that backwards scan doesn't work at all in some plan
> node types (cf ExecSupportsBackwardScan). You'd need to check
> what the inner input node was before trying this.
>
> regards, tom lane
>


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-25 22:10:54
Message-ID: 87myk5haxd.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:

> Yes, I'm relying on the assumption that backwards scan has the same cost as
> forward scan, why shouldn't it?

Because hard drives only spin one direction

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-25 23:31:09
Message-ID: 36e682920807251631o48cba2c0y915f0315bd1a6e59@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
>
>> Yes, I'm relying on the assumption that backwards scan has the same cost as
>> forward scan, why shouldn't it?
>
> Because hard drives only spin one direction

:)

--
Jonah H. Harris, Senior DBA
myYearbook.com


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Manoel Henrique <mhenriquesgbd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-25 23:35:19
Message-ID: 1217028919.16378.142.camel@jd-laptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2008-07-25 at 19:31 -0400, Jonah H. Harris wrote:
> On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> > "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
> >
> >> Yes, I'm relying on the assumption that backwards scan has the same cost as
> >> forward scan, why shouldn't it?
> >
> > Because hard drives only spin one direction
>
> :)

What if you are below the equator?

> --
> Jonah H. Harris, Senior DBA
> myYearbook.com
>
--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Manoel Henrique <mhenriquesgbd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-26 00:15:48
Message-ID: 20080726001548.GS9891@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joshua D. Drake escribió:
> On Fri, 2008-07-25 at 19:31 -0400, Jonah H. Harris wrote:
> > On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> > > "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
> > >
> > >> Yes, I'm relying on the assumption that backwards scan has the same cost as
> > >> forward scan, why shouldn't it?
> > >
> > > Because hard drives only spin one direction
> >
> > :)
>
> What if you are below the equator?

They spin the same direction here too, thanks :-) (Coriolis does not
affect much in this case)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-26 00:26:43
Message-ID: 4160.1217032003@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
>> Yes, I'm relying on the assumption that backwards scan has the same cost as
>> forward scan, why shouldn't it?

> Because hard drives only spin one direction

Good joke, but to be serious: we expect that forward scans will result
in the kernel doing read-ahead, which will allow overlapping of
CPU work to process one page with the I/O to bring in the next page.
A backwards scan will get no such overlapping and thus be up to 2X
slower, unless the kernel is smart enough to do read-ahead for
descending-order read requests. Which seems not too probable. A fairly
typical kernel behavior is that read-ahead is triggered by successive
read() requests without any intervening seek(), and this is impossible
for a backward scan.

(Yes, we do optimize out the seek calls in a forward scan. IIRC it's
done in fd.c.)

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Manoel Henrique <mhenriquesgbd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-26 00:37:25
Message-ID: 20080726003725.GT9891@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane escribió:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> > "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
> >> Yes, I'm relying on the assumption that backwards scan has the same cost as
> >> forward scan, why shouldn't it?
>
> > Because hard drives only spin one direction
>
> Good joke, but to be serious: we expect that forward scans will result
> in the kernel doing read-ahead, which will allow overlapping of
> CPU work to process one page with the I/O to bring in the next page.

I wonder if this is spoiled (or rather, the backwards case fixed) by the
attempts to call posix_fadvise() on certain types of scan.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Manoel Henrique <mhenriquesgbd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-26 00:44:49
Message-ID: 4383.1217033089@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane escribi:
>> Good joke, but to be serious: we expect that forward scans will result
>> in the kernel doing read-ahead, which will allow overlapping of
>> CPU work to process one page with the I/O to bring in the next page.

> I wonder if this is spoiled (or rather, the backwards case fixed) by the
> attempts to call posix_fadvise() on certain types of scan.

Yeah, I started wondering about that too after sending off the above.
The fadvise patch might eliminate the distinction ... on platforms where
fadvise exists and actually works well.

regards, tom lane


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Manoel Henrique <mhenriquesgbd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-26 05:27:58
Message-ID: 488AB5DE.8070100@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
>>> Yes, I'm relying on the assumption that backwards scan has the same cost as
>>> forward scan, why shouldn't it?
>
> G...we expect that forward scans will result
> in the kernel doing read-ahead, ...
> A backwards scan will get no such overlapping and thus be up to 2X
> slower, unless the kernel is smart enough to do read-ahead for
> descending-order read requests. Which seems not too probable.

Linux's old adaptive readahead patches claimed to[1]:
It also have methods to detect some less common cases:
- reading backward"
Interestingly the author of that patch used postgres as the example
application that benefits from the patch (30%).

I'm not sure if the backward reading feature got kept
in the simplified on-demand readahead that seems to have
superseded the adaptive readahead stuff in 2.6.23[2].

[1] http://lwn.net/Articles/185469/
[2] http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Manoel Henrique <mhenriquesgbd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-26 05:38:27
Message-ID: 8815.1217050707@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
> Tom Lane wrote:
>> A backwards scan will get no such overlapping and thus be up to 2X
>> slower, unless the kernel is smart enough to do read-ahead for
>> descending-order read requests. Which seems not too probable.

> Linux's old adaptive readahead patches claimed to[1]:

I didn't say that there were *no* platforms that could do it.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Research/Implementation of Nested Loop Join optimization
Date: 2008-07-26 10:29:17
Message-ID: 87ej5hgcqq.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> "Manoel Henrique" <mhenriquesgbd(at)gmail(dot)com> writes:
>>> Yes, I'm relying on the assumption that backwards scan has the same cost as
>>> forward scan, why shouldn't it?
>
>> Because hard drives only spin one direction
>
> Good joke, but to be serious: we expect that forward scans will result
> in the kernel doing read-ahead, which will allow overlapping of
> CPU work to process one page with the I/O to bring in the next page.

Well it wasn't a joke but you're right that it's not the whole picture. But
then neither is considering interleaving of I/O with CPU work.

Hard drives spin in a particular direction which means if you stream I/O
requests to them in that direction you can stream data off the hard drive as
fast as it passes under the read head. That's going to be 50-60MB/s for a
single modern 7200rpm drive.

On the other hand if you send an I/O request for the previous block then you
have to wait a whole rotation before it passes under the head. On a 7200rpm
drive that's over 8ms which is a *lot* of CPU work to interleave. The most
bandwidth you'll be able to get is under 1MB/s.

However there's another reason fadvise helps -- the kernel or the drive gets a
chance to reorder the I/O. If we read-ahead a whole track's worth of I/O
backwards then during the first 8ms latency the kernel has a chance to notice
that it should reorder the queued up requests and do them the right way
around.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!