Re: Replacement Selection

Lists: pgsql-hackers
From: <mac_man2005(at)hotmail(dot)it>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replacement Selection
Date: 2007-11-27 08:25:35
Message-ID: BAY132-DS30840B8472DA2670B81C2E6760@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi to all.

It seems a previous mail of mine with following body hasn't been sent.
Sorry for possibly getting it twice.

Actually I have now modified that body, so it's worth to read it once again.

Thanks for your attention.
Regards.

------------PREVIOUS MAIL--------------------------
Well, the refinements are the followings:

Using 2 heaps instead of just one:
one heap creating a "descending" run and the
other one creating an "ascending" run.
Both associated to the same "logical" run.

Suppose we want the input elements to be finally sorted in an ascending
order. To do this we could QuickSort the first M initialization elements
into RAM
and then divide it into 2 parts.
Suppose the first heap creates the following run:
10
9
8

And suppose the second heap creates the following run:
3
5
7

Those two runs can be seen as just one by mergesort... since they "could" be
physically merged into one single run: at first we could write the elements
3,5,7 and then the elements of the other run, red upside down.

Possible advantages:
Having two heaps of that kinds lets RS better adapt to local variations of
the input trend.
This technique can be called Two Ways Replacement Selection (2WRS) just
because of those 2 heaps.
As an extreme example, we can say that having the input already sort in
reverse order
no more leads us to the worst case: with 2WRS no matter the input is already
sort
in ascending/descending order... in this case we'll produce just one run
instead
of producing the maximum number of runs as in RS worst case (input in
reverse order).
Moreover it lets us to grow the current run in 2 ways: just imagine we would
output runs
in a regular file. With 2WRS this could be seen as start outputting elements
from the middle
of such a regular file, the descending heap outputting elements from the
middle upwards
while the ascending one outputting from the middle downward. This could
imply getting
a smaller number of "dead records" (as I said in previous mails, a dear
record is an element
that won't form part of the current run) and so having longer runs.

Others optimizations, for example, can be done with the "virtual
concatenation" technique:
storing a cache of couples (first_element,last_element) for each created
run. This
could be useful in case we can find 2 couples (first_element_1,
last_element_1) and
(first_element_2, last_element_2) with last_element_1 <= first_element_2.
In this case, those runs too can be seen as belonging to the same "logical
run"
(actually they are 2 RS different physical runs, or even 4 in 2WRS
but can be seen as just one by mergesort). Of course, once those 2 (or 4)
runs are
logically merged into that only one, this last one in turn could be merged
to other runs.

What does all that imply? Mergesort would actually consider a smaller number
of runs
(since it should just work on logical runs). This means less jumps between
runs on disk.

Now... to test those refinements I should integrate my code into
PostgreSQL... but it's not that easy for me...

Thanks for your attention.
------------PREVIOUS MAIL--------------------------


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: mac_man2005(at)hotmail(dot)it
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replacement Selection
Date: 2007-11-27 12:03:38
Message-ID: 1196165018.4246.950.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-11-27 at 09:25 +0100, mac_man2005(at)hotmail(dot)it wrote:

> Others optimizations, for example, can be done with the "virtual
> concatenation" technique:
> storing a cache of couples (first_element,last_element) for each created
> run. This
> could be useful in case we can find 2 couples (first_element_1,
> last_element_1) and
> (first_element_2, last_element_2) with last_element_1 <= first_element_2.
> In this case, those runs too can be seen as belonging to the same "logical
> run"
> (actually they are 2 RS different physical runs, or even 4 in 2WRS
> but can be seen as just one by mergesort). Of course, once those 2 (or 4)
> runs are
> logically merged into that only one, this last one in turn could be merged
> to other runs.
>
> What does all that imply? Mergesort would actually consider a smaller number
> of runs
> (since it should just work on logical runs). This means less jumps between
> runs on disk.

That's actually a refinement of an idea I've been working on for
optimizing sort. I'll post those separately.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: <mac_man2005(at)hotmail(dot)it>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replacement Selection
Date: 2007-11-27 16:49:36
Message-ID: BAY132-DS3EB671F59B320B8303BCAE6760@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Any comment about Two Ways Replacement Selection (two heaps instead of just
one) ?

--------------------------------------------------
From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Sent: Tuesday, November 27, 2007 1:03 PM
To: <mac_man2005(at)hotmail(dot)it>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Replacement Selection

> On Tue, 2007-11-27 at 09:25 +0100, mac_man2005(at)hotmail(dot)it wrote:
>
>> Others optimizations, for example, can be done with the "virtual
>> concatenation" technique:
>> storing a cache of couples (first_element,last_element) for each created
>> run. This
>> could be useful in case we can find 2 couples (first_element_1,
>> last_element_1) and
>> (first_element_2, last_element_2) with last_element_1 <=
>> first_element_2.
>> In this case, those runs too can be seen as belonging to the same
>> "logical
>> run"
>> (actually they are 2 RS different physical runs, or even 4 in 2WRS
>> but can be seen as just one by mergesort). Of course, once those 2 (or 4)
>> runs are
>> logically merged into that only one, this last one in turn could be
>> merged
>> to other runs.
>>
>> What does all that imply? Mergesort would actually consider a smaller
>> number
>> of runs
>> (since it should just work on logical runs). This means less jumps
>> between
>> runs on disk.
>
> That's actually a refinement of an idea I've been working on for
> optimizing sort. I'll post those separately.
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate
>


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: mac_man2005(at)hotmail(dot)it
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replacement Selection
Date: 2007-11-27 20:16:22
Message-ID: 1196194582.4246.1125.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-11-27 at 17:49 +0100, mac_man2005(at)hotmail(dot)it wrote:
> Any comment about Two Ways Replacement Selection (two heaps instead of just
> one) ?

It might allow dynamic heap size management more easily than with a
single heap.

If you really think it will be better, try it. You'll learn loads, right
or wrong. It's difficult to forecast ahead of time what's a good idea
and what's a bad idea. The real truth of these things is that you need
to pop the hood and start tinkering and its's quite hard to make a plan
for that. If you have a bad idea, just move on to the next one; they're
just ideas.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com