Re: Proposal: Pre ordered aggregates, default ORDER BY clause for aggregates - median support

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Pre ordered aggregates, default ORDER BY clause for aggregates - median support
Date: 2009-12-21 11:43:24
Message-ID: 162867790912210343r9ce2d53s51616392c3003042@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2009/12/21 Greg Stark <gsstark(at)mit(dot)edu>:
> On Mon, Dec 21, 2009 at 5:48 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>> Information about count of rows are not detected in planner time. This
>>  have to by available in any executor's sort node. So this value will
>> be available every time - index scan is problem. Interesting is Greg's
>> info about O(N) algorithms. Then we can implement median as classic
>> aggregate.
>>...
>> On second hand - any internal implementation of median will be faster,
>> then currently used techniques.
>
> Some further information now that I'm on a real keyboard.
>
> The O(n) algorithm for median of which I'm aware is Quickselect. It's
> based on Quicksort which makes it unsuitable for a disk-based
> algorithm since it would have to read every block many times. Perhaps
> there's some way to adapt it or there might be some other O(n)
> algorithm which has better i/o patterns.
>
> But in cases where the tupleset/tuplesort is still in memory it would
> be easy to add support for pulling out the nth element and use that in
> a magic median() function. It wouldn't be a "classic" aggregate, at
> least as I understand it where you also need something like O(1) space
> to store the state, because you definitely need access to the whole
> tupleset to do the quickselect.
>

I can check speed diferences on orafce's median implementation. But
orafce isn't correct - it ignores work_mem limit - so it usable only
for some external modules or for testing. I'll try simple median
implementation based on aggregate API. Then I'll compare speed.

> If we don't find a way to optimize this for on-disk tuplestores though
> then I wonder whether it's worth it. It would only help in the narrow
> case that you had a large enough set to see the difference between
> doing quickselect and quicksort but small enough to fit in workmem. I
> suppose that case is widening over time though as memory sizes get
> bigger and bigger.

Thank you

I have to do same test and get more info about quickselect

Pavel

>
> --
> greg
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-12-21 11:56:13 Re: using separate parameters in psql query execution
Previous Message Simon Riggs 2009-12-21 11:34:01 Re: alpha3 release schedule?