Re: Selecting top N percent of records.

Lists: pgsql-general
From: Tim Uckun <timuckun(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Selecting top N percent of records.
Date: 2010-10-17 23:33:54
Message-ID: AANLkTimAcWFDvvzB2JLmFXTjM+_f1Kkj+7B1byxNxhzO@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Is there a way to select the top 10% of the values from a column?

For example the top 10% best selling items where number of sales is a column.

Thanks.


From: Darren Duncan <darren(at)darrenduncan(dot)net>
To: Tim Uckun <timuckun(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Selecting top N percent of records.
Date: 2010-10-17 23:53:43
Message-ID: 4CBB8C87.8020409@darrenduncan.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tim Uckun wrote:
> Is there a way to select the top 10% of the values from a column?
>
> For example the top 10% best selling items where number of sales is a column.

The top 10% would be a variable number of records. Is that what you want, or
would you rather, say, just see the top N items?

The latter is easy, just something like this:

SELECT * FROM mytbl ORDER BY num_sales DESC LIMIT $1

... where $1 is the number of records you want.

To actually get 10% of the records, you can replace the $1 with a subquery,
something like this probably:

SELECT * FROM mytbl ORDER BY num_sales DESC
LIMIT (SELECT (count(*) / 10) AS selnum FROM mytbl)

There are probably other ways to do it also.

Maybe you want all items whose sales are in the top 90 percentile or something,
or maybe you want what generated the most profit, etc.

-- Darren Duncan


From: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
To: Tim Uckun <timuckun(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Selecting top N percent of records.
Date: 2010-10-17 23:56:02
Message-ID: AANLkTi=i9G+5aVqvfnbB2RKGL4Rtfnh_PRN_aZOCu2te@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 18 October 2010 00:33, Tim Uckun <timuckun(at)gmail(dot)com> wrote:
> Is there a way to select the top 10% of the values from a column?
>
> For example the top 10% best selling items where number of sales is a column.
>

That is a bit problematic because it necessitates knowing the number
of rows total, and slow counting is an idiosyncrasy of postgres.

http://wiki.postgresql.org/wiki/Slow_Counting

To get the top 10%:

SELECT * FROM table LIMIT(SELECT (COUNT(*) * 0.1)::integer FROM table)

--
Regards,
Peter Geoghegan


From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Darren Duncan <darren(at)darrenduncan(dot)net>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Selecting top N percent of records.
Date: 2010-10-18 00:03:29
Message-ID: AANLkTikznm=5rqtNZ1QH9k3yYtmfv6Cn3puwRuNzT_cK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Oct 18, 2010 at 12:53 PM, Darren Duncan <darren(at)darrenduncan(dot)net> wrote:
> Tim Uckun wrote:
>>
>> Is there a way to select the top 10% of the values from a column?
>>
>> For example the top 10% best selling items where number of sales is a
>> column.
>
> The top 10% would be a variable number of records.  Is that what you want,
> or would you rather, say, just see the top N items?

Yes this is what I want. I don't care how many records I pull up (well
I might want to put a LIMIT on them). I want the top 10%


From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Selecting top N percent of records.
Date: 2010-10-18 00:06:19
Message-ID: AANLkTimBXwCheGaet+qKdZ_xGJxOUe4zs6KMRMBYeEps@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> That is a bit problematic because it necessitates knowing the number
> of rows total, and slow counting is an idiosyncrasy of postgres.
>
> http://wiki.postgresql.org/wiki/Slow_Counting
>
> To get the top 10%:
>
> SELECT * FROM table LIMIT(SELECT (COUNT(*) * 0.1)::integer FROM table)

I think I wasn't making myself clear. I don't want the top 10% of the
rows. I want the rows with the top 10% of the values in a column.

In my case there is a very non linear set of values. The lowest value
is 1 and the vast majority of records have a 1 in the column. The
highest value might be in the tens of thousands. I want to pull up
the records that have the top ten percent values.


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Tim Uckun <timuckun(at)gmail(dot)com>
Cc: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Selecting top N percent of records.
Date: 2010-10-18 02:13:17
Message-ID: 4CBBAD3D.9010805@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 10/18/2010 08:06 AM, Tim Uckun wrote:
>> That is a bit problematic because it necessitates knowing the number
>> of rows total, and slow counting is an idiosyncrasy of postgres.
>>
>> http://wiki.postgresql.org/wiki/Slow_Counting
>>
>> To get the top 10%:
>>
>> SELECT * FROM table LIMIT(SELECT (COUNT(*) * 0.1)::integer FROM table)
>
>
> I think I wasn't making myself clear. I don't want the top 10% of the
> rows. I want the rows with the top 10% of the values in a column.

OK, so you want a median-style "sort them in descending order and count
down until you've selected the first 10% of rows" approach? In other
words, values in the 90th percentile of the distribution?

Try this. Given table "x" with single integer column "y", obtain rows of
x in the 90th percentile of y:

select ranked.y FROM (select percent_rank() over (order by y desc) as
pc, y from x) AS ranked WHERE pc <= 0.1;

or:

select ranked.y from (select ntile(10) over (order by y desc) as pc, y
from x) AS ranked WHERE pc = 1;

See:

http://www.postgresql.org/docs/current/static/functions-window.html

Both of these seem to produce odd results with small input row counts.
Test carefully before trusting these expressions, as I'm quite new to
the use of window functions.

--
Craig Ringer


From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Selecting top N percent of records.
Date: 2010-10-18 02:18:50
Message-ID: AANLkTimo_q_LqCuFm9+k3gYs9VOW4s0uwYBRt0q_rmDM@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>
> OK, so you want a median-style "sort them in descending order and count down
> until you've selected the first 10% of rows" approach? In other words,
> values in the 90th percentile of the distribution?
>
> Try this. Given table "x" with single integer column "y", obtain rows of x
> in the 90th percentile of y:
>
> select ranked.y FROM (select percent_rank() over (order by y desc) as pc, y
> from x) AS ranked WHERE pc <= 0.1;
>
> or:
>
> select ranked.y from (select ntile(10) over (order by y desc) as pc, y from
> x) AS ranked WHERE pc = 1;
>

Thanks I will look into the window functions. I haven't used them
before so thanks for the tip.