Re: Suggestion for aggregate function

Lists: pgsql-hackers
From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Suggestion for aggregate function
Date: 2003-01-17 18:39:11
Message-ID: 87k7h339kg.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I have an idea for an aggregate function (actually a pair) that would be very
useful. It's something I've wanted very frequently with Oracle and other
databases and while it's possible to implement in SQL it's hard to do
efficiently. Whereas it would be really easy for the database to do it
efficiently.

lookup_min(column1,column2)
lookup_max(column1,column2)

would return the value of column2 (or one of the values in the case of
duplicates) where column1 is the minimum/maximum value. Ie, it would have an
accumulator that stores two values, the minimum/maximum value found so far,
and the value of column2 for that record.

So it would be possible to say for example:

select min(column1),lookup_min(column1,column2) from tab

to do the equivalent of:

select column1,column2 where column1=(select min(column1) from tab) limit 1

except it would be way more efficient. (Especially if there's an index on
column1 and postgres were taught to use indexes for min/max, but that's a
different story.)

I'm not sure on the names, perhaps someone has a better idea?

I would be interested in doing this myself, it sounds like a fairly
straightforward thing to implement and would be a useful first project.
However I'm really a bit bewildered by the number of steps aggregate functions
seem to have to go through to store accumulator data. It seems like they're
going to a lot of effort to store the accumulator data in a database internal
data-type. Is there something I can read to catch up on what these data
structures are for and how to use them?

--
greg


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for aggregate function
Date: 2003-01-17 19:29:01
Message-ID: 20030117192901.GA3839@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 17, 2003 at 13:39:11 -0500,
Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
> So it would be possible to say for example:
>
> select min(column1),lookup_min(column1,column2) from tab
>
> to do the equivalent of:
>
> select column1,column2 where column1=(select min(column1) from tab) limit 1
>
> except it would be way more efficient. (Especially if there's an index on
> column1 and postgres were taught to use indexes for min/max, but that's a
> different story.)

The following will be more efficient than your function if there is a usable
index on column1:
select column1,column2 from tab order by column 1 limit 1


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for aggregate function
Date: 2003-01-17 20:12:58
Message-ID: 873cnr3585.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Bruno Wolff III <bruno(at)wolff(dot)to> writes:

> On Fri, Jan 17, 2003 at 13:39:11 -0500,
> Greg Stark <gsstark(at)mit(dot)edu> wrote:
> >
> > So it would be possible to say for example:
> >
> > select min(column1),lookup_min(column1,column2) from tab
> >
> > to do the equivalent of:
> >
> > select column1,column2 where column1=(select min(column1) from tab) limit 1

As several people have pointed out this example isn't sufficiently complex to
make rule out various other reasonably efficient SQL implementations.

If you're unconvinced that this function would be handy consider a more
complex query:

SELECT item.*, store.*, x.lowest_price
FROM item, store, (
SELECT item_id,
min(price) AS lowest_price,
lookup_min(price,store_id) AS lowest_price_store
FROM items_for_sale
WHERE item_category = ?
GROUP BY item_id) AS x
WHERE item.item_id = x.item_id
AND store.store_id = x.store_id

There's really no reason for the database to have to do more than one scan of
items_for_sale with one nested_loops lookup of item and store. Ideally if
there's an index on items_for_sale on item_id, price it should be able to use
it too, but that's unlikely.

Currently to write this I think you would have to join against items_for_sale
twice, once to group by item_id and get the least price, then again to lookup
the store.

SELECT item_id, min(store_id)
FROM items_for_sale, (
SELECT min(price) AS lowest_price
FROM items_for_sale
WHERE item_category = ?
GROUP BY item_id
) AS x
WHERE items_for_sale.item_id = x.item_id
AND items_for_sale.price = x.lowest_price

--
greg


From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Bruno Wolff III <bruno(at)wolff(dot)to>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for aggregate function
Date: 2003-01-17 22:16:22
Message-ID: amvg2vorsgcdns20hhs9qj2hconmj37hfr@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 Jan 2003 15:12:58 -0500, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>SELECT item.*, store.*, x.lowest_price
> FROM item, store, (
> SELECT item_id,
> min(price) AS lowest_price,
> lookup_min(price,store_id) AS lowest_price_store
> FROM items_for_sale
> WHERE item_category = ?
> GROUP BY item_id) AS x
> WHERE item.item_id = x.item_id
> AND store.store_id = x.store_id
>
>There's really no reason for the database to have to do more than one scan of
>items_for_sale with one nested_loops lookup of item and store.

Greg, we already have this feature, just the syntax is a bit different :-)

SELECT item.*, store.*, x.lowest_price
FROM item, store, (
SELECT DISTINCT ON (item_id) item_id,
price AS lowest_price,
store_id AS lowest_price_store
FROM items_for_sale
WHERE item_category = ?
ORDER BY item_id, price) AS x
WHERE item.item_id = x.item_id
AND store.store_id = x.lowest_price_store;

> Ideally if
>there's an index on items_for_sale on item_id, price it should be able to use
>it too, but that's unlikely.

Servus
Manfred


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Bruno Wolff III <bruno(at)wolff(dot)to>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for aggregate function
Date: 2003-01-17 23:17:12
Message-ID: 19622.1042845432@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
>> select min(column1),lookup_min(column1,column2) from tab

One small problem is that we only support single-argument aggregates.
As of 7.3 this is no longer wired into the system catalog layout, but
it's still wired into various internal datastructures. Anyone
interested in trying to fix it?

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: Greg Stark <gsstark(at)MIT(dot)EDU>, Bruno Wolff III <bruno(at)wolff(dot)to>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for aggregate function
Date: 2003-01-18 00:08:06
Message-ID: 87fzrr1frt.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:

> Greg, we already have this feature, just the syntax is a bit different :-)
>
> SELECT DISTINCT ON (item_id) item_id,
> price AS lowest_price,
> store_id AS lowest_price_store
> FROM items_for_sale
> WHERE item_category = ?
> ORDER BY item_id, price

Neat! I hadn't seen this. I would have liked to have had that feature on
Oracle! (Please don't tell me I did, I went through such pains to work around
not having it.)

Would this query be efficient if there's an index on item_id, price ? That is,
would it know to do an index scan and be able to skip to the next item_id in
the index as soon as a price was found?

--
greg


From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Bruno Wolff III <bruno(at)wolff(dot)to>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for aggregate function
Date: 2003-01-19 18:14:06
Message-ID: u3ql2vspo0c0ucgerp3sac86f9sjde26v9@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 Jan 2003 19:08:06 -0500, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>Would this query be efficient if there's an index on item_id, price ? That is,
>would it know to do an index scan

Yes, at least to avoid the sort step.

> and be able to skip to the next item_id in
>the index as soon as a price was found?

I don't think so. Look at how the index scan retrieves all rows:

=> EXPLAIN ANALYZE
-> SELECT DISTINCT ON (item) item, price, store FROM sale ORDER BY item, price;
NOTICE: QUERY PLAN:

Unique (cost=0.00..412.24 rows=1024 width=12)
(actual time=0.93..549.95 rows=101 loops=1)
-> Index Scan using s_x1 on sale (cost=0.00..386.64 rows=10240 width=12)
(actual time=0.90..399.52 rows=10240 loops=1)
Total runtime: 551.55 msec

EXPLAIN
=> DROP INDEX s_x1;
DROP
=> EXPLAIN ANALYZE
-> SELECT DISTINCT ON (item) item, price, store FROM sale ORDER BY item, price;
NOTICE: QUERY PLAN:

Unique (cost=845.48..871.08 rows=1024 width=12)
(actual time=941.83..1152.25 rows=101 loops=1)
-> Sort (cost=845.48..845.48 rows=10240 width=12)
(actual time=941.71..1061.93 rows=10240 loops=1)
-> Seq Scan on sale (cost=0.00..163.40 rows=10240 width=12)
(actual time=0.37..273.41 rows=10240 loops=1)
Total runtime: 1304.63 msec

Servus
Manfred


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Greg Stark <gsstark(at)MIT(dot)EDU>
Cc: Manfred Koizar <mkoi-pg(at)aon(dot)at>, Greg Stark <gsstark(at)mit(dot)edu>, Bruno Wolff III <bruno(at)wolff(dot)to>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for aggregate function
Date: 2003-01-24 17:37:00
Message-ID: 878yxaphz7.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)MIT(dot)EDU> writes:

> Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
>
> > Greg, we already have this feature, just the syntax is a bit different :-)
> >
> > SELECT DISTINCT ON (item_id) item_id,
> > price AS lowest_price,
> > store_id AS lowest_price_store
> > FROM items_for_sale
> > WHERE item_category = ?
> > ORDER BY item_id, price
>
> Neat! I hadn't seen this.

Ok, so I still think DISTINCT ON is the neatest thing since sliced bread. But
it strikes me as a bit of an odd syntax. It's very similar to GROUP BY except
where all the fields are implicitly aggregated using a peculiar aggregate
function that grabs the first value according to the order by expression.

I'm using this already for lots of queries, it's very handy. But I'm finding
it awkward in one situation -- when I also want other aggregate values other
than the first value according to the sort.

Consider the above query if I also wanted to know the maximum and average
prices per item. Along with the store that had the maximum and minimum prices
and the total number of stores that stock the item.

With DISTINCT ON I would have to do two queries to get the maximum and minimum
along with the relevant stores, and then do a third query with GROUP BY to get
the average and total number of stores.

What would be useful is something like

SELECT item_id,
first(price) as min_price, first(store_id) as min_store,
avg(price) as avg_price,
last(price) as max_price, last(store_id) as min_store,
count(distinct store_id) as num_stores
FROM (SELECT * FROM items_for_sale ORDER BY item_id, store_id)
GROUP BY store_id

This gives the benefits of DISTINCT ON but makes it easier to combine with
GROUP BY.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Manfred Koizar <mkoi-pg(at)aon(dot)at>, Bruno Wolff III <bruno(at)wolff(dot)to>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestion for aggregate function
Date: 2003-01-24 20:55:53
Message-ID: 22608.1043441753@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> What would be useful is something like

> SELECT item_id,
> first(price) as min_price, first(store_id) as min_store,
> avg(price) as avg_price,
> last(price) as max_price, last(store_id) as min_store,
> count(distinct store_id) as num_stores
> FROM (SELECT * FROM items_for_sale ORDER BY item_id, store_id)
> GROUP BY store_id

Write it yourself --- both first() and last() are trivial to code as
user-defined aggregates.

regards, tom lane