Re: Stuck with a query...

From: Edmund Bacon <ebacon-xlii(at)onesystem(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Stuck with a query...
Date: 2005-03-08 19:54:19
Message-ID: m3vf813od0.fsf@elb_lx.onesystem.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

gsstark(at)mit(dot)edu (Greg Stark) writes:

> Geoff Caplan <geoff(at)variosoft(dot)com> writes:
>
> > Hi folks,
> >
> > Sorry to ask a newbie SQL question but I'm struggling...
>
> There's no efficient way to write this in standard SQL. However Postgres has
> an extension DISTINCT ON that would do it:
>
> select url,count(*)
> from (select distinct on (session_id)
> url
> from clickstream
> order by session_id,sequence_num desc
> )
> group by url
>
> This isn't going to be a superfast query. It has to sort all the clickstream
> records by session and sequence, take just the last one, then probably sort
> those again.
>

As an experiment I tried a more 'standard SQL' approach to this problem:

SELECT url, count(1)
FROM clickstream
WHERE (session_id, sequence_num) IN
(SELECT session_id, max(sequence_num)
FROM clickstream
GROUP BY session_id)
GROUP BY url;

On a table with about 100,000 rows this runs in about 720ms on my
system , compared to the ON DISTICNT version which runs in about
1000ms. Adding an index on (session_id, sequence_num) reduced the run
time to about 690ms, but made no difference to the DISTINCT ON
version. With only about 10,000 rows, there's no appreciable
difference. This surprised me, because I expected the DISTINCT ON to
be better.

--
Remove -42 for email

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Paul Moore 2005-03-08 20:55:13 New user: Windows, Postgresql, Python
Previous Message peter Willis 2005-03-08 19:37:14 Question about accessing current row data inside trigger