Re: Picking out the most recent row using a time stamp column

Lists: pgsql-performance
From: Dave Crooke <dcrooke(at)gmail(dot)com>
To: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Picking out the most recent row using a time stamp column
Date: 2011-02-24 19:55:31
Message-ID: AANLkTinTgU-jfPOhrfteBvhAhgG_8zknpk13N9L4QMap@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi foks

This is an old chestnut which I've found a number of online threads for, and
never seen a clever answer to. It seems a common enough idiom that there
might be some slicker way to do it, so I thought I might inquire with this
august group if such a clever answer exists ....

Consider the following table

create table data
(id_key int,
time_stamp timestamp without time zone,
value double precision);

create unique index data_idx on data (id_key, time_stamp);

with around 1m rows, with 3500 or so distinct values of id_key.

I need to find the most recent value for each distinct value of id_key.
There is no elegant (that I know of) syntax for this, and there are two ways
I've typically seen it done:

1. Use a dependent subquery to find the most recent time stamp, i.e.

select
a.id_key, a.time_stamp, a.value
from
data a
where
a.time_stamp=
(select max(time_stamp)
from data b
where a.id_key=b.id_key)

2. Define a temporary table / view with the most recent time stamp for each
key, and join against it:

select
a.id_key, a.time_stamp, a.value
from
data a,
(select id_key, max(time_stamp) as mts
from data group by id_key) b
where
a.id_key=b.id_key and a.time_stamp=b.mts

I've found that for my data set, PG 8.4.2 selects the "obvious" / "do it as
written" plan in each case, and that method 2. is much quicker (2.6 sec vs.
2 min on my laptop) ....

Is there a more elegant way to write this, perhaps using PG-specific
extensions?

Cheers
Dave


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Dave Crooke <dcrooke(at)gmail(dot)com>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-24 20:11:29
Message-ID: AANLkTikSK3YWBjUyjm-4ZnSOvevFpRXqiJeDBVvOTUQt@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Feb 24, 2011 at 1:55 PM, Dave Crooke <dcrooke(at)gmail(dot)com> wrote:
> Hi foks
>
> This is an old chestnut which I've found a number of online threads for, and
> never seen a clever answer to. It seems a common enough idiom that there
> might be some slicker way to do it, so I thought I might inquire with this
> august group if such a clever answer exists ....
>
> Consider the following table
>
> create table data
>    (id_key int,
>     time_stamp timestamp without time zone,
>     value double precision);
>
> create unique index data_idx on data (id_key, time_stamp);
>
> with around 1m rows, with 3500 or so distinct values of id_key.
>
> I need to find the most recent value for each distinct value of id_key.
> There is no elegant (that I know of) syntax for this, and there are two ways
> I've typically seen it done:
>
> 1. Use a dependent subquery to find the most recent time stamp, i.e.
>
> select
>    a.id_key, a.time_stamp, a.value
> from
>    data a
> where
>   a.time_stamp=
>      (select max(time_stamp)
>       from data b
>       where a.id_key=b.id_key)
>
> 2. Define a temporary table / view with the most recent time stamp for each
> key, and join against it:
>
> select
>    a.id_key, a.time_stamp, a.value
> from
>    data a,
>    (select id_key, max(time_stamp) as mts
>     from data group by id_key) b
> where
>    a.id_key=b.id_key and a.time_stamp=b.mts
>
> I've found that for my data set, PG 8.4.2 selects the "obvious" / "do it as
> written" plan in each case, and that method 2. is much quicker (2.6 sec vs.
> 2 min on my laptop) ....
>
> Is there a more elegant way to write this, perhaps using PG-specific
> extensions?

one pg specific method that a lot of people overlook for this sort of
problem is custom aggregates.

create or replace function maxfoo(foo, foo) returns foo as
$$
select case when $1.t > $2.t then $1 else $2 end;
$$ language sql immutable;

create aggregate aggfoo(foo)
(
sfunc=maxfoo,
stype=foo
);

create table foo(id int, t timestamptz default now());
insert into foo values (1);
insert into foo values (1);

select (f).* from (select aggfoo(foo) as f from foo group by id) q;

postgres=# select (f).* from (select aggfoo(foo) as f from foo group by id) q;
id | t
----+----------------------------
1 | 2011-02-24 14:01:20.051-06
(1 row)

where this approach can be useful is when you have a very complicated
aggregation condition that can be awkward to express in a join.

merlin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Dave Crooke" <dcrooke(at)gmail(dot)com>, "pgsql-performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-24 20:18:55
Message-ID: 4D6668CF020000250003AFC9@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Dave Crooke <dcrooke(at)gmail(dot)com> wrote:

> create table data
> (id_key int,
> time_stamp timestamp without time zone,
> value double precision);
>
> create unique index data_idx on data (id_key, time_stamp);

> I need to find the most recent value for each distinct value of
> id_key.

Well, unless you use timestamp WITH time zone, you might not be able
to do that at all. There are very few places where timestamp
WITHOUT time zone actually makes sense.

> There is no elegant (that I know of) syntax for this

How about this?:

select distinct on (id_key) * from data order by id_key, time_stamp;

> select
> a.id_key, a.time_stamp, a.value
> from
> data a
> where
> a.time_stamp=
> (select max(time_stamp)
> from data b
> where a.id_key=b.id_key)

Rather than the above, I typically find this much faster:

select
a.id_key, a.time_stamp, a.value
from
data a
where not exists
(select * from data b
where b.id_key=a.id_key and b.time_stamp > a.time_stamp)

-Kevin


From: Michael Glaesemann <grzm(at)seespotcode(dot)net>
To: Dave Crooke <dcrooke(at)gmail(dot)com>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-24 20:21:49
Message-ID: 08B3368E-E623-4E9C-83F5-AB05B471AF4C@seespotcode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


On Feb 24, 2011, at 14:55, Dave Crooke wrote:

> Is there a more elegant way to write this, perhaps using PG-specific
> extensions?

SELECT DISTINCT ON (data.id_key)
data.id_key, data.time_stamp, data.value
FROM data
ORDER BY data.id_key, data.time_stamp DESC;

Michael Glaesemann
grzm seespotcode net


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Dave Crooke" <dcrooke(at)gmail(dot)com>, "Michael Glaesemann" <grzm(at)seespotcode(dot)net>
Cc: "pgsql-performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-24 20:24:15
Message-ID: 4D666A0F020000250003AFCE@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Michael Glaesemann <grzm(at)seespotcode(dot)net> wrote:

> SELECT DISTINCT ON (data.id_key)
> data.id_key, data.time_stamp, data.value
> FROM data
> ORDER BY data.id_key, data.time_stamp DESC;

Dang! I forgot the DESC in my post! Thanks for showing the
*correct* version.

-Kevin


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Dave Crooke <dcrooke(at)gmail(dot)com>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-24 21:14:59
Message-ID: AANLkTinmw55=b0ESxgwtPjjMCk=Kz9XBVPTBZMbKmmNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Feb 24, 2011 at 2:18 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Dave Crooke <dcrooke(at)gmail(dot)com> wrote:
>
>> create table data
>>    (id_key int,
>>     time_stamp timestamp without time zone,
>>     value double precision);
>>
>> create unique index data_idx on data (id_key, time_stamp);
>
>> I need to find the most recent value for each distinct value of
>> id_key.
>
> Well, unless you use timestamp WITH time zone, you might not be able
> to do that at all.  There are very few places where timestamp
> WITHOUT time zone actually makes sense.
>
>> There is no elegant (that I know of) syntax for this
>
> How about this?:
>
> select distinct on (id_key) * from data order by id_key, time_stamp;
>
>> select
>>    a.id_key, a.time_stamp, a.value
>> from
>>    data a
>> where
>>   a.time_stamp=
>>      (select max(time_stamp)
>>       from data b
>>       where a.id_key=b.id_key)
>
> Rather than the above, I typically find this much faster:
>
> select
>   a.id_key, a.time_stamp, a.value
> from
>   data a
> where not exists
>  (select * from data b
>   where b.id_key=a.id_key and b.time_stamp > a.time_stamp)

hm. not only is it faster, but much more flexible...that's definitely
the way to go.

merlin


From: Dave Crooke <dcrooke(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Michael Glaesemann <grzm(at)seespotcode(dot)net>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-24 23:38:37
Message-ID: AANLkTimSmjGrCqwZWRXU1_=FW0adbWdxLXMK0n=Vw+Cc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thanks to all .... I had a tickling feeling at the back of my mind that
there was a neater answer here. For the record, times (all from in-memory
cached data, averaged over a bunch of runs):

Dependent subquery = 117.9 seconds
Join to temp table = 2.7 sec
DISTINCT ON = 2.7 sec

So the DISTINCT ON may not be quicker, but it sure is tidier.

Cheers
Dave

On Thu, Feb 24, 2011 at 2:24 PM, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov
> wrote:

> Michael Glaesemann <grzm(at)seespotcode(dot)net> wrote:
>
> > SELECT DISTINCT ON (data.id_key)
> > data.id_key, data.time_stamp, data.value
> > FROM data
> > ORDER BY data.id_key, data.time_stamp DESC;
>
> Dang! I forgot the DESC in my post! Thanks for showing the
> *correct* version.
>
> -Kevin
>


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Dave Crooke <dcrooke(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Michael Glaesemann <grzm(at)seespotcode(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-25 00:20:00
Message-ID: 4D66F5B0.5080502@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 2/24/11 3:38 PM, Dave Crooke wrote:
> Thanks to all .... I had a tickling feeling at the back of my mind that
> there was a neater answer here. For the record, times (all from
> in-memory cached data, averaged over a bunch of runs):
>
> Dependent subquery = 117.9 seconds
> Join to temp table = 2.7 sec
> DISTINCT ON = 2.7 sec

But wait, there's more! You haven't tested the Windowing Function
solution. I'll bet it's even faster.

SELECT id_key, time_stamp, value
FROM (
SELECT id_key, time_stamp, value,
row_number()
OVER ( PARTITION BY id_key
ORDER BY time_stamp DESC)
as ranking
FROM thetable
) as filtered_table
WHERE ranking = 1

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Shaun Thomas <sthomas(at)peak6(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Dave Crooke <dcrooke(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Michael Glaesemann <grzm(at)seespotcode(dot)net>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-25 00:58:33
Message-ID: 4D66FEB9.6000501@peak6.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 02/24/2011 06:20 PM, Josh Berkus wrote:

> SELECT id_key, time_stamp, value
> FROM (
> SELECT id_key, time_stamp, value,
> row_number()
> OVER ( PARTITION BY id_key
> ORDER BY time_stamp DESC)
> as ranking
> FROM thetable
> ) as filtered_table
> WHERE ranking = 1

Why did you use row_number instead of rank?

I am now curious how the speed compares though. I still think the
DISTINCT ON will be faster, but it would be a great surprise.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas(at)peak6(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Shaun Thomas <sthomas(at)peak6(dot)com>
Cc: Dave Crooke <dcrooke(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Michael Glaesemann <grzm(at)seespotcode(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-25 01:52:19
Message-ID: 4D670B53.70003@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


> Why did you use row_number instead of rank?

Because I assumed he only wanted one row in the event of ties.

Hmmm, although with that schema, there won't be ties. So it's pretty
much arbitrary then.

> I am now curious how the speed compares though. I still think the
> DISTINCT ON will be faster, but it would be a great surprise.

Hopefully we'll find out! The windowing functions are usually much
faster for me. I think in 9.0 or 9.1 someone replumbed DISTINCT ON to
use a bunch of the window function internals, at which point it'll cease
to matter.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Dave Johansen <davejohansen(at)gmail(dot)com>
To: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-25 16:50:40
Message-ID: AANLkTin9K_E37QzwEN2ARbh9SHZWv5DwQ6Go7sZpzGL+@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Feb 24, 2011 at 4:38 PM, Dave Crooke <dcrooke(at)gmail(dot)com> wrote:

> Thanks to all .... I had a tickling feeling at the back of my mind that
> there was a neater answer here. For the record, times (all from in-memory
> cached data, averaged over a bunch of runs):
>
> Dependent subquery = 117.9 seconds
> Join to temp table = 2.7 sec
> DISTINCT ON = 2.7 sec
>
> So the DISTINCT ON may not be quicker, but it sure is tidier.
>
> Cheers
> Dave

I'm using 8.3.3 and I have a similar sort of setup and just thought I'd add
another point of reference, here's the timing from doing the same sort of
queries on my dataset of ~700,000 records with ~10,000 unique "id_key"s.

I also added a 4th version that uses a permanent table that's auto-populated
by a trigger with the rid of the most recent entry from the main table, so
it's a simple join to get the latest entries.

Dependent subquery = (killed it after it ran for over 10 minutes)
Join on temp table = 1.5 seconds
DISTINCT ON = 2.9 seconds
Join on auto-populated table = 0.8 seconds

Dave


From: Dave Crooke <dcrooke(at)gmail(dot)com>
To: Dave Johansen <davejohansen(at)gmail(dot)com>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-25 20:45:23
Message-ID: AANLkTi=N0LJp43Zvsyfj=Kd8f0wbF+4-ex8jy1zddWzP@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi Dave

Yes, 100% the best solution .... I did the same thing a while back, I just
have a separate copy of the data in a "latest" table and the Java code just
runs a second SQL statement to update it when writing a new record (I've
never been a trigger fan).

I found myself looking at the "find the latest" query again though in the
process of building a "demo mode" into our application, which will replay a
finite set of data on a rolling loop by moving it forward in time, and also
has to simulate the continuous updating of the "latest" table so the the
business logic will be appropriately fooled.

My next tweak will be to cache the "latest" table in the Java layer ;-)

Cheers
Dave

On Fri, Feb 25, 2011 at 10:50 AM, Dave Johansen <davejohansen(at)gmail(dot)com>wrote:

> On Thu, Feb 24, 2011 at 4:38 PM, Dave Crooke <dcrooke(at)gmail(dot)com> wrote:
>
>> Thanks to all .... I had a tickling feeling at the back of my mind that
>> there was a neater answer here. For the record, times (all from in-memory
>> cached data, averaged over a bunch of runs):
>>
>> Dependent subquery = 117.9 seconds
>> Join to temp table = 2.7 sec
>> DISTINCT ON = 2.7 sec
>>
>> So the DISTINCT ON may not be quicker, but it sure is tidier.
>>
>> Cheers
>> Dave
>
>
> I'm using 8.3.3 and I have a similar sort of setup and just thought I'd add
> another point of reference, here's the timing from doing the same sort of
> queries on my dataset of ~700,000 records with ~10,000 unique "id_key"s.
>
> I also added a 4th version that uses a permanent table that's
> auto-populated by a trigger with the rid of the most recent entry from the
> main table, so it's a simple join to get the latest entries.
>
> Dependent subquery = (killed it after it ran for over 10 minutes)
> Join on temp table = 1.5 seconds
> DISTINCT ON = 2.9 seconds
> Join on auto-populated table = 0.8 seconds
>
> Dave
>


From: Dave Johansen <davejohansen(at)gmail(dot)com>
To: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-26 13:44:28
Message-ID: AANLkTim4Bx1rUW0J+5YQ9TXu9eqfRwpotGPnBUb3QTXW@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, Feb 25, 2011 at 1:45 PM, Dave Crooke <dcrooke(at)gmail(dot)com> wrote:

> Hi Dave
>
> Yes, 100% the best solution .... I did the same thing a while back, I just
> have a separate copy of the data in a "latest" table and the Java code just
> runs a second SQL statement to update it when writing a new record (I've
> never been a trigger fan).
>
> I found myself looking at the "find the latest" query again though in the
> process of building a "demo mode" into our application, which will replay a
> finite set of data on a rolling loop by moving it forward in time, and also
> has to simulate the continuous updating of the "latest" table so the the
> business logic will be appropriately fooled.
>
> My next tweak will be to cache the "latest" table in the Java layer ;-)
>
> Cheers
> Dave

Our application has what sounds like a similar functionality that we call
"playback". The way that we did it was to have a schema called "playback"
with identical tables to those that we want to have repopulated. All the
other tables exist in only the "public" schema and then we don't have to do
any duplication of that data. Then during playback it just runs a query to
copy from the "public" table to the "playback" table and the trigger will
populate the "latest" table in the "playback" schema automatically just like
when the program is running normally and populating the "public" version.

The secret sauce comes in by setting "SET search_path TO playback, public;"
because then your application runs all the same queries to get the data and
doesn't have to know that anything different is going on other than the copy
coperation that it's doing. It's nice because it takes all of the data
management burden off of the application and then allows the database to do
the hard work for you. It's obviously not the perfect solution but it wasn't
too hard to setup and we've really liked the way it works.

Dave


From: Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Dave Crooke" <dcrooke(at)gmail(dot)com>, "pgsql-performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-26 20:54:50
Message-ID: 87r5auwm79.fsf@mid.deneb.enyo.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

* Kevin Grittner:

> Dave Crooke <dcrooke(at)gmail(dot)com> wrote:
>
>> create table data
>> (id_key int,
>> time_stamp timestamp without time zone,
>> value double precision);
>>
>> create unique index data_idx on data (id_key, time_stamp);
>
>> I need to find the most recent value for each distinct value of
>> id_key.
>
> Well, unless you use timestamp WITH time zone, you might not be able
> to do that at all. There are very few places where timestamp
> WITHOUT time zone actually makes sense.

I don't think PostgreSQL keeps track of actual time zone values, just
as it doesn't keep track of the character encoding of TEXT columns.
Unless suppressed with WITHOUT TIME ZONE, PostgreSQL makes up some
time zone on demand. This makes TIMESTAMP WITH TIME ZONE not that
useful, and it's often to use TIMESTAMP WITHOUT TIME ZONE with times
in UTC.


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Dave Crooke <dcrooke(at)gmail(dot)com>
Cc: Dave Johansen <davejohansen(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-26 21:06:00
Message-ID: 4D696B38.5070603@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Dave,

Why not test the windowing version I posted?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Dave Johansen <davejohansen(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2011-02-26 21:38:05
Message-ID: AANLkTikXLneQ1wsYXxS6JqTwnHQdjP54FsXL_NEdOjVF@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
wasn't added til 8.4.
Dave
On Feb 26, 2011 2:06 PM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
> Dave,
>
> Why not test the windowing version I posted?
>
> --
> -- Josh Berkus
> PostgreSQL Experts Inc.
> http://www.pgexperts.com


From: Dave Johansen <davejohansen(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2013-04-05 16:54:06
Message-ID: CAAcYxUcJpVCVgxo4mdq6LOyCagVA80oXEpYD0pg0dULeDpf0Tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sat, Feb 26, 2011 at 2:38 PM, Dave Johansen <davejohansen(at)gmail(dot)com>
wrote:
>
> Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
> wasn't added til 8.4.
> Dave
>
> On Feb 26, 2011 2:06 PM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
> > Dave,
> >
> > Why not test the windowing version I posted?

We finally have moved over to 8.4 and so I just wanted to post the
time comparison numbers to show the times on 8.4 as well. This is also
a newer data set with ~700k rows and ~4k distinct id_key values.

1) Dependent subquery
SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE
a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key =
b.id_key);
8.3.3: Killed it after a few minutes
8.4.13: Killed it after a few minutes

2) Join against temporary table
SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT
id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key)
AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
8.3.3: 1.4 s
8.4.13: 0.5 s

3) DISTINCT ON:
SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER
BY id_key, time_stamp DESC;
Without Index:
8.3.3: 34.1 s
8.4.13: 98.7 s
With Index (data(id_key, time_stamp DESC)):
8.3.3: 3.4 s
8.4.13: 1.3 s

4) Auto-populated table
SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid
FROM latestdata);
8.3.3: 0.2 s
8.4.13: 0.06 s

5) Windowing
SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp,
value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp
DESC) AS ranking FROM data) AS a WHERE ranking=1;
8.3.3: N/A
8.4.13: 1.6 s

So the auto-populated table (#4) is the fastest by an order of
magnitude, but the join against the temporary table (#2) is the next
best option based on speed and doesn't require the extra multi-column
index that DISTINCT ON (#3) does.

On a related note though, is there a way to make the multi-column
index used in the DISTINCT ON more efficient. Based on the results, it
appears that the multi-column index is actually a single index with
the ordering of the tree based on the first value and then the second
value. Is there a way to make it be a "multi-level index"? What I mean
is that the first value is basically a tree/hash that then points to
the second index because if that's possible then that would probably
make the DISTINCT ON (#3) version as fast or faster than the
auto-populated table (#4). Is there a way to create an index like that
in postgres?

Thanks,
Dave


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Dave Johansen <davejohansen(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2013-04-05 18:40:16
Message-ID: CAHyXU0zrGJzr-R1rOqA=7m7k6BOdxauvbsfPYSM9+iaQELaK3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, Apr 5, 2013 at 11:54 AM, Dave Johansen <davejohansen(at)gmail(dot)com> wrote:
> On Sat, Feb 26, 2011 at 2:38 PM, Dave Johansen <davejohansen(at)gmail(dot)com>
> wrote:
>>
>> Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
>> wasn't added til 8.4.
>> Dave
>>
>> On Feb 26, 2011 2:06 PM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
>> > Dave,
>> >
>> > Why not test the windowing version I posted?
>
> We finally have moved over to 8.4 and so I just wanted to post the
> time comparison numbers to show the times on 8.4 as well. This is also
> a newer data set with ~700k rows and ~4k distinct id_key values.
>
> 1) Dependent subquery
> SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE
> a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key =
> b.id_key);
> 8.3.3: Killed it after a few minutes
> 8.4.13: Killed it after a few minutes
>
> 2) Join against temporary table
> SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT
> id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key)
> AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
> 8.3.3: 1.4 s
> 8.4.13: 0.5 s
>
> 3) DISTINCT ON:
> SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER
> BY id_key, time_stamp DESC;
> Without Index:
> 8.3.3: 34.1 s
> 8.4.13: 98.7 s
> With Index (data(id_key, time_stamp DESC)):
> 8.3.3: 3.4 s
> 8.4.13: 1.3 s
>
> 4) Auto-populated table
> SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid
> FROM latestdata);
> 8.3.3: 0.2 s
> 8.4.13: 0.06 s
>
> 5) Windowing
> SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp,
> value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp
> DESC) AS ranking FROM data) AS a WHERE ranking=1;
> 8.3.3: N/A
> 8.4.13: 1.6 s

I would also test:

*) EXISTS()

SELECT a.id_key, a.time_stamp, a.value FROM data
WHERE NOT EXISTS
(
SELECT 1 FROM data b
WHERE
a.id_key = b.id_key
and b.time_stamp > a.time_stamp
);

*) custom aggregate (this will not be the fastest option but is a good
technique to know -- it can be a real life saver when selection
criteria is complex)

CREATE FUNCTION agg_latest_data(data, data) returns data AS
$$
SELECT CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
$$ LANGUAGE SQL IMMUTABLE;

CREATE AGGREGATE latest_data (
SFUNC=agg_latest_data,
STYPE=data
);

SELECT latest_data(d) FROM data d group by d.id_key;

the above returns the composite, not the fields, but that can be worked around.

merlin


From: Dave Johansen <davejohansen(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2013-04-08 21:10:53
Message-ID: CAAcYxUcobsw92aoFuqDmSYnEoaLvc63Hs7_o1wwJ1vH6r8nJ9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, Apr 5, 2013 at 11:40 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>
> On Fri, Apr 5, 2013 at 11:54 AM, Dave Johansen <davejohansen(at)gmail(dot)com> wrote:
> > On Sat, Feb 26, 2011 at 2:38 PM, Dave Johansen <davejohansen(at)gmail(dot)com>
> > wrote:
> >>
> >> Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
> >> wasn't added til 8.4.
> >> Dave
> >>
> >> On Feb 26, 2011 2:06 PM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
> >> > Dave,
> >> >
> >> > Why not test the windowing version I posted?
> >
> > We finally have moved over to 8.4 and so I just wanted to post the
> > time comparison numbers to show the times on 8.4 as well. This is also
> > a newer data set with ~700k rows and ~4k distinct id_key values.
> >
> > 1) Dependent subquery
> > SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE
> > a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key =
> > b.id_key);
> > 8.3.3: Killed it after a few minutes
> > 8.4.13: Killed it after a few minutes
> >
> > 2) Join against temporary table
> > SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT
> > id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key)
> > AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
> > 8.3.3: 1.4 s
> > 8.4.13: 0.5 s
> >
> > 3) DISTINCT ON:
> > SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER
> > BY id_key, time_stamp DESC;
> > Without Index:
> > 8.3.3: 34.1 s
> > 8.4.13: 98.7 s
> > With Index (data(id_key, time_stamp DESC)):
> > 8.3.3: 3.4 s
> > 8.4.13: 1.3 s
> >
> > 4) Auto-populated table
> > SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid
> > FROM latestdata);
> > 8.3.3: 0.2 s
> > 8.4.13: 0.06 s
> >
> > 5) Windowing
> > SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp,
> > value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp
> > DESC) AS ranking FROM data) AS a WHERE ranking=1;
> > 8.3.3: N/A
> > 8.4.13: 1.6 s
>
> I would also test:
>
> *) EXISTS()
>
> SELECT a.id_key, a.time_stamp, a.value FROM data
> WHERE NOT EXISTS
> (
> SELECT 1 FROM data b
> WHERE
> a.id_key = b.id_key
> and b.time_stamp > a.time_stamp
> );

I tried this and it was slow:
8.3.3: 674.4 s
8.4.13: 40.4 s

>
> *) custom aggregate (this will not be the fastest option but is a good
> technique to know -- it can be a real life saver when selection
> criteria is complex)
>
> CREATE FUNCTION agg_latest_data(data, data) returns data AS
> $$
> SELECT CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
> $$ LANGUAGE SQL IMMUTABLE;
>
> CREATE AGGREGATE latest_data (
> SFUNC=agg_latest_data,
> STYPE=data
> );
>
> SELECT latest_data(d) FROM data d group by d.id_key;
>
> the above returns the composite, not the fields, but that can be worked around.

My real table actually returns/needs all the values from the row so I
didn't feel like messing with aggregate stuff.

Thanks,
Dave