Re: efficient data reduction (and deduping)

Lists: pgsql-performance
From: Alessandro Gagliardi <alessandro(at)path(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: efficient data reduction (and deduping)
Date: 2012-03-01 18:27:27
Message-ID: CAAB3BBJQxQ1VVd-mub90a=2H6ezNXx1_bRL28VoxZH5EXWE0Lw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi folks,

I have a system that racks up about 40M log lines per day. I'm able to COPY
the log files into a PostgreSQL table that looks like this:

CREATE TABLE activity_unlogged
(
user_id character(24) NOT NULL,
client_ip inet,
hr_timestamp timestamp without time zone,
locale character varying,
log_id character(36),
method character varying(6),
server_ip inet,
uri character varying,
user_agent character varying
)

Now, I want to reduce that data to get the last activity that was performed
by each user in any given hour. It should fit into a table like this:

CREATE TABLE hourly_activity
(
activity_hour timestamp without time zone NOT NULL,
user_id character(24) NOT NULL,
client_ip inet,
hr_timestamp timestamp without time zone,
locale character varying,
log_id character(36),
method character varying(6),
server_ip inet,
uri character varying,
user_agent character varying,
CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
)

where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary
key constraint)

I am attempting to do that with the following:

INSERT INTO hourly_activity
SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
activity_unlogged.user_id,
client_ip, hr_timestamp, locale, log_id, method,
server_ip, uri, user_agent
FROM activity_unlogged,
(SELECT user_id, MAX(hr_timestamp) AS last_timestamp
FROM activity_unlogged GROUP BY user_id, date_trunc('hour',
hr_timestamp)) AS last_activity
WHERE activity_unlogged.user_id = last_activity.user_id AND
activity_unlogged.hr_timestamp = last_activity.last_timestamp;

I have two problems:

1. It's incredibly slow (like: hours). I assume this is because I am
scanning through a huge unindexed table twice. I imagine there is a more
efficient way to do this, but I can't think of what it is. If I were doing
this in a procedural programming language, it might look something like:
for row in activity_unlogged:
if (date_trunc('hour', hr_timestamp), user_id) in
hourly_activity[(activity_hour, user_id)]:
if hr_timestamp > hourly_activity[(date_trunc('hour',
hr_timestamp), user_id)][hr_timestamp]:
hourly_activity <- row # UPDATE
else:
hourly_activity <- row # INSERT
I suspect some implementation of this (hopefully my pseudocode is at
least somewhat comprehensible) would be very slow as well, but at least it
would only go through activity_unlogged once. (Then again, it would have
to rescan hourly_activity each time, so it really wouldn't be any faster
at all, would it?) I feel like there must be a more efficient way to do
this in SQL though I can't put my finger on it.
2. Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE
activity_unlogged.user_id = last_activity.user_id AND
activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to
multiple records leading to a primary key collision. In such cases, I don't
really care which of the two rows are picked, I just want to make sure that
no more than one row is inserted per user per hour. In fact, though I would
prefer to get the last row for each hour, I could probably get much the
same effect if I just limited it to one per hour. Though I don't know if
that really helps at all.


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Alessandro Gagliardi <alessandro(at)path(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 18:35:26
Message-ID: CAGTBQpYfx9Vt6zMK40jziMcHHRLZABDYQiTwq1SNkkE4F4k03w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Mar 1, 2012 at 3:27 PM, Alessandro Gagliardi
<alessandro(at)path(dot)com> wrote:
> INSERT INTO hourly_activity
>     SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
> activity_unlogged.user_id,
>                     client_ip, hr_timestamp, locale, log_id, method,
> server_ip, uri, user_agent
>         FROM activity_unlogged,
>             (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
>                 FROM activity_unlogged GROUP BY user_id, date_trunc('hour',
> hr_timestamp)) AS last_activity
>     WHERE activity_unlogged.user_id = last_activity.user_id AND
> activity_unlogged.hr_timestamp = last_activity.last_timestamp;

Try

INSERT INTO hourly_activity
SELECT ... everything from au1 ...
FROM activity_unlogged au1
LEFT JOIN activity_unlogged au2 ON au2.user_id = au1.user_id
AND
date_trunc('hour', au2.hr_timestamp) = date_trunc('hour',
au1.hr_timestamp)
AND
au2.hr_timestamp < au1.hr_timestamp
WHERE au2.user_id is null;


From: Craig James <cjames(at)emolecules(dot)com>
To: Alessandro Gagliardi <alessandro(at)path(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 18:35:27
Message-ID: CAFwQ8rfmR1xLr=Z=xpXBoJpaZxkbNerBuMCntzm0eMsic2nyhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
<alessandro(at)path(dot)com> wrote:
> Hi folks,
>
> I have a system that racks up about 40M log lines per day. I'm able to COPY
> the log files into a PostgreSQL table that looks like this:

Since you're using a COPY command and the table has a simple column
with exactly the value you want, why not filter it using grep(1) or
something similar and load the filtered result directly into the
hourly table?

Craig

>
> CREATE TABLE activity_unlogged
> (
>   user_id character(24) NOT NULL,
>   client_ip inet,
>   hr_timestamp timestamp without time zone,
>   locale character varying,
>   log_id character(36),
>   method character varying(6),
>   server_ip inet,
>   uri character varying,
>   user_agent character varying
> )
>
> Now, I want to reduce that data to get the last activity that was performed
> by each user in any given hour. It should fit into a table like this:
>
> CREATE TABLE hourly_activity
> (
>   activity_hour timestamp without time zone NOT NULL,
>   user_id character(24) NOT NULL,
>   client_ip inet,
>   hr_timestamp timestamp without time zone,
>   locale character varying,
>   log_id character(36),
>   method character varying(6),
>   server_ip inet,
>   uri character varying,
>   user_agent character varying,
>   CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
> )
>
> where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary
> key constraint)
>
> I am attempting to do that with the following:
>
> INSERT INTO hourly_activity
>     SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
> activity_unlogged.user_id,
>                     client_ip, hr_timestamp, locale, log_id, method,
> server_ip, uri, user_agent
>         FROM activity_unlogged,
>             (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
>                 FROM activity_unlogged GROUP BY user_id, date_trunc('hour',
> hr_timestamp)) AS last_activity
>     WHERE activity_unlogged.user_id = last_activity.user_id AND
> activity_unlogged.hr_timestamp = last_activity.last_timestamp;
>
> I have two problems:
>
> It's incredibly slow (like: hours). I assume this is because I am scanning
> through a huge unindexed table twice. I imagine there is a more efficient
> way to do this, but I can't think of what it is. If I were doing this in a
> procedural programming language, it might look something like:
> for row in activity_unlogged:
>     if (date_trunc('hour', hr_timestamp), user_id) in
> hourly_activity[(activity_hour, user_id)]:
>         if hr_timestamp > hourly_activity[(date_trunc('hour',
> hr_timestamp), user_id)][hr_timestamp]:
>             hourly_activity <- row # UPDATE
>     else:
>         hourly_activity <- row # INSERT
> I suspect some implementation of this (hopefully my pseudocode is at least
> somewhat comprehensible) would be very slow as well, but at least it would
> only go through activity_unlogged once. (Then again, it would have to
> rescan hourly_activity each time, so it really wouldn't be any faster at
> all, would it?) I feel like there must be a more efficient way to do this in
> SQL though I can't put my finger on it.
> Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE
> activity_unlogged.user_id = last_activity.user_id AND
> activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to
> multiple records leading to a primary key collision. In such cases, I don't
> really care which of the two rows are picked, I just want to make sure that
> no more than one row is inserted per user per hour. In fact, though I would
> prefer to get the last row for each hour, I could probably get much the same
> effect if I just limited it to one per hour. Though I don't know if that
> really helps at all.


From: Peter van Hardenberg <pvh(at)pvh(dot)ca>
To: Alessandro Gagliardi <alessandro(at)path(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 18:40:20
Message-ID: CAAcg=kVXjbd_CSaMQUB5iigXO5x+7QyX12z8xHueHd4_15p7nQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
<alessandro(at)path(dot)com> wrote:
> Now, I want to reduce that data to get the last activity that was performed
> by each user in any given hour. It should fit into a table like this:
>

How about:

1) Create an expression based index on date_trunc('hour', hr_timestamp)
2) Create a view on that showing the last value
3) If you want to throw away the data use CREATE TABLE AS on the
results of the view.

You may also want to investigate window functions.

-p

--
Peter van Hardenberg
San Francisco, California
"Everything was beautiful, and nothing hurt." -- Kurt Vonnegut


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Alessandro Gagliardi" <alessandro(at)path(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 18:51:42
Message-ID: 4F4F70DE0200002500045D73@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alessandro Gagliardi <alessandro(at)path(dot)com> wrote:

> hr_timestamp timestamp without time zone,

In addition to the responses which more directly answer your
question, I feel I should point out that this will not represent a
single moment in time. At the end of Daylight Saving Time, the
value will jump backward and you will run through a range of time
which will overlap existing entries. There is almost never a good
reason to use TIMESTAMP WITHOUT TIME ZONE -- TIMESTAMP WITH TIME
ZONE is required if you want the value to represent a moment in
time.

-Kevin


From: Alessandro Gagliardi <alessandro(at)path(dot)com>
To: Peter van Hardenberg <pvh(at)pvh(dot)ca>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 19:28:32
Message-ID: CAAB3BBKS2ncPZROej1OtE_k-DAEsRpJ2JP8w7zhoVJ8+i3EfDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I was thinking of adding an index, but thought it would be pointless since
I would only be using the index once before dropping the table (after its
loaded into hourly_activity). I assumed it would take longer to create the
index and then use it than to just seq scan once or twice. Am I wrong in
that assumption?

On Thu, Mar 1, 2012 at 10:40 AM, Peter van Hardenberg <pvh(at)pvh(dot)ca> wrote:

> On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
> <alessandro(at)path(dot)com> wrote:
> > Now, I want to reduce that data to get the last activity that was
> performed
> > by each user in any given hour. It should fit into a table like this:
> >
>
> How about:
>
> 1) Create an expression based index on date_trunc('hour', hr_timestamp)
> 2) Create a view on that showing the last value
> 3) If you want to throw away the data use CREATE TABLE AS on the
> results of the view.
>
> You may also want to investigate window functions.
>
> -p
>
> --
> Peter van Hardenberg
> San Francisco, California
> "Everything was beautiful, and nothing hurt." -- Kurt Vonnegut
>


From: Alessandro Gagliardi <alessandro(at)path(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 19:29:00
Message-ID: CAAB3BB+c4+Y5E0hzsfo6fi-y1acOZFYjE+EY2cJmYZM4h0TjpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

All of our servers run in UTC specifically to avoid this sort of problem.
It's kind of annoying actually, because we're a San Francisco company and
so whenever I have to do daily analytics, I have to shift everything to
Pacific. But in this case it's handy. Thanks for the keen eye though.

On Thu, Mar 1, 2012 at 10:51 AM, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov
> wrote:

> Alessandro Gagliardi <alessandro(at)path(dot)com> wrote:
>
> > hr_timestamp timestamp without time zone,
>
> In addition to the responses which more directly answer your
> question, I feel I should point out that this will not represent a
> single moment in time. At the end of Daylight Saving Time, the
> value will jump backward and you will run through a range of time
> which will overlap existing entries. There is almost never a good
> reason to use TIMESTAMP WITHOUT TIME ZONE -- TIMESTAMP WITH TIME
> ZONE is required if you want the value to represent a moment in
> time.
>
> -Kevin
>


From: Alessandro Gagliardi <alessandro(at)path(dot)com>
To: Craig James <cjames(at)emolecules(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 19:30:06
Message-ID: CAAB3BBJNq_utncTKD1NGVfNWTkYmaXgmG269A1+M0QTobca9fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hah! Yeah, that might would work. Except that I suck at grep. :(
Perhaps that's a weakness I should remedy.

On Thu, Mar 1, 2012 at 10:35 AM, Craig James <cjames(at)emolecules(dot)com> wrote:

> On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
> <alessandro(at)path(dot)com> wrote:
> > Hi folks,
> >
> > I have a system that racks up about 40M log lines per day. I'm able to
> COPY
> > the log files into a PostgreSQL table that looks like this:
>
> Since you're using a COPY command and the table has a simple column
> with exactly the value you want, why not filter it using grep(1) or
> something similar and load the filtered result directly into the
> hourly table?
>
> Craig
>
> >
> > CREATE TABLE activity_unlogged
> > (
> > user_id character(24) NOT NULL,
> > client_ip inet,
> > hr_timestamp timestamp without time zone,
> > locale character varying,
> > log_id character(36),
> > method character varying(6),
> > server_ip inet,
> > uri character varying,
> > user_agent character varying
> > )
> >
> > Now, I want to reduce that data to get the last activity that was
> performed
> > by each user in any given hour. It should fit into a table like this:
> >
> > CREATE TABLE hourly_activity
> > (
> > activity_hour timestamp without time zone NOT NULL,
> > user_id character(24) NOT NULL,
> > client_ip inet,
> > hr_timestamp timestamp without time zone,
> > locale character varying,
> > log_id character(36),
> > method character varying(6),
> > server_ip inet,
> > uri character varying,
> > user_agent character varying,
> > CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
> > )
> >
> > where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the
> primary
> > key constraint)
> >
> > I am attempting to do that with the following:
> >
> > INSERT INTO hourly_activity
> > SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
> > activity_unlogged.user_id,
> > client_ip, hr_timestamp, locale, log_id, method,
> > server_ip, uri, user_agent
> > FROM activity_unlogged,
> > (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
> > FROM activity_unlogged GROUP BY user_id,
> date_trunc('hour',
> > hr_timestamp)) AS last_activity
> > WHERE activity_unlogged.user_id = last_activity.user_id AND
> > activity_unlogged.hr_timestamp = last_activity.last_timestamp;
> >
> > I have two problems:
> >
> > It's incredibly slow (like: hours). I assume this is because I am
> scanning
> > through a huge unindexed table twice. I imagine there is a more efficient
> > way to do this, but I can't think of what it is. If I were doing this in
> a
> > procedural programming language, it might look something like:
> > for row in activity_unlogged:
> > if (date_trunc('hour', hr_timestamp), user_id) in
> > hourly_activity[(activity_hour, user_id)]:
> > if hr_timestamp > hourly_activity[(date_trunc('hour',
> > hr_timestamp), user_id)][hr_timestamp]:
> > hourly_activity <- row # UPDATE
> > else:
> > hourly_activity <- row # INSERT
> > I suspect some implementation of this (hopefully my pseudocode is at
> least
> > somewhat comprehensible) would be very slow as well, but at least it
> would
> > only go through activity_unlogged once. (Then again, it would have to
> > rescan hourly_activity each time, so it really wouldn't be any faster at
> > all, would it?) I feel like there must be a more efficient way to do
> this in
> > SQL though I can't put my finger on it.
> > Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE
> > activity_unlogged.user_id = last_activity.user_id AND
> > activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to
> > multiple records leading to a primary key collision. In such cases, I
> don't
> > really care which of the two rows are picked, I just want to make sure
> that
> > no more than one row is inserted per user per hour. In fact, though I
> would
> > prefer to get the last row for each hour, I could probably get much the
> same
> > effect if I just limited it to one per hour. Though I don't know if that
> > really helps at all.
>


From: Alessandro Gagliardi <alessandro(at)path(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 19:35:48
Message-ID: CAAB3BBL_6ju5QS2qEbmRAtePN2BOi==dBig925RjXH2QyyGzwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Interesting solution. If I'm not mistaken, this does solve the problem of
having two entries for the same user at the exact same time (which violates
my pk constraint) but it does so by leaving both of them out (since there
is no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right?

On Thu, Mar 1, 2012 at 10:35 AM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:
>
> Try
>
> INSERT INTO hourly_activity
> SELECT ... everything from au1 ...
> FROM activity_unlogged au1
> LEFT JOIN activity_unlogged au2 ON au2.user_id = au1.user_id
> AND
> date_trunc('hour', au2.hr_timestamp) = date_trunc('hour',
> au1.hr_timestamp)
> AND
> au2.hr_timestamp < au1.hr_timestamp
> WHERE au2.user_id is null;
>


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Alessandro Gagliardi <alessandro(at)path(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 19:39:12
Message-ID: CAGTBQpaBKj8e8wrBrF8+HjNyNKOrJrCBqdCyj1ZR+3ssXjK3sQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Mar 1, 2012 at 4:35 PM, Alessandro Gagliardi
<alessandro(at)path(dot)com> wrote:
> Interesting solution. If I'm not mistaken, this does solve the problem of
> having two entries for the same user at the exact same time (which violates
> my pk constraint) but it does so by leaving both of them out (since there is
> no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right?

Yes, but it would have to be same *exact* time (not same hour).

You can use more fields to desambiguate too, ie:

au1.hr_timestamp > au2.hr_timestamp or (au1.hr_timestamp ==
au2.hr_timestamp and au1.some_other_field > au2.some_other_field)

If you have a sequential id to use in desambiguation, it would be best.


From: Alessandro Gagliardi <alessandro(at)path(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 19:43:54
Message-ID: CAAB3BBLG4JL+cURYLKCobki=D43Nva=w2taTcEy+M+NHqpL_9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Ah, yes, that makes sense. Thank you!

On Thu, Mar 1, 2012 at 11:39 AM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:

> On Thu, Mar 1, 2012 at 4:35 PM, Alessandro Gagliardi
> <alessandro(at)path(dot)com> wrote:
> > Interesting solution. If I'm not mistaken, this does solve the problem of
> > having two entries for the same user at the exact same time (which
> violates
> > my pk constraint) but it does so by leaving both of them out (since
> there is
> > no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right?
>
> Yes, but it would have to be same *exact* time (not same hour).
>
> You can use more fields to desambiguate too, ie:
>
> au1.hr_timestamp > au2.hr_timestamp or (au1.hr_timestamp ==
> au2.hr_timestamp and au1.some_other_field > au2.some_other_field)
>
> If you have a sequential id to use in desambiguation, it would be best.
>


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Alessandro Gagliardi <alessandro(at)path(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 19:44:22
Message-ID: CAGTBQpY-e-TTnd-+7wWeKZ8ecYjdjqeRt9LRN15toxG0To_o4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Mar 1, 2012 at 4:39 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> Interesting solution. If I'm not mistaken, this does solve the problem of
>> having two entries for the same user at the exact same time (which violates
>> my pk constraint) but it does so by leaving both of them out (since there is
>> no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right?
>
> Yes, but it would have to be same *exact* time (not same hour).
>
> You can use more fields to desambiguate too, ie:
>
> au1.hr_timestamp > au2.hr_timestamp or (au1.hr_timestamp ==
> au2.hr_timestamp and au1.some_other_field > au2.some_other_field)
>
> If you have a sequential id to use in desambiguation, it would be best.

Sorry for double posting - but you can also *generate* such an identifier:

create sequence temp_seq;

with identified_au as ( select nextval('temp_seq') as id, * from
hourly_activity )
INSERT INTO hourly_activity
SELECT ... everything from au1 ...
FROM identified_au au1
LEFT JOIN identified_au au2 ON au2.user_id = au1.user_id
AND
date_trunc('hour', au2.hr_timestamp) = date_trunc('hour',
au1.hr_timestamp)
AND
au2.hr_timestamp < au1.hr_timestamp OR (au2.hr_timestamp =
au1.hr_timestamp AND au2.id < au1.id)
WHERE au2.user_id is null;

Should work if you have 9.x


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Alessandro Gagliardi" <alessandro(at)path(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 20:12:02
Message-ID: 4F4F83B20200002500045D82@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alessandro Gagliardi <alessandro(at)path(dot)com> wrote:

> All of our servers run in UTC specifically to avoid this sort of
> problem. It's kind of annoying actually, because we're a San
> Francisco company and so whenever I have to do daily analytics, I
> have to shift everything to Pacific. But in this case it's handy.

If that's working for you, you might just want to leave it alone;
but just so you know:

If you declare the column as TIMESTAMP WITH TIME ZONE, that it
stores the timestamp as UTC regardless of what your server's
definition of time zone is. (It doesn't actually store the time
zone -- it just normalizes the time into UTC for storage.) On
retrieval it shows that UTC moment in the local timezone. So, all
that work you're doing to switch the time zone info around would be
pretty automatic if you used the other type.

-Kevin