Re: [PERFORMANCE] slow small delete on large table

Lists: pgsql-performance
From: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: [PERFORMANCE] slow small delete on large table
Date: 2004-02-24 02:10:57
Message-ID: 200402231910.57078.pgsql@bluepolka.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


A 7.3.4 question...

I want to "expire" some data after 90 days, but not delete too
much at once so as not to overwhelm a system with precariously
balanced disk I/O and on a table with millions of rows. If I
could say it the way I think for a simple example, it'd be
like this:

delete from mytable
where posteddatetime < now() - '90 days'
limit 100;

Of course, that's not legal 7.3.4 syntax. These are both too
slow due to sequential scan of table:

delete from mytable where key in (
select key
from mytable
where posteddatetime < now() - '90 days'
limit 100);
or
delete from mytable where exists (
select m.key
from mytable m
where m.key = mytable.key
and m.posteddatetime < now() - '90 days'
limit 100);

Tried to use a cursor, but couldn't figure out the syntax
for select-for-delete yet, or find appropriate example on
google. Any clues?

TIA.


From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORMANCE] slow small delete on large table
Date: 2004-02-24 02:34:00
Message-ID: 403AB818.9090100@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

> Of course, that's not legal 7.3.4 syntax. These are both too
> slow due to sequential scan of table:
>
> delete from mytable where key in (
> select key
> from mytable
> where posteddatetime < now() - '90 days'
> limit 100);

Upgrade to 7.4 - the query above will be vastly faster.

> delete from mytable where exists (
> select m.key
> from mytable m
> where m.key = mytable.key
> and m.posteddatetime < now() - '90 days'
> limit 100);

That one I used to use on 7.3 - I seem to recall it indexed nicely.

Chris


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORMANCE] slow small delete on large table
Date: 2004-02-24 03:00:17
Message-ID: 20040224030017.GA2872@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, Feb 23, 2004 at 19:10:57 -0700,
"Ed L." <pgsql(at)bluepolka(dot)net> wrote:
>
> A 7.3.4 question...
>
> I want to "expire" some data after 90 days, but not delete too
> much at once so as not to overwhelm a system with precariously
> balanced disk I/O and on a table with millions of rows. If I
> could say it the way I think for a simple example, it'd be
> like this:

If there aren't foreign keys into the table from which rows are being
deleted, then a delete shouldn't have a big impact on the system.
If you do the expires frequently, then there won't be as many records
to delete at one time. The other response showed you how to avoid the
sequential scan, which is the other part of the problem.


From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORMANCE] slow small delete on large table
Date: 2004-02-24 03:48:29
Message-ID: m3oerp6stu.fsf@wolfe.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

pgsql(at)bluepolka(dot)net ("Ed L.") wrote:
> A 7.3.4 question...
>
> I want to "expire" some data after 90 days, but not delete too
> much at once so as not to overwhelm a system with precariously
> balanced disk I/O and on a table with millions of rows. If I
> could say it the way I think for a simple example, it'd be
> like this:
>
> delete from mytable
> where posteddatetime < now() - '90 days'
> limit 100;
>
> Of course, that's not legal 7.3.4 syntax. These are both too
> slow due to sequential scan of table:
>
> delete from mytable where key in (
> select key
> from mytable
> where posteddatetime < now() - '90 days'
> limit 100);
> or
> delete from mytable where exists (
> select m.key
> from mytable m
> where m.key = mytable.key
> and m.posteddatetime < now() - '90 days'
> limit 100);
>
> Tried to use a cursor, but couldn't figure out the syntax
> for select-for-delete yet, or find appropriate example on
> google. Any clues?

I'm hoping that there's an index on posteddatetime, right?

There are several approaches that would be quite sensible to consider...

1. Delete records as often as possible, so that the number deleted at
any given time stays small.

2. Or find an hour at which the system isn't busy, and blow through a
lot of them then.

3. Open a cursor querying records in your acceptable range, e.g.

declare nukem cursor for select key from mytable where posteddate <
now() - '90 days'::interval;

Fetch 100 entries from the cursor, and submit, across another
connection, delete requests for the 100 entries, all as one
transaction, which you commit.

Sleep a bit, and fetch another 100.

Note that the cursor will draw groups of 100 entries into memory;
it's good to immediately delete them, as they'll be in buffers.
Keeping the number of rows deleted small, and sleeping a bit, means
you're not trashing buffers too badly. The query doesn't enforce
any particular order on things; it effect chews out old entries in
any order the query finds them. If you can't keep up with
insertions, there could be rather old entries that would linger
around...

This parallels the "sleepy vacuum" that takes a similar strategy to
keeping vacuums from destroying performance.

4. Rotor tables.

Have "mytable" be a view on a sequence of tables.

create view mytable as
select * from mytable1
union all
select * from mytable2
union all
select * from mytable3
union all
select * from mytable4
union all
select * from mytable5
union all
select * from mytable6
union all
select * from mytable7
union all
select * from mytable8
union all
select * from mytable9
union all
select * from mytable10

A rule can choose an appropriate table from the 9 to _actually_ insert
into.

Every 3 days, you truncate the eldest table and rotate on to insert
into the next table.

That will take mere moments, which is real helpful to save you I/O on
the deletes.

There is an unfortunate other problem with this; joins against mytable
are pretty bad, and self-joins effectively turn into a union all
across 100 joins. (Table 1 against 1-10, Table 2 against 1-10, and so
forth...)

For this not to suck rather incredibly requires fairly carefully
structuring queries on the table. That may or may not be compatible
with your needs...
--
let name="cbbrowne" and tld="cbbrowne.com" in name ^ "@" ^ tld;;
http://cbbrowne.com/info/x.html
A Linux machine! because a 486 is a terrible thing to waste!
-- <jjs(at)wintermute(dot)ucr(dot)edu> Joe Sloan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORMANCE] slow small delete on large table
Date: 2004-02-24 05:23:58
Message-ID: 18315.1077600238@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Ed L." <pgsql(at)bluepolka(dot)net> writes:
> If I could say it the way I think for a simple example, it'd be
> like this:

> delete from mytable
> where posteddatetime < now() - '90 days'
> limit 100;

> Of course, that's not legal 7.3.4 syntax.

Assuming you have a primary key on the table, consider this:

CREATE TEMP TABLE doomed AS
SELECT key FROM mytable WHERE posteddatetime < now() - '90 days'
LIMIT 100;

DELETE FROM mytable WHERE key = doomed.key;

DROP TABLE doomed;

Depending on the size of mytable, you might need an "ANALYZE doomed"
in there, but I'm suspecting not. A quick experiment suggests that
you'll get a plan with an inner indexscan on mytable.key, which is
exactly what you need.

See also Chris Browne's excellent suggestions nearby, if you are willing
to make larger readjustments in your thinking...

regards, tom lane


From: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORMANCE] slow small delete on large table
Date: 2004-02-24 18:36:08
Message-ID: 200402241136.08556.pgsql@bluepolka.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Monday February 23 2004 10:23, Tom Lane wrote:
> "Ed L." <pgsql(at)bluepolka(dot)net> writes:
> > If I could say it the way I think for a simple example, it'd be
> > like this:
> >
> > delete from mytable
> > where posteddatetime < now() - '90 days'
> > limit 100;
> >
> > Of course, that's not legal 7.3.4 syntax.
>
> Assuming you have a primary key on the table, consider this:
>
> CREATE TEMP TABLE doomed AS
> SELECT key FROM mytable WHERE posteddatetime < now() - '90 days'
> LIMIT 100;
>
> DELETE FROM mytable WHERE key = doomed.key;
>
> DROP TABLE doomed;
>
> Depending on the size of mytable, you might need an "ANALYZE doomed"
> in there, but I'm suspecting not. A quick experiment suggests that
> you'll get a plan with an inner indexscan on mytable.key, which is
> exactly what you need.

I didn't mention I'd written a trigger to do delete N rows on each new
insert (with a delay governor preventing deletion avalanches). The
approach looks a little heavy to be done from within a trigger with the
response time I need, but I'll try it. Cantchajust toss in that "limit N"
functionality to delete clauses? How hard could that be? ;)

> See also Chris Browne's excellent suggestions nearby, if you are willing
> to make larger readjustments in your thinking...

I did a search for articles by Chris Browne, didn't see one that appeared
relevant. What is the thread subject to which you refer?


From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORMANCE] slow small delete on large table
Date: 2004-02-24 19:12:54
Message-ID: m3r7wk5m15.fsf@wolfe.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

After a long battle with technology, pgsql(at)bluepolka(dot)net ("Ed L."), an earthling, wrote:
> On Monday February 23 2004 10:23, Tom Lane wrote:
>> "Ed L." <pgsql(at)bluepolka(dot)net> writes:
>> Depending on the size of mytable, you might need an "ANALYZE doomed"
>> in there, but I'm suspecting not. A quick experiment suggests that
>> you'll get a plan with an inner indexscan on mytable.key, which is
>> exactly what you need.
>
> I didn't mention I'd written a trigger to do delete N rows on each new
> insert (with a delay governor preventing deletion avalanches). The
> approach looks a little heavy to be done from within a trigger with the
> response time I need, but I'll try it. Cantchajust toss in that "limit N"
> functionality to delete clauses? How hard could that be? ;)

It's nonstandard, which will get you a certain amount of opposition
"for free;" the problem with nonstandard behaviour is that sometimes
the implications haven't been thought out...

>> See also Chris Browne's excellent suggestions nearby, if you are willing
>> to make larger readjustments in your thinking...
>
> I did a search for articles by Chris Browne, didn't see one that
> appeared relevant. What is the thread subject to which you refer?

It's in the same thread. I suggested having a daemon running a cursor
(amounting to a slightly more expensive version of Tom's "doomed temp
table" approach), or using "rotor" tables where you could TRUNCATE a
table every few days which would be _really_ cheap...
--
output = ("cbbrowne" "@" "cbbrowne.com")
http://cbbrowne.com/info/emacs.html
Expect the unexpected.
-- The Hitchhiker's Guide to the Galaxy, page 7023