Re: Delete performance again

Lists: pgsql-performance
From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Delete performance again
Date: 2008-10-02 09:42:15
Message-ID: 331e40660810020242u2f0f5de2qe146312e448a7df4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hello.

I have a database with company table that have a bunch of related
(delete=cascade) tables.
Also it has 1<->M relation to company_descr table.
Once we've found that ~half of our companies do not have any description and
we would like to remove them.
First this I've tried was
delete from company where id not in (select company_id from company_descr);
I've tried to analyze command, but unlike to other RDBM I've used it did not
include cascade deletes/checks into query plan. That is first problem.
It was SLOW. To make it faster I've done next thing:

create temporary table comprm(id) as select id from company;
delete from comprm where id in (select company_id from company_descr);
delete from company where id in (select id from comprm);

That was much better. So the question is why postgresql can't do such a
thing.
But it was better only until "removing" dataset was small (~5% of all
table).
As soon as I've tried to remove 50% I've got speed problems. I've ensured I
have all indexes for both ends of foreign key.
I've tried to remove all cascaded entries by myself, e.g.:

create temporary table comprm(id) as select id from company;
delete from comprm where id in (select company_id from company_descr);
delete from company_alias where company_id in (select id from comprm);
...
delete from company where id in (select id from comprm);

It did not help until I drop all constraints before and recreate all
constraints after.
Now I have it work for 15minutes, while previously it could not do in a day.

Is it OK? I'd say, some (if not all) of the optimizations could be done by
postgresql optimizer.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Віталій Тимчишин <tivv00(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Delete performance again
Date: 2008-10-02 12:14:20
Message-ID: 23315.1222949660@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"=?ISO-8859-5?B?svbi0Nv22SDC2Nzn2OjY3Q==?=" <tivv00(at)gmail(dot)com> writes:
> delete from company where id not in (select company_id from company_descr);
> I've tried to analyze command, but unlike to other RDBM I've used it did not
> include cascade deletes/checks into query plan. That is first problem.
> It was SLOW.

Usually the reason for that is having forgotten to make an index on the
referencing column(s) ?

regards, tom lane


From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Delete performance again
Date: 2008-10-02 15:21:59
Message-ID: 331e40660810020821w4c4f212br3943df435589f731@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

2008/10/2 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>

> "=?ISO-8859-5?B?svbi0Nv22SDC2Nzn2OjY3Q==?=" <tivv00(at)gmail(dot)com> writes:
> > delete from company where id not in (select company_id from
> company_descr);
> > I've tried to analyze command, but unlike to other RDBM I've used it did
> not
> > include cascade deletes/checks into query plan. That is first problem.
> > It was SLOW.
>
> Usually the reason for that is having forgotten to make an index on the
> referencing column(s) ?
>

Not at all. As you can see below in original message, simply "extending" the
query to what should have been done by optimizer helps. I'd say optimizer
always uses fixed plan not taking into account that this is massive update
and id doing index lookup of children records for each parent record, while
it would be much more effective to perform removal of all children records
in single table scan.

It's like trigger "for each record" instead of "for each statement".


From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: "??????? ????????" <tivv00(at)gmail(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Delete performance again
Date: 2008-10-03 20:55:00
Message-ID: CA896D7906BF224F8A6D74A1B7E54AB3039E8BEF@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi,

Maybe you can try this syntax. I'm not sure, but it eventually perform better:


delete from company_alias USING comprm
where company_alias.company_id =comprm.id

Cheers,

Marc


From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Delete performance again
Date: 2008-10-09 12:54:38
Message-ID: 331e40660810090554n4633de19gc2ee5fd4aa85457c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

OK, I did try you proposal and correlated subselect.
I have a database ~900000 companies.
First try was to remove randomly selected 1000 companies
Uncorrelated subselect: 65899ms
Correlated subselect: 97467ms
using: 9605ms
my way: 104979ms. (without constraints recreate)
My is the worst because it is oriented on massive delete.
So I thought USING would perform better, so I did try 10000 companies
my way: 190527ms. (without constraints recreate)
using: 694144ms
I was a little shocked, but I did check plans and found out that it did
switch from Nested Loop to Hash Join.
I did disable Hash Join, it not show Merge Join. This was also disabled....
and I've got 747253ms.
Then I've tried combinations: Without hash join it was the best result of
402629ms, without merge join it was 1096116ms.

My conclusion: Until optimizer would take into account additional actions
needed (like constraints check/cascade deletes/triggers), it can not make
good plan.


From: Віталій Тимчишин <tivv00(at)gmail(dot)com>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Delete performance again
Date: 2008-10-10 08:24:08
Message-ID: 331e40660810100124h7aef87cap4cabbf15848b9380@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

BTW: Have just tried "clean" (without any foreign keys constraints)
peformance of
"delete from tbl where field not in (select)"
vs
"create temporary table tmp(id) as select distinct field from tbl; delete
from tmp where id in (select); delete from tbl where field in (select id
from tmp)".
both tbl and select are huge.
tbl cardinality is ~5 million, select is ~1 milliion. Number of records to
delete is small.
select is simply "select id from table2".

First (simple) one could not do in a night, second did in few seconds.