REVIEW: Optimize referential integrity checks (todo item)

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Vik Reykja <vikreykja(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: REVIEW: Optimize referential integrity checks (todo item)
Date: 2012-06-16 13:50:30
Message-ID: CAEZATCWm8M00RA814o4DC2cD_aj44gQLb0tDdxMHA312qg7HCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12 February 2012 02:06, Vik Reykja <vikreykja(at)gmail(dot)com> wrote:
> I decided to take a crack at the todo item created from the following post:
> http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php
>
> The attached patch makes the desired changes in both code and function
> naming.
>
> It seemed quite easy to do but wasn't marked as easy on the todo, so I'm
> wondering if I've missed something.  All regression tests pass.
>

Here's my review of this patch.

Basic stuff:
------------

* Patch applies OK (some offsets).

BTW, I had no problems applying both the original patch and Chetan
Suttraway's version. The only difference between the patches seems to
be that the original is in context format, and Chetan Suttraway's is
in unified format.

Which format do hackers actually prefer? The wiki page
http://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
suggests context format, but then the linked example
http://wiki.postgresql.org/wiki/Creating_Clean_Patches is in unified
format. Do people care, or are both formats OK?

* Compiles cleanly with no warnings.

* Regression tests pass.

The regression tests have not been updated. I think that's fair enough
- I don't see a way to test this in a regression test - but I did some
testing (see below) to confirm the expected behaviour.

* No doc changes needed.

What it does:
-------------

The primary benefit this patch offers is to prevent unnecessary
queuing of RI triggers in the case of updates to a FK table where the
old and new FK values are both NULL. It does this by effectively
replacing the existing key equality checks with IS [NOT] DISTINCT FROM
checks.

This seems like a worthwhile optimisation, because I think that it is
fairly common to have NULLs in FKs columns, and then update some other
column.

The patch also prevents unnecessary queuing of RI triggers when the PK
table is updated, and the old and new values are both NULL, but that
seems like a much less common case.

I've looked over the code changes fairly closely, and I believe that
this is a safe change.

Technically, I think the changes to ri_OneKeyEqual() and
ri_AllKeysUnequal() are unnecessary, since they can only be called
from the trigger functions in the case where all the old values are
non-NULL, hence the new versions end up behaving the same. However, I
think it makes sense for all these functions to be consistent.

Talking of consistency, I wonder if RI_FKey_keyequal_upd_pk() and
RI_FKey_keyequal_upd_fk() ought to be renamed to
RI_FKey_keyunchanged_upd_pk() and RI_FKey_keyunchanged_upd_pk()?

Testing:
--------

I tested using the following tables:

CREATE TABLE pk_table
(
a int PRIMARY KEY
);

INSERT INTO pk_table
SELECT * FROM generate_series(1,100000);

CREATE TABLE fk_table
(
a int PRIMARY KEY,
b int,
c int,
d int,
e int REFERENCES pk_table(a)
);

INSERT INTO fk_table
SELECT i, i*2, i*3, i*4, CASE WHEN i%10 = 0 THEN i END
FROM generate_series(1,100000) g(i);

(i.e., FK populated in 10% of rows)

Then in HEAD:
EXPLAIN ANALYSE UPDATE fk_table SET b=b+1, c=c+1, d=d+1;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Update on fk_table (cost=0.00..2300.00 rows=100000 width=26) (actual
time=1390.037..1390.037 rows=0 loops=1)
-> Seq Scan on fk_table (cost=0.00..2300.00 rows=100000 width=26)
(actual time=0.010..60.841 rows=100000 loops=1)
Trigger for constraint fk_table_e_fkey: time=210.184 calls=90000
Total runtime: 1607.626 ms
(4 rows)

So the RI trigger is fired 90000 times, for the unchanged NULL FK rows.

With this patch, the RI trigger is not fired at all:
EXPLAIN ANALYSE UPDATE fk_table SET b=b+1, c=c+1, d=d+1;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Update on fk_table (cost=0.00..2300.00 rows=100000 width=26) (actual
time=1489.640..1489.640 rows=0 loops=1)
-> Seq Scan on fk_table (cost=0.00..2300.00 rows=100000 width=26)
(actual time=0.010..66.328 rows=100000 loops=1)
Total runtime: 1489.679 ms
(3 rows)

Similarly, if I update the FK column in HEAD the RI trigger is fired
for every row:
EXPLAIN ANALYSE UPDATE fk_table SET e=e-1;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Update on fk_table (cost=0.00..1800.00 rows=100000 width=26) (actual
time=1565.148..1565.148 rows=0 loops=1)
-> Seq Scan on fk_table (cost=0.00..1800.00 rows=100000 width=26)
(actual time=0.010..42.725 rows=100000 loops=1)
Trigger for constraint fk_table_e_fkey: time=705.962 calls=100000
Total runtime: 2279.408 ms
(4 rows)

whereas with this patch it is only fired for the non-NULL FK rows that
are changing:
EXPLAIN ANALYSE UPDATE fk_table SET e=e-1;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Update on fk_table (cost=0.00..5393.45 rows=299636 width=26) (actual
time=1962.755..1962.755 rows=0 loops=1)
-> Seq Scan on fk_table (cost=0.00..5393.45 rows=299636 width=26)
(actual time=0.023..52.850 rows=100000 loops=1)
Trigger for constraint fk_table_e_fkey: time=257.845 calls=10000
Total runtime: 2221.912 ms
(4 rows)

Conclusion:
-----------

I think that this patch is in good shape.

There's the question about whether to rename RI_FKey_keyequal_upd_pk()
and RI_FKey_keyequal_upd_fk(), but I'll leave that to the committer.

There's also a separate question about whether the RI trigger
functions need to be doing these key comparisons, given that they are
done earlier when the triggers are queued
(http://archives.postgresql.org/pgsql-performance/2005-10/msg00459.php),
but the savings to be made there are likely to be smaller than the
savings this patch makes by not queuing the triggers at all.

I'm going to mark this ready for committer.

Regards,
Dean

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-06-16 14:15:10 Re: [COMMITTERS] pgsql: Run pgindent on 9.2 source tree in preparation for first 9.3
Previous Message Andres Freund 2012-06-16 11:43:35 Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture