Compressing the AFTER TRIGGER queue

Lists: pgsql-hackers
From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 16:04:48
Message-ID: CAEZATCV5eCxdhGZXYv7G_K7ssRSwV6nM+p_SPksQWG5RrjsnTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been thinking some more about the long-standing problem of the
AFTER TRIGGER queue using too much memory, and I think that the
situation can be improved by using some basic compression.

Currently each event added to the AFTER TRIGGER queue uses 10 bytes
per trigger per row for INSERTs and DELETEs, and 16 for UPDATEs. The
attached patch reduces this down to just 1 byte in a number of common
cases.

It works by optimising for the case where an event's CTID is same or
close to the previous event's CTID, and similarly for the event's
pointer to the shared trigger data (AfterTriggerSharedData).

If there is only 1 trigger on the table, it will use 1 byte for most
INSERTed rows. For UPDATEs and DELETEs, the space needed will vary
according to the table's layout and traversal order. Given a not too
fragmented table, and a sequential or bitmap heap scan, it should use
1-2 bytes per row. This will increase for more random traversal orders
- the theoretical maximum is 7 bytes per row, but that would take a
pretty pathological case.

If there are multiple triggers firing for each row, then each
additional trigger event will take 1 byte, provided the triggers fire
unconditionally (no WHERE clauses). If the triggers fire more
randomly, due to WHERE clauses, then some of the trigger events may
take 2 bytes each (for up to 256 triggers on the table) or 3 bytes
each if there are more triggers than that.

The compression code is pretty simple, so should only use a few
additional cycles. In my testing so far, it seems to make no
detectable difference to performance (actually it seems to be 2-3%
faster for a 50M row INSERT, presumably because the reduced memory
usage leads to fewer cache misses).

For comparison, saving a 1MB AfterTriggerEventChunk produced by the
current code to a file and compressing it with gzip gives a
compression ratio of around 4:1, and bzip2 gives around 8:1. I've not
tried the pg_lzcompress code, but at first glance it is hard to
imagine that it would be as good as gzip, and it is not so suited to
on-the-fly compression/decompression.

It still needs more testing, but the initial results are encouraging.

Thoughts?

Regards,
Dean

Attachment Content-Type Size
after-triggers.patch application/octet-stream 36.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 16:49:08
Message-ID: 24460.1312217348@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> I've been thinking some more about the long-standing problem of the
> AFTER TRIGGER queue using too much memory, and I think that the
> situation can be improved by using some basic compression.

> Currently each event added to the AFTER TRIGGER queue uses 10 bytes
> per trigger per row for INSERTs and DELETEs, and 16 for UPDATEs. The
> attached patch reduces this down to just 1 byte in a number of common
> cases.

Ummm ... I only read the data structure comments, not the code, but I
don't see where you store the second CTID for an update event?

regards, tom lane


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 17:00:00
Message-ID: CAEZATCW5A47-hog46kOwtSXJf-fwkQfmnVp=OfiS_53B85bH_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1 August 2011 17:49, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
>> I've been thinking some more about the long-standing problem of the
>> AFTER TRIGGER queue using too much memory, and I think that the
>> situation can be improved by using some basic compression.
>
>> Currently each event added to the AFTER TRIGGER queue uses 10 bytes
>> per trigger per row for INSERTs and DELETEs, and 16 for UPDATEs. The
>> attached patch reduces this down to just 1 byte in a number of common
>> cases.
>
> Ummm ... I only read the data structure comments, not the code, but I
> don't see where you store the second CTID for an update event?
>

Ah yes, I forgot to mention that bit. I'm using
&(tuple1.t_data->t_ctid) to get the second CTID from the old tuple. Is
that safe? As far as I could see, this would match the new tuple after
a heap update.

Regards,
Dean


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 17:31:22
Message-ID: 27705.1312219882@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> On 1 August 2011 17:49, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Ummm ... I only read the data structure comments, not the code, but I
>> don't see where you store the second CTID for an update event?

> Ah yes, I forgot to mention that bit. I'm using
> &(tuple1.t_data->t_ctid) to get the second CTID from the old tuple. Is
> that safe?

Hmmmm ... not sure. It seems a bit scary, but on the other hand we
should be able to assume that the updating subtransaction hasn't been
rolled back (else surely we shouldn't be firing the trigger). So in
principle it seems like the t_ctid link can't have been replaced.
This will foreclose any ideas about collapsing t_ctid link chains,
if anyone had it in mind to do that.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 17:36:09
Message-ID: CA+TgmoZQN0x=zpZA_vYa4O8GxOd+6oPuLO8g0Uw40X1zYdZa0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 1, 2011 at 1:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
>> On 1 August 2011 17:49, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Ummm ... I only read the data structure comments, not the code, but I
>>> don't see where you store the second CTID for an update event?
>
>> Ah yes, I forgot to mention that bit. I'm using
>> &(tuple1.t_data->t_ctid) to get the second CTID from the old tuple. Is
>> that safe?
>
> Hmmmm ... not sure.  It seems a bit scary, but on the other hand we
> should be able to assume that the updating subtransaction hasn't been
> rolled back (else surely we shouldn't be firing the trigger).  So in
> principle it seems like the t_ctid link can't have been replaced.
> This will foreclose any ideas about collapsing t_ctid link chains,
> if anyone had it in mind to do that.

Don't we already do that when pruning HOT chains?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 17:42:53
Message-ID: CAEZATCXk14N+rtefhkijHvbdZhAG=qmBj7qizn2xZ5khM=DMBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1 August 2011 18:36, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Aug 1, 2011 at 1:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
>>> On 1 August 2011 17:49, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> Ummm ... I only read the data structure comments, not the code, but I
>>>> don't see where you store the second CTID for an update event?
>>
>>> Ah yes, I forgot to mention that bit. I'm using
>>> &(tuple1.t_data->t_ctid) to get the second CTID from the old tuple. Is
>>> that safe?
>>
>> Hmmmm ... not sure.  It seems a bit scary, but on the other hand we
>> should be able to assume that the updating subtransaction hasn't been
>> rolled back (else surely we shouldn't be firing the trigger).  So in
>> principle it seems like the t_ctid link can't have been replaced.
>> This will foreclose any ideas about collapsing t_ctid link chains,
>> if anyone had it in mind to do that.
>
> Don't we already do that when pruning HOT chains?
>

I thought that only happens after the transaction is committed, and
old enough, whereas the trigger code only needs to follow the chain in
the updating transaction.

Regards,
Dean


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 17:46:53
Message-ID: CA+Tgmob1jWU=Zq-EQnBoPkzC+y_xW5v8O8D_Y3Y4a3d5gAPVEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 1, 2011 at 1:42 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>> Hmmmm ... not sure.  It seems a bit scary, but on the other hand we
>>> should be able to assume that the updating subtransaction hasn't been
>>> rolled back (else surely we shouldn't be firing the trigger).  So in
>>> principle it seems like the t_ctid link can't have been replaced.
>>> This will foreclose any ideas about collapsing t_ctid link chains,
>>> if anyone had it in mind to do that.
>>
>> Don't we already do that when pruning HOT chains?
>
> I thought that only happens after the transaction is committed, and
> old enough, whereas the trigger code only needs to follow the chain in
> the updating transaction.

Hmm, true.

I worry a bit that this might foreclose possible future optimization
of the "self update" case, which is a known pain point. Am I wrong to
worry?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 17:55:21
Message-ID: 28230.1312221321@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Aug 1, 2011 at 1:42 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>> Don't we already do that when pruning HOT chains?

>> I thought that only happens after the transaction is committed, and
>> old enough, whereas the trigger code only needs to follow the chain in
>> the updating transaction.

> Hmm, true.

> I worry a bit that this might foreclose possible future optimization
> of the "self update" case, which is a known pain point. Am I wrong to
> worry?

I think it might be OK if you explicitly verify that xmin/cmin of the
linked-to tuple matches the (sub)transaction/command that queued the
trigger event. I don't recall whether the trigger code does that
already; I think there is some related test but it might not be that
strict.

There's also a definitional issue involved: if a transaction updates the
same tuple twice, in the presence of a deferred update trigger for the
table, is it supposed to (eventually) fire the trigger for both update
actions or only the last one? I have a feeling we might already be
locked into the second choice, but if not, this would probably force it.

regards, tom lane


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 18:27:47
Message-ID: CAEZATCXBdfryeZmXsM8UXVMgCR5XGi5Y+vXzWZ15nhRgGORzYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1 August 2011 18:55, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Aug 1, 2011 at 1:42 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>>> Don't we already do that when pruning HOT chains?
>
>>> I thought that only happens after the transaction is committed, and
>>> old enough, whereas the trigger code only needs to follow the chain in
>>> the updating transaction.
>
>> Hmm, true.
>
>> I worry a bit that this might foreclose possible future optimization
>> of the "self update" case, which is a known pain point.  Am I wrong to
>> worry?
>
> I think it might be OK if you explicitly verify that xmin/cmin of the
> linked-to tuple matches the (sub)transaction/command that queued the
> trigger event.  I don't recall whether the trigger code does that
> already; I think there is some related test but it might not be that
> strict.
>
> There's also a definitional issue involved: if a transaction updates the
> same tuple twice, in the presence of a deferred update trigger for the
> table, is it supposed to (eventually) fire the trigger for both update
> actions or only the last one?  I have a feeling we might already be
> locked into the second choice, but if not, this would probably force it.
>

Do you mean this sort of case:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo(a int);
INSERT INTO foo VALUES(1);

CREATE OR REPLACE FUNCTION foo_trig_fn() RETURNS trigger AS
$$
BEGIN
RAISE NOTICE 'In foo_trig_fn(): old.a=%, new.a=%', old.a, new.a;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE CONSTRAINT TRIGGER foo_trig AFTER UPDATE ON foo
DEFERRABLE INITIALLY DEFERRED
FOR EACH ROW EXECUTE PROCEDURE foo_trig_fn();

BEGIN;
UPDATE foo SET a=a+1;
UPDATE foo SET a=a+1;
COMMIT;

In this case we currently fire the trigger twice (once for each
update) when the transaction commits, and the new code behaves the
same. So at commit time you get:

NOTICE: In foo_trig_fn(): old.a=1, new.a=2
NOTICE: In foo_trig_fn(): old.a=2, new.a=3

Thinking back to the deferred PK checking trigger, I thought that in
this sort of case it is the trigger's responsibility to check that the
tuple(s) it is given are not dead.

Regards,
Dean


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 18:56:57
Message-ID: 29407.1312225017@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> On 1 August 2011 18:55, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Aug 1, 2011 at 1:42 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>> Don't we already do that when pruning HOT chains?

>> I thought that only happens after the transaction is committed, and
>> old enough, whereas the trigger code only needs to follow the chain in
>> the updating transaction.

> Hmm, true.

On reflection I think we're probably worrying over nothing. The only
way that the t_ctid link might have gotten changed (given that the
original updating subtransaction hasn't been aborted) is if we were
willing to prune an inserted-and-then-deleted-in-same-xact tuple out of
the t_ctid link chain while its parent xact is still running. However,
it is unsafe to do that unless you know for certain that the xact in
question has no snapshots that could see the tuple as live. VACUUM
certainly doesn't know that, and neither does pruneheap.c, and I don't
really foresee us introducing enough bookkeeping so that they would know
it. (If we did try to do that, we could handle the problem by
considering that queueing of trigger events introduces a quasi-snapshot
that requires the relevant tuples to remain available.)

So I think this is probably OK as long as it's adequately documented in
the code. Just for luck, though, we should add an xmin/cmin match check
to the code if there's not one already.

However, this means that Dean is not simply adding some self-contained
compression logic; he's eliminating information from the data structure
on the grounds that he can get it from elsewhere. I think that that
ought to be treated as a separate patch, and that the costs/benefits
of the "compressed" data structure ought to be evaluated relative to the
situation with the second CTID removed but the data structure not
otherwise changed.

regards, tom lane


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 19:48:07
Message-ID: CAEZATCX9cWKpXM4YXDjDLXLJgQuq3=5W2N3mA5J7v50-Y+U_0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1 August 2011 19:56, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
>> On 1 August 2011 18:55, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Aug 1, 2011 at 1:42 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>>> Don't we already do that when pruning HOT chains?
>
>>> I thought that only happens after the transaction is committed, and
>>> old enough, whereas the trigger code only needs to follow the chain in
>>> the updating transaction.
>
>> Hmm, true.
>
> On reflection I think we're probably worrying over nothing.  The only
> way that the t_ctid link might have gotten changed (given that the
> original updating subtransaction hasn't been aborted) is if we were
> willing to prune an inserted-and-then-deleted-in-same-xact tuple out of
> the t_ctid link chain while its parent xact is still running.  However,
> it is unsafe to do that unless you know for certain that the xact in
> question has no snapshots that could see the tuple as live.  VACUUM
> certainly doesn't know that, and neither does pruneheap.c, and I don't
> really foresee us introducing enough bookkeeping so that they would know
> it.  (If we did try to do that, we could handle the problem by
> considering that queueing of trigger events introduces a quasi-snapshot
> that requires the relevant tuples to remain available.)
>
> So I think this is probably OK as long as it's adequately documented in
> the code.  Just for luck, though, we should add an xmin/cmin match check
> to the code if there's not one already.
>

I don't see any such existing check.
Is this the same as checking that xmax/cmax on the old tuple matches
xmin/cmin on the linked-to tuple? If so, that's pretty
straightforward. Otherwise I'm not sure where the xmin/cmin to check
against would come from, without adding more information to the queues
somewhere.

> However, this means that Dean is not simply adding some self-contained
> compression logic; he's eliminating information from the data structure
> on the grounds that he can get it from elsewhere.  I think that that
> ought to be treated as a separate patch, and that the costs/benefits
> of the "compressed" data structure ought to be evaluated relative to the
> situation with the second CTID removed but the data structure not
> otherwise changed.
>

OK, so I should split this into 2 patches?
Even without the compression, it's probably worth the 16 -> 10 byte
reduction that would result from removing the 2nd CTID in the UPDATE
case, and that part would be a pretty small patch.

Regards,
Dean


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 19:53:19
Message-ID: 672.1312228399@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> OK, so I should split this into 2 patches?
> Even without the compression, it's probably worth the 16 -> 10 byte
> reduction that would result from removing the 2nd CTID in the UPDATE
> case, and that part would be a pretty small patch.

Yeah, my point exactly. The rest of it might or might not be worth the
extra complication.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-01 20:02:45
Message-ID: CA+U5nMJ1YnkA0SMkL5iaivdffEzy71qK4aBQaedUOx=csvnFcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 1, 2011 at 7:56 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> However, this means that Dean is not simply adding some self-contained
> compression logic; he's eliminating information from the data structure
> on the grounds that he can get it from elsewhere.  I think that that
> ought to be treated as a separate patch, and that the costs/benefits
> of the "compressed" data structure ought to be evaluated relative to the
> situation with the second CTID removed but the data structure not
> otherwise changed.

I would prefer an approach where we store the events in a buffer,
which gets added to the main event queue when it fills/block number
changes/etc. That way we can apply intelligence to the actual
compression format used, yet retain all required info in the queued
event. We might also be able to ensure the event queue is ordered more
naturally by block number, to ensure we have reasonable readahead
while re-traversing the event queue - its not just the size of the
event queue that is the problem here.

I'm also nervous of compression formats that rely too heavily on
earlier data, so any corner case bug in one of the formats will screw
the whole data structure. I'd prefer a more modular approach that is
easier to test in regression tests.

The cases we have problems with are the bulk cases, so we should be
specifically optimising for that. For example, if we are running a
COPY/INSERT SELECT or a seqscan DELETE then we will queue a trigger
for every tuple in a block. The best compression and flexibility in
that case is to store a bitmap since that will average out at about 1
bit per row, with variable length bitmaps. Which is about 8 times
better compression ratio than originally suggested, without any loss
of metadata.

We should also optimise the HOT UPDATE case.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-02 07:50:53
Message-ID: CAEZATCUO+SOweZuyuk+Gh-OzKwKpK1bvq7ohcN_PGcJ63eRwUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1 August 2011 20:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
>> OK, so I should split this into 2 patches?
>> Even without the compression, it's probably worth the 16 -> 10 byte
>> reduction that would result from removing the 2nd CTID in the UPDATE
>> case, and that part would be a pretty small patch.
>
> Yeah, my point exactly.  The rest of it might or might not be worth the
> extra complication.
>

OK, here's a patch for the first bit - just removing the second CTID
in the UPDATE case, and including a sanity check of the new tuple's
xmin and cmin.

It passes all the regression tests. I also tested it by doing a 10M
row UPDATE x=x+1 on a deferrable PK, and it gave about the expected
reduction in memory usage, with no difference in run time.

I'll test out the additional compression separately.

Regards,
Dean

Attachment Content-Type Size
after-triggers-1.patch application/octet-stream 9.7 KB

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-02 11:28:04
Message-ID: CAEZATCU-VdFcEagNbnzA0ChRAhxPBQL_KX1hSHHZFFxWJXRrvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1 August 2011 21:02, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I would prefer an approach where we store the events in a buffer,
> which gets added to the main event queue when it fills/block number
> changes/etc. That way we can apply intelligence to the actual
> compression format used, yet retain all required info in the queued
> event. We might also be able to ensure the event queue is ordered more
> naturally by block number, to ensure we have reasonable readahead
> while re-traversing the event queue - its not just the size of the
> event queue that is the problem here.
>

I tried a similar approach before
(http://archives.postgresql.org/pgsql-hackers/2009-10/msg00464.php)
but one of the problems was that, for user-defined triggers at least,
it was thought that the order of firing for the triggers needed to
match the row update order, which is difficult to achieve with a
bitmap.

I also worry about the runtime cost of anything too complicated. What
I was aiming for with this patch was a more modest compression that
wouldn't add too many additional cycles (or complexity) to the
existing code (I need to do more testing to confirm that that is
indeed the case).

The biggest recent complaint I found in this area was this thread:
http://archives.postgresql.org/pgsql-bugs/2010-12/msg00002.php, which
was a one billion row insert. Personally, if I ever insert that much
data I always drop and re-create all the constraints and triggers, but
it would be nice if we could achieve that sort of size on reasonably
modern hardware.

> I'm also nervous of compression formats that rely too heavily on
> earlier data, so any corner case bug in one of the formats will screw
> the whole data structure. I'd prefer a more modular approach that is
> easier to test in regression tests.
>

Nearly all compression formats rely on earlier data. In this case each
event only depends on the previous event, so it just needs testing
with various classes of difference between successive rows, perhaps
with some C test code to make up CTIDs testing the corner cases.

> The cases we have problems with are the bulk cases, so we should be
> specifically optimising for that. For example, if we are running a
> COPY/INSERT SELECT or a seqscan DELETE then we will queue a trigger
> for every tuple in a block.

Agreed. That's what I tried to target with my code too - so it
achieves its best compression with sequential data.

The best compression and flexibility in
> that case is to store a bitmap since that will average out at about 1
> bit per row, with variable length bitmaps. Which is about 8 times
> better compression ratio than originally suggested, without any loss
> of metadata.

Yeah that's probably possible in specific cases, but I'm still not
sure how to make it meet the full requirements of the after trigger
queue.

Regards,
Dean

>
> We should also optimise the HOT UPDATE case.
>
> --
>  Simon Riggs                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-02 12:09:11
Message-ID: CA+U5nMKH_btFYHiBM9EP1MY-cJZx5z14Q5zJ3S5Yt2Ry5EcV0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 2, 2011 at 12:28 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> On 1 August 2011 21:02, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> I would prefer an approach where we store the events in a buffer,
>> which gets added to the main event queue when it fills/block number
>> changes/etc. That way we can apply intelligence to the actual
>> compression format used, yet retain all required info in the queued
>> event. We might also be able to ensure the event queue is ordered more
>> naturally by block number, to ensure we have reasonable readahead
>> while re-traversing the event queue - its not just the size of the
>> event queue that is the problem here.
>>
>
> I tried a similar approach before
> (http://archives.postgresql.org/pgsql-hackers/2009-10/msg00464.php)
> but one of the problems was that, for user-defined triggers at least,
> it was thought that the order of firing for the triggers needed to
> match the row update order, which is difficult to achieve with a
> bitmap.

Luckily, I think what I said 2 years ago was correct.
http://archives.postgresql.org/pgsql-hackers/2009-10/msg01528.php

We just need a way to say that trigger execution order is not
important, which we know to be true for FK checks - or any STABLE
function.

I think that label is "STABLE" so we don't even need to invent new
function labels.

>> The cases we have problems with are the bulk cases, so we should be
>> specifically optimising for that. For example, if we are running a
>> COPY/INSERT SELECT or a seqscan DELETE then we will queue a trigger
>> for every tuple in a block.
>
> Agreed. That's what I tried to target with my code too - so it
> achieves its best compression with sequential data.

Maybe, but you're still storing 1 byte per row at least, often more.
Whereas 1 bit would be much better compression.

>  The best compression and flexibility in
>> that case is to store a bitmap since that will average out at about 1
>> bit per row, with variable length bitmaps. Which is about 8 times
>> better compression ratio than originally suggested, without any loss
>> of metadata.
>
> Yeah that's probably possible in specific cases, but I'm still not
> sure how to make it meet the full requirements of the after trigger
> queue.

I think you'd better explain what use case you are trying to optimise
for. It seems unlikely that you will come up with a compression scheme
that will fit all cases.

The only cases that seem a problem to me are
* bulk RI checks
* large writes on tables using trigger based replication
maybe you have others?

If you are trying to optimise trigger based replication, I would be
inclined to look at immediate execution triggers. If we insert into a
log table then that will only take effect if the transaction commits.
That way we would just bypass the after trigger queue entirely. I see
nothing in the SQL Standard that prevents that.

Bitmaps work better than the current suggestion for optimising bulk RI
checks. If you used immediate execution triggers you could optimise
bulk RI checks against small tables even further by just storing a
local hash table of already acquired row locks, again avoiding the
need to build up a huge after trigger event queue in the first place.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Jim Nasby <jim(at)nasby(dot)net>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-03 22:05:21
Message-ID: 4B502D16-0C61-408F-8664-AA432FD0A3DA@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 2, 2011, at 7:09 AM, Simon Riggs wrote:
>> The best compression and flexibility in
>>> that case is to store a bitmap since that will average out at about 1
>>> bit per row, with variable length bitmaps. Which is about 8 times
>>> better compression ratio than originally suggested, without any loss
>>> of metadata.
>>
>> Yeah that's probably possible in specific cases, but I'm still not
>> sure how to make it meet the full requirements of the after trigger
>> queue.
>
> I think you'd better explain what use case you are trying to optimise
> for. It seems unlikely that you will come up with a compression scheme
> that will fit all cases.
>
> The only cases that seem a problem to me are
> * bulk RI checks
> * large writes on tables using trigger based replication
> maybe you have others?

Not sure how much this relates to this discussion, but I have often wished we had AFTER FOR EACH STATEMENT triggers that provided OLD and NEW recordsets you could make use of. Sometimes it's very valuably to be able to look at *all* the rows that changed in a transaction in one shot.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-04 01:17:21
Message-ID: CA+TgmoYQKqMaYpGFnwbiHK2mhwnkUPmWWfpcLY3PaztJFAC6Rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 3, 2011 at 6:05 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> Not sure how much this relates to this discussion, but I have often wished we had AFTER FOR EACH STATEMENT triggers that provided OLD and NEW recordsets you could make use of. Sometimes it's very valuably to be able to look at *all* the rows that changed in a transaction in one shot.

Yeah, that would be awesome. I think some of our competitors provide
exactly that feature...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-04 01:32:21
Message-ID: 1312421493-sup-6060@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Jim Nasby's message of mié ago 03 18:05:21 -0400 2011:

> Not sure how much this relates to this discussion, but I have often wished we had AFTER FOR EACH STATEMENT triggers that provided OLD and NEW recordsets you could make use of. Sometimes it's very valuably to be able to look at *all* the rows that changed in a transaction in one shot.

I have a mind to implement this sometime.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>,"Jim Nasby" <jim(at)nasby(dot)net>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Dean Rasheed" <dean(dot)a(dot)rasheed(at)gmail(dot)com>, "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-04 13:40:28
Message-ID: 4E3A5AFC020000250003FA46@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Aug 3, 2011 at 6:05 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>> Not sure how much this relates to this discussion, but I have
>> often wished we had AFTER FOR EACH STATEMENT triggers that
>> provided OLD and NEW recordsets you could make use of. Sometimes
>> it's very valuably to be able to look at *all* the rows that
>> changed in a transaction in one shot.
>
> Yeah, that would be awesome. I think some of our competitors
> provide exactly that feature...

If I remember correctly, MS SQL Server and Sybase ASE provide
INSERTED and DELETED relations in triggers instead of NEW and OLD
records. In a FOR EACH ROW trigger the relation contains only one
row.

This is related to the thread on BEFORE triggers, in that these
products require that you UPDATE the row in the base table to modify
it (normally by joining to the INSERTED relation), making the latest
values available to other trigger code, and providing a clear
distinction between the values coming in to the trigger and the
latest values in the database.

-Kevin


From: Jim Nasby <jim(at)nasby(dot)net>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Dean Rasheed" <dean(dot)a(dot)rasheed(at)gmail(dot)com>, "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Compressing the AFTER TRIGGER queue
Date: 2011-08-15 20:07:03
Message-ID: BD669E03-7295-46D2-8DF7-F7A927EB7B4E@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 4, 2011, at 8:40 AM, Kevin Grittner wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Wed, Aug 3, 2011 at 6:05 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>> Not sure how much this relates to this discussion, but I have
>>> often wished we had AFTER FOR EACH STATEMENT triggers that
>>> provided OLD and NEW recordsets you could make use of. Sometimes
>>> it's very valuably to be able to look at *all* the rows that
>>> changed in a transaction in one shot.
>>
>> Yeah, that would be awesome. I think some of our competitors
>> provide exactly that feature...
>
> If I remember correctly, MS SQL Server and Sybase ASE provide
> INSERTED and DELETED relations in triggers instead of NEW and OLD
> records. In a FOR EACH ROW trigger the relation contains only one
> row.
>
> This is related to the thread on BEFORE triggers, in that these
> products require that you UPDATE the row in the base table to modify
> it (normally by joining to the INSERTED relation), making the latest
> values available to other trigger code, and providing a clear
> distinction between the values coming in to the trigger and the
> latest values in the database.

Yeah, that sounds like how DB2 does it (though the relations were by default named NEW and OLD).

If there is agreement that having a NEW recordset that contains all the new records (or new values on updates) and an OLD recordset that contains all the old records, are there any other API issues that need to be resolved? How would this actually be implemented?
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net