Re: Marginal performance improvement: replace bms_first_member loops

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Marginal performance improvement: replace bms_first_member loops
Date: 2014-11-27 19:20:43
Message-ID: 4218.1417116043@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Another thing that came out of the discussion at
http://www.postgresql.org/message-id/flat/CAOR=d=3j1U_q-zf8+jUx1hkx8ps+N8pm=EUTqyFdJ5ov=+fawg(at)mail(dot)gmail(dot)com
was that there was a significant amount of palloc/pfree traffic blamable
on the bms_first_member() loop in plpgsql's setup_param_list(). I've
been experimenting with a variant of bms_first_member() called
bms_next_member(), which doesn't modify the input bitmapset and therefore
removes the need to make a working copy when iterating over the members
of a set.

In isolation, bms_next_member() is fractionally slower than
bms_first_member() because it has to do a bit more shifting-and-masking,
but of course we more than win that back from eliminating a palloc/pfree
cycle. It's also worth noting that in principle, a bms_first_member()
loop is O(N^2) for large sets because it scans from the start of the
set each time; but I doubt this is much of an issue in practice, because
the bitmapsets we work with just aren't very large. (I did some
microbenchmarking and found that if one ignores the palloc overhead
question, a bms_next_member loop is a tad slower up to about four words
in the bitmapset, and faster beyond that because the rescans start to
make a difference. But four words would be 128 bits and very very few
bitmapsets in PG would have more members than that.)

The attached proposed patch adds bms_next_member() and replaces
bms_first_member() calls where it seemed to make sense. I've had a
hard time measuring much speed difference for this patch in isolation,
but in principle it should be less code and less cycles. It also seems
safer and more natural to not use destructive looping techniques.

regards, tom lane

Attachment Content-Type Size
bms_next_member.patch text/x-diff 17.3 KB

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Marginal performance improvement: replace bms_first_member loops
Date: 2014-11-28 08:38:00
Message-ID: CAEZATCVZ_4tcpGPb1h=2NUzXhK2z8mWPRbsQyMWa4Y3Q=fzVWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 November 2014 at 19:20, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The attached proposed patch adds bms_next_member() and replaces
> bms_first_member() calls where it seemed to make sense. I've had a
> hard time measuring much speed difference for this patch in isolation,
> but in principle it should be less code and less cycles. It also seems
> safer and more natural to not use destructive looping techniques.
>

+1. I had a similar idea a while back but didn't have time to produce
a complete patch.

There is another micro-optimisation that you could make in
bms_next_member() -- it isn't necessary to do

w = RIGHTMOST_ONE(w)

because unlike bms_next_member, w isn't being used to mask out the bit
retrieved, so any higher bits don't matter and the later use of
rightmost_one_pos[...] will pick out the required rightmost bit.

Should this function protect against large negative inputs? As it
stands, passing in a value of prevbit less than -1 would be
problematic. Maybe it's sufficient to say "don't do that" in the docs,
rather than waste more cycles checking.

Regards,
Dean


From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Marginal performance improvement: replace bms_first_member loops
Date: 2014-11-28 09:52:42
Message-ID: CAApHDvo3+QYMo_4gzjnGcJRPu_1+CpKBkC8LrzVXjdrGsxaDMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 28, 2014 at 8:20 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>
> The attached proposed patch adds bms_next_member() and replaces
> bms_first_member() calls where it seemed to make sense. I've had a
> hard time measuring much speed difference for this patch in isolation,
> but in principle it should be less code and less cycles. It also seems
> safer and more natural to not use destructive looping techniques.
>
>
I've had a quick read of the patch and it seems like a good idea.

I have to say I don't really like the modifying of the loop iterator that's
going on here:

col = -1;
while ((col = bms_next_member(rte->modifiedCols, col)) >= 0)
{
col += FirstLowInvalidHeapAttributeNumber;
/* do stuff */
col -= FirstLowInvalidHeapAttributeNumber;
}

Some other code is doing this:

col = -1;
while ((col = bms_next_member(cols, col)) >= 0)
{
/* bit numbers are offset by FirstLowInvalidHeapAttributeNumber */
AttrNumber attno = col + FirstLowInvalidHeapAttributeNumber;

Which seems less prone to future breakage and possibly slightly less cycles.

A while back when I was benchmarking the planner time during my trials with
anti/semi join removals, I wrote a patch to change the usage pattern for
cases
such as:

if (bms_membership(a) != BMS_SINGLETON)
return; /* nothing to do */
singleton = bms_singleton_member(a);
...

Into:

if (!bms_get_singleton(a, &singleton))
return; /* nothing to do */
...

Which means 1 function call and loop over the bitmapset, rather than 2
function
calls and 2 loops over the set when the set is a singleton.

This knocked between 4 and 22% off of the time the planner spent in the join
removals path.

The patch to implement this and change all suitable calls sites is attached.

Regards

David Rowley

Attachment Content-Type Size
bms_get_singleton_v1.patch application/octet-stream 4.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Marginal performance improvement: replace bms_first_member loops
Date: 2014-11-28 15:08:36
Message-ID: 9024.1417187316@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> On 27 November 2014 at 19:20, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The attached proposed patch adds bms_next_member() and replaces
>> bms_first_member() calls where it seemed to make sense.

> There is another micro-optimisation that you could make in
> bms_next_member() -- it isn't necessary to do
> w = RIGHTMOST_ONE(w)

Excellent point! Thanks for noticing that.

> Should this function protect against large negative inputs? As it
> stands, passing in a value of prevbit less than -1 would be
> problematic. Maybe it's sufficient to say "don't do that" in the docs,
> rather than waste more cycles checking.

Yeah, I had considered whether to do that; instead of just prevbit++
it would need to be something like
prevbit = (prevbit < 0) ? 0 : prevbit + 1;
This would add one test-and-branch, and moreover one that would be
hard to predict correctly (given that most of our bitmapsets don't
have very many members). So it seems pretty expensive. Probably
a more explicit warning in the header comment is good enough; or
we could put in an Assert().

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Marginal performance improvement: replace bms_first_member loops
Date: 2014-11-28 15:11:40
Message-ID: 9142.1417187500@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Rowley <dgrowleyml(at)gmail(dot)com> writes:
> I have to say I don't really like the modifying of the loop iterator that's
> going on here:

> col = -1;
> while ((col = bms_next_member(rte->modifiedCols, col)) >= 0)
> {
> col += FirstLowInvalidHeapAttributeNumber;
> /* do stuff */
> col -= FirstLowInvalidHeapAttributeNumber;
> }

> Some other code is doing this:

> col = -1;
> while ((col = bms_next_member(cols, col)) >= 0)
> {
> /* bit numbers are offset by FirstLowInvalidHeapAttributeNumber */
> AttrNumber attno = col + FirstLowInvalidHeapAttributeNumber;

> Which seems less prone to future breakage and possibly slightly less cycles.

Yeah, I'd come to the same conclusion while thinking about it in the
shower this morning ...

> A while back when I was benchmarking the planner time during my trials with
> anti/semi join removals, I wrote a patch to change the usage pattern for
> cases such as:
> ...
> This knocked between 4 and 22% off of the time the planner spent in the join
> removals path.

Really!? I've never seen either of those functions show up all that high
in profiles. Can you share the test case you were measuring?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Marginal performance improvement: replace bms_first_member loops
Date: 2014-11-28 19:18:10
Message-ID: 5102.1417202290@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Rowley <dgrowleyml(at)gmail(dot)com> writes:
> A while back when I was benchmarking the planner time during my trials with
> anti/semi join removals, I wrote a patch to change the usage pattern for
> cases such as:

> if (bms_membership(a) != BMS_SINGLETON)
> return; /* nothing to do */
> singleton = bms_singleton_member(a);
> ...

> Into:

> if (!bms_get_singleton(a, &singleton))
> return; /* nothing to do */
> ...

> Which means 1 function call and loop over the bitmapset, rather than 2
> function
> calls and 2 loops over the set when the set is a singleton.

I went ahead and committed this with some cosmetic adjustments.
I'm not sure about there being any performance win in existing use-cases,
but it seems worth doing on notational grounds anyway.

regards, tom lane


From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Marginal performance improvement: replace bms_first_member loops
Date: 2014-11-29 03:36:06
Message-ID: CAApHDvpsPXzULsOi7YunTnqhbwZQmv3yDjV5ZbPe2FsX1t2DcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 29, 2014 at 8:18 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> David Rowley <dgrowleyml(at)gmail(dot)com> writes:
> > A while back when I was benchmarking the planner time during my trials
> with
> > anti/semi join removals, I wrote a patch to change the usage pattern for
> > cases such as:
>
> > if (bms_membership(a) != BMS_SINGLETON)
> > return; /* nothing to do */
> > singleton = bms_singleton_member(a);
> > ...
>
> > Into:
>
> > if (!bms_get_singleton(a, &singleton))
> > return; /* nothing to do */
> > ...
>
> > Which means 1 function call and loop over the bitmapset, rather than 2
> > function
> > calls and 2 loops over the set when the set is a singleton.
>
> I went ahead and committed this with some cosmetic adjustments.
>

Thank you!

> I'm not sure about there being any performance win in existing use-cases,
> but it seems worth doing on notational grounds anyway.
>
>
My original benchmarks for this were based on the semi/anti join patch I
was working on at the time

Benchmarks here:
http://www.postgresql.org/message-id/CAApHDvo21-b+PU=sC9B1QiEG3YL4T919i4WjZfnfP6UPXS9nZg@mail.gmail.com

Though the existing left join removal code should see similar speed ups,
albeit the time spent in the join removal code path did only happen to be
between 0.02 and 0.2% of total planning time with my test cases.

Regards

David Rowley