Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex

Lists: pgsql-hackers
From: Jeremy Kerr <jk(at)ozlabs(dot)org>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-06-30 07:08:17
Message-ID: 1246345697.45390.889305672663.1.gpush@pingu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Move the shift-and-test login into a separate fls() function, which
can use __builtin_clz() if it's available.

This requires a new check for __builtin_clz in the configure script.

Results in a ~2% performance increase on PowerPC.

Signed-off-by: Jeremy Kerr <jk(at)ozlabs(dot)org>

---
v3: respin as context diff

---
configure.in | 13 +++++++++++++
src/backend/utils/mmgr/aset.c | 34 +++++++++++++++++++++++++++-------
2 files changed, 40 insertions(+), 7 deletions(-)

*** a/configure.in
--- b/configure.in
***************
*** 1361,1366 **** case $host_os in
--- 1361,1379 ----
AC_FUNC_FSEEKO;;
esac

+ # GCC builtins
+ #
+ # We need AC_TRY_LINK here, as the prototype generated by AC_CHECK_FUNC
+ # will cause gcc to try to reference a non-builtin symbol.
+
+ AC_MSG_CHECKING([for __builtin_clz])
+ AC_TRY_LINK([],
+ [__builtin_clz(0);],
+ [AC_DEFINE(HAVE_BUILTIN_CLZ, 1,
+ [Define to 1 if you have __builtin_clz().])
+ AC_MSG_RESULT(yes)],
+ [AC_MSG_RESULT(no)])
+

#
# Pthreads
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***************
*** 255,260 **** static MemoryContextMethods AllocSetMethods = {
--- 255,285 ----
#define AllocAllocInfo(_cxt, _chunk)
#endif

+ /*
+ * fls: find last set bit.
+ *
+ * Returns the 1-based index of the most-significant bit in x. The MSB
+ * is bit number 32, the LSB is bit number 1. If x is zero, the result is
+ * undefined.
+ */
+ static inline int
+ fls(unsigned int x)
+ {
+ #ifdef HAVE_BUILTIN_CLZ
+ return 32 - __builtin_clz(x);
+ #else
+ int ls = 0;
+
+ while (x != 0)
+ {
+ ls++;
+ x >>= 1;
+ }
+
+ return ls;
+ #endif
+ }
+
/* ----------
* AllocSetFreeIndex -
*
***************
*** 268,281 **** AllocSetFreeIndex(Size size)
{
int idx = 0;

! if (size > 0)
{
! size = (size - 1) >> ALLOC_MINBITS;
! while (size != 0)
! {
! idx++;
! size >>= 1;
! }
Assert(idx < ALLOCSET_NUM_FREELISTS);
}

--- 293,301 ----
{
int idx = 0;

! if (size > (1 << ALLOC_MINBITS))
{
! idx = fls((size - 1) >> ALLOC_MINBITS);
Assert(idx < ALLOCSET_NUM_FREELISTS);
}


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeremy Kerr <jk(at)ozlabs(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-19 11:22:05
Message-ID: 603c8f070907190422r13c19deaw74b1d9fb95e4602f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 30, 2009 at 3:08 AM, Jeremy Kerr<jk(at)ozlabs(dot)org> wrote:
> Move the shift-and-test login into a separate fls() function, which
> can use __builtin_clz() if it's available.
>
> This requires a new check for __builtin_clz in the configure script.
>
> Results in a ~2% performance increase on PowerPC.

There are some comments on this patch that have been posted to
commitfest.postgresql.org rather than emailed to -hackers. Note to
all: please don't do this. The purpose of commitfest.postgresql.org
is to summarize the mailing list discussion for easy reference, not to
replace the mailing lists.

That having been said, Jeremy, you probably want to take a look at
those comments and I have a few responses to them as well.

Dan Colish, the round-robin reviewer assigned to this patch, added the
following comment:
> Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement
> can be seen.

Question: are we only doing this because of PowerPC? What is going to
happen on x86 and other architectures? x86 is not the center of the
universe, but at a very minimum we need to make sure that things don't
go backwards on what is a very common platform. Has anyone done any
benchmarking of this?

A related question: the original email for this patch says that it
results in a performance increase of about 2% on PPC. But since it
gives no details on exactly what the test was that improved by 2%,
it's hard to judge the impact of this. If this means that
AllocSetFreeIndex() is 2% faster, I think we should reject this patch
and move on. It's not worth introducing architecture-specific code to
get a performance improvement that probably won't even move the needle
on a macro-benchmark. On the other hand, if the test was something
like a pgbench run, then this is really very impressive. But we don't
know which it is.

Zdenek Kotala added this comment:
> I have several objections:
>
> - inline is forbidden to use in PostgreSQL - you need exception or do it differently
>
> - type mismatch - Size vs unsigned int vs 32. you should use only Size and sizeof(Size)
>
> And general comment:
>
> Do we want to have this kind of code which is optimized for one compiler and platform. See openSSL or
> MySQL, they have many optimizations which work fine on one platform with specific version of compiler and
> specific version of OS. But if you select different version of compiler or different compiler you can get very
> different performance result (e.g. MySQL 30% degradation, OpenSSL RSA 3x faster and so on).
>
> I think in this case, it makes sense to have optimization here, but we should do it like spinlocks are
> implemented and put this code into /port. It should help to implemented special code for SPARC and SUN
> Studio for example.

I don't have any thoughts on this part beyond what I stated above, but
hopefully Jeremy does.

I am going to mark this "Waiting on author" pending a response to all
of the above, though hopefully Dan Colish will continue with his
reviewing efforts in the meantime.

Thanks,

...Robert


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeremy Kerr <jk(at)ozlabs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-19 13:13:15
Message-ID: 4A631BEB.9040606@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Tue, Jun 30, 2009 at 3:08 AM, Jeremy Kerr<jk(at)ozlabs(dot)org> wrote:
>> Move the shift-and-test login into a separate fls() function, which
>> can use __builtin_clz() if it's available.
>>
>> This requires a new check for __builtin_clz in the configure script.
>>
>> Results in a ~2% performance increase on PowerPC.
>
> There are some comments on this patch that have been posted to
> commitfest.postgresql.org rather than emailed to -hackers. Note to
> all: please don't do this. The purpose of commitfest.postgresql.org
> is to summarize the mailing list discussion for easy reference, not to
> replace the mailing lists.
>
> That having been said, Jeremy, you probably want to take a look at
> those comments and I have a few responses to them as well.
>
> Dan Colish, the round-robin reviewer assigned to this patch, added the
> following comment:
>> Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement
>> can be seen.
>
> Question: are we only doing this because of PowerPC? What is going to
> happen on x86 and other architectures? x86 is not the center of the
> universe, but at a very minimum we need to make sure that things don't
> go backwards on what is a very common platform. Has anyone done any
> benchmarking of this?
>
> A related question: the original email for this patch says that it
> results in a performance increase of about 2% on PPC. But since it
> gives no details on exactly what the test was that improved by 2%,
> it's hard to judge the impact of this. If this means that
> AllocSetFreeIndex() is 2% faster, I think we should reject this patch
> and move on. It's not worth introducing architecture-specific code to
> get a performance improvement that probably won't even move the needle
> on a macro-benchmark. On the other hand, if the test was something
> like a pgbench run, then this is really very impressive. But we don't
> know which it is.

There are numbers in the original thread this patch
http://archives.postgresql.org/pgsql-hackers/2009-06/msg00159.php
resulted in.

Stefan


From: Jeremy Kerr <jk(at)ozlabs(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 00:47:18
Message-ID: 200907201047.18603.jk@ozlabs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Robert,

> That having been said, Jeremy, you probably want to take a look at
> those comments and I have a few responses to them as well.

OK, thanks for the heads-up.

> following comment:
> > Applied and built cleanly. Regress passes. Trying to hunt down ppc
> > box to see if performance enhancement can be seen.
>
> Question: are we only doing this because of PowerPC?

No, this patch will benefit any architecture that has a gcc
implementation of __builtin_clz. I know that both x86 and powerpc gcc
support this.

> What is going to happen on x86 and other architectures? x86 is not
> the center of the universe, but at a very minimum we need to make sure
> that things don't go backwards on what is a very common platform. Has
> anyone done any benchmarking of this?

Yes, Atsushi Ogawa did some benchmarking with this patch on x86, his
numbers are here:

http://archives.postgresql.org/message-id/4A2895C4.9050108@hi-ho.ne.jp

In fact, Atushi originally submitted a patch using inline asm (using
"bsr") to do this on x86. Coincidentally, I was working on a powerpc
implementation (using "cntlz") at the same time, so submitted a patch
using the gcc builtin that would work on both (and possibly other)
architectures.

> A related question: the original email for this patch says that it
> results in a performance increase of about 2% on PPC. But since it
> gives no details on exactly what the test was that improved by 2%,
> it's hard to judge the impact of this. If this means that
> AllocSetFreeIndex() is 2% faster, I think we should reject this patch
> and move on. It's not worth introducing architecture-specific code
> to get a performance improvement that probably won't even move the
> needle on a macro-benchmark. On the other hand, if the test was
> something like a pgbench run, then this is really very impressive.
> But we don't know which it is.

The 2% improvement I saw is for a sysbench OLTP run. I'm happy to re-do
the run and report specific numbers if you need them.

> Zdenek Kotala added this comment:
> > I have several objections:
> >
> > - inline is forbidden to use in PostgreSQL - you need exception or
> > do it differently
> >
> > - type mismatch - Size vs unsigned int vs 32. you should use only
> > Size and sizeof(Size)

OK, happy to make these changes. What's the commitfest process here,
should I redo the patch and re-send to -hackers?

(inline again: should I just make this a static, the compiler can inline
where possible? or do you want a macro?)

> > And general comment:
> >
> > Do we want to have this kind of code which is optimized for one
> > compiler and platform.

One compiler, multiple platforms.

> > See openSSL or MySQL, they have many
> > optimizations which work fine on one platform with specific version
> > of compiler and specific version of OS. But if you select different
> > version of compiler or different compiler you can get very
> > different performance result (e.g. MySQL 30% degradation, OpenSSL
> > RSA 3x faster and so on).

I don't think we're going to see a lot of variation here (besides the
difference where __builtin_clz isn't available). Given that this is
implemented with a couple of instructions, I'm confident that we won't
see any degradation for platforms that support __builtin_clz. For the
other cases, we just use the exiting while-loop algorithm so there
should be no change (unless we mess up the inlining...).

> > I think in this case, it makes sense to have optimization here, but
> > we should do it like spinlocks are implemented and put this code
> > into /port.

Unless I'm missing something, spinlocks are done the same way - we have
one file, src/include/storage/s_lock.h, which is mostly different
implementations of the lock primitives for different platforms,
separated by preprocessor tests.

Essentially, this is just one line of code, protected by
HAVE_BUILTIN_CLZ (which is a feature-test, rather than a specific
platform or compiler test), and is only used in one place. I don't think
that warrants a separate file.

Cheers,

Jeremy


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Kerr <jk(at)ozlabs(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 01:34:11
Message-ID: 14195.1248053651@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeremy Kerr <jk(at)ozlabs(dot)org> writes:
>>> - inline is forbidden to use in PostgreSQL - you need exception or
>>> do it differently

> (inline again: should I just make this a static, the compiler can inline
> where possible? or do you want a macro?)

I don't know where Zdenek got the idea that we have something against
"inline".

So far as I can see, recent versions of gcc claim to support
__builtin_clz on all supported architectures. On some it might be no
faster than our existing loop, but it seems unlikely to be slower.

The two comments I have are

* do something other than the hardwired "32" for word size; perhaps
sizeof(int) * BITS_PER_BYTE.

* do not use the separate "fls" function. On a compiler that fails to
inline it, this patch would be a net performance loss, which we're not
likely to tolerate for a patch that has no other reason to live than
performance. Just #if the builtin right into the one place where it
will be used.

regards, tom lane


From: Jeremy Kerr <jk(at)ozlabs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 05:33:15
Message-ID: 1248067995.540170.573982756831.1.gpush@pingu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rather than testing single bits in a loop, change AllocSetFreeIndex to
use the __builtin_clz() function to calculate the chunk index.

This requires a new check for __builtin_clz in the configure script.

Results in a ~2% performance increase on sysbench on PowerPC.

Signed-off-by: Jeremy Kerr <jk(at)ozlabs(dot)org>

---
v4: don't use separate fls() function, remove hardcoded 32

---
configure.in | 13 +++++++++++++
src/backend/utils/mmgr/aset.c | 14 +++++++++++++-
2 files changed, 26 insertions(+), 1 deletion(-)

*** a/configure.in
--- b/configure.in
***************
*** 1361,1366 **** case $host_os in
--- 1361,1379 ----
AC_FUNC_FSEEKO;;
esac

+ # GCC builtins
+ #
+ # We need AC_TRY_LINK here, as the prototype generated by AC_CHECK_FUNC
+ # will cause gcc to try to reference a non-builtin symbol.
+
+ AC_MSG_CHECKING([for __builtin_clz])
+ AC_TRY_LINK([],
+ [__builtin_clz(0);],
+ [AC_DEFINE(HAVE_BUILTIN_CLZ, 1,
+ [Define to 1 if you have __builtin_clz().])
+ AC_MSG_RESULT(yes)],
+ [AC_MSG_RESULT(no)])
+

#
# Pthreads
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***************
*** 268,281 **** AllocSetFreeIndex(Size size)
{
int idx = 0;

! if (size > 0)
{
size = (size - 1) >> ALLOC_MINBITS;
while (size != 0)
{
idx++;
size >>= 1;
}
Assert(idx < ALLOCSET_NUM_FREELISTS);
}

--- 268,293 ----
{
int idx = 0;

! if (size > (1 << ALLOC_MINBITS))
{
size = (size - 1) >> ALLOC_MINBITS;
+
+ #ifdef HAVE_BUILTIN_CLZ
+ /* If possible, use __builtin_clz to calculate the chunk index - this
+ * should be O(1) rather than O(n). The builtin takes an unsigned int,
+ * so we need to cast from the possibly 64-bit Size type. This cast
+ * won't overflow, as the limit is at 2^(32 + ALLOC_MINBITS) bytes,
+ * which is larger than ALLOC_CHUNK_LIMIT.
+ */
+ idx = sizeof(unsigned int) * BITS_PER_BYTE -
+ __builtin_clz((unsigned int)size);
+ #else
while (size != 0)
{
idx++;
size >>= 1;
}
+ #endif
Assert(idx < ALLOCSET_NUM_FREELISTS);
}


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Kerr <jk(at)ozlabs(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 16:45:47
Message-ID: 22170.1248108347@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeremy Kerr <jk(at)ozlabs(dot)org> writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.

> This requires a new check for __builtin_clz in the configure script.

> Results in a ~2% performance increase on sysbench on PowerPC.

I did some performance testing on this by extracting the
AllocSetFreeIndex function into a standalone test program that just
executed it a lot of times in a loop. And there's a problem: on
x86_64 it is not much of a win. The code sequence that gcc generates
for __builtin_clz is basically

bsrl %eax, %eax
xorl $31, %eax

and it turns out that Intel hasn't seen fit to put a lot of effort into
the BSR instruction. It's constant time, all right, but on most of
their CPUs that constant time is like 8 or 16 times slower than an ADD;
cf http://www.intel.com/Assets/PDF/manual/248966.pdf

What I found on both a couple-years-old Xeon and a pretty new Macbook
is that the __builtin_clz code is actually *slower* than the naive loop
for the fewest-iterations case (request size up to 16 bytes) and about
a breakeven at the next size category (up to 32). It doesn't start to
look like a useful win until you get to sizes up to 1KB or so.
Unfortunately PG spends a lot more time allocating little structs than
big ones.

I suppose that it's more of a win on PPC (where probably __builtin_clz
is actually fast), but on the basis of these results it's going to be
no gain and maybe a loss on Intel.

The other problem is that this approach is gcc-specific, which seems
particularly bad for something that's only going to be a win on
non-Intel platforms; they're much more likely to be using non-gcc
compilers.

I wonder whether it would work better to do manual unrolling of the
shift loop. That is, forget about the builtin and do something like

while (size >= 8)
{
idx += 4;
size >>= 4;
}
while (size)
{
idx++;
size >>= 1;
}

Obviously there's room for experimental optimization of the particular
shift strides we use here, but remember that the total shift count is
never going to exceed about 10, and will usually be quite a bit less.

I did some quick experimentation along these lines and couldn't really
persuade myself that unrolling was going to win much, but that's an
Intel-specific result.

Anyway, based on what I'm seeing here I can't convince myself that this
is worth introducing a compiler dependency for.

regards, tom lane


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 18:00:00
Message-ID: 4A64B0A0.80107@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Jeremy Kerr <jk(at)ozlabs(dot)org> writes:
>> Rather than testing single bits in a loop, change AllocSetFreeIndex to
>> use the __builtin_clz() function to calculate the chunk index.
>
>> This requires a new check for __builtin_clz in the configure script.
>
>> Results in a ~2% performance increase on sysbench on PowerPC.
>
> I did some performance testing on this by extracting the
> AllocSetFreeIndex function into a standalone test program that just
> executed it a lot of times in a loop. And there's a problem: on
> x86_64 it is not much of a win. The code sequence that gcc generates
> for __builtin_clz is basically
>
> bsrl %eax, %eax
> xorl $31, %eax
>
> and it turns out that Intel hasn't seen fit to put a lot of effort into
> the BSR instruction. It's constant time, all right, but on most of
> their CPUs that constant time is like 8 or 16 times slower than an ADD;
> cf http://www.intel.com/Assets/PDF/manual/248966.pdf

hmm interesting - I don't have the exact numbers any more but that
patch(or a previous version of it) definitly showed a noticable
improvement when I tested with sysbench on a current generation Intel
Nehalem...

Stefan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 19:25:25
Message-ID: 25859.1248117925@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> and it turns out that Intel hasn't seen fit to put a lot of effort into
>> the BSR instruction. It's constant time, all right, but on most of
>> their CPUs that constant time is like 8 or 16 times slower than an ADD;
>> cf http://www.intel.com/Assets/PDF/manual/248966.pdf

> hmm interesting - I don't have the exact numbers any more but that
> patch(or a previous version of it) definitly showed a noticable
> improvement when I tested with sysbench on a current generation Intel
> Nehalem...

Hmm. I may be overestimating the importance of the smaller size
categories. To try to get some trustworthy numbers, I made a quick-hack
patch (attachment #1) to count the actual numbers of calls to
AllocSetFreeIndex, and measured the totals for a run of the regression
tests on both a 32-bit machine and a 64-bit machine. On 32 I got
these totals:

0 5190113
1 5663980
2 3573261
3 4476398
4 4246178
5 1100634
6 386501
7 601654
8 44884
9 52372
10 202801

and on 64 these:

0 2139534
1 5994692
2 5711479
3 3289034
4 4550931
5 2573389
6 487566
7 588470
8 155148
9 52750
10 202597

If you want to do the same in some other workload, feel free. I
wouldn't trust the results from a single-purpose benchmark too much,
though.

I then put together a test harness that exercises AllocSetFreeIndex
according to these distributions (attachments #2,#3). (Note: I found
out that it's necessary to split the test into two files --- otherwise
gcc will inline AllocSetFreeIndex and partially const-fold the work,
leading to skewed results.)

What I'm seeing with this harness on my x86 machines is that
__builtin_clz is indeed a bit faster than a naive loop, but not by
very much --- it saves maybe 25% of the runtime. It's better on an
old PPC Mac; saves about 50%. Still, these are not impressive numbers
for a microbenchmark that is testing *only* AllocSetFreeIndex.

I'm still interested in the idea of doing a manual unroll instead of
relying on a compiler-specific feature. However, some quick testing
didn't find an unrolling that helps much.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 980 bytes
unknown_filename text/plain 936 bytes
unknown_filename text/plain 720 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 19:37:42
Message-ID: 26046.1248118662@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I'm still interested in the idea of doing a manual unroll instead of
> relying on a compiler-specific feature. However, some quick testing
> didn't find an unrolling that helps much.

Hmm, actually this seems to work ok:

idx++;
size >>= 1;
if (size != 0)
{
idx++;
size >>= 1;
if (size != 0)
{
idx++;
size >>= 1;
if (size != 0)
{
idx++;
size >>= 1;
while (size != 0)
{
idx++;
size >>= 1;
}
}
}
}

(this is with the initial "if (size > (1 << ALLOC_MINBITS))" so that
we know the starting value is nonzero)

This seems to be about a wash or a small gain on x86_64, but on my
PPC Mac laptop it's very nearly identical in speed to the __builtin_clz
code. I also see a speedup on HPPA, for which my gcc is too old to
know about __builtin_clz.

Anyone want to see if they can beat that? Some testing on other
architectures would help too.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 21:21:29
Message-ID: 407d949e0907201421m1eb26ef9u171f2790e5ef16f7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 20, 2009 at 8:37 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Anyone want to see if they can beat that?  Some testing on other
> architectures would help too.

Hm, I took the three implementations so far (normal, unrolled, and
clz) as well as the two from
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
and got some very strange results:

normal: 1.494s
clz: 2.214s
unrolled: 2.966s
lookup table: 0.001s
float hack: 11.930s

I can't see why the unrolled implementation is slower than the
non-unrolled so I'm suspecting something's wrong with my #ifdefs but I
don't see it.

I do think the code I grabbed from the stanford page might be
off-by-one for our purposes but I haven't looked closely at that.

I also wonder if this microbenchmark is actually ok because it's
testing the same value over and over so any branch-prediction will
shine unrealistically well.

--
greg
http://mit.edu/~gsstark/resume.pdf

Attachment Content-Type Size
testit.c text/x-csrc 3.4 KB

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 22:58:57
Message-ID: 407d949e0907201558t1ea2087mea35f7606ab8b373@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Doh, I forgot this bit. Will rerun tests now.

On Mon, Jul 20, 2009 at 8:25 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>  (Note: I found
> out that it's necessary to split the test into two files --- otherwise
> gcc will inline AllocSetFreeIndex and partially const-fold the work,
> leading to skewed results.)

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 23:07:40
Message-ID: 29065.1248131260@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> I also wonder if this microbenchmark is actually ok because it's
> testing the same value over and over so any branch-prediction will
> shine unrealistically well.

Yeah, that is a good point --- and it would benefit the unrolled loop
more than the other versions. We should probably revise the test
harness so it mixes the size requests a bit. I'm not sure of a suitably
simple way to do that, though.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-21 01:11:37
Message-ID: 407d949e0907201811i13c73e18x58295566d27aadcc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 21, 2009 at 12:07 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> I also wonder if this microbenchmark is actually ok because it's
>> testing the same value over and over so any branch-prediction will
>> shine unrealistically well.
>
> Yeah, that is a good point --- and it would benefit the unrolled loop
> more than the other versions.  We should probably revise the test
> harness so it mixes the size requests a bit.  I'm not sure of a suitably
> simple way to do that, though.

Well it was a bit of a pain but I filled an array with (1/1000 scaled
down) values and then permuted them. I also went ahead and set the
low-order bits to random values since the lookup table based algorithm
might be affected by it.

The results are a bit disappointing on my machine, only the CLZ and
lookup table come out significantly ahead:

$ ./testit 10
0: 620
1: 4949
2: 5378
3: 3560
4: 4426
5: 4218
6: 1098
7: 387
8: 599
9: 44
10: 52
clz 1.530s
lookup table 1.720s
float hack 4.424s
unrolled 5.280s
normal 5.369s

--
greg
http://mit.edu/~gsstark/resume.pdf

Attachment Content-Type Size
testit.c text/x-csrc 3.6 KB
testit2.c text/x-csrc 2.4 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-21 03:07:36
Message-ID: 3958.1248145656@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Well it was a bit of a pain but I filled an array with (1/1000 scaled
> down) values and then permuted them. I also went ahead and set the
> low-order bits to random values since the lookup table based algorithm
> might be affected by it.

> The results are a bit disappointing on my machine, only the CLZ and
> lookup table come out significantly ahead:

> clz 1.530s
> lookup table 1.720s
> float hack 4.424s
> unrolled 5.280s
> normal 5.369s

It strikes me that we could assume that the values are < 64K and hence
drop the first case in the lookup table code. I've added that variant
and get these results on my machines:

x86_64 (Xeon):

clz 15.357s
lookup table 16.582s
small lookup table 16.705s
float hack 25.138s
unrolled 64.630s
normal 79.025s

PPC:

clz 3.842s
lookup table 7.298s
small lookup table 8.799s
float hack 19.418s
unrolled 7.656s
normal 8.949s

HPPA:
clz (n/a)
lookup table 11.515s
small lookup table 10.803s
float hack 16.502s
unrolled 17.632s
normal 19.754s

Not sure why the "small lookup table" variant actually seems slower
than the original on two of these machines; it can hardly be slower in
reality since it's strictly less code. Maybe some weird code-alignment
issue?

It also seems weird that the x86_64 is now showing a much bigger gap
between clz and "normal" than before. I don't see how branch prediction
would do much for the "normal" code.

regards, tom lane


From: Jeremy Kerr <jk(at)ozlabs(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-21 03:08:11
Message-ID: 200907211308.12371.jk@ozlabs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Greg,

Thanks for the benchmark app, thought I'd pitch in with some ppc
results:

> clz 1.530s
> lookup table 1.720s
> float hack 4.424s
> unrolled 5.280s
> normal 5.369s

POWER5+:
clz 2.046s
lookup table 2.188s
float hack 7.786s
unrolled 6.353s
normal 7.033s

POWER6:
clz 1.063s
lookup table 1.199s
float hack 3.575s
unrolled 2.290s
normal 3.456s

Cheers,

Jeremy


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Kerr <jk(at)ozlabs(dot)org>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-21 14:03:22
Message-ID: 11828.1248185002@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeremy Kerr <jk(at)ozlabs(dot)org> writes:
> Thanks for the benchmark app, thought I'd pitch in with some ppc
> results:

It looks to me like we should go with the lookup table approach, as
being the best tradeoff of speed improvement vs platform and compiler
independence. The "float hack" is right out ;-)

I wonder whether it is worth fooling with alternatives for the data type
of the lookup table entries. unsigned char might be a tad faster
(avoid useless sign-extension work). int might be faster yet, but it
would enlarge the table, creating a distributed overhead that the
microbenchmark would completely fail to show. Or we could buy that
back by reducing the table to cover only 6 bits, knowing that 12 is
more than we need.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Kerr <jk(at)ozlabs(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-21 19:59:58
Message-ID: 18739.1248206398@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeremy Kerr <jk(at)ozlabs(dot)org> writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.

Per subsequent discussion and testing, I've committed a patch that
uses a lookup table instead of depending on __builtin_clz().
http://archives.postgresql.org/message-id/20090721195312.207CA75331E@cvs.postgresql.org

regards, tom lane