Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Kerr <jk(at)ozlabs(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date: 2009-07-20 16:45:47
Message-ID: 22170.1248108347@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeremy Kerr <jk(at)ozlabs(dot)org> writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.

> This requires a new check for __builtin_clz in the configure script.

> Results in a ~2% performance increase on sysbench on PowerPC.

I did some performance testing on this by extracting the
AllocSetFreeIndex function into a standalone test program that just
executed it a lot of times in a loop. And there's a problem: on
x86_64 it is not much of a win. The code sequence that gcc generates
for __builtin_clz is basically

bsrl %eax, %eax
xorl $31, %eax

and it turns out that Intel hasn't seen fit to put a lot of effort into
the BSR instruction. It's constant time, all right, but on most of
their CPUs that constant time is like 8 or 16 times slower than an ADD;
cf http://www.intel.com/Assets/PDF/manual/248966.pdf

What I found on both a couple-years-old Xeon and a pretty new Macbook
is that the __builtin_clz code is actually *slower* than the naive loop
for the fewest-iterations case (request size up to 16 bytes) and about
a breakeven at the next size category (up to 32). It doesn't start to
look like a useful win until you get to sizes up to 1KB or so.
Unfortunately PG spends a lot more time allocating little structs than
big ones.

I suppose that it's more of a win on PPC (where probably __builtin_clz
is actually fast), but on the basis of these results it's going to be
no gain and maybe a loss on Intel.

The other problem is that this approach is gcc-specific, which seems
particularly bad for something that's only going to be a win on
non-Intel platforms; they're much more likely to be using non-gcc
compilers.

I wonder whether it would work better to do manual unrolling of the
shift loop. That is, forget about the builtin and do something like

while (size >= 8)
{
idx += 4;
size >>= 4;
}
while (size)
{
idx++;
size >>= 1;
}

Obviously there's room for experimental optimization of the particular
shift strides we use here, but remember that the total shift count is
never going to exceed about 10, and will usually be quite a bit less.

I did some quick experimentation along these lines and couldn't really
persuade myself that unrolling was going to win much, but that's an
Intel-specific result.

Anyway, based on what I'm seeing here I can't convince myself that this
is worth introducing a compiler dependency for.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-07-20 17:00:12 Re: WIP: Deferrable unique constraints
Previous Message Joshua D. Drake 2009-07-20 16:43:46 Re: hot standby - merged up to CVS HEAD