WIP: dynahash replacement for buffer table

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP: dynahash replacement for buffer table
Date: 2014-10-14 13:30:58
Message-ID: CA+TgmoYE4t-Pt+v08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch. I took the basic infrastructure
from before and used it to replace the buffer table. Patch is
attached.

The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.
The algorithm is not strictly lock-free for reasons explained in the
comments in chash.c, but it's a lot less locky than what we have now,
so in theory you might think that would be a good thing.

I haven't had time to do much performance testing yet, but it looks
like this may be slower at low client counts and faster at high client
counts. However, my results weren't real reproducible, and I haven't
done comprehensive testing yet. What was really bizarre is that I
couldn't really pin down the cause of the slowness at low client
counts; a quick perf profile showed overhead concentrated in
CHashBucketScan, basically memory access latency for walking the
bucket chain. But the table is built to have a load factor of 1, so I
can't see why that should amount to much, or why it should be
significantly worse than for dynahash.

This patch contains assorted leftovers and is grotty in various ways,
but I'm sharing it anyway just to get it out there.

git branch also available at:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
chash-buftable-v1.patch text/x-patch 73.9 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 14:25:03
Message-ID: 20141014142503.GG9267@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> A few years ago I started working on a concurrent hash table for
> PostgreSQL. The hash table part of it worked, but I never did
> anything with it, really. Amit mentioned to me earlier this week that
> he was seeing contention inside the dynahash machinery, which inspired
> me to go back and update the patch.

Interestingly I've benchmarked similar loads, even on the same machine
as Amit, and I do seem trememdous time spent in dynahash.c. It's nearly
all cache misses in my tests though.

> I took the basic infrastructure
> from before and used it to replace the buffer table. Patch is
> attached.

That's pretty cool. The algorithm is complex enough that I haven't fully
understood it yet, but it sounds sane on a first glance.

> The key idea here is that lookups are done without any locks, only
> memory barriers; and inserts and deletes are done using atomic ops.

Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.

> The algorithm is not strictly lock-free for reasons explained in the
> comments in chash.c, but it's a lot less locky than what we have now,
> so in theory you might think that would be a good thing.

I don't see much reason to strive for full lock-free ness. If the locks
aren't noticeable in real world scenarios - who cares?

> I haven't had time to do much performance testing yet, but it looks
> like this may be slower at low client counts and faster at high client
> counts. However, my results weren't real reproducible, and I haven't
> done comprehensive testing yet. What was really bizarre is that I
> couldn't really pin down the cause of the slowness at low client
> counts; a quick perf profile showed overhead concentrated in
> CHashBucketScan, basically memory access latency for walking the
> bucket chain. But the table is built to have a load factor of 1, so I
> can't see why that should amount to much, or why it should be
> significantly worse than for dynahash.

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.

It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
employing builds for some comparable load.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 14:47:27
Message-ID: 20141014144727.GF22660@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> I took the basic infrastructure from before and used it to replace the
> buffer table. Patch is attached.

Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.

How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 15:00:45
Message-ID: CAA4eK1Ks1OCPdwapZNwHiqLOuUzzCXg7JB4RXNp-81_hnLzCUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
>
> On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> > A few years ago I started working on a concurrent hash table for
> > PostgreSQL. The hash table part of it worked, but I never did
> > anything with it, really. Amit mentioned to me earlier this week that
> > he was seeing contention inside the dynahash machinery, which inspired
> > me to go back and update the patch.
>
> Interestingly I've benchmarked similar loads, even on the same machine
> as Amit,

There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 15:08:02
Message-ID: 20141014150802.GH9267@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:
> On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
> wrote:
> >
> > On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> > > A few years ago I started working on a concurrent hash table for
> > > PostgreSQL. The hash table part of it worked, but I never did
> > > anything with it, really. Amit mentioned to me earlier this week that
> > > he was seeing contention inside the dynahash machinery, which inspired
> > > me to go back and update the patch.
> >
> > Interestingly I've benchmarked similar loads, even on the same machine
> > as Amit,
>
> There is one catch here, for these profiles I am using Power-8 m/c
> and the load is slightly higher (5000 scale factor).

Ah, right. I don't think the scale factor changes much, but the
different architecture certainly does. As I said elsewhere, I would not
believe these profiles much until they're actually done with optimized
code...

I also think we need to be a bit careful about optimizing too much for
stock pgbench with working set >> s_b. The uniform readonly access
pattern you're profiling here isn't something super realistic. Still
valuable, but we should take it with a grain of salt.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 15:08:08
Message-ID: CA+TgmobKOF=gV2ue5p14iAJj0QuQdL+eBLXsM8+rogMjABmG8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
>> I took the basic infrastructure from before and used it to replace the
>> buffer table. Patch is attached.
>
> Hm. Unless I missed something you pretty much completely eradicated
> locks from the buffer mapping code? That'd be pretty cool, but it's also
> scary.
>
> How confident are you that your conversion is correct? Not in the sense
> that there aren't any buglets, but fundamentally.

It doesn't look particularly dangerous to me. Famous last words.

Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it. If we don't do that, I think the only consequence is
that, by the time we get the buffer pin, the buffer might no longer be
the one we looked up. So we need to recheck. But assuming we do
that, I don't see an issue. In fact, it occurred to me while I was
cobbling this together that we might want to experiment with that
change independently of chash. We already know (from the
StrategyGetBuffer locking changes) that holding system-wide locks to
prevent people from twaddling the state of individual buffers can be a
loser. This case isn't nearly as bad, because we're only pinning one
buffer, rather than potentially all of them multiple times, and the
lock we're holding only affects 1/128th of the buffers, not all of
them.

The other application I had in mind for chash was SLRU stuff. I
haven't tried to write the code yet, but fundamentally, the same
considerations apply there. You do the lookup locklessly to find out
which buffer is thought to contain the SLRU page you want, but by the
time you lock the page the answer might be out of date. Assuming that
this doesn't happen many times in a row and that lookups are
relatively cheap, you could get rid of having any centralized lock for
SLRU, and just have per-page locks.

I'm not quite sure what fundamental dangers you're thinking about
here, but from what I understand from reading the literature, doing
lookups in a lock-free hash table needs to be thought about in a sort
of relativistic way. The different CPUs have slightly different views
of the world, so any answer you get may be obsolete by the time you
get it, and may according to some other CPU's view of the world have
been obsolete even at the time that it was copied. That's OK, because
it's just equivalent to some other CPU doing its hash table update
slightly sooner or later than it actually did it, within the bounds
imposed by memory barriers, which it could have done anyway. There is
no one source of truth. The result of all this is that some of the
synchronization responsibility is transferred to the caller. That's
obviously a bit more complex to reason about, but it hasn't stopped
lock-free hash tables from being wildly popular data structures.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 15:10:19
Message-ID: CAA4eK1J=8jF3iV-2FgKm=0Tnx8BNOKYmJVBsPVh8tEpbfSBsEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 14, 2014 at 8:38 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
>
> On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:
> > On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
> > wrote:
> > >
> > > On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> > > > A few years ago I started working on a concurrent hash table for
> > > > PostgreSQL. The hash table part of it worked, but I never did
> > > > anything with it, really. Amit mentioned to me earlier this week
that
> > > > he was seeing contention inside the dynahash machinery, which
inspired
> > > > me to go back and update the patch.
> > >
> > > Interestingly I've benchmarked similar loads, even on the same machine
> > > as Amit,
> >
> > There is one catch here, for these profiles I am using Power-8 m/c
> > and the load is slightly higher (5000 scale factor).
>
> Ah, right. I don't think the scale factor changes much, but the
> different architecture certainly does. As I said elsewhere, I would not
> believe these profiles much until they're actually done with optimized
> code...

Today, that m/c is not accessible, so will take a day or so to get the
optimized profiles and will post it once I am able to take the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 15:19:16
Message-ID: CA+TgmoYfe+SK6h=d_EWAzxznBjh=b_SO8ogkY9Qd9S=LSsXC_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> The key idea here is that lookups are done without any locks, only
>> memory barriers; and inserts and deletes are done using atomic ops.
>
> Hm. I quickly looked and I see that you use two full barriers for every
> lookup. That's pretty expensive. I think this should be doable using
> only read/write barriers on the lookup side? Even on architectures where
> read/write memory barriers have to be something but a compiler barrier,
> they're still usually a magnitude or so cheaper than full barriers.

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.

> I don't see much reason to strive for full lock-free ness. If the locks
> aren't noticeable in real world scenarios - who cares?

That's my feeling too. ISTR that when I stress-tested the hash table
back in 2012 when I started this code, the concurrency was far better
than dynahash with 16 lwlock partitions. I don't remember off hand
exactly how I tested that, but it was with the code in hashtest.c. I
also seem to recall that it was possible to make the freelists get
VERY hot, but of course that was a workload where hash table updates
were the only thing happening, so I assume that on a workload where
we're actually doing work (like, copying the data in and out of kernel
buffers) that's not going to be a real problem unless you have
thousands of cores. But we'll see.

> With regard for using a hash table for the buffer mapping lock I'm
> doubtful that any form of separate chaining is the right one. We
> currently have a quite noticeable problem with the number of cache
> misses in the buffer mapping hash (and also the heavyweight lock mgr) -
> if we stay with hashes that's only going to be fundamentally lower than
> dynahash if we change the type of hashing. I've had good, *preliminary*,
> results using an open addressing + linear probing approach.

I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough. There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.

> My guess is that the additional indirection via the arena explains the
> difference in cache misses? But I might be completely off here.

The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline. But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done. Am I missing something?

> It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
> employing builds for some comparable load.

I'm hoping you can test this out on x86. All I have to work with are
the POWER machines, and the characteristics of those are somewhat
different. I can test it there, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 15:31:20
Message-ID: 20141014153120.GI9267@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-14 11:08:08 -0400, Robert Haas wrote:
> On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> >> I took the basic infrastructure from before and used it to replace the
> >> buffer table. Patch is attached.
> >
> > Hm. Unless I missed something you pretty much completely eradicated
> > locks from the buffer mapping code? That'd be pretty cool, but it's also
> > scary.
> >
> > How confident are you that your conversion is correct? Not in the sense
> > that there aren't any buglets, but fundamentally.
>
> It doesn't look particularly dangerous to me. Famous last words.

> Basically, I think what we're doing right now is holding the buffer
> mapping lock so that the buffer can't be renamed under us while we're
> pinning it.

What I'm afraid of is that there's hidden assumptions about the
consistency provided by the mapping locks.

> If we don't do that, I think the only consequence is
> that, by the time we get the buffer pin, the buffer might no longer be
> the one we looked up. So we need to recheck. But assuming we do
> that, I don't see an issue. In fact, it occurred to me while I was
> cobbling this together that we might want to experiment with that
> change independently of chash. We already know (from the
> StrategyGetBuffer locking changes) that holding system-wide locks to
> prevent people from twaddling the state of individual buffers can be a
> loser. This case isn't nearly as bad, because we're only pinning one
> buffer, rather than potentially all of them multiple times, and the
> lock we're holding only affects 1/128th of the buffers, not all of
> them.

Also IIRC at least linux has targeted wakeup/time transfer logic when
using semaphore - and doesn't for spinlocks. So if you're not happening
to sleep while the lwlock's spinlock is held, the result will be
better. Only that we'll frequently sleep within that for very frequently
taken locks...

> The other application I had in mind for chash was SLRU stuff. I
> haven't tried to write the code yet, but fundamentally, the same
> considerations apply there. You do the lookup locklessly to find out
> which buffer is thought to contain the SLRU page you want, but by the
> time you lock the page the answer might be out of date. Assuming that
> this doesn't happen many times in a row and that lookups are
> relatively cheap, you could get rid of having any centralized lock for
> SLRU, and just have per-page locks.

Hm. I have to admit I haven't really looked enough into the slru code to
judge this. My impression so far wasn't that the locking itself was the
problem in most scenarios - that's not what you've seen?

> I'm not quite sure what fundamental dangers you're thinking about
> here

Oh, only in the context of the bufmgr.c conversion. Not more
generally. I agree that a lockfree hashtable is something quite
worthwile to have.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 15:36:39
Message-ID: CA+TgmoZw+DaHA+Pf1A26Vh57zp4hE0di_3AJx-0vU6wcdRt5Vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 14, 2014 at 11:31 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> It doesn't look particularly dangerous to me. Famous last words.
>
>> Basically, I think what we're doing right now is holding the buffer
>> mapping lock so that the buffer can't be renamed under us while we're
>> pinning it.
>
> What I'm afraid of is that there's hidden assumptions about the
> consistency provided by the mapping locks.

That's certainly worth checking for, but I think the only code that
needs to be checked is the code that would formerly have run while
holding said lock. And there isn't that much of that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 20:42:32
Message-ID: 20141014204232.GA7242@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-14 11:19:16 -0400, Robert Haas wrote:
> On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> The key idea here is that lookups are done without any locks, only
> >> memory barriers; and inserts and deletes are done using atomic ops.
> >
> > Hm. I quickly looked and I see that you use two full barriers for every
> > lookup. That's pretty expensive. I think this should be doable using
> > only read/write barriers on the lookup side? Even on architectures where
> > read/write memory barriers have to be something but a compiler barrier,
> > they're still usually a magnitude or so cheaper than full barriers.
>
> The code in CHashSearch shows the problem there: you need to STORE the
> hazard pointer before you begin to do the LOAD operations to scan the
> bucket, and you must finish all of those LOADs before you STORE the
> NULL hazard pointer. A read or write barrier won't provide store/load
> or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

> > With regard for using a hash table for the buffer mapping lock I'm
> > doubtful that any form of separate chaining is the right one. We
> > currently have a quite noticeable problem with the number of cache
> > misses in the buffer mapping hash (and also the heavyweight lock mgr) -
> > if we stay with hashes that's only going to be fundamentally lower than
> > dynahash if we change the type of hashing. I've had good, *preliminary*,
> > results using an open addressing + linear probing approach.
>
> I'm very skeptical of open addressing + linear probing. Such hash
> tables tend to degrade over time, because you have to leave tombstones
> behind to ensure that future scans advance far enough.

You can try to be a bit smarter than a plain open addressing
approach. But yes, it has disadvantages.

> There's really
> no way to recover without rebuilding the whole hash table, and
> eventually it degrades to linear search. If we're spending too much
> time walking hash chains, I think the solution is to increase the
> number of buckets so that the chains get shorter.

Might be worthwile to try. I know that both my handrolled open
addressing and my hand rolled chaining approach have significantly fewer
cache misses than dynahash for the same amount of work.

Hm. Possibly that's at least partially because of the segment stuff in
dynahash?

Anyway, you can get the size of the entire hashtable down quite a
bit. Primarily because there's no need to store a next pointer. There's
also not really any need for the 'hashvalue' in the bufmgr case
imo.

> > My guess is that the additional indirection via the arena explains the
> > difference in cache misses? But I might be completely off here.
>
> The arena makes the calculation of the pointer address less
> predictable, which might make things more difficult for the CPU
> pipeline. But aside from that, I think it's the same basic idea: you
> bounce from some sort of bucket header to the first element of the
> chain and then, hopefully, you're done. Am I missing something?

I haven't really read much of the code yet (wrote most of this email on
my workstation before embarking on a plane), but afaics there's one
indirection to the bucket, and then one to the first node in the linked
list. Right?
In dynahash it's first to the segment, but there's only few, so it's
likely to be cached. Then to the bucket - which, afaics, already can
contain the key.

So it's not really the arena, but that you chose not to store the first
element in a chain inline...

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 21:53:10
Message-ID: CA+TgmoYZ=0vO5W97v=EH18WrpsgLELOih3yYiSPyHQtBOa=XUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> The code in CHashSearch shows the problem there: you need to STORE the
>> hazard pointer before you begin to do the LOAD operations to scan the
>> bucket, and you must finish all of those LOADs before you STORE the
>> NULL hazard pointer. A read or write barrier won't provide store/load
>> or load/store ordering.
>
> I'm not sure I understand the problem here - but a read + write barrier
> should suffice? To avoid falling back to two full barriers, we'd need a
> separate define pg_read_write_barrier(), but that's not a problem. IIUC
> that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain. It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads. We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.

I tested PPC, because that's what I have. I think we're emitting two
sync instructions there, but maybe lwsync or isync would be enough in
one or both cases. The first of these links indicates that lwsync
ought to be enough for both cases; the second says that we need a sync
after setting the hazard pointer but that an lwsync is enough before
clearing it. Or that's my reading anyway.

http://www.ibm.com/developerworks/systems/articles/powerpc.html
http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/

>> > My guess is that the additional indirection via the arena explains the
>> > difference in cache misses? But I might be completely off here.
>>
>> The arena makes the calculation of the pointer address less
>> predictable, which might make things more difficult for the CPU
>> pipeline. But aside from that, I think it's the same basic idea: you
>> bounce from some sort of bucket header to the first element of the
>> chain and then, hopefully, you're done. Am I missing something?
>
> I haven't really read much of the code yet (wrote most of this email on
> my workstation before embarking on a plane), but afaics there's one
> indirection to the bucket, and then one to the first node in the linked
> list. Right?
> In dynahash it's first to the segment, but there's only few, so it's
> likely to be cached. Then to the bucket - which, afaics, already can
> contain the key.
>
> So it's not really the arena, but that you chose not to store the first
> element in a chain inline...

Hmm, you have a point. I think the reason I did it that way is
because I don't know how to make the memory management work out
otherwise. If you store the first element inline, then even an empty
bucket uses up as much memory space as one with a single element.
More seriously, it breaks the deletion algorithm. There's no good way
to atomically swap out the bucket header if it is located in a fixed
position and bigger than the size of object the machine can manipulate
atomically.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-15 06:03:09
Message-ID: 20141015060309.GE7242@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
> On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> The code in CHashSearch shows the problem there: you need to STORE the
> >> hazard pointer before you begin to do the LOAD operations to scan the
> >> bucket, and you must finish all of those LOADs before you STORE the
> >> NULL hazard pointer. A read or write barrier won't provide store/load
> >> or load/store ordering.
> >
> > I'm not sure I understand the problem here - but a read + write barrier
> > should suffice? To avoid falling back to two full barriers, we'd need a
> > separate define pg_read_write_barrier(), but that's not a problem. IIUC
> > that should allow us to avoid emitting any actual code on x86.
>
> Well, thinking about x86 specifically, it definitely needs at least
> one mfence, after setting the hazard pointer and before beginning to
> read the bucket chain.

Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Ants Aasma <ants(at)cybertec(dot)at>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-15 16:32:42
Message-ID: CA+CSw_vQVZnqsQE8X10fAZ486FCTj-h0Dkby=JJsgA578AmLEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> With regard for using a hash table for the buffer mapping lock I'm
>> doubtful that any form of separate chaining is the right one. We
>> currently have a quite noticeable problem with the number of cache
>> misses in the buffer mapping hash (and also the heavyweight lock mgr) -
>> if we stay with hashes that's only going to be fundamentally lower than
>> dynahash if we change the type of hashing. I've had good, *preliminary*,
>> results using an open addressing + linear probing approach.
>
> I'm very skeptical of open addressing + linear probing. Such hash
> tables tend to degrade over time, because you have to leave tombstones
> behind to ensure that future scans advance far enough. There's really
> no way to recover without rebuilding the whole hash table, and
> eventually it degrades to linear search. If we're spending too much
> time walking hash chains, I think the solution is to increase the
> number of buckets so that the chains get shorter.

What about cuckoo hashing? There was a recent paper on how to do fine
grained locking with cuckoo hashes. [1]

I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.

[1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


From: Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-15 17:36:51
Message-ID: 543EB0B3.7080608@cs.utoronto.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 15/10/2014 10:32 AM, Ants Aasma wrote:
> On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> With regard for using a hash table for the buffer mapping lock I'm
>>> doubtful that any form of separate chaining is the right one. We
>>> currently have a quite noticeable problem with the number of cache
>>> misses in the buffer mapping hash (and also the heavyweight lock mgr) -
>>> if we stay with hashes that's only going to be fundamentally lower than
>>> dynahash if we change the type of hashing. I've had good, *preliminary*,
>>> results using an open addressing + linear probing approach.
>> I'm very skeptical of open addressing + linear probing. Such hash
>> tables tend to degrade over time, because you have to leave tombstones
>> behind to ensure that future scans advance far enough. There's really
>> no way to recover without rebuilding the whole hash table, and
>> eventually it degrades to linear search. If we're spending too much
>> time walking hash chains, I think the solution is to increase the
>> number of buckets so that the chains get shorter.
> What about cuckoo hashing? There was a recent paper on how to do fine
> grained locking with cuckoo hashes. [1]
>
> I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
> associativity). This allows us to fit the bucket onto 2 regular sized
> cache lines and have 8 bytes left over. Buckets would be protected by
> seqlocks stored in the extra space. On the read side we would only
> need 2 read barriers (basically free on x86), and we are guaranteed to
> have an answer by fetching two pairs of cache lines. We can trade
> memory bandwidth for latency by issuing prefetches (once we add
> primitives for this). Alternatively we can trade insert speed for
> lookup speed by using asymmetrically sized tables.
>
> [1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf
Actually, I'd go with second-chance hashing [2], same number of hash
functions but it's more stable (no infinite loops, for example). Most
probably the techniques from [1] would apply equally well.

[2]
http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf

Ryan


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-15 17:41:25
Message-ID: CA+TgmoZ=xR14my-mMPZcuO1RGwXbdppytUU3vfT8Oca-2WYRRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
>> On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> The code in CHashSearch shows the problem there: you need to STORE the
>> >> hazard pointer before you begin to do the LOAD operations to scan the
>> >> bucket, and you must finish all of those LOADs before you STORE the
>> >> NULL hazard pointer. A read or write barrier won't provide store/load
>> >> or load/store ordering.
>> >
>> > I'm not sure I understand the problem here - but a read + write barrier
>> > should suffice? To avoid falling back to two full barriers, we'd need a
>> > separate define pg_read_write_barrier(), but that's not a problem. IIUC
>> > that should allow us to avoid emitting any actual code on x86.
>>
>> Well, thinking about x86 specifically, it definitely needs at least
>> one mfence, after setting the hazard pointer and before beginning to
>> read the bucket chain.
>
> Yes, I can see that now. I do wonder if that's not going to prove quite
> expensive... I think I can see some ways around needing that. But I
> think that needs some benchmarking first - no need to build a even more
> complex solution if not needed.
>
> The solution I'm thinking of is to essentially move away from hazard
> pointers and store something like a generation counter per
> backend. Which is updated less often, and in combination with other
> operations. When a backend need to clean up and sees that there's a
> straggler with a old generation it sends that backend a signal to ensure
> it sets the latest generation.

It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.

Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it. This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in. Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard. Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture. I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.

That having been said, there is some literature on generation numbers,
and I think something like that could be made to work. It does have
some significant disadvantages, though. One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide. In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists. A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead. Also,
a generation number implies that we update the value periodically,
rather than before and after each critical section. Anything that
forces garbage collection to be postponed longer than absolutely
necessary seems to me likely to be a loser. It might be a little
faster as long as we have free elements to allocate, but as soon as
we're out and have to wait longer than we otherwise would for garbage
collection, and all system activity halts as a result, even for a few
milliseconds, it's going to be a whole lot slower. Or at least, I
think.

That having been said, I don't know what to do about the fact that the
fence is too expensive. I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure. But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again. Surely a pure fence shouldn't
cost more than a spinlock cycle? Even with arguably one extra cache
line touch, that seems like it ought to be a win. But my intuitions
in this area are shaky.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-15 20:23:39
Message-ID: 20141015202339.GI7242@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-15 13:41:25 -0400, Robert Haas wrote:
> On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > The solution I'm thinking of is to essentially move away from hazard
> > pointers and store something like a generation counter per
> > backend. Which is updated less often, and in combination with other
> > operations. When a backend need to clean up and sees that there's a
> > straggler with a old generation it sends that backend a signal to ensure
> > it sets the latest generation.
>
> It's possible that might work ... but on the timescale we're talking
> about here, that's asking the garbage collecting process to wait for
> practically geologic time.

I think it depends on what we're tying the generation increase to. We
very well could add a implict generation increment to, say, lwlock
acquisition - which are full barriers anyway. And I don't think we'll
normally have a that high rate of garbage collection anyway - there
should be plenty of headroom.

> Back when I first wrote this code, I spent a fair amount of time
> looking at existing work in the area of lock-free hash tables.
> Essentially all of the work I was able to find on this topic assumes a
> threaded model - or more precisely, it assumes that shared memory
> allocation is cheap and easy and you'll have no problem getting as
> much of it as you need whenever you need it.

FWIW, I think many of the solutions that are actually used in practice
don't rely on that heavily. I know at least some that require memory to
be reserved beforehand, in special contexts.

> This assumption often
> isn't even spelled out explicitly: it's just assumed that that's the
> programming environment you're working in. Finding highly parallel
> algorithms that don't rely on memory allocation as a primitive is
> hard. Hazard pointers are one of the few techniques I found that
> seems like it can work in our architecture. I'm quite reluctant to
> leave aside techniques that have been proven to work well in favor of
> inventing something entirely novel to PostgreSQL.

I don't think something like generation numbers is really that new and
unproven - it's essentially a more trivial version of RCU. Which is used
quite heavily, and under different names.

That said, I'm far from convinced that they are the solution - they just
were the first thing that came to my mind thinking about the problem.

> That having been said, there is some literature on generation numbers,
> and I think something like that could be made to work. It does have
> some significant disadvantages, though. One is that a single process
> which fails to update its generation number prevents all reclamation,
> system-wide. In my algorithm, a process whose execution is
> suspended while a hazard pointer is set prevents recycling of only one
> of many garbage lists. A process searching for a reusable element can
> mostly likely find some other garbage list to reclaim instead.

Hm. Couldn't a similar scheme with several lists be used with generation
numbers?

> Also, a generation number implies that we update the value
> periodically, rather than before and after each critical section.

Hm. Don't think it necessarily has to do that.

> Anything that forces garbage collection to be postponed longer than
> absolutely necessary seems to me likely to be a loser. It might be a
> little faster as long as we have free elements to allocate, but as
> soon as we're out and have to wait longer than we otherwise would for
> garbage collection, and all system activity halts as a result, even
> for a few milliseconds, it's going to be a whole lot slower. Or at
> least, I think.

I think it really depends on the user of the facility. If the generation
were e.g. also tied to lwlocks I couldn't see bufmgr.c outrunning that
realistically.

> That having been said, I don't know what to do about the fact that the
> fence is too expensive.

I'm far from sure that it's the problem at hand here - the reason I'm
wondering about the barriers is primarily that the buffer mapping hash
table is one of the top profile entries in in all concurrent
workloads. The top one often even, after removing some locking
bottleneck. Nearly all of the time is spent there due to cache
misses. While I think we can get the table smaller and more efficient
for the same NBuffers value, realistically we need to cope with much
bigger NBuffers values.

Since cache misses are a problem that we're going to have to deal with,
restricting the processor's tool for efficiently dealing with that (out
of order execution, SMT), doesn't seem like a wise choice.

> I don't know that we've really established
> that that's the true root of the problem rather than some other
> pedestrian optimization failure.

Absolutely.

> But the existing code is using an
> atomic operation to acquire a spinlock, then releasing it, walking the
> bucket chain, and then using another atomic operation to acquire a
> spinlock and then releasing it again.

Well, we don't do that for lookups right now. Just for
insertions/removals. Or are you talking about the buffer mapping lock?

> Surely a pure fence shouldn't cost more than a spinlock cycle?

It really shouldn't - there's a difference right now that there's some
other thing the processor can do while dealing with the cache miss.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 07:11:34
Message-ID: 20141016071134.GK7242@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
> On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> The code in CHashSearch shows the problem there: you need to STORE the
> >> hazard pointer before you begin to do the LOAD operations to scan the
> >> bucket, and you must finish all of those LOADs before you STORE the
> >> NULL hazard pointer. A read or write barrier won't provide store/load
> >> or load/store ordering.
> >
> > I'm not sure I understand the problem here - but a read + write barrier
> > should suffice? To avoid falling back to two full barriers, we'd need a
> > separate define pg_read_write_barrier(), but that's not a problem. IIUC
> > that should allow us to avoid emitting any actual code on x86.
>
> Well, thinking about x86 specifically, it definitely needs at least
> one mfence, after setting the hazard pointer and before beginning to
> read the bucket chain. It probably doesn't need the second mfence,
> before clearing the the hazard pointer, because x86 allows loads to be
> reordered before stores but not stores before loads. We could
> certainly try removing the second fence on x86 for benchmarking
> purposes, and we could also check whether mfence outperforms lock;
> addl in that scenario.

Hm. So I thought about this for a while this morning after I wasn't able
to unprogram my hotel room's alarm clock that a previous occupant set
way to early. Given that premise, take the following with a grain of
salt.

The reason for neading an mfence is that a read/write barrier doesn't
guarantee that the store is visible to other processes - just in which
order they become visible. Right? And that's essentially because the
write might sit in the process's store buffer.

So, how about *not* using a full barrier on the reader side (i.e. the
side that does the hazard pointer writes). But instead do something like
a atomic_fetch_add(hazard_ptr, 0) (i.e. lock; xaddl) on the side that
needs to scan the hazard pointers. That, combined with the write memory
barrier, should guarantee that the store buffers are emptied. Which is
pretty nice, because it moves the overhead to the rather infrequent
scanning of the hazard pointers - which needs to do lots of other atomic
ops anyway.

Sounds sane?

That's something that should best be simulated in an exhaustive x86
simulator, but it sounds sane - and it'd be quite awesome imo :)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 12:03:20
Message-ID: 543FB408.6020206@cs.utoronto.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 15/10/2014 11:41 AM, Robert Haas wrote:
> On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
>>> On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>>>> The code in CHashSearch shows the problem there: you need to STORE the
>>>>> hazard pointer before you begin to do the LOAD operations to scan the
>>>>> bucket, and you must finish all of those LOADs before you STORE the
>>>>> NULL hazard pointer. A read or write barrier won't provide store/load
>>>>> or load/store ordering.
>>>> I'm not sure I understand the problem here - but a read + write barrier
>>>> should suffice? To avoid falling back to two full barriers, we'd need a
>>>> separate define pg_read_write_barrier(), but that's not a problem. IIUC
>>>> that should allow us to avoid emitting any actual code on x86.
>>> Well, thinking about x86 specifically, it definitely needs at least
>>> one mfence, after setting the hazard pointer and before beginning to
>>> read the bucket chain.
>> Yes, I can see that now. I do wonder if that's not going to prove quite
>> expensive... I think I can see some ways around needing that. But I
>> think that needs some benchmarking first - no need to build a even more
>> complex solution if not needed.
>>
>> The solution I'm thinking of is to essentially move away from hazard
>> pointers and store something like a generation counter per
>> backend. Which is updated less often, and in combination with other
>> operations. When a backend need to clean up and sees that there's a
>> straggler with a old generation it sends that backend a signal to ensure
>> it sets the latest generation.
> It's possible that might work ... but on the timescale we're talking
> about here, that's asking the garbage collecting process to wait for
> practically geologic time.
>
> Back when I first wrote this code, I spent a fair amount of time
> looking at existing work in the area of lock-free hash tables.
> Essentially all of the work I was able to find on this topic assumes a
> threaded model - or more precisely, it assumes that shared memory
> allocation is cheap and easy and you'll have no problem getting as
> much of it as you need whenever you need it. This assumption often
> isn't even spelled out explicitly: it's just assumed that that's the
> programming environment you're working in. Finding highly parallel
> algorithms that don't rely on memory allocation as a primitive is
> hard. Hazard pointers are one of the few techniques I found that
> seems like it can work in our architecture. I'm quite reluctant to
> leave aside techniques that have been proven to work well in favor of
> inventing something entirely novel to PostgreSQL.
>
> That having been said, there is some literature on generation numbers,
> and I think something like that could be made to work. It does have
> some significant disadvantages, though. One is that a single process
> which fails to update its generation number prevents all reclamation,
> system-wide. In my algorithm, a process whose execution is
> suspended while a hazard pointer is set prevents recycling of only one
> of many garbage lists. A process searching for a reusable element can
> mostly likely find some other garbage list to reclaim instead. Also,
> a generation number implies that we update the value periodically,
> rather than before and after each critical section. Anything that
> forces garbage collection to be postponed longer than absolutely
> necessary seems to me likely to be a loser. It might be a little
> faster as long as we have free elements to allocate, but as soon as
> we're out and have to wait longer than we otherwise would for garbage
> collection, and all system activity halts as a result, even for a few
> milliseconds, it's going to be a whole lot slower. Or at least, I
> think.
>
> That having been said, I don't know what to do about the fact that the
> fence is too expensive. I don't know that we've really established
> that that's the true root of the problem rather than some other
> pedestrian optimization failure. But the existing code is using an
> atomic operation to acquire a spinlock, then releasing it, walking the
> bucket chain, and then using another atomic operation to acquire a
> spinlock and then releasing it again. Surely a pure fence shouldn't
> cost more than a spinlock cycle? Even with arguably one extra cache
> line touch, that seems like it ought to be a win. But my intuitions
> in this area are shaky.
Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems
like an ideal fit...

In brief, RCU has the following requirements:

* Read-heavy access pattern
* Writers must be able to make dead objects unreachable to new readers
(easily done for most data structures)
* Writers must be able to mark dead objects in such a way that
existing readers know to ignore their contents but can still
traverse the data structure properly (usually straightforward)
* Readers must occasionally inform the system that they are not
currently using any RCU-protected pointers (to allow resource
reclamation)

[1] http://www.rdrop.com/~paulmck/RCU/

If the above are satisfied, then:

* Readers need no synchronization at all (not even compiler fences)
* Writers can use the synchronization mechanism of their choice to
coordinate with each other (lwlock, atomic CAS, etc.)
* Object reclamation can be performed in the background, or
synchronously (and incrementally, if desired )

I've had very good results implementing user-level RCU for my own
database projects. It slashes the complexity of reasoning about
concurrent reader accesses. Implemented properly, it requires only a bit
of shared memory, has extremely low overhead in the common case of no
stragglers, and requires only minimal communication except when
stragglers are present (and even then mostly between stragglers). It's
reasonably simple to implement in userspace dbms context (< 1kLoC with
comments, assertions, etc.). Just don't forget to quiesce all in-flight
back ends at regular intervals, or the system won't be able to reclaim
anything...

Thoughts?
Ryan


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 13:19:16
Message-ID: CA+Tgmoak1koy+i0RmLWS8Fh+haG-0R0hkWZViGmu4aV5UwpGGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
<ryan(dot)johnson(at)cs(dot)utoronto(dot)ca> wrote:
> Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
> an ideal fit...
>
> In brief, RCU has the following requirements:
>
> Read-heavy access pattern
> Writers must be able to make dead objects unreachable to new readers (easily
> done for most data structures)
> Writers must be able to mark dead objects in such a way that existing
> readers know to ignore their contents but can still traverse the data
> structure properly (usually straightforward)
> Readers must occasionally inform the system that they are not currently
> using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
"URCU overview" section. Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.

All of the many techniques that have been developed in this area are
merely minor variations on a very old theme: set some kind of flag
variable in shared memory to let people know that you are reading a
shared data structure, and clear it when you are done. Then, other
people can figure out when it's safe to recycle memory that was
previously part of that data structure. In Linux's RCU, the flag
variable is "whether the process is currently scheduled on a CPU",
which is obviously not workable from user-space. Lacking that, you
need an explicit flag variable, which means you need memory barriers,
since the protected operation is a load and the flag variable is
updated via a store. You can try to avoid some of the overhead by
updating the flag variable less often (say, when a signal arrives) or
you can make it more fine-grained (in my case, we only prevent reclaim
of a fraction of the data structure at a time, rather than all of it)
or various other variants, but none of this is unfortunately so simple
as "apply technique X and your problem just goes away".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 13:30:15
Message-ID: 20141016133015.GF21348@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-16 09:19:16 -0400, Robert Haas wrote:
> On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
> <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca> wrote:
> > Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
> > an ideal fit...
> >
> > In brief, RCU has the following requirements:
> >
> > Read-heavy access pattern
> > Writers must be able to make dead objects unreachable to new readers (easily
> > done for most data structures)
> > Writers must be able to mark dead objects in such a way that existing
> > readers know to ignore their contents but can still traverse the data
> > structure properly (usually straightforward)
> > Readers must occasionally inform the system that they are not currently
> > using any RCU-protected pointers (to allow resource reclamation)
>
> Have a look at http://lwn.net/Articles/573424/ and specifically the
> "URCU overview" section. Basically, that last requirement - that
> readers inform the system tat they are not currently using any
> RCU-protected pointers - turns out to require either memory barriers
> or signals.

Well, there's the "quiescent-state-based RCU" - that's actually
something that could reasonably be used inside postgres. Put something
around socket reads (syscalls are synchronization points) and non-nested
lwlock acquires. That'd mean it's nearly no additional overhead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 13:59:26
Message-ID: CA+TgmoahSgDNOkdqnktLBBzvDLTAiYuSuJYyLDXyycwxf88fcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2014 at 9:30 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-10-16 09:19:16 -0400, Robert Haas wrote:
>> On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
>> <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca> wrote:
>> > Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
>> > an ideal fit...
>> >
>> > In brief, RCU has the following requirements:
>> >
>> > Read-heavy access pattern
>> > Writers must be able to make dead objects unreachable to new readers (easily
>> > done for most data structures)
>> > Writers must be able to mark dead objects in such a way that existing
>> > readers know to ignore their contents but can still traverse the data
>> > structure properly (usually straightforward)
>> > Readers must occasionally inform the system that they are not currently
>> > using any RCU-protected pointers (to allow resource reclamation)
>>
>> Have a look at http://lwn.net/Articles/573424/ and specifically the
>> "URCU overview" section. Basically, that last requirement - that
>> readers inform the system tat they are not currently using any
>> RCU-protected pointers - turns out to require either memory barriers
>> or signals.
>
> Well, there's the "quiescent-state-based RCU" - that's actually
> something that could reasonably be used inside postgres. Put something
> around socket reads (syscalls are synchronization points) and non-nested
> lwlock acquires. That'd mean it's nearly no additional overhead.

Sure, so, you reuse your existing barriers instead of adding new ones.
Making it work sounds like a lot of work for an uncertain benefit
though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 15:33:03
Message-ID: 543FE52F.9050300@cs.utoronto.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16/10/2014 7:19 AM, Robert Haas wrote:
> On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
> <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca> wrote:
>> Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
>> an ideal fit...
>>
>> In brief, RCU has the following requirements:
>>
>> Read-heavy access pattern
>> Writers must be able to make dead objects unreachable to new readers (easily
>> done for most data structures)
>> Writers must be able to mark dead objects in such a way that existing
>> readers know to ignore their contents but can still traverse the data
>> structure properly (usually straightforward)
>> Readers must occasionally inform the system that they are not currently
>> using any RCU-protected pointers (to allow resource reclamation)
> Have a look at http://lwn.net/Articles/573424/ and specifically the
> "URCU overview" section. Basically, that last requirement - that
> readers inform the system tat they are not currently using any
> RCU-protected pointers - turns out to require either memory barriers
> or signals.
> All of the many techniques that have been developed in this area are
> merely minor variations on a very old theme: set some kind of flag
> variable in shared memory to let people know that you are reading a
> shared data structure, and clear it when you are done. Then, other
> people can figure out when it's safe to recycle memory that was
> previously part of that data structure.
Sure, but RCU has the key benefit of decoupling its machinery (esp. that
flag update) from the actual critical section(s) it protects. In a DBMS
setting, for example, once per transaction or SQL statement would do
just fine. The notification can be much better than a simple flag---you
want to know whether the thread has ever quiesced since the last reclaim
cycle began, not whether it is currently quiesced (which it usually
isn't). In the implementation I use, a busy thread (e.g. not about to go
idle) can "chain" its RCU "transactions." In the common case, a chained
quiesce call comes when the RCU epoch is not trying to change, and the
"flag update" degenerates to a simple load. Further, the only time it's
critical to have that memory barrier is if the quiescing thread is about
to go idle. Otherwise, missing a flag just imposes a small delay on
resource reclamation (and that's assuming the flag in question even
belonged to a straggler process). How you implement epoch management,
especially the handling of stragglers, is the deciding factor in whether
RCU works well. The early URCU techniques were pretty terrible, and
maybe general-purpose URCU is doomed to stay that way, but in a DBMS
core it can be done very cleanly and efficiently because we can easily
add the quiescent points at appropriate locations in the code.

> In Linux's RCU, the flag
> variable is "whether the process is currently scheduled on a CPU",
> which is obviously not workable from user-space. Lacking that, you
> need an explicit flag variable, which means you need memory barriers,
> since the protected operation is a load and the flag variable is
> updated via a store. You can try to avoid some of the overhead by
> updating the flag variable less often (say, when a signal arrives) or
> you can make it more fine-grained (in my case, we only prevent reclaim
> of a fraction of the data structure at a time, rather than all of it)
> or various other variants, but none of this is unfortunately so simple
> as "apply technique X and your problem just goes away".
Magic wand, no (does nothing for update contention, for example, and
requires some care to apply). But from a practical perspective RCU,
properly implemented, does make an awful lot of problems an awful lot
simpler to tackle. Especially for the readers.

Ryan


From: Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 16:26:10
Message-ID: 543FF1A2.10808@cs.utoronto.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16/10/2014 7:59 AM, Robert Haas wrote:
> On Thu, Oct 16, 2014 at 9:30 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2014-10-16 09:19:16 -0400, Robert Haas wrote:
>>> On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
>>> <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca> wrote:
>>>> Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
>>>> an ideal fit...
>>>>
>>>> In brief, RCU has the following requirements:
>>>>
>>>> Read-heavy access pattern
>>>> Writers must be able to make dead objects unreachable to new readers (easily
>>>> done for most data structures)
>>>> Writers must be able to mark dead objects in such a way that existing
>>>> readers know to ignore their contents but can still traverse the data
>>>> structure properly (usually straightforward)
>>>> Readers must occasionally inform the system that they are not currently
>>>> using any RCU-protected pointers (to allow resource reclamation)
>>> Have a look at http://lwn.net/Articles/573424/ and specifically the
>>> "URCU overview" section. Basically, that last requirement - that
>>> readers inform the system tat they are not currently using any
>>> RCU-protected pointers - turns out to require either memory barriers
>>> or signals.
>> Well, there's the "quiescent-state-based RCU" - that's actually
>> something that could reasonably be used inside postgres. Put something
>> around socket reads (syscalls are synchronization points) and non-nested
>> lwlock acquires. That'd mean it's nearly no additional overhead.
> Sure, so, you reuse your existing barriers instead of adding new ones.
> Making it work sounds like a lot of work for an uncertain benefit
> though.
Perhaps RCU is too much work and/or too little benefit to justify
replacing the current latch-based code. I personally am not convinced,
but have no hard data to offer other than to point at the rapid (even
accelerating) uptake of RCU throughout the Linux kernel (cf. Fig. 1 of
http://www2.rdrop.com/users/paulmck/techreports/RCUUsage.2013.02.24a.pdf).

However---
If we are convinced the latch-based code needs to go and the question is
whether to replace it with RCU or hazard pointers, then RCU wins
hands-down IMO. It's comparable work to implement, easier to reason
about (RCU read protocol is significantly simpler), and a clearer
performance benefit (hazard pointers require more barriers, more atomic
ops, more validating, and more pointer chasing than RCU). The only
metric where RCU loses to hazard pointers is if you have really tight
timing constraints on resource reclamation. Hazard pointers give a tight
bound on the number of dead objects that cannot be reclaimed at any
given moment, while RCU does not. I've not heard that this is a major
concern, though, and in practice it doesn't seem to be a problem unless
you forget to annotate an important quiescent point (like a blocking
syscall).

Ryan


From: Ants Aasma <ants(at)cybertec(dot)at>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 16:59:25
Message-ID: CA+CSw_uRPJY5WReBtrS3oiTX0CYqOPbJh1SUNgZdKrytA0OxMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 15, 2014 7:32 PM, "Ants Aasma" <ants(at)cybertec(dot)at> wrote:
> I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
> associativity). This allows us to fit the bucket onto 2 regular sized
> cache lines and have 8 bytes left over. Buckets would be protected by
> seqlocks stored in the extra space. On the read side we would only
> need 2 read barriers (basically free on x86), and we are guaranteed to
> have an answer by fetching two pairs of cache lines. We can trade
> memory bandwidth for latency by issuing prefetches (once we add
> primitives for this). Alternatively we can trade insert speed for
> lookup speed by using asymmetrically sized tables.

I was thinking about this. It came to me that we might not even need
BufferTag's to be present in the hash table. In the hash table itself
we store just a combination of 4 byte buffer tag hash-buffer id. This
way we can store 8 pairs in one cacheline. Lookup must account for the
fact that due to hash collisions we may find multiple matches one of
which may be the buffer we are looking for. We can make condition
exceedingly unlikely by taking advantage of the fact that we have two
hashes anyway, in each table we use the lookup hash of the other table
for tagging. (32bit hash collision within 8 items). By having a
reasonably high load factor (75% should work) and 8 bytes per key we
even have a pretty good chance of fitting the hotter parts of the
buffer lookup table in CPU caches.

We use a separate array for the locks (spinlocks should be ok here)
for fine grained locking, this may be 1:1 with the buckets or we can
map many buckets to a single lock. Otherwise the scheme should work
mostly the same. Using this scheme we can get by on the lookup side
using 64 bit atomic reads with no barriers, we only need atomic ops
for pinning the found buffer.

I haven't had the chance to check out how second-chance hashing works
and if this scheme would be applicable to it.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 18:36:35
Message-ID: CA+TgmoZ6k1BKenaf3n-gYZoAhxRG2P3A4sVqmLAHswyd+-cwsA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2014 at 12:26 PM, Ryan Johnson
<ryan(dot)johnson(at)cs(dot)utoronto(dot)ca> wrote:
> The only metric where RCU loses to
> hazard pointers is if you have really tight timing constraints on resource
> reclamation.

I think we do have that problem. It's certainly completely
unacceptable for a query to prevent buffer reclaim on any significant
number of buffers even until the end of the query, let alone the end
of the transaction.

But, hey, if somebody wants to try writing a patch different than the
one that I wrote and see whether it works better than mine, I'm
totally cool with that. This is something I came up with, and we're
here to evaluate whether it works better than any other option that we
have now or that someone wants to develop. I'm not saying this is the
best solution; it's just something I've got that seems to basically
work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 20:16:52
Message-ID: CA+TgmoaXUdd1i=dNrTswxEDPaLudWfyGJVfWCWcRdweWn3-NXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Yes, I can see that now. I do wonder if that's not going to prove quite
> expensive... I think I can see some ways around needing that. But I
> think that needs some benchmarking first - no need to build a even more
> complex solution if not needed.

I did a bit of testing on my MacBook Pro last night. Unfortunately, I
don't have access to a large x86 machine the moment.[1] I tried four
configurations:

(1) master
(2) master with chash patch
(3) master with chash patch, pg_memory_barrier() changed from lock
addl to mfence
(4) same as (3), plus barrier at the end of CHashSearch changed to a
compiler barrier (which should be safe on x86)

I tested pgbench -S, scale factor 100, shared_buffers 400MB. 3 trials
per configuration; runs were interleaved. Percentages indicate the
average difference between that line and master.

At 1 client:

(1) 11689.388986 11426.653176 11297.775005
(2) 11618.306822 11331.805396 11273.272945 -0.55%
(3) 11813.664402 11300.082928 11231.265030 -0.20%
(4) 11428.097384 11266.979165 11294.422376 -1.23%

At 16 clients:

(1) 56716.465893 56992.590587 56755.298362
(2) 57126.787712 56848.527712 57351.559138 +0.51%
(3) 56779.624757 57036.610925 56878.036445 +0.13%
(4) 56818.522369 56885.544509 56977.810413 +0.13%

I think this tends to confirm that the patch is a small loss (for
unknown reasons) at 1 client, but that it picks up speed with more
contention, even on a machine with only 4 cores. But if there's an
important difference between the different fencing techniques, it
doesn't show up in this test.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Yes, this is a hint.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-16 22:53:57
Message-ID: 20141016225357.GB19064@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> A few years ago I started working on a concurrent hash table for
> PostgreSQL. The hash table part of it worked, but I never did
> anything with it, really. Amit mentioned to me earlier this week that
> he was seeing contention inside the dynahash machinery, which inspired
> me to go back and update the patch. I took the basic infrastructure
> from before and used it to replace the buffer table. Patch is
> attached.

So. I ran a quick tests on a larger x86 machine. 4xE5-4620, 256GB RAM.

The hotel wifi is too flaky to get detailed results, but some tasty
bits.

The test I used is readonly pgbench on a scale 5000 DB - about 73GB. I'm
benchmarking chash ontop my LW_SHARED and StrategyGetBuffer()
optimizations because otherwise the bottleneck is completely elsewhere.

When using shared_buffers = 96GB there's a performance benefit, but not
huge:
master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8
master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7

But with shared_buffers = 16GB:
master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9
master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8

All the numbers here -P5 output when it looks like it has stabilized -
it takes a couple minutes to get to that point so pgbench runs would
have to be really long to not be skewed by the startup. I don't think my
manual selection of measurements matters much at this scale of
differences ;)

I had to play with setting max_connections+1 sometimes to get halfway
comparable results for master - unaligned data otherwise causes wierd
results otherwise. Without doing that the performance gap between master
96/16G was even bigger. We really need to fix that...

This is pretty awesome.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-17 00:22:24
Message-ID: CA+Tgmoa80iLreNDPhFVu856dcivsWF9x2sUcgNQ6Uy=PS56rWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> When using shared_buffers = 96GB there's a performance benefit, but not
> huge:
> master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8
> master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4
> master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6
> master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7
>
> But with shared_buffers = 16GB:
> master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9
> master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4
> master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1
> master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8

Very interesting. This doesn't show that chash is the right solution,
but it definitely shows that doing nothing is the wrong solution. It
shows that, even with the recent bump to 128 lock manager partitions,
and LW_SHARED on top of that, workloads that actually update the
buffer mapping table still produce a lot of contention there. This
hasn't been obvious to me from profiling, but the numbers above make
it pretty clear.

It also seems to suggest that trying to get rid of the memory barriers
isn't a very useful optimization project. We might get a couple of
percent out of it, but it's pretty small potatoes, so unless it can be
done more easily than I suspect, it's probably not worth bothering
with. An approach I think might have more promise is to have bufmgr.c
call the CHash stuff directly instead of going through buf_table.c.
Right now, for example, BufferAlloc() creates and initializes a
BufferTag and passes a pointer to that buffer tag to BufTableLookup,
which copies it into a BufferLookupEnt. But it would be just as easy
for BufferAlloc() to put the BufferLookupEnt on its own stack, and
then you wouldn't need to copy the data an extra time. Now a 20-byte
copy isn't a lot, but it's completely unnecessary and looks easy to
get rid of.

> I had to play with setting max_connections+1 sometimes to get halfway
> comparable results for master - unaligned data otherwise causes wierd
> results otherwise. Without doing that the performance gap between master
> 96/16G was even bigger. We really need to fix that...
>
> This is pretty awesome.

Thanks. I wasn't quite sure how to test this or where the workloads
that it would benefit would be found, so I appreciate you putting time
into it. And I'm really glad to hear that it delivers good results.

I think it would be useful to plumb the chash statistics into the
stats collector or at least a debugging dump of some kind for testing.
They include a number of useful contention measures, and I'd be
interested to see what those look like on this workload. (If we're
really desperate for every last ounce of performance, we could also
disable those statistics in production builds. That's probably worth
testing at least once to see if it matters much, but I kind of hope it
doesn't.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-17 19:02:51
Message-ID: 20141017190251.GG2075@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-16 20:22:24 -0400, Robert Haas wrote:
> On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > When using shared_buffers = 96GB there's a performance benefit, but not
> > huge:
> > master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8
> > master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4
> > master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6
> > master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7
> >
> > But with shared_buffers = 16GB:
> > master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9
> > master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4
> > master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1
> > master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8
>
> Very interesting. This doesn't show that chash is the right solution,
> but it definitely shows that doing nothing is the wrong solution.

Absolutely.

The thing worrying me most (but not all that much on an absolute scale)
about chash is that lots of the solutions to memory management it builds
are specific to it... And generalizing afterwards will be hard because
we'll have to prove that that general solution is as performant as the
special case code...

> It shows that, even with the recent bump to 128 lock manager
> partitions, and LW_SHARED on top of that, workloads that actually
> update the buffer mapping table still produce a lot of contention
> there.

FWIW, I couldn't see much of a benefit without LW_SHARED even though I
a *few* things. The bottleneck simply is completely elsewhere.

> This hasn't been obvious to me from profiling, but the numbers
> above make it pretty clear.

So I'm not super surprised about that. There very well might be cases
where it's a bad bottleneck before, but I've not seen them.

> It also seems to suggest that trying to get rid of the memory barriers
> isn't a very useful optimization project.

I'm not yet convinced of that. Yes, it's not showing up hugely in the
numbers here, but that's simply because the workload is again completely
dominated by the kernel copying data for the read()s, GetSnapshotData(),
the buffer mapping cache misses and a few other familiar friends.

> We might get a couple of
> percent out of it, but it's pretty small potatoes, so unless it can be
> done more easily than I suspect, it's probably not worth bothering
> with.

I still think that moving the barrier to the reading side would be
simple (implementation wise) and correct for x86. If you think about it,
otherwise our spinlock implementation for x86 would be broken. As a
unlock is just a compiler barrier the full barrier on acquire better be
a full synchronization point. Am I missing something?

> An approach I think might have more promise is to have bufmgr.c
> call the CHash stuff directly instead of going through buf_table.c.

I don't see all that much point in buf_table.c currently - on the other
hand it has lead to it replacing the buffer mapping being slightly
easier... Maybe it should just go to a header...

> Right now, for example, BufferAlloc() creates and initializes a
> BufferTag and passes a pointer to that buffer tag to BufTableLookup,
> which copies it into a BufferLookupEnt. But it would be just as easy
> for BufferAlloc() to put the BufferLookupEnt on its own stack, and
> then you wouldn't need to copy the data an extra time. Now a 20-byte
> copy isn't a lot, but it's completely unnecessary and looks easy to
> get rid of.

Worthwile to try.

> > I had to play with setting max_connections+1 sometimes to get halfway
> > comparable results for master - unaligned data otherwise causes wierd
> > results otherwise. Without doing that the performance gap between master
> > 96/16G was even bigger. We really need to fix that...
> >
> > This is pretty awesome.
>
> Thanks. I wasn't quite sure how to test this or where the workloads
> that it would benefit would be found, so I appreciate you putting time
> into it. And I'm really glad to hear that it delivers good results.

I wasn't sure either ;). Lucky that it played out so impressively. After
the first results I was nearly ready to send out a 'Meh.' type of
message ;)

Btw, CHashTableGetNode isn't exactly cheap. It shows up noticeably in
profiles. It results in several non-pipelineable instructions in a
already penalized section of the code... Replacing arena_stride by a
constant provided measurable improvements (no imul is required anymore,
instead you get shr;lea or so). I'm not sure how to deal with that. If
it shows up even on my quite new laptop, it'll be more so noticeable on
older x86 platforms.

> I think it would be useful to plumb the chash statistics into the
> stats collector or at least a debugging dump of some kind for testing.

I'm not sure it's solid enough at this point to be in the stats
collector. But I surely would like to access it somehow. I'm
e.g. absolutely not sure that your loadfactor is "good" and it'd be much
easier if those stats where visible. I'm not really sure how to make
them visible though.

> They include a number of useful contention measures, and I'd be
> interested to see what those look like on this workload. (If we're
> really desperate for every last ounce of performance, we could also
> disable those statistics in production builds. That's probably worth
> testing at least once to see if it matters much, but I kind of hope it
> doesn't.)

I can't retest on bigger HW right now, but IIRC they didn't show up in
the profile there. It's not visible on my laptop.

Profilewise it'd be helpful if BucketScan would be inlined. Right now
it's hard to see the difference in cost between search/insert/delete and
I think that'd be worth the cost of duplicated instructions...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-11-05 23:19:58
Message-ID: 20141105231958.GC28295@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> git branch also available at:
> http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014

A minor review of this:
* Should be rebased ontop of the atomics API

* In benchmarks it becomes apparent that the dynamic element width makes
some macros like CHashTableGetNode() and
CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
can't figure how to build an efficient expression for the
target. There's two ways to address this:
a) Actually morph chash into something that will be included into the
user sites. Since I think there'll only be a limited number of those
that's probably acceptable.
b) Simplify the access rules. I quite seriously doubt that the
interleaving of garbage and freelists is actually necessary/helpful.

* There's several cases where it's noticeable that chash creates more
cacheline misses - that's imo a good part of why the single threaded
performance regresses. There's good reasons for the current design
without a inline bucket, namely that it makes the concurrency handling
easier. But I think that can be countered by still storing a pointer -
just to an element inside the bucket. Afaics that'd allow this to be
introduced quite easily?

I'm afraid that I can't see us going forward unless we address the
noticeable single threaded penalties. Do you see that differently?

* There's some whitespace damage. (space followed by tab, followed by
actual contents)

* I still think it's a good idea to move the full memory barriers into
the cleanup/writing process by doing write memory barriers when
acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
in the process testing for the hazard pointer.

* Shouldn't we relax the CPU in a couple places like CHashAllocate(),
CHashBucketScan()'s loops?

* I don't understand right now (but then I'm in a Hotel bar) why it's
safe for CHashAddToGarbage() to clobber the hash value?
CHashBucketScan() relies the chain being ordered by hashcode. No?
Another backend might just have been past the IsMarked() bit and look
at the hashcode?

* We really should find a way to sensibly print the stats.

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-11-07 16:08:57
Message-ID: CA+Tgmoak-hK4k-sjCBXQvBR3BZ-Tib3zx_0xB1nb6DgL0fiH=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> * In benchmarks it becomes apparent that the dynamic element width makes
> some macros like CHashTableGetNode() and
> CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
> can't figure how to build an efficient expression for the
> target. There's two ways to address this:
> a) Actually morph chash into something that will be included into the
> user sites. Since I think there'll only be a limited number of those
> that's probably acceptable.

Have you benchmarked this? How much does it help?

> b) Simplify the access rules. I quite seriously doubt that the
> interleaving of garbage and freelists is actually necessary/helpful.

That wasn't added for nothing. I did a bunch of performance testing
on this vs. dynahash (I think the test code is included in the patch)
and the region of memory containing the freelists practically caught
fire. Spreading them out like this greatly improved the performance
under heavy concurrency, but even with those changes the freelists
were extremely hot. Now, since this is the buffer mapping table
rather than a tight loop around hash table manipulation, the
difference may not be enough to measure. But on a microbenchmark it
*definitely* was.

> * There's several cases where it's noticeable that chash creates more
> cacheline misses - that's imo a good part of why the single threaded
> performance regresses. There's good reasons for the current design
> without a inline bucket, namely that it makes the concurrency handling
> easier. But I think that can be countered by still storing a pointer -
> just to an element inside the bucket. Afaics that'd allow this to be
> introduced quite easily?

It's not obvious to me how that would work. If you just throw those
elements on the garbage lists with everything else, it will soon be
the case that one bucket header ends up using the cell stored in some
other bucket, which isn't going to be awesome. So you need some kind
of more complex scheme to preserve locality.

> I'm afraid that I can't see us going forward unless we address the
> noticeable single threaded penalties. Do you see that differently?

I'm not sure. We're talking about something on the order of half a
percent on my tests, and lots of changes cause things to bounce around
by that much. I'm not sure it makes sense to get too worked up about
that if the alternative is to add a big pile of new complexity.

> * There's some whitespace damage. (space followed by tab, followed by
> actual contents)

That's the least of our problems at this point.

> * I still think it's a good idea to move the full memory barriers into
> the cleanup/writing process by doing write memory barriers when
> acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
> in the process testing for the hazard pointer.

Can you cite any existing precedent in other system software? Does
Linux do anything like that, for example?

> * Shouldn't we relax the CPU in a couple places like CHashAllocate(),
> CHashBucketScan()'s loops?

You mean like, have it sleep the way SpinLockAcquire() does? Yeah,
possibly, but that adds non-trivial code complexity which may not be
worth it if - as is hoped for - there's no real contention there.

> * I don't understand right now (but then I'm in a Hotel bar) why it's
> safe for CHashAddToGarbage() to clobber the hash value?
> CHashBucketScan() relies the chain being ordered by hashcode. No?
> Another backend might just have been past the IsMarked() bit and look
> at the hashcode?

I think the worst case scenario is that we falsely conclude that
there's a hash match when there really isn't. The ensuing memcmp()
will clarify matters.

> * We really should find a way to sensibly print the stats.

Yep.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-11-09 20:58:15
Message-ID: 20141109205815.GA28007@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-11-07 11:08:57 -0500, Robert Haas wrote:
> On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > * In benchmarks it becomes apparent that the dynamic element width makes
> > some macros like CHashTableGetNode() and
> > CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
> > can't figure how to build an efficient expression for the
> > target. There's two ways to address this:
> > a) Actually morph chash into something that will be included into the
> > user sites. Since I think there'll only be a limited number of those
> > that's probably acceptable.
>
> Have you benchmarked this? How much does it help?

I've done quick benchmarks, and it helped in the 2-3% range in some
tests, and was a wash in others. What I did wasn't actually including it
in buf_table.c, but hardcoding the size in chash. That's obviously not
the real solution.

> > b) Simplify the access rules. I quite seriously doubt that the
> > interleaving of garbage and freelists is actually necessary/helpful.
>
> That wasn't added for nothing. I did a bunch of performance testing
> on this vs. dynahash (I think the test code is included in the patch)
> and the region of memory containing the freelists practically caught
> fire. Spreading them out like this greatly improved the performance
> under heavy concurrency, but even with those changes the freelists
> were extremely hot. Now, since this is the buffer mapping table
> rather than a tight loop around hash table manipulation, the
> difference may not be enough to measure. But on a microbenchmark it
> *definitely* was.

I think it'd be much less visible for the buffer manager, but it might
still be visible...

> > * There's several cases where it's noticeable that chash creates more
> > cacheline misses - that's imo a good part of why the single threaded
> > performance regresses. There's good reasons for the current design
> > without a inline bucket, namely that it makes the concurrency handling
> > easier. But I think that can be countered by still storing a pointer -
> > just to an element inside the bucket. Afaics that'd allow this to be
> > introduced quite easily?
>
> It's not obvious to me how that would work. If you just throw those
> elements on the garbage lists with everything else, it will soon be
> the case that one bucket header ends up using the cell stored in some
> other bucket, which isn't going to be awesome. So you need some kind
> of more complex scheme to preserve locality.

Treating the element inside the bucket as a kind of one element freelist
seems quite workable to me. Test whether it's marked deleted, scan the
hazard pointer list to see whether it can be used. If yes, grand. If
not, go to the current code. The fact that the element is cacheline
local will make the test for its deletedness almost free. Yes, the
hazard pointer scan surely ain't free - but at least for cases like
bufmgr where reads are never less frequent than writes and very often
*much* more frequent I'm pretty sure it'd be a noticeable win.

> > I'm afraid that I can't see us going forward unless we address the
> > noticeable single threaded penalties. Do you see that differently?
>
> I'm not sure. We're talking about something on the order of half a
> percent on my tests, and lots of changes cause things to bounce around
> by that much. I'm not sure it makes sense to get too worked up about
> that if the alternative is to add a big pile of new complexity.

I saw things in the range of 3-4% on my laptop.

> > * There's some whitespace damage. (space followed by tab, followed by
> > actual contents)
>
> That's the least of our problems at this point.

Sure, but still worth cleaning up ;)

> > * I still think it's a good idea to move the full memory barriers into
> > the cleanup/writing process by doing write memory barriers when
> > acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
> > in the process testing for the hazard pointer.
>
> Can you cite any existing precedent in other system software? Does
> Linux do anything like that, for example?

No, I can't right now. I think it follows from the cache coherency
rules, but I can understand suspicion there.

> > * Shouldn't we relax the CPU in a couple places like CHashAllocate(),
> > CHashBucketScan()'s loops?
>
> You mean like, have it sleep the way SpinLockAcquire() does?

Not actually like in s_lock(), but rather like the SPIN_DELAY() used in
the spinlock code for retries.

> Yeah, possibly, but that adds non-trivial code complexity which may
> not be worth it if - as is hoped for - there's no real contention
> there.

I think just adding a pg_spin_delay() before retrying should be trivial.

> > * I don't understand right now (but then I'm in a Hotel bar) why it's
> > safe for CHashAddToGarbage() to clobber the hash value?
> > CHashBucketScan() relies the chain being ordered by hashcode. No?
> > Another backend might just have been past the IsMarked() bit and look
> > at the hashcode?
>
> I think the worst case scenario is that we falsely conclude that
> there's a hash match when there really isn't. The ensuing memcmp()
> will clarify matters.

Hm. Let me think about it for a while.

I was thinking that a spurious cmp < 0 could causing us to to miss a
match - but that could only happen if the match just has been removed
from the list. Which is fine. Perhaps that case deserves to be mentioned
in the comment?

* Another thing I'm wondering about here is whether it's possible that
somebody is currently walking an "older version" of the bucket list
(i.e. is currently at an already unlinked element) and then visits a
newly inserted element with an 'apparently' out of order hash
value. That seems possible because the inserter might not have seen the
unlinked element. Afaics that's not a problem for searches - but I'm not
sure whether it couldn't cause a problem for concurrent inserts and
deletes.

* One thing I noticed while looking at that part of code is:
+ h = target_node->un.hashcode;
+ if (h == hashcode)
+ cmp = memcmp(CHashNodeGetItem(target_node), key,
+ table->desc.key_size);
+ else if (h > hashcode)
+ cmp = 1;
+ else
+ cmp = -1;

Compilers can feel entirely free to reload local variables from memory
(i.e. use target_node->un.hashcode instead of h) in situations like
these. That's why I used the ACCESS_ONCE macros in my freelist
patch. The same is done in a couple of other places (like looking at
->next). I'm *very* wary of relying on the compiler to not do stuff like
that. I unfortunately think you'll need similar things here.

* I noticed is the following comment:
+ * Note: We can't begin this operation until the clearing of the
+ * garbage list has been committed to memory, but since that was
+ * done using an atomic operation no explicit barrier is needed
+ * here.
+ *

Uh. I don't think that holds true. A memory barrier on one cpu doesn't
forbid all kinds of reordering on others.

* How is the consistency of the element data guaranteed? Afaics there
very well could be two concurrent CHashInsert()'s for the same key
interleaving inside their respective memcpy()s. It'll often be fine
for small element sizes, but even there the compiler might restort to
a char-by-char memcpy.

* Should CHashInsert perhaps be renamed to CHashSetKey or CHashUpdate -
since it's not actually just inserting?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-12-15 03:13:38
Message-ID: CAB7nPqQMdpmeJXWc7wHZcOmgz75AmLPwfw=5ag2u7K8n-Zrkig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

There hasn't been much activity on this patch these days, and Andres
has provided some feedback, hence switching to Returned with feedback.
Regards,
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-12-15 18:39:09
Message-ID: CA+TgmobA=sBeX9P4T8sGJSsUsEqZWcqEJeN13y71J-2ewag11g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 9, 2014 at 3:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > * There's several cases where it's noticeable that chash creates more
>> > cacheline misses - that's imo a good part of why the single threaded
>> > performance regresses. There's good reasons for the current design
>> > without a inline bucket, namely that it makes the concurrency handling
>> > easier. But I think that can be countered by still storing a pointer -
>> > just to an element inside the bucket. Afaics that'd allow this to be
>> > introduced quite easily?
>>
>> It's not obvious to me how that would work. If you just throw those
>> elements on the garbage lists with everything else, it will soon be
>> the case that one bucket header ends up using the cell stored in some
>> other bucket, which isn't going to be awesome. So you need some kind
>> of more complex scheme to preserve locality.
>
> Treating the element inside the bucket as a kind of one element freelist
> seems quite workable to me. Test whether it's marked deleted, scan the
> hazard pointer list to see whether it can be used. If yes, grand. If
> not, go to the current code. The fact that the element is cacheline
> local will make the test for its deletedness almost free. Yes, the
> hazard pointer scan surely ain't free - but at least for cases like
> bufmgr where reads are never less frequent than writes and very often
> *much* more frequent I'm pretty sure it'd be a noticeable win.

Maybe. I'd expect that to radically increase the time spend doing
hazard pointer scans. The performance of this system depends heavily
on garbage collection not happening too often, and ISTR that the
performance changes significantly if you vary the amount of of
overallocation.

>> I'm not sure. We're talking about something on the order of half a
>> percent on my tests, and lots of changes cause things to bounce around
>> by that much. I'm not sure it makes sense to get too worked up about
>> that if the alternative is to add a big pile of new complexity.
>
> I saw things in the range of 3-4% on my laptop.

Mrmpf. Well, that sucks.

>> > * I don't understand right now (but then I'm in a Hotel bar) why it's
>> > safe for CHashAddToGarbage() to clobber the hash value?
>> > CHashBucketScan() relies the chain being ordered by hashcode. No?
>> > Another backend might just have been past the IsMarked() bit and look
>> > at the hashcode?
>>
>> I think the worst case scenario is that we falsely conclude that
>> there's a hash match when there really isn't. The ensuing memcmp()
>> will clarify matters.
>
> Hm. Let me think about it for a while.
>
> I was thinking that a spurious cmp < 0 could causing us to to miss a
> match - but that could only happen if the match just has been removed
> from the list. Which is fine. Perhaps that case deserves to be mentioned
> in the comment?

Maybe. I mean, the general principle here is that there may be some
difference between the apparent order of execution on one CPU as
opposed to another, and the code that uses the hash table has to be OK
with that - at least, unless it has independent methods of assuring
that things happen in the right order, but in that case that other
thing may well become the botleneck. This is just one example of
that. You might also fail to see an insert that's just happened on
some other node but is not committed to main memory yet, which is not
really an issue we need to fix; it's just how things are. A general
discussion of this somewhere might be worthwhile, but it's pretty much
the same as any other highly-concurrent hashtable implemenation you'll
find out there.

(It's also not really different from surrounding the hash table
operation, and only the hash table operation, with a big lock. Then
things can't change while you are scanning the bucket list, but they
can change by the time you can do anything with the returned value,
which is the same problem we have to cope with here.)

> * Another thing I'm wondering about here is whether it's possible that
> somebody is currently walking an "older version" of the bucket list
> (i.e. is currently at an already unlinked element) and then visits a
> newly inserted element with an 'apparently' out of order hash
> value. That seems possible because the inserter might not have seen the
> unlinked element. Afaics that's not a problem for searches - but I'm not
> sure whether it couldn't cause a problem for concurrent inserts and
> deletes.

This is why the delete-mark bit has to be part of the same atomic word
as the next-pointer. If somebody applies a delete-mark, a subsequent
attempt to insert an entry at that position will fail because the
CAS() of the next-word won't find the right value there - it will find
a delete-marked value, which will never be the value it's expecting.

> * One thing I noticed while looking at that part of code is:
> + h = target_node->un.hashcode;
> + if (h == hashcode)
> + cmp = memcmp(CHashNodeGetItem(target_node), key,
> + table->desc.key_size);
> + else if (h > hashcode)
> + cmp = 1;
> + else
> + cmp = -1;
>
> Compilers can feel entirely free to reload local variables from memory
> (i.e. use target_node->un.hashcode instead of h) in situations like
> these. That's why I used the ACCESS_ONCE macros in my freelist
> patch. The same is done in a couple of other places (like looking at
> ->next). I'm *very* wary of relying on the compiler to not do stuff like
> that. I unfortunately think you'll need similar things here.

Even if the compiler decides to fetch the target node's hash code
twice, it won't create a correctness issue, because it's guaranteed to
see the same value both times. The hash code is set before adding the
item to the list (with a memory barrier to enforce ordering), and it's
never reset until the item is put back on a new list - which requires
an intervening cycle of garbage collection and hazard-pointer
scanning, so we can't still be looking at the old item at that point.

> * I noticed is the following comment:
> + * Note: We can't begin this operation until the clearing of the
> + * garbage list has been committed to memory, but since that was
> + * done using an atomic operation no explicit barrier is needed
> + * here.
> + *
>
> Uh. I don't think that holds true. A memory barrier on one cpu doesn't
> forbid all kinds of reordering on others.

Hmm, you might be right about that. I thought that was guaranteed,
but a quick look at http://en.wikipedia.org/wiki/Memory_ordering
suggests otherwise.

> * How is the consistency of the element data guaranteed? Afaics there
> very well could be two concurrent CHashInsert()'s for the same key
> interleaving inside their respective memcpy()s. It'll often be fine
> for small element sizes, but even there the compiler might restort to
> a char-by-char memcpy.

The two operations would have to CAS() the same next-pointer, and one
of those operations will fail. Note that the item is completely
initialized *before* we insert it into the hash table, so the premise
that two memcpy() operations can target the same address is just
wrong.

> * Should CHashInsert perhaps be renamed to CHashSetKey or CHashUpdate -
> since it's not actually just inserting?

This may be the root of the confusion. It *is* just inserting. It
does not update; there is no update operation, and like most
algorithms for highly-concurrent hash tables, the algorithm can't
support such an operation. The copy goes the other way: if an insert
fails, the caller gets back the value already present in the table,
which cannot be concurrently getting modified for the same reasons the
hashcode can't be modified while a bucket scan is looking at it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2015-01-27 04:20:36
Message-ID: CA+TgmoZDHHjQxsn18pEeDYnNXMg39R9yrGnmq7kuYcNwm-JadA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This developed a slight merge conflict. I've rebased the attached
version, and I also took the step of getting rid of buf_table.c, as I
think I proposed somewhere upthread. This avoids the overhead of
constructing a BufferTag only to copy it into a BufferLookupEnt, plus
some function calls and so forth. A quick-and-dirty test suggests
this might not have cut down on the 1-client overhead much, but I
think it's worth doing anyway: it's certainly saving a few cycles, and
I don't think it's complicating anything measurably.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
chash-buftable-v2.patch text/x-patch 81.5 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2015-01-27 15:36:35
Message-ID: CA+TgmoYv=4ojw-DxBC0EvpFfa9D0T-qHyOkWb94BTRoTfCTi8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> This developed a slight merge conflict. I've rebased the attached
> version, and I also took the step of getting rid of buf_table.c, as I
> think I proposed somewhere upthread. This avoids the overhead of
> constructing a BufferTag only to copy it into a BufferLookupEnt, plus
> some function calls and so forth. A quick-and-dirty test suggests
> this might not have cut down on the 1-client overhead much, but I
> think it's worth doing anyway: it's certainly saving a few cycles, and
> I don't think it's complicating anything measurably.

So here are median-of-three results for 5-minute read-only pgbench
runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
various client counts.

clients - master(at)4b2a25 - master+chash-buftable-v2
1 8319.254199 8105.479632
2 15485.561948 15138.227533
3 23690.186576 23264.702784
4 32157.346740 31536.870881
5 40879.532747 40455.794841
6 49063.279970 49625.681573
7 57518.374517 57492.275197
8 65351.415323 65622.211763
16 126166.416498 126668.793798
24 155727.916112 155587.414299
32 180329.012019 179543.424754
40 201384.186317 200109.614362
48 222352.265595 225688.574611
56 240400.659338 254398.144976
64 253699.031075 266624.224699
72 261198.336625 270652.578322
80 264569.862257 270332.738084

That's extremely unimpressive. You (Andres) showed a much bigger
benefit on a different machine with a much higher scale factor (5000)
so I don't know whether the modest benefit here is because of the
different machine, the different scale factor, or what.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2015-01-27 15:46:05
Message-ID: 20150127154605.GM4655@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-01-27 10:36:35 -0500, Robert Haas wrote:
> On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > This developed a slight merge conflict. I've rebased the attached
> > version, and I also took the step of getting rid of buf_table.c, as I
> > think I proposed somewhere upthread. This avoids the overhead of
> > constructing a BufferTag only to copy it into a BufferLookupEnt, plus
> > some function calls and so forth. A quick-and-dirty test suggests
> > this might not have cut down on the 1-client overhead much, but I
> > think it's worth doing anyway: it's certainly saving a few cycles, and
> > I don't think it's complicating anything measurably.
>
> So here are median-of-three results for 5-minute read-only pgbench
> runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
> various client counts.

> That's extremely unimpressive. You (Andres) showed a much bigger
> benefit on a different machine with a much higher scale factor (5000)
> so I don't know whether the modest benefit here is because of the
> different machine, the different scale factor, or what.

Based on my test on hydra some time back the bottleneck is nearly
entirely in GetSnapshotData() at higher client counts. So I'm not that
surprised you don't see the big benefit here. IIRC on hydra the results
for using a large enough shared_buffers setting for a fully cached run
at that scale is pretty close to your master results - so there's really
not much performance increase to be expected by making the buf table
lockless.

I guess you would see a slightly bigger difference at a different
shared_buffer/scale combination. IIRC a scale 1000 is about 15GB large?
So about half of the data set fit in shared buffers. In the scale 5k
results I posted it was about 1/5.

I had also tested on a four socket x86 machine where GetSnapshotData()
was a, but not the sole big, bottleneck.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2015-02-02 03:59:11
Message-ID: CAA4eK1Jz7wVsXe3M3qV_5tmRfT18AmD06T9NgKHFwVBNH3LcgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 27, 2015 at 9:50 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> This developed a slight merge conflict. I've rebased the attached
> version, and I also took the step of getting rid of buf_table.c, as I
> think I proposed somewhere upthread. This avoids the overhead of
> constructing a BufferTag only to copy it into a BufferLookupEnt, plus
> some function calls and so forth. A quick-and-dirty test suggests
> this might not have cut down on the 1-client overhead much, but I
> think it's worth doing anyway: it's certainly saving a few cycles, and
> I don't think it's complicating anything measurably.
>

Performance data at some of the configurations.

Configuration and Db Details
----------------------------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
checkpoint_segments=300
checkpoint_timeout =25min
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5min
Scale_factor - 5000
HEAD (commit id - 168a809d)

Below is the data for median of 3-runs with pgbench read-only
(using -M prepared) configuration

Shared_buffers=8GB Client Count/No. Of Runs (tps) 1 8 16 32 64 128 256 HEAD
17748 119106 164949 246632 216763 183177 173055 HEAD + patch 17802 119721
167422 298746 457863 422621 356756

Shared_buffers=16GB
Client Count/No. Of Runs (tps) 1 8 16 32 64 128 256 HEAD 18139 113265
169217 270172 310936 238490 215308 HEAD + patch 17900 119960 174196 314866
448238 425916 347919

Observations as per data
--------------------------------------
a. It improves the tps by great margin at higher client count.
b. It is evident that as we increase the shared buffers, the gain
is relatively less (gain when shared_buffers is 16GB is lesser as
compare to when shared_buffers is 8GB)

I think the patch is valuable for such loads even though it will show
lesser benefit at higher shared buffers value, although we might want
to once verify that it doesn't topple at configurations such as
(shared_buffers = scale_factor).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com