Quick Links

Re: WIP: dynahash replacement for buffer table

From:	Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: dynahash replacement for buffer table
Date:	2014-10-16 12:03:20
Message-ID:	543FB408.6020206@cs.utoronto.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 15/10/2014 11:41 AM, Robert Haas wrote:
> On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
>>> On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>>>> The code in CHashSearch shows the problem there: you need to STORE the
>>>>> hazard pointer before you begin to do the LOAD operations to scan the
>>>>> bucket, and you must finish all of those LOADs before you STORE the
>>>>> NULL hazard pointer. A read or write barrier won't provide store/load
>>>>> or load/store ordering.
>>>> I'm not sure I understand the problem here - but a read + write barrier
>>>> should suffice? To avoid falling back to two full barriers, we'd need a
>>>> separate define pg_read_write_barrier(), but that's not a problem. IIUC
>>>> that should allow us to avoid emitting any actual code on x86.
>>> Well, thinking about x86 specifically, it definitely needs at least
>>> one mfence, after setting the hazard pointer and before beginning to
>>> read the bucket chain.
>> Yes, I can see that now. I do wonder if that's not going to prove quite
>> expensive... I think I can see some ways around needing that. But I
>> think that needs some benchmarking first - no need to build a even more
>> complex solution if not needed.
>>
>> The solution I'm thinking of is to essentially move away from hazard
>> pointers and store something like a generation counter per
>> backend. Which is updated less often, and in combination with other
>> operations. When a backend need to clean up and sees that there's a
>> straggler with a old generation it sends that backend a signal to ensure
>> it sets the latest generation.
> It's possible that might work ... but on the timescale we're talking
> about here, that's asking the garbage collecting process to wait for
> practically geologic time.
>
> Back when I first wrote this code, I spent a fair amount of time
> looking at existing work in the area of lock-free hash tables.
> Essentially all of the work I was able to find on this topic assumes a
> threaded model - or more precisely, it assumes that shared memory
> allocation is cheap and easy and you'll have no problem getting as
> much of it as you need whenever you need it. This assumption often
> isn't even spelled out explicitly: it's just assumed that that's the
> programming environment you're working in. Finding highly parallel
> algorithms that don't rely on memory allocation as a primitive is
> hard. Hazard pointers are one of the few techniques I found that
> seems like it can work in our architecture. I'm quite reluctant to
> leave aside techniques that have been proven to work well in favor of
> inventing something entirely novel to PostgreSQL.
>
> That having been said, there is some literature on generation numbers,
> and I think something like that could be made to work. It does have
> some significant disadvantages, though. One is that a single process
> which fails to update its generation number prevents all reclamation,
> system-wide. In my algorithm, a process whose execution is
> suspended while a hazard pointer is set prevents recycling of only one
> of many garbage lists. A process searching for a reusable element can
> mostly likely find some other garbage list to reclaim instead. Also,
> a generation number implies that we update the value periodically,
> rather than before and after each critical section. Anything that
> forces garbage collection to be postponed longer than absolutely
> necessary seems to me likely to be a loser. It might be a little
> faster as long as we have free elements to allocate, but as soon as
> we're out and have to wait longer than we otherwise would for garbage
> collection, and all system activity halts as a result, even for a few
> milliseconds, it's going to be a whole lot slower. Or at least, I
> think.
>
> That having been said, I don't know what to do about the fact that the
> fence is too expensive. I don't know that we've really established
> that that's the true root of the problem rather than some other
> pedestrian optimization failure. But the existing code is using an
> atomic operation to acquire a spinlock, then releasing it, walking the
> bucket chain, and then using another atomic operation to acquire a
> spinlock and then releasing it again. Surely a pure fence shouldn't
> cost more than a spinlock cycle? Even with arguably one extra cache
> line touch, that seems like it ought to be a win. But my intuitions
> in this area are shaky.
Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems
like an ideal fit...

In brief, RCU has the following requirements:

* Read-heavy access pattern
* Writers must be able to make dead objects unreachable to new readers
(easily done for most data structures)
* Writers must be able to mark dead objects in such a way that
existing readers know to ignore their contents but can still
traverse the data structure properly (usually straightforward)
* Readers must occasionally inform the system that they are not
currently using any RCU-protected pointers (to allow resource
reclamation)

[1] http://www.rdrop.com/~paulmck/RCU/

If the above are satisfied, then:

* Readers need no synchronization at all (not even compiler fences)
* Writers can use the synchronization mechanism of their choice to
coordinate with each other (lwlock, atomic CAS, etc.)
* Object reclamation can be performed in the background, or
synchronously (and incrementally, if desired )

I've had very good results implementing user-level RCU for my own
database projects. It slashes the complexity of reasoning about
concurrent reader accesses. Implemented properly, it requires only a bit
of shared memory, has extremely low overhead in the common case of no
stragglers, and requires only minimal communication except when
stragglers are present (and even then mostly between stragglers). It's
reasonably simple to implement in userspace dbms context (< 1kLoC with
comments, assertions, etc.). Just don't forget to quiesce all in-flight
back ends at regular intervals, or the system won't be able to reclaim
anything...

Thoughts?
Ryan

In response to

Re: WIP: dynahash replacement for buffer table at 2014-10-15 17:41:25 from Robert Haas

Responses

Re: WIP: dynahash replacement for buffer table at 2014-10-16 13:19:16 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jan Wieck	2014-10-16 12:04:17	Moving of INT64_FORMAT to c.h
Previous Message	Michael Paquier	2014-10-16 12:01:33	Re: Incorrect initialization of sentPtr in walsender.c