Re: Sequence Access Method WIP

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sequence Access Method WIP
Date: 2014-11-04 12:11:04
Message-ID: 5458C258.802@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/13/2014 01:01 PM, Petr Jelinek wrote:
> Hi,
>
> I rewrote the patch with different API along the lines of what was
> discussed.

Thanks, that's better.

It would be good to see an alternative seqam to implement this API, to
see how it really works. The "local" one is too dummy to expose any
possible issues.

> The API now consists of following functions:
> sequence_alloc - allocating range of new values
> The function receives the sequence relation, current value, number of
> requested values amdata and relevant sequence options like min/max and
> returns new amdata, new current value, number of values allocated and
> also if it needs wal write (that should be returned if amdata has
> changed plus other reasons the AM might have to force the wal update).
>
> sequence_setval - notification that setval is happening
> This function gets sequence relation, previous value and new value plus
> the amdata and returns amdata (I can imagine some complex sequence AMs
> will want to throw error that setval can't be done on them).
>
> sequence_request_update/sequence_update - used for background processing
> Basically AM can call the sequence_request_update and backend will then
> call the sequence_update method of an AM with current amdata and will
> write the updated amdata to disk
>
> sequence_seqparams - function to process/validate the standard sequence
> options like start position, min/max, increment by etc by the AM, it's
> called in addition to the standard processing
>
> sequence_reloptions - this is the only thing that remained unchanged
> from previous patch, it's meant to pass custom options to the AM
>
> Only the alloc and reloptions methods are required (and implemented by
> the local AM).
>
> The caching, xlog writing, updating the page, etc is handled by backend,
> the AM does not see the tuple at all. I decided to not pass even the
> struct around and just pass the relevant options because I think if we
> want to abstract the storage properly then the AM should not care about
> how the pg_sequence looks like at all, even if it means that the
> sequence_alloc parameter list is bit long.

Hmm. The division of labour between the seqam and commands/sequence.c
still feels a bit funny. sequence.c keeps track of how many values have
been WAL-logged, and thus usable immediately, but we still call
sequence_alloc even when using up those already WAL-logged values. If
you think of using this for something like a centralized sequence server
in a replication cluster, you certainly don't want to make a call to the
remote server for every value - you'll want to cache them.

With the "local" seqam, there are two levels of caching. Each backend
caches some values (per the CACHE <value> option in CREATE SEQUENCE). In
addition to that, the server WAL-logs 32 values at a time. If you have a
remote seqam, it would most likely add a third cache, but it would
interact in strange ways with the second cache.

Considering a non-local seqam, the locking is also a bit strange. The
server keeps the sequence page locked throughout nextval(). But if the
actual state of the sequence is maintained elsewhere, there's no need to
serialize the calls to the remote allocator, i.e. the sequence_alloc()
calls.

I'm not exactly sure what to do about that. One option is to completely
move the maintenance of the "current" value, i.e. sequence.last_value,
to the seqam. That makes sense from an abstraction point of view. For
example with a remote server managing the sequence, storing the
"current" value in the local catalog table makes no sense as it's always
going to be out-of-date. The local seqam would store it as part of the
am-private data. However, you would need to move the responsibility of
locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to
provide an API that the seqam can call to do that. Perhaps just let the
seqam call heap_inplace_update on the sequence relation.

> For the amdata handling (which is the AM's private data variable) the
> API assumes that (Datum) 0 is NULL, this seems to work well for
> reloptions so should work here also and it simplifies things a little
> compared to passing pointers to pointers around and making sure
> everything is allocated, etc.
>
> Sadly the fact that amdata is not fixed size and can be NULL made the
> page updates of the sequence relation quite more complex that it used to
> be.

It would be nice if the seqam could define exactly the columns it needs,
with any datatypes. There would be a set of common attributes:
sequence_name, start_value, cache_value, increment_by, max_value,
min_value, is_cycled. The local seqam would add "last_value", "log_cnt"
and "is_called" to that. A remote seqam that calls out to some other
server might store the remote server's hostname etc.

There could be a seqam function that returns a TupleDesc with the
required columns, for example.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2014-11-04 13:01:00 Re: tracking commit timestamps
Previous Message Alexey Vasiliev 2014-11-04 11:03:46 Re[2]: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code