Re: Highly Efficient Custom Sorting

From: Eliot Gable <egable+pgsql-performance(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Craig James <craig_james(at)emolecules(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Highly Efficient Custom Sorting
Date: 2010-07-06 19:42:14
Message-ID: AANLkTikLPKC8ebtLs6qxR85A4aPnH5klOwljROl4D2eu@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Jul 6, 2010 at 3:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sat, Jul 3, 2010 at 4:17 PM, Eliot Gable
> <egable+pgsql-performance(at)gmail(dot)com <egable%2Bpgsql-performance(at)gmail(dot)com>>
> wrote:
> > Read RFC 2782 on random weighted load balancing of SRV records inside
> DNS.
>
> It may be asking a bit much to expect people here to read an RFC to
> figure out how to help you solve this problem, but...
>
>
Yeah, I was not actually expecting them to read the whole RFC. The section
on random weighted load balancing is only a few paragraphs and describes
just the algorithm I am trying to implement as efficiently as possible:

Priority
The priority of this target host. A client MUST attempt to
contact the target host with the lowest-numbered priority it can
reach; target hosts with the same priority SHOULD be tried in an
order defined by the weight field. The range is 0-65535. This
is a 16 bit unsigned integer in network byte order.

Weight
A server selection mechanism. The weight field specifies a
relative weight for entries with the same priority. Larger
weights SHOULD be given a proportionately higher probability of
being selected. The range of this number is 0-65535. This is a
16 bit unsigned integer in network byte order. Domain
administrators SHOULD use Weight 0 when there isn't any server
selection to do, to make the RR easier to read for humans (less
noisy). In the presence of records containing weights greater
than 0, records with weight 0 should have a very small chance of
being selected.

In the absence of a protocol whose specification calls for the
use of other weighting information, a client arranges the SRV
RRs of the same Priority in the order in which target hosts,
specified by the SRV RRs, will be contacted. The following
algorithm SHOULD be used to order the SRV RRs of the same
priority:

To select a target to be contacted next, arrange all SRV RRs
(that have not been ordered yet) in any order, except that all
those with weight 0 are placed at the beginning of the list.

Compute the sum of the weights of those RRs, and with each RR
associate the running sum in the selected order. Then choose a
uniform random number between 0 and the sum computed
(inclusive), and select the RR whose running sum value is the
first in the selected order which is greater than or equal to
the random number selected. The target host specified in the
selected SRV RR is the next one to be contacted by the client.
Remove this SRV RR from the set of the unordered SRV RRs and
apply the described algorithm to the unordered SRV RRs to select
the next target host. Continue the ordering process until there
are no unordered SRV RRs. This process is repeated for each
Priority.

The difference between this description and my implementation is that I have
added a "group" field to the mix so that this algorithm is applied to each
group independently of the others. Also, my input data has an "id" field
which must be present on the same rows of the output and is used to map the
output back to my original input.

> ...there's no question that writing things in C is a lot more work,
> and takes some getting used to. Still, it's fast, so maybe worth it,
> especially since you already know C++, and will therefore mostly just
> need to learn the PostgreSQL coding conventions. The best thing to do
> is probably to look at some of the existing examples within the
> backend code. Most of the datatype code is in src/backend/utils/adt.
> You might want to look at arrayfuncs.c (perhaps array_ref() or
> array_map()); and also rowtypes.c (perhaps record_cmp()).
>
>
I did actually find the arrayfuncs.c file and start looking through it for
examples. I'm just not entirely clear on what is going on in some of those
functions -- what is necessary to keep in order to extract my data and get
it represented in C structures and what I can remove. I was hoping there was
some sort of documentation on how to work with input arrays for extracting
the data and getting it converted. In any event, I have spent several hours
reverse engineering how that stuff works, and I think I am pretty close to
being able to get my data into a C structure that I can work with.

> > Specifically, how do I go about preserving the pointers to the data that
> I
> > allocate in multi-call memory context so that they still point to the
> data
> > on the next call to the function for the next result row? Am I supposed
> to
> > set up some global variables to do that, or am I supposed to take a
> > different approach? If I need to use global variables, then how do I deal
> > with concurrency?
>
> Global variables would be a bad idea, not so much because of
> concurrency as because they won't get cleaned up properly. Again, the
> best thing to do is to look at existing examples, like array_unnest()
> in src/backend/utils/adt/arrayfuncs.c; the short answer is that you
> probably want to compute all your results on the first call and stash
> them in the FuncCallContext (funcctx->user_fctx); and then on
> subsequent calls just return one row per call.
>

Thanks for suggesting array_unnest(). I think that will actually prove more
useful to me than the other example I'm using for extracting my data from an
array. I was actually planning on computing the order on the first call and
storing it in a linked list which gets returned one item at a time until all
rows have been returned. Also, I found a code example using Google that
showed someone storing data across function calls using that pointer. I used
their example to produce this:

<snip>
if(SRF_IS_FIRSTCALL()) {
funcctx = SRF_FIRSTCALL_INIT();

/* This is where we stick or sorted data for returning later */
funcctx->user_fctx =
MemoryContextAlloc(funcctx->multi_call_memory_ctx, sizeof(sort_data));
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
data = (sort_data*) funcctx->user_fctx;
</snip>

I have a structure set up that is typedef'd to "sort_data" which stores
pointers to various things that I need to survive across the calls. Since
this seems to be what you are suggesting, I assume this is the correct
approach.

--
Eliot Gable

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Joe Conway 2010-07-06 20:00:33 Re: Highly Efficient Custom Sorting
Previous Message Robert Haas 2010-07-06 19:30:29 Re: Question about partitioned query behavior