Re: gaussian distribution pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gaussian distribution pgbench
Date: 2014-02-09 12:32:54
Message-ID: alpine.DEB.2.10.1402091328110.1388@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello,

> I revise my gaussian pgbench patch which wss requested from community.

With a lot of delay for which I apologise, please find hereafter the
review.

Gaussian Pgbench v3 patch by Mitsumasa KONDO review

* The purpose of the patch is to allow a pgbench script to draw from normally
distributed integer values instead of uniformly distributed.

This is a valuable contribution to enable pgbench to generate more realistic
loads, which is seldom uniform in practice.

However, ISTM that other distributions such an exponantial one would make
more sense, and also the values should be further randomized so that
neighboring values are not more likely to be drawn. The latest point is non
trivial.

* Compilation

The patch applies and compiles against current head. It works as expected,
although there is few feedback from the script to show that.

* Mathematical soundness

We want to derive a discrete normal distribution from a uniform one.
Well, normal distributions are for continuous variables... Anyway, this is
done by computing a continuous normal distribution which is then projected
onto integers. I'm basically fine with that.

The system uses a Box-Muller transform (1958) to do this transformation.
The Ziggurat method seems to be prefered for this purpose, *but* it would
require precalculated tables which depends on the target values. So I'm
fine with the Box-Muller transform for pgbench.

The BM method uses 2 uniformly distributed numbers to derive 2 normally
distributed numbers. The implementation computes one of these, and loops
over till one match a threshold criterion.

More explanations, at least in comments, are needed about this threshold
and its meaning. It is required to be more than 2. I guess is that it allows
to limit the number of iterations of the while loop, but in what proportion
is unclear. The documentation does not also help the user to understand
this value and its meaning.

What I think it is: it is the deviation for the FURTHEST point around the
mean, that is the actual deviation associated to the "min" and "max" target
values. The 2 minimum value induces that there is a least 4 stddev lengths
between min & max, with the most likely mean in the middle.

If the threshold test fails, one of the 2 uniform number is redrawn, a new
candidate value is tested. I'm not at ease about why only 1 value is redrawn
and not both, some explanations would be welcome. Also, on the other hand,
why not test the other possible value (with cos) if the first one fails?

Also, as suggested above, I would like some explanations about how much this
while loop may iterate without success, say with the expected average number
of iterations with its explanation in a comment.

* Implementation

Random values :
double rand1 = 1.0 - rand; // instead of the LONG_MAX computation & limits.h
rand2 should be in (0, 1], but it is in [0, 1), use "1.0 - ..." as well?!

What is called "stdev*" in getrand() is really the chosen deviation from
the target mean, so it would make more sense to name it "dev".

I do not think that the getrand refactoring was such a good idea. I'm sorry
if I may have suggested that in a previous comment.
The new getrand possibly ignores its parameters, hmmmm. ISTM that it would
be much simpler in the code to have a separate and clean "getrand_normal"
or "getrand_gauss" called for "\setgaussian", and that's it. This would
allow to get rid of DistType and all of getrand changes in the code.

There are heavy constants computations (sqrt(log()) within the while
loop which would be moved out of the loop.

ISTM that the while condition would be easier to read as:

while ( dev < - threshold || threshold < dev )

Maybe the \\setgaussian argument handling may be transformed into a function,
so that it could be used easily later for some other distribution (say some
setexp:-)

* Options

ISTM that the test options would be better if made orthogonal, i.e. not to
have three --gaussian* options. I would suggest to have only one
--gaussian=NUM which would trigger gaussian tests with this threshold,
and --gaussian=3.5 --select-only would use the select-only variant,
and so on.

* Typos

gausian -> gaussian
patern -> pattern

* Conclusion :

- this is a valuable patch to help create more realistic load and make pgbench
a more useful tool. I'm greatly in favor of having such a functionality.

- it seems to me that the patch should be further improved before being
committed, in particular I would suggest:

(1) improve the explanations in the code and in the documentation, especially
about what is the "deviation threshold" and its precise link to generated
values.

(2) simplify the code with a separate gaussian getrand, and simpler or
more efficient code here and there, see comments above.

(3) use only one option to trigger gaussian tests.

(bonus) \setexp would be a nice:-)

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2014-02-09 13:10:45 Re: narwhal and PGDLLIMPORT
Previous Message Magnus Hagander 2014-02-09 12:17:48 Terminating pg_basebackup background streamer