Re: gaussian distribution pgbench -- splits v4

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mitsumasa KONDO <kondo(dot)mitsumasa(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gaussian distribution pgbench -- splits v4
Date: 2014-07-31 14:01:46
Message-ID: alpine.DEB.2.10.1407311534120.12755@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Robert,

[...]

> One of the concerns that I have about the proposal of simply slapping a
> gaussian or exponential modifier onto \setrandom aid 1 :naccounts is
> that, while it will allow you to make part of the relation hot and
> another part of the relation cold, you really can't get any more
> fine-grained than that. If you use exponential, all the hot accounts
> will be near the beginning of the relation, and if you use gaussian,
> they'll all be in the middle.

That is a very good remark. Although I thought of it, I do not have a very
good solution yet:-)

From a testing perspective, if we assume that keys have no semantics, a
reasonable assumption is that the distribution of access for actual
realistic workloads is probably exponential (of gaussian, anyway hardly
uniform), but without direct correlation between key values.

In order to simulate that, we would have to apply a fixed (pseudo-)random
permutation to the exponential-drawn key values. This is a non trivial
problem. The version zero of solving it is to do nothing... it is the
current status;-) Version one is "k' = 1 + (a * k + b) modulo n" with "a"
prime with respect to "n", "n" being the number of keys. This is nearly
possible, but for the modulo operator which is currently missing, and that
I'm planning to submit for this very reason, but probably another time.

> I'm not sure exactly will happen after some updating has happened; I'm
> guessing some of the keys will still be in their original location and
> others will have been pushed to the end of the relation following
> relation-extension.

This is a not too bad side. What matters most in the long term is not the
key value correlation, but the actual storage correlation, i.e. whether
two tuples required are in the same page or not. At the beginning of a
simulation, with close key numbers being picked up with an exponential
distribution, the correlation is more that what would be expected.
However, once a significant amount of the table has been updated, this
initial artificial correlation is going to fade, and the test should
become more realistic.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2014-07-31 14:19:57 Re: pgsql: Move log_newpage and log_newpage_buffer to xlog.c.
Previous Message Heikki Linnakangas 2014-07-31 13:59:22 pgsql: Move log_newpage and log_newpage_buffer to xlog.c.