Desirable pgbench features?

Lists: pgsql-hackers
From: Fabien <coelho(at)cri(dot)ensmp(dot)fr>
To: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Desirable pgbench features?
Date: 2016-03-30 15:29:34
Message-ID: alpine.DEB.2.10.1603301618570.5677@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hello pgdevs,

I've been having a look at why pgbench only implements a TPC-B "like" benchmark,
and not the full, although obsolete, TPC-B.

I'm not particulary interested in running TPC-B per se, although I like
the principle of a simple steady-state update-intensive OLTP stress test,
but more at what relevant capabilities are missing in pgbench to implement
a simple test.

I also had a brief look at the TPC-C benchmark, but it is several order of
magnitudes more complex, and there is an open source implementation available
(TPC-C-Uva).

(0) TPC-B assumes signed integers which can hold 10 figures, which means
that currently used INTs are not enough, it should be INT8. Quite easy
to fix, or extend to keep some upward compatibility with the previous
version.

(1) TPC-B test driver must obtain a value from a query (the branch is the one
of the chosen teller, not any random branch) and reuse it in another
query. Currently the communication is one way, all results are silently
discarded.

This is not representative of client applications (say a web app) which
interact with the server with some logic of their own, including reading
things and writing others depending on the previous reading.

This may be simulated on server side with a plpgsql script, but that
would not exercise the client/server protocol logic and its performance
impact, so I think that this simple read capability is important and
missing.

(2) There is an "if" required in TPC-B with a boolean condition when
selecting an account: 85% of the time the account does an operation in
its own branch, 15% in another branch.

(3) More binary operators would be useful, eg | is used by TPC-C for
building a non uniform random generation from a uniform random one
(although pgbench has clean exponential and gaussian randoms, which
still lack some shuffling, though).

(4) Weighted transaction scripts are used by TPC-C, some of which can be
voluntarily aborted. This already works with recently added weights
and using ROLLBACK instead of END in a given script. Nothing to do.

(5) Consistency check: after a run, some properties are expected to be
true, such as the balances of branches is the balance of its
tellers and also of its accounts... This should/could be checked,
maybe with an additional query.

I think that these capabilities are useful features for composing a
reasonable bench, and allowing pgbench to do some of these, while
remaining at a basic expression level (i.e. not creating a full
client-side language, not my intent in any way), should be ok if the
syntax, code and performance impacts are small.

Indeed, some of the missing features can be included, probably at low cost
in pgbench:

* better expressions: comparisons, condition, binary operators...

Comparisons (say <= < == > >= != operators) and an if() function for (2),
on the client side, could look like that:

\set abid if(random(0, 99) < 85, expression-1, expression-2)

This is pretty easy to implement with the current function infrastructure,
as well as new operators such as |&^! (3).

Note: the "?:" C-like syntax looks attractive but would probably interact
quite badly with the existing ":variable" syntax; moreover the ?: syntax
suggests a short-circuit evaluation, but pgbench expressions are fully
evaluated. These operators & function would probably require around 100
lines of pretty basic code, plus doc.

* using values from a query

For this use case (1), the best syntax and implementation is unclear. In
particular, I'm not fond of the \gset syntax used in psql because the ';'
is dropped and the \gset seems to act as a statement terminator.

After giving it some thought, I would suggest a simple two-line explicit
syntax compatible with current conventions, with a SELECT statement
terminated with a ';', on one side and where to put the results on the
other, something like:

SELECT ... ;
\into some variable names

Or maybe in the other way around:

\setsql some variable names
SELECT ... ;

The variable names provided could be stored in the command structure of
the SELECT and they would be assigned when the query is executed.

Among the possible constraints, enforced or not, the variable types should
be int or double, the query should be a select, there should be one row
only (or keep the first and set to zero if nothing?), the number of
variables should be less than the number of columns...

* shuffling function, i.e. a parametric permutation so that non uniform
randoms can be shuffled so that rows with close pkey do not have close
drawing probabilities. This is non trivial, basically some kind of cypher
function but which does not operate on power-of-two sizes...

* if all necessary features are available, being able to run a strict
tpc-b bench (this mean adapting the init phase, and creating a
new builtin script which matches the real spec): no big deal.

Any thoughts on these points? Especially the query to variable syntax?

--
Fabien.


From: Josh berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Desirable pgbench features?
Date: 2016-03-30 17:01:18
Message-ID: 56FC065E.9040908@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/30/2016 08:29 AM, Fabien wrote:

> (1) TPC-B test driver must obtain a value from a query (the branch is
> the one
> of the chosen teller, not any random branch) and reuse it in another
> query. Currently the communication is one way, all results are silently
> discarded.
>
> This is not representative of client applications (say a web app) which
> interact with the server with some logic of their own, including
> reading
> things and writing others depending on the previous reading.
>
> This may be simulated on server side with a plpgsql script, but that
> would not exercise the client/server protocol logic and its performance
> impact, so I think that this simple read capability is important and
> missing.

Yes. Particularly, one of the things I'd like to benchmark is
load-balancing between masters and replicas, including checks for
coherency. Without being able to retrieve and reuse values, this can't
be tested.

The simplest way I'd see doing this is being able to SELECT INTO a
pgbench variable.

> (5) Consistency check: after a run, some properties are expected to be
> true, such as the balances of branches is the balance of its
> tellers and also of its accounts... This should/could be checked,
> maybe with an additional query.

I'd also love to have a consistency check which would be client-only
which I could run in the pgbench unit itself. That is, a way to log
"errors" if, say, two variables were not equal at the end of the unit of
work.

An example of this would be using this to test if load-balanced
connections were getting "stale reads", especially since the
*percentage* of stale reads is what I want to know. 5% is acceptable,
50% is not.

> * using values from a query
>
> For this use case (1), the best syntax and implementation is unclear. In
> particular, I'm not fond of the \gset syntax used in psql because the ';'
> is dropped and the \gset seems to act as a statement terminator.
>
> After giving it some thought, I would suggest a simple two-line explicit
> syntax compatible with current conventions, with a SELECT statement
> terminated with a ';', on one side and where to put the results on the
> other, something like:
>
> SELECT ... ;
> \into some variable names

This works for me if it works for the parser.

>
> Or maybe in the other way around:
>
> \setsql some variable names
> SELECT ... ;

This also works, but is not my preference. It would be somewhat harder
to avoid variable/column mismatches.

One more wishlist item, which would make my request above for unit tests
unnecessary:

* Allow custom logging:

\vlog TAG varname1, varname2

Which would produce a custom log file called:

PID.TAG.varlog

With the format:

timestamp, var1, var2

e.g. if I had this:

SELECT id, abalance FROM account WHERE id = :aid
\into :lid, :lbal

\vlog balancelog :lid, :lbal

It would create a file called:

2247.balancelog.varlog

and/or append a line:

2016-03-30 21:37:33.899, 511, 2150

This would allow CSV logging of all sorts of user custom information,
including de-facto response times.

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)