Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Date: 2013-06-22 16:54:40
Message-ID: alpine.DEB.2.02.1306221634211.23902@localhost6.localdomain6
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Dear Robert and Greg,

> I think so. If it doesn't get fixed now, it's not likely to get fixed
> later. And the fact that nobody understands why it's happening is
> kinda worrisome...

Possibly, but I thing that it is not my fault:-)

So, I spent some time at performance debugging...

First, I finally succeeded in reproducing Greg Smith spikes on my ubuntu
laptop. It needs short transactions and many clients per thread so as to
be a spike. With "pgbench" standard transactions and not too many clients
per thread it is more of a little bump, or even it is not there at all.

After some poking around, and pursuing various red herrings, I resorted to
measure the delay for calling "PQfinish()", which is really the only
special thing going around at the end of pgbench run...

BINGO!

In his tests Greg is using one thread. Once the overall timer is exceeded,
clients start closing their connections by calling PQfinish once their
transactions are done. This call takes between a few us and a few ... ms.
So if some client closing time hit the bad figures, the transactions of
other clients are artificially delayed by this time, and it seems they
have a high latency, but it is really because the thread is in another
client's PQfinish and was not available to process the data. If you have
one thread per client, no problem, especially as the final PQfinish() time
is not measured at all by pgbench:-) Also, the more clients, the higher
the spike because more are to be stopped and may hit the bad figures.

Here is a trace, with the simple SELECT transaction.

sh> ./pgbench --rate=14000 -T 10 -r -l -c 30 -S bench
...

sh> less pgbench_log.*

# short transactions, about 0.250 ms
...
20 4849 241 0 1371916583.455400
21 4844 256 0 1371916583.455470
22 4832 348 0 1371916583.455569
23 4829 218 0 1371916583.455627

** TIMER EXCEEDED **

25 4722 390 0 1371916583.455820
25 done in 1560 [1,2] # BING, 1560 us for PQfinish()
26 4557 1969 0 1371916583.457407
26 done in 21 [1,2]
27 4372 1969 0 1371916583.457447
27 done in 19 [1,2]
28 4009 1910 0 1371916583.457486
28 done in 1445 [1,2] # BOUM
2 interrupted in 1300 [0,0] # BANG
3 interrupted in 15 [0,0]
4 interrupted in 40 [0,0]
5 interrupted in 203 [0,0] # boum?
6 interrupted in 1352 [0,0] # ZIP
7 interrupted in 18 [0,0]
...
23 interrupted in 12 [0,0]

## the cumulated PQfinish() time above is about 6 ms which
## appears as an apparent latency for the next clients:

0 4880 6521 0 1371916583.462157
0 done in 9 [1,2]
1 4877 6397 0 1371916583.462194
1 done in 9 [1,2]
24 4807 6796 0 1371916583.462217
24 done in 9 [1,2]
...

Note that the bad measures also appear when there is no throttling:

sh> ./pgbench -T 10 -r -l -c 30 -S bench
sh> grep 'done.*[0-9][0-9][0-9]' pgbench_log.*
0 done in 1974 [1,2]
1 done in 312 [1,2]
3 done in 2159 [1,2]
7 done in 409 [1,2]
11 done in 393 [1,2]
12 done in 2212 [1,2]
13 done in 1458 [1,2]
## other clients execute PQfinish in less than 100 us

This "done" is issued by my instrumented version of clientDone().

The issue does also appear if I instrument pgbench from master, without
anything from the throttling patch at all:

sh> git diff master
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 1303217..7c5ea81 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -869,7 +869,15 @@ clientDone(CState *st, bool ok)

if (st->con != NULL)
{
+ instr_time now;
+ int64 s0, s1;
+ INSTR_TIME_SET_CURRENT(now);
+ s0 = INSTR_TIME_GET_MICROSEC(now);
PQfinish(st->con);
+ INSTR_TIME_SET_CURRENT(now);
+ s1 = INSTR_TIME_GET_MICROSEC(now);
+ fprintf(stderr, "%d done in %ld [%d,%d]\n",
+ st->id, s1-s0, st->listen, st->state);
st->con = NULL;
}
return false; /* always false */

sh> ./pgbench -T 10 -r -l -c 30 -S bench 2> x.err

sh> grep 'done.*[0-9][0-9][0-9]' x.err
14 done in 1985 [1,2]
16 done in 276 [1,2]
17 done in 1418 [1,2]

So my argumented conclusion is that the issue is somewhere within
PQfinish(), possibly in interaction with pgbench doings, but is *NOT*
related in any way to the throttling patch, as it is preexisting it. Gregs
stumbled upon it because he looked at latencies.

I'll submit a slightly improved v12 shortly.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2013-06-22 17:19:58 Re: Support for REINDEX CONCURRENTLY
Previous Message Simon Riggs 2013-06-22 16:19:49 Re: A better way than tweaking NTUP_PER_BUCKET