Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Date: 2013-06-19 01:17:49
Message-ID: 51C106BD.2090407@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm still getting the same sort of pauses waiting for input with your
v11. This is a pretty frustrating problem; I've spent about two days so
far trying to narrow down how it happens. I've attached the test
program I'm using. It seems related to my picking a throttled rate
that's close to (but below) the maximum possible on your system. I'm
using 10,000 on a system that can do about 16,000 TPS when running
pgbench in debug mode.

This problem is 100% reproducible here; happens every time. This is a
laptop running Mac OS X. It's possible the problem is specific to that
platform. I'm doing all my tests with the database itself setup for
development, with debug and assertions on. The lag spikes seem smaller
without assertions on, but they are still there.

Here's a sample:

transaction type: SELECT only
scaling factor: 10
query mode: simple
number of clients: 25
number of threads: 1
duration: 30 s
number of transactions actually processed: 301921
average transaction lag: 1.133 ms (max 137.683 ms)
tps = 10011.527543 (including connections establishing)
tps = 10027.834189 (excluding connections establishing)

And those slow ones are all at the end of the latency log file, as shown
in column 3 here:

22 11953 3369 0 1371578126 954881
23 11926 3370 0 1371578126 954918
3 12238 30310 0 1371578126 984634
7 12205 30350 0 1371578126 984742
8 12207 30359 0 1371578126 984792
11 12176 30325 0 1371578126 984837
13 12074 30292 0 1371578126 984882
0 12288 175452 0 1371578127 126340
9 12194 171948 0 1371578127 126421
12 12139 171915 0 1371578127 126466
24 11876 175657 0 1371578127 126507

Note that there are two long pauses here, which happens maybe half of
the time. There's a 27 ms pause and then a 145 ms one.

The exact sequence for when the pauses show up is:

-All remaining clients have sent their SELECT statement and are waiting
for a response. They are not sleeping, they're waiting for the server
to return a query result.
-A client receives part of the data it wants, but there is still data
left. This is the path through pgbench.c where the "if
(PQisBusy(st->con))" test is true after receiving some information. I
hacked up some logging that distinguishes those as a "client %d partial
receive" to make this visible.
-A select() call is made with no timeout, so it can only be satisfied by
more data being returned.
-Around ~100ms (can be less, can be more) goes by before that select()
returns more data to the client, where normally it would happen in ~2ms.

You were concerned about a possible bug in the timeout code. If you
hack up the select statement to show some state information, the setup
for the pauses at the end always looks like this:

calling select min_usec=9223372036854775807 sleeping=0

When no one is sleeping, the timeout becomes infinite, so only returning
data will break it. This is intended behavior though. This exact same
sequence and select() call parameters happen in pgbench code every time
at the end of a run. But clients without the throttling patch applied
never seem to get into the state where they pause for so long, waiting
for one of the active clients to receive the next bit of result.

I don't think the st->listen related code has anything to do with this
either. That flag is only used to track when clients have completed
sending their first query over to the server. Once reaching that point
once, afterward they should be "listening" for results each time they
exit the doCustom() code. st->listen goes to 1 very soon after startup
and then it stays there, and that logic seems fine too.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment Content-Type Size
test text/plain 194 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2013-06-19 02:28:23 Re: event trigger API documentation?
Previous Message Jim Nasby 2013-06-19 01:11:42 Re: Vacuum, Freeze and Analyze: the big picture