Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Date: 2013-06-29 23:11:21
Message-ID: 51CF6999.4060103@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/22/13 12:54 PM, Fabien COELHO wrote:
> After some poking around, and pursuing various red herrings, I resorted
> to measure the delay for calling "PQfinish()", which is really the only
> special thing going around at the end of pgbench run...

This wasn't what I was seeing, but it's related. I've proved to myself
the throttle change isn't reponsible for the weird stuff I'm seeing now.
I'd like to rearrange when PQfinish happens now based on what I'm
seeing, but that's not related to this review.

I duplicated the PQfinish problem you found too. On my Linux system,
calls to PQfinish are normally about 36 us long. They will sometimes
get lost for >15ms before they return. That's a different problem
though, because the ones I'm seeing on my Mac are sometimes >150ms.
PQfinish never takes quite that long.

PQfinish doesn't pause for a long time on this platform. But it does
*something* that causes socket select() polling to stutter. I have
instrumented everything interesting in this part of the pgbench code,
and here is the problem event.

1372531862.062236 select with no timeout sleeping=0
1372531862.109111 select returned 6 sockets latency 46875 us

Here select() is called with 0 sleeping processes, 11 that are done, and
14 that are running. The running ones have all sent SELECT statements
to the server, and they are waiting for a response. Some of them
received some data from the server, but they haven't gotten the entire
response back. (The PQfinish calls could be involved in how that happened)

With that setup, select runs for 47 *ms* before it gets the next byte to
a client. During that time 6 clients get responses back to it, but it
stays stuck in there for a long time anyway. Why? I don't know exactly
why, but I am sure that pgbench isn't doing anything weird. It's either
libpq acting funny, or the OS. When pgbench is waiting on a set of
sockets, and none of them are returning anything, that's interesting.
But there's nothing pgbench can do about it.

The cause/effect here is that the randomness to the throttling code
spreads out when all the connections end a bit. There are more times
during which you might have 20 connections finished while 5 still run.

I need to catch up with revisions done to this feature since I started
instrumenting my copy more heavily. I hope I can get this ready for
commit by Monday. I've certainly beaten on the feature for long enough now.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-06-30 00:25:47 Re: New regression test time
Previous Message Claudio Freire 2013-06-29 23:06:11 Re: New regression test time