Quick Links

Re: A couple of proposed pgbench changes

Lists:	pgsql-hackerspgsql-patches

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Cc:	pgsql-patches(at)postgreSQL(dot)org
Subject:	A couple of proposed pgbench changes
Date:	2005-11-30 00:08:50
Message-ID:	11316.1133309330@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi Tatsuo,

Attached is a proposed patch that deals with a couple of pgbench issues.

The change at line 490 updates doCustom's local variable "commands"
after selecting a new file (command sequence). I think that the
existing coding will cause the thing to use the first command of the
old sequence in the remainder of the routine, which would be a bug.
I have not tried to set up a test case to prove it, though.

The other two changes cause doCustom to loop after processing a
meta-command. This might be a bit controversial, but as the code
is currently written, each meta-command "costs" one cycle of the
outer select() loop. Thus, for example, with the default TPC-B script,
once a backend returns "COMMIT" it will not receive a new command
until four cycles of issuing commands to other backends have elapsed.
(You can see this very easily by strace'ing pgbench under load.)
ISTM it is bad to have backends sitting idle waiting for pgbench to
give them a new command. On the other hand you could argue that
finishing out several consecutive metacommands will delay issuance of
new commands to other backends. In the test case I was running,
making this change made for a small but noticeable improvement in
total CPU usage, but I'm not entirely convinced it'd always be a win.

I get the impression that pgbench itself is a bit of a bottleneck when
running on multi-CPU machines. I can't get the total CPU usage to reach
90% with ten client threads on a dual HT-enabled Xeon, and the only
explanation I can see is that there are too many backends waiting for
pgbench to give them new commands instead of doing useful work. strace
shows that each select() iteration usually finds *all ten* sockets
read-ready, which is definitely bad. (I tried nice'ing pgbench to
negative priority, but it did not improve matters. It could easily be
that this is a Heisenproblem, though, with strace itself slowing pgbench
enough to cause the behavior.) I wonder if it would help for pgbench to
fork off multiple sub-processes and have each sub-process tend just one
backend.

Anyway, since I'm not sure of either of these changes, I'm passing them
to you for review.

regards, tom lane

Index: pgbench.c
===================================================================
RCS file: /cvsroot/pgsql/contrib/pgbench/pgbench.c,v
retrieving revision 1.48
diff -c -r1.48 pgbench.c
*** pgbench.c 23 Nov 2005 12:19:12 -0000 1.48
--- pgbench.c 29 Nov 2005 23:51:46 -0000
***************
*** 411,416 ****
--- 411,417 ----
CState *st = &state[n];
Command **commands;

+ top:
commands = sql_files[st->use_file];

if (st->listen)
***************
*** 489,494 ****
--- 490,496 ----
{
st->state = 0;
st->use_file = getrand(0, num_files - 1);
+ commands = sql_files[st->use_file];
}
}

***************
*** 572,577 ****
--- 574,581 ----
free(val);
st->listen = 1;
}
+
+ goto top;
}
}

From:	Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	t-ishii(at)sra(dot)co(dot)jp, pgsql-patches(at)postgresql(dot)org
Subject:	Re: A couple of proposed pgbench changes
Date:	2005-12-01 07:13:05
Message-ID:	20051201.161305.26273167.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Thanks for pointing it out. It's definitely a bug. To reproduce the
problem, you have two SQL files, say,

a.sql:
SELECT 1;
SELECT 2;

b.sql:
SELECT 300;
SELECT 400;

then run,

pgbench -f a.sql -f b.sql

and see the SQL trace by enabling log_statement=all,

I see:
LOG: statement: SELECT 300;
LOG: statement: SELECT 400;
LOG: statement: SELECT 300;
LOG: statement: SELECT 2;
LOG: statement: SELECT 1;
LOG: statement: SELECT 400;
LOG: statement: SELECT 300;
LOG: statement: SELECT 400;
LOG: statement: SELECT 300;
LOG: statement: SELECT 2;
LOG: statement: SELECT 1;
:
:

#1 and #2 are ok, but #3(SELECT 300) and #4(SELECT 2) combo is
supposed to be impossible if pgbench works correctly.

> The other two changes cause doCustom to loop after processing a
> meta-command. This might be a bit controversial, but as the code
> is currently written, each meta-command "costs" one cycle of the
> outer select() loop. Thus, for example, with the default TPC-B script,
> once a backend returns "COMMIT" it will not receive a new command
> until four cycles of issuing commands to other backends have elapsed.
> (You can see this very easily by strace'ing pgbench under load.)
> ISTM it is bad to have backends sitting idle waiting for pgbench to
> give them a new command. On the other hand you could argue that
> finishing out several consecutive metacommands will delay issuance of
> new commands to other backends. In the test case I was running,
> making this change made for a small but noticeable improvement in
> total CPU usage, but I'm not entirely convinced it'd always be a win.

Agreed.

> I get the impression that pgbench itself is a bit of a bottleneck when
> running on multi-CPU machines. I can't get the total CPU usage to reach
> 90% with ten client threads on a dual HT-enabled Xeon, and the only
> explanation I can see is that there are too many backends waiting for
> pgbench to give them new commands instead of doing useful work. strace
> shows that each select() iteration usually finds *all ten* sockets
> read-ready, which is definitely bad. (I tried nice'ing pgbench to
> negative priority, but it did not improve matters. It could easily be
> that this is a Heisenproblem, though, with strace itself slowing pgbench
> enough to cause the behavior.) I wonder if it would help for pgbench to
> fork off multiple sub-processes and have each sub-process tend just one
> backend.

I'm not sure multiple sub-processes version of pgbench shows superior
performance than current implementation because of process context
switching overhead. Maybe threading is better? Mr. Yasuo Ohgaki
implemented pthead version of pgbench.

http://wiki.ohgaki.net/index.php?plugin=attach&pcmd=open&file=ppgbench-20050906-3.tar.bz2&refer=PostgreSQL%2Fppgbench

You might want to try.

> Anyway, since I'm not sure of either of these changes, I'm passing them
> to you for review.

Your patches seem fine.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: [PATCHES] A couple of proposed pgbench changes
Date:	2005-12-01 15:07:31
Message-ID:	10727.1133449651@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> writes:
>> ... I wonder if it would help for pgbench to
>> fork off multiple sub-processes and have each sub-process tend just one
>> backend.

> I'm not sure multiple sub-processes version of pgbench shows superior
> performance than current implementation because of process context
> switching overhead. Maybe threading is better? Mr. Yasuo Ohgaki
> implemented pthead version of pgbench.

Oh, I wasn't aware someone had done it. Last night I rewrote pgbench
to fork() off one subprocess for each client. At least on the TPC-B
script, it's a bit slower, which lends weight to your worry about extra
context swap overhead. But there is something funny going on on the
backend side too. I don't have the numbers in front of me right now,
but very roughly this is what I was seeing:

pgbench alone + idle transaction

stock pgbench 680 tps 330 tps
fork() version 640 tps 50 tps

It's hard to explain that last number as being pgbench's fault. I think
that somehow the fork() version is stressing the backends more than
stock pgbench does. It might have to do with the fact that the stock
version delivers commands to multiple backends in lockstep whereas the
fork() version is more random. I've been poking at it to determine why
without a lot of success, but one interesting thing I've found out is
that with the fork() version there is a *whole* lot more contention for
the SubTransControlLock. I hacked lwlock.c to print out per-process
counts of LWLock acquisitions and number of times blocked, and this is
what I got (with the prototype patch I posted the other day to take
shared lock for SubTransGetParent):

stock pgbench, no idle xact:

PID 31529 lwlock 14: shacq 438 exacq 1 blk 0
PID 31530 lwlock 14: shacq 401 exacq 0 blk 0
PID 31531 lwlock 14: shacq 378 exacq 0 blk 0
PID 31532 lwlock 14: shacq 381 exacq 1 blk 0
PID 31533 lwlock 14: shacq 377 exacq 2 blk 0
PID 31534 lwlock 14: shacq 354 exacq 0 blk 0
PID 31535 lwlock 14: shacq 373 exacq 0 blk 0
PID 31536 lwlock 14: shacq 373 exacq 0 blk 0
PID 31537 lwlock 14: shacq 370 exacq 1 blk 0
PID 31538 lwlock 14: shacq 377 exacq 0 blk 0

fork(), no idle xact:

PID 414 lwlock 14: shacq 82401 exacq 0 blk 0
PID 415 lwlock 14: shacq 82500 exacq 3 blk 0
PID 417 lwlock 14: shacq 77727 exacq 2 blk 0
PID 419 lwlock 14: shacq 83272 exacq 2 blk 0
PID 421 lwlock 14: shacq 78579 exacq 2 blk 0
PID 424 lwlock 14: shacq 82704 exacq 0 blk 0
PID 426 lwlock 14: shacq 82252 exacq 2 blk 0
PID 429 lwlock 14: shacq 86002 exacq 0 blk 0
PID 431 lwlock 14: shacq 86617 exacq 2 blk 0
PID 432 lwlock 14: shacq 78842 exacq 1 blk 0

stock pgbench + idle xact:

PID 17868 lwlock 14: shacq 3342147 exacq 3250 blk 67
PID 17869 lwlock 14: shacq 3318728 exacq 3477 blk 74
PID 17870 lwlock 14: shacq 3324261 exacq 3858 blk 102
PID 17871 lwlock 14: shacq 3388431 exacq 3436 blk 120
PID 17872 lwlock 14: shacq 3409427 exacq 4232 blk 108
PID 17873 lwlock 14: shacq 3416117 exacq 5763 blk 130
PID 17874 lwlock 14: shacq 3396471 exacq 4860 blk 70
PID 17875 lwlock 14: shacq 3369113 exacq 4828 blk 161
PID 17876 lwlock 14: shacq 3428814 exacq 5286 blk 193
PID 17877 lwlock 14: shacq 3476198 exacq 5073 blk 147

fork() + idle xact:

PID 519 lwlock 14: shacq 83979662 exacq 2 blk 7
PID 526 lwlock 14: shacq 94968544 exacq 1 blk 1
PID 529 lwlock 14: shacq 91672324 exacq 0 blk 2
PID 530 lwlock 14: shacq 92307866 exacq 3 blk 16
PID 531 lwlock 14: shacq 93694118 exacq 0 blk 2
PID 532 lwlock 14: shacq 90776114 exacq 1 blk 2
PID 533 lwlock 14: shacq 89445464 exacq 1 blk 2
PID 534 lwlock 14: shacq 94407745 exacq 2 blk 2
PID 535 lwlock 14: shacq 88223627 exacq 2 blk 2
PID 536 lwlock 14: shacq 87223449 exacq 3 blk 6

I don't know yet why pgbench would be able to affect this, but it's
repeatable. If anyone's interested in trying to duplicate it, I'll
post my version of pgbench.

regards, tom lane

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCHES] A couple of proposed pgbench changes
Date:	2005-12-01 17:52:48
Message-ID:	dmnd7r$7t1$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> I don't know yet why pgbench would be able to affect this, but it's
> repeatable. If anyone's interested in trying to duplicate it, I'll
> post my version of pgbench.
>

I'd like to try to duplicate it here,

Regards,
Qingqing

From:	Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: A couple of proposed pgbench changes
Date:	2005-12-04 01:11:22
Message-ID:	20051204.101122.97294376.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I have commited your fixes into PostgreSQL 8.1 stable branches.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> Hi Tatsuo,
>
> Attached is a proposed patch that deals with a couple of pgbench issues.
>
> The change at line 490 updates doCustom's local variable "commands"
> after selecting a new file (command sequence). I think that the
> existing coding will cause the thing to use the first command of the
> old sequence in the remainder of the routine, which would be a bug.
> I have not tried to set up a test case to prove it, though.
>
> The other two changes cause doCustom to loop after processing a
> meta-command. This might be a bit controversial, but as the code
> is currently written, each meta-command "costs" one cycle of the
> outer select() loop. Thus, for example, with the default TPC-B script,
> once a backend returns "COMMIT" it will not receive a new command
> until four cycles of issuing commands to other backends have elapsed.
> (You can see this very easily by strace'ing pgbench under load.)
> ISTM it is bad to have backends sitting idle waiting for pgbench to
> give them a new command. On the other hand you could argue that
> finishing out several consecutive metacommands will delay issuance of
> new commands to other backends. In the test case I was running,
> making this change made for a small but noticeable improvement in
> total CPU usage, but I'm not entirely convinced it'd always be a win.
>
> I get the impression that pgbench itself is a bit of a bottleneck when
> running on multi-CPU machines. I can't get the total CPU usage to reach
> 90% with ten client threads on a dual HT-enabled Xeon, and the only
> explanation I can see is that there are too many backends waiting for
> pgbench to give them new commands instead of doing useful work. strace
> shows that each select() iteration usually finds *all ten* sockets
> read-ready, which is definitely bad. (I tried nice'ing pgbench to
> negative priority, but it did not improve matters. It could easily be
> that this is a Heisenproblem, though, with strace itself slowing pgbench
> enough to cause the behavior.) I wonder if it would help for pgbench to
> fork off multiple sub-processes and have each sub-process tend just one
> backend.
>
> Anyway, since I'm not sure of either of these changes, I'm passing them
> to you for review.
>
> regards, tom lane
>
>
> Index: pgbench.c
> ===================================================================
> RCS file: /cvsroot/pgsql/contrib/pgbench/pgbench.c,v
> retrieving revision 1.48
> diff -c -r1.48 pgbench.c
> *** pgbench.c 23 Nov 2005 12:19:12 -0000 1.48
> --- pgbench.c 29 Nov 2005 23:51:46 -0000
> ***************
> *** 411,416 ****
> --- 411,417 ----
> CState *st = &state[n];
> Command **commands;
>
> + top:
> commands = sql_files[st->use_file];
>
> if (st->listen)
> ***************
> *** 489,494 ****
> --- 490,496 ----
> {
> st->state = 0;
> st->use_file = getrand(0, num_files - 1);
> + commands = sql_files[st->use_file];
> }
> }
>
> ***************
> *** 572,577 ****
> --- 574,581 ----
> free(val);
> st->listen = 1;
> }
> +
> + goto top;
> }
> }
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>