Re: RAID stripe size question

Lists: pgsql-performance
From: "Mikael Carneholm" <Mikael(dot)Carneholm(at)WirelessCar(dot)com>
To: "Markus Schaber" <schabi(at)logix-tt(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: RAID stripe size question
Date: 2006-07-17 12:52:28
Message-ID: 7F10D26ECFA1FB458B89C5B4B0D72C2B4E4BDF@sesrv12.wirelesscar.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

>> This is something I'd also would like to test, as a common
>> best-practice these days is to go for a SAME (stripe all, mirror
everything) setup.
>> From a development perspective it's easier to use SAME as the
>> developers won't have to think about physical location for new
>> tables/indices, so if there's no performance penalty with SAME I'll
>> gladly keep it that way.

>Usually, it's not the developers task to care about that, but the DBAs
responsibility.

As we don't have a full-time dedicated DBA (although I'm the one who do
most DBA related tasks) I would aim for making physical location as
transparent as possible, otherwise I'm afraid I won't be doing anything
else than supporting developers with that - and I *do* have other things
to do as well :)

>> In a previous test, using cd=5000 and cs=20 increased transaction
>> throughput by ~20% so I'll definitely fiddle with that in the coming
>> tests as well.

>How many parallel transactions do you have?

That was when running BenchmarkSQL
(http://sourceforge.net/projects/benchmarksql) with 100 concurrent users
("terminals"), which I assume means 100 parallel transactions at most.
The target application for this DB has 3-4 times as many concurrent
connections so it's possible that one would have to find other cs/cd
numbers better suited for that scenario. Tweaking bgwriter is another
task I'll look into as well..

Btw, here's the bonnie++ results from two different array sets (10+18,
4+24) on the MSA1500:

LUN: WAL, 10 disks, stripe size 32K
------------------------------------
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01 32G 56139 93 73250 22 16530 3 30488 45 57489 5
477.3 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 2458 90 +++++ +++ +++++ +++ 3121 99 +++++ +++
10469 98

LUN: WAL, 4 disks, stripe size 8K
----------------------------------
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01 32G 49170 82 60108 19 13325 2 15778 24 21489 2
266.4 0
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 2432 86 +++++ +++ +++++ +++ 3106 99 +++++ +++
10248 98

LUN: DATA, 18 disks, stripe size 32K
-------------------------------------
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01 32G 59990 97 87341 28 19158 4 30200 46 57556 6
495.4 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 1640 92 +++++ +++ +++++ +++ 1736 99 +++++ +++
10919 99

LUN: DATA, 24 disks, stripe size 64K
-------------------------------------
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01 32G 59443 97 118515 39 25023 5 30926 49 60835 6
531.8 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
16 2499 90 +++++ +++ +++++ +++ 2817 99 +++++ +++
10971 100

Regards,
Mikael


From: "Alex Turner" <armtuk(at)gmail(dot)com>
To: "Mikael Carneholm" <Mikael(dot)Carneholm(at)wirelesscar(dot)com>
Cc: "Markus Schaber" <schabi(at)logix-tt(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID stripe size question
Date: 2006-07-17 15:23:23
Message-ID: 33c6269f0607170823v7df531a8o678505125e85880@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 7/17/06, Mikael Carneholm <Mikael(dot)Carneholm(at)wirelesscar(dot)com> wrote:
>
> >> This is something I'd also would like to test, as a common
> >> best-practice these days is to go for a SAME (stripe all, mirror
> everything) setup.
> >> From a development perspective it's easier to use SAME as the
> >> developers won't have to think about physical location for new
> >> tables/indices, so if there's no performance penalty with SAME I'll
> >> gladly keep it that way.
>
> >Usually, it's not the developers task to care about that, but the DBAs
> responsibility.
>
> As we don't have a full-time dedicated DBA (although I'm the one who do
> most DBA related tasks) I would aim for making physical location as
> transparent as possible, otherwise I'm afraid I won't be doing anything
> else than supporting developers with that - and I *do* have other things
> to do as well :)
>
> >> In a previous test, using cd=5000 and cs=20 increased transaction
> >> throughput by ~20% so I'll definitely fiddle with that in the coming
> >> tests as well.
>
> >How many parallel transactions do you have?
>
> That was when running BenchmarkSQL
> (http://sourceforge.net/projects/benchmarksql) with 100 concurrent users
> ("terminals"), which I assume means 100 parallel transactions at most.
> The target application for this DB has 3-4 times as many concurrent
> connections so it's possible that one would have to find other cs/cd
> numbers better suited for that scenario. Tweaking bgwriter is another
> task I'll look into as well..
>
> Btw, here's the bonnie++ results from two different array sets (10+18,
> 4+24) on the MSA1500:
>
> LUN: WAL, 10 disks, stripe size 32K
> ------------------------------------
> Version 1.03 ------Sequential Output------ --Sequential Input-
> --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> sesell01 32G 56139 93 73250 22 16530 3 30488 45 57489 5
> 477.3 1
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> /sec %CP
> 16 2458 90 +++++ +++ +++++ +++ 3121 99 +++++ +++
> 10469 98
>
>
> LUN: WAL, 4 disks, stripe size 8K
> ----------------------------------
> Version 1.03 ------Sequential Output------ --Sequential Input-
> --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> sesell01 32G 49170 82 60108 19 13325 2 15778 24 21489 2
> 266.4 0
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> /sec %CP
> 16 2432 86 +++++ +++ +++++ +++ 3106 99 +++++ +++
> 10248 98
>
>
> LUN: DATA, 18 disks, stripe size 32K
> -------------------------------------
> Version 1.03 ------Sequential Output------ --Sequential Input-
> --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> sesell01 32G 59990 97 87341 28 19158 4 30200 46 57556 6
> 495.4 1
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> /sec %CP
> 16 1640 92 +++++ +++ +++++ +++ 1736 99 +++++ +++
> 10919 99
>
>
> LUN: DATA, 24 disks, stripe size 64K
> -------------------------------------
> Version 1.03 ------Sequential Output------ --Sequential Input-
> --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> sesell01 32G 59443 97 118515 39 25023 5 30926 49 60835 6
> 531.8 1
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> /sec %CP
> 16 2499 90 +++++ +++ +++++ +++ 2817 99 +++++ +++
> 10971 100

These bonnie++ number are very worrying. Your controller should easily max
out your FC interface on these tests passing 192MB/sec with ease on anything
more than an 6 drive RAID 10 . This is a bad omen if you want high
performance... Each mirror pair can do 60-80MB/sec. A 24Disk RAID 10 can
do 12*60MB/sec which is 740MB/sec - I have seen this performance, it's not
unreachable, but time and again, we see these bad perf numbers from FC and
SCSI systems alike. Consider a different controller, because this one is
not up to snuff. A single drive would get better numbers than your 4 disk
RAID 10, 21MB/sec read speed is really pretty sorry, it should be closer to
120Mb/sec. If you can't swap out, software RAID may turn out to be your
friend. The only saving grace is that this is OLTP, and perhaps, just
maybe, the controller will be better at ordering IOs, but I highly doubt it.

Please people, do the numbers, benchmark before you buy, many many HBAs
really suck under Linux/Free BSD, and you may end up paying vast sums of
money for very sub-optimal performance (I'd say sub-standard, but alas, it
seems that this kind of poor performance is tolerated, even though it's way
off where it should be). There's no point having a 40disk cab, if your
controller can't handle it.

Maximum theoretical linear throughput can be acheived in a White Box for
under $20k, and I have seen this kind of system outperform a server 5 times
it's price even in OLTP.

Alex


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Mikael Carneholm <Mikael(dot)Carneholm(at)WirelessCar(dot)com>
Cc: Markus Schaber <schabi(at)logix-tt(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID stripe size question
Date: 2006-07-18 00:22:29
Message-ID: 44BC29C5.5090607@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mikael Carneholm wrote:
>
> Btw, here's the bonnie++ results from two different array sets (10+18,
> 4+24) on the MSA1500:
>
>
> LUN: DATA, 24 disks, stripe size 64K
> -------------------------------------
> Version 1.03 ------Sequential Output------ --Sequential Input-
> --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> sesell01 32G 59443 97 118515 39 25023 5 30926 49 60835 6
> 531.8 1
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> /sec %CP
> 16 2499 90 +++++ +++ +++++ +++ 2817 99 +++++ +++
> 10971 100
>

It might be interesting to see if 128K or 256K stripe size gives better
sequential throughput, while still leaving the random performance ok.
Having said that, the seeks/s figure of 531 not that great - for
instance I've seen a 12 disk (15K SCSI) system report about 1400 seeks/s
in this test.

Sorry if you mentioned this already - but what OS and filesystem are you
using? (if Linux and ext3, it might be worth experimenting with xfs or jfs).

Cheers

Mark