Quick Links

Re: How to improve db performance with $7K?

Lists:	pgsql-performance

From:	"Mohan, Ross" <RMohan(at)arbinet(dot)com>
To:	<pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 18:30:29
Message-ID:	CC74E7E10A8A054798B6611BD1FEF4D30625DA5B@vamail01.thexchange.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I've been doing some reading up on this, trying to keep up here,
and have found out that (experts, just yawn and cover your ears)

1) some SATA drives (just type II, I think?) have a "Phase Zero"
implementation of Tagged Command Queueing (the special sauce
for SCSI).
2) This SATA "TCQ" is called NCQ and I believe it basically
allows the disk software itself to do the reordering
(this is called "simple" in TCQ terminology) It does not
yet allow the TCQ "head of queue" command, allowing the
current tagged request to go to head of queue, which is
a simple way of manifesting a "high priority" request.

3) SATA drives are not yet multi-initiator?

Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives
are likely to whomp SATA II drives for a while yet (read: a
year or two) in multiuser PostGres applications.

-----Original Message-----
From: pgsql-performance-owner(at)postgresql(dot)org [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of Greg Stark
Sent: Thursday, April 14, 2005 2:04 PM
To: Kevin Brown
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] How to improve db performance with $7K?

Kevin Brown <kevin(at)sysexperts(dot)com> writes:

> Greg Stark wrote:
>
>
> > I think you're being misled by analyzing the write case.
> >
> > Consider the read case. When a user process requests a block and
> > that read makes its way down to the driver level, the driver can't
> > just put it aside and wait until it's convenient. It has to go ahead
> > and issue the read right away.
>
> Well, strictly speaking it doesn't *have* to. It could delay for a
> couple of milliseconds to see if other requests come in, and then
> issue the read if none do. If there are already other requests being
> fulfilled, then it'll schedule the request in question just like the
> rest.

But then the cure is worse than the disease. You're basically describing exactly what does happen anyways, only you're delaying more requests than necessary. That intervening time isn't really idle, it's filled with all the requests that were delayed during the previous large seek...

> Once the first request has been fulfilled, the driver can now schedule
> the rest of the queued-up requests in disk-layout order.
>
> I really don't see how this is any different between a system that has
> tagged queueing to the disks and one that doesn't. The only
> difference is where the queueing happens.

And *when* it happens. Instead of being able to issue requests while a large seek is happening and having some of them satisfied they have to wait until that seek is finished and get acted on during the next large seek.

If my theory is correct then I would expect bandwidth to be essentially equivalent but the latency on SATA drives to be increased by about 50% of the average seek time. Ie, while a busy SCSI drive can satisfy most requests in about 10ms a busy SATA drive would satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI drives have even lower seek times and no such IDE/SATA drives exist...)

In reality higher latency feeds into a system feedback loop causing your application to run slower causing bandwidth demands to be lower as well. It's often hard to distinguish root causes from symptoms when optimizing complex systems.

--
greg

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	"Mohan, Ross" <RMohan(at)arbinet(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 18:44:28
Message-ID:	425EBA0C.6030804@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

If SATA drives don't have the ability to replace SCSI for a multi-user
Postgres apps, but you needed to save on cost (ALWAYS an issue),
could/would you implement SATA for your logs (pg_xlog) and keep the rest
on SCSI?

Steve Poe

Mohan, Ross wrote:

>I've been doing some reading up on this, trying to keep up here,
>and have found out that (experts, just yawn and cover your ears)
>
>1) some SATA drives (just type II, I think?) have a "Phase Zero"
> implementation of Tagged Command Queueing (the special sauce
> for SCSI).
>2) This SATA "TCQ" is called NCQ and I believe it basically
> allows the disk software itself to do the reordering
> (this is called "simple" in TCQ terminology) It does not
> yet allow the TCQ "head of queue" command, allowing the
> current tagged request to go to head of queue, which is
> a simple way of manifesting a "high priority" request.
>
>3) SATA drives are not yet multi-initiator?
>
>Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives
>are likely to whomp SATA II drives for a while yet (read: a
>year or two) in multiuser PostGres applications.
>
>
>
>-----Original Message-----
>From: pgsql-performance-owner(at)postgresql(dot)org [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of Greg Stark
>Sent: Thursday, April 14, 2005 2:04 PM
>To: Kevin Brown
>Cc: pgsql-performance(at)postgresql(dot)org
>Subject: Re: [PERFORM] How to improve db performance with $7K?
>
>
>Kevin Brown <kevin(at)sysexperts(dot)com> writes:
>
>
>
>>Greg Stark wrote:
>>
>>
>>
>>
>>>I think you're being misled by analyzing the write case.
>>>
>>>Consider the read case. When a user process requests a block and
>>>that read makes its way down to the driver level, the driver can't
>>>just put it aside and wait until it's convenient. It has to go ahead
>>>and issue the read right away.
>>>
>>>
>>Well, strictly speaking it doesn't *have* to. It could delay for a
>>couple of milliseconds to see if other requests come in, and then
>>issue the read if none do. If there are already other requests being
>>fulfilled, then it'll schedule the request in question just like the
>>rest.
>>
>>
>
>But then the cure is worse than the disease. You're basically describing exactly what does happen anyways, only you're delaying more requests than necessary. That intervening time isn't really idle, it's filled with all the requests that were delayed during the previous large seek...
>
>
>
>>Once the first request has been fulfilled, the driver can now schedule
>>the rest of the queued-up requests in disk-layout order.
>>
>>I really don't see how this is any different between a system that has
>>tagged queueing to the disks and one that doesn't. The only
>>difference is where the queueing happens.
>>
>>
>
>And *when* it happens. Instead of being able to issue requests while a large seek is happening and having some of them satisfied they have to wait until that seek is finished and get acted on during the next large seek.
>
>If my theory is correct then I would expect bandwidth to be essentially equivalent but the latency on SATA drives to be increased by about 50% of the average seek time. Ie, while a busy SCSI drive can satisfy most requests in about 10ms a busy SATA drive would satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI drives have even lower seek times and no such IDE/SATA drives exist...)
>
>In reality higher latency feeds into a system feedback loop causing your application to run slower causing bandwidth demands to be lower as well. It's often hard to distinguish root causes from symptoms when optimizing complex systems.
>
>
>

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	"Mohan, Ross" <RMohan(at)arbinet(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 19:16:55
Message-ID:	425EC1A7.1030702@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Steve Poe wrote:

> If SATA drives don't have the ability to replace SCSI for a multi-user

I don't think it is a matter of not having the ability. SATA all in all
is fine as long as
it is battery backed. It isn't as high performing as SCSI but who says
it has to be?

There are plenty of companies running databases on SATA without issue. Would
I put it on a database that is expecting to have 500 connections at all
times? No.
Then again, if you have an application with that requirement, you have
the money
to buy a big fat SCSI array.

Sincerely,

Joshua D. Drake

> Postgres apps, but you needed to save on cost (ALWAYS an issue),
> could/would you implement SATA for your logs (pg_xlog) and keep the
> rest on SCSI?
>
> Steve Poe
>
> Mohan, Ross wrote:
>
>> I've been doing some reading up on this, trying to keep up here, and
>> have found out that (experts, just yawn and cover your ears)
>>
>> 1) some SATA drives (just type II, I think?) have a "Phase Zero"
>> implementation of Tagged Command Queueing (the special sauce
>> for SCSI).
>> 2) This SATA "TCQ" is called NCQ and I believe it basically
>> allows the disk software itself to do the reordering
>> (this is called "simple" in TCQ terminology) It does not
>> yet allow the TCQ "head of queue" command, allowing the
>> current tagged request to go to head of queue, which is
>> a simple way of manifesting a "high priority" request.
>>
>> 3) SATA drives are not yet multi-initiator?
>>
>> Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives
>> are likely to whomp SATA II drives for a while yet (read: a
>> year or two) in multiuser PostGres applications.
>>
>>
>> -----Original Message-----
>> From: pgsql-performance-owner(at)postgresql(dot)org
>> [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of Greg Stark
>> Sent: Thursday, April 14, 2005 2:04 PM
>> To: Kevin Brown
>> Cc: pgsql-performance(at)postgresql(dot)org
>> Subject: Re: [PERFORM] How to improve db performance with $7K?
>>
>>
>> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
>>
>>
>>
>>> Greg Stark wrote:
>>>
>>>
>>>
>>>
>>>> I think you're being misled by analyzing the write case.
>>>>
>>>> Consider the read case. When a user process requests a block and
>>>> that read makes its way down to the driver level, the driver can't
>>>> just put it aside and wait until it's convenient. It has to go
>>>> ahead and issue the read right away.
>>>>
>>>
>>> Well, strictly speaking it doesn't *have* to. It could delay for a
>>> couple of milliseconds to see if other requests come in, and then
>>> issue the read if none do. If there are already other requests
>>> being fulfilled, then it'll schedule the request in question just
>>> like the rest.
>>>
>>
>>
>> But then the cure is worse than the disease. You're basically
>> describing exactly what does happen anyways, only you're delaying
>> more requests than necessary. That intervening time isn't really
>> idle, it's filled with all the requests that were delayed during the
>> previous large seek...
>>
>>
>>
>>> Once the first request has been fulfilled, the driver can now
>>> schedule the rest of the queued-up requests in disk-layout order.
>>>
>>> I really don't see how this is any different between a system that
>>> has tagged queueing to the disks and one that doesn't. The only
>>> difference is where the queueing happens.
>>>
>>
>>
>> And *when* it happens. Instead of being able to issue requests while
>> a large seek is happening and having some of them satisfied they have
>> to wait until that seek is finished and get acted on during the next
>> large seek.
>>
>> If my theory is correct then I would expect bandwidth to be
>> essentially equivalent but the latency on SATA drives to be increased
>> by about 50% of the average seek time. Ie, while a busy SCSI drive
>> can satisfy most requests in about 10ms a busy SATA drive would
>> satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM
>> SCSI drives have even lower seek times and no such IDE/SATA drives
>> exist...)
>>
>> In reality higher latency feeds into a system feedback loop causing
>> your application to run slower causing bandwidth demands to be lower
>> as well. It's often hard to distinguish root causes from symptoms
>> when optimizing complex systems.
>>
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
> your
> joining column's datatypes do not match

From:	William Yu <wyu(at)talisys(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 08:05:09
Message-ID:	d3vpnn$2iu$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Problem with this strategy. You want battery-backed write caching for
best performance & safety. (I've tried IDE for WAL before w/ write
caching off -- the DB got crippled whenever I had to copy files from/to
the drive on the WAL partition -- ended up just moving WAL back on the
same SCSI drive as the main DB.) That means in addition to a $$$ SCSI
caching controller, you also need a $$$ SATA caching controller. From my
glance at prices, advanced SATA controllers seem to cost nearly as their
SCSI counterparts.

This also looks to be the case for the drives themselves. Sure you can
get super cheap 7200RPM SATA drives but they absolutely suck for
database work. Believe me, I gave it a try once -- ugh. The highend WD
10K Raptors look pretty good though -- the benchmarks @ storagereview
seem to put these drives at about 90% of SCSI 10Ks for both single-user
and multi-user. However, they're also priced like SCSIs -- here's what I
found @ Mwave (going through pricewatch to find WD740GDs):

Seagate 7200 SATA -- 80GB $59
WD 10K SATA -- 72GB $182
Seagate 10K U320 -- 72GB $289

Using the above prices for a fixed budget for RAID-10, you could get:

SATA 7200 -- 680MB per $1000
SATA 10K -- 200MB per $1000
SCSI 10K -- 125MB per $1000

For a 99% read-only DB that required lots of disk space (say something
like Wikipedia or blog host), using consumer level SATA probably is ok.
For anything else, I'd consider SATA 10K if (1) I do not need 15K RPM
and (2) I don't have SCSI intrastructure already.

Steve Poe wrote:
> If SATA drives don't have the ability to replace SCSI for a multi-user
> Postgres apps, but you needed to save on cost (ALWAYS an issue),
> could/would you implement SATA for your logs (pg_xlog) and keep the rest
> on SCSI?
>
> Steve Poe
>
> Mohan, Ross wrote:
>
>> I've been doing some reading up on this, trying to keep up here, and
>> have found out that (experts, just yawn and cover your ears)
>>
>> 1) some SATA drives (just type II, I think?) have a "Phase Zero"
>> implementation of Tagged Command Queueing (the special sauce
>> for SCSI).
>> 2) This SATA "TCQ" is called NCQ and I believe it basically
>> allows the disk software itself to do the reordering
>> (this is called "simple" in TCQ terminology) It does not
>> yet allow the TCQ "head of queue" command, allowing the
>> current tagged request to go to head of queue, which is
>> a simple way of manifesting a "high priority" request.
>>
>> 3) SATA drives are not yet multi-initiator?
>>
>> Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives
>> are likely to whomp SATA II drives for a while yet (read: a
>> year or two) in multiuser PostGres applications.
>>
>>
>> -----Original Message-----
>> From: pgsql-performance-owner(at)postgresql(dot)org
>> [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of Greg Stark
>> Sent: Thursday, April 14, 2005 2:04 PM
>> To: Kevin Brown
>> Cc: pgsql-performance(at)postgresql(dot)org
>> Subject: Re: [PERFORM] How to improve db performance with $7K?
>>
>>
>> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
>>
>>
>>
>>> Greg Stark wrote:
>>>
>>>
>>>
>>>
>>>> I think you're being misled by analyzing the write case.
>>>>
>>>> Consider the read case. When a user process requests a block and
>>>> that read makes its way down to the driver level, the driver can't
>>>> just put it aside and wait until it's convenient. It has to go ahead
>>>> and issue the read right away.
>>>>
>>>
>>> Well, strictly speaking it doesn't *have* to. It could delay for a
>>> couple of milliseconds to see if other requests come in, and then
>>> issue the read if none do. If there are already other requests being
>>> fulfilled, then it'll schedule the request in question just like the
>>> rest.
>>>
>>
>>
>> But then the cure is worse than the disease. You're basically
>> describing exactly what does happen anyways, only you're delaying more
>> requests than necessary. That intervening time isn't really idle, it's
>> filled with all the requests that were delayed during the previous
>> large seek...
>>
>>
>>
>>> Once the first request has been fulfilled, the driver can now
>>> schedule the rest of the queued-up requests in disk-layout order.
>>>
>>> I really don't see how this is any different between a system that
>>> has tagged queueing to the disks and one that doesn't. The only
>>> difference is where the queueing happens.
>>>
>>
>>
>> And *when* it happens. Instead of being able to issue requests while a
>> large seek is happening and having some of them satisfied they have to
>> wait until that seek is finished and get acted on during the next
>> large seek.
>>
>> If my theory is correct then I would expect bandwidth to be
>> essentially equivalent but the latency on SATA drives to be increased
>> by about 50% of the average seek time. Ie, while a busy SCSI drive can
>> satisfy most requests in about 10ms a busy SATA drive would satisfy
>> most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI
>> drives have even lower seek times and no such IDE/SATA drives exist...)
>>
>> In reality higher latency feeds into a system feedback loop causing
>> your application to run slower causing bandwidth demands to be lower
>> as well. It's often hard to distinguish root causes from symptoms when
>> optimizing complex systems.
>>
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match
>

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	William Yu <wyu(at)talisys(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 14:59:05
Message-ID:	87pswsku6e.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

William Yu <wyu(at)talisys(dot)com> writes:

> Using the above prices for a fixed budget for RAID-10, you could get:
>
> SATA 7200 -- 680MB per $1000
> SATA 10K -- 200MB per $1000
> SCSI 10K -- 125MB per $1000

What a lot of these analyses miss is that cheaper == faster because cheaper
means you can buy more spindles for the same price. I'm assuming you picked
equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as
many spindles as the 125MB/$1000. That means it would have almost double the
bandwidth. And the 7200 RPM case would have more than 5x the bandwidth.

While 10k RPM drives have lower seek times, and SCSI drives have a natural
seek time advantage, under load a RAID array with fewer spindles will start
hitting contention sooner which results into higher latency. If the controller
works well the larger SATA arrays above should be able to maintain their
mediocre latency much better under load than the SCSI array with fewer drives
would maintain its low latency response time despite its drives' lower average
seek time.

--
greg

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 15:17:25
Message-ID:	33c6269f0504180817511b27a3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

This is fundamentaly untrue.

A mirror is still a mirror. At most in a RAID 10 you can have two
simultaneous seeks. You are always going to be limited by the seek
time of your drives. It's a stripe, so you have to read from all
members of the stripe to get data, requiring all drives to seek.
There is no advantage to seek time in adding more drives. By adding
more drives you can increase throughput, but the max throughput of the
PCI-X bus isn't that high (I think around 400MB/sec) You can easily
get this with a six or seven drive RAID 5, or a ten drive RAID 10. At
that point you start having to factor in the cost of a bigger chassis
to hold more drives, which can be big bucks.

Alex Turner
netEconomist

On 18 Apr 2005 10:59:05 -0400, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
> William Yu <wyu(at)talisys(dot)com> writes:
>
> > Using the above prices for a fixed budget for RAID-10, you could get:
> >
> > SATA 7200 -- 680MB per $1000
> > SATA 10K -- 200MB per $1000
> > SCSI 10K -- 125MB per $1000
>
> What a lot of these analyses miss is that cheaper == faster because cheaper
> means you can buy more spindles for the same price. I'm assuming you picked
> equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as
> many spindles as the 125MB/$1000. That means it would have almost double the
> bandwidth. And the 7200 RPM case would have more than 5x the bandwidth.
>
> While 10k RPM drives have lower seek times, and SCSI drives have a natural
> seek time advantage, under load a RAID array with fewer spindles will start
> hitting contention sooner which results into higher latency. If the controller
> works well the larger SATA arrays above should be able to maintain their
> mediocre latency much better under load than the SCSI array with fewer drives
> would maintain its low latency response time despite its drives' lower average
> seek time.
>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match
>

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 15:43:54
Message-ID:	878y3gks3p.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex Turner <armtuk(at)gmail(dot)com> writes:

> This is fundamentaly untrue.
>
> A mirror is still a mirror. At most in a RAID 10 you can have two
> simultaneous seeks. You are always going to be limited by the seek
> time of your drives. It's a stripe, so you have to read from all
> members of the stripe to get data, requiring all drives to seek.
> There is no advantage to seek time in adding more drives.

Adding drives will not let you get lower response times than the average seek
time on your drives*. But it will let you reach that response time more often.

The actual response time for a random access to a drive is the seek time plus
the time waiting for your request to actually be handled. Under heavy load
that could be many milliseconds. The more drives you have the fewer requests
each drive has to handle.

Look at the await and svctime columns of iostat -x.

Under heavy random access load those columns can show up performance problems
more accurately than the bandwidth columns. You could be doing less bandwidth
but be having latency issues. While reorganizing data to allow for more
sequential reads is the normal way to address that, simply adding more
spindles can be surprisingly effective.

> By adding more drives you can increase throughput, but the max throughput of
> the PCI-X bus isn't that high (I think around 400MB/sec) You can easily get
> this with a six or seven drive RAID 5, or a ten drive RAID 10. At that point
> you start having to factor in the cost of a bigger chassis to hold more
> drives, which can be big bucks.

You could use software raid to spread the drives over multiple PCI-X cards.
But if 400MB/s isn't enough bandwidth then you're probably in the realm of
"enterprise-class" hardware anyways.

* (Actually even that's possible: you could limit yourself to a portion of the
drive surface to reduce seek time)

--
greg

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 16:56:48
Message-ID:	33c6269f05041809566abf91d2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

[snip]
>
> Adding drives will not let you get lower response times than the average seek
> time on your drives*. But it will let you reach that response time more often.
>
[snip]

I believe your assertion is fundamentaly flawed. Adding more drives
will not let you reach that response time more often. All drives are
required to fill every request in all RAID levels (except possibly
0+1, but that isn't used for enterprise applicaitons). Most requests
in OLTP require most of the request time to seek, not to read. Only
in single large block data transfers will you get any benefit from
adding more drives, which is atypical in most database applications.
For most database applications, the only way to increase
transactions/sec is to decrease request service time, which is
generaly achieved with better seek times or a better controller card,
or possibly spreading your database accross multiple tablespaces on
seperate paritions.

My assertion therefore is that simply adding more drives to an already
competent* configuration is about as likely to increase your database
effectiveness as swiss cheese is to make your car run faster.

Alex Turner
netEconomist

*Assertion here is that the DBA didn't simply configure all tables and
xlog on a single 7200 RPM disk, but has seperate physical drives for
xlog and tablespace at least on 10k drives.

From:	John A Meinel <john(at)arbash-meinel(dot)com>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 17:16:14
Message-ID:	4263EB5E.8000309@arbash-meinel.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex Turner wrote:

>Most requests
>in OLTP require most of the request time to seek, not to read. Only
>in single large block data transfers will you get any benefit from
>adding more drives, which is atypical in most database applications.
>For most database applications, the only way to increase
>transactions/sec is to decrease request service time, which is
>generaly achieved with better seek times or a better controller card,
>or possibly spreading your database accross multiple tablespaces on
>seperate paritions.
>
>
This is probably true. However, if you are doing lots of concurrent
connections, and things are properly spread across multiple spindles
(using RAID0+1, or possibly tablespaces across multiple raids).
Then each seek occurs on a separate drive, which allows them to occur at
the same time, rather than sequentially. Having 2 processes competing
for seeking on the same drive is going to be worse than having them on
separate drives.
John
=:->

From:	Jacques Caron <jc(at)directinfos(dot)com>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 17:32:09
Message-ID:	6.2.0.14.0.20050418192528.04238088@pop.interactivemediafactory.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Hi,

At 18:56 18/04/2005, Alex Turner wrote:
>All drives are required to fill every request in all RAID levels

No, this is definitely wrong. In many cases, most drives don't actually
have the data requested, how could they handle the request?

When reading one random sector, only *one* drive out of N is ever used to
service any given request, be it RAID 0, 1, 0+1, 1+0 or 5.

When writing:
- in RAID 0, 1 drive
- in RAID 1, RAID 0+1 or 1+0, 2 drives
- in RAID 5, you need to read on all drives and write on 2.

Otherwise, what would be the point of RAID 0, 0+1 or 1+0?

Jacques.

From:	Alan Stange <stange(at)rentec(dot)com>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 17:34:28
Message-ID:	4263EFA4.2000900@rentec.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex Turner wrote:

>[snip]
>
>
>>Adding drives will not let you get lower response times than the average seek
>>time on your drives*. But it will let you reach that response time more often.
>>
>>
>>
>[snip]
>
>I believe your assertion is fundamentaly flawed. Adding more drives
>will not let you reach that response time more often. All drives are
>required to fill every request in all RAID levels (except possibly
>0+1, but that isn't used for enterprise applicaitons). Most requests
>in OLTP require most of the request time to seek, not to read. Only
>in single large block data transfers will you get any benefit from
>adding more drives, which is atypical in most database applications.
>For most database applications, the only way to increase
>transactions/sec is to decrease request service time, which is
>generaly achieved with better seek times or a better controller card,
>or possibly spreading your database accross multiple tablespaces on
>seperate paritions.
>
>My assertion therefore is that simply adding more drives to an already
>competent* configuration is about as likely to increase your database
>effectiveness as swiss cheese is to make your car run faster.
>
>

Consider the case of a mirrored file system with a mostly read()
workload. Typical behavior is to use a round-robin method for issueing
the read operations to each mirror in turn, but one can use other
methods like a geometric algorithm that will issue the reads to the
drive with the head located closest to the desired track. Some
systems have many mirrors of the data for exactly this behavior. In
fact, one can carry this logic to the extreme and have one drive for
every cylinder in the mirror, thus removing seek latencies completely.
In fact this extreme case would also remove the rotational latency as
the cylinder will be in the disks read cache. :-) Of course, writing
data would be a bit slow!

I'm not sure I understand your assertion that "all drives are required
to fill every request in all RAID levels". After all, in mirrored
reads only one mirror needs to read any given block of data, so I don't
know what goal is achieved in making other mirrors read the same data.

My assertion (based on ample personal experience) is that one can
*always* get improved performance by adding more drives. Just limit the
drives to use the first few cylinders so that the average seek time is
greatly reduced and concatenate the drives together. One can then build
the usual RAID device out of these concatenated metadevices. Yes, one
is wasting lots of disk space, but that's life. If your goal is
performance, then you need to put your money on the table. The
system will be somewhat unreliable because of the device count,
additional SCSI buses, etc., but that too is life in the high
performance world.

-- Alan

From:	Jacques Caron <jc(at)directinfos(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 17:41:49
Message-ID:	6.2.0.14.0.20050418183007.03d0ce18@pop.interactivemediafactory.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Hi,

At 16:59 18/04/2005, Greg Stark wrote:

>William Yu <wyu(at)talisys(dot)com> writes:
>
> > Using the above prices for a fixed budget for RAID-10, you could get:
> >
> > SATA 7200 -- 680MB per $1000
> > SATA 10K -- 200MB per $1000
> > SCSI 10K -- 125MB per $1000
>
>What a lot of these analyses miss is that cheaper == faster because cheaper
>means you can buy more spindles for the same price. I'm assuming you picked
>equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as
>many spindles as the 125MB/$1000. That means it would have almost double the
>bandwidth. And the 7200 RPM case would have more than 5x the bandwidth.
>
>While 10k RPM drives have lower seek times, and SCSI drives have a natural
>seek time advantage, under load a RAID array with fewer spindles will start
>hitting contention sooner which results into higher latency. If the controller
>works well the larger SATA arrays above should be able to maintain their
>mediocre latency much better under load than the SCSI array with fewer drives
>would maintain its low latency response time despite its drives' lower average
>seek time.

I would definitely agree. More factors in favor of more cheap drives:
- cheaper drives (7200 rpm) have larger disks (3.7" diameter against 2.6 or
3.3). That means the outer tracks hold more data, and the same amount of
data is held on a smaller area, which means less tracks, which means
reduced seek times. You can roughly count the real average seek time as
(average seek time over full disk * size of dataset / capacity of disk).
And you actually need to physicall seek less often too.

- more disks means less data per disk, which means the data is further
concentrated on outer tracks, which means even lower seek times

Also, what counts is indeed not so much the time it takes to do one single
random seek, but the number of random seeks you can do per second. Hence,
more disks means more seeks per second (if requests are evenly distributed
among all disks, which a good stripe size should achieve).

Not taking into account TCQ/NCQ or write cache optimizations, the important
parameter (random seeks per second) can be approximated as:

N * 1000 / (lat + seek * ds / (N * cap))

Where:
N is the number of disks
lat is the average rotational latency in milliseconds (500/(rpm/60))
seek is the average seek over the full disk in milliseconds
ds is the dataset size
cap is the capacity of each disk

Using this formula and a variety of disks, counting only the disks
themselves (no enclosures, controllers, rack space, power, maintenance...),
trying to maximize the number of seeks/second for a fixed budget (1000
euros) with a dataset size of 100 GB makes SATA drives clear winners: you
can get more than 4000 seeks/second (with 21 x 80GB disks) where SCSI
cannot even make it to the 1400 seek/second point (with 8 x 36 GB disks).
Results can vary quite a lot based on the dataset size, which illustrates
the importance of "staying on the edges" of the disks. I'll try to make the
analysis more complete by counting some of the "overhead" (obviously 21
drives has a lot of other implications!), but I believe SATA drives still
win in theory.

It would be interesting to actually compare this to real-world (or
nearly-real-world) benchmarks to measure the effectiveness of features like
TCQ/NCQ etc.

Jacques.

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 17:46:01
Message-ID:	4263F259.7030804@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex,

In the situation of the animal hospital server I oversee, their
application is OLTP. Adding hard drives (6-8) does help performance.
Benchmarks like pgbench and OSDB agree with it, but in reality users
could not see noticeable change. However, moving the top 5/10 tables and
indexes to their own space made a greater impact.

Someone who reads PostgreSQL 8.0 Performance Checklist is going to see
point #1 add more disks is the key. How about adding a subpoint to
explaining when more disks isn't enough or applicable? I maybe
generalizing the complexity of tuning an OLTP application, but some
clarity could help.

Steve Poe

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Jacques Caron <jc(at)directinfos(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:16:15
Message-ID:	33c6269f05041811165228c814@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Ok - well - I am partially wrong...

If you're stripe size is 64Kb, and you are reading 256k worth of data,
it will be spread across four drives, so you will need to read from
four devices to get your 256k of data (RAID 0 or 5 or 10), but if you
are only reading 64kb of data, I guess you would only need to read
from one disk.

So my assertion that adding more drives doesn't help is pretty
wrong... particularly with OLTP because it's always dealing with
blocks that are smaller that the stripe size.

Alex Turner
netEconomist

On 4/18/05, Jacques Caron <jc(at)directinfos(dot)com> wrote:
> Hi,
>
> At 18:56 18/04/2005, Alex Turner wrote:
> >All drives are required to fill every request in all RAID levels
>
> No, this is definitely wrong. In many cases, most drives don't actually
> have the data requested, how could they handle the request?
>
> When reading one random sector, only *one* drive out of N is ever used to
> service any given request, be it RAID 0, 1, 0+1, 1+0 or 5.
>
> When writing:
> - in RAID 0, 1 drive
> - in RAID 1, RAID 0+1 or 1+0, 2 drives
> - in RAID 5, you need to read on all drives and write on 2.
>
> Otherwise, what would be the point of RAID 0, 0+1 or 1+0?
>
> Jacques.
>
>

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	John A Meinel <john(at)arbash-meinel(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:18:21
Message-ID:	33c6269f05041811187170139f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at
least I would never recommend 1+0 for anything).

RAID 10 and RAID 0+1 are _quite_ different. One gives you very good
redundancy, the other is only slightly better than RAID 5, but
operates faster in degraded mode (single drive).

Alex Turner
netEconomist

On 4/18/05, John A Meinel <john(at)arbash-meinel(dot)com> wrote:
> Alex Turner wrote:
>
> >[snip]
> >
> >
> >>Adding drives will not let you get lower response times than the average seek
> >>time on your drives*. But it will let you reach that response time more often.
> >>
> >>
> >>
> >[snip]
> >
> >I believe your assertion is fundamentaly flawed. Adding more drives
> >will not let you reach that response time more often. All drives are
> >required to fill every request in all RAID levels (except possibly
> >0+1, but that isn't used for enterprise applicaitons).
> >
> Actually 0+1 is the recommended configuration for postgres databases
> (both for xlog and for the bulk data), because the write speed of RAID5
> is quite poor.
> Hence you base assumption is not correct, and adding drives *does* help.
>
> >Most requests
> >in OLTP require most of the request time to seek, not to read. Only
> >in single large block data transfers will you get any benefit from
> >adding more drives, which is atypical in most database applications.
> >For most database applications, the only way to increase
> >transactions/sec is to decrease request service time, which is
> >generaly achieved with better seek times or a better controller card,
> >or possibly spreading your database accross multiple tablespaces on
> >seperate paritions.
> >
> >
> This is probably true. However, if you are doing lots of concurrent
> connections, and things are properly spread across multiple spindles
> (using RAID0+1, or possibly tablespaces across multiple raids).
> Then each seek occurs on a separate drive, which allows them to occur at
> the same time, rather than sequentially. Having 2 processes competing
> for seeking on the same drive is going to be worse than having them on
> separate drives.
> John
> =:->
>
>
>

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:19:36
Message-ID:	33c6269f05041811191e5047fd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I think the add more disks thing is really from the point of view that
one disk isn't enough ever. You should really have at least four
drives configured into two RAID 1s. Most DBAs will know this, but
most average Joes won't.

Alex Turner
netEconomist

On 4/18/05, Steve Poe <spoe(at)sfnet(dot)cc> wrote:
> Alex,
>
> In the situation of the animal hospital server I oversee, their
> application is OLTP. Adding hard drives (6-8) does help performance.
> Benchmarks like pgbench and OSDB agree with it, but in reality users
> could not see noticeable change. However, moving the top 5/10 tables and
> indexes to their own space made a greater impact.
>
> Someone who reads PostgreSQL 8.0 Performance Checklist is going to see
> point #1 add more disks is the key. How about adding a subpoint to
> explaining when more disks isn't enough or applicable? I maybe
> generalizing the complexity of tuning an OLTP application, but some
> clarity could help.
>
> Steve Poe
>
>

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Jacques Caron <jc(at)directinfos(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:21:04
Message-ID:	33c6269f0504181121385a1f0a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

So I wonder if one could take this stripe size thing further and say
that a larger stripe size is more likely to result in requests getting
served parallized across disks which would lead to increased
performance?

Again, thanks to all people on this list, I know that I have learnt a
_hell_ of alot since subscribing.

Alex Turner
netEconomist

On 4/18/05, Alex Turner <armtuk(at)gmail(dot)com> wrote:
> Ok - well - I am partially wrong...
>
> If you're stripe size is 64Kb, and you are reading 256k worth of data,
> it will be spread across four drives, so you will need to read from
> four devices to get your 256k of data (RAID 0 or 5 or 10), but if you
> are only reading 64kb of data, I guess you would only need to read
> from one disk.
>
> So my assertion that adding more drives doesn't help is pretty
> wrong... particularly with OLTP because it's always dealing with
> blocks that are smaller that the stripe size.
>
> Alex Turner
> netEconomist
>
> On 4/18/05, Jacques Caron <jc(at)directinfos(dot)com> wrote:
> > Hi,
> >
> > At 18:56 18/04/2005, Alex Turner wrote:
> > >All drives are required to fill every request in all RAID levels
> >
> > No, this is definitely wrong. In many cases, most drives don't actually
> > have the data requested, how could they handle the request?
> >
> > When reading one random sector, only *one* drive out of N is ever used to
> > service any given request, be it RAID 0, 1, 0+1, 1+0 or 5.
> >
> > When writing:
> > - in RAID 0, 1 drive
> > - in RAID 1, RAID 0+1 or 1+0, 2 drives
> > - in RAID 5, you need to read on all drives and write on 2.
> >
> > Otherwise, what would be the point of RAID 0, 0+1 or 1+0?
> >
> > Jacques.
> >
> >
>

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Jacques Caron <jc(at)directinfos(dot)com>
Cc:	Alex Turner <armtuk(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:24:14
Message-ID:	873btokkoh.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Jacques Caron <jc(at)directinfos(dot)com> writes:

> When writing:
> - in RAID 0, 1 drive
> - in RAID 1, RAID 0+1 or 1+0, 2 drives
> - in RAID 5, you need to read on all drives and write on 2.

Actually RAID 5 only really needs to read from two drives. The existing parity
block and the block you're replacing. It just xors the old block, the new
block, and the existing parity block to generate the new parity block.

--
greg

From:	Jacques Caron <jc(at)directinfos(dot)com>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:24:25
Message-ID:	6.2.0.14.0.20050418202039.03f12008@wheresmymailserver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Hi,

At 20:16 18/04/2005, Alex Turner wrote:
>So my assertion that adding more drives doesn't help is pretty
>wrong... particularly with OLTP because it's always dealing with
>blocks that are smaller that the stripe size.

When doing random seeks (which is what a database needs most of the time),
the number of disks helps improve the number of seeks per second (which is
the bottleneck in this case). When doing sequential reads, the number of
disks helps improve total throughput (which is the bottleneck in that case).

In short: in always helps :-)

Jacques.

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	John A Meinel <john(at)arbash-meinel(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:27:58
Message-ID:	4263FC2E.6070207@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex Turner wrote:
> Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at
> least I would never recommend 1+0 for anything).

Uhmm I was under the impression that 1+0 was RAID 10 and that 0+1 is NOT
RAID 10.

Ref: http://www.acnc.com/raid.html

Sincerely,

Joshua D. Drake

> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly

From:	Jacques Caron <jc(at)directinfos(dot)com>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:43:45
Message-ID:	6.2.0.14.0.20050418202657.03cf3bd0@wheresmymailserver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Hi,

At 20:21 18/04/2005, Alex Turner wrote:
>So I wonder if one could take this stripe size thing further and say
>that a larger stripe size is more likely to result in requests getting
>served parallized across disks which would lead to increased
>performance?

Actually, it would be pretty much the opposite. The smaller the stripe
size, the more evenly distributed data is, and the more disks can be used
to serve requests. If your stripe size is too large, many random accesses
within one single file (whose size is smaller than the stripe size/number
of disks) may all end up on the same disk, rather than being split across
multiple disks (the extreme case being stripe size = total size of all
disks, which means concatenation). If all accesses had the same cost (i.e.
no seek time, only transfer time), the ideal would be to have a stripe size
equal to the number of disks.

But below a certain size, you're going to use multiple disks to serve one
single request which would not have taken much more time from a single disk
(reading even a large number of consecutive blocks within one cylinder does
not take much more time than reading a single block), so you would add
unnecessary seeks on a disk that could have served another request in the
meantime. You should definitely not go below the filesystem block size or
the database block size.

There is a interesting discussion of the optimal stripe size in the vinum
manpage on FreeBSD:

http://www.freebsd.org/cgi/man.cgi?query=vinum&apropos=0&sektion=0&manpath=FreeBSD+5.3-RELEASE+and+Ports&format=html

(look for "Performance considerations", towards the end -- note however
that some of the calculations are not entirely correct).

Basically it says the optimal stripe size is somewhere between 256KB and
4MB, preferably an odd number, and that some hardware RAID controllers
don't like big stripe sizes. YMMV, as always.

Jacques.

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	John A Meinel <john(at)arbash-meinel(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 18:50:46
Message-ID:	33c6269f05041811504c737b6f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Mistype.. I meant 0+1 in the second instance :(

On 4/18/05, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:
> Alex Turner wrote:
> > Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at
> > least I would never recommend 1+0 for anything).
>
> Uhmm I was under the impression that 1+0 was RAID 10 and that 0+1 is NOT
> RAID 10.
>
> Ref: http://www.acnc.com/raid.html
>
> Sincerely,
>
> Joshua D. Drake
>
>
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> > message can get through to the mailing list cleanly
>
>

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Jacques Caron <jc(at)directinfos(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 19:26:31
Message-ID:	33c6269f050418122650848e37@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 4/18/05, Jacques Caron <jc(at)directinfos(dot)com> wrote:
> Hi,
>
> At 20:21 18/04/2005, Alex Turner wrote:
> >So I wonder if one could take this stripe size thing further and say
> >that a larger stripe size is more likely to result in requests getting
> >served parallized across disks which would lead to increased
> >performance?
>
> Actually, it would be pretty much the opposite. The smaller the stripe
> size, the more evenly distributed data is, and the more disks can be used
> to serve requests. If your stripe size is too large, many random accesses
> within one single file (whose size is smaller than the stripe size/number
> of disks) may all end up on the same disk, rather than being split across
> multiple disks (the extreme case being stripe size = total size of all
> disks, which means concatenation). If all accesses had the same cost (i.e.
> no seek time, only transfer time), the ideal would be to have a stripe size
> equal to the number of disks.
>
[snip]

Ahh yes - but the critical distinction is this:
The smaller the stripe size, the more disks will be used to serve _a_
request - which is bad for OLTP because you want fewer disks per
request so that you can have more requests per second because the cost
is mostly seek. If more than one disk has to seek to serve a single
request, you are preventing that disk from serving a second request at
the same time.

To have more throughput in MB/sec, you want a smaller stripe size so
that you have more disks serving a single request allowing you to
multiple by effective drives to get total bandwidth.

Because OLTP is made up of small reads and writes to a small number of
different files, I would guess that you want those files split up
across your RAID, but not so much that a single small read or write
operation would traverse more than one disk. That would infer that
your optimal stripe size is somewhere on the right side of the bell
curve that represents your database read and write block count
distribution. If on average the dbwritter never flushes less than 1MB
to disk at a time, then I guess your best stripe size would be 1MB,
but that seems very large to me.

So I think therefore that I may be contending the exact opposite of
what you are postulating!

Alex Turner
netEconomist

From:	William Yu <wyu(at)talisys(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 20:37:11
Message-ID:	d415ps$2cir$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Oooops, I revived the never-ending $7K thread. :)

Well part of my message is to first relook at the idea that SATA is
cheap but slow. Most people look at SATA from the view of consumer-level
drives, no NCQ/TCQ -- basically these drives are IDEs that can connect
to SATA cables. But if you then look at the server-level SATAs from WD,
you see performance close to server-level 10K SCSIs and pricing also close.

Starting with the idea of using 20 consumer-level SATA drives versus 4
10K SCSIs, the main problem of course is the lack of advanced queueing
in these drives. I'm sure there's some threshold where the number of
drives advantage exceeds the disadvantage of no queueing -- what that
is, I don't have a clue.

Now if you stuffed a ton of memory onto a SATA caching controller and
these controllers did the queue management instead of the drives, that
would eliminate most of the performance issues.

Then you're just left with the management issues. Getting those 20
drives stuffed in a big case and keeping a close eye on the drives since
drive failure will be a much bigger deal.

Greg Stark wrote:
> William Yu <wyu(at)talisys(dot)com> writes:
>
>
>>Using the above prices for a fixed budget for RAID-10, you could get:
>>
>>SATA 7200 -- 680MB per $1000
>>SATA 10K -- 200MB per $1000
>>SCSI 10K -- 125MB per $1000
>
>
> What a lot of these analyses miss is that cheaper == faster because cheaper
> means you can buy more spindles for the same price. I'm assuming you picked
> equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as
> many spindles as the 125MB/$1000. That means it would have almost double the
> bandwidth. And the 7200 RPM case would have more than 5x the bandwidth.
>
> While 10k RPM drives have lower seek times, and SCSI drives have a natural
> seek time advantage, under load a RAID array with fewer spindles will start
> hitting contention sooner which results into higher latency. If the controller
> works well the larger SATA arrays above should be able to maintain their
> mediocre latency much better under load than the SCSI array with fewer drives
> would maintain its low latency response time despite its drives' lower average
> seek time.
>

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Jacques Caron <jc(at)directinfos(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-20 00:03:39
Message-ID:	20050420000339.GR58835@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mon, Apr 18, 2005 at 07:41:49PM +0200, Jacques Caron wrote:
> It would be interesting to actually compare this to real-world (or
> nearly-real-world) benchmarks to measure the effectiveness of features like
> TCQ/NCQ etc.

I was just thinking that it would be very interesting to benchmark
different RAID configurations using dbt2. I don't know if this is
something that the lab is setup for or capable of, though.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"