Quick Links

Re: How to improve db performance with $7K?

Lists:	pgsql-performance

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	How to improve db performance with $7K?
Date:	2005-03-25 16:12:19
Message-ID:	42443863.4040302@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Situation: An 24/7 animal hospital (100 employees) runs their business
on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to) off a 2-CPU
Xeon 2.8MHz, 4GB of RAM, (3) SCSI disks RAID 0 (zcav value 35MB per
sec). The databse is 11GB comprised over 100 tables and indexes from 1MB
to 2GB in size.

I recently told the hospital management team worst-case scenerio they
need to get the database on its own drive array since the RAID0 is a
disaster wating to happen. I said ideally a new dual AMD server with
6/7-disk configuration would be ideal for safety and performance, but
they don't have $15K. I said a seperate drive array offer the balance
of safety and performance.

I have been given budget of $7K to accomplish a safer/faster database
through hardware upgrades. The objective is to get a drive array, but I
can use the budget any way I see fit to accomplish the goal.

Since I am a dba novice, I did not physically build this server, nor did
I write the application the hospital runs on, but I have the opportunity
to make it better, I'd thought I should seek some advice from those who
have been down this road before. Suggestions/ideas anyone?

Thanks.

Steve Poe

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-25 17:19:55
Message-ID:	4244483B.8020401@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom,

From what I understand, the vendor used ProIV for development, when
they attempted to use 7.4.3, they had ODBC issues and something else I
honestly don't know, but I was told that data was not coming through
properly. Their somewhat at the mercy of the ProIV people to give them
the stamp of approval, then the vendor will tell us what they support.

Thanks.

Steve Poe

Tom Lane wrote:

>Steve Poe <spoe(at)sfnet(dot)cc> writes:
>
>
>>Situation: An 24/7 animal hospital (100 employees) runs their business
>>on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to)
>>
>>
>
>[ itch... ] Surely they could at least move to 7.4.7 without pain.
>There are serious data-loss bugs known in 7.4.2.
>
> regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 7: don't forget to increase your free space map settings
>
>
>

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-26 00:12:59
Message-ID:	4244A90B.2030200@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

You could build a dual opteron with 4 GB of ram, 12 10k raptor SATA
drives with a battery backed cache for about 7k or less.

Okay. You trust SATA drives? I've been leary of them for a production
database. Pardon my ignorance, but what is a "battery backed cache"? I
know the drives have a built-in cache but I don't if that's the same.
Are the 12 drives internal or an external chasis? Could you point me to
a place that this configuration exist?

>
> Or if they are not CPU bound just IO bound you could easily just
> add an external 12 drive array (even if scsi) for less than 7k.
>
I don't believe it is CPU bound. At our busiest hour, the CPU is idle
about 70% on average down to 30% idle at its heaviest. Context switching
averages about 4-5K per hour with momentary peaks to 25-30K for a
minute. Overall disk performance is poor (35mb per sec).

Thanks for your input.

Steve Poe

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-26 00:59:17
Message-ID:	20356.1111798757@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Steve Poe <spoe(at)sfnet(dot)cc> writes:
> Situation: An 24/7 animal hospital (100 employees) runs their business
> on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to)

[ itch... ] Surely they could at least move to 7.4.7 without pain.
There are serious data-loss bugs known in 7.4.2.

regards, tom lane

From:	Will LaShell <will(at)lashell(dot)net>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-26 01:03:04
Message-ID:	4244B4C8.8010909@lashell.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

You can purchase a whole new dual opteron 740, with 6 gigs of ram, a
case to match and 6 74 gig ultra320 sca drives for about $7k

I know because that's what I bought one for 2 weeks ago. Using Tyan's
dual board.

If you need some details and are willing to go that route, let me know
and I'll get you the information.

Sincerely,

Will LaShell

Steve Poe wrote:

> Situation: An 24/7 animal hospital (100 employees) runs their
> business on Centos 3.3 (RHEL 3) Postgres 7.4.2 (because they have to)
> off a 2-CPU Xeon 2.8MHz, 4GB of RAM, (3) SCSI disks RAID 0 (zcav value
> 35MB per sec). The databse is 11GB comprised over 100 tables and
> indexes from 1MB to 2GB in size.
>
> I recently told the hospital management team worst-case scenerio they
> need to get the database on its own drive array since the RAID0 is a
> disaster wating to happen. I said ideally a new dual AMD server with
> 6/7-disk configuration would be ideal for safety and performance, but
> they don't have $15K. I said a seperate drive array offer the balance
> of safety and performance.
>
> I have been given budget of $7K to accomplish a safer/faster database
> through hardware upgrades. The objective is to get a drive array, but
> I can use the budget any way I see fit to accomplish the goal.
>
> Since I am a dba novice, I did not physically build this server, nor
> did I write the application the hospital runs on, but I have the
> opportunity to make it better, I'd thought I should seek some advice
> from those who have been down this road before. Suggestions/ideas
> anyone?
>
> Thanks.
>
> Steve Poe
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-26 01:13:40
Message-ID:	4244B744.6020908@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Steve Poe wrote:

You could build a dual opteron with 4 GB of ram, 12 10k raptor SATA
drives with a battery backed cache for about 7k or less.

Or if they are not CPU bound just IO bound you could easily just
add an external 12 drive array (even if scsi) for less than 7k.

Sincerely,

Joshua D. Drake

>
> Since I am a dba novice, I did not physically build this server, nor
> did I write the application the hospital runs on, but I have the
> opportunity to make it better, I'd thought I should seek some advice
> from those who have been down this road before. Suggestions/ideas
> anyone?
>
> Thanks.
>
> Steve Poe
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd(at)commandprompt(dot)com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL

Attachment	Content-Type	Size
jd.vcf	text/x-vcard	285 bytes

From:	Bjoern Metzdorf <bm(at)turtle-entertainment(dot)de>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-26 09:59:15
Message-ID:	42453273.2060509@turtle-entertainment.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Hi Steve,

> Okay. You trust SATA drives? I've been leary of them for a production
> database. Pardon my ignorance, but what is a "battery backed cache"? I
> know the drives have a built-in cache but I don't if that's the same.
> Are the 12 drives internal or an external chasis? Could you point me to
> a place that this configuration exist?

Get 12 or 16 x 74GB Western Digital Raptor S-ATA drives, one 3ware
9500S-12 or two 3ware 9500S-8 raid controllers with a battery backup
unit (in case of power loss the controller saves unflushed data), a
decent tyan board for the existing dual xeon with 2 pci-x slots and a
matching 3U case for 12 drives (12 drives internal).

Here in Germany chassis by Chenbro are quite popular, a matching one for
your needs would be the chenbro RM312 or RM414
(http://61.30.15.60/product/product_preview.php?pid=90 and
http://61.30.15.60/product/product_preview.php?pid=95 respectively).

Take 6 or 10 drives for Raid 10 pgdata, 2-drive Raid 1 for Transaction
logs (xlog), 2-drive Raid 1 for OS and Swap, and 2 spare disks.

That should give you about 250 mb/s reads and 70 mb/s sustained write
rate with xfs.

Regards,
Bjoern

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Bjoern Metzdorf <bm(at)turtle-entertainment(dot)de>, performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-26 13:04:44
Message-ID:	42455DEC.2080301@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>Steve, can we clarify that you are not currently having any performance
>issues, you're just worried about failure? Recommendations should be based
>on whether improving applicaiton speed is a requirement ...

Josh,

The priorities are: 1)improve safety/failure-prevention, 2) improve
performance.

The owner of the company wants greater performance (and, I concure to
certain degree), but the owner's vote is only 1/7 of the managment team.
And, the rest of the management team is not as focused on performance.
They all agree in safety/failure-prevention.

Steve

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-26 13:19:20
Message-ID:	42456158.3090404@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>The Chenbros are nice, but kinda pricey ($800) if Steve doesn't need the
>machine to be rackable.
>
>If your primary goal is redundancy, you may wish to consider the possibility
>of building a brand-new machine for $7k (you can do a lot of machine for
>$7000 if it doesn't have to be rackable) and re-configuring the old machine
>and using it as a replication or PITR backup. This would allow you to
>configure the new machine with only a moderate amount of hardware redundancy
>while still having 100% confidence in staying running.
>
>
>
Our servers are not racked, so a new one does not have to be. *If* it is
possible, I'd like to replace the main server with a new one. I could
tweak the new one the way I need it and work with the vendor to make
sure everything works well. In either case, I'll still need to test how
positioning of the tables/indexes across a raid10 will perform. I am
also waiting onProIV developers feedback. If their ProvIV modules will
not run under AMD64, or take advantage of the processor, then I'll stick
with the server we have.

Steve Poe

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Bjoern Metzdorf <bm(at)turtle-entertainment(dot)de>
Cc:	Steve Poe <spoe(at)sfnet(dot)cc>, performance pgsql <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-26 20:55:58
Message-ID:	200503261255.59007.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Bjoern, Josh, Steve,

> Get 12 or 16 x 74GB Western Digital Raptor S-ATA drives, one 3ware
> 9500S-12 or two 3ware 9500S-8 raid controllers with a battery backup
> unit (in case of power loss the controller saves unflushed data), a
> decent tyan board for the existing dual xeon with 2 pci-x slots and a
> matching 3U case for 12 drives (12 drives internal).

Based on both my testing and feedback from one of the WD Raptor engineers,
Raptors are still only optimal for 90% read applications. This makes them a
great buy for web applications (which are 95% read usually) but a bad choice
for OLTP applicaitons which sounds more like what Steve's describing. For
those, it would be better to get 6 quality SCSI drives than 12 Raptors.

The reason for this is that SATA still doesn't do bi-directional traffic very
well (simultaneous read and write) and OSes and controllers simply haven't
caught up with the drive spec and features. WD hopes that in a year they
will be able to offer a Raptor that performs all operations as well as a 10K
SCSI drive, for 25% less ... but that's in the next generation of drives,
controllers and drivers.

Steve, can we clarify that you are not currently having any performance
issues, you're just worried about failure? Recommendations should be based
on whether improving applicaiton speed is a requirement ...

> Here in Germany chassis by Chenbro are quite popular, a matching one for
> your needs would be the chenbro RM312 or RM414
> (http://61.30.15.60/product/product_preview.php?pid=90 and
> http://61.30.15.60/product/product_preview.php?pid=95 respectively).

The Chenbros are nice, but kinda pricey ($800) if Steve doesn't need the
machine to be rackable.

If your primary goal is redundancy, you may wish to consider the possibility
of building a brand-new machine for $7k (you can do a lot of machine for
$7000 if it doesn't have to be rackable) and re-configuring the old machine
and using it as a replication or PITR backup. This would allow you to
configure the new machine with only a moderate amount of hardware redundancy
while still having 100% confidence in staying running.

--
Josh Berkus
Aglio Database Solutions
San Francisco

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	Cott Lang <cott(at)internetstaff(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-28 17:36:46
Message-ID:	424840AE.3050200@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Cott Lang wrote:

>Have you already considered application/database tuning? Adding
>indexes? shared_buffers large enough? etc.
>
>Your database doesn't seem that large for the hardware you've already
>got. I'd hate to spend $7k and end up back in the same boat. :)
>
>
Cott,

I agree with you. Unfortunately, I am not the developer of the
application. The vendor uses ProIV which connects via ODBC. The vendor
could certain do some tuning and create more indexes where applicable. I
am encouraging the vendor to take a more active role and we work
together on this.

With hardware tuning, I am sure we can do better than 35Mb per sec. Also
moving the top 3 or 5 tables and indexes to their own slice of a RAID10
and moving pg_xlog to its own drive will help too.

Since you asked about tuned settings, here's what we're using:

kernel.shmmax = 1073741824
shared_buffers = 10000
sort_mem = 8192
vacuum_mem = 65536
effective_cache_size = 65536

Steve Poe

From:	Cott Lang <cott(at)internetstaff(dot)com>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-28 22:43:14
Message-ID:	1112049794.23284.29.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Have you already considered application/database tuning? Adding
indexes? shared_buffers large enough? etc.

Your database doesn't seem that large for the hardware you've already
got. I'd hate to spend $7k and end up back in the same boat. :)

On Sat, 2005-03-26 at 13:04 +0000, Steve Poe wrote:
> >Steve, can we clarify that you are not currently having any performance
> >issues, you're just worried about failure? Recommendations should be based
> >on whether improving applicaiton speed is a requirement ...
>
> Josh,
>
> The priorities are: 1)improve safety/failure-prevention, 2) improve
> performance.
>
> The owner of the company wants greater performance (and, I concure to
> certain degree), but the owner's vote is only 1/7 of the managment team.
> And, the rest of the management team is not as focused on performance.
> They all agree in safety/failure-prevention.
>
> Steve
>
>
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org

From:	PFC <lists(at)boutiquenumerique(dot)com>
To:	"Steve Poe" <spoe(at)sfnet(dot)cc>, "Cott Lang" <cott(at)internetstaff(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-29 09:48:34
Message-ID:	op.sod768rgth1vuj@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

> With hardware tuning, I am sure we can do better than 35Mb per sec. Also

WTF ?

My Laptop does 19 MB/s (reading <10 KB files, reiser4) !

A recent desktop 7200rpm IDE drive
# hdparm -t /dev/hdc1
/dev/hdc1:
Timing buffered disk reads: 148 MB in 3.02 seconds = 49.01 MB/sec

# ll "DragonBall 001.avi"
-r--r--r-- 1 peufeu users 218M mar 9 20:07 DragonBall 001.avi

# time cat "DragonBall 001.avi" >/dev/null
real 0m4.162s
user 0m0.020s
sys 0m0.510s

(the file was not in the cache)
=> about 52 MB/s (reiser3.6)

So, you have a problem with your hardware...

From:	Dave Cramer <dave(at)fastcrypt(dot)com>
To:	PFC <lists(at)boutiquenumerique(dot)com>
Cc:	Steve Poe <spoe(at)sfnet(dot)cc>, Cott Lang <cott(at)internetstaff(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-29 12:15:04
Message-ID:	424946C8.6020003@fastcrypt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is
about 50Mb/sec, and striped is about 100

Dave

PFC wrote:

>
>> With hardware tuning, I am sure we can do better than 35Mb per sec. Also
>
>
> WTF ?
>
> My Laptop does 19 MB/s (reading <10 KB files, reiser4) !
>
> A recent desktop 7200rpm IDE drive
> # hdparm -t /dev/hdc1
> /dev/hdc1:
> Timing buffered disk reads: 148 MB in 3.02 seconds = 49.01 MB/sec
>
> # ll "DragonBall 001.avi"
> -r--r--r-- 1 peufeu users 218M mar 9 20:07 DragonBall
> 001.avi
>
> # time cat "DragonBall 001.avi" >/dev/null
> real 0m4.162s
> user 0m0.020s
> sys 0m0.510s
>
> (the file was not in the cache)
> => about 52 MB/s (reiser3.6)
>
> So, you have a problem with your hardware...
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>
>

From:	Dave Cramer <pg(at)fastcrypt(dot)com>
To:
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-29 12:17:05
Message-ID:	42494741.7060806@fastcrypt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is
about 50Mb/sec, and striped is about 100

Dave

PFC wrote:

--
Dave Cramer
http://www.postgresintl.com
519 939 0336
ICQ#14675561

From:	Cott Lang <cott(at)internetstaff(dot)com>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-29 14:52:54
Message-ID:	1112107974.7903.28.camel@sixtyfour.internetstaff.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mon, 2005-03-28 at 17:36 +0000, Steve Poe wrote:

> I agree with you. Unfortunately, I am not the developer of the
> application. The vendor uses ProIV which connects via ODBC. The vendor
> could certain do some tuning and create more indexes where applicable. I
> am encouraging the vendor to take a more active role and we work
> together on this.

I've done a lot browsing through pg_stat_activity, looking for queries
that either hang around for a while or show up very often, and using
explain to find out if they can use some assistance.

You may also find that a dump and restore with a reconfiguration to
mirrored drives speeds you up a lot - just from the dump and restore.

> With hardware tuning, I am sure we can do better than 35Mb per sec. Also
> moving the top 3 or 5 tables and indexes to their own slice of a RAID10
> and moving pg_xlog to its own drive will help too.

If your database activity involves a lot of random i/o, 35Mb per second
wouldn't be too bad.

While conventional wisdom is that pg_xlog on its own drives (I know you
meant plural :) ) is a big boost, in my particular case I could never
get a a measurable boost that way. Obviously, YMMV.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pg(at)fastcrypt(dot)com
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-03-29 18:11:04
Message-ID:	87ekdyz5jb.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Dave Cramer <pg(at)fastcrypt(dot)com> writes:

> PFC wrote:
> >
> > My Laptop does 19 MB/s (reading <10 KB files, reiser4) !
>
> Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is
> about 50Mb/sec, and striped is about 100

Well you're comparing apples and oranges here. A modern 7200rpm drive should
be capable of doing 40-50MB/s depending on the location of the data on the
disk.

But that's only doing sequential access of data using something like dd and
without other processes intervening and causing seeks. In practice it seems a
busy databases see random_page_costs of about 4 which for a drive with 10ms
seek time translates to only about 3.2MB/s.

I think the first order of business is getting pg_xlog onto its own device.
That alone should remove a lot of the seeking. If it's an ext3 device I would
also consider moving the journal to a dedicated drive as well. (or if they're
scsi drives or you're sure the raid controller is safe from write caching then
just switch file systems to something that doesn't journal data.)

--
greg

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Follow-Up: How to improve db performance with $7K?
Date:	2005-04-01 02:01:01
Message-ID:	424CAB5D.9020807@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Thanks for everyone's feedback on to best improve our Postgresql
database for the animal hospital. I re-read the PostgreSQL 8.0
Performance Checklist just to keep focused.

We purchased (2) 4 x 146GB 10,000rpm SCSI U320 SCA drive arrays ($2600)
and (1) Sun W2100z dual AMD64 workstation with 4GB RAM ($2500). We did
not need a rack-mount server, so I though Sun's workstation would do
fine. I'll double the RAM. Hopefully, this should out-perform our dual
2.8 Xeon with 4GB of RAM.

Now, we need to purchase a good U320 RAID card now. Any suggestions for
those which run well under Linux?

These two drive arrays main purpose is for our database. For those
messed with drive arrays before, how would you slice-up the drive array?
Will database performance be effected how our RAID10 is configured? Any
suggestions?

Thanks.

Steve Poe

From:	Thomas F(dot)O'Connell <tfo(at)sitening(dot)com>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Follow-Up: How to improve db performance with $7K?
Date:	2005-04-01 08:09:50
Message-ID:	c6991bbd1f642296f117134a9aed7ad5@sitening.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I'd use two of your drives to create a mirrored partition where pg_xlog
resides separate from the actual data.

RAID 10 is probably appropriate for the remaining drives.

Fortunately, you're not using Dell, so you don't have to worry about
the Perc3/Di RAID controller, which is not so compatible with Linux...

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Mar 31, 2005, at 9:01 PM, Steve Poe wrote:

> Thanks for everyone's feedback on to best improve our Postgresql
> database for the animal hospital. I re-read the PostgreSQL 8.0
> Performance Checklist just to keep focused.
>
> We purchased (2) 4 x 146GB 10,000rpm SCSI U320 SCA drive arrays
> ($2600) and (1) Sun W2100z dual AMD64 workstation with 4GB RAM
> ($2500). We did not need a rack-mount server, so I though Sun's
> workstation would do fine. I'll double the RAM. Hopefully, this should
> out-perform our dual 2.8 Xeon with 4GB of RAM.
>
> Now, we need to purchase a good U320 RAID card now. Any suggestions
> for those which run well under Linux?
>
> These two drive arrays main purpose is for our database. For those
> messed with drive arrays before, how would you slice-up the drive
> array? Will database performance be effected how our RAID10 is
> configured? Any suggestions?
>
> Thanks.
>
> Steve Poe

From:	Vivek Khera <vivek(at)khera(dot)org>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Follow-Up: How to improve db performance with $7K?
Date:	2005-04-01 21:23:13
Message-ID:	6cceaeb5dc30b08f54f7a37d3f1f90c9@khera.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mar 31, 2005, at 9:01 PM, Steve Poe wrote:

> Now, we need to purchase a good U320 RAID card now. Any suggestions
> for those which run well under Linux?

Not sure if it works with linux, but under FreeBSD 5, the LSI MegaRAID
cards are well supported. You should be able to pick up a 320-2X with
128Mb battery backed cache for about $1k. Wicked fast... I'm suprized
you didn't go for the 15k RPM drives for a small extra cost.

From:	Will LaShell <will(at)lashell(dot)net>
To:	Vivek Khera <vivek(at)khera(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Follow-Up: How to improve db performance with $7K?
Date:	2005-04-02 16:21:36
Message-ID:	424EC690.5030909@lashell.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Vivek Khera wrote:

>
> On Mar 31, 2005, at 9:01 PM, Steve Poe wrote:
>
>> Now, we need to purchase a good U320 RAID card now. Any suggestions
>> for those which run well under Linux?
>
>
> Not sure if it works with linux, but under FreeBSD 5, the LSI MegaRAID
> cards are well supported. You should be able to pick up a 320-2X with
> 128Mb battery backed cache for about $1k. Wicked fast... I'm suprized
> you didn't go for the 15k RPM drives for a small extra cost.

Wow, okay, so I'm not sure where everyone's email went, but I got
over a weeks worth of list emails at once.

Several of you have sent me requests on where we purchased our systems
at. Compsource was the vendor, www.c-source.com or
www.compsource.com. The sales rep we have is Steve Taylor or you
can talk to the sales manager Tom. I've bought hardware from them
for the last 2 years and I've been very pleased. I'm sorry wasn't able
to respond sooner.

Steve, The LSI MegaRAID cards are where its at. I've had -great- luck
with them over the years. There were a few weird problems with a series
awhile back where the linux driver needed tweaked by the developers
along with a new bios update. The 320 series is just as Vivek said,
wicked fast. Very strong cards. Be sure though when you order it to
specificy the battery backup either with it, or make sure you buy the
right one for it. There are a couple of options with battery cache on
the cards that can trip you up.

Good luck on your systems! Now that I've got my email problems
resolved I'm definitely more than help to give any information you all
need.

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	pg(at)fastcrypt(dot)com
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-04 13:43:52
Message-ID:	33c6269f0504040643317b69cd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

To be honest, I've yet to run across a SCSI configuration that can
touch the 3ware SATA controllers. I have yet to see one top 80MB/sec,
let alone 180MB/sec read or write, which is why we moved _away_ from
SCSI. I've seen Compaq, Dell and LSI controllers all do pathetically
badly on RAID 1, RAID 5 and RAID 10.

35MB/sec for a three drive RAID 0 is not bad, it's appalling. The
hardware manufacturer should be publicly embarassed for this kind of
speed. A single U320 10k drive can do close to 70MB/sec sustained.

If someone can offer benchmarks to the contrary (particularly in
linux), I would be greatly interested.

Alex Turner
netEconomist

On Mar 29, 2005 8:17 AM, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
> Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is
> about 50Mb/sec, and striped is about 100
>
> Dave
>
> PFC wrote:
>
> >
> >> With hardware tuning, I am sure we can do better than 35Mb per sec. Also
> >
> >
> > WTF ?
> >
> > My Laptop does 19 MB/s (reading <10 KB files, reiser4) !
> >
> > A recent desktop 7200rpm IDE drive
> > # hdparm -t /dev/hdc1
> > /dev/hdc1:
> > Timing buffered disk reads: 148 MB in 3.02 seconds = 49.01 MB/sec
> >
> > # ll "DragonBall 001.avi"
> > -r--r--r-- 1 peufeu users 218M mar 9 20:07 DragonBall
> > 001.avi
> >
> > # time cat "DragonBall 001.avi" >/dev/null
> > real 0m4.162s
> > user 0m0.020s
> > sys 0m0.510s
> >
> > (the file was not in the cache)
> > => about 52 MB/s (reiser3.6)
> >
> > So, you have a problem with your hardware...
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: don't forget to increase your free space map settings
> >
> >
>
> --
> Dave Cramer
> http://www.postgresintl.com
> 519 939 0336
> ICQ#14675561
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

From:	Steve Poe <spoe(at)sfnet(dot)cc>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	pg(at)fastcrypt(dot)com, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-04 14:39:20
Message-ID:	42515198.9090903@sfnet.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex Turner wrote:

>To be honest, I've yet to run across a SCSI configuration that can
>touch the 3ware SATA controllers. I have yet to see one top 80MB/sec,
>let alone 180MB/sec read or write, which is why we moved _away_ from
>SCSI. I've seen Compaq, Dell and LSI controllers all do pathetically
>badly on RAID 1, RAID 5 and RAID 10.
>
>
Alex,

How does the 3ware controller do in heavy writes back to the database?
It may have been Josh, but someone said that SATA does well with reads
but not writes. Would not equal amount of SCSI drives outperform SATA?
I don't want to start a "whose better" war, I am just trying to learn
here. It would seem the more drives you could place in a RAID
configuration, the performance would increase.

Steve Poe

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Steve Poe <spoe(at)sfnet(dot)cc>
Cc:	pg(at)fastcrypt(dot)com, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-04 19:12:20
Message-ID:	33c6269f05040412126b5bc719@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I'm no drive expert, but it seems to me that our write performance is
excellent. I think what most are concerned about is OLTP where you
are doing heavy write _and_ heavy read performance at the same time.

Our system is mostly read during the day, but we do a full system
update everynight that is all writes, and it's very fast compared to
the smaller SCSI system we moved off of. Nearly a 6x spead
improvement, as fast as 900 rows/sec with a 48 byte record, one row
per transaction.

I don't know enough about how SATA works to really comment on it's
performance as a protocol compared with SCSI. If anyone has a usefull
link on that, it would be greatly appreciated.

More drives will give more throughput/sec, but not necesarily more
transactions/sec. For that you will need more RAM on the controler,
and defaintely a BBU to keep your data safe.

Alex Turner
netEconomist

On Apr 4, 2005 10:39 AM, Steve Poe <spoe(at)sfnet(dot)cc> wrote:
>
>
> Alex Turner wrote:
>
> >To be honest, I've yet to run across a SCSI configuration that can
> >touch the 3ware SATA controllers. I have yet to see one top 80MB/sec,
> >let alone 180MB/sec read or write, which is why we moved _away_ from
> >SCSI. I've seen Compaq, Dell and LSI controllers all do pathetically
> >badly on RAID 1, RAID 5 and RAID 10.
> >
> >
> Alex,
>
> How does the 3ware controller do in heavy writes back to the database?
> It may have been Josh, but someone said that SATA does well with reads
> but not writes. Would not equal amount of SCSI drives outperform SATA?
> I don't want to start a "whose better" war, I am just trying to learn
> here. It would seem the more drives you could place in a RAID
> configuration, the performance would increase.
>
> Steve Poe
>
>

From:	Vivek Khera <vivek(at)khera(dot)org>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-04 19:23:33
Message-ID:	cf58ceb65e6ab939e6c737cca23eef4c@khera.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Apr 4, 2005, at 3:12 PM, Alex Turner wrote:

> Our system is mostly read during the day, but we do a full system
> update everynight that is all writes, and it's very fast compared to
> the smaller SCSI system we moved off of. Nearly a 6x spead
> improvement, as fast as 900 rows/sec with a 48 byte record, one row
> per transaction.
>

Well, if you're not heavily multitasking, the advantage of SCSI is lost
on you.

Vivek Khera, Ph.D.
+1-301-869-4449 x806

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Vivek Khera <vivek(at)khera(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-04 19:33:35
Message-ID:	33c6269f05040412336cbfe3b4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I'm doing some research on SATA vs SCSI right now, but to be honest
I'm not turning up much at the protocol level. Alot of stupid
benchmarks comparing 10k Raptor drives against Top of the line 15k
drives, where usnurprsingly the SCSI drives win but of course cost 4
times as much. Although even in some, SATA wins, or draws. I'm
trying to find something more apples to apples. 10k to 10k.

Alex Turner
netEconomist

On Apr 4, 2005 3:23 PM, Vivek Khera <vivek(at)khera(dot)org> wrote:
>
> On Apr 4, 2005, at 3:12 PM, Alex Turner wrote:
>
> > Our system is mostly read during the day, but we do a full system
> > update everynight that is all writes, and it's very fast compared to
> > the smaller SCSI system we moved off of. Nearly a 6x spead
> > improvement, as fast as 900 rows/sec with a 48 byte record, one row
> > per transaction.
> >
>
> Well, if you're not heavily multitasking, the advantage of SCSI is lost
> on you.
>
> Vivek Khera, Ph.D.
> +1-301-869-4449 x806
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Follow-Up: How to improve db performance with $7K?
Date:	2005-04-06 04:44:56
Message-ID:	20050406044456.GA19518@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Thomas F.O'Connell wrote:
> I'd use two of your drives to create a mirrored partition where pg_xlog
> resides separate from the actual data.
>
> RAID 10 is probably appropriate for the remaining drives.
>
> Fortunately, you're not using Dell, so you don't have to worry about
> the Perc3/Di RAID controller, which is not so compatible with
> Linux...

Hmm...I have to wonder how true this is these days.

My company has a Dell 2500 with a Perc3/Di running Debian Linux, with
the 2.6.10 kernel. The controller seems to work reasonably well,
though I wouldn't doubt that it's slower than a different one might
be. But so far we haven't had any reliability issues with it.

Now, the performance is pretty bad considering the setup -- a RAID 5
with five 73.6 gig SCSI disks (10K RPM, I believe). Reads through the
filesystem come through at about 65 megabytes/sec, writes about 35
megabytes/sec (at least, so says "bonnie -s 8192"). This is on a
system with a single 3 GHz Xeon and 1 gigabyte of memory. I'd expect
much better read performance from what is essentially a stripe of 4
fast SCSI disks.

While compatibility hasn't really been an issue, at least as far as
the basics go, I still agree with your general sentiment -- stay away
from the Dells, at least if they have the Perc3/Di controller. You'll
probably get much better performance out of something else.

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	William Yu <wyu(at)talisys(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-06 07:30:44
Message-ID:	d30377$1jn8$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex Turner wrote:
> I'm no drive expert, but it seems to me that our write performance is
> excellent. I think what most are concerned about is OLTP where you
> are doing heavy write _and_ heavy read performance at the same time.
>
> Our system is mostly read during the day, but we do a full system
> update everynight that is all writes, and it's very fast compared to
> the smaller SCSI system we moved off of. Nearly a 6x spead
> improvement, as fast as 900 rows/sec with a 48 byte record, one row
> per transaction.

I've started with SATA in a multi-read/multi-write environment. While it
ran pretty good with 1 thread writing, the addition of a 2nd thread
(whether reading or writing) would cause exponential slowdowns.

I suffered through this for a week and then switched to SCSI. Single
threaded performance was pretty similar but with the advanced command
queueing SCSI has, I was able to do multiple reads/writes simultaneously
with only a small performance hit for each thread.

Perhaps having a SATA caching raid controller might help this situation.
I don't know. It's pretty hard justifying buying a $$$ 3ware controller
just to test it when you could spend the same money on SCSI and have a
guarantee it'll work good under multi-IO scenarios.

From:	"Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Follow-Up: How to improve db performance with $7K?
Date:	2005-04-06 11:28:17
Message-ID:	20050406112817.GB31767@uio.no
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, Apr 05, 2005 at 09:44:56PM -0700, Kevin Brown wrote:
> Now, the performance is pretty bad considering the setup -- a RAID 5
> with five 73.6 gig SCSI disks (10K RPM, I believe). Reads through the
> filesystem come through at about 65 megabytes/sec, writes about 35
> megabytes/sec (at least, so says "bonnie -s 8192"). This is on a
> system with a single 3 GHz Xeon and 1 gigabyte of memory. I'd expect
> much better read performance from what is essentially a stripe of 4
> fast SCSI disks.

Data point here: We have a Linux software RAID quite close to the setup you
describe, with an onboard Adaptec controller and four 146GB 10000rpm disks,
and we get about 65MB/sec sustained when writing to an ext3 filesystem
(actually, when wgetting a file off the gigabit LAN :-) ). I haven't tested
reading, though.

/* Steinar */
--
Homepage: http://www.sesse.net/

From:	PFC <lists(at)boutiquenumerique(dot)com>
To:	"Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Follow-Up: How to improve db performance with $7K?
Date:	2005-04-06 13:26:33
Message-ID:	op.sotbmjp6th1vuj@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

> and we get about 65MB/sec sustained when writing to an ext3 filesystem
> (actually, when wgetting a file off the gigabit LAN :-) ). I haven't

Well, unless you have PCI 64 bits, the "standard" PCI does 133 MB/s which
is then split exactly in two times 66.5 MB/s for 1) reading from the PCI
network card and 2) writing to the PCI harddisk controller. No wonder you
get this figure, you're able to saturate your PCI bus, but it does not
tell you a thing on the performance of your disk or network card... Note
that the server which serves the file is limited in the same way unless
the file is in cache (RAM) or it's PCI64. So...

> tested
> reading, though.
>
> /* Steinar */

From:	"Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Follow-Up: How to improve db performance with $7K?
Date:	2005-04-06 13:33:48
Message-ID:	20050406133348.GA16108@uio.no
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, Apr 06, 2005 at 03:26:33PM +0200, PFC wrote:
> Well, unless you have PCI 64 bits, the "standard" PCI does 133 MB/s
> which is then split exactly in two times 66.5 MB/s for 1) reading from the
> PCI network card and 2) writing to the PCI harddisk controller. No wonder
> you get this figure, you're able to saturate your PCI bus, but it does not
> tell you a thing on the performance of your disk or network card... Note
> that the server which serves the file is limited in the same way unless
> the file is in cache (RAM) or it's PCI64. So...

This is PCI-X.

/* Steinar */
--
Homepage: http://www.sesse.net/

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	William Yu <wyu(at)talisys(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-06 15:35:10
Message-ID:	33c6269f050406083533d2045d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

It's hardly the same money, the drives are twice as much.

It's all about the controller baby with any kind of dive. A bad SCSI
controller will give sucky performance too, believe me. We had a
Compaq Smart Array 5304, and it's performance was _very_ sub par.

If someone has a simple benchmark test database to run, I would be
happy to run it on our hardware here.

Alex Turner

On Apr 6, 2005 3:30 AM, William Yu <wyu(at)talisys(dot)com> wrote:
> Alex Turner wrote:
> > I'm no drive expert, but it seems to me that our write performance is
> > excellent. I think what most are concerned about is OLTP where you
> > are doing heavy write _and_ heavy read performance at the same time.
> >
> > Our system is mostly read during the day, but we do a full system
> > update everynight that is all writes, and it's very fast compared to
> > the smaller SCSI system we moved off of. Nearly a 6x spead
> > improvement, as fast as 900 rows/sec with a 48 byte record, one row
> > per transaction.
>
> I've started with SATA in a multi-read/multi-write environment. While it
> ran pretty good with 1 thread writing, the addition of a 2nd thread
> (whether reading or writing) would cause exponential slowdowns.
>
> I suffered through this for a week and then switched to SCSI. Single
> threaded performance was pretty similar but with the advanced command
> queueing SCSI has, I was able to do multiple reads/writes simultaneously
> with only a small performance hit for each thread.
>
> Perhaps having a SATA caching raid controller might help this situation.
> I don't know. It's pretty hard justifying buying a $$$ 3ware controller
> just to test it when you could spend the same money on SCSI and have a
> guarantee it'll work good under multi-IO scenarios.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

From:	William Yu <wyu(at)talisys(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-06 20:01:35
Message-ID:	d31f71$2epm$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

It's the same money if you factor in the 3ware controller. Even without
a caching controller, SCSI works good in multi-threaded IO (not
withstanding crappy shit from Dell or Compaq). You can get such cards
from LSI for $75. And of course, many server MBs come with LSI
controllers built-in. Our older 32-bit production servers all use Linux
software RAID w/ SCSI and there's no issues when multiple
users/processes hit the DB.

*Maybe* a 3ware controller w/ onboard cache + battery backup might do
much better for multi-threaded IO than just plain-jane SATA.
Unfortunately, I have not been able to find anything online that can
confirm or deny this. Hence, the choice is spend $$$ on the 3ware
controller and hope it meets your needs -- or spend $$$ on SCSI drives
and be sure.

Now if you want to run such tests, we'd all be delighted with to see the
results so we have another option for building servers.

Alex Turner wrote:
> It's hardly the same money, the drives are twice as much.
>
> It's all about the controller baby with any kind of dive. A bad SCSI
> controller will give sucky performance too, believe me. We had a
> Compaq Smart Array 5304, and it's performance was _very_ sub par.
>
> If someone has a simple benchmark test database to run, I would be
> happy to run it on our hardware here.
>
> Alex Turner
>
> On Apr 6, 2005 3:30 AM, William Yu <wyu(at)talisys(dot)com> wrote:
>
>>Alex Turner wrote:
>>
>>>I'm no drive expert, but it seems to me that our write performance is
>>>excellent. I think what most are concerned about is OLTP where you
>>>are doing heavy write _and_ heavy read performance at the same time.
>>>
>>>Our system is mostly read during the day, but we do a full system
>>>update everynight that is all writes, and it's very fast compared to
>>>the smaller SCSI system we moved off of. Nearly a 6x spead
>>>improvement, as fast as 900 rows/sec with a 48 byte record, one row
>>>per transaction.
>>
>>I've started with SATA in a multi-read/multi-write environment. While it
>>ran pretty good with 1 thread writing, the addition of a 2nd thread
>>(whether reading or writing) would cause exponential slowdowns.
>>
>>I suffered through this for a week and then switched to SCSI. Single
>>threaded performance was pretty similar but with the advanced command
>>queueing SCSI has, I was able to do multiple reads/writes simultaneously
>>with only a small performance hit for each thread.
>>
>>Perhaps having a SATA caching raid controller might help this situation.
>>I don't know. It's pretty hard justifying buying a $$$ 3ware controller
>>just to test it when you could spend the same money on SCSI and have a
>>guarantee it'll work good under multi-IO scenarios.
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 8: explain analyze is your friend
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	William Yu <wyu(at)talisys(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-06 22:12:06
Message-ID:	33c6269f050406151241b01148@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Well - unfortuantely software RAID isn't appropriate for everyone, and
some of us need a hardware RAID controller. The LSI Megaraid 320-2
card is almost exactly the same price as the 3ware 9500S-12 card
(although I will conceed that a 320-2 card can handle at most 2x14
devices compare with the 12 on the 9500S).

If someone can come up with a test, I will be happy to run it and see
how it goes. I would be _very_ interested in the results having just
spent $7k on a new DB server!!

I have also seen really bad performance out of SATA. It was with
either an on-board controller, or a cheap RAID controller from
HighPoint. As soon as I put in a decent controller, things went much
better. I think it's unfair to base your opinion of SATA from a test
that had a poor controler.

I know I'm not the only one here running SATA RAID and being very
satisfied with the results.

Thanks,

Alex Turner
netEconomist

On Apr 6, 2005 4:01 PM, William Yu <wyu(at)talisys(dot)com> wrote:
> It's the same money if you factor in the 3ware controller. Even without
> a caching controller, SCSI works good in multi-threaded IO (not
> withstanding crappy shit from Dell or Compaq). You can get such cards
> from LSI for $75. And of course, many server MBs come with LSI
> controllers built-in. Our older 32-bit production servers all use Linux
> software RAID w/ SCSI and there's no issues when multiple
> users/processes hit the DB.
>
> *Maybe* a 3ware controller w/ onboard cache + battery backup might do
> much better for multi-threaded IO than just plain-jane SATA.
> Unfortunately, I have not been able to find anything online that can
> confirm or deny this. Hence, the choice is spend $$$ on the 3ware
> controller and hope it meets your needs -- or spend $$$ on SCSI drives
> and be sure.
>
> Now if you want to run such tests, we'd all be delighted with to see the
> results so we have another option for building servers.
>
>
> Alex Turner wrote:
> > It's hardly the same money, the drives are twice as much.
> >
> > It's all about the controller baby with any kind of dive. A bad SCSI
> > controller will give sucky performance too, believe me. We had a
> > Compaq Smart Array 5304, and it's performance was _very_ sub par.
> >
> > If someone has a simple benchmark test database to run, I would be
> > happy to run it on our hardware here.
> >
> > Alex Turner
> >
> > On Apr 6, 2005 3:30 AM, William Yu <wyu(at)talisys(dot)com> wrote:
> >
> >>Alex Turner wrote:
> >>
> >>>I'm no drive expert, but it seems to me that our write performance is
> >>>excellent. I think what most are concerned about is OLTP where you
> >>>are doing heavy write _and_ heavy read performance at the same time.
> >>>
> >>>Our system is mostly read during the day, but we do a full system
> >>>update everynight that is all writes, and it's very fast compared to
> >>>the smaller SCSI system we moved off of. Nearly a 6x spead
> >>>improvement, as fast as 900 rows/sec with a 48 byte record, one row
> >>>per transaction.
> >>
> >>I've started with SATA in a multi-read/multi-write environment. While it
> >>ran pretty good with 1 thread writing, the addition of a 2nd thread
> >>(whether reading or writing) would cause exponential slowdowns.
> >>
> >>I suffered through this for a week and then switched to SCSI. Single
> >>threaded performance was pretty similar but with the advanced command
> >>queueing SCSI has, I was able to do multiple reads/writes simultaneously
> >>with only a small performance hit for each thread.
> >>
> >>Perhaps having a SATA caching raid controller might help this situation.
> >>I don't know. It's pretty hard justifying buying a $$$ 3ware controller
> >>just to test it when you could spend the same money on SCSI and have a
> >>guarantee it'll work good under multi-IO scenarios.
> >>
> >>---------------------------(end of broadcast)---------------------------
> >>TIP 8: explain analyze is your friend
> >>
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> > (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-06 22:41:02
Message-ID:	20050406224102.GM93835@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Sorry if I'm pointing out the obvious here, but it seems worth
mentioning. AFAIK all 3ware controllers are setup so that each SATA
drive gets it's own SATA bus. My understanding is that by and large,
SATA still suffers from a general inability to have multiple outstanding
commands on the bus at once, unlike SCSI. Therefore, to get good
performance out of SATA you need to have a seperate bus for each drive.
Theoretically, it shouldn't really matter that it's SATA over ATA, other
than I certainly wouldn't want to try and cram 8 ATA cables into a
machine...

Incidentally, when we were investigating storage options at a previous
job we talked to someone who deals with RS/6000 storage. He had a bunch
of info about their serial controller protocol (which I can't think of
the name of) vs SCSI. SCSI had a lot more overhead, so you could end up
saturating even a 160MB SCSI bus with only 2 or 3 drives.

People are finally realizing how important bandwidth has become in
modern machines. Memory bandwidth is why RS/6000 was (and maybe still
is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of
the water. Likewise it's why SCSI is so much better than IDE (unless you
just give each drive it's own dedicated bandwidth).
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 00:32:50
Message-ID:	33c6269f0504061732195f6ae4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I guess I'm setting myself up here, and I'm really not being ignorant,
but can someone explain exactly how is SCSI is supposed to better than
SATA?

Both systems use drives with platters. Each drive can physically only
read one thing at a time.

SATA gives each drive it's own channel, but you have to share in SCSI.
A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
SCSI can only do 320MB/sec across the entire array.

What am I missing here?

Alex Turner
netEconomist

On Apr 6, 2005 5:41 PM, Jim C. Nasby <decibel(at)decibel(dot)org> wrote:
> Sorry if I'm pointing out the obvious here, but it seems worth
> mentioning. AFAIK all 3ware controllers are setup so that each SATA
> drive gets it's own SATA bus. My understanding is that by and large,
> SATA still suffers from a general inability to have multiple outstanding
> commands on the bus at once, unlike SCSI. Therefore, to get good
> performance out of SATA you need to have a seperate bus for each drive.
> Theoretically, it shouldn't really matter that it's SATA over ATA, other
> than I certainly wouldn't want to try and cram 8 ATA cables into a
> machine...
>
> Incidentally, when we were investigating storage options at a previous
> job we talked to someone who deals with RS/6000 storage. He had a bunch
> of info about their serial controller protocol (which I can't think of
> the name of) vs SCSI. SCSI had a lot more overhead, so you could end up
> saturating even a 160MB SCSI bus with only 2 or 3 drives.
>
> People are finally realizing how important bandwidth has become in
> modern machines. Memory bandwidth is why RS/6000 was (and maybe still
> is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of
> the water. Likewise it's why SCSI is so much better than IDE (unless you
> just give each drive it's own dedicated bandwidth).
> --
> Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
> Give your computer some brain candy! www.distributed.net Team #1828
>
> Windows: "Where do you want to go today?"
> Linux: "Where do you want to go tomorrow?"
> FreeBSD: "Are you guys coming, or what?"
>

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 01:12:14
Message-ID:	33c6269f05040618124f5c02d9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Ok - so I found this fairly good online review of various SATA cards
out there, with 3ware not doing too hot on RAID 5, but ok on RAID 10.

http://www.tweakers.net/reviews/557/

Very interesting stuff.

Alex Turner
netEconomist

On Apr 6, 2005 7:32 PM, Alex Turner <armtuk(at)gmail(dot)com> wrote:
> I guess I'm setting myself up here, and I'm really not being ignorant,
> but can someone explain exactly how is SCSI is supposed to better than
> SATA?
>
> Both systems use drives with platters. Each drive can physically only
> read one thing at a time.
>
> SATA gives each drive it's own channel, but you have to share in SCSI.
> A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
> SCSI can only do 320MB/sec across the entire array.
>
> What am I missing here?
>
> Alex Turner
> netEconomist
>
> On Apr 6, 2005 5:41 PM, Jim C. Nasby <decibel(at)decibel(dot)org> wrote:
> > Sorry if I'm pointing out the obvious here, but it seems worth
> > mentioning. AFAIK all 3ware controllers are setup so that each SATA
> > drive gets it's own SATA bus. My understanding is that by and large,
> > SATA still suffers from a general inability to have multiple outstanding
> > commands on the bus at once, unlike SCSI. Therefore, to get good
> > performance out of SATA you need to have a seperate bus for each drive.
> > Theoretically, it shouldn't really matter that it's SATA over ATA, other
> > than I certainly wouldn't want to try and cram 8 ATA cables into a
> > machine...
> >
> > Incidentally, when we were investigating storage options at a previous
> > job we talked to someone who deals with RS/6000 storage. He had a bunch
> > of info about their serial controller protocol (which I can't think of
> > the name of) vs SCSI. SCSI had a lot more overhead, so you could end up
> > saturating even a 160MB SCSI bus with only 2 or 3 drives.
> >
> > People are finally realizing how important bandwidth has become in
> > modern machines. Memory bandwidth is why RS/6000 was (and maybe still
> > is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of
> > the water. Likewise it's why SCSI is so much better than IDE (unless you
> > just give each drive it's own dedicated bandwidth).
> > --
> > Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
> > Give your computer some brain candy! www.distributed.net Team #1828
> >
> > Windows: "Where do you want to go today?"
> > Linux: "Where do you want to go tomorrow?"
> > FreeBSD: "Are you guys coming, or what?"
> >
>

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 01:23:59
Message-ID:	33c6269f050406182365961772@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Ok - I take it back - I'm reading through this now, and realising that
the reviews are pretty clueless in several places...

On Apr 6, 2005 8:12 PM, Alex Turner <armtuk(at)gmail(dot)com> wrote:
> Ok - so I found this fairly good online review of various SATA cards
> out there, with 3ware not doing too hot on RAID 5, but ok on RAID 10.
>
> http://www.tweakers.net/reviews/557/
>
> Very interesting stuff.
>
> Alex Turner
> netEconomist
>
> On Apr 6, 2005 7:32 PM, Alex Turner <armtuk(at)gmail(dot)com> wrote:
> > I guess I'm setting myself up here, and I'm really not being ignorant,
> > but can someone explain exactly how is SCSI is supposed to better than
> > SATA?
> >
> > Both systems use drives with platters. Each drive can physically only
> > read one thing at a time.
> >
> > SATA gives each drive it's own channel, but you have to share in SCSI.
> > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
> > SCSI can only do 320MB/sec across the entire array.
> >
> > What am I missing here?
> >
> > Alex Turner
> > netEconomist
> >
> > On Apr 6, 2005 5:41 PM, Jim C. Nasby <decibel(at)decibel(dot)org> wrote:
> > > Sorry if I'm pointing out the obvious here, but it seems worth
> > > mentioning. AFAIK all 3ware controllers are setup so that each SATA
> > > drive gets it's own SATA bus. My understanding is that by and large,
> > > SATA still suffers from a general inability to have multiple outstanding
> > > commands on the bus at once, unlike SCSI. Therefore, to get good
> > > performance out of SATA you need to have a seperate bus for each drive.
> > > Theoretically, it shouldn't really matter that it's SATA over ATA, other
> > > than I certainly wouldn't want to try and cram 8 ATA cables into a
> > > machine...
> > >
> > > Incidentally, when we were investigating storage options at a previous
> > > job we talked to someone who deals with RS/6000 storage. He had a bunch
> > > of info about their serial controller protocol (which I can't think of
> > > the name of) vs SCSI. SCSI had a lot more overhead, so you could end up
> > > saturating even a 160MB SCSI bus with only 2 or 3 drives.
> > >
> > > People are finally realizing how important bandwidth has become in
> > > modern machines. Memory bandwidth is why RS/6000 was (and maybe still
> > > is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of
> > > the water. Likewise it's why SCSI is so much better than IDE (unless you
> > > just give each drive it's own dedicated bandwidth).
> > > --
> > > Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
> > > Give your computer some brain candy! www.distributed.net Team #1828
> > >
> > > Windows: "Where do you want to go today?"
> > > Linux: "Where do you want to go tomorrow?"
> > > FreeBSD: "Are you guys coming, or what?"
> > >
> >
>

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 03:00:54
Message-ID:	87wtrfmgt5.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex Turner <armtuk(at)gmail(dot)com> writes:

> SATA gives each drive it's own channel, but you have to share in SCSI.
> A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
> SCSI can only do 320MB/sec across the entire array.

SCSI controllers often have separate channels for each device too.

In any case the issue with the IDE protocol is that fundamentally you can only
have a single command pending. SCSI can have many commands pending. This is
especially important for a database like postgres that may be busy committing
one transaction while another is trying to read. Having several commands
queued on the drive gives it a chance to execute any that are "on the way" to
the committing transaction.

However I'm under the impression that 3ware has largely solved this problem.
Also, if you save a few dollars and can afford one additional drive that
additional drive may improve your array speed enough to overcome that
inefficiency.

--
greg

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 03:06:47
Message-ID:	33c6269f050406200623d43daf@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Yeah - the more reading I'm doing - the more I'm finding out.

Alledgelly the Western Digial Raptor drives implement a version of
ATA-4 Tagged Queing which allows reordering of commands. Some
controllers support this. The 3ware docs say that the controller
support both reordering on the controller and to the drive. *shrug*

This of course is all supposed to go away with SATA II which as NCQ,
Native Command Queueing. Of course the 3ware controllers don't
support SATA II, but a few other do, and I'm sure 3ware will come out
with a controller that does.

Alex Turner
netEconomist

On 06 Apr 2005 23:00:54 -0400, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
> Alex Turner <armtuk(at)gmail(dot)com> writes:
>
> > SATA gives each drive it's own channel, but you have to share in SCSI.
> > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
> > SCSI can only do 320MB/sec across the entire array.
>
> SCSI controllers often have separate channels for each device too.
>
> In any case the issue with the IDE protocol is that fundamentally you can only
> have a single command pending. SCSI can have many commands pending. This is
> especially important for a database like postgres that may be busy committing
> one transaction while another is trying to read. Having several commands
> queued on the drive gives it a chance to execute any that are "on the way" to
> the committing transaction.
>
> However I'm under the impression that 3ware has largely solved this problem.
> Also, if you save a few dollars and can afford one additional drive that
> additional drive may improve your array speed enough to overcome that
> inefficiency.
>
> --
> greg
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Alex Turner <armtuk(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 04:14:57
Message-ID:	13449.1112847297@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Greg Stark <gsstark(at)mit(dot)edu> writes:
> In any case the issue with the IDE protocol is that fundamentally you
> can only have a single command pending. SCSI can have many commands
> pending.

That's the bottom line: the SCSI protocol was designed (twenty years ago!)
to allow the drive to do physical I/O scheduling, because the CPU can
issue multiple commands before the drive has to report completion of the
first one. IDE isn't designed to do that. I understand that the latest
revisions to the IDE/ATA specs allow the drive to do this sort of thing,
but support for it is far from widespread.

regards, tom lane

From:	Thomas F(dot)O'Connell <tfo(at)sitening(dot)com>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Follow-Up: How to improve db performance with $7K?
Date:	2005-04-07 04:40:26
Message-ID:	f9c0f39b77928687bf79a6fb0e4252c7@sitening.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Things might've changed somewhat over the past year, but this is from
_the_ Linux guy at Dell...

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Strategic Open Source — Open Your i™

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

Date: Mon, 26 Apr 2004 14:15:02 -0500
From: Matt Domsch <Matt_Domsch(at)dell(dot)com>
To: linux-poweredge(at)dell(dot)com
Subject: PERC3/Di failure workaround hypothesis

--uXxzq0nDebZQVNAZ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Apr 26, 2004 at 11:10:36AM -0500, Sellek, Greg wrote:
> Short of ordering a Perc4 for every 2650 that I want to upgrade to RH
> ES, is there anything else I can do to get around the Perc3/Di
> problem?

Our working hypothesis for a workaround is to do as follows:

In afacli, set:

Read Cache: enabled
Write Cache: enabled when protected

Then unplug the ROMB battery. A reboot is not necessary. The firmware
will immediately drop into Write-Through Cache mode, which in our
testing has not exhibited the problem. Setting the write cache to
disabled in afacli doesn't seem to help - you've got to unplug the
battery with it in the above settings.

We are continuing to search for the root cause to the problem, and will
update the list when we can.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

On Apr 5, 2005, at 11:44 PM, Kevin Brown wrote:

> Thomas F.O'Connell wrote:
>> I'd use two of your drives to create a mirrored partition where
>> pg_xlog
>> resides separate from the actual data.
>>
>> RAID 10 is probably appropriate for the remaining drives.
>>
>> Fortunately, you're not using Dell, so you don't have to worry about
>> the Perc3/Di RAID controller, which is not so compatible with
>> Linux...
>
> Hmm...I have to wonder how true this is these days.
>
> My company has a Dell 2500 with a Perc3/Di running Debian Linux, with
> the 2.6.10 kernel. The controller seems to work reasonably well,
> though I wouldn't doubt that it's slower than a different one might
> be. But so far we haven't had any reliability issues with it.
>
> Now, the performance is pretty bad considering the setup -- a RAID 5
> with five 73.6 gig SCSI disks (10K RPM, I believe). Reads through the
> filesystem come through at about 65 megabytes/sec, writes about 35
> megabytes/sec (at least, so says "bonnie -s 8192"). This is on a
> system with a single 3 GHz Xeon and 1 gigabyte of memory. I'd expect
> much better read performance from what is essentially a stripe of 4
> fast SCSI disks.
>
>
> While compatibility hasn't really been an issue, at least as far as
> the basics go, I still agree with your general sentiment -- stay away
> from the Dells, at least if they have the Perc3/Di controller. You'll
> probably get much better performance out of something else.
>
>
> --
> Kevin Brown kevin(at)sysexperts(dot)com

From:	"Douglas J(dot) Trainor" <trainor(at)transborder(dot)net>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, pgsql-performance(at)postgresql(dot)org, William Yu <wyu(at)talisys(dot)com>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 04:58:33
Message-ID:	cff304a11722a6d926cabb73e89b8919@transborder.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

You asked for it! ;-)

If you want cheap, get SATA. If you want fast under
*load* conditions, get SCSI. Everything else at this
time is marketing hype, either intentional or learned.
Ignoring dollars, expect to see SCSI beat SATA by 40%.

* * * What I tell you three times is true * * *

Also, compare the warranty you get with any SATA
drive with any SCSI drive. Yes, you still have some
change leftover to buy more SATA drives when they
fail, but... it fundamentally comes down to some
actual implementation and not what is printed on
the cardboard box. Disk systems are bound by the
rules of queueing theory. You can hit the sales rep
over the head with your queueing theory book.

Ultra320 SCSI is king of the hill for high concurrency
databases. If you're only streaming or serving files,
save some money and get a bunch of SATA drives.
But if you're reading/writing all over the disk, the
simple first-come-first-serve SATA heuristic will
hose your performance under load conditions.

Next year, they will *try* bring out some SATA cards
that improve on first-come-first-serve, but they ain't
here now. There are a lot of rigged performance tests
out there... Maybe by the time they fix the queueing
problems, serial Attached SCSI (a/k/a SAS) will be out.
Looks like Ultra320 is the end of the line for parallel
SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the
water.

Ultra320 SCSI.
Ultra320 SCSI.
Ultra320 SCSI.

Serial Attached SCSI.
Serial Attached SCSI.
Serial Attached SCSI.

For future trends, see:
http://www.incits.org/archive/2003/in031163/in031163.htm

douglas

p.s. For extra credit, try comparing SATA and SCSI drives
when they're 90% full.

On Apr 6, 2005, at 8:32 PM, Alex Turner wrote:

> I guess I'm setting myself up here, and I'm really not being ignorant,
> but can someone explain exactly how is SCSI is supposed to better than
> SATA?
>
> Both systems use drives with platters. Each drive can physically only
> read one thing at a time.
>
> SATA gives each drive it's own channel, but you have to share in SCSI.
> A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
> SCSI can only do 320MB/sec across the entire array.
>
> What am I missing here?
>
> Alex Turner
> netEconomist

From:	"Douglas J(dot) Trainor" <trainor(at)transborder(dot)net>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 09:55:59
Message-ID:	ff501fced53e507c0f192cb2e474cbb5@transborder.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

A good one page discussion on the future of SCSI and SATA can
be found in the latest CHIPS (The Department of the Navy Information
Technology Magazine, formerly CHIPS AHOY) in an article by
Patrick G. Koehler and Lt. Cmdr. Stan Bush.

Click below if you don't mind being logged visiting Space and Naval
Warfare Systems Center Charleston:

http://www.chips.navy.mil/archives/05_Jan/web_pages/scuzzy.htm

From:	Richard_D_Levine(at)raytheon(dot)com
To:	"Douglas J(dot) Trainor" <trainor(at)transborder(dot)net>
Cc:	Alex Turner <armtuk(at)gmail(dot)com>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, pgsql-performance(at)postgresql(dot)org, pgsql-performance-owner(at)postgresql(dot)org, William Yu <wyu(at)talisys(dot)com>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 15:37:33
Message-ID:	OF74B315AF.D4C3F1C2-ON05256FDC.00559FCC-05256FDC.0055D617@ftw.us.ray.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Another simple question: Why is SCSI more expensive? After the
eleventy-millionth controller is made, it seems like SCSI and SATA are
using a controller board and a spinning disk. Is somebody still making
money by licensing SCSI technology?

Rick

pgsql-performance-owner(at)postgresql(dot)org wrote on 04/06/2005 11:58:33 PM:

> You asked for it! ;-)
>
> If you want cheap, get SATA. If you want fast under
> *load* conditions, get SCSI. Everything else at this
> time is marketing hype, either intentional or learned.
> Ignoring dollars, expect to see SCSI beat SATA by 40%.
>
> * * * What I tell you three times is true * * *
>
> Also, compare the warranty you get with any SATA
> drive with any SCSI drive. Yes, you still have some
> change leftover to buy more SATA drives when they
> fail, but... it fundamentally comes down to some
> actual implementation and not what is printed on
> the cardboard box. Disk systems are bound by the
> rules of queueing theory. You can hit the sales rep
> over the head with your queueing theory book.
>
> Ultra320 SCSI is king of the hill for high concurrency
> databases. If you're only streaming or serving files,
> save some money and get a bunch of SATA drives.
> But if you're reading/writing all over the disk, the
> simple first-come-first-serve SATA heuristic will
> hose your performance under load conditions.
>
> Next year, they will *try* bring out some SATA cards
> that improve on first-come-first-serve, but they ain't
> here now. There are a lot of rigged performance tests
> out there... Maybe by the time they fix the queueing
> problems, serial Attached SCSI (a/k/a SAS) will be out.
> Looks like Ultra320 is the end of the line for parallel
> SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the
> water.
>
> Ultra320 SCSI.
> Ultra320 SCSI.
> Ultra320 SCSI.
>
> Serial Attached SCSI.
> Serial Attached SCSI.
> Serial Attached SCSI.
>
> For future trends, see:
> http://www.incits.org/archive/2003/in031163/in031163.htm
>
> douglas
>
> p.s. For extra credit, try comparing SATA and SCSI drives
> when they're 90% full.
>
> On Apr 6, 2005, at 8:32 PM, Alex Turner wrote:
>
> > I guess I'm setting myself up here, and I'm really not being ignorant,
> > but can someone explain exactly how is SCSI is supposed to better than
> > SATA?
> >
> > Both systems use drives with platters. Each drive can physically only
> > read one thing at a time.
> >
> > SATA gives each drive it's own channel, but you have to share in SCSI.
> > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
> > SCSI can only do 320MB/sec across the entire array.
> >
> > What am I missing here?
> >
> > Alex Turner
> > netEconomist
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
your
> joining column's datatypes do not match

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	"Richard_D_Levine(at)raytheon(dot)com" <Richard_D_Levine(at)raytheon(dot)com>
Cc:	"Douglas J(dot) Trainor" <trainor(at)transborder(dot)net>, "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, pgsql-performance(at)postgresql(dot)org, pgsql-performance-owner(at)postgresql(dot)org, William Yu <wyu(at)talisys(dot)com>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 15:46:31
Message-ID:	33c6269f05040708464df3909a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Based on the reading I'm doing, and somebody please correct me if I'm
wrong, it seems that SCSI drives contain an on disk controller that
has to process the tagged queue. SATA-I doesn't have this. This
additional controller, is basicaly an on board computer that figures
out the best order in which to process commands. I believe you are
also paying for the increased tolerance that generates a better speed.
If you compare an 80Gig 7200RPM IDE drive to a WD Raptor 76G 10k RPM
to a Seagate 10k.6 drive to a Seagate Cheatah 15k drive, each one
represents a step up in parts and technology, thereby generating a
cost increase (at least thats what the manufactures tell us). I know
if you ever held a 15k drive in your hand, you can notice a
considerable weight difference between it and a 7200RPM IDE drive.

Alex Turner
netEconomist

On Apr 7, 2005 11:37 AM, Richard_D_Levine(at)raytheon(dot)com
<Richard_D_Levine(at)raytheon(dot)com> wrote:
> Another simple question: Why is SCSI more expensive? After the
> eleventy-millionth controller is made, it seems like SCSI and SATA are
> using a controller board and a spinning disk. Is somebody still making
> money by licensing SCSI technology?
>
> Rick
>
> pgsql-performance-owner(at)postgresql(dot)org wrote on 04/06/2005 11:58:33 PM:
>
> > You asked for it! ;-)
> >
> > If you want cheap, get SATA. If you want fast under
> > *load* conditions, get SCSI. Everything else at this
> > time is marketing hype, either intentional or learned.
> > Ignoring dollars, expect to see SCSI beat SATA by 40%.
> >
> > * * * What I tell you three times is true * * *
> >
> > Also, compare the warranty you get with any SATA
> > drive with any SCSI drive. Yes, you still have some
> > change leftover to buy more SATA drives when they
> > fail, but... it fundamentally comes down to some
> > actual implementation and not what is printed on
> > the cardboard box. Disk systems are bound by the
> > rules of queueing theory. You can hit the sales rep
> > over the head with your queueing theory book.
> >
> > Ultra320 SCSI is king of the hill for high concurrency
> > databases. If you're only streaming or serving files,
> > save some money and get a bunch of SATA drives.
> > But if you're reading/writing all over the disk, the
> > simple first-come-first-serve SATA heuristic will
> > hose your performance under load conditions.
> >
> > Next year, they will *try* bring out some SATA cards
> > that improve on first-come-first-serve, but they ain't
> > here now. There are a lot of rigged performance tests
> > out there... Maybe by the time they fix the queueing
> > problems, serial Attached SCSI (a/k/a SAS) will be out.
> > Looks like Ultra320 is the end of the line for parallel
> > SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the
> > water.
> >
> > Ultra320 SCSI.
> > Ultra320 SCSI.
> > Ultra320 SCSI.
> >
> > Serial Attached SCSI.
> > Serial Attached SCSI.
> > Serial Attached SCSI.
> >
> > For future trends, see:
> > http://www.incits.org/archive/2003/in031163/in031163.htm
> >
> > douglas
> >
> > p.s. For extra credit, try comparing SATA and SCSI drives
> > when they're 90% full.
> >
> > On Apr 6, 2005, at 8:32 PM, Alex Turner wrote:
> >
> > > I guess I'm setting myself up here, and I'm really not being ignorant,
> > > but can someone explain exactly how is SCSI is supposed to better than
> > > SATA?
> > >
> > > Both systems use drives with platters. Each drive can physically only
> > > read one thing at a time.
> > >
> > > SATA gives each drive it's own channel, but you have to share in SCSI.
> > > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but
> > > SCSI can only do 320MB/sec across the entire array.
> > >
> > > What am I missing here?
> > >
> > > Alex Turner
> > > netEconomist
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: the planner will ignore your desire to choose an index scan if
> your
> > joining column's datatypes do not match
>
>

From:	Richard_D_Levine(at)raytheon(dot)com
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, pgsql-performance(at)postgresql(dot)org, pgsql-performance-owner(at)postgresql(dot)org, "Douglas J(dot) Trainor" <trainor(at)transborder(dot)net>, William Yu <wyu(at)talisys(dot)com>
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-07 16:28:29
Message-ID:	OFBBA6D5E0.8FAB98C1-ON05256FDC.005A2CCD-05256FDC.005A7FAA@ftw.us.ray.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Yep, that's it, as well as increased quality control. I found this from
Seagate:

http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf

With this quote (note that ES stands for Enterprise System and PS stands
for Personal System):

There is significantly more silicon on ES products. The following
comparison comes from a study done in 2000:
· the ES ASIC gate count is more than 2x a PS drive,
· the embedded SRAM space for program code is 2x,
· the permanent flash memory for program code is 2x,
· data SRAM and cache SRAM space is more than 10x.
The complexity of the SCSI/FC interface compared to the
IDE/ATA interface shows up here due in part to the more
complex system architectures in which ES drives find themselves.
ES interfaces support multiple initiators or hosts. The
drive must keep track of separate sets of information for each
host to which it is attached, e.g., maintaining the processor
pointer sets for multiple initiators and tagged commands.
The capability of SCSI/FC to efficiently process commands
and tasks in parallel has also resulted in a higher overhead
“kernel” structure for the firmware. All of these complexities
and an overall richer command set result in the need for a
more expensive PCB to carry the electronics.

Rick

Alex Turner <armtuk(at)gmail(dot)com> wrote on 04/07/2005 10:46:31 AM:

> Based on the reading I'm doing, and somebody please correct me if I'm
> wrong, it seems that SCSI drives contain an on disk controller that
> has to process the tagged queue. SATA-I doesn't have this. This
> additional controller, is basicaly an on board computer that figures
> out the best order in which to process commands. I believe you are
> also paying for the increased tolerance that generates a better speed.
> If you compare an 80Gig 7200RPM IDE drive to a WD Raptor 76G 10k RPM
> to a Seagate 10k.6 drive to a Seagate Cheatah 15k drive, each one
> represents a step up in parts and technology, thereby generating a
> cost increase (at least thats what the manufactures tell us). I know
> if you ever held a 15k drive in your hand, you can notice a
> considerable weight difference between it and a 7200RPM IDE drive.
>
> Alex Turner
> netEconomist
>
> On Apr 7, 2005 11:37 AM, Richard_D_Levine(at)raytheon(dot)com
> <Richard_D_Levine(at)raytheon(dot)com> wrote:
> > Another simple question: Why is SCSI more expensive? After the
> > eleventy-millionth controller is made, it seems like SCSI and SATA are
> > using a controller board and a spinning disk. Is somebody still making
> > money by licensing SCSI technology?
> >
> > Rick
> >
> > pgsql-performance-owner(at)postgresql(dot)org wrote on 04/06/2005 11:58:33 PM:
> >
> > > You asked for it! ;-)
> > >
> > > If you want cheap, get SATA. If you want fast under
> > > *load* conditions, get SCSI. Everything else at this
> > > time is marketing hype, either intentional or learned.
> > > Ignoring dollars, expect to see SCSI beat SATA by 40%.
> > >
> > > * * * What I tell you three times is true * * *
> > >
> > > Also, compare the warranty you get with any SATA
> > > drive with any SCSI drive. Yes, you still have some
> > > change leftover to buy more SATA drives when they
> > > fail, but... it fundamentally comes down to some
> > > actual implementation and not what is printed on
> > > the cardboard box. Disk systems are bound by the
> > > rules of queueing theory. You can hit the sales rep
> > > over the head with your queueing theory book.
> > >
> > > Ultra320 SCSI is king of the hill for high concurrency
> > > databases. If you're only streaming or serving files,
> > > save some money and get a bunch of SATA drives.
> > > But if you're reading/writing all over the disk, the
> > > simple first-come-first-serve SATA heuristic will
> > > hose your performance under load conditions.
> > >
> > > Next year, they will *try* bring out some SATA cards
> > > that improve on first-come-first-serve, but they ain't
> > > here now. There are a lot of rigged performance tests
> > > out there... Maybe by the time they fix the queueing
> > > problems, serial Attached SCSI (a/k/a SAS) will be out.
> > > Looks like Ultra320 is the end of the line for parallel
> > > SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the
> > > water.
> > >
> > > Ultra320 SCSI.
> > > Ultra320 SCSI.
> > > Ultra320 SCSI.
> > >
> > > Serial Attached SCSI.
> > > Serial Attached SCSI.
> > > Serial Attached SCSI.
> > >
> > > For future trends, see:
> > > http://www.incits.org/archive/2003/in031163/in031163.htm
> > >
> > > douglas
> > >
> > > p.s. For extra credit, try comparing SATA and SCSI drives
> > > when they're 90% full.
> > >
> > > On Apr 6, 2005, at 8:32 PM, Alex Turner wrote:
> > >
> > > > I guess I'm setting myself up here, and I'm really not being
ignorant,
> > > > but can someone explain exactly how is SCSI is supposed to better
than
> > > > SATA?
> > > >
> > > > Both systems use drives with platters. Each drive can physically
only
> > > > read one thing at a time.
> > > >
> > > > SATA gives each drive it's own channel, but you have to share in
SCSI.
> > > > A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive,
but
> > > > SCSI can only do 320MB/sec across the entire array.
> > > >
> > > > What am I missing here?
> > > >
> > > > Alex Turner
> > > > netEconomist
> > >
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 9: the planner will ignore your desire to choose an index scan if
> > your
> > > joining column's datatypes do not match
> >
> >

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 05:56:55
Message-ID:	20050414055655.GB19518@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > In any case the issue with the IDE protocol is that fundamentally you
> > can only have a single command pending. SCSI can have many commands
> > pending.
>
> That's the bottom line: the SCSI protocol was designed (twenty years ago!)
> to allow the drive to do physical I/O scheduling, because the CPU can
> issue multiple commands before the drive has to report completion of the
> first one. IDE isn't designed to do that. I understand that the latest
> revisions to the IDE/ATA specs allow the drive to do this sort of thing,
> but support for it is far from widespread.

My question is: why does this (physical I/O scheduling) seem to matter
so much?

Before you flame me for asking a terribly idiotic question, let me
provide some context.

The operating system maintains a (sometimes large) buffer cache, with
each buffer being mapped to a "physical" (which in the case of RAID is
really a virtual) location on the disk. When the kernel needs to
flush the cache (e.g., during a sync(), or when it needs to free up
some pages), it doesn't write the pages in memory address order, it
writes them in *device* address order. And it, too, maintains a queue
of disk write requests.

Now, unless some of the blocks on the disk are remapped behind the
scenes such that an ordered list of blocks in the kernel translates to
an out of order list on the target disk (which should be rare, since
such remapping usually happens only when the target block is bad), how
can the fact that the disk controller doesn't do tagged queuing
*possibly* make any real difference unless the kernel's disk
scheduling algorithm is suboptimal? In fact, if the kernel's
scheduling algorithm is close to optimal, wouldn't the disk queuing
mechanism *reduce* the overall efficiency of disk writes? After all,
the kernel's queue is likely to be much larger than the disk
controller's, and the kernel has knowledge of things like the
filesystem layout that the disk controller and disks do not have. If
the controller is only able to execute a subset of the write commands
that the kernel has in its queue, at the very least the controller may
end up leaving the head(s) in a suboptimal position relative to the
next set of commands that it hasn't received yet, unless it simply
writes the blocks in the order it receives it, right (admittedly, this
is somewhat trivially dealt with by having the controller exclude the
first and last blocks in the request from its internal sort).

I can see how you might configure the RAID controller so that the
kernel's scheduling algorithm will screw things up horribly. For
instance, if the controller has several RAID volumes configured in
such a way that the volumes share spindles, the kernel isn't likely to
know about that (since each volume appears as its own device), so
writes to multiple volumes can cause head movement where the kernel
might be treating the volumes as completely independent. But that
just means that you can't be dumb about how you configure your RAID
setup.

So what gives? Given the above, why is SCSI so much more efficient
than plain, dumb SATA? And why wouldn't you be much better off with a
set of dumb controllers in conjunction with (kernel-level) software
RAID?

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 06:37:15
Message-ID:	878y3lj23o.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Kevin Brown <kevin(at)sysexperts(dot)com> writes:

> My question is: why does this (physical I/O scheduling) seem to matter
> so much?
>
> Before you flame me for asking a terribly idiotic question, let me
> provide some context.
>
> The operating system maintains a (sometimes large) buffer cache, with
> each buffer being mapped to a "physical" (which in the case of RAID is
> really a virtual) location on the disk. When the kernel needs to
> flush the cache (e.g., during a sync(), or when it needs to free up
> some pages), it doesn't write the pages in memory address order, it
> writes them in *device* address order. And it, too, maintains a queue
> of disk write requests.

I think you're being misled by analyzing the write case.

Consider the read case. When a user process requests a block and that read
makes its way down to the driver level, the driver can't just put it aside and
wait until it's convenient. It has to go ahead and issue the read right away.

In the 10ms or so that it takes to seek to perform that read *nothing* gets
done. If the driver receives more read or write requests it just has to sit on
them and wait. 10ms is a lifetime for a computer. In that time dozens of other
processes could have been scheduled and issued reads of their own.

If any of those requests would have lied on the intervening tracks the drive
missed a chance to execute them. Worse, it actually has to backtrack to get to
them meaning another long seek.

The same thing would happen if you had lots of processes issuing lots of small
fsynced writes all over the place. Postgres doesn't really do that though. It
sort of does with the WAL logs, but that shouldn't cause a lot of seeking.
Perhaps it would mean that having your WAL share a spindle with other parts of
the OS would have a bigger penalty on IDE drives than on SCSI drives though?

--
greg

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 08:36:08
Message-ID:	20050414083608.GC19518@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Greg Stark wrote:

> I think you're being misled by analyzing the write case.
>
> Consider the read case. When a user process requests a block and
> that read makes its way down to the driver level, the driver can't
> just put it aside and wait until it's convenient. It has to go ahead
> and issue the read right away.

Well, strictly speaking it doesn't *have* to. It could delay for a
couple of milliseconds to see if other requests come in, and then
issue the read if none do. If there are already other requests being
fulfilled, then it'll schedule the request in question just like the
rest.

> In the 10ms or so that it takes to seek to perform that read
> *nothing* gets done. If the driver receives more read or write
> requests it just has to sit on them and wait. 10ms is a lifetime for
> a computer. In that time dozens of other processes could have been
> scheduled and issued reads of their own.

This is true, but now you're talking about a situation where the
system goes from an essentially idle state to one of furious
activity. In other words, it's a corner case that I strongly suspect
isn't typical in situations where SCSI has historically made a big
difference.

Once the first request has been fulfilled, the driver can now schedule
the rest of the queued-up requests in disk-layout order.

I really don't see how this is any different between a system that has
tagged queueing to the disks and one that doesn't. The only
difference is where the queueing happens. In the case of SCSI, the
queueing happens on the disks (or at least on the controller). In the
case of SATA, the queueing happens in the kernel.

I suppose the tagged queueing setup could begin the head movement and,
if another request comes in that requests a block on a cylinder
between where the head currently is and where it's going, go ahead and
read the block in question. But is that *really* what happens in a
tagged queueing system? It's the only major advantage I can see it
having.

> The same thing would happen if you had lots of processes issuing
> lots of small fsynced writes all over the place. Postgres doesn't
> really do that though. It sort of does with the WAL logs, but that
> shouldn't cause a lot of seeking. Perhaps it would mean that having
> your WAL share a spindle with other parts of the OS would have a
> bigger penalty on IDE drives than on SCSI drives though?

Perhaps.

But I rather doubt that has to be a huge penalty, if any. When a
process issues an fsync (or even a sync), the kernel doesn't *have* to
drop everything it's doing and get to work on it immediately. It
could easily gather a few more requests, bundle them up, and then
issue them. If there's a lot of disk activity, it's probably smart to
do just that. All fsync and sync require is that the caller block
until the data hits the disk (from the point of view of the kernel).
The specification doesn't require that the kernel act on the calls
immediately or write only the blocks referred to by the call in
question.

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 14:44:14
Message-ID:	446.1113489854@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> I really don't see how this is any different between a system that has
> tagged queueing to the disks and one that doesn't. The only
> difference is where the queueing happens. In the case of SCSI, the
> queueing happens on the disks (or at least on the controller). In the
> case of SATA, the queueing happens in the kernel.

That's basically what it comes down to: SCSI lets the disk drive itself
do the low-level I/O scheduling whereas the ATA spec prevents the drive
from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's
possible for the drive to rearrange reads as well as writes --- which
AFAICS is just not possible in ATA. (Maybe in the newest spec...)

The reason this is so much more of a win than it was when ATA was
designed is that in modern drives the kernel has very little clue about
the physical geometry of the disk. Variable-size tracks, bad-block
sparing, and stuff like that make for a very hard-to-predict mapping
from linear sector addresses to actual disk locations. Combine that
with the fact that the drive controller can be much smarter than it was
twenty years ago, and you can see that the case for doing I/O scheduling
in the kernel and not in the drive is pretty weak.

regards, tom lane

From:	Rosser Schwarz <rosser(dot)schwarz(at)gmail(dot)com>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 14:48:58
Message-ID:	37d451f7050414074876acdb82@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

while you weren't looking, Kevin Brown wrote:

[reordering bursty reads]

> In other words, it's a corner case that I strongly suspect
> isn't typical in situations where SCSI has historically made a big
> difference.

[...]

> But I rather doubt that has to be a huge penalty, if any. When a
> process issues an fsync (or even a sync), the kernel doesn't *have* to
> drop everything it's doing and get to work on it immediately. It
> could easily gather a few more requests, bundle them up, and then
> issue them.

To make sure I'm following you here, are you or are you not suggesting
that the kernel could sit on -all- IO requests for some small handful
of ms before actually performing any IO to address what you "strongly
suspect" is a "corner case"?

/rls

--
:wq

From:	Matthew Nuzum <mattnuzum(at)gmail(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 15:51:46
Message-ID:	f3c0b40805041408517d7ede90@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 4/14/05, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> That's basically what it comes down to: SCSI lets the disk drive itself
> do the low-level I/O scheduling whereas the ATA spec prevents the drive
> from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's
> possible for the drive to rearrange reads as well as writes --- which
> AFAICS is just not possible in ATA. (Maybe in the newest spec...)
>
> The reason this is so much more of a win than it was when ATA was
> designed is that in modern drives the kernel has very little clue about
> the physical geometry of the disk. Variable-size tracks, bad-block
> sparing, and stuff like that make for a very hard-to-predict mapping
> from linear sector addresses to actual disk locations. Combine that
> with the fact that the drive controller can be much smarter than it was
> twenty years ago, and you can see that the case for doing I/O scheduling
> in the kernel and not in the drive is pretty weak.
>
>

So if you all were going to choose between two hard drives where:
drive A has capacity C and spins at 15K rpms, and
drive B has capacity 2 x C and spins at 10K rpms and
all other features are the same, the price is the same and C is enough
disk space which would you choose?

I've noticed that on IDE drives, as the capacity increases the data
density increases and there is a pereceived (I've not measured it)
performance increase.

Would the increased data density of the higher capacity drive be of
greater benefit than the faster spindle speed of drive A?

--
Matthew Nuzum
www.bearfruit.org

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-14 18:03:31
Message-ID:	87d5sxgrrg.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Kevin Brown <kevin(at)sysexperts(dot)com> writes:

> Greg Stark wrote:
>
>
> > I think you're being misled by analyzing the write case.
> >
> > Consider the read case. When a user process requests a block and
> > that read makes its way down to the driver level, the driver can't
> > just put it aside and wait until it's convenient. It has to go ahead
> > and issue the read right away.
>
> Well, strictly speaking it doesn't *have* to. It could delay for a
> couple of milliseconds to see if other requests come in, and then
> issue the read if none do. If there are already other requests being
> fulfilled, then it'll schedule the request in question just like the
> rest.

But then the cure is worse than the disease. You're basically describing
exactly what does happen anyways, only you're delaying more requests than
necessary. That intervening time isn't really idle, it's filled with all the
requests that were delayed during the previous large seek...

> Once the first request has been fulfilled, the driver can now schedule
> the rest of the queued-up requests in disk-layout order.
>
> I really don't see how this is any different between a system that has
> tagged queueing to the disks and one that doesn't. The only
> difference is where the queueing happens.

And *when* it happens. Instead of being able to issue requests while a large
seek is happening and having some of them satisfied they have to wait until
that seek is finished and get acted on during the next large seek.

If my theory is correct then I would expect bandwidth to be essentially
equivalent but the latency on SATA drives to be increased by about 50% of the
average seek time. Ie, while a busy SCSI drive can satisfy most requests in
about 10ms a busy SATA drive would satisfy most requests in 15ms. (add to that
that 10k RPM and 15kRPM SCSI drives have even lower seek times and no such
IDE/SATA drives exist...)

In reality higher latency feeds into a system feedback loop causing your
application to run slower causing bandwidth demands to be lower as well. It's
often hard to distinguish root causes from symptoms when optimizing complex
systems.

--
greg

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 02:03:37
Message-ID:	20050415020337.GD19518@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > I really don't see how this is any different between a system that has
> > tagged queueing to the disks and one that doesn't. The only
> > difference is where the queueing happens. In the case of SCSI, the
> > queueing happens on the disks (or at least on the controller). In the
> > case of SATA, the queueing happens in the kernel.
>
> That's basically what it comes down to: SCSI lets the disk drive itself
> do the low-level I/O scheduling whereas the ATA spec prevents the drive
> from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's
> possible for the drive to rearrange reads as well as writes --- which
> AFAICS is just not possible in ATA. (Maybe in the newest spec...)
>
> The reason this is so much more of a win than it was when ATA was
> designed is that in modern drives the kernel has very little clue about
> the physical geometry of the disk. Variable-size tracks, bad-block
> sparing, and stuff like that make for a very hard-to-predict mapping
> from linear sector addresses to actual disk locations.

Yeah, but it's not clear to me, at least, that this is a first-order
consideration. A second-order consideration, sure, I'll grant that.

What I mean is that when it comes to scheduling disk activity,
knowledge of the specific physical geometry of the disk isn't really
important. What's important is whether or not the disk conforms to a
certain set of expectations. Namely, that the general organization is
such that addressing the blocks in block number order guarantees
maximum throughput.

Now, bad block remapping destroys that guarantee, but unless you've
got a LOT of bad blocks, it shouldn't destroy your performance, right?

> Combine that with the fact that the drive controller can be much
> smarter than it was twenty years ago, and you can see that the case
> for doing I/O scheduling in the kernel and not in the drive is
> pretty weak.

Well, I certainly grant that allowing the controller to do the I/O
scheduling is faster than having the kernel do it, as long as it can
handle insertion of new requests into the list while it's in the
middle of executing a request. The most obvious case is when the head
is in motion and the new request can be satisfied by reading from the
media between where the head is at the time of the new request and
where the head is being moved to.

My argument is that a sufficiently smart kernel scheduler *should*
yield performance results that are reasonably close to what you can
get with that feature. Perhaps not quite as good, but reasonably
close. It shouldn't be an orders-of-magnitude type difference.

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 02:24:22
Message-ID:	33c6269f050414192472a58cf6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

3ware claim that their 'software' implemented command queueing
performs at 95% effectiveness compared to the hardware queueing on a
SCSI drive, so I would say that they agree with you.

I'm still learning, but as I read it, the bits are split across the
platters and there is only 'one' head, but happens to be reading from
multiple platters. The 'further' in linear distance the data is from
the current position, the longer it's going to take to get there.
This seems to be true based on a document that was circulated. A hard
drive takes considerable amount of time to 'find' a track on the
platter compared to the rotational speed, which would agree with the
fact that you can read 70MB/sec, but it takes up to 13ms to seek.

the ATA protocol is just how the HBA communicates with the drive,
there is no reason why the HBA can't reschedule reads and writes just
the like SCSI drive would do natively, and this is what infact 3ware
claims. I get the feeling based on my own historical experience that
generaly drives don't just have a bunch of bad blocks. This all leads
me to believe that you can predict with pretty good accuracy how
expensive it is to retrieve a given block knowing it's linear
increment.

Alex Turner
netEconomist

On 4/14/05, Kevin Brown <kevin(at)sysexperts(dot)com> wrote:
> Tom Lane wrote:
> > Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > > I really don't see how this is any different between a system that has
> > > tagged queueing to the disks and one that doesn't. The only
> > > difference is where the queueing happens. In the case of SCSI, the
> > > queueing happens on the disks (or at least on the controller). In the
> > > case of SATA, the queueing happens in the kernel.
> >
> > That's basically what it comes down to: SCSI lets the disk drive itself
> > do the low-level I/O scheduling whereas the ATA spec prevents the drive
> > from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's
> > possible for the drive to rearrange reads as well as writes --- which
> > AFAICS is just not possible in ATA. (Maybe in the newest spec...)
> >
> > The reason this is so much more of a win than it was when ATA was
> > designed is that in modern drives the kernel has very little clue about
> > the physical geometry of the disk. Variable-size tracks, bad-block
> > sparing, and stuff like that make for a very hard-to-predict mapping
> > from linear sector addresses to actual disk locations.
>
> Yeah, but it's not clear to me, at least, that this is a first-order
> consideration. A second-order consideration, sure, I'll grant that.
>
> What I mean is that when it comes to scheduling disk activity,
> knowledge of the specific physical geometry of the disk isn't really
> important. What's important is whether or not the disk conforms to a
> certain set of expectations. Namely, that the general organization is
> such that addressing the blocks in block number order guarantees
> maximum throughput.
>
> Now, bad block remapping destroys that guarantee, but unless you've
> got a LOT of bad blocks, it shouldn't destroy your performance, right?
>
> > Combine that with the fact that the drive controller can be much
> > smarter than it was twenty years ago, and you can see that the case
> > for doing I/O scheduling in the kernel and not in the drive is
> > pretty weak.
>
> Well, I certainly grant that allowing the controller to do the I/O
> scheduling is faster than having the kernel do it, as long as it can
> handle insertion of new requests into the list while it's in the
> middle of executing a request. The most obvious case is when the head
> is in motion and the new request can be satisfied by reading from the
> media between where the head is at the time of the new request and
> where the head is being moved to.
>
> My argument is that a sufficiently smart kernel scheduler *should*
> yield performance results that are reasonably close to what you can
> get with that feature. Perhaps not quite as good, but reasonably
> close. It shouldn't be an orders-of-magnitude type difference.
>
> --
> Kevin Brown kevin(at)sysexperts(dot)com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 02:41:56
Message-ID:	28523.1113532916@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Tom Lane wrote:
>> The reason this is so much more of a win than it was when ATA was
>> designed is that in modern drives the kernel has very little clue about
>> the physical geometry of the disk. Variable-size tracks, bad-block
>> sparing, and stuff like that make for a very hard-to-predict mapping
>> from linear sector addresses to actual disk locations.

> What I mean is that when it comes to scheduling disk activity,
> knowledge of the specific physical geometry of the disk isn't really
> important.

Oh?

Yes, you can probably assume that blocks with far-apart numbers are
going to require a big seek, and you might even be right in supposing
that a block with an intermediate number should be read on the way.
But you have no hope at all of making the right decisions at a more
local level --- say, reading various sectors within the same cylinder
in an optimal fashion. You don't know where the track boundaries are,
so you can't schedule in a way that minimizes rotational latency.
You're best off to throw all the requests at the drive together and
let the drive sort it out.

This is not to say that there's not a place for a kernel-side scheduler
too. The drive will probably have a fairly limited number of slots in
its command queue. The optimal thing is for those slots to be filled
with requests that are in the same area of the disk. So you can still
get some mileage out of an elevator algorithm that works on logical
block numbers to give the drive requests for nearby block numbers at the
same time. But there's also a lot of use in letting the drive do its
own low-level scheduling.

> My argument is that a sufficiently smart kernel scheduler *should*
> yield performance results that are reasonably close to what you can
> get with that feature. Perhaps not quite as good, but reasonably
> close. It shouldn't be an orders-of-magnitude type difference.

That might be the case with respect to decisions about long seeks,
but not with respect to rotational latency. The kernel simply hasn't
got the information.

regards, tom lane

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 05:03:36
Message-ID:	20050415050336.GE19518@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > Tom Lane wrote:
> >> The reason this is so much more of a win than it was when ATA was
> >> designed is that in modern drives the kernel has very little clue about
> >> the physical geometry of the disk. Variable-size tracks, bad-block
> >> sparing, and stuff like that make for a very hard-to-predict mapping
> >> from linear sector addresses to actual disk locations.
>
> > What I mean is that when it comes to scheduling disk activity,
> > knowledge of the specific physical geometry of the disk isn't really
> > important.
>
> Oh?
>
> Yes, you can probably assume that blocks with far-apart numbers are
> going to require a big seek, and you might even be right in supposing
> that a block with an intermediate number should be read on the way.
> But you have no hope at all of making the right decisions at a more
> local level --- say, reading various sectors within the same cylinder
> in an optimal fashion. You don't know where the track boundaries are,
> so you can't schedule in a way that minimizes rotational latency.

This is true, but has to be examined in the context of the workload.

If the workload is a sequential read, for instance, then the question
becomes whether or not giving the controller a set of sequential
blocks (in block ID order) will get you maximum read throughput.
Given that the manufacturers all attempt to generate the biggest read
throughput numbers, I think it's reasonable to assume that (a) the
sectors are ordered within a cylinder such that reading block x + 1
immediately after block x will incur the smallest possible amount of
delay if requested quickly enough, and (b) the same holds true when
block x + 1 is on the next cylinder.

In the case of pure random reads, you'll end up having to wait an
average of half of a rotation before beginning the read. Where SCSI
buys you something here is when you have sequential chunks of reads
that are randomly distributed. The SCSI drive can determine which
block in the set to start with first. But for that to really be a big
win, the chunks themselves would have to span more than half a track
at least, else you'd have a greater than half a track gap in the
middle of your two sorted sector lists for that track (a really
well-engineered SCSI disk could take advantage of the fact that there
are multiple platters and fill the "gap" with reads from a different
platter).

Admittedly, this can be quite a big win. With an average rotational
latency of 4 milliseconds on a 7200 RPM disk, being able to begin the
read at the earliest possible moment will shave at most 25% off the
total average random-access latency, if the average seek time is 12
milliseconds.

> That might be the case with respect to decisions about long seeks,
> but not with respect to rotational latency. The kernel simply hasn't
> got the information.

True, but that should reduce the total latency by something like 17%
(on average). Not trivial, to be sure, but not an order of magnitude,
either.

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 05:28:54
Message-ID:	29785.1113542934@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> In the case of pure random reads, you'll end up having to wait an
> average of half of a rotation before beginning the read.

You're assuming the conclusion. The above is true if the disk is handed
one request at a time by a kernel that doesn't have any low-level timing
information. If there are multiple random requests on the same track,
the drive has an opportunity to do better than that --- if it's got all
the requests in hand.

regards, tom lane

From:	PFC <lists(at)boutiquenumerique(dot)com>
To:	"Kevin Brown" <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 10:07:43
Message-ID:	op.so9qe5y2th1vuj@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

And a controller card (or drive) has a lot less RAM to use as a cache /
queue for reordering stuff than the OS has, potentially the OS can us most
of the available RAM, which can be gigabytes on a big server, whereas in
the drive there are at most a few tens of megabytes...

However all this is a bit looking at the problem through the wrong end.
The OS should provide a multi-read call for the applications to pass a
list of blocks they'll need, then reorder them and read them the fastest
possible way, clustering them with similar requests from other threads.

Right now when a thread/process issues a read() it will block until the
block is delivered to this thread. The OS does not know if this thread
will then need the next block (which can be had very cheaply if you know
ahead of time you'll need it) or not. Thus it must make guesses, read
ahead (sometimes), etc...

From:	PFC <lists(at)boutiquenumerique(dot)com>
To:	"Alex Turner" <armtuk(at)gmail(dot)com>, "Kevin Brown" <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 10:13:59
Message-ID:	op.so9qplulth1vuj@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

> platter compared to the rotational speed, which would agree with the
> fact that you can read 70MB/sec, but it takes up to 13ms to seek.

Actually :
- the head has to be moved
this time depends on the distance, for instance moving from a cylinder to
the next is very fast (it needs to, to get good throughput)
- then you have to wait for the disk to spin until the information you
want comes in front of the head... statistically you have to wait a half
rotation. And this does not depend on the distance between the cylinders,
it depends on the position of the data in the cylinder.
The more RPMs you have, the less you wait, which is why faster RPMs
drives have faster seek (they must also have faster actuators to move the
head)...

From:	Alan Stange <stange(at)rentec(dot)com>
To:	PFC <lists(at)boutiquenumerique(dot)com>
Cc:	Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 13:11:48
Message-ID:	425FBD94.3060600@rentec.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

PFC wrote:

>
>
>> My argument is that a sufficiently smart kernel scheduler *should*
>> yield performance results that are reasonably close to what you can
>> get with that feature. Perhaps not quite as good, but reasonably
>> close. It shouldn't be an orders-of-magnitude type difference.
>
>
> And a controller card (or drive) has a lot less RAM to use as a
> cache / queue for reordering stuff than the OS has, potentially the
> OS can us most of the available RAM, which can be gigabytes on a big
> server, whereas in the drive there are at most a few tens of
> megabytes...
>
> However all this is a bit looking at the problem through the wrong
> end. The OS should provide a multi-read call for the applications to
> pass a list of blocks they'll need, then reorder them and read them
> the fastest possible way, clustering them with similar requests from
> other threads.
>
> Right now when a thread/process issues a read() it will block
> until the block is delivered to this thread. The OS does not know if
> this thread will then need the next block (which can be had very
> cheaply if you know ahead of time you'll need it) or not. Thus it
> must make guesses, read ahead (sometimes), etc...

All true. Which is why high performance computing folks use
aio_read()/aio_write() and load up the kernel with all the requests they
expect to make.

The kernels that I'm familiar with will do read ahead on files based on
some heuristics: when you read the first byte of a file the OS will
typically load up several pages of the file (depending on file size,
etc). If you continue doing read() calls without a seek() on the file
descriptor the kernel will get the hint that you're doing a sequential
read and continue caching up the pages ahead of time, usually using the
pages you just read to hold the new data so that one isn't bloating out
memory with data that won't be needed again. Throw in a seek() and the
amount of read ahead caching may be reduced.

One point that is being missed in all this discussion is that the file
system also imposes some constraints on how IO's can be done. For
example, simply doing a write(fd, buf, 100000000) doesn't emit a stream
of sequential blocks to the drives. Some file systems (UFS was one)
would force portions of large files into other cylinder groups so that
small files could be located near the inode data, thus avoiding/reducing
the size of seeks. Similarly, extents need to be allocated and the
bitmaps recording this data usually need synchronous updates, which will
require some seeks, etc. Not to mention the need to update inode data,
etc. Anyway, my point is that the allocation policies of the file
system can confuse the situation.

Also, the seek times one sees reported are an average. One really needs
to look at the track-to-track seek time and also the "full stoke" seek
times. It takes a *long* time to move the heads across the whole
platter. I've seen people partition drives to only use small regions of
the drives to avoid long seeks and to better use the increased number of
bits going under the head in one rotation. A 15K drive doesn't need to
have a faster seek time than a 10K drive because the rotational speed is
higher. The average seek time might be faster just because the 15K
drives are smaller with fewer number of cylinders.

-- Alan

From:	Vivek Khera <vivek(at)khera(dot)org>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 15:43:40
Message-ID:	9fde2d603a18b53b5b3f22ad47f3a064@khera.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote:

> Now, bad block remapping destroys that guarantee, but unless you've
> got a LOT of bad blocks, it shouldn't destroy your performance, right?
>

ALL disks have bad blocks, even when you receive them. you honestly
think that these large disks made today (18+ GB is the smallest now)
that there are no defects on the surfaces?

/me remembers trying to cram an old donated 5MB (yes M) disk into an
old 8088 Zenith PC in college...

Vivek Khera, Ph.D.
+1-301-869-4449 x806

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Vivek Khera <vivek(at)khera(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 15:58:47
Message-ID:	425FE4B7.6080406@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Vivek Khera wrote:
>
> On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote:
>
>> Now, bad block remapping destroys that guarantee, but unless you've
>> got a LOT of bad blocks, it shouldn't destroy your performance, right?
>>
>
> ALL disks have bad blocks, even when you receive them. you honestly
> think that these large disks made today (18+ GB is the smallest now)
> that there are no defects on the surfaces?

That is correct. It is just that the HD makers will mark the bad blocks
so that the OS knows not to use them. You can also run the bad blocks
command to try and find new bad blocks.

Over time hard drives get bad blocks. It doesn't always mean you have to
replace the drive but it does mean you need to maintain it and usually
at least backup, low level (if scsi) and mark bad blocks. Then restore.

Sincerely,

Joshua D. Drake

>
> /me remembers trying to cram an old donated 5MB (yes M) disk into an old
> 8088 Zenith PC in college...
>
> Vivek Khera, Ph.D.
> +1-301-869-4449 x806
>

--
Your PostgreSQL solutions provider, Command Prompt, Inc.
24x7 support - 1.800.492.2240, programming, and consulting
Home of PostgreSQL Replicator, plPHP, plPerlNG and pgPHPToolkit
http://www.commandprompt.com / http://www.postgresql.org

From:	Vivek Khera <vivek(at)khera(dot)org>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 16:10:14
Message-ID:	fff6340b58e9d740bac7f082be46fa07@khera.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Apr 15, 2005, at 11:58 AM, Joshua D. Drake wrote:

>> ALL disks have bad blocks, even when you receive them. you honestly
>> think that these large disks made today (18+ GB is the smallest now)
>> that there are no defects on the surfaces?
>
> That is correct. It is just that the HD makers will mark the bad blocks
> so that the OS knows not to use them. You can also run the bad blocks
> command to try and find new bad blocks.
>

my point was that you cannot assume an linear correlation between block
number and physical location, since the bad blocks will be mapped all
over the place.

Vivek Khera, Ph.D.
+1-301-869-4449 x806

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-15 18:01:41
Message-ID:	87sm1rdim2.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Yes, you can probably assume that blocks with far-apart numbers are
> going to require a big seek, and you might even be right in supposing
> that a block with an intermediate number should be read on the way.
> But you have no hope at all of making the right decisions at a more
> local level --- say, reading various sectors within the same cylinder
> in an optimal fashion. You don't know where the track boundaries are,
> so you can't schedule in a way that minimizes rotational latency.
> You're best off to throw all the requests at the drive together and
> let the drive sort it out.

Consider for example three reads, one at the beginning of the disk, one at the
very end, and one in the middle. If the three are performed in the logical
order (assuming the head starts at the beginning), then the drive has to seek,
say, 4ms to get to the middle and 4ms to get to the end.

But if the middle block requires a full rotation to reach it from when the
head arrives that adds another 8ms of rotational delay (assuming a 7200RPM
drive).

Whereas the drive could have seeked over to the last block, then seeked back
in 8ms and gotten there just in time to perform the read for free.

I'm not entirely convinced this explains all of the SCSI drives' superior
performance though. The above is about a worst-case scenario. should really
only have a small effect, and it's not like the drive firmware can really
schedule things perfectly either.

I think most of the difference is that the drive manufacturers just don't
package their high end drives with ATA interfaces. So there are no 10k RPM ATA
drives and no 15k RPM ATA drives. I think WD is making fast SATA drives but
most of the manufacturers aren't even doing that.

--
greg

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-16 00:53:50
Message-ID:	20050416005350.GF19518@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > In the case of pure random reads, you'll end up having to wait an
> > average of half of a rotation before beginning the read.
>
> You're assuming the conclusion. The above is true if the disk is handed
> one request at a time by a kernel that doesn't have any low-level timing
> information. If there are multiple random requests on the same track,
> the drive has an opportunity to do better than that --- if it's got all
> the requests in hand.

True, but see below. Actually, I suspect what matters is if they're
on the same cylinder (which may be what you're talking about here).
And in the above, I was assuming randomly distributed single-sector
reads. In that situation, we can't generically know what the
probability that more than one will appear on the same cylinder
without knowing something about the drive geometry.

That said, most modern drives have tens of thousands of cylinders (the
Seagate ST380011a, an 80 gigabyte drive, has 94,600 tracks per inch
according to its datasheet), but much, much smaller queue lengths
(tens of entries, hundreds at most, I'd expect. Hard data on this
would be appreciated). For purely random reads, the probability that
two or more requests in the queue happen to be in the same cylinder is
going to be quite small.

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-16 00:58:25
Message-ID:	20050416005824.GG19518@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Vivek Khera wrote:
>
> On Apr 14, 2005, at 10:03 PM, Kevin Brown wrote:
>
> >Now, bad block remapping destroys that guarantee, but unless you've
> >got a LOT of bad blocks, it shouldn't destroy your performance, right?
> >
>
> ALL disks have bad blocks, even when you receive them. you honestly
> think that these large disks made today (18+ GB is the smallest now)
> that there are no defects on the surfaces?

Oh, I'm not at all arguing that you won't have bad blocks. My
argument is that the probability of any given block read or write
operation actually dealing with a remapped block is going to be
relatively small, unless the fraction of bad blocks to total blocks is
large (in which case you basically have a bad disk). And so the
ability to account for remapped blocks shouldn't itself represent a
huge improvement in overall throughput.

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-16 01:33:31
Message-ID:	20050416013331.GH19518@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Rosser Schwarz wrote:
> while you weren't looking, Kevin Brown wrote:
>
> [reordering bursty reads]
>
> > In other words, it's a corner case that I strongly suspect
> > isn't typical in situations where SCSI has historically made a big
> > difference.
>
> [...]
>
> > But I rather doubt that has to be a huge penalty, if any. When a
> > process issues an fsync (or even a sync), the kernel doesn't *have* to
> > drop everything it's doing and get to work on it immediately. It
> > could easily gather a few more requests, bundle them up, and then
> > issue them.
>
> To make sure I'm following you here, are you or are you not suggesting
> that the kernel could sit on -all- IO requests for some small handful
> of ms before actually performing any IO to address what you "strongly
> suspect" is a "corner case"?

The kernel *can* do so. Whether or not it's a good idea depends on
the activity in the system. You'd only consider doing this if you
didn't already have a relatively large backlog of I/O requests to
handle. You wouldn't do this for every I/O request.

Consider this: I/O operations to a block device are so slow compared
with the speed of other (non I/O) operations on the system that the
system can easily wait for, say, a hundredth of the typical latency on
the target device before issuing requests to it and not have any real
negative impact on the system's I/O throughput. A process running on
my test system, a 3 GHz Xeon, can issue a million read system calls
per second (I've measured it. I can post the rather trivial source
code if you're interested). That's the full round trip of issuing the
system call and having the kernel return back. That means that in the
span of a millisecond, the system could receive 1000 requests if the
system were busy enough. If the average latency for a random read
from the disk (including head movement and everything) is 10
milliseconds, and we decide to delay the issuance of the first I/O
request for a tenth of a millisecond (a hundredth of the latency),
then the system might receive 100 additional I/O requests, which it
could then put into the queue and sort by block address before issuing
the read request. As long as the system knows what the last block
that was requested from that physical device was, it can order the
requests properly and then begin issuing them. Since the latency on
the target device is so high, this is likely to be a rather big win
for overall throughput.

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 20:58:29
Message-ID:	200504182058.j3IKwTG13788@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Kevin Brown wrote:
> Greg Stark wrote:
>
>
> > I think you're being misled by analyzing the write case.
> >
> > Consider the read case. When a user process requests a block and
> > that read makes its way down to the driver level, the driver can't
> > just put it aside and wait until it's convenient. It has to go ahead
> > and issue the read right away.
>
> Well, strictly speaking it doesn't *have* to. It could delay for a
> couple of milliseconds to see if other requests come in, and then
> issue the read if none do. If there are already other requests being
> fulfilled, then it'll schedule the request in question just like the
> rest.

The idea with SCSI or any command queuing is that you don't have to wait
for another request to come in --- you can send the request as it
arrives, then if another shows up, you send that too, and the drive
optimizes the grouping at a later time, knowing what the drive is doing,
rather queueing in the kernel.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Alex Turner <armtuk(at)gmail(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 22:49:44
Message-ID:	33c6269f0504181549571e004b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Does it really matter at which end of the cable the queueing is done
(Assuming both ends know as much about drive geometry etc..)?

Alex Turner
netEconomist

On 4/18/05, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> Kevin Brown wrote:
> > Greg Stark wrote:
> >
> >
> > > I think you're being misled by analyzing the write case.
> > >
> > > Consider the read case. When a user process requests a block and
> > > that read makes its way down to the driver level, the driver can't
> > > just put it aside and wait until it's convenient. It has to go ahead
> > > and issue the read right away.
> >
> > Well, strictly speaking it doesn't *have* to. It could delay for a
> > couple of milliseconds to see if other requests come in, and then
> > issue the read if none do. If there are already other requests being
> > fulfilled, then it'll schedule the request in question just like the
> > rest.
>
> The idea with SCSI or any command queuing is that you don't have to wait
> for another request to come in --- you can send the request as it
> arrives, then if another shows up, you send that too, and the drive
> optimizes the grouping at a later time, knowing what the drive is doing,
> rather queueing in the kernel.
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania 19073
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match
>

From:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-18 22:56:21
Message-ID:	20050418225621.GC28733@dcc.uchile.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mon, Apr 18, 2005 at 06:49:44PM -0400, Alex Turner wrote:
> Does it really matter at which end of the cable the queueing is done
> (Assuming both ends know as much about drive geometry etc..)?

That is a pretty strong assumption, isn't it? Also you seem to be
assuming that the controller<->disk protocol (some internal, unknown to
mere mortals, mechanism) is equally powerful than the host<->controller
(SATA, SCSI, etc).

I'm lost whether this thread is about what is possible with current,
in-market technology, or about what could in theory be possible [if you
were to design "open source" disk controllers and disks.]

--
Alvaro Herrera (<alvherre[(at)]dcc(dot)uchile(dot)cl>)
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Alex Turner <armtuk(at)gmail(dot)com>
Cc:	Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-19 01:45:15
Message-ID:	200504190145.j3J1jFb28220@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alex Turner wrote:
> Does it really matter at which end of the cable the queueing is done
> (Assuming both ends know as much about drive geometry etc..)?

Good question. If the SCSI system was moving the head from track 1 to
10, and a request then came in for track 5, could the system make the
head stop at track 5 on its way to track 10? That is something that
only the controller could do. However, I have no idea if SCSI does
that.

The only part I am pretty sure about is that real-world experience shows
SCSI is better for a mixed I/O environment. Not sure why, exactly, but
the command queueing obviously helps, and I am not sure what else does.

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	newz(at)bearfruit(dot)org
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How to improve db performance with $7K?
Date:	2005-04-19 23:08:36
Message-ID:	20050419230836.GQ58835@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Thu, Apr 14, 2005 at 10:51:46AM -0500, Matthew Nuzum wrote:
> So if you all were going to choose between two hard drives where:
> drive A has capacity C and spins at 15K rpms, and
> drive B has capacity 2 x C and spins at 10K rpms and
> all other features are the same, the price is the same and C is enough
> disk space which would you choose?
>
> I've noticed that on IDE drives, as the capacity increases the data
> density increases and there is a pereceived (I've not measured it)
> performance increase.
>
> Would the increased data density of the higher capacity drive be of
> greater benefit than the faster spindle speed of drive A?

The increased data density will help transfer speed off the platter, but
that's it. It won't help rotational latency.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"