Re: Hardware/OS recommendations for large databases (

Lists: pgsql-performance
From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Adam Weisberg" <Aweisberg(at)seiu1199(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-15 12:09:56
Message-ID: 3E37B936B592014B978C4415F90D662D01995F7B@MI8NYCMAIL06.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Adam,

> -----Original Message-----
> From: pgsql-performance-owner(at)postgresql(dot)org
> [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of
> Claus Guttesen
> Sent: Tuesday, November 15, 2005 12:29 AM
> To: Adam Weisberg
> Cc: pgsql-performance(at)postgresql(dot)org
> Subject: Re: [PERFORM] Hardware/OS recommendations for large
> databases ( 5TB)
>
> > Does anyone have recommendations for hardware and/or OS to
> work with
> > around 5TB datasets?
>
> Hardware-wise I'd say dual core opterons. One
> dual-core-opteron performs better than two single-core at the
> same speed. Tyan makes some boards that have four sockets,
> thereby giving you 8 cpu's (if you need that many). Sun and
> HP also makes nice hardware although the Tyan board is more
> competetive priced.
>
> OS wise I would choose the FreeBSD amd64 port but
> partititions larger than 2 TB needs some special care, using
> gpt rather than disklabel etc., tools like fsck may not be
> able to completely check partitions larger than 2 TB. Linux
> or Solaris with either LVM or Veritas FS sounds like candidates.

I agree - you can get a very good one from www.acmemicro.com or
www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
performance on these (with tuning) on Linux using the xfs filesystem,
which is one of the most critical factors for large databases.

Note that you want to have your DBMS use all of the CPU and disk channel
bandwidth you have on each query, which takes a parallel database like
Bizgres MPP to achieve.

Regards,

- Luke


From: Dave Cramer <pg(at)fastcrypt(dot)com>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: "Adam Weisberg" <Aweisberg(at)seiu1199(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-15 14:15:04
Message-ID: B41A9E46-AC4F-4A3A-B383-AB70091745CF@fastcrypt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke,

Have you tried the areca cards, they are slightly faster yet.

Dave
On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:

>
> I agree - you can get a very good one from www.acmemicro.com or
> www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX
> SATA
> RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
> performance on these (with tuning) on Linux using the xfs filesystem,
> which is one of the most critical factors for large databases.
>
> Note that you want to have your DBMS use all of the CPU and disk
> channel
> bandwidth you have on each query, which takes a parallel database like
> Bizgres MPP to achieve.
>
> Regards,


From: Alex Turner <armtuk(at)gmail(dot)com>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: Adam Weisberg <Aweisberg(at)seiu1199(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 05:15:36
Message-ID: 33c6269f0511152115v1a34ab81j30f1f5df5a6c6956@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 11/15/05, Luke Lonergan <LLonergan(at)greenplum(dot)com> wrote:
> Adam,
>
> > -----Original Message-----
> > From: pgsql-performance-owner(at)postgresql(dot)org
> > [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of
> > Claus Guttesen
> > Sent: Tuesday, November 15, 2005 12:29 AM
> > To: Adam Weisberg
> > Cc: pgsql-performance(at)postgresql(dot)org
> > Subject: Re: [PERFORM] Hardware/OS recommendations for large
> > databases ( 5TB)
> >
> > > Does anyone have recommendations for hardware and/or OS to
> > work with
> > > around 5TB datasets?
> >
> > Hardware-wise I'd say dual core opterons. One
> > dual-core-opteron performs better than two single-core at the
> > same speed. Tyan makes some boards that have four sockets,
> > thereby giving you 8 cpu's (if you need that many). Sun and
> > HP also makes nice hardware although the Tyan board is more
> > competetive priced.
> >
> > OS wise I would choose the FreeBSD amd64 port but
> > partititions larger than 2 TB needs some special care, using
> > gpt rather than disklabel etc., tools like fsck may not be
> > able to completely check partitions larger than 2 TB. Linux
> > or Solaris with either LVM or Veritas FS sounds like candidates.
>
> I agree - you can get a very good one from www.acmemicro.com or
> www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
> performance on these (with tuning) on Linux using the xfs filesystem,
> which is one of the most critical factors for large databases.
>

Spend a fortune on dual core CPUs and then buy crappy disks... I bet
for most applications this system will be IO bound, and you will see a
nice lot of drive failures in the first year of operation with
consumer grade drives.

Spend your money on better Disks, and don't bother with Dual Core IMHO
unless you can prove the need for it.

Alex


From: Alex Turner <armtuk(at)gmail(dot)com>
To: Dave Cramer <pg(at)fastcrypt(dot)com>
Cc: Luke Lonergan <LLonergan(at)greenplum(dot)com>, Adam Weisberg <Aweisberg(at)seiu1199(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 05:16:58
Message-ID: 33c6269f0511152116s4483af82s91c711b1b902da14@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Not at random access in RAID 10 they aren't, and anyone with their
head screwed on right is using RAID 10. The 9500S will still beat the
Areca cards at RAID 10 database access patern.

Alex.

On 11/15/05, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
> Luke,
>
> Have you tried the areca cards, they are slightly faster yet.
>
> Dave
>
> On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
>
>
>
>
>
> I agree - you can get a very good one from www.acmemicro.com or
>
> www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
>
> RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
>
> on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
>
> performance on these (with tuning) on Linux using the xfs filesystem,
>
> which is one of the most critical factors for large databases.
>
>
>
>
> Note that you want to have your DBMS use all of the CPU and disk channel
>
> bandwidth you have on each query, which takes a parallel database like
>
> Bizgres MPP to achieve.
>
>
>
>
> Regards,
>


From: William Yu <wyu(at)talisys(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 12:51:49
Message-ID: dlfa10$2b2n$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alex Turner wrote:
> Not at random access in RAID 10 they aren't, and anyone with their
> head screwed on right is using RAID 10. The 9500S will still beat the
> Areca cards at RAID 10 database access patern.

The max 256MB onboard for 3ware cards is disappointing though. While
good enough for 95% of cases, there's that 5% that could use a gig or
two of onboard ram for ultrafast updates. For example, I'm specing out
an upgrade to our current data processing server. Instead of the
traditional 6xFast-Server-HDs, we're gonna go for broke and do
32xConsumer-HDs. This will give us mega I/O bandwidth but we're
vulnerable to random access since consumer-grade HDs don't have the RPMs
or the queueing-smarts. This means we're very dependent on the
controller using onboard RAM to do I/O scheduling. 256MB divided over
4/6/8 drives -- OK. 256MB divided over 32 drives -- ugh, the HD's
buffers are bigger than the RAM alotted to it.

At least this is how it seems it would work from thinking through all
the factors. Unfortunately, I haven't found anybody else who has gone
this route and reported their results so I guess we're the guinea pig.


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Alex Turner <armtuk(at)gmail(dot)com>
Cc: Luke Lonergan <LLonergan(at)greenplum(dot)com>, Adam Weisberg <Aweisberg(at)seiu1199(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 13:58:48
Message-ID: 437B3B18.1040509@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


>> I agree - you can get a very good one from www.acmemicro.com or
>> www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
>> RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
>> on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
>> performance on these (with tuning) on Linux using the xfs filesystem,
>> which is one of the most critical factors for large databases.
>>
>>
>
> Spend a fortune on dual core CPUs and then buy crappy disks... I bet
> for most applications this system will be IO bound, and you will see a
> nice lot of drive failures in the first year of operation with
> consumer grade drives.
>
There is nothing wrong with using SATA disks and they perform very well.
The catch is, make sure
you have a battery back up on the raid controller.

> Spend your money on better Disks, and don't bother with Dual Core IMHO
> unless you can prove the need for it.
>
The reason you want the dual core cpus is that PostgreSQL can only
execute 1 query per cpu
at a time, so the application will see a big boost in overall
transactional velocity if you push two
dual-core cpus into the machine.

Joshua D. Drake

> Alex
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>


From: Ron <rjpeace(at)earthlink(dot)net>
To: Alex Turner <armtuk(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases
Date: 2005-11-16 13:58:56
Message-ID: 6.2.5.6.0.20051116083943.01be0b98@earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Got some hard numbers to back your statement up? IME, the Areca
1160's with >= 1GB of cache beat any other commodity RAID
controller. This seems to be in agreement with at least one
independent testing source:

http://print.tweakers.net/?reviews/557

RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
commodity HW solution, but their price point is considerably higher.

...on another note, I completely agree with the poster who says we
need more cache on RAID controllers. We should all be beating on the
RAID HW manufacturers to use standard DIMMs for their caches and to
provide 2 standard DIMM slots in their full height cards (allowing
for up to 8GB of cache using 2 4GB DIMMs as of this writing).

It should also be noted that 64 drive chassis' are going to become
possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
year (48's are the TOTL now). We need controller technology to keep up.

Ron

At 12:16 AM 11/16/2005, Alex Turner wrote:
>Not at random access in RAID 10 they aren't, and anyone with their
>head screwed on right is using RAID 10. The 9500S will still beat the
>Areca cards at RAID 10 database access patern.
>
>Alex.
>
>On 11/15/05, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
> > Luke,
> >
> > Have you tried the areca cards, they are slightly faster yet.
> >
> > Dave
> >
> > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
> >
> >
> >
> >
> >
> > I agree - you can get a very good one from www.acmemicro.com or
> >
> > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> >
> > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> >
> > on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
> >
> > performance on these (with tuning) on Linux using the xfs filesystem,
> >
> > which is one of the most critical factors for large databases.
> >
> >
> >
> >
> > Note that you want to have your DBMS use all of the CPU and disk channel
> >
> > bandwidth you have on each query, which takes a parallel database like
> >
> > Bizgres MPP to achieve.
> >
> >
> >
> >
> > Regards,
> >
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: Don't 'kill -9' the postmaster


From: Alex Stapleton <alexs(at)advfn(dot)com>
To: William Yu <wyu(at)talisys(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 14:13:26
Message-ID: 41AD9695-69AC-4677-AEBD-9D9E21BD8DCF@advfn.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 16 Nov 2005, at 12:51, William Yu wrote:

> Alex Turner wrote:
>
>> Not at random access in RAID 10 they aren't, and anyone with their
>> head screwed on right is using RAID 10. The 9500S will still beat
>> the
>> Areca cards at RAID 10 database access patern.
>>
>
> The max 256MB onboard for 3ware cards is disappointing though.
> While good enough for 95% of cases, there's that 5% that could use
> a gig or two of onboard ram for ultrafast updates. For example, I'm
> specing out an upgrade to our current data processing server.
> Instead of the traditional 6xFast-Server-HDs, we're gonna go for
> broke and do 32xConsumer-HDs. This will give us mega I/O bandwidth
> but we're vulnerable to random access since consumer-grade HDs
> don't have the RPMs or the queueing-smarts. This means we're very
> dependent on the controller using onboard RAM to do I/O scheduling.
> 256MB divided over 4/6/8 drives -- OK. 256MB divided over 32 drives
> -- ugh, the HD's buffers are bigger than the RAM alotted to it.
>
> At least this is how it seems it would work from thinking through
> all the factors. Unfortunately, I haven't found anybody else who
> has gone this route and reported their results so I guess we're the
> guinea pig.
>

Your going to have to factor in the increased failure rate in your
cost measurements, including any downtime or performance degradation
whilst rebuilding parts of your RAID array. It depends on how long
your planning for this system to be operational as well of course.

Pick two: Fast, cheap, reliable.


From: Steve Wampler <swampler(at)noao(dot)edu>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Alex Turner <armtuk(at)gmail(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Adam Weisberg <Aweisberg(at)seiu1199(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 14:29:44
Message-ID: 437B4258.8050507@noao.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Joshua D. Drake wrote:
> The reason you want the dual core cpus is that PostgreSQL can only
> execute 1 query per cpu at a time,...

Is that true? I knew that PG only used one cpu per query, but how
does PG know how many CPUs there are to limit the number of queries?

--
Steve Wampler -- swampler(at)noao(dot)edu
The gods that smiled on your birth are now laughing out loud.


From: David Boreham <david_list(at)boreham(dot)org>
To: Steve Wampler <swampler(at)noao(dot)edu>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Alex Turner <armtuk(at)gmail(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Adam Weisberg <Aweisberg(at)seiu1199(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 14:35:23
Message-ID: 437B43AB.7050104@boreham.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Steve Wampler wrote:

>Joshua D. Drake wrote:
>
>
>>The reason you want the dual core cpus is that PostgreSQL can only
>>execute 1 query per cpu at a time,...
>>
>>
>
>Is that true? I knew that PG only used one cpu per query, but how
>does PG know how many CPUs there are to limit the number of queries?
>
>
>
He means only one query can be executing on each cpu at any particular
instant.


From: William Yu <wyu(at)talisys(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 14:37:48
Message-ID: dlfg7t$1r3u$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alex Stapleton wrote:
> Your going to have to factor in the increased failure rate in your cost
> measurements, including any downtime or performance degradation whilst
> rebuilding parts of your RAID array. It depends on how long your
> planning for this system to be operational as well of course.

If we go 32xRAID10, rebuild time should be the same as rebuild time in a
4xRAID10 system. Only the hard drive that was replaced needs rebuild --
not the entire array.

And yes, definitely need a bunch of drives lying around as spares.


From: Steve Wampler <swampler(at)noao(dot)edu>
To: David Boreham <david_list(at)boreham(dot)org>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Alex Turner <armtuk(at)gmail(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Adam Weisberg <Aweisberg(at)seiu1199(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 14:38:52
Message-ID: 437B447C.7050508@noao.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

David Boreham wrote:
> Steve Wampler wrote:
>
>> Joshua D. Drake wrote:
>>
>>
>>> The reason you want the dual core cpus is that PostgreSQL can only
>>> execute 1 query per cpu at a time,...
>>>
>>
>>
>> Is that true? I knew that PG only used one cpu per query, but how
>> does PG know how many CPUs there are to limit the number of queries?
>>
>>
>>
> He means only one query can be executing on each cpu at any particular
> instant.

Got it - the cpu is only acting on one query in any instant but may be
switching between many 'simultaneous' queries. PG isn't really involved
in the decision. That makes sense.

--
Steve Wampler -- swampler(at)noao(dot)edu
The gods that smiled on your birth are now laughing out loud.


From: David Boreham <david_list(at)boreham(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 14:51:31
Message-ID: 437B4773.1040404@boreham.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

>Spend a fortune on dual core CPUs and then buy crappy disks... I bet
>for most applications this system will be IO bound, and you will see a
>nice lot of drive failures in the first year of operation with
>consumer grade drives.

I guess I've never bought into the vendor story that there are
two reliability grades. Why would they bother making two
different kinds of bearing, motor etc ? Seems like it's more
likely an excuse to justify higher prices. In my experience the
expensive SCSI drives I own break frequently while the cheapo
desktop drives just keep chunking along (modulo certain products
that have a specific known reliability problem).

I'd expect that a larger number of hotter drives will give a less reliable
system than a smaller number of cooler ones.


From: William Yu <wyu(at)talisys(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 15:33:39
Message-ID: dlfjgf$41b$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alex Turner wrote:
> Spend a fortune on dual core CPUs and then buy crappy disks... I bet
> for most applications this system will be IO bound, and you will see a
> nice lot of drive failures in the first year of operation with
> consumer grade drives.
>
> Spend your money on better Disks, and don't bother with Dual Core IMHO
> unless you can prove the need for it.

I would say the opposite -- you always want Dual Core nowadays. DC
Opterons simply give you better bang for the buck than single core
Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will
be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC
versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because
those mega-CPU motherboards are astronomically expensive.

DC also gives you a better upgrade path. Let's say you do testing and
figure 2x246 is the right setup to handle the load. Well instead of
getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a
DC/270. Now you have a server that can be upgraded to +80% more CPU by
popping in another DC/270 versus throwing out the entire thing to get a
4x1P setup.

The only questions would be:
(1) Do you need a SMP server at all? I'd claim yes -- you always need 2+
cores whether it's DC or 2P to avoid IO interrupts blocking other
processes from running.

(2) Does a DC system perform better than it's Nx1P cousin? My experience
is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244
and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP,
etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups.

(3) Do you need an insane amount of memory? Well here's the case where
the more expensive motherboard will serve you better since each CPU slot
has its own bank of memory. Spend more money on memory, get cheaper
single-core CPUs.

Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon
DCs, while cheaper than their corresponding single-core SMPs, don't have
the same performance profile of Opteron DCs. Basically, you're paying a
bit extra so your server can generate a ton more heat.


From: William Yu <wyu(at)talisys(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 15:41:10
Message-ID: dlfjuk$9n9$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

David Boreham wrote:
> >Spend a fortune on dual core CPUs and then buy crappy disks... I bet
> >for most applications this system will be IO bound, and you will see a
> >nice lot of drive failures in the first year of operation with
> >consumer grade drives.
>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).
>
> I'd expect that a larger number of hotter drives will give a less reliable
> system than a smaller number of cooler ones.

Our SCSI drives have failed maybe a little less than our IDE drives.
Hell, some of the SCSIs even came bad when we bought them. Of course,
the IDE drive failure % is inflated by all the IBM Deathstars we got -- ugh.

Basically, I've found it's cooling that's most important. Packing the
drives together into really small rackmounts? Good for your density, not
good for the drives. Now we do larger rackmounts -- drives have more
space in between each other plus fans in front and back of the drives.


From: "Douglas J(dot) Trainor" <trainor(at)transborder(dot)net>
To: Steve Wampler <swampler(at)noao(dot)edu>
Cc: David Boreham <david_list(at)boreham(dot)org>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Alex Turner <armtuk(at)gmail(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Adam Weisberg <Aweisberg(at)seiu1199(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: OT Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 15:45:15
Message-ID: 2F6294FF-F274-46E8-85D0-461A6FF36653@transborder.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


AMD added quad-core processors to their public roadmap for 2007.

Beyond 2007, the quad-cores will scale up to 32 sockets!!!!!!!!
(using Direct Connect Architecture 2.0)

Expect Intel to follow.

douglas

On Nov 16, 2005, at 9:38 AM, Steve Wampler wrote:

> [...]
>
> Got it - the cpu is only acting on one query in any instant but may be
> switching between many 'simultaneous' queries. PG isn't really
> involved
> in the decision. That makes sense.


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: David Boreham <david_list(at)boreham(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 15:47:53
Message-ID: 437B54A9.5070104@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).

I don't know if the reliability grade is true or not but what I can tell
you is that I have scsi drives that are 5+ years old that still work without
issue.

I have never had an IDE drive last longer than 3 years (when used in
production).

That being said, so what. That is what raid is for. You loose a drive
and hot swap
it back in. Heck keep a hotspare in the trays.

Joshua D. Drake

>
> I'd expect that a larger number of hotter drives will give a less
> reliable
> system than a smaller number of cooler ones.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: William Yu <wyu(at)talisys(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 16:03:37
Message-ID: 437B5859.9040805@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

>
> The only questions would be:
> (1) Do you need a SMP server at all? I'd claim yes -- you always need
> 2+ cores whether it's DC or 2P to avoid IO interrupts blocking other
> processes from running.

I would back this up. Even for smaller installations (single raid 1, 1
gig of ram). Why? Well because many applications are going to be CPU
bound. For example
we have a PHP application that is a CMS. On a single CPU machine, RAID 1
it takes about 300ms to deliver a single page, point to point. We are
not IO bound.
So what happens is that under reasonable load we are actually waiting
for the CPU to process the code.

A simple upgrade to an SMP machine literally doubles our performance
because we are still not IO bound. I strongly suggest that everyone use
at least a single dual core because of this experience.

>
> (3) Do you need an insane amount of memory? Well here's the case where
> the more expensive motherboard will serve you better since each CPU
> slot has its own bank of memory. Spend more money on memory, get
> cheaper single-core CPUs.
Agreed. A lot of times the slowest dual-core is 5x what you actually
need. So get the slowest, and bulk up on memory. If nothing else memory
is cheap today and it might not be tomorrow.

> Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon
> DCs, while cheaper than their corresponding single-core SMPs, don't
> have the same performance profile of Opteron DCs. Basically, you're
> paying a bit extra so your server can generate a ton more heat.
>
Well if you are an Intel/Dell shop running PostgreSQL you have bigger
problems ;)

Sincerely,

Joshua D. Drake

> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: David Boreham <david_list(at)boreham(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 17:06:25
Message-ID: 1132160785.3582.60.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, 2005-11-16 at 08:51, David Boreham wrote:
> >Spend a fortune on dual core CPUs and then buy crappy disks... I bet
> >for most applications this system will be IO bound, and you will see a
> >nice lot of drive failures in the first year of operation with
> >consumer grade drives.
>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).
>
> I'd expect that a larger number of hotter drives will give a less reliable
> system than a smaller number of cooler ones.

My experience has mirrored this.

Anyone remember back when HP made their SureStore drives? We built 8
drive RAID arrays to ship to customer sites, pre-filled with data. Not
a single one arrived fully operational. The failure rate on those
drives was something like 60% in the first year, and HP quit making hard
drives because of it.

Those were SCSI Server class drives, supposedly built to last 5 years.

OTOH, I remember putting a pair of 60 Gig IDEs into a server that had
lots of ventilation and fans and such, and having no problems
whatsoever.

There was a big commercial EMC style array in the hosting center at the
same place that had something like a 16 wide by 16 tall array of IDE
drives for storing pdf / tiff stuff on it, and we had at least one
failure a month in it. Of course, that's 256 drives, so you're gonna
have failures, and it was configured with a spare on every other row or
some such. We just had a big box of hard drives and it was smart enough
to rebuild automagically when you put a new one in, so the maintenance
wasn't really that bad. The performance was quite impressive too.


From: Alex Turner <armtuk(at)gmail(dot)com>
To: Ron <rjpeace(at)earthlink(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases
Date: 2005-11-16 17:08:37
Message-ID: 33c6269f0511160908s5c432ee2r82f78598b8f5d036@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10,
the 3ware controllers beat the Areca card.

Alex.

On 11/16/05, Ron <rjpeace(at)earthlink(dot)net> wrote:
> Got some hard numbers to back your statement up? IME, the Areca
> 1160's with >= 1GB of cache beat any other commodity RAID
> controller. This seems to be in agreement with at least one
> independent testing source:
>
> http://print.tweakers.net/?reviews/557
>
> RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
> commodity HW solution, but their price point is considerably higher.
>
> ...on another note, I completely agree with the poster who says we
> need more cache on RAID controllers. We should all be beating on the
> RAID HW manufacturers to use standard DIMMs for their caches and to
> provide 2 standard DIMM slots in their full height cards (allowing
> for up to 8GB of cache using 2 4GB DIMMs as of this writing).
>
> It should also be noted that 64 drive chassis' are going to become
> possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
> year (48's are the TOTL now). We need controller technology to keep up.
>
> Ron
>
> At 12:16 AM 11/16/2005, Alex Turner wrote:
> >Not at random access in RAID 10 they aren't, and anyone with their
> >head screwed on right is using RAID 10. The 9500S will still beat the
> >Areca cards at RAID 10 database access patern.
> >
> >Alex.
> >
> >On 11/15/05, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
> > > Luke,
> > >
> > > Have you tried the areca cards, they are slightly faster yet.
> > >
> > > Dave
> > >
> > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
> > >
> > >
> > >
> > >
> > >
> > > I agree - you can get a very good one from www.acmemicro.com or
> > >
> > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> > >
> > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> > >
> > > on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
> > >
> > > performance on these (with tuning) on Linux using the xfs filesystem,
> > >
> > > which is one of the most critical factors for large databases.
> > >
> > >
> > >
> > >
> > > Note that you want to have your DBMS use all of the CPU and disk channel
> > >
> > > bandwidth you have on each query, which takes a parallel database like
> > >
> > > Bizgres MPP to achieve.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 2: Don't 'kill -9' the postmaster
>
>
>
>


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: William Yu <wyu(at)talisys(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 17:09:38
Message-ID: 1132160978.3582.64.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, 2005-11-16 at 09:33, William Yu wrote:
> Alex Turner wrote:
> > Spend a fortune on dual core CPUs and then buy crappy disks... I bet
> > for most applications this system will be IO bound, and you will see a
> > nice lot of drive failures in the first year of operation with
> > consumer grade drives.
> >
> > Spend your money on better Disks, and don't bother with Dual Core IMHO
> > unless you can prove the need for it.
>
> I would say the opposite -- you always want Dual Core nowadays. DC
> Opterons simply give you better bang for the buck than single core
> Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will
> be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC
> versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because
> those mega-CPU motherboards are astronomically expensive.

The biggest gain is going from 1 to 2 CPUs (real cpus, like the DC
Opterons or genuine dual CPU mobo, not "hyperthreaded"). Part of the
issue isn't just raw CPU processing power. The second CPU allows the
machine to be more responsive because it doesn't have to context switch
as much.

While I've seen plenty of single CPU servers start to bog under load
running one big query, the dual CPU machines always seem more than just
twice as snappy under similar loads.


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>, "William Yu" <wyu(at)talisys(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 17:47:26
Message-ID: BFA0B0AE.13D4E%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Scott,

On 11/16/05 9:09 AM, "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com> wrote:

> The biggest gain is going from 1 to 2 CPUs (real cpus, like the DC
> Opterons or genuine dual CPU mobo, not "hyperthreaded"). Part of the
> issue isn't just raw CPU processing power. The second CPU allows the
> machine to be more responsive because it doesn't have to context switch
> as much.
>
> While I've seen plenty of single CPU servers start to bog under load
> running one big query, the dual CPU machines always seem more than just
> twice as snappy under similar loads.
>
I agree, 2 CPUs are better than one in most cases.

The discussion was kicked off by the suggestion to get 8 dual core CPUs to
process a large database with postgres. Say your decision support query
takes 15 minutes to run with one CPU. Add another and it still takes 15
minutes. Add 15 and the same ...

OLTP is so different from Business intelligence and Decision Support that
very little of this thread¹s discussion is relevant IMO.

The job is to design a system that can process sequential scan as fast as
possible and uses all resources (CPUs, mem, disk channels) on each query.
Sequential scan is 100x more important than random seeks.

Here are the facts so far:
* Postgres can only use 1 CPU on each query
* Postgres I/O for sequential scan is CPU limited to 110-120 MB/s on the
fastest modern CPUs
* Postgres disk-based sort speed is 1/10 or more slower than commercial
databases and memory doesn¹t improve it (much)

These are the conclusions that follow about decision support / BI system
architecture for normal Postgres:
* I/O hardware with more than 110MB/s of read bandwidth is not useful
* More than 1 CPU is not useful
* More RAM than a nominal amount for small table caching is not useful

In other words, big SMP doesn¹t address the problem at all. By contrast,
having all CPUs on multiple machines, or even on a big SMP with lots of I/O
channels, solves all of the above issues.

Regards,

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>, "William Yu" <wyu(at)talisys(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 17:49:28
Message-ID: BFA0B128.13D50%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Oops,

Last point should be worded: ³All CPUs on all machines used by a parallel
database²

- Luke

On 11/16/05 9:47 AM, "Luke Lonergan" <llonergan(at)greenplum(dot)com> wrote:

> Scott,
>
> On 11/16/05 9:09 AM, "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com> wrote:
>
>> The biggest gain is going from 1 to 2 CPUs (real cpus, like the DC
>> Opterons or genuine dual CPU mobo, not "hyperthreaded"). Part of the
>> issue isn't just raw CPU processing power. The second CPU allows the
>> machine to be more responsive because it doesn't have to context switch
>> as much.
>>
>> While I've seen plenty of single CPU servers start to bog under load
>> running one big query, the dual CPU machines always seem more than just
>> twice as snappy under similar loads.
>>
> I agree, 2 CPUs are better than one in most cases.
>
> The discussion was kicked off by the suggestion to get 8 dual core CPUs to
> process a large database with postgres. Say your decision support query takes
> 15 minutes to run with one CPU. Add another and it still takes 15 minutes.
> Add 15 and the same ...
>
> OLTP is so different from Business intelligence and Decision Support that very
> little of this thread¹s discussion is relevant IMO.
>
> The job is to design a system that can process sequential scan as fast as
> possible and uses all resources (CPUs, mem, disk channels) on each query.
> Sequential scan is 100x more important than random seeks.
>
> Here are the facts so far:
> * Postgres can only use 1 CPU on each query
> * Postgres I/O for sequential scan is CPU limited to 110-120 MB/s on the
> fastest modern CPUs
> * Postgres disk-based sort speed is 1/10 or more slower than commercial
> databases and memory doesn¹t improve it (much)
>
> These are the conclusions that follow about decision support / BI system
> architecture for normal Postgres:
> * I/O hardware with more than 110MB/s of read bandwidth is not useful
> * More than 1 CPU is not useful
> * More RAM than a nominal amount for small table caching is not useful
>
> In other words, big SMP doesn¹t address the problem at all. By contrast,
> having all CPUs on multiple machines, or even on a big SMP with lots of I/O
> channels, solves all of the above issues.
>
> Regards,
>
- Luke


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 17:54:34
Message-ID: 1132163674.3582.81.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, 2005-11-16 at 11:47, Luke Lonergan wrote:
> Scott,

Some cutting for clarity... I agree on the OLTP versus OLAP
discussion.

> Here are the facts so far:
> * Postgres can only use 1 CPU on each query
> * Postgres I/O for sequential scan is CPU limited to 110-120
> MB/s on the fastest modern CPUs
> * Postgres disk-based sort speed is 1/10 or more slower than
> commercial databases and memory doesn’t improve it (much)

But PostgreSQL only spills to disk if the data set won't fit into the
amount of memory allocated by working_mem / sort_mem. And for most
Business analysis stuff, this can be quite large, and you can even crank
it up for a single query.

I've written reports that were horrifically slow, hitting the disk and
all, and I upped sort_mem to hundreds of megabytes until it fit into
memory, and suddenly, a slow query is running factors faster than
before.

Or did you mean something else by "disk base sort speed"???


From: Matthew Nuzum <mattnuzum(at)gmail(dot)com>
To: David Boreham <david_list(at)boreham(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 18:51:25
Message-ID: f3c0b4080511161051t514b1bd0o8ebe0f5dab6b591b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 11/16/05, David Boreham <david_list(at)boreham(dot)org> wrote:
> >Spend a fortune on dual core CPUs and then buy crappy disks... I bet
> >for most applications this system will be IO bound, and you will see a
> >nice lot of drive failures in the first year of operation with
> >consumer grade drives.
>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).
>
> I'd expect that a larger number of hotter drives will give a less reliable
> system than a smaller number of cooler ones.

Of all the SCSI and IDE drives I've used, and I've used a lot, there
is a definite difference in quality. The SCSI drives primarily use
higher quality components that are intended to last longer under 24/7
work loads. I've taken several SCSI and IDE drives apart and you can
tell from the guts that the SCSI drives are built with sturdier
components.

I haven't gotten my hands on the Raptor line of ATA drives yet, but
I've heard they share this in common with the SCSI drives - they are
built with components made to be used day and night for years straight
without ending.

That doesn't mean they will last longer than IDE drives, that just
means they've been designed to withstand higher amounts of heat and
sustained activity. I've got some IDE drives that have lasted years++
and I've got some IDE drives that have lasted months. However, my SCSI
drives I've had over the years all lasted longer than the server they
were installed in.

I will say that in the last 10 years, the MTBF of IDE/ATA drives has
improved dramatically, so I regularly use them in servers, however I
have also shifted my ideology so that a server should be replaced
after 3 years, where before I aimed for 5.

It seems to me that the least reliable components in servers these
days are the fans.

--
Matthew Nuzum
www.bearfruit.org


From: "Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 18:51:38
Message-ID: 20051116185138.GA22831@uio.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Nov 16, 2005 at 11:06:25AM -0600, Scott Marlowe wrote:
> There was a big commercial EMC style array in the hosting center at the
> same place that had something like a 16 wide by 16 tall array of IDE
> drives for storing pdf / tiff stuff on it, and we had at least one
> failure a month in it. Of course, that's 256 drives, so you're gonna
> have failures, and it was configured with a spare on every other row or
> some such. We just had a big box of hard drives and it was smart enough
> to rebuild automagically when you put a new one in, so the maintenance
> wasn't really that bad. The performance was quite impressive too.

If you have a cool SAN, it alerts you and removes all data off a disk
_before_ it starts giving hard failures :-)

/* Steinar */
--
Homepage: http://www.sesse.net/


From: Matthew Nuzum <mattnuzum(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 19:03:46
Message-ID: f3c0b4080511161103j7677939bldf1c55b137ce9741@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 11/16/05, Steinar H. Gunderson <sgunderson(at)bigfoot(dot)com> wrote:
> If you have a cool SAN, it alerts you and removes all data off a disk
> _before_ it starts giving hard failures :-)
>
> /* Steinar */
> --
> Homepage: http://www.sesse.net/

Good point. I have avoided data loss *twice* this year by using SMART
hard drive monitoring software.

I can't tell you how good it feels to replace a drive that is about to
die, as compared to restoring data because a drive died.
--
Matthew Nuzum
www.bearfruit.org


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: "Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 20:08:03
Message-ID: 1132171683.3582.85.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, 2005-11-16 at 12:51, Steinar H. Gunderson wrote:
> On Wed, Nov 16, 2005 at 11:06:25AM -0600, Scott Marlowe wrote:
> > There was a big commercial EMC style array in the hosting center at the
> > same place that had something like a 16 wide by 16 tall array of IDE
> > drives for storing pdf / tiff stuff on it, and we had at least one
> > failure a month in it. Of course, that's 256 drives, so you're gonna
> > have failures, and it was configured with a spare on every other row or
> > some such. We just had a big box of hard drives and it was smart enough
> > to rebuild automagically when you put a new one in, so the maintenance
> > wasn't really that bad. The performance was quite impressive too.
>
> If you have a cool SAN, it alerts you and removes all data off a disk
> _before_ it starts giving hard failures :-)

Yeah, I forget who made the unit we used, but it was pretty much fully
automated. IT was something like a large RAID 5+0 (0+5???) and would
send an alert when a drive died or started getting errors, and the bad
drive's caddy would be flashing read instead of steady green.

I just remember thinking that I'd never used a drive array that was
taller than I was before that.


From: Ron <rjpeace(at)earthlink(dot)net>
To: Alex Turner <armtuk(at)gmail(dot)com>,pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases
Date: 2005-11-16 20:57:20
Message-ID: 6.2.5.6.0.20051116154502.01c41f50@earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

You _ARE_ kidding right? In what hallucination?

The performance numbers for the 1GB cache version of the Areca 1160
are the _grey_ line in the figures, and were added after the original
article was published:

"Note: Since the original Dutch article was published in late
January, we have finished tests of the 16-port Areca ARC-1160 using
128MB, 512MB and 1GB cache configurations and RAID 5 arrays of up to
12 drives. The ARC-1160 was using the latest 1.35 beta firmware. The
performance graphs have been updated to include the ARC-1160 results.
Discussions of the results have not been updated, however. "

With 1GB of cache, the 1160's beat everything else in almost all of
the tests they participated in. For the few where they do not win
hands down, the Escalade's (very occasionally) essentially tie.

These are very easy to read full color graphs where higher is better
and the grey line representing the 1GB 1160's is almost always higher
on the graph than anything else. Granted the Escalades seem to give
them the overall best run for their money, but they still are clearly
second best when looking at all the graphs and the CPU utilization
numbers in aggregate.

Ron

At 12:08 PM 11/16/2005, Alex Turner wrote:
>Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10,
>the 3ware controllers beat the Areca card.
>
>Alex.
>
>On 11/16/05, Ron <rjpeace(at)earthlink(dot)net> wrote:
> > Got some hard numbers to back your statement up? IME, the Areca
> > 1160's with >= 1GB of cache beat any other commodity RAID
> > controller. This seems to be in agreement with at least one
> > independent testing source:
> >
> > http://print.tweakers.net/?reviews/557
> >
> > RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
> > commodity HW solution, but their price point is considerably higher.
> >
> > ...on another note, I completely agree with the poster who says we
> > need more cache on RAID controllers. We should all be beating on the
> > RAID HW manufacturers to use standard DIMMs for their caches and to
> > provide 2 standard DIMM slots in their full height cards (allowing
> > for up to 8GB of cache using 2 4GB DIMMs as of this writing).
> >
> > It should also be noted that 64 drive chassis' are going to become
> > possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
> > year (48's are the TOTL now). We need controller technology to keep up.
> >
> > Ron
> >
> > At 12:16 AM 11/16/2005, Alex Turner wrote:
> > >Not at random access in RAID 10 they aren't, and anyone with their
> > >head screwed on right is using RAID 10. The 9500S will still beat the
> > >Areca cards at RAID 10 database access patern.
> > >
> > >Alex.
> > >
> > >On 11/15/05, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
> > > > Luke,
> > > >
> > > > Have you tried the areca cards, they are slightly faster yet.
> > > >
> > > > Dave
> > > >
> > > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I agree - you can get a very good one from www.acmemicro.com or
> > > >
> > > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA
> > > >
> > > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
> > > >
> > > > on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
> > > >
> > > > performance on these (with tuning) on Linux using the xfs filesystem,
> > > >
> > > > which is one of the most critical factors for large databases.
> > > >
> > > >
> > > >
> > > >
> > > > Note that you want to have your DBMS use all of the CPU and
> disk channel
> > > >
> > > > bandwidth you have on each query, which takes a parallel database like
> > > >
> > > > Bizgres MPP to achieve.
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > >
> > >---------------------------(end of broadcast)---------------------------
> > >TIP 2: Don't 'kill -9' the postmaster
> >
> >
> >
> >


From: mudfoot(at)rawbw(dot)com
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 21:03:37
Message-ID: 1132175017.437b9ea9e33e2@webmail.rawbw.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Yeah those big disks arrays are real sweet.

One day last week I was in a data center in Arizona when the big LSI/Storagetek
array in the cage next to mine had a hard drive failure. So the alarm shrieked
at like 13225535 decibles continuously for hours. BEEEP BEEEEP BEEEEP BEEEEP.
Of course since this was a colo facility it wasn't staffed on site by the idiots
who own the array. BEEEEP BEEEEEEEP BEEEEEEEP for hours. So I had to stand
next to this thing--only separated by a few feet and a little wire mesh--while
it shrieked for hours until a knuckle-dragger arrived on site to swap the drive.

Yay.

So if you're going to get a fancy array (they're worth it if somebody else is
paying) then make sure to *turn off the @#%(at)#SF'ing audible alarm* if you deploy
it in a colo facility.

Quoting Scott Marlowe <smarlowe(at)g2switchworks(dot)com>:

> On Wed, 2005-11-16 at 12:51, Steinar H. Gunderson wrote:
> > On Wed, Nov 16, 2005 at 11:06:25AM -0600, Scott Marlowe wrote:
> > > There was a big commercial EMC style array in the hosting center at the
> > > same place that had something like a 16 wide by 16 tall array of IDE
> > > drives for storing pdf / tiff stuff on it, and we had at least one
> > > failure a month in it. Of course, that's 256 drives, so you're gonna
> > > have failures, and it was configured with a spare on every other row or
> > > some such. We just had a big box of hard drives and it was smart
> enough
> > > to rebuild automagically when you put a new one in, so the maintenance
> > > wasn't really that bad. The performance was quite impressive too.
> >
> > If you have a cool SAN, it alerts you and removes all data off a disk
> > _before_ it starts giving hard failures :-)
>
> Yeah, I forget who made the unit we used, but it was pretty much fully
> automated. IT was something like a large RAID 5+0 (0+5???) and would
> send an alert when a drive died or started getting errors, and the bad
> drive's caddy would be flashing read instead of steady green.
>
> I just remember thinking that I'd never used a drive array that was
> taller than I was before that.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: William Yu <wyu(at)talisys(dot)com>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-16 21:08:58
Message-ID: 437B9FEA.5010003@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

William Yu wrote:
>
> Our SCSI drives have failed maybe a little less than our IDE drives.

Microsoft in their database showcase terraserver project has
had the same experience. They studied multiple configurations
including a SCSI/SAN solution as well as a cluster of SATA boxes.

They measured a
6.4% average annual failure rate of their SATA version and a
5.5% average annual failure rate on their SCSI implementation.

ftp://ftp.research.microsoft.com/pub/tr/TR-2004-107.pdf

"We lost 9 drives out of 140 SATA drives on the Web and
Storage Bricks in one year. This is a 6.4% annual failure rate.
In contrast, the Compaq Storageworks SAN and Web servers lost
approximately 32 drives in three years out of a total of 194
drives.13 This is a 5.5% annual failure rate.

The failure rates indicate that SCSI drives are more
reliable than SATA. SATA drives are substantially
cheaper than SCSI drives. Because the SATA failure rate
is so close to the SCSI failure rate gives SATA a
substantial return on investment advantage."

So unless your system is extremely sensitive to single drive
failures, the difference is pretty small. And for the cost
it seems you can buy enough extra spindles of SATA drives to
easily make up for the performance difference.

> Basically, I've found it's cooling that's most important. Packing the
> drives together into really small rackmounts? Good for your density, not
> good for the drives.

Indeed that was their guess for their better-than-expected
life of their SATA drives as well. From the same paper:

"We were careful about disk cooling – SATA
drives are rarely cooled with the same care that a SCSI
array receives."


From: Ron <rjpeace(at)earthlink(dot)net>
To: Alex Turner <armtuk(at)gmail(dot)com>,pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases
Date: 2005-11-16 21:21:25
Message-ID: 6.2.5.6.0.20051116161052.03d54a70@earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Amendment: there are graphs where the 1GB Areca 1160's do not do as
well. Given that they are mySQL specific and that similar usage
scenarios not involving mySQL (as well as most of the usage scenarios
involving mySQL; as I said these did not follow the pattern of the
rest of the benchmarks) show the usual pattern of the 1GB 1160's in
1st place or tied for 1st place, it seems reasonable that mySQL has
something to due with the aberrant results in those 2 (IIRC) cases.

Ron

At 03:57 PM 11/16/2005, Ron wrote:
>You _ARE_ kidding right? In what hallucination?
>
>The performance numbers for the 1GB cache version of the Areca 1160
>are the _grey_ line in the figures, and were added after the
>original article was published:
>
>"Note: Since the original Dutch article was published in late
>January, we have finished tests of the 16-port Areca ARC-1160 using
>128MB, 512MB and 1GB cache configurations and RAID 5 arrays of up to
>12 drives. The ARC-1160 was using the latest 1.35 beta firmware. The
>performance graphs have been updated to include the ARC-1160
>results. Discussions of the results have not been updated, however. "
>
>With 1GB of cache, the 1160's beat everything else in almost all of
>the tests they participated in. For the few where they do not win
>hands down, the Escalade's (very occasionally) essentially tie.
>
>These are very easy to read full color graphs where higher is better
>and the grey line representing the 1GB 1160's is almost always
>higher on the graph than anything else. Granted the Escalades seem
>to give them the overall best run for their money, but they still
>are clearly second best when looking at all the graphs and the CPU
>utilization numbers in aggregate.
>
>Ron
>
>
>
>At 12:08 PM 11/16/2005, Alex Turner wrote:
>>Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10,
>>the 3ware controllers beat the Areca card.
>>
>>Alex.
>>
>>On 11/16/05, Ron <rjpeace(at)earthlink(dot)net> wrote:
>> > Got some hard numbers to back your statement up? IME, the Areca
>> > 1160's with >= 1GB of cache beat any other commodity RAID
>> > controller. This seems to be in agreement with at least one
>> > independent testing source:
>> >
>> > http://print.tweakers.net/?reviews/557
>> >
>> > RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any
>> > commodity HW solution, but their price point is considerably higher.
>> >
>> > ...on another note, I completely agree with the poster who says we
>> > need more cache on RAID controllers. We should all be beating on the
>> > RAID HW manufacturers to use standard DIMMs for their caches and to
>> > provide 2 standard DIMM slots in their full height cards (allowing
>> > for up to 8GB of cache using 2 4GB DIMMs as of this writing).
>> >
>> > It should also be noted that 64 drive chassis' are going to become
>> > possible once 2.5" 10Krpm SATA II and FC HDs become the standard next
>> > year (48's are the TOTL now). We need controller technology to keep up.
>> >
>> > Ron
>> >
>> > At 12:16 AM 11/16/2005, Alex Turner wrote:
>> > >Not at random access in RAID 10 they aren't, and anyone with their
>> > >head screwed on right is using RAID 10. The 9500S will still beat the
>> > >Areca cards at RAID 10 database access patern.
>> > >
>> > >Alex.
>> > >
>> > >On 11/15/05, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
>> > > > Luke,
>> > > >
>> > > > Have you tried the areca cards, they are slightly faster yet.
>> > > >
>> > > > Dave
>> > > >
>> > > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote:
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > I agree - you can get a very good one from www.acmemicro.com or
>> > > >
>> > > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware
>> 9550SX SATA
>> > > >
>> > > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM
>> > > >
>> > > > on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read
>> > > >
>> > > > performance on these (with tuning) on Linux using the xfs filesystem,
>> > > >
>> > > > which is one of the most critical factors for large databases.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Note that you want to have your DBMS use all of the CPU and
>> disk channel
>> > > >
>> > > > bandwidth you have on each query, which takes a parallel database like
>> > > >
>> > > > Bizgres MPP to achieve.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Regards,
>> > > >
>> > >
>> > >---------------------------(end of broadcast)---------------------------
>> > >TIP 2: Don't 'kill -9' the postmaster
>> >
>> >
>> >
>> >
>
>
>


From: Alex Turner <armtuk(at)gmail(dot)com>
To: William Yu <wyu(at)talisys(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-17 19:48:38
Message-ID: 33c6269f0511171148q7ea67482xd595ea5717956aa2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 11/16/05, William Yu <wyu(at)talisys(dot)com> wrote:
> Alex Turner wrote:
> > Spend a fortune on dual core CPUs and then buy crappy disks... I bet
> > for most applications this system will be IO bound, and you will see a
> > nice lot of drive failures in the first year of operation with
> > consumer grade drives.
> >
> > Spend your money on better Disks, and don't bother with Dual Core IMHO
> > unless you can prove the need for it.
>
> I would say the opposite -- you always want Dual Core nowadays. DC
> Opterons simply give you better bang for the buck than single core
> Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will
> be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC
> versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because
> those mega-CPU motherboards are astronomically expensive.
>

Opteron 242 - $178.00
Opteron 242 - $178.00
Tyan S2882 - $377.50
Total: $733.50

Opteron 265 - $719.00
Tyan K8E - $169.00
Total: $888.00

Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples.
Infact I couldn't find a single CPU slot board that did, so you pretty
much have to buy a dual CPU board to get PCI-X.

1xDC is _not_ cheaper.

Our DB application does about 5 queries/second peak, plus a heavy
insert job once per day. We only _need_ two CPUs, which is true for a
great many DB applications. Unless you like EJB of course, which will
thrash the crap out of your system.

Consider the two most used regions for DBs:

a) OLTP - probably IO bound, large number of queries/sec updating info
on _disks_, not requiring much CPU activity except to retrieve item
infomration which is well indexed and normalized.

b) Data wharehouse - needs CPU, but probably still IO bound, large
data set that won't fit in RAM will required large amounts of disk
reads. CPU can easily keep up with disk reads.

I have yet to come across a DB system that wasn't IO bound.

> DC also gives you a better upgrade path. Let's say you do testing and
> figure 2x246 is the right setup to handle the load. Well instead of
> getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a
> DC/270. Now you have a server that can be upgraded to +80% more CPU by
> popping in another DC/270 versus throwing out the entire thing to get a
> 4x1P setup.

No argument there. But it's pointless if you are IO bound.

>
> The only questions would be:
> (1) Do you need a SMP server at all? I'd claim yes -- you always need 2+
> cores whether it's DC or 2P to avoid IO interrupts blocking other
> processes from running.

At least 2CPUs is always good for precisely those reasons. More than
2CPUs gives diminishing returns.

>
> (2) Does a DC system perform better than it's Nx1P cousin? My experience
> is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244
> and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP,
> etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups.

Maybe true, but the 265 does have a 25% faster FSB than the 244, which
might perhaps play a role.

>
> (3) Do you need an insane amount of memory? Well here's the case where
> the more expensive motherboard will serve you better since each CPU slot
> has its own bank of memory. Spend more money on memory, get cheaper
> single-core CPUs.

Remember - large DB is going to be IO bound. Memory will get thrashed
for file block buffers, even if you have large amounts, it's all gonna
be cycled in and out again.

>
> Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon
> DCs, while cheaper than their corresponding single-core SMPs, don't have
> the same performance profile of Opteron DCs. Basically, you're paying a
> bit extra so your server can generate a ton more heat.

Dell/Xeon/Postgres is just a bad combination any day of the week ;)

Alex.


From: Alex Turner <armtuk(at)gmail(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-17 19:54:54
Message-ID: 33c6269f0511171154t22e0bce0p803a1803f8b1a539@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 11/16/05, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:
> >
> > The only questions would be:
> > (1) Do you need a SMP server at all? I'd claim yes -- you always need
> > 2+ cores whether it's DC or 2P to avoid IO interrupts blocking other
> > processes from running.
>
> I would back this up. Even for smaller installations (single raid 1, 1
> gig of ram). Why? Well because many applications are going to be CPU
> bound. For example
> we have a PHP application that is a CMS. On a single CPU machine, RAID 1
> it takes about 300ms to deliver a single page, point to point. We are
> not IO bound.
> So what happens is that under reasonable load we are actually waiting
> for the CPU to process the code.
>

This is the performance profile for PHP, not for Postgresql. This is
the postgresql mailing list.

> A simple upgrade to an SMP machine literally doubles our performance
> because we are still not IO bound. I strongly suggest that everyone use
> at least a single dual core because of this experience.
>

Performance of PHP, not postgresql.

> >
> > (3) Do you need an insane amount of memory? Well here's the case where
> > the more expensive motherboard will serve you better since each CPU
> > slot has its own bank of memory. Spend more money on memory, get
> > cheaper single-core CPUs.
> Agreed. A lot of times the slowest dual-core is 5x what you actually
> need. So get the slowest, and bulk up on memory. If nothing else memory
> is cheap today and it might not be tomorrow.
[snip]

Running postgresql on a single drive RAID 1 with PHP on the same
machine is not a typical installation.

300ms for PHP in CPU time? wow dude - that's quite a page. PHP
typical can handle up to 30-50 pages per second for a typical OLTP
application on a single CPU box. Something is really wrong with that
system if it takes 300ms per page.

Alex.


From: William Yu <wyu(at)talisys(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-17 20:38:11
Message-ID: dlipnh$902$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alex Turner wrote:
> Opteron 242 - $178.00
> Opteron 242 - $178.00
> Tyan S2882 - $377.50
> Total: $733.50
>
> Opteron 265 - $719.00
> Tyan K8E - $169.00
> Total: $888.00

You're comparing the wrong CPUs. The 265 is the 2x of the 244 so you'll
have to bump up the price more although not enough to make a difference.

Looks like the price of the 2X MBs have dropped since I last looked at
it. Just a few months back, Tyan duals were $450-$500 which is what I
was basing my "priced less" statement from.

> Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples.
> Infact I couldn't find a single CPU slot board that did, so you pretty
> much have to buy a dual CPU board to get PCI-X.

You can get single CPU boards w/ PCIe and use PCIe controller cards.
Probably expensive right now because they're so bleeding-edge new but
definitely on the downswing.

> a) OLTP - probably IO bound, large number of queries/sec updating info
> on _disks_, not requiring much CPU activity except to retrieve item
> infomration which is well indexed and normalized.

Not in my experience. I find on our OLTP servers, we run 98% in RAM and
hence are 100% CPU-bound. Our DB is about 50GB in size now, servers run
w/ 8GB of RAM. We were *very* CPU limited running 2x244. During busy
hours of the day, our avg "user transaction" time were jumping from
0.8sec to 1.3+sec. Did the 2x265 and now we're always in the 0.7sec to
0.9sec range.

>>DC also gives you a better upgrade path. Let's say you do testing and
>>figure 2x246 is the right setup to handle the load. Well instead of
>>getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a
>>DC/270. Now you have a server that can be upgraded to +80% more CPU by
>>popping in another DC/270 versus throwing out the entire thing to get a
>>4x1P setup.
>
>
> No argument there. But it's pointless if you are IO bound.

Why would you just accept "we're IO bound, nothing we can do"? I'd do
everything in my power to make my app go from IO bound to CPU bound --
whether by optimizing my code or buying more hardware. I can tell you if
our OLTP servers were IO bound, it would run like crap. Instead of < 1
sec, we'd be looking at 5-10 seconds per "user transaction" and our
users would be screaming bloody murder.

In theory, you can always convert your IO bound DB to CPU bound by
stuffing more and more RAM into your server. (Or partitioning the DB
across multiple servers.) Whether it's cost effective depends on the DB
and how much your users are paying you -- and that's a case-by-case
analysis. Not a global statement of "IO-bound, pointless".

>>(2) Does a DC system perform better than it's Nx1P cousin? My experience
>>is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244
>>and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP,
>>etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups.
>
>
> Maybe true, but the 265 does have a 25% faster FSB than the 244, which
> might perhaps play a role.

Nope. There's no such thing as FSB on Opterons. On-die memory controller
runs @ CPU speed and hence connects at whatever the memory runs at
(rounded off to some multiplier math). There's the HT speed that
controls the max IO bandwidth but that's based on the motherboard, not
the CPU. Plus the 265 and 244 both run at 1.8Ghz so the memory
multiplier & HT IO are both the same.


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Alex Turner <armtuk(at)gmail(dot)com>
Cc: William Yu <wyu(at)talisys(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-17 20:39:55
Message-ID: 437CEA9B.6020401@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


>> So what happens is that under reasonable load we are actually waiting
>> for the CPU to process the code.
>>
>>
>
> This is the performance profile for PHP, not for Postgresql. This is
> the post
And your point? PostgreSQL benefits directly from what I am speaking
about as well.

>> Performance of PHP, not postgresql.
>>
>>
Actually both.

> [snip]
>
> Running postgresql on a single drive RAID 1 with PHP on the same
> machine is not a typical installation.
>
Want to bet? What do you think the majority of people hosting at
rackshack, rackspace,
superrack etc... are doing? Or how about all those virtual hosts?

> 300ms for PHP in CPU time? wow dude - that's quite a page. PHP
> typical can handle up to 30-50 pages per second for a typical OLTP
> application on a single CPU box. Something is really wrong with that
> system if it takes 300ms per page.
>
There is wait time associated with that because we are hitting it with
50-100 connections at a time.

Joshua D. Drake

> Alex.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>


From: Joshua Marsh <icub3d(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-17 20:58:46
Message-ID: 38242de90511171258v22f334f3ua474a8aa99bdc29d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 11/17/05, William Yu <wyu(at)talisys(dot)com> wrote:
>
> > No argument there. But it's pointless if you are IO bound.
>
> Why would you just accept "we're IO bound, nothing we can do"? I'd do
> everything in my power to make my app go from IO bound to CPU bound --
> whether by optimizing my code or buying more hardware. I can tell you if
> our OLTP servers were IO bound, it would run like crap. Instead of < 1
> sec, we'd be looking at 5-10 seconds per "user transaction" and our
> users would be screaming bloody murder.
>
> In theory, you can always convert your IO bound DB to CPU bound by
> stuffing more and more RAM into your server. (Or partitioning the DB
> across multiple servers.) Whether it's cost effective depends on the DB
> and how much your users are paying you -- and that's a case-by-case
> analysis. Not a global statement of "IO-bound, pointless".

We all want our systems to be CPU bound, but it's not always possible.
Remember, he is managing a 5 TB Databse. That's quite a bit different than a
100 GB or even 500 GB database.


From: William Yu <wyu(at)talisys(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-17 21:22:47
Message-ID: dlisba$fpf$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Joshua Marsh wrote:
>
> On 11/17/05, *William Yu* <wyu(at)talisys(dot)com <mailto:wyu(at)talisys(dot)com>> wrote:
>
> > No argument there. But it's pointless if you are IO bound.
>
> Why would you just accept "we're IO bound, nothing we can do"? I'd do
> everything in my power to make my app go from IO bound to CPU bound --
> whether by optimizing my code or buying more hardware. I can tell you if
> our OLTP servers were IO bound, it would run like crap. Instead of < 1
> sec, we'd be looking at 5-10 seconds per "user transaction" and our
> users would be screaming bloody murder.
>
> In theory, you can always convert your IO bound DB to CPU bound by
> stuffing more and more RAM into your server. (Or partitioning the DB
> across multiple servers.) Whether it's cost effective depends on the DB
> and how much your users are paying you -- and that's a case-by-case
> analysis. Not a global statement of "IO-bound, pointless".
>
>
> We all want our systems to be CPU bound, but it's not always possible.
> Remember, he is managing a 5 TB Databse. That's quite a bit different
> than a 100 GB or even 500 GB database.

I did say "in theory". :) I'm pretty sure google is more CPU bound than
IO bound -- they just spread their DB over 50K servers or whatever. Not
everybody is willing to pay for that but it's always in the realm of
plausibility.

Plus we have to go back to the statement I was replying to which was "I
have yet to come across a DB system that wasn't IO bound".


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Joshua Marsh <icub3d(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 04:58:45
Message-ID: 873blur1q2.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


Joshua Marsh <icub3d(at)gmail(dot)com> writes:

> We all want our systems to be CPU bound, but it's not always possible.

Sure it is, let me introduce you to my router, a 486DX100...

Ok, I guess that wasn't very helpful, I admit.

--
greg


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Joshua Marsh <icub3d(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 05:17:15
Message-ID: 87zmo2pmas.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


Joshua Marsh <icub3d(at)gmail(dot)com> writes:

> We all want our systems to be CPU bound, but it's not always possible.
> Remember, he is managing a 5 TB Databse. That's quite a bit different than a
> 100 GB or even 500 GB database.

Ok, a more productive point: it's not really the size of the database that
controls whether you're I/O bound or CPU bound. It's the available I/O
bandwidth versus your CPU speed.

If your I/O subsystem can feed data to your CPU as fast as it can consume it
then you'll be CPU bound no matter how much data you have in total. It's
harder to scale up I/O subsystems than CPUs, instead of just replacing a CPU
it tends to mean replacing the whole system to get a better motherboard with a
faster, better bus, as well as adding more controllers and more disks.

--
greg


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 06:07:54
Message-ID: BFA2AFBA.13FC4%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg,

On 11/17/05 9:17 PM, "Greg Stark" <gsstark(at)mit(dot)edu> wrote:

> Ok, a more productive point: it's not really the size of the database that
> controls whether you're I/O bound or CPU bound. It's the available I/O
> bandwidth versus your CPU speed.

Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after
110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1.

A $1,000 system with one CPU and two SATA disks in a software RAID0 will
perform exactly the same as a $80,000 system with 8 dual core CPUs and the
world's best SCSI RAID hardware on a large database for decision support
(what the poster asked about).

Regards,

- Luke


From: Dave Cramer <pg(at)fastcrypt(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 13:00:17
Message-ID: 0F3B99E2-4575-42B8-8AAC-3FE4B231348C@fastcrypt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:

> Greg,
>
>
> On 11/17/05 9:17 PM, "Greg Stark" <gsstark(at)mit(dot)edu> wrote:
>
>> Ok, a more productive point: it's not really the size of the
>> database that
>> controls whether you're I/O bound or CPU bound. It's the available
>> I/O
>> bandwidth versus your CPU speed.
>
> Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound
> after
> 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1.
>
> A $1,000 system with one CPU and two SATA disks in a software RAID0
> will
> perform exactly the same as a $80,000 system with 8 dual core CPUs
> and the
> world's best SCSI RAID hardware on a large database for decision
> support
> (what the poster asked about).

Now there's an interesting line drawn in the sand. I presume you have
numbers to back this up ?

This should draw some interesting posts.

Dave
>
> Regards,
>
> - Luke
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>


From: Richard Huxton <dev(at)archonet(dot)com>
To: Dave Cramer <pg(at)fastcrypt(dot)com>
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 13:22:41
Message-ID: 437DD5A1.1090605@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Dave Cramer wrote:
>
> On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:
>
>> Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after
>> 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1.
>>
>> A $1,000 system with one CPU and two SATA disks in a software RAID0 will
>> perform exactly the same as a $80,000 system with 8 dual core CPUs
>> and the
>> world's best SCSI RAID hardware on a large database for decision support
>> (what the poster asked about).
>
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
>
> This should draw some interesting posts.

Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000
system if he's got one going :-)

--
Richard Huxton
Archonet Ltd


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Richard Huxton" <dev(at)archonet(dot)com>, "Dave Cramer" <pg(at)fastcrypt(dot)com>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 13:30:34
Message-ID: BFA3177A.14008%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Richard,

On 11/18/05 5:22 AM, "Richard Huxton" <dev(at)archonet(dot)com> wrote:

> Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000
> system if he's got one going :-)

Finally, a game worth playing!

Except it¹s backward ­ I¹ll show you 80 $1,000 systems performing 80 times
faster than one $80,000 system.

On your proposition ­ I don¹t have any $80,000 systems for trade, do you?

- Luke


From: Alan Stange <stange(at)rentec(dot)com>
To:
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 13:41:58
Message-ID: 437DDA26.4000603@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Richard Huxton wrote:
> Dave Cramer wrote:
>>
>> On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:
>>
>>> Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound
>>> after
>>> 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1.
>>>
>>> A $1,000 system with one CPU and two SATA disks in a software RAID0
>>> will
>>> perform exactly the same as a $80,000 system with 8 dual core CPUs
>>> and the
>>> world's best SCSI RAID hardware on a large database for decision
>>> support
>>> (what the poster asked about).
>>
>>
>> Now there's an interesting line drawn in the sand. I presume you
>> have numbers to back this up ?
>>
>> This should draw some interesting posts.

That's interesting, as I occasionally see more than 110MB/s of
postgresql IO on our system. I'm using a 32KB block size, which has
been a huge win in performance for our usage patterns. 300GB database
with a lot of turnover. A vacuum analyze now takes about 3 hours, which
is much shorter than before. Postgresql 8.1, dual opteron, 8GB memory,
Linux 2.6.11, FC drives.

-- Alan


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 13:46:58
Message-ID: BFA31B52.14012%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/18/05 5:41 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

>
> That's interesting, as I occasionally see more than 110MB/s of
> postgresql IO on our system. I'm using a 32KB block size, which has
> been a huge win in performance for our usage patterns. 300GB database
> with a lot of turnover. A vacuum analyze now takes about 3 hours, which
> is much shorter than before. Postgresql 8.1, dual opteron, 8GB memory,
> Linux 2.6.11, FC drives.

300GB / 3 hours = 27MB/s.

If you are using the 2.6 linux kernel, you may be fooled into thinking you
burst more than you actually get in net I/O because the I/O stats changed in
tools like iostat and vmstat.

The only meaningful stats are (size of data) / (time to process data). Do a
sequential scan of one of your large tables that you know the size of, then
divide by the run time and report it.

I'm compiling some new test data to make my point now.

Regards,

- Luke


From: Dave Cramer <pg(at)fastcrypt(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: "Richard Huxton" <dev(at)archonet(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 13:47:43
Message-ID: B06243D5-43B8-4D4F-A065-A000C2BD7207@fastcrypt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


On 18-Nov-05, at 8:30 AM, Luke Lonergan wrote:

> Richard,
>
> On 11/18/05 5:22 AM, "Richard Huxton" <dev(at)archonet(dot)com> wrote:
>
>> Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000
>> system if he's got one going :-)
>
> Finally, a game worth playing!
>
> Except it’s backward – I’ll show you 80 $1,000 systems performing
> 80 times faster than one $80,000 system.
Now you wouldn't happen to be selling a system that would enable this
for postgres, now would ya ?
>
> On your proposition – I don’t have any $80,000 systems for trade,
> do you?
>
> - Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Dave Cramer" <pg(at)fastcrypt(dot)com>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 14:04:24
Message-ID: BFA31F68.14017%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Dave,

On 11/18/05 5:00 AM, "Dave Cramer" <pg(at)fastcrypt(dot)com> wrote:
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
>
> This should draw some interesting posts.

OK, here we go:

The $1,000 system (System A):

- I bought 16 of these in 2003 for $1,200 each. They have Intel or Asus
motherboards, Intel P4 3.0GHz CPUs with an 800MHz FSB. They have a system
drive and two RAID0 SATA drives, the Western Digital 74GB Raptor (10K RPM).
They have 1GB of RAM.

* A test of write and read performance on the RAID0:

> [llonergan(at)kite4 raid0]$ time dd if=/dev/zero of=bigfile bs=8k count=250000
> 250000+0 records in
> 250000+0 records out
>
> real 0m17.453s
> user 0m0.249s
> sys 0m10.246s

> [llonergan(at)kite4 raid0]$ time dd if=bigfile of=/dev/null bs=8k
> 250000+0 records in
> 250000+0 records out
>
> real 0m18.930s
> user 0m0.130s
> sys 0m3.590s

> So, the write performance is 114MB/s and read performance is 106MB/s.

The $6,000 system (System B):

* I just bought 5 of these systems for $6,000 each. They are dual Opteron
systems with 8GB of RAM and 2x 250 model CPUs, which are close to the
fastest. They have the new 3Ware 9550SX SATA RAID adapters coupled to
Western Digital 400GB RE2 model hard drives. They are organized as a RAID5.

* A test of write and read performance on the RAID5:

> [root(at)modena2 dbfast1]# time dd if=/dev/zero of=bigfile bs=8k count=2000000
> 2000000+0 records in
> 2000000+0 records out
>
> real 0m51.441s
> user 0m0.288s
> sys 0m29.119s
>
> [root(at)modena2 dbfast1]# time dd if=bigfile of=/dev/null bs=8k
> 2000000+0 records in
> 2000000+0 records out
>
> real 0m39.605s
> user 0m0.244s
> sys 0m19.207s
>
> So, the write performance is 314MB/s and read performance is 404MB/s (!) This
> is the fastest I¹ve seen 8 disk drives perform.
>
So, the question is: which of these systems (A or B) can scan a large table
faster using non-MPP postgres? How much faster would you wager?

Send your answer, and I¹ll post the result.

Regards,

- Luke


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 14:46:56
Message-ID: 437DE960.9070606@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Alan,
>
> On 11/18/05 5:41 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>
>
>> That's interesting, as I occasionally see more than 110MB/s of
>> postgresql IO on our system. I'm using a 32KB block size, which has
>> been a huge win in performance for our usage patterns. 300GB database
>> with a lot of turnover. A vacuum analyze now takes about 3 hours, which
>> is much shorter than before. Postgresql 8.1, dual opteron, 8GB memory,
>> Linux 2.6.11, FC drives.
>>
>
> 300GB / 3 hours = 27MB/s.
>
That's 3 hours under load, with 80 compute clients beating on the
database at the same time. We have the stats turned way up, so the
analyze tends to read a big chunk of the tables a second time as
well. We typically don't have three hours a day of idle time.

-- Alan


From: Ron <rjpeace(at)earthlink(dot)net>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases
Date: 2005-11-18 15:00:56
Message-ID: 6.2.5.6.0.20051118092018.03b6cba0@earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

While I agree with you in principle that pg becomes CPU bound
relatively easily compared to other DB products (at ~110-120MBps
according to a recent thread), there's a bit of hyperbole in your post.

a. There's a big difference between the worst performing 1C x86 ISA
CPU available and the best performing 2C one (IIRC, that's the
2.4GHz, 1MB L2 cache AMDx2 4800+ as of this writing)

b. Two 2C CPU's vs one 1C CPU means that a pg process will almost
never be waiting on other non pg processes. It also means that 3-4
pg processes, CPU bound or not, can execute in parallel. Not an
option with one 1C CPU.

c. Mainboards with support for multiple CPUs and lots' of RAM are
_not_ the cheap ones.

d. No one should ever use RAID 0 for valuable data. Ever. So at
the least you need 4 HD's for a RAID 10 set (RAID 5 is not a good
option unless write performance is unimportant. 4HD RAID 5 is
particularly not a good option.)

e. The server usually needs to talk to things over a network
connection. Often performance here matters. Mainboards with 2 1GbE
NICs and/or PCI-X (or PCI-E) slots for 10GbE cards are not the cheap ones.

f. Trash HDs mean poor IO performance and lower reliability. While
TOTL 15Krpm 4Gb FC HDs are usually overkill (Not always. It depends
on context.),
you at least want SATA II HDs with NCQ or TCQ support. And you want
them to have a decent media warranty- preferably a 5 year one if you
can get it. Again, these are not the cheapest HD's available.

g. Throughput limitations say nothing about latency
considerations. OLTP-like systems _want_ HD spindles. AMAP. Even
non OLTP-like systems need a fair number of spindles to optimize HD
IO: dedicated WAL set, multiple dedicated DB sets, dedicated OS and
swap space set, etc, etc. At 50MBps ASTR, you need 16 HD's operating
in parallel to saturate the bandwidth of a PCI-X channel.
That's ~8 independent pg tasks (queries using different tables,
dedicated WAL IO, etc) running in parallel. Regardless of application domain.

h. Decent RAID controllers and HBAs are not cheap either. Even SW
RAID benefits from having a big dedicated RAM buffer to talk to.

While the above may not cost you $80K, it sure isn't costing you $1K either.
Maybe ~$15-$20K, but not $1K.

Ron

At 01:07 AM 11/18/2005, Luke Lonergan wrote:
>Greg,
>
>
>On 11/17/05 9:17 PM, "Greg Stark" <gsstark(at)mit(dot)edu> wrote:
>
> > Ok, a more productive point: it's not really the size of the database that
> > controls whether you're I/O bound or CPU bound. It's the available I/O
> > bandwidth versus your CPU speed.
>
>Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after
>110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1.
>
>A $1,000 system with one CPU and two SATA disks in a software RAID0 will
>perform exactly the same as a $80,000 system with 8 dual core CPUs and the
>world's best SCSI RAID hardware on a large database for decision support
>(what the poster asked about).
>
>Regards,
>
>- Luke
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Dave Cramer" <pg(at)fastcrypt(dot)com>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 15:13:42
Message-ID: BFA32FA6.14027%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Dave,

On 11/18/05 5:00 AM, "Dave Cramer" <pg(at)fastcrypt(dot)com> wrote:
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
>
> This should draw some interesting posts.

Part 2: The answer

System A:
> This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.
>
> On a single table with 15 columns (the Bizgres IVP) at a size double memory
> (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan
> the table: that¹s 66 MB/s. Not the efficiency I¹d hope from the onboard SATA
> controller that I¹d like, I would have expected to get 85% of the 100MB/s raw
> read performance.
>
> So that¹s $1,200 / 66 MB/s (without adjusting for 2003 price versus now) =
> 18.2 $/MB/s
>
> Raw data:
> [llonergan(at)kite4 IVP]$ cat scan.sh
> #!/bin/bash
>
> time psql -c "select count(*) from ivp.bigtable1" dgtestdb
> [llonergan(at)kite4 IVP]$ cat sysout1
> count
> ----------
> 10000000
> (1 row)
>
>
> real 0m32.565s
> user 0m0.002s
> sys 0m0.003s
>
> Size of the table data:
> [llonergan(at)kite4 IVP]$ du -sk dgtestdb/base
> 2121648 dgtestdb/base
>
System B:
> This system is running an XFS filesystem, and has been tuned to use very large
> (16MB) readahead. It¹s running the Centos 4.1 distro, which uses a Linux
> 2.6.9 kernel.
>
> Same test as above, but with 17GB of data takes 69.7 seconds to scan (!)
> That¹s 244.2MB/s, which is obviously double my earlier point of 110-120MB/s.
> This system is running with a 16MB Linux readahead setting, let¹s try it with
> the default (I think) setting of 256KB ­ AHA! Now we get 171.4 seconds or
> 99.3MB/s.
>
> So, using the tuned setting of ³blockdev ‹setra 16384² we get $6,000 / 244MB/s
> = 24.6 $/MB/s
> If we use the default Linux setting it¹s 2.5x worse.
>
> Raw data:
> [llonergan(at)modena2 IVP]$ cat scan.sh
> #!/bin/bash
>
> time psql -c "select count(*) from ivp.bigtable1" dgtestdb
> [llonergan(at)modena2 IVP]$ cat sysout3
> count
> ----------
> 80000000
> (1 row)
>
>
> real 1m9.875s
> user 0m0.000s
> sys 0m0.004s
> [llonergan(at)modena2 IVP]$ !du
> du -sk dgtestdb/base
> 17021260 dgtestdb/base

Summary:

<cough, cough> OK ­ you can get more I/O bandwidth out of the current I/O
path for sequential scan if you tune the filesystem for large readahead.
This is a cheap alternative to overhauling the executor to use asynch I/O.

Still, there is a CPU limit here ­ this is not I/O bound, it is CPU limited
as evidenced by the sensitivity to readahead settings. If the filesystem
could do 1GB/s, you wouldn¹t go any faster than 244MB/s.

- Luke


From: Dave Cramer <pg(at)fastcrypt(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 15:25:52
Message-ID: A4D5EB2A-73BC-43F8-8B5A-36268193A047@fastcrypt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke,

Interesting numbers. I'm a little concerned about the use of blockdev
—setra 16384. If I understand this correctly it assumes that the
table is contiguous on the disk does it not ?

Dave
On 18-Nov-05, at 10:13 AM, Luke Lonergan wrote:

> Dave,
>
> On 11/18/05 5:00 AM, "Dave Cramer" <pg(at)fastcrypt(dot)com> wrote:
> >
> > Now there's an interesting line drawn in the sand. I presume you
> have
> > numbers to back this up ?
> >
> > This should draw some interesting posts.
>
> Part 2: The answer
>
> System A:
>> This system is running RedHat 3 Update 4, with a Fedora 2.6.10
>> Linux kernel.
>>
>> On a single table with 15 columns (the Bizgres IVP) at a size
>> double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements
>> takes 32 seconds to scan the table: that’s 66 MB/s. Not the
>> efficiency I’d hope from the onboard SATA controller that I’d
>> like, I would have expected to get 85% of the 100MB/s raw read
>> performance.
>>
>> So that’s $1,200 / 66 MB/s (without adjusting for 2003 price
>> versus now) = 18.2 $/MB/s
>>
>> Raw data:
>> [llonergan(at)kite4 IVP]$ cat scan.sh
>> #!/bin/bash
>>
>> time psql -c "select count(*) from ivp.bigtable1" dgtestdb
>> [llonergan(at)kite4 IVP]$ cat sysout1
>> count
>> ----------
>> 10000000
>> (1 row)
>>
>>
>> real 0m32.565s
>> user 0m0.002s
>> sys 0m0.003s
>>
>> Size of the table data:
>> [llonergan(at)kite4 IVP]$ du -sk dgtestdb/base
>> 2121648 dgtestdb/base
>>
> System B:
>> This system is running an XFS filesystem, and has been tuned to
>> use very large (16MB) readahead. It’s running the Centos 4.1
>> distro, which uses a Linux 2.6.9 kernel.
>>
>> Same test as above, but with 17GB of data takes 69.7 seconds to
>> scan (!) That’s 244.2MB/s, which is obviously double my earlier
>> point of 110-120MB/s. This system is running with a 16MB Linux
>> readahead setting, let’s try it with the default (I think) setting
>> of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s.
>>
>> So, using the tuned setting of “blockdev —setra 16384” we get
>> $6,000 / 244MB/s = 24.6 $/MB/s
>> If we use the default Linux setting it’s 2.5x worse.
>>
>> Raw data:
>> [llonergan(at)modena2 IVP]$ cat scan.sh
>> #!/bin/bash
>>
>> time psql -c "select count(*) from ivp.bigtable1" dgtestdb
>> [llonergan(at)modena2 IVP]$ cat sysout3
>> count
>> ----------
>> 80000000
>> (1 row)
>>
>>
>> real 1m9.875s
>> user 0m0.000s
>> sys 0m0.004s
>> [llonergan(at)modena2 IVP]$ !du
>> du -sk dgtestdb/base
>> 17021260 dgtestdb/base
>
> Summary:
>
> <cough, cough> OK – you can get more I/O bandwidth out of the
> current I/O path for sequential scan if you tune the filesystem for
> large readahead. This is a cheap alternative to overhauling the
> executor to use asynch I/O.
>
> Still, there is a CPU limit here – this is not I/O bound, it is CPU
> limited as evidenced by the sensitivity to readahead settings. If
> the filesystem could do 1GB/s, you wouldn’t go any faster than
> 244MB/s.
>
> - Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 15:27:42
Message-ID: BFA332EE.14035%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/18/05 6:46 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> That's 3 hours under load, with 80 compute clients beating on the
> database at the same time. We have the stats turned way up, so the
> analyze tends to read a big chunk of the tables a second time as
> well. We typically don't have three hours a day of idle time.

So I guess you¹re saying you don¹t know what your I/O rate is?

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Dave Cramer" <pg(at)fastcrypt(dot)com>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 15:30:31
Message-ID: BFA33397.1403A%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Dave,

On 11/18/05 7:25 AM, "Dave Cramer" <pg(at)fastcrypt(dot)com> wrote:

> Luke,
>
> Interesting numbers. I'm a little concerned about the use of blockdev ‹setra
> 16384. If I understand this correctly it assumes that the table is contiguous
> on the disk does it not ?

For optimum performance, yes it does. Remember that the poster is asking
about a 5TB warehouse. Decision support applications deal with large tables
and sequential scans a lot, and the data is generally contiguous on disk.
If delete gaps are there, they will generally vacuum them away.

- Luke


From: Bill McGonigle <bill(at)bfccomputing(dot)com>
To: Dave Cramer <pg(at)fastcrypt(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 15:55:19
Message-ID: d555b0eedb008ce34e18463cceb73b04@bfccomputing.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Nov 18, 2005, at 08:00, Dave Cramer wrote:

>> A $1,000 system with one CPU and two SATA disks in a software RAID0
>> will
>> perform exactly the same as a $80,000 system with 8 dual core CPUs
>> and the
>> world's best SCSI RAID hardware on a large database for decision
>> support
>> (what the poster asked about).
>
> Now there's an interesting line drawn in the sand. I presume you have
> numbers to back this up ?
> This should draw some interesting posts.

There is some truth to it. For an app I'm currently running (full-text
search using tsearch2 on ~100MB of data) on:

Dev System:
Asus bare-bones bookshelf case/mobo
3GHz P4 w/ HT
800MHz memory Bus
Fedora Core 3 (nightly update)
1GB RAM
1 SATA Seagate disk (7200RPM, 8MB Cache)
$800
worst-case query: 7.2 seconds

now, the machine I'm deploying to:

Dell SomthingOrOther
(4) 2.4GHz Xeons
533MHz memory bus
RedHat Enterprise 3.6
1GB RAM
(5) 150000 RPM Ultra SCSI 320 on an Adaptec RAID 5 controller
> $10000
same worst-case query: 9.6 seconds

Now it's not apples-to-apples. There's a kernel 2.4 vs. 2.6 difference
and the memory bus is much faster and I'm not sure what kind of context
switching hit you get with the Xeon MP memory controller. On a
previous postgresql app I did I ran nearly identically spec'ed machines
except for the memory bus and saw about a 30% boost in performance just
with the 800MHz bus. I imagine the Opteron bus does even better.

So the small machine is probably slower on disk but makes up for it in
single-threaded access to CPU and memory speed. But if this app were to
be scaled it would make much more sense to cluster several $800
machines than it would to buy 'big-iron'.

-Bill
-----
Bill McGonigle, Owner Work: 603.448.4440
BFC Computing, LLC Home: 603.448.1668
bill(at)bfccomputing(dot)com Mobile: 603.252.2606
http://www.bfccomputing.com/ Pager: 603.442.1833
Jabber: flowerpt(at)gmail(dot)com Text: bill+text(at)bfccomputing(dot)com
Blog: http://blog.bfccomputing.com/


From: Vivek Khera <vivek(at)khera(dot)org>
To: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:05:16
Message-ID: 7F33229E-AE95-436B-AB2F-01AC023DD7EE@khera.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Nov 18, 2005, at 1:07 AM, Luke Lonergan wrote:

> A $1,000 system with one CPU and two SATA disks in a software RAID0
> will
> perform exactly the same as a $80,000 system with 8 dual core CPUs
> and the
> world's best SCSI RAID hardware on a large database for decision
> support
> (what the poster asked about).

Hahahahahahahahahahahahaha! Whooo... needed to fall out of my chair
laughing this morning.

I can tell you from direct personal experience that you're just plain
wrong.

I've had to move my primary DB server from a dual P3 1GHz with 4-disk
RAID10 SCSI, to Dual P3 2GHz with 14-disk RAID10 and faster drives,
to Dual Opteron 2GHz with 8-disk RAID10 and even faster disks to keep
up with my load on a 60+ GB database. The Dual opteron system has
just a little bit of extra capacity if I offload some of the
reporting operations to a replicated copy (via slony1). If I run all
the queries on the one DB it can't keep up.

One most telling point about the difference in speed is that the 14-
disk array system cannot keep up with the replication being generated
by the dual opteron, even when it is no doing any other queries of
its own. The I/O system just ain't fast enough.


From: Vivek Khera <vivek(at)khera(dot)org>
To: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:07:04
Message-ID: 1F5BD2B1-8A75-498E-ADB7-8CB661066013@khera.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


On Nov 18, 2005, at 10:13 AM, Luke Lonergan wrote:

> Still, there is a CPU limit here – this is not I/O bound, it is CPU
> limited as evidenced by the sensitivity to readahead settings. If
> the filesystem could do 1GB/s, you wouldn’t go any faster than
> 244MB/s.

Yeah, and mysql would probably be faster on your trivial queries.
Try concurrent large joins and updates and see which system is faster.


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:13:44
Message-ID: 437DFDB8.1030208@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Alan,
>
> On 11/18/05 6:46 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>
> That's 3 hours under load, with 80 compute clients beating on the
> database at the same time. We have the stats turned way up, so the
> analyze tends to read a big chunk of the tables a second time as
> well. We typically don't have three hours a day of idle time.
>
>
> So I guess you’re saying you don’t know what your I/O rate is?
No, I'm say *you* don't know what my IO rate is.

I told you in my initial post that I was observing numbers in excess of
what you claiming, but you seemed to think I didn't know how to measure
an IO rate.

I should note too that our system uses about 20% of a single cpu when
performing a table scan at >100MB/s of IO. I think you claimed the
system would be cpu bound at this low IO rate.

Cheers,

-- Alan


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Vivek Khera" <vivek(at)khera(dot)org>, "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:16:39
Message-ID: BFA33E67.1404D%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Vivek,

On 11/18/05 8:07 AM, "Vivek Khera" <vivek(at)khera(dot)org> wrote:

>
> On Nov 18, 2005, at 10:13 AM, Luke Lonergan wrote:
>
>> Still, there is a CPU limit here ­ this is not I/O bound, it is CPU limited
>> as evidenced by the sensitivity to readahead settings.   If the filesystem
>> could do 1GB/s, you wouldn¹t go any faster than 244MB/s.
>
> Yeah, and mysql would probably be faster on your trivial queries.  Try
> concurrent large joins and updates and see which system is faster.

That¹s what we do to make a living. And it¹s Oracle that a lot faster
because they implemented a much tighter, optimized I/O path to disk than
Postgres.

Since you asked, we bought the 5 systems as a cluster ­ and with Bizgres MPP
we get close to 400MB/s per machine on complex queries.

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:17:48
Message-ID: BFA33EAC.1404E%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/18/05 8:13 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> I told you in my initial post that I was observing numbers in excess of
> what you claiming, but you seemed to think I didn't know how to measure
> an IO rate.
>
Prove me wrong, post your data.

> I should note too that our system uses about 20% of a single cpu when
> performing a table scan at >100MB/s of IO. I think you claimed the
> system would be cpu bound at this low IO rate.

See above.

- Luke
>


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Vivek Khera" <vivek(at)khera(dot)org>, "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:20:11
Message-ID: BFA33F3B.1404F%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Vivek,

On 11/18/05 8:05 AM, "Vivek Khera" <vivek(at)khera(dot)org> wrote:

> I can tell you from direct personal experience that you're just plain
> wrong.
>
> up with my load on a 60+ GB database. The Dual opteron system has

I¹m always surprised by what passes for a large database. The poster is
talking about 5,000GB, or almost 100 times the data you have.

Post your I/O numbers on sequential scan. Sequential scan is critical for
Decision Support / Data Warehousing.

- Luke


From: Alex Turner <armtuk(at)gmail(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:28:40
Message-ID: 33c6269f0511180828m3bc7f41dp186b4573792bf6a0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Ok - so I ran the same test on my system and get a total speed of
113MB/sec. Why is this? Why is the system so limited to around just
110MB/sec? I tuned read ahead up a bit, and my results improve a
bit..

Alex

On 11/18/05, Luke Lonergan <llonergan(at)greenplum(dot)com> wrote:
> Dave,
>
> On 11/18/05 5:00 AM, "Dave Cramer" <pg(at)fastcrypt(dot)com> wrote:
> >
> > Now there's an interesting line drawn in the sand. I presume you have
> > numbers to back this up ?
> >
> > This should draw some interesting posts.
>
> Part 2: The answer
>
> System A:
>
> This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.
>
> On a single table with 15 columns (the Bizgres IVP) at a size double memory
> (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan
> the table: that's 66 MB/s. Not the efficiency I'd hope from the onboard
> SATA controller that I'd like, I would have expected to get 85% of the
> 100MB/s raw read performance.
>
> So that's $1,200 / 66 MB/s (without adjusting for 2003 price versus now) =
> 18.2 $/MB/s
>
> Raw data:
> [llonergan(at)kite4 IVP]$ cat scan.sh
> #!/bin/bash
>
> time psql -c "select count(*) from ivp.bigtable1" dgtestdb
> [llonergan(at)kite4 IVP]$ cat sysout1
> count
> ----------
> 10000000
> (1 row)
>
>
> real 0m32.565s
> user 0m0.002s
> sys 0m0.003s
>
> Size of the table data:
> [llonergan(at)kite4 IVP]$ du -sk dgtestdb/base
> 2121648 dgtestdb/base
>
> System B:
>
> This system is running an XFS filesystem, and has been tuned to use very
> large (16MB) readahead. It's running the Centos 4.1 distro, which uses a
> Linux 2.6.9 kernel.
>
> Same test as above, but with 17GB of data takes 69.7 seconds to scan (!)
> That's 244.2MB/s, which is obviously double my earlier point of 110-120MB/s.
> This system is running with a 16MB Linux readahead setting, let's try it
> with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds
> or 99.3MB/s.
>
> So, using the tuned setting of "blockdev —setra 16384" we get $6,000 /
> 244MB/s = 24.6 $/MB/s
> If we use the default Linux setting it's 2.5x worse.
>
> Raw data:
> [llonergan(at)modena2 IVP]$ cat scan.sh
> #!/bin/bash
>
> time psql -c "select count(*) from ivp.bigtable1" dgtestdb
> [llonergan(at)modena2 IVP]$ cat sysout3
> count
> ----------
> 80000000
> (1 row)
>
>
> real 1m9.875s
> user 0m0.000s
> sys 0m0.004s
> [llonergan(at)modena2 IVP]$ !du
> du -sk dgtestdb/base
> 17021260 dgtestdb/base
>
> Summary:
>
> <cough, cough> OK – you can get more I/O bandwidth out of the current I/O
> path for sequential scan if you tune the filesystem for large readahead.
> This is a cheap alternative to overhauling the executor to use asynch I/O.
>
> Still, there is a CPU limit here – this is not I/O bound, it is CPU limited
> as evidenced by the sensitivity to readahead settings. If the filesystem
> could do 1GB/s, you wouldn't go any faster than 244MB/s.
>
> - Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Bill McGonigle" <bill(at)bfccomputing(dot)com>, "Dave Cramer" <pg(at)fastcrypt(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:31:00
Message-ID: BFA341C4.14059%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Bill,

On 11/18/05 7:55 AM, "Bill McGonigle" <bill(at)bfccomputing(dot)com> wrote:
>
> There is some truth to it. For an app I'm currently running (full-text
> search using tsearch2 on ~100MB of data) on:

Do you mean 100GB? Sounds like you are more like a decision support
/warehousing application.

> Dev System:
> Asus bare-bones bookshelf case/mobo
> 3GHz P4 w/ HT
> 800MHz memory Bus
> Fedora Core 3 (nightly update)
> 1GB RAM
> 1 SATA Seagate disk (7200RPM, 8MB Cache)
> $800
> worst-case query: 7.2 seconds

About the same machine I posted results for, except I had two faster disks.

> now, the machine I'm deploying to:
>
> Dell SomthingOrOther
> (4) 2.4GHz Xeons
> 533MHz memory bus
> RedHat Enterprise 3.6
> 1GB RAM
> (5) 150000 RPM Ultra SCSI 320 on an Adaptec RAID 5 controller
>> $10000
> same worst-case query: 9.6 seconds

Your problem here is the HW RAID controller - if you dump it and use the
onboard SCSI channels and Linux RAID you will see a jump from 40MB/s to
about 220MB/s in read performance and from 20MB/s to 110MB/s write
performance. It will use less CPU too.

> Now it's not apples-to-apples. There's a kernel 2.4 vs. 2.6 difference
> and the memory bus is much faster and I'm not sure what kind of context
> switching hit you get with the Xeon MP memory controller. On a
> previous postgresql app I did I ran nearly identically spec'ed machines
> except for the memory bus and saw about a 30% boost in performance just
> with the 800MHz bus. I imagine the Opteron bus does even better.

Memory bandwidth is so high on both that it's not a factor. Context
switching / memory bus contention isn't either.

> So the small machine is probably slower on disk but makes up for it in
> single-threaded access to CPU and memory speed. But if this app were to
> be scaled it would make much more sense to cluster several $800
> machines than it would to buy 'big-iron'.

Yes it does - by a lot too. Also, having a multiprocessing executor gets
all of each machine by having multiple CPUs scan simultaneously.

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Alex Turner" <armtuk(at)gmail(dot)com>
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 16:33:35
Message-ID: BFA3425F.1405B%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alex,

On 11/18/05 8:28 AM, "Alex Turner" <armtuk(at)gmail(dot)com> wrote:

> Ok - so I ran the same test on my system and get a total speed of
113MB/sec.
> Why is this? Why is the system so limited to around just
110MB/sec? I
> tuned read ahead up a bit, and my results improve a
bit..

OK! Now we're on the same page. Finally someone who actually tests!

Check the CPU usage while it's doing the scan. Know what it's doing?
Memory copies. We've profiled it extensively.

So - that's the suckage - throwing more CPU power helps a bit, but the
underlying issue is poorly optimized code in the Postgres executor and lack
of I/O asynchrony.

- Luke


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 17:31:33
Message-ID: 437E0FF5.5090203@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Alan,
>
> On 11/18/05 8:13 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>
> I told you in my initial post that I was observing numbers in
> excess of
> what you claiming, but you seemed to think I didn't know how to
> measure
> an IO rate.
>
> Prove me wrong, post your data.
>
> I should note too that our system uses about 20% of a single cpu when
> performing a table scan at >100MB/s of IO. I think you claimed the
> system would be cpu bound at this low IO rate.
>
>
> See above.
Here's the output from one iteration of iostat -k 60 while the box is
doing a select count(1) on a 238GB table.

avg-cpu: %user %nice %sys %iowait %idle
0.99 0.00 17.97 32.40 48.64

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdd 345.95 130732.53 0.00 7843952 0

We're reading 130MB/s for a full minute. About 20% of a single cpu was
being used. The remainder being idle.

We've done nothing fancy and achieved results you claim shouldn't be
possible. This is a system that was re-installed yesterday, no tuning
was done to the file systems, kernel or storage array.

What am I doing wrong?

9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
(for a DOE lab). And now I don't know what I'm doing,

Cheers,

-- Alan


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 17:54:07
Message-ID: BFA3553F.1406D%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/18/05 9:31 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> Here's the output from one iteration of iostat -k 60 while the box is
> doing a select count(1) on a 238GB table.
>
> avg-cpu: %user %nice %sys %iowait %idle
> 0.99 0.00 17.97 32.40 48.64
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sdd 345.95 130732.53 0.00 7843952 0
>
> We're reading 130MB/s for a full minute. About 20% of a single cpu was
> being used. The remainder being idle.

Cool - thanks for the results. Is that % of one CPU, or of 2? Was the
system otherwise idle?

> We've done nothing fancy and achieved results you claim shouldn't be
> possible. This is a system that was re-installed yesterday, no tuning
> was done to the file systems, kernel or storage array.

Are you happy with 130MB/s? How much did you pay for that? Is it more than
$2,000, or double my 2003 PC?

> What am I doing wrong?
>
> 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
> (for a DOE lab). And now I don't know what I'm doing,

Cool. Would that be Sandia?

We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
complex queries.

- Luke


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 18:30:06
Message-ID: 437E1DAE.1070301@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Alan,
>
> On 11/18/05 9:31 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>
>
>> Here's the output from one iteration of iostat -k 60 while the box is
>> doing a select count(1) on a 238GB table.
>>
>> avg-cpu: %user %nice %sys %iowait %idle
>> 0.99 0.00 17.97 32.40 48.64
>>
>> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
>> sdd 345.95 130732.53 0.00 7843952 0
>>
>> We're reading 130MB/s for a full minute. About 20% of a single cpu was
>> being used. The remainder being idle.
>>
>
> Cool - thanks for the results. Is that % of one CPU, or of 2? Was the
> system otherwise idle?
>
Actually, this was dual cpu and there was other activity during the full
minute, but it was on other file devices, which I didn't include in the
above output. Given that, and given what I see on the box now I'd
raise the 20% to 30% just to be more conservative. It's all in the
kernel either way; using a different scheduler or file system would
change that result. Even better would be using direct IO to not flush
everything else from memory and avoid some memory copies from kernel to
user space. Note that almost none of the time is user time. Changing
postgresql won't change the cpu useage.

One IMHO obvious improvement would be to have vacuum and analyze only do
direct IO. Now they appear to be very effective memory flushing tools.
Table scans on tables larger than say 4x memory should probably also use
direct IO for reads.

>
>
>> We've done nothing fancy and achieved results you claim shouldn't be
>> possible. This is a system that was re-installed yesterday, no tuning
>> was done to the file systems, kernel or storage array.
>>
>
> Are you happy with 130MB/s? How much did you pay for that? Is it more than
> $2,000, or double my 2003 PC?
>
I don't know what the system cost. It was part of block of dual
opterons from Sun that we got some time ago. I think the 130MB/s is
slow given the hardware, but it's acceptable. I'm not too price
sensitive; I care much more about reliability, uptime, etc.

>
>
>> What am I doing wrong?
>>
>> 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
>> (for a DOE lab). And now I don't know what I'm doing,
>>
> Cool. Would that be Sandia?
>
> We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
> complex queries.
Disk?! 4 StorageTek tape silos. That would be .002 TB/s. One has to
change how you think when you have that much data. And hope you don't
have a fire, because there's no backup. That work was while I was at
BNL. I believe they are now at 4PB of tape and 150TB of disk.

-- Alan


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 18:52:35
Message-ID: BFA362F3.14099%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/18/05 10:30 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> Actually, this was dual cpu and there was other activity during the full
> minute, but it was on other file devices, which I didn't include in the
> above output. Given that, and given what I see on the box now I'd
> raise the 20% to 30% just to be more conservative. It's all in the
> kernel either way; using a different scheduler or file system would
> change that result. Even better would be using direct IO to not flush
> everything else from memory and avoid some memory copies from kernel to
> user space. Note that almost none of the time is user time. Changing
> postgresql won't change the cpu useage.

These are all things that help on the IO wait side possibly, however, there
is a producer/consumer problem in postgres that goes something like this:

- Read some (small number of, sometimes 1) 8k pages
- Do some work on those pages, including lots of copies
- repeat

This back and forth without threading (like AIO, or a multiprocessing
executor) causes cycling and inefficiency that limits throughput.
Optimizing some of the memcopies and other garbage out, plus increasing the
internal (postgres) readahead would probably double the disk bandwidth.

But to be disk-bound (meaning that the disk subsystem is running at full
speed), requires asynchronous I/O. We do this now with Bizgres MPP, and we
get fully saturated disk channels on every machine. That means that even on
one machine, we run many times faster than non-MPP postgres.

> One IMHO obvious improvement would be to have vacuum and analyze only do
> direct IO. Now they appear to be very effective memory flushing tools.
> Table scans on tables larger than say 4x memory should probably also use
> direct IO for reads.

That's been suggested many times prior - I agree, but this also needs AIO to
be maximally effective.

> I don't know what the system cost. It was part of block of dual
> opterons from Sun that we got some time ago. I think the 130MB/s is
> slow given the hardware, but it's acceptable. I'm not too price
> sensitive; I care much more about reliability, uptime, etc.

Then I know what they cost - we have them too (V20z and V40z). You should
be getting 400MB/s+ with external RAID.

>>> What am I doing wrong?
>>>
>>> 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
>>> (for a DOE lab). And now I don't know what I'm doing,
>>>
>> Cool. Would that be Sandia?
>>
>> We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
>> complex queries.
> Disk?! 4 StorageTek tape silos. That would be .002 TB/s. One has to
> change how you think when you have that much data. And hope you don't
> have a fire, because there's no backup. That work was while I was at
> BNL. I believe they are now at 4PB of tape and 150TB of disk.

We had 1.5 Petabytes on 2 STK Silos at NAVO from 1996-1998 where I ran R&D.
We also had a Cray T932 an SGI Origin 3000 with 256 CPUs, a Cray T3E with
1280 CPUs, 2 Cray J916s with 1 TB of shared disk, a Cray C90-16, a Sun E10K,
etc etc, along with clusters of Alpha machines and lots of SGIs. It's nice
to work with a $40M annual budget.

Later, working with FSL we implemented a weather forecasting cluster that
ultimately became the #5 fastest computer on the TOP500 supercomputing list
from 512 Alpha cluster nodes. That machine had a 10-way shared SAN, tape
robotics and a Myrinet interconnect and ran 64-bit Linux (in 1998).

- Luke


From: Greg Stark <gsstark(at)mit(dot)edu>
To: stange(at)rentec(dot)com
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 19:07:34
Message-ID: 87oe4hpyfd.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan Stange <stange(at)rentec(dot)com> writes:

> Luke Lonergan wrote:
> > Alan,
> >
> > On 11/18/05 9:31 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
> >
> >
> >> Here's the output from one iteration of iostat -k 60 while the box is
> >> doing a select count(1) on a 238GB table.
> >>
> >> avg-cpu: %user %nice %sys %iowait %idle
> >> 0.99 0.00 17.97 32.40 48.64
> >>
> >> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> >> sdd 345.95 130732.53 0.00 7843952 0
> >>
> >> We're reading 130MB/s for a full minute. About 20% of a single cpu was
> >> being used. The remainder being idle.
> >>
> >
> > Cool - thanks for the results. Is that % of one CPU, or of 2? Was the
> > system otherwise idle?
> >
> Actually, this was dual cpu

I hate to agree with him but that looks like a dual machine with one CPU
pegged. Yes most of the time is being spent in the kernel, but you're still
basically cpu limited.

That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
drives and quite respectable for a 3-disk stripe set, even reasonable for a
4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
only getting 130MB/s then it does seem likely the cpu is actually holding you
back here.

Still it doesn't show Postgres being nearly so CPU wasteful as the original
poster claimed.

> It's all in the kernel either way; using a different scheduler or file
> system would change that result. Even better would be using direct IO to not
> flush everything else from memory and avoid some memory copies from kernel
> to user space. Note that almost none of the time is user time. Changing
> postgresql won't change the cpu useage.

Well changing to direct i/o would still be changing Postgres so that's
unclear. And there are plenty of more mundane ways that Postgres is
responsible for how efficiently or not the kernel is used. Just using fewer
syscalls to do the same amount of reading would reduce cpu consumption.

> One IMHO obvious improvement would be to have vacuum and analyze only do direct
> IO. Now they appear to be very effective memory flushing tools. Table scans
> on tables larger than say 4x memory should probably also use direct IO for
> reads.

--
greg


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Greg Stark" <gsstark(at)mit(dot)edu>, stange(at)rentec(dot)com
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 19:24:48
Message-ID: BFA36A80.140AE%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg,

On 11/18/05 11:07 AM, "Greg Stark" <gsstark(at)mit(dot)edu> wrote:

> That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
> drives and quite respectable for a 3-disk stripe set, even reasonable for a
> 4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
> only getting 130MB/s then it does seem likely the cpu is actually holding you
> back here.

With an FC array, it's undoubtedly more like 14 drives, in which case
130MB/s is laughable. On the other hand, I wouldn't be surprised if it were
a single 200MB/s Fibre Channel attachment.

It does make you wonder why people keep recommending 15K RPM drives, like it
would help *not*.

> Still it doesn't show Postgres being nearly so CPU wasteful as the original
> poster claimed.

It's partly about waste, and partly about lack of a concurrent I/O
mechanism. We've profiled it for the waste, we've implemented concurrent
I/O to prove the other point.

>> It's all in the kernel either way; using a different scheduler or file
>> system would change that result. Even better would be using direct IO to not
>> flush everything else from memory and avoid some memory copies from kernel
>> to user space. Note that almost none of the time is user time. Changing
>> postgresql won't change the cpu useage.
>
> Well changing to direct i/o would still be changing Postgres so that's
> unclear. And there are plenty of more mundane ways that Postgres is
> responsible for how efficiently or not the kernel is used. Just using fewer
> syscalls to do the same amount of reading would reduce cpu consumption.

Bingo.

- Luke


From: Alan Stange <stange(at)rentec(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 19:39:30
Message-ID: 437E2DF2.50906@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark wrote:
> Alan Stange <stange(at)rentec(dot)com> writes:
>
>
>> Luke Lonergan wrote:
>>
>>> Alan,
>>>
>>> On 11/18/05 9:31 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>>>
>>>
>>>
>>>> Here's the output from one iteration of iostat -k 60 while the box is
>>>> doing a select count(1) on a 238GB table.
>>>>
>>>> avg-cpu: %user %nice %sys %iowait %idle
>>>> 0.99 0.00 17.97 32.40 48.64
>>>>
>>>> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
>>>> sdd 345.95 130732.53 0.00 7843952 0
>>>>
>>>> We're reading 130MB/s for a full minute. About 20% of a single cpu was
>>>> being used. The remainder being idle.
>>>>
>>>>
>>> Cool - thanks for the results. Is that % of one CPU, or of 2? Was the
>>> system otherwise idle?
>>>
>>>
>> Actually, this was dual cpu
>>
>
> I hate to agree with him but that looks like a dual machine with one CPU
> pegged. Yes most of the time is being spent in the kernel, but you're still
> basically cpu limited.
>
> That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
> drives and quite respectable for a 3-disk stripe set, even reasonable for a
> 4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
> only getting 130MB/s then it does seem likely the cpu is actually holding you
> back here.
>
> Still it doesn't show Postgres being nearly so CPU wasteful as the original
> poster claimed.
>
Yes and no. The one cpu is clearly idle. The second cpu is 40% busy
and 60% idle (aka iowait in the above numbers).
Of that 40%, other things were happening as well during the 1 minute
snapshot. During some iostat outputs that I didn't post the cpu time
was ~ 20%.

So, you can take your pick. The single cpu usage is somewhere between
20% and 40%. As I can't remove other users of the system, it's the best
measurement that I can make right now.

Either way, it's not close to being cpu bound. This is with Opteron
248, 2.2Ghz cpus.

Note that the storage system has been a bit disappointing: it's an IBM
Fast T600 with a 200MB/s fiber attachment. It could be better, but
it's not been the bottleneck in our work, so we haven't put any energy
into it.

>> It's all in the kernel either way; using a different scheduler or file
>> system would change that result. Even better would be using direct IO to not
>> flush everything else from memory and avoid some memory copies from kernel
>> to user space. Note that almost none of the time is user time. Changing
>> postgresql won't change the cpu useage.
>>
> Well changing to direct i/o would still be changing Postgres so that's
> unclear. And there are plenty of more mundane ways that Postgres is
> responsible for how efficiently or not the kernel is used. Just using fewer
> syscalls to do the same amount of reading would reduce cpu consumption.
Absolutely. This is why we're using a 32KB block size and also switched
to using O_SYNC for the WAL syncing method. That's many MB/s that
don't need to be cached in the kernel (thus evicting other data), and we
avoid all the fysnc/fdatasync syscalls.

The purpose of direct IO isn't to make the vacuum or analyze faster, but
to lessen their impact on queries with someone waiting for the
results. That's our biggest hit: running a sequential scan on 240GB
of data and flushing everything else out of memory.

Now that I'm think about this a bit, a big chunk of time is probably
being lost in TLB misses and other virtual memory events that would be
avoided if a larger page size was being used.

-- Alan


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 19:39:50
Message-ID: 437E2E06.5070306@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
>> opterons from Sun that we got some time ago. I think the 130MB/s is
>> slow given the hardware, but it's acceptable. I'm not too price
>> sensitive; I care much more about reliability, uptime, etc.
>>
> I don't know what the system cost. It was part of block of dual
>
> Then I know what they cost - we have them too (V20z and V40z). You should
> be getting 400MB/s+ with external RAID.
Yes, but we don't. This is where I would normally begin a rant on how
craptacular Linux can be at times. But, for the sake of this
discussion, postgresql isn't reading the data any more slowly than does
any other program.

And we don't have the time to experiment with the box.

I know it should be better, but it's good enough for our purposes at
this time.

-- Alan


From: Ron <rjpeace(at)earthlink(dot)net>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases
Date: 2005-11-18 20:29:11
Message-ID: 6.2.5.6.0.20051118151319.01d16288@earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Breaking the ~120MBps pg IO ceiling by any means
is an important result. Particularly when you
get a ~2x improvement. I'm curious how far we
can get using simple approaches like this.

At 10:13 AM 11/18/2005, Luke Lonergan wrote:
>Dave,
>
>On 11/18/05 5:00 AM, "Dave Cramer" <pg(at)fastcrypt(dot)com> wrote:
> >
> > Now there's an interesting line drawn in the sand. I presume you have
> > numbers to back this up ?
> >
> > This should draw some interesting posts.
>
>Part 2: The answer
>
>System A:
>This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel.
>
>On a single table with 15 columns (the Bizgres
>IVP) at a size double memory (2.12GB), Postgres
>8.0.3 with Bizgres enhancements takes 32 seconds
>to scan the table: that’s 66 MB/s. Not the
>efficiency I’d hope from the onboard SATA
>controller that I’d like, I would have expected
>to get 85% of the 100MB/s raw read performance.
Have you tried the large read ahead trick with
this system? It would be interesting to see how
much it would help. It might even be worth it to
do the experiment at all of [default, 2x default,
4x default, 8x default, etc] read ahead until
either a) you run out of resources to support the
desired read ahead, or b) performance levels
off. I can imagine the results being very enlightening.

>System B:
>This system is running an XFS filesystem, and
>has been tuned to use very large (16MB)
>readahead. It’s running the Centos 4.1 distro,
>which uses a Linux 2.6.9 kernel.
>
>Same test as above, but with 17GB of data takes
>69.7 seconds to scan (!) That’s 244.2MB/s,
>which is obviously double my earlier point of
>110-120MB/s. This system is running with a 16MB
>Linux readahead setting, let’s try it with the
>default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s.
The above experiment would seem useful here as well.

>Summary:
>
><cough, cough> OK – you can get more I/O
>bandwidth out of the current I/O path for
>sequential scan if you tune the filesystem for
>large readahead. This is a cheap alternative to
>overhauling the executor to use asynch I/O.
>
>Still, there is a CPU limit here – this is not
>I/O bound, it is CPU limited as evidenced by the
>sensitivity to readahead settings. If the
>filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.
>
>- Luke

I respect your honesty in reporting results that
were different then your expectations or
previously taken stance. Alan Stange's comment
re: the use of direct IO along with your comments
re: async IO and mem copies plus the results of
these experiments could very well point us
directly at how to most easily solve pg's CPU boundness during IO.

[HACKERS] are you watching this?

Ron


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-18 23:46:54
Message-ID: 437E67EE.4070605@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:

> (mass snippage)
> time psql -c "select count(*) from ivp.bigtable1" dgtestdb
> [llonergan(at)modena2 IVP]$ cat sysout3
> count
> ----------
> 80000000
> (1 row)
>
>
> real 1m9.875s
> user 0m0.000s
> sys 0m0.004s
> [llonergan(at)modena2 IVP]$ !du
> du -sk dgtestdb/base
> 17021260 dgtestdb/base
>
>
> Summary:
>
> <cough, cough> OK – you can get more I/O bandwidth out of the current
> I/O path for sequential scan if you tune the filesystem for large
> readahead. This is a cheap alternative to overhauling the executor to
> use asynch I/O.
>
> Still, there is a CPU limit here – this is not I/O bound, it is CPU
> limited as evidenced by the sensitivity to readahead settings. If the
> filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.
>
>

Luke,

Interesting - but possibly only representative for a workload consisting
entirely of one executor doing "SELECT ... FROM my_single_table".

If you alter this to involve more complex joins (e.g 4. way star) and
(maybe add a small number of concurrent executors too) - is it still the
case?

Cheers

Mark


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-19 00:04:00
Message-ID: BFA3ABF0.14124%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark,

On 11/18/05 3:46 PM, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz> wrote:

> If you alter this to involve more complex joins (e.g 4. way star) and
> (maybe add a small number of concurrent executors too) - is it still the
> case?

4-way star, same result, that's part of my point. With Bizgres MPP, the
4-way star uses 4 concurrent scanners, though not all are active all the
time. And that's per segment instance - we normally use one segment
instance per CPU, so our concurrency is NCPUs plus some.

The trick is the "small number of concurrent executors" part. The only way
to get this with normal postgres is to have concurrent users, and normally
they are doing different things, scanning different parts of the disk.
These are competing things, and for concurrency enhancement something like
"sync scan" would be an effective optimization.

But in reporting, business analytics and warehousing in general, there are
reports that take hours to run. If you can knock that down by factors of 10
using parallelism, it's a big win. That's the reason that Teradata did $1.5
Billion in business last year.

More importantly - that's the kind of work that everyone using internet data
for analytics wants right now.

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-19 00:05:59
Message-ID: BFA3AC67.14125%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark,

On 11/18/05 3:46 PM, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz> wrote:

> If you alter this to involve more complex joins (e.g 4. way star) and
> (maybe add a small number of concurrent executors too) - is it still the
> case?

I may not have listened to you - are you asking about whether the readahead
works for these cases?

I¹ll be running some massive TPC-H benchmarks on these machines soon ­ we¹ll
see then.

- Luke


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-19 02:27:49
Message-ID: 437E8DA5.1050703@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Mark,
>
> On 11/18/05 3:46 PM, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz> wrote:
>
> If you alter this to involve more complex joins (e.g 4. way star) and
> (maybe add a small number of concurrent executors too) - is it still the
> case?
>
>
> I may not have listened to you - are you asking about whether the
> readahead works for these cases?
>
> I’ll be running some massive TPC-H benchmarks on these machines soon –
> we’ll see then.

That too, meaning the business of 1 executor random reading a given
relation file whilst another is sequentially scanning (some other) part
of it....

Cheers

Mark


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com, "Greg Stark" <gsstark(at)mit(dot)edu>
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-19 16:13:09
Message-ID: BFA48F15.14182%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/18/05 11:39 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> Yes and no. The one cpu is clearly idle. The second cpu is 40% busy
> and 60% idle (aka iowait in the above numbers).

The "aka iowait" is the problem here - iowait is not idle (otherwise it
would be in the "idle" column).

Iowait is time spent waiting on blocking io calls. As another poster
pointed out, you have a two CPU system, and during your scan, as predicted,
one CPU went 100% busy on the seq scan. During iowait periods, the CPU can
be context switched to other users, but as I pointed out earlier, that's not
useful for getting response on decision support queries.

Thanks for your data, it exemplifies many of the points brought up:
- Lots of disks and expensive I/O hardware does not help improve performance
on large table queries because I/O bandwidth does not scale beyond
110-120MB/s on the fastest CPUs
- OLTP performance optimizations are different than decision support

Regards,

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-19 16:15:29
Message-ID: BFA48FA1.14183%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark,

On 11/18/05 6:27 PM, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz> wrote:

> That too, meaning the business of 1 executor random reading a given
> relation file whilst another is sequentially scanning (some other) part
> of it....

I think it should actually improve things - each I/O will read 16MB into the
I/O cache, then the next scanner will seek for 10ms to get the next 16MB
into cache, etc. It should minimize the seek/data ratio nicely. As long as
the tables are contiguous it should rock and roll.

- Luke


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-19 21:28:57
Message-ID: 437F9919.20305@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Mark,
>
> On 11/18/05 3:46 PM, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz> wrote:
>
>
>>If you alter this to involve more complex joins (e.g 4. way star) and
>>(maybe add a small number of concurrent executors too) - is it still the
>>case?
>
>
> 4-way star, same result, that's part of my point. With Bizgres MPP, the
> 4-way star uses 4 concurrent scanners, though not all are active all the
> time. And that's per segment instance - we normally use one segment
> instance per CPU, so our concurrency is NCPUs plus some.
>

Luke - I don't think I was clear enough about what I was asking, sorry.

I added the more "complex joins" comment because:

- I am happy that seqscan is cpu bound after ~110M/s (It's cpu bound on
my old P3 system even earlier than that....)
- I am curious if the *other* access methods (indexscan, nested loop,
hash, merge, bitmap) also suffer then same fate.

I'm guessing from your comment that you have tested this too, but I
think its worth clarifying!

With respect to Bizgres MPP, scan parallelism is a great addition...
very nice! (BTW - is that in 0.8, or are we talking a new product variant?)

regards

Mark


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 02:43:48
Message-ID: 437FE2E4.5040600@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Alan,
>
> On 11/18/05 11:39 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>
>
>> Yes and no. The one cpu is clearly idle. The second cpu is 40% busy
>> and 60% idle (aka iowait in the above numbers).
>>
>
> The "aka iowait" is the problem here - iowait is not idle (otherwise it
> would be in the "idle" column).
>
> Iowait is time spent waiting on blocking io calls. As another poster
> pointed out, you have a two CPU system, and during your scan, as predicted,
> one CPU went 100% busy on the seq scan. During iowait periods, the CPU can
> be context switched to other users, but as I pointed out earlier, that's not
> useful for getting response on decision support queries.
>
iowait time is idle time. Period. This point has been debated
endlessly for Solaris and other OS's as well.

Here's the man page:
%iowait
Show the percentage of time that the CPU or
CPUs were
idle during which the system had an outstanding
disk I/O
request.

If the system had some other cpu bound work to perform you wouldn't ever
see any iowait time. Anyone claiming the cpu was 100% busy on the
sequential scan using the one set of numbers I posted is
misunderstanding the actual metrics.

> Thanks for your data, it exemplifies many of the points brought up:
> - Lots of disks and expensive I/O hardware does not help improve performance
> on large table queries because I/O bandwidth does not scale beyond
> 110-120MB/s on the fastest CPUs
>
I don't think that is the conclusion from anecdotal numbers I posted.
This file subsystem doesn't perform as well as expected for any tool.
Bonnie, dd, star, etc., don't get a better data rate either. In fact,
the storage system wasn't built for performance; it was build to
reliably hold a big chunk of data. Even so, postgresql is reading at
130MB/s on it, using about 30% of a single cpu, almost all of which was
system time. I would get the same 130MB/s on a system with cpus that
were substantially slower; the limitation isn't the cpus, or
postgresql. It's the IO system that is poorly configured for this test,
not postgresqls ability to use it.

In fact, given the numbers I posted, it's clear this system could
handily generate more than 120 MB/s using a single cpu given a better IO
subsystem; it has cpu time to spare. A simple test can be done:
build the database in /dev/shm and time the scans. It's the same read()
system call being used and now one has made the IO system "infinitely
fast". The claim is being made that standard postgresql is unable to
generate more than 120MB/s of IO on any IO system due to an inefficient
use of the kernel API and excessive memory copies, etc. Having the
database be on a ram based file system is an example of "expensive IO
hardware" and all else would be the same. Hmmm, now that I think about
this, I could throw a medium sized table onto /dev/shm using
tablespaces on one of our 8GB linux boxes. So why is this experiment
not valid, or what is it about the above assertion that I am missing?

Anyway, if one cares about high speed sequential IO, then one should use
a much larger block size to start. Using 8KB IOs is inappropriate for
such a configuration. We happen to be using 32KB blocks on our largest
database and it's been the best move for us.

-- Alan


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 04:43:28
Message-ID: 437FFEF0.9090409@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Another data point.

We had some down time on our system today to complete some maintenance
work. It took the opportunity to rebuild the 700GB file system using
XFS instead of Reiser.

One iostat output for 30 seconds is

avg-cpu: %user %nice %sys %iowait %idle
1.58 0.00 19.69 31.94 46.78

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdd 343.73 175035.73 277.55 5251072 8326

while doing a select count(1) on the same large table as before.
Subsequent iostat output all showed that this data rate was being
maintained. The system is otherwise mostly idle during this measurement.

The sequential read rate is 175MB/s. The system is the same as earlier,
one cpu is idle and the second is ~40% busy doing the scan and ~60%
idle. This is postgresql 8.1rc1, 32KB block size. No tuning except
for using a 1024KB read ahead.

The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel
controller). I see no reason why this configuration wouldn't generate
higher IO rates if a faster IO connection were available.

Can you explain again why you think there's an IO ceiling of 120MB/s
because I really don't understand?

-- Alan


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, pgsql-performance(at)postgresql(dot)org, stange(at)rentec(dot)com, gsstark(at)mit(dot)edu, pg(at)fastcrypt(dot)com, icub3d(at)gmail(dot)com
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 08:55:59
Message-ID: 43803A1F.9030708@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark Kirkwood wrote:

>
> - I am happy that seqscan is cpu bound after ~110M/s (It's cpu bound on
> my old P3 system even earlier than that....)

Ahem - after reading Alan's postings I am not so sure, ISTM that there
is some more investigation required here too :-).


From: William Yu <wyu(at)talisys(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 12:55:14
Message-ID: dlprnd$25d0$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan Stange wrote:
> Luke Lonergan wrote:
>> The "aka iowait" is the problem here - iowait is not idle (otherwise it
>> would be in the "idle" column).
>>
>> Iowait is time spent waiting on blocking io calls. As another poster
>> pointed out, you have a two CPU system, and during your scan, as
>
> iowait time is idle time. Period. This point has been debated
> endlessly for Solaris and other OS's as well.

I'm sure the the theory is nice but here's my experience with iowait
just a minute ago. I run Linux/XFce as my desktop -- decided I wanted to
lookup some stuff in Wikipedia under Mozilla and my computer system
became completely unusable for nearly a minute while who knows what
Mozilla was doing. (Probably loading all the language packs.) I could
not even switch to IRC (already loaded) to chat with other people while
Mozilla was chewing up all my disk I/O.

So I went to another computer, connected to mine remotely (slow...) and
checked top. 90% in the "wa" column which I assume is the iowait column.
It may be idle in theory but it's not a very useful idle -- wasn't able
to switch to any programs already running, couldn't click on the XFce
launchbar to run any new programs.


From: "Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 13:04:38
Message-ID: 20051120130438.GA31206@uio.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sat, Nov 19, 2005 at 08:13:09AM -0800, Luke Lonergan wrote:
> Iowait is time spent waiting on blocking io calls.

To be picky, iowait is time spent in the idle task while the I/O queue is not
empty. It does not matter if the I/O is blocking or not (from userspace's
point of view), and if the I/O was blocking (say, PIO) from the kernel's
point of view, it would be counted in system.

/* Steinar */
--
Homepage: http://www.sesse.net/


From: Alan Stange <stange(at)rentec(dot)com>
To: William Yu <wyu(at)talisys(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 13:42:09
Message-ID: 43807D31.6050008@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

William Yu wrote:
> Alan Stange wrote:
>> Luke Lonergan wrote:
>>> The "aka iowait" is the problem here - iowait is not idle (otherwise it
>>> would be in the "idle" column).
>>>
>>> Iowait is time spent waiting on blocking io calls. As another poster
>>> pointed out, you have a two CPU system, and during your scan, as
>>
>> iowait time is idle time. Period. This point has been debated
>> endlessly for Solaris and other OS's as well.
>
> I'm sure the the theory is nice but here's my experience with iowait
> just a minute ago. I run Linux/XFce as my desktop -- decided I wanted
> to lookup some stuff in Wikipedia under Mozilla and my computer system
> became completely unusable for nearly a minute while who knows what
> Mozilla was doing. (Probably loading all the language packs.) I could
> not even switch to IRC (already loaded) to chat with other people
> while Mozilla was chewing up all my disk I/O.
>
> So I went to another computer, connected to mine remotely (slow...)
> and checked top. 90% in the "wa" column which I assume is the iowait
> column. It may be idle in theory but it's not a very useful idle --
> wasn't able to switch to any programs already running, couldn't click
> on the XFce launchbar to run any new programs.

So, you have a sucky computer. I'm sorry, but iowait is still idle
time, whether you believe it or not.

-- Alan


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Alan Stange <stange(at)rentec(dot)com>
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 14:22:41
Message-ID: 87y83jo0um.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


Alan Stange <stange(at)rentec(dot)com> writes:

> > Iowait is time spent waiting on blocking io calls. As another poster
> > pointed out, you have a two CPU system, and during your scan, as predicted,
> > one CPU went 100% busy on the seq scan. During iowait periods, the CPU can
> > be context switched to other users, but as I pointed out earlier, that's not
> > useful for getting response on decision support queries.

I don't think that's true. If the syscall was preemptable then it wouldn't
show up under "iowait", but rather "idle". The time spent in iowait is time in
uninterruptable sleeps where no other process can be scheduled.

> iowait time is idle time. Period. This point has been debated endlessly for
> Solaris and other OS's as well.
>
> Here's the man page:
> %iowait
> Show the percentage of time that the CPU or CPUs were
> idle during which the system had an outstanding disk I/O
> request.
>
> If the system had some other cpu bound work to perform you wouldn't ever see
> any iowait time. Anyone claiming the cpu was 100% busy on the sequential scan
> using the one set of numbers I posted is misunderstanding the actual metrics.

That's easy to test. rerun the test with another process running a simple C
program like "main() {while(1);}" (or two invocations of that on your system
because of the extra processor). I bet you'll see about half the percentage of
iowait because postres will get half as much opportunity to schedule i/o. If
what you are saying were true then you should get 0% iowait.

--
greg


From: "Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 14:29:35
Message-ID: 20051120142935.GA32311@uio.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sun, Nov 20, 2005 at 09:22:41AM -0500, Greg Stark wrote:
> I don't think that's true. If the syscall was preemptable then it wouldn't
> show up under "iowait", but rather "idle". The time spent in iowait is time in
> uninterruptable sleeps where no other process can be scheduled.

You are confusing userspace with kernel space. When a process is stuck in
uninterruptable sleep, it means _that process_ can't be interrupted (say,
by a signal). The kernel can preempt it without problems.

/* Steinar */
--
Homepage: http://www.sesse.net/


From: Alan Stange <stange(at)rentec(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 18:09:06
Message-ID: 4380BBC2.7060109@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark wrote:
> Alan Stange <stange(at)rentec(dot)com> writes:
>
>
>>> Iowait is time spent waiting on blocking io calls. As another poster
>>> pointed out, you have a two CPU system, and during your scan, as predicted,
>>> one CPU went 100% busy on the seq scan. During iowait periods, the CPU can
>>> be context switched to other users, but as I pointed out earlier, that's not
>>> useful for getting response on decision support queries.
>>>
>
> I don't think that's true. If the syscall was preemptable then it wouldn't
> show up under "iowait", but rather "idle". The time spent in iowait is time in
> uninterruptable sleeps where no other process can be scheduled.
>
That would be wrong. The time spent in iowait is idle time. The
iowait stat would be 0 on a machine with a compute bound runnable
process available for each cpu.

Come on people, read the man page or look at the source code. Just
stop making stuff up.

>
>> iowait time is idle time. Period. This point has been debated endlessly for
>> Solaris and other OS's as well.
>>
>> Here's the man page:
>> %iowait
>> Show the percentage of time that the CPU or CPUs were
>> idle during which the system had an outstanding disk I/O
>> request.
>>
>> If the system had some other cpu bound work to perform you wouldn't ever see
>> any iowait time. Anyone claiming the cpu was 100% busy on the sequential scan
>> using the one set of numbers I posted is misunderstanding the actual metrics.
>>
>
> That's easy to test. rerun the test with another process running a simple C
> program like "main() {while(1);}" (or two invocations of that on your system
> because of the extra processor). I bet you'll see about half the percentage of
> iowait because postres will get half as much opportunity to schedule i/o. If
> what you are saying were true then you should get 0% iowait.
Yes, I did this once about 10 years ago. But instead of saying "I bet"
and guessing at the result, you should try it yourself. Without
guessing, I can tell you that the iowait time will go to 0%. You can do
this loop in the shell, so there's no code to write. Also, it helps to
do this with the shell running at a lower priority.

-- Alan


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Alan Stange <stange(at)rentec(dot)com>
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 00:11:15
Message-ID: 438110A3.20506@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan Stange wrote:
> Another data point.
> We had some down time on our system today to complete some maintenance
> work. It took the opportunity to rebuild the 700GB file system using
> XFS instead of Reiser.
>
> One iostat output for 30 seconds is
>
> avg-cpu: %user %nice %sys %iowait %idle
> 1.58 0.00 19.69 31.94 46.78
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sdd 343.73 175035.73 277.55 5251072 8326
>
> while doing a select count(1) on the same large table as before.
> Subsequent iostat output all showed that this data rate was being
> maintained. The system is otherwise mostly idle during this measurement.
>
> The sequential read rate is 175MB/s. The system is the same as earlier,
> one cpu is idle and the second is ~40% busy doing the scan and ~60%
> idle. This is postgresql 8.1rc1, 32KB block size. No tuning except
> for using a 1024KB read ahead.
>
> The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel
> controller). I see no reason why this configuration wouldn't generate
> higher IO rates if a faster IO connection were available.
>
> Can you explain again why you think there's an IO ceiling of 120MB/s
> because I really don't understand?
>

I think what is going on here is that Luke's observation of the 120 Mb/s
rate is taken from data using 8K block size - it looks like we can get
higher rates with 32K.

A quick test on my P3 system seems to support this (the numbers are a
bit feeble, but the difference is interesting):

The test is SELECT 1 FROM table, stopping Pg and unmounting the file
system after each test.

8K blocksize:
25 s elapsed
48 % idle from vmstat (dual cpu system)
70 % busy from gstat (Freebsd GEOM io monitor)
181819 pages in relation
56 Mb/s effective IO throughput

32K blocksize:
23 s elapsed
44 % idle from vmstat
80 % busy from gstat
45249 pages in relation
60 Mb/s effective IO throughput

I re-ran these several times - very repeatable (+/- 0.25 seconds).

This is Freebsd 6.0 with the readahead set to 16 blocks, UFS2 filesystem
created with 32K blocksize (both cases). It might be interesting to see
the effect of using 16K (the default) with the 8K Pg block size, I would
expect this to widen the gap.

Cheers

Mark


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Alan Stange <stange(at)rentec(dot)com>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 01:40:54
Message-ID: 438125A6.2040200@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark Kirkwood wrote:

> The test is SELECT 1 FROM table

That should read "The test is SELECT count(1) FROM table...."


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Alan Stange" <stange(at)rentec(dot)com>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 08:12:55
Message-ID: BFA6C187.1425C%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/19/05 8:43 PM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sdd 343.73 175035.73 277.55 5251072 8326
>
> while doing a select count(1) on the same large table as before.
> Subsequent iostat output all showed that this data rate was being
> maintained. The system is otherwise mostly idle during this measurement.

Yes - interesting. Note the other result using XFS that I posted earlier
where I got 240+MB/s. XFS has more aggressive readahead, which is why I
used it.

> Can you explain again why you think there's an IO ceiling of 120MB/s
> because I really don't understand?

OK - slower this time:

We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
which all are capped at 120MB/s when doing sequential scans with different
versions of Postgres.

Understand my point: It doesn't matter that there is idle or iowait on the
CPU, the postgres executor is not able to drive the I/O rate for two
reasons: there is a lot of CPU used for the scan (the 40% you reported) and
a lack of asynchrony (the iowait time). That means that by speeding up the
CPU you only reduce the first part, but you don't fix the second and v.v.

With more aggressive readahead, the second problem (the I/O asynchrony) is
handled better by the Linux kernel and filesystem. That's what we're seeing
with XFS.

- Luke


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc: "Alan Stange" <stange(at)rentec(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 14:56:46
Message-ID: 1611.1132585006@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
> OK - slower this time:

> We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
> machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
> which all are capped at 120MB/s when doing sequential scans with different
> versions of Postgres.

Luke, sometime it would be nice if you would post your raw evidence
and let other people do their own analysis. I for one have gotten
tired of reading sweeping generalizations unbacked by any data.

I find the notion of a magic 120MB/s barrier, independent of either
CPU or disk speed, to be pretty dubious to say the least. I would
like to know exactly what the "wide variety" of data points you
haven't shown us are.

regards, tom lane


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 14:57:59
Message-ID: 4381E077.80009@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> OK - slower this time:
> We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
> machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
> which all are capped at 120MB/s when doing sequential scans with different
> versions of Postgres.
>
Postgresql issues the exact same sequence of read() calls as does dd.
So why is dd so much faster?

I'd be careful with the dd read of a 16GB file on an 8GB system. Make
sure you umount the file system first, to make sure all of the file is
flushed from memory. Some systems use a freebehind on sequential reads
to avoid flushing memory...and you'd find that 1/2 of your 16GB file is
still in memory. The same point also holds for the writes: when dd
finishes not all the data is on disk. You need to issue a sync() call
to make that happen. Use lmdd to ensure that the data is actually all
written. In other words, I think your dd results are possibly misleading.

It's trivial to demonstrate:

$ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
800000+0 records in
800000+0 records out

real 0m13.780s
user 0m0.134s
sys 0m13.510s

Oops. I just wrote 470MB/s to a file system that has peak write speed
of 200MB/s peak.

Now, you might say that you wrote a 16GB file on an 8 GB machine so this
isn't an issue. It does make your dd numbers look fast as some of the
data will be unwritten.

I'd also suggest running dd on the same files as postgresql. I suspect
you'd find that the layout of the postgresql files isn't that good as
they are grown bit by bit, unlike the file created by simply dd'ing a
large file.

> Understand my point: It doesn't matter that there is idle or iowait on the
> CPU, the postgres executor is not able to drive the I/O rate for two
> reasons: there is a lot of CPU used for the scan (the 40% you reported) and
> a lack of asynchrony (the iowait time). That means that by speeding up the
> CPU you only reduce the first part, but you don't fix the second and v.v.
>
> With more aggressive readahead, the second problem (the I/O asynchrony) is
> handled better by the Linux kernel and filesystem. That's what we're seeing
> with XFS.

I think your point doesn't hold up. Every time you make it, I come away
posting another result showing it to be incorrect.

The point your making doesn't match my experience with *any* storage or
program I've ever used, including postgresql. Your point suggests that
the storage system is idle and that postgresql is broken because it
isn't able to use the resources available...even when the cpu is very
idle. How can that make sense? The issue here is that the storage
system is very active doing reads on the files...which might be somewhat
poorly allocated on disk because postgresql grows the tables bit by bit.

I had the same readahead in Reiser and in XFS. The XFS performance was
better because XFS does a better job of large file allocation on disk,
thus resulting in many fewer seeks (generated by the file system itself)
to read the files back in. As an example, some file systems like UFS
purposely scatter large files across cylinder groups to avoid forcing
large seeks on small files; one can tune this behavior so that large
files are more tightly allocated.

Of course, because this is engineering, I have another obligatory data
point: This time it's a 4.2GB table using 137,138 32KB pages with
nearly 41 million rows.

A "select count(1)" on the table completes in 14.6 seconds, for an
average read rate of 320 MB/s.

One cpu was idle, the other averaged 32% system time and 68 user time
for the 14 second period. This is on a 2.2Ghz Opteron. A faster cpu
would show increased performance as I really am cpu bound finally.

Postgresql is clearly able to issue the relevant sequential read()
system calls and sink the resulting data without a problem if the file
system is capable of providing the data. It can do this up to a speed
of ~300MB/s on this class of system. Now it should be fairly simple to
tweak the few spots where some excess memory copies are being done and
up this result substantially. I hope postgresql is always using the
libc memcpy as that's going to be a lot faster then some private routine.

-- Alan


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 18:06:48
Message-ID: BFA74CB8.142B5%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/21/05 6:57 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
> 800000+0 records in
> 800000+0 records out
>
> real 0m13.780s
> user 0m0.134s
> sys 0m13.510s
>
> Oops. I just wrote 470MB/s to a file system that has peak write speed
> of 200MB/s peak.

How much RAM on this machine?

> Now, you might say that you wrote a 16GB file on an 8 GB machine so this
> isn't an issue. It does make your dd numbers look fast as some of the
> data will be unwritten.

This simple test, at 2x memory correlates very closely to Bonnie++ numbers
for sequential scan. What's more, we see close to the same peak in practice
with multiple scanners. Furthermore, if you run two of them simultaneously
(on two filesystems), you can also see the I/O limited.

> I'd also suggest running dd on the same files as postgresql. I suspect
> you'd find that the layout of the postgresql files isn't that good as
> they are grown bit by bit, unlike the file created by simply dd'ing a
> large file.

Can happen if you're not careful with filesystems (see above).

There's nothing "wrong" with the dd test.

> I think your point doesn't hold up. Every time you make it, I come away
> posting another result showing it to be incorrect.

Prove it - your Reiserfs number was about the same.

I also posted an XFS number that was substantially higher than 110-120.

> The point your making doesn't match my experience with *any* storage or
> program I've ever used, including postgresql. Your point suggests that
> the storage system is idle and that postgresql is broken because it
> isn't able to use the resources available...even when the cpu is very
> idle. How can that make sense? The issue here is that the storage
> system is very active doing reads on the files...which might be somewhat
> poorly allocated on disk because postgresql grows the tables bit by bit.

Then you've made my point - if the problem is contiguity of files on disk,
then larger allocation blocks would help on the CPU side.

The objective is clear: given a high performance filesystem, how much of the
available bandwidth can Postgres achieve? I think what we're seeing is that
XFS is dramatically improving that objective.

> I had the same readahead in Reiser and in XFS. The XFS performance was
> better because XFS does a better job of large file allocation on disk,
> thus resulting in many fewer seeks (generated by the file system itself)
> to read the files back in. As an example, some file systems like UFS
> purposely scatter large files across cylinder groups to avoid forcing
> large seeks on small files; one can tune this behavior so that large
> files are more tightly allocated.

Our other tests have used ext3, reiser and Solaris 10 UFS, so this might
make some sense.

> Of course, because this is engineering, I have another obligatory data
> point: This time it's a 4.2GB table using 137,138 32KB pages with
> nearly 41 million rows.
>
> A "select count(1)" on the table completes in 14.6 seconds, for an
> average read rate of 320 MB/s.

So, assuming that the net memory scan rate is about 2GB/s, and two copies
(one from FS cache to buffer cache, one from buffer cache to the agg node),
you have a 700MB/s filesystem with the equivalent of DirectIO (no FS cache)
because you are reading directly from the I/O cache. You got half of that
because the I/O processing in the executor is limited to 320MB/s on that
fast CPU.

My point is this: if you were to decrease the filesystem speed to say
400MB/s and still use the equivalent of DirectIO, I thinkPostgres would not
deliver 320MB/s, but rather something like 220MB/s due to the
producer/consumer arch of the executor. If you get that part, then we're on
the same track, otherwise we disagree.

> One cpu was idle, the other averaged 32% system time and 68 user time
> for the 14 second period. This is on a 2.2Ghz Opteron. A faster cpu
> would show increased performance as I really am cpu bound finally.

Yep, with the equivalent of DirectIO you are.

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Alan Stange" <stange(at)rentec(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 18:14:29
Message-ID: BFA74E85.142B8%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom,

On 11/21/05 6:56 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
>> OK - slower this time:
>
>> We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
>> machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
>> which all are capped at 120MB/s when doing sequential scans with different
>> versions of Postgres.
>
> Luke, sometime it would be nice if you would post your raw evidence
> and let other people do their own analysis. I for one have gotten
> tired of reading sweeping generalizations unbacked by any data.

This has partly been a challenge to get others to post their results.

> I find the notion of a magic 120MB/s barrier, independent of either
> CPU or disk speed, to be pretty dubious to say the least. I would
> like to know exactly what the "wide variety" of data points you
> haven't shown us are.

I'll try to put up some of them, they've occurred over the last 3 years on
various platforms including:
- Dual 3.2GHz Xeon, 2 x Adaptec U320 SCSI attached to 6 x 10K RPM disks,
Linux 2.6.4(?) - 2.6.10 kernel, ext2/3 and Reiser filesystems
120-130MB/s Postgres seq scan rate on 7.4 and 8.0.

- Dual 1.8 GHz Opteron, 2 x LSI U320 SCSI attached to 6 x 10K RPM disks,
Linux 2.6.10 kernel, ext2/3 and Reiser filesystems
110-120MB/s Postgres seq scan rate on 8.0

- Same machine as above running Solaris 10, with UFS filesystem. When I/O
caching is tuned, we reach the same 110-120MB/s Postgres seq scan rate

- Sam machine as above with 7 x 15K RPM 144GB disks in an external disk
tray, same scan rate

Only when we got these new SATA systems and tried out XFS with large
readahead have we been able to break past the 120-130MB/s. After Alan's
post, it seems that XFS might be a big part of that. I think we'll test
ext2/3 against XFS on the same machine to find out.

It may have to wait a week, as many of us are on vacation.

- Luke


From: Greg Stark <gsstark(at)mit(dot)edu>
To: stange(at)rentec(dot)com
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 19:01:26
Message-ID: 87ek59omex.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


Alan Stange <stange(at)rentec(dot)com> writes:

> The point your making doesn't match my experience with *any* storage or program
> I've ever used, including postgresql. Your point suggests that the storage
> system is idle and that postgresql is broken because it isn't able to use the
> resources available...even when the cpu is very idle. How can that make sense?

Well I think what he's saying is that Postgres is issuing a read, then waiting
for the data to return. Then it does some processing, and goes back to issue
another read. The CPU is idle half the time because Postgres isn't capable of
doing any work while waiting for i/o, and the i/o system is idle half the time
while the CPU intensive part happens.

(Consider as a pathological example a program that reads 8k then sleeps for
10ms, and loops doing that 1,000 times. Now consider the same program
optimized to read 8M asynchronously and sleep for 10s. By the time it's
finished sleeping it has probably read in all 8M. Whereas the program that
read 8k in little chunks interleaved with small sleeps would probably take
twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
idle.)

It's a reasonable theory and it's not inconsistent with the results you sent.
But it's not exactly proven either. Nor is it clear how to improve matters.
Adding additional threads to handle the i/o adds an enormous amount of
complexity and creates lots of opportunity for other contention that could
easily eat all of the gains.

I also fear that heading in that direction could push Postgres even further
from the niche of software that works fine even on low end hardware into the
realm of software that only works on high end hardware. It's already suffering
a bit from that.

--
greg


From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: stange(at)rentec(dot)com, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 19:51:47
Message-ID: 20051121195147.GC26621@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark wrote:

> I also fear that heading in that direction could push Postgres even further
> from the niche of software that works fine even on low end hardware into the
> realm of software that only works on high end hardware. It's already suffering
> a bit from that.

What's high end hardware for you? I do development on a Celeron 533
machine with 448 MB of RAM and I find it to work well (for a "slow"
value of "well", certainly.) If you're talking about embedded hardware,
that's another matter entirely and I don't think we really support the
idea of running Postgres on one of those things.

There's certainly true in that the memory requirements have increased a
bit, but I don't think it really qualifies as "high end" even on 8.1.

--
Alvaro Herrera Developer, http://www.PostgreSQL.org
Jude: I wish humans laid eggs
Ringlord: Why would you want humans to lay eggs?
Jude: So I can eat them


From: Bill McGonigle <bill(at)bfccomputing(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: stange(at)rentec(dot)com, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org, Luke Lonergan <llonergan(at)greenplum(dot)com>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 19:58:18
Message-ID: 8c110ae5a3902fba586f3353b25fc54f@bfccomputing.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Would it be worth first agreeing on a common set of criteria to
measure? I see many data points going back and forth but not much
agreement on what's worth measuring and how to measure.

I'm not necessarily trying to herd cats, but it sure would be swell to
have the several knowledgeable minds here come up with something that
could uniformly tested on a range of machines, possibly even integrated
into pg_bench or something. Disagreements on criteria or methodology
should be methodically testable.

Then I have dreams of a new pg_autotune that would know about these
kinds of system-level settings.

I haven't been on this list for long, and only using postgres for a
handful of years, so forgive it if this has been hashed out before.

-Bill
-----
Bill McGonigle, Owner Work: 603.448.4440
BFC Computing, LLC Home: 603.448.1668
bill(at)bfccomputing(dot)com Mobile: 603.252.2606
http://www.bfccomputing.com/ Pager: 603.442.1833
Jabber: flowerpt(at)gmail(dot)com Text: bill+text(at)bfccomputing(dot)com
Blog: http://blog.bfccomputing.com/


From: Michael Stone <mstone+postgres(at)mathom(dot)us>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 19:59:09
Message-ID: 20051121195909.GP7330@mathom.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, Nov 21, 2005 at 02:01:26PM -0500, Greg Stark wrote:
>I also fear that heading in that direction could push Postgres even further
>from the niche of software that works fine even on low end hardware into the
>realm of software that only works on high end hardware. It's already suffering
>a bit from that.

Well, there are are alread a bunch of open source DB's that can handle
the low end. postgres is the closest thing to being able to handle the
high end.

Mike Stone


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 21:53:41
Message-ID: 438241E5.2010701@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke,

it's time to back yourself up with some numbers. You're claiming the
need for a significant rewrite of portions of postgresql and you haven't
done the work to make that case.

You've apparently made some mistakes on the use of dd to benchmark a
storage system. Use lmdd and umount the file system before the read
and post your results. Using a file 2x the size of memory doesn't work
corectly. You can quote any other numbers you want, but until you use
lmdd correctly you should be ignored. Ideally, since postgresql uses
1GB files, you'll want to use 1GB files for dd as well.

Luke Lonergan wrote:
> Alan,
>
> On 11/21/05 6:57 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>
>
>> $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
>> 800000+0 records in
>> 800000+0 records out
>>
>> real 0m13.780s
>> user 0m0.134s
>> sys 0m13.510s
>>
>> Oops. I just wrote 470MB/s to a file system that has peak write speed
>> of 200MB/s peak.
>>
> How much RAM on this machine?
>
Doesn't matter. The result will always be wrong without a call to
sync() or fsync() before the close() if you're trying to measure the
speed of the disk subsystem. Add that sync() and the result will be
correct for any memory size. Just for completeness: Solaris implicitly
calls sync() as part of close. Bonnie used to get this wrong, so
quoting Bonnie isn't any good. Note that on some systems using 2x
memory for these tests is almost OK. For example, Solaris used to have
a hiwater mark that would throttle processes and not allow more than a
few 100K of writes to be outstanding on a file. Linux/XFS clearly
allows a lot of write data to be outstanding. It's best to understand
the tools and know what they do and why they can be wrong than simply
quoting some other tool that makes the same mistakes.

I find that postgresql is able to achieve about 175MB/s on average from
a system capable of delivering 200MB/s peak and it does this with a lot
of cpu time to spare. Maybe dd can do a little better and deliver
185MB/s. If I were to double the speed of my IO system, I might find
that a single postgresql instance can sink about 300MB/s of data (based
on the last numbers I posted). That's why I have multi-cpu opterons and
more than one query/client as they soak up the remaining IO capacity.

It is guaranteed that postgresql will hit some threshold of performance
in the future and possible rewrites of some core functionality will be
needed, but no numbers posted here so far have made the case that
postgresql is in trouble now. In the mean time, build balanced
systems with cpus that match the capabilities of the storage subsystems,
use 32KB block sizes for large memory databases that are doing lots of
sequential scans, use file systems tuned for large files, use opterons, etc.

As always, one has to post some numbers. Here's an example of how dd
doesn't do what you might expect:

mite02:~ # lmdd if=internal of=/fidb2/bigfile bs=8k count=2k
16.7772 MB in 0.0235 secs, 714.5931 MB/sec

mite02:~ # lmdd if=internal of=/fidb2/bigfile bs=8k count=2k sync=1
16.7772 MB in 0.1410 secs, 118.9696 MB/sec

Both numbers are "correct". But one measures the kernels ability to
absorb 2000 8KB writes with no guarantee that the data is on disk and
the second measures the disk subsystems ability to write 16MB of data.
dd is equivalent to the first result. You can't use the first type of
result and complain that postgresql is slow. If you wrote 16G of data
on a machine with 8G memory then your dd result is possibly too fast by
a factor of two as 8G of the data might not be on disk yet. We won't
know until you post some results.

Cheers,

-- Alan


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 23:44:35
Message-ID: BFA79BE3.14369%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

Unless noted otherwise all results posted are for block device readahead set
to 16M using "blockdev --setra=16384 <block_device>". All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:

============================================================================
[root(at)modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=800000 && sync)"
800000+0 records in
800000+0 records out

real 0m33.057s
user 0m0.116s
sys 0m13.577s

[root(at)modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=800000 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real 0m33.032s
user 0m0.087s
sys 0m13.129s
============================================================================

So lmdd with sync=1 is apparently equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both sides for this set of comparisons.

First, let's test ext2 versus "ext3, data=ordered", versus reiserfs versus
xfs:


From: Michael Stone <mstone+postgres(at)mathom(dot)us>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 23:54:44
Message-ID: 20051121235444.GR7330@mathom.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, Nov 21, 2005 at 10:14:29AM -0800, Luke Lonergan wrote:
>This has partly been a challenge to get others to post their results.

You'll find that people respond better if you don't play games with
them.


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-22 04:35:26
Message-ID: BFA7E00E.143DB%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

Looks like Postgres gets sensible scan rate scaling as the filesystem speed
increases, as shown below. I'll drop my 120MB/s observation - perhaps CPUs
got faster since I last tested this.

The scaling looks like 64% of the I/O subsystem speed is available to the
executor - so as the I/O subsystem increases in scan rate, so does Postgres'
executor scan speed.

So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?

- Luke

================= Results ===================

Unless noted otherwise all results posted are for block device readahead set
to 16M using "blockdev --setra=16384 <block_device>". All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:

============================================================================
[root(at)modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=800000 && sync)"
800000+0 records in
800000+0 records out

real 0m33.057s
user 0m0.116s
sys 0m13.577s

[root(at)modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=800000 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real 0m33.032s
user 0m0.087s
sys 0m13.129s
============================================================================

So lmdd with sync=1 is equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both write and read for this set of comparisons.

First, let's test ext2 versus "ext3, data=ordered", versus xfs:

============================================================================
16GB write, then read
============================================================================
-----------------------
ext2:
-----------------------
[root(at)modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 144.2670 secs, 113.5672 MB/sec

[root(at)modena1 dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 49.3766 secs, 331.8170 MB/sec

-----------------------
ext3, data=ordered:
-----------------------
[root(at)modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 137.1607 secs, 119.4511 MB/sec

[root(at)modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 48.7398 secs, 336.1527 MB/sec

-----------------------
xfs:
-----------------------
[root(at)modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 52.6141 secs, 311.3994 MB/sec

[root(at)modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 40.2807 secs, 406.7453 MB/sec
============================================================================

I'm liking xfs! Something about the way files are layed out, as Alan
suggested seems to dramatically improve write performance and perhaps
consequently the read also improves. There doesn't seem to be a difference
between ext3 and ext2, as expected.

Now on to the Postgres 8 tests. We'll do a 16GB table size to ensure that
we aren't reading from the read cache. I'll write this file through
Postgres COPY to be sure that the file layout is as Postgres creates it. The
alternative would be to use COPY once, then tar/untar onto different
filesystems, but that may not duplicate the real world results.

These tests will use Bizgres 0_8_1, which is an augmented 8.0.3. None of
the augmentations act to improve the executor I/O though, so for these
purposes it should be the same as 8.0.3.

============================================================================
26GB of DBT-3 data from the lineitem table
============================================================================
llonergan=# select relpages from pg_class where relname='lineitem';
relpages
----------
3159138
(1 row)

3159138*8192/1000000
25879 Million Bytes, or 25.9GB

-----------------------
xfs:
-----------------------
llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
count
-----------
119994608
(1 row)

Time: 394908.501 ms
llonergan=# select count(1) from lineitem;
count
-----------
119994608
(1 row)

Time: 99425.223 ms
llonergan=# select count(1) from lineitem;
count
-----------
119994608
(1 row)

Time: 99187.205 ms

-----------------------
ext2:
-----------------------
llonergan=# select relpages from pg_class where relname='lineitem';
relpages
----------
3159138
(1 row)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
count
-----------
119994608
(1 row)

Time: 395286.475 ms
llonergan=# select count(1) from lineitem;
count
-----------
119994608
(1 row)

Time: 195756.381 ms
llonergan=# select count(1) from lineitem;
count
-----------
119994608
(1 row)

Time: 122822.090 ms
============================================================================
Analysis of Postgres 8.0.3 results
============================================================================
ext2 xfs
Write Speed 114 311
Read Speed 332 407
Postgres Seq Scan Speed 212 263
Scan % of lmdd Read Speed 63.9% 64.6%

Well - looks like we get linear scaling with disk/file subsystem speedup.

- Luke


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: stange(at)rentec(dot)com, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-22 05:10:24
Message-ID: 4382A840.3030401@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:

> So that leaves the question - why not more than 64% of the I/O scan rate?
> And why is it a flat 64% as the I/O subsystem increases in speed from
> 333-400MB/s?
>

It might be interesting to see what effect reducing the cpu consumption
entailed by the count aggregation has - by (say) writing a little bit
of code to heap scan the desired relation (sample attached).

Cheers

Mark

Attachment Content-Type Size
fastcount.c text/plain 978 bytes

From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-22 14:26:38
Message-ID: 43832A9E.3030801@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke,

- XFS will probably generate better data rates with larger files. You
really need to use the same file size as does postgresql. Why compare
the speed to reading a 16G file and the speed to reading a 1G file.
They won't be the same. If need be, write some code that does the test
or modify lmdd to read a sequence of 1G files. Will this make a
difference? You don't know until you do it. Any time you cross a
couple of 2^ powers in computing, you should expect some differences.

- you did umount the file system before reading the 16G file back in?
Because if you didn't then your read numbers are possibly garbage.
When the read began, 8G of the file was in memory. You'd be very naive
to think that somehow the read of the first 8GB somehow flushed that
cached data out of memory. After all, why would the kernel flush pages
from file X when you're in the middle of a sequential read of...file
X? I'm not sure how Linux handles this, but Solaris would've found the
8G still in memory.

- What was the hardware and disk configuration on which these numbers
were generated? For example, if you have a U320 controller, how did
the read rate become larger than 320MB/s?

- how did the results change from before? Just posting the new results
is misleading given all the boasting we've had to read about your past
results.

- there are two results below for writing to ext2: one at 209 MB/s and
one at 113MB/s. Why are they different?

- what was the cpu usage during these tests? We see postgresql doing
200+MB/s of IO. You've claimed many times that the machine would be
compute bound at lower IO rates, so how much idle time does the cpu
still have?

- You wrote: "We'll do a 16GB table size to ensure that we aren't
reading from the read cache. " Do you really believe that?? You have
to umount the file system before each test to ensure you're really
measuring the disk IO rate. If I'm reading your results correctly, it
looks like you have three results for ext and xfs, each of which is
faster than the prior one. If I'm reading this correctly, then it looks
like one is clearly reading from the read cache.

- Gee, it's so nice of you to drop your 120MB/s observation. I guess my
reading at 300MB/s wasn't convincing enough. Yeah, I think it was the
cpus too...

- I wouldn't focus on the flat 64% of the data rate number. It'll
probably be different on other systems.

I'm all for testing and testing. It seems you still cut a corner
without umounting the file system first. Maybe I'm a little too old
school on this, but I wouldn't spend a dime until you've done the
measurements correctly.

Good Luck.

-- Alan

Luke Lonergan wrote:
> Alan,
>
> Looks like Postgres gets sensible scan rate scaling as the filesystem speed
> increases, as shown below. I'll drop my 120MB/s observation - perhaps CPUs
> got faster since I last tested this.
>
> The scaling looks like 64% of the I/O subsystem speed is available to the
> executor - so as the I/O subsystem increases in scan rate, so does Postgres'
> executor scan speed.
>
> So that leaves the question - why not more than 64% of the I/O scan rate?
> And why is it a flat 64% as the I/O subsystem increases in speed from
> 333-400MB/s?
>
> - Luke
>
> ================= Results ===================
>
> Unless noted otherwise all results posted are for block device readahead set
> to 16M using "blockdev --setra=16384 <block_device>". All are using the
> 2.6.9-11 Centos 4.1 kernel.
>
> For those who don't have lmdd, here is a comparison of two results on an
> ext2 filesystem:
>
> ============================================================================
> [root(at)modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
> bs=8k count=800000 && sync)"
> 800000+0 records in
> 800000+0 records out
>
> real 0m33.057s
> user 0m0.116s
> sys 0m13.577s
>
> [root(at)modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=800000 sync=1
> 6553.6000 MB in 31.2957 secs, 209.4092 MB/sec
>
> real 0m33.032s
> user 0m0.087s
> sys 0m13.129s
> ============================================================================
>
> So lmdd with sync=1 is equivalent to a sync after a dd.
>
> I use 2x memory with dd for the *READ* performance testing, but let's make
> sure things are synced on both write and read for this set of comparisons.
>
> First, let's test ext2 versus "ext3, data=ordered", versus xfs:
>
> ============================================================================
> 16GB write, then read
> ============================================================================
> -----------------------
> ext2:
> -----------------------
> [root(at)modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 144.2670 secs, 113.5672 MB/sec
>
> [root(at)modena1 dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 49.3766 secs, 331.8170 MB/sec
>
> -----------------------
> ext3, data=ordered:
> -----------------------
> [root(at)modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 137.1607 secs, 119.4511 MB/sec
>
> [root(at)modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 48.7398 secs, 336.1527 MB/sec
>
> -----------------------
> xfs:
> -----------------------
> [root(at)modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 52.6141 secs, 311.3994 MB/sec
>
> [root(at)modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 40.2807 secs, 406.7453 MB/sec
> ============================================================================
>
> I'm liking xfs! Something about the way files are layed out, as Alan
> suggested seems to dramatically improve write performance and perhaps
> consequently the read also improves. There doesn't seem to be a difference
> between ext3 and ext2, as expected.
>
> Now on to the Postgres 8 tests. We'll do a 16GB table size to ensure that
> we aren't reading from the read cache. I'll write this file through
> Postgres COPY to be sure that the file layout is as Postgres creates it. The
> alternative would be to use COPY once, then tar/untar onto different
> filesystems, but that may not duplicate the real world results.
>
> These tests will use Bizgres 0_8_1, which is an augmented 8.0.3. None of
> the augmentations act to improve the executor I/O though, so for these
> purposes it should be the same as 8.0.3.
>
> ============================================================================
> 26GB of DBT-3 data from the lineitem table
> ============================================================================
> llonergan=# select relpages from pg_class where relname='lineitem';
> relpages
> ----------
> 3159138
> (1 row)
>
> 3159138*8192/1000000
> 25879 Million Bytes, or 25.9GB
>
> -----------------------
> xfs:
> -----------------------
> llonergan=# \timing
> Timing is on.
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 394908.501 ms
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 99425.223 ms
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 99187.205 ms
>
> -----------------------
> ext2:
> -----------------------
> llonergan=# select relpages from pg_class where relname='lineitem';
> relpages
> ----------
> 3159138
> (1 row)
>
> llonergan=# \timing
> Timing is on.
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 395286.475 ms
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 195756.381 ms
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 122822.090 ms
> ============================================================================
> Analysis of Postgres 8.0.3 results
> ============================================================================
> ext2 xfs
> Write Speed 114 311
> Read Speed 332 407
> Postgres Seq Scan Speed 212 263
> Scan % of lmdd Read Speed 63.9% 64.6%
>
> Well - looks like we get linear scaling with disk/file subsystem speedup.
>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: stange(at)rentec(dot)com, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 00:13:20
Message-ID: 200511230013.jAN0DKV10698@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark wrote:
>
> Alan Stange <stange(at)rentec(dot)com> writes:
>
> > The point your making doesn't match my experience with *any* storage or program
> > I've ever used, including postgresql. Your point suggests that the storage
> > system is idle and that postgresql is broken because it isn't able to use the
> > resources available...even when the cpu is very idle. How can that make sense?
>
> Well I think what he's saying is that Postgres is issuing a read, then waiting
> for the data to return. Then it does some processing, and goes back to issue
> another read. The CPU is idle half the time because Postgres isn't capable of
> doing any work while waiting for i/o, and the i/o system is idle half the time
> while the CPU intensive part happens.
>
> (Consider as a pathological example a program that reads 8k then sleeps for
> 10ms, and loops doing that 1,000 times. Now consider the same program
> optimized to read 8M asynchronously and sleep for 10s. By the time it's
> finished sleeping it has probably read in all 8M. Whereas the program that
> read 8k in little chunks interleaved with small sleeps would probably take
> twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
> idle.)
>
> It's a reasonable theory and it's not inconsistent with the results you sent.
> But it's not exactly proven either. Nor is it clear how to improve matters.
> Adding additional threads to handle the i/o adds an enormous amount of
> complexity and creates lots of opportunity for other contention that could
> easily eat all of the gains.

Perfect summary. We have a background writer now. Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
"create a thread" approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options? mmap()? I have no idea. Seems larger page
size does help.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Alan Stange <stange(at)rentec(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 03:57:16
Message-ID: 4383E89C.8040200@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Bruce Momjian wrote:
> Greg Stark wrote:
>
>> Alan Stange <stange(at)rentec(dot)com> writes:
>>
>>
>>> The point your making doesn't match my experience with *any* storage or program
>>> I've ever used, including postgresql. Your point suggests that the storage
>>> system is idle and that postgresql is broken because it isn't able to use the
>>> resources available...even when the cpu is very idle. How can that make sense?
>>>
>> Well I think what he's saying is that Postgres is issuing a read, then waiting
>> for the data to return. Then it does some processing, and goes back to issue
>> another read. The CPU is idle half the time because Postgres isn't capable of
>> doing any work while waiting for i/o, and the i/o system is idle half the time
>> while the CPU intensive part happens.
>>
>> (Consider as a pathological example a program that reads 8k then sleeps for
>> 10ms, and loops doing that 1,000 times. Now consider the same program
>> optimized to read 8M asynchronously and sleep for 10s. By the time it's
>> finished sleeping it has probably read in all 8M. Whereas the program that
>> read 8k in little chunks interleaved with small sleeps would probably take
>> twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
>> idle.)
>>
>> It's a reasonable theory and it's not inconsistent with the results you sent.
>> But it's not exactly proven either. Nor is it clear how to improve matters.
>> Adding additional threads to handle the i/o adds an enormous amount of
>> complexity and creates lots of opportunity for other contention that could
>> easily eat all of the gains.
>>
>
> Perfect summary. We have a background writer now. Ideally we would
> have a background reader, that reads-ahead blocks into the buffer cache.
> The problem is that while there is a relatively long time between a
> buffer being dirtied and the time it must be on disk (checkpoint time),
> the read-ahead time is much shorter, requiring some kind of quick
> "create a thread" approach that could easily bog us down as outlined
> above.
>
> Right now the file system will do read-ahead for a heap scan (but not an
> index scan), but even then, there is time required to get that kernel
> block into the PostgreSQL shared buffers, backing up Luke's observation
> of heavy memcpy() usage.
>
> So what are our options? mmap()? I have no idea. Seems larger page
> size does help.
For sequential scans, you do have a background reader. It's the
kernel. As long as you don't issue a seek() between read() calls, the
kernel will get the hint about sequential IO and begin to perform a read
ahead for you. This is where the above analysis isn't quite right:
while postgresql is processing the returned data from the read() call,
the kernel has also issued reads as part of the read ahead, keeping the
device busy while the cpu is busy. (I'm assuming these details for
Linux; Solaris/UFS does work this way). Issue one seek on the file and
the read ahead algorithm will back off for a while. This was my point
about some descriptions of how the system works not being sensible.

If your goal is sequential IO, then one must use larger block sizes.
No one would use 8KB IO for achieving high sequential IO rates. Simply
put, read() is about the slowest way to get 8KB of data. Switching
to 32KB blocks reduces all the system call overhead by a large margin.
Larger blocks would be better still, up to the stripe size of your
mirror. (Of course, you're using a mirror and not raid5 if you care
about performance.)

I don't think the memcpy of data from the kernel to userspace is that
big of an issue right now. dd and all the high end network interfaces
manage OK doing it, so I'd expect postgresql to do all right with it now
yet too. Direct IO will avoid that memcpy, but then you also don't get
any caching of the files in memory. I'd be more concerned about any
memcpy calls or general data management within postgresql. Does
postgresql use the platform specific memcpy() in libc? Some care might
be needed to ensure that the memory blocks within postgresql are all
properly aligned to make sure that one isn't ping-ponging cache lines
around (usually done by padding the buffer sizes by an extra 32 bytes or
L1 line size). Whatever you do, all the usual high performance
computing tricks should be used prior to considering any rewriting of
major code sections.

Personally, I'd like to see some detailed profiling being done using
hardware counters for cpu cycles and cache misses, etc. Given the poor
quality of work that has been discussed here in this thread, I don't
have much confidence in any other additional results at this time.
None of the analysis would be acceptable in any environment in which
I've worked. Be sure to take a look at Sun's free Workshop tools as
they are excellent for this sort of profiling and one doesn't need to
recompile to use them. If I get a little time in the next week or two
I might take a crack at this.

Cheers,

-- Alan


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Alan Stange <stange(at)rentec(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 04:21:36
Message-ID: 87d5ksm1tb.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


Alan Stange <stange(at)rentec(dot)com> writes:

> For sequential scans, you do have a background reader. It's the kernel. As
> long as you don't issue a seek() between read() calls, the kernel will get the
> hint about sequential IO and begin to perform a read ahead for you. This is
> where the above analysis isn't quite right: while postgresql is processing the
> returned data from the read() call, the kernel has also issued reads as part of
> the read ahead, keeping the device busy while the cpu is busy. (I'm assuming
> these details for Linux; Solaris/UFS does work this way). Issue one seek on
> the file and the read ahead algorithm will back off for a while. This was my
> point about some descriptions of how the system works not being sensible.

Well that's certainly the hope. But we don't know that this is actually as
effective as you assume it is. It's awfully hard in the kernel to make much
more than a vague educated guess about what kind of readahead would actually
help.

This is especially true when a file isn't really being accessed in a
sequential fashion as Postgres may well do if, for example, multiple backends
are reading the same file. And as you pointed out it doesn't help at all for
random access index scans.

> If your goal is sequential IO, then one must use larger block sizes. No one
> would use 8KB IO for achieving high sequential IO rates. Simply put, read()
> is about the slowest way to get 8KB of data. Switching to 32KB blocks
> reduces all the system call overhead by a large margin. Larger blocks would be
> better still, up to the stripe size of your mirror. (Of course, you're using
> a mirror and not raid5 if you care about performance.)

Switching to 32kB blocks throughout Postgres has pros but also major cons, not
the least is *extra* i/o for random access read patterns. One of the possible
advantages of the suggestions that were made, the ones you're shouting down,
would actually be the ability to use 32kB scatter/gather reads without
necessarily switching block sizes.

(Incidentally, your parenthetical comment is a bit confused. By "mirror" I
imagine you're referring to raid1+0 since mirrors alone, aka raid1, aren't a
popular way to improve performance. But raid5 actually performs better than
raid1+0 for sequential reads.)

> Does postgresql use the platform specific memcpy() in libc? Some care might
> be needed to ensure that the memory blocks within postgresql are all
> properly aligned to make sure that one isn't ping-ponging cache lines around
> (usually done by padding the buffer sizes by an extra 32 bytes or L1 line
> size). Whatever you do, all the usual high performance computing tricks
> should be used prior to considering any rewriting of major code sections.

So your philosophy is to worry about microoptimizations before worrying about
architectural issues?

--
greg


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Alan Stange <stange(at)rentec(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 04:53:16
Message-ID: 200511230453.jAN4rGs24421@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan Stange wrote:
> Bruce Momjian wrote:
> > Right now the file system will do read-ahead for a heap scan (but not an
> > index scan), but even then, there is time required to get that kernel
> > block into the PostgreSQL shared buffers, backing up Luke's observation
> > of heavy memcpy() usage.
> >
> > So what are our options? mmap()? I have no idea. Seems larger page
> > size does help.

> For sequential scans, you do have a background reader. It's the
> kernel. As long as you don't issue a seek() between read() calls, the

I guess you missed my text of "Right now the file system will do
read-ahead", meaning the kernel.

> I don't think the memcpy of data from the kernel to userspace is that
> big of an issue right now. dd and all the high end network interfaces
> manage OK doing it, so I'd expect postgresql to do all right with it now
> yet too. Direct IO will avoid that memcpy, but then you also don't get
> any caching of the files in memory. I'd be more concerned about any
> memcpy calls or general data management within postgresql. Does
> postgresql use the platform specific memcpy() in libc? Some care might
> be needed to ensure that the memory blocks within postgresql are all
> properly aligned to make sure that one isn't ping-ponging cache lines
> around (usually done by padding the buffer sizes by an extra 32 bytes or
> L1 line size). Whatever you do, all the usual high performance
> computing tricks should be used prior to considering any rewriting of
> major code sections.

We have dealt with alignment and MemCpy is what we used for small-sized
copies to reduce function call overhead. If you want to improve it,
feel free to take a look.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Greg Stark" <gsstark(at)mit(dot)edu>
Cc: stange(at)rentec(dot)com, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 17:51:06
Message-ID: BFA9EC0A.14580%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Bruce,

On 11/22/05 4:13 PM, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> Perfect summary. We have a background writer now. Ideally we would
> have a background reader, that reads-ahead blocks into the buffer cache.
> The problem is that while there is a relatively long time between a
> buffer being dirtied and the time it must be on disk (checkpoint time),
> the read-ahead time is much shorter, requiring some kind of quick
> "create a thread" approach that could easily bog us down as outlined
> above.

Yes, the question is "how much read-ahead buffer is needed to equate to the
38% of I/O wait time in the current executor profile?"

The idea of asynchronous buffering would seem appropriate if the executor
would use the 38% of time as useful work.

A background reader is an interesting approach - it would require admin
management of buffers where AIO would leave that in the kernel. The
advantage over AIO would be more universal platform support I suppose?

> Right now the file system will do read-ahead for a heap scan (but not an
> index scan), but even then, there is time required to get that kernel
> block into the PostgreSQL shared buffers, backing up Luke's observation
> of heavy memcpy() usage.

As evidenced by the 16MB readahead setting still resulting in only 36% IO
wait.

> So what are our options? mmap()? I have no idea. Seems larger page
> size does help.

Not sure about that, we used to run with 32KB page size and I didn't see a
benefit on seq scan at all. I haven't seen tests in this thread that
compare 8K to 32K.

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Alan Stange" <stange(at)rentec(dot)com>, "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 17:53:04
Message-ID: BFA9EC80.14581%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

Why not contribute something - put up proof of your stated 8KB versus 32KB
page size improvement.

- Luke


From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 22:00:37
Message-ID: 4384E685.4060108@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Why not contribute something - put up proof of your stated 8KB versus
> 32KB page size improvement.

I did observe that 32KB block sizes were a significant win "for our
usage patterns". It might be a win for any of the following reasons:

0) The preliminaries: ~300GB database with about ~50GB daily
turnover. Our data is fairly reasonably grouped. If we're getting one
item on a page we're usually looking at the other items as well.

1) we can live with a smaller FSM size. We were often leaking pages
with a 10M page FSM setting. With 32K pages, a 10M FSM size is
sufficient. Yes, the solution to this is "run vacuum more often", but
when the vacuum was taking 10 hours at a time, that was hard to do.

2) The typical datum size in our largest table is about 2.8KB, which is
more than 1/4 page size thus resulting in the use of a toast table.
Switching to 32KB pages allows us to get a decent storage of this data
into the main tables, thus avoiding another table and associated large
index. Not having the extra index in memory for a table with 90M rows
is probably beneficial.

3) vacuum time has been substantially reduced. Vacuum analyze now run
in the 2 to 3 hour range depending on load.

4) less cpu time spent in the kernel. We're basically doing 1/4 as many
system calls.

Overall the system has now been working well. We used to see the
database being a bottleneck at times, but now it's keeping up nicely.

Hope this helps.

Happy Thanksgiving!

-- Alan


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 01:50:57
Message-ID: BFAA5C81.145DB%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Alan,

On 11/23/05 2:00 PM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> Luke Lonergan wrote:
>> Why not contribute something - put up proof of your stated 8KB versus
>> 32KB page size improvement.
>
> I did observe that 32KB block sizes were a significant win "for our
> usage patterns". It might be a win for any of the following reasons:
> (* big snip *)

Though all of what you relate is interesting, it seems irrelevant to your
earlier statement here:

>> Alan Stange <stange(at)rentec(dot)com> writes:
>> If your goal is sequential IO, then one must use larger block sizes.
>> No one would use 8KB IO for achieving high sequential IO rates. Simply
>> put, read() is about the slowest way to get 8KB of data. Switching
>> to 32KB blocks reduces all the system call overhead by a large margin.
>> Larger blocks would be better still, up to the stripe size of your
>> mirror. (Of course, you're using a mirror and not raid5 if you care
>> about performance.)

And I am interested in seeing if your statement is correct. Do you have any
proof of this to share?

- Luke


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: pgsql-performance(at)postgresql(dot)org, "eng(at)intranet(dot)greenplum(dot)com" <eng(at)intranet(dot)greenplum(dot)com>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 02:29:49
Message-ID: BFAA659D.145E5%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark,

This is an excellent idea ­ unfortunately I¹m in Maui right now (Mahalo!)
and I¹m not getting to testing with this. My first try was with 8.0.3 and
it¹s an 8.1 function I presume.

Not to be lazy ­ but any hint as to how to do the same thing for 8.0?

- Luke

On 11/21/05 9:10 PM, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz> wrote:

> Luke Lonergan wrote:
>
>> > So that leaves the question - why not more than 64% of the I/O scan rate?
>> > And why is it a flat 64% as the I/O subsystem increases in speed from
>> > 333-400MB/s?
>> >
>
> It might be interesting to see what effect reducing the cpu consumption
> entailed by the count aggregation has - by (say) writing a little bit
> of code to heap scan the desired relation (sample attached).
>
> Cheers
>
> Mark
>
>
>
>
>
>
> /*
> * fastcount.c
> *
> * Do a count that uses considerably less CPU time than an aggregate.
> */
>
> #include "postgres.h"
>
> #include "funcapi.h"
> #include "access/heapam.h"
> #include "catalog/namespace.h"
> #include "utils/builtins.h"
>
>
> extern Datum fastcount(PG_FUNCTION_ARGS);
>
>
> PG_FUNCTION_INFO_V1(fastcount);
> Datum
> fastcount(PG_FUNCTION_ARGS)
> {
> text *relname = PG_GETARG_TEXT_P(0);
> RangeVar *relrv;
> Relation rel;
> HeapScanDesc scan;
> HeapTuple tuple;
> int64 result = 0;
>
> /* Use the name to get a suitable range variable and open the relation. */
> relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
> rel = heap_openrv(relrv, AccessShareLock);
>
> /* Start a heap scan on the relation. */
> scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
> while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
> {
> result++;
> }
>
> /* End the scan and close up the relation. */
> heap_endscan(scan);
> heap_close(rel, AccessShareLock);
>
>
> PG_RETURN_INT64(result);
> }


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, "eng(at)intranet(dot)greenplum(dot)com" <eng(at)intranet(dot)greenplum(dot)com>
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 05:34:03
Message-ID: 438550CB.5020506@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Mark,
>
> This is an excellent idea – unfortunately I’m in Maui right now
> (Mahalo!) and I’m not getting to testing with this. My first try was
> with 8.0.3 and it’s an 8.1 function I presume.
>
> Not to be lazy – but any hint as to how to do the same thing for 8.0?
>

Yeah, it's 8.1 - I didn't think to check against 8.0. The attached
variant works with 8.0.4 (textToQualifiedNameList needs 2 args)

cheers

Mark

P.s. Maui eh, sounds real nice.

Attachment Content-Type Size
fastcount-8.0.c text/plain 1.0 KB

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: stange(at)rentec(dot)com, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 08:17:06
Message-ID: BFAAB702.1460D%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark,

See the results below and analysis - the pure HeapScan gets 94.1% of the max
available read bandwidth (cool!). Nothing wrong with heapscan in the
presence of large readahead, which is good news.

That says it's something else in the path. As you probably know there is a
page lock taken, a copy of the tuple from the page, lock removed, count
incremented for every iteration of the agg node on a count(*). Is the same
true of a count(1)?

I recall that the profile is full of memcpy and memory context calls.

It would be nice to put some tracers into the executor and see where the
time is going. I'm also curious about the impact of the new 8.1 virtual
tuples in reducing the executor overhead. In this case my bet's on the agg
node itself, what do you think?

- Luke

On 11/21/05 9:10 PM, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz> wrote:

> Luke Lonergan wrote:
>
>> So that leaves the question - why not more than 64% of the I/O scan rate?
>> And why is it a flat 64% as the I/O subsystem increases in speed from
>> 333-400MB/s?
>>
>
> It might be interesting to see what effect reducing the cpu consumption
> entailed by the count aggregation has - by (say) writing a little bit
> of code to heap scan the desired relation (sample attached).

OK - here are results for a slightly smaller (still bigger than RAM)
lineitem on the same machine, using the same xfs filesystem that achieved
407MB/s:

============================================================================
12.9GB of DBT-3 data from the lineitem table
============================================================================
llonergan=# select relpages from pg_class where relname='lineitem';
relpages
----------
1579270
(1 row)

1579270*8192/1000000
12937 Million Bytes or 12.9GB

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 197870.105 ms
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 49912.164 ms
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 49218.739 ms

llonergan=# select fastcount('lineitem');
fastcount
-----------
59986052
(1 row)

Time: 33752.778 ms
llonergan=# select fastcount('lineitem');
fastcount
-----------
59986052
(1 row)

Time: 34543.646 ms
llonergan=# select fastcount('lineitem');
fastcount
-----------
59986052
(1 row)

Time: 34528.053 ms

============================================================================
Analysis:
============================================================================
Bandwidth Percent of max
dd Read 407MB/s 100%
Count(1) 263MB/s 64.6%
HeapScan 383MB/s 94.1%

Wow - looks like the HeapScan gets almost all of the available bandwidth!

- Luke


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: stange(at)rentec(dot)com, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 08:53:16
Message-ID: 43857F7C.6060305@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:

> ============================================================================
> 12.9GB of DBT-3 data from the lineitem table
> ============================================================================
> llonergan=# select relpages from pg_class where relname='lineitem';
> relpages
> ----------
> 1579270
> (1 row)
>
> 1579270*8192/1000000
> 12937 Million Bytes or 12.9GB
>
> llonergan=# \timing
> Timing is on.
> llonergan=# select count(1) from lineitem;
> count
> ----------
> 59986052
> (1 row)
>
> Time: 197870.105 ms

So 198 seconds is the uncached read time with count (Just for clarity,
did you clear the Pg and filesystem caches or unmount / remount the
filesystem?)

> llonergan=# select count(1) from lineitem;
> count
> ----------
> 59986052
> (1 row)
>
> Time: 49912.164 ms
> llonergan=# select count(1) from lineitem;
> count
> ----------
> 59986052
> (1 row)
>
> Time: 49218.739 ms
>

and ~50 seconds is the (partially) cached read time with count

> llonergan=# select fastcount('lineitem');
> fastcount
> -----------
> 59986052
> (1 row)
>
> Time: 33752.778 ms
> llonergan=# select fastcount('lineitem');
> fastcount
> -----------
> 59986052
> (1 row)
>
> Time: 34543.646 ms
> llonergan=# select fastcount('lineitem');
> fastcount
> -----------
> 59986052
> (1 row)
>
> Time: 34528.053 ms
>

so ~34 seconds is the (partially) cached read time for fastcount -
I calculate this to give ~362Mb/s effective IO rate (I'm doing / by
1024*1024 not 1000*1000) FWIW.

While this is interesting, you probably want to stop Pg, unmount the
filesystem, and restart Pg to get the uncached time for fastcount too
(and how does this compare to uncached read with dd using the same block
size?).

But at this stage it certainly looks the the heapscan code is pretty
efficient - great!

Oh - and do you want to try out 32K block size, I'm interested to see
what level of improvement you get (as my system is hopelessly cpu bound...)!

> ============================================================================
> Analysis:
> ============================================================================
> Bandwidth Percent of max
> dd Read 407MB/s 100%
> Count(1) 263MB/s 64.6%
> HeapScan 383MB/s 94.1%

Cheers

Mark


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: stange(at)rentec(dot)com, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 09:11:36
Message-ID: 438583C8.4050809@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Mark,
>
>
> It would be nice to put some tracers into the executor and see where the
> time is going. I'm also curious about the impact of the new 8.1 virtual
> tuples in reducing the executor overhead. In this case my bet's on the agg
> node itself, what do you think?
>

Yeah - it's pretty clear that the count aggregate is fairly expensive
wrt cpu - However, I am not sure if all agg nodes suffer this way (guess
we could try a trivial aggregate that does nothing for all tuples bar
the last and just reports the final value it sees).

Cheers

Mark


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: stange(at)rentec(dot)com, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 09:22:03
Message-ID: BFAAC63B.14618%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark,

>> Time: 197870.105 ms
>
> So 198 seconds is the uncached read time with count (Just for clarity,
> did you clear the Pg and filesystem caches or unmount / remount the
> filesystem?)

Nope - the longer time is due to the "second write" known issue with
Postgres - it writes the data to the table, but all of the pages are marked
dirty? So, always on the first scan after loading they are written again.
This is clear as you watch vmstat - the pattern on the first seq scan is
half read / half write.

>> Time: 49218.739 ms
>>
>
> and ~50 seconds is the (partially) cached read time with count

Again - the pattern here is pure read and completely non-cached. You see a
very nearly constant I/O rate when watching vmstat for the entire scan.

>> Time: 34528.053 ms

> so ~34 seconds is the (partially) cached read time for fastcount -
> I calculate this to give ~362Mb/s effective IO rate (I'm doing / by
> 1024*1024 not 1000*1000) FWIW.

The dd number uses 1000*1000, so I maintained it for the percentage of max.

> While this is interesting, you probably want to stop Pg, unmount the
> filesystem, and restart Pg to get the uncached time for fastcount too
> (and how does this compare to uncached read with dd using the same block
> size?).

I'll do it again sometime, but I've already deleted the file. I've done the
following in the past to validate this though:

- Reboot machine
- Rerun scan

And we get identical results.

> But at this stage it certainly looks the the heapscan code is pretty
> efficient - great!

Yep.

> Oh - and do you want to try out 32K block size, I'm interested to see
> what level of improvement you get (as my system is hopelessly cpu bound...)!

Yah - done so in the past and not seen any - was waiting for Alan to post
his results.

>> ============================================================================
>> Analysis:
>> ============================================================================
>> Bandwidth Percent of max
>> dd Read 407MB/s 100%
>> Count(1) 263MB/s 64.6%
>> HeapScan 383MB/s 94.1%

Note these are all in consistent 1000x1000 units.

Thanks for the test - neat trick! We'll use it to do some more profiling
some time soon...

- Luke


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: stange(at)rentec(dot)com, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 09:24:35
Message-ID: 438586D3.9010701@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:

> That says it's something else in the path. As you probably know there is a
> page lock taken, a copy of the tuple from the page, lock removed, count
> incremented for every iteration of the agg node on a count(*). Is the same
> true of a count(1)?
>

Sorry Luke - message 3 - I seem to be suffering from a very small
working memory buffer myself right now, I think it's after a day of
working with DB2 ... :-)

Anyway, as I read src/backend/parser/gram.y:6542 - count(*) is
transformed into count(1), so these two are identical.

Cheers (last time tonight, promise!)

Mark


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: stange(at)rentec(dot)com, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 09:26:44
Message-ID: 43858754.4020300@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Luke Lonergan wrote:
> Mark,
>
>
>>>Time: 197870.105 ms
>>
>>So 198 seconds is the uncached read time with count (Just for clarity,
>>did you clear the Pg and filesystem caches or unmount / remount the
>>filesystem?)
>
>
> Nope - the longer time is due to the "second write" known issue with
> Postgres - it writes the data to the table, but all of the pages are marked
> dirty? So, always on the first scan after loading they are written again.
> This is clear as you watch vmstat - the pattern on the first seq scan is
> half read / half write.
>

Ah - indeed - first access after a COPY no? I should have thought of
that, sorry!


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, stange(at)rentec(dot)com, Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 16:00:25
Message-ID: 87d5kqkpd2.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:

> Yeah - it's pretty clear that the count aggregate is fairly expensive wrt cpu -
> However, I am not sure if all agg nodes suffer this way (guess we could try a
> trivial aggregate that does nothing for all tuples bar the last and just
> reports the final value it sees).

As you mention count(*) and count(1) are the same thing.

Last I heard the reason count(*) was so expensive was because its state
variable was a bigint. That means it doesn't fit in a Datum and has to be
alloced and stored as a pointer. And because of the Aggregate API that means
it has to be allocated and freed for every tuple processed.

There was some talk of having a special case API for count(*) and maybe
sum(...) to avoid having to do this.

There was also some talk of making Datum 8 bytes wide on platforms where that
was natural (I guess AMD64, Sparc64, Alpha, Itanic).

Afaik none of these items have happened but I don't know for sure.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Luke Lonergan <llonergan(at)greenplum(dot)com>, stange(at)rentec(dot)com, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 16:25:28
Message-ID: 14286.1132849528@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Last I heard the reason count(*) was so expensive was because its state
> variable was a bigint. That means it doesn't fit in a Datum and has to be
> alloced and stored as a pointer. And because of the Aggregate API that means
> it has to be allocated and freed for every tuple processed.

There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
Conway IIRC).

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Luke Lonergan <llonergan(at)greenplum(dot)com>, stange(at)rentec(dot)com, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 16:40:21
Message-ID: 877jayknii.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Last I heard the reason count(*) was so expensive was because its state
> > variable was a bigint. That means it doesn't fit in a Datum and has to be
> > alloced and stored as a pointer. And because of the Aggregate API that means
> > it has to be allocated and freed for every tuple processed.
>
> There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
> Conway IIRC).

ah, cool, missed that.

--
greg


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 17:07:40
Message-ID: BFAB335C.14684%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

The same 12.9GB distributed across 4 machines using Bizgres MPP fits into
I/O cache. The interesting result is that the query "select count(1)" is
limited in speed to 280 MB/s per CPU when run on the lineitem table. So
when I run it spread over 4 machines, one CPU per machine I get this:

======================================================
Bizgres MPP, 4 data segments, 1 per 2 CPUs
======================================================
llonergan=# explain select count(1) from lineitem;
QUERY PLAN
----------------------------------------------------------------------------
----------
Aggregate (cost=582452.00..582452.00 rows=1 width=0)
-> Gather Motion (cost=582452.00..582452.00 rows=1 width=0)
-> Aggregate (cost=582452.00..582452.00 rows=1 width=0)
-> Seq Scan on lineitem (cost=0.00..544945.00 rows=15002800
width=0)
(4 rows)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 12191.435 ms
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 11986.109 ms
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 11448.941 ms
======================================================

That's 12,937 MB in 11.45 seconds, or 1,130 MB/s. When you divide out the
number of Postgres instances (4), that's 283MB/s per Postgres instance.

To verify that this has nothing to do with MPP, I ran it in a special
internal mode on one instance and got the same result.

So - we should be able to double this rate by running one segment per CPU,
or two per host:

======================================================
Bizgres MPP, 8 data segments, 1 per CPU
======================================================
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 6484.594 ms
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 6156.729 ms
llonergan=# select count(1) from lineitem;
count
----------
59986052
(1 row)

Time: 6063.416 ms
======================================================
That's 12,937 MB in 11.45 seconds, or 2,134 MB/s. When you divide out the
number of Postgres instances (8), that's 267MB/s per Postgres instance.

So, if you want to "select count(1)", using more CPUs is a good idea! For
most complex queries, having lots of CPUs + MPP is a good combo.

Here is an example of a sorting plan - this should probably be done with a
hash aggregation, but using 8 CPUs makes it go 8x faster:

- Luke


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Luke Lonergan <llonergan(at)greenplum(dot)com>, stange(at)rentec(dot)com, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-24 22:07:50
Message-ID: 438639B6.5090902@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>
>>Last I heard the reason count(*) was so expensive was because its state
>>variable was a bigint. That means it doesn't fit in a Datum and has to be
>>alloced and stored as a pointer. And because of the Aggregate API that means
>>it has to be allocated and freed for every tuple processed.
>
>
> There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil
> Conway IIRC).
>

It certainly makes quite a difference as I measure it:

doing select(1) from a 181000 page table (completely uncached) on my PIII:

8.0 : 32 s
8.1 : 25 s

Note that the 'fastcount()' function takes 21 s in both cases - so all
the improvement seems to be from the count overhead reduction.

Cheers

Mark