Re: Performance increase with elevator=deadline

Lists: pgsql-performance
From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Performance increase with elevator=deadline
Date: 2008-04-11 11:22:04
Message-ID: D960CB61B694CF459DCFB4B0128514C201FA62C8@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

This refers to the performance problem reported in
http://archives.postgresql.org/pgsql-performance/2008-04/msg00052.php

After some time of trial and error we found that changing the I/O scheduling
algorithm to "deadline" improved I/O performance by a factor 4 (!) for
this specific load test.

It seems that the bottleneck in this case was actually in the Linux kernel.

Since performance statements are useless without a description of
the system and the type of load, I'll send a few details to make this
report more useful for the archives:

The machine is a PC with 8 AMD Opteron 885 CPUs and 32 GB RAM, attached to
a HP EVA 8100 storage system with 72 disks.

We are running 64-bit Linux 2.6.18-53.1.6.el5 from RedHat Enterprise 5.1.
The I/O queue depth is set to 64.

Our benchmark tools show a possible I/O performance of about 11000 transactions
per second for random scattered reads of 8k blocks.

PostgreSQL version is 8.2.4.

The database system is a cluster with 6 databases containing tables up
to a couple of GB in size. The whole database cluster takes about
200 GB of storage.

The database load is a set of read-only statements, several of which have
miserable execution plans and perform table and index scans.

With the default I/O scheduler we observe a performance of about
600 I/O transactions or 7 MB per second.

After booting with elevator=deadline both values increase by a factor
of up to 4 and the query response times sink drastically.

Yours,
Laurenz Albe


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-11 14:47:17
Message-ID: 878wzkh2iy.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at> writes:

> This refers to the performance problem reported in
> http://archives.postgresql.org/pgsql-performance/2008-04/msg00052.php
>
> After some time of trial and error we found that changing the I/O scheduling
> algorithm to "deadline" improved I/O performance by a factor 4 (!) for
> this specific load test.

What was the algorithm before?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning


From: Jeff <threshar(at)torgo(dot)978(dot)org>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-11 15:09:07
Message-ID: C050E63A-2209-4E89-AA99-6A327CCE7380@torgo.978.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


On Apr 11, 2008, at 7:22 AM, Albe Laurenz wrote:
> After some time of trial and error we found that changing the I/O
> scheduling
> algorithm to "deadline" improved I/O performance by a factor 4 (!) for
> this specific load test.
>
I was inspired once again to look into this - as I'm recently hitting
some glass ceilings with my machines.

I have a little app I wrote called pgiosim (its on pgfoundry - http://pgfoundry.org/projects/pgiosim)
that basically just opens some files, and does random seeks and 8kB
reads, much like what our beloved PG does.

Using 4 of these with a dataset of about 30GB across a few files
(Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
noop. Quite an improvement. If you have a decent controller CFQ is
not what you want. I tried deadline as well and it was a touch
slower. The controller is a 3ware 9550sx with 4 disks in a raid10.

I'll be trying this out on the big array later today. I found it
suprising this info wasn't more widespread (the use of CFQ on a good
controller).

it also seems changing elevators on the fly works fine (echo
schedulername > /sys/block/.../queue/scheduler I admit I sat there
flipping back and forth going "disk go fast.. disk go slow.. disk go
fast... " :)

--
Jeff Trout <jeff(at)jefftrout(dot)com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/


From: Matthew <matthew(at)flymine(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-11 16:40:02
Message-ID: Pine.LNX.4.64.0804111736580.20402@aragorn.flymine.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, 11 Apr 2008, Jeff wrote:
> Using 4 of these with a dataset of about 30GB across a few files (Machine has
> 8GB mem) I went from around 100 io/sec to 330 changing to noop. Quite an
> improvement. If you have a decent controller CFQ is not what you want. I
> tried deadline as well and it was a touch slower. The controller is a 3ware
> 9550sx with 4 disks in a raid10.

I ran Greg's fadvise test program a while back on a 12-disc array. The
three schedulers (deadline, noop, anticipatory) all performed pretty-much
the same, with the fourth (cfq, the default) being consistently slower.

> it also seems changing elevators on the fly works fine (echo schedulername >
> /sys/block/.../queue/scheduler I admit I sat there flipping back and forth
> going "disk go fast.. disk go slow.. disk go fast... " :)

Oh Homer Simpson, your legacy lives on.

Matthew

--
I suppose some of you have done a Continuous Maths course. Yes? Continuous
Maths? <menacing stares from audience> Whoah, it was like that, was it!
-- Computer Science Lecturer


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Matthew <matthew(at)flymine(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-11 18:03:43
Message-ID: 47FFA7FF.10801@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Matthew wrote:
> On Fri, 11 Apr 2008, Jeff wrote:
>> Using 4 of these with a dataset of about 30GB across a few files
>> (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
>> noop. Quite an improvement. If you have a decent controller CFQ is
>> not what you want. I tried deadline as well and it was a touch
>> slower. The controller is a 3ware 9550sx with 4 disks in a raid10.
>
> I ran Greg's fadvise test program a while back on a 12-disc array. The
> three schedulers (deadline, noop, anticipatory) all performed
> pretty-much the same, with the fourth (cfq, the default) being
> consistently slower.

I use CFQ on some of my servers, despite the fact that it's often slower
in total throughput terms, because it delivers much more predictable I/O
latencies that help prevent light I/O processes being starved by heavy
I/O processes. In particular, an Linux terminal server used at work has
taken a lot of I/O tuning before it delivers even faintly acceptable I/O
latencies under any sort of load.

Bounded I/O latency at the expense of throughput is not what you usually
want on a DB server, where throughput is king, so I'm not at all
surprised that CFQ performs poorly for PostgreSQL. I've done no testing
on that myself, though, because with my DB size and the nature of my
queries most of them are CPU bound anyway.

Speaking of I/O performance with PostgreSQL, has anybody here done any
testing to compare results with LVM to results with the same filesystem
on a conventionally partitioned or raw volume? I'd probably use LVM even
at a performance cost because of its admin benefits, but I'd be curious
if there is any known cost for use with Pg. I've never been able to
measure one with other workloads.

--
Craig Ringer


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-11 19:56:36
Message-ID: Pine.GSO.4.64.0804111553410.21193@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sat, 12 Apr 2008, Craig Ringer wrote:

> Speaking of I/O performance with PostgreSQL, has anybody here done any
> testing to compare results with LVM to results with the same filesystem on a
> conventionally partitioned or raw volume?

There was some chatter on this topic last year; a quick search finds

http://archives.postgresql.org/pgsql-performance/2007-06/msg00005.php

which is a fair statement of the situation. I don't recall any specific
benchmarks.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Florian Weimer <fweimer(at)bfk(dot)de>
To: Jeff <threshar(at)real(dot)jefftrout(dot)com>
Cc: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-15 13:27:20
Message-ID: 82od8bi6yv.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

* Jeff:

> Using 4 of these with a dataset of about 30GB across a few files
> (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
> noop. Quite an improvement. If you have a decent controller CFQ is
> not what you want. I tried deadline as well and it was a touch
> slower. The controller is a 3ware 9550sx with 4 disks in a raid10.
>
> I'll be trying this out on the big array later today. I found it
> suprising this info wasn't more widespread (the use of CFQ on a good
> controller).

3ware might be a bit special because the controller has got very deep
queues on its own, so many assumptions of the kernel I/O schedulers do
not seem to apply. Toying with the kernel/controller queue depths
might help, but I haven't done real benchmarks to see if it's actually
a difference.

A few days ago, I experienced this: On a machine with a 3ware
controller, a simple getxattr call on a file in an uncontended
directory took several minutes because a PostgreSQL dump process was
running in the background (and some other activity of a legacy
database which caused frequent fdatasync calls). Clearly, this is
unacceptable, and I've since switched to the deadline scheduler, too.
So far, this particular behavior hasn't occurred again.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99


From: david(at)lang(dot)hm
To: Florian Weimer <fweimer(at)bfk(dot)de>
Cc: Jeff <threshar(at)threshar(dot)is-a-geek(dot)com>, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-15 19:08:31
Message-ID: alpine.DEB.1.10.0804151204120.14908@asgard
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, 15 Apr 2008, Florian Weimer wrote:

> * Jeff:
>
>> Using 4 of these with a dataset of about 30GB across a few files
>> (Machine has 8GB mem) I went from around 100 io/sec to 330 changing to
>> noop. Quite an improvement. If you have a decent controller CFQ is
>> not what you want. I tried deadline as well and it was a touch
>> slower. The controller is a 3ware 9550sx with 4 disks in a raid10.
>>
>> I'll be trying this out on the big array later today. I found it
>> suprising this info wasn't more widespread (the use of CFQ on a good
>> controller).
>
> 3ware might be a bit special because the controller has got very deep
> queues on its own, so many assumptions of the kernel I/O schedulers do
> not seem to apply. Toying with the kernel/controller queue depths
> might help, but I haven't done real benchmarks to see if it's actually
> a difference.
>
> A few days ago, I experienced this: On a machine with a 3ware
> controller, a simple getxattr call on a file in an uncontended
> directory took several minutes because a PostgreSQL dump process was
> running in the background (and some other activity of a legacy
> database which caused frequent fdatasync calls). Clearly, this is
> unacceptable, and I've since switched to the deadline scheduler, too.
> So far, this particular behavior hasn't occurred again.

one other thing to watch out for. up until very recent kernels (2.6.23 or
2.6.24) it was possible for one very busy block device to starve other
block devices. they added isolation of queues for different block devices,
but I've seen reports that the isolation can end up throttling high
performance devices as a result. I haven't had time to really dig into
this, but there are tuning knobs available to adjust the que space
available to different devices and the reports are significantly better
activity on a tuned system.

David Lang


From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Gregory Stark *EXTERN*" <stark(at)enterprisedb(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-16 05:27:14
Message-ID: D960CB61B694CF459DCFB4B0128514C22D6E7A@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Gregory Stark wrote:
>> After some time of trial and error we found that changing the I/O scheduling
>> algorithm to "deadline" improved I/O performance by a factor 4 (!) for
>> this specific load test.
>
> What was the algorithm before?

The default algorithm, CFQ I think it is.

Yours,
Laurenz Albe


From: Enrico Sirola <enrico(dot)sirola(at)gmail(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: Matthew <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance increase with elevator=deadline
Date: 2008-04-18 14:38:05
Message-ID: 544D07A2-EB3F-4D91-A749-0C4D5D38AB80@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi,

Il giorno 11/apr/08, alle ore 20:03, Craig Ringer ha scritto:
>
> Speaking of I/O performance with PostgreSQL, has anybody here done
> any testing to compare results with LVM to results with the same
> filesystem on a conventionally partitioned or raw volume? I'd
> probably use LVM even at a performance cost because of its admin
> benefits, but I'd be curious if there is any known cost for use with
> Pg. I've never been able to measure one with other workloads.

I performed some tests some time ago. LVM is significantly slower.
The disk subsystem is a HP P400/512MB battery-backed controller with 4
disks in raid 10.
See the tests:

ext3 tests:

bonnie++ -s 16000 -u 0 -f -b
=
=
=
=
=
=
=
========================================================================

Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP /sec %CP
16000M 153637 50 78895 17 204124
17 700.6 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec
%CP /sec %CP
16 2233 10 +++++ +++ 2606 8 2255 10 +++++ ++
+ 2584 7
16000M,,,153637,50,78895,17,,,204124,17,700.6,1,16,2233,10,+++++,+++,
2606,8,2255,10,+++++,+++,2584,7

bonnie++ -s 16000 -u 0 -f
=
=
=
=
=
=
=
========================================================================

Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP /sec %CP
16000M 162223 51 77277 17 207055
17 765.3 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec
%CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
+++++ +++
16000M,,,162223,51,77277,17,,,207055,17,765.3,1,16,+++++,+++,+++++,+++,
+++++,+++,+++++,+++,+++++,+++,+++++,+++

=
=
=
=
=
=
=
========================================================================

LVM tests:

bonnie++ -u 0 -f -s 16000 -b
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP /sec %CP
16000M 153768 52 53420 13 177414
15 699.8 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec
%CP /sec %CP
16 2158 9 +++++ +++ 2490 7 2177 9 +++++ ++
+ 2460 7
16000M,,,153768,52,53420,13,,,177414,15,699.8,1,16,2158,9,+++++,+++,
2490,7,2177,9,+++++,+++,2460,7

bonnie++ -u 0 -f -s 16000
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP /sec %CP
16000M 161476 53 54904 13 171693
14 774.3 1
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec
%CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
+++++ +++
16000M,,,161476,53,54904,13,,,171693,14,774.3,1,16,+++++,+++,+++++,+++,
+++++,+++,+++++,+++,+++++,+++,+++++,+++

Bye,
e.


From: Craig James <craig_james(at)emolecules(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
Date: 2008-05-05 23:33:01
Message-ID: 481F992D.7070509@emolecules.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I had the opportunity to do more testing on another new server to see whether the kernel's I/O scheduling makes any difference. Conclusion: On a battery-backed RAID 10 system, the kernel's I/O scheduling algorithm has no effect. This makes sense, since a battery-backed cache will supercede any I/O rescheduling that the kernel tries to do.

Hardware:
Dell 2950
8 CPU (Intel 2GHz Xeon)
8 GB memory
Dell Perc 6i with battery-backed cache
RAID 10 of 8x 146GB SAS 10K 2.5" disks

Software:
Linux 2.6.24, 64-bit
XFS file system
Postgres 8.3.0
max_connections = 1000
shared_buffers = 2000MB
work_mem = 256MB
max_fsm_pages = 1000000
max_fsm_relations = 5000
synchronous_commit = off
wal_buffers = 256kB
checkpoint_segments = 30
effective_cache_size = 4GB

Each test was run 5 times:
drop database test
create database test
pgbench -i -s 20 -U test
pgbench -c 10 -t 50000 -v -U test

The I/O scheduler was changed on-the-fly using (for example) "echo cfq >/sys/block/sda/queue/scheduler".

Autovacuum was turned off during the test.

Here are the results. The numbers are those reported as "tps = xxxx (including connections establishing)" (which were almost identical to the "excluding..." tps number).

I/O Sched AVG Test1 Test2 Test3 Test4 Test5
--------- ----- ----- ----- ----- ----- -----
cfq 3355 3646 3207 3132 3204 3584
noop 3163 2901 3190 3293 3124 3308
deadline 3547 3923 3722 3351 3484 3254
anticipatory 3384 3453 3916 2944 3451 3156

As you can see, the averages are very close -- closer than the "noise" between runs. As far as I can tell, there is no significant advantage, or even any significant difference, between the various I/O scheduler algorithms.

(It also reinforces what the pgbench man page says: Short runs aren't useful. Even these two-minute runs have a lot of variability. Before I turned off AutoVacuum, the variability was more like 50% between runs.)

Craig


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Craig James" <craig_james(at)emolecules(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
Date: 2008-05-06 03:15:44
Message-ID: dcc563d10805052015j352d71fdla2cfca22da899590@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, May 5, 2008 at 5:33 PM, Craig James <craig_james(at)emolecules(dot)com> wrote:
>
> (It also reinforces what the pgbench man page says: Short runs aren't
> useful. Even these two-minute runs have a lot of variability. Before I
> turned off AutoVacuum, the variability was more like 50% between runs.)

I'd suggest a couple things for more realistic tests. Run the tests
much longer, say 30 minutes to an hour. Crank up your scaling factor
until your test db is larger than memory. Turn on autovacuum, maybe
raising the cost / delay factors so it doesn't affect performance too
negatively. And lastly tuning the bgwriter so that checkpoints are
short and don't interfere too much.

My guess is if you let it run for a while, you'll get a much more
reliable number.


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Craig James <craig_james(at)emolecules(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
Date: 2008-05-06 03:59:30
Message-ID: Pine.GSO.4.64.0805052354120.3752@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, 5 May 2008, Craig James wrote:

> pgbench -i -s 20 -U test

That's way too low to expect you'll see a difference in I/O schedulers.
A scale of 20 is giving you a 320MB database, you can fit the whole thing
in RAM and almost all of it on your controller cache. What's there to
schedule? You're just moving between buffers that are generally large
enough to hold most of what they need.

> pgbench -c 10 -t 50000 -v -U test

This is OK, because when you increase the size you're not going to be
pushing 3500 TPS anymore and this test will take quite a while.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Jeff <threshar(at)torgo(dot)978(dot)org>
To: Craig James <craig_james(at)emolecules(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
Date: 2008-05-06 12:26:03
Message-ID: 11417349-DBA5-48AA-AE6B-0A376235D421@torgo.978.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


On May 5, 2008, at 7:33 PM, Craig James wrote:

> I had the opportunity to do more testing on another new server to
> see whether the kernel's I/O scheduling makes any difference.
> Conclusion: On a battery-backed RAID 10 system, the kernel's I/O
> scheduling algorithm has no effect. This makes sense, since a
> battery-backed cache will supercede any I/O rescheduling that the
> kernel tries to do.
>

this goes against my real world experience here.

> pgbench -i -s 20 -U test
> pgbench -c 10 -t 50000 -v -U test
>

You should use a sample size of 2x ram to get a more realistic number,
or try out my pgiosim tool on pgfoundry which "sort of" simulates an
index scan. I posted numbers from that a month or two ago here.

--
Jeff Trout <jeff(at)jefftrout(dot)com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/


From: Craig James <craig_james(at)emolecules(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID 10 Benchmark with different I/O schedulers
Date: 2008-05-06 18:30:26
Message-ID: 4820A3C2.4000307@emolecules.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Smith wrote:
> On Mon, 5 May 2008, Craig James wrote:
>
>> pgbench -i -s 20 -U test
>
> That's way too low to expect you'll see a difference in I/O schedulers.
> A scale of 20 is giving you a 320MB database, you can fit the whole
> thing in RAM and almost all of it on your controller cache. What's
> there to schedule? You're just moving between buffers that are
> generally large enough to hold most of what they need.

Test repeated with:
autovacuum enabled
database destroyed and recreated between runs
pgbench -i -s 600 ...
pgbench -c 10 -t 50000 -n ...

I/O Sched AVG Test1 Test2
--------- ----- ----- -----
cfq 705 695 715
noop 758 769 747
deadline 741 705 775
anticipatory 494 477 511

I only did two runs of each, which took about 24 minutes. Like the first round of tests, the "noise" in the measurements (about 10%) exceeds the difference between scheduler-algorithm performance, except that "anticipatory" seems to be measurably slower.

So it still looks like cfq, noop and deadline are more or less equivalent when used with a battery-backed RAID.

Craig


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Craig James <craig_james(at)emolecules(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID 10 Benchmark with different I/O schedulers
Date: 2008-05-06 20:21:08
Message-ID: Pine.GSO.4.64.0805061605030.11474@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, 6 May 2008, Craig James wrote:

> I only did two runs of each, which took about 24 minutes. Like the first
> round of tests, the "noise" in the measurements (about 10%) exceeds the
> difference between scheduler-algorithm performance, except that
> "anticipatory" seems to be measurably slower.

Those are much better results. Any test that says anticipatory is
anything other than useless for database system use with a good controller
I presume is broken, so that's how I know you're in the right ballpark now
but weren't before.

In order to actually get some useful data out of the noise that is
pgbench, you need a lot more measurements of longer runs. As perspective,
the last time I did something in this area, in order to get enough data to
get a clear picture I ran tests for 12 hours. I'm hoping to repeat that
soon with some more common hardware that gives useful results I can give
out.

> So it still looks like cfq, noop and deadline are more or less equivalent
> when used with a battery-backed RAID.

I think it's fair to say they're within 10% of one another on raw
throughput. The thing you're not measuring here is worst-case latency,
and that's where there might be a more interesting difference. Most tests
I've seen suggest deadline is the best in that regard, cfq the worst, and
where noop fits in depends on the underlying controller.

pgbench produces log files with latency measurements if you pass it "-l".
Here's a snippet of shell that runs pgbench then looks at the resulting
latency results for the worst 5 numbers:

pgbench ... -l &
p=$!
wait $p
mv pgbench_log.${p} pgbench.log
echo Worst latency results:
cat pgbench.log | cut -f 3 -d " " | sort -n | tail -n 5

However, that may not give you much useful info either--in most cases
checkpoint issues kind of swamp the worst-base behavior in PostgreSQL,
and to quantify I/O schedulers you need to look more complicated
statistics on latency.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Craig James <craig_james(at)emolecules(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID 10 Benchmark with different I/O schedulers
Date: 2008-05-06 20:43:54
Message-ID: 4820C30A.8010709@emolecules.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Smith wrote:
> On Tue, 6 May 2008, Craig James wrote:
>
>> I only did two runs of each, which took about 24 minutes. Like the
>> first round of tests, the "noise" in the measurements (about 10%)
>> exceeds the difference between scheduler-algorithm performance, except
>> that "anticipatory" seems to be measurably slower.
>
> Those are much better results. Any test that says anticipatory is
> anything other than useless for database system use with a good
> controller I presume is broken, so that's how I know you're in the right
> ballpark now but weren't before.
>
> In order to actually get some useful data out of the noise that is
> pgbench, you need a lot more measurements of longer runs. As
> perspective, the last time I did something in this area, in order to get
> enough data to get a clear picture I ran tests for 12 hours. I'm hoping
> to repeat that soon with some more common hardware that gives useful
> results I can give out.

This data is good enough for what I'm doing. There were reports from non-RAID users that the I/O scheduling could make as much as a 4x difference in performance (which makes sense for non-RAID), but these tests show me that three of the four I/O schedulers are within 10% of each other. Since this matches my intuition of how battery-backed RAID will work, I'm satisfied. If our servers get overloaded to the point where 10% matters, then I need a much more dramatic solution, like faster machines or more machines.

Craig


From: "Albe Laurenz *EXTERN*" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Craig James *EXTERN*" <craig_james(at)emolecules(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: RAID 10 Benchmark with different I/O schedulers
Date: 2008-05-07 07:29:05
Message-ID: D960CB61B694CF459DCFB4B0128514C202122018@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Craig James wrote:
> This data is good enough for what I'm doing. There were
> reports from non-RAID users that the I/O scheduling could
> make as much as a 4x difference in performance (which makes
> sense for non-RAID), but these tests show me that three of
> the four I/O schedulers are within 10% of each other. Since
> this matches my intuition of how battery-backed RAID will
> work, I'm satisfied. If our servers get overloaded to the
> point where 10% matters, then I need a much more dramatic
> solution, like faster machines or more machines.

I should comment on this as I am the one who reported the
big performance increase with the deadline scheduler.
I was very surprised at this increase myself as I had not seen
any similar reports, so I thought I should share it for whatever
it is worth.

Our SAN *is* a RAID-5 with lots of cache, so there must be a flaw
in your intuition.

Performance measures depend a lot on your hardware and
software setup (e.g. kernel version in this case) and on the
specific load. The load we used was a real life load, collected
over seveal hours and extracted from the log files.

My opinion is that performance observations can rarely be
generalized - I am not surprised that with a different system
and a different load you observe hardly any difference between
"cfq" and "deadline".

For the record, in our test case "noop" performed practically
as good as "deadline", while the other two did way worse.

Like yourself, I have wondered why different I/O scheduling
algorithms should make so much difference.
Here is my home-spun theory of what may happen; tear it apart
and replace it with a better one at your convenience:

Our SAN probably (we're investigating) has its own brains to
optimize I/O, and I guess that any optimization that the kernel
does can only deteriorate performance because the two algorithms
might "step on each other's toes". This is backed by "noop"
performing well.
I believe that caching will not make much difference, because the
cache is way smaller than the database, and whatever is neither in
the shared buffer nor in the kernel filesystem cache is also not
likely to be in the storage system's cache. Remember that our load
was read-only.

Yours,
Laurenz Albe


From: Matthew Wakeling <matthew(at)flymine(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID 10 Benchmark with different I/O schedulers
Date: 2008-05-07 10:46:27
Message-ID: Pine.LNX.4.64.0805071142320.16756@aragorn.flymine.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, 6 May 2008, Craig James wrote:
> I/O Sched AVG Test1 Test2
> --------- ----- ----- -----
> cfq 705 695 715
> noop 758 769 747
> deadline 741 705 775
> anticipatory 494 477 511

Interesting. That contrasts with some tests I did a while back on a
16-disc RAID-0, where noop, deadline, and anticipatory were all identical
in performance, with cfq being significantly slower. Admittedly, the disc
test was single-process, which is probably why the anticipatory behaviour
didn't kick in. You are seeing a little bit of degradation with cfq - I
guess it's worse the bigger the disc subsystem you have.

Matthew

--
Matthew: That's one of things about Cambridge - all the roads keep changing
names as you walk along them, like Hills Road in particular.
Sagar: Yes, Sidney Street is a bit like that too.
Matthew: Sidney Street *is* Hills Road.


From: Craig James <craig_james(at)emolecules(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID 10 Benchmark with different I/O schedulers
Date: 2008-06-26 01:52:59
Message-ID: 4862F67B.1010109@emolecules.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

This seems like a bug to me, but it shows up as a performance problem. Since the column being queried is an integer, the second query (see below) can't possibly match, yet Postgres uses a typecast, forcing a full table scan for a value that can't possibly be in the table.

The application could intercept these bogus queries, but that requires building schema-specific and postgres-specific knowledge into the application (i.e. "What is the maximum legal integer for this column?").

Craig

explain analyze select version_id, parent_id from version where version_id = 99999;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using version_pkey on version (cost=0.00..9.89 rows=1 width=8) (actual time=0.054..0.054 rows=0 loops=1)
Index Cond: (version_id = 99999)
Total runtime: 0.130 ms
(3 rows)

emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id = 999999999999999999999999999;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on version (cost=0.00..253431.77 rows=48393 width=8) (actual time=3135.530..3135.530 rows=0 loops=1)
Filter: ((version_id)::numeric = 999999999999999999999999999::numeric)
Total runtime: 3135.557 ms
(3 rows)

\d version
Table "emol_warehouse_1.version"
Column | Type | Modifiers
------------+---------+-----------
version_id | integer | not null
parent_id | integer | not null
... more columns
Indexes:
"version_pkey" PRIMARY KEY, btree (version_id)


From: Craig James <craig_james(at)emolecules(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Typecast bug?
Date: 2008-06-26 04:17:50
Message-ID: 4863186E.1000306@emolecules.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

This seems like a bug to me, but it shows up as a performance problem. Since the column being queried is an integer, the second query (see below) can't possibly match, yet Postgres uses a typecast, forcing a full table scan for a value that can't possibly be in the table.

The application could intercept these bogus queries, but that requires building schema-specific and postgres-specific knowledge into the application (i.e. "What is the maximum legal integer for this column?").

Craig

explain analyze select version_id, parent_id from version where version_id = 99999;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using version_pkey on version (cost=0.00..9.89 rows=1 width=8) (actual time=0.054..0.054 rows=0 loops=1)
Index Cond: (version_id = 99999)
Total runtime: 0.130 ms
(3 rows)

emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id = 999999999999999999999999999;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on version (cost=0.00..253431.77 rows=48393 width=8) (actual time=3135.530..3135.530 rows=0 loops=1)
Filter: ((version_id)::numeric = 999999999999999999999999999::numeric)
Total runtime: 3135.557 ms
(3 rows)

\d version
Table "emol_warehouse_1.version"
Column | Type | Modifiers
------------+---------+-----------
version_id | integer | not null
parent_id | integer | not null
... more columns
Indexes:
"version_pkey" PRIMARY KEY, btree (version_id)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Craig James <craig_james(at)emolecules(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Typecast bug?
Date: 2008-06-26 05:33:30
Message-ID: 5564.1214458410@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Craig James <craig_james(at)emolecules(dot)com> writes:
> This seems like a bug to me, but it shows up as a performance problem.

> emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id = 999999999999999999999999999;

If you actually *need* so many 9's here as to force it out of the range
of bigint, then why is your id column not declared numeric?

This seems to me to be about on par with complaining that "intcol = 4.2e1"
won't be indexed. We have a numeric data type hierarchy, learn to
work with it ...

regards, tom lane


From: Craig James <craig_james(at)emolecules(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Typecast bug?
Date: 2008-06-26 06:22:20
Message-ID: 4863359C.1040906@emolecules.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom Lane wrote:
> Craig James <craig_james(at)emolecules(dot)com> writes:
>> This seems like a bug to me, but it shows up as a performance problem.
>
>> emol_warehouse_1=> explain analyze select version_id, parent_id from version where version_id = 999999999999999999999999999;
>
> If you actually *need* so many 9's here as to force it out of the range
> of bigint, then why is your id column not declared numeric?
>
> This seems to me to be about on par with complaining that "intcol = 4.2e1"
> won't be indexed. We have a numeric data type hierarchy, learn to
> work with it ...

Your suggestion of "learn to work with it" doesn't fly. A good design separates the database schema details from the application to the greatest extent possible. What you're suggesting is that every application that queries against a Postgres database should know the exact range of every numeric data type of every indexed column in the schema, simply because Postgres can't recognize an out-of-range numeric value.

In this case, the optimizer could have instantly returned zero results with no further work, since the query was out of range for that column.

This seems like a pretty simple optimization to me, and it seems like a helpful suggestion to make to this forum.

BTW, this query came from throwing lots of junk at a web app in an effort to uncover exactly this sort of problem. It's not a real query, but then, hackers don't use real queries. The app checks that its input is a well-formed integer expression, but then assumes Postgres can deal with it from there.

Craig


From: "Frank Joerdens" <frank(at)joerdens(dot)de>
To: "Craig James" <craig_james(at)emolecules(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Typecast bug?
Date: 2008-06-26 07:02:13
Message-ID: 7d10d2df0806260002w8707ea1rf8f62ca8126099a8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 6/26/08, Craig James <craig_james(at)emolecules(dot)com> wrote:
> This seems like a bug to me, but it shows up as a performance problem.
> Since the column being queried is an integer, the second query (see below)
> can't possibly match, yet Postgres uses a typecast, forcing a full table
> scan for a value that can't possibly be in the table.

Which version are you using? 8.3 removes a lot of implicit casts (all?
not sure), so this may already be your fix.

Cheers,

Frank


From: "Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To: "Frank Joerdens" <frank(at)joerdens(dot)de>
Cc: "Craig James" <craig_james(at)emolecules(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Typecast bug?
Date: 2008-06-26 07:12:56
Message-ID: 1d4e0c10806260012pef3b340y2ec5e5855d18974c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Jun 26, 2008 at 9:02 AM, Frank Joerdens <frank(at)joerdens(dot)de> wrote:
> Which version are you using? 8.3 removes a lot of implicit casts (all?
> not sure), so this may already be your fix.

8.3 only removed implicit casts from non text types to text (date ->
text, int -> text, interval -> text...) to avoid unexpected
behaviours.

--
Guillaume