Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery

Lists: pgsql-general
From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 15:28:39
Message-ID: d7df81620708220828y123c03n48ca23e3457f5d2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hello.

We are trying to use HP CISS contoller (Smart Array E200i) with internal
cache memory (100M for write caching, built-in power battery) together with
Postgres. Typically under a heavy load Postgres runs checkpoint fsync very
slow:

checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms

(If we turn off fsync, the speed increases greatly, fsync=0.) And
unfortunately it affects all the database productivity during the
checkpoint.
Here is the timing (in milliseconds) of a test transaction called multiple
times concurrently (6 threads) with fsync turned ON:

40.4
44.4
37.4
44.0
42.7
41.8
218.1
254.2
101.0
42.2
42.4
41.0
39.5

(you may see a significant slowdown during a checkpoint).
Here is dstat disc write activity log for that test:

0
0
284k
0
0
84k
0
0
276k
37M
208k
0
0
0
0
156k
0
0
0
0

I have written a small perl script to check how slow is fsync for Smart
Array E200i controller. Theoretically, because of write cache, fsync MUST
cost nothing, but in practice it is not true:

# cd /mnt/c0d1p1/
# perl -e 'use Time::HiRes qw(gettimeofday tv_interval); system "sync"; open
F, ">bulk"; print F ("a" x (1024 * 1024 * 20)); close F; $t0=[gettimeofday];
system "sync"; print ">>> fsync took " . tv_interval ( $t0, [gettimeofday])
. " s\n"; unlink "bulk"'
>>> fsync took 0.247033 s

You see, 50M block was fsynced for 0.25 s.

The question is: how to solve this problem and make fsync run with no delay.
Seems to me that controller's internal write cache is not used (strange,
because all configuration options are fine), but how to check it? Or, maybe,
there is another side-effect?


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: dmitry(at)koterov(dot)ru
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 15:47:25
Message-ID: dcc563d10708220847ydc2e90dk8d7e98462fc99c22@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/22/07, Dmitry Koterov <dmitry(at)koterov(dot)ru> wrote:
> Hello.
> You see, 50M block was fsynced for 0.25 s.
>
> The question is: how to solve this problem and make fsync run with no delay.
> Seems to me that controller's internal write cache is not used (strange,
> because all configuration options are fine), but how to check it? Or, maybe,
> there is another side-effect?

I would suggest that either the controller is NOT configured fine, OR
there's some bug in how the OS is interacting with it.

What options are there for this RAID controller, and what are they set
to? Specifically, the writeback / writethru type options for the
cache, and it might be if it doesn't preoprly detect a battery backup
module it refuses to go into writeback mode.


From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 17:29:13
Message-ID: d7df81620708221029q4ceeef61u3c0cc4fdadd1c13b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

And here are results of built-in Postgres test script:

Simple write timing:
write 0.006355

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 0.233793
write, close, fsync 0.227444

Compare one o_sync write to two:
one 16k o_sync write 0.297093
two 8k o_sync writes 0.402803

Compare file sync methods with one 8k write:

(o_dsync unavailable)
write, fdatasync 0.228725
write, fsync, 0.223302

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.414954
write, fdatasync 0.335280
write, fsync, 0.327195

(Also, I tried to manually specify open_sync method in postgresql.conf, but
after that Postgres database had completely crashed. :-)

On 8/22/07, Dmitry Koterov <dmitry(at)koterov(dot)ru > wrote:
>
> All settings seems to be fine. Mode is writeback.
>
> We temporarily (for tests only on test machine!!!) put pg_xlog into RAM
> drive (to completely exclude xlog fsync from the statistics), but slowdown
> during the checkpoint and 5-10 second fsync during the checkpoint are alive
> yet.
>
> Here are some statistical data from the controller. Other report data is
> attached to the mail.
>
> ACCELERATOR STATUS:
> Logical Drive Disable Map: 0x00000000
> Read Cache Size: 24 MBytes
> Posted Write Size: 72 MBytes
> Disable Flag: 0x00
> Status: 0x00000001
> Disable Code: 0x0000
> Total Memory Size: 128 MBytes
> Battery Count: 1
> Battery Status: 0x0001
> Parity Read Errors: 0000
> Parity Write Errors: 0000
> Error Log: N/A
> Failed Batteries: 0x0000
> Board Present: Yes
> Accelerator Failure Map: 0x00000000
> Max Error Log Entries: 12
> NVRAM Load Status: 0x00
> Memory Size Shift Factor: 0x0a
> Non Battery Backed Memory: 0 MBytes
> Memory State: 0x00
>
>
> On 8/22/07, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> >
> > On 8/22/07, Dmitry Koterov <dmitry(at)koterov(dot)ru> wrote:
> > > Hello.
> > > You see, 50M block was fsynced for 0.25 s.
> > >
> > > The question is: how to solve this problem and make fsync run with no
> > delay.
> > > Seems to me that controller's internal write cache is not used
> > (strange,
> > > because all configuration options are fine), but how to check it? Or,
> > maybe,
> > > there is another side-effect?
> >
> > I would suggest that either the controller is NOT configured fine, OR
> > there's some bug in how the OS is interacting with it.
> >
> > What options are there for this RAID controller, and what are they set
> > to? Specifically, the writeback / writethru type options for the
> > cache, and it might be if it doesn't preoprly detect a battery backup
> > module it refuses to go into writeback mode.
> >
>
>
>


From: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
To: dmitry(at)koterov(dot)ru
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 17:39:54
Message-ID: e373d31e0708221039l2f6f1030j6bbef5d9c7e3e3f8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

On 23/08/07, Dmitry Koterov <dmitry(at)koterov(dot)ru> wrote:
> And here are results of built-in Postgres test script:
>

Can you tell me how I can execute this script on my system? Where is
this script?

Thanks!


From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 17:50:56
Message-ID: d7df81620708221050m5a2035dbg80a52934e411e70f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

This script is here:
postgresql-8.2.3\src\tools\fsync\test_fsync.c

On 8/22/07, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
>
> Hi,
>
>
> On 23/08/07, Dmitry Koterov <dmitry(at)koterov(dot)ru> wrote:
> > And here are results of built-in Postgres test script:
> >
>
>
>
> Can you tell me how I can execute this script on my system? Where is
> this script?
>
> Thanks!
>


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Dmitry Koterov <dmitry(at)koterov(dot)ru>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 20:18:26
Message-ID: Pine.GSO.4.64.0708221613130.26829@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 22 Aug 2007, Dmitry Koterov wrote:

> I have written a small perl script to check how slow is fsync for Smart
> Array E200i controller. Theoretically, because of write cache, fsync MUST
> cost nothing, but in practice it is not true

That theory is fundamentally flawed; you don't know what else is in the
operating system write cache in front of what you're trying to fsync, and
you also don't know exactly what's in the controller's cache when you
start. For all you know, the controller might be filled with cached reads
and refuse to kick all of them out. This is a complicated area where
tests are much more useful than trying to predict the behavior.

You haven't mentioned any details yet about the operating system you're
running on; Solaris? Guessing from the device name. There have been some
comments passing by lately about the write caching behavior not being
turned on by default in that operating system.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 22:43:24
Message-ID: d7df81620708221543r5e85803dob4403a5cd0cb3eaa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>
> > I have written a small perl script to check how slow is fsync for Smart
> > Array E200i controller. Theoretically, because of write cache, fsync
> MUST
> > cost nothing, but in practice it is not true
>
> That theory is fundamentally flawed; you don't know what else is in the
> operating system write cache in front of what you're trying to fsync, and
> you also don't know exactly what's in the controller's cache when you
> start. For all you know, the controller might be filled with cached reads
> and refuse to kick all of them out. This is a complicated area where

tests are much more useful than trying to predict the behavior.

Nobody else writes, nobody reads. The machine is for tests, it is clean. I
monitor dstat - for 5 minutes before there is no disc activity. So I suppose
that the conntroller cache is already flushed before I am running the test.

> tests are much more useful than trying to predict the behavior. You
> haven't mentioned any details yet about the operating system you're
>
running on; Solaris? Guessing from the device name. There have been some
> comments passing by lately about the write caching behavior not being
> turned on by default in that operating system.
>
Linux CentOS x86_64. A lot of memory, 8 processors.
Filesystem is ext2 (to reduce the journalling side-effects).
OS write caching is turned on, turned off and also set to flush once per
second (all these cases are tested, all these have no effect).

The question is - MUST my test script report about a zero fsync time or not,
if the controler has built-in and large write cache. If yes, something wrong
with controller or drivers (how to diagnose?). If no, why?

There are a lot of discussions in this maillist about fsync & battery-armed
controller, people say that a controller with builtin cache memory reduces
the price of fsync to zero. I just want to achieve this.


From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 22:45:23
Message-ID: d7df81620708221545j3b6d2e39wc3b26c5da53002e5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Also, the controller is configured to use 75% of its memory for write
caching and 25% - for read caching. So reads cannot flood writes.

On 8/23/07, Dmitry Koterov <dmitry(at)koterov(dot)ru> wrote:
>
> > I have written a small perl script to check how slow is fsync for Smart
> > > Array E200i controller. Theoretically, because of write cache, fsync
> > MUST
> > > cost nothing, but in practice it is not true
> >
> > That theory is fundamentally flawed; you don't know what else is in the
> > operating system write cache in front of what you're trying to fsync,
> > and
> > you also don't know exactly what's in the controller's cache when you
> > start. For all you know, the controller might be filled with cached
> > reads
> > and refuse to kick all of them out. This is a complicated area where
>
> tests are much more useful than trying to predict the behavior.
>
>
> Nobody else writes, nobody reads. The machine is for tests, it is clean. I
> monitor dstat - for 5 minutes before there is no disc activity. So I suppose
> that the conntroller cache is already flushed before I am running the test.
>
>
> > tests are much more useful than trying to predict the behavior. You
> > haven't mentioned any details yet about the operating system you're
> >
> running on; Solaris? Guessing from the device name. There have been some
> >
> > comments passing by lately about the write caching behavior not being
> > turned on by default in that operating system.
> >
> Linux CentOS x86_64. A lot of memory, 8 processors.
> Filesystem is ext2 (to reduce the journalling side-effects).
> OS write caching is turned on, turned off and also set to flush once per
> second (all these cases are tested, all these have no effect).
>
> The question is - MUST my test script report about a zero fsync time or
> not, if the controler has built-in and large write cache. If yes, something
> wrong with controller or drivers (how to diagnose?). If no, why?
>
> There are a lot of discussions in this maillist about fsync &
> battery-armed controller, people say that a controller with builtin cache
> memory reduces the price of fsync to zero. I just want to achieve this.
>
>
>


From: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 23:14:11
Message-ID: 46CCC343.7000308@cox.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/22/07 17:45, Dmitry Koterov wrote:
> Also, the controller is configured to use 75% of its memory for write
> caching and 25% - for read caching. So reads cannot flood writes.

That seems to be a very extreme ratio. Most databases do *many*
times more reads than writes.

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGzMNDS9HxQb37XmcRAgMLAJsGvA43MKrfRKoyf0W0Nv5/VWu5gACdG8qh
oJbb6+7FbotnEXnf9PdYF+E=
=Esfi
-----END PGP SIGNATURE-----


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: dmitry(at)koterov(dot)ru
Cc: "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-22 23:49:04
Message-ID: dcc563d10708221649m469c72d8sb809e89132d66bab@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/22/07, Dmitry Koterov <dmitry(at)koterov(dot)ru> wrote:
> Also, the controller is configured to use 75% of its memory for write
> caching and 25% - for read caching. So reads cannot flood writes.

128 Meg is a pretty small cache for a modern RAID controller. I
wonder if this one is just a dog performer.

Have you looked at things like the Areca or Escalade with 1g or more
cache on them?


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To:
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-23 03:16:10
Message-ID: Pine.GSO.4.64.0708222313270.13185@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 22 Aug 2007, Ron Johnson wrote:

> That seems to be a very extreme ratio. Most databases do *many*
> times more reads than writes.

Yes, but the OS has a lot more memory to cache the reads for you, so you
should be relying more heavily on it in cases like this where the card has
a relatively small amount of memory. The main benefit for having a
caching controller is fsync acceleration, the reads should pass right
through the controller's cache and then stay in system RAM afterwards if
they're needed again.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To:
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-23 03:56:55
Message-ID: Pine.GSO.4.64.0708222328350.14539@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 22 Aug 2007, Dmitry Koterov wrote:

> We are trying to use HP CISS contoller (Smart Array E200i)

There have been multiple reports of problems with general performance
issues specifically with the cciss Linux driver for other HP cards. The
E200i isn't from the same series, but I wouldn't expect that their drivers
have gotten much better. Wander through the thread at
http://svr5.postgresql.org/pgsql-performance/2006-07/msg00257.php to see
one example I recall from last year; there are more in the archives if you
search around a bit.

> I have written a small perl script to check how slow is fsync for Smart
> Array E200i controller. Theoretically, because of write cache, fsync MUST
> cost nothing, but in practice it is not true:
>>>> fsync took 0.247033 s

For comparision sake, your script run against my system with an Areca
ARC-1210 card with 256MB of cache 20 times gives me the following minimum
and maximum times (full details on my server config are at
http://www.westnet.com/~gsmith/content/postgresql/serverinfo.htm ):

>>> fsync took 0.039676 s
>>> fsync took 0.041137 s

And here's what the last set of test_fsync results look like on my system:

Compare file sync methods with 2 8k writes:
open o_sync, write 0.099819
write, fdatasync 0.100054
write, fsync, 0.094009

So basically your card is running 3 (test_fsync) to 6 (your script) times
slower than my Areca unit on these low-level tests. I don't know that
it's possible to drive the fsync times completely to zero, but there's
certainly a whole lot of improvement from where you are to what I'd expect
from even a cheap caching controller like I'm using. I've got maybe $900
worth of hardware total in this box and it's way faster than yours in this
area.

> (Also, I tried to manually specify open_sync method in postgresql.conf,
> but after that Postgres database had completely crashed. :-)

This is itself a sign there's something really strange going on. There's
something wrong with your system, your card, or the OS/driver you're using
if open_sync doesn't work under Linux; in fact, it should be faster in
practice even if it looks a little slower on test_fsync.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>
To: dmitry(at)koterov(dot)ru, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-23 17:29:03
Message-ID: 200708231732.l7NHWhtl063773@smtp2.jaring.my
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

At 11:28 PM 8/22/2007, Dmitry Koterov wrote:
>Hello.
>
>We are trying to use HP CISS contoller (Smart Array E200i) with
>internal cache memory (100M for write caching, built-in power
>battery) together with Postgres. Typically under a heavy load
>Postgres runs checkpoint fsync very slow:
>
>checkpoint buffers dirty=16.8 MB (3.3%) write=24.3 ms sync=6243.3 ms
>
>(If we turn off fsync, the speed increases greatly, fsync=0.) And
>unfortunately it affects all the database productivity during the checkpoint.
>Here is the timing (in milliseconds) of a test transaction called
>multiple times concurrently (6 threads) with fsync turned ON:

It's likely your controller is probably not doing the write caching
thingy or the write caching is still slow (I've seen raid controllers
that are slower than software raid).

Have you actually configured your controller to do the write caching?
Won't be surprised if it's in a conservative setting which means
"write-through" rather than "write-back", even if there's a battery.

BTW, what happens if someone replaced a faulty battery backed
controller card on a "live" system with one from a "don't care test
system" (identical hardware tho) that was powered down abruptly
because people didn't care? Would the new card proceed to trash the
"live" system?

Probably not that important, but what are your mount options for the
partition? Is the partition mounted noatime (or similar)?

Regards,
Link.


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Date: 2007-08-23 17:52:57
Message-ID: Pine.GSO.4.64.0708231346270.28718@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, 24 Aug 2007, Lincoln Yeoh wrote:

> BTW, what happens if someone replaced a faulty battery backed controller card
> on a "live" system with one from a "don't care test system" (identical
> hardware tho) that was powered down abruptly because people didn't care?
> Would the new card proceed to trash the "live" system?

All the caching controllers I've examined this behavior on give each disk
a unique ID, so if you connect new disks to them they wouldn't trash
anything because those writes will only go out to the original drives.
What happens to the pending writes for the drives that aren't there
anymore is kind of undefined though; presumably they'll just be thrown
away, I don't know if there are any cards that try to hang on to them in
case the original disks are connected later.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD