Re: SCSI vs. IDE performance test

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Rick Gigger" <rick(at)alpinenetworking(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: SCSI vs. IDE performance test
Date: 2003-10-28 00:05:44
Message-ID: 18118.1067299544@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

"Rick Gigger" <rick(at)alpinenetworking(dot)com> writes:
> ahhh. "lies about write order" is the phrase that I was looking for. That
> seemed to make sense but I didn't know if I could go directly from "lying
> about fsync" to that. Obviously I don't understand exactly what fsync is
> doing.

What we actually care about is write order: WAL entries have to hit the
platter before the corresponding data-file changes do. Unfortunately we
have no portable means of expressing that exact constraint to the
kernel. We use fsync() (or related constructs) instead: issue the WAL
writes, fsync the WAL file, then issue the data-file writes. This
constrains the write ordering more than is really needed, but it's the
best we can do in a portable Unix application.

The problem is that the kernel thinks fsync is done when the disk drive
reports the writes are complete. When we say a drive lies about this,
we mean it accepts a sector of data into its on-board RAM and then
immediately claims write-complete, when in reality the data hasn't hit
the platter yet and will be lost if power dies before the drive gets
around to writing it.

So we can have a scenario where we think WAL is down to disk and go
ahead with issuing data-file writes. These will also be shoved over to
the drive and stored in its on-board RAM. Now the drive has multiple
sectors pending write in its buffers. If it chooses to write these in
some order other than the order they were given to it, it could write
the data file updates to disk first. If power drops *now*, we lose,
because the data files are inconsistent and there's no WAL entry to tell
us to fix it.

Got it? It's really the combination of "lie about write completion" and
"write pending sectors out of order" that can mess things up.

The reason IDE drives have to do this for reasonable performance is that
the IDE interface is single-threaded: you can only have one read or
write in process at a time, from the point of view of the
kernel-to-drive interface. But in order to schedule reads and writes in
a way that makes sense physically (minimizes seeks), the drive has to
have multiple read and write requests pending that it can pick and
choose from. The only possibility to do that in the IDE world is to
let a write "complete" in interface terms before it's really done ...
that is, lie.

The reason SCSI drives do *not* do this is that the SCSI interface is
logically multi-threaded: you can have multiple reads or writes pending
at once. When you want to write on a SCSI drive, you send over a
command that says "write this data at this sector". Sometime later the
drive sends back a status report "yessir boss, I done did that write".
Similarly, a read consists of a command "read this sector", followed
sometime later by a response that delivers the requested data. But you
can send other commands to read or write other sectors meanwhile, and
the drive is free to reorder them to suit its convenience. So in the
SCSI world, there is no need for the drive to lie in order to do its own
read/write scheduling. The kernel knows the truth about whether a given
sector has hit disk, and so it won't conclude that the WAL file has been
completely fsync'd until it really is all down to the platter.

This is also why SCSI disks shine on the read side when you have lots of
processes doing reads: in an IDE drive, there is no way for the drive to
satisfy read requests in any order but the one they're issued in. If the
kernel guesses wrong about the best ordering for a set of read requests,
then everybody waits for the seeks needed to get the earlier processes'
data. A SCSI drive can fetch the "nearest" data first, and then that
requester is freed to make progress in the CPU while the other guys wait
for their longer seeks. There's no win here with a single active user
process (since it probably wants specific data in a specific order), but
it's a huge win if lots of processes are making unrelated read requests.

Clear now?

(In a previous lifetime I wrote SCSI disk driver code ...)

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2003-10-28 00:22:08 Re: SCSI vs. IDE performance test
Previous Message Allen Landsidel 2003-10-27 23:42:27 Re: SCSI vs. IDE performance test