Re: Index Scans become Seq Scans after VACUUM ANALYSE

From: Curt Sampson <cjs(at)cynic(dot)net>
To: "J(dot) R(dot) Nield" <jrnield(at)usol(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Loftis <mloftis(at)wgops(dot)com>, mlw <markw(at)mohawksoft(dot)com>, PostgreSQL Hacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date: 2002-06-24 03:40:51
Message-ID: Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23 Jun 2002, J. R. Nield wrote:

> If is impossible to do what you want. You can not protect against
> partial writes without writing pages twice and calling fdatasync
> between them while going through a generic filesystem.

I agree with this.

> The best disk array will not protect you if the operating system does
> not align block writes to the structure of the underlying device.

This I don't quite understand. Assuming you're using a SCSI drive
(and this mostly applies to ATAPI/IDE, too), you can do naught but
align block writes to the structure of the underlying device. When you
initiate a SCSI WRITE command, you start by telling the device at which
block to start writing and how many blocks you intend to write. Then you
start passing the data.

(See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter
details for the SCSI WRITE(10) command. You may find the SCSI 2
specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful
reference here.)

> Even with raw devices, you need special support or knowledge of the
> operating system and/or the disk device to ensure that each write
> request will be atomic to the underlying hardware.

Well, so here I guess you're talking about two things:

1. When you request, say, an 8K block write, will the OS really
write it to disk in a single 8K or multiple of 8K SCSI write
command?

2. Does the SCSI device you're writing to consider these writes to
be transactional. That is, if the write is interrupted before being
completed, does the SCSI device guarantee that the partially-sent
data is not written, and the old data is maintained? And of course,
does it guarantee that, when it acknowledges a write, that write is
now in stable storage and will never go away?

Both of these are not hard to guarantee, actually. For a BSD-based OS,
for example, just make sure that your filesystem block size is the
same as or a multiple of the database block size. BSD will never write
anything other than a block or a sequence of blocks to a disk in a
single SCSI transaction (unless you've got a really odd SCSI driver).
And for your disk, buy a Baydel or Clarion disk array, or something
similar.

Given that it's not hard to set up a system that meets these criteria,
and this is in fact commonly done for database servers, it would seem a
good idea for postgres to have the option to take advantage of the time
and money spent and adjust its performance upward appropriately.

> All other systems rely on the fact that you can recover a damaged file
> using the log archive.

Not exactly. For MS SQL Server, at any rate, if it detects a page tear
you cannot restore based on the log file alone. You need a full or
partial backup that includes that entire torn block.

> This means downtime in the rare case, but no data loss. Until
> PostgreSQL can do this, then it will not be acceptable for real
> critical production use.

It seems to me that it is doing this right now. In fact, it's more
reliable than some commerial systems (such as SQL Server) because it can
recover from a torn block with just the logfile.

> But at the end of the day, unless you have complete understanding of
> the I/O system from write(2) through to the disk system, the only sure
> ways to protect against partial writes are by "careful writes" (in
> the WAL log or elsewhere, writing pages twice), or by requiring (and
> allowing) users to do log-replay recovery when a file is corrupted by
> a partial write.

I don't understand how, without a copy of the old data that was in the
torn block, you can restore that block from just log file entries. Can
you explain this to me? Take, as an example, a block with ten tuples,
only one of which has been changed "recently." (I.e., only that change
is in the log files.)

> If we log pages to WAL, they are useless when archived (after a
> checkpoint). So either we have a separate "log" for them (the
> ping-pong file), or we should at least remove them when archived,
> which makes log archiving more complex but is perfectly doable.

Right. That seems to me a better option, since we've now got only one
write point on the disk rather than two.

> Finally, I would love to hear why we are using the operating system
> buffer manager at all. The OS is acting as a secondary buffer manager
> for us. Why is that? What flaw in our I/O system does this reveal?

It's acting as a "second-level" buffer manager, yes, but to say it's
"secondary" may be a bit misleading. On most of the systems I've set
up, the OS buffer cache is doing the vast majority of the work, and the
postgres buffering is fairly minimal.

There are some good (and some perhaps not-so-good) reasons to do it this
way. I'll list them more or less in the order of best to worst:

1. The OS knows where the blocks physically reside on disk, and
postgres does not. Therefore it's in the interest of postgresql to
dispatch write responsibility back to the OS as quickly as possible
so that the OS can prioritize requests appropriately. Most operating
systems use an "elevator" algorithm to minimize disk head movement;
but if the OS does not have a block that it could write while the
head is "on the way" to another request, it can't write it in that
head pass.

2. Postgres does not know about any "bank-switching" tricks for
mapping more physical memory than it has address space. Thus, on
32-bit machines, postgres might be limited to mapping 2 or 3 GB of
memory, even though the machine has, say, 6 GB of physical RAM. The
OS can use all of the available memory for caching; postgres cannot.

3. A lot of work has been put into the seek algorithms, read-ahead
algorithms, block allocation algorithms, etc. in the OS. Why
duplicate all that work again in postgres?

When you say things like the following:

> We should only be writing blocks when they need to be on disk. We
> should not be expecting the OS to write them "sometime later" and
> avoid blocking (as long) for the write. If we need that, then our
> buffer management is wrong and we need to fix it.

you appear to be making the arugment that we should take the route of
other database systems, and use raw devices and our own management of
disk block allocation. If so, you might want first to look back through
the archives at the discussion I and several others had about this a
month or two ago. After looking in detail at what NetBSD, at least, does
in terms of its disk I/O algorithms and buffering, I've pretty much come
around, at least for the moment, to the attitude that we should stick
with using the OS. I wouldn't mind seeing postgres be able to manage all
of this stuff, but it's a *lot* of work for not all that much benefit
that I can see.

> The ORACLE people were not kidding when they said that they could not
> certify Linux for production use until it supported O_DSYNC. Can you
> explain why that was the case?

I'm suspecting it's because Linux at the time had no raw devices, so
O_DSYNC was the only other possible method of making sure that disk
writes actually got to disk.

You certainly don't want to use O_DSYNC if you can use another method,
because O_DSYNC still goes through the the operating system's buffer
cache, wasting memory and double-caching things. If you're doing your
own management, you need either to use a raw device or open files with
the flag that indicates that the buffer cache should not be used at all
for reads from and writes to that file.

> However, this discussion and a search of the pgsql-hackers archives
> reveals this problem to be the KEY area of PostgreSQL's failing, and
> general misunderstanding, when compared to its commercial competitors.

No, I think it's just that you're under a few minor misapprehensions
here about what postgres and the OS are actually doing. As I said, I
went through this whole exact argument a month or two ago, on this very
list, and I came around to the idea that what postgres is doing now
works quite well, at least on NetBSD. (Most other OSes have disk I/O
algorithms that are pretty much as good or better.) There might be a
very slight advantage to doing all one's own I/O management, but it's
a huge amount of work, and I think that much effort could be much more
usefully applied to other areas.

Just as a side note, I've been a NetBSD developer since about '96,
and have been delving into the details of OS design since well before
that time, so I'm coming to this with what I hope is reasonably good
knowledge of how disks work and how operating systems use them. (Not
that this should stop you from pointing out holes in my arguments. :-))

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2002-06-24 03:44:56 Re: Suggestions for implementing IS DISTINCT FROM?
Previous Message Bruce Momjian 2002-06-24 02:46:44 Re: Index Scans become Seq Scans after VACUUM ANALYSE