Re: Index Scans become Seq Scans after VACUUM ANALYSE

From: Curt Sampson <cjs(at)cynic(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Michael Loftis <mloftis(at)wgops(dot)com>, mlw <markw(at)mohawksoft(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date: 2002-06-21 02:18:14
Message-ID: Pine.NEB.4.43.0206211106390.437-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 20 Jun 2002, Bruce Momjian wrote:

> > MS SQL Server has an interesting way of dealing with this. They have a
> > "torn" bit in each 512-byte chunk of a page, and this bit is set the
> > same for each chunk. When they are about to write out a page, they first
> > flip all of the torn bits and then do the write. If the write does not
> > complete due to a system crash or whatever, this can be detected later
> > because the torn bits won't match across the entire page.
>
> I was wondering, how does knowing the block is corrupt help MS SQL?

I'm trying to recall, but I can't off hand. I'll have to look it
up in my Inside SQL Server book, which is at home right now,
unfortunately. I'll bring the book into work and let you know the
details later.

> Right now, we write changed pages to WAL, then later write them to disk.

Ah. You write the entire page? MS writes only the changed tuple.
And DB2, in fact, goes one better and writes only the part of the
tuple up to the change, IIRC. Thus, if you put smaller and/or more
frequently changed columns first, you'll have smaller logs.

> I have always been looking for a way to prevent these WAL writes. The
> 512-byte bit seems interesting, but how does it help?

Well, this would at least let you reduce the write to the 512-byte
chunk that changed, rather than writing the entire 8K page.

> And how does the bit help them with partial block writes? Is the bit at
> the end of the block? Is that reliable?

The bit is somewhere within every 512 byte "disk page" within the
8192 byte "filesystem/database page." So an 8KB page is divided up
like this:

| <----------------------- 8 Kb ----------------------> |

| 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b |

Thus, the tear bits start out like this:

| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

After a successful write of the entire page, you have this:

| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

If the write is unsuccessful, you end up with something like this:

| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |

And now you know which parts of your page got written, and which
parts didn't.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Ford 2002-06-21 02:23:29 Re: cvs read lock
Previous Message Rod Taylor 2002-06-21 02:08:40 Domain coercion