WAL format

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: WAL format
Date: 2009-12-07 19:28:38
Message-ID: 4B1D5766.6070905@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While looking at the streaming replication patch, I can't help but
wonder why our WAL format is so complicated.

WAL is divided into WAL segments, each 16 MB by default. Each WAL
segment is divided into pages, 8k by default. At the beginning of each
WAL page, there's a page header, but the header at the first page of
each WAL segment contains a few extra fields.

If a WAL record crosses a page boundary, we write as much of it as fits
onto the first page, and so-called continuation records with the rest of
the data on subsequent pages.

In particular I wonder why we bother with the page headers. A much
simpler format would be:

- get rid of page headers, except for the header at the beginning of
each WAL segment
- get rid of continuation records
- at the end of WAL segment, when there's not enough space to write the
next WAL record, always write an XLOG SWITCH record to fill the rest of
the segment.

The page addr stored in the WAL page header gives some extra protection
for detecting end of valid WAL correctly, but we rely on the prev-links
and CRC within page for that anyway, so I wouldn't mind losing that.

The changes to ReadRecord in the streaming replication patch feel a bit
awkward, because it has to work around the fact that WAL is streamed as
a stream of bytes, but ReadRecord works one page at a time. I'd like to
replace ReadRecord with a simpler ring buffer approach, but handling the
continuation records makes it a bit hard.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2009-12-07 19:38:00 Build sizes vs docs
Previous Message Greg Smith 2009-12-07 19:26:12 Re: YAML