B-tree parent pointer and checkpoints

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: B-tree parent pointer and checkpoints
Date: 2010-11-02 10:56:33
Message-ID: 4CCFEE61.2090702@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We have the rm_safe_restartpoint mechanism to ensure that we don't use a
checkpoint that splits a multi-level B-tree insertion as a restart
point. But to my surprise, we don't have anything to protect against the
analogous case during normal operation. This is possible:

1. Split child page. Write WAL records for the child pages.
2. Begin and finish a checkpoint
3. Crash, before writing the WAL record of inserting the child pointer
in the parent B-tree page.
4. Recovery begins at the new checkpoint, never sees the incomplete
split, so it stays incomplete.

In practice that's pretty hard to hit, because a checkpoint takes some
time, while locking the parent page and writing the child pointer is
usually very quick. But it's possible.

It surprises me that we thought of this when we introduced
restartpoints, but this more obvious case during normal operation seems
to have been there forever. Nothing very bad happens if you lose the
parent update, but this would be nice to fix nevertheless.

I bumped into this while thinking about archive recovery - the above can
happen at archive recovery too if the checkpoint is caused by
pg_start_backup().

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written. That will close the
issue with restartpoints, archive recovery etc. as well, so we no longer
need to worry about this anywhere else than while performing an online
checkpoint.

I'm thinking of using the isCommit flag for this, to delay writing the
checkpoint record until all incomplete splits are finished. isCommit
protects against a similar race condition between writing commit record
and flushing the clog page, this race condition is similar. Will
obviously need to rename it, and double-check that it's safe: b-tree
splits take longer, and there's no critical section there like there is
in the commit codepath.

Comments?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message ghatpande 2010-11-02 11:25:38 Intelligent RDBMS
Previous Message Heikki Linnakangas 2010-11-02 09:40:42 pgsql: Bootstrap WAL to begin at segment logid=0 logseg=1 (000000010000