Avoiding adjacent checkpoint records

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Avoiding adjacent checkpoint records
Date: 2012-06-06 19:08:45
Message-ID: 13147.1339009725@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the
rule for when to skip checkpoints on the grounds that not enough
activity has happened since the last one. However, that commit left the
comment block about it in a nonsensical state:

* If this isn't a shutdown or forced checkpoint, and we have not switched
* to the next WAL file since the start of the last checkpoint, skip the
* checkpoint. The idea here is to avoid inserting duplicate checkpoints
* when the system is idle. That wastes log space, and more importantly it
* exposes us to possible loss of both current and previous checkpoint
* records if the machine crashes just as we're writing the update.
* (Perhaps it'd make even more sense to checkpoint only when the previous
* checkpoint record is in a different xlog page?)

The new code entirely fails to prevent writing adjacent checkpoint
records, because what it checks is the distance from the previous
checkpoint's REDO pointer, not the previous checkpoint record itself.
So the concern raised in the last two sentences of the comment isn't
being addressed at all: if we corrupt the current page of WAL while
trying to write the new checkpoint record, we risk losing the previous
checkpoint record too. Should the system then crash, there is enough
logic to back up to the second previous checkpoint record and roll
forward from there --- but since we've lost the last checkpoint and up
to one page's worth of preceding WAL records, there is no guarantee that
we'll manage to reach a database state that is consistent with data
already flushed out to disk during the last checkpoint.

I started to make a quick patch to add an additional check on the
location of the previous checkpoint record, so that we'd skip a new
checkpoint unless we'd moved to a new page of WAL. However, if we
really want to take this risk seriously, ISTM that allowing adjacent
checkpoint records is bad all the time, not only for non-forced
checkpoints.

What I'm now thinking is that a more appropriate way to address that
risk is to force a skip to a new page (not segment) of WAL after we
write a checkpoint record. This won't waste much WAL space in view
of the new rule to avoid checkpoints more than once per segment on
average.

On the other hand, you could argue that this concern is entirely
hypothetical, and we're already basically assuming that once a WAL
record has been flushed to disk it's safe there even if we're still
writing more stuff into the same page. If we don't want to assume
that, then any XLogFlush would have to include skip-to-new-page,
and that's not going to be cheap.

Thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-06-06 19:11:02 Re: WalSndWakeup() and synchronous_commit=off
Previous Message Robert Haas 2012-06-06 19:08:05 Re: 9.3: load path to mitigate load penalty for checksums