We have got a serious problem with pg_clog/WAL synchronization

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Subject: We have got a serious problem with pg_clog/WAL synchronization
Date: 2004-08-10 16:17:11
Message-ID: 29031.1092154631@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While investigating Satoshi Okada's bug report here
http://archives.postgresql.org/pgsql-hackers/2004-08/msg00510.php
I realized that it actually represents a crash-safety risk that has
existed since 7.2.

<lecture>
Allow me to refresh your memory about the principles of write-ahead
logging. The one that everyone remembers is "a WAL entry must hit disk
before any of the data changes it describes". But there is a different
constraint that must also be met, which is "a checkpoint must flush all
data changes of preceding WAL entries". In detail, a checkpoint does:

1. Note the current end-of-WAL position (where the next WAL entry will
be made). This is the checkpoint's "REDO" pointer.

2. Flush all dirty buffers to disk, and fsync all changes.

3. Add a checkpoint record to WAL, and flush it to disk.

(Note that when there is concurrent activity, other WAL records may be
added to WAL between steps 1 and 3, so that the checkpoint record's
physical location is later than its REDO pointer. This is okay. The
added records are logically after the checkpoint, even though physically
located before it in the WAL data.)

If we now suffer a crash, log replay will be executed starting at the
REDO point. Since records before the REDO point will not be replayed,
it is critical that the "flush" operations in step 2 have written all
the effects of such records to disk.
</lecture>

Satoshi-san's bug report shows a way to cause the system to sometimes
violate this constraint. In particular, what I saw was a transaction
commit WAL record that was just before the REDO pointer of a checkpoint,
but the pg_clog status update for it had not been flushed to disk by the
checkpoint. The reason this is possible is that RecordTransactionCommit
first writes the commit record, and fsyncs it, and only then goes and
makes the pg_clog status update in shared memory. There is thus a
window for a checkpoint to start, note its REDO point (after the
commit), and flush the current contents of pg_clog buffers out to disk
before the transaction has updated its state in pg_clog.

This has been broken since the original design of pg_clog in 7.2 :-(.
I fear I have to take the blame for it.

(Just to add insult to injury: if you enable commit_delay then the sleep
occurs during the window of vulnerability.)

What I am thinking of doing to fix the problem is to introduce
a new LWLock that RecordTransactionCommit will take a shared lock on
before writing its WAL record, and not release until it has updated
pg_clog. Checkpoint start will acquire the lock exclusively just long
enough to do its step 1 (establish REDO point). This is slightly
annoying since it means one more high-traffic lock to grab during
commit, but I don't see any other good solution. We will certainly have
to back-patch this into 7.4 and I suppose we should think about issuing
new 7.3 and 7.2 releases as well.

regards, tom lane


From: "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: We have got a serious problem with pg_clog/WAL synchronization
Date: 2004-08-11 15:04:25
Message-ID: 411A3579.8030704@cs.wisc.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>While investigating Satoshi Okada's bug report here
>
>
>...
>
>What I am thinking of doing to fix the problem is to introduce
>a new LWLock that RecordTransactionCommit will take a shared lock on
>before writing its WAL record, and not release until it has updated
>pg_clog. Checkpoint start will acquire the lock exclusively just long
>enough to do its step 1 (establish REDO point). This is slightly
>annoying since it means one more high-traffic lock to grab during
>commit, but I don't see any other good solution.
>

Please forgive me to give my potentially irrelevant comments. I am not
too familiar with the internals of the postgresql.

It seems to me this is an interesting phenomena of interactions between
frequent events of transaction commits and infrequent events of system
checkpoints. A potential alternative solution to adding a new shared
lock to the frequent commit operation is to let the infrequent
checkpoint operation take more overhead. I suppose acquiring/releasing
an extra lock for each commit would incur extra performance overhead,
even when the lock is not contented. On the other hand, let the
checkpoing operation acquire some existing locks (exclusively) to
effectively disallowing committing transactions to interfere with the
checkpoint process might be a better solution since it incur higher
overhead only when necessary.

Of course, this after all might be "pre-mature" optimization. Just my $0.02.

Thanks,

-Min

--
Rong: Life is all about running after a busy agenda! How frustrating!
Min: That's right! You can either deal with things as quick as possible or to create less items on the agenda to begin with.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: mxu(at)cae(dot)wisc(dot)edu
Cc: pgsql-hackers(at)postgresql(dot)org, OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: We have got a serious problem with pg_clog/WAL synchronization
Date: 2004-08-11 15:25:58
Message-ID: 5670.1092237958@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu> writes:
> It seems to me this is an interesting phenomena of interactions between
> frequent events of transaction commits and infrequent events of system
> checkpoints. A potential alternative solution to adding a new shared
> lock to the frequent commit operation is to let the infrequent
> checkpoint operation take more overhead. I suppose acquiring/releasing
> an extra lock for each commit would incur extra performance overhead,
> even when the lock is not contented. On the other hand, let the
> checkpoing operation acquire some existing locks (exclusively) to
> effectively disallowing committing transactions to interfere with the
> checkpoint process might be a better solution since it incur higher
> overhead only when necessary.

Unfortunately, there isn't any pre-existing lock that will serve.
A transaction that is between XLogInsert'ing its COMMIT record and
updating the shared pg_clog data area does not hold any lock that
could be used to prevent a checkpoint from starting. (Or it didn't
until yesterday's patch, anyway.)

I looked briefly at reorganizing the existing code so that we'd do the
COMMIT XLogInsert while we're holding lock on the shared pg_clog data,
which would solve the problem without adding any new lock acquisition.
But this seemed extremely messy to do. Also it would be optimizing
transaction commit at the cost of pessimizing other uses of pg_clog,
which might have to wait longer to get at the shared data. Adding the
new lock has the advantage that we can be sure it's not blocking
anything we don't want it to block.

Thanks for thinking about the problem though ...

regards, tom lane


From: "Min Xu (Hsu)" <mxu(at)cae(dot)wisc(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: We have got a serious problem with pg_clog/WAL synchronization
Date: 2004-08-11 15:35:02
Message-ID: 411A3CA6.6060800@cae.wisc.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>
>Unfortunately, there isn't any pre-existing lock that will serve.
>A transaction that is between XLogInsert'ing its COMMIT record and
>updating the shared pg_clog data area does not hold any lock that
>could be used to prevent a checkpoint from starting. (Or it didn't
>until yesterday's patch, anyway.)
>
>I looked briefly at reorganizing the existing code so that we'd do the
>COMMIT XLogInsert while we're holding lock on the shared pg_clog data,
>which would solve the problem without adding any new lock acquisition.
>But this seemed extremely messy to do. Also it would be optimizing
>transaction commit at the cost of pessimizing other uses of pg_clog,
>which might have to wait longer to get at the shared data. Adding the
>new lock has the advantage that we can be sure it's not blocking
>anything we don't want it to block.
>
>Thanks for thinking about the problem though ...
>
>

You are welcome.