Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-03-29 05:35:47
Message-ID: CAMsr+YE5Gs9iPqw2mQ6OHt1aC5Qk5EuBFCyG+vzHun1EqMxyQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29 March 2018 at 10:30, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer <craig(at)2ndquadrant(dot)com> writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> >
> > Surely you jest.
>
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

I covered this in my original post.

Yes, we check the return value. But what do we do about it? For fsyncs of
heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint
later, which will retry the fsync(). **Which will now appear to succeed**
because the kernel forgot that it lost our writes after telling us the
first time. So we do check the error code, which returns success, and we
complete the checkpoint and move on.

But we only retried the fsync, not the writes before the fsync.

So we lost data. Or rather, failed to detect that the kernel did so, so our
checkpoint was bad and could not be completed.

The problem is that we keep retrying checkpoints *without* repeating the
writes leading up to the checkpoint, and retrying fsync.

I don't pretend the kernel behaviour is sane, but we'd better deal with it
anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2018-03-29 05:48:40 Re: pg_class.reltuples of brin indexes
Previous Message Craig Ringer 2018-03-29 05:32:43 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS