Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Christophe Pettus <xof(at)thebuild(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-10 01:44:59
Message-ID: CAMsr+YFw=dJX-dP1LVROJsq+h9PA8TOazXWSq4f0FUizcnVmvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10 April 2018 at 03:59, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:
>> On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
>> > You could make the argument that it's OK to forget if the entire file
>> > system goes away. But actually, why is that ok?
>>
>> I was going to say that it'd be okay to clear error flag on umount, since any
>> opened files would prevent unmounting; but, then I realized we need to consider
>> the case of close()ing all FDs then opening them later..in another process.
>
>> On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
>> > notification descriptor open, where the kernel would inject events
>> > related to writeback failures of files under watch (potentially
>> > enriched to contain info regarding the exact failed pages and
>> > the file offset they map to).
>>
>> For postgres that'd require backend processes to open() an file such that,
>> following its close(), any writeback errors are "signalled" to the checkpointer
>> process...
>
> I don't think that's as hard as some people argued in this thread. We
> could very well open a pipe in postmaster with the write end open in
> each subprocess, and the read end open only in checkpointer (and
> postmaster, but unused there). Whenever closing a file descriptor that
> was dirtied in the current process, send it over the pipe to the
> checkpointer. The checkpointer then can receive all those file
> descriptors (making sure it's not above the limit, fsync(), close() ing
> to make room if necessary). The biggest complication would presumably
> be to deduplicate the received filedescriptors for the same file,
> without loosing track of any errors.

Yep. That'd be a cheaper way to do it, though it wouldn't work on
Windows. Though we don't know how Windows behaves here at all yet.

Prior discussion upthread had the checkpointer open()ing a file at the
same time as a backend, before the backend writes to it. But passing
the fd when the backend is done with it would be better.

We'd need a way to dup() the fd and pass it back to a backend when it
needed to reopen it sometimes, or just make sure to keep the oldest
copy of the fd when a backend reopens multiple times, but that's no
biggie.

We'd still have to fsync() out early in the checkpointer if we ran out
of space in our FD list, and initscripts would need to change our
ulimit or we'd have to do it ourselves in the checkpointer. But
neither seems insurmountable.

FWIW, I agree that this is a corner case, but it's getting to be a
pretty big corner with the spread of overcommitted, dedupliating SANs,
cloud storage, etc. Not all I/O errors indicate permanent hardware
faults, disk failures, etc, as I outlined earlier. I'm very curious to
know what AWS EBS's error semantics are, and other cloud network block
stores. (I posted on Amazon forums
https://forums.aws.amazon.com/thread.jspa?threadID=279274&tstart=0 but
nothing so far).

I'm also not particularly inclined to trust that all file systems will
always reliably reserve space without having some cases where they'll
fail writeback on space exhaustion.

So we don't need to panic and freak out, but it's worth looking at the
direction the storage world is moving in, and whether this will become
a bigger issue over time.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-04-10 01:52:21 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Andres Freund 2018-04-10 01:40:37 Re: Excessive PostmasterIsAlive calls slow down WAL redo