Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Mark Dilger <hornschnorter(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-09 20:43:03
Message-ID: e2c86276-cd44-aebe-7da5-8de01d615f5c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/09/2018 10:25 PM, Mark Dilger wrote:
>
>> On Apr 9, 2018, at 12:13 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>
>> Hi,
>>
>> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>>> I think the simplest technological solution to this problem is to
>>> rewrite the entire backend and all supporting processes to use
>>> O_DIRECT everywhere. To maintain adequate performance, we'll have to
>>> write a complete I/O scheduling system inside PostgreSQL. Also, since
>>> we'll now have to make shared_buffers much larger -- since we'll no
>>> longer be benefiting from the OS cache -- we'll need to replace the
>>> use of malloc() with an allocator that pulls from shared_buffers.
>>> Plus, as noted, we'll need to totally rearchitect several of our
>>> critical frontend tools. Let's freeze all other development for the
>>> next year while we work on that, and put out a notice that Linux is no
>>> longer a supported platform for any existing release. Before we do
>>> that, we might want to check whether fsync() actually writes the data
>>> to disk in a usable way even with O_DIRECT. If not, we should just
>>> de-support Linux entirely as a hopelessly broken and unsupportable
>>> platform.
>>
>> Let's lower the pitchforks a bit here. Obviously a grand rewrite is
>> absurd, as is some of the proposed ways this is all supposed to
>> work. But I think the case we're discussing is much closer to a near
>> irresolvable corner case than anything else.
>>
>> We're talking about the storage layer returning an irresolvable
>> error. You're hosed even if we report it properly. Yes, it'd be nice if
>> we could report it reliably. But that doesn't change the fact that what
>> we're doing is ensuring that data is safely fsynced unless storage
>> fails, in which case it's not safely fsynced anyway.
>
> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted. That seems a much bigger
> problem than merely having the master become corrupted in some unrecoverable
> way. It is a long standing expectation that serious hardware problems on
> the master can result in the master needing to be replaced. But there has
> not been an expectation that the one or more standby servers would be taken
> down along with the master, leaving all copies of the database unusable.
> If this bug corrupts the standby servers, too, then it is a whole different
> class of problem than the one folks have come to expect.
>
> Your comment reads as if this is a problem isolated to whichever server has
> the problem, and will not get propagated to other servers. Am I reading
> that right?
>
> Can anybody clarify this for non-core-hacker folks following along at home?
>

That's a good question. I don't see any guarantee it'd be isolated to
the master node. Consider this example:

(0) checkpoint happens on the primary

(1) a page gets modified, a full-page gets written to WAL

(2) the page is written out to page cache

(3) writeback of that page fails (and gets discarded)

(4) we attempt to modify the page again, but we read the stale version

(5) we modify the stale version, writing the change to WAL

The standby will get the full-page, and then a WAL from the stale page
version. That doesn't seem like a story with a happy end, I guess. But I
might be easily missing some protection built into the WAL ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-04-09 20:46:34 Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.
Previous Message Andres Freund 2018-04-09 20:37:31 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS