Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"
Date: 2014-08-19 20:27:34
Message-ID: 1b63ecf2-e16f-4346-9583-66e8887958c7@email.android.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On August 19, 2014 10:24:20 PM CEST, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
>wrote:
>
>> On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera
><alvherre(at)2ndquadrant(dot)com
>> > wrote:
>>
>>> Jeff Janes wrote:
>>>
>>> > This problem was initially fairly easy to reproduce, but since I
>>> > started adding instrumentation specifically to catch it, it has
>become
>>> > devilishly hard to reproduce.
>>> >
>>> > I think my next step will be to also log each of the values which
>goes
>>> > into the complex if (...) expression that decides on the deletion.
>>>
>>> Could you please to reproduce it after updating to latest? I pushed
>>> fixes that should close these issues. Maybe you want to remove the
>>> instrumentation you added, to make failures more likely.
>>>
>>
>> There are still some problems in 9.4, but I haven't been able to
>diagnose
>> them and wanted to do more research on it. The announcement of
>upcoming
>> back-branches for 9.3 spurred me to try it there, and I have problems
>with
>> 9.3 (12c5bbdcbaa292b2a4b09d298786) as well. The move of truncation
>to the
>> checkpoint seems to have made the problem easier to reproduce. On an
>8
>> core machine, this test fell over after about 20 minutes, which is
>much
>> faster than it usually reproduces.
>>
>> This the error I get:
>>
>> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR: could not access
>status of
>> transaction 85837221
>> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL: Could not open file
>> "pg_multixact/members/14031": No such file or directory.
>> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT: SQL statement
>"SELECT 1
>> FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=)
>$1 FOR
>> KEY SHARE OF x"
>>
>> The testing harness is attached as 3 patches that must be made to the
>test
>> server, and 2 scripts. The script do.sh sets up the database (using
>fixed
>> paths, so be careful) and then invokes count.pl in a loop to do the
>> actual work.
>>
>
>Sorry, after a long time when I couldn't do much testing on this, I've
>now
>been able to get back to it.
>
>It looks like what is happening is that checkPoint.nextMultiOffset
>wraps
>around from 2^32 to 0, even if 0 is still being used. At that point it
>starts deleting member files that are still needed.
>
>Is there some interlock which is supposed to prevent from
>checkPoint.nextMultiOffset rom lapping iself? I haven't been able to
>find
>it. It seems like the interlock applies only to MultiXid, not the
>Offset.

There is none (and there never has been one either). I've complained about it a couple of times but nobody, me included, had time and energy to fix that :(

Andres

---
Please excuse brevity and formatting - I am writing this on my mobile phone.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Lumby 2014-08-19 21:17:09 Re: Extended Prefetching using Asynchronous IO - proposal and patch
Previous Message Jeff Janes 2014-08-19 20:24:20 Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"