Re: could not truncate directory "pg_serial": apparent wraparound

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-08 00:28:36
Message-ID: 4DEE7BE4020000250003E2BD@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We had a report of the subject message during testing a while back
and attempted to address the issue. It can result in a LOG level
message and the accumulation of files in the pg_serial SLRU
subdirectory. We haven't seen a recurrence, until I hit it during
testing of the just-posted patch for SSI DDL. I re-read the code
and believe that the attached is the correct fix.

-Kevin

Attachment Content-Type Size
ssi-slru-trunc-to-head-1.patch text/plain 1.1 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-08 19:30:31
Message-ID: 4DEFCDD7.1070102@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08.06.2011 03:28, Kevin Grittner wrote:
> We had a report of the subject message during testing a while back
> and attempted to address the issue. It can result in a LOG level
> message and the accumulation of files in the pg_serial SLRU
> subdirectory. We haven't seen a recurrence, until I hit it during
> testing of the just-posted patch for SSI DDL. I re-read the code
> and believe that the attached is the correct fix.

Doesn't this patch bring back the issue mentioned in the comment: the
slru page might not get used again until we wrap-around?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-08 19:40:40
Message-ID: 4DEF89E8020000250003E380@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 08.06.2011 03:28, Kevin Grittner wrote:
>> We had a report of the subject message during testing a while
>> back and attempted to address the issue. It can result in a LOG
>< level message and the accumulation of files in the pg_serial SLRU
>> subdirectory. We haven't seen a recurrence, until I hit it
>> during testing of the just-posted patch for SSI DDL. I re-read
>> the code and believe that the attached is the correct fix.
>
> Doesn't this patch bring back the issue mentioned in the comment:
> the slru page might not get used again until we wrap-around?

In the previous attempt I thought it was looking at the allocated
files to assess that it was approaching wraparound. In looking at
the SLRU code last night, it seemed that it was really using the
"last zeroed" page as the comparison point, and wanted the
truncation point to precede that. Given that last night didn't seem
to be my sharpest hour, it would be wise to confirm this. This code
is pretty hard to test, since it only comes into play on overflow of
the predicate locking shared memory structures, so desk-checking is
important.

So, to directly address your question, if we don't overflow again
and go to the SLRU summary data, one file could sit in the pg_serial
directory indefinitely; but we should avoid the LOG message and the
accumulation of large numbers of such files -- if I'm reading the
SLRU code correctly....

-Kevin


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-09 08:56:41
Message-ID: 4DF08AC9.4060707@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While testing this, I noticed another serious bug in the OldSerXidSLRU
handling: we never set the dirty-flag on any page. I believe the reason
we haven't bumped into this in testing before is that when a new page is
initialized, it's marked as dirty, so everything goes smoothly when we
modify recently-zeroed pages. But if a page falls out of the cache, and
is later read back in and modified, the modifications are lost.

The comments in SLRU could be more explicit about this. It was
coincidental that I started to wonder where the pages are marked as
dirty, I somehow thought the SLRU functions do that for you.

Fortunately the fix is very simple, we just need to set the page_dirty
flag whenever we modify an slru page. But clearly this slru stuff needs
more testing. It's pretty hard to write good repeatable test cases for
these things, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-09 14:40:24
Message-ID: 1307630393-sup-8102@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Heikki Linnakangas's message of jue jun 09 04:56:41 -0400 2011:

> Fortunately the fix is very simple, we just need to set the page_dirty
> flag whenever we modify an slru page. But clearly this slru stuff needs
> more testing. It's pretty hard to write good repeatable test cases for
> these things, though.

Maybe reduce the number of SLRU buffers used?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-09 15:09:23
Message-ID: 4DF09BD3020000250003E459@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Heikki Linnakangas's message of jue jun 09 04:56:41
-0400 2011:
>
>> Fortunately the fix is very simple, we just need to set the
>> page_dirty flag whenever we modify an slru page. But clearly this
>> slru stuff needs more testing. It's pretty hard to write good
>> repeatable test cases for these things, though.
>
> Maybe reduce the number of SLRU buffers used?

That's a thought. I'll see about getting a build with
TEST_OLDSERXID defined and just one or two SLRU buffers for this,
and see if anything pops out of that.

.... after I finish reviewing Dan's patch from last night.

Thanks,

-Kevin


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-09 17:42:05
Message-ID: 4DF105ED.1050800@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08.06.2011 22:40, Kevin Grittner wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> On 08.06.2011 03:28, Kevin Grittner wrote:
>>> We had a report of the subject message during testing a while
>>> back and attempted to address the issue. It can result in a LOG
>> < level message and the accumulation of files in the pg_serial SLRU
>>> subdirectory. We haven't seen a recurrence, until I hit it
>>> during testing of the just-posted patch for SSI DDL. I re-read
>>> the code and believe that the attached is the correct fix.
>>
>> Doesn't this patch bring back the issue mentioned in the comment:
>> the slru page might not get used again until we wrap-around?
>
> In the previous attempt I thought it was looking at the allocated
> files to assess that it was approaching wraparound. In looking at
> the SLRU code last night, it seemed that it was really using the
> "last zeroed" page as the comparison point, and wanted the
> truncation point to precede that.

I've committed your patch for now. It does in fact bring back the
original problem the clever but broken code was trying to address: if a
pg_serial is not needed for a very long time, the last segment that we
leave behind will eventually appear to be new again, and won't be
cleaned up until a lot more XIDs pass.

> So, to directly address your question, if we don't overflow again
> and go to the SLRU summary data, one file could sit in the pg_serial
> directory indefinitely; but we should avoid the LOG message and the
> accumulation of large numbers of such files -- if I'm reading the
> SLRU code correctly....

Yeah, as far as I can see it's harmless except for the small waste of
disk space. We probably want to revisit this later, maybe still for 9.1
or maybe not, but for now I just put in a comment about it.

I also fixed the broken warning logic. Please double-check that too when
you get a chance.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-09 18:15:57
Message-ID: 4DF0C78D020000250003E46A@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> I've committed your patch for now.

Thanks!

I don't see it yet on the public git repo, nor on the -commiters
list. I'll keep an eye out for it.

> as far as I can see it's harmless except for the small waste of
> disk space. We probably want to revisit this later, maybe still
> for 9.1 or maybe not, but for now I just put in a comment about
> it.

I'm inclined to think it's not worth the trouble to revisit it. At
most it would be one segment. In fact, in line with your desire to
flush the SLRU pages so they're useful for debugging, we could call
this a feature -- it provides a quick way to check the date and time
of the last need for SLRU summarization. I can see that being a
useful piece of information for forensics.

-Kevin


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-09 18:40:25
Message-ID: 4DF11399.4050200@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09.06.2011 21:15, Kevin Grittner wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> I've committed your patch for now.
>
> Thanks!
>
> I don't see it yet on the public git repo, nor on the -commiters
> list. I'll keep an eye out for it.

Oh, rats! Forgot to push.. Will do so now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: could not truncate directory "pg_serial": apparent wraparound
Date: 2011-06-09 19:32:29
Message-ID: 4DF0D97D020000250003E46F@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> I also fixed the broken warning logic. Please double-check that
> too when you get a chance.

As usual, I like your code better than mine. I don't think my code
was broken in the sense that it would generate different results
than yours, but it was too clever by half, and depended on knowledge
that a TransactionId is 32 bits. Yours is more tolerant of changes
to implementation details, and much easier to read and understand.

While I'm *pretty sure* my code for this worked, I am *positive*
that yours does. :-)

Thanks again.

-Kevin