Re: New WAL code dumps core trivially on replay of bad data

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-18 01:46:55
Message-ID: 10185.1345254415@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I just had HEAD fail to recover after a backend core dump, because the
startup process dumped core itself during replay:

LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2012-08-17 20:47:37 EDT
LOG: database system was not properly shut down; automatic recovery in progress
LOG: startup process (PID 9418) was terminated by signal 11
LOG: aborting startup due to startup process failure

The startup process's stack trace is

#0 0x26fd1c in RecordIsValid (record=0x4008d7a0, recptr=80658424, emode=15)
at xlog.c:3713
3713 COMP_CRC32(crc, XLogRecGetData(record), len);
(gdb) bt
#0 0x26fd1c in RecordIsValid (record=0x4008d7a0, recptr=80658424, emode=15)
at xlog.c:3713
#1 0x270690 in ReadRecord (RecPtr=0x7b03bad0, emode=15,
fetching_ckpt=0 '\000') at xlog.c:4006
#2 0x2761e8 in StartupXLOG () at xlog.c:6550
#3 0x463d28 in StartupProcessMain () at startup.c:222
#4 0x2ade78 in AuxiliaryProcessMain (argc=2, argv=0x7b03b850)
at bootstrap.c:418
#5 0x4631e0 in StartChildProcess (type=StartupProcess) at postmaster.c:4564
#6 0x461ddc in PostmasterStateMachine () at postmaster.c:3167
#7 0x460e48 in reaper (postgres_signal_arg=1074321312) at postmaster.c:2609

The current WAL address is 80658424 == 0x04cebff8, that is just 8 bytes
short of a page boundary, and what RecordIsValid thinks it is dealing
with is

(gdb) p *record
$1 = {xl_tot_len = 0, xl_xid = 0, xl_len = 1074108016, xl_info = 0 '\000',
xl_rmid = 0 '\000', xl_prev = 412316860416, xl_crc = 64}
(gdb) x/32 record
0x4008d7a0: 0x00000000 0x00000000 0x40059670 0x00008009
0x4008d7b0: 0x00000060 0x00000000 0x00000040 0x10000000
0x4008d7c0: 0x00000000 0x04cebec0 0x620583d8 0x00000000
0x4008d7d0: 0x00000000 0x04cebf98 0x00000001 0x01000000
0x4008d7e0: 0x00000000 0x00001947 0x0000c000 0x00000001
0x4008d7f0: 0x00000000 0x0000029a 0x00000001 0x00000000
0x4008d800: 0x00000000 0x502ee5c0 0x00000000 0x00000000
0x4008d810: 0x00000000 0x00000000 0x00000000 0x00000000

so it merrily tries to compute a checksum on a gigabyte worth of data,
and soon falls off the end of memory.

In reality, inspection of the WAL file suggests that this is the end of
valid data and what should have happened is that replay just stopped.
The xl_len and so forth shown above are just garbage from off the end of
what was actually read from the file (everything beyond offset 0xcebff8
in file 4 is in fact zeroes).

I'm not sure whether this is just a matter of having failed to
sanity-check that xl_tot_len is at least SizeOfXLogRecord, or whether
there is a deeper problem with the new design of continuation records
that makes it impossible to validate records safely.

Certainly it's completely insane for RecordIsValid to be trusting xl_len
so implicitly as it does, and to be blithely CRC'ing backup blocks that
could extend off the known length of the record as well --- that is,
checking only at line 3738 to see if we've fallen off the end of memory
is a tad too late. I realize that that code was like that before, but
apparently the earlier logic protected it to a greater extent than it
does now.

regards, tom lane


From: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-18 05:52:01
Message-ID: 6C0B27F7206C9E4CA54AE035729E9C38285258A8@szxeml509-mbx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane Sent: Saturday, August 18, 2012 7:16 AM

> The startup process's stack trace is

> #0 0x26fd1c in RecordIsValid (record=0x4008d7a0, recptr=80658424, emode=15)
> at xlog.c:3713
> 3713 COMP_CRC32(crc, XLogRecGetData(record), len);
> (gdb) bt
> #0 0x26fd1c in RecordIsValid (record=0x4008d7a0, recptr=80658424, emode=15)
> at xlog.c:3713
> #1 0x270690 in ReadRecord (RecPtr=0x7b03bad0, emode=15,
> fetching_ckpt=0 '\000') at xlog.c:4006

> The current WAL address is 80658424 == 0x04cebff8, that is just 8 bytes
> short of a page boundary, and what RecordIsValid thinks it is dealing
> with is

> so it merrily tries to compute a checksum on a gigabyte worth of data,
> and soon falls off the end of memory.

> In reality, inspection of the WAL file suggests that this is the end of
> valid data and what should have happened is that replay just stopped.
> The xl_len and so forth shown above are just garbage from off the end of
> what was actually read from the file (everything beyond offset 0xcebff8
> in file 4 is in fact zeroes).

> I'm not sure whether this is just a matter of having failed to
> sanity-check that xl_tot_len is at least SizeOfXLogRecord, or whether
> there is a deeper problem with the new design of continuation records
> that makes it impossible to validate records safely.

Earlier there was a check related to total length in ReadRecord, before it calls RecordIsValid()
if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))

I think that missing check of total length has caused this problem. However now this check will be different.

With Regards,
Amit Kapila.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-20 06:39:35
Message-ID: 5031DBA7.6010309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18.08.2012 08:52, Amit kapila wrote:
> Tom Lane Sent: Saturday, August 18, 2012 7:16 AM
>
>> so it merrily tries to compute a checksum on a gigabyte worth of data,
>> and soon falls off the end of memory.
>
>> In reality, inspection of the WAL file suggests that this is the end of
>> valid data and what should have happened is that replay just stopped.
>> The xl_len and so forth shown above are just garbage from off the end of
>> what was actually read from the file (everything beyond offset 0xcebff8
>> in file 4 is in fact zeroes).
>
>> I'm not sure whether this is just a matter of having failed to
>> sanity-check that xl_tot_len is at least SizeOfXLogRecord, or whether
>> there is a deeper problem with the new design of continuation records
>> that makes it impossible to validate records safely.
>
> Earlier there was a check related to total length in ReadRecord, before it calls RecordIsValid()
> if (record->xl_tot_len< SizeOfXLogRecord + record->xl_len ||
> record->xl_tot_len> SizeOfXLogRecord + record->xl_len +
> XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
>
> I think that missing check of total length has caused this problem. However now this check will be different.

That check still exists, in ValidXLogRecordHeader(). However, we now
allocate the buffer for the whole record before that check, based on
xl_tot_len, if the record header is split across pages. The theory in
allocating the buffer is that a bogus xl_tot_len field will cause the
malloc() to fail, returning NULL, and we treat that the same as a broken
header. However, with memory overcommit, what happens is that the
malloc() succeeds, but the process is killed when it actually tries to
use all that memory.

I think we need to delay the allocation of the record buffer. We need to
read and validate the whole record header first, like we did before,
before we trust xl_tot_len enough to call malloc() with it. I'll take a
shot at doing that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Amit kapila <amit(dot)kapila(at)huawei(dot)com>, "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-20 14:04:52
Message-ID: 28954.1345471492@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 18.08.2012 08:52, Amit kapila wrote:
>> I think that missing check of total length has caused this problem. However now this check will be different.

> That check still exists, in ValidXLogRecordHeader(). However, we now
> allocate the buffer for the whole record before that check, based on
> xl_tot_len, if the record header is split across pages. The theory in
> allocating the buffer is that a bogus xl_tot_len field will cause the
> malloc() to fail, returning NULL, and we treat that the same as a broken
> header.

Uh, no, you misread it. xl_tot_len is *zero* in this example. The
problem is that RecordIsValid believes xl_len (and backup block size)
even when it exceeds xl_tot_len.

> I think we need to delay the allocation of the record buffer. We need to
> read and validate the whole record header first, like we did before,
> before we trust xl_tot_len enough to call malloc() with it. I'll take a
> shot at doing that.

I don't believe this theory at all. Overcommit applies to writing on
pages that were formerly shared with the parent process --- it should
not have anything to do with malloc'ing new space. But anyway, this
is not what happened in my example.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Subject: Re: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-20 14:07:48
Message-ID: 201208201607.48821.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, August 20, 2012 04:04:52 PM Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > On 18.08.2012 08:52, Amit kapila wrote:
> >> I think that missing check of total length has caused this problem.
> >> However now this check will be different.
> >
> > That check still exists, in ValidXLogRecordHeader(). However, we now
> > allocate the buffer for the whole record before that check, based on
> > xl_tot_len, if the record header is split across pages. The theory in
> > allocating the buffer is that a bogus xl_tot_len field will cause the
> > malloc() to fail, returning NULL, and we treat that the same as a broken
> > header.
>
> Uh, no, you misread it. xl_tot_len is *zero* in this example. The
> problem is that RecordIsValid believes xl_len (and backup block size)
> even when it exceeds xl_tot_len.
>
> > I think we need to delay the allocation of the record buffer. We need to
> > read and validate the whole record header first, like we did before,
> > before we trust xl_tot_len enough to call malloc() with it. I'll take a
> > shot at doing that.
>
> I don't believe this theory at all. Overcommit applies to writing on
> pages that were formerly shared with the parent process --- it should
> not have anything to do with malloc'ing new space. But anyway, this
> is not what happened in my example.
If the memory is big enough (128kb) it will be mmap'ed into place. In that
case overcommiting applies before the pages have been brought in.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Amit kapila <amit(dot)kapila(at)huawei(dot)com>, "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-20 15:09:19
Message-ID: 5032531F.9090007@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20.08.2012 17:04, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 18.08.2012 08:52, Amit kapila wrote:
>>> I think that missing check of total length has caused this problem. However now this check will be different.
>
>> That check still exists, in ValidXLogRecordHeader(). However, we now
>> allocate the buffer for the whole record before that check, based on
>> xl_tot_len, if the record header is split across pages. The theory in
>> allocating the buffer is that a bogus xl_tot_len field will cause the
>> malloc() to fail, returning NULL, and we treat that the same as a broken
>> header.
>
> Uh, no, you misread it. xl_tot_len is *zero* in this example. The
> problem is that RecordIsValid believes xl_len (and backup block size)
> even when it exceeds xl_tot_len.

Ah yes, I see that now. I think all we need then is a check for
xl_tot_len >= SizeOfXLogRecord.

>> I think we need to delay the allocation of the record buffer. We need to
>> read and validate the whole record header first, like we did before,
>> before we trust xl_tot_len enough to call malloc() with it. I'll take a
>> shot at doing that.
>
> I don't believe this theory at all. Overcommit applies to writing on
> pages that were formerly shared with the parent process --- it should
> not have anything to do with malloc'ing new space. But anyway, this
> is not what happened in my example.

I was thinking that we might read gigabytes worth of bogus WAL into the
memory buffer, if xl_tot_len is bogus and large, e.g 0xffffffff. But now
that I look closer, the xlog record is validated after reading the first
continuation page, so we should catch a bogus xl_tot_len value at that
point. And there is a cross-check with xl_rem_len on every continuation
page, too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Amit kapila <amit(dot)kapila(at)huawei(dot)com>, "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-20 15:25:40
Message-ID: 711.1345476340@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 20.08.2012 17:04, Tom Lane wrote:
>> Uh, no, you misread it. xl_tot_len is *zero* in this example. The
>> problem is that RecordIsValid believes xl_len (and backup block size)
>> even when it exceeds xl_tot_len.

> Ah yes, I see that now. I think all we need then is a check for
> xl_tot_len >= SizeOfXLogRecord.

That should get us back to a reliability level similar to the old code.

However, I think that we also need to improve RecordIsValid so that at
each step, it checks it hasn't overrun xl_tot_len *before* touching the
corresponding part of the record buffer.

> I was thinking that we might read gigabytes worth of bogus WAL into the
> memory buffer, if xl_tot_len is bogus and large, e.g 0xffffffff. But now
> that I look closer, the xlog record is validated after reading the first
> continuation page, so we should catch a bogus xl_tot_len value at that
> point. And there is a cross-check with xl_rem_len on every continuation
> page, too.

Yeah. Even if xl_tot_len is bogus, we should realize that within a
couple of pages at most. The core of the problem here is that
RecordIsValid is not being careful to confine its touches to the
guaranteed-to-exist bytes of the record buffer, ie 0 .. xl_tot_len-1.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Amit kapila <amit(dot)kapila(at)huawei(dot)com>, "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-20 16:15:25
Message-ID: 5032629D.2040202@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20.08.2012 18:25, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> I was thinking that we might read gigabytes worth of bogus WAL into the
>> memory buffer, if xl_tot_len is bogus and large, e.g 0xffffffff. But now
>> that I look closer, the xlog record is validated after reading the first
>> continuation page, so we should catch a bogus xl_tot_len value at that
>> point. And there is a cross-check with xl_rem_len on every continuation
>> page, too.
>
> Yeah. Even if xl_tot_len is bogus, we should realize that within a
> couple of pages at most. The core of the problem here is that
> RecordIsValid is not being careful to confine its touches to the
> guaranteed-to-exist bytes of the record buffer, ie 0 .. xl_tot_len-1.

Hmm, RecordIsValid() assumes that the whole record has been read into
memory already, where "whole record" is defined by xl_tot_len. The
problem is that xl_len disagrees with xl_tot_len. Validating the XLOG
header would've caught that, but in this case the caller had not called
ValidXLogRecordHeader().

However, a suitably corrupt record might have a valid header, but
*appear* to have larger backup blocks than the header claims. You would
indeed overrun the memory buffer while calculating the CRC, then. So
yeah, we should check that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com