Re: New WAL code dumps core trivially on replay of bad data

From: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: New WAL code dumps core trivially on replay of bad data
Date: 2012-08-18 05:52:01
Message-ID: 6C0B27F7206C9E4CA54AE035729E9C38285258A8@szxeml509-mbx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane Sent: Saturday, August 18, 2012 7:16 AM

> The startup process's stack trace is

> #0 0x26fd1c in RecordIsValid (record=0x4008d7a0, recptr=80658424, emode=15)
> at xlog.c:3713
> 3713 COMP_CRC32(crc, XLogRecGetData(record), len);
> (gdb) bt
> #0 0x26fd1c in RecordIsValid (record=0x4008d7a0, recptr=80658424, emode=15)
> at xlog.c:3713
> #1 0x270690 in ReadRecord (RecPtr=0x7b03bad0, emode=15,
> fetching_ckpt=0 '\000') at xlog.c:4006

> The current WAL address is 80658424 == 0x04cebff8, that is just 8 bytes
> short of a page boundary, and what RecordIsValid thinks it is dealing
> with is

> so it merrily tries to compute a checksum on a gigabyte worth of data,
> and soon falls off the end of memory.

> In reality, inspection of the WAL file suggests that this is the end of
> valid data and what should have happened is that replay just stopped.
> The xl_len and so forth shown above are just garbage from off the end of
> what was actually read from the file (everything beyond offset 0xcebff8
> in file 4 is in fact zeroes).

> I'm not sure whether this is just a matter of having failed to
> sanity-check that xl_tot_len is at least SizeOfXLogRecord, or whether
> there is a deeper problem with the new design of continuation records
> that makes it impossible to validate records safely.

Earlier there was a check related to total length in ReadRecord, before it calls RecordIsValid()
if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))

I think that missing check of total length has caused this problem. However now this check will be different.

With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2012-08-18 08:34:37 Re: [PATCH] Docs: Make notes on sequences and rollback more obvious
Previous Message Jeff Janes 2012-08-18 02:18:52 "CLUSTER VERBOSE" tab completion