Re: BUG #6425: Bus error in slot_deform_tuple

Lists: pgsql-bugspgsql-hackers
From: postgres(at)dunquino(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-01 14:28:30
Message-ID: E1RsbAk-00007G-LC@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

The following bug has been logged on the website:

Bug reference: 6425
Logged by: orval
Email address: postgres(at)dunquino(dot)com
PostgreSQL version: 9.0.6
Operating system: Solaris 10 u9
Description:

This is intermittent and hard to reproduce but crashes consistently in the
same place. That place is backend/access/common/heaptuple.c line 1104:

values[attnum] = fetchatt(thisatt, tp + off);

off is always 0, tp is an unaligned address (not divisible by 4 -- this is
Sparc BTW.) I've seen tup->t_hoff set to 0x62 and 0x82 in different core
files.

This system is using streaming replication, and the problem always occurrs
on the secondary. The system is under heavy load, both in terms of queries
and DML on the primary. There are usually quite a lot of deadlocks going
on.

The query in question each time is a join between a table called preferences
and one called preference_fields. The tuple is in preference_fields. I have
not confirmed this is a cause, but the following statement does appear in
one of the scripts in action:

DELETE FROM preference_fields WHERE preference_field_id NOT IN (SELECT
DISTINCT preference_field_id FROM preferences);

There is also this kind of nasty stuff going on:

ALTER TABLE preferences RENAME TO preferences_old;
ALTER TABLE preferences_1326144465 RENAME TO preferences;

Where preferences_1326144465 is a copy of preferences that is used during a
data import process.

At the moment I have asserts in the places where t_hoff is set, looking for
(address % 4 != 0) but it's been going for a couple of days and it hasn't
happened yet. Any advice on where better to put some debugging would be
gratefully received.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: postgres(at)dunquino(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-01 16:04:15
Message-ID: 15205.1328112255@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

postgres(at)dunquino(dot)com writes:
> This is intermittent and hard to reproduce but crashes consistently in the
> same place. That place is backend/access/common/heaptuple.c line 1104:
> ...
> This system is using streaming replication, and the problem always occurrs
> on the secondary.

Have you read the thread about bug #6200? I'm suspicious that this is
the same or similar problem, with a slightly different visible symptom
because of pickier hardware. I'm afraid we don't know what's going on
yet there either, but the idea that t_hoff is wrong gives us a new line
of thought.

regards, tom lane


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-01 17:32:57
Message-ID: 610BBB94-24AF-419B-B614-8D522E7103CA@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 1 Feb 2012, at 16:04, Tom Lane wrote:

> postgres(at)dunquino(dot)com writes:
>> This is intermittent and hard to reproduce but crashes consistently in the
>> same place. That place is backend/access/common/heaptuple.c line 1104:
>> ...
>> This system is using streaming replication, and the problem always occurrs
>> on the secondary.
>
> Have you read the thread about bug #6200? I'm suspicious that this is
> the same or similar problem, with a slightly different visible symptom
> because of pickier hardware. I'm afraid we don't know what's going on
> yet there either, but the idea that t_hoff is wrong gives us a new line
> of thought.
>
> regards, tom lane

I didn't find 6200 when looking for mentions of this problem. So thanks for that.

I have read the thread now and I guess it could be the same kind of thing. I have tried creating a cut-down version of what is happening for real, but that didn't cause the problem.

I do have a bunch of core files, but I'm not (yet!) familiar with pg code so I am unable to usefully analyse it.

One idea I saw mentioned in #6200 is of using another HS. I didn't think of that before. I may try creating another (or more) of them to see if I can reproduce it more quickly.

Regards,
orval


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: postgres(at)dunquino(dot)com, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-01 18:10:07
Message-ID: CA+TgmoYiaJY4yKydS09ymXJmGPAT-BUHq36SBVB-8iG7bcFUxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, Feb 1, 2012 at 11:04 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Have you read the thread about bug #6200?  I'm suspicious that this is
> the same or similar problem, with a slightly different visible symptom
> because of pickier hardware.  I'm afraid we don't know what's going on
> yet there either, but the idea that t_hoff is wrong gives us a new line
> of thought.

I went looking for commits that might be relevant to this that are new
in 9.0.6, also present in 9.1.2 (per 6200), and related to t_hoff, and
came up with this one:

Branch: master [039680aff] 2011-11-04 23:22:50 -0400
Branch: REL9_1_STABLE Release: REL9_1_2 [8bfc2b5a8] 2011-11-04 23:23:06 -0400
Branch: REL9_0_STABLE Release: REL9_0_6 [b07b2bdc5] 2011-11-04 23:23:16 -0400
Branch: REL8_4_STABLE Release: REL8_4_10 [23998fe99] 2011-11-04 23:23:24 -0400
Branch: REL8_3_STABLE Release: REL8_3_17 [c34088fde] 2011-11-04 23:23:33 -0400
Branch: REL8_2_STABLE Release: REL8_2_23 [73e8ee9eb] 2011-11-04 23:23:38 -0400

Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Fri Nov 4 23:23:16 2011 -0400

Don't assume that a tuple's header size is unchanged during toasting.

Mind you, I have no evidence that this is related; it's just the only
thing that pops out at me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-01 20:43:48
Message-ID: 846B8F25-A7C1-4513-A284-D496C1DFC590@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 1 Feb 2012, at 18:10, Robert Haas wrote:
> I went looking for commits that might be relevant to this that are new
> in 9.0.6, also present in 9.1.2 (per 6200), and related to t_hoff, and
> came up with this one:
>
> Branch: master [039680aff] 2011-11-04 23:22:50 -0400

I looked at this and it seems specific to doing an ALTER TABLE ADD COLUMN, which we're not in this case.

I mentioned in the bug report that I has asserts in places were t_hoff is set. I've been doing it like so:

if (hoff % 4 != 0) {
elog(ERROR, "wrong hoff: %d",hoff);
abort();
}

I've been sitting here waiting for the server to abort and only just realised there are some interesting entries in my pgbench logs. I'm using pgbench to hammer the server with queries, and I have a handful of these:

Client 87 aborted in state 8: ERROR: wrong hoff: 134

I have these abort() calls in:

backend/access/common/heaptuple.c
backend/access/heap/heapam.c
backend/access/heap/tuptoaster.c

But I know from the text that it must have been from either slot_deform_tuple(), heap_form_tuple() or heap_form_minimal_tuple() in heaptuple.c.

What I don't get is why this is causing the client to abort, and not the backend.

What can I do to get the server to abort at this point? Use PANIC instead of ERROR in the elog call perhaps?


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Duncan Rance <postgres(at)dunquino(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-01 21:05:29
Message-ID: 1328130247-sup-7672@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


Excerpts from Duncan Rance's message of mié feb 01 17:43:48 -0300 2012:

> I mentioned in the bug report that I has asserts in places were t_hoff is set. I've been doing it like so:
>
> if (hoff % 4 != 0) {
> elog(ERROR, "wrong hoff: %d",hoff);
> abort();
> }
>
> I've been sitting here waiting for the server to abort and only just realised there are some interesting entries in my pgbench logs. I'm using pgbench to hammer the server with queries, and I have a handful of these:

elog(ERROR) longjmps to the error handling code and never returns to
your abort() call. If you want it to abort at that point, use elog(PANIC).
Or you could do elog(WARNING) and then abort(), which is pretty much the
same thing.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Duncan Rance <postgres(at)dunquino(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-01 21:43:34
Message-ID: 24434.1328132614@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Duncan Rance <postgres(at)dunquino(dot)com> writes:
> I mentioned in the bug report that I has asserts in places were t_hoff is set. I've been doing it like so:

> if (hoff % 4 != 0) {
> elog(ERROR, "wrong hoff: %d",hoff);
> abort();
> }

> I've been sitting here waiting for the server to abort and only just realised there are some interesting entries in my pgbench logs. I'm using pgbench to hammer the server with queries, and I have a handful of these:

> Client 87 aborted in state 8: ERROR: wrong hoff: 134

Yowza. Is this just the standard pgbench test, or something else?
If you could post complete instructions for duplicating this, we
could probably find the cause fairly quickly.

> What I don't get is why this is causing the client to abort, and not the backend.

As Alvaro said, it's not reaching the abort(). You should use PANIC
instead.

regards, tom lane


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-01 22:37:32
Message-ID: A7CCB062-B7D1-4015-8B72-38BB3755477C@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 1 Feb 2012, at 21:43, Tom Lane wrote:
>> Client 87 aborted in state 8: ERROR: wrong hoff: 134
>
> Yowza. Is this just the standard pgbench test, or something else?

This is pgbench with a custom script (-f option.)

> If you could post complete instructions for duplicating this, we
> could probably find the cause fairly quickly.

I'd love to, really I would! If I did, the instructions would be War & Peace length :)

I've been on this for over a week now, and much of that has been trying to simplify the test case. I have a lot more to go on now though so I may make more progress with that soon. (Although it's 10:30pm so I'm calling it a day!)

>> What I don't get is why this is causing the client to abort, and not the backend.
>
> As Alvaro said, it's not reaching the abort(). You should use PANIC
> instead.

Yes thanks, and to Álvaro too. I changed it to PANIC and I now have many many core files to choose from!

Cheers,
Duncan


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-02 18:02:03
Message-ID: 24D75B69-D387-4476-AC0D-5271FA56AB20@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 1 Feb 2012, at 22:37, Duncan Rance wrote:

> On 1 Feb 2012, at 21:43, Tom Lane wrote:
>
>> If you could post complete instructions for duplicating this, we
>> could probably find the cause fairly quickly.
>
> I've been on this for over a week now, and much of that has been trying to simplify the test case.

At last I have been able to reproduce this problem in a relatively simple (yet contrived) way.

I've put together a tarball with a few scripts, some to be run on the primary and others to be run on the hot-stanby. There's a README in there explaining what to do.

I'm going to try attaching it here, although I wouldn't be surprised if one is not allowed to send attachments to the list. Any suggestions of where to put it would be gratefully received.

Cheers,
Duncan

Attachment Content-Type Size
bug_6425.tar.gz application/x-gzip 1.6 KB

From: Duncan Rance <postgres(at)dunquino(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-02 18:28:12
Message-ID: B6B764C0-60A2-45E1-BE28-A9561487A4E4@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 2 Feb 2012, at 18:02, Duncan Rance wrote:
>
> At last I have been able to reproduce this problem in a relatively simple (yet contrived) way.

Doh! Should have mentioned this already, but in case a Sparc is not available, the latest on the debugging is as follows:

As well as the bus error, I also saw the same symptom as described in BUG #6200. I changed the four places that did an elog ERROR "invalid memory alloc request size" to PANIC instead and got a raft of core files.

I have not dug any further as yet, but at the following function on the stack:

char *
text_to_cstring(const text *t)

The values t and tunpacked are the same, so pg_detoast_datum_packed() did not modify t. And len comes out as -4.

A couple of bits from dbx:

(dbx) print -fx t->vl_len_[0]
t->vl_len_[0] = 0xffffff84
(dbx) examine tunpacked /2x
0x0000010000ceb9dc: 0x8474 0x776f

Going to have a look further up the stack now.

Cheers,
Dunc


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Duncan Rance <postgres(at)dunquino(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-02 21:44:10
Message-ID: 4909.1328219050@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Duncan Rance <postgres(at)dunquino(dot)com> writes:
> At last I have been able to reproduce this problem in a relatively simple (yet contrived) way.

> I've put together a tarball with a few scripts, some to be run on the primary and others to be run on the hot-stanby. There's a README in there explaining what to do.

So far no luck reproducing any issue with this test case. I am running
two copies of import_loop.sh against the master, per your instructions,
and see occasional deadlock errors there as expected. No errors at all
on the standby though.

One question probably worth asking is what non-default GUC settings are
you using on the master and standby?

(BTW, for anyone else trying this: with multiple copies of
import_loop.sh you will get a lot of "duplicate key" failures with the
test as written. I had better luck after changing import.sh from
ts=$(perl -e 'print time')
to
ts=$$
ie use PID not timestamp as the pseudo-unique key. This could be made
more bulletproof yet, but it didn't seem worth more trouble.)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Duncan Rance <postgres(at)dunquino(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-02 21:56:30
Message-ID: 5101.1328219790@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

I wrote:
> So far no luck reproducing any issue with this test case.

And I swear my finger had barely left the "send" key when:

TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 735)
LOG: server process (PID 24740) was terminated by signal 6: Aborted
DETAIL: Failed process was running: SELECT * FROM repro_02_ref;
LOG: terminating any other active server processes

So:

(1) no need to worry about GUC settings. It's just a shade less
probable than I'd supposed from your message.

(2) I suspect you are not running with asserts enabled. You might
have better luck isolating this if they were on.

I have not gotten very far with the coredump, except to observe that
gdb says the Assert ought to have passed:

(gdb) f 3
#3 0x0000000000475248 in heapgettup_pagemode (scan=0x1457b08,
dir=<optimized out>, nkeys=0, key=0x0) at heapam.c:735
735 Assert(ItemIdIsNormal(lpp));
(gdb) p lpp
$1 = (ItemIdData *) 0x7fea59705d90
(gdb) p *lpp
$2 = {lp_off = 7936, lp_flags = 1, lp_len = 34}

This suggests very strongly that indeed the buffer was changing under
us.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Duncan Rance <postgres(at)dunquino(dot)com>, pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-03 06:45:13
Message-ID: 12379.1328251513@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

I wrote:
> I have not gotten very far with the coredump, except to observe that
> gdb says the Assert ought to have passed: ...
> This suggests very strongly that indeed the buffer was changing under
> us.

I probably ought to let the test case run overnight before concluding
anything, but at this point it's run for two-plus hours with no errors
after applying this patch:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cce87a3..b128bfd 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
*** 3716,3724 ****
}
else
{
- /* must zero-fill the hole */
- MemSet((char *) page, 0, BLCKSZ);
memcpy((char *) page, blk, bkpb.hole_offset);
memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
blk + bkpb.hole_offset,
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
--- 3716,3724 ----
}
else
{
memcpy((char *) page, blk, bkpb.hole_offset);
+ /* must zero-fill the hole */
+ MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
blk + bkpb.hole_offset,
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));

The existing code makes the page state transiently invalid (all zeroes)
for no particularly good reason, and consumes useless cycles to do so,
so this would be a good change in any case. The reason it is relevant
to our current problem is that even though RestoreBkpBlocks faithfully
takes exclusive lock on the buffer, *that is not enough to guarantee
that no one else is touching that buffer*. Another backend that has
already located a visible tuple on a page is entitled to keep accessing
that tuple with only a buffer pin. So the existing code transiently
wipes the data from underneath the other backend's pin.

It's clear how this explains the symptoms I saw (Assert reporting wrong
value of lp_flags even though the backend must previously have seen the
right value, and the eventual coredump captured the right value too).
It's less clear though whether this explains the symptoms seen by Duncan
and Bridget. They presumably are running without asserts enabled, so
it's unsurprising that they don't see the Assert failure, but what
happens if control gets past that? There seem to be several possible
failure modes:

* Reader picks up zero lp_off/lp_len from the line pointer, and then
tries to interpret the page header as a tuple. The results would be
predictable only until RestoreBkpBlocks puts back nonzero data there,
and then it's a bit of a mess. (In particular, t_hoff would be read out
of the pd_prune_xid field if I counted right, and so would have a rather
unpredictable value.)

* Reader finds correct location of tuple, but sees t_hoff and/or
t_infomask as zeroes (the latter possibly causing it to not check for
nulls, if it doesn't think HEAP_HASNULL is set). Until RestoreBkpBlocks
puts back the data, this would devolve to the next case, but after that
it's a bit unpredictable again.

* Reader finds correct location of data, but sees zeroes there.

I believe that the reported failures involving palloc(-3) in
text_to_cstring can be explained as instances of seeing zeroes where a
text or varchar value is expected. Zeroes would look like a long-format
varlena header with value zero, and the code would subtract 4 to get the
data length, then add 1 for the trailing NUL byte that's needed in
cstring representation, and thus ask for -3 bytes for the cstring
equivalent. Furthermore, all three of the above cases end up doing that
as long as the page stays all-zero for long enough. If some nonzero
data gets restored before we're done examining the page, you get some
other behavior of doubtful predictability. Maybe the other reported
symptoms always fall out of that, or maybe not --- it seems surprising
that we don't have a wider set of visible misbehaviors.

Whether or not this single bug explains all the cases reported so far,
it certainly seems possible that there are other mistakes of the same
sort in individual WAL replay routines. I think we'd better go over
all of them with a fine-tooth comb. In general, a WAL replay routine
can no longer be allowed to create transiently invalid page states
that would not have occurred in the "live" version of the page change.

I am even more troubled than I was before about what this says about
the amount of testing Hot Standby has gotten, because AFAICS absolutely
any use of Hot Standby, no matter the particulars, ought to be heavily
exposed to this bug.

regards, tom lane


From: Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Duncan Rance <postgres(at)dunquino(dot)com>, pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-03 07:27:38
Message-ID: CAHOc93kziir_R4+u3ybF8kidKSkpXDNJ5KJ9ynUHfEYuiX7Z-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

I just wanted to say thanks to everyone who has been working so hard on
this issue. I realize it's not certain that this would fix the issues
we're seeing, but we'd be willing to try it out and report back. The only
caveat is we would need to deploy it to production, so if someone could let
us know what the risk factor is here (e.g. the potential to make things
worse), that would help us plan out how and when we would want to try it.

Thanks again, I'm really hopeful that this will fix the issues we're seeing
- and, if not, at least there seems to be good momentum towards getting to
the root of the problem.
-B

On Thu, Feb 2, 2012 at 10:45 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I wrote:
> > I have not gotten very far with the coredump, except to observe that
> > gdb says the Assert ought to have passed: ...
> > This suggests very strongly that indeed the buffer was changing under
> > us.
>
> I probably ought to let the test case run overnight before concluding
> anything, but at this point it's run for two-plus hours with no errors
> after applying this patch:
>
> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index cce87a3..b128bfd 100644
> *** a/src/backend/access/transam/xlog.c
> --- b/src/backend/access/transam/xlog.c
> *************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
> *** 3716,3724 ****
> }
> else
> {
> - /* must zero-fill the hole */
> - MemSet((char *) page, 0, BLCKSZ);
> memcpy((char *) page, blk, bkpb.hole_offset);
> memcpy((char *) page + (bkpb.hole_offset +
> bkpb.hole_length),
> blk + bkpb.hole_offset,
> BLCKSZ - (bkpb.hole_offset +
> bkpb.hole_length));
> --- 3716,3724 ----
> }
> else
> {
> memcpy((char *) page, blk, bkpb.hole_offset);
> + /* must zero-fill the hole */
> + MemSet((char *) page + bkpb.hole_offset, 0,
> bkpb.hole_length);
> memcpy((char *) page + (bkpb.hole_offset +
> bkpb.hole_length),
> blk + bkpb.hole_offset,
> BLCKSZ - (bkpb.hole_offset +
> bkpb.hole_length));
>
>
> The existing code makes the page state transiently invalid (all zeroes)
> for no particularly good reason, and consumes useless cycles to do so,
> so this would be a good change in any case. The reason it is relevant
> to our current problem is that even though RestoreBkpBlocks faithfully
> takes exclusive lock on the buffer, *that is not enough to guarantee
> that no one else is touching that buffer*. Another backend that has
> already located a visible tuple on a page is entitled to keep accessing
> that tuple with only a buffer pin. So the existing code transiently
> wipes the data from underneath the other backend's pin.
>
> It's clear how this explains the symptoms I saw (Assert reporting wrong
> value of lp_flags even though the backend must previously have seen the
> right value, and the eventual coredump captured the right value too).
> It's less clear though whether this explains the symptoms seen by Duncan
> and Bridget. They presumably are running without asserts enabled, so
> it's unsurprising that they don't see the Assert failure, but what
> happens if control gets past that? There seem to be several possible
> failure modes:
>
> * Reader picks up zero lp_off/lp_len from the line pointer, and then
> tries to interpret the page header as a tuple. The results would be
> predictable only until RestoreBkpBlocks puts back nonzero data there,
> and then it's a bit of a mess. (In particular, t_hoff would be read out
> of the pd_prune_xid field if I counted right, and so would have a rather
> unpredictable value.)
>
> * Reader finds correct location of tuple, but sees t_hoff and/or
> t_infomask as zeroes (the latter possibly causing it to not check for
> nulls, if it doesn't think HEAP_HASNULL is set). Until RestoreBkpBlocks
> puts back the data, this would devolve to the next case, but after that
> it's a bit unpredictable again.
>
> * Reader finds correct location of data, but sees zeroes there.
>
> I believe that the reported failures involving palloc(-3) in
> text_to_cstring can be explained as instances of seeing zeroes where a
> text or varchar value is expected. Zeroes would look like a long-format
> varlena header with value zero, and the code would subtract 4 to get the
> data length, then add 1 for the trailing NUL byte that's needed in
> cstring representation, and thus ask for -3 bytes for the cstring
> equivalent. Furthermore, all three of the above cases end up doing that
> as long as the page stays all-zero for long enough. If some nonzero
> data gets restored before we're done examining the page, you get some
> other behavior of doubtful predictability. Maybe the other reported
> symptoms always fall out of that, or maybe not --- it seems surprising
> that we don't have a wider set of visible misbehaviors.
>
> Whether or not this single bug explains all the cases reported so far,
> it certainly seems possible that there are other mistakes of the same
> sort in individual WAL replay routines. I think we'd better go over
> all of them with a fine-tooth comb. In general, a WAL replay routine
> can no longer be allowed to create transiently invalid page states
> that would not have occurred in the "live" version of the page change.
>
> I am even more troubled than I was before about what this says about
> the amount of testing Hot Standby has gotten, because AFAICS absolutely
> any use of Hot Standby, no matter the particulars, ought to be heavily
> exposed to this bug.
>
> regards, tom lane
>

--
Bridget Frey Director, Data & Analytics Engineering | Redfin

bridget(dot)frey(at)redfin(dot)com | tel: 206.576.5894


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
Cc: Duncan Rance <postgres(at)dunquino(dot)com>, pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-03 07:47:15
Message-ID: 13251.1328255235@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Bridget Frey <bridget(dot)frey(at)redfin(dot)com> writes:
> I just wanted to say thanks to everyone who has been working so hard on
> this issue. I realize it's not certain that this would fix the issues
> we're seeing, but we'd be willing to try it out and report back. The only
> caveat is we would need to deploy it to production, so if someone could let
> us know what the risk factor is here (e.g. the potential to make things
> worse), that would help us plan out how and when we would want to try it.

AFAICS the proposed patch is totally safe; it can't make things worse,
and should save some cycles to boot. Whether it fixes what you're
seeing is a different question of course, but testing would be mighty
helpful for that.

What I would actually ask about as far as production risk goes is
whether you are accustomed to building from patched sources. If you're
not, the risk of dropping a stitch in the build process could be
significant.

regards, tom lane


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-03 15:24:50
Message-ID: 7F89C398-A270-45C4-893C-23AB24A81CFC@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 3 Feb 2012, at 06:45, Tom Lane wrote:
>
> I probably ought to let the test case run overnight before concluding
> anything, but at this point it's run for two-plus hours with no errors
> after applying this patch:
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c

Thank Tom! I've had this running for a few hours now without problems. Previously, on Sparc, the problem would occur in less than a minute.

I did try a build with --enable-cassert and it didn't actually cause the problem. I think I left it for about an hour. Although a a relatively modern machine, this Sparc box I am using is painfully slow. My guess is that the extra time taken to perform the Assert code is hiding the problem.

Now it's time to persuade the customer to use a patched version of pg ;)

Cheers,
Duncan

P.S. I've been looking for an OS project to contribute to, and I think I'll see if I can help with pg. Time to look a the TODO list :)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Duncan Rance <postgres(at)dunquino(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-03 16:11:37
Message-ID: 20740.1328285497@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Duncan Rance <postgres(at)dunquino(dot)com> writes:
> On 3 Feb 2012, at 06:45, Tom Lane wrote:
>> I probably ought to let the test case run overnight before concluding
>> anything, but at this point it's run for two-plus hours with no errors
>> after applying this patch:

> Thank Tom! I've had this running for a few hours now without problems. Previously, on Sparc, the problem would occur in less than a minute.

> I did try a build with --enable-cassert and it didn't actually cause the problem. I think I left it for about an hour. Although a a relatively modern machine, this Sparc box I am using is painfully slow. My guess is that the extra time taken to perform the Assert code is hiding the problem.

My machine has been running the test case for twelve hours now with no
errors, whereas with the bug the MTTF seemed to be half an hour or so.
(Hm, I wonder whether turning off asserts would reduce the time to
failure? Probably not worth the trouble to experiment now.) So I think
we've got it, or at least we've found the problems this test case can
expose. I'm still going to go read all the other WAL replay code...

> Now it's time to persuade the customer to use a patched version of pg ;)

FWIW, this bug might persuade us to do a set of releases pretty soon.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-04 16:11:43
Message-ID: CA+U5nMJkaLowf=Vksbh30MBHMQdT2D65fwZfTWF6SQfbT8429A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Feb 3, 2012 at 6:45 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> I have not gotten very far with the coredump, except to observe that
>> gdb says the Assert ought to have passed: ...
>> This suggests very strongly that indeed the buffer was changing under
>> us.
>
> I probably ought to let the test case run overnight before concluding
> anything, but at this point it's run for two-plus hours with no errors
> after applying this patch:
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index cce87a3..b128bfd 100644
> *** a/src/backend/access/transam/xlog.c
> --- b/src/backend/access/transam/xlog.c
> *************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
> *** 3716,3724 ****
>                }
>                else
>                {
> -                       /* must zero-fill the hole */
> -                       MemSet((char *) page, 0, BLCKSZ);
>                        memcpy((char *) page, blk, bkpb.hole_offset);
>                        memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
>                                   blk + bkpb.hole_offset,
>                                   BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
> --- 3716,3724 ----
>                }
>                else
>                {
>                        memcpy((char *) page, blk, bkpb.hole_offset);
> +                       /* must zero-fill the hole */
> +                       MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
>                        memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
>                                   blk + bkpb.hole_offset,
>                                   BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
>
>
> The existing code makes the page state transiently invalid (all zeroes)
> for no particularly good reason, and consumes useless cycles to do so,
> so this would be a good change in any case.  The reason it is relevant
> to our current problem is that even though RestoreBkpBlocks faithfully
> takes exclusive lock on the buffer, *that is not enough to guarantee
> that no one else is touching that buffer*.  Another backend that has
> already located a visible tuple on a page is entitled to keep accessing
> that tuple with only a buffer pin.  So the existing code transiently
> wipes the data from underneath the other backend's pin.
>
> It's clear how this explains the symptoms

Yes, that looks like the murder weapon.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-04 18:37:40
Message-ID: CA+U5nMJ8SjFkkwZ5d3dRORSrgykL3Fwd-1DKcZu25RRi4GXkQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Feb 3, 2012 at 6:45 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> The reason it is relevant
> to our current problem is that even though RestoreBkpBlocks faithfully
> takes exclusive lock on the buffer, *that is not enough to guarantee
> that no one else is touching that buffer*.  Another backend that has
> already located a visible tuple on a page is entitled to keep accessing
> that tuple with only a buffer pin.  So the existing code transiently
> wipes the data from underneath the other backend's pin.

While deciding whether to apply the patch, I'm thinking about whether
we should be doing this at all. We already agreed that backup blocks
were removable from the WAL stream.

The cause here is data changing underneath the user. Your patch solves
the most obvious error, but it still allows other problems if applying
the backup block changes data. If the backup block doesn't do anything
at all then we don't need to apply it either.

So ISTM that we should just skip applying backup blocks over the top
of existing buffers once we have reached consistency.

Patch to do that attached, but the basic part of it is just this...

@@ -3700,8 +3701,21 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord
*record, bool cleanup)
memcpy(&bkpb, blk, sizeof(BkpBlock));
blk += sizeof(BkpBlock);

+ hit = false;
buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork,
bkpb.block,
-
RBM_ZERO);
+
RBM_ZERO, &hit);
+
+ /*
+ * If we found the block in shared buffers and we are already
+ * consistent then skip applying the backup block. The block
+ * was already removable anyway, so we can skip
without problems.
+ * This avoids us needing to take a cleanup lock in
all cases when
+ * we apply backup blocks because of potential effects
on user queries,
+ * which expect data on blocks to remain constant
while being read.
+ */
+ if (reachedConsistency && hit)
+ continue;
+
Assert(BufferIsValid(buffer));
if (cleanup)
LockBufferForCleanup(buffer);

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-04 18:39:19
Message-ID: CA+U5nM+Kx+=K+QnO6XYXrS69W0Rj4HveKKSwZF2APHczfBLyvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sat, Feb 4, 2012 at 6:37 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> Patch to do that attached

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
skip_backup_blocks.v1.patch text/x-diff 9.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-04 18:49:12
Message-ID: 12579.1328381352@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> The cause here is data changing underneath the user. Your patch solves
> the most obvious error, but it still allows other problems if applying
> the backup block changes data. If the backup block doesn't do anything
> at all then we don't need to apply it either.

This is nonsense. What applying the backup block does is to apply the
change that the WAL record would otherwise have applied, except we
decided to make it store a full-page image instead.

regards, tom lane


From: Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-04 19:52:09
Message-ID: CAHOc93kky4Dd2UFMaHQqOc=mze+1_ZcUFerr_UJ=QMRM59dtxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

We deployed the patch to one of our production slaves at 3:30 PM yesterday
(so roughly 20 hours ago), and since then we have not seen any alloc
errors. On Feb 2nd, the last full day in which we ran without the patch,
we saw 13 alloc errors. We're going to continue monitoring this slave, but
we're cautiously optimistic that the patch does address the alloc errors
we've been seeing. It will take a few weeks to be able to definitively see
if it fixes the segfault, too.

Thanks again, Tom, for your efforts on this. We do seem to be in much
better shape than we were before the patch!
-B


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-05 20:42:47
Message-ID: CA+U5nMJYaH3Os6FrQPsYYt3BLBTW_4ExXJ+6Ov9RBpRcu6FWrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sat, Feb 4, 2012 at 6:49 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> The cause here is data changing underneath the user. Your patch solves
>> the most obvious error, but it still allows other problems if applying
>> the backup block changes data. If the backup block doesn't do anything
>> at all then we don't need to apply it either.
>
> This is nonsense.  What applying the backup block does is to apply the
> change that the WAL record would otherwise have applied, except we
> decided to make it store a full-page image instead.

Yep, you're right, my bad.

Got a head cold, so will lay off a few days from too much thinking.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Duncan Rance <postgres(at)dunquino(dot)com>, pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>, "Daniel Farina" <daniel(at)heroku(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-06 20:48:46
Message-ID: 27258.1328561326@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

[ in re bugs 6200 and 6425 ]

I've committed patches for all the issues I could find pursuant to these
bug reports. Please see if you can break REL9_0_STABLE branch tip
(or 9.1 if that's what you're working with).

regards, tom lane


From: Bridget Frey <bridget(dot)frey(at)redfin(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Duncan Rance <postgres(at)dunquino(dot)com>, pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-08 05:08:09
Message-ID: CAHOc93mFE_LtLpit1xEfCh3LGR7EDss3vh2xWW3Q9mv7YhPPJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Just a quick update, we have now deployed the patch to all three of our
production slave databases, and none has experienced an alloc error or
segfault since receiving the patch. So it's looking very good! We would
not be able to deploy the whole 9.1 stable build to our production
environment since that would require a full round of testing on our part.
But basically 9.1.2 + the patch seems to fix this specific issue for us.
Thanks again, and we'll update this thread if we see any additional issues.
-B


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>, "Daniel Farina" <daniel(at)heroku(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-08 10:01:55
Message-ID: D73C8FF3-29AF-4A8C-AF88-62818712D787@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 6 Feb 2012, at 20:48, Tom Lane wrote:

> bug reports. Please see if you can break REL9_0_STABLE branch tip

Just to let you know that I built this yesterday and I'm giving it a good battering in our Solaris 10 Sparc test environment.

D


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-09 09:29:47
Message-ID: 14D74270-BEBB-46BA-B0D3-A3F3308657D1@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 8 Feb 2012, at 10:01, Duncan Rance wrote:

> On 6 Feb 2012, at 20:48, Tom Lane wrote:
>
>> bug reports. Please see if you can break REL9_0_STABLE branch tip
>
> Just to let you know that I built this yesterday and I'm giving it a good battering in our Solaris 10 Sparc test environment.

In this environment my bug repro scripts would produce the problem within seconds. It has now been running for 24 hours, so I'm confident the problem is solved.

Our customers are keen to get the official release as soon as possible. They are on 9.0.6, so I guess this'll be 9.0.7? I'm new here so I don't know how long this might take, and I promised I'll find out for them. Any ideas?

Thanks,
Duncan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Duncan Rance <postgres(at)dunquino(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Bridget Frey <bridget(dot)frey(at)redfin(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-09 15:02:07
Message-ID: 10224.1328799727@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Duncan Rance <postgres(at)dunquino(dot)com> writes:
> Our customers are keen to get the official release as soon as possible. They are on 9.0.6, so I guess this'll be 9.0.7? I'm new here so I don't know how long this might take, and I promised I'll find out for them. Any ideas?

There's no firm plan at the moment. The earliest it could happen is
around the end of the month, since various key people have other
commitments in the next couple weeks. I'm not promising it *will*
happen then, but that's the way things look right now.

(Since you're new around here, I'll explain that the way this works
is that the pgsql-core and pgsql-packagers lists agree on a release
date in advance. We've had some preliminary discussions, and people
seem to agree that this is a bad enough bug to force a release, but
no date's been set. Once a schedule decision is made, some core
member --- often me --- will announce it on pgsql-hackers, so you
can keep an eye on that list if you want advance notice.)

regards, tom lane


From: Duncan Rance <postgres(at)dunquino(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Date: 2012-02-09 15:07:09
Message-ID: C5517898-F368-40A5-B12A-BB225C144A26@dunquino.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 9 Feb 2012, at 15:02, Tom Lane wrote:

> Duncan Rance <postgres(at)dunquino(dot)com> writes:
>> Our customers are keen to get the official release as soon as possible. They are on 9.0.6, so I guess this'll be 9.0.7? I'm new here so I don't know how long this might take, and I promised I'll find out for them. Any ideas?
>
> There's no firm plan at the moment. The earliest it could happen is
> around the end of the month, since various key people have other
> commitments in the next couple weeks. I'm not promising it *will*
> happen then, but that's the way things look right now.
>
> (Since you're new around here, I'll explain that the way this works
> is that the pgsql-core and pgsql-packagers lists agree on a release
> date in advance. We've had some preliminary discussions, and people
> seem to agree that this is a bad enough bug to force a release, but
> no date's been set. Once a schedule decision is made, some core
> member --- often me --- will announce it on pgsql-hackers, so you
> can keep an eye on that list if you want advance notice.)
>
> regards, tom lane

Good explanation. Thanks Tom!