Re: 9.4 beta1 crash on Debian sid/i386

From: Christoph Berg <cb(at)df7cb(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 beta1 crash on Debian sid/i386
Date: 2014-05-18 09:08:34
Message-ID: 20140518090834.GA18253@msgid.df7cb.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Re: Tom Lane 2014-05-18 <9058(dot)1400385611(at)sss(dot)pgh(dot)pa(dot)us>
> Christoph Berg <cb(at)df7cb(dot)de> writes:
> > Re: Tom Lane 2014-05-14 <1357(dot)1400028161(at)sss(dot)pgh(dot)pa(dot)us>
> >> It would appear that something is wrong with check_stack_depth(),
> >> and/or getrlimit(RLIMIT_STACK) is lying to us about the available stack.
>
> > ulimit -s is 8192 (kB); max_stack_depth is 2MB.
>
> > check_stack_depth looks right, max_stack_depth_bytes there is 2097152
> > and I can see stack_base_ptr - &stack_top_loc grow over repeated
> > invocations of the function (stack_depth itself is optimized out).
> > Still, it never enters "if (stack_depth > max_stack_depth_bytes...)".
>
> Hm. Did you check that stack_base_ptr is non-NULL? If it were somehow
> not getting set, that would disable the error report. But on most
> architectures that would also result in silly values for the pointer
> difference, so I doubt this is the issue.

stack_base_ptr was non-NULL. The stack size started around 3 or 5kB
(don't remember exactly), and grew by something like a few 100B in
each iteration, so this looked sane.

> > Interestingly, the Debian buildd managed to run the testsuite for
> > i386, while I could reproduce the problem on the pgapt build machine
> > and on my notebook, so there must be some system difference. Possibly
> > the reason is these two machines are running a 64bit kernel and I'm
> > building in a 32bit chroot, though that hasn't been a problem before.
>
> I'm suspicious that something has changed in your build environment,
> because that stack-checking logic hasn't changed since these commits:

It's something in the combination of build and runtime environment. I
can reproduce the problem in the package that the Debian
i386/experimental buildd has compiled, including passing the
regression tests there. Possibly a change in libc there. I'll try to
ask some kernel/libc people if they have an idea. My current bet is on
the gcc hardening flags we are using.

> The lack of reports from the buildfarm or other users is also evidence
> against there being a widespread issue here.

The only animal running Debian testing/unstable I can see is dugong,
which is ia64 - which has been removed from Debian some months ago.
I guess I should look into setting up a new animal for this.

> A different thought: I have heard of environments in which the available
> stack depth is much less than what ulimit would suggest because the ulimit
> space gets split up for multiple per-thread stacks. That should not be
> happening in a Postgres backend, since we don't do threading, but I'm
> running out of ideas to investigate ...

I've done some builds now and there's no clear picture yet when the
problem is occurring. Still trying...

Christoph
--
cb(at)df7cb(dot)de | http://www.df7cb.de/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-05-18 09:14:45 Re: 9.4 beta1 crash on Debian sid/i386
Previous Message Raghavendra 2014-05-18 08:40:53 Is it typo in pg_stat_replication column name in PG 9.4 ?