Re: gcc versus division-by-zero traps

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: gcc versus division-by-zero traps
Date: 2009-09-03 14:24:17
Message-ID: 14006.1251987857@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We have seen several previous reports of regression test failures
due to division by zero causing SIGFPE, even though the code
should never reach the division command:

http://archives.postgresql.org/pgsql-bugs/2006-11/msg00180.php
http://archives.postgresql.org/pgsql-bugs/2007-11/msg00032.php
http://archives.postgresql.org/pgsql-bugs/2008-05/msg00148.php
http://archives.postgresql.org/pgsql-general/2009-05/msg00774.php

It's always been on non-mainstream architectures so it was hard
to investigate. But I have finally been able to reproduce this:
https://bugzilla.redhat.com/show_bug.cgi?id=520916

While s390x is still not quite mainstream, at least I can get
access to one ;-). What turns out to be the case is that
"simple" test cases like
if (y == 0)
single_function_call(...);
z = x / y;
do not show the problem; you need something pretty complex in the
if-command. Like, say, an ereport() construct. So that's why the gcc
boys haven't already had visits from mobs of villagers about this.

I hope that the bug will get fixed in due course, but even if they
respond pretty quickly it will be years before the problem disappears
from every copy of gcc in the field. So I'm thinking that it would
behoove us to install a workaround, now that we've characterized the
problem sufficiently. What I am thinking is that in the three
functions known to exhibit the bug (int24div, int28div, int48div)
we should do something like this:

if (arg2 == 0)
+ {
ereport(ERROR,
(errcode(ERRCODE_DIVISION_BY_ZERO),
errmsg("division by zero")));
+ /* ensure compiler realizes we don't reach the division */
+ PG_RETURN_NULL();
+ }
/* No overflow is possible */
PG_RETURN_INT64((int64) arg1 / arg2);

Thoughts?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Fetter <david(at)fetter(dot)org>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: gcc versus division-by-zero traps
Date: 2009-09-03 17:26:52
Message-ID: 19979.1251998812@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Fetter <david(at)fetter(dot)org> writes:
> On Thu, Sep 03, 2009 at 10:24:17AM -0400, Tom Lane wrote:
>> While s390x is still not quite mainstream, at least I can get
>> access to one ;-).

> Do you also have access to z/OS with Unix System Services?

No, Red Hat's machines run RHEL ;-)

>> What I am thinking is that in the three
>> functions known to exhibit the bug (int24div, int28div, int48div)
>> we should do something like this:

> How big would this change be? How would people know to use that
> construct everywhere it's appropriate?

I'm talking about patching exactly those three functions. We don't
have any reports of trouble elsewhere. The long-term fix is in the
compiler anyway, this is just a workaround for currently-known issues.

Part of my motivation for this is to get rid of an existing hack in
the Red Hat RPMs:

# use -O1 on sparc64 and alpha
%ifarch sparc64 alpha
CFLAGS=`echo $CFLAGS| sed -e "s|-O2|-O1|g" `
%endif

I believe that that is only there to prevent exactly this problem.
Anyway I'd like to try removing it after patching as proposed, and
then we'll find out if there are other trouble spots ...

regards, tom lane


From: David Fetter <david(at)fetter(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: gcc versus division-by-zero traps
Date: 2009-09-03 17:50:03
Message-ID: 20090903175003.GU8410@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 03, 2009 at 01:26:52PM -0400, Tom Lane wrote:
> David Fetter <david(at)fetter(dot)org> writes:
> > On Thu, Sep 03, 2009 at 10:24:17AM -0400, Tom Lane wrote:
> >> While s390x is still not quite mainstream, at least I can get
> >> access to one ;-).
>
> > Do you also have access to z/OS with Unix System Services?
>
> No, Red Hat's machines run RHEL ;-)

I'm given to understand it's possible to run both on the same
hardware.

Cheers,
David
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: gcc versus division-by-zero traps
Date: 2009-09-17 22:21:30
Message-ID: 200909172221.n8HMLUX13580@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I hope that the bug will get fixed in due course, but even if they
> respond pretty quickly it will be years before the problem disappears
> from every copy of gcc in the field. So I'm thinking that it would
> behoove us to install a workaround, now that we've characterized the
> problem sufficiently. What I am thinking is that in the three
> functions known to exhibit the bug (int24div, int28div, int48div)
> we should do something like this:
>
>
> if (arg2 == 0)
> + {
> ereport(ERROR,
> (errcode(ERRCODE_DIVISION_BY_ZERO),
> errmsg("division by zero")));
> + /* ensure compiler realizes we don't reach the division */
> + PG_RETURN_NULL();
> + }

Could we document this a little better, perhaps refer to a file
somewhere or a central comment on its purpose?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +