Re: xlc atomics

From: Noah Misch <noah(at)leadboat(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)heroku(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>
Subject: Re: xlc atomics
Date: 2016-02-15 17:11:29
Message-ID: 20160215171129.GA347322@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 04, 2015 at 08:07:49PM -0400, Noah Misch wrote:
> On Sun, Jul 05, 2015 at 12:54:43AM +0200, Andres Freund wrote:
> > On 2015-07-04 18:40:41 -0400, Noah Misch wrote:
> > > (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working
> > > required the attached patch.
>
> > Will you apply? Having the ability to test change seems to put you in a
> > much better spot then me.
>
> I will.

That commit (0d32d2e) permitted things to compile and usually pass tests, but
I missed the synchronization bug. Since 2015-10-01, the buildfarm has seen
sixteen duplicate-catalog-OID failures. The compiler has always been xlC, and
the branch has been HEAD or 9.5. Examples:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2015-12-15%2004%3A12%3A49
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2015-12-08%2001%3A20%3A16
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2015-12-11%2005%3A25%3A54

These suggested OidGenLock wasn't doing its job. I've seen similar symptoms
around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
5765-J08)" for ppc64le. The problem is generic-xlc.h
pg_atomic_compare_exchange_u32_impl() issuing __isync() before
__compare_and_swap(). __isync() shall follow __compare_and_swap(); see our
own s_lock.h, its references, and other projects' usage:

https://github.com/boostorg/smart_ptr/blob/boost-1.60.0/include/boost/smart_ptr/detail/sp_counted_base_vacpp_ppc.hpp#L55
http://redmine.gromacs.org/projects/gromacs/repository/entry/src/external/thread_mpi/include/thread_mpi/atomic/xlc_ppc.h?rev=v5.1.2#L112
https://www-01.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/language_ref/asm_example.html

This patch's test case would have failed about half the time under today's
generic-xlc.h. Fast machines run it in about 1s. A test that detects the bug
99% of the time would run far longer, hence this compromise.

Archaeology enthusiasts may enjoy the bug's changing presentation over time.
LWLockAcquire() exposed it after commit ab5194e redesigned lwlock.c in terms
of atomics. Assert builds were unaffected for commits in [ab5194e, bbfd7ed);
atomics.h asserts fattened code enough to mask the bug. However, removing the
AssertPointerAlignment() from pg_atomic_fetch_or_u32() would have sufficed to
surface it in assert builds. All builds demonstrated the bug once bbfd7ed
introduced xlC pg_attribute_noreturn(), which elicits a simpler instruction
sequence around the ExceptionalCondition() call embedded in each Assert.

nm

Attachment Content-Type Size
isync-xlc-v1.patch text/plain 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2016-02-15 17:18:15 Re: Existence check for suitable index in advance when concurrently refreshing.
Previous Message Tom Lane 2016-02-15 16:25:21 Re: [WIP] ALTER ... OWNER TO ... CASCADE