Re: Strange hanging bug in a simple milter

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-09 16:10:54
Message-ID: 522DF30E.90605@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09.09.2013 18:20, Stephen Frost wrote:
> Vesa-Matti, Heikki,
>
> * Heikki Linnakangas (hlinnakangas(at)vmware(dot)com) wrote:
>> On 09.09.2013 15:36, Vesa-Matti J Kari wrote:
>>> If I interpret this correctly, threads #2 and #3 are waiting for the same
>>> lock but they make no progress.
>>
>> A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL
>> enabled in my test server. As soon as I turned it on, it hung.
>>
>> Attached is a small stand-alone test program to reproduce it. You
>> can pass a libpq connection string as argument to it.
>
> Interesting... Which version of libpq were you working against? I see
> that Vesa-Matti had the problem happen w/ 9.1.9, which should have been
> before the changes that I made to add locking around our usage of
> SSL_context, as otherwise we would end up in situations where we'd dump
> core, but he also had it with 9.3rc1, which should have included it. I
> had tested the patch w/ a pretty good amount of concurrent threads fired
> off from a little python script and didn't run into any deadlocks
> there..

> Vesa-Matti, was this working previously, and are you sure you were
> testing with 9.3rc1's libpq? Heikki, which are you testing against and
> perhaps you might try before and after? I'll be able to look into it
> more in a few hours also.

Thanks! I tested with git master.

I added printf()s into the pq_lockingcallback function, and got a trace
where both threads got stuck waiting for lock 10 in the pq_lockarray. It
looks like someone is failing to release it. The backtrace for both
threads look like this:

#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fad49826f3c in _L_lock_974 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fad49826d8b in __GI___pthread_mutex_lock
(mutex=0x7fad3800a260) at pthread_mutex_lock.c:64
#3 0x00007fad49a53f08 in pq_lockingcallback (line=175,
file=0x7fad48fa8bb3 "x_pubkey.c", n=10, mode=<optimized out>) at
fe-secure.c:872
#4 pq_lockingcallback (mode=<optimized out>, n=10, file=0x7fad48fa8bb3
"x_pubkey.c", line=175) at fe-secure.c:868
#5 0x00007fad48f396ab in X509_PUBKEY_get () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6 0x00007fad48f56292 in X509_get_pubkey_parameters () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7 0x00007fad48f5649c in X509_verify_cert () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8 0x00007fad4924f14a in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#9 0x00007fad4922ce1c in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#10 0x00007fad492310d2 in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#11 0x00007fad49a54c67 in open_client_SSL (conn=0x7fad380397d0) at
fe-secure.c:1463
#12 pqsecure_open_client (conn=conn(at)entry=0x7fad380397d0) at fe-secure.c:306
#13 0x00007fad49a44fb6 in PQconnectPoll (conn=conn(at)entry=0x7fad380397d0)
at fe-connect.c:2123
#14 0x00007fad49a4618e in connectDBComplete
(conn=conn(at)entry=0x7fad380397d0) at fe-connect.c:1521
#15 0x00007fad49a46b47 in PQconnectdb (conninfo=<optimized out>) at
fe-connect.c:516
#16 0x00000000004007b6 in test_connect (threadid=2) at threaded-connect.c:25
#17 0x000000000040086e in run_thread (arg=0x600e04 <two>) at
threaded-connect.c:55
#18 0x00007fad49824e0e in start_thread (arg=0x7fad48203700) at
pthread_create.c:311
#19 0x00007fad4955993d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Sometimes the lockup happens differently, with one thread hung up in
SSL_init() and another waiting for the first one on ssl_config_mutex,
for example.

A good next step might be to create a standalone program that doesn't
use libpq at all, but just calls X509_verify_cert() concurrently in two
threads. Or open plain SSL connections. If the deadlock can be
reproduced with that, then we could just report the bug to the OpenSSL
and hope that they can figure it out.

- Heikki

Attachment Content-Type Size
pq_lockingcallback-debug-1.patch text/x-diff 728 bytes
lock-trace.gz application/x-gzip 14.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-09-09 16:21:31 Re: dynamic shared memory
Previous Message Robert Haas 2013-09-09 16:09:05 Re: improve Chinese locale performance