Re: Strange hanging bug in a simple milter

Lists: pgsql-hackers
From: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Strange hanging bug in a simple milter
Date: 2013-09-09 06:34:38
Message-ID: alpine.LRH.2.00.1309090930410.14908@ruuvi.it.helsinki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hello PostgreSQL gurus,

(I have already posted a very similar message to comp.mail.sendmail
newsgroup on August 22nd, but I haven't received any responses there. I
have also tried pgsql-interfaces(at)postgresql(dot)org but to no avail. Solving
this problem requires some Sendmail/Postfix experience because the MTA
needs to be configured to use the authmilter. Also some basic Milter API
knowledge is needed: https://www.milter.org/developers/api/index)

I have come across a very strange bug and I do not understand why it is
occurring. Fortunately it is reproducible in a pretty deterministic
manner.

The goal: to create a milter for limiting how many nrcpts an authenticated
user can send in a 24 hour time interval. The related data is stored in a
PostgreSQL database.

The problem: my alpha-stage code for authmilter simply hangs when
processing two concurrent connections from Sendmail. For
authmilter-simplified, I have removed the callbacks envfrom and envrcpt,
because they are not needed in order to demonstrate the bug.

Basically all that the authmilter now does is to connect to PostgreSQL in
authmilt_connect() and close the connection in authmilt_close(). Based on
the authmilter debug logging it seems to me that when the hanging occurs,
the authmilter never completes PQsetdbLogin().

Based on the document

http://www.postgresql.org/docs/9.1/interactive/libpq-threading.html

I am sure libpq should be thread safe and on startup
the authmilter verifies that the libpq is indeed thread safe.

If you think you could set up a test enviroment using:

- Sendmail (or maybe Postfix)
- authmilter
- PostgreSQL

here is an authmilter-simplied.tar.gz package for you:

http://www.helsinki.fi/~vmkari/authmilter-simplified.tar.gz

The README file contains a rough instructions outline on how to setup
things in order to reproduce the strange hanging bug.

Please note that when running two test message sender scripts in parallel,
the bug does not occur immediately, but only after between 1 to 5 minutes
of processing. Sometimes it may take even longer.

I have tested authmilter on Ubuntu Linux 12.04 having packages:

libmilter-dev 8.14.4-2ubuntu2
libmilter1.0.1 8.14.4-2ubuntu2
libpq-dev 9.1.9-0ubuntu12.04
libpq5 9.1.9-0ubuntu12.04
postgresql-9.1 9.1.9-0ubuntu12.04
postgresql-client-9.1 9.1.9-0ubuntu12.04
postgresql-client-common 129ubuntu1
postgresql-common 129ubuntu1
postgresql-contrib-9.1 9.1.9-0ubuntu12.04
postgresql-doc-9.1 9.1.9-0ubuntu12.04
postgresql-server-dev-9.1 9.1.9-0ubuntu12.04
postgresql-server-dev-all 129ubuntu1
sendmail 8.14.4-2ubuntu2
sendmail-base 8.14.4-2ubuntu2
sendmail-bin 8.14.4-2ubuntu2
sendmail-cf 8.14.4-2ubuntu2
sendmail-doc 8.14.4-2ubuntu2

I suspect there could be something wrong with libpq and libmilter working
together, but I am not sure.

Many thanks for any help you can provide.

PS. I installed PostgreSQL 9.3rc1 and linked my milter against the libpq
that comes with that version. The hanging bug still occurs.

Regards,
vmk
--
************************************************************************
Tietotekniikkakeskus / Helsingin yliopisto
IT department / University of Helsinki
************************************************************************


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-09 11:58:43
Message-ID: 522DB7F3.2010203@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09.09.2013 09:34, Vesa-Matti J Kari wrote:
> Basically all that the authmilter now does is to connect to PostgreSQL in
> authmilt_connect() and close the connection in authmilt_close(). Based on
> the authmilter debug logging it seems to me that when the hanging occurs,
> the authmilter never completes PQsetdbLogin().
>
> Based on the document
>
> http://www.postgresql.org/docs/9.1/interactive/libpq-threading.html
>
> I am sure libpq should be thread safe and on startup
> the authmilter verifies that the libpq is indeed thread safe.
>
>
> If you think you could set up a test enviroment using:
>
> - Sendmail (or maybe Postfix)
> - authmilter
> - PostgreSQL
>
> here is an authmilter-simplied.tar.gz package for you:
>
> http://www.helsinki.fi/~vmkari/authmilter-simplified.tar.gz
>
> The README file contains a rough instructions outline on how to setup
> things in order to reproduce the strange hanging bug.
>
>
> Please note that when running two test message sender scripts in parallel,
> the bug does not occur immediately, but only after between 1 to 5 minutes
> of processing. Sometimes it may take even longer.

I managed to set that up and got it running. But it works fine for me,
does not hang.

I'd suggest poking around with gdb, to see where it hangs. Also, run
"select * from pg_stat_activity" from a psql session to see what's
happening inside the database. log_connections=on and
log_disconnections=on would also be a good idea.

PS. You'll need to escape the strings in the queries, to avoid SQL
injection.

- Heikki


From: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-09 12:36:15
Message-ID: alpine.LRH.2.00.1309091517170.7114@ruuvi.it.helsinki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hello,

On Mon, 9 Sep 2013, Heikki Linnakangas wrote:

> I managed to set that up and got it running.

Many thanks for taking the time.

> But it works fine for me, does not hang.

Okay. Have you tried increasing the iterations for the smtp sender
scripts? And could you please specify what is your test environment like
(i.e. OS and the related library versions)?

> I'd suggest poking around with gdb, to see where it hangs.

I have actually done that, but it only show the main listener thread from
the libmilter library:

(gdb) bt
#0 0x00007fe64bdd0313 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fe64c4f7b46 in mi_listener () from /usr/lib/libmilter.so.1.0.1
#2 0x00007fe64c4f8707 in smfi_main () from /usr/lib/libmilter.so.1.0.1
#3 0x0000000000402c8f in main (argc=15, argv=0x7fffa6560e68) at
authmilter.c:699

Hmmm. The man page mentioned no threads, but Google was helpful and
suggested "info threads" so here goes:

(I hope alpine will not wrap these long lines)

(gdb) info threads
Id Target Id Frame
9 Thread 0x7fe64700c700 (LWP 14362) "authmilter" 0x00007fe64c0b69f7 in do_sigwait () from /lib/x86_64-linux-gnu/libpthread.so.0
8 Thread 0x7fe64680b700 (LWP 14363) "authmilter" 0x00007fe64bdd0313 in poll () from /lib/x86_64-linux-gnu/libc.so.6
7 Thread 0x7fe645809700 (LWP 14365) "authmilter" 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
6 Thread 0x7fe645008700 (LWP 22404) "authmilter" 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
5 Thread 0x7fe64600a700 (LWP 27263) "authmilter" 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
4 Thread 0x7fe644807700 (LWP 27264) "authmilter" 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
3 Thread 0x7fe62ffff700 (LWP 27283) "authmilter" 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
2 Thread 0x7fe62f7fe700 (LWP 27284) "authmilter" 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
* 1 Thread 0x7fe64c8fd740 (LWP 14361) "authmilter" 0x00007fe64bdd0313 in poll () from /lib/x86_64-linux-gnu/libc.so.6

It looks like a deadlock situation of some kind...

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fe62f7fe700 (LWP 27284))]
#0 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fe64c0b1065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fe64c0b0eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007fe64c2df200 in ?? () from /usr/lib/libpq.so.5
#4 0x00007fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#5 0x00007fe64b77a915 in RSA_new_method () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6 0x00007fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7 0x00007fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8 0x00007fe64b7bc6d1 in ASN1_item_ex_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#9 0x00007fe64b7bd0c4 in ASN1_item_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#10 0x00007fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#11 0x00007fe64b7b461a in X509_PUBKEY_get () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#12 0x00007fe64b7d119a in X509_get_pubkey_parameters () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#13 0x00007fe64b7d1398 in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#14 0x00007fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#15 0x00007fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#16 0x00007fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#17 0x00007fe64c2dffcb in ?? () from /usr/lib/libpq.so.5
#18 0x00007fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5
#19 0x00007fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5
#20 0x00007fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5
#21 0x0000000000401ba5 in authmilt_connect (ctx=0xe81b60, hostname=0x7fe6200008c0 "localhost", hostaddr=0x7fe62f7fdce0) at authmilter.c:212
#22 0x00007fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1
#23 0x00007fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1
#24 0x00007fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1
#25 0x00007fe64c0aee9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#26 0x00007fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#27 0x0000000000000000 in ?? ()

(gdb) thread 3
[Switching to thread 3 (Thread 0x7fe62ffff700 (LWP 27283))]
#0 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fe64c0b1065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fe64c0b0eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007fe64c2df200 in ?? () from /usr/lib/libpq.so.5
#4 0x00007fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#5 0x00007fe64b77a915 in RSA_new_method () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6 0x00007fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7 0x00007fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8 0x00007fe64b7bc6d1 in ASN1_item_ex_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#9 0x00007fe64b7bd0c4 in ASN1_item_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#10 0x00007fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#11 0x00007fe64b7b461a in X509_PUBKEY_get () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#12 0x00007fe64b7d119a in X509_get_pubkey_parameters () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#13 0x00007fe64b7d1398 in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#14 0x00007fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#15 0x00007fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#16 0x00007fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#17 0x00007fe64c2dffcb in ?? () from /usr/lib/libpq.so.5
#18 0x00007fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5
#19 0x00007fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5
#20 0x00007fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5
#21 0x0000000000401ba5 in authmilt_connect (ctx=0xe818e0, hostname=0x7fe6280008c0 "localhost", hostaddr=0x7fe62fffece0) at authmilter.c:212
#22 0x00007fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1
#23 0x00007fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1
#24 0x00007fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1
#25 0x00007fe64c0aee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#26 0x00007fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#27 0x0000000000000000 in ?? ()

If I interpret this correctly, threads #2 and #3 are waiting for the same
lock but they make no progress.

> Also, run "select *
> from pg_stat_activity" from a psql session to see what's happening inside the
> database.

It shows no activity whatsoever except the pg_stat_activity query itself.

> log_connections=on and log_disconnections=on would also be a good
> idea.

Okay. Found these when the hanging occurred:

2013-09-09 15:14:28 EEST LOG: connection received: host=127.0.0.1
port=51519

2013-09-09 15:14:28 EEST LOG: connection received: host=127.0.0.1
port=51520

> PS. You'll need to escape the strings in the queries, to avoid SQL injection.

Yes, thanks for the tip. I believed {auth_authen} would be safe coming
from Sendmail, but you're right that it is better to be safe than sorry.
I never intended to "release" this code yet, but was forced to do so
because I cannot figure out where bug is.

Regards,
vmk
--
************************************************************************
Tietotekniikkakeskus / Helsingin yliopisto
IT department / University of Helsinki
************************************************************************


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-09 15:02:07
Message-ID: 522DE2EF.3050006@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09.09.2013 15:36, Vesa-Matti J Kari wrote:
> It looks like a deadlock situation of some kind...
>
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fe62f7fe700 (LWP 27284))]
> #0 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
> (gdb) bt
> #0 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
> #1 0x00007fe64c0b1065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
> #2 0x00007fe64c0b0eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
> #3 0x00007fe64c2df200 in ?? () from /usr/lib/libpq.so.5
> #4 0x00007fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #5 0x00007fe64b77a915 in RSA_new_method () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #6 0x00007fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #7 0x00007fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #8 0x00007fe64b7bc6d1 in ASN1_item_ex_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #9 0x00007fe64b7bd0c4 in ASN1_item_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #10 0x00007fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #11 0x00007fe64b7b461a in X509_PUBKEY_get () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #12 0x00007fe64b7d119a in X509_get_pubkey_parameters () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #13 0x00007fe64b7d1398 in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #14 0x00007fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #15 0x00007fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #16 0x00007fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #17 0x00007fe64c2dffcb in ?? () from /usr/lib/libpq.so.5
> #18 0x00007fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5
> #19 0x00007fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5
> #20 0x00007fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5
> #21 0x0000000000401ba5 in authmilt_connect (ctx=0xe81b60, hostname=0x7fe6200008c0 "localhost", hostaddr=0x7fe62f7fdce0) at authmilter.c:212
> #22 0x00007fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1
> #23 0x00007fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1
> #24 0x00007fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1
> #25 0x00007fe64c0aee9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #26 0x00007fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #27 0x0000000000000000 in ?? ()
>
> (gdb) thread 3
> [Switching to thread 3 (Thread 0x7fe62ffff700 (LWP 27283))]
> #0 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
> (gdb) bt
> #0 0x00007fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
> #1 0x00007fe64c0b1065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
> #2 0x00007fe64c0b0eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
> #3 0x00007fe64c2df200 in ?? () from /usr/lib/libpq.so.5
> #4 0x00007fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #5 0x00007fe64b77a915 in RSA_new_method () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #6 0x00007fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #7 0x00007fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #8 0x00007fe64b7bc6d1 in ASN1_item_ex_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #9 0x00007fe64b7bd0c4 in ASN1_item_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #10 0x00007fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #11 0x00007fe64b7b461a in X509_PUBKEY_get () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #12 0x00007fe64b7d119a in X509_get_pubkey_parameters () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #13 0x00007fe64b7d1398 in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
> #14 0x00007fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #15 0x00007fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #16 0x00007fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
> #17 0x00007fe64c2dffcb in ?? () from /usr/lib/libpq.so.5
> #18 0x00007fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5
> #19 0x00007fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5
> #20 0x00007fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5
> #21 0x0000000000401ba5 in authmilt_connect (ctx=0xe818e0, hostname=0x7fe6280008c0 "localhost", hostaddr=0x7fe62fffece0) at authmilter.c:212
> #22 0x00007fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1
> #23 0x00007fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1
> #24 0x00007fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1
> #25 0x00007fe64c0aee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
> #26 0x00007fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #27 0x0000000000000000 in ?? ()
>
> If I interpret this correctly, threads #2 and #3 are waiting for the same
> lock but they make no progress.

A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL
enabled in my test server. As soon as I turned it on, it hung.

Attached is a small stand-alone test program to reproduce it. You can
pass a libpq connection string as argument to it.

I'll dig into that, but right now it seems like an OpenSSL or libcrypto
bug to me. Or something in the way we use them, although I can't see
anything obviously wrong in the libpq code at a quick glance.

- Heikki

Attachment Content-Type Size
threaded-connect.c text/x-csrc 1.2 KB

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-09 15:20:01
Message-ID: 20130909152001.GQ2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Vesa-Matti, Heikki,

* Heikki Linnakangas (hlinnakangas(at)vmware(dot)com) wrote:
> On 09.09.2013 15:36, Vesa-Matti J Kari wrote:
> >If I interpret this correctly, threads #2 and #3 are waiting for the same
> >lock but they make no progress.
>
> A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL
> enabled in my test server. As soon as I turned it on, it hung.
>
> Attached is a small stand-alone test program to reproduce it. You
> can pass a libpq connection string as argument to it.

Interesting... Which version of libpq were you working against? I see
that Vesa-Matti had the problem happen w/ 9.1.9, which should have been
before the changes that I made to add locking around our usage of
SSL_context, as otherwise we would end up in situations where we'd dump
core, but he also had it with 9.3rc1, which should have included it. I
had tested the patch w/ a pretty good amount of concurrent threads fired
off from a little python script and didn't run into any deadlocks
there..

Vesa-Matti, was this working previously, and are you sure you were
testing with 9.3rc1's libpq? Heikki, which are you testing against and
perhaps you might try before and after? I'll be able to look into it
more in a few hours also.

Thanks,

Stephen


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-09 16:04:41
Message-ID: 20130909160441.GC6068@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:

> I'll dig into that, but right now it seems like an OpenSSL or
> libcrypto bug to me. Or something in the way we use them, although I
> can't see anything obviously wrong in the libpq code at a quick
> glance.

Can you please try with ssl_renegotiation_limit=0?

[ looks ] Uh, actually you don't even send data in those connections in
your test program, do you? Maybe there's a problem with the mutex stuff
committed recently by Stephen.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-09 16:10:54
Message-ID: 522DF30E.90605@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09.09.2013 18:20, Stephen Frost wrote:
> Vesa-Matti, Heikki,
>
> * Heikki Linnakangas (hlinnakangas(at)vmware(dot)com) wrote:
>> On 09.09.2013 15:36, Vesa-Matti J Kari wrote:
>>> If I interpret this correctly, threads #2 and #3 are waiting for the same
>>> lock but they make no progress.
>>
>> A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL
>> enabled in my test server. As soon as I turned it on, it hung.
>>
>> Attached is a small stand-alone test program to reproduce it. You
>> can pass a libpq connection string as argument to it.
>
> Interesting... Which version of libpq were you working against? I see
> that Vesa-Matti had the problem happen w/ 9.1.9, which should have been
> before the changes that I made to add locking around our usage of
> SSL_context, as otherwise we would end up in situations where we'd dump
> core, but he also had it with 9.3rc1, which should have included it. I
> had tested the patch w/ a pretty good amount of concurrent threads fired
> off from a little python script and didn't run into any deadlocks
> there..

> Vesa-Matti, was this working previously, and are you sure you were
> testing with 9.3rc1's libpq? Heikki, which are you testing against and
> perhaps you might try before and after? I'll be able to look into it
> more in a few hours also.

Thanks! I tested with git master.

I added printf()s into the pq_lockingcallback function, and got a trace
where both threads got stuck waiting for lock 10 in the pq_lockarray. It
looks like someone is failing to release it. The backtrace for both
threads look like this:

#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fad49826f3c in _L_lock_974 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fad49826d8b in __GI___pthread_mutex_lock
(mutex=0x7fad3800a260) at pthread_mutex_lock.c:64
#3 0x00007fad49a53f08 in pq_lockingcallback (line=175,
file=0x7fad48fa8bb3 "x_pubkey.c", n=10, mode=<optimized out>) at
fe-secure.c:872
#4 pq_lockingcallback (mode=<optimized out>, n=10, file=0x7fad48fa8bb3
"x_pubkey.c", line=175) at fe-secure.c:868
#5 0x00007fad48f396ab in X509_PUBKEY_get () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6 0x00007fad48f56292 in X509_get_pubkey_parameters () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7 0x00007fad48f5649c in X509_verify_cert () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8 0x00007fad4924f14a in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#9 0x00007fad4922ce1c in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#10 0x00007fad492310d2 in ?? () from
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#11 0x00007fad49a54c67 in open_client_SSL (conn=0x7fad380397d0) at
fe-secure.c:1463
#12 pqsecure_open_client (conn=conn(at)entry=0x7fad380397d0) at fe-secure.c:306
#13 0x00007fad49a44fb6 in PQconnectPoll (conn=conn(at)entry=0x7fad380397d0)
at fe-connect.c:2123
#14 0x00007fad49a4618e in connectDBComplete
(conn=conn(at)entry=0x7fad380397d0) at fe-connect.c:1521
#15 0x00007fad49a46b47 in PQconnectdb (conninfo=<optimized out>) at
fe-connect.c:516
#16 0x00000000004007b6 in test_connect (threadid=2) at threaded-connect.c:25
#17 0x000000000040086e in run_thread (arg=0x600e04 <two>) at
threaded-connect.c:55
#18 0x00007fad49824e0e in start_thread (arg=0x7fad48203700) at
pthread_create.c:311
#19 0x00007fad4955993d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Sometimes the lockup happens differently, with one thread hung up in
SSL_init() and another waiting for the first one on ssl_config_mutex,
for example.

A good next step might be to create a standalone program that doesn't
use libpq at all, but just calls X509_verify_cert() concurrently in two
threads. Or open plain SSL connections. If the deadlock can be
reproduced with that, then we could just report the bug to the OpenSSL
and hope that they can figure it out.

- Heikki

Attachment Content-Type Size
pq_lockingcallback-debug-1.patch text/x-diff 728 bytes
lock-trace.gz application/x-gzip 14.7 KB

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-09 17:26:10
Message-ID: 20130909172610.GR2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro,

* Alvaro Herrera (alvherre(at)2ndquadrant(dot)com) wrote:
> Heikki Linnakangas wrote:
> > I'll dig into that, but right now it seems like an OpenSSL or
> > libcrypto bug to me. Or something in the way we use them, although I
> > can't see anything obviously wrong in the libpq code at a quick
> > glance.
>
> Can you please try with ssl_renegotiation_limit=0?
>
> [ looks ] Uh, actually you don't even send data in those connections in
> your test program, do you? Maybe there's a problem with the mutex stuff
> committed recently by Stephen.

I was wondering about that also, but it was apparently an issue even
before that change (it was reported against 9.1.9). Also, Heikki's
analysis appears to show cases where two threads end up waiting on the
same entry in the lockarray, which I don't think my changes would have
impacted at all.

In any case, I hope to find time this afternoon/evening to try this
against libpq from before and after, just to be sure and rule out that
patch. Assuming that pans out, I tend to agree w/ Heikki that we should
test this outside of libpq entirely and see if we can reproduce it.
Even if we're able to do that, we may need to consider ways to fix it
ourselves (perhaps be holding heavier locks or something), as we have no
idea how long it'll take an OpenSSL fix to happen..

Thanks,

Stephen


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-10 15:10:41
Message-ID: 20130910151041.GZ2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki,

* Heikki Linnakangas (hlinnakangas(at)vmware(dot)com) wrote:
> Thanks! I tested with git master.

I've run your test program against both git master and 9.2.4 on a couple
of Ubuntu 13.04 boxes and all I see are tons of these:

1: DEBUG: database connection established
1: DEBUG: about to call PQfinish()
1: DEBUG: database connection established
1: DEBUG: about to call PQfinish()
2: DEBUG: database connection established
2: DEBUG: about to call PQfinish()

But no deadlocks; even across multiple runs, they all terminate
normally.

For 9.2.4, I've got:

libpq5 9.2.4-1.pgdg12.4+1
libssl1.0.0:amd64 1.0.1c-4ubuntu8.1

With master, it's:

libpq5 as-of b34f8f409bec4921d260ac84668917f5fff04319
libssl1.0.0:amd64 1.0.1c-4ubuntu8.1

What versions of libssl are you working with..? Is it possible this has
actually been fixed already?

Thanks,

Stephen


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-10 17:03:42
Message-ID: 522F50EE.5020400@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10.09.2013 18:10, Stephen Frost wrote:
> I've run your test program against both git master and 9.2.4 on a couple
> of Ubuntu 13.04 boxes and all I see are tons of these:
>
> 1: DEBUG: database connection established
> 1: DEBUG: about to call PQfinish()
> 1: DEBUG: database connection established
> 1: DEBUG: about to call PQfinish()
> 2: DEBUG: database connection established
> 2: DEBUG: about to call PQfinish()
>
> But no deadlocks; even across multiple runs, they all terminate
> normally.

Hmm. Are you sure you're getting an SSL connection? Run it with
something like this to make sure:

./threaded-connect "dbname=postgres sslmode=require host=localhost"

> For 9.2.4, I've got:
>
> libpq5 9.2.4-1.pgdg12.4+1
> libssl1.0.0:amd64 1.0.1c-4ubuntu8.1
>
> With master, it's:
>
> libpq5 as-of b34f8f409bec4921d260ac84668917f5fff04319
> libssl1.0.0:amd64 1.0.1c-4ubuntu8.1
>
> What versions of libssl are you working with..? Is it possible this has
> actually been fixed already?

$ dpkg -s libssl1.0.0:amd64 | grep Vers
Version: 1.0.1e-3

- Heikki


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-10 21:39:44
Message-ID: 20130910213944.GA2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki,

* Heikki Linnakangas (hlinnakangas(at)vmware(dot)com) wrote:
> Hmm. Are you sure you're getting an SSL connection? Run it with
> something like this to make sure:

sslmode=require doesn't help on Unix domain connections. :)

Was able to get it to lock with both 9.2.4 and master, and with both
versions of the SSL library (1.0.1c-4ubuntu8.1 and 1.0.1e-3).

The lock that I got showed these stacks:

Thread #2:
pthread_mutex_lock
init_ssl_system ; fe-secure.c:917

Thread #3:
pthread_mutex_lock
pq_lockingcallback
...
SSL_new
pqsecure_open_client ; fe-secure.c:275

Thread #2 waiting at 917 makes sense, he's waiting on the lock that the
other thread has on ssl_config_mutex before moving in to set up his own
SSL connection. What's odd is how is thread #3 waiting on a lock in the
lock array. Both threads agree that ssl_open_connections is only 1
(thread #3's ; thread #2 hasn't gotten to incrementing it yet). Looking
at the lock array, only one of the locks is taken out and it's owner is
thread #3, meaning that SSL apparently caused a deadlock by trying to
take a lock which it's already taken.

Changing the lock type to be recursive instead masks the self-locking
issue, but then I got a case where, with the same stack traces as above,
the lock in the array was held by thread #2 instead, where thread #2 is
in init_ssl_system- well before it's even made any calls into SSL since
the previous PQdestroy happened.

I've also caught it where a thread is still holding a lock when it drops
into destroy_ssl_system() by simply trying to unlock all of the locks in
the array. With the recursive lock type, all such attempts should
simply error out (either it's locked by someone else, or it's already
unlocked) and so I checked for *successful* unlocks:

2: DEBUG: database connection established
2: DEBUG: about to call PQfinish()
successfully unlocked mutex!

Having that happen can then cause the deadlock because the other thread
can end up waiting on that lock that we're still holding while in
destroy_ssl_system(), where we're waiting on the ssl_config_mutex lock
that the first thread has. Even leaving that code in there, which
unlocks all the locks during destroy_ssl_system(), it's still deadlocked
on me with the same stack trace as above, with thread #2 holding a lock
in the pq_lockarray which thread #3 is trying to get (while thread #3
holds the ssl_config_mutex lock that thread #2 is waiting on).

Very curious. Out of time right now to look into it, but will probably
be back at it later tonight.

Thanks,

Stephen


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 16:40:11
Message-ID: 20130913164011.GR2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki, all,

* Stephen Frost (sfrost(at)snowman(dot)net) wrote:
> Very curious. Out of time right now to look into it, but will probably
> be back at it later tonight.

Alright, I was back at this a bit today and decided to go with a hunch-
and it looks like I might have been right to try.

Leaving the locking callback hooks in place appears to prevent the
deadlocks from happening, at least with this little app.

IOW, in destroy_ssl_system(), simply arrange to not have
CRYPTO_set_locking_callback(NULL); and CRYPTO_set_id_callback(NULL);
called. I've done this with the very simple attached patch. Per the
comment above destroy_ssl_system(), this doesn't seem to be an
acceptable solution because libpq might get unloaded from the
application, but perhaps it points the way towards what the issue is.

My thought had been that there was an issue with regard to signal
handling, but I've been unable to confirm that, nor is it clear why that
would be the case. In any case, I'm curious what others think of these
results and if anyone can reproduce the deadlock with this patch
applied.

Thanks!

Stephen

Attachment Content-Type Size
dont_drop_hooks.patch text/x-diff 1.4 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 17:00:49
Message-ID: 20130913170049.GB1330627@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-13 12:40:11 -0400, Stephen Frost wrote:
> Heikki, all,
>
> * Stephen Frost (sfrost(at)snowman(dot)net) wrote:
> > Very curious. Out of time right now to look into it, but will probably
> > be back at it later tonight.
>
> Alright, I was back at this a bit today and decided to go with a hunch-
> and it looks like I might have been right to try.
>
> Leaving the locking callback hooks in place appears to prevent the
> deadlocks from happening, at least with this little app.
>
> IOW, in destroy_ssl_system(), simply arrange to not have
> CRYPTO_set_locking_callback(NULL); and CRYPTO_set_id_callback(NULL);
> called. I've done this with the very simple attached patch. Per the
> comment above destroy_ssl_system(), this doesn't seem to be an
> acceptable solution because libpq might get unloaded from the
> application, but perhaps it points the way towards what the issue is.

It'd be interesting to replace the origin callbacks with one immediately
doing an abort() or similar to see whether they maybe are called after
they shouldn't be and from where.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 17:15:34
Message-ID: CAOuzzgqoU5p=zFhoQp6fVDVjaSz9x14XWmzSpbAY2sn4=8M61w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres,

On Friday, September 13, 2013, Andres Freund wrote:
>
> It'd be interesting to replace the origin callbacks with one immediately
> doing an abort() or similar to see whether they maybe are called after
> they shouldn't be and from where.
>

Good thought. Got sucked into a meeting but once I'm out I'll try having
the lock/unlock routines abort if they're called while ssl_open_connections
is zero, which should not be happening, but seems like it is.

Thanks,

Stephen


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 17:25:20
Message-ID: 20130913172520.GF1330627@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-13 13:15:34 -0400, Stephen Frost wrote:
> Andres,
>
> On Friday, September 13, 2013, Andres Freund wrote:
> >
> > It'd be interesting to replace the origin callbacks with one immediately
> > doing an abort() or similar to see whether they maybe are called after
> > they shouldn't be and from where.
> >
>
> Good thought. Got sucked into a meeting but once I'm out I'll try having
> the lock/unlock routines abort if they're called while ssl_open_connections
> is zero, which should not be happening, but seems like it is.

Hm. close_SSL() first does pqsecure_destroy() which will unset the
callbacks, and the count and then goes on to do X509_free() and
ENGINE_finish(), ENGINE_free() if either is used.

It's not implausible that one of those actually needs locking. I doubt
engines play a role here, but, without having looked at the testcase,
X509_free() might be a possibility.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 17:59:54
Message-ID: 20130913175954.GV2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> On 2013-09-13 13:15:34 -0400, Stephen Frost wrote:
> > Good thought. Got sucked into a meeting but once I'm out I'll try having
> > the lock/unlock routines abort if they're called while ssl_open_connections
> > is zero, which should not be happening, but seems like it is.
>
> Hm. close_SSL() first does pqsecure_destroy() which will unset the
> callbacks, and the count and then goes on to do X509_free() and
> ENGINE_finish(), ENGINE_free() if either is used.
>
> It's not implausible that one of those actually needs locking. I doubt
> engines play a role here, but, without having looked at the testcase,
> X509_free() might be a possibility.

Unfortunately, while I can still easily get the deadlock to happen when
the hooks are reset, the hooks don't appear to ever get called when
ssl_open_connections is set to zero. You have a good point about the
additional SSL calls after the hooks are unloaded though, I wonder if
holding the ssl_config_mutex lock over all of close_SSL might be more
sensible..

Thanks,

Stephen


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 18:30:39
Message-ID: 20130913183039.GH1330627@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-13 13:59:54 -0400, Stephen Frost wrote:
> Unfortunately, while I can still easily get the deadlock to happen when
> the hooks are reset, the hooks don't appear to ever get called when
> ssl_open_connections is set to zero. You have a good point about the
> additional SSL calls after the hooks are unloaded though, I wonder if
> holding the ssl_config_mutex lock over all of close_SSL might be more
> sensible..

Hm. You might want to do the check for ssl_open_connections after a
memory barrier or such, it might not actually be up2date (too tired to
think about the actual guarantees of x86's cache coherency guarantees).

Does it burn if you move the pqsecure_destroy(); to the end of the
function, after setting a boolean in the if (conn->ssl) branch?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 18:33:25
Message-ID: 20130913183325.GX2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Stephen Frost (sfrost(at)snowman(dot)net) wrote:
> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> > Hm. close_SSL() first does pqsecure_destroy() which will unset the
> > callbacks, and the count and then goes on to do X509_free() and
> > ENGINE_finish(), ENGINE_free() if either is used.
> >
> > It's not implausible that one of those actually needs locking. I doubt
> > engines play a role here, but, without having looked at the testcase,
> > X509_free() might be a possibility.
>
> Unfortunately, while I can still easily get the deadlock to happen when
> the hooks are reset, the hooks don't appear to ever get called when
> ssl_open_connections is set to zero. You have a good point about the
> additional SSL calls after the hooks are unloaded though, I wonder if
> holding the ssl_config_mutex lock over all of close_SSL might be more
> sensible..

I went ahead and moved the locks to be around all of close_SSL() and
haven't been able to reproduce the deadlock, so perhaps those calls are
the issue and what's happening is that another thread is dropping or
adding the hooks in a common place while the X509_free, etc, are trying
to figure out if they should be calling the locking functions or not,
but there's a race because there's no higher-level locking happening
around those.

Attached is a patch to move those and which doesn't deadlock for me.

Thoughts?

Thanks,

Stephen

Attachment Content-Type Size
move_locks.diff text/x-diff 1.2 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 18:35:29
Message-ID: 20130913183529.GI1330627@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-13 14:33:25 -0400, Stephen Frost wrote:
> * Stephen Frost (sfrost(at)snowman(dot)net) wrote:
> > * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> > > Hm. close_SSL() first does pqsecure_destroy() which will unset the
> > > callbacks, and the count and then goes on to do X509_free() and
> > > ENGINE_finish(), ENGINE_free() if either is used.
> > >
> > > It's not implausible that one of those actually needs locking. I doubt
> > > engines play a role here, but, without having looked at the testcase,
> > > X509_free() might be a possibility.
> >
> > Unfortunately, while I can still easily get the deadlock to happen when
> > the hooks are reset, the hooks don't appear to ever get called when
> > ssl_open_connections is set to zero. You have a good point about the
> > additional SSL calls after the hooks are unloaded though, I wonder if
> > holding the ssl_config_mutex lock over all of close_SSL might be more
> > sensible..
>
> I went ahead and moved the locks to be around all of close_SSL() and
> haven't been able to reproduce the deadlock, so perhaps those calls are
> the issue and what's happening is that another thread is dropping or
> adding the hooks in a common place while the X509_free, etc, are trying
> to figure out if they should be calling the locking functions or not,
> but there's a race because there's no higher-level locking happening
> around those.
>
> Attached is a patch to move those and which doesn't deadlock for me.

It seems slightly cleaner to just move the pqsecure_destroy(); to the
end of that function, based on a boolean. But if you think otherwise, I
won't protest...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 19:03:31
Message-ID: 20130913190331.GY2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> It seems slightly cleaner to just move the pqsecure_destroy(); to the
> end of that function, based on a boolean. But if you think otherwise, I
> won't protest...

Hmm, agreed; I had originally been concerned that the SIGPIPE madness
needed to be around the pqsecure_destroy() call, but I can't see why
that would be.

I've run through a few times w/ the attached and haven't seen the
deadlock. Will continue testing, of course.

Heikki, any thoughts regarding this change? Any concerns about
back-patching it?

Thanks,

Stephen

Attachment Content-Type Size
move_call.diff text/x-diff 672 bytes

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 19:18:35
Message-ID: 20130913191835.GA1349870@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-13 15:03:31 -0400, Stephen Frost wrote:
> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> > It seems slightly cleaner to just move the pqsecure_destroy(); to the
> > end of that function, based on a boolean. But if you think otherwise, I
> > won't protest...
>
> Hmm, agreed; I had originally been concerned that the SIGPIPE madness
> needed to be around the pqsecure_destroy() call, but I can't see why
> that would be.
>
> I've run through a few times w/ the attached and haven't seen the
> deadlock. Will continue testing, of course.

That patch looks wrong to me. Note that the if (conn->ssl) branch resets
conn->ssl to NULL.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 19:23:00
Message-ID: 20130913192300.GA2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> That patch looks wrong to me. Note that the if (conn->ssl) branch resets
> conn->ssl to NULL.

huh, it figures one would overlook the simplest things. Of course it's
not locking up now- we never remove the hooks (as my original patch was
doing :). That's what I get for trying to go to meetings and code at
the same time.

Thanks,

Stephen


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 19:26:07
Message-ID: 523366CF.6030801@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13.09.2013 22:03, Stephen Frost wrote:
> * Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
>> It seems slightly cleaner to just move the pqsecure_destroy(); to the
>> end of that function, based on a boolean. But if you think otherwise, I
>> won't protest...
>
> Hmm, agreed; I had originally been concerned that the SIGPIPE madness
> needed to be around the pqsecure_destroy() call, but I can't see why
> that would be.
>
> I've run through a few times w/ the attached and haven't seen the
> deadlock. Will continue testing, of course.
>
> Heikki, any thoughts regarding this change? Any concerns about
> back-patching it?

Umm, with that patch, pqsecure_destroy() is never called. The "if
(conn->ssl)" test that's now at the end of the close_SSL function is
never true, because conn->ssl is set to NULL earlier.

I'm afraid the "move_locks.diff" patch you posted earlier is also
broken; close_SSL() is called in error scenarios from
pqsecure_open_client(), while already holding the mutex. So it will
deadlock with itself if the connection cannot be established.

- Heikki


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 19:32:26
Message-ID: 5233684A.2040505@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13.09.2013 22:26, Heikki Linnakangas wrote:
> I'm afraid the "move_locks.diff" patch you posted earlier is also
> broken; close_SSL() is called in error scenarios from
> pqsecure_open_client(), while already holding the mutex. So it will
> deadlock with itself if the connection cannot be established.

Actually, I think there's a pre-existing bug there in git master. If the
SSL_set_app_data or SSL_set_fd call in pqsecure_open_client fails for
some reason, it will call close_SSL() with conn->ssl already set, and
the mutex held. close_SSL() will call
pqsecure_destroy()->destroy_SSL()->destory_ssl_system(), which will try
to acquire the mutex again.

- Heikki


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 19:36:08
Message-ID: 20130913193608.GB2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Heikki Linnakangas (hlinnakangas(at)vmware(dot)com) wrote:
> Umm, with that patch, pqsecure_destroy() is never called. The "if
> (conn->ssl)" test that's now at the end of the close_SSL function is
> never true, because conn->ssl is set to NULL earlier.

Yeah, got ahead of myself, as Andres pointed out.

> I'm afraid the "move_locks.diff" patch you posted earlier is also
> broken; close_SSL() is called in error scenarios from
> pqsecure_open_client(), while already holding the mutex. So it will
> deadlock with itself if the connection cannot be established.

Sorry, I was really just tossing it up to see if it really did avoid
this particular deadlock. I'm running some tests now with the
attached to see if I can get it to deadlock now.

Thanks,

Stephen

Attachment Content-Type Size
move_call_take2.diff text/x-diff 1.2 KB

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Vesa-Matti J Kari <vmkari(at)cc(dot)helsinki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange hanging bug in a simple milter
Date: 2013-09-13 19:41:09
Message-ID: 20130913194109.GC2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Heikki Linnakangas (hlinnakangas(at)vmware(dot)com) wrote:
> Actually, I think there's a pre-existing bug there in git master. If
> the SSL_set_app_data or SSL_set_fd call in pqsecure_open_client
> fails for some reason, it will call close_SSL() with conn->ssl
> already set, and the mutex held. close_SSL() will call
> pqsecure_destroy()->destroy_SSL()->destory_ssl_system(), which will
> try to acquire the mutex again.

Yeah, good point, and that one looks like my fault. Moving it after the
unlock should address that tho, no? Looking through the other
close_SSL() calls, I don't see any other situations where it might be
called with the lock held.

Thanks,

Stephen