Re: hung backends stuck in spinlock heavy endless loop

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)heroku(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hung backends stuck in spinlock heavy endless loop
Date: 2015-01-14 19:49:28
Message-ID: CAHyXU0zbR0n9cC4Qhq=MDd4p_EHjbZPPXsrtqoyLMbPiqcr6uA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 14, 2015 at 9:49 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2015-01-14 09:47:19 -0600, Merlin Moncure wrote:
>> On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > If you gdb in, and type 'fin' a couple times, to wait till the function
>> > finishes, is there actually any progress? I'm wondering whether it's
>> > just many catalog accesses + contention, or some other
>> > problem. Alternatively set a breakpoint on ScanPgRelation() or so and
>> > see how often it's hit.
>>
>> well, i restarted the database, forgetting my looper was running which
>> immediately spun up and it got stuck again with a similar profile
>> (lots of cpu in spinlock):
>>
>> Samples: 3K of event 'cycles', Event count (approx.): 2695723228
>> 31.16% postgres [.] s_lock
>> 22.32% postgres [.] tas
>> 12.13% postgres [.] tas
>> 5.93% postgres [.] spin_delay
>> 5.69% postgres [.] LWLockRelease
>> 3.75% postgres [.] LWLockAcquireCommon
>> 3.61% perf [.] 0x00000000000526c4
>> 2.51% postgres [.] FunctionCall2Coll
>> 1.48% libc-2.17.so [.] 0x000000000016a132
>>
>> > If you gdb in, and type 'fin' a couple times,
>> (gdb) fin
>> Run till exit from #0 0x00007ff4c63f7a97 in semop () from
>> /lib/x86_64-linux-gnu/libc.so.6
>> 0x00000000006de073 in PGSemaphoreLock ()
>> (gdb) fin
>> Run till exit from #0 0x00000000006de073 in PGSemaphoreLock ()
>>
>> It returned once. Second time, it didn't at least so far (minute or so).
>
> Hm, that's autovac though, not the normal user backends that actually do
> stuff, right? If you could additionally check those, it'd be great.

got it to reproduce. had to take a break to get some work done.

(gdb) fin
Run till exit from #0 0x00007f7b07099dc3 in select ()
from /lib/x86_64-linux-gnu/libc.so.6
0x00000000008d3ac4 in pg_usleep ()
(gdb) fin
Run till exit from #0 0x00000000008d3ac4 in pg_usleep ()
0x0000000000750b69 in s_lock ()
(gdb) fin
Run till exit from #0 0x0000000000750b69 in s_lock ()
0x0000000000750844 in LWLockRelease ()
(gdb) fin
Run till exit from #0 0x0000000000750844 in LWLockRelease ()
0x0000000000730000 in LockBuffer ()
(gdb) fin
Run till exit from #0 0x0000000000730000 in LockBuffer ()
0x00000000004b2db4 in _bt_relandgetbuf ()
(gdb) fin
Run till exit from #0 0x00000000004b2db4 in _bt_relandgetbuf ()
0x00000000004b7116 in _bt_moveright ()
(gdb) fin
Run till exit from #0 0x00000000004b7116 in _bt_moveright ()
....

so it looks like nobody ever exits from _bt_moveright. any last
requests before I start bisecting down?

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-01-14 20:02:59 Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Previous Message Robert Haas 2015-01-14 19:45:38 Re: pg_basebackup fails with long tablespace paths