Quick Links

Re: hung backends stuck in spinlock heavy endless loop

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hung backends stuck in spinlock heavy endless loop
Date:	2015-01-13 23:54:57
Message-ID:	CAHyXU0wyuJ7-2ypLZtkhOWZzC34udFXpm6tr+O51aZcCEbobag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
>> On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
>> >> I'm inclined to think that this is a livelock, and so the problem
>> >> isn't evident from the structure of the B-Tree, but it can't hurt to
>> >> check.
>> >
>> > My guess is rather that it's contention on the freelist lock via
>> > StrategyGetBuffer's. I've seen profiles like this due to exactly that
>> > before - and it fits to parallel loading quite well.
>>
>> I think I've got it to pop again. s_lock is only showing 35%
>> (increasing very slowly if at all) but performance is mostly halted.
>> Frame pointer is compiled out. perf report attached.
>
>> 35.82% postgres [.] s_lock
>> 23.71% postgres [.] tas
>> 14.01% postgres [.] tas
>> 6.82% postgres [.] spin_delay
>> 5.93% postgres [.] LWLockRelease
>> 4.36% postgres [.] LWLockAcquireCommon
>
> Interesting. This profile looks quite different?

yep, it's very stable, and the database is approximately frozen.

> What kind of hardware is this on?

linux on crap workstation box, except I have good ssd (intel 3500).

model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz

top - 17:44:00 up 11 days, 6:25, 15 users, load average: 6.94, 6.97, 5.73
Tasks: 259 total, 8 running, 250 sleeping, 0 stopped, 1 zombie
%Cpu0 : 95.7 us, 0.0 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 93.5 us, 2.2 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 95.7 us, 0.0 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 93.6 us, 2.1 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 8131028 total, 7977608 used, 153420 free, 21424 buffers
KiB Swap: 8340476 total, 1884900 used, 6455576 free, 5201188 cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32492 mmoncure 20 0 175m 12m 9040 R 64.0 0.2 9:28.42 postgres
32367 mmoncure 20 0 175m 12m 9128 R 57.3 0.2 9:35.87 postgres
31792 mmoncure 20 0 175m 92m 88m R 52.9 1.2 9:57.50 postgres
32725 mmoncure 20 0 175m 13m 9852 R 52.9 0.2 9:08.50 postgres
31899 mmoncure 20 0 179m 104m 97m R 50.7 1.3 9:49.51 postgres
32368 mmoncure 20 0 175m 12m 9200 R 50.7 0.2 9:33.77 postgres
31282 mmoncure 20 0 184m 152m 140m R 46.3 1.9 10:26.29 postgres
666 mmoncure 20 0 20640 1684 1116 R 2.2 0.0 0:00.01 top

Some more information what's happening:
This is a ghetto logical replication engine that migrates data from
sql sever to postgres, consolidating a sharded database into a single
set of tables (of which there are only two). There is only one index
on the destination table, and it's composite int,int in both cases.

A single source table is replicated in a single transaction, and it's
staged to a set of two temp tables, one for deletes, one for inserts.
I log each step in the database. Here are the times:

cds2=# select cdsruntableid, finished - started, inserted from
cdsruntable where cdsrunid = 62 order by started;
cdsruntableid │ ?column? │ inserted
───────────────┼─────────────────┼──────────
833 │ 00:00:33.24044 │ 11860
834 │ 00:00:35.44981 │ 19099
835 │ 00:02:01.725186 │ 530923
836 │ 00:01:29.578811 │ 211737
838 │ 00:01:17.393461 │ 64258
837 │ 00:00:56.849106 │ 227844
839 │ 00:02:12.016504 │ 630309
840 │ 00:00:00.00111 │
841 │ 00:01:09.058806 │ 155309
842 │ 00:01:54.723747 │ 704422
843 │ 00:01:14.965304 │
844 │ 00:00:45.410953 │ 59934
845 │ 00:00:34.302632 │ 14848
846 │ 00:00:46.913832 │ 89893
848 │ 00:05:22.155189 │ 2410622
847 │ 00:01:25.199141 │ 268157
849 │ 00:01:16.168352 │ 117511
850 │ 00:00:34.809862 │ 1175
851 │ 00:01:12.565397 │ 67214
852 │ 00:01:03.734776 │ 20129
853 │ 00:00:41.038456 │ 62765
854 │ 00:01:03.478917 │ 14967
855 │ 00:00:33.88803 │ 6901
856 │ 00:00:36.381886 │ 6670
857 │ 00:00:36.626623 │ 8268
858 │ 00:01:14.391584 │ 363448
859 │ 00:01:44.619005 │ 533395
860 │ 00:01:02.042255 │ 212202
861 │ 00:00:00.001065 │
863 │ 00:00:58.265506 │ 215234
862 │ 00:02:45.572635 │ 827941
865 │ 00:01:28.049165 │ 241020
864 │ 00:01:51.465643 │ 531012
866 │ 00:01:20.273391 │ 419357
868 │ 00:01:25.479443 │ 294262
869 │ 00:00:46.400825 │ 46136
870 │ 00:00:42.337286 │ 25934
871 │ 00:00:48.350152 │ 91375
872 │ 00:00:00.001743 │
873 │ 00:01:04.596749 │ 199744
876 │ 00:01:15.210301 │ 276276
877 │ 00:02:36.447723 │ 1693887
878 │ 00:00:55.925104 │ 36938
879 │ 00:00:00.016874 │
882 │ 00:00:50.479248 │ 171901
881 │ │
884 │ │
883 │ │
867 │ │
874 │ │
875 │ │
880 │ │
(52 rows)

There is a an 8 thread worker that walks that queue in id order and
migrates one source table of data to the destination postgres server.
It's been stuck for quite some time now, and I'm certain that if I
restart the database, and rerun the incremental processing routine, it
would work those last seven tables with no issues (at least that's
what happened last time). It's been bogged with no progress now for a
half hour again, so there's some kind of weird bug here.

merlin

In response to

Re: hung backends stuck in spinlock heavy endless loop at 2015-01-13 23:42:09 from Andres Freund

Responses

Re: hung backends stuck in spinlock heavy endless loop at 2015-01-14 01:24:03 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2015-01-13 23:54:58	Re: hung backends stuck in spinlock heavy endless loop
Previous Message	Andres Freund	2015-01-13 23:54:04	Re: hung backends stuck in spinlock heavy endless loop