Re: "soft lockup" in kernel

Lists: pgsql-general
From: Stuart Ford <stuart(dot)ford(at)glide(dot)uk(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: "soft lockup" in kernel
Date: 2013-07-05 12:00:36
Message-ID: CDFC73F3.9367%stuart.ford@glide.uk.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Dear community

Twice today our PG 9.1 server has caused a "soft lockup", with a kernel
message like this:

[1813775.496127] BUG: soft lockup - CPU#3 stuck for 73s! [postgres:18723]

Full dmesg output - http://pastebin.com/YdWSmNUp

The incidents were approximately two hours apart and the server was
momentarily unavailable before coming back again, with no restore actions
required - it just carried on. It's never done this before, I've checked
in the logs. The server is about 3 weeks old and runs Debian 7 under
VMWare.

Does anybody know what could cause this, and, if so, is it something to be
concerned about and what can be done to stop it?

Any help greatly appreciated.

Stuart

This email and any attachments contain confidential and proprietary information of Glide Utilities Limited intended only for the use of the person to whom it is addressed. Unauthorised disclosure, copying or distribution of the email or its content may result in legal liability. If you have received the email in error, please immediately notify us by telephone on +44 333 666 5555 or email glide(at)glide(dot)uk(dot)com

The sender does not accept liability for any loss or damage from receipt or use thereof which arises as a result of internet transmission. Any views/opinions expressed within this email and any attachments are that of the individual and not necessarily that of Glide Utilities Limited.

Glide is a registered trademark of Glide Utilities Limited. Registered Office: Alpha Tower, Suffolk Street Queensway, Birmingham, B1 1TT. Registered in England & Wales. Registered Company No. 06194523.


From: Dennis Jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: "soft lockup" in kernel
Date: 2013-07-05 13:24:51
Message-ID: CAAEzAp-7hVXgJnZixBtgfOkU2QdXThxyuNuopxenC4CyO6g7ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Jul 5, 2013 at 7:00 AM, Stuart Ford <stuart(dot)ford(at)glide(dot)uk(dot)com>wrote:

> Dear community
>
> Twice today our PG 9.1 server has caused a "soft lockup", with a kernel
> message like this:
>
> [1813775.496127] BUG: soft lockup - CPU#3 stuck for 73s! [postgres:18723]
>
> Full dmesg output - http://pastebin.com/YdWSmNUp
>
> The incidents were approximately two hours apart and the server was
> momentarily unavailable before coming back again, with no restore actions
> required - it just carried on. It's never done this before, I've checked
> in the logs. The server is about 3 weeks old and runs Debian 7 under
> VMWare.
>
> Does anybody know what could cause this, and, if so, is it something to be
> concerned about and what can be done to stop it?
>
>
Before I looked at your pastebin, I was going to ask "What kind of storage
are the VMDKs on? If they are on NFS, iSCSI or FC, could the NAS/SAN be
experiencing a problem?"

But I see in the stack trace that the kernel thread hung in
"vmxnet3_xmit_frame" (sending an Ethernet frame on your virtual NIC).

Describe your vmware network topology.

Do you share the same VLAN for guest traffic with NFS or iSCSI used by
vmware for storing VMDKs?

Are there any errors recorded on the Ethernet switch connected to your
VMWare servers?

What path does a packet take to get from a postgresql server process in
your VM to a client?

What version of VMWare are you running?

If you are managing it with vCenter, are there any alarms of events in the
VMWare logs?


From: Stuart Ford <stuart(dot)ford(at)glide(dot)uk(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: "soft lockup" in kernel
Date: 2013-07-05 13:58:38
Message-ID: CDFC8944.9393%stuart.ford@glide.uk.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Jul 5, 2013 at 7:00 AM, Dennis Jenkins wrpte

> Before I looked at your pastebin, I was going to ask "What kind of
>storage
> are the VMDKs on? If they are on NFS, iSCSI or FC, could the NAS/SAN be
> experiencing a problem?" But I see in the stack trace that the kernel
>thread
> hung in "vmxnet3_xmit_frame" (sending an Ethernet frame on your virtual
>NIC).

I did not spot that.

> Describe your vmware network topology.
> Do you share the same VLAN for guest traffic with NFS or iSCSI used by
>vmware
> for storing VMDKs?

No. iSCSI traffic between the VMWare hosts and the SAN uses completely
separate NICs and different switches to the "production" LAN.

> Are there any errors recorded on the Ethernet switch connected to your
> VMWare servers?

No, there is nothing in the logs on the switch.

> What path does a packet take to get from a postgresql server process in
>your
> VM to a client?

Postgres is the backend database to a web application. Only the web
application uses it. It sits on another VM in the same cluster (but not on
the same physical host, so it definitely goes through a physical switch).
Is that what you mean?

> What version of VMWare are you running?

ESXi 5.0.0.

> If you are managing it with vCenter, are there any alarms of events in
>the
> VMWare logs?

I've had a look at the task activity in VCEnter and found these two events
at almost the same time as the kernel messages. In both cases the start
time (the first time below) is 5-6 seconds after the kernel message, and
I've seen that the clock on the Postgres VM and the VCenter server, at
least, are in sync (it may not, of course, be the VCenter server's clock
that these logs get the time from).

Remove snapshot
GLIBMPDB001_replica
Completed
GLIDE\Svc_vcenter
05/07/2013 11:58:41
05/07/2013 11:58:41
05/07/2013 11:59:03

Remove snapshot
GLIBMPDB001_replica
Completed
GLIDE\Svc_vcenter
05/07/2013 10:11:10
05/07/2013 10:11:10
05/07/2013 10:11:23

For clarity, here are the kernel messages:

syslog:Jul 5 10:11:05 glibmpdb001 kernel: [1807328.506991] BUG: soft
lockup - CPU#0 stuck for 25s! [postgres:5937]
syslog:Jul 5 11:58:35 glibmpdb001 kernel: [1813775.496127] BUG: soft
lockup - CPU#3 stuck for 73s! [postgres:18723]

This is done by Veeam and not a person. There has not been a newer version
of this event since (at the time of writing this e-mail).

Regardless of this, it's not like we've only just introduced Veeam this
morning, it's been running for a few weeks, and these lockups have only
happened today.

Do you think this is worth further investigation or could there be another
explanation?

Stuart

This email and any attachments contain confidential and proprietary information of Glide Utilities Limited intended only for the use of the person to whom it is addressed. Unauthorised disclosure, copying or distribution of the email or its content may result in legal liability. If you have received the email in error, please immediately notify us by telephone on +44 333 666 5555 or email glide(at)glide(dot)uk(dot)com

The sender does not accept liability for any loss or damage from receipt or use thereof which arises as a result of internet transmission. Any views/opinions expressed within this email and any attachments are that of the individual and not necessarily that of Glide Utilities Limited.

Glide is a registered trademark of Glide Utilities Limited. Registered Office: Alpha Tower, Suffolk Street Queensway, Birmingham, B1 1TT. Registered in England & Wales. Registered Company No. 06194523.


From: Dennis Jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: "soft lockup" in kernel
Date: 2013-07-05 16:31:37
Message-ID: CAAEzAp8Ym1PPvLsf4YdMZbX0PXLpFApFCjm6yvs47yRnPq6zJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Jul 5, 2013 at 8:58 AM, Stuart Ford <stuart(dot)ford(at)glide(dot)uk(dot)com>wrote:

> On Fri, Jul 5, 2013 at 7:00 AM, Dennis Jenkins wrpte
>

> No. iSCSI traffic between the VMWare hosts and the SAN uses completely
> separate NICs and different switches to the "production" LAN.
> I've had a look at the task activity in VCEnter and found these two events
>

> at almost the same time as the kernel messages. In both cases the start
> time (the first time below) is 5-6 seconds after the kernel message, and
> I've seen that the clock on the Postgres VM and the VCenter server, at
> least, are in sync (it may not, of course, be the VCenter server's clock
> that these logs get the time from).
>
> Remove snapshot
> GLIBMPDB001_replica
> Completed
> GLIDE\Svc_vcenter
> 05/07/2013 11:58:41
> 05/07/2013 11:58:41
> 05/07/2013 11:59:03
>
> Remove snapshot
> GLIBMPDB001_replica
> Completed
> GLIDE\Svc_vcenter
> 05/07/2013 10:11:10
> 05/07/2013 10:11:10
> 05/07/2013 10:11:23
>
>
I would not blame Veeam.

I suspect that when a snapshot is deleted that all iSCSI activity either
halts or slows SIGNIFICANTLY. This depends on your NAS.

I've seen an Oracle 7320 ZFS Storage appliance, misconfigured to use
RAID-Z2 (raid6) to store terabytes of essentially random-access data pause
for minutes when deleting a snapshot containing a few dozen gigabytes.
(the snapshot deletion kernel threads get IO priority over "nfsd" file
IO). This causes enough latency to VMWare (over NFS), that VMWare gave up
on the IO and returned a generic SCSI error to the guests. Linux guests
will semi-panic and remount their file-systems read-only. FreeBSD will
just freak out, panic and reboot. The flaw here was using the wrong raid
type (since replaced with triple-parity raid-10 and is working great).

What NAS are you using?

How busy are its disks when deleting a snapshot?

What is the RAID type under the hood?


From: Dennis Jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: "soft lockup" in kernel
Date: 2013-07-15 19:40:23
Message-ID: CAAEzAp9uqkzA1ztKsfDa1CND5fGTwLqU+hdeAVAYWfEg+ahSXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Stuart,

I'm simply curious - did you resolve your issue? What NAS
(vendor/model/config) are you using?

On Fri, Jul 5, 2013 at 11:31 AM, Dennis Jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com
> wrote:

>
> On Fri, Jul 5, 2013 at 8:58 AM, Stuart Ford <stuart(dot)ford(at)glide(dot)uk(dot)com>wrote:
>
>> On Fri, Jul 5, 2013 at 7:00 AM, Dennis Jenkins wrpte
>>
>
>
>> No. iSCSI traffic between the VMWare hosts and the SAN uses completely
>>
>> separate NICs and different switches to the "production" LAN.
>> I've had a look at the task activity in VCEnter and found these two events
>>
>
>
>> at almost the same time as the kernel messages. In both cases the start
>> time (the first time below) is 5-6 seconds after the kernel message, and
>> I've seen that the clock on the Postgres VM and the VCenter server, at
>> least, are in sync (it may not, of course, be the VCenter server's clock
>> that these logs get the time from).
>>
>> Remove snapshot
>> GLIBMPDB001_replica
>> Completed
>> GLIDE\Svc_vcenter
>> 05/07/2013 11:58:41
>> 05/07/2013 11:58:41
>> 05/07/2013 11:59:03
>>
>> Remove snapshot
>> GLIBMPDB001_replica
>> Completed
>> GLIDE\Svc_vcenter
>> 05/07/2013 10:11:10
>> 05/07/2013 10:11:10
>> 05/07/2013 10:11:23
>>
>>
> I would not blame Veeam.
>
> I suspect that when a snapshot is deleted that all iSCSI activity either
> halts or slows SIGNIFICANTLY. This depends on your NAS.
>
> I've seen an Oracle 7320 ZFS Storage appliance, misconfigured to use
> RAID-Z2 (raid6) to store terabytes of essentially random-access data pause
> for minutes when deleting a snapshot containing a few dozen gigabytes.
> (the snapshot deletion kernel threads get IO priority over "nfsd" file
> IO). This causes enough latency to VMWare (over NFS), that VMWare gave up
> on the IO and returned a generic SCSI error to the guests. Linux guests
> will semi-panic and remount their file-systems read-only. FreeBSD will
> just freak out, panic and reboot. The flaw here was using the wrong raid
> type (since replaced with triple-parity raid-10 and is working great).
>
> What NAS are you using?
>
> How busy are its disks when deleting a snapshot?
>
> What is the RAID type under the hood?
>
>