Re: Hanging queries on dual CPU windows

Lists: pgsql-performance
From: "Magnus Hagander" <mha(at)sollentuna(dot)net>
To: "Jan de Visser" <jdevisser(at)digitalfairway(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hanging queries on dual CPU windows
Date: 2006-03-10 09:20:15
Message-ID: 6BCB9D8A16AC4241919521715F4D8BCEA35104@algol.sollentuna.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

> > Is it possible to get a stack trace from the stuck process?
>  I dunno
> > if you've got anything gdb-equivalent under Windows, but that's the
> > first thing I'd be interested in ...
>
> Here ya go:
>
> http://www.devisser-siderius.com/stack1.jpg
> http://www.devisser-siderius.com/stack2.jpg
> http://www.devisser-siderius.com/stack3.jpg
>
> There are three threads in the process. I guess thread 1
> (stack1.jpg) is the most interesting.
>
> I also noted that cranking up concurrency in my app
> reproduces the problem in about 4 minutes ;-)

Actually, stack2 looks very interesting. Does it "stay stuck" in pg_queue_signal? That's really not supposed to happen.

Also, can you confirm that stack1 actually *stops* in pgwin32_waitforsinglesocket? Or does it go out and come back? ;-)

(A good signal of this is to check the cswitch delta. If it stays at zero, then it's stuck. If it shows any values, that means it's actuall going out and coming back)

And finally, is this 8.0 or 8.1? There have been some significant changes in the handling of the signals between the two...

//Magnus


From: Jan de Visser <jdevisser(at)digitalfairway(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hanging queries on dual CPU windows
Date: 2006-03-10 14:03:14
Message-ID: 200603100903.14517.jdevisser@digitalfairway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Friday 10 March 2006 04:20, Magnus Hagander wrote:
> > > Is it possible to get a stack trace from the stuck process?
> >
> >  I dunno
> >
> > > if you've got anything gdb-equivalent under Windows, but that's the
> > > first thing I'd be interested in ...
> >
> > Here ya go:
> >
> > http://www.devisser-siderius.com/stack1.jpg
> > http://www.devisser-siderius.com/stack2.jpg
> > http://www.devisser-siderius.com/stack3.jpg
> >
> > There are three threads in the process. I guess thread 1
> > (stack1.jpg) is the most interesting.
> >
> > I also noted that cranking up concurrency in my app
> > reproduces the problem in about 4 minutes ;-)
>

Just reproduced again.

> Actually, stack2 looks very interesting. Does it "stay stuck" in
> pg_queue_signal? That's really not supposed to happen.

Yes it does.

>
> Also, can you confirm that stack1 actually *stops* in
> pgwin32_waitforsinglesocket? Or does it go out and come back? ;-)
>
> (A good signal of this is to check the cswitch delta. If it stays at zero,
> then it's stuck. If it shows any values, that means it's actuall going out
> and coming back)

I only see CSwitch change once I click OK on the thread window. Once I do
that, it goes up to 3 and back to blank again. The 'context switches' counter
does not increase like it does for other processes (like e.g. process
explorer itself).

Another thing which may or may not be of interest: Nothing is listed in the
'TCP/IP' tab for the stuck process. I would have expected to see at least the
socket of the client connection there??

>
> And finally, is this 8.0 or 8.1? There have been some significant changes
> in the handling of the signals between the two...

This is 8.1.3 on Windows 2003 Server. Also reproduced on 8.1.0 and 8.1.1 (also
on 2K3).

>
> //Magnus

jan

--
--------------------------------------------------------------
Jan de Visser                     jdevisser(at)digitalfairway(dot)com

                Baruk Khazad! Khazad ai-menu!
--------------------------------------------------------------


From: Jan de Visser <jdevisser(at)digitalfairway(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hanging queries on dual CPU windows
Date: 2006-03-10 14:32:59
Message-ID: 200603100933.00047.jdevisser@digitalfairway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Friday 10 March 2006 09:03, Jan de Visser wrote:
> On Friday 10 March 2006 04:20, Magnus Hagander wrote:
> > > > Is it possible to get a stack trace from the stuck process?
> > >
> > >  I dunno
> > >
> > > > if you've got anything gdb-equivalent under Windows, but that's the
> > > > first thing I'd be interested in ...
> > >
> > > Here ya go:
> > >
> > > http://www.devisser-siderius.com/stack1.jpg
> > > http://www.devisser-siderius.com/stack2.jpg
> > > http://www.devisser-siderius.com/stack3.jpg
> > >
> > > There are three threads in the process. I guess thread 1
> > > (stack1.jpg) is the most interesting.
> > >
> > > I also noted that cranking up concurrency in my app
> > > reproduces the problem in about 4 minutes ;-)
>
> Just reproduced again.
>
> > Actually, stack2 looks very interesting. Does it "stay stuck" in
> > pg_queue_signal? That's really not supposed to happen.
>
> Yes it does.

An update on that: There is actually *two* processes in this state, both
hanging in pg_queue_signal. I've looked at the source of that, and the
obvious candidate for hanging is EnterCriticalSection. I also found this:

http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx

where they say:

"
In addition, for Windows 2003, SP1, the EnterCriticalSection API has a subtle
change that's intended tor resolve many of the lock convoy issues. Before
Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and all 10
threads had the same priority, then EnterCriticalSection would service those
threads in a FIFO (first -in, first-out) basis. Starting in Windows 2003
SP1, the EnterCriticalSection will wake up a random thread from the waiting
threads. If all the threads are doing the same thing (like a thread pool)
this won't make much of a difference, but if the different threads are doing
different work (like the critical section protecting a widely accessed
object), this will go a long way towards removing lock convoy semantics.
"

Could it be they broke it when they did that????

>
> > Also, can you confirm that stack1 actually *stops* in
> > pgwin32_waitforsinglesocket? Or does it go out and come back? ;-)
> >
> > (A good signal of this is to check the cswitch delta. If it stays at
> > zero, then it's stuck. If it shows any values, that means it's actuall
> > going out and coming back)
>
> I only see CSwitch change once I click OK on the thread window. Once I do
> that, it goes up to 3 and back to blank again. The 'context switches'
> counter does not increase like it does for other processes (like e.g.
> process explorer itself).
>
> Another thing which may or may not be of interest: Nothing is listed in the
> 'TCP/IP' tab for the stuck process. I would have expected to see at least
> the socket of the client connection there??
>
> > And finally, is this 8.0 or 8.1? There have been some significant changes
> > in the handling of the signals between the two...
>
> This is 8.1.3 on Windows 2003 Server. Also reproduced on 8.1.0 and 8.1.1
> (also on 2K3).
>
> > //Magnus
>
> jan

--
--------------------------------------------------------------
Jan de Visser                     jdevisser(at)digitalfairway(dot)com

                Baruk Khazad! Khazad ai-menu!
--------------------------------------------------------------


From: Jan de Visser <jdevisser(at)digitalfairway(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hanging queries on dual CPU windows
Date: 2006-03-10 14:47:22
Message-ID: 200603100947.22460.jdevisser@digitalfairway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Friday 10 March 2006 09:32, Jan de Visser wrote:
> > > Actually, stack2 looks very interesting. Does it "stay stuck" in
> > > pg_queue_signal? That's really not supposed to happen.
> >
> > Yes it does.
>
> An update on that: There is actually *two* processes in this state, both
> hanging in pg_queue_signal. I've looked at the source of that, and the
> obvious candidate for hanging is EnterCriticalSection. I also found this:
>
> http://blogs.msdn.com/larryosterman/archive/2005/03/02/383685.aspx
>
> where they say:
>
> "
> In addition, for Windows 2003, SP1, the EnterCriticalSection API has a
> subtle change that's intended tor resolve many of the lock convoy issues.
>  Before Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and
> all 10 threads had the same priority, then EnterCriticalSection would
> service those threads in a FIFO (first -in, first-out) basis.  Starting in
> Windows 2003 SP1, the EnterCriticalSection will wake up a random thread
> from the waiting threads.  If all the threads are doing the same thing
> (like a thread pool) this won't make much of a difference, but if the
> different threads are doing different work (like the critical section
> protecting a widely accessed object), this will go a long way towards
> removing lock convoy semantics. "
>
> Could it be they broke it when they did that????

See also this:

http://bugs.mysql.com/bug.php?id=12071

It appears the mysql people ran into this and concluded it is a Windows bug
they needed to work around.

jan

--
--------------------------------------------------------------
Jan de Visser                     jdevisser(at)digitalfairway(dot)com

                Baruk Khazad! Khazad ai-menu!
--------------------------------------------------------------