Re: Postgres gets stuck

Lists: pgsql-performance
From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Postgres gets stuck
Date: 2006-05-10 00:38:17
Message-ID: 446135F9.1060608@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I'm having a rare but deadly problem. On our web servers, a process occasionally gets stuck, and can't be unstuck. Once it's stuck, all Postgres activities cease. "kill -9" is required to kill it -- signals 2 and 15 don't work, and "/etc/init.d/postgresql stop" fails.

Here's what the process table looks like:

$ ps -ef | grep postgres
postgres 30713 1 0 Apr24 ? 00:02:43 /usr/local/pgsql/bin/postmaster -p 5432 -D /disk3/postgres/data
postgres 25423 30713 0 May08 ? 00:03:34 postgres: writer process
postgres 25424 30713 0 May08 ? 00:00:02 postgres: stats buffer process
postgres 25425 25424 0 May08 ? 00:00:02 postgres: stats collector process
postgres 11918 30713 21 07:37 ? 02:00:27 postgres: production webuser 127.0.0.1(21772) SELECT
postgres 31624 30713 0 16:11 ? 00:00:00 postgres: production webuser [local] idle
postgres 31771 30713 0 16:12 ? 00:00:00 postgres: production webuser 127.0.0.1(12422) idle
postgres 31772 30713 0 16:12 ? 00:00:00 postgres: production webuser 127.0.0.1(12421) idle
postgres 31773 30713 0 16:12 ? 00:00:00 postgres: production webuser 127.0.0.1(12424) idle
postgres 31774 30713 0 16:12 ? 00:00:00 postgres: production webuser 127.0.0.1(12425) idle
postgres 31775 30713 0 16:12 ? 00:00:00 postgres: production webuser 127.0.0.1(12426) idle
postgres 31776 30713 0 16:12 ? 00:00:00 postgres: production webuser 127.0.0.1(12427) idle
postgres 31777 30713 0 16:12 ? 00:00:00 postgres: production webuser 127.0.0.1(12428) idle

The SELECT process is the one that's stuck. top(1) and other indicators show that nothing is going on at all (no CPU usage, normal memory usage); the process seems to be blocked waiting for something. (The "idle" processes are attached to a FastCGI program.)

This has happened on *two different machines*, both doing completely different tasks. The first one is essentially a read-only warehouse that serves lots of queries, and the second one is the server we use to load the warehouse. In both cases, Postgres has been running for a long time, and is issuing SELECT statements that it's issued millions of times before with no problems. No other processes are accessing Postgres, just the web services.

This is a deadly bug, because our web site goes dead when this happens, and it requires an administrator to log in and kill the stuck postgres process then restart Postgres. We've installed failover system so that the web site is diverted to a backup server, but since this has happened twice in one week, we're worried.

Any ideas?

Details:

Postgres 8.0.3
Linux 2.6.12-1.1381_FC3smp i686 i386

Dell 2-CPU Xeon system (hyperthreading is enabled)
4 GB memory
2 120 GB disks (SATA on machine 1, IDE on machine 2)

Thanks,
Craig


From: Chris <dmagick(at)gmail(dot)com>
To: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Postgres gets stuck
Date: 2006-05-10 00:51:41
Message-ID: 4461391D.2000800@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


> This is a deadly bug, because our web site goes dead when this happens,
> and it requires an administrator to log in and kill the stuck postgres
> process then restart Postgres. We've installed failover system so that
> the web site is diverted to a backup server, but since this has happened
> twice in one week, we're worried.
>
> Any ideas?

Sounds like a deadlock issue.

Do you have query logging turned on?

Also, edit your postgresql.conf file and add (or uncomment):

stats_command_string = true

and restart postgresql.

then you'll be able to:

select * from pg_stat_activity;

to see what queries postgres is running and that might give you some clues.

--
Postgresql & php tutorials
http://www.designmagick.com/


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Postgres gets stuck
Date: 2006-05-11 15:26:12
Message-ID: e3vl2r$k7$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


""Craig A. James"" <cjames(at)modgraph-usa(dot)com> wrote
> I'm having a rare but deadly problem. On our web servers, a process
> occasionally gets stuck, and can't be unstuck. Once it's stuck, all
> Postgres activities cease. "kill -9" is required to kill it --
> signals 2 and 15 don't work, and "/etc/init.d/postgresql stop" fails.
>
> Details:
>
> Postgres 8.0.3
>

[Scanning 8.0.4 ~ 8.0.7 ...] Didn't find related bug fix in the upgrade
release. Can you attach to the problematic process and "bt" it (so we
could see where it stucks)?

Regards,
Qingqing


From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: Chris <dmagick(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Postgres gets stuck
Date: 2006-05-11 15:53:34
Message-ID: 44635DFE.1060509@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Chris wrote:
>
>> This is a deadly bug, because our web site goes dead when this
>> happens, ...
>
> Sounds like a deadlock issue.
> ...
> stats_command_string = true
> and restart postgresql.
> then you'll be able to:
> select * from pg_stat_activity;
> to see what queries postgres is running and that might give you some clues.

Thanks, good advice. You're absolutely right, it's stuck on a mutex. After doing what you suggest, I discovered that the query in progress is a user-written function (mine). When I log in as root, and use "gdb -p <pid>" to attach to the process, here's what I find. Notice the second function in the stack, a mutex lock:

(gdb) bt
#0 0x0087f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x0096cbfe in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#2 0x008ff67b in _L_mutex_lock_3220 () from /lib/tls/libc.so.6
#3 0x4f5fc1b4 in ?? ()
#4 0x00dc5e64 in std::string::_Rep::_S_empty_rep_storage () from /usr/local/pgsql/lib/libchmoogle.so
#5 0x009ffcf0 in ?? () from /usr/lib/libz.so.1
#6 0xbfe71c04 in ?? ()
#7 0xbfe71e50 in ?? ()
#8 0xbfe71b78 in ?? ()
#9 0x009f7019 in zcfree () from /usr/lib/libz.so.1
#10 0x009f7019 in zcfree () from /usr/lib/libz.so.1
#11 0x009f8b7c in inflateEnd () from /usr/lib/libz.so.1
#12 0x00c670a2 in ~basic_unzip_streambuf (this=0xbfe71be0) at zipstreamimpl.h:332
#13 0x00c60b61 in OpenBabel::OBConversion::Read (this=0x1, pOb=0xbfd923b8, pin=0xffffffea) at istream:115
#14 0x00c60fd8 in OpenBabel::OBConversion::ReadString (this=0x8672b50, pOb=0xbfd923b8) at obconversion.cpp:780
#15 0x00c19d69 in chmoogle_ichem_mol_alloc () at stl_construct.h:120
#16 0x00c1a203 in chmoogle_ichem_normalize_parent () at stl_construct.h:120
#17 0x00c1b172 in chmoogle_normalize_parent_sdf () at vector.tcc:243
#18 0x0810ae4d in ExecMakeFunctionResult ()
#19 0x0810de2e in ExecProject ()
#20 0x08115972 in ExecResult ()
#21 0x08109e01 in ExecProcNode ()
#22 0x00000020 in ?? ()
#23 0xbed4b340 in ?? ()
#24 0xbf92d9a0 in ?? ()
#25 0xbed4b0c0 in ?? ()
#26 0x00000000 in ?? ()

It looks to me like my code is trying to read the input parameter (a fairly long string, maybe 2K) from a buffer that was gzip'ed by Postgres for the trip between the client and server. My suspicion is that it's an incompatibility between malloc() libraries. libz (gzip compression) is calling something called zcfree, which then appears to be intercepted by something that's (probably statically) linked into my library. And somewhere along the way, a mutex gets set, and then ... it's stuck forever.

ps(1) shows that this thread had been running for about 7 hours, and the job status showed that this function had been successfully called about 1 million times, before this mutex lock occurred.

Any ideas?

Thanks,
Craig


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Cc: Chris <dmagick(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Postgres gets stuck
Date: 2006-05-12 00:03:26
Message-ID: 18707.1147392206@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Craig A. James" <cjames(at)modgraph-usa(dot)com> writes:
> My suspicion is that it's an incompatibility between malloc()
> libraries.

On Linux there's only supposed to be one malloc, ie, glibc's version.
On other platforms I'd be worried about threaded vs non-threaded libc
(because the backend is not threaded), but not Linux.

There may be a more basic threading problem here, though, rooted in the
precise fact that the backend isn't threaded. If you're trying to use
any libraries that assume they can have multiple threads, I wouldn't be
at all surprised to see things go boom. C++ exception handling could be
problematic too.

Or it could be a garden variety glibc bug. How up-to-date is your
platform?

regards, tom lane


From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Chris <dmagick(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Postgres gets stuck
Date: 2006-05-12 02:10:17
Message-ID: 4463EE89.8010103@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom Lane wrote:
> >My suspicion is that it's an incompatibility between malloc()
> >libraries.
>
> On Linux there's only supposed to be one malloc, ie, glibc's version.
> On other platforms I'd be worried about threaded vs non-threaded libc
> (because the backend is not threaded), but not Linux.

I guess I misinterpreted the Postgress manual, which says (in 31.9, "C Language Functions"),

"When allocating memory, use the PostgreSQL functions palloc and pfree
instead of the corresponding C library functions malloc and free."

I imagined that perhaps palloc/pfree used mutexes for something. But if I understand you, palloc() and pfree() are just wrappers around malloc() and free(), and don't (for example) make their own separate calls to brk(2), sbrk(2), or their kin. If that's the case, then you answered my question - it's all ordinary malloc/free calls in the end, and that's not the source of the problem.

> There may be a more basic threading problem here, though, rooted in the
> precise fact that the backend isn't threaded. If you're trying to use
> any libraries that assume they can have multiple threads, I wouldn't be
> at all surprised to see things go boom.

No threading anywhere. None of the libraries use threads or mutexes. It's just plain old vanilla C/C++ scientific algorithms.

> C++ exception handling could be problematic too.

No C++ exceptions are thrown anywhere in the code, 'tho I suppose one of the I/O libraries could throw an exception, e.g. when reading from a file. But there's no evidence of this after millions of identical operations succeeded. In addition, the stack trace shows it to be stuck in a memory operation, not an I/O operation.

> Or it could be a garden variety glibc bug. How up-to-date is your
> platform?

I guess this is the next place to look. From the few answers I've gotten, it sounds like this isn't a known Postgres issue, and my stack trace doesn't seem to be familiar to anyone on this forum. Oh well... thanks for your help.

Craig


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Cc: Chris <dmagick(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Postgres gets stuck
Date: 2006-05-12 02:51:12
Message-ID: 19950.1147402272@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Craig A. James" <cjames(at)modgraph-usa(dot)com> writes:
> I guess I misinterpreted the Postgress manual, which says (in 31.9, "C Language Functions"),

> "When allocating memory, use the PostgreSQL functions palloc and pfree
> instead of the corresponding C library functions malloc and free."

> I imagined that perhaps palloc/pfree used mutexes for something. But if I understand you, palloc() and pfree() are just wrappers around malloc() and free(), and don't (for example) make their own separate calls to brk(2), sbrk(2), or their kin.

Correct. palloc/pfree are all about managing the lifetime of memory
allocations, so that (for example) a function can return a palloc'd data
structure without worrying about whether that creates a long-term memory
leak. But ultimately they just use malloc/free, and there's certainly
not any threading or mutex considerations in there.

> No threading anywhere. None of the libraries use threads or mutexes. It's just plain old vanilla C/C++ scientific algorithms.

Darn, my best theory down the drain.

>> Or it could be a garden variety glibc bug. How up-to-date is your
>> platform?

> I guess this is the next place to look.

Let us know how it goes...

regards, tom lane