Re: [PATCH] Log crashed backend's query (activity string)

Lists: pgsql-hackers
From: Marti Raudsepp <marti(at)juffo(dot)org>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-06 20:52:18
Message-ID: CABRT9RDuuV6a+zJN+DMJjt1ahwk1_50iEN0mPdRV9fjE7gHjuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi list,

This patch adds the backend's current running query to the "backend
crash" message.

The crashing query is often a good starting point in debugging the
cause, and much more easily accessible than core dumps.

Example output:
LOG: server process (PID 31451) was terminated by signal 9: Killed
DETAIL: Running query: DO LANGUAGE plpythonu 'import os;os.kill(os.getpid(),9)'

The message "Running query" might not be entirely true, as it might be
e.g. a vacuum activity string too. But it sounds better than "Activity
string" or anything else I could come up with.

Also refactored pgstat_get_backend_current_activity() such that it
always returns consistent results. (Previously it returned a pointer to
shared memory that could theoretically change and was vulnerable to
races) The function can also deal with shared memory corruption (if
isCrashed is true), so that corruption doesn't cause a postmaster crash
or hang.

I'm not happy with the indenting depth in the function, but it would
be hard to split anything out.

Regards,
Marti

Attachment Content-Type Size
0001-Log-crashed-backend-s-query-activity-string.patch text/x-patch 8.2 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-06 21:03:30
Message-ID: CA+Tgmoa_Z6NGF68pv9S_0=1b7DR0T4dtoRwiXvF8R=F-5Y4USg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 6, 2011 at 4:52 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> This patch adds the backend's current running query to the "backend
> crash" message.
>
> The crashing query is often a good starting point in debugging the
> cause, and much more easily accessible than core dumps.
>
> Example output:
> LOG:  server process (PID 31451) was terminated by signal 9: Killed
> DETAIL:  Running query: DO LANGUAGE plpythonu 'import os;os.kill(os.getpid(),9)'
>
> The message "Running query" might not be entirely true, as it might be
> e.g. a vacuum activity string too. But it sounds better than "Activity
> string" or anything else I could come up with.
>
> Also refactored pgstat_get_backend_current_activity() such that it
> always returns consistent results. (Previously it returned a pointer to
> shared memory that could theoretically change and was vulnerable to
> races) The function can also deal with shared memory corruption (if
> isCrashed is true), so that corruption doesn't cause a postmaster crash
> or hang.

I haven't looked at the patch, but boy would this save me a lot of
support headaches if we can make it work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-06 21:34:51
Message-ID: 22764.1315344891@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marti Raudsepp <marti(at)juffo(dot)org> writes:
> This patch adds the backend's current running query to the "backend
> crash" message.

Sorry, this patch is entirely unacceptable. We cannot have the
postmaster's functioning depending on the contents of shared memory
still being valid ... most especially not when we know that somebody
just crashed, and could have corrupted the shared memory in arbitrary
ways. No, I don't think your attempts to validate the data are
adequate, nor do I believe they can be made adequate. And I doubt
that the goal is worth taking risks for.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-06 21:58:01
Message-ID: CA+Tgmoa9JOFWVUS0hGPR+48Ka0tm_QkpobRX0wALGukSLtx8uA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 6, 2011 at 5:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Marti Raudsepp <marti(at)juffo(dot)org> writes:
>> This patch adds the backend's current running query to the "backend
>> crash" message.
>
> Sorry, this patch is entirely unacceptable.  We cannot have the
> postmaster's functioning depending on the contents of shared memory
> still being valid ... most especially not when we know that somebody
> just crashed, and could have corrupted the shared memory in arbitrary
> ways.  No, I don't think your attempts to validate the data are
> adequate, nor do I believe they can be made adequate.

Why and why not?

> And I doubt
> that the goal is worth taking risks for.

I am unable to count the number of times that I have had a customer
come to me and say "well, the backend crashed". And I go look at
their logs and I have no idea what happened. So then I tell them to
include %p in log_line_prefix and set log_min_duration_statement=0 and
call me if it happens again. This is a huge nuisance and a serious
interference with attempts to do meaningful troubleshooting. When it
doesn't add days or weeks to the time to resolution, it's because it
prevents resolution altogether. We really, really need something like
this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-06 22:02:44
Message-ID: CABRT9RDkvJgU-_cykw3W=mNZLbSoEC82vAgjLVX9P=bmhp+Zbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 7, 2011 at 00:34, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> No, I don't think your attempts to validate the data are
> adequate, nor do I believe they can be made adequate.

Can you think of any concrete situations that would fail the current validation?

As far as I can tell, the only pointer I'm following from the shm
region is vbeentry->st_activity, which is being validated that it
still points to the BackendActivityBuffer.

The rest -- BackendStatusArray, BackendActivityBuffer, MaxBackends and
pgstat_track_activity_query_size -- are process-local variables and
initialized at startup time. Given that other backends cannot change
these variables, nor deallocate postmaster's shm regions, we can take
for granted that they still point to the shm region.

Whether the data contained in shm is corrupt or not, we really don't
care. If we don't find the matching PID, we just return a placeholder
string. In the worst case, the activity buffer is corrupt and the user
gets a kilobyte of garbage in their log file.

Ok, well there's a worse case -- if a backend keeps incrementing
vbeentry->st_changecount at a faster rate. However, this seems pretty
unlikely if the backend has already terminated. We could terminate
other backends *first* before logging the query to make sure they
can't touch it.

Regards,
Marti


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-06 22:05:46
Message-ID: 2793.1315346746@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Sep 6, 2011 at 5:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> And I doubt
>> that the goal is worth taking risks for.

> I am unable to count the number of times that I have had a customer
> come to me and say "well, the backend crashed". And I go look at
> their logs and I have no idea what happened.

gdb and print debug_query_string?

I don't dispute that this would be nice to have. But I don't think that
it's sane to compromise the postmaster's reliability in order to print
information of doubtful accuracy.

If you want to do something that doesn't violate the system's basic
design goals, think about setting up a SIGSEGV handler that tries to
print debug_query_string via elog before crashing. It might well crash
too, but it won't be risking taking out more of the database with it.

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Marti Raudsepp" <marti(at)juffo(dot)org>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-06 22:09:44
Message-ID: 4E6653D80200002500040DCC@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Marti Raudsepp <marti(at)juffo(dot)org> writes:
>>> This patch adds the backend's current running query to the
>>> "backend crash" message.
>>
>> Sorry, this patch is entirely unacceptable. We cannot have the
>> postmaster's functioning depending on the contents of shared
>> memory still being valid ... most especially not when we know
>> that somebody just crashed, and could have corrupted the shared
>> memory in arbitrary ways. No, I don't think your attempts to
>> validate the data are adequate, nor do I believe they can be made
>> adequate.
>
> Why and why not?
>
>> And I doubt
>> that the goal is worth taking risks for.
>
> I am unable to count the number of times that I have had a
> customer come to me and say "well, the backend crashed". And I go
> look at their logs and I have no idea what happened. So then I
> tell them to include %p in log_line_prefix and set
> log_min_duration_statement=0 and call me if it happens again.
> This is a huge nuisance and a serious interference with attempts
> to do meaningful troubleshooting. When it doesn't add days or
> weeks to the time to resolution, it's because it prevents
> resolution altogether. We really, really need something like
> this.

I haven't had this experience more than a few times, but a few is
enough to recognize how painful it can be. It seems we're brave
enough to log *some* information at crash time, in spite of the risk
that memory may be corrupted in unpredictable ways. Sure, there is
a slim chance that when you think you're writing to the log you've
actually got a handle to a segment of a heap file, but that chance
is extremely slim -- and if that's where you're at you've probably
already written a 'segfault' message there, anyway. My gut feel is
this would allow diagnosis in a timely fashion often enough to save
more data than it puts at risk, to say nothing of people's time.

I don't know whether the patch on the table is coded as defensively
as it should be given the perilous times the new code would come
into play, but I don't think the idea should be rejected out of
hand.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-06 22:57:07
Message-ID: CA+TgmoZ1xt=0cHvu4PBFze9SHZS5UhqXJPhqn30vUukvUUxEhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 6, 2011 at 6:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Sep 6, 2011 at 5:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> And I doubt
>>> that the goal is worth taking risks for.
>
>> I am unable to count the number of times that I have had a customer
>> come to me and say "well, the backend crashed".  And I go look at
>> their logs and I have no idea what happened.
>
> gdb and print debug_query_string?

Surely you're kidding. These are customer systems which I frequently
don't even have access to. They don't always have gdb installed
(sometimes they are Windows systems) and if they do the customer isn't
likely to know how to use it, and even if they do they don't think the
better of us for needing such a tool to troubleshoot a crash. Even if
none of that were an issue, gdb is only going to work if you attach it
before the crash or have a core dump available. Typically you don't
know the crash is going to happen and core dumps aren't enabled
anyway.

> I don't dispute that this would be nice to have.  But I don't think that
> it's sane to compromise the postmaster's reliability in order to print
> information of doubtful accuracy.

In practice, I think very few crashes will clobber it. A lot of
crashes are going to be caused by a null pointer deference in some
random part of the program, an assertion failure, the OOM killer, etc.
It's certainly POSSIBLE that it could get clobbered, but it shouldn't
be very likely; and as Marti says, with proper defensive coding, the
worst case scenario if it does happen should be some log garbage.

> If you want to do something that doesn't violate the system's basic
> design goals, think about setting up a SIGSEGV handler that tries to
> print debug_query_string via elog before crashing.  It might well crash
> too, but it won't be risking taking out more of the database with it.

I don't think that's adequate. You need to trap a lot more than just
SIGSEGV to catch all the crashes - there's also SIGABRT and SIGILL and
a bunch of other ones, including SIGKILL. I think you really, really
need something that executes outside the context of the dying process.

TBH, I'm very unclear what could cause the postmaster to go belly-up
copying a bounded amount of data out of shared memory for logging
purposes only. It's surely possible to make the code safe against any
sequence of bytes that might be found there. The only real danger
seems to be that the memory access itself might trigger a segmentation
fault of some sort - but how is that going to happen? The child can't
unmap the address space in the parent, can it? If it's a real danger,
perhaps we could fork off a dedicated child process just to read the
relevant portion of shared memory and emit a log message - but I'm not
seeing what plausible scenario that would guard against.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 19:42:45
Message-ID: 1315424436-sup-2007@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Robert Haas's message of mar sep 06 19:57:07 -0300 2011:
> On Tue, Sep 6, 2011 at 6:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> On Tue, Sep 6, 2011 at 5:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> And I doubt
> >>> that the goal is worth taking risks for.
> >
> >> I am unable to count the number of times that I have had a customer
> >> come to me and say "well, the backend crashed".  And I go look at
> >> their logs and I have no idea what happened.
> >
> > gdb and print debug_query_string?
>
> Surely you're kidding. These are customer systems which I frequently
> don't even have access to. They don't always have gdb installed
> (sometimes they are Windows systems) and if they do the customer isn't
> likely to know how to use it, and even if they do they don't think the
> better of us for needing such a tool to troubleshoot a crash.

I'm with Robert on this ... been there, got that look.

> TBH, I'm very unclear what could cause the postmaster to go belly-up
> copying a bounded amount of data out of shared memory for logging
> purposes only. It's surely possible to make the code safe against any
> sequence of bytes that might be found there.

A mishandled encoding conversion could be problematic, so that needs to
be carefully considered (perhaps just shut off unconditionally).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 20:00:44
Message-ID: 836.1315425644@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Robert Haas's message of mar sep 06 19:57:07 -0300 2011:
>> TBH, I'm very unclear what could cause the postmaster to go belly-up
>> copying a bounded amount of data out of shared memory for logging
>> purposes only. It's surely possible to make the code safe against any
>> sequence of bytes that might be found there.

> A mishandled encoding conversion could be problematic, so that needs to
> be carefully considered (perhaps just shut off unconditionally).

That, and the question of exactly what makes the amount bounded, and
probably six other things that could go wrong. But I'm sure Andrew
won't be pleased with a proposal to inject unknown-encoding data into
the logs.

I remain of the opinion that this needs to be kept out of the postmaster.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 20:01:19
Message-ID: CA+TgmoZjCpKB_HNW64u_EU7sHASxXeJTp0TeO+tYG3k_BwKF5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 7, 2011 at 3:42 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>> TBH, I'm very unclear what could cause the postmaster to go belly-up
>> copying a bounded amount of data out of shared memory for logging
>> purposes only.  It's surely possible to make the code safe against any
>> sequence of bytes that might be found there.
>
> A mishandled encoding conversion could be problematic, so that needs to
> be carefully considered (perhaps just shut off unconditionally).

It's not really a problem for the postmaster if something just plain
old fails. Where we get into trouble is if it manages to (a) crash,
(b) take an excessive amount of time to complete, or (c) screw up the
postmaster state in some way we can't recover from. But if any of
those are an issue then, yeah, just shut it off.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 20:02:59
Message-ID: CA+TgmoZHpu5nixc3i7J9M-_V+0FDcXbF4DMnrU9u7QEu4dZxxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 7, 2011 at 4:00 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Excerpts from Robert Haas's message of mar sep 06 19:57:07 -0300 2011:
>>> TBH, I'm very unclear what could cause the postmaster to go belly-up
>>> copying a bounded amount of data out of shared memory for logging
>>> purposes only.  It's surely possible to make the code safe against any
>>> sequence of bytes that might be found there.
>
>> A mishandled encoding conversion could be problematic, so that needs to
>> be carefully considered (perhaps just shut off unconditionally).
>
> That, and the question of exactly what makes the amount bounded, and

The fact that it's a fixed-size chunk of shmem. We won't copy more
bytes than the size of the buffer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 20:15:23
Message-ID: 1724.1315426523@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Sep 7, 2011 at 3:42 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> A mishandled encoding conversion could be problematic, so that needs to
>> be carefully considered (perhaps just shut off unconditionally).

> It's not really a problem for the postmaster if something just plain
> old fails. Where we get into trouble is if it manages to (a) crash,
> (b) take an excessive amount of time to complete, or (c) screw up the
> postmaster state in some way we can't recover from. But if any of
> those are an issue then, yeah, just shut it off.

Keep in mind that in the postmaster, elog(ERROR) *is* a crash.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 20:30:13
Message-ID: CA+Tgmobjp4zgy-vRznJ2NEZJBjgfTXEP1PRZnAQrgNfW=mJkaQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 7, 2011 at 4:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Sep 7, 2011 at 3:42 PM, Alvaro Herrera
>> <alvherre(at)commandprompt(dot)com> wrote:
>>> A mishandled encoding conversion could be problematic, so that needs to
>>> be carefully considered (perhaps just shut off unconditionally).
>
>> It's not really a problem for the postmaster if something just plain
>> old fails.  Where we get into trouble is if it manages to (a) crash,
>> (b) take an excessive amount of time to complete, or (c) screw up the
>> postmaster state in some way we can't recover from.  But if any of
>> those are an issue then, yeah, just shut it off.
>
> Keep in mind that in the postmaster, elog(ERROR) *is* a crash.

Right... but a function that returns -1 to indicate that something
didn't work should be OK, as long as whatever it does is otherwise
extremely boring.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 21:09:32
Message-ID: CABRT9RCaMs_VeCB0Ljvpib=0u9VZVijcOYhwzKqFo-_fgifztw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 7, 2011 at 22:42, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> A mishandled encoding conversion could be problematic, so that needs to
> be carefully considered (perhaps just shut off unconditionally).

Are you referring to log file encoding or something else? The log file
is already potentially mixed-encoding, as different databases may have
different encodings and backends just dump bytes to it in their
current encoding.

pg_verify_mbstr() could potentially be used with noError=true if we
can figure out the backend's current encoding, but that indeed sounds
like something to avoid.

Regards,
Marti


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 21:18:38
Message-ID: 6215.1315430318@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Sep 7, 2011 at 4:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Keep in mind that in the postmaster, elog(ERROR) *is* a crash.

> Right... but a function that returns -1 to indicate that something
> didn't work should be OK, as long as whatever it does is otherwise
> extremely boring.

The more functionality you put in the postmaster, the more likely it is
to trip over an elog(ERROR) somewhere.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-07 21:24:53
Message-ID: 1315430492-sup-8827@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Marti Raudsepp's message of mié sep 07 18:09:32 -0300 2011:
> On Wed, Sep 7, 2011 at 22:42, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> > A mishandled encoding conversion could be problematic, so that needs to
> > be carefully considered (perhaps just shut off unconditionally).
>
> Are you referring to log file encoding or something else? The log file
> is already potentially mixed-encoding, as different databases may have
> different encodings and backends just dump bytes to it in their
> current encoding.

I am referring to the fact that whatever the backend puts in shared
memory is going to be in its encoding setting, which may not necessarily
match the postmaster's. And if it doesn't, the postmaster might try to
convert it using settings not valid for the string, possibly leading to
crashes.

I remember we had bugs whereby an encoding conversion would fail,
leading to elog trying to report this problem, but this attempt would
also incur a conversion step, failing recursively until elog's stack got
full. I'm not saying this is impossible to solve, just something to
keep in mind.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Marti Raudsepp <marti(at)juffo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-08 00:22:30
Message-ID: CA+TgmoZ0BAZe_3_b+h0LC=WTU9VYwJX0qAiL0tNz6Y9H7v74gQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 7, 2011 at 5:24 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Marti Raudsepp's message of mié sep 07 18:09:32 -0300 2011:
>> On Wed, Sep 7, 2011 at 22:42, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
>> > A mishandled encoding conversion could be problematic, so that needs to
>> > be carefully considered (perhaps just shut off unconditionally).
>>
>> Are you referring to log file encoding or something else? The log file
>> is already potentially mixed-encoding, as different databases may have
>> different encodings and backends just dump bytes to it in their
>> current encoding.
>
> I am referring to the fact that whatever the backend puts in shared
> memory is going to be in its encoding setting, which may not necessarily
> match the postmaster's.  And if it doesn't, the postmaster might try to
> convert it using settings not valid for the string, possibly leading to
> crashes.
>
> I remember we had bugs whereby an encoding conversion would fail,
> leading to elog trying to report this problem, but this attempt would
> also incur a conversion step, failing recursively until elog's stack got
> full.  I'm not saying this is impossible to solve, just something to
> keep in mind.

Can we do something like: pass through ASCII characters unchanged, and
output anything with the high-bit set as \x<hexdigit><hexdigit>? That
might be garbled in some cases, but the goal here is not perfection.
We're just trying to give the admin (or PostgreSQL-guru-for-hire) a
clue where to start looking for the problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-09 09:51:23
Message-ID: CABRT9RDcJhKfw8shbO9QdHqAFC6fqcLyXXcmjx4CdFnntEBsLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 8, 2011 at 03:22, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Sep 7, 2011 at 5:24 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> I remember we had bugs whereby an encoding conversion would fail,
>> leading to elog trying to report this problem, but this attempt would
>> also incur a conversion step, failing recursively until elog's stack got
>> full.  I'm not saying this is impossible to solve, just something to
>> keep in mind.

Looking at elog.c, this only seems to apply to messages sent to the
client from a backend connection. No conversion is done for log
messages.

> Can we do something like: pass through ASCII characters unchanged, and
> output anything with the high-bit set as \x<hexdigit><hexdigit>?  That
> might be garbled in some cases, but the goal here is not perfection.
> We're just trying to give the admin (or PostgreSQL-guru-for-hire) a
> clue where to start looking for the problem.

Or we might just replace them with '?'. This has the advantage of not
expanding query length 4x if it does happen to be corrupted. The vast
majority of queries are ASCII-only anyway.

Regards,
Marti


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-24 17:51:28
Message-ID: CAMkU=1wYxk=nqiL14VeiYrAkZyGLaP7n5unch7OBzdnOFnU0zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 9, 2011 at 2:51 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> On Thu, Sep 8, 2011 at 03:22, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Wed, Sep 7, 2011 at 5:24 PM, Alvaro Herrera
>> <alvherre(at)commandprompt(dot)com> wrote:
>>> I remember we had bugs whereby an encoding conversion would fail,
>>> leading to elog trying to report this problem, but this attempt would
>>> also incur a conversion step, failing recursively until elog's stack got
>>> full.  I'm not saying this is impossible to solve, just something to
>>> keep in mind.
>
> Looking at elog.c, this only seems to apply to messages sent to the
> client from a backend connection. No conversion is done for log
> messages.
>
>> Can we do something like: pass through ASCII characters unchanged, and
>> output anything with the high-bit set as \x<hexdigit><hexdigit>?  That
>> might be garbled in some cases, but the goal here is not perfection.
>> We're just trying to give the admin (or PostgreSQL-guru-for-hire) a
>> clue where to start looking for the problem.
>
> Or we might just replace them with '?'. This has the advantage of not
> expanding query length 4x if it does happen to be corrupted. The vast
> majority of queries are ASCII-only anyway.

Should this patch be reviewed as is, or will the substitution of
non-ASCII be implemented?

It seems like everyone agrees that this feature is wanted, but Tom is
still very much opposed to the general approach to implement it, as
being too dangerous.
Is it the reviewer's job to try to convince him otherwise?

Thanks,

Jeff


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Marti Raudsepp <marti(at)juffo(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-24 19:23:12
Message-ID: CA+TgmoYg-7w15KpvkzgjJGj6Wjuuf9uimZonKH3yBs1bN9Seug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 24, 2011 at 1:51 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> It seems like everyone agrees that this feature is wanted, but Tom is
> still very much opposed to the general approach to implement it, as
> being too dangerous.
> Is it the reviewer's job to try to convince him otherwise?

It's not the reviewer's job to convince Tom of anything in particular,
but I think it's helpful for them to state their opinion, whatever it
may be (agreeing with Tom, disagreeing with Tom, or whatever).

IMHO, the most compelling argument against the OP's approach made so
far is the encoding issue. I was hoping someone (maybe the OP?) would
have an opinion on that, an idea how to work around it, or something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Log crashed backend's query (activity string)
Date: 2011-09-24 19:57:41
Message-ID: CABRT9RBCnU5ceeQd48WrR2aDkdXNnai41DrAyOn2LpxiQSm4UA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 24, 2011 at 22:23, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> It's not the reviewer's job to convince Tom of anything in particular,
> but I think it's helpful for them to state their opinion, whatever it
> may be (agreeing with Tom, disagreeing with Tom, or whatever).

My opinion is that this can be made safe enough and I explained why I
think so here: http://archives.postgresql.org/pgsql-hackers/2011-09/msg00308.php

Launching another process to read 1kB out of shared memory and print
it to log sounds like overkill. But if that's deemed necessary, I'm
willing to code it up too.

However, I now realize that it does make sense to write a separate
simpler function for the crashed backend case with no
vbeentry->st_changecount check loops, no checkUser, etc. That would be
more robust and easier to review.

I'll try to send a new patch implementing this in a few days.

> IMHO, the most compelling argument against the OP's approach made so
> far is the encoding issue.  I was hoping someone (maybe the OP?) would
> have an opinion on that, an idea how to work around it, or something.

I propsed replacing non-ASCII characters with '?' earlier. That would
be simpler to code, but obviously wouldn't preserve non-ASCII
characters in case the crash has anything to do with those. Since
nobody else weighed in on the '\x##' vs '?' choice, I didn't implement
it yet; but I will in my next submission.

Regards,
Marti