Quick Links

Re: RFC: Timing Events

Lists:	pgsql-hackers

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	RFC: Timing Events
Date:	2012-11-01 23:08:39
Message-ID:	509300F7.5000803@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Parsing log files for commonly needed performance data is no fun. As a
next step toward eliminating that, I've been working on how to approach
this similarly to how pg_stat_statements can cut down on query log
processing. I thought it would be novel to outline this for design
review before coding any more of it. I keep switching between calling
the sort of things people want to know about Timing Events and
Performance Events. No preference here, I'll use the former because
it's shorter.

Modeling this on pg_stat_statements includes the hope of packaging it as
an extension with minor core hooks, and the idea that there would be a
fixed size list of timing events available at any time. Consider it a
FIFO queue built using a circular buffer. Dump events into there and
provide a way to view them. If some fall out before they're
analyzed/saved, that doesn't seem a problem for now. If you want 100%
durable, the log is still available. Eventually I'd like a consumer for
these that wrote them to a history table as an option, but that seems a
second priority after getting them into memory in the first place. I'd
like that consumer thing for pg_stat_statements too, but so far we're
getting by without it. It seems like something that might benefit from
the in-core queuing work one day too.

= Initial event list =

There are three particular events I have been thinking about, hoping
that will be enough to make sure the structure is general enough to
handle more later:

-log_lock_waits output. This is my #1 goal. Lock waits are incredibly
common issues in larger systems, and the process for tracking them down
right now is very painful. Probably deadlocks too.

-log_temp_files output. This everyone wants to monitor, but it won't be
as useful unless query tieing (below) works.

-log_checkpoints. The fight to make this data available via
pg_stat_bgwriter has been useful. You really need a trending system to
do anything useful with the data that way though. Being able to write a
simple timing events query that looked for checkpoint spikes would be
great for alerting, which is often done with something other than a
trending package. To give specific examples here, the graphs Munin
makes for pg_stat_bgwriter data work, but they're not much of a help if
what you really want is to trigger a Nagios alert when checkpoints are
slow. A Timing Event entry of a slow checkpoint would be easy to
trigger alerts from.

= Data to save =

There's a few obvious fields that could be explicitly saved as regular
columns:

-timestamp
-user
-database
-event type code. Presumably an integer that's decoded via a master
list of these, in some header file.
-running process id
-running process details (below)
-referenced process id. The one causing the problem in situations like
a lock wait.
-referenced process details (below)
-serial number. People who will want to poll & store this data might
want a better way to do that than just a timestamp. Would let you know
when you lost some of it, too, so you'd know to consider increasing the
buffer size or decrease the polling interval.

= Storing event details =

When I think about how people are going to want to use this data, it
should be easy to tear apart without returning to the need to write a
complicated parser.

My first idea with that in mind was to consider this like a series of
key/value pairs. If you look at the log_checkpoints output, you can see
it approaches that form already at the end, with "=" signs as the
delimiters. Taking an example from a pre-9.1 server (so no sync times,
but they obviously fit the style already):

checkpoint complete: wrote 128770 buffers (12.3%); 0 transaction log
file(s) added, 979 removed, 129 recycled; write=431.784 s, sync=6048.384
s, total=6484.669 s

You can imagine that instead as:

wrote=128770 wrote_pct=12.3 log_added=0 log_removed=979 log_recycled=129
recycled write=431.784 sync=6048.384 s total=6484.669

Even if the timing event data was stored exactly like that--simple text
string, perhaps with some escaping of "=" signs--I think it would be
completely usable to people. log_temp_files output saves like this
easily enough too.

Another way this data might be stored is to have a key/value composite
type, and then make each timing event entry an array of those.

I'm open to other ideas for how to store this. If the events go into
one place, the different data each saves turns into the classic
"flexible schema" problem.

= Query text =

For the often multi-line query text that's dumped by log_lock_waits,
I've thought of two possibilities so far:

-Mash them into a single text line using "\n" between them. Now you're
even back to where, with the right escaping, the entire timing event
detail part could still collapse into a single text string. That would
kill the idea of simple parsing of key/value data stuffed there though.

-Create an array of lines to hold the two statements. Can always make
the view that shows the data unnest it. Seems like this would compel
putting the two query description arrays into their own fields though.

I don't really like either of these. The second should result in a
better UI in the end though, at the expense of bloating the width of the
event table some.

= Query Tieing =

I'd like to have a way to reference the normalized version of the active
query or queries at the time when that's appropriate too. That seems
like it should be a key that's joinable with the normalized version in
pg_stat_statements. One of the ideas bounced around in the "Hash id in
pg_stat_statements" thread, like assigning a numeric code for them,
would seem fine for that. This will turn dependent on one of those
ideas being committed.

I have not confirmed how feasible it is to extract a query hash key at
the point when the log messages show up yet. ProcSleep logs the lock
waiting and it grabs the full query text. Presumably that means it
could grab the hash key instead.

FileClose is what logs temp file data. That seems pretty disconnected
from what the executor is doing at the time.

If it's not possible to tie things to specific queries cleanly in every
case, I'd rather dump that requirement for now than make it a blocker.
That can always be added onto the basic timing event plumbing later.

= Limitations of scope =

I was thinking the act of loading the extension for this would turn the
logging on, and the existing GUCs like log_lock_waits will also apply.
It seems useful and easy to have a load time option for the module that
says to disable all the replaced log file writes if you're using the
timing event mechanism. It would be nice to have a granular way to ask
for some timing events: "I want to see lock waits but not temp files".
That seems like scope creep I don't think is necessary to make this
useful.

All these initial targets are events that do not occur that often, and
I'm not too concerned about the performance of this implementation.
Lock waits come out once per second per connection. temp files
shouldn't be that common. And checkpoints are infrequent. For now, I
hope that only targeting infrequent events means the overhead of dumping
the timing event into memory will be little overhead. Eventually I'd
like to have a much finer timing events capability, more along the lines
of the DTrace level. Getting overhead low enough to go there seems hard
to attack at first though, while simultaneously dealing well with what
I'm trying to extract without log file parsing.

Alternately, think about the overhead of these initial targets like
this: right now, all of these things are being written to the log file.
If you can afford to do that, you can certainly afford to write a copy
to memory instead, and probably in addition too.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-01 23:54:34
Message-ID:	50930BBA.2040107@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg,

First off, let me again praise the great work you and Peter are doing in
this area.
> Modeling this on pg_stat_statements includes the hope of packaging it as
> an extension with minor core hooks, and the idea that there would be a
> fixed size list of timing events available at any time. Consider it a
> FIFO queue built using a circular buffer. Dump events into there and
> provide a way to view them. If some fall out before they're
> analyzed/saved, that doesn't seem a problem for now.

Agreed.

> If you want 100%
> durable, the log is still available. Eventually I'd like a consumer for
> these that wrote them to a history table as an option, but that seems a
> second priority after getting them into memory in the first place.

Yes, and easy for users/tools to implement once the basic data is out there.

> I'd
> like that consumer thing for pg_stat_statements too, but so far we're
> getting by without it. It seems like something that might benefit from
> the in-core queuing work one day too.

Well, we could actually use such a thing in general, and not just for
the timing events. For example, it would be really useful to be able to
see, for example, pg_stat_user_tables from 2 days ago to estimate table
growth and activity, or pg_stat_replication from 10 minutes ago to
average replication lag. There was a plug-in tool for this, I think
Itagaki developed it. Anyone remember what/where it is?

So there's actually two things for a consumer: Timing Events and Counters.

> = Initial event list =
>
> There are three particular events I have been thinking about, hoping
> that will be enough to make sure the structure is general enough to
> handle more later:
>
> -log_lock_waits output. This is my #1 goal. Lock waits are incredibly
> common issues in larger systems, and the process for tracking them down
> right now is very painful. Probably deadlocks too.
>
> -log_temp_files output. This everyone wants to monitor, but it won't be
> as useful unless query tieing (below) works.
>
> -log_checkpoints. The fight to make this data available via
> pg_stat_bgwriter has been useful. You really need a trending system to
> do anything useful with the data that way though. Being able to write a
> simple timing events query that looked for checkpoint spikes would be
> great for alerting, which is often done with something other than a
> trending package. To give specific examples here, the graphs Munin
> makes for pg_stat_bgwriter data work, but they're not much of a help if
> what you really want is to trigger a Nagios alert when checkpoints are
> slow. A Timing Event entry of a slow checkpoint would be easy to
> trigger alerts from.

I agree that these are the events to start with. Let me add the
additional events I think we want to have, in the order of how useful
the information is for troubleshooting:

4. log_connections / log_disconnections
5. deadlock reporting
6. autovacuum logging

I also think that we should eventually expose a buffer of the last N
error messages, but that's a different sort of data.

>
> = Data to save =
>
> There's a few obvious fields that could be explicitly saved as regular
> columns:
>
> -timestamp
> -user
> -database

+ application_name

> -event type code. Presumably an integer that's decoded via a master
> list of these, in some header file.
> -running process id

+ running session id

> -running process details (below)
> -referenced process id. The one causing the problem in situations like
> a lock wait.

- referenced session id

> -referenced process details (below)
> -serial number. People who will want to poll & store this data might
> want a better way to do that than just a timestamp. Would let you know
> when you lost some of it, too, so you'd know to consider increasing the
> buffer size or decrease the polling interval.

By the time the user looks at this information, details on the process
from pg_stat_activity will probably no longer be available. This makes
data like process_id not that useful, and means we need to replicate all
relevant data from pg_stat_activity.

> Another way this data might be stored is to have a key/value composite
> type, and then make each timing event entry an array of those.

If only PostgreSQL had some sort of key-value composite storage type
available! I have no idea where we would get such a thing. ;-)

> I'm open to other ideas for how to store this. If the events go into
> one place, the different data each saves turns into the classic
> "flexible schema" problem.

I think it's an excellent approach, not in the least because it lets us
add new data later, and deals with the variety of fields we need to
lock_waits.

> = Query text =
>
> For the often multi-line query text that's dumped by log_lock_waits,
> I've thought of two possibilities so far:
>
> -Mash them into a single text line using "\n" between them. Now you're
> even back to where, with the right escaping, the entire timing event
> detail part could still collapse into a single text string. That would
> kill the idea of simple parsing of key/value data stuffed there though.
>
> -Create an array of lines to hold the two statements. Can always make
> the view that shows the data unnest it. Seems like this would compel
> putting the two query description arrays into their own fields though.

I think you need to have a separate field for the queries. Do we have
anything we want to log which would have more than two queries?

> = Query Tieing =
>
> I'd like to have a way to reference the normalized version of the active
> query or queries at the time when that's appropriate too. That seems
> like it should be a key that's joinable with the normalized version in
> pg_stat_statements. One of the ideas bounced around in the "Hash id in
> pg_stat_statements" thread, like assigning a numeric code for them,
> would seem fine for that. This will turn dependent on one of those
> ideas being committed.

So, the problem with joining against pg_stat_statements is that a
special-purpose incident you're looking at (like a lock_wait) might have
been pushed "off the bottom" of pg_stat_statements even though it is
still visible in pg_stat_lock_waits. No?

> = Limitations of scope =
>
> I was thinking the act of loading the extension for this would turn the
> logging on, and the existing GUCs like log_lock_waits will also apply.
> It seems useful and easy to have a load time option for the module that
> says to disable all the replaced log file writes if you're using the
> timing event mechanism. It would be nice to have a granular way to ask
> for some timing events: "I want to see lock waits but not temp files".
> That seems like scope creep I don't think is necessary to make this
> useful.

That depends on whether we think any of these types of events,
individually, will add substantial overhead.

> All these initial targets are events that do not occur that often, and
> I'm not too concerned about the performance of this implementation. Lock
> waits come out once per second per connection. temp files shouldn't be
> that common. And checkpoints are infrequent. For now, I hope that only
> targeting infrequent events means the overhead of dumping the timing
> event into memory will be little overhead.

log_connections/disconnections might not be low-frequency, though. So
we should plan for that.

> Alternately, think about the overhead of these initial targets like
> this: right now, all of these things are being written to the log file.
> If you can afford to do that, you can certainly afford to write a copy
> to memory instead, and probably in addition too.

To say nothing of the overhead of parsing/processing them later ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-02 00:04:45
Message-ID:	CAB7nPqTqGUkTLTmT_xr=EyEpN3pUDPhVphG9wMT+sj09sbatAA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 2, 2012 at 8:54 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> Greg,
>
> First off, let me again praise the great work you and Peter are doing in
> this area.
> > Modeling this on pg_stat_statements includes the hope of packaging it as
> > an extension with minor core hooks, and the idea that there would be a
> > fixed size list of timing events available at any time. Consider it a
> > FIFO queue built using a circular buffer. Dump events into there and
> > provide a way to view them. If some fall out before they're
> > analyzed/saved, that doesn't seem a problem for now.
>
> Agreed.
>
> > If you want 100%
> > durable, the log is still available. Eventually I'd like a consumer for
> > these that wrote them to a history table as an option, but that seems a
> > second priority after getting them into memory in the first place.
>
> Yes, and easy for users/tools to implement once the basic data is out
> there.
>
> > I'd
> > like that consumer thing for pg_stat_statements too, but so far we're
> > getting by without it. It seems like something that might benefit from
> > the in-core queuing work one day too.
>
> Well, we could actually use such a thing in general, and not just for
> the timing events. For example, it would be really useful to be able to
> see, for example, pg_stat_user_tables from 2 days ago to estimate table
> growth and activity, or pg_stat_replication from 10 minutes ago to
> average replication lag. There was a plug-in tool for this, I think
> Itagaki developed it. Anyone remember what/where it is?
>
pg_statsinfo perhaps? It is used for stat info management:
http://pgfoundry.org/projects/pgstatsinfo/
--
Michael Paquier
http://michael.otacoo.com

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-02 00:56:35
Message-ID:	50931A43.7090805@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/1/12 11:54 PM, Josh Berkus wrote:
> For example, it would be really useful to be able to
> see, for example, pg_stat_user_tables from 2 days ago to estimate table
> growth and activity, or pg_stat_replication from 10 minutes ago to
> average replication lag.

I don't see all that going into core without a much bigger push than I
think people will buy. What people really want for all these is a
proper trending system, and that means graphs and dashboards and
bling--not a history table. I have almost all of my customers using
Munin or Cacti or Zabbix or something, and none using pg_statsinfo.
Shoot, static graphs are barely good enough anymore--people really want
dynamic ones driven by client-side Javascript. "Why can't I zoom in on
this Munin graph, this is lame" they tell me. I blame Google Maps for
being the first thing that made all the users so demanding in this area.

But the main weakness of these tools isn't display, is that it's seemed
impractical to get them to collect per-table data, either for
configuration, speed, or display reasons. I'm trying to find a good web
application toolchain to recommend that does that and dynamic graphs,
too. I would never take up the fight to try and build in that direction
in core though. I think most people aren't even consuming the
pg_stat_user_tables data already provided fully yet in userland.

[I fear this topic will turn into a more appropriate one for
pgsql-advocacy in a hurry if it keeps going]

> So, the problem with joining against pg_stat_statements is that a
> special-purpose incident you're looking at (like a lock_wait) might have
> been pushed "off the bottom" of pg_stat_statements even though it is
> still visible in pg_stat_lock_waits. No?

This whole approach has the assumption that things are going to fall off
sometimes. To expand on that theme for a second, right now I'm more
worried about the "99%" class of problems. Neither pg_stat_statements
nor this idea are very good for tracking the rare rogue problem down.
They're both aimed to make things that happen a lot more statistically
likely to be seen, by giving an easier UI to glare at them frequently.
That's not ideal, but I suspect really fleshing the whole queue consumer
-> table idea needs to happen to do much better.

Thanks for the quick feedback, there's a lot of ideas I should
incorporate there I need to chew on.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-03 01:44:25
Message-ID:	509476F9.7040202@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I don't see all that going into core without a much bigger push than I
> think people will buy. What people really want for all these is a
> proper trending system, and that means graphs and dashboards and
> bling--not a history table.

Well, I'm particularly thinking for autoconfiguration. For example, to
set vacuum_freeze_min_age properly, you have to know the XID "burn rate"
of the server, which is only available via history. I really don't want
to be depending on a graphical monitoring utility to find these things out.

> This whole approach has the assumption that things are going to fall off
> sometimes. To expand on that theme for a second, right now I'm more
> worried about the "99%" class of problems. Neither pg_stat_statements
> nor this idea are very good for tracking the rare rogue problem down.
> They're both aimed to make things that happen a lot more statistically
> likely to be seen, by giving an easier UI to glare at them frequently.
> That's not ideal, but I suspect really fleshing the whole queue consumer
> -> table idea needs to happen to do much better.

I'm just concerned that for some types of incidents, it would be much
more than 1% *of what you want to look at* which fall off. For example,
consider a server which does 95% reads at a very high rate, but has 2%
of its writes cronically having lock waits. That's something you want
to solve, but it seems fairly probably that these relatively infrequent
queries would have fallen off the bottom of pg_stat_statements. Same
thing with the relative handful of queries which do large on-disk sorts.

The problem I'm worried about is that pg_stat_statements is designed to
keep the most frequent queries, but sometimes the thing you really need
to look at is not in the list of most frequent queries.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-04 09:28:11
Message-ID:	5096352B.4020108@uptime.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(2012/11/03 10:44), Josh Berkus wrote:
>
>> I don't see all that going into core without a much bigger push than I
>> think people will buy. What people really want for all these is a
>> proper trending system, and that means graphs and dashboards and
>> bling--not a history table.
>
> Well, I'm particularly thinking for autoconfiguration. For example, to
> set vacuum_freeze_min_age properly, you have to know the XID "burn rate"
> of the server, which is only available via history. I really don't want
> to be depending on a graphical monitoring utility to find these things out.
>
>> This whole approach has the assumption that things are going to fall off
>> sometimes. To expand on that theme for a second, right now I'm more
>> worried about the "99%" class of problems. Neither pg_stat_statements
>> nor this idea are very good for tracking the rare rogue problem down.
>> They're both aimed to make things that happen a lot more statistically
>> likely to be seen, by giving an easier UI to glare at them frequently.
>> That's not ideal, but I suspect really fleshing the whole queue consumer
>> -> table idea needs to happen to do much better.
>
> I'm just concerned that for some types of incidents, it would be much
> more than 1% *of what you want to look at* which fall off. For example,
> consider a server which does 95% reads at a very high rate, but has 2%
> of its writes cronically having lock waits. That's something you want
> to solve, but it seems fairly probably that these relatively infrequent
> queries would have fallen off the bottom of pg_stat_statements. Same
> thing with the relative handful of queries which do large on-disk sorts.
>
> The problem I'm worried about is that pg_stat_statements is designed to
> keep the most frequent queries, but sometimes the thing you really need
> to look at is not in the list of most frequent queries.

I think auto_explain would help you solve such rare incidents
if it could dump several statistics into server log, including lock
waits and block reads/writes statistic per-session, for example.

Do we have something to add to auto_explain?

Regards,
--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-04 09:35:45
Message-ID:	CAFj8pRCPJZa+QWnON4B+bKY+8SrEJb7Mtrr7g+kS9SZtWBHXQw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello

2012/11/4 Satoshi Nagayasu <snaga(at)uptime(dot)jp>:
> (2012/11/03 10:44), Josh Berkus wrote:
>>
>>
>>> I don't see all that going into core without a much bigger push than I
>>> think people will buy. What people really want for all these is a
>>> proper trending system, and that means graphs and dashboards and
>>> bling--not a history table.
>>
>>
>> Well, I'm particularly thinking for autoconfiguration. For example, to
>> set vacuum_freeze_min_age properly, you have to know the XID "burn rate"
>> of the server, which is only available via history. I really don't want
>> to be depending on a graphical monitoring utility to find these things
>> out.
>>
>>> This whole approach has the assumption that things are going to fall off
>>> sometimes. To expand on that theme for a second, right now I'm more
>>> worried about the "99%" class of problems. Neither pg_stat_statements
>>> nor this idea are very good for tracking the rare rogue problem down.
>>> They're both aimed to make things that happen a lot more statistically
>>> likely to be seen, by giving an easier UI to glare at them frequently.
>>> That's not ideal, but I suspect really fleshing the whole queue consumer
>>> -> table idea needs to happen to do much better.
>>
>>
>> I'm just concerned that for some types of incidents, it would be much
>> more than 1% *of what you want to look at* which fall off. For example,
>> consider a server which does 95% reads at a very high rate, but has 2%
>> of its writes cronically having lock waits. That's something you want
>> to solve, but it seems fairly probably that these relatively infrequent
>> queries would have fallen off the bottom of pg_stat_statements. Same
>> thing with the relative handful of queries which do large on-disk sorts.
>>
>> The problem I'm worried about is that pg_stat_statements is designed to
>> keep the most frequent queries, but sometimes the thing you really need
>> to look at is not in the list of most frequent queries.
>
>
> I think auto_explain would help you solve such rare incidents
> if it could dump several statistics into server log, including lock
> waits and block reads/writes statistic per-session, for example.
>
> Do we have something to add to auto_explain?

Now I am working on expanding slow query record and auto_explain with
some locking times (lock on objects, lock on enhancing pages, other
locks).

Just statement time produces too less information in our complex and
"unpredictable" cloud environment with thousand databases and hundreds
servers.

Regards

Pavel Stehule

>
> Regards,
> --
> Satoshi Nagayasu <snaga(at)uptime(dot)jp>
> Uptime Technologies, LLC. http://www.uptime.jp
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-05 18:49:19
Message-ID:	50980A2F.40304@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I think auto_explain would help you solve such rare incidents
> if it could dump several statistics into server log, including lock
> waits and block reads/writes statistic per-session, for example.
>
> Do we have something to add to auto_explain?

Well, to be frank, I've never found auto-explain to be useful because of
its restriction to superuser sessions. It's an interesting
proof-of-concept, but completely useless at any production site.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-05 20:44:54
Message-ID:	24346.1352148294@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Do we have something to add to auto_explain?

> Well, to be frank, I've never found auto-explain to be useful because of
> its restriction to superuser sessions. It's an interesting
> proof-of-concept, but completely useless at any production site.

Huh? The typical use-case is to enable it for all sessions by
including it in shared_preload_libraries. That doesn't require any
particular session to be superuser. (If you're superuser you can then
turn it *off* in your session, should you wish.)

regards, tom lane

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-05 20:57:27
Message-ID:	CAMkU=1yqwVYrUov7wde1odMuAOmQAtH9MeUYUqCgB0yukxmKOw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 4, 2012 at 1:35 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
> Hello
>
> 2012/11/4 Satoshi Nagayasu <snaga(at)uptime(dot)jp>:
>>>
>>
>> Do we have something to add to auto_explain?
>
> Now I am working on expanding slow query record and auto_explain with
> some locking times (lock on objects, lock on enhancing pages, other
> locks).

But this would only work if you used 'auto_explain.log_analyze=1',
which is has nasty performance implications. Or you planning on
changing the way log_analyze works to get around this?

>
> Just statement time produces too less information in our complex and
> "unpredictable" cloud environment with thousand databases and hundreds
> servers.

I think it would be easier to implement but still a big step forward
over what we currently have if "explain analyze" and "\timing" would
show the 'rusage' values in addition to the wall-clock time.

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-06 01:42:21
Message-ID:	50986AFD.7060305@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Huh? The typical use-case is to enable it for all sessions by
> including it in shared_preload_libraries. That doesn't require any
> particular session to be superuser. (If you're superuser you can then
> turn it *off* in your session, should you wish.)

It's not practical to have auto-explain on for all queries on a server
which is processing 10K queries/minute. And non-superusers can't alter
the settings in their session, even the min_duration.

A practical use of auto-explain would involve during it on for a single
user session, or for a specific database user, neither of which are
possible.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	"Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To:	"Josh Berkus EXTERN" <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Timing Events
Date:	2012-11-06 09:20:58
Message-ID:	D960CB61B694CF459DCFB4B0128514C208A4E909@exadv11.host.magwien.gv.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
>> Huh? The typical use-case is to enable it for all sessions by
>> including it in shared_preload_libraries. That doesn't require any
>> particular session to be superuser. (If you're superuser you can then
>> turn it *off* in your session, should you wish.)
>
> It's not practical to have auto-explain on for all queries on a server
> which is processing 10K queries/minute. And non-superusers can't alter
> the settings in their session, even the min_duration.
>
> A practical use of auto-explain would involve during it on for a single
> user session, or for a specific database user, neither of which are
> possible.

I think that Tom is right.

You can reduce the logging volume on busy servers with
auto_explain.log_min_duration. You can also activate it for a single
database user only by setting log_min_duration to -1 globally and
change the setting for one user with ALTER ROLE SET, right?

And I am under the impression that parameters that influence
what is logged tend to be superuser only for a good reason
(the only exception I see are the debug_print_* parameters).

Yours,
Laurenz Albe

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-06 23:25:19
Message-ID:	50999C5F.4050302@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> You can reduce the logging volume on busy servers with
> auto_explain.log_min_duration. You can also activate it for a single
> database user only by setting log_min_duration to -1 globally and
> change the setting for one user with ALTER ROLE SET, right?

Not according to the docs, you can't.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-06 23:32:36
Message-ID:	22596.1352244756@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> You can reduce the logging volume on busy servers with
>> auto_explain.log_min_duration. You can also activate it for a single
>> database user only by setting log_min_duration to -1 globally and
>> change the setting for one user with ALTER ROLE SET, right?

> Not according to the docs, you can't.

You're being obtuse, Josh. These parameters can be set the same way any
other parameters can, including ALTER ROLE SET or ALTER DATABASE SET.
The "superuser only" restriction is that only a superuser can execute
the ALTER, not that the target role has to be superuser.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-07 19:45:25
Message-ID:	509ABA55.9030000@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> You're being obtuse, Josh. These parameters can be set the same way any
> other parameters can, including ALTER ROLE SET or ALTER DATABASE SET.
> The "superuser only" restriction is that only a superuser can execute
> the ALTER, not that the target role has to be superuser.

Oh? Ok, that's helpful.

It still eliminates the main potential use of auto-explain on production
sites, though, which is to turn it on only for specific operations.
I've never been able to make use of auto-explain for any real diagnostic
purpose on a production site, and I don't personally know anyone who has.

I recall bringing this up when auto-explain was first committed as a
fatal limitation in the module. And as far as I'm concerned, it has been.

Fixing this would clearly be complex; we don't currently have a system
which lets the superuser control which other users can change specific
parameters, even though that would be useful in a lot of contexts.
Presumably, making auto-explain would also require such a system of
grants, so it's not something we're fixing for 9.3, I'd imagine.

To get back to the original thread, though: I'm saying that "you can do
that with auto-explain" is zero justification to limit any of Greg's
work, because auto-explain isn't generally useful.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-07 20:35:03
Message-ID:	23894.1352320503@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> It still eliminates the main potential use of auto-explain on production
> sites, though, which is to turn it on only for specific operations.
> I've never been able to make use of auto-explain for any real diagnostic
> purpose on a production site, and I don't personally know anyone who has.

This objection seems completely off the mark. If you have a specific
query you want to test, you can just use EXPLAIN. The point of
auto-explain is to get info about queries that are unexpectedly slow,
and for that it seems to me that you'll generally want it active over a
fairly wide scope.

I'm perfectly prepared to grant that you personally may never have had a
use for auto-explain, but it does not follow that it's useless. I'm
also pretty unclear on what use-case you think there is that's
intermediate between plain EXPLAIN and cases for which auto-explain is
reasonably suitable as-is.

> To get back to the original thread, though: I'm saying that "you can do
> that with auto-explain" is zero justification to limit any of Greg's
> work, because auto-explain isn't generally useful.

I don't recall having said anything against Greg's proposal, but in any
case I don't grant your premise.

regards, tom lane

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: RFC: Timing Events
Date:	2012-11-07 21:06:01
Message-ID:	CAMkU=1zfnFvx-sriK=hTeDdoen78U-9DHhs1iPzqRMw6ScZzLQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 7, 2012 at 11:45 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> You're being obtuse, Josh. These parameters can be set the same way any
>> other parameters can, including ALTER ROLE SET or ALTER DATABASE SET.
>> The "superuser only" restriction is that only a superuser can execute
>> the ALTER, not that the target role has to be superuser.
>
> Oh? Ok, that's helpful.
>
> It still eliminates the main potential use of auto-explain on production
> sites, though, which is to turn it on only for specific operations.

If you have sufficient control over the application to do that, why
can't you just explain those queries the old fashioned way?

> I've never been able to make use of auto-explain for any real diagnostic
> purpose on a production site, and I don't personally know anyone who has.

I have, although it is a rather low volume production site.

What prevents you from doing so? The performance impact? The volume
of logs generated?

Cheers,

Jeff

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2012-11-08 00:55:02
Message-ID:	509B02E6.2010301@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> What prevents you from doing so? The performance impact? The volume
> of logs generated?

Yes and yes.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2012-11-08 01:05:02
Message-ID:	CAMkU=1xGZuMbu4vhnmcFFZh7nz=4Z8rkYgq3b__=ecodHXCN=w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 7, 2012 at 4:55 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> What prevents you from doing so? The performance impact? The volume
>> of logs generated?
>
> Yes and yes.

I've never noticed a performance impact, unless you are having it
analyze, or having it log every query. How do you measure the impact?

For the log volume, would it help if there was some "unexpectedness"
threshold? That is, if a statement exceeds the duration threshold, it
gets explained, But then it only gets logged if the actual duration
divided by the cost estimate exceeds some threshold.

Cheers,

Jeff

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2012-11-08 19:16:55
Message-ID:	509C0527.6070105@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> For the log volume, would it help if there was some "unexpectedness"
> threshold? That is, if a statement exceeds the duration threshold, it
> gets explained, But then it only gets logged if the actual duration
> divided by the cost estimate exceeds some threshold.

Thing is, pg_stat_plans makes this kind of use of auto_explain obsolete.
With a much more useful interface.

Where autoexplain would be useful would be to get all of the plans,
regardless of execution time, for a specific application session or a
specific stored procedure. However, that requires the ability of the
application session to toggle auto-explain logging settings. This was
part of the functionality which Itagaki demonstrated for auto-explain
when he first proposed it, but was later disabled for security reasons
before commit IIRC. Writing a SECURITY DEFINER function to get around
inability to toggle as a regular user has been a nonstarter when I've
proposed it to clients.

Also, logging only the long-running queries is less useful than people
on this list seem to think. When I'm doing real performance analysis, I
need to see *everything* which was run, not just the slow stuff. Often
the real problem is a query which used to take 1.1ms, now takes 1.8ms,
and gets run 400 times/second. Looking just at the slow queries won't
tell you that.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2012-11-21 23:15:10
Message-ID:	50AD607E.80806@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/8/12 2:16 PM, Josh Berkus wrote:

> Also, logging only the long-running queries is less useful than people
> on this list seem to think. When I'm doing real performance analysis, I
> need to see *everything* which was run, not just the slow stuff. Often
> the real problem is a query which used to take 1.1ms, now takes 1.8ms,
> and gets run 400 times/second. Looking just at the slow queries won't
> tell you that.

No argument here. I've tried to be clear that some of these high-speed
but lossy things I'm hacking on are not going to be useful for everyone.
A solution that found most of these 1.8ms queries, but dropped some
percentage under the highest peak load conditions, would still be very
useful to me.

An anecdote on this topic seems relevant. I have a troublesome
production server that has moved log_min_duration_statement from 100ms
to 500ms to 2000ms as the system grew. Even the original setting wasn't
short enough to catch everything we would like to watch *now*, but
seeing sub-second data is a dream at this point. The increases have
been forced by logging contention becoming unmanagable when every client
has to fight for the log to write to disk. I can see the buggers stack
up as waiting for locks if I try to log shorter statements, stalling
enough that it drags the whole server down under peak load.

If I could just turn off logging just during those periods--basically,
throwing them away only when some output rate throttled component hit
its limit--I could still find them in the data collected the rest of the
time. There are some types of problems that also only occur under peak
load that this idea would miss, but you'd still be likely to get *some*
of them, statistically.

There's a really hard trade-off here:

-Sometimes you must save data on every query to capture fine details
-Making every query wait for disk I/O is impractical

The sort of ideas you threw out for making things like auto-explain
logging per-session can work. The main limitation there though is that
it presumes you even know what area the problem is in the first place.
I am not optimistic that covers a whole lot of ground either.

Upthread I talked about a background process that persists shared memory
queues as a future consumer of timing events--one that might consume
slow query data too. That is effectively acting as the ideal component
I described above, one that only loses things when it exceeds the
system's write capacity for saving them. I wouldn't want to try and
retrofit the existing PostgreSQL logging facility for such a thing
though. Log parsing as the way to collect data is filled with headaches
anyway.

I don't see any other good way to resolve this trade-off. To help with
the real-world situation you describe, ideally you dump all the query
data somewhere, fast, and have the client move on. You can't make
queries wait for storage, something else (with a buffer!) needs to take
over that job.

I can't find the URL right now, but at PG.EU someone was showing me a
module that grabbed the new 9.2 logging hook and shipped the result to
another server. Clients burn a little CPU and network time and they
move on, and magically disk I/O goes out of their concerns. How much
overhead persisting the data takes isn't the database server's job at
all then. That's the sort of flexibility people need to have with
logging eventually. Something so efficient that every client can afford
to do it; it is capable of saving all events under ideal conditions; but
under adverse ones, you have to keep going and accept the loss.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

From:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To:	Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2012-11-21 23:33:10
Message-ID:	50AD64B6.4030607@archidevsys.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22/11/12 12:15, Greg Smith wrote:
> On 11/8/12 2:16 PM, Josh Berkus wrote:
>
>> Also, logging only the long-running queries is less useful than people
>> on this list seem to think. When I'm doing real performance analysis, I
>> need to see *everything* which was run, not just the slow stuff. Often
>> the real problem is a query which used to take 1.1ms, now takes 1.8ms,
>> and gets run 400 times/second. Looking just at the slow queries won't
>> tell you that.
>
> No argument here. I've tried to be clear that some of these
> high-speed but lossy things I'm hacking on are not going to be useful
> for everyone. A solution that found most of these 1.8ms queries, but
> dropped some percentage under the highest peak load conditions, would
> still be very useful to me.
>
> An anecdote on this topic seems relevant. I have a troublesome
> production server that has moved log_min_duration_statement from 100ms
> to 500ms to 2000ms as the system grew. Even the original setting
> wasn't short enough to catch everything we would like to watch *now*,
> but seeing sub-second data is a dream at this point. The increases
> have been forced by logging contention becoming unmanagable when every
> client has to fight for the log to write to disk. I can see the
> buggers stack up as waiting for locks if I try to log shorter
> statements, stalling enough that it drags the whole server down under
> peak load.
>
> If I could just turn off logging just during those periods--basically,
> throwing them away only when some output rate throttled component hit
> its limit--I could still find them in the data collected the rest of
> the time. There are some types of problems that also only occur under
> peak load that this idea would miss, but you'd still be likely to get
> *some* of them, statistically.
>
> There's a really hard trade-off here:
>
> -Sometimes you must save data on every query to capture fine details
> -Making every query wait for disk I/O is impractical
>
> The sort of ideas you threw out for making things like auto-explain
> logging per-session can work. The main limitation there though is
> that it presumes you even know what area the problem is in the first
> place. I am not optimistic that covers a whole lot of ground either.
>
> Upthread I talked about a background process that persists shared
> memory queues as a future consumer of timing events--one that might
> consume slow query data too. That is effectively acting as the ideal
> component I described above, one that only loses things when it
> exceeds the system's write capacity for saving them. I wouldn't want
> to try and retrofit the existing PostgreSQL logging facility for such
> a thing though. Log parsing as the way to collect data is filled with
> headaches anyway.
>
> I don't see any other good way to resolve this trade-off. To help
> with the real-world situation you describe, ideally you dump all the
> query data somewhere, fast, and have the client move on. You can't
> make queries wait for storage, something else (with a buffer!) needs
> to take over that job.
>
> I can't find the URL right now, but at PG.EU someone was showing me a
> module that grabbed the new 9.2 logging hook and shipped the result to
> another server. Clients burn a little CPU and network time and they
> move on, and magically disk I/O goes out of their concerns. How much
> overhead persisting the data takes isn't the database server's job at
> all then. That's the sort of flexibility people need to have with
> logging eventually. Something so efficient that every client can
> afford to do it; it is capable of saving all events under ideal
> conditions; but under adverse ones, you have to keep going and accept
> the loss.
>
Would it be useful to be able to specify a fixed sampling rate, say 1 in
100. That way you could get a well defined statistical sample, yet not
cause excessive I/O?

Cheers,
Gavin

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2012-11-22 13:03:27
Message-ID:	20121122130327.GB4316@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith escribió:

> If I could just turn off logging just during those
> periods--basically, throwing them away only when some output rate
> throttled component hit its limit--I could still find them in the
> data collected the rest of the time. There are some types of
> problems that also only occur under peak load that this idea would
> miss, but you'd still be likely to get *some* of them,
> statistically.

Uhm, what if you had a simple script that changed postgresql.conf and
SIGHUP'ed postmaster? You could have it turn logging on and off
depending on predefined system load watermarks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc:	Greg Smith <greg(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2013-02-25 00:27:19
Message-ID:	512AAFE7.3040904@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sorry for the late reply, but I think I can add some ideas here...

On 11/21/12 5:33 PM, Gavin Flower wrote:
> On 22/11/12 12:15, Greg Smith wrote:
>> On 11/8/12 2:16 PM, Josh Berkus wrote:
>>
>>> Also, logging only the long-running queries is less useful than people
>>> on this list seem to think. When I'm doing real performance analysis, I
>>> need to see *everything* which was run, not just the slow stuff. Often
>>> the real problem is a query which used to take 1.1ms, now takes 1.8ms,
>>> and gets run 400 times/second. Looking just at the slow queries won't
>>> tell you that.
>>
>> No argument here. I've tried to be clear that some of these high-speed but lossy things I'm hacking on are not going to be useful for everyone. A solution that found most of these 1.8ms queries, but dropped some percentage under the highest peak load conditions, would still be very useful to me.
>>
>> An anecdote on this topic seems relevant. I have a troublesome production server that has moved log_min_duration_statement from 100ms to 500ms to 2000ms as the system grew. Even the original setting wasn't short enough to catch everything we would like to watch *now*, but seeing sub-second data is a dream at this point. The increases have been forced by logging contention becoming unmanagable when every client has to fight for the log to write to disk. I can see the buggers stack up as waiting for locks if I try to log shorter statements, stalling enough that it drags the whole server down under peak load.
>>
>> If I could just turn off logging just during those periods--basically, throwing them away only when some output rate throttled component hit its limit--I could still find them in the data collected the rest of the time. There are some types of problems that also only occur under peak load that this idea would miss, but you'd still be likely to get *some* of them, statistically.
>>
>> There's a really hard trade-off here:
>>
>> -Sometimes you must save data on every query to capture fine details
>> -Making every query wait for disk I/O is impractical
>>
>> The sort of ideas you threw out for making things like auto-explain logging per-session can work. The main limitation there though is that it presumes you even know what area the problem is in the first place. I am not optimistic that covers a whole lot of ground either.
>>
>> Upthread I talked about a background process that persists shared memory queues as a future consumer of timing events--one that might consume slow query data too. That is effectively acting as the ideal component I described above, one that only loses things when it exceeds the system's write capacity for saving them. I wouldn't want to try and retrofit the existing PostgreSQL logging facility for such a thing though. Log parsing as the way to collect data is filled with headaches anyway.
>>
>> I don't see any other good way to resolve this trade-off. To help with the real-world situation you describe, ideally you dump all the query data somewhere, fast, and have the client move on. You can't make queries wait for storage, something else (with a buffer!) needs to take over that job.
>>
>> I can't find the URL right now, but at PG.EU someone was showing me a module that grabbed the new 9.2 logging hook and shipped the result to another server. Clients burn a little CPU and network time and they move on, and magically disk I/O goes out of their concerns. How much overhead persisting the data takes isn't the database server's job at all then. That's the sort of flexibility people need to have with logging eventually. Something so efficient that every client can afford to do it; it is capable of saving all events under ideal conditions; but under adverse ones, you have to keep going and accept the loss.

I definitely think we need a means to "log" stuff really, really fast and then deal with it later. And the log ideally would not be text either, because it's a PITA to deal with and often far more data to handle (which makes all our problems worse).

I like that Greg is proposing a method of logging information in a more useful format than text. I suggest that it would be best if the non-text structure be implemented outside of any particular logging methodology if possible. Perhaps as something *equivalent to* a set of data types in PG. Of course I doubt real PG types could be used for the actual logging due to performance reasons, but the end interface should be able to use PG capabilities (inheritance is another interesting possibility).

Josh also brings up an interesting use case, one where I really like Tom Kyte's (the Oracle "Ask Tom" guy) methodology. When asked about how to do logging in PL/SQL he proposed a system where you always associate some kind of "context" with log data so that you can enable logging *only for that context*. In this case, (and in plpgsql syntax), you'd do something like:

PERFORM debug( 'my function name', 'blah blah just happened' );

That way you're not stuck trying to mess around with verbosity levels to stem a flood of data, instead you can only enable getting data for a specific context.

IIRC, his system also had a method for turning on debugging on a *different backend*. That allowed you to debug a specific application thread. If we had that ability, it would eliminate the issue of requiring SU to change the autoexplain GUC.

Please note that I'm not suggesting these ideas at the exclusion of others; they can all be valuable. IE: Retain the concept of "log level", add a user specified context (or maybe it'd be enough to just use function names), and provide some ability to control debug on a separate backend. All three features are useful and can stand on their own or in combination.

> Would it be useful to be able to specify a fixed sampling rate, say 1 in 100. That way you could get a well defined statistical sample, yet not cause excessive I/O?

We actually do that in our application and have discovered that random sampling can end up significantly skewing your data.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2013-02-25 20:26:04
Message-ID:	CA+TgmoYmfczh4mV-A=1shPyFc7KO9LvNL8NAoUL50yAZUoKadA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> We actually do that in our application and have discovered that random
> sampling can end up significantly skewing your data.

/me blinks.

How so?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2013-02-26 03:22:45
Message-ID:	CAM-w4HPUS_kAk1HSWFHA+EcTuS7COg8FyH8_PB-ndbE5+1ny3Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 25, 2013 at 8:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>> We actually do that in our application and have discovered that random
>> sampling can end up significantly skewing your data.
>
> /me blinks.
>
> How so?

Sampling is a pretty big area of statistics. There are dozens of
sampling methods to deal with various problems that occur with
different types of data distributions.

One problem is if you have some very rare events then random sampling
can produce odd results since those rare events will drop out entirely
unless your sample is very large whereas less rare events are
represented proportionally. There are sampling methods that ensure
that x% of the rare events are included even if those rare events are
less than x% of your total data set. One of those might be appropriate
to use for profiling data when you're looking for rare slow queries
amongst many faster queries.

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2013-02-26 17:19:17
Message-ID:	CA+TgmoYE8_VGV2GC41ZHxkupmHcOO3X6F+haEQZ0uZFn_4Nfig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 25, 2013 at 10:22 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Mon, Feb 25, 2013 at 8:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>> We actually do that in our application and have discovered that random
>>> sampling can end up significantly skewing your data.
>>
>> /me blinks.
>>
>> How so?
>
> Sampling is a pretty big area of statistics. There are dozens of
> sampling methods to deal with various problems that occur with
> different types of data distributions.
>
> One problem is if you have some very rare events then random sampling
> can produce odd results since those rare events will drop out entirely
> unless your sample is very large whereas less rare events are
> represented proportionally. There are sampling methods that ensure
> that x% of the rare events are included even if those rare events are
> less than x% of your total data set. One of those might be appropriate
> to use for profiling data when you're looking for rare slow queries
> amongst many faster queries.

I'll grant all that, but it still seems to me like x% of all queries
plus all queries running longer than x milliseconds would cover most
of the interesting cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <stark(at)mit(dot)edu>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: auto_explain WAS: RFC: Timing Events
Date:	2013-02-26 22:55:38
Message-ID:	512D3D6A.3060402@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/26/13 11:19 AM, Robert Haas wrote:
> On Mon, Feb 25, 2013 at 10:22 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> On Mon, Feb 25, 2013 at 8:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>>> We actually do that in our application and have discovered that random
>>>> sampling can end up significantly skewing your data.
>>>
>>> /me blinks.
>>>
>>> How so?
>>
>> Sampling is a pretty big area of statistics. There are dozens of
>> sampling methods to deal with various problems that occur with
>> different types of data distributions.
>>
>> One problem is if you have some very rare events then random sampling
>> can produce odd results since those rare events will drop out entirely
>> unless your sample is very large whereas less rare events are
>> represented proportionally. There are sampling methods that ensure
>> that x% of the rare events are included even if those rare events are
>> less than x% of your total data set. One of those might be appropriate
>> to use for profiling data when you're looking for rare slow queries
>> amongst many faster queries.
>
> I'll grant all that, but it still seems to me like x% of all queries
> plus all queries running longer than x milliseconds would cover most
> of the interesting cases.

In our specific case, we were capturing statistics about webpage hits; when we took "random" samples and multiplied back out there were some inconsistencies that we couldn't explain. We just turned the sampling off and never really investigated. So it's possible that something in our implementation was flawed.

However, randomness can also work against you in strange ways. You could easily get a glut of samples that are skewed in one direction or another. And the problem can potentially be far worse if your "randomness" is actually impacted by some other aspect of the system.

For this case it might be good enough. I just wanted to caution about it.