Re: auto_explain WAS: RFC: Timing Events

From: Jim Nasby <jim(at)nasby(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: auto_explain WAS: RFC: Timing Events
Date: 2013-02-26 22:55:38
Message-ID: 512D3D6A.3060402@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/26/13 11:19 AM, Robert Haas wrote:
> On Mon, Feb 25, 2013 at 10:22 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> On Mon, Feb 25, 2013 at 8:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Sun, Feb 24, 2013 at 7:27 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>>> We actually do that in our application and have discovered that random
>>>> sampling can end up significantly skewing your data.
>>>
>>> /me blinks.
>>>
>>> How so?
>>
>> Sampling is a pretty big area of statistics. There are dozens of
>> sampling methods to deal with various problems that occur with
>> different types of data distributions.
>>
>> One problem is if you have some very rare events then random sampling
>> can produce odd results since those rare events will drop out entirely
>> unless your sample is very large whereas less rare events are
>> represented proportionally. There are sampling methods that ensure
>> that x% of the rare events are included even if those rare events are
>> less than x% of your total data set. One of those might be appropriate
>> to use for profiling data when you're looking for rare slow queries
>> amongst many faster queries.
>
> I'll grant all that, but it still seems to me like x% of all queries
> plus all queries running longer than x milliseconds would cover most
> of the interesting cases.

In our specific case, we were capturing statistics about webpage hits; when we took "random" samples and multiplied back out there were some inconsistencies that we couldn't explain. We just turned the sampling off and never really investigated. So it's possible that something in our implementation was flawed.

However, randomness can also work against you in strange ways. You could easily get a glut of samples that are skewed in one direction or another. And the problem can potentially be far worse if your "randomness" is actually impacted by some other aspect of the system.

For this case it might be good enough. I just wanted to caution about it.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2013-02-26 22:56:20 DBD::Pg PPM?
Previous Message Mark Kirkwood 2013-02-26 22:51:52 Re: initdb ignoring options?