Re: ANALYZE sampling is too good

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 19:13:55
Message-ID: CAMkU=1zt9Z6qTuyX8BGn1+5P8dy26C-T9pnuZTvoTP21mKmsHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:

> On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> >> It would be relatively easy to fix this if we trusted the number of
> visible
> >> rows in each block to be fairly constant. But without that assumption,
> I
> >> don't see a way to fix the sample selection process without reading the
> >> entire table.
> >
> > Yeah, varying tuple density is the weak spot in every algorithm we've
> > looked at. The current code is better than what was there before, but as
> > you say, not perfect. You might be entertained to look at the threads
> > referenced by the patch that created the current sampling method:
> >
> http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at
> >
> > particularly
> >
> http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs(at)email(dot)aon(dot)at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at
>

Thanks, I will read those.

> >
> >
> > However ... where this thread started was not about trying to reduce
> > the remaining statistical imperfections in our existing sampling method.
> > It was about whether we could reduce the number of pages read for an
> > acceptable cost in increased statistical imperfection.
>

I think it is pretty clear that n_distinct at least, and probably MCV,
would be a catastrophe under some common data distribution patterns if we
sample all rows in each block without changing our current computation
method. If we come up with a computation that works for that type of
sampling, it would probably be an improvement under our current sampling as
well. If we find such a thing, I wouldn't want it to get rejected just
because the larger block-sampling method change did not make it in.

Well, why not take a supersample containing all visible tuples from N
> selected blocks, and do bootstrapping over it, with subsamples of M
> independent rows each?
>

Bootstrapping methods generally do not work well when ties are significant
events, i.e. when two values being identical is meaningfully different from
them being very close but not identical.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2013-12-12 19:20:50 Re: ANALYZE sampling is too good
Previous Message Claudio Freire 2013-12-12 19:13:23 Re: ANALYZE sampling is too good