Re: ANALYZE sampling is too good

Lists: pgsql-hackers
From: Greg Stark <stark(at)mit(dot)edu>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: ANALYZE sampling is too good
Date: 2013-12-03 23:30:44
Message-ID: CAM-w4HOjRbNPMW=SHjHw_Qfapcuu5Ege1tMdR0ZQU+kqX8Qeug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At multiple conferences I've heard about people trying all sorts of
gymnastics to avoid ANALYZE which they expect to take too long and
consume too much I/O. This is especially a big complain after upgrades
when their new database performs poorly until the new statistics are
in and they did pg_upgrade to avoid an extended downtime and complain
about ANALYZE taking hours.

I always gave the party line that ANALYZE only takes a small
constant-sized sample so even very large tables should be very quick.
But after hearing the same story again in Heroku I looked into it a
bit further. I was kind of shocked but the numbers.

ANALYZE takes a sample of 300 * statistics_target rows. That sounds
pretty reasonable but with default_statistics_target set to 100 that's
30,000 rows. If I'm reading the code right It takes this sample by
sampling 30,000 blocks and then (if the table is large enough) taking
an average of one row per block. Each block is 8192 bytes so that
means it's reading 240MB of each table.That's a lot more than I
realized.

It means if your table is anywhere up to 240MB you're effectively
doing a full table scan and then throwing out nearly all the data
read.

Worse, my experience with the posix_fadvise benchmarking is that on
spinning media reading one out of every 16 blocks takes about the same
time as reading them all. Presumably this is because the seek time
between tracks dominates and reading one out of every 16 blocks is
still reading every track. So in fact if your table is up to about
3-4G ANALYZE is still effectively going to do a full table scan, at
least as far as I/O time goes.

The current algorithm seems like it was designed with a 100G+ table in
mind but the consequences on the more common 100M-100G tables weren't
really considered. Consider what this means for partitioned tables. If
they partition their terabyte table into 10 partitions ANALYZE will
suddenly want to use 10x as much I/O which seems like a perverse
consequence.

I'm not sure I have a prescription but my general feeling is that
we're spending an awful lot of resources going after a statistically
valid sample when we can spend a lot less resources and get something
90% as good. Or if we're really going to read that much data that we
might as well use more of the rows we find.

--
greg


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-05 23:50:58
Message-ID: 52A11162.8090104@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/03/2013 03:30 PM, Greg Stark wrote:
> It means if your table is anywhere up to 240MB you're effectively
> doing a full table scan and then throwing out nearly all the data
> read.

There are lots of issues with our random sampling approach for ANALYZE.
This is why, back in our Greenplum days, Simon proposed changing to a
block-based sampling approach, where we would sample random *pages*
instead of random *rows*. That would allow us to do things like sample
5% of the table, but only read 5% of the table, although we might have
to play some with OS-FS operations to make sure of that. In addition to
solving the issue you cite above, it would let us get MUCH more accurate
estimates for very large tables, where currently we sample only about
0.1% of the table.

There are fairly well researched algorithms for block-based sampling
which estimate for the skew introduced by looking at consecutive rows in
a block. In general, a minimum sample size of 5% is required, and the
error is no worse than our current system. However, the idea was shot
down at the time, partly because I think other hackers didn't get the math.

I believe that both Oracle and MSSQL use block-based sampling, but of
course, I don't know which specific algo they use.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-06 00:23:11
Message-ID: CAGTBQpbXp9wrPnNm7QAGm5BTeorxyF0ce7-6hrBv9cYDC4iakQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 3, 2013 at 8:30 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> Worse, my experience with the posix_fadvise benchmarking is that on
> spinning media reading one out of every 16 blocks takes about the same
> time as reading them all. Presumably this is because the seek time
> between tracks dominates and reading one out of every 16 blocks is
> still reading every track. So in fact if your table is up to about
> 3-4G ANALYZE is still effectively going to do a full table scan, at
> least as far as I/O time goes.

Actually, it's rotational latency the dominant cost there.


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-06 01:52:34
Message-ID: CAM3SWZREK9cRovD2X=3pMqYgq1QfhG6xmfdwD_gN0FEsH9td+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 5, 2013 at 3:50 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> There are fairly well researched algorithms for block-based sampling
> which estimate for the skew introduced by looking at consecutive rows in
> a block. In general, a minimum sample size of 5% is required, and the
> error is no worse than our current system. However, the idea was shot
> down at the time, partly because I think other hackers didn't get the math.

I think that this certainly warrants revisiting. The benefits would be
considerable.

Has anyone ever thought about opportunistic ANALYZE piggy-backing on
other full-table scans? That doesn't really help Greg, because his
complaint is mostly that a fresh ANALYZE is too expensive, but it
could be an interesting, albeit risky approach.
Opportunistically/unpredictably acquiring a ShareUpdateExclusiveLock
would be kind of weird, for one thing, but if a full table scan really
is very expensive, would it be so unreasonable to attempt to amortize
that cost?

--
Peter Geoghegan


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-06 08:49:33
Message-ID: CAA4eK1K1R011==4-xuYe9WYFqWQiT=Hayp-Aa4J=gc0Xy9=2xA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 6, 2013 at 7:22 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Thu, Dec 5, 2013 at 3:50 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> There are fairly well researched algorithms for block-based sampling
>> which estimate for the skew introduced by looking at consecutive rows in
>> a block. In general, a minimum sample size of 5% is required, and the
>> error is no worse than our current system. However, the idea was shot
>> down at the time, partly because I think other hackers didn't get the math.
>
> I think that this certainly warrants revisiting. The benefits would be
> considerable.
>
> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> other full-table scans? That doesn't really help Greg, because his
> complaint is mostly that a fresh ANALYZE is too expensive, but it
> could be an interesting, albeit risky approach.

Is only fresh ANALYZE costly or consecutive one's are also equally costly?

Doing it in some background operation might not be a bad idea, but doing it
in backend query execution (seq scan) might add overhead for query response time
especially if part or most of data for table is in RAM, so here
overhead due to actual read
might not be very high but the calculation for analyse (like sort)
will make it costly.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-06 09:21:14
Message-ID: 20131206092114.GH7814@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> other full-table scans? That doesn't really help Greg, because his
> complaint is mostly that a fresh ANALYZE is too expensive, but it
> could be an interesting, albeit risky approach.

What I've been thinking of is

a) making it piggy back on scans vacuum is doing instead of doing
separate ones all the time (if possible, analyze needs to be more
frequent). Currently with quite some likelihood the cache will be gone
again when revisiting.

b) make analyze incremental. In lots of bigger tables most of the table
is static - and we actually *do* know that, thanks to the vm. So keep a
rawer form of what ends in the catalogs around somewhere, chunked by the
region of the table the statistic is from. Everytime a part of the table
changes, re-sample only that part. Then recompute the aggregate.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Stark <stark(at)mit(dot)edu>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-06 16:05:45
Message-ID: CAM-w4HPDaioC9epxviuNkD-8ZnYeBSb7Z=uHQUkTohMfdkgFVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

It looks like this is a fairly well understood problem because in the
real world it's also often cheaper to speak to people in a small
geographic area or time interval too. These wikipedia pages sound
interesting and have some external references:

http://en.wikipedia.org/wiki/Cluster_sampling
http://en.wikipedia.org/wiki/Multistage_sampling

I suspect the hard part will be characterising the nature of the
non-uniformity in the sample generated by taking a whole block. Some
of it may come from how the rows were loaded (e.g. older rows were
loaded by pg_restore but newer rows were inserted retail) or from the
way Postgres works (e.g. hotter rows are on blocks with fewer rows in
them and colder rows are more densely packed).

I've felt for a long time that Postgres would make an excellent test
bed for some aspiring statistics research group.

--
greg


From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 07:17:56
Message-ID: alpine.DEB.2.10.1312070807350.6697@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> http://en.wikipedia.org/wiki/Cluster_sampling
> http://en.wikipedia.org/wiki/Multistage_sampling
>
> I suspect the hard part will be characterising the nature of the
> non-uniformity in the sample generated by taking a whole block. Some
> of it may come from how the rows were loaded (e.g. older rows were
> loaded by pg_restore but newer rows were inserted retail) or from the
> way Postgres works (e.g. hotter rows are on blocks with fewer rows in
> them and colder rows are more densely packed).

I would have thought that as VACUUM reclaims space it levels that issue in
the long run and on average, so that it could be simply ignored?

> I've felt for a long time that Postgres would make an excellent test
> bed for some aspiring statistics research group.

I would say "applied statistics" rather than "research". Nevertheless I
can ask my research statistician colleagues next door about their opinion
on this sampling question.

--
Fabien.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 19:46:17
Message-ID: CA+TgmoZaqyGSuaL2v+YFVsX06DQDQh-pEV0nobGPws-dNwAwBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> I always gave the party line that ANALYZE only takes a small
> constant-sized sample so even very large tables should be very quick.
> But after hearing the same story again in Heroku I looked into it a
> bit further. I was kind of shocked but the numbers.
>
> ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> pretty reasonable but with default_statistics_target set to 100 that's
> 30,000 rows. If I'm reading the code right It takes this sample by
> sampling 30,000 blocks and then (if the table is large enough) taking
> an average of one row per block. Each block is 8192 bytes so that
> means it's reading 240MB of each table.That's a lot more than I
> realized.

That is a lot. On the other hand, I question the subject line:
sometimes, our ANALYZE sampling is not good enough. Before we raised
the default statistics target from 10 to 100, complaints about bad
plan choices due to insufficiently-precise statistics were legion --
and we still have people periodically proposing to sample a fixed
percentage of the table instead of a fixed amount of the table, even
on large tables, which is going the opposite direction. I think this
is because they're getting really bad n_distinct estimates, and no
fixed-size sample can reliably give a good one.

More generally, I think the basic problem that people are trying to
solve by raising the statistics target is avoid index scans on
gigantic tables. Obviously, there are a lot of other situations where
inadequate statistics can cause problems, but that's a pretty
easy-to-understand one that we do not always get right. We know that
an equality search looking for some_column = 'some_constant', where
some_constant is an MCV, must be more selective than a search for the
least-frequent MCV. If you store more and more MCVs for a table,
eventually you'll have enough that the least-frequent one is pretty
infrequent, and then things work a lot better.

This is more of a problem for big tables than for small tables. MCV
#100 can't have a frequency of greater than 1/100 = 0.01, but that's a
lot more rows on a big table than small one. On a table with 10
million rows we might estimate something close to 100,000 rows when
the real number is just a handful; when the table has only 10,000
rows, we just can't be off by as many orders of magnitude. Things
don't always work out that badly, but in the worst case they do.

Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not. So what I
think we should do is auto-tune the statistics target based on the
table size. If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 20:25:59
Message-ID: 52A38457.6090204@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/07/2013 11:46 AM, Robert Haas wrote:
> Maybe there's some highly-principled statistical approach which could
> be taken here, and if so that's fine, but I suspect not. So what I
> think we should do is auto-tune the statistics target based on the
> table size. If, say, we think that the generally useful range for the
> statistics target is something like 10 to 400, then let's come up with
> a formula based on table size that outputs 10 for small tables, 400
> for really big tables, and intermediate values for tables in the
> middle.

The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate. Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.

This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables. We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.

There's other qualitative improvements we could make, which Nathan Boley
has spoken on. For example, our stats code has no way to recognize a
normal or exponential distrbution -- it assumes that all columns are
randomly distributed. If we could recoginze common distribution
patterns, then not only could we have better query estimates, those
would require keeping *fewer* stats, since all you need for a normal
distribution are the end points and the variance.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Greg Stark <stark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 21:27:39
Message-ID: CAM-w4HOQfAVvUMYCXOfmkZwg=8ye9t7jpqYBdyE7+ny-hZtZqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> The only approach which makes sense is to base it on a % of the table.
> In fact, pretty much every paper which has examined statistics
> estimation for database tables has determined that any estimate based on
> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
> samples are 100% accurate, but at least they fit the 80/20 rule.

This is nonsense. The math behind the deductions the planner makes is
well understood and we know how to estimate the precision based on the
sample size. There are some questions that really need a proportional
sample such as n_distinct (and as Robert points out the MCV list) but
the main motivation of the sample size is the histograms and those are
used to answer questions which very clearly do not need a proportional
sample. The statistics is very clear there. Using a proportional
sample for the histograms would just be silly. It would be
substituting a gut feeling for real statistics.

The problems with using a proportional sample for things like
n_distinct or the MCV list is that you're talking about sorting or
hashing an unboundedly large set of values and storing an unboundedly
large array in the statistics table. I don't think that would be
tenable without dramatically changing the way process and store that
data to be more scalable. Using a lossy counting algorithm and
something more scalable than toasted arrays would be prerequisites I
think.

And as Robert mentions even if we solved those problems it wouldn't
help n_distinct. You really need to read all the values to deal with
n_distinct.

> This is the reason why implementing block-based sampling is critical;
> using our current "take one row out of every page" method, sampling 5%
> of the table means scanning the whole thing in most tables. We also
> need to decouple the number of MCVs we keep from the sample size.
> Certainly our existing sampling algo seems designed to maximize IO for
> the sample size.

This would be helpful if you could specify what you mean by
"black-based sampling". The reason these previous pleas didn't go
anywhere is not because we can't do math. It's because of the lack of
specifics here. What we do *today* is called "block-based sampling" by
the literature.

What I'm saying is we need to change the "block-based sampling" that
we do to extract more rows per block. We currently extract the
"correct" number of rows to get a strong guarantee of uniformity. If
we could get a weaker guarantee of being "close enough" to uniform
samples for the deductions we want to make and get many more rows per
block then we could read a lot fewer blocks.

Or to put it another way people could increase the statistics target
dramatically and still be reading the same number of blocks as today.
In an ideal world perhaps we could have people reading 1% of the
blocks they read today and get statistics targets 10x better than
today.

--
greg


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 23:27:31
Message-ID: 52A3AEE3.2030706@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/12/13 09:25, Josh Berkus wrote:
> On 12/07/2013 11:46 AM, Robert Haas wrote:
>> Maybe there's some highly-principled statistical approach which could
>> be taken here, and if so that's fine, but I suspect not. So what I
>> think we should do is auto-tune the statistics target based on the
>> table size. If, say, we think that the generally useful range for the
>> statistics target is something like 10 to 400, then let's come up with
>> a formula based on table size that outputs 10 for small tables, 400
>> for really big tables, and intermediate values for tables in the
>> middle.
>
> The only approach which makes sense is to base it on a % of the table.
> In fact, pretty much every paper which has examined statistics
> estimation for database tables has determined that any estimate based on
> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
> samples are 100% accurate, but at least they fit the 80/20 rule.
>
> This is the reason why implementing block-based sampling is critical;
> using our current "take one row out of every page" method, sampling 5%
> of the table means scanning the whole thing in most tables. We also
> need to decouple the number of MCVs we keep from the sample size.
> Certainly our existing sampling algo seems designed to maximize IO for
> the sample size.
>
> There's other qualitative improvements we could make, which Nathan Boley
> has spoken on. For example, our stats code has no way to recognize a
> normal or exponential distrbution -- it assumes that all columns are
> randomly distributed. If we could recoginze common distribution
> patterns, then not only could we have better query estimates, those
> would require keeping *fewer* stats, since all you need for a normal
> distribution are the end points and the variance.
>

From src/backend/commands/analyze.c

* As of May 2004 we use a new two-stage method: Stage one selects up
* to targrows random blocks (or all blocks, if there aren't so many).
* Stage two scans these blocks and uses the Vitter algorithm to create
* a random sample of targrows rows (or less, if there are less in the
* sample of blocks). The two stages are executed simultaneously: each
* block is processed as soon as stage one returns its number and while
* the rows are read stage two controls which ones are to be inserted
* into the sample.

I don't think we always read every block (been a while since I looked at
this code, so I'll recheck). From what I understand this stuff is based
on reasonable research (Vitter algorithm). Not to say it
couldn't/shouldn't be looked at again to improve it - but it is not just
dreamed up with no valid research!

Cheers

Mark


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 00:06:42
Message-ID: 52A3B812.4070407@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/12/13 12:27, Mark Kirkwood wrote:
> On 08/12/13 09:25, Josh Berkus wrote:
>> On 12/07/2013 11:46 AM, Robert Haas wrote:
>>> Maybe there's some highly-principled statistical approach which could
>>> be taken here, and if so that's fine, but I suspect not. So what I
>>> think we should do is auto-tune the statistics target based on the
>>> table size. If, say, we think that the generally useful range for the
>>> statistics target is something like 10 to 400, then let's come up with
>>> a formula based on table size that outputs 10 for small tables, 400
>>> for really big tables, and intermediate values for tables in the
>>> middle.
>>
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
>>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables. We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
>>
>> There's other qualitative improvements we could make, which Nathan Boley
>> has spoken on. For example, our stats code has no way to recognize a
>> normal or exponential distrbution -- it assumes that all columns are
>> randomly distributed. If we could recoginze common distribution
>> patterns, then not only could we have better query estimates, those
>> would require keeping *fewer* stats, since all you need for a normal
>> distribution are the end points and the variance.
>>
>
> From src/backend/commands/analyze.c
>
> * As of May 2004 we use a new two-stage method: Stage one selects up
> * to targrows random blocks (or all blocks, if there aren't so many).
> * Stage two scans these blocks and uses the Vitter algorithm to create
> * a random sample of targrows rows (or less, if there are less in the
> * sample of blocks). The two stages are executed simultaneously: each
> * block is processed as soon as stage one returns its number and while
> * the rows are read stage two controls which ones are to be inserted
> * into the sample.
>
> I don't think we always read every block (been a while since I looked at
> this code, so I'll recheck). From what I understand this stuff is based
> on reasonable research (Vitter algorithm). Not to say it
> couldn't/shouldn't be looked at again to improve it - but it is not just
> dreamed up with no valid research!
>

Since I volunteered to recheck :-)

from analyze.c again:

/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * attr->attstattarget;

So for typical settings (default_statictics_target = 100), we try to get
30000 rows. This means we will sample about 30000 blocks.

Indeed quickly checking with a scale 100 pgbench db and a simple patch
to make the block sampler say how many blocks it reads (note
pgbench_accounts has 163935 blocks):

bench=# ANALYZE pgbench_branches;
NOTICE: acquire sample will need 30000 blocks
NOTICE: sampled 1 blocks
ANALYZE
Time: 16.935 ms

bench=# ANALYZE pgbench_accounts;
NOTICE: acquire sample will need 30000 blocks
NOTICE: sampled 30000 blocks
ANALYZE
Time: 10059.446 ms
bench=# \q

Regards

Mark


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 02:31:33
Message-ID: 52A3DA05.7040506@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/12/13 10:27, Greg Stark wrote:
> On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
> This is nonsense. The math behind the deductions the planner makes is
> well understood and we know how to estimate the precision based on the
> sample size. There are some questions that really need a proportional
> sample such as n_distinct (and as Robert points out the MCV list) but
> the main motivation of the sample size is the histograms and those are
> used to answer questions which very clearly do not need a proportional
> sample. The statistics is very clear there. Using a proportional
> sample for the histograms would just be silly. It would be
> substituting a gut feeling for real statistics.
>
> The problems with using a proportional sample for things like
> n_distinct or the MCV list is that you're talking about sorting or
> hashing an unboundedly large set of values and storing an unboundedly
> large array in the statistics table. I don't think that would be
> tenable without dramatically changing the way process and store that
> data to be more scalable. Using a lossy counting algorithm and
> something more scalable than toasted arrays would be prerequisites I
> think.
>
> And as Robert mentions even if we solved those problems it wouldn't
> help n_distinct. You really need to read all the values to deal with
> n_distinct.
>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables. We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
> This would be helpful if you could specify what you mean by
> "black-based sampling". The reason these previous pleas didn't go
> anywhere is not because we can't do math. It's because of the lack of
> specifics here. What we do *today* is called "block-based sampling" by
> the literature.
>
> What I'm saying is we need to change the "block-based sampling" that
> we do to extract more rows per block. We currently extract the
> "correct" number of rows to get a strong guarantee of uniformity. If
> we could get a weaker guarantee of being "close enough" to uniform
> samples for the deductions we want to make and get many more rows per
> block then we could read a lot fewer blocks.
>
> Or to put it another way people could increase the statistics target
> dramatically and still be reading the same number of blocks as today.
> In an ideal world perhaps we could have people reading 1% of the
> blocks they read today and get statistics targets 10x better than
> today.
>
I suspect that the number of rows to sample should be something of the form:

{ N where N <= a * 100
n = { (a * N) / 100 where a * 100 < N <= 10000
{ (a * SQRT(N)) / 100 where N > 10000

n: the number of rows to sample
N: total number of rows
a: configurable coefficient (1 <= a <= 100)

For very large numbers of rows, like 10^8, taking a constant fraction
will over sample for most purposes, hence the square root.

a N n sampled%
1 10000 100 1%
100 10000 10000 100%
1 10^8 10000 0.01%
100 10^8 10^6 1%
1 10^10 10^5 0.001%
100 10^10 10^7 0.01%

For very large values of N, you can get can still get a representative
sample with a very small fraction of rows sampled.

Hmm... Yes, I can see some issues, once I tried out with test values!
However. it might inspire a more useful and practical approach!

Cheers,
Gavin


From: Greg Stark <stark(at)mit(dot)edu>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 18:14:51
Message-ID: CAM-w4HODF+=Rv8XhUefCJjPDyfFVOMqetWRQUG75S+dMtxD3xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 8, 2013 at 12:06 AM, Mark Kirkwood
<mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:
>
> bench=# ANALYZE pgbench_accounts;
> NOTICE: acquire sample will need 30000 blocks
> NOTICE: sampled 30000 blocks
> ANALYZE
> Time: 10059.446 ms
> bench=# \q

I did some experimenting here as well.

I hacked up a version of analyze.c that has a guc for rows_per_block
to sample. Setting it to 1 doesn't modify the behaviour at all whereas
setting it to 4 divides the number of blocks to sample by 4 which
causes it to do less I/O and use more rows from each block.

I then initialized pgbench with scale factor 100 but modified the code
to run the actual pgbench with scale factor 1. In other words I ran a
lot of updates on 1% of the database but left the other 99% untouched
from the initial load.

Then I ran "ANALYZE VERBOSE accounts" with rows_per_block set to 1, 4,
16, and 64. The latter is slightly larger than the average number of
tuples per block so the resulting sample is actually slightly short.

The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):

stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 30000 of 158756 pages, containing
1889701 live rows and 0 dead rows; 30000 rows in sample, 10000036
estimated total rows
ANALYZE
Time: 19468.987 ms

With rows_per_block=4 it reads only a quarter as many blocks but it's
not much faster:

stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 7501 of 158756 pages, containing
472489 live rows and 0 dead rows; 30000 rows in sample, 10000037
estimated total rows
ANALYZE
Time: 17062.331 ms

But with rows_per_block=16 it's much faster, 6.7s

stark=# set statistics_rows_per_block = 16;
SET
Time: 1.583 ms
stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 1876 of 158756 pages, containing
118163 live rows and 0 dead rows; 30000 rows in sample, 10000031
estimated total rows
ANALYZE
Time: 6694.331 ms

And with rows_per_block=64 it's under 2s:

stark=# set statistics_rows_per_block = 64;
SET
Time: 0.693 ms
stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 469 of 158756 pages, containing
29544 live rows and 0 dead rows; 29544 rows in sample, 10000033
estimated total rows
ANALYZE
Time: 1937.055 ms

The estimates for the total rows is just as accurate in every case.
(It seems to be consistently sightly high though which is a bit
disconcerting)

However looking at the actual pg_stats entries the stats are
noticeably less accurate for the "blockier" samples. The "bid" column
actually has 100 distinct values and so with a statistics_target of
100 each value should appear in the MCV list with a frequency of about
.01.

With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123
With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125
With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213

I'm not really sure if this is due to the blocky sample combined with
the skewed pgbench run or not. It doesn't seem to be consistently
biasing towards or against bid 1 which I believe are the only rows
that would have been touched by pgbench. Still it's suspicious that
they seem to be consistently getting less accurate as the blockiness
increases.

I've attached the results of pg_stats following the analyze with the
various levels of "blockiness".

--
greg

Attachment Content-Type Size
pgbench_stats.txt text/plain 32.4 KB

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 19:03:02
Message-ID: 52A4C266.50000@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/08/2013 10:14 AM, Greg Stark wrote:
> With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123
> With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125
> With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
> With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213
>
> I'm not really sure if this is due to the blocky sample combined with
> the skewed pgbench run or not. It doesn't seem to be consistently
> biasing towards or against bid 1 which I believe are the only rows
> that would have been touched by pgbench. Still it's suspicious that
> they seem to be consistently getting less accurate as the blockiness
> increases.

They will certainly do so if you don't apply any statistical adjustments
for selecting more rows from the same pages.

So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block. That's what I meant by
"block-based sampling"; you read, say, 400 pages, you compile statistics
on *all* of the rows on those pages, you apply some algorithms to adjust
for groupings of rows based on how grouped they are. And you have a
pretty good estimate of how grouped they are, because you just looked a
complete sets of rows on a bunch of nonadjacent pages.

Obviously, you need to look at more rows than you would with a
pure-random sample. Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample. However, since those rows come from the
same pages, the cost of looking at more rows is quite small, compared to
the cost of looking at 64 times as many disk pages.

My ACM subscription has lapsed, though; someone with a current ACM
subscription could search for this; there are several published papers,
with math and pseudocode.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 19:49:43
Message-ID: 52A4CD57.1000308@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/08/2013 08:14 PM, Greg Stark wrote:
> The whole accounts table is 1.2GB and contains 10 million rows. As
> expected with rows_per_block set to 1 it reads 240MB of that
> containing nearly 2 million rows (and takes nearly 20s -- doing a full
> table scan for select count(*) only takes about 5s):

One simple thing we could do, without or in addition to changing the
algorithm, is to issue posix_fadvise() calls for the blocks we're going
to read. It should at least be possible to match the speed of a plain
sequential scan that way.

- Heikki


From: Greg Stark <stark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 21:15:09
Message-ID: CAM-w4HNfAaWSZKLoSjnF4SMU4qUoumtMLcYwG=YQQEjdTwnHXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 8, 2013 at 7:03 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> They will certainly do so if you don't apply any statistical adjustments
> for selecting more rows from the same pages.
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block.

I just think this is an oversimplification. There's no skew introduced
just by reading all the rows in each block unless there's some kind of
dependency between the block a row is placed on and the data in it. So
I don't believe there can be some single set of math that
automatically removes any skew automatically. The math will depend on
what the dependency is.

Just to be clear, you have to think pretty hard about the way Postgres
internals work to see what kinds of skew might be appearing here. Due
to the way Postgres updates work and HOT cleanups work "hot" tuples
will be weighted less than "cold" tuples. That's not going to be
something someone in ACM knew to design into their maths.

I do have access to ACM or other academic articles if you remember any
author names or any keywords but if it's a database journal I would
worry about patent issues. Do you remember if it was over 17 years
old?

> Obviously, you need to look at more rows than you would with a
> pure-random sample. Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample.

I really don't believe the 5% thing. It's not enough for n_distinct
and it's *far* too high a value for linear properties like histograms
or nullfrac etc. From a computer point of view it's too high to be
worth bothering. If we have to read 5% of the table we might as well
do a full scan anyways, it'll be marginally slower but much better
quality results.

--
greg


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 22:19:48
Message-ID: 52A4F084.3070200@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/12/13 08:03, Josh Berkus wrote:
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block. That's what I meant by
> "block-based sampling"; you read, say, 400 pages, you compile statistics
> on *all* of the rows on those pages, you apply some algorithms to adjust
> for groupings of rows based on how grouped they are. And you have a
> pretty good estimate of how grouped they are, because you just looked a
> complete sets of rows on a bunch of nonadjacent pages.
>
> Obviously, you need to look at more rows than you would with a
> pure-random sample. Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample. However, since those rows come from the
> same pages, the cost of looking at more rows is quite small, compared to
> the cost of looking at 64 times as many disk pages.
>
> My ACM subscription has lapsed, though; someone with a current ACM
> subscription could search for this; there are several published papers,
> with math and pseudocode.
>

This makes more sense! Unfortunately you had confused everyone (well me
at least) by decreeing that we needed block based sampling - when we
started using that in 2004 - there is more than one way to sample blocks
it seems :-)

What you are actually suggesting is a way to do block based sampling
that requires reading fewer blocks, and that is a great idea! Of course
as Greg has just suggested - the details are gonna be the clincher. Can
you supply authors or titles of papers?

Also with reference to Greenplum - their use of postgres is fairly
special case, and an approach that works well for them is not always
suitable for more general purpose use.

Regards

Mark


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 18:03:50
Message-ID: 52A60606.9030409@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg,

> I really don't believe the 5% thing. It's not enough for n_distinct
> and it's *far* too high a value for linear properties like histograms
> or nullfrac etc.

Actually, it is enough for n_distinct, or more properly, 5% is as good
as you can get for n_distinct unless you're going to jump to scanning
50% or more.

It's also applicable for the other stats; histogram buckets constructed
from a 5% sample are more likely to be accurate than those constructed
from a 0.1% sample. Same with nullfrac. The degree of improved
accuracy, would, of course, require some math to determine.

> From a computer point of view it's too high to be
> worth bothering. If we have to read 5% of the table we might as well
> do a full scan anyways, it'll be marginally slower but much better
> quality results.

Reading 5% of a 200GB table is going to be considerably faster than
reading the whole thing, if that 5% is being scanned in a way that the
FS understands.

Also, we can optimize this significantly by using the VM, as Robert (I
think) suggested.

In the advanced approaches section, there's also the idea of collecting
analyze data from table pages while they're in memory anyway for other
reasons.

You do seem kind of hostile to the idea of full-page-sampling, going
pretty far beyond the "I'd need to see the math". Why?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 18:45:08
Message-ID: CA+TgmoaYGgp2dSd4+8CrNCGVKfgjVbuzXQH9e=LUzX4aoL8x4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 1:03 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> I really don't believe the 5% thing. It's not enough for n_distinct
>> and it's *far* too high a value for linear properties like histograms
>> or nullfrac etc.
>
> Actually, it is enough for n_distinct, or more properly, 5% is as good
> as you can get for n_distinct unless you're going to jump to scanning
> 50% or more.

I'd like to see a proof of that result.

Not because I'm hostile to changing the algorithm, but because you've
made numerous mathematical claims on this thread that fly in the face
of what Greg, myself, and others understand to be mathematically true
- including this one. If our understanding is wrong, then by all
means let's get that fixed. But you're not going to convince anyone
here that we should rip out the existing algorithm and its
peer-reviewed journal citations by making categorical assertions about
the right way to do things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 18:47:16
Message-ID: 24417.1386614836@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Reading 5% of a 200GB table is going to be considerably faster than
> reading the whole thing, if that 5% is being scanned in a way that the
> FS understands.

Really? See the upthread point that reading one sector from each track
has just as much seek overhead as reading the whole thing. I will grant
that if you think that reading a *contiguous* 5% of the table is good
enough, you can make it faster --- but I don't believe offhand that
you can make this better without seriously compromising the randomness
of your sample. Too many tables are loaded roughly in time order, or
in other ways that make contiguous subsets nonrandom.

> You do seem kind of hostile to the idea of full-page-sampling, going
> pretty far beyond the "I'd need to see the math". Why?

I'm detecting a lot of hostility to assertions unsupported by any math.
For good reason.

regards, tom lane


From: Greg Stark <stark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 18:54:21
Message-ID: CAM-w4HMFAfAAUp5YDEFkRTNt7UbrxT-9ecAJUNgwv2DyC=weGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
> It's also applicable for the other stats; histogram buckets constructed
> from a 5% sample are more likely to be accurate than those constructed
> from a 0.1% sample. Same with nullfrac. The degree of improved
> accuracy, would, of course, require some math to determine.

This "some math" is straightforward basic statistics. The 95th
percentile confidence interval for a sample consisting of 300 samples
from a population of a 1 million would be 5.66%. A sample consisting
of 1000 samples would have a 95th percentile confidence interval of
+/- 3.1%.

The histogram and nullfact answers the same kind of question as a
political poll, "what fraction of the population falls within this
subset". This is why pollsters don't need to sample 15 million
Americans to have a decent poll result. That's just not how the math
works for these kinds of questions.

n_distinct is an entirely different kettle of fish. It's a different
kind of problem and the error rate there *is* going to be dependent on
the percentage of the total population that you sampled. Moreover from
the papers I read I'm convinced any sample less than 50-80% is nearly
useless so I'm convinced you can't get good results without reading
the whole table.

--
greg


From: Greg Stark <stark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 18:56:48
Message-ID: CAM-w4HMrZTy8F56hbGpSEyrQ9hwiuuTm1pm6amh=MgK1yKhzrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 6:54 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>
> This "some math" is straightforward basic statistics. The 95th
> percentile confidence interval for a sample consisting of 300 samples
> from a population of a 1 million would be 5.66%. A sample consisting
> of 1000 samples would have a 95th percentile confidence interval of
> +/- 3.1%.

Incidentally I got this using an online sample size calculator. Google
turns up several but this one seems the easiest to use:
http://www.raosoft.com/samplesize.html

--
greg


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 21:18:59
Message-ID: CAMkU=1yVmSVjSgw0qL9imbabLa7_kUYQxu9ZcdO1o357kq=_Zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> > I always gave the party line that ANALYZE only takes a small
> > constant-sized sample so even very large tables should be very quick.
> > But after hearing the same story again in Heroku I looked into it a
> > bit further. I was kind of shocked but the numbers.
> >
> > ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> > pretty reasonable but with default_statistics_target set to 100 that's
> > 30,000 rows. If I'm reading the code right It takes this sample by
> > sampling 30,000 blocks and then (if the table is large enough) taking
> > an average of one row per block. Each block is 8192 bytes so that
> > means it's reading 240MB of each table.That's a lot more than I
> > realized.
>
> That is a lot. On the other hand, I question the subject line:
> sometimes, our ANALYZE sampling is not good enough. Before we raised
> the default statistics target from 10 to 100, complaints about bad
> plan choices due to insufficiently-precise statistics were legion --
> and we still have people periodically proposing to sample a fixed
> percentage of the table instead of a fixed amount of the table, even
> on large tables, which is going the opposite direction. I think this
> is because they're getting really bad n_distinct estimates, and no
> fixed-size sample can reliably give a good one.
>

I don't recall ever tracing a bad plan down to a bad n_distinct. I have
seen several that were due to bad frequency estimates in MCV list, because
hash join planning is extremely sensitive to that. Do we have some kind of
catalog of generators of problematic data, so that changes can be tested on
known problem sets? Perhaps a wiki page to accumulate them would be
useful. For automated testing I guess the generator and query is the easy
part, the hard part is the cost settings/caching/RAM needed to trigger the
problem, and parsing and interpreting the results.

>
> More generally, I think the basic problem that people are trying to
> solve by raising the statistics target is avoid index scans on
> gigantic tables. Obviously, there are a lot of other situations where
> inadequate statistics can cause problems, but that's a pretty
> easy-to-understand one that we do not always get right. We know that
> an equality search looking for some_column = 'some_constant', where
> some_constant is an MCV, must be more selective than a search for the
> least-frequent MCV. If you store more and more MCVs for a table,
> eventually you'll have enough that the least-frequent one is pretty
> infrequent, and then things work a lot better.
>

My reading of the code is that if it is not in the MCV, then it is assumed
to have the average selectivity (about 1/n_distinct, but deflating top and
bottom for the MCV list). There is also a check that it is less than the
least common of the MCV, but I don't know why that situation would ever
prevail--that should always be higher or equal to the average selectivity.

>
> This is more of a problem for big tables than for small tables. MCV
> #100 can't have a frequency of greater than 1/100 = 0.01, but that's a
> lot more rows on a big table than small one. On a table with 10
> million rows we might estimate something close to 100,000 rows when
> the real number is just a handful; when the table has only 10,000
> rows, we just can't be off by as many orders of magnitude. Things
> don't always work out that badly, but in the worst case they do.
>
> Maybe there's some highly-principled statistical approach which could
> be taken here, and if so that's fine, but I suspect not. So what I
> think we should do is auto-tune the statistics target based on the
> table size. If, say, we think that the generally useful range for the
> statistics target is something like 10 to 400, then let's come up with
> a formula based on table size that outputs 10 for small tables, 400
> for really big tables, and intermediate values for tables in the
> middle.
>

I think that parts of the planner are N^2 in the size of histogram (or was
that the size of the MCV list?). So we would probably need a way to use a
larger sample size to get more accurate n_distinct and MCV frequencies, but
not save the entire histogram that goes with that sample size.

Cheers,

Jeff


From: Jim Nasby <jim(at)nasby(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 21:20:17
Message-ID: 52A63411.7000504@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/6/13 3:21 AM, Andres Freund wrote:
> On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
>> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
>> other full-table scans? That doesn't really help Greg, because his
>> complaint is mostly that a fresh ANALYZE is too expensive, but it
>> could be an interesting, albeit risky approach.
>
> What I've been thinking of is
>
> a) making it piggy back on scans vacuum is doing instead of doing
> separate ones all the time (if possible, analyze needs to be more
> frequent). Currently with quite some likelihood the cache will be gone
> again when revisiting.

FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's fancy new shared memory stuff).
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Jim Nasby <jim(at)nasby(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 21:35:32
Message-ID: 52A637A4.5090801@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
> On 12/08/2013 08:14 PM, Greg Stark wrote:
>> The whole accounts table is 1.2GB and contains 10 million rows. As
>> expected with rows_per_block set to 1 it reads 240MB of that
>> containing nearly 2 million rows (and takes nearly 20s -- doing a full
>> table scan for select count(*) only takes about 5s):
>
> One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way.

Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage 2 work in the callback routine...
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 21:40:16
Message-ID: CAM3SWZQhLXwESXuRajUPssGSn0anvp06fqDpS3Y97cbzeC14gQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 1:18 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> I don't recall ever tracing a bad plan down to a bad n_distinct.

It does happen. I've seen it several times.

--
Peter Geoghegan


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 21:47:38
Message-ID: 52A63A7A.6090606@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/09/2013 11:35 PM, Jim Nasby wrote:
> On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
>> On 12/08/2013 08:14 PM, Greg Stark wrote:
>>> The whole accounts table is 1.2GB and contains 10 million rows. As
>>> expected with rows_per_block set to 1 it reads 240MB of that
>>> containing nearly 2 million rows (and takes nearly 20s -- doing a full
>>> table scan for select count(*) only takes about 5s):
>>
>> One simple thing we could do, without or in addition to changing the
>> algorithm, is to issue posix_fadvise() calls for the blocks we're
>> going to read. It should at least be possible to match the speed of a
>> plain sequential scan that way.
>
> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
> thinking it wouldn't be very hard to do the stage 2 work in the callback
> routine...

Yeah, other than the fact we have no infrastructure to do asynchronous
I/O anywhere in the backend. If we had that, then we could easily use it
here. I doubt it would be much better than posix_fadvising the blocks,
though.

- Heikki


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 21:56:44
Message-ID: CAGTBQpYOkmQaua6YxOuR3B5owc5LOCyU0tv4bhVVMLQU_6kFGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 12/09/2013 11:35 PM, Jim Nasby wrote:
>>
>> On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
>>>
>>> On 12/08/2013 08:14 PM, Greg Stark wrote:
>>>>
>>>> The whole accounts table is 1.2GB and contains 10 million rows. As
>>>> expected with rows_per_block set to 1 it reads 240MB of that
>>>> containing nearly 2 million rows (and takes nearly 20s -- doing a full
>>>> table scan for select count(*) only takes about 5s):
>>>
>>>
>>> One simple thing we could do, without or in addition to changing the
>>> algorithm, is to issue posix_fadvise() calls for the blocks we're
>>> going to read. It should at least be possible to match the speed of a
>>> plain sequential scan that way.
>>
>>
>> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
>> thinking it wouldn't be very hard to do the stage 2 work in the callback
>> routine...
>
>
> Yeah, other than the fact we have no infrastructure to do asynchronous I/O
> anywhere in the backend. If we had that, then we could easily use it here. I
> doubt it would be much better than posix_fadvising the blocks, though.

Without patches to the kernel, it is much better.

posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
bitmap heap scans (or similarly sorted analyze block samples) run at 1
IO / block, ie horrible, whereas aio can do read coalescence and
read-ahead when the kernel thinks it'll be profitable, significantly
increasing IOPS. I've seen everything from a 2x to 10x difference.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 22:37:54
Message-ID: CA+TgmoZdyPmwHKgsW3PuZKg87ZqiuF2MChwZYXtV7qMh8CrMMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> My reading of the code is that if it is not in the MCV, then it is assumed
> to have the average selectivity (about 1/n_distinct, but deflating top and
> bottom for the MCV list). There is also a check that it is less than the
> least common of the MCV, but I don't know why that situation would ever
> prevail--that should always be higher or equal to the average selectivity.

I've never seen an n_distinct value of more than 5 digits, regardless
of reality. Typically I've seen 20-50k, even if the real number is
much higher. But the n_distinct value is only for non-MCVs, so if we
estimate the selectivity of column = 'rarevalue' to be
(1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
the estimate, and making the MCV list longer naturally makes mcvfrac
bigger. I'm not sure how important the
less-frequent-than-the-least-common-MCV part is, but I'm very sure
that raising the statistics target helps to solve the problem of
overestimating the prevalence of uncommon values in a very big table.

> I think that parts of the planner are N^2 in the size of histogram (or was
> that the size of the MCV list?). So we would probably need a way to use a
> larger sample size to get more accurate n_distinct and MCV frequencies, but
> not save the entire histogram that goes with that sample size.

I think the saving the histogram part is important. As you say, the
MCVs are important for a variety of planning purposes, such as hash
joins. More than that, in my experience, people with large tables are
typically very willing to spend more planning time to get a better
plan, because mistakes are expensive and the queries are likely to run
for a while anyway. People with small tables care about planning
time, because it makes no sense to spend an extra 1ms planning a query
unless you improve the plan by enough to save at least 1ms when
executing it, and when the tables are small and access is expected to
be fast anyway that's often not the case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 23:14:38
Message-ID: 52A64EDE.60105@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/09/2013 11:56 PM, Claudio Freire wrote:
> On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> On 12/09/2013 11:35 PM, Jim Nasby wrote:
>>>
>>> On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
>>>>
>>>> On 12/08/2013 08:14 PM, Greg Stark wrote:
>>>>>
>>>>> The whole accounts table is 1.2GB and contains 10 million rows. As
>>>>> expected with rows_per_block set to 1 it reads 240MB of that
>>>>> containing nearly 2 million rows (and takes nearly 20s -- doing a full
>>>>> table scan for select count(*) only takes about 5s):
>>>>
>>>> One simple thing we could do, without or in addition to changing the
>>>> algorithm, is to issue posix_fadvise() calls for the blocks we're
>>>> going to read. It should at least be possible to match the speed of a
>>>> plain sequential scan that way.
>>>
>>> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
>>> thinking it wouldn't be very hard to do the stage 2 work in the callback
>>> routine...
>>
>> Yeah, other than the fact we have no infrastructure to do asynchronous I/O
>> anywhere in the backend. If we had that, then we could easily use it here. I
>> doubt it would be much better than posix_fadvising the blocks, though.
>
> Without patches to the kernel, it is much better.
>
> posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
> bitmap heap scans (or similarly sorted analyze block samples) run at 1
> IO / block, ie horrible, whereas aio can do read coalescence and
> read-ahead when the kernel thinks it'll be profitable, significantly
> increasing IOPS. I've seen everything from a 2x to 10x difference.

How did you test that, given that we don't actually have an asynchronous
I/O implementation? I don't recall any recent patches floating around
either to do that. When Greg Stark investigated this back in 2007-2008
and implemented posix_fadvise() for bitmap heap scans, posix_fadvise
certainly gave a significant speedup on the test data he used. What kind
of a data distribution gives a slowdown like that?

I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
very easy, patch attached. Your mileage may vary, but I'm seeing a nice
gain from this on my laptop. Taking a 30000 page sample of a table with
717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6
seconds without the patch, and less than a second with the patch, with
effective_io_concurrency=10. If anyone with a good test data set loaded
would like to test this and post some numbers, that would be great.

- Heikki

Attachment Content-Type Size
analyze-prefetch-1.patch text/x-diff 3.7 KB

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 23:29:50
Message-ID: CAGTBQpa-+wQg6ZL_+6X3dvX=8wn0LaLQ-3WDsVNfFpo_WYwu3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 12/09/2013 11:56 PM, Claudio Freire wrote:
>> Without patches to the kernel, it is much better.
>>
>> posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
>> bitmap heap scans (or similarly sorted analyze block samples) run at 1
>> IO / block, ie horrible, whereas aio can do read coalescence and
>> read-ahead when the kernel thinks it'll be profitable, significantly
>> increasing IOPS. I've seen everything from a 2x to 10x difference.
>
>
> How did you test that, given that we don't actually have an asynchronous I/O
> implementation? I don't recall any recent patches floating around either to
> do that. When Greg Stark investigated this back in 2007-2008 and implemented
> posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a
> significant speedup on the test data he used. What kind of a data
> distribution gives a slowdown like that?

That's basically my summarized experience from working on this[0]
patch, and the feedback given there about competing AIO work.

Sequential I/O was the biggest issue. I had to actively avoid
fadvising on sequential I/O, or sequential-ish I/O, which was a huge
burden on fadvise logic.

>
> I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very
> easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from
> this on my laptop. Taking a 30000 page sample of a table with 717717 pages
> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the
> patch, and less than a second with the patch, with
> effective_io_concurrency=10. If anyone with a good test data set loaded
> would like to test this and post some numbers, that would be great.

Kernel version?

I raised this issue on LKML, and, while I got no news on this front,
they might have silently fixed it. I'd have to check the sources
again.

[0] http://www.postgresql.org/message-id/COL116-W162AEAA64173E77D4597EEA3670@phx.gbl


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 23:45:33
Message-ID: 2a291e05-737e-4270-80c5-4df55a57c470@email.android.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
><hlinnakangas(at)vmware(dot)com> wrote:
>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>be very
>> easy, patch attached. Your mileage may vary, but I'm seeing a nice
>gain from
>> this on my laptop. Taking a 30000 page sample of a table with 717717
>pages
>> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without
>the
>> patch, and less than a second with the patch, with
>> effective_io_concurrency=10. If anyone with a good test data set
>loaded
>> would like to test this and post some numbers, that would be great.
>
>Kernel version?

3.12, from Debian experimental. With an ssd drive and btrfs filesystem. Admittedly not your average database server setup, so it would be nice to get more reports from others.

>I raised this issue on LKML, and, while I got no news on this front,
>they might have silently fixed it. I'd have to check the sources
>again.

Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise anyway.
I

- Heikki


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 23:46:38
Message-ID: CAGTBQpbgcTOGwh0SbyV0mpJq3RDq+6YzCBqZ8Wsr6NkNG1CpSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 8:45 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
>><hlinnakangas(at)vmware(dot)com> wrote:
>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>>be very
>>> easy, patch attached. Your mileage may vary, but I'm seeing a nice
>>gain from
>>> this on my laptop. Taking a 30000 page sample of a table with 717717
>>pages
>>> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without
>>the
>>> patch, and less than a second with the patch, with
>>> effective_io_concurrency=10. If anyone with a good test data set
>>loaded
>>> would like to test this and post some numbers, that would be great.
>>
>>Kernel version?
>
> 3.12, from Debian experimental. With an ssd drive and btrfs filesystem. Admittedly not your average database server setup, so it would be nice to get more reports from others.

Yeah, read-ahead isn't relevant for SSD.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 23:52:35
Message-ID: 30686.1386633155@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise anyway.

Yeah. If we're not reading consecutive blocks, readahead is unlikely
to do anything anyhow.

Claudio's comments do suggest that it might be a bad idea to issue a
posix_fadvise when the next block to be examined *is* adjacent to the
current one, though.

regards, tom lane


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 00:14:00
Message-ID: 52A65CC8.4070805@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 12:14, Heikki Linnakangas wrote:
>
>
> I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
> very easy, patch attached. Your mileage may vary, but I'm seeing a
> nice gain from this on my laptop. Taking a 30000 page sample of a
> table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes
> about 6 seconds without the patch, and less than a second with the
> patch, with effective_io_concurrency=10. If anyone with a good test
> data set loaded would like to test this and post some numbers, that
> would be great.
>
>

I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch: ANALYZE pgbench_accounts 90s
With patch: ANALYZE pgbench_accounts 91s

So I'm essentially seeing no difference :-(

Regards

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 00:17:09
Message-ID: 31233.1386634629@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> writes:
> I did a test run:

> pgbench scale 2000 (pgbench_accounts approx 25GB).
> postgres 9.4

> i7 3.5Ghz Cpu
> 16GB Ram
> 500 GB Velociraptor 10K

> (cold os and pg cache both runs)
> Without patch: ANALYZE pgbench_accounts 90s
> With patch: ANALYZE pgbench_accounts 91s

> So I'm essentially seeing no difference :-(

What OS and filesystem?

regards, tom lane


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 00:20:47
Message-ID: 52A65E5F.7050509@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 13:14, Mark Kirkwood wrote:
> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>
>>
>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>> be very easy, patch attached. Your mileage may vary, but I'm seeing a
>> nice gain from this on my laptop. Taking a 30000 page sample of a
>> table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes
>> about 6 seconds without the patch, and less than a second with the
>> patch, with effective_io_concurrency=10. If anyone with a good test
>> data set loaded would like to test this and post some numbers, that
>> would be great.
>>
>>
>
>
>
> I did a test run:
>
> pgbench scale 2000 (pgbench_accounts approx 25GB).
> postgres 9.4
>
> i7 3.5Ghz Cpu
> 16GB Ram
> 500 GB Velociraptor 10K
>
> (cold os and pg cache both runs)
> Without patch: ANALYZE pgbench_accounts 90s
> With patch: ANALYZE pgbench_accounts 91s
>
> So I'm essentially seeing no difference :-(

Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To:
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 00:53:19
Message-ID: 52A665FF.1060500@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 13:20, Mark Kirkwood wrote:
> On 10/12/13 13:14, Mark Kirkwood wrote:
>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>
>>>
>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>>> be very easy, patch attached. Your mileage may vary, but I'm seeing
>>> a nice gain from this on my laptop. Taking a 30000 page sample of a
>>> table with 717717 pages (ie. slightly larger than RAM), ANALYZE
>>> takes about 6 seconds without the patch, and less than a second with
>>> the patch, with effective_io_concurrency=10. If anyone with a good
>>> test data set loaded would like to test this and post some numbers,
>>> that would be great.
>>>
>>>
>>
>>
>>
>> I did a test run:
>>
>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>> postgres 9.4
>>
>> i7 3.5Ghz Cpu
>> 16GB Ram
>> 500 GB Velociraptor 10K
>>
>> (cold os and pg cache both runs)
>> Without patch: ANALYZE pgbench_accounts 90s
>> With patch: ANALYZE pgbench_accounts 91s
>>
>> So I'm essentially seeing no difference :-(
>
>
> Arrg - sorry forgot the important bits:
>
> Ubuntu 13.10 (kernel 3.11.0-14)
> filesystem is ext4
>
>
>

Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem
mounted with discard):

(cold os and pg cache both runs)
Without patch: ANALYZE pgbench_accounts 5s
With patch: ANALYZE pgbench_accounts 5s


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 01:00:40
Message-ID: 52A667B8.3000001@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/06/2013 09:52 AM, Peter Geoghegan wrote:
> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> other full-table scans? That doesn't really help Greg, because his
> complaint is mostly that a fresh ANALYZE is too expensive, but it
> could be an interesting, albeit risky approach.

It'd be particularly interesting, IMO, if autovacuum was able to do it
using the synchronized scans machinery, so the analyze still ran in a
separate proc and didn't impact the user's query. Have an autovac worker
or two waiting for new scans to start on tables they need to analyze,
and if one doesn't start within a reasonable time frame they start their
own scan.

We've seen enough issues with hint-bit setting causing unpredictable
query times for user queries that I wouldn't want to add another "might
write during a read-only query" behaviour.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 01:01:45
Message-ID: 52A667F9.9070604@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/10/2013 05:20 AM, Jim Nasby wrote:
>>
>
> FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to
> fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's
> fancy new shared memory stuff).

Apologies for posting the same as a new idea before I saw your post. I
need to finish the thread before posting.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 02:04:34
Message-ID: 52A676B2.9040000@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 13:53, Mark Kirkwood wrote:
> On 10/12/13 13:20, Mark Kirkwood wrote:
>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>
>>>>
>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>>>> be very easy, patch attached. Your mileage may vary, but I'm seeing
>>>> a nice gain from this on my laptop. Taking a 30000 page sample of a
>>>> table with 717717 pages (ie. slightly larger than RAM), ANALYZE
>>>> takes about 6 seconds without the patch, and less than a second
>>>> with the patch, with effective_io_concurrency=10. If anyone with a
>>>> good test data set loaded would like to test this and post some
>>>> numbers, that would be great.
>>>>
>>>>
>>>
>>>
>>>
>>> I did a test run:
>>>
>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>> postgres 9.4
>>>
>>> i7 3.5Ghz Cpu
>>> 16GB Ram
>>> 500 GB Velociraptor 10K
>>>
>>> (cold os and pg cache both runs)
>>> Without patch: ANALYZE pgbench_accounts 90s
>>> With patch: ANALYZE pgbench_accounts 91s
>>>
>>> So I'm essentially seeing no difference :-(
>>
>>
>> Arrg - sorry forgot the important bits:
>>
>> Ubuntu 13.10 (kernel 3.11.0-14)
>> filesystem is ext4
>>
>>
>>
>
> Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem
> mounted with discard):
>
> (cold os and pg cache both runs)
> Without patch: ANALYZE pgbench_accounts 5s
> With patch: ANALYZE pgbench_accounts 5s
>
>
>
>
>

Redoing the filesystem on the 520 as btrfs didn't seem to make any
difference either:

(cold os and pg cache both runs)
Without patch: ANALYZE pgbench_accounts 6.4s
With patch: ANALYZE pgbench_accounts 6.4s


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 02:11:13
Message-ID: 52A67841.6020401@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 15:04, Mark Kirkwood wrote:
> On 10/12/13 13:53, Mark Kirkwood wrote:
>> On 10/12/13 13:20, Mark Kirkwood wrote:
>>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>>
>>>>>
>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out
>>>>> to be very easy, patch attached. Your mileage may vary, but I'm
>>>>> seeing a nice gain from this on my laptop. Taking a 30000 page
>>>>> sample of a table with 717717 pages (ie. slightly larger than
>>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less
>>>>> than a second with the patch, with effective_io_concurrency=10. If
>>>>> anyone with a good test data set loaded would like to test this
>>>>> and post some numbers, that would be great.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> I did a test run:
>>>>
>>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>>> postgres 9.4
>>>>
>>>> i7 3.5Ghz Cpu
>>>> 16GB Ram
>>>> 500 GB Velociraptor 10K
>>>>
>>>> (cold os and pg cache both runs)
>>>> Without patch: ANALYZE pgbench_accounts 90s
>>>> With patch: ANALYZE pgbench_accounts 91s
>>>>
>>>> So I'm essentially seeing no difference :-(
>>>
>>>
>>> Arrg - sorry forgot the important bits:
>>>
>>> Ubuntu 13.10 (kernel 3.11.0-14)
>>> filesystem is ext4
>>>
>>>
>>>
>>
>> Doing the same test as above, but on a 80GB Intel 520 (ext4
>> filesystem mounted with discard):
>>
>> (cold os and pg cache both runs)
>> Without patch: ANALYZE pgbench_accounts 5s
>> With patch: ANALYZE pgbench_accounts 5s
>>
>>
>>
>>
>>
>
> Redoing the filesystem on the 520 as btrfs didn't seem to make any
> difference either:
>
> (cold os and pg cache both runs)
> Without patch: ANALYZE pgbench_accounts 6.4s
> With patch: ANALYZE pgbench_accounts 6.4s
>
>
>

Ah - I have just realized I was not setting effective_io_concurrency -
so I'll redo the test. - Apologies.

Regards

Mark


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 02:17:56
Message-ID: 52A679D4.6060504@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 15:11, Mark Kirkwood wrote:
> On 10/12/13 15:04, Mark Kirkwood wrote:
>> On 10/12/13 13:53, Mark Kirkwood wrote:
>>> On 10/12/13 13:20, Mark Kirkwood wrote:
>>>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>>>
>>>>>>
>>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out
>>>>>> to be very easy, patch attached. Your mileage may vary, but I'm
>>>>>> seeing a nice gain from this on my laptop. Taking a 30000 page
>>>>>> sample of a table with 717717 pages (ie. slightly larger than
>>>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less
>>>>>> than a second with the patch, with effective_io_concurrency=10.
>>>>>> If anyone with a good test data set loaded would like to test
>>>>>> this and post some numbers, that would be great.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I did a test run:
>>>>>
>>>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>>>> postgres 9.4
>>>>>
>>>>> i7 3.5Ghz Cpu
>>>>> 16GB Ram
>>>>> 500 GB Velociraptor 10K
>>>>>
>>>>> (cold os and pg cache both runs)
>>>>> Without patch: ANALYZE pgbench_accounts 90s
>>>>> With patch: ANALYZE pgbench_accounts 91s
>>>>>
>>>>> So I'm essentially seeing no difference :-(
>>>>
>>>>
>>>> Arrg - sorry forgot the important bits:
>>>>
>>>> Ubuntu 13.10 (kernel 3.11.0-14)
>>>> filesystem is ext4
>>>>
>>>>
>>>>
>>>
>>> Doing the same test as above, but on a 80GB Intel 520 (ext4
>>> filesystem mounted with discard):
>>>
>>> (cold os and pg cache both runs)
>>> Without patch: ANALYZE pgbench_accounts 5s
>>> With patch: ANALYZE pgbench_accounts 5s
>>>
>>>
>>>
>>>
>>>
>>
>> Redoing the filesystem on the 520 as btrfs didn't seem to make any
>> difference either:
>>
>> (cold os and pg cache both runs)
>> Without patch: ANALYZE pgbench_accounts 6.4s
>> With patch: ANALYZE pgbench_accounts 6.4s
>>
>>
>>
>
> Ah - I have just realized I was not setting effective_io_concurrency -
> so I'll redo the test. - Apologies.
>
>

Redoing the test on the velociraptor gives me exactly the same numbers
as before (effective_io_concurrency = 10 instead of 1).

Cheers

Mark


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 02:22:14
Message-ID: 52A67AD6.6050302@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/09/2013 02:37 PM, Robert Haas wrote:
> I've never seen an n_distinct value of more than 5 digits, regardless
> of reality. Typically I've seen 20-50k, even if the real number is
> much higher. But the n_distinct value is only for non-MCVs, so if we
> estimate the selectivity of column = 'rarevalue' to be
> (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
> the estimate, and making the MCV list longer naturally makes mcvfrac
> bigger. I'm not sure how important the
> less-frequent-than-the-least-common-MCV part is, but I'm very sure
> that raising the statistics target helps to solve the problem of
> overestimating the prevalence of uncommon values in a very big table.

I did an analysis of our ndistinct algorithm several years ago ( ~~
8.1), and to sum up:

1. we take far too small of a sample to estimate ndistinct well for
tables larger than 100,000 rows.

2. the estimation algo we have chosen is one which tends to be wrong in
the downwards direction, rather strongly so. That is, if we could
potentially have an ndistinct of 1000 to 100,000 based on the sample,
our algo estimates 1500 to 3000.

3. Other algos exist. The tend to be wrong in other directions.

4. Nobody has done an analysis of whether it's worse, on average, to
estimate low vs. high for ndistinct.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 02:32:03
Message-ID: 52A67D23.5090505@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 15:17, Mark Kirkwood wrote:
> On 10/12/13 15:11, Mark Kirkwood wrote:
>> On 10/12/13 15:04, Mark Kirkwood wrote:
>>> On 10/12/13 13:53, Mark Kirkwood wrote:
>>>> On 10/12/13 13:20, Mark Kirkwood wrote:
>>>>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>>>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>>>>
>>>>>>>
>>>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out
>>>>>>> to be very easy, patch attached. Your mileage may vary, but I'm
>>>>>>> seeing a nice gain from this on my laptop. Taking a 30000 page
>>>>>>> sample of a table with 717717 pages (ie. slightly larger than
>>>>>>> RAM), ANALYZE takes about 6 seconds without the patch, and less
>>>>>>> than a second with the patch, with effective_io_concurrency=10.
>>>>>>> If anyone with a good test data set loaded would like to test
>>>>>>> this and post some numbers, that would be great.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I did a test run:
>>>>>>
>>>>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>>>>> postgres 9.4
>>>>>>
>>>>>> i7 3.5Ghz Cpu
>>>>>> 16GB Ram
>>>>>> 500 GB Velociraptor 10K
>>>>>>
>>>>>> (cold os and pg cache both runs)
>>>>>> Without patch: ANALYZE pgbench_accounts 90s
>>>>>> With patch: ANALYZE pgbench_accounts 91s
>>>>>>
>>>>>> So I'm essentially seeing no difference :-(
>>>>>
>>>>>
>>>>> Arrg - sorry forgot the important bits:
>>>>>
>>>>> Ubuntu 13.10 (kernel 3.11.0-14)
>>>>> filesystem is ext4
>>>>>
>>>>>
>>>>>
>>>>
>>>> Doing the same test as above, but on a 80GB Intel 520 (ext4
>>>> filesystem mounted with discard):
>>>>
>>>> (cold os and pg cache both runs)
>>>> Without patch: ANALYZE pgbench_accounts 5s
>>>> With patch: ANALYZE pgbench_accounts 5s
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> Redoing the filesystem on the 520 as btrfs didn't seem to make any
>>> difference either:
>>>
>>> (cold os and pg cache both runs)
>>> Without patch: ANALYZE pgbench_accounts 6.4s
>>> With patch: ANALYZE pgbench_accounts 6.4s
>>>
>>>
>>>
>>
>> Ah - I have just realized I was not setting effective_io_concurrency
>> - so I'll redo the test. - Apologies.
>>
>>
>
> Redoing the test on the velociraptor gives me exactly the same numbers
> as before (effective_io_concurrency = 10 instead of 1).
>
>

Good grief - repeating the test gave:

Without patch: ANALYZE pgbench_accounts 90s
With patch: ANALYZE pgbench_accounts 42s

pretty consistent *now*. No idea what was going on in the 1st run (maybe
I happened to have it running at the same time as a checkpoint)? Anyway
will stop now before creating more confusion.

Regards

Mark


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 03:13:17
Message-ID: 52A686CD.6050104@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 15:32, Mark Kirkwood wrote:
> On 10/12/13 15:17, Mark Kirkwood wrote:
>> On 10/12/13 15:11, Mark Kirkwood wrote:
>>> On 10/12/13 15:04, Mark Kirkwood wrote:
>>>> On 10/12/13 13:53, Mark Kirkwood wrote:
>>>>> On 10/12/13 13:20, Mark Kirkwood wrote:
>>>>>> On 10/12/13 13:14, Mark Kirkwood wrote:
>>>>>>> On 10/12/13 12:14, Heikki Linnakangas wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I took a stab at using posix_fadvise() in ANALYZE. It turned
>>>>>>>> out to be very easy, patch attached. Your mileage may vary, but
>>>>>>>> I'm seeing a nice gain from this on my laptop. Taking a 30000
>>>>>>>> page sample of a table with 717717 pages (ie. slightly larger
>>>>>>>> than RAM), ANALYZE takes about 6 seconds without the patch, and
>>>>>>>> less than a second with the patch, with
>>>>>>>> effective_io_concurrency=10. If anyone with a good test data
>>>>>>>> set loaded would like to test this and post some numbers, that
>>>>>>>> would be great.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I did a test run:
>>>>>>>
>>>>>>> pgbench scale 2000 (pgbench_accounts approx 25GB).
>>>>>>> postgres 9.4
>>>>>>>
>>>>>>> i7 3.5Ghz Cpu
>>>>>>> 16GB Ram
>>>>>>> 500 GB Velociraptor 10K
>>>>>>>
>>>>>>> (cold os and pg cache both runs)
>>>>>>> Without patch: ANALYZE pgbench_accounts 90s
>>>>>>> With patch: ANALYZE pgbench_accounts 91s
>>>>>>>
>>>>>>> So I'm essentially seeing no difference :-(
>>>>>>
>>>>>>
>>>>>> Arrg - sorry forgot the important bits:
>>>>>>
>>>>>> Ubuntu 13.10 (kernel 3.11.0-14)
>>>>>> filesystem is ext4
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Doing the same test as above, but on a 80GB Intel 520 (ext4
>>>>> filesystem mounted with discard):
>>>>>
>>>>> (cold os and pg cache both runs)
>>>>> Without patch: ANALYZE pgbench_accounts 5s
>>>>> With patch: ANALYZE pgbench_accounts 5s
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Redoing the filesystem on the 520 as btrfs didn't seem to make any
>>>> difference either:
>>>>
>>>> (cold os and pg cache both runs)
>>>> Without patch: ANALYZE pgbench_accounts 6.4s
>>>> With patch: ANALYZE pgbench_accounts 6.4s
>>>>
>>>>
>>>>
>>>
>>> Ah - I have just realized I was not setting effective_io_concurrency
>>> - so I'll redo the test. - Apologies.
>>>
>>>
>>
>> Redoing the test on the velociraptor gives me exactly the same
>> numbers as before (effective_io_concurrency = 10 instead of 1).
>>
>>
>
> Good grief - repeating the test gave:
>
> Without patch: ANALYZE pgbench_accounts 90s
> With patch: ANALYZE pgbench_accounts 42s
>
> pretty consistent *now*. No idea what was going on in the 1st run
> (maybe I happened to have it running at the same time as a
> checkpoint)? Anyway will stop now before creating more confusion.
>
>

Just one more...

The Intel 520 with ext4:

Without patch: ANALYZE pgbench_accounts 5s
With patch: ANALYZE pgbench_accounts 1s

And double checking -
With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s

These results look more like Heikki's. Which suggests more benefit on
SSD than spinning disks. Some more data points (apart from mine) would
be good to see tho.

Cheers

Mark


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 04:18:06
Message-ID: CAGTBQpYRz7z_Lhf3T9BrQKJ12sZA9fJL78D7gDfPTvKCjELUUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood
<mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:
> Just one more...
>
> The Intel 520 with ext4:
>
>
> Without patch: ANALYZE pgbench_accounts 5s
> With patch: ANALYZE pgbench_accounts 1s
>
> And double checking -
> With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s
>
> These results look more like Heikki's. Which suggests more benefit on SSD
> than spinning disks. Some more data points (apart from mine) would be good
> to see tho.

Assuming ANALYZE is sampling less than 5% of the table, I'd say
fadvising will always be a win.

I'd also suggest higher e_i_c for rotating media. Rotating media has
longer latencias, and e_i_c has to be computed in terms of latency,
rather than "spindles" when doing prefetch.

For backward index scans, I found the optimum number for a single
spindle to be about 20.


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 05:19:14
Message-ID: 52A6A452.6070801@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/13 17:18, Claudio Freire wrote:
> On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood
> <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:
>> Just one more...
>>
>> The Intel 520 with ext4:
>>
>>
>> Without patch: ANALYZE pgbench_accounts 5s
>> With patch: ANALYZE pgbench_accounts 1s
>>
>> And double checking -
>> With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s
>>
>> These results look more like Heikki's. Which suggests more benefit on SSD
>> than spinning disks. Some more data points (apart from mine) would be good
>> to see tho.
>
> Assuming ANALYZE is sampling less than 5% of the table, I'd say
> fadvising will always be a win.
>
> I'd also suggest higher e_i_c for rotating media. Rotating media has
> longer latencias, and e_i_c has to be computed in terms of latency,
> rather than "spindles" when doing prefetch.
>
> For backward index scans, I found the optimum number for a single
> spindle to be about 20.

Yeah - was just fooling about with this on the velociraptor, looks like
somewhere in the 20's works well.

| pgbench_accounts
eff_io_concurrency | analyze time (s)
-------------------+-----------------
8 | 43
16 | 40
24 | 36
32 | 35
64 | 35

Cheers

Mark


From: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Greg Stark *EXTERN*" <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 08:28:23
Message-ID: A737B7A37273E048B164557ADEF4A58B17C7DCEF@ntex2010i.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:
>> It's also applicable for the other stats; histogram buckets constructed
>> from a 5% sample are more likely to be accurate than those constructed
>> from a 0.1% sample. Same with nullfrac. The degree of improved
>> accuracy, would, of course, require some math to determine.
>
> This "some math" is straightforward basic statistics. The 95th
> percentile confidence interval for a sample consisting of 300 samples
> from a population of a 1 million would be 5.66%. A sample consisting
> of 1000 samples would have a 95th percentile confidence interval of
> +/- 3.1%.

Doesn't all that assume a normally distributed random variable?

I don't think it can be applied to database table contents
without further analysis.

Yours,
Laurenz Albe


From: Greg Stark <stark(at)mit(dot)edu>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 14:02:35
Message-ID: CAM-w4HPdMtPiS9A3cj1gfO0vjUT4eK3KXyKpv8FLrnv1cN529w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at> wrote:
>
>
> Doesn't all that assume a normally distributed random variable?

I don't think so because of the law of large numbers. If you have a large
population and sample it the sample behaves like a normal distribution when
if the distribution of the population isn't.


From: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Greg Stark *EXTERN*" <stark(at)mit(dot)edu>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 14:31:31
Message-ID: A737B7A37273E048B164557ADEF4A58B17C7E09B@ntex2010i.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:
>> Doesn't all that assume a normally distributed random variable?

> I don't think so because of the law of large numbers. If you have a large population and sample it the
> sample behaves like a normal distribution when if the distribution of the population isn't.

Statistics is the part of mathematics I know least of, but aren't
you saying that in a large enough sample of people there will
always be some with age < 0 (which is what a normal distribution
would imply)?

Yours,
Laurenz Albe


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 14:32:14
Message-ID: CAGTBQpbnhsc7h4fCHBG63kSYt3-DmyeTZ-QjGf30dN-xacrK4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>
> On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at> wrote:
>>
>>
>> Doesn't all that assume a normally distributed random variable?
>
> I don't think so because of the law of large numbers. If you have a large
> population and sample it the sample behaves like a normal distribution when
> if the distribution of the population isn't.

No, the large population says that if you have an AVERAGE of many
samples of a random variable, the random variable that is the AVERAGE
behaves like a normal.

The variable itself doesn't.

And for n_distinct, you need to know the variable itself.


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 14:32:44
Message-ID: CAGTBQpZQROubRkV=3VQJXt629xj4C9pQ+GnYJOjfFm7ZTs+Bkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 11:32 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>>
>> On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at> wrote:
>>>
>>>
>>> Doesn't all that assume a normally distributed random variable?
>>
>> I don't think so because of the law of large numbers. If you have a large
>> population and sample it the sample behaves like a normal distribution when
>> if the distribution of the population isn't.
>
>
> No, the large population says that if you have an AVERAGE of many
> samples of a random variable, the random variable that is the AVERAGE
> behaves like a normal.

Sorry, that should be "No, the *law of large numbers* says..."


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 19:23:37
Message-ID: CA+U5nMJP0r-qkMaE_JNeEq4akgbZtWaLwkabYY_YoxY8w1u=6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6 December 2013 09:21, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
>> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
>> other full-table scans? That doesn't really help Greg, because his
>> complaint is mostly that a fresh ANALYZE is too expensive, but it
>> could be an interesting, albeit risky approach.
>
> What I've been thinking of is
>
> a) making it piggy back on scans vacuum is doing instead of doing
> separate ones all the time (if possible, analyze needs to be more
> frequent). Currently with quite some likelihood the cache will be gone
> again when revisiting.

> b) make analyze incremental. In lots of bigger tables most of the table
> is static - and we actually *do* know that, thanks to the vm. So keep a
> rawer form of what ends in the catalogs around somewhere, chunked by the
> region of the table the statistic is from. Everytime a part of the table
> changes, re-sample only that part. Then recompute the aggregate.

Piggy-backing sounds like a bad thing. If I run a query, I don't want
to be given some extra task thanks! Especially if we might need to run
data type code I'm not authroised to run, or to sample data I may not
be authorised to see. The only way that could work is to kick off an
autovacuum worker to run the ANALYZE as a separate process and then
use synchronous scan behaviour to derive benefit indirectly.

However, these things presume that we need to continue scanning most
of the blocks of the table, which I don't think needs to be the case.
There is a better way.

Back in 2005/6, I advocated a block sampling method, as described by
Chaudri et al (ref?)
That has two advantages

* We don't need to visit all of the blocks, reducing I/O

* It would also give better analysis of clustered data (its all in
that paper...)

The recent patch on TABLESAMPLE contained a rewrite of the guts of
ANALYZE into a generic sample scan, which would then be used as the
basis for a TABLESAMPLE BERNOULLI. A TABLESAMPLE SYSTEM could use a
block sample, as it does in some other DBMS (DB2, SQLServer).

While I was unimpressed with the TABLESAMPLE patch, all it needs is
some committer-love, so Greg...

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 19:28:38
Message-ID: 20131210192838.GD30072@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-12-10 19:23:37 +0000, Simon Riggs wrote:
> On 6 December 2013 09:21, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:
> >> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> >> other full-table scans? That doesn't really help Greg, because his
> >> complaint is mostly that a fresh ANALYZE is too expensive, but it
> >> could be an interesting, albeit risky approach.
> >
> > What I've been thinking of is
> >
> > a) making it piggy back on scans vacuum is doing instead of doing
> > separate ones all the time (if possible, analyze needs to be more
> > frequent). Currently with quite some likelihood the cache will be gone
> > again when revisiting.
>
> > b) make analyze incremental. In lots of bigger tables most of the table
> > is static - and we actually *do* know that, thanks to the vm. So keep a
> > rawer form of what ends in the catalogs around somewhere, chunked by the
> > region of the table the statistic is from. Everytime a part of the table
> > changes, re-sample only that part. Then recompute the aggregate.
>
> Piggy-backing sounds like a bad thing. If I run a query, I don't want
> to be given some extra task thanks! Especially if we might need to run
> data type code I'm not authroised to run, or to sample data I may not
> be authorised to see. The only way that could work is to kick off an
> autovacuum worker to run the ANALYZE as a separate process and then
> use synchronous scan behaviour to derive benefit indirectly.

I was suggesting to piggyback on VACUUM, not user queries. The latter
suggestion was somebody else.
In combination with incremental or chunk-wise building of statistics,
doing more frequent partial vacuums which re-compute the changing part
of the stats would be great. All those blocks have to be read anyway,
and we should be more agressive about vacuuming anyway.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 19:49:12
Message-ID: CAM3SWZQDDA2MrFpfCpxM05Ypr0WueejQbbvVNPcXsMKr9KHWWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> However, these things presume that we need to continue scanning most
> of the blocks of the table, which I don't think needs to be the case.
> There is a better way.

Do they? I think it's one opportunistic way of ameliorating the cost.

> Back in 2005/6, I advocated a block sampling method, as described by
> Chaudri et al (ref?)

I don't think that anyone believes that not doing block sampling is
tenable, fwiw. Clearly some type of block sampling would be preferable
for most or all purposes.

--
Peter Geoghegan


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 19:54:43
Message-ID: CA+U5nMKQrTZ=SF93rY=uXYwcXDBtHjXWsP+X1THQnqSQLG57Yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10 December 2013 19:49, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> However, these things presume that we need to continue scanning most
>> of the blocks of the table, which I don't think needs to be the case.
>> There is a better way.
>
> Do they? I think it's one opportunistic way of ameliorating the cost.
>
>> Back in 2005/6, I advocated a block sampling method, as described by
>> Chaudri et al (ref?)
>
> I don't think that anyone believes that not doing block sampling is
> tenable, fwiw. Clearly some type of block sampling would be preferable
> for most or all purposes.

If we have one way of reducing cost of ANALYZE, I'd suggest we don't
need 2 ways - especially if the second way involves the interaction of
otherwise not fully related parts of the code.

Or to put it clearly, lets go with block sampling and then see if that
needs even more work.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 19:54:57
Message-ID: 52A77191.6010907@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I don't think that anyone believes that not doing block sampling is
> tenable, fwiw. Clearly some type of block sampling would be preferable
> for most or all purposes.

As discussed, we need math though. Does anyone have an ACM subscription
and time to do a search? Someone must. We can buy one with community
funds, but no reason to do so if we don't have to.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Greg Stark <stark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 19:59:29
Message-ID: CAM-w4HMLgydObQ-XKBDzFk3zm-rH_X8mXzxA0nX7WKwt03gNZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 7:54 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> As discussed, we need math though. Does anyone have an ACM subscription
> and time to do a search? Someone must. We can buy one with community
> funds, but no reason to do so if we don't have to.

Anyone in a university likely has access through their library.

But I don't really think this is the right way to go about this.
Research papers are going to turn up pretty specialized solutions that
are probably patented. We don't even have the basic understanding we
need. I suspect a basic textbook chapter on multistage sampling will
discuss at least the standard techniques.

Once we have a handle on the standard multistage sampling techniques
that would be safe from patents then we might want to go look at
research papers to find how they've been applied to databases in the
past but we would have to do that fairly carefully.

--
greg


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 20:00:49
Message-ID: CA+U5nM+vNd=10yYUuyNe9uA+Nh9mED=bbsjpU68nUUFT=ZTSVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10 December 2013 19:54, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
>> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> I don't think that anyone believes that not doing block sampling is
>> tenable, fwiw. Clearly some type of block sampling would be preferable
>> for most or all purposes.
>
> As discussed, we need math though. Does anyone have an ACM subscription
> and time to do a search? Someone must. We can buy one with community
> funds, but no reason to do so if we don't have to.

We already have that, just use Vitter's algorithm at the block level
rather than the row level.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 20:17:43
Message-ID: CAM3SWZRP_NTXKjZTJ956nzY8zgSXEeT-GBSYsHdhbqJbTHW1oQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> But I don't really think this is the right way to go about this.
> Research papers are going to turn up pretty specialized solutions that
> are probably patented. We don't even have the basic understanding we
> need. I suspect a basic textbook chapter on multistage sampling will
> discuss at least the standard techniques.

I agree that looking for information on block level sampling
specifically, and its impact on estimation quality is likely to not
turn up very much, and whatever it does turn up will have patent
issues.

--
Peter Geoghegan


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 20:19:14
Message-ID: 52A77742.9070105@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/10/2013 10:00 PM, Simon Riggs wrote:
> On 10 December 2013 19:54, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
>>> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> I don't think that anyone believes that not doing block sampling is
>>> tenable, fwiw. Clearly some type of block sampling would be preferable
>>> for most or all purposes.
>>
>> As discussed, we need math though. Does anyone have an ACM subscription
>> and time to do a search? Someone must. We can buy one with community
>> funds, but no reason to do so if we don't have to.
>
> We already have that, just use Vitter's algorithm at the block level
> rather than the row level.

And what do you do with the blocks? How many blocks do you choose?
Details, please.

- Heikki


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 21:33:29
Message-ID: 52A788A9.4010609@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/12/13 09:19, Heikki Linnakangas wrote:
> On 12/10/2013 10:00 PM, Simon Riggs wrote:
>> On 10 December 2013 19:54, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> On 12/10/2013 11:49 AM, Peter Geoghegan wrote:
>>>> On Tue, Dec 10, 2013 at 11:23 AM, Simon Riggs
>>>> <simon(at)2ndquadrant(dot)com> wrote:
>>>> I don't think that anyone believes that not doing block sampling is
>>>> tenable, fwiw. Clearly some type of block sampling would be preferable
>>>> for most or all purposes.
>>>
>>> As discussed, we need math though. Does anyone have an ACM
>>> subscription
>>> and time to do a search? Someone must. We can buy one with community
>>> funds, but no reason to do so if we don't have to.
>>
>> We already have that, just use Vitter's algorithm at the block level
>> rather than the row level.
>
> And what do you do with the blocks? How many blocks do you choose?
> Details, please.
>
>

Yeah - and we seem to be back to Josh's point about needing 'some math'
to cope with the rows within a block not being a purely random selection.

Regards

Mark


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 21:45:29
Message-ID: 52A78B79.6060904@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/10/2013 01:33 PM, Mark Kirkwood wrote:
> Yeah - and we seem to be back to Josh's point about needing 'some math'
> to cope with the rows within a block not being a purely random selection.

Well, sometimes they are effectively random. But sometimes they are
not. The Chaudri et al paper had a formula for estimating randomness
based on the grouping of rows in each block, assuming that the sampled
blocks were widely spaced (if they aren't there's not much you can do).
This is where you get up to needing a 5% sample; you need to take
enough blocks that you're confident that the blocks you sampled are
representative of the population.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Greg Stark <stark(at)mit(dot)edu>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 21:48:10
Message-ID: CAM-w4HPq9UHE85H9QrC=8STCEBirsU+qT7czjdD5R9DWm_h5gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 7:49 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> Back in 2005/6, I advocated a block sampling method, as described by
>> Chaudri et al (ref?)
>
> I don't think that anyone believes that not doing block sampling is
> tenable, fwiw. Clearly some type of block sampling would be preferable
> for most or all purposes.

We do block sampling now. But then we select rows from those blocks uniformly.

Incidentally, if you mean Surajit Chaudhuri, he's a Microsoft Research
lead so I would be nervous about anything he's published being
patented by Microsoft.

--
greg


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 23:03:49
Message-ID: CAMkU=1xesmEJLmcgjpBUzR3VrqV0ZPqkcXCNvb=_eHqWK2V_vA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 2:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > My reading of the code is that if it is not in the MCV, then it is
> assumed
> > to have the average selectivity (about 1/n_distinct, but deflating top
> and
> > bottom for the MCV list). There is also a check that it is less than the
> > least common of the MCV, but I don't know why that situation would ever
> > prevail--that should always be higher or equal to the average
> selectivity.
>
> I've never seen an n_distinct value of more than 5 digits, regardless
> of reality. Typically I've seen 20-50k, even if the real number is
> much higher.

I don't recall seeing an n_distinct that is literally above 100,000 in the
wild, but I've seen negative ones that multiply with reltuples to give
values more than that. In test cases it is easy enough to get values in
the millions by creating tables using floor(random()*$something).

create table baz as select floor(random()*10000000), md5(random()::text)
from generate_series(1,100000000);
create table baz2 as select * from baz order by floor;
create table baz3 as select * from baz order by md5(floor::text);

baz unclustered, baz2 is clustered with perfect correlation, baz3 is
clustered but without correlation.

After analyzing all of them:

select tablename, n_distinct,correlation from pg_stats where tablename
like 'baz%' and attname='floor' ;
tablename | n_distinct | correlation
-----------+-------------+-------------
baz | 8.56006e+06 | 0.00497713
baz2 | 333774 | 1
baz3 | 361048 | -0.0118147

So baz is pretty close, while the other two are way off. But the
n_distinct computation doesn't depend on the order of the rows, as far as I
can tell, but only the composition of the sample. So this can only mean
that our "random" sampling method is desperately non-random, right?

> But the n_distinct value is only for non-MCVs, so if we
> estimate the selectivity of column = 'rarevalue' to be
> (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
> the estimate, and making the MCV list longer naturally makes mcvfrac
> bigger.

Ah, I see. By including more things into MCV, we crowd out the rest of the
space implicitly.

> I'm not sure how important the
> less-frequent-than-the-least-common-MCV part is, but I'm very sure
> that raising the statistics target helps to solve the problem of
> overestimating the prevalence of uncommon values in a very big table.
>
> > I think that parts of the planner are N^2 in the size of histogram (or
> was
> > that the size of the MCV list?). So we would probably need a way to use
> a
> > larger sample size to get more accurate n_distinct and MCV frequencies,
> but
> > not save the entire histogram that goes with that sample size.
>
> I think the saving the histogram part is important. As you say, the
> MCVs are important for a variety of planning purposes, such as hash
> joins. More than that, in my experience, people with large tables are
> typically very willing to spend more planning time to get a better
> plan, because mistakes are expensive and the queries are likely to run
> for a while anyway. People with small tables care about planning

time, because it makes no sense to spend an extra 1ms planning a query
> unless you improve the plan by enough to save at least 1ms when
> executing it, and when the tables are small and access is expected to
> be fast anyway that's often not the case.
>

I would think that the dichotomy is more about the size of the query-plan
than of the tables. I think a lot of people with huge tables end up doing
mostly indexed lookups in unique or highly selective indexes, once all the
planning is done.

Does anyone have generators for examples of cases where increasing the
sample size to get better histograms (as opposed more accurate n_distinct
or more accurate MCV) was the key to fixing bad plans? I wonder if it is
more bins, or more accurate boundaries, that makes the difference.

Cheers,

Jeff


From: Jim Nasby <jim(at)nasby(dot)net>
To: Peter Geoghegan <pg(at)heroku(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 23:26:56
Message-ID: 52A7A340.9070801@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/10/13 2:17 PM, Peter Geoghegan wrote:
> On Tue, Dec 10, 2013 at 11:59 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> But I don't really think this is the right way to go about this.
>> Research papers are going to turn up pretty specialized solutions that
>> are probably patented. We don't even have the basic understanding we
>> need. I suspect a basic textbook chapter on multistage sampling will
>> discuss at least the standard techniques.
>
> I agree that looking for information on block level sampling
> specifically, and its impact on estimation quality is likely to not
> turn up very much, and whatever it does turn up will have patent
> issues.

We have an entire analytics dept. at work that specializes in finding patterns in our data. I might be able to get some time from them to at least provide some guidance here, if the community is interested. They could really only serve in a consulting role though.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 23:43:23
Message-ID: CAM3SWZQhd0AUr=R2gUBzsEG0FEqV2zF-xVk0Rsm=SeC-mrVCog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>> I agree that looking for information on block level sampling
>> specifically, and its impact on estimation quality is likely to not
>> turn up very much, and whatever it does turn up will have patent
>> issues.
>
>
> We have an entire analytics dept. at work that specializes in finding
> patterns in our data. I might be able to get some time from them to at least
> provide some guidance here, if the community is interested. They could
> really only serve in a consulting role though.

I think that Greg had this right several years ago: it would probably
be very useful to have the input of someone with a strong background
in statistics. It doesn't seem that important that they already know a
lot about databases, provided they can understand what our constraints
are, and what is important to us. It might just be a matter of having
them point us in the right direction.

--
Peter Geoghegan


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 00:14:58
Message-ID: CA+U5nMKQ-b=34u37A1yOMOExQ2me+Tif8_-cYHDM3vODOrLDuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10 December 2013 23:43, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Tue, Dec 10, 2013 at 3:26 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>> I agree that looking for information on block level sampling
>>> specifically, and its impact on estimation quality is likely to not
>>> turn up very much, and whatever it does turn up will have patent
>>> issues.
>>
>>
>> We have an entire analytics dept. at work that specializes in finding
>> patterns in our data. I might be able to get some time from them to at least
>> provide some guidance here, if the community is interested. They could
>> really only serve in a consulting role though.
>
> I think that Greg had this right several years ago: it would probably
> be very useful to have the input of someone with a strong background
> in statistics. It doesn't seem that important that they already know a
> lot about databases, provided they can understand what our constraints
> are, and what is important to us. It might just be a matter of having
> them point us in the right direction.

err, so what does stats target mean exactly in statistical theory?
Waiting for a statistician, and confirming his credentials before you
believe him above others here, seems like wasted time.

What your statistician will tell you is it that YMMV, depending on the data.

So we'll still need a parameter to fine tune things when the default
is off. We can argue about the default later, in various level of
rigour.

Block sampling, with parameter to specify sample size. +1

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Stark <stark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 00:28:33
Message-ID: CAM-w4HMq2J91wVb-sGT0H3E+Tn90nFq4byz9TcSnUGrxpTmTPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Block sampling, with parameter to specify sample size. +1

Simon this is very frustrating. Can you define "block sampling"?

--
greg


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 00:40:08
Message-ID: CA+U5nMJ+kp_xhV8LSSOG_ekTBRvD5YMYntSLGANDK9pUerYWyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11 December 2013 00:28, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Block sampling, with parameter to specify sample size. +1
>
> Simon this is very frustrating. Can you define "block sampling"?

Blocks selected using Vitter's algorithm, using a parameterised
fraction of the total.

When we select a block we should read all rows on that block, to help
identify the extent of clustering within the data.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Stark <stark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 00:44:59
Message-ID: CAM-w4HOTrwt-v-zOxp747rf+PW3r9EYAyGtKCtDvrFQ1AScSuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> When we select a block we should read all rows on that block, to help
> identify the extent of clustering within the data.

So how do you interpret the results of the sample read that way that
doesn't introduce bias?

--
greg


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 00:48:37
Message-ID: CAM3SWZRB8dH5RUzpi_fPD3yakFEivoHo1v69Yrv8yDoU+S7DxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 4:14 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> err, so what does stats target mean exactly in statistical theory?

Why would I even mention that to a statistician? We want guidance. But
yes, I bet I could give a statistician an explanation of statistics
target that they'd understand without too much trouble.

> Waiting for a statistician, and confirming his credentials before you
> believe him above others here, seems like wasted time.
>
> What your statistician will tell you is it that YMMV, depending on the data.

I'm reasonably confident that they'd give me more than that.

> So we'll still need a parameter to fine tune things when the default
> is off. We can argue about the default later, in various level of
> rigour.
>
> Block sampling, with parameter to specify sample size. +1

Again, it isn't as if the likely efficacy of *some* block sampling
approach is in question. I'm sure analyze.c is currently naive about
many things. Everyone knows that there are big gains to be had.
Balancing those gains against the possible downsides in terms of
impact on the quality of statistics generated is pretty nuanced. I do
know enough to know that a lot of thought goes into mitigating and/or
detecting the downsides of block-based sampling.

--
Peter Geoghegan


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 00:58:12
Message-ID: CA+U5nM+3M7PfwrZs3ivt_oCuF-yTRaaA1Wq=u4GDTjKJxr2Kpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11 December 2013 00:44, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> When we select a block we should read all rows on that block, to help
>> identify the extent of clustering within the data.
>
> So how do you interpret the results of the sample read that way that
> doesn't introduce bias?

Yes, it is not a perfect statistical sample. All sampling is subject
to an error that is data dependent.

I'm happy that we have an option to select this/or not and a default
that maintains current behaviour, since otherwise we might expect some
plan instability.

I would like to be able to

* allow ANALYZE to run faster in some cases
* increase/decrease sample size when it matters
* have the default sample size vary according to the size of the
table, i.e. a proportional sample

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 01:27:04
Message-ID: alpine.LRH.2.00.1312110506150.19468@lnfm1.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

For what it's worth.

I'll quote Chaudhuri et al. first line from the abstract about the block
sampling.
"Block-level sampling is far more efficient than true uniform-random
sampling over a large database, but prone to significant errors if used
to create database statistics."

And after briefly glancing through the paper, my opinion is why it works
is because after making one version of statistics they cross-validate, see
how well it goes and then collect more if the cross-validation error is
large (for example because the data is clustered). Without this bit, as
far as I can a simply block based sampler will be bound to make
catastrophic mistakes depending on the distribution

Also, just another point about targets (e.g X%) for estimating stuff from
the samples (as it was discussed in the thread). Basically, the is a
point talking about a sampling a fixed target (5%) of the data
ONLY if you fix the actual distribution of your data in the table, and
decide what statistic you are trying to find, e.g. average, std. dev. a
90% percentile, ndistinct or a histogram and so forth. There won't be a
general answer as the percentages will be distribution dependend and
statistic dependent.

Cheers,
Sergey

PS I'm not a statistician, but I use statistics a lot

*******************************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 02:03:52
Message-ID: 16278.1386727432@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Geoghegan <pg(at)heroku(dot)com> writes:
> Again, it isn't as if the likely efficacy of *some* block sampling
> approach is in question. I'm sure analyze.c is currently naive about
> many things.

It's not *that* naive; this is already about a third-generation algorithm.
The last major revision (commit 9d6570b8a4) was to address problems with
misestimating the number of live tuples due to nonuniform tuple density
in a table. IIRC, the previous code could be seriously misled if the
first few pages in the table were significantly non-representative of the
live-tuple density further on. I'm not sure how we can significantly
reduce the number of blocks examined without re-introducing that hazard in
some form. In particular, given that you want to see at least N tuples,
how many blocks will you read if you don't have an a-priori estimate of
tuple density? You have to decide that before you start sampling blocks,
if you want all blocks to have the same probability of being selected
and you want to read them in sequence.

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 02:11:45
Message-ID: CAMkU=1weFZ-k=z2Utu=kTHe7R5eqR45ujWdNVGC+UHU7n+RZNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday, December 10, 2013, Simon Riggs wrote:

> On 11 December 2013 00:28, Greg Stark <stark(at)mit(dot)edu <javascript:;>>
> wrote:
> > On Wed, Dec 11, 2013 at 12:14 AM, Simon Riggs <simon(at)2ndquadrant(dot)com<javascript:;>>
> wrote:
> >> Block sampling, with parameter to specify sample size. +1
> >
> > Simon this is very frustrating. Can you define "block sampling"?
>
> Blocks selected using Vitter's algorithm, using a parameterised
> fraction of the total.
>

OK, thanks for defining that.

We only need Vitter's algorithm when we don't know in advance how many
items we are sampling from (such as for tuples--unless we want to rely on
the previous estimate for the current round of analysis). But for blocks,
we do know how many there are, so there are simpler ways to pick them.

>
> When we select a block we should read all rows on that block, to help
> identify the extent of clustering within the data.
>

But we have no mechanism to store such information (or to use it if it were
stored), nor even ways to prevent the resulting skew in the sample from
seriously messing up the estimates which we do have ways of storing and
using.

Cheers,

Jeff


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 06:34:52
Message-ID: CA+U5nMJS8CLED0FLif+5cj_XGm8Q4e7h6qTeDJu=rRsnojF0wQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11 December 2013 01:27, Sergey E. Koposov <math(at)sai(dot)msu(dot)ru> wrote:
> For what it's worth.
>
> I'll quote Chaudhuri et al. first line from the abstract about the block
> sampling.
> "Block-level sampling is far more efficient than true uniform-random
> sampling over a large database, but prone to significant errors if used to
> create database statistics."

This glosses over the point that both SQLServer and Oracle use this technique.

> And after briefly glancing through the paper, my opinion is why it works is
> because after making one version of statistics they cross-validate, see how
> well it goes and then collect more if the cross-validation error is large
> (for example because the data is clustered). Without this bit, as far as I
> can a simply block based sampler will be bound to make catastrophic mistakes
> depending on the distribution

I don't think its true that a block based sampler will be *bound* to
make "catastrophic mistakes". They can clearly happen, just as they
can with random samples, hence the need for a parameter to control the
sample with a parameter.

Realistically, I never heard of an Oracle DBA doing advanced
statistical mathematics before setting the sample size on ANALYZE. You
use the default and bump it up if the sample is insufficient for the
data.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 07:01:27
Message-ID: CAM3SWZRw93mAn1wBo0-5Aca-YD1-YqZskECZus+t5_E2gjvhog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 10:34 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 11 December 2013 01:27, Sergey E. Koposov <math(at)sai(dot)msu(dot)ru> wrote:
>> For what it's worth.
>>
>> I'll quote Chaudhuri et al. first line from the abstract about the block
>> sampling.
>> "Block-level sampling is far more efficient than true uniform-random
>> sampling over a large database, but prone to significant errors if used to
>> create database statistics."
>
> This glosses over the point that both SQLServer and Oracle use this technique.

That seems like an unusual omission for Microsoft Research to have made.

I didn't read that paper, because undoubtedly it's all patented. But
before I figured that out, after finding it on Google randomly, I did
read the first couple of paragraphs, which more or less said "what
follows - the entire paper - is an explanation as to why it's okay
that we do block sampling".

--
Peter Geoghegan


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 07:50:25
Message-ID: 52A81941.1080704@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/11/2013 01:44 AM, Greg Stark wrote:
> On Wed, Dec 11, 2013 at 12:40 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> When we select a block we should read all rows on that block, to help
>> identify the extent of clustering within the data.
> So how do you interpret the results of the sample read that way that
> doesn't introduce bias?
>
Initially/experimentally we could just compare it to our current approach :)

That is, implement *some* block sampling and then check it against what
we currently have. Then figure out the bad differences. Rinse. Repeat.

Cheers

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>, "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 08:09:39
Message-ID: 52A81DC3.2060105@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/12/13 19:34, Simon Riggs wrote:
>
> Realistically, I never heard of an Oracle DBA doing advanced
> statistical mathematics before setting the sample size on ANALYZE. You
> use the default and bump it up if the sample is insufficient for the
> data.
>

I'm not sure that Oracle's stats and optimizer design is an example to
be envied - pretty much all Oracle DBA's I've encountered will apply
hints all queries to get the plan they want...

Regards

Mark


From: Greg Stark <stark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 11:01:34
Message-ID: CAM-w4HPWSnGFgk+o7sR3Y76sywp3kX9EPj6-jn28qc4p5bJ8tQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 11, 2013 at 12:58 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> Yes, it is not a perfect statistical sample. All sampling is subject
> to an error that is data dependent.

Well there's random variation due to the limitations of dealing with a
sample. And then there's systemic biases due to incorrect algorithms.
You wouldn't be happy if the samples discarded every row with NULLs or
every row older than some date etc. These things would not be
corrected by larger samples. That's the kind of "error" we're talking
about here.

But the more I think about things the less convinced I am that there
is a systemic bias introduced by reading the entire block. I had
assumed larger rows would be selected against but that's not really
true, they're just selected against relative to the number of bytes
they occupy which is the correct frequency to sample.

Even blocks that are mostly empty don't really bias things. Picture a
table that consists of 100 blocks with 100 rows each (value A) and
another 100 blocks with only 1 row each (value B). The rows with value
B have a 50% chance of being in any given block which is grossly
inflated however each block selected with value A will produce 100
rows. So if you sample 10 blocks you'll get 100x10xA and 1x10xB which
will be the correct proportion.

I'm not actually sure there is any systemic bias here. The larger
number of rows per block generate less precise results but from my
thought experiments they seem to still be accurate?


From: Greg Stark <stark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 12:08:21
Message-ID: CAM-w4HPpVDC-EDMui60a4_i4WMhqM=ZTTdshZJyse0O-e5Karw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> I'm not actually sure there is any systemic bias here. The larger
> number of rows per block generate less precise results but from my
> thought experiments they seem to still be accurate?

So I've done some empirical tests for a table generated by:
create table sizeskew as (select i,j,repeat('i',i) from
generate_series(1,1000) as i, generate_series(1,1000) as j);

I find that using the whole block doesn't cause any problem with the
avg_width field for the "repeat" column.That does reinforce my belief
that we might not need any particularly black magic here.

It does however cause a systemic error in the histogram bounds. It
seems the median is systematically overestimated by more and more the
larger the number of rows per block are used:

1: 524
4: 549
8: 571
12: 596
16: 602
20: 618 (total sample slightly smaller than normal)
30: 703 (substantially smaller sample)

So there is something clearly wonky in the histogram stats that's
affected by the distribution of the sample. The only thing I can think
of is maybe the most common elements are being selected preferentially
from the early part of the sample which is removing a substantial part
of the lower end of the range. But even removing 100 from the
beginning shouldn't be enough to push the median above 550.

--
greg


From: Greg Stark <stark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 12:23:26
Message-ID: CAM-w4HPufZ6qa9TzJYK4pvKv2oBFWjYtyGp_Z7oZUR5zMs7W6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 11, 2013 at 12:08 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> The only thing I can think
> of is maybe the most common elements are being selected preferentially
> from the early part of the sample which is removing a substantial part
> of the lower end of the range. But even removing 100 from the
> beginning shouldn't be enough to push the median above 550.

Just to follow up here. I think what's going is that not only are the
most_common_vals being preferentially taken from the beginning of the
sample but also their frequency is being massively overestimated. All
values have a frequency of about .001 but the head of the MCV has a
frequency as high as .10 in some of my tests.

--
greg


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 13:02:30
Message-ID: 52A86266.9020700@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/11/2013 02:08 PM, Greg Stark wrote:
> On Wed, Dec 11, 2013 at 11:01 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> I'm not actually sure there is any systemic bias here. The larger
>> number of rows per block generate less precise results but from my
>> thought experiments they seem to still be accurate?
>
> So I've done some empirical tests for a table generated by:
> create table sizeskew as (select i,j,repeat('i',i) from
> generate_series(1,1000) as i, generate_series(1,1000) as j);
>
> I find that using the whole block doesn't cause any problem with the
> avg_width field for the "repeat" column.That does reinforce my belief
> that we might not need any particularly black magic here.

How large a sample did you use? Remember that the point of doing
block-level sampling instead of the current approach would be to allow
using a significantly smaller sample (in # of blocks), and still achieve
the same sampling error. If the sample is "large enough", it will mask
any systemic bias caused by block-sampling, but the point is to reduce
the number of sampled blocks.

The practical question here is this: What happens to the quality of the
statistics if you only read 1/2 the number of blocks than you normally
would, but included all the rows in the blocks we read in the sample?
How about 1/10 ?

Or to put it another way: could we achieve more accurate statistics by
including all rows from the sampled rows, while reading the same number
of blocks? In particular, I wonder if it would help with estimating
ndistinct. It generally helps to have a larger sample for ndistinct
estimation, so it might be beneficial.

- Heikki


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 14:28:51
Message-ID: 848AB714-0DF8-493C-BA8F-7F49102B4E10@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec10, 2013, at 15:32 , Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Tue, Dec 10, 2013 at 11:02 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>>
>> On 10 Dec 2013 08:28, "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at> wrote:
>>>
>>>
>>> Doesn't all that assume a normally distributed random variable?
>>
>> I don't think so because of the law of large numbers. If you have a large
>> population and sample it the sample behaves like a normal distribution when
>> if the distribution of the population isn't.
>
> No, the large population says that if you have an AVERAGE of many
> samples of a random variable, the random variable that is the AVERAGE
> behaves like a normal.

Actually, that's the central limit theorem, and it doesn't hold for all
random variables, only for those with finite expected value and variance.

The law of large numbers, in contrast, only tells you that the AVERAGE of
n samples of a random variable will converge to the random variables'
expected value as n goes to infinity (there are different versions of the
law which guarantee different kinds of convergence, weak or strong).

best regards,
Florian Pflug


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 15:16:50
Message-ID: CA+U5nMLW0yZ3JyuLc5=gcBj4RtV-BdC4zewmwaPu-tFABXaBqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11 December 2013 12:08, Greg Stark <stark(at)mit(dot)edu> wrote:

> So there is something clearly wonky in the histogram stats that's
> affected by the distribution of the sample.

...in the case where the avg width changes in a consistent manner
across the table.

Well spotted.

ISTM we can have a specific cross check for bias in the sample of that
nature. We just calculate the avg width per block and then check for
correlation of the avg width against block number. If we find bias we
can calculate how many extra blocks to sample and from where.

There may be other biases also, so we can check for them and respond
accordingly.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 16:03:39
Message-ID: 16676.1386777819@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> So I've done some empirical tests for a table generated by:
> create table sizeskew as (select i,j,repeat('i',i) from
> generate_series(1,1000) as i, generate_series(1,1000) as j);

> I find that using the whole block doesn't cause any problem with the
> avg_width field for the "repeat" column.That does reinforce my belief
> that we might not need any particularly black magic here.

> It does however cause a systemic error in the histogram bounds. It
> seems the median is systematically overestimated by more and more the
> larger the number of rows per block are used:

Hm. You can only take N rows from a block if there actually are at least
N rows in the block. So the sampling rule I suppose you are using is
"select up to N rows from each sampled block" --- and that is going to
favor the contents of blocks containing narrower-than-average rows.

Now in this case, it looks like that ought to favor rows with *smaller* i
values, but you say the median goes up not down. So I'm not sure what's
going on. I thought at first that TOAST compression might be part of the
explanation, but TOAST shouldn't kick in on rows with raw representation
narrower than 2KB.

Did you do a run with no upper limit on the number of rows per block?
Because I'm not sure that tests with a limit in place are a good guide
to what happens without it.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 17:22:51
Message-ID: 18246.1386782571@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Hm. You can only take N rows from a block if there actually are at least
> N rows in the block. So the sampling rule I suppose you are using is
> "select up to N rows from each sampled block" --- and that is going to
> favor the contents of blocks containing narrower-than-average rows.

Oh, no, wait: that's backwards. (I plead insufficient caffeine.)
Actually, this sampling rule discriminates *against* blocks with
narrower rows. You previously argued, correctly I think, that
sampling all rows on each page introduces no new bias because row
width cancels out across all sampled pages. However, if you just
include up to N rows from each page, then rows on pages with more
than N rows have a lower probability of being selected, but there's
no such bias against wider rows. This explains why you saw smaller
values of "i" being undersampled.

Had you run the test series all the way up to the max number of
tuples per block, which is probably a couple hundred in this test,
I think you'd have seen the bias go away again. But the takeaway
point is that we have to sample all tuples per page, not just a
limited number of them, if we want to change it like this.

regards, tom lane


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 18:22:59
Message-ID: 52A8AD83.7000608@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/13 06:22, Tom Lane wrote:
> I wrote:
>> Hm. You can only take N rows from a block if there actually are at least
>> N rows in the block. So the sampling rule I suppose you are using is
>> "select up to N rows from each sampled block" --- and that is going to
>> favor the contents of blocks containing narrower-than-average rows.
> Oh, no, wait: that's backwards. (I plead insufficient caffeine.)
> Actually, this sampling rule discriminates *against* blocks with
> narrower rows. You previously argued, correctly I think, that
> sampling all rows on each page introduces no new bias because row
> width cancels out across all sampled pages. However, if you just
> include up to N rows from each page, then rows on pages with more
> than N rows have a lower probability of being selected, but there's
> no such bias against wider rows. This explains why you saw smaller
> values of "i" being undersampled.
>
> Had you run the test series all the way up to the max number of
> tuples per block, which is probably a couple hundred in this test,
> I think you'd have seen the bias go away again. But the takeaway
> point is that we have to sample all tuples per page, not just a
> limited number of them, if we want to change it like this.
>
> regards, tom lane
>
>
Surely we want to sample a 'constant fraction' (obviously, in practice
you have to sample an integral number of rows in a page!) of rows per
page? The simplest way, as Tom suggests, is to use all the rows in a page.

However, if you wanted the same number of rows from a greater number of
pages, you could (for example) select a quarter of the rows from each
page. In which case, when this is a fractional number: take the
integral number of rows, plus on extra row with a probability equal to
the fraction (here 0.25).

Either way, if it is determined that you need N rows, then keep
selecting pages at random (but never use the same page more than once)
until you have at least N rows.

Cheers,
Gavin


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 18:26:46
Message-ID: 52A8AE66.3010504@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/13 06:22, Tom Lane wrote:
> I wrote:
>> Hm. You can only take N rows from a block if there actually are at least
>> N rows in the block. So the sampling rule I suppose you are using is
>> "select up to N rows from each sampled block" --- and that is going to
>> favor the contents of blocks containing narrower-than-average rows.
> Oh, no, wait: that's backwards. (I plead insufficient caffeine.)
> Actually, this sampling rule discriminates *against* blocks with
> narrower rows. You previously argued, correctly I think, that
> sampling all rows on each page introduces no new bias because row
> width cancels out across all sampled pages. However, if you just
> include up to N rows from each page, then rows on pages with more
> than N rows have a lower probability of being selected, but there's
> no such bias against wider rows. This explains why you saw smaller
> values of "i" being undersampled.
>
> Had you run the test series all the way up to the max number of
> tuples per block, which is probably a couple hundred in this test,
> I think you'd have seen the bias go away again. But the takeaway
> point is that we have to sample all tuples per page, not just a
> limited number of them, if we want to change it like this.
>
> regards, tom lane
>
>
Hmm...

In my previous reply, which hasn't shown up yet!

I realized I made a mistake!

The fraction/probability could any of 0.25. 0.50, and 0.75.

Cheers,
Gavin


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 19:14:32
Message-ID: 52A8B998.5040602@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/13 07:22, Gavin Flower wrote:
> On 12/12/13 06:22, Tom Lane wrote:
>> I wrote:
>>> Hm. You can only take N rows from a block if there actually are at
>>> least
>>> N rows in the block. So the sampling rule I suppose you are using is
>>> "select up to N rows from each sampled block" --- and that is going to
>>> favor the contents of blocks containing narrower-than-average rows.
>> Oh, no, wait: that's backwards. (I plead insufficient caffeine.)
>> Actually, this sampling rule discriminates *against* blocks with
>> narrower rows. You previously argued, correctly I think, that
>> sampling all rows on each page introduces no new bias because row
>> width cancels out across all sampled pages. However, if you just
>> include up to N rows from each page, then rows on pages with more
>> than N rows have a lower probability of being selected, but there's
>> no such bias against wider rows. This explains why you saw smaller
>> values of "i" being undersampled.
>>
>> Had you run the test series all the way up to the max number of
>> tuples per block, which is probably a couple hundred in this test,
>> I think you'd have seen the bias go away again. But the takeaway
>> point is that we have to sample all tuples per page, not just a
>> limited number of them, if we want to change it like this.
>>
>> regards, tom lane
>>
>>
> Surely we want to sample a 'constant fraction' (obviously, in practice
> you have to sample an integral number of rows in a page!) of rows per
> page? The simplest way, as Tom suggests, is to use all the rows in a
> page.
>
> However, if you wanted the same number of rows from a greater number
> of pages, you could (for example) select a quarter of the rows from
> each page. In which case, when this is a fractional number: take the
> integral number of rows, plus on extra row with a probability equal to
> the fraction (here 0.25).
>
> Either way, if it is determined that you need N rows, then keep
> selecting pages at random (but never use the same page more than once)
> until you have at least N rows.
>
>
> Cheers,
> Gavin
>
>
>
Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75.

But there is a bias introduced by the arithmetic average size of the
rows in a page. This results in block sampling favouring large rows, as
they are in a larger proportion of pages.

For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
using 400 byte pages. In the pathologically worst case, assuming
maximum packing density and no page has both types: the large rows would
occupy 500 pages and the smaller rows 50 pages. So if one selected 11
pages at random, you get about 10 pages of large rows and about one for
small rows! In practice, it would be much less extreme - for a start,
not all blocks will be fully packed, most blocks would have both types
of rows, and there is usually greater variation in row size - but still
a bias towards sampling larger rows. So somehow, this bias needs to be
counteracted.

Cheers,
Gavin


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 19:31:44
Message-ID: 1386790304.24146.YahooMailNeo@web162906.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:

> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
> using 400 byte pages.  In the pathologically worst case, assuming
> maximum packing density and no page has both types: the large rows would
> occupy  500 pages and the smaller rows 50 pages. So if one selected 11
> pages at random, you get about 10 pages of large rows and about one for
> small rows!

With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 19:33:08
Message-ID: 52A8BDF4.4090408@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/13 08:14, Gavin Flower wrote:
> On 12/12/13 07:22, Gavin Flower wrote:
>> On 12/12/13 06:22, Tom Lane wrote:
>>> I wrote:
>>>> Hm. You can only take N rows from a block if there actually are at
>>>> least
>>>> N rows in the block. So the sampling rule I suppose you are using is
>>>> "select up to N rows from each sampled block" --- and that is going to
>>>> favor the contents of blocks containing narrower-than-average rows.
>>> Oh, no, wait: that's backwards. (I plead insufficient caffeine.)
>>> Actually, this sampling rule discriminates *against* blocks with
>>> narrower rows. You previously argued, correctly I think, that
>>> sampling all rows on each page introduces no new bias because row
>>> width cancels out across all sampled pages. However, if you just
>>> include up to N rows from each page, then rows on pages with more
>>> than N rows have a lower probability of being selected, but there's
>>> no such bias against wider rows. This explains why you saw smaller
>>> values of "i" being undersampled.
>>>
>>> Had you run the test series all the way up to the max number of
>>> tuples per block, which is probably a couple hundred in this test,
>>> I think you'd have seen the bias go away again. But the takeaway
>>> point is that we have to sample all tuples per page, not just a
>>> limited number of them, if we want to change it like this.
>>>
>>> regards, tom lane
>>>
>>>
>> Surely we want to sample a 'constant fraction' (obviously, in
>> practice you have to sample an integral number of rows in a page!) of
>> rows per page? The simplest way, as Tom suggests, is to use all the
>> rows in a page.
>>
>> However, if you wanted the same number of rows from a greater number
>> of pages, you could (for example) select a quarter of the rows from
>> each page. In which case, when this is a fractional number: take the
>> integral number of rows, plus on extra row with a probability equal
>> to the fraction (here 0.25).
>>
>> Either way, if it is determined that you need N rows, then keep
>> selecting pages at random (but never use the same page more than
>> once) until you have at least N rows.
>>
>>
>> Cheers,
>> Gavin
>>
>>
>>
> Yes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75.
>
> But there is a bias introduced by the arithmetic average size of the
> rows in a page. This results in block sampling favouring large rows,
> as they are in a larger proportion of pages.
>
> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
> using 400 byte pages. In the pathologically worst case, assuming
> maximum packing density and no page has both types: the large rows
> would occupy 500 pages and the smaller rows 50 pages. So if one
> selected 11 pages at random, you get about 10 pages of large rows and
> about one for small rows! In practice, it would be much less extreme
> - for a start, not all blocks will be fully packed, most blocks would
> have both types of rows, and there is usually greater variation in row
> size - but still a bias towards sampling larger rows. So somehow,
> this bias needs to be counteracted.
>
>
> Cheers,
> Gavin
>
Actually, I just thought of a possible way to overcome the bias towards
large rows.

1. Calculate (a rough estimate may be sufficient, if not too 'rough')
the size of the smallest row.

2. Select a page at random (never selecting the same page twice)

3. Then select rows at random within the page (never selecting the same
row twice). For each row selected, accept it with the probability
equal to (size of smallest row)/(size of selected row). I think you
find that will almost completely offset the bias towards larger rows!

4. If you do not have sufficient rows, and you still have pages not yet
selected, goto 2

Note that it will be normal for for some pages not to have any rows
selected, especially for large tables!

Cheers,
Gavin

P.S.
I really need to stop thinking about this problem, and get on with my
assigned project!!!


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 19:39:31
Message-ID: 52A8BF73.3050808@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/13 08:31, Kevin Grittner wrote:
> Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:
>
>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
>> using 400 byte pages. In the pathologically worst case, assuming
>> maximum packing density and no page has both types: the large rows would
>> occupy 500 pages and the smaller rows 50 pages. So if one selected 11
>> pages at random, you get about 10 pages of large rows and about one for
>> small rows!
> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Sorry, I've simply come up with well argued nonsense!

Kevin, you're dead right.

Cheers,
Gavin


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 19:42:13
Message-ID: CAM3SWZSJwREmjVPuH8coJwLHAWoHDD0b9=hZdSamYRW55aZ+qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 4:48 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Why would I even mention that to a statistician? We want guidance. But
> yes, I bet I could give a statistician an explanation of statistics
> target that they'd understand without too much trouble.

Actually, I think that if we told a statistician about the statistics
target, his or her response would be: why would you presume to know
ahead of time what statistics target is going to be effective? I
suspect that the basic problem is that it isn't adaptive. I think that
if we could somehow characterize the quality of our sample as we took
it, and then cease sampling when we reached a certain degree of
confidence in its quality, that would be helpful. It might not even
matter that the sample was clustered from various blocks.

--
Peter Geoghegan


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 20:12:20
Message-ID: 52A8C724.1050009@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/13 08:39, Gavin Flower wrote:
> On 12/12/13 08:31, Kevin Grittner wrote:
>> Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:
>>
>>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
>>> using 400 byte pages. In the pathologically worst case, assuming
>>> maximum packing density and no page has both types: the large rows
>>> would
>>> occupy 500 pages and the smaller rows 50 pages. So if one selected 11
>>> pages at random, you get about 10 pages of large rows and about one for
>>> small rows!
>> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.
>>
>> --
>> Kevin Grittner
>> EDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
> Sorry, I've simply come up with well argued nonsense!
>
> Kevin, you're dead right.
>
>
> Cheers,
> Gavin
>
>
I looked at:
http://www.postgresql.org/docs/current/interactive/storage-page-layout.html
this says that each row has an overhead, which suggests there should be
a bias towards small rows.

There must be a lot of things going on, that I'm simply not aware of,
that affect sampling bias...

Cheers,
Gavin


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 21:14:24
Message-ID: 52A8D5B0.6080704@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/13 09:12, Gavin Flower wrote:
> On 12/12/13 08:39, Gavin Flower wrote:
>> On 12/12/13 08:31, Kevin Grittner wrote:
>>> Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:
>>>
>>>> For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,
>>>> using 400 byte pages. In the pathologically worst case, assuming
>>>> maximum packing density and no page has both types: the large rows
>>>> would
>>>> occupy 500 pages and the smaller rows 50 pages. So if one selected 11
>>>> pages at random, you get about 10 pages of large rows and about one
>>>> for
>>>> small rows!
>>> With 10 * 2 = 20 large rows, and 1 * 20 = 20 small rows.
>>>
>>> --
>>> Kevin Grittner
>>> EDB: http://www.enterprisedb.com
>>> The Enterprise PostgreSQL Company
>> Sorry, I've simply come up with well argued nonsense!
>>
>> Kevin, you're dead right.
>>
>>
>> Cheers,
>> Gavin
>>
>>
> I looked at:
> http://www.postgresql.org/docs/current/interactive/storage-page-layout.html
>
> this says that each row has an overhead, which suggests there should
> be a bias towards small rows.
>
> There must be a lot of things going on, that I'm simply not aware of,
> that affect sampling bias...
>
>
> Cheers,
> Gavin
>
>
Ignoring overheads per row and other things... There will be a biasing
affect when the distribution of sizes is not symmetric. For example:
when the majority of rows have sizes greater than the arithmetic mean,
then most samples will be biased towards larger rows. Similarly there
could be a bias towards smaller rows when most rows are smaller than the
arithmetic mean. Yes, I did think about this in depth - but it is way
too complicated to attempt to quantify the bias, because it depends on
too many factors (even just limiting it to the distribution of row sizes).

So apart from the nature of volatility of the table, and the pattern of
insertions/updates/deletes - there there will be a bias depending on the
distribution of values in the table.

So I despair, that a simple elegant & practical algorithm will ever be
found.

Therefore, I expect the best answer is probably some kind of empirical
adaptive approach - which I think has already been suggested.

Cheers,
Gavin


From: Greg Stark <stark(at)mit(dot)edu>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kgrittn(at)ymail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 22:33:40
Message-ID: CAM-w4HN__RLfYhvtyBGDYFnmRZsyQXey4kZE-NUHggnC4NT0BA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think we're all wet here. I don't see any bias towards larger or smaller
rows. Larger tied will be on a larger number of pages but there will be
fewer of them on any one page. The average effect should be the same.

Smaller values might have a higher variance with block based sampling than
larger values. But that actually *is* the kind of thing that Simon's
approach of just compensating with later samples can deal with.

--
greg


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 22:39:04
Message-ID: 20131211223904.GA13377@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 12, 2013 at 07:22:59AM +1300, Gavin Flower wrote:
> Surely we want to sample a 'constant fraction' (obviously, in
> practice you have to sample an integral number of rows in a page!)
> of rows per page? The simplest way, as Tom suggests, is to use all
> the rows in a page.
>
> However, if you wanted the same number of rows from a greater number
> of pages, you could (for example) select a quarter of the rows from
> each page. In which case, when this is a fractional number: take
> the integral number of rows, plus on extra row with a probability
> equal to the fraction (here 0.25).

In this discussion we've mostly used block = 1 postgresql block of 8k.
But when reading from a disk once you've read one block you can
basically read the following ones practically for free.

So I wonder if you could make your sampling read always 16 consecutive
blocks, but then use 25-50% of the tuples. That way you get many more
tuples for the same amount of disk I/O seeks..

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-11 22:49:18
Message-ID: 52A8EBEE.9070506@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/11/2013 02:39 PM, Martijn van Oosterhout wrote:
> In this discussion we've mostly used block = 1 postgresql block of 8k.
> But when reading from a disk once you've read one block you can
> basically read the following ones practically for free.
>
> So I wonder if you could make your sampling read always 16 consecutive
> blocks, but then use 25-50% of the tuples. That way you get many more
> tuples for the same amount of disk I/O seeks..

Yeah, that's what I meant by "tune this for the FS". We'll probably
have to test a lot of different "block sizes" on different FSes before
we arrive at a reasonable size, and even then I'll bet we have to offer
a GUC.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 14:39:36
Message-ID: 34A68E9C-6716-4A01-BFCF-2815F0B738D8@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.

On Dec11, 2013, at 00:03 , Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000);
> create table baz2 as select * from baz order by floor;
> create table baz3 as select * from baz order by md5(floor::text);
>
> baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation.
>
> After analyzing all of them:
>
> select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ;
> tablename | n_distinct | correlation
> -----------+-------------+-------------
> baz | 8.56006e+06 | 0.00497713
> baz2 | 333774 | 1
> baz3 | 361048 | -0.0118147

I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then

tablename | n_distinct (est) | n_distinct (act)
-----------+------------------+------------------
baz | 391685 | 500000
baz2 | 36001 | 500000

ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of
those.

The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.

The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.

Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.

So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.

So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.

best regards,
Florian Pflug


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kgrittn(at)ymail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 17:07:31
Message-ID: CAMkU=1ykPnfJUh=S_6GDAfgY9eF=odmkOLYUWwuhgBZeVzrvnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark <stark(at)mit(dot)edu> wrote:

>
> I think we're all wet here. I don't see any bias towards larger or smaller
> rows. Larger tied will be on a larger number of pages but there will be
> fewer of them on any one page. The average effect should be the same.
>
> Smaller values might have a higher variance with block based sampling than
> larger values. But that actually *is* the kind of thing that Simon's
> approach of just compensating with later samples can deal with.
>

I think that looking at all rows in randomly-chosen blocks will not bias
size, or histograms. But it will bias n_distinct and MCV for some data
distributions of data, unless we find some way to compensate for it.

But even for avg size and histograms, what does block sampling get us? We
get larger samples sizes for the same IO, but those samples are less
independent (assuming data is no randomly scattered over the table), so the
"effective sample size" is less than the true sample size. So we can't
just sample 100 time fewer blocks because there are about 100 rows per
block--doing so would not bias our avg size or histogram boundaries, but it
would certainly make them noisier.

Cheers,

Jeff


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 17:44:39
Message-ID: CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:

> Here's an analysis of Jeff Janes' simple example of a table where our
> n_distinct estimate is way off.
>
> On Dec11, 2013, at 00:03 , Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > create table baz as select floor(random()*10000000), md5(random()::text)
> from generate_series(1,100000000);
> > create table baz2 as select * from baz order by floor;
> > create table baz3 as select * from baz order by md5(floor::text);
> >
> > baz unclustered, baz2 is clustered with perfect correlation, baz3 is
> clustered but without correlation.
> >
> > After analyzing all of them:
> >
> > select tablename, n_distinct,correlation from pg_stats where tablename
> like 'baz%' and attname='floor' ;
> > tablename | n_distinct | correlation
> > -----------+-------------+-------------
> > baz | 8.56006e+06 | 0.00497713
> > baz2 | 333774 | 1
> > baz3 | 361048 | -0.0118147
>
> I think I understand what's going on here. I worked with a reduced test
> cases
> of 1e7 rows containing random values between 0 and 5e5 and instrumented
> analyze to print the raw ndistinct and nmultiple values of the sample
> population (i.e. the number of distinct values in the sample, and the
> number
> of distinct values which appeared more than once). I've also considered
> only
> baz and baz2, and thus removed the than unnecessary md5 column. To account
> for
> the reduced table sizes, I adjusted default_statistics_target to 10
> instead of
> 100. The resulting estimates are then
>
> tablename | n_distinct (est) | n_distinct (act)
> -----------+------------------+------------------
> baz | 391685 | 500000
> baz2 | 36001 | 500000
>
> ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of
> those.
>
> The sample of baz contains 2989 distinct values, 11 of which appear more
> than
> once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
> appear more than once.
>
> The very different results stem from the Duj1 estimator we use. It
> estimates
> n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
> samples, N the number of rows, d the number of distinct samples, and f1 the
> number of distinct samples occurring exactly once. If all samples are
> unique
> (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops
> very
> quickly - sampling baz2 produces 117 non-unique values out of 2878 -
> roughly
> 0.03% - and the estimate already less than a 1/10 of what it would be if f1
> where 0.
>
> Which leaves us with the question why sampling baz2 produces more duplicate
> values than sampling baz does. Now, if we used block sampling, that
> behaviour
> would be unsurprising - baz2 is sorted, so each block very likely contains
> each value more than since, since the row count exceeds the number of
> possible
> values by more than a magnitude. In fact, with block sampling, we'd
> probably
> see f1 values close to 0 and thus our estimate of n_distinct would be
> roughly
> equal to the number of distinct values in the *sample* population, i.e.
> about
> 300 or so.
>
> So why does vitter's algorithm fail here? Given that we see inflated
> numbers
> of duplicate values in our sample, yet still far fewer than block-based
> sampling would yield, my guess is that it's the initial reservoir filling
> that
> bites us here. After that initial filling step, the reservoir contains a
> lot of
> consecutive rows, i.e. a block-based sample taken from just a few blocks.
> If
> the replacement phase that follows somehow uses a slightly smaller
> replacement
> probability than it should, more of these rows will survive replacement,
> resulting in exactly the kind of inflated numbers of duplicate values we're
> seeing. I've yet to validate this by looking at the reservoir before and
> after
> the replacement stage, though.
>

I think the problem is more subtle than that. It is easier to visualize it
if you think of every block having the same number of rows, with that
number being fairly large. If you pick 30,000 rows at random from
1,000,000 blocks, the number of rows chosen from any given block should be
close to following a poisson distribution with average of 0.03, which means
about 29113 blocks should have exactly 1 row chosen from them while 441
would have two or more rows chosen from them.

But if you instead select 30,000 row from 30,000 blocks, which is what we
ask Vitter's algorithm to do, you get about a Poisson distribution with
average of 1.0. Then about 11036 blocks have exactly one row chosen from
them, and 7927 blocks have two or more rows sampled from it. Another
11,036 blocks get zero rows selected from them due to Vitter, in addition
to the 970,000 that didn't even get submitted to Vitter in the first place.
That is why you see too many duplicates for clustered data, as too many
blocks are sampled multiple times.

The Poisson argument doesn't apply cleanly when blocks have variable number
of rows, but the general principle still applies. Too-few blocks have
exactly one row sampled from them, and too many blocks have either zero
row, or more than one row.

It would be relatively easy to fix this if we trusted the number of visible
rows in each block to be fairly constant. But without that assumption, I
don't see a way to fix the sample selection process without reading the
entire table.

> So at least for the purpose of estimating n_distinct, our current sampling
> method seems to exhibit the worst rather than the best properties of block-
> and row- based sampling. What conclusions to draw of that, I'm not sure
> yet -
> other that if we move to block-based sampling, we'll certainly have to
> change
> the way we estimate n_distinct.
>

Perhaps we should consider changing that even if we don't change to block
based sampling--but what would the other candidates be? I guess the same
paper that presented Duj1 would be a good place to start, but are there
others? It sounds like this was discussed several years ago, but I didn't
find it in the archives with a casual search.

Cheers,

Jeff


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 18:29:40
Message-ID: 24713.1386872980@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> It would be relatively easy to fix this if we trusted the number of visible
> rows in each block to be fairly constant. But without that assumption, I
> don't see a way to fix the sample selection process without reading the
> entire table.

Yeah, varying tuple density is the weak spot in every algorithm we've
looked at. The current code is better than what was there before, but as
you say, not perfect. You might be entertained to look at the threads
referenced by the patch that created the current sampling method:
http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at

particularly
http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs(at)email(dot)aon(dot)at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at

However ... where this thread started was not about trying to reduce
the remaining statistical imperfections in our existing sampling method.
It was about whether we could reduce the number of pages read for an
acceptable cost in increased statistical imperfection.

regards, tom lane


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 18:33:49
Message-ID: CAGTBQpbFhAMnJcA8qOj1-0AgjVM6+L2d1nUjZRyYqMv5Sjrjtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> It would be relatively easy to fix this if we trusted the number of visible
>> rows in each block to be fairly constant. But without that assumption, I
>> don't see a way to fix the sample selection process without reading the
>> entire table.
>
> Yeah, varying tuple density is the weak spot in every algorithm we've
> looked at. The current code is better than what was there before, but as
> you say, not perfect. You might be entertained to look at the threads
> referenced by the patch that created the current sampling method:
> http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at
>
> particularly
> http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs(at)email(dot)aon(dot)at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at
>
>
> However ... where this thread started was not about trying to reduce
> the remaining statistical imperfections in our existing sampling method.
> It was about whether we could reduce the number of pages read for an
> acceptable cost in increased statistical imperfection.

Well, why not take a supersample containing all visible tuples from N
selected blocks, and do bootstrapping over it, with subsamples of M
independent rows each?


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 18:56:42
Message-ID: 52AA06EA.1030508@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/2013 10:33 AM, Claudio Freire wrote:
> Well, why not take a supersample containing all visible tuples from N
> selected blocks, and do bootstrapping over it, with subsamples of M
> independent rows each?

Well, we still need to look at each individual block to determine
grouping correlation. Let's take a worst case example: imagine a table
has *just* been created by:

CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item;

If "category" is fairly low cardinality, then grouping will be severe;
we can reasonably expect that if we sample 100 blocks, many of them will
have only one category value present. The answer to this is to make our
block samples fairly widely spaced and compare them.

In this simplified example, if the table had 1000 blocks, we would take
blocks 1,101,201,301,401,etc. Then we would compare the number and
content of values found on each block with the number and content found
on each other block. For example, if we see that block 101 is entirely
the category "cats", and block 701 is entirely the category "shopping"
and block 901 is split 60/40 between the categories "transportation" and
"voting", then we can assume that the level of grouping is very high,
and the number of unknown values we haven't seen is also high.

Whereas if 101 is "cats" and 201 is "cats" and 301 through 501 are
"cats" with 2% other stuff, then we assume that the level of grouping is
moderate and it's just the case that most of the dataset is "cats".
Which means that the number of unknown values we haven't seen is low.

Whereas if 101, 201, 501, and 901 have near-identical distributions of
values, we assume that the level of grouping is very low, and that there
are very few values we haven't seen.

As someone else pointed out, full-block (the proposal) vs. random-row
(our current style) doesn't have a very significant effect on estimates
of Histograms and nullfrac, as long as the sampled blocks are widely
spaced. Well, nullfrac is affected in the extreme example of a totally
ordered table where the nulls are all in one block, but I'll point out
that we can (and do) also miss that using our current algo.

Estimated grouping should, however, affect MCVs. In cases where we
estimate that grouping levels are high, the expected % of observed
values should be "discounted" somehow. That is, with total random
distribution you have a 1:1 ratio between observed frequency of a value
and assumed frequency. However, with highly grouped values, you might
have a 2:1 ratio.

Again, more math (backed by statistical analysis) is needed.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 19:13:23
Message-ID: CAGTBQpY+znTQOujv3yV38NzF06OnK_wakudNb+ZRRL8JsG6QBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 12, 2013 at 3:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
> Estimated grouping should, however, affect MCVs. In cases where we
> estimate that grouping levels are high, the expected % of observed
> values should be "discounted" somehow. That is, with total random
> distribution you have a 1:1 ratio between observed frequency of a value
> and assumed frequency. However, with highly grouped values, you might
> have a 2:1 ratio.

Cross validation can help there. But it's costly.


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 19:13:55
Message-ID: CAMkU=1zt9Z6qTuyX8BGn1+5P8dy26C-T9pnuZTvoTP21mKmsHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:

> On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> >> It would be relatively easy to fix this if we trusted the number of
> visible
> >> rows in each block to be fairly constant. But without that assumption,
> I
> >> don't see a way to fix the sample selection process without reading the
> >> entire table.
> >
> > Yeah, varying tuple density is the weak spot in every algorithm we've
> > looked at. The current code is better than what was there before, but as
> > you say, not perfect. You might be entertained to look at the threads
> > referenced by the patch that created the current sampling method:
> >
> http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at
> >
> > particularly
> >
> http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimonbs(at)email(dot)aon(dot)at#ri5u70du80gnnt326k2hhuei5nlnimonbs@email.aon.at
>

Thanks, I will read those.

> >
> >
> > However ... where this thread started was not about trying to reduce
> > the remaining statistical imperfections in our existing sampling method.
> > It was about whether we could reduce the number of pages read for an
> > acceptable cost in increased statistical imperfection.
>

I think it is pretty clear that n_distinct at least, and probably MCV,
would be a catastrophe under some common data distribution patterns if we
sample all rows in each block without changing our current computation
method. If we come up with a computation that works for that type of
sampling, it would probably be an improvement under our current sampling as
well. If we find such a thing, I wouldn't want it to get rejected just
because the larger block-sampling method change did not make it in.

Well, why not take a supersample containing all visible tuples from N
> selected blocks, and do bootstrapping over it, with subsamples of M
> independent rows each?
>

Bootstrapping methods generally do not work well when ties are significant
events, i.e. when two values being identical is meaningfully different from
them being very close but not identical.

Cheers,

Jeff


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 19:20:50
Message-ID: CAGTBQpaRYAedJS=zJHnitiJj8pCFzoDR-FipuAwEoGEw1Ez=vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 12, 2013 at 4:13 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> Well, why not take a supersample containing all visible tuples from N
>> selected blocks, and do bootstrapping over it, with subsamples of M
>> independent rows each?
>
>
> Bootstrapping methods generally do not work well when ties are significant
> events, i.e. when two values being identical is meaningfully different from
> them being very close but not identical.

Yes, that's why I meant to say (but I see now that I didn't) that it
wouldn't do much for n_distinct, just the histogram.


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 20:11:40
Message-ID: CAMkU=1z1eY7YtbqzNQWNu-sejniQWiq7aorqOwuUyj5E4Rvhdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 3, 2013 at 3:30 PM, Greg Stark <stark(at)mit(dot)edu> wrote:

> At multiple conferences I've heard about people trying all sorts of
> gymnastics to avoid ANALYZE which they expect to take too long and
> consume too much I/O. This is especially a big complain after upgrades
> when their new database performs poorly until the new statistics are
> in and they did pg_upgrade to avoid an extended downtime and complain
> about ANALYZE taking hours.
>

Out of curiosity, are they using the 3 stage script
"analyze_new_cluster.sh"? If so, is the complaint that even the first
rounds are too slow, or that the database remains unusable until the last
round is complete?

Cheers,

Jeff


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-12 22:25:22
Message-ID: D1C450BF-DFBF-4330-8C7B-5F8A69BE49D1@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec12, 2013, at 19:29 , Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> However ... where this thread started was not about trying to reduce
> the remaining statistical imperfections in our existing sampling method.
> It was about whether we could reduce the number of pages read for an
> acceptable cost in increased statistical imperfection.

True, but Jeff's case shows that even the imperfections of the current
sampling method are larger than what the n_distinct estimator expects.

Making it even more biased will thus require us to rethink how we
obtain a n_distinct estimate or something equivalent. I don't mean that
as an argument against changing the sampling method, just as something
to watch out for.

best regards,
Florian Pflug


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-16 22:06:08
Message-ID: CAMkU=1w6qOQadwCwVCMkGGpGptfe2R6k5Cc0JzseARt49Z5LPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com>wrote:

> I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
> very easy, patch attached. Your mileage may vary, but I'm seeing a nice
> gain from this on my laptop. Taking a 30000 page sample of a table with
> 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds
> without the patch, and less than a second with the patch, with
> effective_io_concurrency=10. If anyone with a good test data set loaded
> would like to test this and post some numbers, that would be great.

Performance is often chaotic near transition points, so I try to avoid data
sets that are slightly bigger or slightly smaller than RAM (or some other
limit).

Do you know how many io channels your SSD has (or whatever the term of art
is for SSD drives)?

On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB)
with 4 GB of RAM goes from ~106 seconds to ~19 seconds.

However, I'm not sure what problem we want to solve here. I certainly
would not wish to give a background maintenance process permission to
confiscate my entire RAID throughput for its own operation. Perhaps this
could only be active for explicit analyze, and only if vacuum_cost_delay=0?

Perhaps there should be something like "alter background role autovac set
...". Otherwise we are going to end up with an "autovacuum_*" shadow
parameter for many of our parameters, see "autovacuum_work_mem" discussions.

Cheers,

Jeff


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-17 16:54:43
Message-ID: 52B081D3.9090907@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/17/2013 12:06 AM, Jeff Janes wrote:
> On Mon, Dec 9, 2013 at 3:14 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com>wrote:
>
>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to be
>> very easy, patch attached. Your mileage may vary, but I'm seeing a nice
>> gain from this on my laptop. Taking a 30000 page sample of a table with
>> 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds
>> without the patch, and less than a second with the patch, with
>> effective_io_concurrency=10. If anyone with a good test data set loaded
>> would like to test this and post some numbers, that would be great.
>
> Performance is often chaotic near transition points, so I try to avoid data
> sets that are slightly bigger or slightly smaller than RAM (or some other
> limit).
>
> Do you know how many io channels your SSD has (or whatever the term of art
> is for SSD drives)?

No idea. It's an Intel 335.

> On a RAID with 12 spindles, analyzing pgbench_accounts at scale 1000 (13GB)
> with 4 GB of RAM goes from ~106 seconds to ~19 seconds.
>
> However, I'm not sure what problem we want to solve here.

The case that Greg Stark mentioned in the email starting this thread is
doing a database-wide ANALYZE after an upgrade. In that use case, you
certainly want to get it done as quickly as possible, using all the
available resources.

> I certainly would not wish to give a background maintenance process
> permission to confiscate my entire RAID throughput for its own
> operation.

Then don't set effective_io_concurrency. If you're worried about that,
you probably wouldn't want any other process to monopolize the RAID
array either.

> Perhaps this could only be active for explicit analyze, and only if
> vacuum_cost_delay=0?

That would be a bit weird, because ANALYZE in general doesn't obey
vacuum_cost_delay. Maybe it should, though...

> Perhaps there should be something like "alter background role autovac set
> ...". Otherwise we are going to end up with an "autovacuum_*" shadow
> parameter for many of our parameters, see "autovacuum_work_mem" discussions.

Yeah, so it seems.

- Heikki


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2014-03-09 01:55:18
Message-ID: 20140309015518.GB32380@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I assume we never came up with a TODO from this thread:

---------------------------------------------------------------------------

On Tue, Dec 3, 2013 at 11:30:44PM +0000, Greg Stark wrote:
> At multiple conferences I've heard about people trying all sorts of
> gymnastics to avoid ANALYZE which they expect to take too long and
> consume too much I/O. This is especially a big complain after upgrades
> when their new database performs poorly until the new statistics are
> in and they did pg_upgrade to avoid an extended downtime and complain
> about ANALYZE taking hours.
>
> I always gave the party line that ANALYZE only takes a small
> constant-sized sample so even very large tables should be very quick.
> But after hearing the same story again in Heroku I looked into it a
> bit further. I was kind of shocked but the numbers.
>
> ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> pretty reasonable but with default_statistics_target set to 100 that's
> 30,000 rows. If I'm reading the code right It takes this sample by
> sampling 30,000 blocks and then (if the table is large enough) taking
> an average of one row per block. Each block is 8192 bytes so that
> means it's reading 240MB of each table.That's a lot more than I
> realized.
>
> It means if your table is anywhere up to 240MB you're effectively
> doing a full table scan and then throwing out nearly all the data
> read.
>
> Worse, my experience with the posix_fadvise benchmarking is that on
> spinning media reading one out of every 16 blocks takes about the same
> time as reading them all. Presumably this is because the seek time
> between tracks dominates and reading one out of every 16 blocks is
> still reading every track. So in fact if your table is up to about
> 3-4G ANALYZE is still effectively going to do a full table scan, at
> least as far as I/O time goes.
>
> The current algorithm seems like it was designed with a 100G+ table in
> mind but the consequences on the more common 100M-100G tables weren't
> really considered. Consider what this means for partitioned tables. If
> they partition their terabyte table into 10 partitions ANALYZE will
> suddenly want to use 10x as much I/O which seems like a perverse
> consequence.
>
> I'm not sure I have a prescription but my general feeling is that
> we're spending an awful lot of resources going after a statistically
> valid sample when we can spend a lot less resources and get something
> 90% as good. Or if we're really going to read that much data that we
> might as well use more of the rows we find.
>
> --
> greg
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +