Quick Links

Re: ANALYZE sampling is too good

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: ANALYZE sampling is too good
Date:	2013-12-08 18:14:51
Message-ID:	CAM-w4HODF+=Rv8XhUefCJjPDyfFVOMqetWRQUG75S+dMtxD3xw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Dec 8, 2013 at 12:06 AM, Mark Kirkwood
<mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:
>
> bench=# ANALYZE pgbench_accounts;
> NOTICE: acquire sample will need 30000 blocks
> NOTICE: sampled 30000 blocks
> ANALYZE
> Time: 10059.446 ms
> bench=# \q

I did some experimenting here as well.

I hacked up a version of analyze.c that has a guc for rows_per_block
to sample. Setting it to 1 doesn't modify the behaviour at all whereas
setting it to 4 divides the number of blocks to sample by 4 which
causes it to do less I/O and use more rows from each block.

I then initialized pgbench with scale factor 100 but modified the code
to run the actual pgbench with scale factor 1. In other words I ran a
lot of updates on 1% of the database but left the other 99% untouched
from the initial load.

Then I ran "ANALYZE VERBOSE accounts" with rows_per_block set to 1, 4,
16, and 64. The latter is slightly larger than the average number of
tuples per block so the resulting sample is actually slightly short.

The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):

stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 30000 of 158756 pages, containing
1889701 live rows and 0 dead rows; 30000 rows in sample, 10000036
estimated total rows
ANALYZE
Time: 19468.987 ms

With rows_per_block=4 it reads only a quarter as many blocks but it's
not much faster:

stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 7501 of 158756 pages, containing
472489 live rows and 0 dead rows; 30000 rows in sample, 10000037
estimated total rows
ANALYZE
Time: 17062.331 ms

But with rows_per_block=16 it's much faster, 6.7s

stark=# set statistics_rows_per_block = 16;
SET
Time: 1.583 ms
stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 1876 of 158756 pages, containing
118163 live rows and 0 dead rows; 30000 rows in sample, 10000031
estimated total rows
ANALYZE
Time: 6694.331 ms

And with rows_per_block=64 it's under 2s:

stark=# set statistics_rows_per_block = 64;
SET
Time: 0.693 ms
stark=# analyze verbose pgbench_accounts;
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 469 of 158756 pages, containing
29544 live rows and 0 dead rows; 29544 rows in sample, 10000033
estimated total rows
ANALYZE
Time: 1937.055 ms

The estimates for the total rows is just as accurate in every case.
(It seems to be consistently sightly high though which is a bit
disconcerting)

However looking at the actual pg_stats entries the stats are
noticeably less accurate for the "blockier" samples. The "bid" column
actually has 100 distinct values and so with a statistics_target of
100 each value should appear in the MCV list with a frequency of about
.01.

With rows_per_block=1 the MCV frequency list ranges from .0082 to .0123
With rows_per_block=4 the MCV frequency list ranges from .0063 to .0125
With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213

I'm not really sure if this is due to the blocky sample combined with
the skewed pgbench run or not. It doesn't seem to be consistently
biasing towards or against bid 1 which I believe are the only rows
that would have been touched by pgbench. Still it's suspicious that
they seem to be consistently getting less accurate as the blockiness
increases.

I've attached the results of pg_stats following the analyze with the
various levels of "blockiness".

--
greg

Attachment	Content-Type	Size
pgbench_stats.txt	text/plain	32.4 KB

In response to

Re: ANALYZE sampling is too good at 2013-12-08 00:06:42 from Mark Kirkwood

Responses

Re: ANALYZE sampling is too good at 2013-12-08 19:03:02 from Josh Berkus
Re: ANALYZE sampling is too good at 2013-12-08 19:49:43 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dean Rasheed	2013-12-08 18:51:04	Re: Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
Previous Message	Tom Lane	2013-12-08 18:12:11	Re: dblink performance regression