Quick Links

Re: Improving N-Distinct estimation by ANALYZE

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Improving N-Distinct estimation by ANALYZE
Date:	2006-01-09 10:28:32
Message-ID:	1136802512.21025.386.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, 2006-01-06 at 16:13 -0500, Greg Stark wrote:
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
>
> > Before we start debating merits of proposals based on random reads, can
> > someone confirm that the sampling code actually does read randomly? I
> > looked at it yesterday; there is a comment that states that blocks to be
> > scanned are passed to the analyze function in physical order, and AFAICT
> > the function that chooses blocks does so based strictly on applying a
> > probability function to block numbers as it increments a counter. It
> > seems that any reading is actually sequential and not random, which
> > makes all the random_page_cost hand-waving null and void.
>
> Hm. I'm curious just how much that behaves like a sequential scan actually. I
> think I'll do some experiments.
>
> Reading 1% (1267 read, 126733 skipped): 7748264us
> Reading 2% (2609 read, 125391 skipped): 12672025us
> Reading 5% (6502 read, 121498 skipped): 19005678us
> Reading 5% (6246 read, 121754 skipped): 18509770us
> Reading 10% (12975 read, 115025 skipped): 19305446us
> Reading 20% (25716 read, 102284 skipped): 18147151us
> Reading 50% (63656 read, 64344 skipped): 18089229us
> Reading 100% (128000 read, 0 skipped): 18173003us
>
> These numbers don't make much sense to me. It seems like 5% is about as slow
> as reading the whole file which is even worse than I expected. I thought I was
> being a bit pessimistic to think reading 5% would be as slow as reading 20% of
> the table.

Just to put a few rumours to bed:

- the current code does *not* use block sampling, it uses random row
sampling. (There is a part of the code that selects the "next block" but
that should not confuse you into thinking that the whole block is
sampled).

- yes, the random sampling is random - please read the code and comments

- yes, I would expect the results you get. If you sample 5% of rows and
each block has on average at least 20 rows, then we should expect the
majority of blocks to be hit.

Best Regards, Simon Riggs

In response to

Re: Improving N-Distinct estimation by ANALYZE at 2006-01-06 21:13:04 from Greg Stark

Responses

Re: Improving N-Distinct estimation by ANALYZE at 2006-01-09 10:52:16 from Lukas Smith
Re: Improving N-Distinct estimation by ANALYZE at 2006-01-09 14:38:09 from Greg Stark
Re: Improving N-Distinct estimation by ANALYZE at 2006-01-09 16:21:10 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Lukas Smith	2006-01-09 10:52:16	Re: Improving N-Distinct estimation by ANALYZE
Previous Message	Neil Conway	2006-01-09 08:16:58	Re: plpgsql: check domain constraints