Re: Improving N-Distinct estimation by ANALYZE

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-09 10:28:32
Message-ID: 1136802512.21025.386.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2006-01-06 at 16:13 -0500, Greg Stark wrote:
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
>
> > Before we start debating merits of proposals based on random reads, can
> > someone confirm that the sampling code actually does read randomly? I
> > looked at it yesterday; there is a comment that states that blocks to be
> > scanned are passed to the analyze function in physical order, and AFAICT
> > the function that chooses blocks does so based strictly on applying a
> > probability function to block numbers as it increments a counter. It
> > seems that any reading is actually sequential and not random, which
> > makes all the random_page_cost hand-waving null and void.
>
> Hm. I'm curious just how much that behaves like a sequential scan actually. I
> think I'll do some experiments.
>
> Reading 1% (1267 read, 126733 skipped): 7748264us
> Reading 2% (2609 read, 125391 skipped): 12672025us
> Reading 5% (6502 read, 121498 skipped): 19005678us
> Reading 5% (6246 read, 121754 skipped): 18509770us
> Reading 10% (12975 read, 115025 skipped): 19305446us
> Reading 20% (25716 read, 102284 skipped): 18147151us
> Reading 50% (63656 read, 64344 skipped): 18089229us
> Reading 100% (128000 read, 0 skipped): 18173003us
>
> These numbers don't make much sense to me. It seems like 5% is about as slow
> as reading the whole file which is even worse than I expected. I thought I was
> being a bit pessimistic to think reading 5% would be as slow as reading 20% of
> the table.

Just to put a few rumours to bed:

- the current code does *not* use block sampling, it uses random row
sampling. (There is a part of the code that selects the "next block" but
that should not confuse you into thinking that the whole block is
sampled).

- yes, the random sampling is random - please read the code and comments

- yes, I would expect the results you get. If you sample 5% of rows and
each block has on average at least 20 rows, then we should expect the
majority of blocks to be hit.

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Lukas Smith 2006-01-09 10:52:16 Re: Improving N-Distinct estimation by ANALYZE
Previous Message Neil Conway 2006-01-09 08:16:58 Re: plpgsql: check domain constraints