Quick Links

Re: [GENERAL] Large DB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Large DB
Date:	2004-04-02 19:48:13
Message-ID:	4140.1080935293@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-hackers

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> What I have in mind is a kind of "Double Vitter" algorithm. Whatever we
> do to get our sample of rows, in the end the sampled rows come from no
> more than sample_size different blocks. So my idea is to first create a
> random sample of sample_size block numbers, and then to sample the rows
> out of this pool of blocks.

That assumption is faulty, though --- consider wholly-empty pages.

A bigger problem is that this makes the sampling quite nonuniform,
because rows that are on relatively low-density pages would be more
likely to become part of the final sample than rows that are on pages
with lots of tuples. Thus for example your sample would tend to favor
rows with wide values of variable-width columns and exclude narrower
values. (I am not certain that the existing algorithm completely avoids
this trap, but at least it tries.)

regards, tom lane

In response to

Re: Large DB at 2004-04-02 18:05:01 from Manfred Koizar

Responses

Re: [GENERAL] Large DB at 2004-04-02 21:53:29 from Manfred Koizar

Browse pgsql-general by date

	From	Date	Subject
Next Message	Bruno Wolff III	2004-04-02 20:26:46	Re: row-level security model
Previous Message	Randall Skelton	2004-04-02 19:47:41	Re: Storage cost of a null column

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jim Seymour	2004-04-02 19:51:30	Re: Problems Vacuum'ing
Previous Message	Tom Lane	2004-04-02 19:40:08	Re: GiST future