Re: Gsoc2012 idea, tablesample

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: <josh(at)agliodbs(dot)com>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <robertmhaas(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <huangqiyx(at)hotmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-05-11 13:54:35
Message-ID: 87B3326E-03B2-4724-8FF8-2C07AA0FC854@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[ Sorry for the self-reply ]

On May11, 2012, at 13:17 , Florian Pflug wrote:
> Actually, thinking more about this, if the approach sketched above
> turns out to work, then one might be able to get away without remembering
> previously computed TIDs, thus removing a lot of complexity.
>
> Say, for example, you simply start out with a single random TID tid[0].
> The you produce the sequence of "random" TIDs by setting
>
> tid[n+1] = tid[n] + random(1 <= x <= 2*D-1)
>
> where D is the expected distance between one TID from the sample set
> and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once
> tid[n] >) #TIDs). This will give you approximately uniformly distributed
> TIDs, I think, and will even return them in physical order. The 100$ question
> is, by how much does this violate the independence requirement, i.e. how
> far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart?
> Some search through the literature should be able to answer that.

Actually, one can easily do better then that. Say you've determined that
to pick each tuple with probability p_tup, you need to pick each TID with
probability p_tid (where p_tup/p_tid is the TID density, i.e. the probability
of a single TID being live). Now, looping over all TIDs and picking each with
probability p_tid is, of course, just a another (more error-prone) way of
iterating over all tuples and picking each with probability p_tup. But instead
of looping, you can jump from from picked TID to the next. Observe that, after
having picked tid X, the probability of picking X+i as the next tid is
(1 - p_tid)^(i-1) * p_tid, because to pick X+i next you have to skip (i-1) TIDs
and then pick the i-th one. Thus, the following algorithm returns a sorted list
of TIDs which should be uniformly distributed and independent (or so I think,
at least)

1) Starting with a random tid tid[0], chosen such that
P(tid[0] = i) = (1 - p_tid)^i * p_tid

2) Setting tid[n+1] = tid[n] + d, which d > 0 chosen such that
P(d = i) = (1 - p_tid)^(i-1) * p_tid

3) Abort if max(tid) is >= #TIDs.

This all hinges on the ability to produce a sufficient accurate estimate of the
TID density p_tup/p_tid, of course.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sandro Santilli 2012-05-11 13:59:51 Re: Gsoc2012 idea, tablesample
Previous Message Andrew Dunstan 2012-05-11 13:51:49 Re: Draft release notes complete