Re: Tuple sampling

Lists: pgsql-patches
From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: pgsql-patches(at)postgresql(dot)org
Subject: Tuple sampling
Date: 2004-05-22 22:59:49
Message-ID: 1tkva0h547jhomsasujt2qs7gcgg0gtvrp@email.aon.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

This patch implements the new tuple sampling method as discussed on
-hackers and -performance a few weeks ago.

"Large DB" -hackers 2004-04-02
"query slows down with more accurate stats" -perform 2004-04-13
"Tuple sampling" -hackers 2004-04-19

Servus
Manfred

Attachment Content-Type Size
unknown_filename text/plain 20.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Tuple sampling
Date: 2004-05-23 21:32:36
Message-ID: 28169.1085347956@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> This patch implements the new tuple sampling method as discussed on
> -hackers and -performance a few weeks ago.

Applied with minor editorializations. AFAICS get_next_S() needs to be
called with the number of tuples already processed, which means you were
off-by-one --- this surely makes only a trivial difference in the
probabilities, but if we are going to use Vitter's algorithm then we may
as well get it right. Also, I took out the TupleCount typedef and went
back to using doubles for the tuple counts; this is more consistent with
the coding style used elsewhere, and I really doubt that it's any
slower. (The datatype conversions induced inside get_next_S are likely
to outweigh any savings from counting by ints, on most modern hardware.)
Plus the justification for assuming it couldn't overflow seems weak to
me; the current limitation to 300000 requested sample rows is very
arbitrary and could change anytime.

I was initially convinced that your implementation of Knuth's algorithm
S was all wet, so now there's a bunch of comments explaining why it's
actually correct...

regards, tom lane


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Manfred Koizar <mkoi-pg(at)aon(dot)at>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Tuple sampling
Date: 2004-05-24 03:57:43
Message-ID: 20040524035743.GA337@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Sun, May 23, 2004 at 17:32:36 -0400,
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> slower. (The datatype conversions induced inside get_next_S are likely
> to outweigh any savings from counting by ints, on most modern hardware.)

You aren't even guarenteed that integer operations are faster. On the Cray
XMP integer math was often done using floating point operations because
they were faster. This limited them to 48 bits instead of instead of 64,
but normally 48 bits was enough. I don't remember if just multiplication
was faster or if all operations were faster.


From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Tuple sampling
Date: 2004-05-24 10:29:27
Message-ID: 8li3b0l68pbasfri8q2880pml3hd7kppf2@email.aon.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Sun, 23 May 2004 17:32:36 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>I took out the TupleCount typedef and went
>back to using doubles for the tuple counts; this is more consistent with
>the coding style used elsewhere, and I really doubt that it's any
>slower.

Performance was not the primary motivation. I found it confusing to
have doubles everywhere and not to know whether a variable is declared
as double, because
. we need the fractional part (e.g. a probability)
. or it should be able to hold an integral value of more than 32 bits.
So I just invented my own datatype for huge integers. Long long would
have been a natural choice, but AFAIK its not available on all
platforms.

>I was initially convinced that your implementation of Knuth's algorithm
>S was all wet, so now there's a bunch of comments explaining why it's
>actually correct...

Thanks. I like your explanation. My justification for that change was
a lot more complicated.

Servus
Manfred


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Tuple sampling
Date: 2004-05-24 13:43:42
Message-ID: 10795.1085406222@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> Performance was not the primary motivation. I found it confusing to
> have doubles everywhere and not to know whether a variable is declared
> as double, because
> . we need the fractional part (e.g. a probability)
> . or it should be able to hold an integral value of more than 32 bits.
> So I just invented my own datatype for huge integers.

Well, if you want to "typedef double TupleCount" as a documentation aid,
I don't object, but let's not do it just locally in analyze.c. There
are a bunch of other places that do the same thing.

regards, tom lane