Quick Links

Re: Large DB

Lists:	pgsql-generalpgsql-hackers

From:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Large DB
Date:	2004-03-31 18:08:26
Message-ID:	B295F3D09B0D3840BEA3B46C51FC389C1F5C07@pnlmse28.pnl.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Thanks for the excellent feedback (all)!

Good point on the excess bytes/row, not sure how to explain that. There
have never been any deletes or updates on this table and all inserts
just simple inserts (no transactions or anything funky) so there
shouldn't be many aborted transactions (I could see the daemons that do
the inserts dying part way through a few times, but nothing to explain
the variance).

I haven't run ANALYZE on this table in a while. After about 50-60M rows
it didn't seem to change the query plan at all and since there were
never any deletes/updates it seemed like it wasn't making much/any
difference (should have been no pages to reclaim). That may be an
invalid assumption though.

I'll try the other suggestions over the next couple of days and see how
it goes. Thanks again.

Here is an explain on the query:

=> explain select point, avg(pvalue) as avg from tp3 where host in
('m563', 'm562', 'm561', 'm560', 'm559', 'm558', 'm557', 'm538', 'm755',
'm754', 'm753', 'm752', 'm751', 'm750', 'm749', 'm748') and starttime
between '2004-03-27 07:37:43' and '2004-03-30 07:40:08' group by point;

HashAggregate (cost=96.90..96.90 rows=1 width=25)
-> Index Scan using tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime on tp3 (cost=0.00..96.90 rows=1
width=25)
Index Cond: (((host = 'm563'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm562'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm561'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm560'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm559'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm558'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm557'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm538'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm755'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm754'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm753'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm752'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm751'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm750'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm749'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm748'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)))
Filter: (((host = 'm563'::bpchar) OR (host = 'm562'::bpchar) OR
(host = 'm561'::bpchar) OR (host = 'm560'::bpchar) OR (host =
'm559'::bpchar) OR (host = 'm558'::bpchar) OR (host = 'm557'::bpchar) OR
(host = 'm538'::bpchar) OR (host = 'm755'::bpchar) OR (host =
'm754'::bpchar) OR (host = 'm753'::bpchar) OR (host = 'm752'::bpchar) OR
(host = 'm751'::bpchar) OR (host = 'm750'::bpchar) OR (host =
'm749'::bpchar) OR (host = 'm748'::bpchar)) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone))
(4 rows)

> -----Original Message-----
> From: Manfred Koizar [mailto:mkoi-pg(at)aon(dot)at]
> Sent: Wednesday, March 31, 2004 1:18 AM
> To: Mooney, Ryan
> Cc: pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Large DB
>
>
> On Tue, 30 Mar 2004 17:48:14 -0800, "Mooney, Ryan"
> <ryan(dot)mooney(at)pnl(dot)gov>
> wrote:
> >I have a single table that just went over 234GB in size with about
> >290M+ rows.
>
> That would mean ~ 800 bytes/row which, given your schema, is
> hard to believe unless there are lots of dead tuples lying around.
>
> >queries use the indexes fairly well, although I suspect that
> the order
> >of host/starttime is suboptimal (fewer hosts than starttime, and the
> >table is naturally in starttime order). I'm going to try adding an
> >index on just starttime (and later just host) and see if I
> can tune the
> >queries on that more.
>
> Yes, if you are ready to switch OS for a 10% performance
> gain, getting your indices right should be no question.
>
> > I never delete rows from the table, only do
> >inserts (up to around 11,000/minute mostly in one big burst every
> >minute, this is anticipated to go up some over time).
>
> How often do you ANALYSE?
>
> Have there been DELETEs or UPDATEs or aborted transactions in
> the past? Did you VACUUM or VACUUM FULL since then?
>
> > I'm only doing sub 15MB from the disk
> >array (from iostat) and I know it can do in the 40-60MB
> range when we
> >tested the raw speed,
>
> Sounds plausible for nonsequential I/O.
>
> >However the two indexes are also - large (which may be part of the
> >problem, which is why I'm trying just starttime for an
> index; They are
> >currently in the 140-150G range).
>
> This would be extreme index bloat which is only possible
> after massive DELETEs/UPDATEs.
>
> >stats=> explain select count(*) from tp3;
> > -> Seq Scan on tp3 (cost=0.00..6906493.16
> rows=290602016 width=0)
>
> The planner thinks that the table size is 4M pages, 32GB.
> The average tuple size of ~110 bytes (including tuple header)
> suits your schema quite nicely.
>
> > Table "public.tp3"
> > Column | Type | Modifiers
> >-------------+-----------------------------+-----------
> > host | character(4) |
> > point | character varying(64) |
> > type | character(1) |
> > cooked | character(1) |
> > starttime | timestamp without time zone |
> > intervallen | interval |
> > arrivetime | timestamp without time zone |
> > pvalue | numeric |
> >Indexes:
> > "tp3_host_starttime" btree (host, starttime, cooked)
> > "tp3_point_starttime" btree (point, starttime, cooked)
>
> In my experience any reduction in average tuple size results
> directly in a proportional increase of throughput for large
> tables. So here are some random thoughts:
>
> You said there are only a few hosts. So moving the hosts
> into a separate table with an integer primary key would save
> 4 bytes per row.
>
> Datatype "char" (with quotes) needs only 1 byte, char(1)
> needs 5 bytes, both before padding. Changing type and cooked
> from char(1) to "char" would save 12 bytes.
>
> And if you want to push it, you change hostid to smallint and
> rearrange the fields, saving 4 more padding bytes:
> hostid | smallint
> type | "char"
> cooked | "char"
>
> What about point? If there is a known small number of
> different values, move it into its own table.
>
> I'm not sure about the storage needs of numeric, might be at
> least 8 bytes. Consider using bigint. Someone please correct
> me if I'm wrong.
>
> Did you CREATE TABLE tp3 (...) WITHOUT OIDS?
>
> >Sample data mining query:
> >----------------------------
> >select point, avg(pvalue) as avg from tp3 where host in ('node',
> >'node',
> >....) and starttime between 'timestamp' and 'timestamp'
> group by point
>
> Show us EXPLAIN ANALYSE, please.
>
> >shared_buffers = 60800
>
> Looks a bit large to me. But if your tests have shown it to
> be the best value, it should be ok.
>
> >sort_mem = 1286720 # min 64, size in KB
>
> This is more than 1GB, I think this is too high.
>
> >fsync=false # Play fast and loose - whee
>
> How much did this help?
>
> >effective_cache_size = 160000
>
> Try more, say 320000 or even 400000.
>
> Servus
> Manfred
>

From:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
To:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Large DB
Date:	2004-04-01 10:22:58
Message-ID:	1uon60lr3jjndh4o8i9cagd62tead9b0t6@email.aon.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 31 Mar 2004 10:08:26 -0800, "Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>
wrote:
>I haven't run ANALYZE on this table in a while. After about 50-60M rows
>it didn't seem to change the query plan at all and since there were
>never any deletes/updates it seemed like it wasn't making much/any
>difference (should have been no pages to reclaim).

Reclaiming pages is not the job of ANALYSE, VACUUM does this.

> That may be an
>invalid assumption though.

Might be a valid assumption as well -- if you're lucky. But do you want
to depend on luck? Eg. 75% of the today's rows contain timestamps that
are greater than what the planner believes to be the maximum.

No VACCUM, no ANALYSE, no REINDEX. This explains why the planner thinks
there are only 4M pages, which gives 640 bytes/row if there were 50M
rows at that time. OTOH the EXPLAIN shows 290M rows for the seq scan.
Something doesn't fit together here.

Hackers, what could update reltuples, but not relpages?

Or, Ryan, is it possible that you already had 290M rows when you ran
ANALYSE and you have more than 1G rows today?

BTW, ANALYSE is basically a constant time operation.

>Here is an explain on the query:
>
>=> explain select point, avg(pvalue) as avg from tp3 where host in

This tells us one half of the story.

EXPLAIN ANALYSE SELECT ...

would tell us the other half, too.

Servus
Manfred

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Large DB
Date:	2004-04-01 15:22:19
Message-ID:	29257.1080832939@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> Hackers, what could update reltuples, but not relpages?

Nothing --- they are always updated together. One possibility is that
the 4M pages and 290M rows numbers really do go together (for about 112
bytes/row) and that the table has since grown, or perhaps merely bloated
due to lack of vacuuming of updated rows.

A different line of thought is that they were updated together, but the
relpages estimate was accurate while reltuples was not. ANALYZE knows
the actual table size in pages (because it asks the OS) but reltuples is
extrapolated from an average of the number of live tuples on the pages
ANALYZE looks at. It is possible for ANALYZE to be fooled badly if, for
instance, there are lots and lots of dead rows near the start of the
table. (Lack of regular vacuuming would certainly improve the odds of
this happening...)

Note that VACUUM is not subject to this error because it has to grovel
over every page anyway. So either "VACUUM" or "VACUUM ANALYZE" will
give you a known-good reltuples, it's only standalone "ANALYZE" that
has a risk of estimation error.

regards, tom lane

From:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Large DB
Date:	2004-04-02 15:17:50
Message-ID:	7sqq60po4q8k0l7ere02sja1kkevb877cu@email.aon.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 01 Apr 2004 12:22:58 +0200, I wrote:
>BTW, ANALYSE is basically a constant time operation.

On closer inspection, this is not the whole truth. ANALY[SZ]E is a two
stage process: First it collects a sample of rows, then these rows are
examined to produce various statistics.

The cost of the latter depends on the sample size, which itself depends
on the default or column-specific statistics target, and the number (and
types) of columns, so it *should* take more or less constant time.

The first step, however, (acquire_sample_rows() in analyze.c) has to
read more rows than finally end up in the sample. It visits less than
O(nblocks) pages but certainly more than O(1).

A vague feeling tries to tell me that the number of page reads is
somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
grow like O(ln(n)).

I have an idea how this could be done with O(1) page reads. If I'm able
to translate it into C, I'll send a patch ...

Servus
Manfred

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Large DB
Date:	2004-04-02 16:16:21
Message-ID:	28229.1080922581@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> The first step, however, (acquire_sample_rows() in analyze.c) has to
> read more rows than finally end up in the sample. It visits less than
> O(nblocks) pages but certainly more than O(1).

> A vague feeling tries to tell me that the number of page reads is
> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
> grow like O(ln(n)).

Good guess. Vitter's paper says the expected time to sample n rows from
a table of size N is O(n * (1 + log(N/n))).

> I have an idea how this could be done with O(1) page reads.

The hard part is getting a genuinely random sample when we don't know N
in advance. We do however know the table size in blocks, so if you're
willing to make assumptions about constant tuple density you could do
something different. (But the tuple density assumption is exactly the
weak spot of what we've got, so I'm unconvinced that would be a big step
forward.)

regards, tom lane

From:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>, pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Large DB
Date:	2004-04-02 18:05:01
Message-ID:	2r6r60h049h0lg4s9ve3qe1h38ubprpo30@email.aon.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

[time to move this to -hackers]

On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
>> The first step, however, (acquire_sample_rows() in analyze.c) has to
>> read more rows than finally end up in the sample. It visits less than
>> O(nblocks) pages but certainly more than O(1).
>
>> A vague feeling tries to tell me that the number of page reads is
>> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
>> grow like O(ln(n)).
>
>Good guess. Vitter's paper says the expected time to sample n rows from
>a table of size N is O(n * (1 + log(N/n))).

Well, for what I tried to find out my wild guess seems to be wrong.

I don't doubt that Vitter's formula is correct, but it assumes that
access to any tuple has the same cost. This does not apply to our
problem, however. With 100 tuples per page, we access the first
sample_size tuples at a cost of 0.01 sequential page reads per tuple.
Later we use less and less tuples per page which results in higher
per-tuple-cost. Near the end of a large relation we can expect to
access only one tuple per page and more and more pages are skipped, so
that prefetching doesn't help any more.

Playing around with some real numbers (for 100 tuples/page and a sample
size of 3000) I got:

rel | page
size | reads
------+-------------
30 | 30
300 | 300 expectation is something like 299.9995
500 | 499
1K | 990
3K | 2.6K
30K | 8K
100K | 12K
1M | 19K
10M | 26K
100M | 33K

This growth rate is steeper than O(log(nblocks)).

>> I have an idea how this could be done with O(1) page reads.

What I have in mind is a kind of "Double Vitter" algorithm. Whatever we
do to get our sample of rows, in the end the sampled rows come from no
more than sample_size different blocks. So my idea is to first create a
random sample of sample_size block numbers, and then to sample the rows
out of this pool of blocks.

I have to think harder though, what to do about those 400 pages that are
not accessed when the sample size is 3000 ...

>The hard part is getting a genuinely random sample when we don't know N
>in advance. We do however know the table size in blocks, so if you're
>willing to make assumptions about constant tuple density you could do
>something different. (But the tuple density assumption is exactly the
>weak spot of what we've got, so I'm unconvinced that would be a big step
>forward.)

Starting the scan at some random blocks should help against the common
case of unusual distribution of dead tuples near the start of the
relation. And I plan to factor information about dead tuple hits into
an increasingly better estimation of dead/live tuple ratio.

Servus
Manfred

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Large DB
Date:	2004-04-02 19:48:13
Message-ID:	4140.1080935293@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> What I have in mind is a kind of "Double Vitter" algorithm. Whatever we
> do to get our sample of rows, in the end the sampled rows come from no
> more than sample_size different blocks. So my idea is to first create a
> random sample of sample_size block numbers, and then to sample the rows
> out of this pool of blocks.

That assumption is faulty, though --- consider wholly-empty pages.

A bigger problem is that this makes the sampling quite nonuniform,
because rows that are on relatively low-density pages would be more
likely to become part of the final sample than rows that are on pages
with lots of tuples. Thus for example your sample would tend to favor
rows with wide values of variable-width columns and exclude narrower
values. (I am not certain that the existing algorithm completely avoids
this trap, but at least it tries.)

regards, tom lane

From:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Large DB
Date:	2004-04-02 21:53:29
Message-ID:	36lr60pvii0tlr97oqvj0klrn5ma88vb65@email.aon.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
>> What I have in mind is a kind of "Double Vitter" algorithm. [...]
>> random sample of sample_size block numbers, and then to sample the rows
>> out of this pool of blocks.
>
>That assumption is faulty, though --- consider wholly-empty pages.
>
>A bigger problem is that this makes the sampling quite nonuniform,
>because rows that are on relatively low-density pages would be more
>likely to become part of the final sample than rows that are on pages
>with lots of tuples.

This sounds like you are assuming that I want to take exactly one tuple
out of each block of the block sample. This is not the case. In the
second round I plan to apply the same (or a better) Vitter method as it
is done now. The main difference is that blocks will be adressed
indirectly through the array of block numbers obtained in the first
round.

> Thus for example your sample would tend to favor
>rows with wide values of variable-width columns and exclude narrower
>values. (I am not certain that the existing algorithm completely avoids
>this trap, but at least it tries.)

I'm reading 7.4 source code and I fail to see how it does this. If the
relation starts with an atypical distribution of wide/narrow or
dead/alive tuples, a wrong value for tuplesperpage is used for the rest
of the sampling.

Tuples immediately following one or more dead tuples have a better
chance of being selected. This may be called as random as anything else
and not favouring a special property. OTOH after long runs of dead
tuples consecutive tuples are likely to be selected.

Your comment about nonuniformity above exactly describes the current
algorithm: Once the initial sample is fetched and tuplesperpage is
determined, targpos is computed without any further feedback. If
targpos points to a sparsely populated area (with wide tuples or with
many dead tuples) tuples in this area are more likely to get into the
sample than tuples in densely populated areas (with many small active
tuples).

I think that cutting down the number of blocks to be looked at does not
affect these problems.

Servus
Manfred

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Large DB
Date:	2004-04-02 23:06:12
Message-ID:	9088.1080947172@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> On Fri, 02 Apr 2004 14:48:13 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> A bigger problem is that this makes the sampling quite nonuniform,
>> because rows that are on relatively low-density pages would be more
>> likely to become part of the final sample than rows that are on pages
>> with lots of tuples.

> This sounds like you are assuming that I want to take exactly one tuple
> out of each block of the block sample. This is not the case.

No, I understood that you wanted to resample, but [ ... thinks for
awhile ... ] hmm, now I can't construct a failure case either. I must
have done the math wrong before.

There's still a risk of not being able to collect N rows out of N
blocks, if you are unfortunate enough to select a lot of wholly-empty
pages. But that seems like a low-probability scenario; besides such a
table would be so desperately in need of VACUUM FULL that the possible
low quality of the stats hardly matters.

You should not need to use the Vitter algorithm for the block-level
selection, since you can know the number of blocks in the table in
advance. You can just use the traditional method of choosing each block
or not with probability (k/K), where k = number of sample blocks still
needed, K = number of blocks from here to the end. You'd run the Vitter
algorithm separately to decide whether to keep or discard each live row
you find in the blocks you read.

I do like this, since it eliminates the current method's bias towards
estimating the number of live rows from the density found near the start
of the table only. At the end you'd know the total number of live rows
on all the pages you read, and it's reasonable to extrapolate that total
to the full table size.

Question: if the table size is less than N blocks, are you going to read
every block or try to reduce the number of blocks sampled? If you don't
adjust the sample size then I think this would perform worse for
intermediate-size tables than the current method does ... perhaps not so
much at sample size = 3000, but at larger sizes it would hurt. A lot of
people are setting the stats target to 100 which means a sample size of
30000 --- how do the page-access counts look in that case?

regards, tom lane

From:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Large DB
Date:	2004-04-03 00:40:39
Message-ID:	8ftr60l1ebgcable559ogr2tlb6nuujllq@email.aon.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Fri, 02 Apr 2004 18:06:12 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>You should not need to use the Vitter algorithm for the block-level
>selection, since you can know the number of blocks in the table in
>advance. You can just use the traditional method of choosing each block
>or not with probability (k/K), where k = number of sample blocks still
>needed, K = number of blocks from here to the end.

Sounds reasonable. I have to play around a bit more to get a feeling
where the Vitter method gets more efficient.

> You'd run the Vitter
>algorithm separately to decide whether to keep or discard each live row
>you find in the blocks you read.

You mean once a block is sampled we inspect it in any case? This was
not the way I had planned to do it, but I'll keep this idea in mind.

>Question: if the table size is less than N blocks, are you going to read
>every block or try to reduce the number of blocks sampled?

Don't know yet.

>people are setting the stats target to 100 which means a sample size of
>30000 --- how do the page-access counts look in that case?

rel | page
size | reads
------+-------------
300 | 300
3000 | 3000
5000 | 4999
10K | 9.9K
30K | 25.8K
300K | 85K
1M | 120K
10M | 190K
100M | 260K
1G | 330K

This is exactly the table I posted before (for sample size 3000) with
every entry multiplied by 10. Well, not quite exactly, but the
differences are far behind the decimal point. So for our purposes, for
a given relation size the number of pages accessed is proportional to
the sample size.

Servus
Manfred

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Large DB
Date:	2004-04-03 00:57:47
Message-ID:	10036.1080953867@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
>> You'd run the Vitter
>> algorithm separately to decide whether to keep or discard each live row
>> you find in the blocks you read.

> You mean once a block is sampled we inspect it in any case? This was
> not the way I had planned to do it, but I'll keep this idea in mind.

Well, once we've gone to the trouble of reading in a block we
definitely want to count the tuples in it, for the purposes of
extrapolating the total number of tuples in the relation. Given
that, I think the most painless route is simply to use the Vitter
algorithm with the number-of-tuples-scanned as the count variable.
You could dump the logic in acquire_sample_rows that tries to estimate
where to read the N'th tuple from.

If you like I can send you the Vitter paper off-list (I have a PDF of
it). The comments in the code are not really intended to teach someone
what it's good for ...

regards, tom lane

From:	Manfred Koizar <mkoi-pg(at)aon(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Large DB
Date:	2004-04-03 01:09:51
Message-ID:	ri3s60ltac47uu5kgv3be8nok8v5lo9nrm@email.aon.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Fri, 02 Apr 2004 19:57:47 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>If you like I can send you the Vitter paper off-list (I have a PDF of
>it). The comments in the code are not really intended to teach someone
>what it's good for ...

Yes, please. [Would have sent this off-list. But I'm blacklisted.]

Servus
Manfred