Quick Links

Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Lists:	pgsql-adminpgsql-hackers

From:	Florian Helmberger <fh(at)25th-floor(dot)com>
To:	pgsql-admin(at)postgresql(dot)org
Subject:	pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-17 15:08:35
Message-ID:	4DD28F73.6010108@25th-floor.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Hi.

I'm running a production database with PostgreSQL 9.0.3 (64-bit) on
Debian 5.0.4 and have an issue with a TOAST table and far to frequent
autovacuum runs.

I think I've pinned the problem down to the values pg_class holds for
the affected TOAST table:

relpages | 433596
reltuples | 1868538

These values are significantly too low. Interestingly, the autovacuum
logout reports the correct values:

pages: 0 removed, 34788136 remain
tuples: 932487 removed, 69599038 remain

but these aren't stored in pg_class after each run.

Currently, there are no long running transactions and/or dumps running.

I've confirmed that PostgreSQL is using the values stored in pg_class
for it's calculations, it starts autovacuum for the table at around
375k dead rows (threshold is 50, scale_factor 0.2 (both default)).

Additionally I've done manual VACUUM ANALYZE of both the parent table
and the TOAST table, which didn't help either.

Other databases with the same hardware, PostgreSQL and OS versions don't
have this issue.

Currently I've worked around the issue by disabling autovacuum for the
TOAST table and doing manual VACUUM ANALYZE once a week.

Any clue how to get PostgreSQL to store the correct values?

Side note: while trying to debug this I've noticed, that the TOAST
chunks on 32-bit systems have the documented size of 2000 bytes, on
64-bit systems they have 1996 bytes. Is this normal/on purpose?

Regards,
Florian Helmberger

Florian Helmberger --------------------

25th-floor - Operating Custom Solutions
de Pretis & Helmberger KG

Gluckgasse 2/6, 1010 Wien, Austria

Mail: fh(at)25th-floor(dot)com
Web : http://www.25th-floor.com
Tel.: +43 1 / 512 82 89 - 60
Fax : +43 1 / 512 82 89 - 76
Mob.: +43 699 / 109 24 24 5
---------------------------------------

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Florian Helmberger <fh(at)25th-floor(dot)com>
Cc:	pgsql-admin(at)postgresql(dot)org
Subject:	Re: pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 02:47:49
Message-ID:	25017.1306291669@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Florian Helmberger <fh(at)25th-floor(dot)com> writes:
> I'm running a production database with PostgreSQL 9.0.3 (64-bit) on
> Debian 5.0.4 and have an issue with a TOAST table and far to frequent
> autovacuum runs.

> I think I've pinned the problem down to the values pg_class holds for
> the affected TOAST table:

> relpages | 433596
> reltuples | 1868538

> These values are significantly too low. Interestingly, the autovacuum
> logout reports the correct values:

> pages: 0 removed, 34788136 remain
> tuples: 932487 removed, 69599038 remain

> but these aren't stored in pg_class after each run.

That's exceedingly weird. Do the pg_stat_all_tables columns update
after autovacuums on that table?

regards, tom lane

From:	Florian Helmberger <fh(at)25th-floor(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-admin(at)postgresql(dot)org
Subject:	Re: pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 06:56:01
Message-ID:	4DDCA801.70805@25th-floor.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On 25.05.11 04:47, Tom Lane wrote:

> Florian Helmberger<fh(at)25th-floor(dot)com> writes:
>> I'm running a production database with PostgreSQL 9.0.3 (64-bit) on
>> Debian 5.0.4 and have an issue with a TOAST table and far to frequent
>> autovacuum runs.
>
>> I think I've pinned the problem down to the values pg_class holds for
>> the affected TOAST table:
>
>> relpages | 433596
>> reltuples | 1868538
>
>> These values are significantly too low. Interestingly, the autovacuum
>> logout reports the correct values:
>
>> pages: 0 removed, 34788136 remain
>> tuples: 932487 removed, 69599038 remain
>
>> but these aren't stored in pg_class after each run.
>
> That's exceedingly weird. Do the pg_stat_all_tables columns update
> after autovacuums on that table?

Hi Tom,

Yes they do:

That was the last autovacuum run before I disabled it (via storage
parameter on the main table) and switched to manual vacuum's once per week.

I've also rechecked the "sister" database (same Hareware, OS/PostgreSQL
Version and database schema) which is working as intended.

Regards,
Florian

Florian Helmberger --------------------

25th-floor - Operating Custom Solutions
de Pretis & Helmberger KG

Gluckgasse 2/6, 1010 Wien, Austria

Mail: fh(at)25th-floor(dot)com
Web : http://www.25th-floor.com
Tel.: +43 1 / 512 82 89 - 60
Fax : +43 1 / 512 82 89 - 76
Mob.: +43 699 / 109 24 24 5
---------------------------------------

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Florian Helmberger <fh(at)25th-floor(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 15:47:52
Message-ID:	19738.1306338472@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Florian Helmberger <fh(at)25th-floor(dot)com> writes:
> On 25.05.11 04:47, Tom Lane wrote:
>> Florian Helmberger<fh(at)25th-floor(dot)com> writes:
>>> I'm running a production database with PostgreSQL 9.0.3 (64-bit) on
>>> Debian 5.0.4 and have an issue with a TOAST table and far to frequent
>>> autovacuum runs.
>>>
>>> I think I've pinned the problem down to the values pg_class holds for
>>> the affected TOAST table:
>>>
>>> relpages | 433596
>>> reltuples | 1868538
>>>
>>> These values are significantly too low. Interestingly, the autovacuum
>>> logout reports the correct values:
>>>
>>> pages: 0 removed, 34788136 remain
>>> tuples: 932487 removed, 69599038 remain
>>>
>>> but these aren't stored in pg_class after each run.

>> That's exceedingly weird. Do the pg_stat_all_tables columns update
>> after autovacuums on that table?

> Yes they do:

I think I see what must be going on here: that toast table must contain
a long run of all-visible-according-to-the-VM pages (which is hardly a
surprising situation). This results in VACUUM choosing not to update
the pg_class entry:

/*
* Update statistics in pg_class. But only if we didn't skip any pages;
* the tuple count only includes tuples from the pages we've visited, and
* we haven't frozen tuples in unvisited pages either. The page count is
* accurate in any case, but because we use the reltuples / relpages ratio
* in the planner, it's better to not update relpages either if we can't
* update reltuples.
*/
if (vacrelstats->scanned_all)
vac_update_relstats(onerel,
vacrelstats->rel_pages, vacrelstats->rel_tuples,
vacrelstats->hasindex,
FreezeLimit);

For an ordinary table this wouldn't be fatal because we'll still do an
ANALYZE from time to time, and that will update the entries with new
(approximate) values. But we never run ANALYZE on toast tables.

And this would *still* be okay, because as noted in the comment, the
planner only depends on the ratio being roughly correct, not on either
individual value being current. But autovacuum didn't get the memo;
it thinks it can use reltuples to make decisions.

I can see two basic approaches we might take here:

1. Modify autovacuum to use something from the stats collector, rather
than reltuples, to make its decisions. I'm not too clear on why
reltuples is being used there anyway; is there some good algorithmic or
statistical reason why AV should be looking at a number from the last
vacuum?

2. Revise the vacuum code so that it doesn't skip updating the pg_class
entries. We could have it count the number of pages it skipped, rather
than just keeping a bool, and then scale up the rel_tuples count to be
approximately right by assuming the skipped pages have tuple density
similar to the scanned ones.

Thoughts?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Florian Helmberger <fh(at)25th-floor(dot)com>
Cc:	pgsql-admin(at)postgresql(dot)org
Subject:	Re: pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 15:49:57
Message-ID:	19785.1306338597@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Florian Helmberger <fh(at)25th-floor(dot)com> writes:
> I think I've pinned the problem down to the values pg_class holds for
> the affected TOAST table:
> relpages | 433596
> reltuples | 1868538
> These values are significantly too low. Interestingly, the autovacuum
> logout reports the correct values:
> pages: 0 removed, 34788136 remain
> tuples: 932487 removed, 69599038 remain
> but these aren't stored in pg_class after each run.

I've moved discussion of this to pgsql-hackers, since this appears to be
an actual bug.

> Side note: while trying to debug this I've noticed, that the TOAST
> chunks on 32-bit systems have the documented size of 2000 bytes, on
> 64-bit systems they have 1996 bytes. Is this normal/on purpose?

I don't have the exact numbers in my head, but the TOAST chunk size does
depend on a MAXALIGN calculation, so it being different between 32- and
64-bit isn't surprising.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 16:00:38
Message-ID:	1306339028-sup-6948@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011:

> I think I see what must be going on here: that toast table must contain
> a long run of all-visible-according-to-the-VM pages (which is hardly a
> surprising situation). This results in VACUUM choosing not to update
> the pg_class entry:
>
> /*
> * Update statistics in pg_class. But only if we didn't skip any pages;
> * the tuple count only includes tuples from the pages we've visited, and
> * we haven't frozen tuples in unvisited pages either. The page count is
> * accurate in any case, but because we use the reltuples / relpages ratio
> * in the planner, it's better to not update relpages either if we can't
> * update reltuples.
> */
> if (vacrelstats->scanned_all)
> vac_update_relstats(onerel,
> vacrelstats->rel_pages, vacrelstats->rel_tuples,
> vacrelstats->hasindex,
> FreezeLimit);
>
> For an ordinary table this wouldn't be fatal because we'll still do an
> ANALYZE from time to time, and that will update the entries with new
> (approximate) values. But we never run ANALYZE on toast tables.

Ouch.

> I can see two basic approaches we might take here:
>
> 1. Modify autovacuum to use something from the stats collector, rather
> than reltuples, to make its decisions. I'm not too clear on why
> reltuples is being used there anyway; is there some good algorithmic or
> statistical reason why AV should be looking at a number from the last
> vacuum?

It uses reltuples simply because it was what the original contrib code
was using. Since pgstat was considerably weaker at the time, reltuples
might have been the only thing available. It's certainly the case that
pgstat has improved a lot since autovacuum got in, and some things have
been revised but not this one.

> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
> entries. We could have it count the number of pages it skipped, rather
> than just keeping a bool, and then scale up the rel_tuples count to be
> approximately right by assuming the skipped pages have tuple density
> similar to the scanned ones.

Hmm, interesting idea. This would be done only for toast tables, or all
tables?

At this point I only wonder why we store tuples & pages rather than just
live tuple density.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 16:17:11
Message-ID:	BANLkTi=973Arh-nXLO6YFP7_ZTCeNp-X7w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Wed, May 25, 2011 at 11:47 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
> entries. We could have it count the number of pages it skipped, rather
> than just keeping a bool, and then scale up the rel_tuples count to be
> approximately right by assuming the skipped pages have tuple density
> similar to the scanned ones.

This approach doesn't seem like a good idea to me. The skipped
portions of the table are the ones that haven't been modified in a
while, so this is or embeds an assumption that the tuples in the hot
and cold portions of the table are the same size. That might be true,
but it isn't hard to think of cases where it might not be. There
could also very easily be sampling error, because it's easy to imagine
a situation where 99% of the table is getting skipped. Any error that
creeps into the estimate is going to get scaled up along with the
estimate itself.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 16:23:56
Message-ID:	22173.1306340636@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of mi may 25 11:47:52 -0400 2011:
>> I can see two basic approaches we might take here:
>>
>> 1. Modify autovacuum to use something from the stats collector, rather
>> than reltuples, to make its decisions. I'm not too clear on why
>> reltuples is being used there anyway; is there some good algorithmic or
>> statistical reason why AV should be looking at a number from the last
>> vacuum?

> It uses reltuples simply because it was what the original contrib code
> was using. Since pgstat was considerably weaker at the time, reltuples
> might have been the only thing available. It's certainly the case that
> pgstat has improved a lot since autovacuum got in, and some things have
> been revised but not this one.

On reflection I'm hesitant to do this, especially for a backpatched bug
fix, because it would be changing the feedback loop behavior for
autovacuum scheduling. That could have surprising consequences.

>> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
>> entries. We could have it count the number of pages it skipped, rather
>> than just keeping a bool, and then scale up the rel_tuples count to be
>> approximately right by assuming the skipped pages have tuple density
>> similar to the scanned ones.

> Hmm, interesting idea. This would be done only for toast tables, or all
> tables?

I'm thinking just do it for all. The fact that these numbers don't
necessarily update after a vacuum is certainly surprising in and of
itself, and it did not work that way before the VM patch went in.
I'm concerned about other stuff besides AV not dealing well with
obsolete values.

> At this point I only wonder why we store tuples & pages rather than just
> live tuple density.

It's just for backwards compatibility. I've thought about doing that in
the past, but I don't know what client-side code might be looking at
relpages/reltuples. It's not like collapsing them into one field would
save much, anyway.

regards, tom lane

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Florian Helmberger" <fh(at)25th-floor(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 16:37:24
Message-ID:	4DDCE9F4020000250003DC48@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I don't know what client-side code might be looking at
> relpages/reltuples.

I know that I find reltuples useful for getting an "accurate enough"
sense of rows in a table (or set of tables) without resorting to
count(*). I'd be OK with any two of pages, tuples, and density; but
dropping to one would make databases harder to manage, IMV.

Personally, I don't have much code that uses those columns;
eliminating an existing column wouldn't involve much pain for me as
long as it could still be derived.

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 16:41:39
Message-ID:	22497.1306341699@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, May 25, 2011 at 11:47 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
>> entries. We could have it count the number of pages it skipped, rather
>> than just keeping a bool, and then scale up the rel_tuples count to be
>> approximately right by assuming the skipped pages have tuple density
>> similar to the scanned ones.

> This approach doesn't seem like a good idea to me. The skipped
> portions of the table are the ones that haven't been modified in a
> while, so this is or embeds an assumption that the tuples in the hot
> and cold portions of the table are the same size. That might be true,
> but it isn't hard to think of cases where it might not be. There
> could also very easily be sampling error, because it's easy to imagine
> a situation where 99% of the table is getting skipped.

Yeah, I had been thinking about the latter point. We could be
conservative and just keep the reported tuple density the same (ie,
update relpages to the new correct value, while setting reltuples so
that the density ratio doesn't change). But that has its own problems
when the table contents *do* change. What I'm currently imagining is
to do a smoothed moving average, where we factor in the new density
estimate with a weight dependent on the percentage of the table we did
scan. That is, the calculation goes something like

old_density = old_reltuples / old_relpages
new_density = counted_tuples / scanned_pages
reliability = scanned_pages / new_relpages
updated_density = old_density + (new_density - old_density) * reliability
new_reltuples = updated_density * new_relpages

We could slow the moving-average convergence even further when
reliability is small; for instance if you were really paranoid you might
want to use the square of reliability in line 4. That might be
overdoing it, though.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 16:54:28
Message-ID:	BANLkTim7YNLBBDUT9qXZrO2biuwDZ3kjyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Wed, May 25, 2011 at 12:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Yeah, I had been thinking about the latter point. We could be
> conservative and just keep the reported tuple density the same (ie,
> update relpages to the new correct value, while setting reltuples so
> that the density ratio doesn't change). But that has its own problems
> when the table contents *do* change. What I'm currently imagining is
> to do a smoothed moving average, where we factor in the new density
> estimate with a weight dependent on the percentage of the table we did
> scan. That is, the calculation goes something like
>
> old_density = old_reltuples / old_relpages
> new_density = counted_tuples / scanned_pages
> reliability = scanned_pages / new_relpages
> updated_density = old_density + (new_density - old_density) * reliability
> new_reltuples = updated_density * new_relpages
>
> We could slow the moving-average convergence even further when
> reliability is small; for instance if you were really paranoid you might
> want to use the square of reliability in line 4. That might be
> overdoing it, though.

I don't know. That's maybe better, but I'd be willing to wager that
in some cases it will just slow down the rate at which we converge to
a completely incorrect value, while in other cases it'll fail to
update the data when it really has changed.

I am wondering, though, why we're not just inserting a special-purpose
hack for TOAST tables. Your email seems to indicate that regular
tables are already handled well enough, and certainly if we only whack
around the TOAST behavior it's much less likely to fry anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 16:55:09
Message-ID:	1306342228-sup-2964@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > I don't know what client-side code might be looking at
> > relpages/reltuples.
>
> I know that I find reltuples useful for getting an "accurate enough"
> sense of rows in a table (or set of tables) without resorting to
> count(*). I'd be OK with any two of pages, tuples, and density; but
> dropping to one would make databases harder to manage, IMV.
>
> Personally, I don't have much code that uses those columns;
> eliminating an existing column wouldn't involve much pain for me as
> long as it could still be derived.

Well, we only actually need to store one number, because you can figure
out a much more precise number-of-pages figure with pg_relation_size()
divided by configured page size.

(We feel free to hack around catalogs in other places, and we regularly
break pgAdmin and the like when we drop columns -- people just live with
it and update their tools. I don't think it's such a big deal in this
particular case.)

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 17:04:27
Message-ID:	22926.1306343067@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, May 25, 2011 at 12:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yeah, I had been thinking about the latter point. We could be
>> conservative and just keep the reported tuple density the same (ie,
>> update relpages to the new correct value, while setting reltuples so
>> that the density ratio doesn't change). But that has its own problems
>> when the table contents *do* change. What I'm currently imagining is
>> to do a smoothed moving average, where we factor in the new density
>> estimate with a weight dependent on the percentage of the table we did
>> scan. That is, the calculation goes something like
>>
>> old_density = old_reltuples / old_relpages
>> new_density = counted_tuples / scanned_pages
>> reliability = scanned_pages / new_relpages
>> updated_density = old_density + (new_density - old_density) * reliability
>> new_reltuples = updated_density * new_relpages
>>
>> We could slow the moving-average convergence even further when
>> reliability is small; for instance if you were really paranoid you might
>> want to use the square of reliability in line 4. That might be
>> overdoing it, though.

> I don't know. That's maybe better, but I'd be willing to wager that
> in some cases it will just slow down the rate at which we converge to
> a completely incorrect value, while in other cases it'll fail to
> update the data when it really has changed.

[ shrug... ] When you don't have complete information, it's *always*
the case that you will sometimes make a mistake. That's not
justification for paralysis, especially not when the existing code is
demonstrably broken.

What occurs to me after thinking a bit more is that the existing tuple
density is likely to be only an estimate, too (one coming from the last
ANALYZE, which could very well have scanned even less of the table than
VACUUM did). So what I now think is that both VACUUM and ANALYZE ought
to use a calculation of the above form to arrive at a new value for
pg_class.reltuples. In both cases it would be pretty easy to track the
number of pages we looked at while counting tuples, so the same raw
information is available.

> I am wondering, though, why we're not just inserting a special-purpose
> hack for TOAST tables.

Because the problem is not specific to TOAST tables. As things
currently stand, we will accept the word of an ANALYZE as gospel even if
it scanned 1% of the table, and completely ignore the results from a
VACUUM even if it scanned 99% of the table. This is not sane.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 17:13:28
Message-ID:	1306343006-sup-4509@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Excerpts from Robert Haas's message of mié may 25 12:54:28 -0400 2011:

For regular tables I don't think there's a real problem because it'll
get fixed next time a full scan happens anyway. For toast tables, I
think the set of operations is limited enough that this is easy to prove
correct (or fixed so that it is) -- no HOT updates (in fact no updates
at all), etc.

BTW one thing we haven't fixed at all is how do HOT updates affect
vacuuming behavior ...

> I am wondering, though, why we're not just inserting a special-purpose
> hack for TOAST tables. Your email seems to indicate that regular
> tables are already handled well enough, and certainly if we only whack
> around the TOAST behavior it's much less likely to fry anything.

Well, having two paths one of which is uncommonly used means that it
will get all the bugs. As with the xl_commit WAL record comment from
Simon, I'd rather stick with having a single one.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 17:15:59
Message-ID:	BANLkTimVhdO_bKQagRsH0OLp7MxgJZDryg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> [ shrug... ] When you don't have complete information, it's *always*
> the case that you will sometimes make a mistake. That's not
> justification for paralysis, especially not when the existing code is
> demonstrably broken.
>
> What occurs to me after thinking a bit more is that the existing tuple
> density is likely to be only an estimate, too (one coming from the last
> ANALYZE, which could very well have scanned even less of the table than
> VACUUM did). So what I now think is that both VACUUM and ANALYZE ought
> to use a calculation of the above form to arrive at a new value for
> pg_class.reltuples. In both cases it would be pretty easy to track the
> number of pages we looked at while counting tuples, so the same raw
> information is available.
>
>> I am wondering, though, why we're not just inserting a special-purpose
>> hack for TOAST tables.
>
> Because the problem is not specific to TOAST tables. As things
> currently stand, we will accept the word of an ANALYZE as gospel even if
> it scanned 1% of the table, and completely ignore the results from a
> VACUUM even if it scanned 99% of the table. This is not sane.

I agree that if VACUUM scanned 99% of the table, it's probably fine to
use its numbers. It's also fine to use the numbers from ANALYZE,
because those pages are chosen randomly. What bothers me is the idea
of using a small *non-random* sample, and I'm not sure that
incorporating possibly-bogus results slowly is any better than
incorporating them quickly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 17:24:01
Message-ID:	BANLkTimyfvQXsLswzBCV5PFi+NN_TeDZ+g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

2011/5/25 Alvaro Herrera <alvherre(at)commandprompt(dot)com>:
> Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> > I don't know what client-side code might be looking at
>> > relpages/reltuples.
>>
>> I know that I find reltuples useful for getting an "accurate enough"
>> sense of rows in a table (or set of tables) without resorting to
>> count(*). I'd be OK with any two of pages, tuples, and density; but
>> dropping to one would make databases harder to manage, IMV.
>>
>> Personally, I don't have much code that uses those columns;
>> eliminating an existing column wouldn't involve much pain for me as
>> long as it could still be derived.
>
> Well, we only actually need to store one number, because you can figure
> out a much more precise number-of-pages figure with pg_relation_size()
> divided by configured page size.
>
> (We feel free to hack around catalogs in other places, and we regularly
> break pgAdmin and the like when we drop columns -- people just live with
> it and update their tools. I don't think it's such a big deal in this
> particular case.)

I may miss something but we need relation size in costsize.c even if
we have a reldensity (or we need a reltuples). Else what values should
be used to estimate the relation size ? (pg_relation_size() goes down
to kernel/fs to ask the stat.st_size, is it really what we want?)

>
> --
> Álvaro Herrera <alvherre(at)commandprompt(dot)com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 17:31:03
Message-ID:	BANLkTimaDj950K-298JW09RrmG0eJ_C=qQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Wed, May 25, 2011 at 1:15 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I agree that if VACUUM scanned 99% of the table, it's probably fine to
> use its numbers. It's also fine to use the numbers from ANALYZE,
> because those pages are chosen randomly. What bothers me is the idea
> of using a small *non-random* sample, and I'm not sure that
> incorporating possibly-bogus results slowly is any better than
> incorporating them quickly.

In particular, unless I'm misremembering, VACUUM *always* scans the
first few pages of the table, until it sees enough consecutive
all-visible bits that it decides to start skipping. If I'm right
about that, then those pages could easily end up being overweighted
when VACUUM does the counting; especially if ANALYZE or an actual
full-table vacuum aren't allowed to snap the count back to reality.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc:	Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 17:37:07
Message-ID:	1306344946-sup-417@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:

> > Well, we only actually need to store one number, because you can figure
> > out a much more precise number-of-pages figure with pg_relation_size()
> > divided by configured page size.

> I may miss something but we need relation size in costsize.c even if
> we have a reldensity (or we need a reltuples). Else what values should
> be used to estimate the relation size ? (pg_relation_size() goes down
> to kernel/fs to ask the stat.st_size, is it really what we want?)

Actually yes, the planner does go to kernel to determine the current
relation size, and then multiplies by density as computed from catalog
data to figure out current reasonably accurate number of tuples.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 19:14:06
Message-ID:	BANLkTinKDYoBo8ZpreagS3E5wkfL5zYNcg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

2011/5/25 Alvaro Herrera <alvherre(at)commandprompt(dot)com>:
> Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:
>
>> > Well, we only actually need to store one number, because you can figure
>> > out a much more precise number-of-pages figure with pg_relation_size()
>> > divided by configured page size.
>
>> I may miss something but we need relation size in costsize.c even if
>> we have a reldensity (or we need a reltuples). Else what values should
>> be used to estimate the relation size ? (pg_relation_size() goes down
>> to kernel/fs to ask the stat.st_size, is it really what we want?)
>
> Actually yes, the planner does go to kernel to determine the current
> relation size, and then multiplies by density as computed from catalog
> data to figure out current reasonably accurate number of tuples.

Okay! I just read that part. Interesting.
(If I dive correctly, we search our last segment and then use a
fileseek to the end of this segment to get our information)

make more sense, suddendly :)

>
> --
> Álvaro Herrera <alvherre(at)commandprompt(dot)com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 21:54:30
Message-ID:	28141.1306360470@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Because the problem is not specific to TOAST tables. As things
>> currently stand, we will accept the word of an ANALYZE as gospel even if
>> it scanned 1% of the table, and completely ignore the results from a
>> VACUUM even if it scanned 99% of the table. This is not sane.

> I agree that if VACUUM scanned 99% of the table, it's probably fine to
> use its numbers. It's also fine to use the numbers from ANALYZE,
> because those pages are chosen randomly. What bothers me is the idea
> of using a small *non-random* sample, and I'm not sure that
> incorporating possibly-bogus results slowly is any better than
> incorporating them quickly.

The above is simply fuzzy thinking. The fact that ANALYZE looked at a
random subset of pages is *no guarantee whatsoever* that its results are
highly accurate. They might be more trustworthy than VACUUM's nonrandom
sample of a similar number of pages, but it doesn't hold even a little
bit of water to claim that we should believe ANALYZE completely and
VACUUM not at all even when the latter has looked at a significantly
larger sample of pages.

In any case, your line of thought doesn't help us for fixing the problem
with toast tables, because we aren't going to start doing ANALYZEs on
toast tables.

The bottom line here is that making use of stats we have is a lot better
than not making use of them, even if they aren't entirely trustworthy.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-25 23:11:31
Message-ID:	BANLkTinN6KrpcFDu-Ha0hCa6eN6hr4FYuA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Wed, May 25, 2011 at 5:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, May 25, 2011 at 1:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Because the problem is not specific to TOAST tables. As things
>>> currently stand, we will accept the word of an ANALYZE as gospel even if
>>> it scanned 1% of the table, and completely ignore the results from a
>>> VACUUM even if it scanned 99% of the table. This is not sane.
>
>> I agree that if VACUUM scanned 99% of the table, it's probably fine to
>> use its numbers. It's also fine to use the numbers from ANALYZE,
>> because those pages are chosen randomly. What bothers me is the idea
>> of using a small *non-random* sample, and I'm not sure that
>> incorporating possibly-bogus results slowly is any better than
>> incorporating them quickly.
>
> The above is simply fuzzy thinking. The fact that ANALYZE looked at a
> random subset of pages is *no guarantee whatsoever* that its results are
> highly accurate. They might be more trustworthy than VACUUM's nonrandom
> sample of a similar number of pages, but it doesn't hold even a little
> bit of water to claim that we should believe ANALYZE completely and
> VACUUM not at all even when the latter has looked at a significantly
> larger sample of pages.

I think you're arguing against a straw-man.

> In any case, your line of thought doesn't help us for fixing the problem
> with toast tables, because we aren't going to start doing ANALYZEs on
> toast tables.

Can we simply use a constant for tuple density on TOAST tables?

> The bottom line here is that making use of stats we have is a lot better
> than not making use of them, even if they aren't entirely trustworthy.

http://dilbert.com/strips/comic/2008-05-07/

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 05:05:53
Message-ID:	BANLkTi=jcKSU5zR+JG2BvfLNCtezC9sb+g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Wed, May 25, 2011 at 9:41 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Yeah, I had been thinking about the latter point. We could be
> conservative and just keep the reported tuple density the same (ie,
> update relpages to the new correct value, while setting reltuples so
> that the density ratio doesn't change). But that has its own problems
> when the table contents *do* change. What I'm currently imagining is
> to do a smoothed moving average, where we factor in the new density
> estimate with a weight dependent on the percentage of the table we did
> scan. That is, the calculation goes something like
>
> old_density = old_reltuples / old_relpages
> new_density = counted_tuples / scanned_pages
> reliability = scanned_pages / new_relpages
> updated_density = old_density + (new_density - old_density) * reliability
> new_reltuples = updated_density * new_relpages

This amounts to assuming that the pages observed in the vacuum have
the density observed and the pages that weren't seen have the density
that were previously in the reltuples/relpages stats. That seems like
a pretty solid approach to me. If the numbers were sane before it
follows that they should be sane after the update.

--
greg

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 05:12:13
Message-ID:	BANLkTinX4jG+4n1vSqGzvdTr44j4cOtBcg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Wed, May 25, 2011 at 10:05 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> updated_density = old_density + (new_density - old_density) * reliability
>> new_reltuples = updated_density * new_relpages
>
> This amounts to assuming that the pages observed in the vacuum have
> the density observed and the pages that weren't seen have the density
> that were previously in the reltuples/relpages stats.

In case it's not clear, Tom's expression for updated_density is
equivalent by simple algebra to:

updated_density = (old_density * (1-reliability)) + (new_density * reliability)

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 15:25:00
Message-ID:	15043.1306423500@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Wed, May 25, 2011 at 9:41 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> ... What I'm currently imagining is
>> to do a smoothed moving average, where we factor in the new density
>> estimate with a weight dependent on the percentage of the table we did
>> scan. That is, the calculation goes something like
>>
>> old_density = old_reltuples / old_relpages
>> new_density = counted_tuples / scanned_pages
>> reliability = scanned_pages / new_relpages
>> updated_density = old_density + (new_density - old_density) * reliability
>> new_reltuples = updated_density * new_relpages

> This amounts to assuming that the pages observed in the vacuum have
> the density observed and the pages that weren't seen have the density
> that were previously in the reltuples/relpages stats. That seems like
> a pretty solid approach to me. If the numbers were sane before it
> follows that they should be sane after the update.

Hm, that's an interesting way of looking at it, but I was coming at it
from a signal-processing point of view. What Robert is concerned about
is that if VACUUM is cleaning a non-representative sample of pages, and
repeated VACUUMs examine pretty much the same sample each time, then
over repeated applications of the above formula the estimated density
will eventually converge to what we are seeing in the sample. The speed
of convergence depends on the moving-average multiplier, ie the
"reliability" number above, and what I was after was just to slow down
convergence for smaller samples. So I wouldn't have any problem with
including a fudge factor to make the convergence even slower. But your
analogy makes it seem like this particular formulation is actually
"right" in some sense.

One other point here is that Florian's problem is really only with our
failing to update relpages. I don't think there is any part of the
system that particularly cares about reltuples for a toast table. So
even if the value did converge to some significantly-bad estimate over
time, it's not really an issue AFAICS. We do care about having a sane
reltuples estimate for regular tables, but for those we should have a
mixture of updates from ANALYZE and updates from VACUUM. Also, for both
regular and toast tables we will have an occasional vacuum-for-wraparound
that is guaranteed to scan all pages and hence do a hard reset of
reltuples to the correct value.

I'm still of the opinion that an incremental estimation process like
the above is a lot saner than what we're doing now, snarky Dilbert
references notwithstanding. The only thing that seems worthy of debate
from here is whether we should trust ANALYZE's estimates a bit more than
VACUUM's estimates, on the grounds that the former are more likely to be
from a random subset of pages. We could implement that by applying a
fudge factor when folding a VACUUM estimate into the moving average (ie,
multiply its reliability by something less than one). I don't have any
principled suggestion for just what the fudge factor ought to be, except
that I don't think "zero" is the best value, which AFAICT is what Robert
is arguing. I think Greg's argument shows that "one" is the right value
when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a
random set of pages ... but using that for VACUUM does seem
overoptimistic.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 15:55:58
Message-ID:	BANLkTikrCaoZCVrda6SNCUQFiYvvZX75BA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Thu, May 26, 2011 at 11:25 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm still of the opinion that an incremental estimation process like
> the above is a lot saner than what we're doing now, snarky Dilbert
> references notwithstanding. The only thing that seems worthy of debate
> from here is whether we should trust ANALYZE's estimates a bit more than
> VACUUM's estimates, on the grounds that the former are more likely to be
> from a random subset of pages. We could implement that by applying a
> fudge factor when folding a VACUUM estimate into the moving average (ie,
> multiply its reliability by something less than one). I don't have any
> principled suggestion for just what the fudge factor ought to be, except
> that I don't think "zero" is the best value, which AFAICT is what Robert
> is arguing. I think Greg's argument shows that "one" is the right value
> when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a
> random set of pages ... but using that for VACUUM does seem
> overoptimistic.

The problem is that it's quite difficult to predict the relative
frequency of full-relation-vacuum, vacuum-with-skips, and ANALYZE
operations on the table will be. It matters how fast the table is
being inserted into vs. updated/deleted; and it also matters how fast
the table is being updated compared with the system's rate of XID
consumption. So in general it seems hard to say, well, we know this
number might drift off course a little bit, but there will be a
freezing vacuum or analyze or something coming along soon enough to
fix the problem. There might be, but it's difficult to be sure. My
argument isn't so much that using a non-zero value here is guaranteed
to have bad effects, but that we really have no idea what will work
out well in practice, and therefore it seems dangerous to whack the
behavior around ... especially in stable branches.

If we changed this in 9.1, and that's the last time we ever get a
complaint about it, problem solved. But I would feel bad if we
changed this in the back-branches and then found that, while solving
this particular problem, we had created others. It also seems likely
that the replacement problems would be more subtle and more difficult
to diagnose, because they'd depend in a very complicated way on the
workload, and having, say, the latest table contents would not
necessarily enable us to reproduce the problem.

I would feel a lot better about something that is deterministic, like,
I dunno, if VACUUM visits more than 25% of the table, we use its
estimate. And we always use ANALYZE's estimate. Or something.

Another thought: Couldn't relation_needs_vacanalyze() just scale up
reltuples by the ratio of the current number of pages in the relation
to relpages, just as the query planner does?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 16:23:02
Message-ID:	25097.1306426982@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I would feel a lot better about something that is deterministic, like,
> I dunno, if VACUUM visits more than 25% of the table, we use its
> estimate. And we always use ANALYZE's estimate. Or something.

This argument seems to rather miss the point. The data we are working
from is fundamentally not deterministic, and you can't make it so by
deciding to ignore what data we do have. That leads to a less useful
estimate, not a more useful estimate.

> Another thought: Couldn't relation_needs_vacanalyze() just scale up
> reltuples by the ratio of the current number of pages in the relation
> to relpages, just as the query planner does?

Hmm ... that would fix Florian's immediate issue, and it does seem like
a good change on its own merits. But it does nothing for the problem
that we're failing to put the best available information into pg_class.

Possibly we could compromise on doing just that much in the back
branches, and the larger change for 9.1?

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 17:08:46
Message-ID:	BANLkTikxM6V5DUsXnDEs84z=oRMbyeoOgA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Thu, May 26, 2011 at 12:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Another thought: Couldn't relation_needs_vacanalyze() just scale up
>> reltuples by the ratio of the current number of pages in the relation
>> to relpages, just as the query planner does?
>
> Hmm ... that would fix Florian's immediate issue, and it does seem like
> a good change on its own merits. But it does nothing for the problem
> that we're failing to put the best available information into pg_class.
>
> Possibly we could compromise on doing just that much in the back
> branches, and the larger change for 9.1?

Do you think we need to worry about the extra overhead of determining
the current size of every relation as we sweep through pg_class? It's
not a lot, but OTOH I think we'd be doing it once a minute... not sure
what would happen if there were tons of tables.

Going back to your thought upthread, I think we should really consider
replacing reltuples with reltupledensity at some point. I continue to
be afraid that using a decaying average in this case is going to end
up overweighting the values from some portion of the table that's
getting scanned repeatedly, at the expense of other portions of the
table that are not getting scanned at all. Now, perhaps experience
will prove that's not a problem. But storing relpages and
reltupledensity separately would give us more flexibility, because we
could feel free to bump relpages even when we're not sure what to do
about reltupledensity. That would help Florian's problem quite a lot,
even if we did nothing else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Florian Helmberger" <fh(at)25th-floor(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 17:28:11
Message-ID:	4DDE475B020000250003DD63@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I think we should really consider replacing reltuples with
> reltupledensity at some point. I continue to be afraid that using
> a decaying average in this case is going to end up overweighting
> the values from some portion of the table that's getting scanned
> repeatedly, at the expense of other portions of the table that are
> not getting scanned at all. Now, perhaps experience will prove
> that's not a problem. But storing relpages and reltupledensity
> separately would give us more flexibility, because we could feel
> free to bump relpages even when we're not sure what to do about
> reltupledensity. That would help Florian's problem quite a lot,
> even if we did nothing else.

Given how trivial it would be to adjust reltuples to keep its ratio
to relpages about the same when we don't have a new "hard" number,
but some evidence that we should fudge our previous value, I don't
see where this change buys us much. It seems to mostly obfuscate
the fact that we're changing our assumption about how many tuples we
have. I would rather that we did that explicitly with code comments
about why it's justified than to slip it in the way you suggest.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 17:50:45
Message-ID:	BANLkTikTbZ8gibj+X++vkdqkxxJCU6trWw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Thu, May 26, 2011 at 1:28 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I think we should really consider replacing reltuples with
>> reltupledensity at some point. I continue to be afraid that using
>> a decaying average in this case is going to end up overweighting
>> the values from some portion of the table that's getting scanned
>> repeatedly, at the expense of other portions of the table that are
>> not getting scanned at all. Now, perhaps experience will prove
>> that's not a problem. But storing relpages and reltupledensity
>> separately would give us more flexibility, because we could feel
>> free to bump relpages even when we're not sure what to do about
>> reltupledensity. That would help Florian's problem quite a lot,
>> even if we did nothing else.
>
> Given how trivial it would be to adjust reltuples to keep its ratio
> to relpages about the same when we don't have a new "hard" number,
> but some evidence that we should fudge our previous value, I don't
> see where this change buys us much. It seems to mostly obfuscate
> the fact that we're changing our assumption about how many tuples we
> have. I would rather that we did that explicitly with code comments
> about why it's justified than to slip it in the way you suggest.

I'm a bit confused by this - what the current design obfuscates is the
fact that reltuples and relpages are not really independent columns;
you can't update one without updating the other, unless you want
screwy behavior. Replacing reltuples by reltupledensity would fix
that problem - it would be logical and non-damaging to update either
column independently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Florian Helmberger" <fh(at)25th-floor(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 18:05:55
Message-ID:	4DDE5033020000250003DD71@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

>> Given how trivial it would be to adjust reltuples to keep its
>> ratio to relpages about the same when we don't have a new "hard"
>> number, but some evidence that we should fudge our previous
>> value, I don't see where this change buys us much. It seems to
>> mostly obfuscate the fact that we're changing our assumption
>> about how many tuples we have. I would rather that we did that
>> explicitly with code comments about why it's justified than to
>> slip it in the way you suggest.
>
> I'm a bit confused by this - what the current design obfuscates is
> the fact that reltuples and relpages are not really independent
> columns; you can't update one without updating the other, unless
> you want screwy behavior. Replacing reltuples by reltupledensity
> would fix that problem - it would be logical and non-damaging to
> update either column independently.

They don't always move in tandem. Certainly there can be available
space in those pages from which tuples can be allocated or which
increases as tuples are vacuumed. Your proposed change would
neither make more or less information available, because we've got
two numbers which can be observed as raw counts, and a ratio between
them. By storing the ratio and one count you make changes to the
other count implied and less visible. It seems more understandable
and less prone to error (to me, anyway) to keep the two "raw"
numbers and calculate the ratio -- and when you observe a change in
one raw number which you believe should force an adjustment to the
other raw number before its next actual value is observed, to
comment on why that's a good idea, and do the trivial arithmetic at
that time.

As a thought exercise, what happens each way if a table is loaded
with a low fillfactor and then a lot of inserts are done? What
happens if mass deletes are done from a table which has a high
density?

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 19:48:29
Message-ID:	BANLkTikGu30z1rwYeN0NwuO8jMFWHzus+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Thu, May 26, 2011 at 2:05 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> I'm a bit confused by this - what the current design obfuscates is
>> the fact that reltuples and relpages are not really independent
>> columns; you can't update one without updating the other, unless
>> you want screwy behavior. Replacing reltuples by reltupledensity
>> would fix that problem - it would be logical and non-damaging to
>> update either column independently.
>
> They don't always move in tandem. Certainly there can be available
> space in those pages from which tuples can be allocated or which
> increases as tuples are vacuumed. Your proposed change would
> neither make more or less information available, because we've got
> two numbers which can be observed as raw counts, and a ratio between
> them.

So far I agree.

> By storing the ratio and one count you make changes to the
> other count implied and less visible. It seems more understandable
> and less prone to error (to me, anyway) to keep the two "raw"
> numbers and calculate the ratio -- and when you observe a change in
> one raw number which you believe should force an adjustment to the
> other raw number before its next actual value is observed, to
> comment on why that's a good idea, and do the trivial arithmetic at
> that time.

Except that's not how it works. At least in the case of ANALYZE, we
*aren't* counting all the tuples in the table. We're selecting a
random sample of pages and inferring a tuple density, which we then
extrapolate to the whole table and store. Then when we pull it back
out of the table, we convert it back to a tuple density. The real
value we are computing and using almost everywhere is tuple density;
storing a total number of tuples in the table appears to be just
confusing the issue.

Unless, of course, I am misunderstanding, which is possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 20:30:09
Message-ID:	28829.1306441809@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Except that's not how it works. At least in the case of ANALYZE, we
> *aren't* counting all the tuples in the table. We're selecting a
> random sample of pages and inferring a tuple density, which we then
> extrapolate to the whole table and store. Then when we pull it back
> out of the table, we convert it back to a tuple density. The real
> value we are computing and using almost everywhere is tuple density;
> storing a total number of tuples in the table appears to be just
> confusing the issue.

If we were starting in a green field we might choose to store tuple
density. However, the argument for changing it now is at best mighty
thin; IMO it is not worth the risk of breaking client code.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Florian Helmberger <fh(at)25th-floor(dot)com>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 20:37:45
Message-ID:	28973.1306442265@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, May 26, 2011 at 12:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Another thought: Couldn't relation_needs_vacanalyze() just scale up
>>> reltuples by the ratio of the current number of pages in the relation
>>> to relpages, just as the query planner does?

>> Hmm ... that would fix Florian's immediate issue, and it does seem like
>> a good change on its own merits. But it does nothing for the problem
>> that we're failing to put the best available information into pg_class.
>>
>> Possibly we could compromise on doing just that much in the back
>> branches, and the larger change for 9.1?

> Do you think we need to worry about the extra overhead of determining
> the current size of every relation as we sweep through pg_class? It's
> not a lot, but OTOH I think we'd be doing it once a minute... not sure
> what would happen if there were tons of tables.

Ugh ... that is a mighty good point, since the RelationGetNumberOfBlocks
call would have to happen for each table, even the ones we then decide
not to vacuum. We've already seen people complain about the cost of the
AV launcher once they have a lot of databases, and this would probably
increase it quite a bit.

> Going back to your thought upthread, I think we should really consider
> replacing reltuples with reltupledensity at some point. I continue to
> be afraid that using a decaying average in this case is going to end
> up overweighting the values from some portion of the table that's
> getting scanned repeatedly, at the expense of other portions of the
> table that are not getting scanned at all.

Changing the representation of the information would change that issue
not in the slightest. The fundamental point here is that we have new,
possibly partial, information which we ought to somehow merge with the
old, also possibly partial, information. Storing the data a little bit
differently doesn't magically eliminate that issue.

regards, tom lane

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Florian Helmberger" <fh(at)25th-floor(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 21:24:24
Message-ID:	4DDE7EB8020000250003DDA8@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

>> By storing the ratio and one count you make changes to the
>> other count implied and less visible. It seems more
>> understandable and less prone to error (to me, anyway) to keep
>> the two "raw" numbers and calculate the ratio -- and when you
>> observe a change in one raw number which you believe should force
>> an adjustment to the other raw number before its next actual
>> value is observed, to comment on why that's a good idea, and do
>> the trivial arithmetic at that time.
>
> Except that's not how it works. At least in the case of ANALYZE,
> we *aren't* counting all the tuples in the table. We're selecting
> a random sample of pages and inferring a tuple density, which we
> then extrapolate to the whole table and store. Then when we pull
> it back out of the table, we convert it back to a tuple density.
> The real value we are computing and using almost everywhere is
> tuple density; storing a total number of tuples in the table
> appears to be just confusing the issue.

Well, if tuple density is the number which is most heavily used, it
might shave a few nanoseconds doing the arithmetic in enough places
to justify the change, but I'm skeptical. Basically I'm with Tom on
the fact that this change would store neither more nor less
information (and for that matter would not really change what
information you can easily retrieve); and slightly changing the
manner in which it is stored doesn't solve any of the problems you
assert that it does.

When we prune or vacuum a page, I don't suppose we have enough
information about that page's previous state to calculate a tuple
count delta, do we? That would allow a far more accurate number to
be maintained than anything suggested so far, as long as we tweak
autovacuum to count inserts toward the need to vacuum. (It seems to
me I saw a post giving some reason that would have benefits anyway.)
Except for the full pass during transaction wrap-around protection,
where it could just set a new actual count, autovacuum would be
skipping pages where the bit is set to indicate that all tuples are
visible, right?

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Florian Helmberger" <fh(at)25th-floor(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-26 21:50:00
Message-ID:	9007.1306446600@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> When we prune or vacuum a page, I don't suppose we have enough
> information about that page's previous state to calculate a tuple
> count delta, do we? That would allow a far more accurate number to
> be maintained than anything suggested so far, as long as we tweak
> autovacuum to count inserts toward the need to vacuum.

Well, that was the other direction that was suggested upthread: stop
relying on reltuples at all, but use the stats collector's counts.
That might be a good solution in the long run, but there are some
issues:

1. It's not clear how using a current count, as opposed to
time-of-last-vacuum count, would affect the behavior of the autovacuum
control logic. At first glance I think it would break it, since the
basic logic there is "how much of the table changed since it was last
vacuumed?". Even if the equations could be modified to still work,
I remember enough feedback control theory from undergrad EE to think that
this is something to be seriously scared of tweaking without extensive
testing. IMO it is far more risky than what Robert is worried about.

2. You still have the problem that we're exposing inaccurate (or at
least less accurate than they could be) counts to the planner and to
onlooker clients. We could change the planner to also depend on the
stats collector instead of reltuples, but at that point you just removed
the option for people to turn off the stats collector. The implications
for plan stability might be unpleasant, too.

So that's not a direction I want to go without a significant amount
of work and testing.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-27 15:07:05
Message-ID:	BANLkTinREie2cX2zqZPSEfUsZsdM+GjymQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Thu, May 26, 2011 at 5:50 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> When we prune or vacuum a page, I don't suppose we have enough
>> information about that page's previous state to calculate a tuple
>> count delta, do we? That would allow a far more accurate number to
>> be maintained than anything suggested so far, as long as we tweak
>> autovacuum to count inserts toward the need to vacuum.
>
> Well, that was the other direction that was suggested upthread: stop
> relying on reltuples at all, but use the stats collector's counts.
> That might be a good solution in the long run, but there are some
> issues:
>
> 1. It's not clear how using a current count, as opposed to
> time-of-last-vacuum count, would affect the behavior of the autovacuum
> control logic. At first glance I think it would break it, since the
> basic logic there is "how much of the table changed since it was last
> vacuumed?". Even if the equations could be modified to still work,
> I remember enough feedback control theory from undergrad EE to think that
> this is something to be seriously scared of tweaking without extensive
> testing. IMO it is far more risky than what Robert is worried about.

Yeah, I think that would be broken.

> 2. You still have the problem that we're exposing inaccurate (or at
> least less accurate than they could be) counts to the planner and to
> onlooker clients. We could change the planner to also depend on the
> stats collector instead of reltuples, but at that point you just removed
> the option for people to turn off the stats collector. The implications
> for plan stability might be unpleasant, too.
>
> So that's not a direction I want to go without a significant amount
> of work and testing.

FWIW, I agree. Your proposed solution is certainly better than trying
to do this; but it still seems a bit shaky to me.

Still, maybe we don't have a better option. If it were me, I'd add an
additional safety valve: use your formula if the percentage of the
relation scanned is above some threshold where there's unlikely to be
too much skew. But if the percentage scanned is too small, then don't
use that formula. Instead, only update relpages/reltuples if the
relation is now larger; set relpages to the new actual value, and
scale up reltuples proportionately.

However, I just work here. It's possible that I'm worrying about a
problem that won't materialize in practice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-27 18:34:10
Message-ID:	17685.1306521250@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Still, maybe we don't have a better option. If it were me, I'd add an
> additional safety valve: use your formula if the percentage of the
> relation scanned is above some threshold where there's unlikely to be
> too much skew. But if the percentage scanned is too small, then don't
> use that formula. Instead, only update relpages/reltuples if the
> relation is now larger; set relpages to the new actual value, and
> scale up reltuples proportionately.

Ah, progress: now we're down to arguing about the size of the fudge
factor ;-). I'll do something involving derating the reliability
when the number is coming from VACUUM.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-28 19:01:35
Message-ID:	28116.1306609295@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

> However, I just work here. It's possible that I'm worrying about a
> problem that won't materialize in practice.

Attached is a proposed patch to fix these issues. Experimentation
convinced me that including a fudge factor for VACUUM's results made
things *less* accurate, not more so. The reason seems to be bound up in
Greg Stark's observation that the unmodified calculation is equivalent
to assuming that the old average tuple density still applies to the
unscanned pages. In a VACUUM, we know that the unscanned pages are
exactly those that have had no changes since (at least) the last vacuum,
which means that indeed the old density ought to be a good estimate.
Now, this reasoning can break down if the table's tuple density is
nonuniform, but what I found in my testing is that if you vacuum after a
significant change in a table (such as deleting a lot of rows), and you
don't apply the full unfudged correction, you get a badly wrong result.
I think that's a more significant issue than the possibility of drift
over time.

I also found that Greg was right in thinking that it would help if we
tweaked lazy_scan_heap to not always scan the first
SKIP_PAGES_THRESHOLD-1 pages even if they were
all_visible_according_to_vm. That seemed to skew the results if those
pages weren't representative. And, for the case of a useless manual
vacuum on a completely clean table, it would cause the reltuples value
to drift when there was no reason to change it at all.

Lastly, this patch removes a bunch of grotty interconnections between
VACUUM and ANALYZE that were meant to prevent ANALYZE from updating the
stats if VACUUM had done a full-table scan in the same command. With
the new logic it's relatively harmless if ANALYZE does that, and anyway
autovacuum frequently fires the two cases independently anyway, making
all that logic quite useless in the normal case. (This simplification
accounts for the bulk of the diff, actually.)

Comments?

regards, tom lane

Attachment	Content-Type	Size
vac-analyze-reltuple-changes.patch	text/x-patch	34.3 KB

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 02:42:51
Message-ID:	BANLkTi=6RYvEVfKn8ufPEf9meGVEu=fHsQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I also found that Greg was right in thinking that it would help if we
> tweaked lazy_scan_heap to not always scan the first
> SKIP_PAGES_THRESHOLD-1 pages even if they were
> all_visible_according_to_vm. That seemed to skew the results if those
> pages weren't representative. And, for the case of a useless manual
> vacuum on a completely clean table, it would cause the reltuples value
> to drift when there was no reason to change it at all.

You fixed the logic only for the first 32 pages which helps with the
skew. But really the logic is backwards in general. Instead of
counting how many missed opportunities for skipped pages we've seen in
the past we should read the bits for the next 32 pages in advance and
decide what to do before we read those pages.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 05:10:48
Message-ID:	11295.1306645848@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I also found that Greg was right in thinking that it would help if we
>> tweaked lazy_scan_heap to not always scan the first
>> SKIP_PAGES_THRESHOLD-1 pages even if they were
>> all_visible_according_to_vm.

> You fixed the logic only for the first 32 pages which helps with the
> skew. But really the logic is backwards in general. Instead of
> counting how many missed opportunities for skipped pages we've seen in
> the past we should read the bits for the next 32 pages in advance and
> decide what to do before we read those pages.

OK, do you like the attached version of that logic? (Other fragments
of the patch as before.)

regards, tom lane

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 9393fa0727aaad7508e1163623322b4066412257..231447b31223bc5350ce49a136cffafaa53bc5fb 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 311,317 ****
int i;
PGRUsage ru0;
Buffer vmbuffer = InvalidBuffer;
! BlockNumber all_visible_streak;

pg_rusage_init(&ru0);

--- 305,312 ----
int i;
PGRUsage ru0;
Buffer vmbuffer = InvalidBuffer;
! BlockNumber next_not_all_visible_block;
! bool skipping_all_visible_blocks;

pg_rusage_init(&ru0);

*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 329,340 ****

nblocks = RelationGetNumberOfBlocks(onerel);
vacrelstats->rel_pages = nblocks;
vacrelstats->nonempty_pages = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;

lazy_space_alloc(vacrelstats, nblocks);

! all_visible_streak = 0;
for (blkno = 0; blkno < nblocks; blkno++)
{
Buffer buf;
--- 324,369 ----

nblocks = RelationGetNumberOfBlocks(onerel);
vacrelstats->rel_pages = nblocks;
+ vacrelstats->scanned_pages = 0;
vacrelstats->nonempty_pages = 0;
vacrelstats->latestRemovedXid = InvalidTransactionId;

lazy_space_alloc(vacrelstats, nblocks);

! /*
! * We want to skip pages that don't require vacuuming according to the
! * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
! * consecutive pages. Since we're reading sequentially, the OS should be
! * doing readahead for us, so there's no gain in skipping a page now and
! * then; that's likely to disable readahead and so be counterproductive.
! * Also, skipping even a single page means that we can't update
! * relfrozenxid, so we only want to do it if we can skip a goodly number
! * of pages.
! *
! * Before entering the main loop, establish the invariant that
! * next_not_all_visible_block is the next block number >= blkno that's
! * not all-visible according to the visibility map, or nblocks if there's
! * no such block. Also, we set up the skipping_all_visible_blocks flag,
! * which is needed because we need hysteresis in the decision: once we've
! * started skipping blocks, we may as well skip everything up to the next
! * not-all-visible block.
! *
! * Note: if scan_all is true, we won't actually skip any pages; but we
! * maintain next_not_all_visible_block anyway, so as to set up the
! * all_visible_according_to_vm flag correctly for each page.
! */
! for (next_not_all_visible_block = 0;
! next_not_all_visible_block < nblocks;
! next_not_all_visible_block++)
! {
! if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
! break;
! }
! if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
! skipping_all_visible_blocks = true;
! else
! skipping_all_visible_blocks = false;
!
for (blkno = 0; blkno < nblocks; blkno++)
{
Buffer buf;
*************** lazy_scan_heap(Relation onerel, LVRelSta
*** 347,387 ****
OffsetNumber frozen[MaxOffsetNumber];
int nfrozen;
Size freespace;
! bool all_visible_according_to_vm = false;
bool all_visible;
bool has_dead_tuples;

! /*
! * Skip pages that don't require vacuuming according to the visibility
! * map. But only if we've seen a streak of at least
! * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading
! * sequentially, the OS should be doing readahead for us and there's
! * no gain in skipping a page now and then. You need a longer run of
! * consecutive skipped pages before it's worthwhile. Also, skipping
! * even a single page means that we can't update relfrozenxid or
! * reltuples, so we only want to do it if there's a good chance to
! * skip a goodly number of pages.
! */
! if (!scan_all)
{
! all_visible_according_to_vm =
! visibilitymap_test(onerel, blkno, &vmbuffer);
! if (all_visible_according_to_vm)
{
! all_visible_streak++;
! if (all_visible_streak >= SKIP_PAGES_THRESHOLD)
! {
! vacrelstats->scanned_all = false;
! continue;
! }
}
else
! all_visible_streak = 0;
}

vacuum_delay_point();

! scanned_pages++;

/*
* If we are close to overrunning the available space for dead-tuple
--- 376,419 ----
OffsetNumber frozen[MaxOffsetNumber];
int nfrozen;
Size freespace;
! bool all_visible_according_to_vm;
bool all_visible;
bool has_dead_tuples;

! if (blkno == next_not_all_visible_block)
{
! /* Time to advance next_not_all_visible_block */
! for (next_not_all_visible_block++;
! next_not_all_visible_block < nblocks;
! next_not_all_visible_block++)
{
! if (!visibilitymap_test(onerel, next_not_all_visible_block,
! &vmbuffer))
! break;
}
+
+ /*
+ * We know we can't skip the current block. But set up
+ * skipping_all_visible_blocks to do the right thing at the
+ * following blocks.
+ */
+ if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
+ skipping_all_visible_blocks = true;
else
! skipping_all_visible_blocks = false;
! all_visible_according_to_vm = false;
! }
! else
! {
! /* Current block is all-visible */
! if (skipping_all_visible_blocks && !scan_all)
! continue;
! all_visible_according_to_vm = true;
}

vacuum_delay_point();

! vacrelstats->scanned_pages++;

/*
* If we are close to overrunning the available space for dead-tuple

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 08:40:28
Message-ID:	BANLkTimbSiUZC7+ZZJOzMWp=M2jYZZcJSg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

2011/5/29 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> On Sat, May 28, 2011 at 12:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I also found that Greg was right in thinking that it would help if we
>>> tweaked lazy_scan_heap to not always scan the first
>>> SKIP_PAGES_THRESHOLD-1 pages even if they were
>>> all_visible_according_to_vm.
>
>> You fixed the logic only for the first 32 pages which helps with the
>> skew. But really the logic is backwards in general. Instead of
>> counting how many missed opportunities for skipped pages we've seen in
>> the past we should read the bits for the next 32 pages in advance and
>> decide what to do before we read those pages.
>
> OK, do you like the attached version of that logic? (Other fragments
> of the patch as before.)

The idea was that remove only one page from the VACUUM will prevent
relfrozenxid update and reltuples (and relpages) update.
Now, I beleive that once we've skip at least one page thanks to
SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
many as possible pages from the VACUUM, tks to the VM.

>
> regards, tom lane
>
> diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
> index 9393fa0727aaad7508e1163623322b4066412257..231447b31223bc5350ce49a136cffafaa53bc5fb 100644
> *** a/src/backend/commands/vacuumlazy.c
> --- b/src/backend/commands/vacuumlazy.c
> *************** lazy_scan_heap(Relation onerel, LVRelSta
> *** 311,317 ****
> int i;
> PGRUsage ru0;
> Buffer vmbuffer = InvalidBuffer;
> ! BlockNumber all_visible_streak;
>
> pg_rusage_init(&ru0);
>
> --- 305,312 ----
> int i;
> PGRUsage ru0;
> Buffer vmbuffer = InvalidBuffer;
> ! BlockNumber next_not_all_visible_block;
> ! bool skipping_all_visible_blocks;
>
> pg_rusage_init(&ru0);
>
> *************** lazy_scan_heap(Relation onerel, LVRelSta
> *** 329,340 ****
>
> nblocks = RelationGetNumberOfBlocks(onerel);
> vacrelstats->rel_pages = nblocks;
> vacrelstats->nonempty_pages = 0;
> vacrelstats->latestRemovedXid = InvalidTransactionId;
>
> lazy_space_alloc(vacrelstats, nblocks);
>
> ! all_visible_streak = 0;
> for (blkno = 0; blkno < nblocks; blkno++)
> {
> Buffer buf;
> --- 324,369 ----
>
> nblocks = RelationGetNumberOfBlocks(onerel);
> vacrelstats->rel_pages = nblocks;
> + vacrelstats->scanned_pages = 0;
> vacrelstats->nonempty_pages = 0;
> vacrelstats->latestRemovedXid = InvalidTransactionId;
>
> lazy_space_alloc(vacrelstats, nblocks);
>
> ! /*
> ! * We want to skip pages that don't require vacuuming according to the
> ! * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
> ! * consecutive pages. Since we're reading sequentially, the OS should be
> ! * doing readahead for us, so there's no gain in skipping a page now and
> ! * then; that's likely to disable readahead and so be counterproductive.
> ! * Also, skipping even a single page means that we can't update
> ! * relfrozenxid, so we only want to do it if we can skip a goodly number
> ! * of pages.
> ! *
> ! * Before entering the main loop, establish the invariant that
> ! * next_not_all_visible_block is the next block number >= blkno that's
> ! * not all-visible according to the visibility map, or nblocks if there's
> ! * no such block. Also, we set up the skipping_all_visible_blocks flag,
> ! * which is needed because we need hysteresis in the decision: once we've
> ! * started skipping blocks, we may as well skip everything up to the next
> ! * not-all-visible block.
> ! *
> ! * Note: if scan_all is true, we won't actually skip any pages; but we
> ! * maintain next_not_all_visible_block anyway, so as to set up the
> ! * all_visible_according_to_vm flag correctly for each page.
> ! */
> ! for (next_not_all_visible_block = 0;
> ! next_not_all_visible_block < nblocks;
> ! next_not_all_visible_block++)
> ! {
> ! if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
> ! break;
> ! }
> ! if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
> ! skipping_all_visible_blocks = true;
> ! else
> ! skipping_all_visible_blocks = false;
> !
> for (blkno = 0; blkno < nblocks; blkno++)
> {
> Buffer buf;
> *************** lazy_scan_heap(Relation onerel, LVRelSta
> *** 347,387 ****
> OffsetNumber frozen[MaxOffsetNumber];
> int nfrozen;
> Size freespace;
> ! bool all_visible_according_to_vm = false;
> bool all_visible;
> bool has_dead_tuples;
>
> ! /*
> ! * Skip pages that don't require vacuuming according to the visibility
> ! * map. But only if we've seen a streak of at least
> ! * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading
> ! * sequentially, the OS should be doing readahead for us and there's
> ! * no gain in skipping a page now and then. You need a longer run of
> ! * consecutive skipped pages before it's worthwhile. Also, skipping
> ! * even a single page means that we can't update relfrozenxid or
> ! * reltuples, so we only want to do it if there's a good chance to
> ! * skip a goodly number of pages.
> ! */
> ! if (!scan_all)
> {
> ! all_visible_according_to_vm =
> ! visibilitymap_test(onerel, blkno, &vmbuffer);
> ! if (all_visible_according_to_vm)
> {
> ! all_visible_streak++;
> ! if (all_visible_streak >= SKIP_PAGES_THRESHOLD)
> ! {
> ! vacrelstats->scanned_all = false;
> ! continue;
> ! }
> }
> else
> ! all_visible_streak = 0;
> }
>
> vacuum_delay_point();
>
> ! scanned_pages++;
>
> /*
> * If we are close to overrunning the available space for dead-tuple
> --- 376,419 ----
> OffsetNumber frozen[MaxOffsetNumber];
> int nfrozen;
> Size freespace;
> ! bool all_visible_according_to_vm;
> bool all_visible;
> bool has_dead_tuples;
>
> ! if (blkno == next_not_all_visible_block)
> {
> ! /* Time to advance next_not_all_visible_block */
> ! for (next_not_all_visible_block++;
> ! next_not_all_visible_block < nblocks;
> ! next_not_all_visible_block++)
> {
> ! if (!visibilitymap_test(onerel, next_not_all_visible_block,
> ! &vmbuffer))
> ! break;
> }
> +
> + /*
> + * We know we can't skip the current block. But set up
> + * skipping_all_visible_blocks to do the right thing at the
> + * following blocks.
> + */
> + if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
> + skipping_all_visible_blocks = true;
> else
> ! skipping_all_visible_blocks = false;
> ! all_visible_according_to_vm = false;
> ! }
> ! else
> ! {
> ! /* Current block is all-visible */
> ! if (skipping_all_visible_blocks && !scan_all)
> ! continue;
> ! all_visible_according_to_vm = true;
> }
>
> vacuum_delay_point();
>
> ! vacrelstats->scanned_pages++;
>
> /*
> * If we are close to overrunning the available space for dead-tuple
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 14:40:27
Message-ID:	19876.1306680027@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

=?ISO-8859-1?Q?C=E9dric_Villemain?= <cedric(dot)villemain(dot)debian(at)gmail(dot)com> writes:
> 2011/5/29 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>> OK, do you like the attached version of that logic? (Other fragments
>> of the patch as before.)

> The idea was that remove only one page from the VACUUM will prevent
> relfrozenxid update and reltuples (and relpages) update.
> Now, I beleive that once we've skip at least one page thanks to
> SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
> many as possible pages from the VACUUM, tks to the VM.

That would require proof, not just suggestion. Skipping pages will
defeat the OS read-ahead algorithm, and so could easily cost more than
reading them.

regards, tom lane

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 15:48:42
Message-ID:	BANLkTi=ONrsOoMD3rOdPjRnKPrirKE+fpg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> =?ISO-8859-1?Q?C=E9dric_Villemain?= <cedric(dot)villemain(dot)debian(at)gmail(dot)com> writes:
>> 2011/5/29 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>>> OK, do you like the attached version of that logic? (Other fragments
>>> of the patch as before.)
>
>> The idea was that remove only one page from the VACUUM will prevent
>> relfrozenxid update and reltuples (and relpages) update.
>> Now, I beleive that once we've skip at least one page thanks to
>> SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
>> many as possible pages from the VACUUM, tks to the VM.
>
> That would require proof, not just suggestion. Skipping pages will
> defeat the OS read-ahead algorithm, and so could easily cost more than
> reading them.
>

My worry is what we have right now is also based on just assumptions
and gut feelings rather than any numbers.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 15:57:01
Message-ID:	21060.1306684621@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
> On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> That would require proof, not just suggestion. Skipping pages will
>> defeat the OS read-ahead algorithm, and so could easily cost more than
>> reading them.

> My worry is what we have right now is also based on just assumptions
> and gut feelings rather than any numbers.

So go collect some numbers.

regards, tom lane

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 16:26:40
Message-ID:	BANLkTimTJLzdPDHrVUa=56n_mPtPtO3D3w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Sun, May 29, 2011 at 9:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
>> On Sun, May 29, 2011 at 8:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> That would require proof, not just suggestion. Skipping pages will
>>> defeat the OS read-ahead algorithm, and so could easily cost more than
>>> reading them.
>
>> My worry is what we have right now is also based on just assumptions
>> and gut feelings rather than any numbers.
>
> So go collect some numbers.
>

I am sorry if I sounded terse above. But my gripe is that sometimes we
are too reluctant to listen to ideas and insist on producing some hard
numbers first which might take significant efforts. But we are not
equally strict when such changes are introduced initially. For
example, in this particular case, the change was introduced after this
discussion:

http://archives.postgresql.org/pgsql-hackers/2008-12/msg01316.php

Heikki suggested 20, Simon proposed 32 to make it a power of 2. But
why not 16 ? Thats closer to 16 than 32. And Greg yesterday said, 8 is
a better number based on his testings.

May be a performance build farm as being discussed is the solution
where we can throw some simple patches and see if something helps or
not.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 17:05:53
Message-ID:	22525.1306688753@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
> I am sorry if I sounded terse above. But my gripe is that sometimes we
> are too reluctant to listen to ideas and insist on producing some hard
> numbers first which might take significant efforts. But we are not
> equally strict when such changes are introduced initially.

The reason for not wanting to change it without some actual evidence
is that there is already evidence: the code has been in the field with
this setting since 8.4, and nobody's vacuum performance has fallen off a
cliff. So while I'd agree that there was little testing done before the
code went in, there is more than zero reason to leave it where it is.
Without some positive evidence showing that another value is better,
I'm disinclined to change it. I also think that you're not helping
by complaining about the code without being willing to do some work
to try to collect such evidence.

> Heikki suggested 20, Simon proposed 32 to make it a power of 2. But
> why not 16 ? Thats closer to 16 than 32. And Greg yesterday said, 8 is
> a better number based on his testings.

Greg said he had found that the read speed was the same for reading
every page vs reading every 8th page. That's not the same as concluding
that 8 is the optimal skip distance for vacuum; or at least, he didn't
say that's what he had concluded. vacuum isn't just reading ...

regards, tom lane

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 17:19:00
Message-ID:	BANLkTi=9RZPv9JOjRdSX6imBeSBGLLrCPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

On Sun, May 29, 2011 at 10:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
>> I am sorry if I sounded terse above. But my gripe is that sometimes we
>> are too reluctant to listen to ideas and insist on producing some hard
>> numbers first which might take significant efforts. But we are not
>> equally strict when such changes are introduced initially.
>
> The reason for not wanting to change it without some actual evidence
> is that there is already evidence: the code has been in the field with
> this setting since 8.4, and nobody's vacuum performance has fallen off a
> cliff.

Well, that's probably because there was definitely much improvement
over what existed before. But that does not mean we can't make it
better. IOW there are no complaints because there is no regression.

> So while I'd agree that there was little testing done before the
> code went in, there is more than zero reason to leave it where it is.
> Without some positive evidence showing that another value is better,
> I'm disinclined to change it. I also think that you're not helping
> by complaining about the code without being willing to do some work
> to try to collect such evidence.
>

I am not complaining about the code. I am suggesting we can be more
receptive to ideas, especially when we know what we have today was not
backed by any evidence either. I will anyways do some tests and post
numbers when I work on single-pass vacuum patch. I'll try to
experiment with this stuff at that time.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Florian Helmberger <fh(at)25th-floor(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Date:	2011-05-29 23:32:54
Message-ID:	BANLkTinL6QuAm_Xf8teRZboG2Mdy3dR_vw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers

2011/5/29 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> =?ISO-8859-1?Q?C=E9dric_Villemain?= <cedric(dot)villemain(dot)debian(at)gmail(dot)com> writes:
>> 2011/5/29 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>>> OK, do you like the attached version of that logic? (Other fragments
>>> of the patch as before.)
>
>> The idea was that remove only one page from the VACUUM will prevent
>> relfrozenxid update and reltuples (and relpages) update.
>> Now, I beleive that once we've skip at least one page thanks to
>> SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
>> many as possible pages from the VACUUM, tks to the VM.
>
> That would require proof, not just suggestion. Skipping pages will
> defeat the OS read-ahead algorithm, and so could easily cost more than
> reading them.

Correct, it needs proof. Parenthesis: I did learn also that reading
the first block of a file make read-ahead have its larger window from
the beginning (the one that posix_fadvise_sequential set too), so
remove those initial reads might be counter-productive also. But this
is damn hard to benchmark because the read ahead is also influenced by
memory pressure for example.

From theory, 1. readahead algo is a bit smarter and can work with
read-with-holes (if the holes are not larger than the read-ahead
window) and 2. if holes are that large then maybe it is not so good to
keep a larger read-ahead window (which keep trashing our buffer
cache).

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support