Re: GIN fast-insert vs autovacuum scheduling

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: GIN fast-insert vs autovacuum scheduling
Date: 2009-03-23 17:56:22
Message-ID: 29127.1237830982@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm looking again at the fast-insert patch, and I find myself still
desperately unhappy about the mechanism for scheduling autovacuum
cleanup of pending insertions. I complained about that before, but
I think I only cited a worry about adding overhead to statistics
tracking in order to have the "recently inserted tuples" counts.
It's got worse problems though:

1. The "recently inserted tuples" count is simply the wrong measurement
if the index is partial --- it could be a drastic overestimate.

2. Since the patch has pgstats unconditionally resetting the count to
zero after every vacuum, it's not safe for an index AM to use any other
cleanup policy except "flush all pending insertions on every vacuum".
This doesn't seem particularly optimal to me; isn't the idea to make
sure we insert lots of tuples at once? Seems like if there's not very
much in the pending list it'd be better to do nothing.

3. Given that ginHeapTupleFastInsert forces a cleanup cycle whenever
the pending list gets too big, it's far from clear why we should have
to force autovacuum just because of pending list size at all. I also
note that such cleanups aren't being accounted for in the "recently
inserted tuples" stat, anyhow.

On top of those issues, there are implementation problems in the
proposed relation_has_pending_indexes() check: it has hard-wired
knowledge about GIN indexes, which means the feature cannot be
extended to add-on index AMs; and it's examining indexes without any
lock whatsoever on either the indexes or their parent table. (And
we really would rather not let autovacuum take a lock here.)

So I'm fairly strongly tempted to just rip out the whole mechanism,
and rely on existing autovacuum rules plus the ginHeapTupleFastInsert-
driven cleanups.

The only case that I can see where this is really any step backwards
is that following a bulk insert operation, autovacuum will only think
it needs to ANALYZE the table, but we would like it to clean out the
pending insertion lists too. But even then, the patch's mechanism
isn't all that desirable because it forces a useless VACUUM pass over
the heap. ISTM what might be a better, more flexible approach is to
allow the amvacuumcleanup hook to be called at the end of ANALYZE too,
letting the index AM make its own decision about whether it needs
to do anything then. A decision at that point could be made on the
actual size of the index's pending list, rather than any stats-driven
guess.

Comments?

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: GIN fast-insert vs autovacuum scheduling
Date: 2009-03-23 19:01:14
Message-ID: 20090323190114.GD16373@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> On top of those issues, there are implementation problems in the
> proposed relation_has_pending_indexes() check: it has hard-wired
> knowledge about GIN indexes, which means the feature cannot be
> extended to add-on index AMs; and it's examining indexes without any
> lock whatsoever on either the indexes or their parent table. (And
> we really would rather not let autovacuum take a lock here.)

I wonder if it's workable to have GIN send pgstats a message with number
of fast-inserted tuples, and have autovacuum check that number as well
as dead/live tuples.

ISTM this shouldn't be considered part of either vacuum or analyze at
all, and have autovacuum invoke it separately from both, with its own
decision equations and such. We could even have a scan over pg_class
just for GIN indexes to implement this.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: GIN fast-insert vs autovacuum scheduling
Date: 2009-03-23 19:23:24
Message-ID: 700.1237836204@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> On top of those issues, there are implementation problems in the
>> proposed relation_has_pending_indexes() check:

> I wonder if it's workable to have GIN send pgstats a message with number
> of fast-inserted tuples, and have autovacuum check that number as well
> as dead/live tuples.

> ISTM this shouldn't be considered part of either vacuum or analyze at
> all, and have autovacuum invoke it separately from both, with its own
> decision equations and such. We could even have a scan over pg_class
> just for GIN indexes to implement this.

That's going in the wrong direction IMHO, because it's building
GIN-specific infrastructure into the core system. There is no need for
any such infrastructure if we just drive it off a post-ANALYZE callback.

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: GIN fast-insert vs autovacuum scheduling
Date: 2009-03-23 22:59:45
Message-ID: 1237849185.2523.15.camel@dell.linuxdev.us.dell.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2009-03-23 at 15:23 -0400, Tom Lane wrote:
> There is no need for any such infrastructure if we just drive it off a
> post-ANALYZE callback.

That sounds reasonable, although it does seem a little strange for
analyze to actually perform cleanup.

Now that we have FSM, the cost of VACUUMing insert-only tables is a lot
less. Does that possibly justify running VACUUM on insert-only tables?
On tables without GIN indexes, that wouldn't be a complete waste,
because it could set hint bits, which needs to be done sometime anyway.

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: GIN fast-insert vs autovacuum scheduling
Date: 2009-03-23 23:38:34
Message-ID: 4991.1237851514@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Mon, 2009-03-23 at 15:23 -0400, Tom Lane wrote:
>> There is no need for any such infrastructure if we just drive it off a
>> post-ANALYZE callback.

> That sounds reasonable, although it does seem a little strange for
> analyze to actually perform cleanup.

My thought was to have GIN do cleanup only in an autovacuum-driven
ANALYZE, not in a client-issued ANALYZE. You could argue it either way
I suppose, but I agree that if a user says ANALYZE he's probably not
expecting index cleanup to happen.

> Now that we have FSM, the cost of VACUUMing insert-only tables is a lot
> less.

Well, not if you just did a huge pile of inserts, which is the case
that we need to worry about here.

> On tables without GIN indexes, that wouldn't be a complete waste,
> because it could set hint bits, which needs to be done sometime anyway.

True, but we have not chosen to make autovacuum do that, and whether we
should or not seems to me to be orthogonal to when GIN index cleanup
should happen.

regards, tom lane