Re: GIN fast insert

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN fast insert
Date: 2009-02-19 03:57:31
Message-ID: 603c8f070902181957j3c026d6dh6b794afe43f21ebb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 17, 2009 at 2:28 PM, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
> Hi there,
>
> we present two variants of GIN fast insert patch, since we're not sure
> what is a the best solution:
>
> v0.28.1
> - remove disable cost in gincostestimate
> - per http://archives.postgresql.org/message-id/499466D2.4010808@sigaev.ru
> gingettuple could force cleanup of pending list if it got a lossy
> tidbitmap.
> If such cleanup occurs the gingettuple will rescan pending list again.
>
> v0.28.2
> - remove disable cost in gincostestimate
> - per
> http://archives.postgresql.org/message-id/12795.1234379754@sss.pgh.pa.us
> AM can now have only one search method: amgettuple or amgetbitmap.
> - GIN now has only amgetbitmap interface

I reviewed v0.28.1. I see that disable_cost is gone from
gincostestimate, but you're still using the pending list to set costs,
and I still think that's bogus. It seems clear that this is going to
change much faster than plans are going to be invalidated, and if
autovacuum is doing its job, the pending list won't get long enough to
matter much anyway, right? I don't think this patch should be
touching gincostestimate at all.

I am thinking that it is may not be necessary to remove the
gingettuple interface (as you did in v0.28.2). Forcing a cleanup of
the pending list seems like a reasonable workaround. We don't expect
this situation to come up frequently, so if the method we use to
handle it is not terribly efficient, oh well. The one thing that
concerns me is - what will happen in a hot standby environment, when
that patch is committed? In that situation, I believe that we can't
call modify any heap or index pages, so...

Some other assorted minor comments on v0.28.1...

1. The description of the "fastupdate" reloption should be reworded
for consistency with other options:

Enables "fast update" feature for this GIN index

2. Why is this implemented as a string reloption rather than a boolean
reloption? It seems like you want to model this on
autovacuum_enabled.

3. Documentation wordsmithing. You have the following paragraph:

As of <productname>PostgreSQL</productname> 8.4, this problem is
alleviated by postponing most of the work until the next
<command>VACUUM</>. Newly inserted index entries are temporarily
stored in an unsorted list of pending entries <command>VACUUM</>
inserts all pending entries into the main <acronym>GIN</acronym> index
data structure, using the same bulk insert techniques used during
initial index creation. This greatly improves <acronym>GIN</acronym>
index update speed, even counting the additional vacuum overhead.

Here is my proposed rewrite:

As of <productname>PostgreSQL</productname> 8.4, <acronym>GIN</> is
capable of postponing much of this work by inserting new tuples into a
temporary, unsorted list of pending entries. When the table is
vacuumed, or in some cases when the pending list becomes too large,
the entries are moved to the main <acronym>GIN</acronym> data
structure using the same bulk insert techniques used during initial
index creation. This greatly improves <acronym>GIN</acronym> index
update speed, even counting the additional vacuum overhead.

4. More wordsmithing. In the following paragraph, you have:

It's recommended to use properly-configured autovacuum with tables
having <acronym>GIN</acronym> indexes, to keep this overhead to
reasonable levels.

I think it is clearer and more succinct to write simply:

Proper use of autovacuum can minimize this problem.

5. In textsearch.sgml, you say that GIN indexes are moderately slower
to update, but about 10x slower without fastupdate. Can we provide a
real number in place of "moderately"? I don't know whether to think
this means 20% or 2x.

6. One of the comment changes in startScanEntry is simply a correction
of a typographical error ("deletion" for "deletition"). You might as
well commit this change separately and remove it from this patch.

7. pg_stat_get_fresh_inserted_tuples. I am not crazy about the fact
that we call this the pending list in some places, fast update in some
places, and now, here, fresh tuples. Let's just call it fast insert
tuples.

8. tbm_check_tuple. The opening curly brace should be uncuddled. The
curly braces around wordnum = bitnum = 0 are superfluous.

9. gincostestimate. There are a lot of superfluous whitespace changes
here, and some debugging code that obviously wasn't fully removed.

10. GinPageOpaqueData. Surely we can come up with a better name than
GIN_LIST. This is yet another name for the same concept. Why not call
this GIN_FAST_INSERT_LIST?

11. ginInsertCleanup. "Inserion" is a typo.

Unfortunately, I don't understand all of this patch well enough to
give it as thorough a going-over as it deserves, so my apologies for
whatever I've missed.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-02-19 04:20:03 Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Previous Message Adriano Lange 2009-02-19 01:12:53 Re: graph representation of data structures in optimizer