Re: pg_xlog becomes extremely large during CREATE INDEX

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: "Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_xlog becomes extremely large during CREATE INDEX
Date: 2004-05-15 16:44:27
Message-ID: 7536.1084639467@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl> writes:
> On Sat, May 15, 2004 at 12:23:18AM -0400, Tom Lane wrote:
>> ... It tries to write all the levels of the btree in
>> parallel, so it is spitting out level-zero pages at a great rate,
>> level-one pages at a lesser rate, etc. For a large index there could
>> be many btree levels, and pages in the higher levels will be held locked
>> in the shared buffer arena for considerable periods.

> I'm not sure how btree from-scratch-building work, but could it be made
> so that it first build level 0 completely, scanning the heap; then build
> level 1 scanning the level 0, and so on?

I don't think that would be a win; it would require an extra read pass
over nearly all of the index. The idea here is to reduce the amount of
I/O, not increase it.

> I also wonder why index creation would write XLog information;

Yeah, that was implicit in my comment about not using the buffer
manager, and it was why I was concerned about being able to fsync
the index file. Without xlog for backup, you'd better be able to
force the index to disk before you commit.

It turns out that btree index build currently writes each page three
times: once to extend the file, once into WAL, and then finally to write
the completed index page. It is difficult to do any better as long as
you're using the shared buffer manager. (When extending a file, the
buffer manager wants to allocate disk space before it will give you a
buffer to write into. This is the right thing for heap insertions and
extending an existing index, because we don't force buffers to disk
before commit. So if it didn't work that way, you might commit your
transaction before finding out there's no disk space to hold the data.
But we could dispense with that worry for index build if we ensure that
we write and fsync all the pages before commit.) So I was thinking
about keeping the same index build process, but working in private
storage instead of shared buffers, and writing direct to smgr. This
would not only eliminate the locking conflict against checkpoints,
but reduce the amount of I/O involved quite a lot.

regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ed L. 2004-05-15 16:59:29 Re: Identifying diskspace leakage
Previous Message Jeffrey W. Baker 2004-05-15 16:43:44 Re: pg_xlog becomes extremely large during CREATE INDEX