Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Date: 2012-03-05 16:50:02
Message-ID: 4F54EEBA.5070504@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21.02.2012 13:19, Fujii Masao wrote:
> In some places, the spinlock "insertpos_lck" is taken while another
> spinlock "info_lck" is being held. Is this OK? What if unfortunately
> inner spinlock takes long to be taken?

Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt
that's a big issue in practice. We had the same pattern before the
patch, just with WALInsertLock instead of insertpos_lck. Holding a
spinlock longer is much worse than holding a lwlock longer, but
nevertheless I don't think that's a problem.

If it does become a problem, we could provide a second copy of
RedoRecPtr that could be read while holding info_lck, and would be
allowed to lag behind slightly, while requiring insertpos_lck to
read/update the main shared copy of RedoRecPtr. The only place that
reads RedoRecPtr while holding info_lck is GetRedoRecPtr(), which would
be happy with a value that lags behind a few milliseconds. We could
still update that copy right after releasing insertpos_lck, so the delay
between the two would be tiny.

> + * An xlog-switch record consumes all the remaining space on the
> + * WAL segment. We have already reserved it for us, but we still need
> + * to make sure it's been allocated and zeroed in the WAL buffers so
> + * that when the caller (or someone else) does XLogWrite(), it can
> + * really write out all the zeros.
>
> Why do we need to write out all the remaining space with zeros? In
> current master, we don't do that. A recovery code ignores the data
> following XLOG_SWITCH record, so I don't think that's required.

It's to keep the logic simpler. Before the patch, an xlog-switch just
initialized the next page in the WAL buffers to insert to, to be the
first page in the next segment. With this patch, we rely on a simple
linear mapping from an XLogRecPtr to the WAL buffer that should contain
that page (see XLogRecPtrToBufIdx()). Such a mapping is not possible if
you sometimes skip over pages in the WAL buffers, so we allocate the
buffers for those empty pages, too. Note that this means that an
xlog-switch can blow through all your WAL buffers.

We could probably optimize that so that you don't need to actually
write() and fsync() all the zeros, perhaps by setting a flag on the WAL
buffer to indicate that it only contains padding for an xlog-switch.
However, I don't see any easy way to avoid blowing the cache.

I'm thinking that xlog-switching happens so seldom, and typically on a
fairly idle system, so we don't need to optimize it much. I guess we
should measure the impact, though..

> + /* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
> + * if the xlog switch had no work to do, ie. if we were already at the
> + * beginning of a new XLOG segment. You can check if RecPtr points to
> + * beginning of a segment if you want to keep the distinction.
> + */
> + TRACE_POSTGRESQL_XLOG_SWITCH();
>
> If so, RecPtr (or the flag indicating whether the xlog switch has no
> work to do) should
> be given to TRACE_POSTGRESQL_XLOG_SWITCH() as an argument?

I think I'll just move that call, so that the current behavior is retained.

> The followings are trivial comments:

Thanks, fixed these!

On 22.02.2012 03:34, Fujii Masao wrote:
> When I ran the long-running performance test, I encountered the following
> panic error.
>
> PANIC: could not find WAL buffer for 0/FF000000
>
> 0/FF000000 is the xlog file boundary, so the patch seems to handle
> the xlog file boundary incorrectly. In the patch, current insertion lsn
> is advanced by directly incrementing XLogRecPtr.xrecoff as follows.
> But to handle the xlog file boundary correctly, we should use
> XLByteAdvance() for that, instead?

Thanks, fixed this, too.

I made the locking a bit more strict in WaitXLogInsertionsToFinish(), so
that it grabs the insertpos_lck to read nextslot. I previously thought
that was not necessary, assuming that reading/writing an int32 is
atomic, but I'm afraid there might be memory-ordering issues where the
CurrPos of the most recent slot has not become visible to other backends
yet, while the advancing of nextslot has.

That particular issue would be very hard to hit in practice, so I don't
know if this could explain the recovery failures that Jeff saw. I got
the test script running (thanks for that Jeff!), but unfortunately have
not seen any failures yet (aside from the issue with crossing xlogid
boundary), with either this or the older version of the patch.

Attached is a new version of the patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
xloginsert-scale-10.patch text/x-diff 95.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-03-05 16:55:57 Re: archive_keepalive_command
Previous Message Simon Riggs 2012-03-05 16:46:38 Re: RFC: Making TRUNCATE more "MVCC-safe"