Re: [REVIEW] pg_last_xact_insert_timestamp

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] pg_last_xact_insert_timestamp
Date: 2011-12-10 12:29:39
Message-ID: 4EE350B3.2040801@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/02/2011 07:10 AM, Robert Haas wrote:
> Your proposals involve sending additional information from the master to
> the slave, but the slave already knows both its WAL position and the
> timestamp of the transaction it has most recently replayed, because
> the startup process on the slave tracks that information and publishes
> it in shared memory. On the master, however, only the WAL position is
> centrally tracked; the transaction timestamps are not.
>

This seems to be the question that was never really answered well enough
to satisfy anyone, so let's rewind to here for a bit. I wasn't
following this closely until now, so I just did my own review from
scratch against the latest patch. I found a few issues, and I don't
think all of them have been vented here yet:

-It adds overhead at every commit, even for people who aren't using it.
Probably not enough to matter, but it's yet another thing going through
the often maligned as too heavy pgstat system, often.

-In order to measure lag this way, you need access to both the master
and the standby. Yuck, dblink or application code doing timestamp math,
either idea makes me shudder. It would be nice to answer "how many
seconds of lag do have?" directly from the standby. Ideally I would
like both the master and standby to know those numbers.

-After the server is restarted, you get a hole in the monitoring data
until the first transaction is committed or aborted. The way the
existing pg_last_xact_replay_timestamp side of this computation goes
NULL for some unpredictable period after restart is going to drive
monitoring systems batty. Building this new facility similarly will
force everyone who writes a lag monitor to special case for that
condition on both sides. Sure, that's less user hostile than the status
quo, but it isn't going to help PostgreSQL's battered reputation in the
area of monitoring either.

-The transaction ID advancing is not a very good proxy for real-world
lag. You can have a very large amount of writing in between commits.
The existing lag monitoring possibilities focus on WAL position instead,
which is better correlated with the sort of activity that causes lag.
Making one measurement of lag WAL position based (the existing ones) and
another based on transaction advance (this proposal) is bound to raise
some "which of these is the correct lag?" questions, when the two
diverge. Large data loading operations come to mind as a not unusual at
all situation where this would happen.

I'm normally a fan of building the simplest thing that will do something
useful, and this patch succeeds there. But the best data to collect
needs to have a timestamp that keeps moving forward in a way that
correlates reasonably well to the system WAL load. Using the
transaction ID doesn't do that. Simon did some hand waving around
sending a timestamp every checkpoint. That would allow the standby to
compute its own lag, limit overhead to something much lighter than
per-transaction, and better track write volume. There could still be a
bigger than normal discontinuity after server restart, if the server was
down a while, but at least there wouldn't ever be a point where the
value was returned by the master but was NULL.

But as Simon mentioned in passing, it will bloat the WAL streams for
everyone. Here's the as yet uncoded mini-proposal that seems to have
slid by uncommented upon:

"We can send regular special messages from WALSender to WALReceiver that
do not form part of the WAL stream, so we don't bulk
up WAL archives. (i.e. don't use "w" messages)."

Here's my understanding of how this would work. Each time a
commit/abort record appears in the WAL, that updates XLogCtl with the
associated timestamp. If WALReceiver received regular updates
containing the master's clock timestamp and stored them
similarly--either via regular streaming or the heartbeat--it could
compute lag with the same resolution as this patch aims to do, for the
price of two spinlocks: just subtract the two timestamps. No overhead
on the master, and lag can be directly computed and queried from each
standby. If you want to feed that data back to the master so it can
appear in pg_stat_replication in both places, send it periodically via
the same channel sync rep and standby feedback use. I believe that will
be cheap in many cases, since it can piggyback on messages that will
still be quite small relative to minimum packet size on most networks.
(Exception for compressed/encrypted networks where packets aren't
discrete in this way doesn't seem that relevant, presuming that if
you're doing one of those I would think this overhead is the least of
your worries)

Now, there's still one problem here. This doesn't address the "lots of
write volume but no commit" problem any better than the proposed patch
does. Maybe there's some fancy way to inject it along with the received
WAL on the standby, I'm out of brain power to think through that right
now. One way to force this to go away is to add a "time update" WAL
record type here, one that only appears in some of these unusual
situations. My first idea for keeping that overhead small is to put the
injection logic for it at WAL file switch time. If you haven't
committed a transaction during that entire 16MB of writes, start the new
WAL segment with a little time update record. That would ensure you
never do too many writes before the WAL replay clock is updated, while
avoiding this record type altogether during commit-heavy periods.

The WAL sender and receiver pair should be able to work out how to
handle the other corner case, where neither WAL advance nor transactions
occured. You don't want lag to keep increasing forever there. If the
transaction ID hasn't moved forward, the heartbeat can still update the
time. I think if you do that, all you'd need to special case is that if
master XID=standby replay XID, lag time should be forced to 0 instead of
being its usual value (master timestamp - last standby commit/abort record).

As for "what do you return if asked for lag before the first data with a
timestamp appears?" problem, there are surely still cases where that
happens in this approach. I'm less concerned about that if there's only
a single function involved though. The part that worries me is the high
probability people are going to do NULL math wrong if they have to
combine two values from different servers, and not catch it during
development. If I had a penny for every NULL handling mistake ever made
in that category, I'd be writing this from my island lair instead of
Baltimore. I'm more comfortable saying "this lag interval might be
NULL"; if that's the only exception people have to worry about, that
doesn't stress me as much.

Last disclaimer: if this argument is persuasive enough to punt
pg_last_xact_insert_timestamp for now, in favor of a new specification
along the lines I've outlined here, I have no intention of letting
myself or Simon end up blocking progress on that. This one is important
enough for 9.2 that if there's not another patch offering the same
feature sitting in the queue by the end of the month, I'll ask someone
else here to sit on this problem a while. Probably Jaime, because he's
suffering with this problem as much as I am. We maintain code in repmgr
to do this job with brute force: it saves a history table to translate
XID->timestamps and works out lag from there. Window function query +
constantly churning monitoring table=high overhead. It would really be
great to EOL that whole messy section as deprecated starting in 9.2.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2011-12-10 13:56:55 Re: pg_stat_statements with query tree based normalization
Previous Message Gabriele Bartolini 2011-12-10 08:47:53 Re: [PATCH] Support for foreign keys with arrays