Re: Streaming Recovery - Automated Monitoring - CODE LINK

From: Karl Denninger <karl(at)denninger(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Streaming Recovery - Automated Monitoring - CODE LINK
Date: 2010-10-03 21:24:59
Message-ID: 4CA8F4AB.9040706@denninger.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/3/2010 3:44 PM, Karl Denninger wrote:
> On 10/3/2010 1:34 AM, Guillaume Lelarge wrote:
>> Le 03/10/2010 07:07, Karl Denninger a écrit :
>>> On 10/2/2010 11:40 PM, Rajesh Kumar Mallah wrote:
>>>> I hope u checked point #11
>>>> http://wiki.postgresql.org/wiki/Streaming_Replication#How_to_Use
>>>>
>>>> * *11.* You can calculate the replication lag by comparing the
>>>> current WAL write location on the primary with the last WAL
>>>> location received/replayed by the standby. They can be retrieved
>>>> using /pg_current_xlog_location/ on the primary and the
>>>> /pg_last_xlog_receive_location///pg_last_xlog_replay_location/
>>>> on the standby, respectively.
>>>>
>>>> $ psql -c "SELECT pg_current_xlog_location()" -h192.168.0.10 (primary host)
>>>> pg_current_xlog_location
>>>> --------------------------
>>>> 0/2000000
>>>> (1 row)
>>>>
>>>> $ psql -c "select pg_last_xlog_receive_location()" -h192.168.0.20 (standby host)
>>>> pg_last_xlog_receive_location
>>>> -------------------------------
>>>> 0/2000000
>>>> (1 row)
>>>>
>>>> $ psql -c "select pg_last_xlog_replay_location()" -h192.168.0.20 (standby host)
>>>> pg_last_xlog_replay_location
>>>> ------------------------------
>>>> 0/2000000
>>>> (1 row)
>>>>
>>>> Regds
>>>> Rajesh Kumar Mallah.
>>> Yes, I did.
>>>
>>> Now how do I get an arithmetic difference between the two? There will
>>> (usually) be a small difference between the master and slave on a busy
>>> system - what I want to do is query both and if the difference in their
>>> locations is greater than some defined size, start raising hell (e.g.
>>> sending SMS to people, etc)
>>>
>>> I can SEE the difference, but I don't see a way to COMPUTE a difference,
>>> and there does not appear to be a function that will accept the log file
>>> location as an argument for conversion - the one documented for offsets
>>> (which might otherwise work) does not work on the slave as I noted.
>>>
>>> With Slony there was a set of system tables that would tell me how many
>>> unapplied changes were in the queue. From this I could determine health
>>> - if the number was more than some reasonably-small amount, something
>>> was broken and alarms were to be raised.
>>>
>>> I'm looking for a way to implement the same sort of functionality here.
>>>
>>> ticker=# select pg_last_xlog_replay_location();
>>> pg_last_xlog_replay_location
>>> ------------------------------
>>> 37A/327D1888
>>> (1 row)
>>>
>>> ticker=# select pg_current_xlog_location();
>>> pg_current_xlog_location
>>> --------------------------
>>> 37A/3280DCB8
>>> (1 row)
>>>
>>> How do I get an arithmetic difference between these two
>>> programmatically, and will such always be monoatomically increasing
>>> (that is, will they ever roll over, thereby giving me a potential
>>> NEGATIVE difference?)
>>>
>> pgPool-II does that computation. You should check that in its source
>> code. File pool_worker_child.c, function check_replication_time_lag().
>> It creates a long value from the LSN returned by these functions. Here
>> is the computation:
>>
>> lsn = xlogid * 16 * 1024 * 1024 * 255 + xrecoff;
>>
>> In your example, xlogid is 37A and xrecoff is 327D1888 on the slave and
>> 3280DCB8 on the master. The hexadecimal values were first converted to
>> their unsigned decimal integer representation (same file, function
>> text_to_lsn()).
>>
>> You should really get a look at this file. It's probably a better
>> explanation than what I'm trying to do :)
>>
> Here's a little program to do it - change "MAX_OFFSET" and modify
> (particularly the "notify" function) to suit and place a file called
> "CHECK.cluster" in the following form in the home directory of whatever
> this runs as (it will look in the home directory of the euid of the
> process):
>
> master-name connection-string-to-connect-to-master
> slave1-name slave-string-to-connect
> slave2-name slave-string-to-connect
> ......
>
> It will emit a notification for each slave that is more than
> "MAX_OFFSET" behind, or if any slave is not in recovery mode (bad; it
> thinks it's a master!) or worse, if the MASTER is in recovery mode, or
> if it can't talk to any of the declared nodes.
>
> Note that "CHECK.cluster" contains connection information so make sure
> you run this as someone and with a home directory for that someone
> appropriately permitted to prevent leakage of the connection credentials.
>
> Ain't gonna claim it's elegant, but it was something I was able to
> quickly bang out in a few minutes and it works. Stuck in the CRON if
> you have the cron's email going somewhere that will get noticed fast
> (e.g. a Blackberry email, etc) you can run this on whatever interval you
> need and it'll do the job.
>
> Postgresql's development team can consider this a contribution to the
> codebase if someone wants to clean it up a bit and include it. You
> could obviously pick up the config file as an argument rather than using
> the running euid
> of the process to grab it; I'm using it in an environment where the
> latter works better for me - YMMV.
>
I've cleaned this up a bit more and it is now at
http://www.denninger.net/check-replication.c

Feel free to grab and use it if you think it would be helpful.

-- Karl

Attachment Content-Type Size
karl.vcf text/x-vcard 180 bytes

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tommy Gildseth 2010-10-04 05:09:21 Re: rotate psql output
Previous Message Karl Denninger 2010-10-03 20:44:39 Re: Streaming Recovery - Automated Monitoring