Quick Links

Re: Hot Standby (v9d)

Lists:	pgsql-hackers

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Hot Standby (v9d)
Date:	2009-01-22 22:35:10
Message-ID:	1232663710.2327.1028.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Latest version of Hot Standby includes new deferred cancellation for
conflict resolution and further refactoring.

http://wiki.postgresql.org/wiki/Hot_Standby#Usage

[Please note that max_standby_delay parameter has been set to default to
zero (0), to allow investigation of usability. i.e. queries will be
cancelled fairly quickly if they conflict. Please set this in
recovery.conf to any value you consider reasonable and report findings.]

GENERAL PLANS

* Bug fix v9 over next few days
* Put corrected version into GIT
* Produce outstanding items as patch-on-patch via GIT
* Work with reviewers to nibble away until fully committed

OUTSTANDING ITEMS (as of Version 9d)

Reported problems (Identified by) (Target release)

* None (as yet!) - please post all bugs to list
All v9 bugs have been fixed. This version has passed initial bash and
function tests and we will continue testing this version tomorrow.

Main work/required additional items (Proposed by) (Target release)

* Complete Prepared Transaction support (Simon)

Performance tuning opportunities (Proposed by) (Target release)

* tuning of RecordKnownAssignedTransactionIds() using hash table
(Heikki)

Usability enhancements

* don't cancel all currently connected users every time we drop a
tablespace that has temp files being used within that tablespace -
should be able to parse the filenames to get pids to cancel

* don't kill all currently connected users every time we drop a user -
should be able to match list of current users against connected roles

* more work possible on btree deletes to reduce impact - should be able
to work out a reasonable latestRemovedXid

Testing of any kind welcome, the heavier the better. Save the database
and logs if it crashes, please. Performance data especially valuable.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachment	Content-Type	Size
hs.v9d.20090122_4.tar.bz2	application/x-bzip-compressed-tar	73.4 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 14:14:49
Message-ID:	4979D0D9.6010904@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> * Put corrected version into GIT
> * Produce outstanding items as patch-on-patch via GIT

I've applied the hot standby patch and recovery infra v9 patch to
branches in my git repository, and pushed those to:

git://git.postgresql.org/git/~hlinnaka/pgsql.git

You can clone that to get started.

I've set those branches up so that the hot standby branch is branched
off from the recovery infra branch. I'd like to keep the two separate,
as the recovery infra patch is useful on it's own, and can be reviewed
separately.

As a teaser, I made a couple of minor changes after importing your
patches. For the sake of the archives, I've also included those changes
as patches here.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
0001-Remove-padding-in-XLogCtl-might-be-a-good-idea-but.patch	text/x-diff	0 bytes

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 14:43:23
Message-ID:	1232721803.2327.1190.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > * Put corrected version into GIT
> > * Produce outstanding items as patch-on-patch via GIT
>
> I've applied the hot standby patch and recovery infra v9 patch to
> branches in my git repository, and pushed those to:
>
> git://git.postgresql.org/git/~hlinnaka/pgsql.git
>
> You can clone that to get started.
>
> I've set those branches up so that the hot standby branch is branched
> off from the recovery infra branch. I'd like to keep the two separate,
> as the recovery infra patch is useful on it's own, and can be reviewed
> separately.
>
> As a teaser, I made a couple of minor changes after importing your
> patches. For the sake of the archives, I've also included those changes
> as patches here.

OK, I'll look at those. I've fixed 6 bugs today (!), so I'd rather go
for the new version coming out in an hour. That's too many to unpick
unfortunately.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 16:00:54
Message-ID:	1232726454.2327.1227.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:

> I made a couple of minor changes after importing your
> patches.

I've applied them both to v9g, attached here.

If you wouldn't mind redoing the initial step, I will promise not to do
anything else to the code, except via patch on GIT.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachment	Content-Type	Size
hs.v9g.20090123_3.tar.bz2	application/x-bzip-compressed-tar	73.9 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 16:11:25
Message-ID:	4979EC2D.7080605@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:
>
>> I made a couple of minor changes after importing your
>> patches.
>
> I've applied them both to v9g, attached here.
>
> If you wouldn't mind redoing the initial step, I will promise not to do
> anything else to the code, except via patch on GIT.

No problem. I don't need to do it from scratch, I'll just apply the
changes from that patch as an incremental commit. Done, you can see it
in my git repo now too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 16:22:21
Message-ID:	4979EEBD.5060108@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> @@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr)
> {
> XLogRecPtr bufLSN = BufferGetLSN(bufHdr);
>
> + /*
> + * If the buffer is recent we may need to cancel ourselves
> + * rather than risk returning a wrong answer. This test is
> + * too conservative, but it is correct.
> + *
>>> + * We only need to cancel the current subtransaction.
> + * Once we've handled the error then other subtransactions can
> + * continue processing. Note that we do *not* reset the
> + * BufferRecoveryConflictLSN at subcommit/abort, but we do
> + * reset it if we release our last remaining sbapshot.
> + * see SnapshotResetXmin()
> + *

Is it really enough to cancel just the current subtransaction? What if
it's a serializable transaction?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 18:51:21
Message-ID:	1232736681.2327.1268.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-01-23 at 18:22 +0200, Heikki Linnakangas wrote:
> > @@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr)
> > {
> > XLogRecPtr bufLSN = BufferGetLSN(bufHdr);
> >
> > + /*
> > + * If the buffer is recent we may need to cancel ourselves
> > + * rather than risk returning a wrong answer. This test is
> > + * too conservative, but it is correct.
> > + *
> >>> + * We only need to cancel the current subtransaction.
> > + * Once we've handled the error then other subtransactions can
> > + * continue processing. Note that we do *not* reset the
> > + * BufferRecoveryConflictLSN at subcommit/abort, but we do
> > + * reset it if we release our last remaining sbapshot.
> > + * see SnapshotResetXmin()
> > + *
>
> Is it really enough to cancel just the current subtransaction? What if
> it's a serializable transaction?

I did originally think that when I first looked at the problem. I'm
sorry if I say that a lot.

If you have a serializable transaction with subtransactions that suffers
a serializability error it only cancels the current subtransaction. That
means it's snapshot is still valid and can be used again. By analogy, as
long as a transaction does not see any data that is inconsistent with
its snapshot it seems OK for it to continue. So I think it is correct.

(Bizarrely, this might mean that if we did this programatically in a
loop we might keep the system busy for some time while it continually
re-reads data and fails. But that's another story).

You remind me that we can now do what Kevin has requested and throw a
errcode(ERRCODE_T_R_SERIALIZATION_FAILURE) at this point, which I agree
is the most easily understood way of describing this error.

(I was sorely tempted to make it "snapshot too old", as a joke).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 19:17:11
Message-ID:	497A17B7.1080001@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> If you have a serializable transaction with subtransactions that suffers
> a serializability error it only cancels the current subtransaction. That
> means it's snapshot is still valid and can be used again. By analogy, as
> long as a transaction does not see any data that is inconsistent with
> its snapshot it seems OK for it to continue. So I think it is correct.

Yeah, you're right. How bizarre.

> (I was sorely tempted to make it "snapshot too old", as a joke).

Yeah, that is a very describing message, actually. If there wasn't any
precedence to that, I believe we might have came up with exactly that
message ourselves.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 19:50:00
Message-ID:	20090123195000.GO4047@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> If you have a serializable transaction with subtransactions that suffers
>> a serializability error it only cancels the current subtransaction. That
>> means it's snapshot is still valid and can be used again. By analogy, as
>> long as a transaction does not see any data that is inconsistent with
>> its snapshot it seems OK for it to continue. So I think it is correct.
>
> Yeah, you're right. How bizarre.

It was argued this way to me way back when subtransactions were written.
Originally I had written it in such a way as to abort the whole
transaction, on the rationale that if you're blindly restarting the
subtransaction after a serialization error, it would result in a (maybe
infinite) loop. I think the reason it only aborts the subxact is that
that's what all other errors do, so why would this one act differently.

Hmm, now that I think about it, I think it was deadlock errors not
serialization errors ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-23 21:22:27
Message-ID:	1232745747.2327.1316.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-22 at 22:35 +0000, Simon Riggs wrote:

> * Bug fix v9 over next few days

version 9g - please use this for testing now

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachment	Content-Type	Size
hs.v9g.20090123_3.tar.bz2	application/x-bzip-compressed-tar	73.9 KB

From:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-24 04:24:00
Message-ID:	497A97E0.2090502@paradise.net.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-01-22 at 22:35 +0000, Simon Riggs wrote:
>
>
>> * Bug fix v9 over next few days
>>
>
> version 9g - please use this for testing now
>
>

I'm doing some test runs with this now. I notice an old flatfiles
related bug has reappeared:

master:

=# create database test;

slave:

=# select datname from pg_database where datname like 'test';
datname
---------
test
(1 row)

postgres=# \c test
FATAL: database "test" does not exist
Previous connection kept

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-24 11:20:32
Message-ID:	1232796032.2327.1395.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:

> > version 9g - please use this for testing now

> I'm doing some test runs with this now. I notice an old flatfiles
> related bug has reappeared:

I'm seeing an off-by-one error on xmax, in some cases. That then causes
the flat file update to not pick up correct info, even though it
executed in other ways as intended. If you run two create databases and
then test only the first, it appears to have worked as intended.

These bugs are result of recent refactoring and it will take a few days
to shake some of them out. We've had more than 20 already so we're
beating them back, but we're not done yet.

Thanks for the report, will publish this and other fixes on Monday.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-24 12:15:19
Message-ID:	1232799319.2327.1410.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2009-01-24 at 11:20 +0000, Simon Riggs wrote:
> On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:
>
> > > version 9g - please use this for testing now
>
> > I'm doing some test runs with this now. I notice an old flatfiles
> > related bug has reappeared:
>
> I'm seeing an off-by-one error on xmax, in some cases. That then causes
> the flat file update to not pick up correct info, even though it
> executed in other ways as intended. If you run two create databases and
> then test only the first, it appears to have worked as intended.
>
> These bugs are result of recent refactoring and it will take a few days
> to shake some of them out. We've had more than 20 already so we're
> beating them back, but we're not done yet.

I was at a loss to explain how this could have slipped through our
tests. It appears that the error was corrected following each checkpoint
as a result of ProcArrayUpdateRunningXacts(). Our tests were performed
after a short delay, which typically would be greater than the
deliberately short setting of checkpoint_timeout/archive_timeout and so
by the time we looked the error was gone and masked the problem. We're
setting checkpoint_timeout to 30 mins now to avoid the delay...

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-24 23:48:07
Message-ID:	497BA8B7.90606@paradise.net.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Sat, 2009-01-24 at 11:20 +0000, Simon Riggs wrote:
>
>> On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:
>>
>>
>>>> version 9g - please use this for testing now
>>>>
>>> I'm doing some test runs with this now. I notice an old flatfiles
>>> related bug has reappeared:
>>>
>> I'm seeing an off-by-one error on xmax, in some cases. That then causes
>> the flat file update to not pick up correct info, even though it
>> executed in other ways as intended. If you run two create databases and
>> then test only the first, it appears to have worked as intended.
>>
>> These bugs are result of recent refactoring and it will take a few days
>> to shake some of them out. We've had more than 20 already so we're
>> beating them back, but we're not done yet.
>>
>
> I was at a loss to explain how this could have slipped through our
> tests. It appears that the error was corrected following each checkpoint
> as a result of ProcArrayUpdateRunningXacts(). Our tests were performed
> after a short delay, which typically would be greater than the
> deliberately short setting of checkpoint_timeout/archive_timeout and so
> by the time we looked the error was gone and masked the problem. We're
> setting checkpoint_timeout to 30 mins now to avoid the delay...
>
>
That makes sense - I had archive_timeout set at 5 minutes, and I would
have checked before 5 minutes were up.

Cheers

Mark

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-27 22:52:54
Message-ID:	1233096774.2327.2283.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:

> I'm doing some test runs with this now. I notice an old flatfiles
> related bug has reappeared:

Bug fix patch against git repo.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachment	Content-Type	Size
hs.v9h_dif_v9g.patch	text/x-patch	12.6 KB

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 18:55:06
Message-ID:	87mydbjm5x.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I skimmed through the Hot Standby patch for a preliminary review. I noted the
following things, some minor tweaks, some just questions. None of the things I
noted are big issues unless some of the questions uncover issues.

1) This code is obviously a cut-pasto:

+ else if (strcmp(tok1, "max_standby_delay") == 0)
+ {
+ errno = 0;
+ maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
+ if (errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errmsg("max_standby_delay is not a valid number: \"%s\"",
+ tok2)));

Have you ever tested the behaviour with max_standby_delay = -1 ?

Also, the default max_standby_delay is currently 0 -- ie, kill queries
immediately upon detecting any conflicts at all -- which I don't really think
anyone would be happy with. I still *strongly* feel the default has to be the
non-destructive conservative -1.

2) This hard coded constant of 500ms seems arbitrary to me. If your use case
is a heavily-used reporting engine you might get this message all the time. I
think this either has to be configurable (warn_standby_delay?) or based on
max_standby_delay or some other parameter somehow.

+ /*
+ * log that we have been waiting for a while now...
+ */
+ if (!logged && standbyWait_ms > 500)

3) These two blocks of code seem unsatisfactory:

! /*
! * Keep track of the block number of the lastBlockVacuumed, so
! * we can scan those blocks as well during WAL replay. This then
! * provides concurrency protection and allows btrees to be used
! * while in recovery.
! */
! if (lastBlockVacuumed > vstate->lastBlockVacuumed)
! vstate->lastBlockVacuumed = lastBlockVacuumed;
!

+ /*
+ * XXXHS we don't actually need to read the block, we
+ * just need to confirm it is unpinned. If we had a special call
+ * into the buffer manager we could optimise this so that
+ * if the block is not in shared_buffers we confirm it as unpinned.
+ */
+ buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+ if (BufferIsValid(buffer))
+ {
+ LockBufferForCleanup(buffer);
+ UnlockReleaseBuffer(buffer);
+ }

Aside from the XXX comment (I thought we actually had such a call now, but if
not shouldn't we just add one instead of carping?) I'm not convinced this
handles all the cases that can arise. Notable, what happens if a previous
vacuum died in the middle of the scan?

I think we have a vacuum id which we use already with btrees to the same
issue. It seems to me this be more robust if you stamped the xlog record with
that id number and ensured it matched the id of the lastBloockVacuumed? Then
you could start from 0 if the id changes.

4) Why is this necessary?

+ if (IsRecoveryProcessingMode() &&
+ locktag->locktag_type == LOCKTAG_OBJECT &&
+ lockmode > AccessShareLock)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot acquire lockmode %s on database objects while recovery is in progress",
+ lockMethodTable->lockModeNames[lockmode]),
+ errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
+

Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
access. But shouldn't we be able to do manual LOCK TABLE calls?

I hope this isn't the only interlock we have against trying to do write i/o or
DDL against tables...?

5) The code for killing off conflicting transactions looks odd to me but I
haven't really traced through it to see what it's doing. It seems to kill off
just one process? What if there are several holding the lock in question?
Also, it doesn't seem to take into account how long the various transactions
have held the lock?

Is there a risk of, for example, if you have a long report running against a
table and lots of OLTP requests against the same table it seems the conflict
resolution code would fire randomly into the crowd and hit mostly innocent
OLTP transactions until eventually it found the culprit report?

Also, it kills of backends using SIGINT which I *think* Tom went through and
made safe earlier this release, right?

In any case if the backend doesn't die off promptly we wait forever with no
further warnings or log messages. I would think we should at least print some
sort of message here, even if it's a "can't happen" elog. The doubling thing
is probably unnecessary too in this case.

+ if (!XLogRecPtrIsValid(conflict_LSN))
+ {
+ /* wait awhile for it to die */
+ pg_usleep(wontDieWait * 5000L);
+ wontDieWait *= 2;
+ }
+ }

6) I still don't understand why you need unobserved_xids. We don't need this
in normal running, an xid we don't know for certain is committed is exactly
the same as a transaction we know is currently running or aborted. So why do
you need it during HS?

The comment says:

+ * This is very important because if we leave
+ * those xids out of the snapshot then they will appear to be already complete.
+ * Later, when they have actually completed this could lead to confusion as to
+ * whether those xids are visible or not, blowing a huge hole in MVCC.
+ * We need 'em.

But that doesn't sound rational to me. I'm not sure what "confusion" this
would cause. If they later actually complete then any existing snapshots would
still not see them. And any later snapshots wouldn't be confused by the
earlier conclusions.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 19:27:27
Message-ID:	1233170847.2327.2510.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:

> I skimmed through the Hot Standby patch for a preliminary review. I noted the
> following things, some minor tweaks, some just questions. None of the things I
> noted are big issues unless some of the questions uncover issues.

Thanks for your attention.

> 1) This code is obviously a cut-pasto:
>
> + else if (strcmp(tok1, "max_standby_delay") == 0)
> + {
> + errno = 0;
> + maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
> + if (errno == EINVAL || errno == ERANGE)
> + ereport(FATAL,
> + (errmsg("max_standby_delay is not a valid number: \"%s\"",
> + tok2)));
>
> Have you ever tested the behaviour with max_standby_delay = -1 ?

> Also, the default max_standby_delay is currently 0 -- ie, kill queries
> immediately upon detecting any conflicts at all -- which I don't really think
> anyone would be happy with.

Agreed. As explained when I published that patch it is deliberately
severe to allow testing of conflict resolution and feedback on it.

> I still *strongly* feel the default has to be the
> non-destructive conservative -1.

I don't. Primarily, we must support high availability. It is much better
if we get people saying "I get my queries cancelled" and we say RTFM and
change parameter X, than if people say "my failover was 12 hours behind
when I needed it to be 10 seconds behind and I lost a $1 million because
of downtime of Postgres" and we say RTFM and change parameter X.

> 2) This hard coded constant of 500ms seems arbitrary to me. If your use case
> is a heavily-used reporting engine you might get this message all the time. I
> think this either has to be configurable (warn_standby_delay?) or based on
> max_standby_delay or some other parameter somehow.
>
> + /*
> + * log that we have been waiting for a while now...
> + */
> + if (!logged && standbyWait_ms > 500)

Greg, that's a DEBUG5 message.

> 3) These two blocks of code seem unsatisfactory:
>
> ! /*
> ! * Keep track of the block number of the lastBlockVacuumed, so
> ! * we can scan those blocks as well during WAL replay. This then
> ! * provides concurrency protection and allows btrees to be used
> ! * while in recovery.
> ! */
> ! if (lastBlockVacuumed > vstate->lastBlockVacuumed)
> ! vstate->lastBlockVacuumed = lastBlockVacuumed;
> !
>
>
> + /*
> + * XXXHS we don't actually need to read the block, we
> + * just need to confirm it is unpinned. If we had a special call
> + * into the buffer manager we could optimise this so that
> + * if the block is not in shared_buffers we confirm it as unpinned.
> + */
> + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
> + if (BufferIsValid(buffer))
> + {
> + LockBufferForCleanup(buffer);
> + UnlockReleaseBuffer(buffer);
> + }
>
> Aside from the XXX comment (I thought we actually had such a call now, but if
> not shouldn't we just add one instead of carping?)

We should add many things. The equivalent optimisation of VACUUM in
normal running has not been done either. Considering we have both HOT
and visibility map enhanced VACUUM, its not a priority.

> I'm not convinced this
> handles all the cases that can arise. Notable, what happens if a previous
> vacuum died in the middle of the scan?

Nothing, because it won't then have removed any heap rows.

> I think we have a vacuum id which we use already with btrees to the same
> issue. It seems to me this be more robust if you stamped the xlog record with
> that id number and ensured it matched the id of the lastBloockVacuumed? Then
> you could start from 0 if the id changes.

> 4) Why is this necessary?
>
> + if (IsRecoveryProcessingMode() &&
> + locktag->locktag_type == LOCKTAG_OBJECT &&
> + lockmode > AccessShareLock)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot acquire lockmode %s on database objects while recovery is in progress",
> + lockMethodTable->lockModeNames[lockmode]),
> + errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
> +
>
> Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
> access. But shouldn't we be able to do manual LOCK TABLE calls?

Read the rest of the comments on the locking code section.

> I hope this isn't the only interlock we have against trying to do write i/o or
> DDL against tables...?

No it's not.

> 5) The code for killing off conflicting transactions looks odd to me but I
> haven't really traced through it to see what it's doing. It seems to kill off
> just one process?

No, why do you think that?

> What if there are several holding the lock in question?

Covered.

> Also, it doesn't seem to take into account how long the various transactions
> have held the lock?

True. Why would that matter?

> Is there a risk of, for example, if you have a long report running against a
> table and lots of OLTP requests against the same table

> it seems the conflict resolution code would fire randomly into the
> crowd

OK, that's where I stop bothering to respond.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 19:33:47
Message-ID:	1233171227.10539.13.camel@jd-laptop.pragmaticzealot.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:

> Agreed. As explained when I published that patch it is deliberately
> severe to allow testing of conflict resolution and feedback on it.
>
> > I still *strongly* feel the default has to be the
> > non-destructive conservative -1.
>
> I don't. Primarily, we must support high availability. It is much better
> if we get people saying "I get my queries cancelled" and we say RTFM and
> change parameter X, than if people say "my failover was 12 hours behind
> when I needed it to be 10 seconds behind and I lost a $1 million because
> of downtime of Postgres" and we say RTFM and change parameter X.

If the person was stupid enough to configure it for such as thing they
deserve to the lose the money. Not to mention we have already lost them
as a user because they will blame postgresql regardless of reality as
evidenced by their inability to RTFM (or have a vendor that RTFMs) in
the first place.

I got to vote with Greg on this one.

Sincerely,

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 19:41:19
Message-ID:	4980B4DF.30604@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> 6) I still don't understand why you need unobserved_xids. We don't need this
> in normal running, an xid we don't know for certain is committed is exactly
> the same as a transaction we know is currently running or aborted. So why do
> you need it during HS?

In normal operation, any transaction that's in-progress has an entry in
ProcArray. GetSnapshot() gathers the xids of all those in-progress
transactions, so that they're seen as not-committed even when the
they're later marked as committed in clog.

In HS, we might see the first WAL record of transaction 10 before we see
the first WAL record of transaction 9. Without unobserved_xids, if you
take a snapshot in the standby between those two WAL records, xid 10 is
included in the snapshot, but 9 is not. If xact 9 later commits, it's
marked in clog as committed, and it will suddenly appear as visible to
the snapshot. To avoid that, when we replay the first WAL record of xact
10, we also add 9 to the unobserved xid array, so that it's included in
snapshots.

So, you can think of the unobserved xids array as an extension of
ProcArray. The entries are like light-weight PGPROC entries. In fact I
proposed earlier to simply create dummy PGPROC entries instead.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 19:49:17
Message-ID:	1233172157.2327.2521.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
> I still don't understand why you need unobserved_xids. We don't need
> this in normal running, an xid we don't know for certain is committed
> is exactly the same as a transaction we know is currently running or
> aborted. So why do you need it during HS?

All running transactions need to be part of the snapshot. This is true
on both standby and primary.

On primary we allocate new xids from a single counter, so there are no
gaps. Newly assigned xids go straight into the procarray and then are
removed later when they commit/abort.

In standby we only know what is in WAL. Xid assignment is not currently
WAL logged, so either we choose to WAL log each new xid and face the
performance consequences (IMHO, unacceptable), or we choose a different
strategy. UnobservedXids is that different strategy. Florian Pflug had a
different strategy, but he did have a strategy.

> The comment says:
>
> + * This is very important because if we leave
> + * those xids out of the snapshot then they will appear to be
> already complete.
> + * Later, when they have actually completed this could lead to
> confusion as to
> + * whether those xids are visible or not, blowing a huge hole in
> MVCC.
> + * We need 'em.
>
> But that doesn't sound rational to me. I'm not sure what "confusion"
> this would cause. If they later actually complete then any existing
> snapshots would still not see them. And any later snapshots wouldn't
> be confused by the earlier conclusions.

If a xact is running when we take a snapshot, yet we do not include it
then bad things will happen, but not til later. If a xact is not in
snapshot *and* less than xamx then we presume it completed prior to the
snapshot. If that xact did subsequently commit *after* we took the
snapshot, then it's absence from the snapshot will make that data
visible if the xact committed, making it look like it committed *before*
the snapshot was taken. So the "confusion" results in an MVCC violation
and so we must handle this case correctly, or die.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	jd(at)commandprompt(dot)com
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 19:56:18
Message-ID:	21473.1233172578@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>>> I still *strongly* feel the default has to be the
>>> non-destructive conservative -1.
>>
>> I don't. Primarily, we must support high availability. It is much better
>> if we get people saying "I get my queries cancelled" and we say RTFM and
>> change parameter X, than if people say "my failover was 12 hours behind
>> when I needed it to be 10 seconds behind and I lost a $1 million because
>> of downtime of Postgres" and we say RTFM and change parameter X.

> If the person was stupid enough to configure it for such as thing they
> deserve to the lose the money.

Well, those unexpectedly cancelled queries could have represented
critical functionality too. I think this argument calls the entire
approach into question. If there is no safe setting for the parameter
then we need to find a way to not have the parameter.

regards, tom lane

From:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:03:46
Message-ID:	1928ABCB-023B-44F5-84EA-56EE1189CB3A@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(sorry for top posting -- blame apple)

I don't see anything "dangerous" with either setting. For use cases
where the backup is the primary purpose then killing queries is fine.
For use cases where the maching is a reporting machine then saving
large amounts of archived logs is fine.

Engineering is about tradeoffs and these two use cases are
intrinsically in conflict.

The alternative is postponing vacuuming on the master which is imho
even worse than the disease.

--
Greg

On 28 Jan 2009, at 19:56, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
>>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>>>> I still *strongly* feel the default has to be the
>>>> non-destructive conservative -1.
>>>
>>> I don't. Primarily, we must support high availability. It is much
>>> better
>>> if we get people saying "I get my queries cancelled" and we say
>>> RTFM and
>>> change parameter X, than if people say "my failover was 12 hours
>>> behind
>>> when I needed it to be 10 seconds behind and I lost a $1 million
>>> because
>>> of downtime of Postgres" and we say RTFM and change parameter X.
>
>> If the person was stupid enough to configure it for such as thing
>> they
>> deserve to the lose the money.
>
> Well, those unexpectedly cancelled queries could have represented
> critical functionality too. I think this argument calls the entire
> approach into question. If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.
>
> regards, tom lane

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:05:05
Message-ID:	20090128200505.GR12094@yugib.highrise.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> [090128 15:02]:

> Well, those unexpectedly cancelled queries could have represented
> critical functionality too. I think this argument calls the entire
> approach into question. If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.

That's what we currently have without HS, but people aren't completely
satisfied with it either ;-)

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:07:01
Message-ID:	5C2E8A03-66B2-45CE-A38E-D9CC7A520C5F@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Put another way: your characterization is no more true than claiming
there's no "safe" setting for statement_timeout since a large value
means clog could overflow your disk and your tables could bloat.

(I note we default statement_timeout to off though)

--
Greg

On 28 Jan 2009, at 19:56, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:13:21
Message-ID:	1233173601.2327.2531.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 21:41 +0200, Heikki Linnakangas wrote:

> So, you can think of the unobserved xids array as an extension of
> ProcArray. The entries are like light-weight PGPROC entries. In fact I
> proposed earlier to simply create dummy PGPROC entries instead.

Which we don't do because we don't know whether we are dealing with
top-level xids or subtransactions of already observed top-level xids.

Either way we have to rearrange things when we move from unobserved to
observed. A major difference is that what we have now works and what we
might have instead may not, which is being illustrated by recent
testing.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:16:15
Message-ID:	1233173775.5247.18.camel@dell.linuxdev.us.dell.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
> "my failover was 12 hours behind when I needed it to be 10 seconds
> behind and I lost a $1 million because of downtime of Postgres"

The same could be said for warm standby right now. Or Slony-I, for that
matter. I think that we can reasonably expect anyone implementing
asynchronous replication for HA to properly monitor the lag time.

There are many sources of latency in the process, so I don't think
anyone can expect 10 seconds without actually monitoring to verify what
the actual lag time is.

I apologize if my post is based on ignorance, I haven't followed your
patch as closely as others involved in this discussion.

Regards,
Jeff Davis

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:19:21
Message-ID:	4980BDC9.3000405@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
>>> On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>>>> I still *strongly* feel the default has to be the
>>>> non-destructive conservative -1.
>>> I don't. Primarily, we must support high availability. It is much better
>>> if we get people saying "I get my queries cancelled" and we say RTFM and
>>> change parameter X, than if people say "my failover was 12 hours behind
>>> when I needed it to be 10 seconds behind and I lost a $1 million because
>>> of downtime of Postgres" and we say RTFM and change parameter X.
>
>> If the person was stupid enough to configure it for such as thing they
>> deserve to the lose the money.
>
> Well, those unexpectedly cancelled queries could have represented
> critical functionality too. I think this argument calls the entire
> approach into question. If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.

We've gone through that already. Different ideas were hashed out around
September. There's four basic feasible approaches to what to do when an
incoming WAL record conflicts with a running read-only query:

1. Kill the query. (max_standby_delay=0)
2. Wait for the query to finish before continuing (max_standby_delay=-1)
3. Have a feedback loop from standby to master, feeding an OldestXmin to
the master, preventing it from removing tuples that are still needed in
the standby.
4. Allow the query to continue, knowing that it will return wrong results.

I don't consider 4 to be an option. Option 3 has its own set of
drawbacks, as a standby can then cause bloat in the master, and in any
case we're not going to have it in this release. And then there's some
middle ground, like wait a while and then kill the query
(max_standby_delay > 0).

I don't see any way around the fact that when a tuple is removed, it's
gone and can't be accessed by queries. Either you don't remove it, or
you kill the query.

I think the max_standby_delay setting is fairly easy to explain. It
shouldn't be too hard for a DBA to set it correctly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	jd(at)commandprompt(dot)com
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:27:48
Message-ID:	1233174468.2327.2547.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 11:33 -0800, Joshua D. Drake wrote:
> On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
> > On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
>
> > Agreed. As explained when I published that patch it is deliberately
> > severe to allow testing of conflict resolution and feedback on it.
> >
> > > I still *strongly* feel the default has to be the
> > > non-destructive conservative -1.
> >
> > I don't. Primarily, we must support high availability. It is much better
> > if we get people saying "I get my queries cancelled" and we say RTFM and
> > change parameter X, than if people say "my failover was 12 hours behind
> > when I needed it to be 10 seconds behind and I lost a $1 million because
> > of downtime of Postgres" and we say RTFM and change parameter X.
>
> If the person was stupid enough to configure it for such as thing they
> deserve to the lose the money.

It was never intended to be 0, that was just for testing, as I said. But
a smallish integer number of seconds, say 10, 60, 300 or at most 600 is
reasonable.

Crash barriers can be moved, falling off a cliff is permanent. It's easy
to change the parameter if you don't like it, and too late to change it
if we set the default wrong.

My name is on the patch and I take responsibility for such failures. I'm
not going to turn round and say "but Josh said", kinda like Stan Laurel,
if it fails. I've never called any user stupid and never will, not while
they use Postgres, at least... If we get the default wrong then its down
to us.

Never cancelling queries is definitely a wrong default choice for an HA
server.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:39:22
Message-ID:	1233175162.2327.2557.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 14:56 -0500, Tom Lane wrote:

> Well, those unexpectedly cancelled queries could have represented
> critical functionality too. I think this argument calls the entire
> approach into question. If there is no safe setting for the parameter
> then we need to find a way to not have the parameter.

I see the opposite: We don't know what tradeoffs, if any, the user is
willing to put up with, so we need input. It is about resource
prioritisation and not for us to decide, since these reflect business
objectives not internal twangy things like join_collapse_limit.

The essential choice is "What would you like the max failover time to
be?". Some users want one server with max 5 mins behind, some want two
servers, one with 0 seconds behind, one with 12 hours behind

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:47:25
Message-ID:	4980C45D.3070009@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> The essential choice is "What would you like the max failover time to
> be?". Some users want one server with max 5 mins behind, some want two
> servers, one with 0 seconds behind, one with 12 hours behind

It's not quite that simple. Setting max_standby_delay=5mins means that
you're willing to wait 5 minutes for each query to die. Which means that
in worst case you have to stop for 5 minutes at every single vacuum
record, and fall behind much more than 5 minutes.

You could make it more like that by tracking the timestamps in commit
records, and/or having some sort of a moving average logic in the
timeout, where you allow more waiting if you haven't waited for a long
time, and kill queries more aggressively if you've had to wait a lot
recently.

It should also be noted that the control functions allow you to connect
to the database and manually pause/resume the replay. So you can for
example set max_standby_delay=0 during the day, but pause the replay
manually before starting a nightly report.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:55:54
Message-ID:	1233176154.5692.15.camel@dell.linuxdev.us.dell.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 22:47 +0200, Heikki Linnakangas wrote:
> It's not quite that simple. Setting max_standby_delay=5mins means that
> you're willing to wait 5 minutes for each query to die. Which means that
> in worst case you have to stop for 5 minutes at every single vacuum
> record, and fall behind much more than 5 minutes.

Just trying to follow along: are you talking about the situation where
there are (for example) a continuous stream of "select pg_sleep(600)" on
the standby, and a series of rapid VACUUMs on the primary?

This situation might be more likely now that we have partial VACUUMs.

> It should also be noted that the control functions allow you to connect
> to the database and manually pause/resume the replay. So you can for
> example set max_standby_delay=0 during the day, but pause the replay
> manually before starting a nightly report.
>

That's a very cool feature.

Regards,
Jeff Davis

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:56:57
Message-ID:	1233176217.2327.2564.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 22:47 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > The essential choice is "What would you like the max failover time to
> > be?". Some users want one server with max 5 mins behind, some want two
> > servers, one with 0 seconds behind, one with 12 hours behind
>
> It's not quite that simple.

In this case, yes it is.

> Setting max_standby_delay=5mins means that
> you're willing to wait 5 minutes for each query to die. Which means that
> in worst case you have to stop for 5 minutes at every single vacuum
> record, and fall behind much more than 5 minutes.

That's not how this patch works.

> You could make it more like that by tracking the timestamps in commit
> records

Which is what we do.

Yes, thank you for bringing balance to the discussions.

Please everybody read this before commenting further.

http://wiki.postgresql.org/wiki/Hot_Standby#Usage

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	jd(at)commandprompt(dot)com
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 20:58:22
Message-ID:	603c8f070901281258k5b9b29cas5d38f2a245d20968@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> I don't. Primarily, we must support high availability. It is much better
>> if we get people saying "I get my queries cancelled" and we say RTFM and
>> change parameter X, than if people say "my failover was 12 hours behind
>> when I needed it to be 10 seconds behind and I lost a $1 million because
>> of downtime of Postgres" and we say RTFM and change parameter X.
>
> If the person was stupid enough to configure it for such as thing they
> deserve to the lose the money. Not to mention we have already lost them
> as a user because they will blame postgresql regardless of reality as
> evidenced by their inability to RTFM (or have a vendor that RTFMs) in
> the first place.
>
> I got to vote with Greg on this one.

I vote with Simon. The thing is that if you get some queries
cancelled, you'll realize you have a problem. Now you have several
options for what to do to fix it. Having your failover be 12 hours
behind (or 12 months behind) is something that it would be much easier
to not realize.

...Robert

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 21:11:43
Message-ID:	87d4e7jfu8.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> On Wed, 2009-01-28 at 14:56 -0500, Tom Lane wrote:
>
>> Well, those unexpectedly cancelled queries could have represented
>> critical functionality too. I think this argument calls the entire
>> approach into question. If there is no safe setting for the parameter
>> then we need to find a way to not have the parameter.
>
> I see the opposite: We don't know what tradeoffs, if any, the user is
> willing to put up with, so we need input.

Well, if you see it that way then it seems to me you should be arguing for
making max_standby_delay a mandatory parameter. Without it don't start allow
connections. I hadn't considered that and am not exactly sure where I would
stand on it.

> The essential choice is "What would you like the max failover time to
> be?". Some users want one server with max 5 mins behind, some want two
> servers, one with 0 seconds behind, one with 12 hours behind

Sure. But if they don't configure one then we shouldn't impose one. You're
thinking of precisely one use case and taking positive action to interrupt the
user's requests on the basis of it. But there are plenty of other use cases. I
claim the default has to be to do as the user instructed without intervention.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-01-28 21:12:28
Message-ID:	23375.1233177148@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I vote with Simon. The thing is that if you get some queries
> cancelled, you'll realize you have a problem.

... or if you don't, they couldn't have been all that critical.

> Having your failover be 12 hours
> behind (or 12 months behind) is something that it would be much easier
> to not realize.

Okay, I'm sold, positive max_standby_delay should be the default.

regards, tom lane

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 13:26:06
Message-ID:	1233667566.7999.7.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 22:19 +0200, Heikki Linnakangas wrote:
> Tom Lane wrote:
...
> > Well, those unexpectedly cancelled queries could have represented
> > critical functionality too. I think this argument calls the entire
> > approach into question. If there is no safe setting for the parameter
> > then we need to find a way to not have the parameter.
>
> We've gone through that already. Different ideas were hashed out around
> September. There's four basic feasible approaches to what to do when an
> incoming WAL record conflicts with a running read-only query:
>
> 1. Kill the query. (max_standby_delay=0)
> 2. Wait for the query to finish before continuing (max_standby_delay=-1)
> 3. Have a feedback loop from standby to master, feeding an OldestXmin to
> the master, preventing it from removing tuples that are still needed in
> the standby.
> 4. Allow the query to continue, knowing that it will return wrong results.
>
> I don't consider 4 to be an option. Option 3 has its own set of
> drawbacks, as a standby can then cause bloat in the master, and in any
> case we're not going to have it in this release. And then there's some
> middle ground, like wait a while and then kill the query
> (max_standby_delay > 0).
>
> I don't see any way around the fact that when a tuple is removed, it's
> gone and can't be accessed by queries. Either you don't remove it, or
> you kill the query.

Actually we came up with a solution to this - use filesystem level
snapshots (like LVM2+XFS or ZFS), and redirect backends with
long-running queries to use fs snapshot mounted to a different
mountpoint.

I don't think Simon has yet put full support for it in code, but it is
clearly _the_ solution for those who want to eat the cake and have it
too.

>
> I think the max_standby_delay setting is fairly easy to explain. It
> shouldn't be too hard for a DBA to set it correctly.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 13:40:56
Message-ID:	49884968.4050900@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> Actually we came up with a solution to this - use filesystem level
> snapshots (like LVM2+XFS or ZFS), and redirect backends with
> long-running queries to use fs snapshot mounted to a different
> mountpoint.
>
> I don't think Simon has yet put full support for it in code, but it is
> clearly _the_ solution for those who want to eat the cake and have it
> too.
>
>

How does that work if you're using mutiple file systems via tablespaces
(e.g. indexes in a different TS)?

cheers

andrew

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 13:50:28
Message-ID:	87zlh3iq8r.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing <hannu(at)krosing(dot)net> writes:

> Actually we came up with a solution to this - use filesystem level
> snapshots (like LVM2+XFS or ZFS), and redirect backends with
> long-running queries to use fs snapshot mounted to a different
> mountpoint.

Uhm, how do you determine which snapshot to direct the backend to? There could
have been several generations of tuples in that tid since your query started.
Do you take a snapshot every time there's a vacuum-snapshot conflict and
record which snapshot goes with that snapshot?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 14:14:16
Message-ID:	603c8f070902030614s7ecdeb5eyf726c72fb858dc78@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> I don't see any way around the fact that when a tuple is removed, it's
>> gone and can't be accessed by queries. Either you don't remove it, or
>> you kill the query.
>
> Actually we came up with a solution to this - use filesystem level
> snapshots (like LVM2+XFS or ZFS), and redirect backends with
> long-running queries to use fs snapshot mounted to a different
> mountpoint.
>
> I don't think Simon has yet put full support for it in code, but it is
> clearly _the_ solution for those who want to eat the cake and have it
> too.

I think _the_ solution is to notice when you're about to vacuum a page
that is still visible to a running backend on the standby, and save
that page off to a separate cache of old page versions (perhaps using
the relation fork mechanism). I suspect filesystem-level snapshots
can be made to work, but it's never going to be a portable solution,
and I suspect you're going to need a lot of the same bookkeeping
anyway (like, when you have page X in shared buffers, which version of
page X is it, anyway?).

...Robert

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Hannu Krosing <hannu(at)krosing(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 14:28:02
Message-ID:	1233671282.4500.146.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote:
>
> Hannu Krosing wrote:
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
> >
> > I don't think Simon has yet put full support for it in code, but it is
> > clearly _the_ solution for those who want to eat the cake and have it
> > too.

> How does that work if you're using mutiple file systems via tablespaces
> (e.g. indexes in a different TS)?

It's a great idea and easy to do, but I can't do everything in one go.

The main reasons not to are multiple file system difficulties and lack
of a mainstream Linux solution, and more simply lack of time and test
resources.

So not now, maybe later.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 14:33:29
Message-ID:	1233671609.7999.9.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Basically the same way we do WAL shipping up to 8.3.

That is using external scripts. Once we have enough experience, we could
try to move tha actual snapshot-mount-switch inside the core

--
------------------------------------------
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

From:	Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 14:35:35
Message-ID:	1233671735.7999.11.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote:
> >> I don't see any way around the fact that when a tuple is removed, it's
> >> gone and can't be accessed by queries. Either you don't remove it, or
> >> you kill the query.
> >
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
> >
> > I don't think Simon has yet put full support for it in code, but it is
> > clearly _the_ solution for those who want to eat the cake and have it
> > too.
>
> I think _the_ solution is to notice when you're about to vacuum a page
> that is still visible to a running backend on the standby, and save
> that page off to a separate cache of old page versions (perhaps using
> the relation fork mechanism). I suspect filesystem-level snapshots
> can be made to work, but it's never going to be a portable solution,
> and I suspect you're going to need a lot of the same bookkeeping
> anyway (like, when you have page X in shared buffers, which version of
> page X is it, anyway?).

The idea was to switch file pointers inside the backend needing old
versions, (and then flush cache if needed) so the only bookkeeping you
need is which fs snapshots you need to keep and which can be released.

--
------------------------------------------
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Hannu Krosing <hannu(at)krosing(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 14:40:26
Message-ID:	1233672026.4500.160.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote:

> I think _the_ solution is to notice when you're about to vacuum a page
> that is still visible to a running backend on the standby, and save
> that page off to a separate cache of old page versions (perhaps using
> the relation fork mechanism).

I'll let you write that, for the next release...

The problem with all of this is we've been talking about it for 8 months
now and various opinions are held by people. What is being presented is
a broad set of options (summarised from Wiki)

1. Wait then cancel
a) always for databases, tablespaces and locks
b) deferred cancelation for buffer changes

2. We connect to Primary server from Standby server and keep a
transaction open using contrib/dblink functions, then commit as soon as
we are done.

3. We pause recovery by issuing a pg_recovery_pause() function, or start
server in pause mode using recovery_started_paused = on.

Yes, it's a critical area to the success of the feature. But this is
enough for first release and for us to get user feedback.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 14:42:42
Message-ID:	1233672162.7999.18.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-02-03 at 13:50 +0000, Gregory Stark wrote:
> Hannu Krosing <hannu(at)krosing(dot)net> writes:
>
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
>
> Uhm, how do you determine which snapshot to direct the backend to? There could
> have been several generations of tuples in that tid since your query started.
> Do you take a snapshot every time there's a vacuum-snapshot conflict and
> record which snapshot goes with that snapshot?

The most sensible thing to do seems to wait for some configurable period
(say a few seconds or a few minutes), delaying WAL apply, and then to do
the snaphot, mount it and redirect all running transactions to use
_current_ filesystem snapshot, and then resume WAL application.

As each of the transactions running on saved fs snapshots complete, they
are switced back to main/live fs view.

When the last there transaction ends, the snapshot is unmounted and
released.

--
------------------------------------------
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 14:55:45
Message-ID:	49885AF1.6070209@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 02/03/2009 02:26 PM, Hannu Krosing wrote:
>> I don't see any way around the fact that when a tuple is removed, it's
>> gone and can't be accessed by queries. Either you don't remove it, or
>> you kill the query.
> Actually we came up with a solution to this - use filesystem level
> snapshots (like LVM2+XFS or ZFS), and redirect backends with
> long-running queries to use fs snapshot mounted to a different
> mountpoint.
Isn't that really, really expensive?

A single write on the master logical volume yields writes of PE size
for _every_ single snapshot (the first time the block is touched) -
considering that there could quite many such snapshots I don't think
that this is really feasible - io quite possible might be saturated.

The default PE size is 4MB - but on most bigger systems it is set to a
bigger size, so its just getting worse for bigger systems.

Sure, one might say, that this is an LVM deficiency - but I do knot know
of any snapshot-able block layer doing it that way.

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Hannu Krosing <hannu(at)krosing(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 15:19:13
Message-ID:	603c8f070902030719y47f295feh68a0557444708320@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 3, 2009 at 9:40 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote:
>> I think _the_ solution is to notice when you're about to vacuum a page
>> that is still visible to a running backend on the standby, and save
>> that page off to a separate cache of old page versions (perhaps using
>> the relation fork mechanism).
>
> I'll let you write that, for the next release...

LOL. How many sponsorship dollars are available for that project?

> The problem with all of this is we've been talking about it for 8 months
> now and various opinions are held by people. What is being presented is
> a broad set of options (summarised from Wiki)

I think everyone understands that these are things we might want to do
down the line, not things we need to have now. For this release, I
was under the impression that we'd pretty much settled on implementing
(1) and maybe (3) but not (2) from the below list.

> 1. Wait then cancel
> a) always for databases, tablespaces and locks
> b) deferred cancelation for buffer changes
>
> 2. We connect to Primary server from Standby server and keep a
> transaction open using contrib/dblink functions, then commit as soon as
> we are done.
>
> 3. We pause recovery by issuing a pg_recovery_pause() function, or start
> server in pause mode using recovery_started_paused = on.
>
> Yes, it's a critical area to the success of the feature. But this is
> enough for first release and for us to get user feedback.

I completely agree.

...Robert

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Hannu Krosing <hannu(at)krosing(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 15:29:14
Message-ID:	1233674954.4500.168.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-02-03 at 15:55 +0100, Andres Freund wrote:
> Hi,
>
> On 02/03/2009 02:26 PM, Hannu Krosing wrote:
> >> I don't see any way around the fact that when a tuple is removed, it's
> >> gone and can't be accessed by queries. Either you don't remove it, or
> >> you kill the query.
> > Actually we came up with a solution to this - use filesystem level
> > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > long-running queries to use fs snapshot mounted to a different
> > mountpoint.
> Isn't that really, really expensive?
>
> A single write on the master logical volume yields writes of PE size
> for _every_ single snapshot (the first time the block is touched) -
> considering that there could quite many such snapshots I don't think
> that this is really feasible - io quite possible might be saturated.

If we did that we would provide an option to select the MVCC snapshot
that went with the filesystem snapshot. There need not be one per user.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 15:38:17
Message-ID:	1233675497.7999.20.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-02-03 at 10:19 -0500, Robert Haas wrote:
> On Tue, Feb 3, 2009 at 9:40 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > On Tue, 2009-02-03 at 09:14 -0500, Robert Haas wrote:
> >> I think _the_ solution is to notice when you're about to vacuum a page
> >> that is still visible to a running backend on the standby, and save
> >> that page off to a separate cache of old page versions (perhaps using
> >> the relation fork mechanism).
> >
> > I'll let you write that, for the next release...
>
> LOL. How many sponsorship dollars are available for that project?
>
> > The problem with all of this is we've been talking about it for 8 months
> > now and various opinions are held by people. What is being presented is
> > a broad set of options (summarised from Wiki)
>
> I think everyone understands that these are things we might want to do
> down the line, not things we need to have now. For this release, I
> was under the impression that we'd pretty much settled on implementing
> (1) and maybe (3) but not (2) from the below list.

(2) is something that you can always do manually if you need it. So no
reason to support it in HS code explicitly.

Once you keep trx open on master, 1 and 3 should not happen anymore
until you close that trx.

> > 1. Wait then cancel
> > a) always for databases, tablespaces and locks
> > b) deferred cancelation for buffer changes
> >
> > 2. We connect to Primary server from Standby server and keep a
> > transaction open using contrib/dblink functions, then commit as soon as
> > we are done.
> >
> > 3. We pause recovery by issuing a pg_recovery_pause() function, or start
> > server in pause mode using recovery_started_paused = on.
> >
> > Yes, it's a critical area to the success of the feature. But this is
> > enough for first release and for us to get user feedback.
>
> I completely agree.
>
> ...Robert
>

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 16:09:27
Message-ID:	1233677367.7999.33.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-02-03 at 14:28 +0000, Simon Riggs wrote:
> On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote:
> >
> > Hannu Krosing wrote:
> > > Actually we came up with a solution to this - use filesystem level
> > > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > > long-running queries to use fs snapshot mounted to a different
> > > mountpoint.
> > >
> > > I don't think Simon has yet put full support for it in code, but it is
> > > clearly _the_ solution for those who want to eat the cake and have it
> > > too.
>
> > How does that work if you're using mutiple file systems via tablespaces
> > (e.g. indexes in a different TS)?
>
> It's a great idea and easy to do, but I can't do everything in one go.
>
> The main reasons not to are multiple file system difficulties and lack
> of a mainstream Linux solution, and more simply lack of time and test
> resources.

More general, but also lot harder, solution would be going back to roots
and implement what original postgres 4.2 and earlier versions were meant
to do - namely VACUUM was not meant to just discard older versions , but
rather move it to WORM storage (write once read many was all the rage
back then :) .

If we did that in a way that each relation, at least on warm standby ,
has its own "archive" fork, possibly in a separate tablespace for
cheaper storage, then we could basically apply WAL's as fast we want and
just move the old versions to "archive". It will be slower(ish),
especially for HOT updates, but may be a good solution for lots of
usecases.

And the decision to do the archiving on master and WAL-copy to slave, or
just do it on slave only could probably be left to user.

Reintroducing keeping old tuples "forever" would also allow us to bring
back time travel feature, that is

SELECT .... AS OF 'yesterday afternoon'::timestamp;

Which was thrown out at the times we got WAL-logging.

To be really useful we should also have some way to know trx timestamps,
but that can be easily done using ticker feature from Slony -
SkyTools/pgQ, which could be run a a separate server thread similar to
what we do with background writer, autovacuum etc. now.

>
> So not now, maybe later.
>

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jd(at)commandprompt(dot)com, Gregory Stark <stark(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot Standby (v9d)
Date:	2009-02-03 17:18:45
Message-ID:	1233681525.4500.185.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-02-03 at 18:09 +0200, Hannu Krosing wrote:
> On Tue, 2009-02-03 at 14:28 +0000, Simon Riggs wrote:
> > On Tue, 2009-02-03 at 08:40 -0500, Andrew Dunstan wrote:
> > >
> > > Hannu Krosing wrote:
> > > > Actually we came up with a solution to this - use filesystem level
> > > > snapshots (like LVM2+XFS or ZFS), and redirect backends with
> > > > long-running queries to use fs snapshot mounted to a different
> > > > mountpoint.
> > > >
> > > > I don't think Simon has yet put full support for it in code, but it is
> > > > clearly _the_ solution for those who want to eat the cake and have it
> > > > too.
> >
> > > How does that work if you're using mutiple file systems via tablespaces
> > > (e.g. indexes in a different TS)?
> >
> > It's a great idea and easy to do, but I can't do everything in one go.
> >
> > The main reasons not to are multiple file system difficulties and lack
> > of a mainstream Linux solution, and more simply lack of time and test
> > resources.
>
> More general, but also lot harder, solution would be going back to roots
> and implement what original postgres 4.2 and earlier versions were meant
> to do - namely VACUUM was not meant to just discard older versions , but
> rather move it to WORM storage (write once read many was all the rage
> back then :) .
>
> If we did that in a way that each relation, at least on warm standby ,
> has its own "archive" fork, possibly in a separate tablespace for
> cheaper storage, then we could basically apply WAL's as fast we want and
> just move the old versions to "archive". It will be slower(ish),
> especially for HOT updates, but may be a good solution for lots of
> usecases.
>
> And the decision to do the archiving on master and WAL-copy to slave, or
> just do it on slave only could probably be left to user.
>
> Reintroducing keeping old tuples "forever" would also allow us to bring
> back time travel feature, that is
>
> SELECT .... AS OF 'yesterday afternoon'::timestamp;
>
> Which was thrown out at the times we got WAL-logging.
>
> To be really useful we should also have some way to know trx timestamps,
> but that can be easily done using ticker feature from Slony -
> SkyTools/pgQ, which could be run a a separate server thread similar to
> what we do with background writer, autovacuum etc. now.

That might be the way to do the "Named Restore Points" that is
frequently requested.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support