Quick Links

Re: Catalog/Metadata consistency during changeset extraction from wal

Lists:	pgsql-hackers

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 11:41:25
Message-ID:	201206211341.25322.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Robert, Hi all,

Robert and I talked quite a bit about different methods of providing enough
information to extract tuples from wal. I don't think either of us is yet
really convinced of any individual method, so I want to recap our discussion
in one email so others can chime in without reading the already huge thread.

I hope I am not misrepesenting Roberts opinion here, but I am sure he will
correct me if I do ;)

To satisfy the different needs people have for changeset extraction we
currently think that for each individual tuple extracted from wal we need to
provide the following information:

a) one or more pieces of tuple data (HeapTupleData structs)
* INSERT: full new tuple
* UPDATE: full new tuple, old pkey (optionally the full old tuple)
* DELETE: old pkey (optionally the full old tuple)
b) the action performed (INSERT|UPDATE|DELETE)
c) the table on which the action was performed
d) access to the table structure (names, column types, ...) procs (*_out
functions for the individual columns)

The problem with getting that data is that at the point were decoding the wal
the catalog may have evolved significantly from the state it was in when the
tuple was put into the wal.
We can extract a) and b) without any problems (lets not talk about it here)
but we don't necessarily know how to make sense of the data because a
HeapTuple cannot be properly interpreted without the knowledge of c) and d).

I am of the opinion that c) is basically equivalent to solving d) because the
wal only contains the tuple (pg_database.oid, pg_tablespace.oid,
pg_class.relfilenode) of the table and not the 'pg_class.oid'. The relfilenode
is changed by operations that rewrite the table like ALTER TABLE ADD COLUMN
... DEFAULT ...; TRUNCATE; CLUSTER and some others.

A single transaction can contain tuples for different relfilenodes and with
different columns:

CREATE TABLE foo(id serial primary key, data text);
BEGIN;
INSERT INTO foo ...;
TRUNCATE foo;
INSERT INTO foo ...; -- same structure, different relfilenode

ALTER TABLE foo ADD COLUMN bar text;
INSERT INTO foo ...; -- same relfilenode, different table structure

ALTER TABLE foo ADD COLUMN zaphod text DEFAULT '';
INSERT INTO foo ...; -- different relfilenode, different table structure
COMMIT;

There are several additional complex scenarios.

In http://archives.postgresql.org/message-
id/201206192023(dot)20589(dot)andres(at)2ndquadrant(dot)com I listed which options I see for
reaching that goal.

A refined version of that list:

1.)
Decode on a different, possibly catalog-only, pg instance kept in sync using
the command trigger infrastructure (but not necessarily user-level defined
command triggers)

If the command/event trigger infrastructure logs into a system-catalog table
keeping the catalog in the correct state is relatively easy. When
replaying/converting a reassembled transaction everytime an INSERT into that
system table happens the contained DDL gets performed.
The locking on the generating side takes care of the concurrency aspects.

Advantages:
* minimal overhead (space, performance)
* allows additional tables/indexes/triggers if you take care with oid
allocation
* easy transactionally correct catalog behaviour behaviour
* the decoding instance can be used to store all data in a highly efficient
manner (no decoding, no full detoasting, ...)
* the decoding instance is fully writable without problems if you don't
generate conflicts (separate tables, non-overlapping writes, whatever)
* implementable in a pretty unintrusive way

Disadvantes:
* the table structure of replicated tables needs to be exactly the same
* the type definition + support procs needs to be similar enough to read the
data
* error checking of the above isn't easy but probably possible
* full version/architecture compatibility required
* a proxy instance required even if you want to replicate into some other
system/architecture/version

2.)
Keep the decoding site up2date by replicating the catalog via normal HS
recovery
mechanisms.

Advantages:
* most of the technology is already there
* minimal overhead (space, performance)
* no danger of out of sync catalogs
* no support for command triggers required that can keep a catalog in sync,
including oids

Disadvantages:
* driving the catalog recovery that way requires some somewhat intricate code
as it needs to be done in lockstep with decoding the wal-stream
* requires an additional feature to guarantee HS always has enough information
to be queryable after a crash/shutdown
* some complex logic/low-level fudging required to keep the transactional
behaviour sensible when querying the catalog
* full version/architecture compatibility required
* the decoding site will always ever be only readable

3)
Multi-Versioned catalog

Below are two possible implementation strategies for that concept

Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in that
path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling back
to textual versions on mismatch

Disadvantages:
* decoding probably has to happen on the master which might not be what people
want performancewise

3a)
Change the system catalogs to be versioned

Advantages.
* catalog access is easy
* might be interesting for other users

Disadvantages:
* catalog versioning is complex to implement
* space overhead for all users, even without using logical replication
* I can't see -hackers signing off

3b)
Ensure that enough information in the catalog remains by fudging the xmin
horizon. Then reassemble an appropriate snapshot to read the catalog as the
tuple in question has seen it.

Advantages:
* should be implementable with low impact to general code

Disadvantages:
* requires some complex code for assembling snapshots
* it might be hard to guarantee that we always have enough information to
reassemble a snapshot (subxid overflows ...)
* impacts vacuum if replication to some site is slow

4.)
Log enough information in the walstream to make decoding possible using only
the walstream.

Advantages:
* Decoding can optionally be done on the master
* No catalog syncing/access required
* its possible to make this architecture independent

Disadvantage:
* high to very high implementation overhead depending on efficiency aims
* high space overhead in the wal because at least all the catalog information
needs to be logged in a transactional manner repeatedly
* misuses wal far more than other methods
* significant new complexity in somewhat cricital code paths (heapam.c)
* insanely high space overhead if the decoding should be possible architecture
independent

5.)
The actually good idea. Yours?

-----

I think 3a) is not likely to fly and I think 4) is too complex although Robert
isn't convinced of the latter argument.

In my opinion either 3b) or 1) are our best options because they seem to
support most of the usecases without huge costs in complexity and
runtime/space for users not using changeset extraction. 3b) seems to be
preferrable because imo its the most flexible choice and doesn't require a
second instance.

Any arguments against/for those?

I am going to talk about implementing 3b) in a separate email in a bit.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 13:56:54
Message-ID:	99FDAE17-E669-4E9B-A744-63A9C206F1EE@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun21, 2012, at 13:41 , Andres Freund wrote:
> 3b)
> Ensure that enough information in the catalog remains by fudging the xmin
> horizon. Then reassemble an appropriate snapshot to read the catalog as the
> tuple in question has seen it.

The ComboCID machinery makes that quite a bit harder, I fear. If a tuple is
updated multiple times by the same transaction, you cannot decide whether a
tuple was visible in a certain snapshot unless you have access to the updating
backend's ComboCID hash.

best regards,
Florian Pflug

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 14:05:54
Message-ID:	66B754A4-1328-489E-9BD2-F768D910EF99@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun21, 2012, at 13:41 , Andres Freund wrote:
> 5.)
> The actually good idea. Yours?

What about a mixure of (3b) and (4), which writes the data not to the WAL
but to a separate logical replication log. More specifically:

There's a per-backend queue of change notifications.

Whenever a non-catalog tuple is modified, we queue a TUPLE_MODIFIED
record containing (xid, databaseoid, tableoid, old xmin, old ctid, new ctid)

Whenever a table (or something that a table depends on) is modified we
wait until all references to that table's oid have vanished from the queue,
then queue a DDL record containing (xid, databaseoid, tableoid, text). Other
backend cannot concurrently add further TUPLE_MODIFIED records since we alreay
hold an exclusive lock on the table at that point.

A background process continually processes these queues. If the front of the
queue is a TUPLE_MODIFIED record, it fetches the old and the new tuple
based on their ctids and writes the old tuple's PK and the full new tuple
to the logical replication log. Since table modifications always wait for
all previously queued TUPLE_MODIFIED records referencing that table to be
processes *before* altering the catalog, tuples can always be interpreted
according to the current (SnapshotNow) catalog contents.

Upon transaction COMMIT and ROLLBACK, we queue COMMIT and ROLLBACK records,
which are also written to the log by the background process. The background
process may decide to wait until a backend commits before processing that
backend's log. In that case, rolled back transaction don't leave a trace in
the logical replication log. Should a backend, however, issue a DDL statement,
the background process *must* process that backend's queue immediately, since
otherwise there's a dead lock.

The background process also maintains a value in shared memory which
contains the oldest value in any of the queue's xid or "old xmin" fields.
VACUUM and the like must not remove tuples whose xmin is >= that value.
Hit bits *may* be set for newest tuples though, provided that the background
process ignores hint bits when fetching the old and new tuples.

best regards,
Florian Pflug

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 14:39:21
Message-ID:	CA+U5nML-i6MtpCwo8O=1icZVE-wU+JLZ1+qgQKc3p-ugNhoiqg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 June 2012 12:41, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> 3)
> Multi-Versioned catalog
>
> Below are two possible implementation strategies for that concept
>
> Advantages:
> * Decoding is done on the master in an asynchronous fashion
> * low overhead during normal DML execution, not much additional code in that
> path
> * can be very efficient if architecture/version are the same
> * version/architecture compatibility can be done transparently by falling back
> to textual versions on mismatch
>
> Disadvantages:
> * decoding probably has to happen on the master which might not be what people
> want performancewise
>
> 3a)
> Change the system catalogs to be versioned
>
> Advantages.
> * catalog access is easy
> * might be interesting for other users
>
> Disadvantages:
> * catalog versioning is complex to implement
> * space overhead for all users, even without using logical replication
> * I can't see -hackers signing off

Hmm, there's all sorts of stuff mixed up there in your description.

ISTM we should maintain a lookup table on target system that has the
minimal required information on it.

There is no need to version the whole catalog. (Complete overkill - I
would oppose it ;-)

If we keep the lookup table on the target as a normal table, we can
insert new rows into it as changes occur. If we need to perform
recovery then the earlier version rows will still be there and we just
use those. Versioning is easy to implement, just use LSN as additional
key in the table. Then lookup based on key and LSN. If a transaction
that makes DDL changes aborts, then the changes will be automatically
backed out.

Only keep the lookup table if using logical replication, so zero
overhead otherwise. We just need to setup the initial state carefully,
so it matches whats in the database, but that sounds OK.

So I don't see any of the disadvantages you have there. Its just darn
simple, and hence will probably work. It's also a very similar
solution to the other lookups required in memory by the apply process.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 14:53:35
Message-ID:	201206211653.35633.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thursday, June 21, 2012 04:39:21 PM Simon Riggs wrote:
> On 21 June 2012 12:41, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > 3)
> > Multi-Versioned catalog
> >
> > Below are two possible implementation strategies for that concept
> >
> > Advantages:
> > * Decoding is done on the master in an asynchronous fashion
> > * low overhead during normal DML execution, not much additional code in
> > that path
> > * can be very efficient if architecture/version are the same
> > * version/architecture compatibility can be done transparently by falling
> > back to textual versions on mismatch
> >
> > Disadvantages:
> > * decoding probably has to happen on the master which might not be what
> > people want performancewise
> >
> > 3a)
> > Change the system catalogs to be versioned
> >
> > Advantages.
> > * catalog access is easy
> > * might be interesting for other users
> >
> > Disadvantages:
> > * catalog versioning is complex to implement
> > * space overhead for all users, even without using logical replication
> > * I can't see -hackers signing off
>
> Hmm, there's all sorts of stuff mixed up there in your description.
Sure, it tried to compress a complex topic discussed in a long thread ;)

> ISTM we should maintain a lookup table on target system that has the
> minimal required information on it.
You need just about the whole catalog because the *_out procs might need to
lookup types, operators and such again.
Unless you want to rewrite those functions you need to provide a normal
execution environment.

I don't see how your idea works because of that? Am I missing something?

Yes, that would be easier if we didn't want to support conversion to text and
similar, but I don't see that flying. And even if it would be acceptable you
would need to have enough information to construct a btree ScanKey which means
you already need a lot of the catalogs.

> There is no need to version the whole catalog. (Complete overkill - I
> would oppose it ;-)
Hey, that originally was your idea :P. But I definitely agree, its not a good
idea.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 15:05:04
Message-ID:	CA+U5nM+hh6x2h9j+UJJkHYUQPo_dcg8aTQywzLAkL1p9LEppxQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 June 2012 15:53, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

>> ISTM we should maintain a lookup table on target system that has the
>> minimal required information on it.

> You need just about the whole catalog because the *_out procs might need to
> lookup types, operators and such again.
> Unless you want to rewrite those functions you need to provide a normal
> execution environment.

OK, so its more tables than I first thought, but its not all rows and
columns of all catalog tables.

> I don't see how your idea works because of that? Am I missing something?

Why does the number/size of the tables required make that not work?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 15:13:48
Message-ID:	201206211713.48508.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:
> On 21 June 2012 15:53, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> ISTM we should maintain a lookup table on target system that has the
> >> minimal required information on it.
> >
> > You need just about the whole catalog because the *_out procs might need
> > to lookup types, operators and such again.
> > Unless you want to rewrite those functions you need to provide a normal
> > execution environment.
>
> OK, so its more tables than I first thought, but its not all rows and
> columns of all catalog tables.
Sure, there are a few you probably can leave out (pg_database, pg_auth*,
pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not many.

> > I don't see how your idea works because of that? Am I missing something?
> Why does the number/size of the tables required make that not work?
The number of tables itself isn't a fundamental problem although it would make
stuff harder.
The problem is that the out functions expect a normal operating environment
and might e.g. do catalog lookups themselves. I don't see how we can do
anything here without providing a (nearly) full catalog.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 15:25:41
Message-ID:	CA+U5nMLAY+0DwTSwNwXgtf=cg47M-pW8Qhqhksyk33KhmJ=KLg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 June 2012 16:13, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:
>> On 21 June 2012 15:53, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> ISTM we should maintain a lookup table on target system that has the
>> >> minimal required information on it.
>> >
>> > You need just about the whole catalog because the *_out procs might need
>> > to lookup types, operators and such again.
>> > Unless you want to rewrite those functions you need to provide a normal
>> > execution environment.
>>
>> OK, so its more tables than I first thought, but its not all rows and
>> columns of all catalog tables.
> Sure, there are a few you probably can leave out (pg_database, pg_auth*,
> pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not many.

That's a start. Leaving out the shared catalogs makes me smile already.

>> > I don't see how your idea works because of that? Am I missing something?
>> Why does the number/size of the tables required make that not work?
> The number of tables itself isn't a fundamental problem although it would make
> stuff harder.
> The problem is that the out functions expect a normal operating environment
> and might e.g. do catalog lookups themselves. I don't see how we can do
> anything here without providing a (nearly) full catalog.

I accept that there could be pathological functions in there. We're
not trying to make it work with any conceivable datatype/operator, so
forcing logical replication to follow sensible rules makes sense. Are
there any out functions that anybody uses that do that?

It's too much change to actually version the main catalog. Keeping a
separate copy of a versioned catalog for use by replication sounds
much more likely to fly.

In any case, I think we'll have to go back through the list and do
more work on evaluation. When the options look like that, its typical
to have ruled out the final winner early on, but that doesn't mean it
isn't in there somewhere.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 15:33:05
Message-ID:	201206211733.05652.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, June 21, 2012 05:25:41 PM Simon Riggs wrote:
> On 21 June 2012 16:13, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:
> >> On 21 June 2012 15:53, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> >> ISTM we should maintain a lookup table on target system that has the
> >> >> minimal required information on it.
> >> >
> >> > You need just about the whole catalog because the *_out procs might
> >> > need to lookup types, operators and such again.
> >> > Unless you want to rewrite those functions you need to provide a
> >> > normal execution environment.
> >>
> >> OK, so its more tables than I first thought, but its not all rows and
> >> columns of all catalog tables.
> >
> > Sure, there are a few you probably can leave out (pg_database, pg_auth*,
> > pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not
> > many.
>
> That's a start. Leaving out the shared catalogs makes me smile already.
>
> >> > I don't see how your idea works because of that? Am I missing
> >> > something?
> >>
> >> Why does the number/size of the tables required make that not work?
> >
> > The number of tables itself isn't a fundamental problem although it would
> > make stuff harder.
> > The problem is that the out functions expect a normal operating
> > environment and might e.g. do catalog lookups themselves. I don't see
> > how we can do anything here without providing a (nearly) full catalog.
>
> I accept that there could be pathological functions in there. We're
> not trying to make it work with any conceivable datatype/operator, so
> forcing logical replication to follow sensible rules makes sense. Are
> there any out functions that anybody uses that do that?
Loads. enum_out, record_out, array_out are examples I can think of without
even looking. I am pretty sure there are more. But imo this list already shows
its prohibitive.

> It's too much change to actually version the main catalog. Keeping a
> separate copy of a versioned catalog for use by replication sounds
> much more likely to fly.
I don't yet see how that should work given oids and everything are quite
possibly hardcoded in those functions. You could start switching out the
catalogs on a lower level but I think at that point its getting too ugly.

> In any case, I think we'll have to go back through the list and do
> more work on evaluation. When the options look like that, its typical
> to have ruled out the final winner early on, but that doesn't mean it
> isn't in there somewhere.
I hope we have but I am not convinced that there is an elegant solution...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 15:40:08
Message-ID:	201206211740.09128.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:
> On Jun21, 2012, at 13:41 , Andres Freund wrote:
> > 3b)
> > Ensure that enough information in the catalog remains by fudging the xmin
> > horizon. Then reassemble an appropriate snapshot to read the catalog as
> > the tuple in question has seen it.
>
> The ComboCID machinery makes that quite a bit harder, I fear. If a tuple is
> updated multiple times by the same transaction, you cannot decide whether a
> tuple was visible in a certain snapshot unless you have access to the
> updating backend's ComboCID hash.
Thats a very good point. Not sure how I forgot that.

It think it might be possible to reconstruct a sensible combocid mapping from
the walstream. Let me think about it for a while...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-21 15:56:11
Message-ID:	201206211756.11996.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, June 21, 2012 04:05:54 PM Florian Pflug wrote:
> On Jun21, 2012, at 13:41 , Andres Freund wrote:
> > 5.)
> > The actually good idea. Yours?
>
> What about a mixure of (3b) and (4), which writes the data not to the WAL
> but to a separate logical replication log. More specifically:
>
> There's a per-backend queue of change notifications.
>
> Whenever a non-catalog tuple is modified, we queue a TUPLE_MODIFIED
> record containing (xid, databaseoid, tableoid, old xmin, old ctid, new
> ctid)
>
> Whenever a table (or something that a table depends on) is modified we
> wait until all references to that table's oid have vanished from the queue,
> then queue a DDL record containing (xid, databaseoid, tableoid, text).
> Other backend cannot concurrently add further TUPLE_MODIFIED records since
> we alreay hold an exclusive lock on the table at that point.
>
> A background process continually processes these queues. If the front of
> the queue is a TUPLE_MODIFIED record, it fetches the old and the new tuple
> based on their ctids and writes the old tuple's PK and the full new tuple
> to the logical replication log. Since table modifications always wait for
> all previously queued TUPLE_MODIFIED records referencing that table to be
> processes *before* altering the catalog, tuples can always be interpreted
> according to the current (SnapshotNow) catalog contents.
>
> Upon transaction COMMIT and ROLLBACK, we queue COMMIT and ROLLBACK records,
> which are also written to the log by the background process. The background
> process may decide to wait until a backend commits before processing that
> backend's log. In that case, rolled back transaction don't leave a trace in
> the logical replication log. Should a backend, however, issue a DDL
> statement, the background process *must* process that backend's queue
> immediately, since otherwise there's a dead lock.
>
> The background process also maintains a value in shared memory which
> contains the oldest value in any of the queue's xid or "old xmin" fields.
> VACUUM and the like must not remove tuples whose xmin is >= that value.
> Hit bits *may* be set for newest tuples though, provided that the
> background process ignores hint bits when fetching the old and new tuples.
I think thats too complicated to fly. Getting that to recover cleanly in case
of crash would mean you'd need another wal.

I think if it comes to that going for 1) is more realistic...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Andres Freund'" <andres(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Cc:	"'Robert Haas'" <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-22 06:46:19
Message-ID:	002f01cd5042$bb4d2070$31e76150$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund Sent: Thursday, June 21, 2012 5:11 PM
> 4.)
> Log enough information in the walstream to make decoding possible using
only
> the walstream.

What I understood is that enough information is catalog data. Is that right
or something else?

> Advantages:
> * Decoding can optionally be done on the master
> * No catalog syncing/access required
> * its possible to make this architecture independent

> Disadvantage:
> * high to very high implementation overhead depending on efficiency aims
> * high space overhead in the wal because at least all the catalog
information

In Multiversion approach also, there will be overhead of space to
maintain multiple versions
irrespective of any approach you use.

> needs to be logged in a transactional manner repeatedly
Why it needs to be logged repeatedly, once we log the catalog
information in WAL, during that
time we can disallow/block other DDL's and after that changes to Catalog
information can be
retrievied from WAL only.

> * misuses wal far more than other methods
What is the misuse in this, I believe it can be later used for log
mining purposes also.

> * significant new complexity in somewhat cricital code paths (heapam.c)
> * insanely high space overhead if the decoding should be possible
architecture
> independent

The option 4 seems to be better as compare to others w.r.t top level
approach to solve the problem.
Some other databases also uses similar approach for the use cases similar to
what you have described.

With Regards,
Amit Kapila.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-22 06:48:41
Message-ID:	CA+U5nM+0FKYk=tOB+gwNx6YU4tqqxQZzb3QCHLvs0kgv=RVNzw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 June 2012 12:41, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> 2.)
> Keep the decoding site up2date by replicating the catalog via normal HS
> recovery
> mechanisms.
>
> Advantages:
> * most of the technology is already there
> * minimal overhead (space, performance)
> * no danger of out of sync catalogs
> * no support for command triggers required that can keep a catalog in sync,
> including oids
>
> Disadvantages:
> * driving the catalog recovery that way requires some somewhat intricate code
> as it needs to be done in lockstep with decoding the wal-stream
> * requires an additional feature to guarantee HS always has enough information
> to be queryable after a crash/shutdown
> * some complex logic/low-level fudging required to keep the transactional
> behaviour sensible when querying the catalog
> * full version/architecture compatibility required
> * the decoding site will always ever be only readable

My initial reaction was "this wont work", but that requires
qualification since this is a complex topic: You can use this
approach as long as you realise that the catalog it gives can never be
rewound.

So the generic approach to "construct me a catalog as of this LSN"
would need to start with a base backup of the catalog and then roll
forward to the appropriate LSN. Which means a generic user of this
approach would need to be able to construct an initial catalog using a
PITR.

Constructing a decoding site requires you to
a) take a partial base backup of the catalog
b) apply WAL records to bring that forwards to the correct LSN, which
would require some alteration of the recovery code to skip the files
missing in a)

So taking the approach of a decoding site means we have to modify
recovery code, and even when we do that we still end up with a
difficult to deploy option in the real world. Difficult to deploy
becaus we need a whole new instance of Postgres, plus we need all of
the WAL files, which could easily be impractical.

The overall approach is good, but the architecture is wrong. What we
need is a "catalog base backup" and a means of rolling forward to the
appropriate LSN. Rolling forward using WAL is too bulky, so we need a
separate log of DDL changes to the catalog. So what we need is a
"catalog base backup" plus a "ddl-log".

And we need to be able to reconstruct the correct catalog on the
target server directly.

To translate the WAL we maintain a secondary set of catalog tables,
which only exist for logical replication. These are normal,
non-versioned tables, but held in a new schema pg_logical or similar.
One reason why this must be a secondary catalog is to allow the
translation to take place on the target server, and to allow
translation of WAL from a prior version of the catalog - so we can
allow online upgrades across different catalog versions (and possibly
major versions).

The secondary catalog is used in place of the normal catalog during
InitCatalogCache() in the apply process. All the normal caches exist,
they just point to secondary relations rather than the normal ones.

When we initialise replication we take a copy of the appropriate
tables, columns and rows in a catalog-base-backup, using something
like pg_dump. Overall, this is much smaller than normal catalog since
it avoids temp tables, and anything not related to WAL translation.

On each non-temp change to the database we record changes as SQL in
the ddl-log, together with the LSN of the change.

When number of changes in ddl-log hits a limit we take a new
catalog-base-backup. This process is similar to a checkpoint, but much
less frequent, lets call it a ddl-checkpoint.

When we start to read WAL logs to translate them, we start by
truncating/re-bootstrapping and reloading the secondary catalog from
the base backup. We then apply all changes from the ddl-log (which is
just a sequence of SQL statements) up until the LSN at the start of
WAL. The secondary catalog is then an exact copy of the catalog as of
that LSN.

As we read through WAL we apply further changes to secondary catalog
so it maintains in lock step with the WAL we currently read.

Having the ddl-base-backup and ddl-log allows reconstruction of the
catalog without needing to put whole catalog into WAL each checkpoint.
We can truncate old WAL segments and yet still recreate the DDL needed
to translate current WAL data. As a result, ddl-checkpoints are much
less frequent, perhaps weekly or monthly, rather than every few
minutes.

The whole process is similar in algorithm to recovery, but is just
normal userspace tables and SQL.

Constructing the correct catalog seems to be the heart of this
problem, so it is likely to take a while and look complex. Getting the
HS initial state was around 50% of the effort in making it all work,
so I guess its similar here.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-22 09:42:10
Message-ID:	201206221142.10603.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Friday, June 22, 2012 08:48:41 AM Simon Riggs wrote:
> On 21 June 2012 12:41, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > 2.)
> > Keep the decoding site up2date by replicating the catalog via normal HS
> > recovery
> > mechanisms.
> >
> > Advantages:
> > * most of the technology is already there
> > * minimal overhead (space, performance)
> > * no danger of out of sync catalogs
> > * no support for command triggers required that can keep a catalog in
> > sync, including oids
> >
> > Disadvantages:
> > * driving the catalog recovery that way requires some somewhat intricate
> > code as it needs to be done in lockstep with decoding the wal-stream
> > * requires an additional feature to guarantee HS always has enough
> > information to be queryable after a crash/shutdown
> > * some complex logic/low-level fudging required to keep the transactional
> > behaviour sensible when querying the catalog
> > * full version/architecture compatibility required
> > * the decoding site will always ever be only readable
>
> My initial reaction was "this wont work", but that requires
> qualification since this is a complex topic: You can use this
> approach as long as you realise that the catalog it gives can never be
> rewound.
Well, only as far as the min recovery point has been advanced. Thats advanced
less frequent than we apply xlog records.

> So the generic approach to "construct me a catalog as of this LSN"
> would need to start with a base backup of the catalog and then roll
> forward to the appropriate LSN. Which means a generic user of this
> approach would need to be able to construct an initial catalog using a
> PITR.

> Constructing a decoding site requires you to
> a) take a partial base backup of the catalog
> b) apply WAL records to bring that forwards to the correct LSN, which
> would require some alteration of the recovery code to skip the files
> missing in a)
>
> So taking the approach of a decoding site means we have to modify
> recovery code, and even when we do that we still end up with a
> difficult to deploy option in the real world. Difficult to deploy
> becaus we need a whole new instance of Postgres, plus we need all of
> the WAL files, which could easily be impractical.
>
> The overall approach is good, but the architecture is wrong. What we
> need is a "catalog base backup" and a means of rolling forward to the
> appropriate LSN. Rolling forward using WAL is too bulky, so we need a
> separate log of DDL changes to the catalog. So what we need is a
> "catalog base backup" plus a "ddl-log".
The idea was to store the applycache to disk everytime UpdateMinRecoveryPoint
is called. That way you wouldn't have to scroll back, even if the database
crashes/is stopped hard.
But I agree, I don't like the architecture that much either.

> To translate the WAL we maintain a secondary set of catalog tables,
> which only exist for logical replication. These are normal,
> non-versioned tables, but held in a new schema pg_logical or similar.
> One reason why this must be a secondary catalog is to allow the
> translation to take place on the target server, and to allow
> translation of WAL from a prior version of the catalog - so we can
> allow online upgrades across different catalog versions (and possibly
> major versions).
> The secondary catalog is used in place of the normal catalog during
> InitCatalogCache() in the apply process. All the normal caches exist,
> they just point to secondary relations rather than the normal ones.
> When we initialise replication we take a copy of the appropriate
> tables, columns and rows in a catalog-base-backup, using something
> like pg_dump. Overall, this is much smaller than normal catalog since
> it avoids temp tables, and anything not related to WAL translation.
>
> On each non-temp change to the database we record changes as SQL in
> the ddl-log, together with the LSN of the change.
>
> When number of changes in ddl-log hits a limit we take a new
> catalog-base-backup. This process is similar to a checkpoint, but much
> less frequent, lets call it a ddl-checkpoint.
>
> When we start to read WAL logs to translate them, we start by
> truncating/re-bootstrapping and reloading the secondary catalog from
> the base backup. We then apply all changes from the ddl-log (which is
> just a sequence of SQL statements) up until the LSN at the start of
> WAL. The secondary catalog is then an exact copy of the catalog as of
> that LSN.
>
> As we read through WAL we apply further changes to secondary catalog
> so it maintains in lock step with the WAL we currently read.
I can't see how thats going to fly because the *_out functions use the
syscache and also plain access to catalog tables. We would have to completely
map oids to the alternative catalog.
For one I think that mapping would involve far too many places (shared
catalogs/relmapper. smgr, fd, syscache, ...). For another you need to access
those tables in a completely normal fashion from non-recovery backends which
means that we cannot just have duplicated oids hidden away somewhere.

> Constructing the correct catalog seems to be the heart of this
> problem, so it is likely to take a while and look complex. Getting the
> HS initial state was around 50% of the effort in making it all work,
> so I guess its similar here.
Yes.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-22 13:22:03
Message-ID:	201206221522.04019.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, June 21, 2012 05:40:08 PM Andres Freund wrote:
> On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:
> > On Jun21, 2012, at 13:41 , Andres Freund wrote:
> > > 3b)
> > > Ensure that enough information in the catalog remains by fudging the
> > > xmin horizon. Then reassemble an appropriate snapshot to read the
> > > catalog as the tuple in question has seen it.
> >
> > The ComboCID machinery makes that quite a bit harder, I fear. If a tuple
> > is updated multiple times by the same transaction, you cannot decide
> > whether a tuple was visible in a certain snapshot unless you have access
> > to the updating backend's ComboCID hash.
>
> Thats a very good point. Not sure how I forgot that.
>
> It think it might be possible to reconstruct a sensible combocid mapping
> from the walstream. Let me think about it for a while...
I have a very, very preliminary thing which seems to work somewhat. I just log
(cmin, cmax) additionally for every modified catalog tuple into the wal and so
far that seems to be enough.
Do you happen to have suggestions for other problematic things to look into
before I put more time into it?

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-22 19:30:59
Message-ID:	201206222130.59842.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday, June 22, 2012 03:22:03 PM Andres Freund wrote:
> On Thursday, June 21, 2012 05:40:08 PM Andres Freund wrote:
> > On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:
> > > On Jun21, 2012, at 13:41 , Andres Freund wrote:
> > > > 3b)
> > > > Ensure that enough information in the catalog remains by fudging the
> > > > xmin horizon. Then reassemble an appropriate snapshot to read the
> > > > catalog as the tuple in question has seen it.
> > >
> > > The ComboCID machinery makes that quite a bit harder, I fear. If a
> > > tuple is updated multiple times by the same transaction, you cannot
> > > decide whether a tuple was visible in a certain snapshot unless you
> > > have access to the updating backend's ComboCID hash.
> >
> > Thats a very good point. Not sure how I forgot that.
> >
> > It think it might be possible to reconstruct a sensible combocid mapping
> > from the walstream. Let me think about it for a while...
>
> I have a very, very preliminary thing which seems to work somewhat. I just
> log (cmin, cmax) additionally for every modified catalog tuple into the
> wal and so far that seems to be enough.
> Do you happen to have suggestions for other problematic things to look into
> before I put more time into it?
Im continuing to play around with this. The tricky bit so far is
subtransaction handling in transactions which modify the catalog (+ possible
tables which are marked as being required for decoding like pg_enum
equivalent).

Would somebody fundamentally object to one the following things:
1.
replace

#define IsMVCCSnapshot(snapshot) \
((snapshot)->satisfies == HeapTupleSatisfiesMVCC)

with something like

#define IsMVCCSnapshot(snapshot) \
((snapshot)->satisfies == HeapTupleSatisfiesMVCC || (snapshot)->satisfies ==
HeapTupleSatisfiesMVCCDuringDecode)

The define is only used sparingly and none of the code path looks so hot that
this could make a difference.

2.
Set SnapshotNowData.satisfies to HeapTupleSatisfiesNowDuringRecovery while
reading the catalog for decoding.

Its possible to go on without both but the faking up of data gets quite a bit
more complex.

The problem making replacement of SnapshotNow.satisfies useful is that there is
no convenient way to represent subtransactions of the current transaction
which already have committed according to the TransactionLog but aren't yet
visible at the current lsn because they only started afterwards. Its
relatively easy to fake this in an mvcc snapshot but way harder for
SnapshotNow because you cannot mark transactions as in-progress.

Thanks,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-24 21:11:07
Message-ID:	201206242311.07948.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, June 21, 2012 01:41:25 PM Andres Freund wrote:
> Below are two possible implementation strategies for that concept
>
> Advantages:
> * Decoding is done on the master in an asynchronous fashion
> * low overhead during normal DML execution, not much additional code in
> that path
> * can be very efficient if architecture/version are the same
> * version/architecture compatibility can be done transparently by falling
> back to textual versions on mismatch
>
> Disadvantages:
> * decoding probably has to happen on the master which might not be what
> people want performancewise

> 3b)
> Ensure that enough information in the catalog remains by fudging the xmin
> horizon. Then reassemble an appropriate snapshot to read the catalog as
> the tuple in question has seen it.
>
> Advantages:
> * should be implementable with low impact to general code
>
> Disadvantages:
> * requires some complex code for assembling snapshots
> * it might be hard to guarantee that we always have enough information to
> reassemble a snapshot (subxid overflows ...)
> * impacts vacuum if replication to some site is slow
There are some interesting problems related to locking and snapshots here. Not
sure if they are resolvable:

We need to restrict SnapshotNow to represent to the view it had back when the
wal record were currently decoding had. Otherwise we would possibly get wrong
column types and similar. As were working in the past locking doesn't protect
us against much here. I have that (mostly and inefficiently).

One interesting problem are table rewrites (truncate, cluster, some ALTER
TABLE's) and dropping tables. Because we nudge SnapshotNow to the past view it
had back when the wal record was created we get the old relfilenode. Which
might have been dropped in part of the transaction cleanup...
With most types thats not a problem. Even things like records and arrays
aren't problematic. More interesting cases include VACUUM FULL $systable (e.g.
pg_enum) and vacuum full'ing a table which is used in the *_out function of a
type (like a user level pg_enum implementation).

The only theoretical way I see against that problem would be to postpone all
relation unlinks untill everything that could possibly read them has finished.
Doesn't seem to alluring although it would be needed if we ever move more
things of SnapshotNow.

Input/Ideas/Opinions?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-24 21:37:26
Message-ID:	CA+U5nMLVfMgLe3bB9gXah9tdGmTWyFAvfDHxR+s6Nk2byo+Fig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24 June 2012 22:11, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> One interesting problem are table rewrites (truncate, cluster, some ALTER
> TABLE's) and dropping tables. Because we nudge SnapshotNow to the past view it
> had back when the wal record was created we get the old relfilenode. Which
> might have been dropped in part of the transaction cleanup...
> With most types thats not a problem. Even things like records and arrays
> aren't problematic. More interesting cases include VACUUM FULL $systable (e.g.
> pg_enum) and vacuum full'ing a table which is used in the *_out function of a
> type (like a user level pg_enum implementation).

That's only a problem if you are generating changes to the relfilenode
rather than the relid.

ISTM that this step differs depending upon whether we are generating
portable SQL, or whether we are generating changes for immediate
apply. If it is the latter, then it should never actually happen
because if a table rewrite occurred and then committed we would never
need to re-read earlier WAL.

So treating this as a generic problem leads to some weird results that
we don't need to worry about cos they can't actually happen.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-24 21:41:42
Message-ID:	CA+U5nMJ=jWcTyZK2GPoNv5g5tKg9sRejVnR9B5=YqxHZ_bgpHw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22 June 2012 20:30, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> The problem making replacement of SnapshotNow.satisfies useful is that there is
> no convenient way to represent subtransactions of the current transaction
> which already have committed according to the TransactionLog but aren't yet
> visible at the current lsn because they only started afterwards. Its
> relatively easy to fake this in an mvcc snapshot but way harder for
> SnapshotNow because you cannot mark transactions as in-progress.

I'm starting to like these ideas now.

We can solve many things by emitting a new WAL record type in any
subtransaction that issues catalog changes. That wasn't possible in
Hot Standby for performance reasons, but since we only care about
catalog changes those things are much rarer and wouldn't be a problem.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-24 21:50:03
Message-ID:	201206242350.03856.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sunday, June 24, 2012 11:37:26 PM Simon Riggs wrote:
> On 24 June 2012 22:11, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > One interesting problem are table rewrites (truncate, cluster, some ALTER
> > TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
> > view it had back when the wal record was created we get the old
> > relfilenode. Which might have been dropped in part of the transaction
> > cleanup...
> > With most types thats not a problem. Even things like records and arrays
> > aren't problematic. More interesting cases include VACUUM FULL $systable
> > (e.g. pg_enum) and vacuum full'ing a table which is used in the *_out
> > function of a type (like a user level pg_enum implementation).

> That's only a problem if you are generating changes to the relfilenode
> rather than the relid.
Hm. I can't follow so far. Could you paraphrase?

> ISTM that this step differs depending upon whether we are generating
> portable SQL, or whether we are generating changes for immediate
> apply.
I fear only generating changes for immediate, low-level apply is going to fly
given the various interests people have voiced.

> If it is the latter, then it should never actually happen because if a table
> rewrite occurred and then committed we would never need to re-read earlier
> WAL.
> So treating this as a generic problem leads to some weird results that
> we don't need to worry about cos they can't actually happen.
Well, even if it were true that we don't need to worry about the state before
a full-table rewritte - I don't think it is - we still need to be able to cope
with CLUSTER or VACUUM FULL...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-24 22:01:13
Message-ID:	CA+U5nMJx9NsiY+cXqQOO5L-eZejM0Vo-4w+1uBoLbx0g8G+Uew@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24 June 2012 22:50, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On Sunday, June 24, 2012 11:37:26 PM Simon Riggs wrote:
>> On 24 June 2012 22:11, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > One interesting problem are table rewrites (truncate, cluster, some ALTER
>> > TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
>> > view it had back when the wal record was created we get the old
>> > relfilenode. Which might have been dropped in part of the transaction
>> > cleanup...
>> > With most types thats not a problem. Even things like records and arrays
>> > aren't problematic. More interesting cases include VACUUM FULL $systable
>> > (e.g. pg_enum) and vacuum full'ing a table which is used in the *_out
>> > function of a type (like a user level pg_enum implementation).
>
>> That's only a problem if you are generating changes to the relfilenode
>> rather than the relid.
> Hm. I can't follow so far. Could you paraphrase?

Basically, whether we generate SQL or not.

>> ISTM that this step differs depending upon whether we are generating
>> portable SQL, or whether we are generating changes for immediate
>> apply.
> I fear only generating changes for immediate, low-level apply is going to fly
> given the various interests people have voiced.

I'm OK with that, just checking what the objectives are.

>> If it is the latter, then it should never actually happen because if a table
>> rewrite occurred and then committed we would never need to re-read earlier
>> WAL.
>> So treating this as a generic problem leads to some weird results that
>> we don't need to worry about cos they can't actually happen.
> Well, even if it were true that we don't need to worry about the state before
> a full-table rewritte - I don't think it is - we still need to be able to cope
> with CLUSTER or VACUUM FULL...

If you have a WAL record for a new relfilenode, then you don't need to
read the catalog at all.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 01:08:51
Message-ID:	CA+TgmoYTNLG5tmtyEcZfMnzrMYKBS=iF+YKE2+K_fEmKd=+ekg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 24, 2012 at 5:11 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> There are some interesting problems related to locking and snapshots here. Not
> sure if they are resolvable:
>
> We need to restrict SnapshotNow to represent to the view it had back when the
> wal record were currently decoding had. Otherwise we would possibly get wrong
> column types and similar. As were working in the past locking doesn't protect
> us against much here. I have that (mostly and inefficiently).
>
> One interesting problem are table rewrites (truncate, cluster, some ALTER
> TABLE's) and dropping tables. Because we nudge SnapshotNow to the past view it
> had back when the wal record was created we get the old relfilenode. Which
> might have been dropped in part of the transaction cleanup...
> With most types thats not a problem. Even things like records and arrays
> aren't problematic. More interesting cases include VACUUM FULL $systable (e.g.
> pg_enum) and vacuum full'ing a table which is used in the *_out function of a
> type (like a user level pg_enum implementation).
>
> The only theoretical way I see against that problem would be to postpone all
> relation unlinks untill everything that could possibly read them has finished.
> Doesn't seem to alluring although it would be needed if we ever move more
> things of SnapshotNow.
>
> Input/Ideas/Opinions?

Yeah, this is slightly nasty. I'm not sure whether or not there's a
way to make it work.

I had another idea. Suppose decoding happens directly on the primary,
because I'm still hoping there's a way to swing that. Suppose further
that we handle DDL by insisting that (1) any backend which wants to
add columns or change the types of existing columns must first wait
for logical replication to catch up and (2) if a backend which has
added columns or changed the types of existing columns then writes to
the modified table, decoding of those writes will be postponed until
transaction commit. I think that's enough to guarantee that the
decoding process can just use the catalogs as they stand, with plain
old SnapshotNow.

The downside of this approach is that it makes certain kinds of DDL
suck worse if logical replication is in use and behind. But I don't
necessarily see that as prohibitive because (1) logical replication
being behind is likely to suck for a lot of other reasons too and (2)
adding or retyping columns isn't a terribly frequent operation and
people already expect a hit when they do it. Also, I suspect that we
could find ways to loosen those restrictions at least in common cases
in some future version; meanwhile, less work now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Robert Haas'" <robertmhaas(at)gmail(dot)com>, "'Andres Freund'" <andres(at)2ndquadrant(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, "'Florian Pflug'" <fgp(at)phlo(dot)org>, "'Simon Riggs'" <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 05:16:34
Message-ID:	003501cd5291$b1397240$13ac56c0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Robert Haas
Sent: Monday, June 25, 2012 6:39 AM
To: Andres Freund
Cc: pgsql-hackers(at)postgresql(dot)org; Florian Pflug; Simon Riggs
Subject: Re: [HACKERS] Catalog/Metadata consistency during changeset
extraction from wal

On Sun, Jun 24, 2012 at 5:11 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> There are some interesting problems related to locking and snapshots here.
Not
> sure if they are resolvable:
>
> We need to restrict SnapshotNow to represent to the view it had back when
the
> wal record were currently decoding had. Otherwise we would possibly get
wrong
> column types and similar. As were working in the past locking doesn't
protect
> us against much here. I have that (mostly and inefficiently).
>
> One interesting problem are table rewrites (truncate, cluster, some ALTER
> TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
view it
> had back when the wal record was created we get the old relfilenode. Which
> might have been dropped in part of the transaction cleanup...
> With most types thats not a problem. Even things like records and arrays
> aren't problematic. More interesting cases include VACUUM FULL $systable
(e.g.
> pg_enum) and vacuum full'ing a table which is used in the *_out function
of a
> type (like a user level pg_enum implementation).
>
> The only theoretical way I see against that problem would be to postpone
all
> relation unlinks untill everything that could possibly read them has
finished.
> Doesn't seem to alluring although it would be needed if we ever move more
> things of SnapshotNow.
>
> Input/Ideas/Opinions?

> Yeah, this is slightly nasty. I'm not sure whether or not there's a
> way to make it work.

What will be the way to make the DDL's wait for logical replication to catch
up?
One way can be that by looking WAL location till what it has generated
logical records.
If we do this by holding table level locks (which will be required for
DDL's), then other DML's will also get blocked.
And If we do this without holding locks and decided to go ahead then there
can be other WAL generated by the meantime
and can create a problem.

Or if there is another way to make DDL's wait?

With Regards,
Amit Kapila.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 13:43:39
Message-ID:	201206251543.40142.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, June 25, 2012 03:08:51 AM Robert Haas wrote:
> On Sun, Jun 24, 2012 at 5:11 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> > There are some interesting problems related to locking and snapshots
> > here. Not sure if they are resolvable:
> >
> > We need to restrict SnapshotNow to represent to the view it had back when
> > the wal record were currently decoding had. Otherwise we would possibly
> > get wrong column types and similar. As were working in the past locking
> > doesn't protect us against much here. I have that (mostly and
> > inefficiently).
> >
> > One interesting problem are table rewrites (truncate, cluster, some ALTER
> > TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
> > view it had back when the wal record was created we get the old
> > relfilenode. Which might have been dropped in part of the transaction
> > cleanup...
> > With most types thats not a problem. Even things like records and arrays
> > aren't problematic. More interesting cases include VACUUM FULL $systable
> > (e.g. pg_enum) and vacuum full'ing a table which is used in the *_out
> > function of a type (like a user level pg_enum implementation).
> >
> > The only theoretical way I see against that problem would be to postpone
> > all relation unlinks untill everything that could possibly read them has
> > finished. Doesn't seem to alluring although it would be needed if we
> > ever move more things of SnapshotNow.
> >
> > Input/Ideas/Opinions?
>
> Yeah, this is slightly nasty. I'm not sure whether or not there's a
> way to make it work.
Postponing all non-rollback unlinks to the next "logical checkpoint" is the
only thing I can think of...

> I had another idea. Suppose decoding happens directly on the primary,
> because I'm still hoping there's a way to swing that. Suppose further
> that we handle DDL by insisting that (1) any backend which wants to
> add columns or change the types of existing columns must first wait
> for logical replication to catch up and (2) if a backend which has
> added columns or changed the types of existing columns then writes to
> the modified table, decoding of those writes will be postponed until
> transaction commit. I think that's enough to guarantee that the
> decoding process can just use the catalogs as they stand, with plain
> old SnapshotNow.
I don't think its that easy. If you e.g. have multiple ALTER's in the same
transaction interspersed with inserted rows they will all have different
TupleDesc's.
I don't see how thats resolvable without either replicating ddl to the target
system or changing what SnapshotNow does...

> The downside of this approach is that it makes certain kinds of DDL
> suck worse if logical replication is in use and behind. But I don't
> necessarily see that as prohibitive because (1) logical replication
> being behind is likely to suck for a lot of other reasons too and (2)
> adding or retyping columns isn't a terribly frequent operation and
> people already expect a hit when they do it. Also, I suspect that we
> could find ways to loosen those restrictions at least in common cases
> in some future version; meanwhile, less work now.
Agreed.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 15:34:13
Message-ID:	CA+TgmobL-GY7pXou-9huHiyEU3so1YYXBocXMcUR1OeoY3rMhQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 25, 2012 at 9:43 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > The only theoretical way I see against that problem would be to postpone
>> > all relation unlinks untill everything that could possibly read them has
>> > finished. Doesn't seem to alluring although it would be needed if we
>> > ever move more things of SnapshotNow.
>> >
>> > Input/Ideas/Opinions?
>>
>> Yeah, this is slightly nasty. I'm not sure whether or not there's a
>> way to make it work.
> Postponing all non-rollback unlinks to the next "logical checkpoint" is the
> only thing I can think of...

There are a number of cool things we could do if we postponed unlinks.
Like, why can't we allow concurrent read-only queries while a CLUSTER
operation is in progress? Well, two reasons. The first is that we
currently can't do ANY DDL with less than a full table lock because of
SnapshotNow-related race conditions. The second is that people might
still need to look at the old heap after the CLUSTER transaction
commits. Some kind of delayed unlink facility where we
garbage-collect relation backing files when their refcount falls to
zero would solve the second problem - not that that's any help by
itself without a solution to the first one, but hey.

>> I had another idea. Suppose decoding happens directly on the primary,
>> because I'm still hoping there's a way to swing that. Suppose further
>> that we handle DDL by insisting that (1) any backend which wants to
>> add columns or change the types of existing columns must first wait
>> for logical replication to catch up and (2) if a backend which has
>> added columns or changed the types of existing columns then writes to
>> the modified table, decoding of those writes will be postponed until
>> transaction commit. I think that's enough to guarantee that the
>> decoding process can just use the catalogs as they stand, with plain
>> old SnapshotNow.
> I don't think its that easy. If you e.g. have multiple ALTER's in the same
> transaction interspersed with inserted rows they will all have different
> TupleDesc's.

If new columns were added, then tuples created with those older
tuple-descriptors can still be interpreted with the latest
tuple-descriptor.

Columns that are dropped or retyped are a little trickier, but
honestly... how much do we care about those cases? How practical is
it to suppose we're going to be able to handle them sanely anyway?
Suppose that the user defines a type which works just like int4 except
that the output functions writes out each number in pig latin (and the
input function parses pig latin). The user defines the types as
binary coercible to each other and then does ALTER TABLE on a large
table with an int4 column, transforming it into an int4piglatin
column. Due to Noah Misch's fine work, we will conclude that no table
rewrite is needed. But if logical replication is in use, then in
theory we should scan the whole table and generate an LCR for each row
saying "the row with primary key X was updated, and column Y, which
used to contain 42, now contains ourty-two-fay". Otherwise, if we're
doing heterogenous replication into a system that just stores that
column as text, it'll end up with the wrong contents. On the other
hand, if we're trying to ship data to another PostgreSQL instance
where the column hasn't yet been updated, then all of those LCRs are
just going to error out when we try to apply them.

A more realistic scenario where you have the same problem is with
something like ALTER TABLE .. ADD COLUMN .. DEFAULT. If you add a
column with a default in a single step (as opposed to first adding the
column and then setting its default), we rewrite the table and set
every row to the default value. Should that generate LCRs showing
every row being updated to add that new value, or should we generate
no LCRs and assume that the DBA will independently do the same
operation on the remote side? Either answer could be correct,
depending on how the LCRs are being used. If you're just rewriting
with a constant default, then perhaps the sensible thing is to
generate no LCRs, since it will be more efficient to mimic the
operation on the remote side than to replay the changes row-by-row.
But what if the default isn't a constant, like maybe it's
nextval('new_synthetic_pkey_seq') or even something like now(). In
those cases, it seems quite likely that if you don't generate LCRs,
manual user intervention will be required to get things back on track.
On the other hand, if you do generate LCRs, the remote side will
become horribly bloated on replay, unless the LCRs also instruct the
far side that they should be applied via a full-table rewrite.

Can we just agree to punt all this complexity for version 1 (and maybe
versions 2, 3, and 4)? I'm not sure what Slony does in situations
like this but I bet for a lot of replication systems, the answer is
"do a full resync". In other words, we either forbid the operation
outright when the table is enabled for logical replication, or else we
emit an LCR that says, in effect, "transaction 12345 monkeyed with the
table, please resync". It strikes me that it's really the job of some
higher-level control logic to decide what the "correct" behavior is in
these cases; the decoding process doesn't really have enough
information about what the user is trying to do to make a sensible
decision anyway. It would be nice to be able to support some simple
cases like "adding a column that has no default" or "dropping a
column" without punting, but going much further than that seems like
it will require embedding policy decisions that should really be
happening at a higher level.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Andres Freund" <andres(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 16:42:41
Message-ID:	4FE84EB10200002500048AA2@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I bet for a lot of replication systems, the answer is "do a full
> resync". In other words, we either forbid the operation outright
> when the table is enabled for logical replication, or else we emit
> an LCR that says, in effect, "transaction 12345 monkeyed with the
> table, please resync". It strikes me that it's really the job of
> some higher-level control logic to decide what the "correct"
> behavior is in these cases; the decoding process doesn't really
> have enough information about what the user is trying to do to
> make a sensible decision anyway.

This is clearly going to depend on the topology. You would
definitely want to try to replicate the DDL for the case on which
Simon is focused (which seems to me to be essentially physical
replication of catalogs with logical replication of data changes
from any machine to all others). What you do about transactions in
flight is the hard part. You could try to suppress concurrent DML
of the same objects or have some complex matrix of rules for trying
to resolve the transactions in flight. I don't see how the latter
could ever be 100% accurate.

In our shop it is much easier. We always have one database which is
the only valid source for any tuple, although rows from many such
databases can be in one table, and one row might replicate to many
databases. Thus, we don't want automatic replication of DDL.

- When a column is going to be added to the source machines, we
first add it to the targets, with either a default or as
NULL-capable.

- When a column is going to be deleted from the source machines, we
make sure it is NULL-capable or has a default on the replicas.
We drop it from all replicas after it is gone from all sources.

- If a column is changing name or is changing to a fundamentally
different type we need to give the new column a new name, have
triggers to convert old to new (and vice versa) on the replicas,
and drop the old after all sources are updated.

- If a column is changing in a minor way, like its precision, we
make sure the replicas can accept either format until all sources
have been converted. We update the replicas to match the sources
after all sources are converted.

We most particularly *don't* want DDL to replicate automatically,
because the schema changes are deployed along with related software
changes, and we like to pilot any changes for at least a few days.
Depending on the release, the rollout may take a couple months, or
we may slam in out everywhere a few days after the first pilot
deployment.

So you could certainly punt all of this for any release as far as
Wisconsin Courts are concerned. We need to know table and column
names, before and after images, and some application-supplied
metadata.

I don't know that what we're looking for is any easier (although I
doubt that it's any harder), but I'm starting to wonder how much
mechanism they can really share. The 2Q code is geared toward page
format OIDs and data values for automated DDL distribution and
faster replication, while we're looking for something which works
between releases, architectures, and OSes. We keep coming back to
the idea of one mechanism because both WAL and a logical transaction
stream would have "after" tuples, although they need them in
different formats.

I think the need for truly logical replication is obvious, since so
many different people have developed trigger-based versions of that.
And it sure seems like 2Q has clients who are willing to pay for the
other.

Perhaps the first question is: Is there enough in common between
logical replication (and all the topologies that might be created
with that) and the proposal on the table (which seems to be based
around one particular topology with a vague notion of bolting
logical replication on to it after the fact) to try to resolve the
differences in one feature? Or should the "identical schema with
multiple identical copies" case be allowed to move forward more or
less in isolation, with logical replication having its own design if
and when someone wants to take it on? Two non-compromised features
might be cleaner -- I'm starting to feel like we're trying to design
a toaster which can also water your garden.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 17:09:38
Message-ID:	CA+TgmoZtox_qytZQwNq9LXSmnx0bcirF6tt0gVJgbL8wOF4Ftg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 25, 2012 at 12:42 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Perhaps the first question is: Is there enough in common between
> logical replication (and all the topologies that might be created
> with that) and the proposal on the table (which seems to be based
> around one particular topology with a vague notion of bolting
> logical replication on to it after the fact) to try to resolve the
> differences in one feature? Or should the "identical schema with
> multiple identical copies" case be allowed to move forward more or
> less in isolation, with logical replication having its own design if
> and when someone wants to take it on? Two non-compromised features
> might be cleaner -- I'm starting to feel like we're trying to design
> a toaster which can also water your garden.

I think there are a number of shared pieces. Being able to read WAL
and do something with it is a general need that both solutions share;
I think actually that might be the piece that we should try to get
committed first. I suspect that there are a number of applications
for just that and nothing more - for example, it might allow a contrib
module that reads WAL as it's generated and prints out a debug trace,
which I can imagine being useful.

Also, I think that even for MMR there will be a need for control
logic, resynchronization, and similar mechanisms. I mean, suppose you
have four servers in an MMR configuration. Now, you want to deploy a
schema change that adds a new column and which, as it so happens,
requires a table rewrite to add the default. It is very possible that
you do NOT want that to automatically replicate around the cluster.
Instead, you likely want to redirect load to the remaining three
servers, do the change on the fourth, put it back into the ring and
take out a different one, do the change on that one, and so on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 17:50:08
Message-ID:	201206251950.08949.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, June 25, 2012 05:34:13 PM Robert Haas wrote:
> On Mon, Jun 25, 2012 at 9:43 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> >> > The only theoretical way I see against that problem would be to
> >> > postpone all relation unlinks untill everything that could possibly
> >> > read them has finished. Doesn't seem to alluring although it would be
> >> > needed if we ever move more things of SnapshotNow.
> >> >
> >> > Input/Ideas/Opinions?
> >>
> >> Yeah, this is slightly nasty. I'm not sure whether or not there's a
> >> way to make it work.
> >
> > Postponing all non-rollback unlinks to the next "logical checkpoint" is
> > the only thing I can think of...
> There are a number of cool things we could do if we postponed unlinks.
> Like, why can't we allow concurrent read-only queries while a CLUSTER
> operation is in progress? Well, two reasons. The first is that we
> currently can't do ANY DDL with less than a full table lock because of
> SnapshotNow-related race conditions. The second is that people might
> still need to look at the old heap after the CLUSTER transaction
> commits. Some kind of delayed unlink facility where we
> garbage-collect relation backing files when their refcount falls to
> zero would solve the second problem - not that that's any help by
> itself without a solution to the first one, but hey.
Its an argument why related infrastructure would be interesting to more than
that patch and thats not bad.
If the garbage collecting is done in a very simplistic manner it doesn't sound
too hard... The biggest problem is probably crash-recovery of that knowledge
and how to hook knowledge into it that logical rep needs that data...

> >> I had another idea. Suppose decoding happens directly on the primary,
> >> because I'm still hoping there's a way to swing that. Suppose further
> >> that we handle DDL by insisting that (1) any backend which wants to
> >> add columns or change the types of existing columns must first wait
> >> for logical replication to catch up and (2) if a backend which has
> >> added columns or changed the types of existing columns then writes to
> >> the modified table, decoding of those writes will be postponed until
> >> transaction commit. I think that's enough to guarantee that the
> >> decoding process can just use the catalogs as they stand, with plain
> >> old SnapshotNow.
> >
> > I don't think its that easy. If you e.g. have multiple ALTER's in the
> > same transaction interspersed with inserted rows they will all have
> > different TupleDesc's.
>
> If new columns were added, then tuples created with those older
> tuple-descriptors can still be interpreted with the latest
> tuple-descriptor.
But you need to figure that out. If you have just the before-after images of
the tupledescs you don't know what happened in there... That would mean either
doing special things on catalog changes or reassembling the meaning from the
changed pg_* rows. Neither seems enticing.

> Columns that are dropped or retyped are a little trickier, but
> honestly... how much do we care about those cases? How practical is
> it to suppose we're going to be able to handle them sanely anyway?
> Suppose that the user defines a type which works just like int4 except
> that the output functions writes out each number in pig latin (and the
> input function parses pig latin). The user defines the types as
> binary coercible to each other and then does ALTER TABLE on a large
> table with an int4 column, transforming it into an int4piglatin
> column. Due to Noah Misch's fine work, we will conclude that no table
> rewrite is needed. But if logical replication is in use, then in
> theory we should scan the whole table and generate an LCR for each row
> saying "the row with primary key X was updated, and column Y, which
> used to contain 42, now contains ourty-two-fay". Otherwise, if we're
> doing heterogenous replication into a system that just stores that
> column as text, it'll end up with the wrong contents. On the other
> hand, if we're trying to ship data to another PostgreSQL instance
> where the column hasn't yet been updated, then all of those LCRs are
> just going to error out when we try to apply them.

> A more realistic scenario where you have the same problem is with
> something like ALTER TABLE .. ADD COLUMN .. DEFAULT. If you add a
> column with a default in a single step (as opposed to first adding the
> column and then setting its default), we rewrite the table and set
> every row to the default value. Should that generate LCRs showing
> every row being updated to add that new value, or should we generate
> no LCRs and assume that the DBA will independently do the same
> operation on the remote side? Either answer could be correct,
> depending on how the LCRs are being used. If you're just rewriting
> with a constant default, then perhaps the sensible thing is to
> generate no LCRs, since it will be more efficient to mimic the
> operation on the remote side than to replay the changes row-by-row.
> But what if the default isn't a constant, like maybe it's
> nextval('new_synthetic_pkey_seq') or even something like now(). In
> those cases, it seems quite likely that if you don't generate LCRs,
> manual user intervention will be required to get things back on track.
> On the other hand, if you do generate LCRs, the remote side will
> become horribly bloated on replay, unless the LCRs also instruct the
> far side that they should be applied via a full-table rewrite.
I think that heavily depends on your use-case. In the settings that I/we have
in mind the (partial) schemas are tightly coupled so command triggers can just
do the equivalent changes.
But there will definitely be cases where user errors will cause problems (now,
nextval as you pointed out) or where we just have to punt.

In other scenarios like Kevins the application code will need to make that
decisison.

> Can we just agree to punt all this complexity for version 1 (and maybe
> versions 2, 3, and 4)? I'm not sure what Slony does in situations
> like this but I bet for a lot of replication systems, the answer is
> "do a full resync". In other words, we either forbid the operation
> outright when the table is enabled for logical replication, or else we
> emit an LCR that says, in effect, "transaction 12345 monkeyed with the
> table, please resync". It strikes me that it's really the job of some
> higher-level control logic to decide what the "correct" behavior is in
> these cases; the decoding process doesn't really have enough
> information about what the user is trying to do to make a sensible
> decision anyway. It would be nice to be able to support some simple
> cases like "adding a column that has no default" or "dropping a
> column" without punting, but going much further than that seems like
> it will require embedding policy decisions that should really be
> happening at a higher level.
I am totally fine with saying that we do not support everything from the
start. But we need to choose an architecture where its possible to add that
support gradually and I don't think without looking inside transaction makes
that possible.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 18:13:54
Message-ID:	CA+TgmoYJevRaDb1mmWf0KpOkhJUsADGc6WR45_TPMymnK=9D7Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 25, 2012 at 1:50 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Its an argument why related infrastructure would be interesting to more than
> that patch and thats not bad.
> If the garbage collecting is done in a very simplistic manner it doesn't sound
> too hard... The biggest problem is probably crash-recovery of that knowledge
> and how to hook knowledge into it that logical rep needs that data...

I suppose the main reason we haven't done it already is that it
increases the period of time during which we're using 2X the disk
space.

>> > I don't think its that easy. If you e.g. have multiple ALTER's in the
>> > same transaction interspersed with inserted rows they will all have
>> > different TupleDesc's.
>>
>> If new columns were added, then tuples created with those older
>> tuple-descriptors can still be interpreted with the latest
>> tuple-descriptor.
> But you need to figure that out. If you have just the before-after images of
> the tupledescs you don't know what happened in there... That would mean either
> doing special things on catalog changes or reassembling the meaning from the
> changed pg_* rows. Neither seems enticing.

I think there is absolutely nothing wrong with doing extra things in
ALTER TABLE when logical replication is enabled. We've got code
that's conditional on Hot Standby being enabled in many places in the
system; why should logical replication be any different? If we set
the bar for logical replication at "the system can't do anything
differently when logical replication is enabled" then I cheerfully
submit that we are doomed. You've already made WAL format changes to
support logging the pre-image of the tuple, which is a hundred times
more likely to cause a performance problem than any monkeying around
we might want to do in ALTER TABLE.

>> Can we just agree to punt all this complexity for version 1 (and maybe
>> versions 2, 3, and 4)? I'm not sure what Slony does in situations
>> like this but I bet for a lot of replication systems, the answer is
>> "do a full resync". In other words, we either forbid the operation
>> outright when the table is enabled for logical replication, or else we
>> emit an LCR that says, in effect, "transaction 12345 monkeyed with the
>> table, please resync". It strikes me that it's really the job of some
>> higher-level control logic to decide what the "correct" behavior is in
>> these cases; the decoding process doesn't really have enough
>> information about what the user is trying to do to make a sensible
>> decision anyway. It would be nice to be able to support some simple
>> cases like "adding a column that has no default" or "dropping a
>> column" without punting, but going much further than that seems like
>> it will require embedding policy decisions that should really be
>> happening at a higher level.
> I am totally fine with saying that we do not support everything from the
> start. But we need to choose an architecture where its possible to add that
> support gradually and I don't think without looking inside transaction makes
> that possible.

I am deeply skeptical that we need to look inside of transactions that
do full-table rewrites. But even if we do, I don't see that what I'm
proposing precludes it. For example, I think we could have ALTER
TABLE emit WAL records specifically for logical replication that allow
us to disentangle which tuple descriptor to use at which point in the
transaction. I don't see that that would even be very difficult to
set up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 18:19:47
Message-ID:	201206252019.48154.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

(munching the mail from Robert and Kevin together)

On Monday, June 25, 2012 06:42:41 PM Kevin Grittner wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > I bet for a lot of replication systems, the answer is "do a full
> > resync". In other words, we either forbid the operation outright
> > when the table is enabled for logical replication, or else we emit
> > an LCR that says, in effect, "transaction 12345 monkeyed with the
> > table, please resync". It strikes me that it's really the job of
> > some higher-level control logic to decide what the "correct"
> > behavior is in these cases; the decoding process doesn't really
> > have enough information about what the user is trying to do to
> > make a sensible decision anyway.
>
> This is clearly going to depend on the topology. You would
> definitely want to try to replicate the DDL for the case on which
> Simon is focused (which seems to me to be essentially physical
> replication of catalogs with logical replication of data changes
> from any machine to all others). What you do about transactions in
> flight is the hard part. You could try to suppress concurrent DML
> of the same objects or have some complex matrix of rules for trying
> to resolve the transactions in flight. I don't see how the latter
> could ever be 100% accurate.
Yes. Thats why I dislike that proposal. I don't think thats going to be
understandable and robust enough.

If we really look inside transactions (3b) and 1)) that shouldn't be a problem
though. So I think it really has to be one of those.

> In our shop it is much easier. We always have one database which is
> the only valid source for any tuple, although rows from many such
> databases can be in one table, and one row might replicate to many
> databases. Thus, we don't want automatic replication of DDL.
>
> - When a column is going to be added to the source machines, we
> first add it to the targets, with either a default or as
> NULL-capable.
>
> - When a column is going to be deleted from the source machines, we
> make sure it is NULL-capable or has a default on the replicas.
> We drop it from all replicas after it is gone from all sources.
>
> - If a column is changing name or is changing to a fundamentally
> different type we need to give the new column a new name, have
> triggers to convert old to new (and vice versa) on the replicas,
> and drop the old after all sources are updated.
>
> - If a column is changing in a minor way, like its precision, we
> make sure the replicas can accept either format until all sources
> have been converted. We update the replicas to match the sources
> after all sources are converted.
>
> We most particularly *don't* want DDL to replicate automatically,
> because the schema changes are deployed along with related software
> changes, and we like to pilot any changes for at least a few days.
> Depending on the release, the rollout may take a couple months, or
> we may slam in out everywhere a few days after the first pilot
> deployment.
Thats a sensible for your use-case - but I do not think its thats the
appropriate behaviour for anything which is somewhat out-of-the box...

> So you could certainly punt all of this for any release as far as
> Wisconsin Courts are concerned. We need to know table and column
> names, before and after images, and some application-supplied
> metadata.
I am not sure were going to get all that into 9.3. More on that below.

On Monday, June 25, 2012 07:09:38 PM Robert Haas wrote:
> On Mon, Jun 25, 2012 at 12:42 PM, Kevin Grittner wrote:
> > I don't know that what we're looking for is any easier (although I
> > doubt that it's any harder), but I'm starting to wonder how much
> > mechanism they can really share. The 2Q code is geared toward page
> > format OIDs and data values for automated DDL distribution and
> > faster replication, while we're looking for something which works
> > between releases, architectures, and OSes. We keep coming back to
> > the idea of one mechanism because both WAL and a logical transaction
> > stream would have "after" tuples, although they need them in
> > different formats.
> >
> > I think the need for truly logical replication is obvious, since so
> > many different people have developed trigger-based versions of that.
> > And it sure seems like 2Q has clients who are willing to pay for the
> > other.
> >
> > Perhaps the first question is: Is there enough in common between
> > logical replication (and all the topologies that might be created
> > with that) and the proposal on the table (which seems to be based
> > around one particular topology with a vague notion of bolting
> > logical replication on to it after the fact) to try to resolve the
> > differences in one feature? Or should the "identical schema with
> > multiple identical copies" case be allowed to move forward more or
> > less in isolation, with logical replication having its own design if
> > and when someone wants to take it on? Two non-compromised features
> > might be cleaner -- I'm starting to feel like we're trying to design
> > a toaster which can also water your garden.
I think there are some pieces which can be shared without too many problems
(general wal reading, enough information in wal for decoding, new wal level,
transaction reassembly, ...). Other pieces are less clear (wal decoding,
transport format, ddl handling ...) and others clearly won't be shared (low
level apply, conflict resolution hooks, ...).
Whether some of that will be shareeable between different interests also
depends on how many people are willing to chime in and participate. While I am
happy to put in some extra time to make stuff fully generic there
unfortunately (I really mean that!) are limits on how much effort & time I can
pour into it. I find that to be an important point.

Another factor obviously is how hard it is to make something generic ;). The
whole problem is nothing simple otherwise we would have existing solutions...

> I think there are a number of shared pieces. Being able to read WAL
> and do something with it is a general need that both solutions share;
> I think actually that might be the piece that we should try to get
> committed first. I suspect that there are a number of applications
> for just that and nothing more - for example, it might allow a contrib
> module that reads WAL as it's generated and prints out a debug trace,
> which I can imagine being useful.
I plan to revise my patch for that now that Heikki's most invasive changes
have been committed. Reworking the innards of XLogInsert shouldn't change the
wal formats anymore...

> Also, I think that even for MMR there will be a need for control
> logic, resynchronization, and similar mechanisms. I mean, suppose you
> have four servers in an MMR configuration. Now, you want to deploy a
> schema change that adds a new column and which, as it so happens,
> requires a table rewrite to add the default. It is very possible that
> you do NOT want that to automatically replicate around the cluster.
> Instead, you likely want to redirect load to the remaining three
> servers, do the change on the fourth, put it back into the ring and
> take out a different one, do the change on that one, and so on.
Thats all nice, but I think its pretty clear were not getting anything that
sophisticated in the near future ;)

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Andres Freund" <andres(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 18:50:54
Message-ID:	4FE86CBE0200002500048AB2@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

>> We most particularly *don't* want DDL to replicate automatically,
>> because the schema changes are deployed along with related
>> software changes, and we like to pilot any changes for at least a
>> few days. Depending on the release, the rollout may take a
>> couple months, or we may slam in out everywhere a few days after
>> the first pilot deployment.
> Thats a sensible for your use-case - but I do not think its thats
> the appropriate behaviour for anything which is somewhat
> out-of-the box...

Right. We currently consider the issues involved in a change and
script it by hand. I think we want to continue to do that. The
point was that, given the variety of timings and techniques for
distributing schema changes, maybe is was only worth doing that
automatically for the most constrained and controlled cases.

>> So you could certainly punt all of this for any release as far as
>> Wisconsin Courts are concerned. We need to know table and column
>> names, before and after images, and some application-supplied
>> metadata.
> I am not sure were going to get all that into 9.3.

Sure, that was more related to why I was questioning how much these
use cases even *could* integrate -- whether it even paid to
*consider* these use cases at this point. Responses from Robert and
you have pretty much convinced me that I was being overly
pessimistic on that.

One fine point regarding before and after images -- if a value
doesn't change in an UPDATE, there's no reason to include it in both
the BEFORE and AFTER tuple images, as long as we have the null
column bitmaps -- or some other way of distinguishing unchanged from
NULL. (This could be especially important when the unchanged column
was a 50 MB bytea.) It doesn't matter to me which way this is
optimized -- in our trigger-based system we chose to keep the full
BEFORE tuple and just show AFTER values which were distinct from the
corresponding BEFORE values, but it would be trivial to adapt the
code to the other way.

Sorry for that bout of pessimism.

-Kevin

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 19:17:46
Message-ID:	201206252117.46836.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Monday, June 25, 2012 08:13:54 PM Robert Haas wrote:
> On Mon, Jun 25, 2012 at 1:50 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> > Its an argument why related infrastructure would be interesting to more
> > than that patch and thats not bad.
> > If the garbage collecting is done in a very simplistic manner it doesn't
> > sound too hard... The biggest problem is probably crash-recovery of that
> > knowledge and how to hook knowledge into it that logical rep needs that
> > data...
> I suppose the main reason we haven't done it already is that it
> increases the period of time during which we're using 2X the disk
> space.
I find that an acceptable price if its optional. Making it such doesn't seem
to be a problem for me.

> >> > I don't think its that easy. If you e.g. have multiple ALTER's in the
> >> > same transaction interspersed with inserted rows they will all have
> >> > different TupleDesc's.
> >>
> >> If new columns were added, then tuples created with those older
> >> tuple-descriptors can still be interpreted with the latest
> >> tuple-descriptor.
> >
> > But you need to figure that out. If you have just the before-after images
> > of the tupledescs you don't know what happened in there... That would
> > mean either doing special things on catalog changes or reassembling the
> > meaning from the changed pg_* rows. Neither seems enticing.
>
> I think there is absolutely nothing wrong with doing extra things in
> ALTER TABLE when logical replication is enabled. We've got code
> that's conditional on Hot Standby being enabled in many places in the
> system; why should logical replication be any different? If we set
> the bar for logical replication at "the system can't do anything
> differently when logical replication is enabled" then I cheerfully
> submit that we are doomed. You've already made WAL format changes to
> support logging the pre-image of the tuple, which is a hundred times
> more likely to cause a performance problem than any monkeying around
> we might want to do in ALTER TABLE.
>
> I am deeply skeptical that we need to look inside of transactions that
> do full-table rewrites. But even if we do, I don't see that what I'm
> proposing precludes it. For example, I think we could have ALTER
> TABLE emit WAL records specifically for logical replication that allow
> us to disentangle which tuple descriptor to use at which point in the
> transaction. I don't see that that would even be very difficult to
> set up.
Sorry, I was imprecise above: I have no problem doing some changes during
ALTER TABLE if logical rep is enabled. I am worried though that to make that
robust you would need loads of places that emit additional information:
* ALTER TABLE
* ALTER FUNCTIION
* ALTER OPERATOR
* ALTER/CREATE CAST
* TRUNCATE
* CLUSTER
* ...

I have the feeling that would we want to do that the full amount of required
information would be rather high and end up being essentially the catalog.
Just having an intermediate tupledesc doesn't help that much if you have e.g.
record_out doing type lookups of its own.

There also is the issue you have talked about before, that a user-type might
depend on values in other tables. Unless were ready to break at least
transactional behaviour for those for now...) I don't see how decoding outside
of the transaction is ever going to be valid? I wouldn't have a big problem
declaring that as broken for now...

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-25 20:04:10
Message-ID:	1340654533-sup-5535@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Kevin Grittner's message of lun jun 25 14:50:54 -0400 2012:

Yeah, probably the best is to have the whole thing in BEFORE, and just
send AFTER values for those columns that changed, and include the
'replace' bool array (probably packed as a bitmap), so that the update
can be trivially constructed at the other end just like in
heap_modify_tuple.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-26 11:50:33
Message-ID:	201206261350.33579.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, June 25, 2012 08:50:54 PM Kevin Grittner wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> We most particularly *don't* want DDL to replicate automatically,
> >> because the schema changes are deployed along with related
> >> software changes, and we like to pilot any changes for at least a
> >> few days. Depending on the release, the rollout may take a
> >> couple months, or we may slam in out everywhere a few days after
> >> the first pilot deployment.
> >
> > Thats a sensible for your use-case - but I do not think its thats
> > the appropriate behaviour for anything which is somewhat
> > out-of-the box...

> Right. We currently consider the issues involved in a change and
> script it by hand. I think we want to continue to do that. The
> point was that, given the variety of timings and techniques for
> distributing schema changes, maybe is was only worth doing that
> automatically for the most constrained and controlled cases.
Agreed.

> >> So you could certainly punt all of this for any release as far as
> >> Wisconsin Courts are concerned. We need to know table and column
> >> names, before and after images, and some application-supplied
> >> metadata.
> >
> > I am not sure were going to get all that into 9.3.
>
> Sure, that was more related to why I was questioning how much these
> use cases even *could* integrate -- whether it even paid to
> *consider* these use cases at this point. Responses from Robert and
> you have pretty much convinced me that I was being overly
> pessimistic on that.
I think its an important question to ask, otherwise we might just end up with
infrastructure unusable for all our goals... Or usable but unfinished
infrastructure because its to complex to build in sensible time.

> One fine point regarding before and after images -- if a value
> doesn't change in an UPDATE, there's no reason to include it in both
> the BEFORE and AFTER tuple images, as long as we have the null
> column bitmaps -- or some other way of distinguishing unchanged from
> NULL. (This could be especially important when the unchanged column
> was a 50 MB bytea.) It doesn't matter to me which way this is
> optimized -- in our trigger-based system we chose to keep the full
> BEFORE tuple and just show AFTER values which were distinct from the
> corresponding BEFORE values, but it would be trivial to adapt the
> code to the other way.
I don't think doing that is worth the trouble in the first incarnation. There
is enough detail hidden in that that makes that non-trivial that I wouldn't
worry about it until the rest of the infrastructure exists. That part of the
code will definitely be version specific so I see no problem improving on that
incrementally.

> Sorry for that bout of pessimism.
Oh, no reason for that. I have some doubts about that myself, so...

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-26 12:35:47
Message-ID:	CA+U5nML6-7feA10O3C+A-Nre2PZuf7m++RJUN5mf9PHPDi6FRQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 25 June 2012 17:42, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> This is clearly going to depend on the topology. You would
> definitely want to try to replicate the DDL for the case on which
> Simon is focused (which seems to me to be essentially physical
> replication of catalogs with logical replication of data changes
> from any machine to all others).

Just to remove any doubt: I'm not trying to support a single use case.

The overall proposals include a variety of design patterns. Each of
those covers many reasons for doing it, but end up with same
architecture.

1) Single master replication, with options not possible with physical
2) Multimaster
3) Many to One: data aggregation
4) Online upgrade

I don't think it will be possible to support all of those in one
release. Each has different challenges.

3 and 4 will not be worked on until 9.4, unless someone else is
willing to work on them. That isn't meant to be harsh, just an
explanation of practical reality that I hope people can accept without
needing to argue it.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-26 13:48:37
Message-ID:	CA+TgmoZqEQf7MoR_S39m1QkAARN+cC7TfryW55C91v4WoZm_1A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 25, 2012 at 3:17 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> I suppose the main reason we haven't done it already is that it
>> increases the period of time during which we're using 2X the disk
>> space.
> I find that an acceptable price if its optional. Making it such doesn't seem
> to be a problem for me.

+1.

>> I think there is absolutely nothing wrong with doing extra things in
>> ALTER TABLE when logical replication is enabled. We've got code
>> that's conditional on Hot Standby being enabled in many places in the
>> system; why should logical replication be any different? If we set
>> the bar for logical replication at "the system can't do anything
>> differently when logical replication is enabled" then I cheerfully
>> submit that we are doomed. You've already made WAL format changes to
>> support logging the pre-image of the tuple, which is a hundred times
>> more likely to cause a performance problem than any monkeying around
>> we might want to do in ALTER TABLE.
>>
>> I am deeply skeptical that we need to look inside of transactions that
>> do full-table rewrites. But even if we do, I don't see that what I'm
>> proposing precludes it. For example, I think we could have ALTER
>> TABLE emit WAL records specifically for logical replication that allow
>> us to disentangle which tuple descriptor to use at which point in the
>> transaction. I don't see that that would even be very difficult to
>> set up.
> Sorry, I was imprecise above: I have no problem doing some changes during
> ALTER TABLE if logical rep is enabled. I am worried though that to make that
> robust you would need loads of places that emit additional information:
> * ALTER TABLE
> * ALTER FUNCTIION
> * ALTER OPERATOR
> * ALTER/CREATE CAST
> * TRUNCATE
> * CLUSTER
> * ...
>
> I have the feeling that would we want to do that the full amount of required
> information would be rather high and end up being essentially the catalog.
> Just having an intermediate tupledesc doesn't help that much if you have e.g.
> record_out doing type lookups of its own.
>
> There also is the issue you have talked about before, that a user-type might
> depend on values in other tables. Unless were ready to break at least
> transactional behaviour for those for now...) I don't see how decoding outside
> of the transaction is ever going to be valid? I wouldn't have a big problem
> declaring that as broken for now...

I've been thinking about this a lot. My thinking's still evolving
somewhat, but consider the following case. A user defines a type T
with an input function I and and output function O. They create a
table which uses type T and insert a bunch of data, which is duly
parsed using I; then, they replace I with a new input function I' and
O with a new output function O'. Now, clearly, if we process the
inserts using the catalogs that were in effect at the time the inserts
we're done, we could theoretically get different output than if we use
the catalogs that were in effect after the I/O functions were
replaced. But is the latter output wrong, or merely different? My
first thought when we started talking about this was "it's wrong", but
the more I think about it, the less convinced I am...

...because it can't possibly be right to suppose that it's impossible
to decode heap tuples using any catalog contents other than the ones
that were in effect at the time the tuples got inserted. If that were
true, then we wouldn't be able to read a table after adding or
dropping a column, which of course we can. It seems to me that it
might be sufficient to guarantee that we'll decode using the same
*types* that were in effect at the time the inserts happened. If the
user yanks the rug out from under us by changing the type definition,
maybe we simply define that as a situation in which they get to keep
both pieces. After all, if you replace the type definition in a way
that makes sensible decoding of the table impossible, you've pretty
much shot yourself in the foot whether logical replication enters the
picture or not.

If the enum case, for example, we go to great pains to make sure that
the table contents are always decodable not only under the current
version of SnapshotNow, but also any successor version. We do that by
prohibiting ALTER TYPE .. ADD VALUE from running inside a transaction
block - because if we inserted a row into pg_enum and then inserted
dependent rows into some user table, a rollback could leave us with
rows that we can't decode. For the same reason, we don't allow ALTER
TYPE .. DROP VALUE. I think that we can infer a general principle
from this: while I/O functions may refer to catalog contents, they may
not do so in a way that could be invalidated by subsequent commits or
rollbacks. If they did, they'd be breaking the ability of subsequent
SELECT statements to read the table.

An interesting case that is arguably an exception to this rule is that
regwhatever types, which will cheerfully output their value as an OID
if it can't be decoded to text, but will bail on input if the textual
input can't be mapped to an OID via the appropriate system catalog.
The results don't even obey transaction semantics:

rhaas=# create table t (a int);
CREATE TABLE
rhaas=# begin;
BEGIN
rhaas=# select 't'::regclass;
regclass
----------
t
(1 row)

-- at this point, drop table t in another session
rhaas=# select 't'::regclass;
regclass
----------
t
(1 row)

But:

rhaas=# create table t (a int);
CREATE TABLE
rhaas=# begin;
BEGIN
-- at this point, drop table t in another session
rhaas=# select 't'::regclass;
ERROR: relation "t" does not exist at character 8
STATEMENT: select 't'::regclass;
ERROR: relation "t" does not exist
LINE 1: select 't'::regclass;
^

So in this case, the I/O functions are dependent not only on the
contents of the system catalogs and what's visible under SnapshotNow,
but also on the contents of the backend-local system caches and the
exact timing of calls to AcceptInvalidationMessages() with respect to
concurrent activities in other sessions. There is no way you're gonna
be able to figure that out from the WAL stream, which means that
there's no possible way to ensure that decode-to-text produces the
same value that the user actually typed (never mind that he could have
specified the input value as either an integer or a table name). And
even if you could, the replication transaction could fail during
reencoding on the remote node. To replicate this at all, you'd
probably have to decide on replication the underlying OID and forget
about the text representation... but it wouldn't bother me very much
to say "oh, that's a kooky internal type, you can't use it with
logical replication". Unless there's some way of making sure that all
the remote OID assignments are the same as the local ones, it's not
going to be that meaningful anyway. Even if you had such a thing, the
fact that the behavior can depend on leftover syscache contents means
that such a type might break under parallel apply.

So to get back to my main point: how about decreeing that logical
rep's responsibilities extend only to using the correct set of type
OIDs to decode the data, placing the burden of not whacking around the
type definitions or any ancillary tables on the user? That seems to
reduce the complexity of the problem scenarios quite a lot, and also
seems generally sane.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	David Fetter <david(at)fetter(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-26 18:11:54
Message-ID:	20120626181154.GB19560@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> > I am not sure were going to get all that into 9.3.
>
> Sure, that was more related to why I was questioning how much these
> use cases even *could* integrate -- whether it even paid to
> *consider* these use cases at this point. Responses from Robert and
> you have pretty much convinced me that I was being overly
> pessimistic on that.
>
> One fine point regarding before and after images -- if a value
> doesn't change in an UPDATE, there's no reason to include it in both
> the BEFORE and AFTER tuple images, as long as we have the null
> column bitmaps -- or some other way of distinguishing unchanged from
> NULL. (This could be especially important when the unchanged column
> was a 50 MB bytea.)

How about two bitmaps: one telling which columns are actually there,
the other with NULLs?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"David Fetter" <david(at)fetter(dot)org>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-26 22:05:27
Message-ID:	4FE9EBD70200002500048B40@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David Fetter <david(at)fetter(dot)org> wrote:
> On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:

>> One fine point regarding before and after images -- if a value
>> doesn't change in an UPDATE, there's no reason to include it in
>> both the BEFORE and AFTER tuple images, as long as we have the
>> null column bitmaps -- or some other way of distinguishing
>> unchanged from NULL. (This could be especially important when
>> the unchanged column was a 50 MB bytea.)
>
> How about two bitmaps: one telling which columns are actually
> there, the other with NULLs?

There are quite a few ways that could be done, but I suspect
Álvaro's idea is best:

http://archives.postgresql.org/message-id/1340654533-sup-5535@alvh.no-ip.org

In any event, it sounds like Andres wants to keep it as simple as
possible for the moment, and just include both tuples in their
entirety. Hopefully that is something which can be revisited before
the last CF.

-Kevin

From:	David Fetter <david(at)fetter(dot)org>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Catalog/Metadata consistency during changeset extraction from wal
Date:	2012-06-27 02:11:12
Message-ID:	20120627021112.GA28708@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jun 26, 2012 at 05:05:27PM -0500, Kevin Grittner wrote:
> David Fetter <david(at)fetter(dot)org> wrote:
> > On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:
>
> >> One fine point regarding before and after images -- if a value
> >> doesn't change in an UPDATE, there's no reason to include it in
> >> both the BEFORE and AFTER tuple images, as long as we have the
> >> null column bitmaps -- or some other way of distinguishing
> >> unchanged from NULL. (This could be especially important when
> >> the unchanged column was a 50 MB bytea.)
> >
> > How about two bitmaps: one telling which columns are actually
> > there, the other with NULLs?
>
> There are quite a few ways that could be done, but I suspect
> Álvaro's idea is best:
>
> http://archives.postgresql.org/message-id/1340654533-sup-5535@alvh.no-ip.org

Looks great (or at least way better than mine) to me :)

> In any event, it sounds like Andres wants to keep it as simple as
> possible for the moment, and just include both tuples in their
> entirety. Hopefully that is something which can be revisited before
> the last CF.

I hope so, too...

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate