Re: Logical decoding & exported base snapshot

Lists: pgsql-hackers
From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Logical decoding & exported base snapshot
Date: 2012-12-11 23:52:16
Message-ID: 20121211235216.GA4337@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

When initiating a new logical replication "slot" I want to provide a
'SET TRANSACTION SNAPSHOT'able snapshot which can be used to setup a new
replica. I have some questions arround this where I could use some input
on.

First, some basics around how this currently works:

Test the other side:

psql "port=5440 host=/tmp dbname=postgres replication=1"
postgres=# IDENTIFY_SYSTEM;
systemid | timeline | xlogpos | dbname
---------------------+----------+-----------+----------
5820768138794874841 | 1 | 0/190AF98 | postgres
(1 row)

Now, initiate a replication slot:

postgres=# INIT_LOGICAL_REPLICATION 'test';
WARNING: Initiating logical rep
WARNING: reached consistent point, stopping!
replication_id | consistent_point | snapshot_name | plugin
----------------+------------------+---------------+--------
id-0 | 0/190AFD0 | 0xDEADBEEF | test
(1 row)

If you would actually want to receive changes you would now (and later)
use START_LOGICAL_REPLICATION 'id-0' 0/190AFD0; to stream the changes
from that point onwards.

INIT_LOGICAL_REPLICATION starts to decode WAL from the last checkpoint
on and reads until it finds a point where it has enough information
(including a suitable xmin horizon) to decode the contents.. Then it
tells you the name of the newly created slot, the wal location and the
name of a snapshot it exported (obviously fake here).

I have the code to export a real snapshot, but exporting the snapshot is
actually the easy part. The idea with the exported snapshot obviously is
that you can start a pg_dump using it to replicate the base date for a
new replica.

Problem 1:

One problem I see is that while exporting a snapshot solves the
visibility issues of the table's contents it does not protect against
schema changes. I am not sure whether thats a problem.

If somebody runs a CLUSTER or something like that, the table's contents
will be preserved including MVCC semantics. That's fine.
The more problematic cases I see are TRUNCATE, DROP and ALTER
TABLE. Imagine the following:

S1: INIT_LOGICAL_REPLICATION
S1: get snapshot: 00000333-1
S2: ALTER TABLE foo ALTER COLUMN blub text USING (blub::text);
S3: pg_dump --snapshot 00000333-1
S1: START_LOGICAL_REPLICATION

In that case the pg_dump would dump foo using the schema *after* the
ALTER TABLE but showing only rows visible to our snapshot. After
START_LOGICAL_REPLICATION all changes after the xlog position from
INIT_LOGICAL_REPLICATION will be returned though - including all the
tuples from the ALTER TABLE and potentially - if some form of schema
capturing was in place - the ALTER TABLE itself. The copied schema would
have the new format already though.

Does anybody see that as aproblem or is it just a case of PEBKAC? One
argument for the latter is that thats already a problematic case for
normal pg_dump's. Its just that the window is a bit larger here.

Problem 2:

Control Flow.

To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction
needs to be alive. That's currently solved by exporting the snapshot in
the walsender connection that did the INIT_LOGICAL_REPLICATION. The
question is how long should we preserve that snapshot?

There is no requirement - and there *cannot* be one in the general case,
the slot needs to be usable after a restart - that
START_LOGICAL_REPLICATION has to be run in the same replication
connection as INIT.

Possible solutions:
1) INIT_LOGICAL_REPLICATION waits for an answer from the client that
confirms that logical replication initialization is finished. Before
that the walsender connection cannot be used for anything else.

2) we remove the snapshot as soon as any other commend is received, this
way the replication connection stays usable, e.g. to issue a
START_LOGICAL_REPLICATION in parallel to the initial data dump. In that
case the snapshot would have to be imported *before* the next command
was received as SET TRANSACTION SNAPSHOT requires the source transaction
to be still open.

Opinions?

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Joachim Wieland <joe(at)mcknight(dot)de>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 02:05:51
Message-ID: CACw0+109batQZdQ1C_AKfdG4xhUMY+LJNWb_zD2pgMScCq7ghw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> One problem I see is that while exporting a snapshot solves the
> visibility issues of the table's contents it does not protect against
> schema changes. I am not sure whether thats a problem.
>
> If somebody runs a CLUSTER or something like that, the table's contents
> will be preserved including MVCC semantics. That's fine.
> The more problematic cases I see are TRUNCATE, DROP and ALTER
> TABLE.

This is why the pg_dump master process executes a

lock table <table> in access share mode

for every table, so your commands would all block.

In fact it's even more complicated because the pg_dump worker
processes also need to lock the table. They try to get a similar lock
in "NOWAIT" mode right before dumping the table. If they don't get the
lock that means that somebody else is waiting for an exclusive lock
(this is the case you describe) and the backup will fail.

> Problem 2:
>
> To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction
> needs to be alive. That's currently solved by exporting the snapshot in
> the walsender connection that did the INIT_LOGICAL_REPLICATION. The
> question is how long should we preserve that snapshot?

You lost me on this one after the first sentence... But in general the
snapshot isn't so much a magical thing: As long the exporting
transaction is open, it guarantees that there is a transaction out
there that is holding off vacuum from removing data and it's also
guaranteeing that the snapshot as is has existed at some time in the
past.

Once it is applied to another transaction you now have two
transactions that will hold off vacuum because they share the same
xmin,xmax values. You could also just end the exporting transaction at
that point.

One thought at the time was to somehow integrate exported snapshots
with the prepared transactions feature, then you could export a
snapshot, prepare the exporting transaction and have that snapshot
frozen forever and it would even survive a server restart.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joachim Wieland <joe(at)mcknight(dot)de>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 03:20:18
Message-ID: 26517.1355282418@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joachim Wieland <joe(at)mcknight(dot)de> writes:
> On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> One problem I see is that while exporting a snapshot solves the
>> visibility issues of the table's contents it does not protect against
>> schema changes. I am not sure whether thats a problem.
>>
>> If somebody runs a CLUSTER or something like that, the table's contents
>> will be preserved including MVCC semantics. That's fine.
>> The more problematic cases I see are TRUNCATE, DROP and ALTER
>> TABLE.

> This is why the pg_dump master process executes a
> lock table <table> in access share mode
> for every table, so your commands would all block.

A lock doesn't protect against schema changes made before the lock was
taken. The reason that the described scenario is problematic is that
pg_dump is going to be expected to work against a snapshot made before
it gets a chance to take those table locks. Thus, there's a window
where DDL is dangerous, and will invalidate the dump --- perhaps without
any warning.

Now, we have this problem today, in that pg_dump has to read pg_class
before it can take table locks so some window exists already. What's
bothering me about what Andres describes is that the window for trouble
seems to be getting much bigger.

regards, tom lane


From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 03:39:14
Message-ID: BLU0-SMTP8417A16727A6855996B47DC4F0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12-12-11 06:52 PM, Andres Freund wrote:
> Hi,

>
> Problem 1:
>
> One problem I see is that while exporting a snapshot solves the
> visibility issues of the table's contents it does not protect against
> schema changes. I am not sure whether thats a problem.
>
> If somebody runs a CLUSTER or something like that, the table's contents
> will be preserved including MVCC semantics. That's fine.
> The more problematic cases I see are TRUNCATE, DROP and ALTER
> TABLE. Imagine the following:
>
> S1: INIT_LOGICAL_REPLICATION
> S1: get snapshot: 00000333-1
> S2: ALTER TABLE foo ALTER COLUMN blub text USING (blub::text);
> S3: pg_dump --snapshot 00000333-1
> S1: START_LOGICAL_REPLICATION
>
> In that case the pg_dump would dump foo using the schema *after* the
> ALTER TABLE but showing only rows visible to our snapshot. After
> START_LOGICAL_REPLICATION all changes after the xlog position from
> INIT_LOGICAL_REPLICATION will be returned though - including all the
> tuples from the ALTER TABLE and potentially - if some form of schema
> capturing was in place - the ALTER TABLE itself. The copied schema would
> have the new format already though.
>
> Does anybody see that as aproblem or is it just a case of PEBKAC? One
> argument for the latter is that thats already a problematic case for
> normal pg_dump's. Its just that the window is a bit larger here.

Is there anyway to detect this situation as part of the pg_dump? If I
could detect this, abort my pg_dump then go and get a new snapshot then
I don't see this as a big deal. I can live with telling users, "don't
do DDL like things while subscribing a new node, if you do the
subscription will restart". I am less keen on telling users "don't do
DDL like things while subscribing a new node or the results will be
unpredictable"

I'm trying to convince myself if I will be able to take a pg_dump from
an exported snapshot plus the changes made after in between the snapshot
id to some later time and turn the results into a consistent database.
What if something like this comes along

INIT REPLICATION
insert into foo (id,bar) values (1,2);
alter table foo drop column bar;
pg_dump --snapshot

The schema I get as part of the pg_dump won't have bar because it has
been dropped, even though it will have the rows that existed with bar.
I then go to process the INSERT statement. It will have a WAL record
with column data for bar and the logical replication replay will lookup
the catalog rows from before the alter table so it will generate a
logical INSERT record with BAR. That will fail on the replica.

I'm worried that it will be difficult to pragmatically stitch together
the inconsistent snapshot from the pg_dump plus the logical records
generated in between the snapshot and the dump (along with any record of
the DDL if it exists).

> Problem 2:
>
> Control Flow.
>
> To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction
> needs to be alive. That's currently solved by exporting the snapshot in
> the walsender connection that did the INIT_LOGICAL_REPLICATION. The
> question is how long should we preserve that snapshot?
>
> There is no requirement - and there *cannot* be one in the general case,
> the slot needs to be usable after a restart - that
> START_LOGICAL_REPLICATION has to be run in the same replication
> connection as INIT.
>
> Possible solutions:
> 1) INIT_LOGICAL_REPLICATION waits for an answer from the client that
> confirms that logical replication initialization is finished. Before
> that the walsender connection cannot be used for anything else.
>
> 2) we remove the snapshot as soon as any other commend is received, this
> way the replication connection stays usable, e.g. to issue a
> START_LOGICAL_REPLICATION in parallel to the initial data dump. In that
> case the snapshot would have to be imported *before* the next command
> was received as SET TRANSACTION SNAPSHOT requires the source transaction
> to be still open.
>
> Opinions?

Option 2 sounds more flexible. Is it more difficult to implement?
> Andres
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Joachim Wieland <joe(at)mcknight(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 11:02:28
Message-ID: 20121212110228.GA8027@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-12-11 21:05:51 -0500, Joachim Wieland wrote:
> On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > One problem I see is that while exporting a snapshot solves the
> > visibility issues of the table's contents it does not protect against
> > schema changes. I am not sure whether thats a problem.
> >
> > If somebody runs a CLUSTER or something like that, the table's contents
> > will be preserved including MVCC semantics. That's fine.
> > The more problematic cases I see are TRUNCATE, DROP and ALTER
> > TABLE.
>
> This is why the pg_dump master process executes a
>
> lock table <table> in access share mode
>
> for every table, so your commands would all block.
>
> In fact it's even more complicated because the pg_dump worker
> processes also need to lock the table. They try to get a similar lock
> in "NOWAIT" mode right before dumping the table. If they don't get the
> lock that means that somebody else is waiting for an exclusive lock
> (this is the case you describe) and the backup will fail.

[Tom explains why this is problematic way better than I do, see his
email]

A trivial example - you can play nastier games than that though:

S1: CREATE TABLE test(id int);
S1: INSERT INTO test VALUES(1), (2);
S1: BEGIN;
S1: ALTER TABLE test RENAME COLUMN id TO id2;
S2: pg_dump
S2: waits
S1: COMMIT;
pg_dump: [archiver (db)] query failed: ERROR: column "id" of relation "test" does not exist
pg_dump: [archiver (db)] query was: COPY public.test (id) TO stdout;

> > Problem 2:
> >
> > To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction
> > needs to be alive. That's currently solved by exporting the snapshot in
> > the walsender connection that did the INIT_LOGICAL_REPLICATION. The
> > question is how long should we preserve that snapshot?
>
> You lost me on this one after the first sentence... But in general the
> snapshot isn't so much a magical thing: As long the exporting
> transaction is open, it guarantees that there is a transaction out
> there that is holding off vacuum from removing data and it's also
> guaranteeing that the snapshot as is has existed at some time in the
> past.
>
> Once it is applied to another transaction you now have two
> transactions that will hold off vacuum because they share the same
> xmin,xmax values. You could also just end the exporting transaction at
> that point.

I should probably have given a bit more context here... All this is in
the context of decoding WAL back into something useful for the purpose
of replication. This is done in the already existing walsender process,
using the commands I explained in the original post.
What we do there is walk the WAL from some point until we could assemble
a snapshot for exactly that LSN. Several preconditions need to be met
there (like a low enough xmin horizon, so the old catalog entries are
still arround). Using that snapshot we can now decode the entries stored
in the WAL.
If you setup a new replica though, getting all changes from one point
where you're consistent isn't all that interesting if you cannot also
get a backup from exactly that point in time because otherwise its very
hard to start with something sensible on the standby without locking
everything down. So the idea is that once we found that consistent
snapshot we also export it. Only that we are in a walsender connection
which doesn't expose transactional semantics, so the snapshot cannot be
"deallocated" by COMMIT; ing.

So the question is basically about how to design the user interface of
that deallocation.

Makes slightly more sense now?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 11:13:44
Message-ID: 20121212111344.GB8027@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-12-11 22:20:18 -0500, Tom Lane wrote:
> Joachim Wieland <joe(at)mcknight(dot)de> writes:
> > On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> One problem I see is that while exporting a snapshot solves the
> >> visibility issues of the table's contents it does not protect against
> >> schema changes. I am not sure whether thats a problem.
> >>
> >> If somebody runs a CLUSTER or something like that, the table's contents
> >> will be preserved including MVCC semantics. That's fine.
> >> The more problematic cases I see are TRUNCATE, DROP and ALTER
> >> TABLE.
>
> > This is why the pg_dump master process executes a
> > lock table <table> in access share mode
> > for every table, so your commands would all block.
>
> A lock doesn't protect against schema changes made before the lock was
> taken. The reason that the described scenario is problematic is that
> pg_dump is going to be expected to work against a snapshot made before
> it gets a chance to take those table locks. Thus, there's a window
> where DDL is dangerous, and will invalidate the dump --- perhaps without
> any warning.

Exactly.

> Now, we have this problem today, in that pg_dump has to read pg_class
> before it can take table locks so some window exists already.

Yea. And if somebody else already has conflicting locks its pretty damn
easy to hit as show in my answer to Joachim... I am pretty sure there
are lots of nasty silent variants.

> What's
> bothering me about what Andres describes is that the window for trouble
> seems to be getting much bigger.

Me too. If it only were clearly visible errors I wouldn't be bothered
too much, but ...

This morning I wondered whether we couldn't protect against that by
acquiring share locks on the catalog rows pg_dump reads, that would
result in "could not serialize access due to concurrent update" type of
errors which would be easy enough discernible/translateable.
While pretty damn ugly that should take care of most of those issues,
shouldn't it?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 11:20:11
Message-ID: 20121212112011.GC8027@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-12-11 22:39:14 -0500, Steve Singer wrote:
> On 12-12-11 06:52 PM, Andres Freund wrote:
> >Hi,
>
> >
> >Problem 1:
> >
> >One problem I see is that while exporting a snapshot solves the
> >visibility issues of the table's contents it does not protect against
> >schema changes. I am not sure whether thats a problem.
> >
> >If somebody runs a CLUSTER or something like that, the table's contents
> >will be preserved including MVCC semantics. That's fine.
> >The more problematic cases I see are TRUNCATE, DROP and ALTER
> >TABLE. Imagine the following:
> >
> >S1: INIT_LOGICAL_REPLICATION
> >S1: get snapshot: 00000333-1
> >S2: ALTER TABLE foo ALTER COLUMN blub text USING (blub::text);
> >S3: pg_dump --snapshot 00000333-1
> >S1: START_LOGICAL_REPLICATION
> >
> >In that case the pg_dump would dump foo using the schema *after* the
> >ALTER TABLE but showing only rows visible to our snapshot. After
> >START_LOGICAL_REPLICATION all changes after the xlog position from
> >INIT_LOGICAL_REPLICATION will be returned though - including all the
> >tuples from the ALTER TABLE and potentially - if some form of schema
> >capturing was in place - the ALTER TABLE itself. The copied schema would
> >have the new format already though.
> >
> >Does anybody see that as aproblem or is it just a case of PEBKAC? One
> >argument for the latter is that thats already a problematic case for
> >normal pg_dump's. Its just that the window is a bit larger here.
>
> Is there anyway to detect this situation as part of the pg_dump? If I could
> detect this, abort my pg_dump then go and get a new snapshot then I don't
> see this as a big deal. I can live with telling users, "don't do DDL like
> things while subscribing a new node, if you do the subscription will
> restart". I am less keen on telling users "don't do DDL like things while
> subscribing a new node or the results will be unpredictable"

I am trying to think of unintrusive way to detect this....

> I'm trying to convince myself if I will be able to take a pg_dump from an
> exported snapshot plus the changes made after in between the snapshot id to
> some later time and turn the results into a consistent database. What if
> something like this comes along
>
> ...
>
> I'm worried that it will be difficult to pragmatically stitch together the
> inconsistent snapshot from the pg_dump plus the logical records generated in
> between the snapshot and the dump (along with any record of the DDL if it
> exists).

I think trying to solve that in the replication solution is a bad
idea. There are too many possible scenarios and some of them very subtle
and hard to detect.

So I think its either:
1) find something that tests for this and abort if so
2) acquire locks earlier preventing DDL alltogether till pg_dump starts
3) don't care. The problem exists today and not many people have
complained.

> >Problem 2:
> >
> >Control Flow.
> >
> >To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction
> >needs to be alive. That's currently solved by exporting the snapshot in
> >the walsender connection that did the INIT_LOGICAL_REPLICATION. The
> >question is how long should we preserve that snapshot?
> >
> >There is no requirement - and there *cannot* be one in the general case,
> >the slot needs to be usable after a restart - that
> >START_LOGICAL_REPLICATION has to be run in the same replication
> >connection as INIT.
> >
> >Possible solutions:
> >1) INIT_LOGICAL_REPLICATION waits for an answer from the client that
> >confirms that logical replication initialization is finished. Before
> >that the walsender connection cannot be used for anything else.
> >
> >2) we remove the snapshot as soon as any other commend is received, this
> >way the replication connection stays usable, e.g. to issue a
> >START_LOGICAL_REPLICATION in parallel to the initial data dump. In that
> >case the snapshot would have to be imported *before* the next command
> >was received as SET TRANSACTION SNAPSHOT requires the source transaction
> >to be still open.

> Option 2 sounds more flexible. Is it more difficult to implement?

No, I don't think so. It's a bit more intrusive in that it requires
knowledge about logical replication in more parts of walsender, but it
should be ok.

Note btw, that my description of 1) was easy to misunderstand. The
"that" in "Before that the walsender connection cannot be used for
anything else." is the answer from the client, not the usage of the
exported snapshot. Once the snapshot has been iimported into other
session(s) the source doesn't need to be alive anymore.
Does that explanation change anything?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 23:45:16
Message-ID: 20121212234516.GC7099@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-12-12 12:13:44 +0100, Andres Freund wrote:
> On 2012-12-11 22:20:18 -0500, Tom Lane wrote:
> > Joachim Wieland <joe(at)mcknight(dot)de> writes:
> > > On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > >> One problem I see is that while exporting a snapshot solves the
> > >> visibility issues of the table's contents it does not protect against
> > >> schema changes. I am not sure whether thats a problem.
> > >>
> > >> If somebody runs a CLUSTER or something like that, the table's contents
> > >> will be preserved including MVCC semantics. That's fine.
> > >> The more problematic cases I see are TRUNCATE, DROP and ALTER
> > >> TABLE.
> >
> > > This is why the pg_dump master process executes a
> > > lock table <table> in access share mode
> > > for every table, so your commands would all block.
> >
> > A lock doesn't protect against schema changes made before the lock was
> > taken. The reason that the described scenario is problematic is that
> > pg_dump is going to be expected to work against a snapshot made before
> > it gets a chance to take those table locks. Thus, there's a window
> > where DDL is dangerous, and will invalidate the dump --- perhaps without
> > any warning.

> > Now, we have this problem today, in that pg_dump has to read pg_class
> > before it can take table locks so some window exists already.
>
> > What's
> > bothering me about what Andres describes is that the window for trouble
> > seems to be getting much bigger.
>
> This morning I wondered whether we couldn't protect against that by
> acquiring share locks on the catalog rows pg_dump reads, that would
> result in "could not serialize access due to concurrent update" type of
> errors which would be easy enough discernible/translateable.
> While pretty damn ugly that should take care of most of those issues,
> shouldn't it?

After a quick look it doesn't look too hard to add this, does anybody
have an opinion whether its something worthwile? And possibly a suggest
option name? I can't come up with something better than --recheck-locks
or something, but thats awful.

I don't think there's too much point in locking anything but pg_class,
pg_attribute, pg_type nearly everything else is read using a mvcc
snapshot anyway or isn't all that critical. Other candidates?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 23:52:33
Message-ID: 17860.1355356353@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2012-12-12 12:13:44 +0100, Andres Freund wrote:
>> This morning I wondered whether we couldn't protect against that by
>> acquiring share locks on the catalog rows pg_dump reads, that would
>> result in "could not serialize access due to concurrent update" type of
>> errors which would be easy enough discernible/translateable.
>> While pretty damn ugly that should take care of most of those issues,
>> shouldn't it?

How would it fix anything? The problem is with DDL that's committed and
gone before pg_dump ever gets to the table's pg_class row. Once it
does, and takes AccessShareLock on the relation, it's safe. Adding a
SELECT FOR SHARE step just adds more time before we can get that lock.

Also, locking the pg_class row doesn't provide protection against DDL
that doesn't modify the relation's pg_class row, of which there is
plenty.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-12 23:58:05
Message-ID: 20121212235805.GD7099@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-12-12 18:52:33 -0500, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2012-12-12 12:13:44 +0100, Andres Freund wrote:
> >> This morning I wondered whether we couldn't protect against that by
> >> acquiring share locks on the catalog rows pg_dump reads, that would
> >> result in "could not serialize access due to concurrent update" type of
> >> errors which would be easy enough discernible/translateable.
> >> While pretty damn ugly that should take care of most of those issues,
> >> shouldn't it?
>
> How would it fix anything? The problem is with DDL that's committed and
> gone before pg_dump ever gets to the table's pg_class row. Once it
> does, and takes AccessShareLock on the relation, it's safe. Adding a
> SELECT FOR SHARE step just adds more time before we can get that lock.

Getting a FOR SHARE lock ought to error out with a serialization failure
if the row was updated since our snapshot started as pg_dump uses
repeatable read/serializable. Now that obviously doesn't fix the
situation, but making it detectable in a safe way seems to be good
enough for me.

> Also, locking the pg_class row doesn't provide protection against DDL
> that doesn't modify the relation's pg_class row, of which there is
> plenty.

Well, thats why I thought of pg_class, pg_attribute, pg_type. Maybe that
list needs to get extended a bit, but I think just those 3 should detect
most dangerous situations.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-13 16:02:06
Message-ID: BLU0-SMTP482BC92ABDECCD2AF9D7DBDC4E0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12-12-12 06:20 AM, Andres Freund wrote:
>> Possible solutions:
>> 1) INIT_LOGICAL_REPLICATION waits for an answer from the client that
>> confirms that logical replication initialization is finished. Before
>> that the walsender connection cannot be used for anything else.
>>
>> 2) we remove the snapshot as soon as any other commend is received, this
>> way the replication connection stays usable, e.g. to issue a
>> START_LOGICAL_REPLICATION in parallel to the initial data dump. In that
>> case the snapshot would have to be imported *before* the next command
>> was received as SET TRANSACTION SNAPSHOT requires the source transaction
>> to be still open.
>> Option 2 sounds more flexible. Is it more difficult to implement?
> No, I don't think so. It's a bit more intrusive in that it requires
> knowledge about logical replication in more parts of walsender, but it
> should be ok.
>
> Note btw, that my description of 1) was easy to misunderstand. The
> "that" in "Before that the walsender connection cannot be used for
> anything else." is the answer from the client, not the usage of the
> exported snapshot. Once the snapshot has been iimported into other
> session(s) the source doesn't need to be alive anymore.
> Does that explanation change anything?

I think I understood you were saying the walsender connection can't be
used for anything else (ie streaming WAL) until the exported snapshot
has been imported. I think your clarification is still consistent with
this?

WIth option 2 I can still get the option 1 behaviour by not sending the
next command to the walsender until I am done importing the snapshot.
However if I want to start processing WAL before the snapshot has been
imported I can do that with option 2.

I am not sure I would need to do that, I'm just saying having the option
is more flexible.

> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-13 20:40:43
Message-ID: 20121213204042.GC28882@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-12-13 11:02:06 -0500, Steve Singer wrote:
> On 12-12-12 06:20 AM, Andres Freund wrote:
> >>Possible solutions:
> >>1) INIT_LOGICAL_REPLICATION waits for an answer from the client that
> >>confirms that logical replication initialization is finished. Before
> >>that the walsender connection cannot be used for anything else.
> >>
> >>2) we remove the snapshot as soon as any other commend is received, this
> >>way the replication connection stays usable, e.g. to issue a
> >>START_LOGICAL_REPLICATION in parallel to the initial data dump. In that
> >>case the snapshot would have to be imported *before* the next command
> >>was received as SET TRANSACTION SNAPSHOT requires the source transaction
> >>to be still open.
> >>Option 2 sounds more flexible. Is it more difficult to implement?
> >No, I don't think so. It's a bit more intrusive in that it requires
> >knowledge about logical replication in more parts of walsender, but it
> >should be ok.
> >
> >Note btw, that my description of 1) was easy to misunderstand. The
> >"that" in "Before that the walsender connection cannot be used for
> >anything else." is the answer from the client, not the usage of the
> >exported snapshot. Once the snapshot has been iimported into other
> >session(s) the source doesn't need to be alive anymore.
> >Does that explanation change anything?
>
> I think I understood you were saying the walsender connection can't be used
> for anything else (ie streaming WAL) until the exported snapshot has been
> imported. I think your clarification is still consistent with this?

Yes, thats correct.

> WIth option 2 I can still get the option 1 behaviour by not sending the next
> command to the walsender until I am done importing the snapshot. However if
> I want to start processing WAL before the snapshot has been imported I can
> do that with option 2.
>
> I am not sure I would need to do that, I'm just saying having the option is
> more flexible.

True.

Still not sure whats better, but since youre slightly leaning towards 2)
I am going to implement that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joachim Wieland <joe(at)mcknight(dot)de>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-13 22:11:31
Message-ID: CA+Tgmoa4HmtfphB3ic0g-RNpX6p-2v4Ji5YixXvrBKJ8EhLG2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 11, 2012 at 10:20 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> This is why the pg_dump master process executes a
>> lock table <table> in access share mode
>> for every table, so your commands would all block.
>
> A lock doesn't protect against schema changes made before the lock was
> taken. The reason that the described scenario is problematic is that
> pg_dump is going to be expected to work against a snapshot made before
> it gets a chance to take those table locks. Thus, there's a window
> where DDL is dangerous, and will invalidate the dump --- perhaps without
> any warning.
>
> Now, we have this problem today, in that pg_dump has to read pg_class
> before it can take table locks so some window exists already. What's
> bothering me about what Andres describes is that the window for trouble
> seems to be getting much bigger.

It would be sort of nice to have some kind of lock that would freeze
out all DDL (except for temporary objects, perhaps?). Then pg_dump
could take that lock before taking a snapshot. As a nifty side
benefit, it would perhaps provide a way around the problem that
databases with many relations are undumpable due to lock table
exhaustion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-13 22:25:15
Message-ID: 20121213222515.GD28882@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-12-13 17:11:31 -0500, Robert Haas wrote:
> On Tue, Dec 11, 2012 at 10:20 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> This is why the pg_dump master process executes a
> >> lock table <table> in access share mode
> >> for every table, so your commands would all block.
> >
> > A lock doesn't protect against schema changes made before the lock was
> > taken. The reason that the described scenario is problematic is that
> > pg_dump is going to be expected to work against a snapshot made before
> > it gets a chance to take those table locks. Thus, there's a window
> > where DDL is dangerous, and will invalidate the dump --- perhaps without
> > any warning.
> >
> > Now, we have this problem today, in that pg_dump has to read pg_class
> > before it can take table locks so some window exists already. What's
> > bothering me about what Andres describes is that the window for trouble
> > seems to be getting much bigger.
>
> It would be sort of nice to have some kind of lock that would freeze
> out all DDL (except for temporary objects, perhaps?). Then pg_dump
> could take that lock before taking a snapshot. As a nifty side
> benefit, it would perhaps provide a way around the problem that
> databases with many relations are undumpable due to lock table
> exhaustion.

That would solve the consistency problem, yes. Would we need a special
kind of permission for this? I would say superuser/database owner only?

It should actually even work for standbys in contrast to my FOR SHARE
hack idea as we could "just" make it conflict with the exclusive locks
logged into the WAL.

I don't think that it will against the too-many-locks problem itself
though, unless we play some additional tricks. The locks will be
acquired later when the tables are dumped anyway. Now, given the
snapshot import/export feature we actually could dump them table by
table in a single transaction, but thats a bit more complicated.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-13 22:49:58
Message-ID: CA+TgmoZi_12sDW7vLcaFzHcASaQfYJvvJCbL45ASX=_1BX-Dkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 13, 2012 at 5:25 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> That would solve the consistency problem, yes. Would we need a special
> kind of permission for this? I would say superuser/database owner only?

Yeah, I doubt we would need a whole new permission for it, but it
would certainly have to be disallowed for ordinary users.

> It should actually even work for standbys in contrast to my FOR SHARE
> hack idea as we could "just" make it conflict with the exclusive locks
> logged into the WAL.
>
> I don't think that it will against the too-many-locks problem itself
> though, unless we play some additional tricks. The locks will be
> acquired later when the tables are dumped anyway. Now, given the
> snapshot import/export feature we actually could dump them table by
> table in a single transaction, but thats a bit more complicated.

Well, I was thinking that if a transaction already had the
no-DDL-on-nontemp-objects lock then perhaps we could optimize away
AccessShareLocks on individual relations. Of course that would rely
on the global lock being an adequate substitute for all cases where an
AccessShareLock would otherwise be needed. Not sure how feasible that
is. But it would be really cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-13 22:55:18
Message-ID: 13812.1355439318@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Dec 13, 2012 at 5:25 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> That would solve the consistency problem, yes. Would we need a special
>> kind of permission for this? I would say superuser/database owner only?

> Yeah, I doubt we would need a whole new permission for it, but it
> would certainly have to be disallowed for ordinary users.

Giving up the ability to run pg_dump as a non-superuser would be
pretty annoying, so this would have to be an optional feature if we
restrict it that way.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Joachim Wieland <joe(at)mcknight(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-13 23:16:08
Message-ID: CA+TgmoZ4SbxX0fWSaTZCQ3SH_L1DGxOazirGOge+7+z0x1gz8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 13, 2012 at 5:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Dec 13, 2012 at 5:25 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> That would solve the consistency problem, yes. Would we need a special
>>> kind of permission for this? I would say superuser/database owner only?
>
>> Yeah, I doubt we would need a whole new permission for it, but it
>> would certainly have to be disallowed for ordinary users.
>
> Giving up the ability to run pg_dump as a non-superuser would be
> pretty annoying, so this would have to be an optional feature if we
> restrict it that way.

Yeah, for sure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Logical decoding & exported base snapshot
Date: 2012-12-14 11:48:26
Message-ID: 20121214114826.GB20197@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-12-13 21:40:43 +0100, Andres Freund wrote:
> On 2012-12-13 11:02:06 -0500, Steve Singer wrote:
> > On 12-12-12 06:20 AM, Andres Freund wrote:
> > >>Possible solutions:
> > >>1) INIT_LOGICAL_REPLICATION waits for an answer from the client that
> > >>confirms that logical replication initialization is finished. Before
> > >>that the walsender connection cannot be used for anything else.
> > >>
> > >>2) we remove the snapshot as soon as any other commend is received, this
> > >>way the replication connection stays usable, e.g. to issue a
> > >>START_LOGICAL_REPLICATION in parallel to the initial data dump. In that
> > >>case the snapshot would have to be imported *before* the next command
> > >>was received as SET TRANSACTION SNAPSHOT requires the source transaction
> > >>to be still open.
> > >>Option 2 sounds more flexible. Is it more difficult to implement?
> > >No, I don't think so. It's a bit more intrusive in that it requires
> > >knowledge about logical replication in more parts of walsender, but it
> > >should be ok.
> > >
> > >Note btw, that my description of 1) was easy to misunderstand. The
> > >"that" in "Before that the walsender connection cannot be used for
> > >anything else." is the answer from the client, not the usage of the
> > >exported snapshot. Once the snapshot has been iimported into other
> > >session(s) the source doesn't need to be alive anymore.
> > >Does that explanation change anything?
> >
> > I think I understood you were saying the walsender connection can't be used
> > for anything else (ie streaming WAL) until the exported snapshot has been
> > imported. I think your clarification is still consistent with this?
>
> Yes, thats correct.
>
> > WIth option 2 I can still get the option 1 behaviour by not sending the next
> > command to the walsender until I am done importing the snapshot. However if
> > I want to start processing WAL before the snapshot has been imported I can
> > do that with option 2.
> >
> > I am not sure I would need to do that, I'm just saying having the option is
> > more flexible.
>
> True.
>
> Still not sure whats better, but since youre slightly leaning towards 2)
> I am going to implement that.

Pushed and lightly tested.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services