Quick Links

Re: Changeset Extraction v7.6.1

Lists:	pgsql-hackers

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-15 00:22:23
Message-ID:	20140115002223.GA17204@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Everyone,

I am pleased to announce the next version of the the changeset
extraction feature.

There's more changes than I can remember, but that's what comes to my
tired mind:
* Initial userlevel docs (including an example session!).
* generalization of the "replication slot" system to also work for
streaming rep, although the user interface for that is mostly missing.
* don't remove WAL still required by a replication slot, be it a slot for
changeset extraction or streaming rep.
* New output plugin interface with one _PG_output_plugin_init dlsym()ed
function filling out the callbacks.
* Renaming of the init and _cleanup output plugins to startup and shutdown.
* Simplification of the prepare_write/write interface for output plugins
(no need to specify LSNs anymore).
* Renaming of the changeset extraction operations to
create_replication_slot/drop_replication_slot/start_replication
... logical.
* moving the SQL changeset functions from a contrib module into core
* Addition of peeking functions for changeset extraction.
* revised error messages
* revised comments
* ...

I've followed Robert's wishes with generalizing the replication slot
interface to not only work for changeset generation, but also streaming
rep - not sure whether that was the right choice, it's been more work
than I expected blocking things a bit but we're there now....
There's no clientside support included except as in the pg_receivexlog
hack attached as the last patch, but I also have tested it via streaming
rep and it mostly works (minus a hot_standby_feedback bug, will report
tomorrow).

What I think is missing:
* The user docs need more work, even though we're in a much better state
than before.
* Replication slots are stored in binary files. I think it might make
sense to store them as text files instead, for easier
extensibility. Especially since we want to use them for streaming rep,
I am pretty sure new attributes will soon come. I don't think it's
critical enough performancwise to store them in binary.
* Contrary to what Robert and I'd discussed I've named the SQL functions
outputting changes decoding_slot_(get|peek)_[binary_]changes instead of
decoding_stream_* - I can change that, but the SQL functions don't
actually support streaming, so I thought that might be
confusing. Opinions?
* Robert complained earlier about the way historical catalog snapshots
are forced in tqual.c - I don't really know yet what the better way
would be here.
* Some functionality probably needs to move between the patches - it's a
bit hard to see where the best boundaries are.

The sources are in my git tree at:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
branch xlog-decoding-rebasing-remapping.

The last two patches are *not* indendet to be actually applied, but are
useful for testing.

If you want to test, you'll need a clean initdb, set wal_level=logical
and max_replication_slots>0. There's a example SQL session showing how
things can be used....

Testing, Review, Questions welcome!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Log-xl_running_xact-s-at-a-higher-frequ.patch.gz	application/x-patch-gzip	3.1 KB
0002-wal_decoding-Introduce-the-replication-slot-interfac.patch.gz	application/x-patch-gzip	11.2 KB
0003-wal_decoding-Introduce-changeset-extraction.patch.gz	application/x-patch-gzip	73.7 KB
0004-wal_decoding-Only-peg-the-xmin-horizon-for-catalog-t.patch.gz	application/x-patch-gzip	5.3 KB
0005-wal_decoding-Allow-walsenders-to-connect-to-a-specif.patch.gz	application/x-patch-gzip	4.0 KB
0006-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	11.9 KB
0007-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	9.1 KB
0008-wal_decoding-test_decoding-Add-a-simple-decoding-mod.patch.gz	application/x-patch-gzip	24.6 KB
0009-wal_decoding-design-document-v2.4-and-snapshot-build.patch.gz	application/x-patch-gzip	12.9 KB
0010-wal_decoding-Initial-cut-at-sgml-docs-section.patch.gz	application/x-patch-gzip	8.7 KB
0011-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB
0012-slot-hack-up-pg_receivexlog-support.patch.gz	application/x-patch-gzip	1.6 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-15 17:43:04
Message-ID:	CA+TgmoYyPnujZmWhVAn=vUyTDo-D4OSYYQNVFFRd7R0xbyhz2g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This 0001 patch, to log running transactions more frequently, has been
pending for a long time now, and I haven't heard any objections, so
I've gone ahead and committed that part.

...Robert

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-15 18:28:25
Message-ID:	CA+TgmoaWO2xU0LGQ_DFd0RaQM5ZRa8wB6u+-0UAdSrau1i0V=w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Review of patch 0002:

- I think you should just regard ReplicationSlotCtlLock as protecting
the "name" and "active" flags of every slot. ReplicationSlotCreate()
would then not need to mess with the spinlocks at all, and
ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
think. Functions like ReplicationSlotsComputeRequiredXmin() can
acquire this lock in shared mode and only lock the slots that are
actually active, instead of all of them.

- If you address /* FIXME: apply sanity checking to slot name */, then
I think that also addresses /* XXX: do we want to use truncate
identifier instead? */. In other words, let's just error out if the
name is too long. I'm not sure what other sanity checking is needed
here; maybe just restrict it to 7-bit characters to avoid problems
with encodings for individual databases varying.

- ReplicationSlotAcquire probably needs to ignore slots that are not active.

- ReplicationSlotAcquire should be tweaked so that the code that holds
the spinlock is more self-contained. If you adopt the above-proposed
recasting of ReplicationSlotCtlLock, then the part that holds the
spinlock can probably look like this: SpinLockAcquire(&slot->mutex);
was_active = slot->active; slot->active = true;
SpinLockRelease(&slot->mutex), which looks quite a bit safer.

- If there's a coding rule that slot->database can't be changed while
the slot is active, then the check to make sure that the user isn't
trying to bind to a slot with a mis-matching database could be done
before the code described in the previous point, avoiding the need to
go back and release the resource.

- I think the critical section in ReplicationSlotDrop is bogus. If
DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
gone.

- cancel_before_shmem_exit is only guaranteed to remove the
most-recently-added callback.

- Why does ReplicationSlotsComputeRequiredXmin() need to hold
ProcArrayLock at all?

- ReplicationSlotsComputeRequiredXmin scarcely needs to hold the
spinlock while it does all of those gyrations. It can just acquire
the spinlock, copy the three fields needed into local variables, and
release the spinlock. The rest can be worked out afterwards.
Similarly in ReplicationSlotsComputeRequiredXmin.

- A comment in KillSlot wonders whether locking is required. I say
yes. It's safe to take lwlocks and spinlocks during shmem exit, and
failing to do so seems like a recipe for subtle corner-case bugs.

- pg_get_replication_slots() wonders what permissions we require. I
don't know that any special permissions are needed here; the data
we're exposing doesn't appear to be sensitive. Unless I'm missing
something?

- PG_STAT_GET_LOGICAL_DECODING_SLOTS_COLS has a leftover "logical" in its name.

- There seems to be no interface to acquire or release slots from
either SQL or the replication protocol, nor any way for a client of
this code to update its slot details. The value of
catalog_xmin/data_xmin vs. effective_catalog_xmin/effective_data_xmin
is left to the imagination.

...Robert

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-15 20:39:01
Message-ID:	20140115203901.GF8653@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-01-15 13:28:25 -0500, Robert Haas wrote:
> - I think you should just regard ReplicationSlotCtlLock as protecting
> the "name" and "active" flags of every slot. ReplicationSlotCreate()
> would then not need to mess with the spinlocks at all, and
> ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
> think. Functions like ReplicationSlotsComputeRequiredXmin() can
> acquire this lock in shared mode and only lock the slots that are
> actually active, instead of all of them.

I first thought you meant that we should get rid of the spinlock, but
after rereading I think you just mean that ->name, ->active, ->in_use
are only allowed to change while holding the lwlock exclusively so we
don't need to spinlock in those cases? If so, yes, that works for me.

> - If you address /* FIXME: apply sanity checking to slot name */, then
> I think that also addresses /* XXX: do we want to use truncate
> identifier instead? */. In other words, let's just error out if the
> name is too long. I'm not sure what other sanity checking is needed
> here; maybe just restrict it to 7-bit characters to avoid problems
> with encodings for individual databases varying.

Yea, erroring out seems like a good idea. But I think we need to
restrict slot names a bit more than that, given they are used as
filenames... We could instead name the files using the slot's offset,
but I'd prefer to not go that route.

> - ReplicationSlotAcquire probably needs to ignore slots that are not active.

Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.

> - If there's a coding rule that slot->database can't be changed while
> the slot is active, then the check to make sure that the user isn't
> trying to bind to a slot with a mis-matching database could be done
> before the code described in the previous point, avoiding the need to
> go back and release the resource.

I don't think slot->database should be allowed to change at all...

> - I think the critical section in ReplicationSlotDrop is bogus. If
> DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
> gone.

Well, if delete slot fails, we don't really know at which point it
failed which means that the on-disk state might not correspond to the
in-memory state. I don't see a point in adding code trying to handle
that case correctly...

> - cancel_before_shmem_exit is only guaranteed to remove the
> most-recently-added callback.

Yea :(. I think that's safe for the current usages but seems mighty
fragile... Not sure what to do about it. Just register KillSlot once and
keep it registered?

> - Why does ReplicationSlotsComputeRequiredXmin() need to hold
> ProcArrayLock at all?

There's reasoning, but I just noticed that it's basis might be flawed
anyway :(.
When starting changeset extraction in a new slot we need to guarantee
that we only start decoding records we know the catalog tuples haven't
been removed for.
So, when creating the slot I've so far done a GetOldestXmin() and used
that to check against xl_running_xact->oldestRunningXid. But
GetOldestXmin() can go backwards...

I'll think a bit and try to simplify this.

> - ReplicationSlotsComputeRequiredXmin scarcely needs to hold the
> spinlock while it does all of those gyrations. It can just acquire
> the spinlock, copy the three fields needed into local variables, and
> release the spinlock. The rest can be worked out afterwards.
> Similarly in ReplicationSlotsComputeRequiredXmin.

Yea, will change.

> - A comment in KillSlot wonders whether locking is required. I say
> yes. It's safe to take lwlocks and spinlocks during shmem exit, and
> failing to do so seems like a recipe for subtle corner-case bugs.

I agree that it's safe to use spinlocks, but lwlocks? What if we are
erroring out while holding an lwlock? Code that runs inside a
TransactionCommand is protected against that, but if not ProcKill()
which invokes LWLockReleaseAll() runs pretty late in the teardown
process...

> - pg_get_replication_slots() wonders what permissions we require. I
> don't know that any special permissions are needed here; the data
> we're exposing doesn't appear to be sensitive. Unless I'm missing
> something?

I don't see a problem either, but it seems others have -
pg_stat_replication only displays minor amounts of information if one
doesn't have the replication privilege... Not sure what the reasoning
there is, and whether it applies here as well.

> - There seems to be no interface to acquire or release slots from
> either SQL or the replication protocol, nor any way for a client of
> this code to update its slot details.

I don't think either ever needs to do that - START_TRANSACTION SLOT slot
...; and decoding_slot_*changes will acquire/release for them while
active. What would the usecase be to allow them to be acquired from SQL?

The slot details get updates by the respective replication code. For
streaming rep, that should happen via reply and feedback messages. For
changeset extraction it happens when LogicalConfirmReceivedLocation() is
called; the walsender interface does that using reply messages, the SQL
interface calls it when finished (unless you use the _peek_ functions).

> The value of
> catalog_xmin/data_xmin vs. effective_catalog_xmin/effective_data_xmin
> is left to the imagination.

There's a comment about them in a following patch. Basically the reason
is that for changeset extraction we cannot adjust the in-memory value
before we know the changed slot status is safely synced to
disk. Otherwise a client could restart streaming at a LSN where the
corresponding catalog details are gone since it's not prevented by the
slot anymore.

That could be done by just holding some lock forbidding the global xmin
value to be recomputing while writing to disk, but that seems awfully
heavy-handed. So the protocol is:
1) update ->catalog_xmin to the new xmin,
2) sync slot to disk
3) set ->effective_catalog_xmin = ->catalog_xmin
4) ReplicationSlotsComputeRequiredXmin()

Since ComputeRequiredXmin() only looks at effective_catalog_xmin that
guarantees that the global xmin horizon doesn't increase before the the
slot has been synced to disk. If we crash after 2,
StartupReplicationSlots() simply sets effective_catalog_xmin =
catalog_xmin, we know it's safely on disk now since we've just read it
from there.

Thanks for committing 0001!

Regards,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-16 14:34:51
Message-ID:	CA+TgmoZt2DhWKWwgbqdUpKJ-tH9V=fD8k9uHZYKVokrUR=Me_A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 15, 2014 at 3:39 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-01-15 13:28:25 -0500, Robert Haas wrote:
>> - I think you should just regard ReplicationSlotCtlLock as protecting
>> the "name" and "active" flags of every slot. ReplicationSlotCreate()
>> would then not need to mess with the spinlocks at all, and
>> ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
>> think. Functions like ReplicationSlotsComputeRequiredXmin() can
>> acquire this lock in shared mode and only lock the slots that are
>> actually active, instead of all of them.
>
> I first thought you meant that we should get rid of the spinlock, but
> after rereading I think you just mean that ->name, ->active, ->in_use
> are only allowed to change while holding the lwlock exclusively so we
> don't need to spinlock in those cases? If so, yes, that works for me.

Yeah, that's about what I had in mind.

>> - If you address /* FIXME: apply sanity checking to slot name */, then
>> I think that also addresses /* XXX: do we want to use truncate
>> identifier instead? */. In other words, let's just error out if the
>> name is too long. I'm not sure what other sanity checking is needed
>> here; maybe just restrict it to 7-bit characters to avoid problems
>> with encodings for individual databases varying.
>
> Yea, erroring out seems like a good idea. But I think we need to
> restrict slot names a bit more than that, given they are used as
> filenames... We could instead name the files using the slot's offset,
> but I'd prefer to not go that route.

OK. Well, add some code, then. :-)

>> - ReplicationSlotAcquire probably needs to ignore slots that are not active.
> Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.

active != in_use.

I suppose your point is that the slot can't be in_use if it's not also
active. Maybe it would be better to get rid of active/in_use and have
three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
REPLSLOT_FREE. Or something like that.

>> - If there's a coding rule that slot->database can't be changed while
>> the slot is active, then the check to make sure that the user isn't
>> trying to bind to a slot with a mis-matching database could be done
>> before the code described in the previous point, avoiding the need to
>> go back and release the resource.
>
> I don't think slot->database should be allowed to change at all...

Well, it can if the slot is dropped and a new one created.

>> - I think the critical section in ReplicationSlotDrop is bogus. If
>> DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
>> gone.
>
> Well, if delete slot fails, we don't really know at which point it
> failed which means that the on-disk state might not correspond to the
> in-memory state. I don't see a point in adding code trying to handle
> that case correctly...

Deleting the slot should be an atomic operation. There's some
critical point before which the slot will be picked up by recovery and
after which it won't. You either did that operation, or not, and can
adjust the in-memory state accordingly.

>> - cancel_before_shmem_exit is only guaranteed to remove the
>> most-recently-added callback.
>
> Yea :(. I think that's safe for the current usages but seems mighty
> fragile... Not sure what to do about it. Just register KillSlot once and
> keep it registered?

Yep. Use a module-private flag to decide whether it needs to do anything.

>> - A comment in KillSlot wonders whether locking is required. I say
>> yes. It's safe to take lwlocks and spinlocks during shmem exit, and
>> failing to do so seems like a recipe for subtle corner-case bugs.
>
> I agree that it's safe to use spinlocks, but lwlocks? What if we are
> erroring out while holding an lwlock? Code that runs inside a
> TransactionCommand is protected against that, but if not ProcKill()
> which invokes LWLockReleaseAll() runs pretty late in the teardown
> process...

Hmm. I guess it'd be fine to decide that a connected slot can be
marked not-connected without the lock. I think you'd want a rule that
a slot can't be freed except when it's not-connected; otherwise, you
might end up marking the slot not-connected after someone else had
already recycled it for an unrelated purpose (drop slot, create new
slot).

>> - There seems to be no interface to acquire or release slots from
>> either SQL or the replication protocol, nor any way for a client of
>> this code to update its slot details.
>
> I don't think either ever needs to do that - START_TRANSACTION SLOT slot
> ...; and decoding_slot_*changes will acquire/release for them while
> active. What would the usecase be to allow them to be acquired from SQL?

My point isn't so much about SQL as that with just this patch I don't
see any way for anyone to ever acquire a slot for anything, ever. So
I think there's a piece missing, or three.

> The slot details get updates by the respective replication code. For
> streaming rep, that should happen via reply and feedback messages. For
> changeset extraction it happens when LogicalConfirmReceivedLocation() is
> called; the walsender interface does that using reply messages, the SQL
> interface calls it when finished (unless you use the _peek_ functions).

Right, but where is this code? I don't see this updating the reply
and feedback message processing code to touch slots. Did I miss that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-16 14:54:32
Message-ID:	20140116145432.GD4498@alap3.lan
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-01-16 09:34:51 -0500, Robert Haas wrote:
> >> - ReplicationSlotAcquire probably needs to ignore slots that are not active.
> > Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.
>
> active != in_use.
>
> I suppose your point is that the slot can't be in_use if it's not also
> active.

Yes. There's asserts to that end...

> Maybe it would be better to get rid of active/in_use and have
> three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
> REPLSLOT_FREE. Or something like that.

Hm. Color me unenthusiastic. If you feel strongly I can change it, but
otherwise not.

> >> - If there's a coding rule that slot->database can't be changed while
> >> the slot is active, then the check to make sure that the user isn't
> >> trying to bind to a slot with a mis-matching database could be done
> >> before the code described in the previous point, avoiding the need to
> >> go back and release the resource.
> >
> > I don't think slot->database should be allowed to change at all...
>
> Well, it can if the slot is dropped and a new one created.

Well. That obviously requires the lwlock to be acquired...

> >> - I think the critical section in ReplicationSlotDrop is bogus. If
> >> DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
> >> gone.
> >
> > Well, if delete slot fails, we don't really know at which point it
> > failed which means that the on-disk state might not correspond to the
> > in-memory state. I don't see a point in adding code trying to handle
> > that case correctly...
>
> Deleting the slot should be an atomic operation. There's some
> critical point before which the slot will be picked up by recovery and
> after which it won't. You either did that operation, or not, and can
> adjust the in-memory state accordingly.

I am not sure I understand that point. We can either update the
in-memory bit before performing the on-disk operations or
afterwards. Either way, there's a way to be inconsistent if the disk
operation fails somewhere inbetween (it might fail but still have
deleted the file/directory!). The normal way to handle that in other
places is PANICing when we don't know so we recover from the on-disk
state.
I really don't see the problem here? Code doesn't get more robust by
doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
ERROR, often that's not warranted.

> >> - A comment in KillSlot wonders whether locking is required. I say
> >> yes. It's safe to take lwlocks and spinlocks during shmem exit, and
> >> failing to do so seems like a recipe for subtle corner-case bugs.
> >
> > I agree that it's safe to use spinlocks, but lwlocks? What if we are
> > erroring out while holding an lwlock? Code that runs inside a
> > TransactionCommand is protected against that, but if not ProcKill()
> > which invokes LWLockReleaseAll() runs pretty late in the teardown
> > process...
>
> Hmm. I guess it'd be fine to decide that a connected slot can be
> marked not-connected without the lock.

I now acquire the spinlock since that has to work, or we have much worse
problems... That guarantees that other backends see the value as well.

> I think you'd want a rule that
> a slot can't be freed except when it's not-connected; otherwise, you
> might end up marking the slot not-connected after someone else had
> already recycled it for an unrelated purpose (drop slot, create new
> slot).

Yea, that rule is there. Otherwise we'd get in great trouble.

> >> - There seems to be no interface to acquire or release slots from
> >> either SQL or the replication protocol, nor any way for a client of
> >> this code to update its slot details.
> >
> > I don't think either ever needs to do that - START_TRANSACTION SLOT slot
> > ...; and decoding_slot_*changes will acquire/release for them while
> > active. What would the usecase be to allow them to be acquired from SQL?
>
> My point isn't so much about SQL as that with just this patch I don't
> see any way for anyone to ever acquire a slot for anything, ever. So
> I think there's a piece missing, or three.

The slot is acquired by code using the slot. So when START_TRANSACTION
SLOT ... (in contrast to a START_TRANSACTION without SLOT) is sent,
walsender.c does an ReplicationSlotAcquire(cmd->slotname) in
StartReplication() and releases it after it has finished.

> > The slot details get updates by the respective replication code. For
> > streaming rep, that should happen via reply and feedback
> > messages. For changeset extraction it happens when
> > LogicalConfirmReceivedLocation() is called; the walsender interface
> > does that using reply messages, the SQL interface calls it when
> > finished (unless you use the _peek_ functions).
>
> Right, but where is this code? I don't see this updating the reply
> and feedback message processing code to touch slots. Did I miss that?

It's in "wal_decoding: logical changeset extraction walsender interface"
currently :(. Splitting the streaming replication part of that patch off
isn't easy...

Greetings,

Andres Freund

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-17 03:15:40
Message-ID:	52D8A05C.5080504@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/16/2014 02:28 AM, Robert Haas wrote:
> - If you address /* FIXME: apply sanity checking to slot name */, then
> I think that also addresses /* XXX: do we want to use truncate
> identifier instead? */. In other words, let's just error out if the
> name is too long. I'm not sure what other sanity checking is needed
> here; maybe just restrict it to 7-bit characters to avoid problems
> with encodings for individual databases varying.

It's a common misunderstanding that restricting to 7-bit solves encoding
issues.

Thanks to the joy that is SHIFT_JIS, we must also disallow the backslash
and tilde characters.

Anybody who actually uses SHIFT_JIS as an operational encoding, rather
than as an input/output encoding, is into pain and suffering. Personally
I'd be quite happy to see it supported as client_encoding, but forbidden
as a server-side encoding. That's not the case right now - so since we
support it, we'd better guard against its quirks.

slotnames can't be regular identifiers, because they might contain chars
not valid in another DB's encoding. So lets just restrict them to
[a-zA-Z0-9_ -] and be done with it.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-18 13:31:55
Message-ID:	CA+Tgmob=C2wfYKH9C1ke9-t6PrQrENFv8AtoncHCrg4f+wKtdg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> Anybody who actually uses SHIFT_JIS as an operational encoding, rather
> than as an input/output encoding, is into pain and suffering. Personally
> I'd be quite happy to see it supported as client_encoding, but forbidden
> as a server-side encoding. That's not the case right now - so since we
> support it, we'd better guard against its quirks.

I think that *is* the case right now. pg_wchar.h sayeth:

/* followings are for client encoding only */
PG_SJIS, /* Shift JIS
(Winindows-932) */
PG_BIG5, /* Big5 (Windows-950) */
PG_GBK, /* GBK (Windows-936) */

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-18 13:35:47
Message-ID:	CA+TgmoasqtaJmgTRBqyRTpQdHDCqjt979W+zY66Ypk5RFEyeoQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 16, 2014 at 9:54 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Maybe it would be better to get rid of active/in_use and have
>> three states: REPLSLOT_CONNECTED, REPLSLOT_NOT_CONNECTED,
>> REPLSLOT_FREE. Or something like that.
>
> Hm. Color me unenthusiastic. If you feel strongly I can change it, but
> otherwise not.

I found the active/in_use distinction confusing; I thought one
three-state flag rather than two Booleans might be clearer. But I
might be able to just suck it up.

>> >> - If there's a coding rule that slot->database can't be changed while
>> >> the slot is active, then the check to make sure that the user isn't
>> >> trying to bind to a slot with a mis-matching database could be done
>> >> before the code described in the previous point, avoiding the need to
>> >> go back and release the resource.
>> >
>> > I don't think slot->database should be allowed to change at all...
>>
>> Well, it can if the slot is dropped and a new one created.
>
> Well. That obviously requires the lwlock to be acquired...

Right, so the point of this comment originally was you had some logic
that could be moved sooner to avoid having to undo so much on a
failure.

>> >> - I think the critical section in ReplicationSlotDrop is bogus. If
>> >> DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't
>> >> gone.
>> >
>> > Well, if delete slot fails, we don't really know at which point it
>> > failed which means that the on-disk state might not correspond to the
>> > in-memory state. I don't see a point in adding code trying to handle
>> > that case correctly...
>>
>> Deleting the slot should be an atomic operation. There's some
>> critical point before which the slot will be picked up by recovery and
>> after which it won't. You either did that operation, or not, and can
>> adjust the in-memory state accordingly.
>
> I am not sure I understand that point. We can either update the
> in-memory bit before performing the on-disk operations or
> afterwards. Either way, there's a way to be inconsistent if the disk
> operation fails somewhere inbetween (it might fail but still have
> deleted the file/directory!). The normal way to handle that in other
> places is PANICing when we don't know so we recover from the on-disk
> state.
> I really don't see the problem here? Code doesn't get more robust by
> doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
> ERROR, often that's not warranted.

People get cranky when the database PANICs because of a filesystem
failure. We should avoid that, especially when it's trivial to do so.
The update to shared memory should be done second and should be set
up to be no-fail.

>> > The slot details get updates by the respective replication code. For
>> > streaming rep, that should happen via reply and feedback
>> > messages. For changeset extraction it happens when
>> > LogicalConfirmReceivedLocation() is called; the walsender interface
>> > does that using reply messages, the SQL interface calls it when
>> > finished (unless you use the _peek_ functions).
>>
>> Right, but where is this code? I don't see this updating the reply
>> and feedback message processing code to touch slots. Did I miss that?
>
> It's in "wal_decoding: logical changeset extraction walsender interface"
> currently :(. Splitting the streaming replication part of that patch off
> isn't easy...

Ack. I was hoping to work through these patches one at a time, but
that's not going to work if they are interdependent to that degree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-18 15:12:51
Message-ID:	52DA99F3.905@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/18/2014 09:31 PM, Robert Haas wrote:
> On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>> Anybody who actually uses SHIFT_JIS as an operational encoding, rather
>> than as an input/output encoding, is into pain and suffering. Personally
>> I'd be quite happy to see it supported as client_encoding, but forbidden
>> as a server-side encoding. That's not the case right now - so since we
>> support it, we'd better guard against its quirks.
>
> I think that *is* the case right now. pg_wchar.h sayeth:
>
> /* followings are for client encoding only */
> PG_SJIS, /* Shift JIS
> (Winindows-932) */
> PG_BIG5, /* Big5 (Windows-950) */
> PG_GBK, /* GBK (Windows-936) */

Perfect - that makes ASCII-only just fine, IMO.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-18 17:04:07
Message-ID:	19833.1390064647@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>> Anybody who actually uses SHIFT_JIS as an operational encoding, rather
>> than as an input/output encoding, is into pain and suffering. Personally
>> I'd be quite happy to see it supported as client_encoding, but forbidden
>> as a server-side encoding. That's not the case right now - so since we
>> support it, we'd better guard against its quirks.

> I think that *is* the case right now.

SHIFT_JIS is not and never will be allowed as a server encoding,
precisely because it has multi-byte characters of which some bytes could
be taken for ASCII. The same is true of our other client-only encodings.

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-19 14:31:06
Message-ID:	52DBE1AA.8060500@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/18/2014 02:31 PM, Robert Haas wrote:
> On Thu, Jan 16, 2014 at 10:15 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>> Anybody who actually uses SHIFT_JIS as an operational encoding, rather
>> than as an input/output encoding, is into pain and suffering. Personally
>> I'd be quite happy to see it supported as client_encoding, but forbidden
>> as a server-side encoding. That's not the case right now - so since we
>> support it, we'd better guard against its quirks.
>
> I think that *is* the case right now. pg_wchar.h sayeth:
>
> /* followings are for client encoding only */
> PG_SJIS, /* Shift JIS
> (Winindows-932) */

while you have that file open: s/Winindows-932/Windows-932 maybe?

Stefan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-22 14:48:46
Message-ID:	20140122144846.GH21170@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
> > I am not sure I understand that point. We can either update the
> > in-memory bit before performing the on-disk operations or
> > afterwards. Either way, there's a way to be inconsistent if the disk
> > operation fails somewhere inbetween (it might fail but still have
> > deleted the file/directory!). The normal way to handle that in other
> > places is PANICing when we don't know so we recover from the on-disk
> > state.
> > I really don't see the problem here? Code doesn't get more robust by
> > doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
> > ERROR, often that's not warranted.
>
> People get cranky when the database PANICs because of a filesystem
> failure. We should avoid that, especially when it's trivial to do so.
> The update to shared memory should be done second and should be set
> up to be no-fail.

I don't see how that would help. If we fail during unlink/rmdir, we
don't really know at which point we failed. Just keeping the slot in
memory, won't help us in any way - we'll continue to reserve resources
while the slot is half-gone.
I don't think trying to handle errors we don't understand and we don't
routinely expect actually improves robustness. It just leads to harder
to diagnose errors. It's not like the cases here are likely to be caused
by anthything but severe admin failure like removing the write
permissions of the postgres directory while the server is running. Or do
you see more valid causes?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-22 15:14:27
Message-ID:	CA+TgmoZ9yeTiwxc10Y+4+hZ2oAgeEUtPiSSgPTZ5ynODe74+eg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
>> > I am not sure I understand that point. We can either update the
>> > in-memory bit before performing the on-disk operations or
>> > afterwards. Either way, there's a way to be inconsistent if the disk
>> > operation fails somewhere inbetween (it might fail but still have
>> > deleted the file/directory!). The normal way to handle that in other
>> > places is PANICing when we don't know so we recover from the on-disk
>> > state.
>> > I really don't see the problem here? Code doesn't get more robust by
>> > doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
>> > ERROR, often that's not warranted.
>>
>> People get cranky when the database PANICs because of a filesystem
>> failure. We should avoid that, especially when it's trivial to do so.
>> The update to shared memory should be done second and should be set
>> up to be no-fail.
>
> I don't see how that would help. If we fail during unlink/rmdir, we
> don't really know at which point we failed.

This doesn't make sense to me. unlink/rmdir are atomic operations.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-22 15:34:58
Message-ID:	20140122153458.GI21170@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Attached is v7.1 of the patchset with (among others) the following
changes:
* rebase to master
* split the slot support for streaming replication into a separate
patch, early in the series
* slot names are now limited to /[a-z0-9_]{1,NAMEDATALEN-1}/
* computation of the initial xmin for changeset extraction is now done
with an extra routine getting rid of races around GetOldestXmin()
going forwards and then backwards and getting rid of an additional
parameter to GetOldestXmin().
* slot locking is rejiggered according to Robert's suggestions
* comment improvements
* sgml documentation improvements
* ...

I think sgml the documentation is in a reasonable shape now, I'd
appreciate somebody else having a look. I think a bit more effort is
required in protocol.sgml.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-the-replication-slot-interfac.patch.gz	application/x-patch-gzip	14.1 KB
0002-wal_decoding-physical-streaming-replication-walsende.patch.gz	application/x-patch-gzip	5.3 KB
0003-wal_decoding-Introduce-changeset-extraction.patch.gz	application/x-patch-gzip	72.3 KB
0004-wal_decoding-Only-peg-the-xmin-horizon-for-catalog-t.patch.gz	application/x-patch-gzip	4.5 KB
0005-wal_decoding-Allow-walsenders-to-connect-to-a-specif.patch.gz	application/x-patch-gzip	4.0 KB
0006-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	8.2 KB
0007-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	9.1 KB
0008-wal_decoding-test_decoding-Add-a-simple-decoding-mod.patch.gz	application/x-patch-gzip	25.8 KB
0009-wal_decoding-design-document-v2.4-and-snapshot-build.patch.gz	application/x-patch-gzip	12.9 KB
0010-wal_decoding-Documentation-for-replication-slots-and.patch.gz	application/x-patch-gzip	13.1 KB
0011-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB
0012-slot-hack-up-pg_receivexlog-support.patch.gz	application/x-patch-gzip	1.9 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-22 15:48:58
Message-ID:	20140122154858.GK21170@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-01-22 10:14:27 -0500, Robert Haas wrote:
> On Wed, Jan 22, 2014 at 9:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-01-18 08:35:47 -0500, Robert Haas wrote:
> >> > I am not sure I understand that point. We can either update the
> >> > in-memory bit before performing the on-disk operations or
> >> > afterwards. Either way, there's a way to be inconsistent if the disk
> >> > operation fails somewhere inbetween (it might fail but still have
> >> > deleted the file/directory!). The normal way to handle that in other
> >> > places is PANICing when we don't know so we recover from the on-disk
> >> > state.
> >> > I really don't see the problem here? Code doesn't get more robust by
> >> > doing s/PANIC/ERROR/, rather the contrary. It takes extra smarts to only
> >> > ERROR, often that's not warranted.
> >>
> >> People get cranky when the database PANICs because of a filesystem
> >> failure. We should avoid that, especially when it's trivial to do so.
> >> The update to shared memory should be done second and should be set
> >> up to be no-fail.
> >
> > I don't see how that would help. If we fail during unlink/rmdir, we
> > don't really know at which point we failed.
>
> This doesn't make sense to me. unlink/rmdir are atomic operations.

Yes, individual operations should be, but you cannot be sure whether a
rename()/unlink() will survive a crash until the directory is
fsync()ed. So, what is one going to do if the unlink suceeded, but the
fsync didn't?

Deletion currently works like:
if (rename(path, tmppath) != 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not rename \"%s\" to \"%s\": %m",
path, tmppath)));

/* make sure no partial state is visible after a crash */
fsync_fname(tmppath, false);
fsync_fname("pg_replslot", true);

if (!rmtree(tmppath, true))
{
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove directory \"%s\": %m",
tmppath)));
}

If we fail between the rename() and the fsync_fname() we don't really
know which state we are in. We'd also have to add code to handle
incomplete slot directories, which currently only exists for startup, to
other places.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-22 18:00:44
Message-ID:	CA+Tgmob94d1A5OiwEmsE1aug8g8w4WU2WEbViGPY4F8YPR3g2g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 22, 2014 at 10:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Yes, individual operations should be, but you cannot be sure whether a
> rename()/unlink() will survive a crash until the directory is
> fsync()ed. So, what is one going to do if the unlink suceeded, but the
> fsync didn't?

Well, apparently, one is going to PANIC and reinitialize the system.
I presume that upon reinitialization we'll decide that the slot is
gone, and thus won't recreate it in shared memory. Of course, if the
entire system suffers a hard power failure after that and before the
directory is succesfully fsync'd, then the slot could reappear on the
next startup. Which is also exactly what would happen if we removed
the slot from shared memory after doing the unlink, and then the
system suffered a hard power failure before the directory contents
made it to disk. Except that we also panicked.

In the case of shared buffers, the way we handle fsync failures is by
not allowing the system to checkpoint until all of the fsyncs succeed.
If there's an OS-level reset before that happens, WAL replay will
perform the same buffer modifications over again and the next
checkpoint will again try to flush them to disk and will not complete
unless it does. That forms a closed system where we never advance the
redo pointer over the covering WAL record until the changes it covers
are on the disk. But I don't think this code has any similar
interlock; if it does, I missed it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-23 12:05:03
Message-ID:	20140123120503.GB7182@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-01-22 13:00:44 -0500, Robert Haas wrote:
> Well, apparently, one is going to PANIC and reinitialize the system.
> I presume that upon reinitialization we'll decide that the slot is
> gone, and thus won't recreate it in shared memory.

Yea, and if it's half-gone we'll continue deletion. And since yesterday
evening we'll even fsync things during startup to handle scenarios
similar to 20140122162115(dot)GL21170(at)alap3(dot)anarazel(dot)de .

> Of course, if the entire system suffers a hard power failure after that and before the
> directory is succesfully fsync'd, then the slot could reappear on the
> next startup. Which is also exactly what would happen if we removed
> the slot from shared memory after doing the unlink, and then the
> system suffered a hard power failure before the directory contents
> made it to disk. Except that we also panicked.

Yes, but that could only happen as long as no relevant data has been
lost since we hold relevant locks during this.

> In the case of shared buffers, the way we handle fsync failures is by
> not allowing the system to checkpoint until all of the fsyncs succeed.

I don't think shared buffers fsyncs are the apt comparison. It's more
something like UpdateControlFile(). Which PANICs.

I really don't get why you fight PANICs in general that much. There are
some nasty PANICs in postgres which can happen in legitimate situations,
which should be made to fail more gracefully, but this surely isn't one
of them. We're doing rename(), unlink() and rmdir(). That's it.
We should concentrate on the ones that legitimately can happen, not the
ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
mount -o remount,ro /. We don't increase reliability by a bit adding
codepaths that will never get tested.

> If there's an OS-level reset before that happens, WAL replay will
> perform the same buffer modifications over again and the next
> checkpoint will again try to flush them to disk and will not complete
> unless it does. That forms a closed system where we never advance the
> redo pointer over the covering WAL record until the changes it covers
> are on the disk. But I don't think this code has any similar
> interlock; if it does, I missed it.

No, it doesn't (until the first rename() at least), but the number of
failure scenarios is far smaller.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-23 16:50:57
Message-ID:	CA+TgmoZ1DTGKJ6FthQ7vSAiniih2LZ_aL0FM8kCzQNc8d2Gfmg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 23, 2014 at 7:05 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I don't think shared buffers fsyncs are the apt comparison. It's more
> something like UpdateControlFile(). Which PANICs.
>
> I really don't get why you fight PANICs in general that much. There are
> some nasty PANICs in postgres which can happen in legitimate situations,
> which should be made to fail more gracefully, but this surely isn't one
> of them. We're doing rename(), unlink() and rmdir(). That's it.
> We should concentrate on the ones that legitimately can happen, not the
> ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
> mount -o remount,ro /. We don't increase reliability by a bit adding
> codepaths that will never get tested.

Sorry, I don't buy it. Lots of people I know have stories that go
like this "$HORRIBLE happened, and PostgreSQL kept on running, and it
didn't even lose my data!", where $HORRIBLE may be variously that the
disk filled up, that disk writes started failing with I/O errors, that
somebody changed the permissions on the data directory inadvertently,
that the entire data directory got removed, and so on. I've been
through some of those scenarios myself, and the care and effort that's
been put into failure modes has saved my bacon more than a few times,
too. We *do* increase reliability by worrying about what will happen
even in code paths that very rarely get exercised. It's certainly
true that our bug count there is higher there than for the parts of
our code that get exercised more regularly, but it's also lower than
it would be if we didn't make the effort, and the dividend that we get
from that effort is that we have a well-deserved reputation for
reliability.

I think it's completely unacceptable for the failure of routine
filesystem operations to result in a PANIC. I grant you that we have
some existing cases where that can happen (like UpdateControlFile),
but that doesn't mean we should add more. Right this very minute
there is massive bellyaching on a nearby thread caused by the fact
that a full disk condition while writing WAL can PANIC the server,
while on this thread at the very same time you're arguing that adding
more ways for a full disk to cause PANICs won't inconvenience anyone.
The other thread is right, and your argument here is wrong. We have
been able to - and have taken the time to - fix comparable problems in
other cases, and we should do the same thing here.

As for why I fight PANICs so much in general, there are two reasons.
First, I believe that to be project policy. I welcome correction if I
have misinterpreted our stance in that area. Second, I have
encountered a few situations where customers had production servers
that repeatedly PANICked due to some bug or other. If I've ever
encountered angrier customers, I can't remember when. A PANIC is no
big deal when it happens on your development box, but when it happens
on a machine with 100 users connected to it, it's a big deal,
especially if a single underlying cause makes it happen over and over
again.

I think we should be devoting our time to figuring how to improve
this, not whether to improve it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-01-23 17:21:40
Message-ID:	20140123172140.GH7182@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-01-23 11:50:57 -0500, Robert Haas wrote:
> On Thu, Jan 23, 2014 at 7:05 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > I don't think shared buffers fsyncs are the apt comparison. It's more
> > something like UpdateControlFile(). Which PANICs.
> >
> > I really don't get why you fight PANICs in general that much. There are
> > some nasty PANICs in postgres which can happen in legitimate situations,
> > which should be made to fail more gracefully, but this surely isn't one
> > of them. We're doing rename(), unlink() and rmdir(). That's it.
> > We should concentrate on the ones that legitimately can happen, not the
> > ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
> > mount -o remount,ro /. We don't increase reliability by a bit adding
> > codepaths that will never get tested.
>
> Sorry, I don't buy it. Lots of people I know have stories that go
> like this "$HORRIBLE happened, and PostgreSQL kept on running, and it
> didn't even lose my data!", where $HORRIBLE may be variously that the
> disk filled up, that disk writes started failing with I/O errors, that
> somebody changed the permissions on the data directory inadvertently,
> that the entire data directory got removed, and so on.

Especially the "not loosing data" imo is because postgres is
conservative with continuing in situations it doesn't know anything
about. Most prominently the cluster wide restart after a segfault.

> I've been
> through some of those scenarios myself, and the care and effort that's
> been put into failure modes has saved my bacon more than a few times,
> too. We *do* increase reliability by worrying about what will happen
> even in code paths that very rarely get exercised.

A part of thinking about them *is* restricting what happens in those
cases by keeping the possible states to worry about to a minimum.

Just splapping on an ERROR instead of PANIC can make things much
worse. Not releasing space until a restart, without a chance to do
anything about it because we failed to properly release the in-memory
slot will just make the problem bigger because now the cleanup might
take a week (VACUUM FULLing the entire cluster?).

> I think it's completely unacceptable for the failure of routine
> filesystem operations to result in a PANIC. I grant you that we have
> some existing cases where that can happen (like UpdateControlFile),
> but that doesn't mean we should add more. Right this very minute
> there is massive bellyaching on a nearby thread caused by the fact
> that a full disk condition while writing WAL can PANIC the server,
> while on this thread at the very same time you're arguing that adding
> more ways for a full disk to cause PANICs won't inconvenience anyone.

A full disk won't cause any of the problems for the case we're
discussing, will it? We're just doing rename(), unlink(), rmdir() here,
all should succeed while the FS is full (afair rename() does on all
common FSs because inodes are kept separately).

> The other thread is right, and your argument here is wrong. We have
> been able to - and have taken the time to - fix comparable problems in
> other cases, and we should do the same thing here.

I don't think the WAL case is comparable at all. ENOSPC is something
expected that can happen during normal operation and doesn't include
malintended operator and is reasonably easy to test. unlink() or fsync()
randomly failing is not.
In fact, isn't the consequence out of that thread that we need a
significant amount of extra complexity to handle the case? We shouldn't
spend that effort for cases that don't deserve it because they are too
unlikely in practice.

And yes, there's not too many other places PANICing - because most can
rely on WAL handling those tricky cases for them...

> Second, I have
> encountered a few situations where customers had production servers
> that repeatedly PANICked due to some bug or other. If I've ever
> encountered angrier customers, I can't remember when. A PANIC is no
> big deal when it happens on your development box, but when it happens
> on a machine with 100 users connected to it, it's a big deal,
> especially if a single underlying cause makes it happen over and over
> again.

Sure. But blindly continuing and then, possibly quite a bit later,
loosing data, causing an outage that takes a long while to recover or
something isn't any better.

> I think we should be devoting our time to figuring how to improve
> this, not whether to improve it.

If you'd argue that creating a new slot should fail gracefull, ok, I can
relatively easily be convinced of that. But trying to handle failures in
the midst of deletion in cases that won't happen in reality is just
inviting trouble imo.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-23 21:04:10
Message-ID:	CA+TgmobF6_FzU411MCrBYrUJd6Pf3OFPtBT+CA61TtNt9t+1RA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Patch 0001:

+ errmsg("could not find free replication slot"),

Suggest: all replication slots are in use

+ elog(ERROR, "cannot aquire a slot while another slot
has been acquired");

Suggest: this backend has already acquired a replication slot

Or demote it to Assert(). I'm not really sure why this needs to be
checked in non-assert builds. I also wonder if we should use the
terminology "attach" instead of "acquire"; that pairs more naturally
with "release". Then the message, if we want more than an assert,
might be "this backend is already attached to a replication slot".

+ if (slot == NULL)
+ {
+ LWLockRelease(ReplicationSlotCtlLock);
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("could not find replication
slot \"%s\"", name)));
+ }

The error will release the LWLock anyway; I'd get rid of the manual
LWLockRelease, and the braces. Similarly in ReplicationSlotDrop.

+ /* acquire spinlock so we can test and set ->active safely */
+ SpinLockAcquire(&slot->mutex);
+
+ if (slot->active)
+ {
+ SpinLockRelease(&slot->mutex);
+ LWLockRelease(ReplicationSlotCtlLock);
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("slot \"%s\" already active", name)));
+ }
+
+ /* we definitely have the slot, no errors possible anymore */
+ slot->active = true;
+ MyReplicationSlot = slot;
+ SpinLockRelease(&slot->mutex);

This doesn't need the LWLockRelease either. It does need the
SpinLockRelease, but as I think I noted previously, the right way to
write this is probably: SpinLockAcquire(&slot->mutex); was_active =
slot->active; slot->active = true; SpinLockRelease(&slot->mutex); if
(was_active) ereport(). MyReplicatonSlot = slot.

ReplicationSlotsComputeRequiredXmin still acquires ProcArrayLock, and
the comment "Provide interlock against concurrent recomputations"
doesn't seem adequate to me. I guess the idea here is that we regard
ProcArrayLock as protecting ReplicationSlotCtl->catalog_xmin and
ReplicationSlotCtl->data_xmin, but if that's the idea then we only
need to hold the lock during the time when we actually update those
values, not the loop where we compute them. Also, if that's the
design, maybe they should be part of PROC_HDR *ProcGlobal rather than
here. It seems weird to have some of the values protected by
ProcArrayLock live in a completely different data structure managed
almost entirely by some other part of the system.

It's pretty evident that what's currently patch #4 (only peg the xmin
horizon for catalog tables during logical decoding) needs to become
patch #1, because it doesn't make any sense to apply this before we do
that. I'm still not 100% confident in that approach, but maybe I'd
better try to look at it RSN and get confident, because too much of
the rest of what you've got here hangs on that to proceed without it.
Or to put all that another way, if for any reason we decide that the
separate catalog xmin stuff is not viable, the rest of this is going
to need a lot of rework, so we'd better sort that now rather than
later.

With respect to the synchronize-slots-to-disk stuff we're arguing
about on the other thread, I think the basic design problem here is
that you assume that you can change stuff in memory and then change
stuff on disk, without either set of changes being atomic. What I
think you need to do is making atomic actions on disk correspond to
atomic state changes in memory. IOW, suppose that creating a slot
consists of two steps: mkdir() + fsync(). Then I think what you need
to do is - do the mkdir(). If it errors out, fine. If it succeeds,
the mark the slot half-created. This is just an assignment so it can
done immediately after you learn that mkdir() worked with no risk of
an intervening failure. Then, try to fsync(). If it errors out, the
slot will get left in the half-created state. If it works, then
immediately mark the slot as fully created. Now, when the next guy
comes along and looks at the slot, he can tell what he needs to do.
Specifically, if the slot is half-created, and he wants to do anything
other than remove it, he's got to fsync() it first, and if that errors
out, so be it. The next access to the slot will merely find it still
half-created and simply try the fsync() yet again.

Alternatively, since nearly everything we're trying to do here is a
two-step operation - do something and then fsync - maybe we have a
more generic fsync-pending flag, and each slot operation checks that
and retries the fsync() if it's set. But it might be situation
dependent which thing we need to fsync, since there are multiple files
involved.

Broadly, what you're trying to accomplish here is to have something
that is crash-safe but without relying on WAL, so that it can work on
standbys. If making things crash-safe without WAL were easy to do, we
probably wouldn't have WAL at all, so it stands to reason that there
are going to be some difficulties here. Making it work reliably is
going to require either inventing some special-purpose type of
write-ahead logging specific to this particular need, or some analogue
of shadow paging, or making sure that every intermediate step is
well-defined and recoverable. Right now, you're on that last path,
and it's by no means obvious to me that that's the wrong place to be,
but I think there's some work left to be done to get it there.

Calling a slot "old" or "new" looks liable to cause problems. Maybe
change those names to contain a character not allowed in a slot name,
if we're going to keep doing it that way.

I wonder if it wouldn't be better to get rid of the subdirectories for
the individual slots, and just have a file pg_replslot/$SLOTNAME, or
not. I know there are later patches that need subdirectories for
their own private data, but they could just create
pg_replslot/$SLOTNAME.dir and put whatever in it they like, without
really implicating this code that much. The advantage of that is that
there would be fewer intermediate states. The slot exists if the file
exists, and not if it doesn't. You still need half-alive and
half-dead until the fsync finishes, but you don't need to worry about
tracking both the state of the directory and the state of the file.
On startup we fsync the containing directory and all of the slot files
we find inside it and refuse to start up if that fails, but once
running filesystem failures only prevent changes; they don't kill the
system.

Patch 0004:

I'm not very confident that PROC_IN_LOGICAL_DECODING is the right way
to go here. It seems to me that excluding the xmins of backends with
slots from globalxmin consideration so that we can fold the same xmin
in by some other mechanism is kind of strange. How about letting the
xmins of such backends affect the computation as normal, and then
having one extra xmin that gets folded in that represents the minima
of the xmin of unconnected slots? When a backend with a slot
disconnects, an on_shmem_exit hook must move that value backwards if
it follows MyPgXact->xmin.

That's all for now...

...Robert

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-23 21:27:16
Message-ID:	20140123212716.GV10723@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I wonder if it wouldn't be better to get rid of the subdirectories for
> the individual slots, and just have a file pg_replslot/$SLOTNAME, or
> not. I know there are later patches that need subdirectories for
> their own private data, but they could just create
> pg_replslot/$SLOTNAME.dir and put whatever in it they like, without
> really implicating this code that much. The advantage of that is that
> there would be fewer intermediate states. The slot exists if the file
> exists, and not if it doesn't. You still need half-alive and
> half-dead until the fsync finishes, but you don't need to worry about
> tracking both the state of the directory and the state of the file.

Why do we need directories at all? I know there might be subsidiary
files to store stuff in separate files, but maybe we can just name files
using the slot name (or a transformation thereof) as a prefix. It
shouldn't be difficult to remove the right files whenever there's a
need, and not having to worry about a directory that might need a
separate fsync might make things easier.

On the other hand, there might still be a need to fsync the parent
directory, so maybe there is not that much gain.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-23 23:32:09
Message-ID:	20140123233209.GJ7182@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-01-23 16:04:10 -0500, Robert Haas wrote:
> Patch 0001:
>
> + errmsg("could not find free replication slot"),
>
> Suggest: all replication slots are in use

That sounds better indeed.

> + elog(ERROR, "cannot aquire a slot while another slot
> has been acquired");
>
> Suggest: this backend has already acquired a replication slot
>
> Or demote it to Assert(). I'm not really sure why this needs to be
> checked in non-assert builds.

Hm. Fine with me, not sure why I went with an elog(). Maybe because I
thought output plugin authors could have the idea of using another slot
while inside one?

> I also wonder if we should use the
> terminology "attach" instead of "acquire"; that pairs more naturally
> with "release". Then the message, if we want more than an assert,
> might be "this backend is already attached to a replication slot".

I went with Acquire/Release because our locking code does so, and it
seemed sensible to be consistent. I don't have strong feelings about it.

> + if (slot == NULL)
> + {
> + LWLockRelease(ReplicationSlotCtlLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_UNDEFINED_OBJECT),
> + errmsg("could not find replication
> slot \"%s\"", name)));
> + }
>
> The error will release the LWLock anyway; I'd get rid of the manual
> LWLockRelease, and the braces. Similarly in ReplicationSlotDrop.

Unfortunately not. Inside the walsender there's currently no
LWLockReleaseAll() for ERRORs since commands aren't run inside a
transaction command...

But maybe I should have fixed this by adding the release to
WalSndErrorCleanup() instead? That'd still leave the problematic case
that currently we try to delete a replication slot inside a CATCH when
we fail while initializing the rest of logical replication... But I
guess adding it would be a good idea independent of that.

We could also do a StartTransactionCommand() but I'd rather not, that
currently prevents code in that vicinity from doing anything it
shouldn't via various Assert()s in existing code.

> + /* acquire spinlock so we can test and set ->active safely */
> + SpinLockAcquire(&slot->mutex);
> +
> + if (slot->active)
> + {
> + SpinLockRelease(&slot->mutex);
> + LWLockRelease(ReplicationSlotCtlLock);
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_IN_USE),
> + errmsg("slot \"%s\" already active", name)));
> + }
> +
> + /* we definitely have the slot, no errors possible anymore */
> + slot->active = true;
> + MyReplicationSlot = slot;
> + SpinLockRelease(&slot->mutex);
>
> This doesn't need the LWLockRelease either. It does need the
> SpinLockRelease, but as I think I noted previously, the right way to
> write this is probably: SpinLockAcquire(&slot->mutex); was_active =
> slot->active; slot->active = true; SpinLockRelease(&slot->mutex); if
> (was_active) ereport(). MyReplicatonSlot = slot.

That's not really simpler tho? But if you prefer I can go that way.

> ReplicationSlotsComputeRequiredXmin still acquires ProcArrayLock, and
> the comment "Provide interlock against concurrent recomputations"
> doesn't seem adequate to me. I guess the idea here is that we regard
> ProcArrayLock as protecting ReplicationSlotCtl->catalog_xmin and
> ReplicationSlotCtl->data_xmin, but if that's the idea then we only
> need to hold the lock during the time when we actually update those
> values, not the loop where we compute them.

There's a comment someplace else to that end, but yes, that's
essentially the idea. I decided to take it during the whole
recomputation because we also take ProcArrayLock when creating a new
decoding slot and initially setting ->catalog_xmin. That's not strictly required
but seemed simpler that way, and the code shouldn't be very hot.
The code that initially computes the starting value for catalog_xmin
when creating a new decoding slot has to take ProcArrayLock to be safe,
that's why I though it'd be convenient to always use it for those
values.

In all other cases where we modify *_xmin we're only increasing it which
doesn't need a lock (HS feedback never has taken one, and
GetSnapshotData() modifies ->xmin while holding a shared lock), the only
potential danger is a slight delay in increasing the overall value.

> Also, if that's the
> design, maybe they should be part of PROC_HDR *ProcGlobal rather than
> here. It seems weird to have some of the values protected by
> ProcArrayLock live in a completely different data structure managed
> almost entirely by some other part of the system.

Don't we already have cases of that? I seem to remember so. If you
prefer having them there, I am certainly fine with doing that. This way
they aren't allocated if slots are disabled but it's just two
TransactionIds.

> It's pretty evident that what's currently patch #4 (only peg the xmin
> horizon for catalog tables during logical decoding) needs to become
> patch #1, because it doesn't make any sense to apply this before we do
> that.

Well, the slot code and the the slot support for streaming rep are
independent from and don't use it. So they easily can come before it.

I previously had argued for committing that patch together with the main
changeset extraction commit but you, understandably so!, wanted to have
it separately for review.

> [ discussion about crash safety of slots and their use of PANIC ]
> Broadly, what you're trying to accomplish here is to have something
> that is crash-safe but without relying on WAL, so that it can work on
> standbys. If making things crash-safe without WAL were easy to do, we
> probably wouldn't have WAL at all, so it stands to reason that there
> are going to be some difficulties here.

My big problem here is that you're asking this code to have *higher*
guarantees than WAL ever had and currently has, not equivalent
guarantees. Even though the likelihood of hitting problems is a least a
magnitude or two smaller as we are dealing with minimal amounts of data.
All the situations that seem halfway workable in the nearby thread about
PANIC in XLogInsert() you reference are rough ideas that reduce the
likelihood of PANICs, not remove them.

I am fine with reworking things so that the first operation of several
doesn't PANIC because we still can clearly ERROR out in that case. That
should press the likelihood of problems into the utterly irrelevant
area. E.g. ERROR for the rename(oldpath, newpath) and then start a
critical section for the fsync et al.

> Calling a slot "old" or "new" looks liable to cause problems. Maybe
> change those names to contain a character not allowed in a slot name,
> if we're going to keep doing it that way.

Hm. Fair point. slotname.old, slotname.new sounds better.

I wondered about making them plain files as well but given the need for
a directory independent from this I don't really see the advantage,
we'll need to handle them anyway during cleanup.

> Patch 0004:
>
> I'm not very confident that PROC_IN_LOGICAL_DECODING is the right way
> to go here. It seems to me that excluding the xmins of backends with
> slots from globalxmin consideration so that we can fold the same xmin
> in by some other mechanism is kind of strange.

It's essentially copying what PROC_IN_VACUUM already does, that's where
I got the idea from.

> How about letting the xmins of such backends affect the computation as normal, and then
> having one extra xmin that gets folded in that represents the minima
> of the xmin of unconnected slots?

That's how I had it in the beginning but it turned out that has
noticeable performance/space impact. Surprising isn't it? The reason is
that we'll intermittently use normal snapshots to look at the catalog
during decoding and they will install a xmin the current proc. So, while
that snapshot is active GetSnapshotData() will return an older xmin
preventing HOT pruning from being as efficient.

I think we *really* need to make heap_page_prune() more efficient CPU
wise someday not too far away.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-24 16:28:18
Message-ID:	20140124162818.GU7182@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-01-24 00:32:09 +0100, Andres Freund wrote:
> I am fine with reworking things so that the first operation of several
> doesn't PANIC because we still can clearly ERROR out in that case. That
> should press the likelihood of problems into the utterly irrelevant
> area. E.g. ERROR for the rename(oldpath, newpath) and then start a
> critical section for the fsync et al.

So, I've changed stuff around to PANIC only as soon as we're in a state
that's unclear.
To test stuff I've added another .so to the test_decoding contrib that
exposes mkdir/rmdir/chmod/unlink to sql to test those cases in the
contrib's sql/slot.sql. Not sure if we want to keep that, but it's
certainly helpful for now.
The required changes certainly didn't make things look nicer...

I've also changed the temporary name used when creating/dropping slots
to $slotname.tmp.

That doesn't remove PANICs but makes them even less likely. Ok?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-the-replication-slot-interfac.patch.gz	application/x-patch-gzip	15.0 KB
0002-wal_decoding-physical-streaming-replication-walsende.patch.gz	application/x-patch-gzip	5.2 KB
0003-wal_decoding-Introduce-changeset-extraction.patch.gz	application/x-patch-gzip	72.4 KB
0004-wal_decoding-Only-peg-the-xmin-horizon-for-catalog-t.patch.gz	application/x-patch-gzip	4.5 KB
0005-wal_decoding-Allow-walsenders-to-connect-to-a-specif.patch.gz	application/x-patch-gzip	4.0 KB
0006-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	8.1 KB
0007-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	9.1 KB
0008-wal_decoding-test_decoding-Add-a-simple-decoding-mod.patch.gz	application/x-patch-gzip	27.4 KB
0009-wal_decoding-design-document-v2.4-and-snapshot-build.patch.gz	application/x-patch-gzip	12.9 KB
0010-wal_decoding-Documentation-for-replication-slots-and.patch.gz	application/x-patch-gzip	13.1 KB
0011-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB
0012-slot-hack-up-pg_receivexlog-support.patch.gz	application/x-patch-gzip	1.9 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-24 17:10:50
Message-ID:	CA+Tgmoa+sZXGt32O3Dp5u+8f6iXYzk7teJp2rNcFeCVxd5uJHA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 23, 2014 at 6:32 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> I also wonder if we should use the
>> terminology "attach" instead of "acquire"; that pairs more naturally
>> with "release". Then the message, if we want more than an assert,
>> might be "this backend is already attached to a replication slot".
>
> I went with Acquire/Release because our locking code does so, and it
> seemed sensible to be consistent. I don't have strong feelings about it.

Yeah, but I think a slot is not really the same thing as a lock.
Acquire/release might be OK. In some of my recent code I used
attach/detach, which feels a little more natural to me for something
like this, so I lean that direction.

> Unfortunately not. Inside the walsender there's currently no
> LWLockReleaseAll() for ERRORs since commands aren't run inside a
> transaction command...
>
> But maybe I should have fixed this by adding the release to
> WalSndErrorCleanup() instead? That'd still leave the problematic case
> that currently we try to delete a replication slot inside a CATCH when
> we fail while initializing the rest of logical replication... But I
> guess adding it would be a good idea independent of that.

+1. I think that if we can't rely on error handling to clean up the
same things everywhere, it's gonna be a mess. People won't be able to
keep track of which error cleanup is engaged in which code paths, and
screw-ups will result when old code paths are called from new call
sites.

> We could also do a StartTransactionCommand() but I'd rather not, that
> currently prevents code in that vicinity from doing anything it
> shouldn't via various Assert()s in existing code.

Like what? I mean, I'm OK with having a partial error-handling
environment if that's all we need, but I think it's a bad plan to the
extent that the code here needs to be aware of error-handling
differences versus expectations elsewhere in our code.

>> This doesn't need the LWLockRelease either. It does need the
>> SpinLockRelease, but as I think I noted previously, the right way to
>> write this is probably: SpinLockAcquire(&slot->mutex); was_active =
>> slot->active; slot->active = true; SpinLockRelease(&slot->mutex); if
>> (was_active) ereport(). MyReplicatonSlot = slot.
>
> That's not really simpler tho? But if you prefer I can go that way.

It avoids a branch while holding the lock, and it puts the
SpinLockAcquire/Release pair much closer together, so it's easier to
visually verify that the lock is released in all cases.

>> ReplicationSlotsComputeRequiredXmin still acquires ProcArrayLock, and
>> the comment "Provide interlock against concurrent recomputations"
>> doesn't seem adequate to me. I guess the idea here is that we regard
>> ProcArrayLock as protecting ReplicationSlotCtl->catalog_xmin and
>> ReplicationSlotCtl->data_xmin, but if that's the idea then we only
>> need to hold the lock during the time when we actually update those
>> values, not the loop where we compute them.
>
> There's a comment someplace else to that end, but yes, that's
> essentially the idea. I decided to take it during the whole
> recomputation because we also take ProcArrayLock when creating a new
> decoding slot and initially setting ->catalog_xmin. That's not strictly required
> but seemed simpler that way, and the code shouldn't be very hot.
> The code that initially computes the starting value for catalog_xmin
> when creating a new decoding slot has to take ProcArrayLock to be safe,
> that's why I though it'd be convenient to always use it for those
> values.

I don't really see why it's simpler that way. It's clearer what the
point of the lock is if you only hold it for the operations that need
to be protected by that lock.

> In all other cases where we modify *_xmin we're only increasing it which
> doesn't need a lock (HS feedback never has taken one, and
> GetSnapshotData() modifies ->xmin while holding a shared lock), the only
> potential danger is a slight delay in increasing the overall value.

Right. We might want to comment such places.

>> Also, if that's the
>> design, maybe they should be part of PROC_HDR *ProcGlobal rather than
>> here. It seems weird to have some of the values protected by
>> ProcArrayLock live in a completely different data structure managed
>> almost entirely by some other part of the system.
>
> Don't we already have cases of that? I seem to remember so. If you
> prefer having them there, I am certainly fine with doing that. This way
> they aren't allocated if slots are disabled but it's just two
> TransactionIds.

Let's go for it, unless we think of a reason not to.

>> It's pretty evident that what's currently patch #4 (only peg the xmin
>> horizon for catalog tables during logical decoding) needs to become
>> patch #1, because it doesn't make any sense to apply this before we do
>> that.
>
> Well, the slot code and the the slot support for streaming rep are
> independent from and don't use it. So they easily can come before it.

But this code is riddled with places where you track a catalog xmin
and a data xmin separately. The only point of doing it that way is to
support a division that hasn't been made yet.

>> [ discussion about crash safety of slots and their use of PANIC ]
>> Broadly, what you're trying to accomplish here is to have something
>> that is crash-safe but without relying on WAL, so that it can work on
>> standbys. If making things crash-safe without WAL were easy to do, we
>> probably wouldn't have WAL at all, so it stands to reason that there
>> are going to be some difficulties here.
>
> My big problem here is that you're asking this code to have *higher*
> guarantees than WAL ever had and currently has, not equivalent
> guarantees. Even though the likelihood of hitting problems is a least a
> magnitude or two smaller as we are dealing with minimal amounts of data.
> All the situations that seem halfway workable in the nearby thread about
> PANIC in XLogInsert() you reference are rough ideas that reduce the
> likelihood of PANICs, not remove them.
>
> I am fine with reworking things so that the first operation of several
> doesn't PANIC because we still can clearly ERROR out in that case. That
> should press the likelihood of problems into the utterly irrelevant
> area. E.g. ERROR for the rename(oldpath, newpath) and then start a
> critical section for the fsync et al.

I have zero confidence that it's OK to treat fsync() as an operation
that won't fail. Linux documents EIO as a plausible error return, for
example. (And really, how could it not?)

>> Calling a slot "old" or "new" looks liable to cause problems. Maybe
>> change those names to contain a character not allowed in a slot name,
>> if we're going to keep doing it that way.
> I wondered about making them plain files as well but given the need for
> a directory independent from this I don't really see the advantage,
> we'll need to handle them anyway during cleanup.

Yeah, sure, but if it makes for fewer in-between states, it might be worth it.

>> How about letting the xmins of such backends affect the computation as normal, and then
>> having one extra xmin that gets folded in that represents the minima
>> of the xmin of unconnected slots?
>
> That's how I had it in the beginning but it turned out that has
> noticeable performance/space impact. Surprising isn't it? The reason is
> that we'll intermittently use normal snapshots to look at the catalog
> during decoding and they will install a xmin the current proc. So, while
> that snapshot is active GetSnapshotData() will return an older xmin
> preventing HOT pruning from being as efficient.

Hrm, so you still need the flag, to indicate whether the xmin should
be included when we're computing a globalxmin for pruning of a
non-catalog table. But that doesn't necessarily mean that the value
has to live in the slot rather than the PGXACT, does it? It might be
for the best the way you have it, but it does look kind of weird. Not
sure yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-24 17:49:16
Message-ID:	20140124174916.GV7182@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-01-24 12:10:50 -0500, Robert Haas wrote:
> > Unfortunately not. Inside the walsender there's currently no
> > LWLockReleaseAll() for ERRORs since commands aren't run inside a
> > transaction command...
> >
> > But maybe I should have fixed this by adding the release to
> > WalSndErrorCleanup() instead? That'd still leave the problematic case
> > that currently we try to delete a replication slot inside a CATCH when
> > we fail while initializing the rest of logical replication... But I
> > guess adding it would be a good idea independent of that.
>
> +1. I think that if we can't rely on error handling to clean up the
> same things everywhere, it's gonna be a mess. People won't be able to
> keep track of which error cleanup is engaged in which code paths, and
> screw-ups will result when old code paths are called from new call
> sites.

Ok, I'll additionally add it there.

> > We could also do a StartTransactionCommand() but I'd rather not, that
> > currently prevents code in that vicinity from doing anything it
> > shouldn't via various Assert()s in existing code.
>
> Like what? I mean, I'm OK with having a partial error-handling
> environment if that's all we need, but I think it's a bad plan to the
> extent that the code here needs to be aware of error-handling
> differences versus expectations elsewhere in our code.

Catalog lookups, building a snapshot, xid assignment, whatnot. All that
shouldn't happen in the locations creating/dropping a slot.
I think we should at some point separate parts of the error handling out
of xact.c. Currently its repeated slightly differently over logs of
places (check e.g. the callsites for LWLockReleaseAll), that's not
robust. But that's a project for another day.

Note that the actual decoding *does* happen inside a TransactionCommand,
as it'd be failure prone to copy all the cleanup logic. And we need to
have most of the normal cleanup code.

> I don't really see why it's simpler that way. It's clearer what the
> point of the lock is if you only hold it for the operations that need
> to be protected by that lock.

> > In all other cases where we modify *_xmin we're only increasing it which
> > doesn't need a lock (HS feedback never has taken one, and
> > GetSnapshotData() modifies ->xmin while holding a shared lock), the only
> > potential danger is a slight delay in increasing the overall value.

> Right. We might want to comment such places.

> > Don't we already have cases of that? I seem to remember so. If you
> > prefer having them there, I am certainly fine with doing that. This way
> > they aren't allocated if slots are disabled but it's just two
> > TransactionIds.
>
> Let's go for it, unless we think of a reason not to.

ok on those counts.

> >> It's pretty evident that what's currently patch #4 (only peg the xmin
> >> horizon for catalog tables during logical decoding) needs to become
> >> patch #1, because it doesn't make any sense to apply this before we do
> >> that.
> >
> > Well, the slot code and the the slot support for streaming rep are
> > independent from and don't use it. So they easily can come before it.
>
> But this code is riddled with places where you track a catalog xmin
> and a data xmin separately. The only point of doing it that way is to
> support a division that hasn't been made yet.

If you think it will make stuff more manageable I can try separating all
lines dealing with catalog_xmin into another patch as long as data_xmin
doesn't have to be renamed.
That said, I don't really think it's a big problem that the division
hasn't been made, essentially the meaning is different, even if we don't
take advantage of it yet. data_xmin is there for streaming replication
where we need to prevent all removal, catalog_xmin is there for
changeset extraction.

> I have zero confidence that it's OK to treat fsync() as an operation
> that won't fail. Linux documents EIO as a plausible error return, for
> example. (And really, how could it not?)

But quite fundamentally having a the most basic persistency building
block fail is something you can't really handle safely. Note that
issue_xlog_fsync() has always (and I wager, will always) treated that as
a PANIC.
I don't recall many complaints about that for WAL. All of the ones I
found in a quick search were like "oh, the fs invalidated my fd because
of corruption". And few.

> >> Calling a slot "old" or "new" looks liable to cause problems. Maybe
> >> change those names to contain a character not allowed in a slot name,
> >> if we're going to keep doing it that way.
> > I wondered about making them plain files as well but given the need for
> > a directory independent from this I don't really see the advantage,
> > we'll need to handle them anyway during cleanup.
>
> Yeah, sure, but if it makes for fewer in-between states, it might be worth it.

I don't think it'd make anything simpler with the new version of the
code. Agreed?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-25 01:38:11
Message-ID:	CA+TgmobObSwQS0qwvLr-JQN8QobDWov4b2H7rWk43k1vRMpg2g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 24, 2014 at 12:49 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> But this code is riddled with places where you track a catalog xmin
>> and a data xmin separately. The only point of doing it that way is to
>> support a division that hasn't been made yet.
>
> If you think it will make stuff more manageable I can try separating all
> lines dealing with catalog_xmin into another patch as long as data_xmin
> doesn't have to be renamed.
> That said, I don't really think it's a big problem that the division
> hasn't been made, essentially the meaning is different, even if we don't
> take advantage of it yet. data_xmin is there for streaming replication
> where we need to prevent all removal, catalog_xmin is there for
> changeset extraction.

I spent some more time studying the 0001 and 0002 patches this
afternoon, with a side dish of 0004. I'm leaning toward thinking we
should go ahead and make that division. I'm also wondering about
whether we've got the right naming here. AFAICT, it's not the case
that we're going to use the "catalog xmin" for catalogs and the "data
xmin" for non-catalogs. Rather, the "catalog xmin" is going to always
be included in globalxmin calculations, so IOW it should just be
called "xmin". The "data xmin" is going to be included only for
non-catalog tables. I guess "data" is a reasonable antonym for
catalog, but I'm slightly tempted to propose
RecentGlobalNonCatalogXmin and similar. Maybe that's too ugly to
live, but I can see someone failing to guess what the exact
distinction is between "xmin" and "data xmin", and I bet they'd be a
lot less likely to misguess if we wrote "non catalog".

It's interesting (though not immediately relevant) to speculate about
how we might extend this to fine-grained xmin tracking more generally.
The design sketch that comes to mind (and I think parts of this have
been proposed before) is to have a means by which backends can promise
not to lock any more tables except under a new snapshot. At the read
committed isolation level, or in any single-statement transaction,
backends can so promise whenever (a) all tables mentioned in the query
have been locked and (b) all functions to be invoked during the query
via the fmgr interface promise (e.g. via function labeling) that they
won't directly or indirectly do such a thing. If they break their
promise, we detect it and ereport(ERROR). Backends that have made
such a guarantee can be ignored for global-xmin calculations that
don't involve the tables they have locked. One idea is to keep a hash
table keyed by <dboid, reloid> with some limited number of entries in
shared memory; it caches the table-specific xmin, a usage counter, and
a flag indicating whether the cached xmin might be stale. In order to
promise not to lock any new tables, backends must make or update
entries for all the tables they already have locked in this hash
table; if there aren't enough entries, they're not allowed to promise.
Thus, backends wishing to prune can use the cached xmin value if it's
present (optionally updating it if it's stale) and the minimum of the
xmins of the backends that haven't made a promise if it isn't. This
is a bit hairy though; access to the shared hash table had better be
*really* fast, and we'd better not need to recompute the cached value
too often.

Anyway, whether we end up pursuing something like that or not, I think
I'm persuaded that this particular optimization won't really be a
problem for hypothetical future work in this area; and also that it's
a good idea to do this much now specifically for logical decoding.

>> I have zero confidence that it's OK to treat fsync() as an operation
>> that won't fail. Linux documents EIO as a plausible error return, for
>> example. (And really, how could it not?)
>
> But quite fundamentally having a the most basic persistency building
> block fail is something you can't really handle safely. Note that
> issue_xlog_fsync() has always (and I wager, will always) treated that as
> a PANIC.
> I don't recall many complaints about that for WAL. All of the ones I
> found in a quick search were like "oh, the fs invalidated my fd because
> of corruption". And few.

Well, you have a point. And certainly this version looks much better
than the previous version in terms of the likelihood of PANIC
occurring in practice. But I wonder if we couldn't cut it down even
further without too much effort. Suppose we define a slot to exist
if, and only if, the state file exists. A directory without a state
file is subject to arbitrary removal. Then we can proceed as follows:

- mkdir() the directory.
- open() state.tmp
- write() state.tmp
- close() state.tmp
- fsync() parent directory, directory and state.tmp
- rename() state.tmp to state
- fsync() state

If any step except the last one fails, no problem. The next startup
can nuke the leftovers; any future attempt to create a slot with the
name can ignore an EEXIST failure from mkdir() and procedure just as
above. Only a failure of the very last fsync is a PANIC. In some
ways I think this'd be simpler than what you've got now, because we'd
eliminate the dance with renaming the directory as well as the state
file; only the state file matters.

To drop a slot, just unlink the state file and fsync() the directory.
If the unlink fails, it's just an error. If the fsync() fails, it's a
PANIC. Once the state file is gone, removing everything else is only
an ERROR, and you don't even need to fsync() it again.

To update a slot, open, write, close, and fsync state.tmp, then rename
it to state and fsync() again. None of these steps need PANIC; hold
off on updating the values in memory until they're all done. If any
step fails, the attempt to update the slot fails, but either memory
and disk are still consistent, or the disk has an xmin newer than
memory, but still legal. On restart, when restoring slots, fsync()
each state file, dying horribly if we can't, and remove any
directories that don't contain one.

As compared with what you have here, this eliminates the risk of PANIC
entirely for slot updates, which is good because those will be quite
frequent. For creating or dropping a slot, it doesn't quite eliminate
the risk entirely but only one fsync() call per create or drop is at
risk. We still risk startup time failures, but that's unavoidable
anyway if the data we need can't be read; the chances of blowing up a
running system are very low.

Looking over patch 0002, I see that there's code to allow a walsender
to create or drop a physical replication slot. Also, if we've
acquired a replication slot, there's code to update it, and code to
make sure we disconnect from it, but there's no code to acquire it. I
think maybe the hunk in StartReplication() is supposed to be calling
ReplicationSlotAcquire() instead of ReplicationSlotRelease(), which
(ahem) makes one wonder how thoroughly this code has been tested.
There's also no provision for walsender (or pg_receivexlog?) to send
the new SLOT option to walreceiver, which seems somewhat necessary.
I'm tempted to suggest also adding something to src/bin/scripts to
create and drop slots, though I suppose we could just recommend psql
-c 'CREATE_REPLICATION_SLOT SLOT zippy PHYSICAL' 'replication=true'.

(BTW, isn't that kind of a strange syntax, with the word SLOT
appearing twice? I think we could drop the second one.)

It also occurred to me that we need documentation for all of this; I
see that's in patch 0010, but I haven't reviewed it in detail yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-25 22:25:26
Message-ID:	20140125222526.GZ7182@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Robert, all,

On 2014-01-24 20:38:11 -0500, Robert Haas wrote:
> On Fri, Jan 24, 2014 at 12:49 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> But this code is riddled with places where you track a catalog xmin
> >> and a data xmin separately. The only point of doing it that way is to
> >> support a division that hasn't been made yet.
> >
> > If you think it will make stuff more manageable I can try separating all
> > lines dealing with catalog_xmin into another patch as long as data_xmin
> > doesn't have to be renamed.
> > That said, I don't really think it's a big problem that the division
> > hasn't been made, essentially the meaning is different, even if we don't
> > take advantage of it yet. data_xmin is there for streaming replication
> > where we need to prevent all removal, catalog_xmin is there for
> > changeset extraction.
>
> I spent some more time studying the 0001 and 0002 patches this
> afternoon, with a side dish of 0004. I'm leaning toward thinking we
> should go ahead and make that division.

Ok.

> I'm also wondering about
> whether we've got the right naming here. AFAICT, it's not the case
> that we're going to use the "catalog xmin" for catalogs and the "data
> xmin" for non-catalogs. Rather, the "catalog xmin" is going to always
> be included in globalxmin calculations, so IOW it should just be
> called "xmin".

Well, not really. That's true for GetSnapshotData(), but not for
GetOldestXmin() where we calculate an xmin *not* including the catalog
xmin. And the data_xmin is always used, regardless of
catalog/non_catalog, we peg the xmin further for catalog tables, based
on that value.
The reason for doing things this way is that it makes all current usages
of RecentGlobalXmin safe, since that is the more conservative
value. Only in inspected location we can use RecentGlobalDataXmin which
*does* include data_xmin but *not* catalog_xmin.

> It's interesting (though not immediately relevant) to speculate about
> how we might extend this to fine-grained xmin tracking more generally.
> [musings for another time]

Yea, I have wondered about that as well. I think the easiest thing would
be to to compute RecentGlobalDataXmin in a database specific manner
since by definition it will *not* include shared tables. We do that
already for GetOldestXmin() but that's not used for heap pruning. I'd
quickly tested that some months back and it gave significant speedups
for two pgbenches in two databases.

> >> I have zero confidence that it's OK to treat fsync() as an operation
> >> that won't fail. Linux documents EIO as a plausible error return, for
> >> example. (And really, how could it not?)
> >
> > But quite fundamentally having a the most basic persistency building
> > block fail is something you can't really handle safely. Note that
> > issue_xlog_fsync() has always (and I wager, will always) treated that as
> > a PANIC.
> > I don't recall many complaints about that for WAL. All of the ones I
> > found in a quick search were like "oh, the fs invalidated my fd because
> > of corruption". And few.
>
> Well, you have a point. And certainly this version looks much better
> than the previous version in terms of the likelihood of PANIC
> occurring in practice. But I wonder if we couldn't cut it down even
> further without too much effort. Suppose we define a slot to exist
> if, and only if, the state file exists. A directory without a state
> file is subject to arbitrary removal. Then we can proceed as follows:
>
> - mkdir() the directory.
> - open() state.tmp
> - write() state.tmp
> - close() state.tmp
> - fsync() parent directory, directory and state.tmp
> - rename() state.tmp to state
> - fsync() state
>
> If any step except the last one fails, no problem. The next startup
> can nuke the leftovers; any future attempt to create a slot with the
> name can ignore an EEXIST failure from mkdir() and procedure just as
> above. Only a failure of the very last fsync is a PANIC. In some
> ways I think this'd be simpler than what you've got now, because we'd
> eliminate the dance with renaming the directory as well as the state
> file; only the state file matters.

Hm. I think this is pretty exactly what happens in the current patch,
right? There's an additional fsync() of the parent directory at the end,
but that's it.

> To drop a slot, just unlink the state file and fsync() the directory.
> If the unlink fails, it's just an error. If the fsync() fails, it's a
> PANIC. Once the state file is gone, removing everything else is only
> an ERROR, and you don't even need to fsync() it again.

Well, the patch as is renames the directory first and fsyncs that. Only
a failure in fsyncing is punishable by PANIC, if rmtree() on the temp
directory file fails it generates WARNINGs, that's it.

> To update a slot, open, write, close, and fsync state.tmp, then rename
> it to state and fsync() again. None of these steps need PANIC; hold
> off on updating the values in memory until they're all done. If any
> step fails, the attempt to update the slot fails, but either memory
> and disk are still consistent, or the disk has an xmin newer than
> memory, but still legal. On restart, when restoring slots, fsync()
> each state file, dying horribly if we can't, and remove any
> directories that don't contain one.

That's again pretty similar to what happens, only that we panic if the
fsync()ing fails. And I think that's correct.

I still think worrying over this to this degree is a waste of
effort. There's much hotter places that could be inspected to that
detail than this.

> Looking over patch 0002, I see that there's code to allow a walsender
> to create or drop a physical replication slot. Also, if we've
> acquired a replication slot, there's code to update it, and code to
> make sure we disconnect from it, but there's no code to acquire it. I
> think maybe the hunk in StartReplication() is supposed to be calling
> ReplicationSlotAcquire() instead of ReplicationSlotRelease(),

Uh. You had me worried here for a minute or two, a hunk or two earlier
than the ReplicationSlotRelease() you mention. What probably confused
you is that StartReplication only returns once all streaming is
finished. Not my idea...

static void
StartReplication(StartReplicationCmd *cmd)
{
...
if (cmd->slotname)
ReplicationSlotAcquire(cmd->slotname);
...
...
/* this is where we'll actually loop busily */
WalSndLoop(XLogSendPhysical);
...
if (cmd->slotname)
ReplicationSlotRelease();
...
}

> which
> (ahem) makes one wonder how thoroughly this code has been tested.

It's actually tested as of a week ago or so. Both with pg_receivexlog
and a hacked up walreceiver. That's how I noticed
a472ae1e4e2bf5fb71ac655d38d1e35df4c1c966 ;). Because it did *not* work
properly in the beginning... But it didn't end up being my code. Hah!

> There's also no provision for walsender (or pg_receivexlog?) to send
> the new SLOT option to walreceiver, which seems somewhat necessary.

There's a hacked up pg_receivexlog in the last commit in the series. I
haven't included the hack for walreceiver as it was too embarassing for
the public eye.

I really, really don't want to focus on polishing up the receiver side
for this before the basics of changeset extraction are done. I've very,
very reluctantly agreed to generalize the slot concept for streaming rep
now, but I said all along that I won't do the client work till the
changeset extraction stuff is done. There is a good deal of UI design
work to be done, and I don't think I have the capacity to tackle that
right now.

> I'm tempted to suggest also adding something to src/bin/scripts to
> create and drop slots, though I suppose we could just recommend psql
> -c 'CREATE_REPLICATION_SLOT SLOT zippy PHYSICAL' 'replication=true'.

There's an SQL function for doing so. In, err, the wrong patch:
postgres=# \df create_physical_replication_slot
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+----------------------------------+------------------+----------------------------------------------------------+--------
pg_catalog | create_physical_replication_slot | record | slotname name, OUT slotname text, OUT xlog_position text | normal
(1 row)

will move it. Not sure if we need the slotname as an OUT value as well,
it's helpful as part of the additional return types for creating a
decoding slot, but here...

Not sure if there's still a reason for a separate commandline utility?

> (BTW, isn't that kind of a strange syntax, with the word SLOT
> appearing twice? I think we could drop the second one.)

Well, that's what I'd suggested on the mailinglist, so I didn't change
it. It will definitely be a separate SLOT slot_name for
START_REPLICATION, that's pretty much the only reason for keeping it
separate for CREATE/DROP. Don't care which way we go in the end.

> It also occurred to me that we need documentation for all of this; I
> see that's in patch 0010, but I haven't reviewed it in detail yet.

The streaming rep part is scantily documented since that's pending the
clientside work, but the changeset extraction part should be documented
to some degree... Craig worked on my initial docs and seemed to be able
to make enough sense of it, so I hope it's not in a totally bad state.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-27 12:49:40
Message-ID:	CA+TgmoaaD=wi6Rwoc1sN8JbzKHkpqe6RwxhKjtwP63+NNHWPWQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 25, 2014 at 5:25 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Looking over patch 0002, I see that there's code to allow a walsender
>> to create or drop a physical replication slot. Also, if we've
>> acquired a replication slot, there's code to update it, and code to
>> make sure we disconnect from it, but there's no code to acquire it. I
>> think maybe the hunk in StartReplication() is supposed to be calling
>> ReplicationSlotAcquire() instead of ReplicationSlotRelease(),
>
> Uh. You had me worried here for a minute or two, a hunk or two earlier
> than the ReplicationSlotRelease() you mention. What probably confused
> you is that StartReplication only returns once all streaming is
> finished. Not my idea...

No, what confuses me is that there's no call to
ReplicationSlotAcquire() in patch 0001 or patch 0002.... the function
is added but not called.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.1
Date:	2014-01-27 12:59:33
Message-ID:	CA+TgmoZROB9DFQM5=7MCyORUDNC1rt=Ob0viViE_OBgXVt=WNA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 25, 2014 at 5:25 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> I'm also wondering about
>> whether we've got the right naming here. AFAICT, it's not the case
>> that we're going to use the "catalog xmin" for catalogs and the "data
>> xmin" for non-catalogs. Rather, the "catalog xmin" is going to always
>> be included in globalxmin calculations, so IOW it should just be
>> called "xmin".
>
> Well, not really. That's true for GetSnapshotData(), but not for
> GetOldestXmin() where we calculate an xmin *not* including the catalog
> xmin. And the data_xmin is always used, regardless of
> catalog/non_catalog, we peg the xmin further for catalog tables, based
> on that value.
> The reason for doing things this way is that it makes all current usages
> of RecentGlobalXmin safe, since that is the more conservative
> value. Only in inspected location we can use RecentGlobalDataXmin which
> *does* include data_xmin but *not* catalog_xmin.

Well, OK, so I guess I'm turned around. But I guess my point is - if
one of data_xmin and catalog_xmin is really just xmin, then I think it
would be more clear to call that one "xmin".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-27 16:20:06
Message-ID:	20140127162006.GA25670@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Here's the next version of the patchset. The following changes have been
made:
* move xmin pegging and more logic responsibility to procarray.c
* split all support for changeset extraction from the initial slot patch
* always register an before_shmem_exit handler when
max_replication_slots is registered, not just while a slot is acquired
* move some patch hunks to earlier patches, especially the
ReplicationSlotAcquire() call for physical rep that accidentally
slipped and the SQL accessible slot manipulation functions
* minor stuff

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-the-replication-slot-interfac.patch.gz	application/x-patch-gzip	14.5 KB
0002-wal_decoding-physical-streaming-replication-walsende.patch.gz	application/x-patch-gzip	5.5 KB
0003-wal_decoding-Introduce-changeset-extraction.patch.gz	application/x-patch-gzip	75.5 KB
0004-wal_decoding-Only-peg-the-xmin-horizon-for-catalog-t.patch.gz	application/x-patch-gzip	5.2 KB
0005-wal_decoding-Allow-walsenders-to-connect-to-a-specif.patch.gz	application/x-patch-gzip	4.0 KB
0006-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	8.0 KB
0007-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	9.1 KB
0008-wal_decoding-test_decoding-Add-a-simple-decoding-mod.patch.gz	application/x-patch-gzip	27.4 KB
0009-wal_decoding-design-document-v2.4-and-snapshot-build.patch.gz	application/x-patch-gzip	12.9 KB
0010-wal_decoding-Documentation-for-replication-slots-and.patch.gz	application/x-patch-gzip	13.1 KB
0011-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB
0012-slot-hack-up-pg_receivexlog-support.patch.gz	application/x-patch-gzip	1.9 KB

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-28 16:49:17
Message-ID:	CAA-aLv4p+CN8U+ukME3s_7emwMUL7p+qv8tVVLsVSnmsLE+oBw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 27 January 2014 16:20, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> Here's the next version of the patchset. The following changes have been
> made:
> * move xmin pegging and more logic responsibility to procarray.c
> * split all support for changeset extraction from the initial slot patch
> * always register an before_shmem_exit handler when
> max_replication_slots is registered, not just while a slot is acquired
> * move some patch hunks to earlier patches, especially the
> ReplicationSlotAcquire() call for physical rep that accidentally
> slipped and the SQL accessible slot manipulation functions
> * minor stuff

0001 doesn't apply cleanly due to commit
ea9df812d8502fff74e7bc37d61bdc7d66d77a7f.

The rest are fine.

--
Thom

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-28 16:53:41
Message-ID:	CA+TgmoZqA++6fpEJy9SCFnOEUSJMF1StyNFBoe3Uq+UJ5U1HHg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 11:49 AM, Thom Brown <thom(at)linux(dot)com> wrote:
> On 27 January 2014 16:20, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> Here's the next version of the patchset. The following changes have been
>> made:
>> * move xmin pegging and more logic responsibility to procarray.c
>> * split all support for changeset extraction from the initial slot patch
>> * always register an before_shmem_exit handler when
>> max_replication_slots is registered, not just while a slot is acquired
>> * move some patch hunks to earlier patches, especially the
>> ReplicationSlotAcquire() call for physical rep that accidentally
>> slipped and the SQL accessible slot manipulation functions
>> * minor stuff
>
> 0001 doesn't apply cleanly due to commit
> ea9df812d8502fff74e7bc37d61bdc7d66d77a7f.
>
> The rest are fine.

I've rebased it here and am hacking on it still.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-28 21:37:32
Message-ID:	CA+TgmoaRKcBkpr3YNJq4PvKCHYntZqZ-NF=HrB6f6JgcuAouSw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I've rebased it here and am hacking on it still.

Andres and I are going back and forth between our respective git repos
hacking on this, and I think we're getting there, but I have a
terminological question which I'd like to submit to a wider audience:

The point of Andres's patch set is to introduce a new technology
called logical decoding; that is, the ability to get a replication
stream that is based on changes to tuples rather than changes to
blocks. It could also be called logical replication. In these
patches, our existing replication is referred to as "physical"
replication, which sounds kind of funny to me. Anyone have another
suggestion?

There are a lot of ways to slice the space of possible replication
solutions. We currently talk about "streaming replication" (as
opposed to "archiving") and "synchronous replication" (as opposed to
asynchronous), but this is a new distinction. At least in theory,
whether replication is "physical" or logical is independent of whether
it's based on streaming or archiving and also of whether it's
synchronous or asynchronous. So we can't for example talk about
"logical replication" in opposition to "streaming replication"; that's
comparing apples and oranges. We need a pair of new terms, and I
can't immediately think of anything better than physical/logical, but
it still sounds somewhat awkward to me so ... anyone else have an
idea?

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Thom Brown <thom(at)linux(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-28 21:48:09
Message-ID:	CAA-aLv4_GxGeHWfsQTxypcHR_HYA5USCyihUbMQ5+0HBagkakw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28 January 2014 21:37, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I've rebased it here and am hacking on it still.
>
> Andres and I are going back and forth between our respective git repos
> hacking on this, and I think we're getting there, but I have a
> terminological question which I'd like to submit to a wider audience:
>
> The point of Andres's patch set is to introduce a new technology
> called logical decoding; that is, the ability to get a replication
> stream that is based on changes to tuples rather than changes to
> blocks. It could also be called logical replication. In these
> patches, our existing replication is referred to as "physical"
> replication, which sounds kind of funny to me. Anyone have another
> suggestion?

Logical and Binary replication?

--
Thom

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-28 21:56:20
Message-ID:	20140128215620.GH18333@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-01-28 21:48:09 +0000, Thom Brown wrote:
> On 28 January 2014 21:37, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> I've rebased it here and am hacking on it still.
> >
> > Andres and I are going back and forth between our respective git repos
> > hacking on this, and I think we're getting there, but I have a
> > terminological question which I'd like to submit to a wider audience:
> >
> > The point of Andres's patch set is to introduce a new technology
> > called logical decoding; that is, the ability to get a replication
> > stream that is based on changes to tuples rather than changes to
> > blocks. It could also be called logical replication. In these
> > patches, our existing replication is referred to as "physical"
> > replication, which sounds kind of funny to me. Anyone have another
> > suggestion?
>
> Logical and Binary replication?

Unfortunately changeset extraction output's can be binary data...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-28 22:27:20
Message-ID:	CAA-aLv7tP7Hnh-SE=HmDAytjD5cB5Z6JQiar+Tj+mHpKNq4hqQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28 January 2014 21:56, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-01-28 21:48:09 +0000, Thom Brown wrote:
>> On 28 January 2014 21:37, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> > On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> >> I've rebased it here and am hacking on it still.
>> >
>> > Andres and I are going back and forth between our respective git repos
>> > hacking on this, and I think we're getting there, but I have a
>> > terminological question which I'd like to submit to a wider audience:
>> >
>> > The point of Andres's patch set is to introduce a new technology
>> > called logical decoding; that is, the ability to get a replication
>> > stream that is based on changes to tuples rather than changes to
>> > blocks. It could also be called logical replication. In these
>> > patches, our existing replication is referred to as "physical"
>> > replication, which sounds kind of funny to me. Anyone have another
>> > suggestion?
>>
>> Logical and Binary replication?
>
> Unfortunately changeset extraction output's can be binary data...

"system"?
"cluster"?
"full"?
"complete"?

--
Thom

From:	Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-28 22:31:25
Message-ID:	CAKddOFCgJ4jDEspnpK58V0VqkHghtydWJ9buvO_S-fcTdxfFHw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 4:56 PM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

> On 2014-01-28 21:48:09 +0000, Thom Brown wrote:
> > On 28 January 2014 21:37, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > > On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> > >> I've rebased it here and am hacking on it still.
> > >
> > > Andres and I are going back and forth between our respective git repos
> > > hacking on this, and I think we're getting there, but I have a
> > > terminological question which I'd like to submit to a wider audience:
> > >
> > > The point of Andres's patch set is to introduce a new technology
> > > called logical decoding; that is, the ability to get a replication
> > > stream that is based on changes to tuples rather than changes to
> > > blocks. It could also be called logical replication. In these
> > > patches, our existing replication is referred to as "physical"
> > > replication, which sounds kind of funny to me. Anyone have another
> > > suggestion?
> >
> > Logical and Binary replication?
>
> Unfortunately changeset extraction output's can be binary data...
>

Perhaps Logical and Block?

The existing replication mechanism is similar to block-based disk backups.
It's the whole thing (not parts) and doesn't have any concept of
database/directory.

From:	David Fetter <david(at)fetter(dot)org>
To:	Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Thom Brown <thom(at)linux(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-28 22:38:33
Message-ID:	20140128223833.GA24518@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 05:31:25PM -0500, Rod Taylor wrote:
> On Tue, Jan 28, 2014 at 4:56 PM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:
>
> > On 2014-01-28 21:48:09 +0000, Thom Brown wrote:
> > > On 28 January 2014 21:37, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > > > On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> > wrote:
> > > >> I've rebased it here and am hacking on it still.
> > > >
> > > > Andres and I are going back and forth between our respective git repos
> > > > hacking on this, and I think we're getting there, but I have a
> > > > terminological question which I'd like to submit to a wider audience:
> > > >
> > > > The point of Andres's patch set is to introduce a new technology
> > > > called logical decoding; that is, the ability to get a replication
> > > > stream that is based on changes to tuples rather than changes to
> > > > blocks. It could also be called logical replication. In these
> > > > patches, our existing replication is referred to as "physical"
> > > > replication, which sounds kind of funny to me. Anyone have another
> > > > suggestion?
> > >
> > > Logical and Binary replication?
> >
> > Unfortunately changeset extraction output's can be binary data...
> >
>
> Perhaps Logical and Block?
>
> The existing replication mechanism is similar to block-based disk backups.
> It's the whole thing (not parts) and doesn't have any concept of
> database/directory.

+1 for this terminology. It's descriptive.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-29 00:43:52
Message-ID:	52E84EC8.6090503@proxel.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/28/2014 10:56 PM, Andres Freund wrote:
> On 2014-01-28 21:48:09 +0000, Thom Brown wrote:
>> On 28 January 2014 21:37, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> The point of Andres's patch set is to introduce a new technology
>>> called logical decoding; that is, the ability to get a replication
>>> stream that is based on changes to tuples rather than changes to
>>> blocks. It could also be called logical replication. In these
>>> patches, our existing replication is referred to as "physical"
>>> replication, which sounds kind of funny to me. Anyone have another
>>> suggestion?
>>
>> Logical and Binary replication?
>
> Unfortunately changeset extraction output's can be binary data...

I think "physical" and "logical" are fine and they seem to be well known
terminology. Oracle uses those words and I have also seen many places
use "physical backup" and "logical backup", for example on Barman's
homepage.

--
Andreas Karlsson

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andreas Karlsson <andreas(at)proxel(dot)se>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-29 04:09:07
Message-ID:	CA+Tgmob3zbW_jWg1QNCMviL6vN=oNjKP0qFk3nzs6z3SUNv3GA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 7:43 PM, Andreas Karlsson <andreas(at)proxel(dot)se> wrote:
> I think "physical" and "logical" are fine and they seem to be well known
> terminology. Oracle uses those words and I have also seen many places use
> "physical backup" and "logical backup", for example on Barman's homepage.

There's certainly something to be said for this.

Another idea I had this evening was "logical replication" and
"WAL-based replication", but that's a bit confusing too since logical
rep. is going to use WAL as an underlying technology.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To:	"Andreas Karlsson EXTERN" <andreas(at)proxel(dot)se>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-29 09:25:30
Message-ID:	A737B7A37273E048B164557ADEF4A58B17C9F44C@ntex2010i.host.magwien.gv.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andreas Karlsson wrote:
> On 01/28/2014 10:56 PM, Andres Freund wrote:
>> On 2014-01-28 21:48:09 +0000, Thom Brown wrote:
>>> On 28 January 2014 21:37, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>> On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>> The point of Andres's patch set is to introduce a new technology
>>>> called logical decoding; that is, the ability to get a replication
>>>> stream that is based on changes to tuples rather than changes to
>>>> blocks. It could also be called logical replication. In these
>>>> patches, our existing replication is referred to as "physical"
>>>> replication, which sounds kind of funny to me. Anyone have another
>>>> suggestion?
>>>
>>> Logical and Binary replication?
>>
>> Unfortunately changeset extraction output's can be binary data...
>
> I think "physical" and "logical" are fine and they seem to be well known
> terminology. Oracle uses those words and I have also seen many places
> use "physical backup" and "logical backup", for example on Barman's
> homepage.

I think it also fits well with the well-known terms "physical [database]
design" and "logical design". Not that it is the same thing, but I
believe that every database person, when faced with the distiction
"physical" versus "logical", will conclude that the former refers to
data placement or block structure, while the latter refers to the
semantics of the data being stored.

Yours,
Laurenz Albe

From:	Christian Convey <christian(dot)convey(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-29 23:08:25
Message-ID:	CAPfS4Zx-_3hse5wHWrh8RCrtwxxn+BaP421Wp1R7HiKWBr6YaA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It seems to me that the terms "physical", "logical", and "binary" are
always relative to the perspective of the component being worked on.

"Physical" often means "one level of abstraction below mine, and upon which
my work builds". "Logical" means "my work's level of abstraction". And
"Binary" means "data which I'm not going to pretend I know or care how to
interpret."

So I'd suggest "block" and "tuple", perhaps.

On Wed, Jan 29, 2014 at 4:25 AM, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>wrote:

> Andreas Karlsson wrote:
> > On 01/28/2014 10:56 PM, Andres Freund wrote:
> >> On 2014-01-28 21:48:09 +0000, Thom Brown wrote:
> >>> On 28 January 2014 21:37, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>>> On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> >>>> The point of Andres's patch set is to introduce a new technology
> >>>> called logical decoding; that is, the ability to get a replication
> >>>> stream that is based on changes to tuples rather than changes to
> >>>> blocks. It could also be called logical replication. In these
> >>>> patches, our existing replication is referred to as "physical"
> >>>> replication, which sounds kind of funny to me. Anyone have another
> >>>> suggestion?
> >>>
> >>> Logical and Binary replication?
> >>
> >> Unfortunately changeset extraction output's can be binary data...
> >
> > I think "physical" and "logical" are fine and they seem to be well known
> > terminology. Oracle uses those words and I have also seen many places
> > use "physical backup" and "logical backup", for example on Barman's
> > homepage.
>
> +1
>
> I think it also fits well with the well-known terms "physical [database]
> design" and "logical design". Not that it is the same thing, but I
> believe that every database person, when faced with the distiction
> "physical" versus "logical", will conclude that the former refers to
> data placement or block structure, while the latter refers to the
> semantics of the data being stored.
>
> Yours,
> Laurenz Albe
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.3
Date:	2014-01-30 19:17:59
Message-ID:	CA+Tgmoar6BLb+7BQUYEmkmdFSE1f8khCZCDP-aCojOrESiNLBg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 11:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I've rebased it here and am hacking on it still.

OK. The attached patch is a combination of Andres' first two patches
with lots more changes from both Andres and myself. At this point,
I'm pretty happy with this, and propose to commit the attached
version, absent objection. All by itself, it provides a useful new
option for users, and it sets the stage for the subsequent logical
decoding patches as well. For those not following along closely,
here's the short version: if you choose to create a replication slot,
you can make the master retain the exact amount of WAL that the
standby still needs, rather than guessing what value to set for
wal_keep_segments; also, you can avoid hot standby conflicts even when
the connection between master and slave is interrupted (but the master
will bloat if the interruption is long, so watch out). For logical
decoding, this functionality is essential rather than nice-to-have.

Here's a summary of what we've changed since the last version Andres posted:

- Fairly extensive revisions to slot error handling, eliminating the
PG_TRY/PG_CATCH blocks that were present before (which I didn't
believe would work as designed) and cleaning up some corner cases to
eliminate unnecessary failures.
- Rigid enforcement of existing PG practices around spinlocks,
volatile pointers, and memory barriers.
- Modification of the on-disk format for slots so that we don't
serialize junk that properly only lives in memory.
- Removal of various references to decoding and logical replication
that properly belong in subsequent patches.
- Support for using slots via a new recovery.conf parameter,
primary_slotname, and a new pg_receivexlog option, --slot. I felt
this was important because, without this, you couldn't test that it
actually works without applying the remainder of Andres's patch set,
and even then you could mostly only test logical replication. With
it, this is an independently useful and testable feature.
- Exclude pg_replslot from base backups. This might need more thought
and documentation; people who use the filesystem method to perform
backups might need to be advised to remove this directory in some
cases also, or people who use pg_basebackup might want to keep it in
some cases (not sure).
- Lots of renaming to make the names more clear and consistent.
- Lots of bug fixes, minor tinkering, comment changes, and cleanups.
- Documentation.

For those wishing to see the blow-by-blow:

http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/slot2

In the future (i.e. post-9.4), I think we'll likely want to extend
this in a bunch of interesting ways. I strongly suspect people are
going to want to have an option for slots that pin the LSN but not
xmin, and I also think they're going to want slots that hold LSN or
xmin for a certain amount of time after a disconnect but then give up,
or maybe a certain number of segments/transaction IDs and then give
up. Nonwithstanding those important improvements, I think this is a
very credible v1 of this functionality.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
slot.patch	text/x-patch	105.0 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 19:35:35
Message-ID:	20140207193535.GD2792@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

attached you can find the next version of the patchset.

Changes:
* rebased ontop the committed slot patch (Thanks Robert!), that required
a fair amount of work
* adjusted naming of the SQL interface functions, to be consisted with ^
* several patches of the patchseries were merged
* Large amount of comment copy-editing
* Some code restructuring

There's one major things I am not yet really happy with which is the is
the integration of how decoding snapshots are integrated. I've gone back
and forth over it today, but I think I need a decent night of sleep to
bring it to a conclusion...

The patches are currently:

0001: wal_decoding: Introduce logical changeset extraction.
The meat of the functionality, including the SQL interface.

0002: wal_decoding: logical changeset extraction walsender interface
Walsender integration of changeset extraction, including support for
synchronous replication.

0003: wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for logical changes
Simple tool for receiving the changes over the walsender interface.

0004: wal_decoding: Documentation for replication slots and changeset extraction
...

0005: wal_decoding: Temporarily add logical decoding regression tests to everything
This is a patch I don't think should be finally applied, but which is
very helpful during debugging. It simply adds tests to the beginning/end of
the normal regression tests, decoding it in its entirety.

As always it's also pushed to
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
branch xlog-decoding-rebasing-remapping

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-logical-changeset-extraction.patch	text/x-patch	743.3 KB
0002-wal_decoding-logical-changeset-extraction-walsender-.patch	text/x-patch	40.6 KB
0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch	text/x-patch	34.7 KB
0004-wal_decoding-Documentation-for-replication-slots-and.patch	text/x-patch	48.3 KB
0005-wal_decoding-Temporarily-add-logical-decoding-regres.patch	text/x-patch	5.5 KB

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 20:58:14
Message-ID:	CAA-aLv7LtDbV6UYvSYd92PAyaYyML9QqHpDRjBoT+GyMy6xLTQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7 February 2014 19:35, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> 0004: wal_decoding: Documentation for replication slots and changeset extraction

The usage of pg_create_decoding_replication_slot does show the "(1 row)" line.

The output of "SELECT * FROM pg_replication_slots;" is out-of-date.

There appears to be a column named "slot_name" and "slottype". Could
one of these have or not have the underscore for consistency?

The example also shows output from pg_decoding_slot_get_changes after
inserting 2 rows, but when I run the same example, there are no rows
returned:

# BEGIN;
BEGIN

*# INSERT INTO data(data) VALUES('1');
INSERT 0 1

*# COMMIT;
COMMIT

# SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now',
'include-xids', '0');
location | xid | data
----------+-----+------
(0 rows)

I inserted a single row outside of a transaction, and got the expected
output. Then I ran the above again, and got an output, but an
unexpected one:

SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now',
'include-xids', '0');
location | xid | data
-----------+-----+-----------------------------------------------
0/16C8B90 | 769 | BEGIN
0/16C8D50 | 769 | table "data": INSERT: id[int4]:3 data[text]:1
0/16C8D50 | 769 | COMMIT
(3 rows)

And running the transaction with inserts again, there's no output from
that same function command. I always get an output from isolated
INSERT statements. I should point out that in my .psqlrc file I have
"\set ON_ERROR_ROLLBACK". If I use psql -X, this symptom no longer
occurs, so I think the automatic savepoints are interfering, and the
effect appears to be inconsistent.

--
Thom

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 21:03:22
Message-ID:	CAA-aLv4hJyx6jh=5J2PuepDAqz=Th2ZbyrKXGddDC5vB3EdBwg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7 February 2014 20:58, Thom Brown <thom(at)linux(dot)com> wrote:
> On 7 February 2014 19:35, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> 0004: wal_decoding: Documentation for replication slots and changeset extraction
>
> The usage of pg_create_decoding_replication_slot does show the "(1 row)" line.

I mean "doesn't show" of course. :)

--
Thom

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 21:04:41
Message-ID:	26d5d8ff-9f4c-4349-9914-329d84496989@email.android.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On February 7, 2014 9:58:14 PM CET, Thom Brown <thom(at)linux(dot)com> wrote:
>On 7 February 2014 19:35, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> 0004: wal_decoding: Documentation for replication slots and changeset
>extraction
>
>The usage of pg_create_decoding_replication_slot does show the "(1
>row)" line.
>
>The output of "SELECT * FROM pg_replication_slots;" is out-of-date.
>
>There appears to be a column named "slot_name" and "slottype". Could
>one of these have or not have the underscore for consistency?
>
>The example also shows output from pg_decoding_slot_get_changes after
>inserting 2 rows, but when I run the same example, there are no rows
>returned:
>
># BEGIN;
>BEGIN
>
>*# INSERT INTO data(data) VALUES('1');
>INSERT 0 1
>
>*# INSERT INTO data(data) VALUES('1');
>INSERT 0 1
>
>*# COMMIT;
>COMMIT
>
># SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now',
>'include-xids', '0');
> location | xid | data
>----------+-----+------
>(0 rows)
>
>
>I inserted a single row outside of a transaction, and got the expected
>output. Then I ran the above again, and got an output, but an
>unexpected one:
>
>SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now',
>'include-xids', '0');
> location | xid | data
>-----------+-----+-----------------------------------------------
> 0/16C8B90 | 769 | BEGIN
> 0/16C8D50 | 769 | table "data": INSERT: id[int4]:3 data[text]:1
> 0/16C8D50 | 769 | COMMIT
>(3 rows)
>
>And running the transaction with inserts again, there's no output from
>that same function command. I always get an output from isolated
>INSERT statements. I should point out that in my .psqlrc file I have
>"\set ON_ERROR_ROLLBACK". If I use psql -X, this symptom no longer
>occurs, so I think the automatic savepoints are interfering, and the
>effect appears to be inconsistent.

More complete answer later, but any chance you're using synchronous commit = off?

Thanks for looking,

Andres

--
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 21:09:05
Message-ID:	CAA-aLv4eGHkFRo_O4HsqgJeprQU4M1vye_j4d49OtmEH4xG0kg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7 February 2014 21:04, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On February 7, 2014 9:58:14 PM CET, Thom Brown <thom(at)linux(dot)com> wrote:
>>On 7 February 2014 19:35, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> 0004: wal_decoding: Documentation for replication slots and changeset
>>extraction
>>
>>The usage of pg_create_decoding_replication_slot does show the "(1
>>row)" line.
>>
>>The output of "SELECT * FROM pg_replication_slots;" is out-of-date.
>>
>>There appears to be a column named "slot_name" and "slottype". Could
>>one of these have or not have the underscore for consistency?
>>
>>The example also shows output from pg_decoding_slot_get_changes after
>>inserting 2 rows, but when I run the same example, there are no rows
>>returned:
>>
>># BEGIN;
>>BEGIN
>>
>>*# INSERT INTO data(data) VALUES('1');
>>INSERT 0 1
>>
>>*# INSERT INTO data(data) VALUES('1');
>>INSERT 0 1
>>
>>*# COMMIT;
>>COMMIT
>>
>># SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now',
>>'include-xids', '0');
>> location | xid | data
>>----------+-----+------
>>(0 rows)
>>
>>
>>I inserted a single row outside of a transaction, and got the expected
>>output. Then I ran the above again, and got an output, but an
>>unexpected one:
>>
>>SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now',
>>'include-xids', '0');
>> location | xid | data
>>-----------+-----+-----------------------------------------------
>> 0/16C8B90 | 769 | BEGIN
>> 0/16C8D50 | 769 | table "data": INSERT: id[int4]:3 data[text]:1
>> 0/16C8D50 | 769 | COMMIT
>>(3 rows)
>>
>>And running the transaction with inserts again, there's no output from
>>that same function command. I always get an output from isolated
>>INSERT statements. I should point out that in my .psqlrc file I have
>>"\set ON_ERROR_ROLLBACK". If I use psql -X, this symptom no longer
>>occurs, so I think the automatic savepoints are interfering, and the
>>effect appears to be inconsistent.
>
> More complete answer later, but any chance you're using synchronous commit = off?

No:

# show synchronous_commit ;
synchronous_commit
--------------------
on
(1 row)

My custom config is:

wal_level = 'logical'
max_replication_slots = '1'
shared_buffers = 3900MB
temp_buffers = 16MB
work_mem = 16MB
maintenance_work_mem = 256MB
checkpoint_segments = 32
random_page_cost = 1.1
effective_cache_size = 12GB
logging_collector = on
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h '

--
Thom

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Thom Brown" <thom(at)linux(dot)com>
Cc:	"Andres Freund" <andres(at)2ndquadrant(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 21:28:02
Message-ID:	af3e2b70ee274187e6650dd4cd77c7a1.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, February 7, 2014 22:09, Thom Brown wrote:

>>The example also shows output from pg_decoding_slot_get_changes after
>>inserting 2 rows, but when I run the same example, there are no rows

FWIW, works for me:

testdb=# SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now', 'include-xids', '0');
location | xid | data
----------+-----+------
(0 rows)

testdb=# BEGIN; INSERT INTO data(data) VALUES('1'); INSERT INTO data(data) VALUES('1'); COMMIT;
testdb=# SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now', 'include-xids', '0');
location | xid | data
-----------+------+------------------------------------------------
0/2B81ED0 | 1973 | BEGIN
0/2B823A8 | 1973 | table "data": INSERT: id[int4]:14 data[text]:1
0/2B823A8 | 1973 | table "data": INSERT: id[int4]:15 data[text]:1
0/2B823A8 | 1973 | COMMIT
(4 rows)

testdb=# SELECT * FROM pg_decoding_slot_get_changes('regression_slot', 'now', 'include-xids', '0');
location | xid | data
----------+-----+------
(0 rows)

( output of "SELECT * FROM pg_replication_slots;" is, indeed, out-of-date.)

From:	Thom Brown <thom(at)linux(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 21:29:35
Message-ID:	CAA-aLv6MV4Ypc946-5K1KB1bWp2L9EBPP_EWRE7F6tTG7dMjWw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7 February 2014 21:28, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> On Fri, February 7, 2014 22:09, Thom Brown wrote:
>
>>>The example also shows output from pg_decoding_slot_get_changes after
>>>inserting 2 rows, but when I run the same example, there are no rows
>
> FWIW, works for me:

Can you confirm you're running it with ON_ERROR_ROLLBACK set?

--
Thom

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Thom Brown" <thom(at)linux(dot)com>
Cc:	"Andres Freund" <andres(at)2ndquadrant(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 21:32:45
Message-ID:	b075be98f07ab3ff114e44a885573794.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, February 7, 2014 22:29, Thom Brown wrote:
> On 7 February 2014 21:28, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>> On Fri, February 7, 2014 22:09, Thom Brown wrote:
>>
>>>>The example also shows output from pg_decoding_slot_get_changes after
>>>>inserting 2 rows, but when I run the same example, there are no rows
>>
>> FWIW, works for me:
>
> Can you confirm you're running it with ON_ERROR_ROLLBACK set?
>

Ah, no, I missed that. You're right: with that, behaviour is the same here as you described.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-07 23:43:11
Message-ID:	20140207234311.GF2792@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-07 20:58:14 +0000, Thom Brown wrote:
> On 7 February 2014 19:35, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > 0004: wal_decoding: Documentation for replication slots and changeset extraction
>
> The usage of pg_create_decoding_replication_slot does show the "(1 row)" line.
>
> The output of "SELECT * FROM pg_replication_slots;" is out-of-date.

Thanks, refreshed.

> There appears to be a column named "slot_name" and "slottype". Could
> one of these have or not have the underscore for consistency?

That's luckily already fixed...

> The example also shows output from pg_decoding_slot_get_changes after
> inserting 2 rows, but when I run the same example, there are no rows
> returned:

> And running the transaction with inserts again, there's no output from
> that same function command. I always get an output from isolated
> INSERT statements. I should point out that in my .psqlrc file I have
> "\set ON_ERROR_ROLLBACK". If I use psql -X, this symptom no longer
> occurs, so I think the automatic savepoints are interfering, and the
> effect appears to be inconsistent.

Thanks, that's a bug indeed. I have experimentally fixed the bug, not
sure whether I like the fix yet, or not.

I've already fixed two issues caused by the rebase onto
858ec11858a914d4c380971985709b6d6b7dd6fc.

Is pushing to git sufficient for you, or shall I rebase and resend the
series?

Thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 00:16:07
Message-ID:	CAA-aLv6Wg5PdJeyiU-ngTvr3V1_QCb8z5JeKyfuVuW1P1WCnqQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7 February 2014 23:43, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-07 20:58:14 +0000, Thom Brown wrote:
>> On 7 February 2014 19:35, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > 0004: wal_decoding: Documentation for replication slots and changeset extraction
>>
>> The usage of pg_create_decoding_replication_slot does show the "(1 row)" line.
>>
>> The output of "SELECT * FROM pg_replication_slots;" is out-of-date.
>
> Thanks, refreshed.
>
>> There appears to be a column named "slot_name" and "slottype". Could
>> one of these have or not have the underscore for consistency?
>
> That's luckily already fixed...
>
>> The example also shows output from pg_decoding_slot_get_changes after
>> inserting 2 rows, but when I run the same example, there are no rows
>> returned:
>
>> And running the transaction with inserts again, there's no output from
>> that same function command. I always get an output from isolated
>> INSERT statements. I should point out that in my .psqlrc file I have
>> "\set ON_ERROR_ROLLBACK". If I use psql -X, this symptom no longer
>> occurs, so I think the automatic savepoints are interfering, and the
>> effect appears to be inconsistent.
>
> Thanks, that's a bug indeed. I have experimentally fixed the bug, not
> sure whether I like the fix yet, or not.
>
> I've already fixed two issues caused by the rebase onto
> 858ec11858a914d4c380971985709b6d6b7dd6fc.
>
> Is pushing to git sufficient for you, or shall I rebase and resend the
> series?

Sure, push it to git, I'll add your remote repo and checkout that branch.

--
Thom

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 17:52:56
Message-ID:	20140208175256.GA10692@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Only got to this now, was a bit too tired and needed to catch up on some
real-world stuff...

On 2014-02-08 00:16:07 +0000, Thom Brown wrote:
> On 7 February 2014 23:43, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > Thanks, that's a bug indeed. I have experimentally fixed the bug, not
> > sure whether I like the fix yet, or not.
> >
> > I've already fixed two issues caused by the rebase onto
> > 858ec11858a914d4c380971985709b6d6b7dd6fc.
> >
> > Is pushing to git sufficient for you, or shall I rebase and resend the
> > series?
>
> Sure, push it to git, I'll add your remote repo and checkout that branch.

Ok, I roughly went with my initial plan to fix this and I've added (and
fixed) a regression for this.

Pushed this and some other improvements to
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
branch xlog-decoding-rebasing-remapping

Thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 19:35:32
Message-ID:	CAA-aLv7Y9dKgSzzCrUjZ3YqcidNwOACWMWtV6TjpWLouX8LRuQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8 February 2014 17:52, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> Only got to this now, was a bit too tired and needed to catch up on some
> real-world stuff...
>
> On 2014-02-08 00:16:07 +0000, Thom Brown wrote:
>> On 7 February 2014 23:43, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > Thanks, that's a bug indeed. I have experimentally fixed the bug, not
>> > sure whether I like the fix yet, or not.
>> >
>> > I've already fixed two issues caused by the rebase onto
>> > 858ec11858a914d4c380971985709b6d6b7dd6fc.
>> >
>> > Is pushing to git sufficient for you, or shall I rebase and resend the
>> > series?
>>
>> Sure, push it to git, I'll add your remote repo and checkout that branch.
>
> Ok, I roughly went with my initial plan to fix this and I've added (and
> fixed) a regression for this.
>
> Pushed this and some other improvements to
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
> branch xlog-decoding-rebasing-remapping

This appears to be working now. Thanks.

I'll continue to play around with the feature.

--
Thom

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 21:07:03
Message-ID:	CAA-aLv5JBHCJfqg71bA5unFf5xZQdE6z0godg+k1KHLktxO1Hg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8 February 2014 19:35, Thom Brown <thom(at)linux(dot)com> wrote:
> On 8 February 2014 17:52, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> Only got to this now, was a bit too tired and needed to catch up on some
>> real-world stuff...
>>
>> On 2014-02-08 00:16:07 +0000, Thom Brown wrote:
>>> On 7 February 2014 23:43, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> > Thanks, that's a bug indeed. I have experimentally fixed the bug, not
>>> > sure whether I like the fix yet, or not.
>>> >
>>> > I've already fixed two issues caused by the rebase onto
>>> > 858ec11858a914d4c380971985709b6d6b7dd6fc.
>>> >
>>> > Is pushing to git sufficient for you, or shall I rebase and resend the
>>> > series?
>>>
>>> Sure, push it to git, I'll add your remote repo and checkout that branch.
>>
>> Ok, I roughly went with my initial plan to fix this and I've added (and
>> fixed) a regression for this.
>>
>> Pushed this and some other improvements to
>> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary
>> branch xlog-decoding-rebasing-remapping
>
> This appears to be working now. Thanks.
>
> I'll continue to play around with the feature.

Next issue. Firstly, an out-of-date example:

doc/src/sgml/changesetextraction.sgml

pg_recvlogical --slot test --init -d testdb

There's no option --init. I think this is supposed to be --create.

But also:

$ pg_recvlogical --slot test --create -d testdb
pg_recvlogical: could not send replication command
"CREATE_REPLICATION_SLOT "test" LOGICAL "test_decoding"": extraneous
data in "T" message

But this seems to have created it anyway:

If I drop it and run the same command, the same message is emitted.

--
Thom

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 21:25:03
Message-ID:	20140208212503.GD10692@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-08 21:07:03 +0000, Thom Brown wrote:
> > I'll continue to play around with the feature.
>
> Next issue. Firstly, an out-of-date example:
>
> doc/src/sgml/changesetextraction.sgml
>
> pg_recvlogical --slot test --init -d testdb
>
> There's no option --init. I think this is supposed to be --create.

Fixed. It used to be --init, but that has changed. Thanks.

> $ pg_recvlogical --slot test --create -d testdb
> pg_recvlogical: could not send replication command
> "CREATE_REPLICATION_SLOT "test" LOGICAL "test_decoding"": extraneous
> data in "T" message

Gah. Another merge issue. Fixed. We really need to have infrastructure
for testing binaries...

Thanks,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 22:26:59
Message-ID:	CAA-aLv434DwA699D+E-9NDDyFE+V7i0hQ08O-924+kcDnpPp1w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8 February 2014 21:25, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-08 21:07:03 +0000, Thom Brown wrote:
>> > I'll continue to play around with the feature.
>>
>> Next issue. Firstly, an out-of-date example:
>>
>> doc/src/sgml/changesetextraction.sgml
>>
>> pg_recvlogical --slot test --init -d testdb
>>
>> There's no option --init. I think this is supposed to be --create.
>
> Fixed. It used to be --init, but that has changed. Thanks.
>
>> $ pg_recvlogical --slot test --create -d testdb
>> pg_recvlogical: could not send replication command
>> "CREATE_REPLICATION_SLOT "test" LOGICAL "test_decoding"": extraneous
>> data in "T" message
>
> Gah. Another merge issue. Fixed. We really need to have infrastructure
> for testing binaries...

Thanks, no issue with that now.

Got a question about ranges and arrays usage with timestamps... why
are quotes added to these?

timestamptz (no quotes with input or output):
table "a": INSERT: moo[timestamptz]:2014-02-08 22:09:33+00

tstzrange (no quotes with input, but quotes with output):
table "b": INSERT: moo[tstzrange]:["2014-02-08
13:45:22+00","2014-02-08 14:45:42+00")

timestamptz[] (no quotes with input, but quotes with output):
table "c": INSERT: moo[_timestamptz]:{"2010-01-01
13:45:22+00","2010-01-03 14:45:42+00"}

tstzrange[] (one set of quotes with input, two sets of quotes with
output, one set of which are escaped):
table "d": INSERT: moo[_tstzrange]:{"(\"2014-02-08
13:45:22+00\",\"2014-02-08 13:45:42+00\"]","[\"2014-02-07
10:12:19+00\",\"2014-02-07 13:51:16+00\"]"}

--
Thom

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 22:47:21
Message-ID:	20140208224721.GF10692@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Thom,

On 2014-02-08 22:26:59 +0000, Thom Brown wrote:
> Got a question about ranges and arrays usage with timestamps... why
> are quotes added to these?
>
> timestamptz (no quotes with input or output):
> table "a": INSERT: moo[timestamptz]:2014-02-08 22:09:33+00
>
> tstzrange (no quotes with input, but quotes with output):
> table "b": INSERT: moo[tstzrange]:["2014-02-08
> 13:45:22+00","2014-02-08 14:45:42+00")
>
> timestamptz[] (no quotes with input, but quotes with output):
> table "c": INSERT: moo[_timestamptz]:{"2010-01-01
> 13:45:22+00","2010-01-03 14:45:42+00"}
>
> tstzrange[] (one set of quotes with input, two sets of quotes with
> output, one set of which are escaped):
> table "d": INSERT: moo[_tstzrange]:{"(\"2014-02-08
> 13:45:22+00\",\"2014-02-08 13:45:42+00\"]","[\"2014-02-07
> 10:12:19+00\",\"2014-02-07 13:51:16+00\"]"}

The test_decoding output plugin just uses the default text output
functions for all types, other plugins could do differently. I.e. in all
these cases a SELECT, COPY, pg_dump will also include those quotes.
E.g.
postgres=# SELECT ARRAY[tstzrange(NOW(), NOW() + interval '1 day')];
will format it's output similarly:
{"[\"2014-02-08 23:44:19.82007+01\",\"2014-02-09 23:44:19.82007+01\")"}
(1 row)

Does that answer make sense?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 22:58:35
Message-ID:	CAA-aLv7RHkEr8+mjD=biiTwCU9Hg-24R764cj8ZwAb2t0VYnCA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8 February 2014 22:47, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi Thom,
>
> On 2014-02-08 22:26:59 +0000, Thom Brown wrote:
>> Got a question about ranges and arrays usage with timestamps... why
>> are quotes added to these?
>>
>> timestamptz (no quotes with input or output):
>> table "a": INSERT: moo[timestamptz]:2014-02-08 22:09:33+00
>>
>> tstzrange (no quotes with input, but quotes with output):
>> table "b": INSERT: moo[tstzrange]:["2014-02-08
>> 13:45:22+00","2014-02-08 14:45:42+00")
>>
>> timestamptz[] (no quotes with input, but quotes with output):
>> table "c": INSERT: moo[_timestamptz]:{"2010-01-01
>> 13:45:22+00","2010-01-03 14:45:42+00"}
>>
>> tstzrange[] (one set of quotes with input, two sets of quotes with
>> output, one set of which are escaped):
>> table "d": INSERT: moo[_tstzrange]:{"(\"2014-02-08
>> 13:45:22+00\",\"2014-02-08 13:45:42+00\"]","[\"2014-02-07
>> 10:12:19+00\",\"2014-02-07 13:51:16+00\"]"}
>
> The test_decoding output plugin just uses the default text output
> functions for all types, other plugins could do differently. I.e. in all
> these cases a SELECT, COPY, pg_dump will also include those quotes.
> E.g.
> postgres=# SELECT ARRAY[tstzrange(NOW(), NOW() + interval '1 day')];
> will format it's output similarly:
> {"[\"2014-02-08 23:44:19.82007+01\",\"2014-02-09 23:44:19.82007+01\")"}
> (1 row)
>
> Does that answer make sense?

Ah, okay. Thanks.

Another question: in order for logical decoding/replication to be
useful, presumably one would need a primary key on every table? It's
just I haven't seen this mentioned on the changeset extraction page in
the docs.

--
Thom

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-08 23:08:26
Message-ID:	20140208230826.GH10692@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-08 22:58:35 +0000, Thom Brown wrote:
> Another question: in order for logical decoding/replication to be
> useful, presumably one would need a primary key on every table? It's
> just I haven't seen this mentioned on the changeset extraction page in
> the docs.

Hm, that's a good point. 07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65 added
configurability for that, but there at least should be a link to
http://www.postgresql.org/docs/devel/static/sql-altertable.html with
some additional words.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-09 00:49:31
Message-ID:	CAA-aLv4RL_CudqpW9op8Quu5D6JjnuTTYC2nW74Z4d6=Z2SAiQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8 February 2014 23:08, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-08 22:58:35 +0000, Thom Brown wrote:
>> Another question: in order for logical decoding/replication to be
>> useful, presumably one would need a primary key on every table? It's
>> just I haven't seen this mentioned on the changeset extraction page in
>> the docs.
>
> Hm, that's a good point. 07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65 added
> configurability for that, but there at least should be a link to
> http://www.postgresql.org/docs/devel/static/sql-altertable.html with
> some additional words.

# CREATE TABLE test (id serial primary key, val int);
CREATE TABLE

# INSERT INTO test (val) SELECT generate_series(1,3);
INSERT 0 3

# ALTER TABLE test ADD COLUMN a decimal DEFAULT 2.22;
ALTER TABLE

# ALTER TABLE test ADD COLUMN b json DEFAULT '{"a":[1,2,3],"b":[4,5,6]}';
ALTER TABLE

The output generated by those last 2 statements is:

BEGIN 891
table "pg_temp_16552": INSERT: id[int4]:1 val[int4]:1 a[numeric]:2.22
table "pg_temp_16552": INSERT: id[int4]:2 val[int4]:2 a[numeric]:2.22
table "pg_temp_16552": INSERT: id[int4]:3 val[int4]:3 a[numeric]:2.22
COMMIT 891
BEGIN 892
table "pg_temp_16552": INSERT: id[int4]:1 val[int4]:1 a[numeric]:2.22
b[json]:{"a":[1,2,3],"b":[4,5,6]}
table "pg_temp_16552": INSERT: id[int4]:2 val[int4]:2 a[numeric]:2.22
b[json]:{"a":[1,2,3],"b":[4,5,6]}
table "pg_temp_16552": INSERT: id[int4]:3 val[int4]:3 a[numeric]:2.22
b[json]:{"a":[1,2,3],"b":[4,5,6]}
COMMIT 892

This is showing inserts into the temp table as part of the operation.
Is that sufficient?

--
Thom

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-09 01:06:12
Message-ID:	20140209010611.GA16141@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-09 00:49:31 +0000, Thom Brown wrote:
> # ALTER TABLE test ADD COLUMN a decimal DEFAULT 2.22;
> ALTER TABLE
>
> # ALTER TABLE test ADD COLUMN b json DEFAULT '{"a":[1,2,3],"b":[4,5,6]}';
> ALTER TABLE
>
> The output generated by those last 2 statements is:
>
> BEGIN 891
> table "pg_temp_16552": INSERT: id[int4]:1 val[int4]:1 a[numeric]:2.22
> table "pg_temp_16552": INSERT: id[int4]:2 val[int4]:2 a[numeric]:2.22
> table "pg_temp_16552": INSERT: id[int4]:3 val[int4]:3 a[numeric]:2.22
> COMMIT 891
> BEGIN 892
> table "pg_temp_16552": INSERT: id[int4]:1 val[int4]:1 a[numeric]:2.22
> b[json]:{"a":[1,2,3],"b":[4,5,6]}
> table "pg_temp_16552": INSERT: id[int4]:2 val[int4]:2 a[numeric]:2.22
> b[json]:{"a":[1,2,3],"b":[4,5,6]}
> table "pg_temp_16552": INSERT: id[int4]:3 val[int4]:3 a[numeric]:2.22
> b[json]:{"a":[1,2,3],"b":[4,5,6]}
> COMMIT 892
>
> This is showing inserts into the temp table as part of the operation.
> Is that sufficient?

I think it's a good thing for now. We don't have support for DDL
replication so it's not yet that interesting, but having the new values
allows to safely handle things like DEFAULTs that produce
nondeterministic data.
What do you think?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-09 01:13:17
Message-ID:	CAA-aLv76MnE=Z34DvJe7fr=-se9+GQ6DxEPE1C5W5bqc4=35HA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9 February 2014 01:06, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-09 00:49:31 +0000, Thom Brown wrote:
>> # ALTER TABLE test ADD COLUMN a decimal DEFAULT 2.22;
>> ALTER TABLE
>>
>> # ALTER TABLE test ADD COLUMN b json DEFAULT '{"a":[1,2,3],"b":[4,5,6]}';
>> ALTER TABLE
>>
>> The output generated by those last 2 statements is:
>>
>> BEGIN 891
>> table "pg_temp_16552": INSERT: id[int4]:1 val[int4]:1 a[numeric]:2.22
>> table "pg_temp_16552": INSERT: id[int4]:2 val[int4]:2 a[numeric]:2.22
>> table "pg_temp_16552": INSERT: id[int4]:3 val[int4]:3 a[numeric]:2.22
>> COMMIT 891
>> BEGIN 892
>> table "pg_temp_16552": INSERT: id[int4]:1 val[int4]:1 a[numeric]:2.22
>> b[json]:{"a":[1,2,3],"b":[4,5,6]}
>> table "pg_temp_16552": INSERT: id[int4]:2 val[int4]:2 a[numeric]:2.22
>> b[json]:{"a":[1,2,3],"b":[4,5,6]}
>> table "pg_temp_16552": INSERT: id[int4]:3 val[int4]:3 a[numeric]:2.22
>> b[json]:{"a":[1,2,3],"b":[4,5,6]}
>> COMMIT 892
>>
>> This is showing inserts into the temp table as part of the operation.
>> Is that sufficient?
>
> I think it's a good thing for now. We don't have support for DDL
> replication so it's not yet that interesting, but having the new values
> allows to safely handle things like DEFAULTs that produce
> nondeterministic data.
> What do you think?

Okay, I'm just checking. If it's expected behaviour to you, it's good
enough for me.

--
Thom

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-11 16:22:24
Message-ID:	CA+TgmoYTc0gHm_ca=wDWqc8TLY1VQ9vx6jxPSSE6=OifAtW=cQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 7, 2014 at 2:35 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> attached you can find the next version of the patchset.

As usual, I'm going to be reviewing patch 1. The definition of "patch
1" has changed quite a few times over the past year, but that's
usually the one I'm reviewing.

+ * contents of records in here xexcept turning them into a more usable

Typo.

+ /*
+ * XXX: There doesn't seem to be a usecase for decoding
+ * HEAP_NEWPAGE's. Its only used in various
indexam's and CLUSTER,
+ * neither of which should be relevant for the logical
+ * changestream.
+ */

There's a level of uncertainty here that doesn't seem consistent with
calling this a finished patch. It's also not a complete list of
places where log_newpage() is called, but frankly I don't think that
should be the aim of this comment. The only relevant question is
whether we ever use XLOG_HEAP_NEWPAGE to log heap changes that are
relevant to logical replication. I think we don't.

+ /* FIXME: skip if wrong db? */

It's time to fish or cut bait.

+ /*
+ * XXX: As a future feature, we could replay
the transaction and
+ * prepare it as well, allowing for 2PC via
logical decoding.
+ */

Let's try to avoid using XXX (or FIXME) for things that really mean TODO.

I think this comment deserves to be expanded a bit, too. Maybe
something like: "Right now, logical decoding ignores PREPARE
TRANSACTION and simply decodes the subsequent COMMIT TRANSACTION or
ROLLBACK TRANSACTION just as it would a regular COMMIT or ROLLBACK.
In the future, we might want to change this. Decoding PREPARE might
enable future code to prepare each locally prepared transaction on the
remote side before doing a COMMIT TRANSACTION locally, allowing for
logical synchronous replication."

+ /*
+ * If the record wasn't part of a transaction,
it will not have
+ * caused invalidations and thus isn't
important when building
+ * snapshots. If it was part of a transaction,
that transaction
+ * just performed DDL because those are the
only codepaths using
+ * inplace updates.
+ */

Under what circumstances do we issue in-place updates not associated
with a transaction? And under what circumstances do we issue in-place
updates that ARE associated with a transaction?

+ * XXX: At some point we might want to execute the transaction's

The XXX again seems needless; the comment is fine as it stands.

+ /*
+ * Abort all transactions that we keep
track of that are older
+ * than ->oldestRunningXid. This is
the most convenient spot

I think writing ->membername is poor commenting style. Just leave out
the arrow, or write "the WAL record's oldestRunningXid."

+/*
+ * Get the data from the various forms of commit records and pass it
+ * on to snapbuild.c and reorderbuffer.c
+ */

This is a lousy comment. I suggest something like: "Currently, each
transaction is decoded only once it commit, so the arrival of a commit
record means that we can now decode the changes made by this toplevel
transaction and all of its committed subtransactions, unless we have
to skip it because the replication system isn't fully initialized yet.
Whether decoding the transaction or not, we must take note of any
invalidations it issues, as those will affect the snapshot used for
decoding of *other* transactions."

+/*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */

Suggest: "When a transaction abort is detected, we throw away any data
we've stashed away for possible future decoding of that transaction.
Knowledge of the abort may also help us establish our initial snapshot
when logical decoding is first initiated."

+/*
+ * Set the xmin required for decoding snapshots for the specific decoding
+ * slot.
+ */
+void
+IncreaseLogicalXminForSlot(XLogRecPtr lsn, TransactionId xmin)

I'm thinking this and everything that follows, up through
LogicalDecodingCountDBSlots, probably should be moved to slot.c.

+ /* XXX: Add the current LSN? */

+1.

+ /* shorter lines... */
+ slot = MyReplicationSlot;

If you're going to do this, which seems like it's probably a good
idea, do it at the top of the function and use it all the way through
instead of doing it in the middle.

+ if (MyReplicationSlot == NULL)
+ elog(ERROR, "need a current slot");
+
+ if (is_init && start_lsn != InvalidXLogRecPtr)
+ elog(ERROR, "Cannot INIT_LOGICAL_REPLICATION at a
specified LSN");
+
+ if (is_init && plugin == NULL)
+ elog(ERROR, "Cannot INIT_LOGICAL_REPLICATION without a
specified plugin");

One of these error messages is not like the others.

+ context = AllocSetContextCreate(CurrentMemoryContext,
+
"Changeset Extraction Context",
+
ALLOCSET_DEFAULT_MINSIZE,
+
ALLOCSET_DEFAULT_INITSIZE,
+
ALLOCSET_DEFAULT_MAXSIZE);

I have my doubts about whether it's wise to make this the child of
CurrentMemoryContext. Certainly, if we do that, then expectations
around what that context is need to be documented. Short-lived
contexts are presumably unsuitable.

+ * Lets start with enough information if we can, so
log a standby
+ * snapshot and start decoding at exactl that position.

Let's. Exactly.

+ * the xlog records didn't result in anyting
relevant for

anything.

+ elog(LOG, "cannot stream from %X/%X, minimum
is %X/%X, forwarding",

This isn't very clear, and I don't think it follows style guidelines either.

+ /* register output plugin name with slot */
+ strncpy(NameStr(MyReplicationSlot->data.plugin), plugin,
+ NAMEDATALEN);
+ NameStr(MyReplicationSlot->data.plugin)[NAMEDATALEN - 1] = '\0';

Hmm. Shouldn't this be delegated to something in slot.c? Why don't
we need the lock? Why don't we need to fsync() the change?

+ /*
+ * Acquire the current global xmin value and directly
set the logical
+ * xmin before releasing the lock if necessary. We do
this so wal
+ * decoding is guaranteed to have all catalog rows
produced by xacts
+ * with an xid > walsnd->xmin available.
+ *
+ * We can't let ReplicationSlotsComputeRequiredXmin() lock the
+ * procarray as that acquires ProcArrayLock separately
which would
+ * open a short window for the global xmin to advance
above our xmin.
+ */
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+ slot->effective_catalog_xmin =
GetOldestSafeDecodingTransactionId();
+ slot->data.catalog_xmin = slot->effective_catalog_xmin;
+
+ ReplicationSlotsComputeRequiredXmin(true);
+
+ LWLockRelease(ProcArrayLock);

I don't understand this.

+ (errmsg("changeset extraction started,
extracting changes after %X/%X, reading from %X/%X",

Needs some style work. What does "extracting changes after %X/%X"
really mean? Also, how about including the slot name in there?

+ * Performoutput plugin write into tuplestore.

Space.

+ /*
+ * XXX: maybe we ought to assert ctx->out is in database encoding when
+ * we're writing textual output.
+ */

Good idea.

+ /*
+ * FIXME: we're going to have to do something more
intelligent about
+ * timelines on standby's. Use readTimeLineHistory() and
+ * tliOfPointInHistory() to get the proper LSN?
+ */

So what's the plan for that?

+ /*
+ * XXX: It'd be way nicer to be able to use the
walsender waiting logic
+ * here, but that's not available in all environments.
+ */

I don't understand this.

pg_create_decoding_replication_slot() should go in slotfuncs.c.

+/* number of changes kept in memory, per transaction */
+const Size max_memtries = 4096;
+
+/* Size of the slab caches used for frequently allocated objects */
+const Size max_cached_changes = 4096 * 2;
+const Size max_cached_tuplebufs = 4096 * 2; /* ~8MB */
+const Size max_cached_transactions = 512;

Hmm. Is max_memtries the number of "tries" you keep in "mem"? Or is
"tries" an inadvertent abbreviation for "entries" or something? Also,
there's no real discussion here of the logic behind these values, or
the performance or memory impact of changing them.

+ * Free an ReorderBufferTXN. Deallocation might be delayed for efficiency
+ * purposes.

Wow, so we have a bespoke dllist for caching these objects, and that's
actually material to performance? What is the allocation/deallocation
rate here?

+/*
+ * FIXME: better comment and/or name
+ */

If not now, then when?

+ ReorderBufferChange *next_change =
+ dlist_container(ReorderBufferChange, node, next);

Formatting.

+ * ->subxip contains all txids that belong to our transaction which we

snap->subxip

+ * XXX: ->nsubxcnt can be out of date when subtransactions abort, count
+ * manually.

Why is this an XXX?

+ if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+ elog(ERROR, "cannot replay using sub,
already allocated xid %u",
+ GetTopTransactionIdIfAny());

Message style. "cannot replay using top", too. Also, what makes this
elog() material? If decoding can be performed from a user session
this seems likely to be (blech!) user-facing.

+ /* XXX: we could skip
snapshots in non toplevel txns */

TODO. Or fix it.

+ /*
+ * don't do a ReorderBufferCleanupTXN here, with the
vague idea of
+ * allowing to retry decoding.
+ */

It's a bit late in the cycle for such vagueness.

+ * Rejigger change->newtuple to point to in-memory toast tuples instead to
+ * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
+ *
+ * We cannot replace unchanged toast tuples though, so those will still point
+ * to on-disk toast data.

Why is that OK?

+ * location indicated by 'lsn'. Returns true if successfull, false otherwise.

Extra "l".

My eyes are glazing over, so I'm stopping here for now.

Generally, I think there's a lot of good stuff in here, but I think
you need to make a serious pass through here and try to get rid of all
the things marked XXX and FIXME, either by changing them to say TODO
(or nothing), or by deciding that they're OK and removing the comment,
or by fixing them. It's fine to have some XXX comments on points
where you're hoping for reviewer feedback, but the time to have notes
in there for yourself is long past. Also, you need to either go
through and make a studious attempt to fix all the typographical
errors and message style issues, or you need to find someone else who
can help you with that. Preferably not me, because I'd rather focus
on things of somewhat greater significance.

Generally, I find decode.c to be relatively straightforward, and it's
even got a nice long header comment explaining what it does.
logical.c, on the other hand, has a one-line header comment. As I
mentioned above, I think a lot of the stuff in this file properly
belongs elsewhere, so maybe that's not so bad. But it could probably
use at least a little more work.

To the extent that it's possible, the documentation should be
incorporated into the patches that introduce the corresponding
facilities rather than being a separate patch all of its own.

The meat of the patch seems to be reorderbuffer.c. That file needs
more and better explanations of what is being done and why it is being
done. For example:

+/*
+ * Abort a transaction that possibly has previous changes. Needs to be done
+ * independently for toplevel and subtransactions.
+ */
+void
+ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)

It's already clear from the name of the function and how it's used
elsewhere that it gets called when a transaction aborts. Mentioning
that the function gets invoked for both toplevel and subtransactions
is, surely, worthwhile. But the function has no comments explaining
what it's doing in response to that abort, or why it's doing those
things, or why it's doing those things in that order rather than some
other order. Here is an example of a better comment:

+/*
+ * Setup the base snapshot of a transaction. That is the snapshot that is used
+ * to decode all changes until either this transaction modifies the catalog or
+ * another catalog modifying transaction commits.
+ */

Now, the grammar there might need a tad of work, but there's a lot of
useful information in that comment. It would be even better if there
were a comment explaining, either here or in the caller, how we decide
*when* to call this function.

Both reorderbuffer.c and snapbuild.c contain a significant number of
functions that are quite short. I can't decide whether this is
excellent attention to abstraction boundaries or a sign that the
abstraction boundary is too permeable in the first place.

This email is a bit down in the trenches; I will try to write another
with some higher-level considerations. But I think there is plenty of
stuff here for you to get started fixing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-11 17:57:34
Message-ID:	20140211175734.GI15246@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On 2014-02-11 11:22:24 -0500, Robert Haas wrote:
> + * contents of records in here xexcept turning them into a more usable
>
> Typo.
>
> + /*
> + * XXX: There doesn't seem to be a usecase for decoding
> + * HEAP_NEWPAGE's. Its only used in various
> indexam's and CLUSTER,
> + * neither of which should be relevant for the logical
> + * changestream.
> + */
>
> There's a level of uncertainty here that doesn't seem consistent with
> calling this a finished patch. It's also not a complete list of
> places where log_newpage() is called, but frankly I don't think that
> should be the aim of this comment. The only relevant question is
> whether we ever use XLOG_HEAP_NEWPAGE to log heap changes that are
> relevant to logical replication. I think we don't.

You're right, we currently don't. I guess we should add a comment to
log_newpage()/buffer to make sure that's not violated in future code,
that seems like the only place that's somewhat likely to be read?

> Decoding PREPARE might
> enable future code to prepare each locally prepared transaction on the
> remote side before doing a COMMIT TRANSACTION locally, allowing for
> logical synchronous replication."

We do support logical synchronous replication over the walsender
interface, what it would allow us to do would be some form of 2pc...

I'll adapt your version.

> + /*
> + * If the record wasn't part of a transaction,
> it will not have
> + * caused invalidations and thus isn't
> important when building
> + * snapshots. If it was part of a transaction,
> that transaction
> + * just performed DDL because those are the
> only codepaths using
> + * inplace updates.
> + */
>
> Under what circumstances do we issue in-place updates not associated
> with a transaction? And under what circumstances do we issue in-place
> updates that ARE associated with a transaction?

vacuum updates relfrozenxid outside a transaction, e.g. create/drop
index, analyze inside one.

> +/*
> + * Get the data from the various forms of commit records and pass it
> + * on to snapbuild.c and reorderbuffer.c
> + */
>
> This is a lousy comment. I suggest something like: "Currently, each
> transaction is decoded only once it commit, so the arrival of a commit
> record means that we can now decode the changes made by this toplevel
> transaction and all of its committed subtransactions, unless we have
> to skip it because the replication system isn't fully initialized yet.
> Whether decoding the transaction or not, we must take note of any
> invalidations it issues, as those will affect the snapshot used for
> decoding of *other* transactions."
>
> +/*
> + * Get the data from the various forms of abort records and pass it on to
> + * snapbuild.c and reorderbuffer.c
> + */
>
> Suggest: "When a transaction abort is detected, we throw away any data
> we've stashed away for possible future decoding of that transaction.
> Knowledge of the abort may also help us establish our initial snapshot
> when logical decoding is first initiated."

Hm, those should go into reorderbuffer.c instead, the interesting part
of DecodeCommit/DecodeAbort is just to centralize the handling of the
various forms of commit/abort records. decode.c really shouldn't need to
be changed much when we start to optionally support streaming out changes
immediately.

> + /* register output plugin name with slot */
> + strncpy(NameStr(MyReplicationSlot->data.plugin), plugin,
> + NAMEDATALEN);
> + NameStr(MyReplicationSlot->data.plugin)[NAMEDATALEN - 1] = '\0';
>
> Hmm. Shouldn't this be delegated to something in slot.c? Why don't
> we need the lock?

I wasn't sure, we can place it there, but it doesn't really need to know
about these details either. Lockingwise I don't see it needing more,
nobody but the slot that has it acquired is interested in it.

> Why don't we need to fsync() the change?

I've since pushed a patch that does the fsyncing. Not doing so was part
of a rebase screwup.

> + /*
> + * FIXME: we're going to have to do something more
> intelligent about
> + * timelines on standby's. Use readTimeLineHistory() and
> + * tliOfPointInHistory() to get the proper LSN?
> + */
>
> So what's the plan for that?

Prohibit decoding on the standby for now. Not sure how to deal with the
relevant code, leave it there, #ifdef it out, remove it?

> + /*
> + * XXX: It'd be way nicer to be able to use the
> walsender waiting logic
> + * here, but that's not available in all environments.
> + */
>
> I don't understand this.

The walsender get's notified when flushing WAL, which allows the
walsender specific read_page callback to wait on the walsender
latch. For normal backends that's not available, so we have to do a
check/sleep/repeat logic.

> pg_create_decoding_replication_slot() should go in slotfuncs.c.

I wasn't sure about where to place it.

> + * Free an ReorderBufferTXN. Deallocation might be delayed for efficiency
> + * purposes.
>
> Wow, so we have a bespoke dllist for caching these objects, and that's
> actually material to performance? What is the allocation/deallocation
> rate here?

It's absolutely massively relevant for performance, I was really
surprised. reorderbuffer.c was the original reason for developing
ilist.h...
Before doing this, memory allocation was by far the top profile,
afterwards it has left the top ten.

I've tried my memory manager rewrite on it, but it didn't fix things
sufficiently. I don't think there's anything as good as a per-type slab
allocation (which this essentially boils down to, even if it has a
higher memory overhead) for frequently allocated types.

> + * XXX: ->nsubxcnt can be out of date when subtransactions abort, count
> + * manually.
>
> Why is this an XXX?

For me XXX really is "watch out", that's the only reason.

> + /*
> + * don't do a ReorderBufferCleanupTXN here, with the
> vague idea of
> + * allowing to retry decoding.
> + */
>
> It's a bit late in the cycle for such vagueness.

Well, I sure think people (including me) will continue to work on
this. There's not much downside to not doing a cleanup here, it will be
cleaned up later.

> + * Rejigger change->newtuple to point to in-memory toast tuples instead to
> + * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
> + *
> + * We cannot replace unchanged toast tuples though, so those will still point
> + * to on-disk toast data.
>
> Why is that OK?

Because we currently only allow accessing changed toast tuples. We'd
discussed that a long time back and it seemed people agree that that's
fair enough initially. In fact, more people were interested in that
behaviour than in doing it differently.
Alternatively we can vacuum toast tables less agressively or WAL log
more.

> This email is a bit down in the trenches; I will try to write another
> with some higher-level considerations. But I think there is plenty of
> stuff here for you to get started fixing.

Yes, working on it. I'll push the smaller increments to git, ok?

Working on all the issues now, thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.5
Date:	2014-02-12 15:56:09
Message-ID:	20140212155609.GA3391@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-11 11:22:24 -0500, Robert Haas wrote:
> + context = AllocSetContextCreate(CurrentMemoryContext,
> +
> "Changeset Extraction Context",
> +
> ALLOCSET_DEFAULT_MINSIZE,
> +
> ALLOCSET_DEFAULT_INITSIZE,
> +
> ALLOCSET_DEFAULT_MAXSIZE);
>
> I have my doubts about whether it's wise to make this the child of
> CurrentMemoryContext. Certainly, if we do that, then expectations
> around what that context is need to be documented. Short-lived
> contexts are presumably unsuitable.

Well, it depends on the type of usage. In the walsender, yes, it needs
to be a longliving context. Not so much in the SQL case, inside the SRF
we spill all the data into a tuplestore after which we are done. I don't
see which context would be more suitable as a default parent; it used to
be TopMemoryContext but that requires pointless cleanup routines to
handle errors...

So I think documenting the requirement is the best way?

I'm working on the other comments, pushing individual changes to
git. Will send a new version onlist once I'm through.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6
Date:	2014-02-13 16:12:38
Message-ID:	20140213161238.GA26092@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-02-11 11:22:24 -0500, Robert Haas wrote:
> [loads of comments]

I tried to address all the points you mentioned.

Except:
* I ended up only moving some functions from logical.c to slot.c, not sure
if it's the right split now.
* I couldn't merge the docs entirely to the commits, that'd be a
horrible mess right now, since the the new sgml file refers to all the
tools and output methods. I think that'd make changing things too hard.

News:
* loads of comment, error reporting, documentation improvements
* output plugins now have to specify whether they output data textually
or in binary so the text SQL interface functions can refuse if it's binary.
* changeset extraction while in recovery is disallowed for now
* some more tests were added
Bugs fixed:
* pg_recvlogical could accidently close stdout if -f - was specified and
the connection died.
* the WAL reservation wasn't working when initially creation a slot,
only a bit later.
* typo in pg_replication_slots lead to catalog_xmin not being displayed.
* 8de3e410faa06ab20ec1aa6d0abb0a2c040261ba required some minor changes

Thanks to Christian Kruse for helping me!

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz	application/x-patch-gzip	116.0 KB
0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	11.9 KB
0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	11.4 KB
0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz	application/x-patch-gzip	10.5 KB
0005-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.0 (was logical changeset generation)
Date:	2014-02-14 02:53:03
Message-ID:	1392346383.25241.0.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2014-01-19 at 15:31 +0100, Stefan Kaltenbrunner wrote:
> > /* followings are for client encoding only */
> > PG_SJIS, /* Shift JIS
> > (Winindows-932) */
>
> while you have that file open: s/Winindows-932/Windows-932 maybe?

done

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Andres Freund" <andres(at)2ndquadrant(dot)com>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6
Date:	2014-02-14 08:23:45
Message-ID:	33c6ac2dd6f0e0eee0ef1befc8e8c5fa.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, February 13, 2014 17:12, Andres Freund wrote:
> Hi,
>
> On 2014-02-11 11:22:24 -0500, Robert Haas wrote:
>> [loads of comments]
>
> I tried to address all the points you mentioned.
>

>0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz 159 k
>0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz 16 k
>0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz 15 k
>0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz 14 k
>0005-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz 1.8 k

These don't apply...

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6
Date:	2014-02-14 09:13:37
Message-ID:	20140214091337.GG4910@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-02-14 09:23:45 +0100, Erik Rijkers wrote:
> >0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz 159 k
> >0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz 16 k
> >0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz 15 k
> >0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz 14 k
> >0005-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz 1.8 k
>
> These don't apply...

Works here, could you give a bit more details about the problem you have
applying them? Note that they are compressed, so need to be gunzipped first...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	"Erik Rijkers" <er(at)xs4all(dot)nl>
To:	"Andres Freund" <andres(at)2ndquadrant(dot)com>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6
Date:	2014-02-14 09:42:46
Message-ID:	a0d7f88ec1704d880ecc9483a9674def.squirrel@webmail.xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, February 14, 2014 10:13, Andres Freund wrote:
> Hi,
>
> On 2014-02-14 09:23:45 +0100, Erik Rijkers wrote:
>> >0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz 159 k
>> >0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz 16 k
>> >0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz 15 k
>> >0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz 14 k
>> >0005-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz 1.8 k
>>
>> These don't apply...
>
> Works here, could you give a bit more details about the problem you have
> applying them? Note that they are compressed, so need to be gunzipped first...
>

yeah, unzipping -- I thought of that :) (no offense taken :))

I just gave it into my standard patch-applying-compiling shell script which does something like, as always :

patch --dry-run -b -l -F 25 -p 1 <
/home/aardvark/download/pgpatches/0094/lcsg/20140213/0001-wal_decoding-Introduce-logical-changeset-extraction.patch
>patch.1.dry-run.1.out

(it loops to try out several -p values, none acceptable)

etc

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-14 09:55:46
Message-ID:	20140214095546.GH4910@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-14 10:42:46 +0100, Erik Rijkers wrote:
> On Fri, February 14, 2014 10:13, Andres Freund wrote:
> > Hi,
> >
> > On 2014-02-14 09:23:45 +0100, Erik Rijkers wrote:
> >> >0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz 159 k
> >> >0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz 16 k
> >> >0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz 15 k
> >> >0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz 14 k
> >> >0005-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz 1.8 k
> >>
> >> These don't apply...
> >
> > Works here, could you give a bit more details about the problem you have
> > applying them? Note that they are compressed, so need to be gunzipped first...
> >
>
> yeah, unzipping -- I thought of that :) (no offense taken :))

Good ;). I thought it was better to ask...

I hadn't yet fetched new changes since tonight, that's why it worked for
me. It's breaking because of 801c2dc72cb3c68a7c430bb244675b7a68fd541a.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz	application/x-patch-gzip	115.9 KB
0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	11.9 KB
0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	11.4 KB
0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz	application/x-patch-gzip	10.5 KB
0005-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-15 22:29:04
Message-ID:	CA+TgmoYwUTK=qerv4OOsiAyw0d6G=CZ7_0+X6T3ZX36Xxx2u7w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> [ new patches ]

0001 already needs minor

+ * copied stuff from tuptoaster.c. Perhaps there should be toast_internal.h?

Yes, please. If you can submit a separate patch creating this file
and relocating this stuff there, I will commit it.

+ /*
+ * XXX: It's impolite to ignore our argument and keep decoding until the
+ * current position.
+ */

Eh, what?

+ * We misuse the original meaning of SnapshotData's xip and
subxip fields
+ * to make the more fitting for our needs.
[...]
+ * XXX: Do we want extra fields instead of misusing existing
ones instead?

If we're going to do this, then it surely needs to be documented in
snapshot.h. On the second question, you're not the first hacker to
want to abuse the meanings of the existing fields; SnapshotDirty
already does it. It's tempting to think we need a more principled
approach to this, like what we've done with Node i.e. typedef enum ...
SnapshotType; and then a separate struct definition for each kind, all
beginning with SnapshotType type.

+ /*
+ * XXX: Timeline handling/naming. Do we need to include the timeline in
+ * snapshot's name? Outside of very obscure, user triggered, cases every
+ * LSN should correspond to exactly one timeline?
+ */

I can't really comment intelligently on this, so you need to figure it
out. My off-the-cuff thought is that erring on the side of including
it couldn't hurt anything.

+ * XXX: use hex format for the LSN as well?

Yes, please.

+ /* prepared abort contain a normal
commit abort... */

contains.

+ /*
+ * Abort all transactions that we keep
track of that are older
+ * than the record's oldestRunningXid.
This is the most
+ * convenient spot for doing so since,
in contrast to shutdown
+ * or end-of-recovery checkpoints, we
have sufficient
+ * knowledge to deal with prepared
transactions here.
+ */

I have no real quibble with this, but I think the comment could be
clarified slightly to state *what* knowledge we have here that we
wouldn't have there.

+ /* only crash recovery/replication needs to care */

I believe I know what you're getting at here, but the next guy might
not. I suggest: "Although these records only exist to serve the needs
of logical decoding, all the work happens as part of crash or archive
recovery, so we don't need to do anything here."

+ * Treat HOT update as normal updates, there
is no useful

s/, t/. T/

+ * There are cases in which inplace updates
are used without xids
+ * assigned (like VACUUM), there are others
(like CREATE INDEX
+ * CONCURRENTLY) where an xid is present. If
an xid was assigned

In-place updates can be used either by XID-bearing transactions (e.g.
in CREATE INDEX CONCURRENTLY) or by XID-less transactions (e.g.
VACUUM). In the former case, ...

+ * redundant because the commit will do that
as well, but one
+ * we'll support decoding in-progress
relations, this will be

s/one/once/
s/we'll/we/

+ /* we don't care about row level locks for now */
+ case XLOG_HEAP_LOCK:
+ break;

The position of the comment isn't consistent with the comments for the
other WAL record type in this section; that is, it's above rather than
below the case.

+ * transaction's contents as the various caches need to always be

I think you should use "since" or "because" rather than "as" here, and
maybe put a comma before it.

+ * the transaction's invalidations. This currently won't be needed if
+ * we're just skipping over the transaction, since that currently only
+ * happens when starting decoding, after we invalidated all caches, but
+ * transactions in other databases might have touched shared relations.

I'm having trouble understanding what this means, especially the part
after the "but".

+ * Read a HeapTuple as WAL logged by heap_insert, heap_update and
+ * heap_delete, but not by heap_multi_insert into a tuplebuf.

"but not by heap_multi_insert" needs punctuation both before and
after. You can just add a comma after, or change it into a
parenthetical phrase.

As the above comments probably make clear, I'm pretty much happy with decode.c.

+ /* TODO: We got to change that someday soon.. */

Two periods. Maybe "We need to change this some day soon." - and then
follow that with a paragraph explaining what roughly what would need
to be done.

+ /* shorter lines... */
+ slot = MyReplicationSlot;
+
+ /* first some sanity checks that are unlikely to be violated */
+ if (MyReplicationSlot == NULL)
+ elog(ERROR, "cannot perform logical decoding without a
acquired slot");

Can test slot.

+ /* make sure the passed slot is suitable, these are user
facing errors */

Make sure the passed slot is suitable. These are user-facing errors.

+ if (IsTransactionState() &&
+ GetTopTransactionIdIfAny() != InvalidTransactionId)
+ ereport(ERROR,
+ (errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ errmsg("cannot perform changeset
extraction in transaction that has performed writes")));

This is sort of an awkward restriction, as it makes it hard to compose
this feature with others. What underlies the restriction, can we get
rid of it, and if not can we include a comment here explaining why it
exists?

+ * the xlog records didn't result in anyting
relevant for

anything.

+ /* register output plugin name with slot */
+ strncpy(NameStr(slot->data.plugin), plugin,
+ NAMEDATALEN);
+ NameStr(slot->data.plugin)[NAMEDATALEN - 1] = '\0';

If it's safe to do this without a lock, I don't know why.

More broadly, I wonder why this is_init stuff is in here at all.
Maybe the stuff that happens in the is_init case should be done in the
caller, or another helper function.

+ /* prevent WAL removal as fast as possible */
+ ReplicationSlotsComputeRequiredLSN();

If there's a race here, can't we rejigger the order of operations to
eliminate it? Or is that just too hard and not worth it?

+begin_txn_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+ state.callback_name = "pg_decode_begin_txn";
+ ctx->callbacks.begin_cb(ctx, txn);

I feel that there's a certain lack of naming consistency between these
things. Can we improve that? (and similarly for parallel cases)

+pg_create_decoding_replication_slot(PG_FUNCTION_ARGS)

I thought we were going to have physical and logical slots, not
physical and decoding slots.

+ /* make sure we don't end up with an unreleased slot */
+ PG_TRY();
+ {
...
+ PG_CATCH();
+ {
+ ReplicationSlotRelease();
+ ReplicationSlotDrop(NameStr(*name));
+ PG_RE_THROW();
+ }
+ PG_END_TRY();

I don't think this is a very good idea. The problem with doing things
during error recovery that can themselves fail is that you'll lose the
original error, which is not cool, and maybe even blow out the error
stack. Many people have confuse PG_TRY()/PG_CATCH() with an
exception-handling system, but it's not. One way to fix this is to
put some of the initialization logic in ReplicationSlotCreate() just
prior to calling CreateSlotOnDisk(). If the work that needs to be
done is too complex or protracted to be done there, then I think that
it should be pulled out of the act of creating the replication slot
and made to happen as part of first use, or as a separate operation
like PrepareForLogicalDecoding.

+ * When the client has confirmed flushes >= candidate_xmin_after we can

candidate_xmin_after is not otherwise referenced; incomplete identifier rename?

Nothing in patch 1 sets PROC_IN_LOGICAL_DECODING. Is that right?

+ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
+ bool
already_locked)

Maybe Assert(!already_locked || LWLockHeldByMe(ProcArrayLock))

+void
+ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
+
TransactionId *catalog_xmin)
+{
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

If we need the lock in exclusive mode here, there should be a comment
explaining why. And regardless, there should probably be some sort of
header comment.

+ /*
+ * Acquire spinlock so other backends are guaranteed
to see this in
+ * time - we cannot generally acquire the lwlock here
since we might
+ * be still holding it in an error path.
+ */
+ SpinLockAcquire(&MyReplicationSlot->mutex);
+ MyReplicationSlot->active = false;
+ SpinLockRelease(&MyReplicationSlot->mutex);
+
+ /* might not have been set when we've been a plain slot */
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+ MyPgXact->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+ LWLockRelease(ProcArrayLock);

OK, a couple of things. First, the comment for the first chunk says
we can't acquire the LWLock here, and then the second part acquires an
LWLock. Second, if we're about to dump our PGPROC altogether, why do
we need to update vacuumFlags first? Third, this isn't actually
called anywhere.

+ * At some point in the future it probaly makes sense to have a more elaborate
+ * resource management here, but it's not entirely clear how that would look
+ * like.

s/how/what/

ReorderBufferGetTXN() should get a comment about the performance
impact of this. There's a tiny bit there in ReorderBufferReturnTXN()
but it should be better called out. Should these call the valgrind
macros to make the memory inaccessible while it's being held in cache?

My eyes are starting to glaze over, so more later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-15 23:59:46
Message-ID:	20140215235946.GC7821@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-15 17:29:04 -0500, Robert Haas wrote:
> On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > [ new patches ]
>
> 0001 already needs minor

Hm?

If there are conflicts, I'll push/send a rebased tomorrow or monday.

> + * the transaction's invalidations. This currently won't be needed if
> + * we're just skipping over the transaction, since that currently only
> + * happens when starting decoding, after we invalidated all caches, but
> + * transactions in other databases might have touched shared relations.
>
> I'm having trouble understanding what this means, especially the part
> after the "but".

Let me rephrase, maybe that will help:

This currently won't be needed if we're just skipping over the
transaction because currenlty we only do so during startup, to get to
the first transaction the client needs. As we have reset the catalog
caches before starting to read WAL and we haven't yet touched any
catalogs there can't be anything to invalidate.

But if we're "forgetting" this commit because it's it happened in
another database, the invalidations might be important, because they
could be for shared catalogs and we might have loaded data into the
relevant syscaches.

Better?

> + if (IsTransactionState() &&
> + GetTopTransactionIdIfAny() != InvalidTransactionId)
> + ereport(ERROR,
> + (errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
> + errmsg("cannot perform changeset
> extraction in transaction that has performed writes")));
>
> This is sort of an awkward restriction, as it makes it hard to compose
> this feature with others. What underlies the restriction, can we get
> rid of it, and if not can we include a comment here explaining why it
> exists?

Well, you can write the result into a table if you're halfway
careful... :)

I think it should be fairly easy to relax the restriction to creating a
slot, but not getting data from it. Do you think that would that be
sufficient?

> + /* register output plugin name with slot */
> + strncpy(NameStr(slot->data.plugin), plugin,
> + NAMEDATALEN);
> + NameStr(slot->data.plugin)[NAMEDATALEN - 1] = '\0';
>
> If it's safe to do this without a lock, I don't know why.

Well, the worst that could happen is that somebody else doing a SELECT *
FROM pg_replication_slot gets a incomplete plugin name... But we
certainly can hold the spinlock during it if you think that's better.

> More broadly, I wonder why this is_init stuff is in here at all.
> Maybe the stuff that happens in the is_init case should be done in the
> caller, or another helper function.

The reason I moved it in there is that after the walsender patch there
are two callers and the stuff is sufficiently complex that I having it
twice lead to bugs.
The reason it's currenlty the same function is that sufficiently much of
the code would have to be shared that I found it it ugly to split. I'll
have a look whether I can figure something out.

> + /* prevent WAL removal as fast as possible */
> + ReplicationSlotsComputeRequiredLSN();
>
> If there's a race here, can't we rejigger the order of operations to
> eliminate it? Or is that just too hard and not worth it?

Yes, there's a small race which at the very least should be properly
documented.

Hm. Yes, I think we can plug the hole. If the race condition occurs we'd
take slightly longer to startup, which isn't bad. Will fix.

> +begin_txn_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> + state.callback_name = "pg_decode_begin_txn";
> + ctx->callbacks.begin_cb(ctx, txn);
>
> I feel that there's a certain lack of naming consistency between these
> things. Can we improve that? (and similarly for parallel cases)
>
> +pg_create_decoding_replication_slot(PG_FUNCTION_ARGS)
>
> I thought we were going to have physical and logical slots, not
> physical and decoding slots.

Ok.

> + /* make sure we don't end up with an unreleased slot */
> + PG_TRY();
> + {
> ...
> + PG_CATCH();
> + {
> + ReplicationSlotRelease();
> + ReplicationSlotDrop(NameStr(*name));
> + PG_RE_THROW();
> + }
> + PG_END_TRY();
>
> I don't think this is a very good idea. The problem with doing things
> during error recovery that can themselves fail is that you'll lose the
> original error, which is not cool, and maybe even blow out the error
> stack. Many people have confuse PG_TRY()/PG_CATCH() with an
> exception-handling system, but it's not. One way to fix this is to
> put some of the initialization logic in ReplicationSlotCreate() just
> prior to calling CreateSlotOnDisk(). If the work that needs to be
> done is too complex or protracted to be done there, then I think that
> it should be pulled out of the act of creating the replication slot
> and made to happen as part of first use, or as a separate operation
> like PrepareForLogicalDecoding.

I think what should be done here is adding a drop_on_release flag. As
soon as everything important is done, it gets unset.

With some small changes that'd be highly useful for pg_basebackup as
well, because it could simply create a slot to prevent removal of
important WAL. Hm, maybe name it release_on_error, and call it from the
error locations.
You previously (IM?) argued that that'd be problematic because of locks,
but in all the error handling situations we'll already have released
lwlocks. And we rely on being able to tacke lwlocks there,
c.f. TerminateBufferIO et al.

> + * When the client has confirmed flushes >= candidate_xmin_after we can
>
> candidate_xmin_after is not otherwise referenced; incomplete identifier rename?
>
> Nothing in patch 1 sets PROC_IN_LOGICAL_DECODING. Is that right?

CreateDecodingContext() does.

> ReorderBufferGetTXN() should get a comment about the performance
> impact of this. There's a tiny bit there in ReorderBufferReturnTXN()
> but it should be better called out. Should these call the valgrind
> macros to make the memory inaccessible while it's being held in cache?

Hm, I think it does call the valgrind stuff?
VALGRIND_MAKE_MEM_UNDEFINED(txn, sizeof(ReorderBufferTXN));
VALGRIND_MAKE_MEM_DEFINED(&txn->node, sizeof(txn->node));

> My eyes are starting to glaze over, so more later.

Thanks! Will address asap.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-16 17:12:42
Message-ID:	20140216171242.GC16983@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-15 17:29:04 -0500, Robert Haas wrote:
> On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > [ new patches ]
>
> 0001 already needs minor
>
> + * copied stuff from tuptoaster.c. Perhaps there should be toast_internal.h?
>
> Yes, please. If you can submit a separate patch creating this file
> and relocating this stuff there, I will commit it.

I started to work on that, but I am not sure we actually need it
anymore. tuptoaster.h isn't included in that many places, so perhaps we
should just add it there?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-18 01:31:34
Message-ID:	CA+TgmoanoFcShpbxH1T3O9m-xkr_HEzjkmrH2PtUhoJSbw3GEQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> [ patches ]

Having now had a little bit of opportunity to reflect on the State Of
This Patch, I'd like to step back from the minutia upon which I've
been commenting in my previous emails and articulate three high-level
concerns about this patch. In so doing, I would like to specifically
request that other folks on this mailing list comment on the extent to
which they do or do not believe these concerns to be valid. I believe
I've mentioned all of these concerns at least to some degree
previously, but they've been mixed in with other things, so I want to
take this opportunity to call them out more clearly.

1. How safe is it to try to do decoding inside of a regular backend?
What we're doing here is entering a special mode where we forbid the
use of regular snapshots in favor of requiring the use of "decoding
snapshots", and forbid access to non-catalog relations. We then run
through the decoding process; and then exit back into regular mode.
On entering and on exiting this special mode, we
InvalidateSystemCaches(). I don't see a big problem with having
special backends (e.g. walsender) use this special mode, but I'm less
convinced that it's wise to try to set things up so that we can switch
back and forth between decoding mode and regular mode in a single
backend. I worry that won't end up working out very cleanly, and I
think the prohibition against using this special mode in an
XID-bearing transaction is merely a small downpayment on future pain
in this area. That having been said, I can't pretend at this point
either to understand the genesis of this particular restriction or
what other problems are likely to crop up in trying to allow this
mode-switching. So it's possible that I'm overblowing it, but it's
makin' me nervous.

2. I think the snapshot-export code is fundamentally misdesigned. As
I said before, the idea that we're going to export one single snapshot
at one particular point in time strikes me as extremely short-sighted.
For example, consider one-to-many replication where clients may join
or depart the replication group at any time. Whenever somebody joins,
we just want a <snapshot, LSN> pair such that they can apply all
changes after the LSN except for XIDs that would have been visible to
the snapshot. And in fact, we don't even need any special machinery
for that; the client can just make a connection and *take a snapshot*
once decoding is initialized enough. This code is going to great
pains to be able to export a snapshot at the precise point when all
transactions that were running in the first xl_running_xacts record
seen after the start of decoding have ended, but there's nothing
magical about that point, except that it's the first point at which a
freshly-taken snapshot is guaranteed to be good enough to establish an
initial state for any table in the database.

But do you really want to keep that snapshot around long enough to
copy the entire database? I bet you don't: if the database is big,
holding back xmin for long enough to copy the whole thing isn't likely
to be fun. You might well want to copy one table at a time, with
progressively newer snapshots, and apply to each table only those
transactions that weren't part of the initial snapshot for that table.
Many other patterns are possible. What you've got baked in here
right now is suitable only for the simplest imaginable case, and yet
we're paying a substantial price in implementation complexity for it.
Frankly, this code is *ugly*; the fact that SnapBuildExportSnapshot()
needs to start a transaction so that it can push out a snapshot. I
think that's a pretty awful abuse of the transaction machinery, and
the whole point of it, AFAICS, is to eliminate flexibility that we'd
have with simpler approaches.

3. As this feature is proposed, the only plugin we'll ship with 9.4 is
a test_decoding plugin which, as its own documentation says, "doesn't
do anything especially useful." What exactly do we gain by forcing
users who want to make use of these new capabilities to write C code?
You previously stated that it wasn't possible (or there wasn't time)
to write something generic, but how hard is it, really? Sure, people
who are hard-core should have the option to write C code, and I'm
happy that they do. But that shouldn't, IMHO anyway, be a requirement
to use that feature, and I'm having trouble understanding why we're
making it one. The test_decoding plugin doesn't seem tremendously
much simpler than something that someone could actually use, so why
not make that the goal?

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-18 02:10:26
Message-ID:	11887.1392689426@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Having now had a little bit of opportunity to reflect on the State Of
> This Patch, I'd like to step back from the minutia upon which I've
> been commenting in my previous emails and articulate three high-level
> concerns about this patch. In so doing, I would like to specifically
> request that other folks on this mailing list comment on the extent to
> which they do or do not believe these concerns to be valid.
> ...

> 1. How safe is it to try to do decoding inside of a regular backend?
> What we're doing here is entering a special mode where we forbid the
> use of regular snapshots in favor of requiring the use of "decoding
> snapshots", and forbid access to non-catalog relations. We then run
> through the decoding process; and then exit back into regular mode.
> On entering and on exiting this special mode, we
> InvalidateSystemCaches().

How often is such a mode switch expected to happen? I would expect
frequent use of InvalidateSystemCaches() to be pretty much disastrous
for performance, even absent any of the possible bugs you're worried
about. It would likely be better to design things so that a decoder
backend does only that.

> 2. I think the snapshot-export code is fundamentally misdesigned.

Your concerns here sound reasonable, but I can't say I've got any
special insight into it.

> 3. As this feature is proposed, the only plugin we'll ship with 9.4 is
> a test_decoding plugin which, as its own documentation says, "doesn't
> do anything especially useful." What exactly do we gain by forcing
> users who want to make use of these new capabilities to write C code?

TBH, if that's all we're going to ship, I'm going to vote against
committing this patch to 9.4 at all. Let it wait till 9.5 when we
might be able to build something useful on it. To point out just
one obvious problem, how much confidence can we have in the APIs
being right if there are no usable clients? Even if they're right,
what benefit do we get from freezing them one release before anything
useful is going to happen?

The most recent precedent I can think of is the FDW APIs, which I'd
be the first to admit are still in flux. But we didn't ship anything
there without non-toy contrib modules to exercise it. If we had,
we'd certainly have regretted it, because in the creation of those
contrib modules we found flaws in the initial design.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-18 02:35:23
Message-ID:	CA+TgmoZGLKJ1GsBAGQw5jEX9_GhwNS0R4p2LN9CZ9=B=CEQEuw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 17, 2014 at 9:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 3. As this feature is proposed, the only plugin we'll ship with 9.4 is
>> a test_decoding plugin which, as its own documentation says, "doesn't
>> do anything especially useful." What exactly do we gain by forcing
>> users who want to make use of these new capabilities to write C code?
>
> TBH, if that's all we're going to ship, I'm going to vote against
> committing this patch to 9.4 at all. Let it wait till 9.5 when we
> might be able to build something useful on it. To point out just
> one obvious problem, how much confidence can we have in the APIs
> being right if there are no usable clients? Even if they're right,
> what benefit do we get from freezing them one release before anything
> useful is going to happen?

I actually have a lot of confidence that the APIs are almost entirely
right, except maybe for the snapshot-related stuff and possibly one or
two other minor details. And I have every confidence that 2ndQuadrant
is going to put out decoding modules that do useful stuff. I also
assume Slony is going to ship one at some point. EnterpriseDB's xDB
replication server will need one, so someone at EDB will have to go
write that. And if Bucardo or Londiste want to use this
infrastructure, they'll need their own, too. What I don't understand
is why it's cool to make each of those replication solutions bring its
own to the table. I mean if they want to, so that they can generate
exactly the format they want with no extra overhead, sure, cool. What
I don't understand is why we're not taking the test_decoding module,
polishing it up a little to produce some nice, easily
machine-parseable output, calling it basic_decoding, and shipping
that. Then people who want something else can build it, but people
who are happy with something basic will already have it.

What I actually suspect is going to happen if we ship this as-is is
that people are going to start building logical replication solutions
on top of the test_decoding module even though it explicitly says that
it's just test code. This is *really* cool technology and people are
*hungry* for it. But writing C is hard, so if there's not a polished
plugin available, I bet people are going to try to use the
not-polished one. I think we try to get out ahead of that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-18 02:49:59
Message-ID:	CAM3SWZRreGfaN0h7nqE81CYHogipvcjpfaxpfTsHGogpPpdj0g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 17, 2014 at 6:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> What I actually suspect is going to happen if we ship this as-is is
> that people are going to start building logical replication solutions
> on top of the test_decoding module even though it explicitly says that
> it's just test code. This is *really* cool technology and people are
> *hungry* for it. But writing C is hard, so if there's not a polished
> plugin available, I bet people are going to try to use the
> not-polished one. I think we try to get out ahead of that.

Tom made a comparison with FDWs, so I'll make another. The Multicorn
module made FDW authorship much more accessible by wrapping it in a
Python interface, I believe with some success. I don't want to stand
in the way of building a fully-featured test_decoding module, but I
think that those that would misuse test_decoding as it currently
stands can be redirected to a third-party wrapper. As you say, it's
pretty cool stuff, so it seems likely that someone will build one for
us.

--
Peter Geoghegan

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-18 09:07:58
Message-ID:	20140218090758.GJ7161@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Robert,

On 2014-02-17 20:31:34 -0500, Robert Haas wrote:
> 1. How safe is it to try to do decoding inside of a regular backend?
> What we're doing here is entering a special mode where we forbid the
> use of regular snapshots in favor of requiring the use of "decoding
> snapshots", and forbid access to non-catalog relations. We then run
> through the decoding process; and then exit back into regular mode.
> On entering and on exiting this special mode, we
> InvalidateSystemCaches(). I don't see a big problem with having
> special backends (e.g. walsender) use this special mode, but I'm less
> convinced that it's wise to try to set things up so that we can switch
> back and forth between decoding mode and regular mode in a single
> backend.

The main reason the SQL interface exists is that it's awfully hard to
use isolationtester, pg_regress et al when the output isn't also visible
via SQL. We tried hacking things in other ways, but that's what it came
down to. If you recall, previously the SQL changes interface was only in
a test_logical_decoding extension, because I wasn't sure it's all that
interesting for real usecases.
It's sure nice for testing things though.

> I worry that won't end up working out very cleanly, and I
> think the prohibition against using this special mode in an
> XID-bearing transaction is merely a small downpayment on future pain
> in this area.

That restriction is in principle only needed when creating the slot, not
when getting changes. The only problem is that some piece of code
doesn't know about it.

The reason it exists are twofold: One is that when looking for an
initial snapshot, we wait for concurrent transactions to end. If we'd
wait for the transaction itself we'd be in trouble, it could never
happen. The second reason is that the code do a XactLockTableWait() to
"visualize" it's waiting, so isolatester knows it should background the
command. It's not good to wait on itself.
But neither is actually needed when not creating the slot, the code just
needs to be told about that.

> That having been said, I can't pretend at this point
> either to understand the genesis of this particular restriction or
> what other problems are likely to crop up in trying to allow this
> mode-switching. So it's possible that I'm overblowing it, but it's
> makin' me nervous.

I am not terribly concerned, but I can understand where you are coming
from. I think for replication solutions this isn't going to be needed
but it's way much more handy for testing and such.

> 2. I think the snapshot-export code is fundamentally misdesigned. As
> I said before, the idea that we're going to export one single snapshot
> at one particular point in time strikes me as extremely short-sighted.

I don't think so. It's precisely what you need to implement a simple
replication solution. Yes, there are usecases that could benefit from
more possibilities, but that's always the case.

> For example, consider one-to-many replication where clients may join
> or depart the replication group at any time. Whenever somebody joins,
> we just want a <snapshot, LSN> pair such that they can apply all
> changes after the LSN except for XIDs that would have been visible to
> the snapshot.

And? They need to create individual replication slots, which each will
get a snapshot.

> And in fact, we don't even need any special machinery
> for that; the client can just make a connection and *take a snapshot*
> once decoding is initialized enough.

No, they can't. Two reasons: For one the commit order between snapshots
and WAL isn't necessarily the same. For another, clients now need logic
to detect whether a transaction's contents has already been applied or
has not been applied yet, that's nontrivial.

> This code is going to great
> pains to be able to export a snapshot at the precise point when all
> transactions that were running in the first xl_running_xacts record
> seen after the start of decoding have ended, but there's nothing
> magical about that point, except that it's the first point at which a
> freshly-taken snapshot is guaranteed to be good enough to establish an
> initial state for any table in the database.

I still maintain that there's something magic about that moment. It's
when all *future* (from the POV of the snapshot) changes will be
streamed, and all *past* changes are included in the exported snapshot.

> But do you really want to keep that snapshot around long enough to
> copy the entire database? I bet you don't: if the database is big,
> holding back xmin for long enough to copy the whole thing isn't likely
> to be fun.

Well, that's how pg_dump works, it's not this patch's problem to fix
that.

> You might well want to copy one table at a time, with
> progressively newer snapshots, and apply to each table only those
> transactions that weren't part of the initial snapshot for that table.
> Many other patterns are possible. What you've got baked in here
> right now is suitable only for the simplest imaginable case, and yet
> we're paying a substantial price in implementation complexity for it.

Which implementation complexity are you talking about? The relevant code
is maybe 50-60 lines?

> Frankly, this code is *ugly*; the fact that SnapBuildExportSnapshot()
> needs to start a transaction so that it can push out a snapshot. I
> think that's a pretty awful abuse of the transaction machinery, and
> the whole point of it, AFAICS, is to eliminate flexibility that we'd
> have with simpler approaches.

It's not my idea that the snapshot importing requires that
restriction. We could possibly lift it and replace it by another check,
but I don't really see the problem.

It gains us to have a output plugin in which we can easily demonstrate
features so they can be tested in the regression tests. Which I find to
be rather important.
Just like e.g. the test_shm_mq stuff doesn't do anything really useful.

> You previously stated that it wasn't possible (or there wasn't time)
> to write something generic, but how hard is it, really? Sure, people
> who are hard-core should have the option to write C code, and I'm
> happy that they do. But that shouldn't, IMHO anyway, be a requirement
> to use that feature, and I'm having trouble understanding why we're
> making it one.

I think the commmunity will step up and provide further plugins. In
fact, there's already been a json plugin on the mailinglist.

> The test_decoding plugin doesn't seem tremendously
> much simpler than something that someone could actually use, so why
> not make that the goal?

For one, it being a designated toy plugin allows us to easily change it,
to showcase/test new features. For another, I still don't agree that
it's easy to agree to an output format. I think we should include some
that matured into 9.5.

Thanks,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-18 09:17:46
Message-ID:	20140218091746.GK7161@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-17 21:10:26 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > 1. How safe is it to try to do decoding inside of a regular backend?
> > What we're doing here is entering a special mode where we forbid the
> > use of regular snapshots in favor of requiring the use of "decoding
> > snapshots", and forbid access to non-catalog relations. We then run
> > through the decoding process; and then exit back into regular mode.
> > On entering and on exiting this special mode, we
> > InvalidateSystemCaches().
>
> How often is such a mode switch expected to happen? I would expect
> frequent use of InvalidateSystemCaches() to be pretty much disastrous
> for performance, even absent any of the possible bugs you're worried
> about. It would likely be better to design things so that a decoder
> backend does only that.

Very infrequently. When it's starting to decode, and when it's
ending. When used via walsender, that should only happen at connection
start/end which surely shouldn't be frequent.
It's more frequent when using the SQL interface, but since that's not a
streaming interface on a busy server there still would be a couple of
megabytes of transactions to decode for one reset.

> > 3. As this feature is proposed, the only plugin we'll ship with 9.4 is
> > a test_decoding plugin which, as its own documentation says, "doesn't
> > do anything especially useful." What exactly do we gain by forcing
> > users who want to make use of these new capabilities to write C code?
>
> TBH, if that's all we're going to ship, I'm going to vote against
> committing this patch to 9.4 at all. Let it wait till 9.5 when we
> might be able to build something useful on it.

There *are* useful things around already. We didn't include postgres_fdw
in the same release as the fdw code either? I don't see why this should
be held to a different standard.

> To point out just
> one obvious problem, how much confidence can we have in the APIs
> being right if there are no usable clients?

Because there *are* clients. They just don't sound likely to either be
suitable for core code (to specialized) or have already been submitted
(the json plugin).

There's a whole replication suite built ontop of this, to a good degree
to just test it. So I am fairly confident that the most important parts
are covered. There sure is additional features I want, but that's not
surprising.

> The most recent precedent I can think of is the FDW APIs, which I'd
> be the first to admit are still in flux. But we didn't ship anything
> there without non-toy contrib modules to exercise it. If we had,
> we'd certainly have regretted it, because in the creation of those
> contrib modules we found flaws in the initial design.

Which non-toy fdw was there? file_fdw was in 9.1, but that's a toy. And
*8.4* had CREATE FOREIGN DATA WRAPPER, without it doing anything...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-18 09:20:12
Message-ID:	20140218092012.GL7161@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-17 18:49:59 -0800, Peter Geoghegan wrote:
> On Mon, Feb 17, 2014 at 6:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > What I actually suspect is going to happen if we ship this as-is is
> > that people are going to start building logical replication solutions
> > on top of the test_decoding module even though it explicitly says that
> > it's just test code. This is *really* cool technology and people are
> > *hungry* for it. But writing C is hard, so if there's not a polished
> > plugin available, I bet people are going to try to use the
> > not-polished one. I think we try to get out ahead of that.
>
> Tom made a comparison with FDWs, so I'll make another. The Multicorn
> module made FDW authorship much more accessible by wrapping it in a
> Python interface, I believe with some success. I don't want to stand
> in the way of building a fully-featured test_decoding module, but I
> think that those that would misuse test_decoding as it currently
> stands can be redirected to a third-party wrapper. As you say, it's
> pretty cool stuff, so it seems likely that someone will build one for
> us.

Absolutely. I *sure* hope somebody is going to build such an
abstraction. I am not entirely sure how it'd look like, but ...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-18 09:33:13
Message-ID:	20140218093313.GM7161@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-17 21:35:23 -0500, Robert Haas wrote:
> What
> I don't understand is why we're not taking the test_decoding module,
> polishing it up a little to produce some nice, easily
> machine-parseable output, calling it basic_decoding, and shipping
> that. Then people who want something else can build it, but people
> who are happy with something basic will already have it.

Because every project is going to need their own plugin
*anyway*. Londiste, slony sure are going to ignore changes to relations
they don't need. Querying their own metadata. They will want
compatibility to the earlier formats as far as possible. Sometime not
too far away they will want to optionally support binary output because
it's so much faster.
There's just not much chance that either of these will be able to agree
on a format short term.

So, possibly we could agree to something that consumers *outside* of
replication could use.

> What I actually suspect is going to happen if we ship this as-is is
> that people are going to start building logical replication solutions
> on top of the test_decoding module even though it explicitly says that
> it's just test code. This is *really* cool technology and people are
> *hungry* for it. But writing C is hard, so if there's not a polished
> plugin available, I bet people are going to try to use the
> not-polished one. I think we try to get out ahead of that.

I really hope there will be nicer ones by the time 9.4 is
released. Euler did send in a json plugin
http://archives.postgresql.org/message-id/52A5BFAE.1040209%2540timbira.com.br
, but there hasn't too much feedback yet. It's hard to start discussing
something that needs a couple of patches to pg before you can develop
your own patch...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-19 18:01:02
Message-ID:	CA+TgmobWhqGM=yy1MdGgBp+PD-ovxaU=Fpevfs=MJa3A1+wd4g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 15, 2014 at 6:59 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-15 17:29:04 -0500, Robert Haas wrote:
>> On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > [ new patches ]
>>
>> 0001 already needs minor
>
> Hm?
>
> If there are conflicts, I'll push/send a rebased tomorrow or monday.

As you guessed, the missing word was "rebasing". It's a trivial
conflict though, so please don't feel the need to repost just for
that.

>> + * the transaction's invalidations. This currently won't be needed if
>> + * we're just skipping over the transaction, since that currently only
>> + * happens when starting decoding, after we invalidated all caches, but
>> + * transactions in other databases might have touched shared relations.
>>
>> I'm having trouble understanding what this means, especially the part
>> after the "but".
>
> Let me rephrase, maybe that will help:
>
> This currently won't be needed if we're just skipping over the
> transaction because currenlty we only do so during startup, to get to
> the first transaction the client needs. As we have reset the catalog
> caches before starting to read WAL and we haven't yet touched any
> catalogs there can't be anything to invalidate.
>
> But if we're "forgetting" this commit because it's it happened in
> another database, the invalidations might be important, because they
> could be for shared catalogs and we might have loaded data into the
> relevant syscaches.
>
> Better?

Much! Please include that text, or something like it.

>> + if (IsTransactionState() &&
>> + GetTopTransactionIdIfAny() != InvalidTransactionId)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
>> + errmsg("cannot perform changeset
>> extraction in transaction that has performed writes")));
>>
>> This is sort of an awkward restriction, as it makes it hard to compose
>> this feature with others. What underlies the restriction, can we get
>> rid of it, and if not can we include a comment here explaining why it
>> exists?
>
> Well, you can write the result into a table if you're halfway
> careful... :)
>
> I think it should be fairly easy to relax the restriction to creating a
> slot, but not getting data from it. Do you think that would that be
> sufficient?

That would be a big improvement, for sure, and might be entirely sufficient.

>> + /* register output plugin name with slot */
>> + strncpy(NameStr(slot->data.plugin), plugin,
>> + NAMEDATALEN);
>> + NameStr(slot->data.plugin)[NAMEDATALEN - 1] = '\0';
>>
>> If it's safe to do this without a lock, I don't know why.
>
> Well, the worst that could happen is that somebody else doing a SELECT *
> FROM pg_replication_slot gets a incomplete plugin name... But we
> certainly can hold the spinlock during it if you think that's better.

Isn't the worst thing that can happen that they copy garbage out of
the buffer, because the new name is longer than the old and only
partially written?

>> More broadly, I wonder why this is_init stuff is in here at all.
>> Maybe the stuff that happens in the is_init case should be done in the
>> caller, or another helper function.
>
> The reason I moved it in there is that after the walsender patch there
> are two callers and the stuff is sufficiently complex that I having it
> twice lead to bugs.
> The reason it's currenlty the same function is that sufficiently much of
> the code would have to be shared that I found it it ugly to split. I'll
> have a look whether I can figure something out.

I was thinking that the is_init portion could perhaps be done first,
in a *previous* function call, and adjusted in such a way that the
non-is-init path can be used for both case here.

>> I don't think this is a very good idea. The problem with doing things
>> during error recovery that can themselves fail is that you'll lose the
>> original error, which is not cool, and maybe even blow out the error
>> stack. Many people have confuse PG_TRY()/PG_CATCH() with an
>> exception-handling system, but it's not. One way to fix this is to
>> put some of the initialization logic in ReplicationSlotCreate() just
>> prior to calling CreateSlotOnDisk(). If the work that needs to be
>> done is too complex or protracted to be done there, then I think that
>> it should be pulled out of the act of creating the replication slot
>> and made to happen as part of first use, or as a separate operation
>> like PrepareForLogicalDecoding.
>
> I think what should be done here is adding a drop_on_release flag. As
> soon as everything important is done, it gets unset.

That might be more elegant, but I don't think it really fixes
anything, because backing stuff out from on disk can fail. AIUI, your
whole concern here is that you don't want the slot creation to fail
halfway through and leave behind the slot, but what you've got here
doesn't prevent that; it just makes it less likely. The more I think
about it, the more I think you're trying to pack stuff into slot
creation that really ought to be happening on first use.

>> ReorderBufferGetTXN() should get a comment about the performance
>> impact of this. There's a tiny bit there in ReorderBufferReturnTXN()
>> but it should be better called out. Should these call the valgrind
>> macros to make the memory inaccessible while it's being held in cache?
>
> Hm, I think it does call the valgrind stuff?
> VALGRIND_MAKE_MEM_UNDEFINED(txn, sizeof(ReorderBufferTXN));
> VALGRIND_MAKE_MEM_DEFINED(&txn->node, sizeof(txn->node));

That's there in ReorderBufferReturnTXN, but don't you need something
in ReorderBufferGetTXN? Maybe not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-19 18:01:56
Message-ID:	CA+TgmoYXXvuch8yJkwOSakgs+Oiov7oadAomy1NUCZAXQyg7Lg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 16, 2014 at 12:12 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-15 17:29:04 -0500, Robert Haas wrote:
>> On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > [ new patches ]
>>
>> 0001 already needs minor
>>
>> + * copied stuff from tuptoaster.c. Perhaps there should be toast_internal.h?
>>
>> Yes, please. If you can submit a separate patch creating this file
>> and relocating this stuff there, I will commit it.
>
> I started to work on that, but I am not sure we actually need it
> anymore. tuptoaster.h isn't included in that many places, so perhaps we
> should just add it there?

That seems fine to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-19 18:07:11
Message-ID:	CA+TgmoayQ_fKzLOR3zArtsa1JSmG2sPNbF7S9o+WEmLd00rEiA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 18, 2014 at 4:07 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> 2. I think the snapshot-export code is fundamentally misdesigned. As
>> I said before, the idea that we're going to export one single snapshot
>> at one particular point in time strikes me as extremely short-sighted.
>
> I don't think so. It's precisely what you need to implement a simple
> replication solution. Yes, there are usecases that could benefit from
> more possibilities, but that's always the case.
>
>> For example, consider one-to-many replication where clients may join
>> or depart the replication group at any time. Whenever somebody joins,
>> we just want a <snapshot, LSN> pair such that they can apply all
>> changes after the LSN except for XIDs that would have been visible to
>> the snapshot.
>
> And? They need to create individual replication slots, which each will
> get a snapshot.

So we have to wait for startup N times, and transmit the change stream
N times, instead of once? Blech.

>> And in fact, we don't even need any special machinery
>> for that; the client can just make a connection and *take a snapshot*
>> once decoding is initialized enough.
>
> No, they can't. Two reasons: For one the commit order between snapshots
> and WAL isn't necessarily the same.

So what?

> For another, clients now need logic
> to detect whether a transaction's contents has already been applied or
> has not been applied yet, that's nontrivial.

My point is, I think we should be trying to *make* that trivial,
rather than doing this.

>> 3. As this feature is proposed, the only plugin we'll ship with 9.4 is
>> a test_decoding plugin which, as its own documentation says, "doesn't
>> do anything especially useful." What exactly do we gain by forcing
>> users who want to make use of these new capabilities to write C code?
>
> It gains us to have a output plugin in which we can easily demonstrate
> features so they can be tested in the regression tests. Which I find to
> be rather important.
> Just like e.g. the test_shm_mq stuff doesn't do anything really useful.

It definitely doesn't, but this patch is a lot closer to being done
than parallel query is, so I'm not sure it's a fair comparison.

>> The test_decoding plugin doesn't seem tremendously
>> much simpler than something that someone could actually use, so why
>> not make that the goal?
>
> For one, it being a designated toy plugin allows us to easily change it,
> to showcase/test new features. For another, I still don't agree that
> it's easy to agree to an output format. I think we should include some
> that matured into 9.5.

I regret that more effort has not been made in that area.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-19 18:31:06
Message-ID:	CA+TgmoZEd4wbNPn-P6BrXXhs7_fPVfUWM_Nryo_XBM2RruKU_Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 18, 2014 at 4:33 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-17 21:35:23 -0500, Robert Haas wrote:
>> What
>> I don't understand is why we're not taking the test_decoding module,
>> polishing it up a little to produce some nice, easily
>> machine-parseable output, calling it basic_decoding, and shipping
>> that. Then people who want something else can build it, but people
>> who are happy with something basic will already have it.
>
> Because every project is going to need their own plugin
> *anyway*. Londiste, slony sure are going to ignore changes to relations
> they don't need. Querying their own metadata. They will want
> compatibility to the earlier formats as far as possible. Sometime not
> too far away they will want to optionally support binary output because
> it's so much faster.
> There's just not much chance that either of these will be able to agree
> on a format short term.

Ah, so part of what you're expecting the output plugin to do is
filtering. I can certainly see where there might be considerable
variation between solutions in that area - but I think that's separate
from the question of formatting per se. Although I think we should
have an in-core output plugin with filtering capabilities eventually,
I'm happy to define that as out of scope for 9.4. But isn't there a
way that we can ship something that will due for people who want to
just see the database's entire change stream float by?

TBH, as compared to what you've got now, I think this mostly boils
down to a question of quoting and escaping. I'm not really concerned
with whether we ship something that's perfectly efficient, or that has
filtering capabilities, or that has a lot of fancy bells and whistles.
What I *am* concerned about is that if the user updates a text field
that contains characters like " or ' or : or [ or ] or , that somebody
might be using as delimiters in the output format, that a program can
still parse that output format and reliably determine what the actual
change was. I don't care all that much whether we use JSON or CSV or
something custom, but the data that gets spit out should not have
SQL-injection-like vulnerabilities.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Euler Taveira <euler(at)timbira(dot)com(dot)br>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-19 18:53:53
Message-ID:	5304FDC1.5010708@timbira.com.br
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18-02-2014 06:33, Andres Freund wrote:
> I really hope there will be nicer ones by the time 9.4 is
> released. Euler did send in a json plugin
> http://archives.postgresql.org/message-id/52A5BFAE.1040209%2540timbira.com.br
> , but there hasn't too much feedback yet. It's hard to start discussing
> something that needs a couple of patches to pg before you can develop
> your own patch...
>
BTW, I've updated that code to reflect the recent changes in the API and
publish it in [1]. This version is based on the Andres' branch
xlog-decoding-rebasing-remapping. I'll continue to polish this code.

Regards,

[1] https://github.com/eulerto/wal2json

--
Euler Taveira Timbira - http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-21 11:07:22
Message-ID:	20140221110722.GV28858@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-02-19 13:01:02 -0500, Robert Haas wrote:
> > I think it should be fairly easy to relax the restriction to creating a
> > slot, but not getting data from it. Do you think that would that be
> > sufficient?
>
> That would be a big improvement, for sure, and might be entirely sufficient.

Turned out to be a 5 line change + tests or something... Pushed.

> >> I don't think this is a very good idea. The problem with doing things
> >> during error recovery that can themselves fail is that you'll lose the
> >> original error, which is not cool, and maybe even blow out the error
> >> stack. Many people have confuse PG_TRY()/PG_CATCH() with an
> >> exception-handling system, but it's not. One way to fix this is to
> >> put some of the initialization logic in ReplicationSlotCreate() just
> >> prior to calling CreateSlotOnDisk(). If the work that needs to be
> >> done is too complex or protracted to be done there, then I think that
> >> it should be pulled out of the act of creating the replication slot
> >> and made to happen as part of first use, or as a separate operation
> >> like PrepareForLogicalDecoding.
> >
> > I think what should be done here is adding a drop_on_release flag. As
> > soon as everything important is done, it gets unset.
>
> That might be more elegant, but I don't think it really fixes
> anything, because backing stuff out from on disk can fail.

If the slot is marked as "drop_on_release" during creation, and we fail
during removal, it will just be dropped on the next startup. That seems
ok to me?

I still think it's not really important to put much effort in the "disk
stuff fails" case, it's entirely hypothetical. If that fails you have
*so* much huger problems, a leftover slot is the least of you problems.

> AIUI, your
> whole concern here is that you don't want the slot creation to fail
> halfway through and leave behind the slot, but what you've got here
> doesn't prevent that; it just makes it less likely. The more I think
> about it, the more I think you're trying to pack stuff into slot
> creation that really ought to be happening on first use.

Well, having a leftover slot that never succeeded being created is going
to be confusing lots of people, especially as it will not rollback or
something. That's why I think it's important to make it unlikely.
The typical reasons for failing are stuff like a output plugin that
doesn't exist or being interrupted while initializing.

I can sympathize with the "too much during init" argument, but I don't
see how moving stuff to the first call would get rid of the problems. If
we fail later it's going to be just as confusing.

> >> ReorderBufferGetTXN() should get a comment about the performance
> >> impact of this. There's a tiny bit there in ReorderBufferReturnTXN()
> >> but it should be better called out. Should these call the valgrind
> >> macros to make the memory inaccessible while it's being held in cache?
> >
> > Hm, I think it does call the valgrind stuff?
> > VALGRIND_MAKE_MEM_UNDEFINED(txn, sizeof(ReorderBufferTXN));
> > VALGRIND_MAKE_MEM_DEFINED(&txn->node, sizeof(txn->node));
>
> That's there in ReorderBufferReturnTXN, but don't you need something
> in ReorderBufferGetTXN? Maybe not.

Don't think so, it marks the memory as undefined, which allows writes,
but will warn on reads. We could additionally mark the memory as
inaccessible disallowing writes, but I don't really that catching much.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-21 11:09:37
Message-ID:	20140221110937.GW28858@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-19 13:31:06 -0500, Robert Haas wrote:
> TBH, as compared to what you've got now, I think this mostly boils
> down to a question of quoting and escaping. I'm not really concerned
> with whether we ship something that's perfectly efficient, or that has
> filtering capabilities, or that has a lot of fancy bells and whistles.
> What I *am* concerned about is that if the user updates a text field
> that contains characters like " or ' or : or [ or ] or , that somebody
> might be using as delimiters in the output format, that a program can
> still parse that output format and reliably determine what the actual
> change was. I don't care all that much whether we use JSON or CSV or
> something custom, but the data that gets spit out should not have
> SQL-injection-like vulnerabilities.

If it's just that, I am *perfectly* happy to change it. What I do not
want is arguments like "I don't want the type information, that's
pointless" because it's actually really important for regression
testing.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-21 11:40:51
Message-ID:	20140221114051.GZ28858@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-19 13:07:11 -0500, Robert Haas wrote:
> On Tue, Feb 18, 2014 at 4:07 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> 2. I think the snapshot-export code is fundamentally misdesigned. As
> >> I said before, the idea that we're going to export one single snapshot
> >> at one particular point in time strikes me as extremely short-sighted.
> >
> > I don't think so. It's precisely what you need to implement a simple
> > replication solution. Yes, there are usecases that could benefit from
> > more possibilities, but that's always the case.
> >
> >> For example, consider one-to-many replication where clients may join
> >> or depart the replication group at any time. Whenever somebody joins,
> >> we just want a <snapshot, LSN> pair such that they can apply all
> >> changes after the LSN except for XIDs that would have been visible to
> >> the snapshot.
> >
> > And? They need to create individual replication slots, which each
> > will get a snapshot.
>
> So we have to wait for startup N times, and transmit the change stream
> N times, instead of once? Blech.

I can't get too excited about this. If we later want to add a command to
clone an existing slot, sure, that's perfectly fine with me. That will
then stream at exactly the same position. Easy, less than 20 LOC + docs
probably.

We have much more waiting e.g. in the CONCURRENTLY commands and it's not
causing that many problems.

Note that it'd be a *significant* overhead to contiuously be able to
export snapshots that are useful for looking at normal relations. Bot
for computing snapshots and for not being able to remove those rows.

> >> And in fact, we don't even need any special machinery
> >> for that; the client can just make a connection and *take a snapshot*
> >> once decoding is initialized enough.
> >
> > No, they can't. Two reasons: For one the commit order between snapshots
> > and WAL isn't necessarily the same.

> So what?

So you can't just use a plain snapshot and dump using it, without
getting into inconsistencies.

> > For another, clients now need logic
> > to detect whether a transaction's contents has already been applied or
> > has not been applied yet, that's nontrivial.

> My point is, I think we should be trying to *make* that trivial,
> rather than doing this.

I think *this* *is* making it trivial.

Maybe I've missed it, but I haven't seen any alternative that comes even
*close* to being as easy to implement in a replication
solution. Currently you can use it like:

CREATE_REPLICATION_SLOT <name> LOGICAL
copy data using the exported snapshot
START_REPLICATION SLOT <name> LOGICAL
stream changes.

Where you can do the START_REPLICATION as soon as some other sesion has
imported the snapshot. Really not much to worry about additionally.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-21 13:16:59
Message-ID:	CA+TgmoYD0cSSbd2cmC9AbSy9dvaY=ejPWpqOAV-wi=OsAv4j8w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 21, 2014 at 6:07 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I can sympathize with the "too much during init" argument, but I don't
> see how moving stuff to the first call would get rid of the problems. If
> we fail later it's going to be just as confusing.

No, it isn't. If you fail during init the use will expect the slot to
be gone. That's the reason for all of this complexity. If you fail
on first use, the user will expect the slot to still be there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-21 13:27:33
Message-ID:	20140221132733.GB28858@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-21 08:16:59 -0500, Robert Haas wrote:
> On Fri, Feb 21, 2014 at 6:07 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > I can sympathize with the "too much during init" argument, but I don't
> > see how moving stuff to the first call would get rid of the problems. If
> > we fail later it's going to be just as confusing.
>
> No, it isn't. If you fail during init the use will expect the slot to
> be gone. That's the reason for all of this complexity. If you fail
> on first use, the user will expect the slot to still be there.

The primary case for failing is a plugin that either doesn't exist or
fails to initialize, or a user aborting the init. It seems odd that a
created slot fails because of a bad plugin or needs to wait till it
finds a suitable snapshot record. We could add an intermediary call like
pg_startup_logical_slot() but that doesn't seem to have much going for
it?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-21 13:51:03
Message-ID:	CA+TgmoZ_gUGkx92NbNs5ui5QzbWrfL=fsTv7uTb0D+5eweVdqQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 21, 2014 at 8:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-21 08:16:59 -0500, Robert Haas wrote:
>> On Fri, Feb 21, 2014 at 6:07 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > I can sympathize with the "too much during init" argument, but I don't
>> > see how moving stuff to the first call would get rid of the problems. If
>> > we fail later it's going to be just as confusing.
>>
>> No, it isn't. If you fail during init the use will expect the slot to
>> be gone. That's the reason for all of this complexity. If you fail
>> on first use, the user will expect the slot to still be there.
>
> The primary case for failing is a plugin that either doesn't exist or
> fails to initialize, or a user aborting the init. It seems odd that a
> created slot fails because of a bad plugin or needs to wait till it
> finds a suitable snapshot record. We could add an intermediary call like
> pg_startup_logical_slot() but that doesn't seem to have much going for
> it?

Well, we can surely detect a plugin that fails to initialize before
creating the slot on disk, right?

I'm not sure what "fails to initialize" entails.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-21 13:56:39
Message-ID:	20140221135639.GD28858@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-21 08:51:03 -0500, Robert Haas wrote:
> On Fri, Feb 21, 2014 at 8:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-02-21 08:16:59 -0500, Robert Haas wrote:
> >> On Fri, Feb 21, 2014 at 6:07 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> > I can sympathize with the "too much during init" argument, but I don't
> >> > see how moving stuff to the first call would get rid of the problems. If
> >> > we fail later it's going to be just as confusing.
> >>
> >> No, it isn't. If you fail during init the use will expect the slot to
> >> be gone. That's the reason for all of this complexity. If you fail
> >> on first use, the user will expect the slot to still be there.
> >
> > The primary case for failing is a plugin that either doesn't exist or
> > fails to initialize, or a user aborting the init. It seems odd that a
> > created slot fails because of a bad plugin or needs to wait till it
> > finds a suitable snapshot record. We could add an intermediary call like
> > pg_startup_logical_slot() but that doesn't seem to have much going for
> > it?
>
> Well, we can surely detect a plugin that fails to initialize before
> creating the slot on disk, right?

We could detect whether the plugin .so can be loaded and provides the
required callbacks, but we can't initialize it.

> I'm not sure what "fails to initialize" entails.

elog(ERROR, 'hey, the tables I require are missing');

or similar.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-21 21:14:15
Message-ID:	5307C1A7.7000809@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/17/14, 7:31 PM, Robert Haas wrote:
> But do you really want to keep that snapshot around long enough to
> copy the entire database? I bet you don't: if the database is big,
> holding back xmin for long enough to copy the whole thing isn't likely
> to be fun.

I can confirm that this would be epic fail, at least for londiste. It takes about 3 weeks for a new copy of a ~2TB database. There's no way that'd work with one snapshot. (Granted, copy performance in londiste is rather lackluster, but still...)
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-24 14:48:26
Message-ID:	20140224144826.GC6718@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-15 17:29:04 -0500, Robert Haas wrote:
> On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> + /*
> + * XXX: It's impolite to ignore our argument and keep decoding until the
> + * current position.
> + */
>
> Eh, what?

So, the background here is that I was thinking of allowing to specify a
limit for the number of returned rows. For the sql interface that sounds
like a good idea. I am just not so sure anymore that allowing to specify
a LSN as a limit is sufficient. Maybe simply allow to limit the number
of changes and check everytime a transaction has been replayed?

It's all trivial codewise, I am just wondering about the interface most
users would want.

> + * We misuse the original meaning of SnapshotData's xip and
> subxip fields
> + * to make the more fitting for our needs.
> [...]
> + * XXX: Do we want extra fields instead of misusing existing
> ones instead?
>
> If we're going to do this, then it surely needs to be documented in
> snapshot.h. On the second question, you're not the first hacker to
> want to abuse the meanings of the existing fields; SnapshotDirty
> already does it. It's tempting to think we need a more principled
> approach to this, like what we've done with Node i.e. typedef enum ...
> SnapshotType; and then a separate struct definition for each kind, all
> beginning with SnapshotType type.

Hm, essentially that's what the ->satisfies pointer already is, right?

There's already documentation of the extra fields in snapbuild, but I
understand you'd rather have them moved?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-24 15:11:31
Message-ID:	20140224151131.GD6718@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Changes in this version include:
* changed slot error handling log by introducing "ephermal" slots which
get dropped on errors. This is the biggest change.
* added quoting in the test_decoding output plugin
* closing of a tight race condition during slot creation where WAL could
have been removed
* comment and other adjustments, many of them noticed by robert

As always the result is pushed to the xlog-decoding-rebasing-remapping
on http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz	application/x-patch-gzip	121.9 KB
0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	12.8 KB
0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	11.4 KB
0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz	application/x-patch-gzip	10.6 KB
0005-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-24 17:50:03
Message-ID:	CA+TgmobJsRY=WTjeQc85sO9j=Dw__tfzNoX98YHnjv_tyacS5w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 24, 2014 at 9:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-15 17:29:04 -0500, Robert Haas wrote:
>> On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
>> + /*
>> + * XXX: It's impolite to ignore our argument and keep decoding until the
>> + * current position.
>> + */
>>
>> Eh, what?
>
> So, the background here is that I was thinking of allowing to specify a
> limit for the number of returned rows. For the sql interface that sounds
> like a good idea. I am just not so sure anymore that allowing to specify
> a LSN as a limit is sufficient. Maybe simply allow to limit the number
> of changes and check everytime a transaction has been replayed?

The last idea there seems like pretty sound, but ...

> It's all trivial codewise, I am just wondering about the interface most
> users would want.

...I can't swear it meets this criterion.

>> + * We misuse the original meaning of SnapshotData's xip and
>> subxip fields
>> + * to make the more fitting for our needs.
>> [...]
>> + * XXX: Do we want extra fields instead of misusing existing
>> ones instead?
>>
>> If we're going to do this, then it surely needs to be documented in
>> snapshot.h. On the second question, you're not the first hacker to
>> want to abuse the meanings of the existing fields; SnapshotDirty
>> already does it. It's tempting to think we need a more principled
>> approach to this, like what we've done with Node i.e. typedef enum ...
>> SnapshotType; and then a separate struct definition for each kind, all
>> beginning with SnapshotType type.
>
> Hm, essentially that's what the ->satisfies pointer already is, right?

Sorta, yeah. But with nodes, you can change the whole struct
definition for each type.

> There's already documentation of the extra fields in snapbuild, but I
> understand you'd rather have them moved?

Yeah, I think it needs to be documented where SnapshotData is defined.
There may be reason to mention it again, or in more detail,
elsewhere. But there should be some mention of it there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-24 22:06:53
Message-ID:	CA+TgmoZkqSdQ=pN3NHr07t_U8WV99ieDULAOcjfHYE7nK=i=5w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 24, 2014 at 10:11 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Changes in this version include:
> * changed slot error handling log by introducing "ephermal" slots which
> get dropped on errors. This is the biggest change.
> * added quoting in the test_decoding output plugin
> * closing of a tight race condition during slot creation where WAL could
> have been removed
> * comment and other adjustments, many of them noticed by robert

I did another read-through of this this afternoon, focusing on the
stuff you changed and parts I hadn't looked at carefully yet.
Comments below.

Documentation needs to be updated for pg_stat_replication view.

I still think pg_create_logical_replication_slot should be in slotfuncs.c.

/* Size of an indirect datum that contains an indirect TOAST pointer */
#define INDIRECT_POINTER_SIZE (VARHDRSZ_EXTERNAL + sizeof(struct
varatt_indirect))

+/* Size of an indirect datum that contains a standard TOAST pointer */
+#define INDIRECT_POINTER_SIZE (VARHDRSZ_EXTERNAL + sizeof(struct
varatt_indirect))

Isn't the new hunk a duplicate of the existing definition, except for
a one-word change to the comment?

I don't think the completely-unsecured directory operations in
test_decoding_regsupport.c are acceptable. Tom fought tooth and nail
to make sure that similar capabilities in adminpack carried meaningful
security restrictions.

/*
+ * Check whether there are, possibly unconnected, logical
slots that refer
+ * to the to-be-dropped database. The database lock we are holding
+ * prevents the creation of new slots using the database.
+ */
+ if (ReplicationSlotsCountDBSlots(db_id, &nslots, &nslots_active))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("database \"%s\" is used in a
logical decoding slot",
+ dbname),
+ errdetail("There are %d slot(s), %d
of them active",
+ nslots, nslots_active)));

What are you going to do when we get around to supporting this on a
standby? Whatever the answer is, maybe add a TODO comment.

+ * loop for now..
+ * more than twice..

Extra periods.

+ * The replicatio slot mechanism is used to prevent removal of required

Typo.

+
+ /*
+ * GetRunningTransactionData() acquired ProcArrayLock, we must release
+ * it. We can do that before inserting the WAL record because
+ * ProcArrayApplyRecoveryInfo can recheck the commit status using the
+ * clog. If we're doing logical replication we can't do that though, so
+ * hold the lock for a moment longer.
+ */
+ if (wal_level < WAL_LEVEL_LOGICAL)
+ LWLockRelease(ProcArrayLock);
+
recptr = LogCurrentRunningXacts(running);

+ /* Release lock if we kept it longer ... */
+ if (wal_level >= WAL_LEVEL_LOGICAL)
+ LWLockRelease(ProcArrayLock);
+

This seems unfortunate. The comment should clearly explain why it's necessary.

+ /*
+ * Startup logical state, needs to be setup now so we have proper data
+ * during restore.
+ */
+ StartupReorderBuffer();

Should add blank line before this.

+ CheckPointSnapBuild();
+ CheckpointLogicalRewriteHeap();

Shouldn't the capitalization be consistent?

- heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
+ if (IsSystemRelation(scan->rs_rd)
+ || RelationIsAccessibleInLogicalDecoding(scan->rs_rd))
+ heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
+ else
+ heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalDataXmin);

Instead of changing the callers of heap_page_prune_opt() in this way,
I think it might be better to change heap_page_prune_opt() to take
only the first two of its current three parameters; everybody's just
passing RecentGlobalXmin right now anyway. Then, we could change the
first check in heap_page_prune_opt() to check first whether
PageIsPrunable(page, RecentGlobalDataXmin). If not, give up. If so,
then check that (!IsSystemRelation(scan->rs_rd) &&
!RelationIsAccessibleInLogicalDecoding(scan->rs_rd)) ||
PageIsPrunable(page, RecentGlobalXmin)). The advantage of this is
that we avoid code duplication, and we avoid checking a couple of
conditions if pd_prune_xmin is very recent.

- if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval ||
forceSyncCommit)
+ if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval ||
forceSyncCommit ||
+ XLogLogicalInfoActive())

Mmph. Is this really necessary? If so, why? The comments could elucidate.

+ bool fail_softly = slot->data.persistency == RS_EPHEMERAL;

This should be contingent on whether we're being called in the error
pathway, not the slot type. I think you should pass a bool.

There are a bunch of places where you're testing IsSystemRelation() ||
RelationIsAccessibleInLogicalDecoding(). Maybe we need a macro
encapsulating that test, with a name chose to explain the point of it.
It seems to be indicating, roughly, whether the relation should
participate in RecentGlobalXmin or RecentGlobalDataXmin. But is there
any point at all of separating those when !XLogLogicalInfoActive()?
The test expands to:

IsSystemRelation() || (XLogLogicalInfoActive() &&
RelationNeedsWAL(relation) && (IsCatalogRelation(relation) ||
RelationIsUsedAsCatalogTable(relation)))

So basically this is all tables created in pg_catalog during initdb
plus all TOAST tables in the system. If wal_level=logical, then we
also include tables marked with the reloption user_catalog_table=true,
unless they're unlogged. This all seems a bit complex. Why not this:

IsSystemRelation() || || RelationIsUsedAsCatalogTable(relation)

And why not this?

IsCatalogRelation() || || RelationIsUsedAsCatalogTable(relation)

i.e. is it really necessary to include all TOAST tables, or does it
suffice to include TOAST tables of system catalogs? I bet you're
going to tell me that we don't know which TOAST tables pertain to
user-catalog tables, and thus must include them all. Ugh.

+ /*
+ * It's important *not* to track decoding tasks here because
+ * snapbuild.c uses ->oldestRunningXid to manage its xmin. If it
+ * were to be included here the initial value could never
+ * increase.
+ */

This is not clear, and it uses the " ->member" syntax which I find
confusing and inelegant.

lazy_vacuum_rel() takes the relation as an argument, so why does it
need the to caller to compute IsSystemRelation(onerel) ||
RelationIsAccessibleInLogicalDecoding(onerel)?

Header comment for ReplicationSlotDropAcquired() is bogus.

ReplicationSlotDropAcquired() can easily avoid using a "goto" with a
short else block. I'd suggest if rename() == 0 then fsync() else
ereport().

+ * GetOldestSafeDecodingTransactionId -- lowest xid not affected by vacuum

It seems to me that this is the lowest XID known not to have been
pruned, whether by vacuum or otherwise.

+ /* ----
+ * This is a bit tricky: We need to determine a safe xmin
horizon to start
+ * decoding from, to avoid starting from a running xacts
record referring
+ * to xids whose rows have been vacuumed or pruned
+ * already. GetOldestSafeDecodingTransactionId() returns such
a value, but
+ * without further interlock it's return value might
immediately be out of
+ * date.
+ *
+ * So we have to acquire the ProcArrayLock to prevent computation of new
+ * xmin horizons by other backends, get the safe decoding xid,
and inform
+ * the slot machinery about the new limit. Once that's done the
+ * ProcArrayLock can be be released as the slot machinery now is
+ * protecting against vacuum.
+ * ----
+ */

I can't claim to be very excited about this. I'm assuming you've
spent a lot of time thinking about ways to avoid this and utterly
failed to come up with any reasonable alternative, but let me take a
shot. Suppose we take ProcArrayLock in exclusive mode and compute the
oldest running XID, install it as our xmin, and then release
ProcArrayLock. At that point, nobody else can compute an oldest-xmin
value that precedes that value, so we can take our time installing
that value as the slot's xmin, without needing to hold a lock
meanwhile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-24 23:16:05
Message-ID:	20140224231605.GO6718@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-02-24 17:06:53 -0500, Robert Haas wrote:
> I still think pg_create_logical_replication_slot should be in slotfuncs.c.

Ok, I don't feel too strongly, so I can change it. I wanted to keep
logical/ stuff out of slotfuncs.c, but there's not really a strong
reason for that.

> I don't think the completely-unsecured directory operations in
> test_decoding_regsupport.c are acceptable. Tom fought tooth and nail
> to make sure that similar capabilities in adminpack carried meaningful
> security restrictions.

I actually thought they'd be too ugly to live and we'd remove them
pre-commit.

There's no security problem though afaics, since they aren't actually
created, and you need to be superuser to create C functions.

> /*
> + * Check whether there are, possibly unconnected, logical
> slots that refer
> + * to the to-be-dropped database. The database lock we are holding
> + * prevents the creation of new slots using the database.
> + */
> + if (ReplicationSlotsCountDBSlots(db_id, &nslots, &nslots_active))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_IN_USE),
> + errmsg("database \"%s\" is used in a
> logical decoding slot",
> + dbname),
> + errdetail("There are %d slot(s), %d
> of them active",
> + nslots, nslots_active)));
>
> What are you going to do when we get around to supporting this on a
> standby? Whatever the answer is, maybe add a TODO comment.

I think it should actually mostly work out, anybody actively connected
to a slot will be kicked of (normal HS mechanisms)... But the slot would
currently live on which obviously isn't nice.

Will add TODO.

> + /*
> + * GetRunningTransactionData() acquired ProcArrayLock, we must release
> + * it. We can do that before inserting the WAL record because
> + * ProcArrayApplyRecoveryInfo can recheck the commit status using the
> + * clog. If we're doing logical replication we can't do that though, so
> + * hold the lock for a moment longer.
> + */
> + if (wal_level < WAL_LEVEL_LOGICAL)
> + LWLockRelease(ProcArrayLock);
> +
> recptr = LogCurrentRunningXacts(running);
>
> + /* Release lock if we kept it longer ... */
> + if (wal_level >= WAL_LEVEL_LOGICAL)
> + LWLockRelease(ProcArrayLock);
> +

> This seems unfortunate. The comment should clearly explain why it's necessary.

There's another (existing) comment ontop of the function giving a bit
more context, but I'll expand.

I'd actually prefer to remove that special case alltogether, I don't
have much trust in those codepaths for HS... But that's not an argument
I want to fight out right nwo.

> - heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
> + if (IsSystemRelation(scan->rs_rd)
> + || RelationIsAccessibleInLogicalDecoding(scan->rs_rd))
> + heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
> + else
> + heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalDataXmin);
>
> Instead of changing the callers of heap_page_prune_opt() in this way,
> I think it might be better to change heap_page_prune_opt() to take
> only the first two of its current three parameters; everybody's just
> passing RecentGlobalXmin right now anyway.

Sounds like a plan.

> - if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval ||
> forceSyncCommit)
> + if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval ||
> forceSyncCommit ||
> + XLogLogicalInfoActive())
>
> Mmph. Is this really necessary? If so, why? The comments could elucidate.

We could get rid of it by (optionally) adding information about the
database oid to compact commit, but that'd increase the size of the
record.

> + bool fail_softly = slot->data.persistency == RS_EPHEMERAL;
>
> This should be contingent on whether we're being called in the error
> pathway, not the slot type. I think you should pass a bool.

Why? I had it that way at first, but for persistent slots this won't be
called in error pathways as we won't drop there.

> There are a bunch of places where you're testing IsSystemRelation() ||
> RelationIsAccessibleInLogicalDecoding(). Maybe we need a macro
> encapsulating that test, with a name chose to explain the point of it.

Sounds like a idea.

> It seems to be indicating, roughly, whether the relation should
> participate in RecentGlobalXmin or RecentGlobalDataXmin. But is there
> any point at all of separating those when !XLogLogicalInfoActive()?
> The test expands to:
>
> IsSystemRelation() || (XLogLogicalInfoActive() &&
> RelationNeedsWAL(relation) && (IsCatalogRelation(relation) ||
> RelationIsUsedAsCatalogTable(relation)))
>
> So basically this is all tables created in pg_catalog during initdb
> plus all TOAST tables in the system. If wal_level=logical, then we
> also include tables marked with the reloption user_catalog_table=true,
> unless they're unlogged. This all seems a bit complex. Why not this:
>
> IsSystemRelation() || || RelationIsUsedAsCatalogTable(relation)

Because that'd possibly retain too much when !XLogLogicalInfoActive(),
there's no need to look for RelationIsUsedAsCatalogTable() for those. We
could decide not to care?

> And why not this?
>
> IsCatalogRelation() || || RelationIsUsedAsCatalogTable(relation)
>
> i.e. is it really necessary to include all TOAST tables, or does it
> suffice to include TOAST tables of system catalogs?

The latter would suffice.

> I bet you're
> going to tell me that we don't know which TOAST tables pertain to
> user-catalog tables, and thus must include them all. Ugh.

Not sure offhand, but if that's an issue, it needs to be fixed when
setting the option. I dimly remember thinking about it, and convincing
myself it's not an issue.

> + * GetOldestSafeDecodingTransactionId -- lowest xid not affected by vacuum
>
> It seems to me that this is the lowest XID known not to have been
> pruned, whether by vacuum or otherwise.

Hm, yes, mentioning pruning make sense.

> + /* ----
> + * This is a bit tricky: We need to determine a safe xmin
> horizon to start
> + * decoding from, to avoid starting from a running xacts
> record referring
> + * to xids whose rows have been vacuumed or pruned
> + * already. GetOldestSafeDecodingTransactionId() returns such
> a value, but
> + * without further interlock it's return value might
> immediately be out of
> + * date.
> + *
> + * So we have to acquire the ProcArrayLock to prevent computation of new
> + * xmin horizons by other backends, get the safe decoding xid,
> and inform
> + * the slot machinery about the new limit. Once that's done the
> + * ProcArrayLock can be be released as the slot machinery now is
> + * protecting against vacuum.
> + * ----
> + */
>
> I can't claim to be very excited about this.

Because of the already_locked parameters, or any wider concerns?

> I'm assuming you've
> spent a lot of time thinking about ways to avoid this and utterly
> failed to come up with any reasonable alternative, but let me take a
> shot. Suppose we take ProcArrayLock in exclusive mode and compute the
> oldest running XID, install it as our xmin, and then release
> ProcArrayLock. At that point, nobody else can compute an oldest-xmin
> value that precedes that value, so we can take our time installing
> that value as the slot's xmin, without needing to hold a lock
> meanwhile.

I actually had it that way for a while, but what if the backend already
has a xmin set? Then we need to reason about whether the xmin is newer,
restore it afterwards and such. That doesn't seem nice.

Since the time holding the lock isn't long (we're just iterating over
the slots), I am not too worried?

Thanks for the review! Will address ASAP.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-25 18:47:49
Message-ID:	CA+TgmoYXjGA6CKz8wnp0fDY2hKKCp6f3a4=UkkoCw+2TmOhZDg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 24, 2014 at 6:16 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I actually thought they'd be too ugly to live and we'd remove them
> pre-commit.

Might be getting to be about that time, then.

>> - if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval ||
>> forceSyncCommit)
>> + if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval ||
>> forceSyncCommit ||
>> + XLogLogicalInfoActive())
>>
>> Mmph. Is this really necessary? If so, why? The comments could elucidate.
>
> We could get rid of it by (optionally) adding information about the
> database oid to compact commit, but that'd increase the size of the
> record.

So why do we need the database OID?

>> + bool fail_softly = slot->data.persistency == RS_EPHEMERAL;
>>
>> This should be contingent on whether we're being called in the error
>> pathway, not the slot type. I think you should pass a bool.
>
> Why? I had it that way at first, but for persistent slots this won't be
> called in error pathways as we won't drop there.

I was thinking more the reverse - that a non-persistent slot might be
dropped in a non-error pathway.

>> It seems to be indicating, roughly, whether the relation should
>> participate in RecentGlobalXmin or RecentGlobalDataXmin. But is there
>> any point at all of separating those when !XLogLogicalInfoActive()?
>> The test expands to:
>>
>> IsSystemRelation() || (XLogLogicalInfoActive() &&
>> RelationNeedsWAL(relation) && (IsCatalogRelation(relation) ||
>> RelationIsUsedAsCatalogTable(relation)))
>>
>> So basically this is all tables created in pg_catalog during initdb
>> plus all TOAST tables in the system. If wal_level=logical, then we
>> also include tables marked with the reloption user_catalog_table=true,
>> unless they're unlogged. This all seems a bit complex. Why not this:
>>
>> IsSystemRelation() || || RelationIsUsedAsCatalogTable(relation)
>
> Because that'd possibly retain too much when !XLogLogicalInfoActive(),
> there's no need to look for RelationIsUsedAsCatalogTable() for those. We
> could decide not to care?

But when !XLogLogicalInfoActive() I think we could just make this
always false, right? I mean, if PROC_IN_LOGICAL_DECODING is never
going to be set, the values are always going to be the same anyway.
I think.

>> + /* ----
>> + * This is a bit tricky: We need to determine a safe xmin
>> horizon to start
>> + * decoding from, to avoid starting from a running xacts
>> record referring
>> + * to xids whose rows have been vacuumed or pruned
>> + * already. GetOldestSafeDecodingTransactionId() returns such
>> a value, but
>> + * without further interlock it's return value might
>> immediately be out of
>> + * date.
>> + *
>> + * So we have to acquire the ProcArrayLock to prevent computation of new
>> + * xmin horizons by other backends, get the safe decoding xid,
>> and inform
>> + * the slot machinery about the new limit. Once that's done the
>> + * ProcArrayLock can be be released as the slot machinery now is
>> + * protecting against vacuum.
>> + * ----
>> + */
>>
>> I can't claim to be very excited about this.
>
> Because of the already_locked parameters, or any wider concerns?

Passing down already_locked through several layers is kind of ugly,
but also, holding ProcArrayLock more is sad. That is not a
lightly-contended lock.

>> I'm assuming you've
>> spent a lot of time thinking about ways to avoid this and utterly
>> failed to come up with any reasonable alternative, but let me take a
>> shot. Suppose we take ProcArrayLock in exclusive mode and compute the
>> oldest running XID, install it as our xmin, and then release
>> ProcArrayLock. At that point, nobody else can compute an oldest-xmin
>> value that precedes that value, so we can take our time installing
>> that value as the slot's xmin, without needing to hold a lock
>> meanwhile.
>
> I actually had it that way for a while, but what if the backend already
> has a xmin set? Then we need to reason about whether the xmin is newer,
> restore it afterwards and such. That doesn't seem nice.

It's not too far removed from the problem snapmgr.c is already
designed to solve, though, is it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-25 19:39:06
Message-ID:	20140225193906.GV6718@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-02-25 13:47:49 -0500, Robert Haas wrote:
> On Mon, Feb 24, 2014 at 6:16 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > I actually thought they'd be too ugly to live and we'd remove them
> > pre-commit.
>
> Might be getting to be about that time, then.

I want to leave them in until the slot semantics aren't going to change
anymore, they are pretty useful for testing that. But I'll separate them out
into a separate commit again.

> >> - if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval ||
> >> forceSyncCommit)
> >> + if (nrels > 0 || nmsgs > 0 || RelcacheInitFileInval ||
> >> forceSyncCommit ||
> >> + XLogLogicalInfoActive())
> >>
> >> Mmph. Is this really necessary? If so, why? The comments could elucidate.
> >
> > We could get rid of it by (optionally) adding information about the
> > database oid to compact commit, but that'd increase the size of the
> > record.
>
> So why do we need the database OID?

To ignore commits from other databases. Since we don't decode changes
from other databases, it's really confusing (and pointless overhead) to
see transactions from there.

> >> + bool fail_softly = slot->data.persistency == RS_EPHEMERAL;
> >>
> >> This should be contingent on whether we're being called in the error
> >> pathway, not the slot type. I think you should pass a bool.
> >
> > Why? I had it that way at first, but for persistent slots this won't be
> > called in error pathways as we won't drop there.
>
> I was thinking more the reverse - that a non-persistent slot might be
> dropped in a non-error pathway.

Well, currently EPHEMERAL slots are documented to be dropped at release
since that's what changeset extraction (and possibly basebackup and
receivexlog) need afaics. You'd prefer DROP_ON_ERROR semantics?

> >> It seems to be indicating, roughly, whether the relation should
> >> participate in RecentGlobalXmin or RecentGlobalDataXmin. But is there
> >> any point at all of separating those when !XLogLogicalInfoActive()?
> >> The test expands to:
> >>
> >> IsSystemRelation() || (XLogLogicalInfoActive() &&
> >> RelationNeedsWAL(relation) && (IsCatalogRelation(relation) ||
> >> RelationIsUsedAsCatalogTable(relation)))
> >>
> >> So basically this is all tables created in pg_catalog during initdb
> >> plus all TOAST tables in the system. If wal_level=logical, then we
> >> also include tables marked with the reloption user_catalog_table=true,
> >> unless they're unlogged. This all seems a bit complex. Why not this:
> >>
> >> IsSystemRelation() || || RelationIsUsedAsCatalogTable(relation)
> >
> > Because that'd possibly retain too much when !XLogLogicalInfoActive(),
> > there's no need to look for RelationIsUsedAsCatalogTable() for those. We
> > could decide not to care?
>
> But when !XLogLogicalInfoActive() I think we could just make this
> always false, right? I mean, if PROC_IN_LOGICAL_DECODING is never
> going to be set, the values are always going to be the same anyway.
> I think.

It seems confusing and bug-prone to use the wrong horizon variable just
because right now they'd be the same if wal_level < logical.

> >> I can't claim to be very excited about this.
> >
> > Because of the already_locked parameters, or any wider concerns?
>
> Passing down already_locked through several layers is kind of ugly,
> but also, holding ProcArrayLock more is sad. That is not a
> lightly-contended lock.

Absolutely true, but this is very far from a operation that will be
frequent enough to matter. Creating a slot so frequently that a lock on
the procarray hold while iterating the slot array matters, will be
painful long before the contention on that is the problem.

> >> I'm assuming you've
> >> spent a lot of time thinking about ways to avoid this and utterly
> >> failed to come up with any reasonable alternative, but let me take a
> >> shot. Suppose we take ProcArrayLock in exclusive mode and compute the
> >> oldest running XID, install it as our xmin, and then release
> >> ProcArrayLock. At that point, nobody else can compute an oldest-xmin
> >> value that precedes that value, so we can take our time installing
> >> that value as the slot's xmin, without needing to hold a lock
> >> meanwhile.
> >
> > I actually had it that way for a while, but what if the backend already
> > has a xmin set? Then we need to reason about whether the xmin is newer,
> > restore it afterwards and such. That doesn't seem nice.
>
> It's not too far removed from the problem snapmgr.c is already
> designed to solve, though, is it?

Hm, I don't immediately see how it would fit in there. PgXact->xmin is
set by procarray.c, all snapmgr does is reset it. And there's no logic
about resetting it back to higher values and such.

I'll ponder on getting rid of this, but I am not of too high hopes.

Thanks,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-26 17:29:19
Message-ID:	20140226172919.GD14104@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-24 17:06:53 -0500, Robert Haas wrote:
> - heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
> + if (IsSystemRelation(scan->rs_rd)
> + || RelationIsAccessibleInLogicalDecoding(scan->rs_rd))
> + heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
> + else
> + heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalDataXmin);
>
> Instead of changing the callers of heap_page_prune_opt() in this way,
> I think it might be better to change heap_page_prune_opt() to take
> only the first two of its current three parameters; everybody's just
> passing RecentGlobalXmin right now anyway.

I've changed stuff this way, and it indeed looks better.

I am wondering about the related situation of GetOldestXmin()
callers. There's a fair bit of duplicated logic in the callers, before
but especially after this patchset. What about adding 'Relation rel'
parameter instead of `allDbs' and `systable'? That keeps the logic
centralized and there's been a fair amount of talk about vacuum
optimizations that could also use it.
It's a bit sad that that requires including rel.h from procarray.h...

What do you think? Isolated patch attached.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-fixup-wal_decoding-Introduce-logical-changeset-extra.patch	text/x-patch	13.0 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-26 18:30:55
Message-ID:	20140226183055.GF4759@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund escribió:

> I am wondering about the related situation of GetOldestXmin()
> callers. There's a fair bit of duplicated logic in the callers, before
> but especially after this patchset. What about adding 'Relation rel'
> parameter instead of `allDbs' and `systable'? That keeps the logic
> centralized and there's been a fair amount of talk about vacuum
> optimizations that could also use it.
> It's a bit sad that that requires including rel.h from procarray.h...

relcache.h, not rel.h.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-26 18:32:30
Message-ID:	20140226183230.GA6718@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-26 15:30:55 -0300, Alvaro Herrera wrote:
> Andres Freund escribió:
>
> > I am wondering about the related situation of GetOldestXmin()
> > callers. There's a fair bit of duplicated logic in the callers, before
> > but especially after this patchset. What about adding 'Relation rel'
> > parameter instead of `allDbs' and `systable'? That keeps the logic
> > centralized and there's been a fair amount of talk about vacuum
> > optimizations that could also use it.
> > It's a bit sad that that requires including rel.h from procarray.h...
>
> relcache.h, not rel.h.

RelationData is declared in rel.h, not relcache.h, no?

Alternatively we could just forward declare it in the header...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-26 20:10:56
Message-ID:	20140226201056.GH4759@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund escribió:
> On 2014-02-26 15:30:55 -0300, Alvaro Herrera wrote:
> > Andres Freund escribió:
> >
> > > I am wondering about the related situation of GetOldestXmin()
> > > callers. There's a fair bit of duplicated logic in the callers, before
> > > but especially after this patchset. What about adding 'Relation rel'
> > > parameter instead of `allDbs' and `systable'? That keeps the logic
> > > centralized and there's been a fair amount of talk about vacuum
> > > optimizations that could also use it.
> > > It's a bit sad that that requires including rel.h from procarray.h...
> >
> > relcache.h, not rel.h.
>
> RelationData is declared in rel.h, not relcache.h, no?

Sure, but with your patch AFAICT procarray.h header only needs Relation,
which is declared in relcache.h.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.7
Date:	2014-02-27 02:33:13
Message-ID:	CA+TgmobOZg1HTw0P3V_2Kt_vp1zrfs_hNZVjxQZheUVutehWzw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 26, 2014 at 12:29 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-24 17:06:53 -0500, Robert Haas wrote:
>> - heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
>> + if (IsSystemRelation(scan->rs_rd)
>> + || RelationIsAccessibleInLogicalDecoding(scan->rs_rd))
>> + heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
>> + else
>> + heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalDataXmin);
>>
>> Instead of changing the callers of heap_page_prune_opt() in this way,
>> I think it might be better to change heap_page_prune_opt() to take
>> only the first two of its current three parameters; everybody's just
>> passing RecentGlobalXmin right now anyway.
>
> I've changed stuff this way, and it indeed looks better.
>
> I am wondering about the related situation of GetOldestXmin()
> callers. There's a fair bit of duplicated logic in the callers, before
> but especially after this patchset. What about adding 'Relation rel'
> parameter instead of `allDbs' and `systable'? That keeps the logic
> centralized and there's been a fair amount of talk about vacuum
> optimizations that could also use it.
> It's a bit sad that that requires including rel.h from procarray.h...
>
> What do you think? Isolated patch attached.

Seems reasonable to me.

+ * considered, but for non-shared non-shared relations that's not required,

Duplicate word.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-27 16:06:00
Message-ID:	20140227160600.GI28858@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-24 12:50:03 -0500, Robert Haas wrote:
> On Mon, Feb 24, 2014 at 9:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-02-15 17:29:04 -0500, Robert Haas wrote:
> >> On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >
> >> + /*
> >> + * XXX: It's impolite to ignore our argument and keep decoding until the
> >> + * current position.
> >> + */
> >>
> >> Eh, what?
> >
> > So, the background here is that I was thinking of allowing to specify a
> > limit for the number of returned rows. For the sql interface that sounds
> > like a good idea. I am just not so sure anymore that allowing to specify
> > a LSN as a limit is sufficient. Maybe simply allow to limit the number
> > of changes and check everytime a transaction has been replayed?
>
> The last idea there seems like pretty sound, but ...
>
> > It's all trivial codewise, I am just wondering about the interface most
> > users would want.
>
> ...I can't swear it meets this criterion.

So, it's now:
CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
IN slotname name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
OUT location pg_lsn, OUT xid xid, OUT data text)
RETURNS SETOF RECORD
LANGUAGE INTERNAL
VOLATILE ROWS 1000 COST 1000
AS 'pg_logical_slot_get_changes';

if nonnull upto_lsn allows limiting based on the lsn, similar with
upto_nchanges.

Makes sense?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-02-27 16:28:22
Message-ID:	CA+TgmoYeYRyNt+3khCqw-6hb8NzqSvpUDt6FzaZJ3zQ+AMFStQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 27, 2014 at 11:06 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-24 12:50:03 -0500, Robert Haas wrote:
>> On Mon, Feb 24, 2014 at 9:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > On 2014-02-15 17:29:04 -0500, Robert Haas wrote:
>> >> On Fri, Feb 14, 2014 at 4:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >
>> >> + /*
>> >> + * XXX: It's impolite to ignore our argument and keep decoding until the
>> >> + * current position.
>> >> + */
>> >>
>> >> Eh, what?
>> >
>> > So, the background here is that I was thinking of allowing to specify a
>> > limit for the number of returned rows. For the sql interface that sounds
>> > like a good idea. I am just not so sure anymore that allowing to specify
>> > a LSN as a limit is sufficient. Maybe simply allow to limit the number
>> > of changes and check everytime a transaction has been replayed?
>>
>> The last idea there seems like pretty sound, but ...
>>
>> > It's all trivial codewise, I am just wondering about the interface most
>> > users would want.
>>
>> ...I can't swear it meets this criterion.
>
> So, it's now:
> CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
> IN slotname name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
> OUT location pg_lsn, OUT xid xid, OUT data text)
> RETURNS SETOF RECORD
> LANGUAGE INTERNAL
> VOLATILE ROWS 1000 COST 1000
> AS 'pg_logical_slot_get_changes';
>
> if nonnull upto_lsn allows limiting based on the lsn, similar with
> upto_nchanges.
>
> Makes sense?

Time will tell, but it seems plausible to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Changeset Extraction v7.8
Date:	2014-02-27 16:56:08
Message-ID:	20140227165608.GJ28858@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Attached you can find version 7.8 of this patcheset. Changes since 7.7
include:
* Signature changes of the SQL changeset SRFs to support limits based on
LSN and/or number of returned rows (pg_logical_slot_get_changes() et
al) and to make parameter passing optional (by adding a DEFAULT '{}'
to the variadic argument)
* heap_page_prune_opt() now decides itself which horizon to use,
removing a good amount of duplicated logic
* GetOldestXmin() now has a Relation parameter that can be NULL instead
of the former allDbs (existing in master) and systable (just this
branch) parameters, also removing code duplication.
* pg_create_logical_replication_slot() is now defined in slotfuncs.c
* a fair number of cosmetic and comment changes

The open issues that I know of are:
* do we modify struct SnapshotData to be polymorphic based on some tag
or move comments there?
* How/whether to change the exclusive lock on the ProcArrayLock in
CreateInitDecodingContext()

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Changeset Extraction v7.8
Date:	2014-02-27 16:58:52
Message-ID:	20140227165852.GA18320@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-27 17:56:08 +0100, Andres Freund wrote:
> Hi,
>
> Attached you can find version 7.8 of this patcheset. Changes since 7.7
> include:

Hrmpf, prematurely hit send.

> * Signature changes of the SQL changeset SRFs to support limits based on
> LSN and/or number of returned rows (pg_logical_slot_get_changes() et
> al) and to make parameter passing optional (by adding a DEFAULT '{}'
> to the variadic argument)
> * heap_page_prune_opt() now decides itself which horizon to use,
> removing a good amount of duplicated logic
> * GetOldestXmin() now has a Relation parameter that can be NULL instead
> of the former allDbs (existing in master) and systable (just this
> branch) parameters, also removing code duplication.
> * pg_create_logical_replication_slot() is now defined in slotfuncs.c
> * a fair number of cosmetic and comment changes
* the probably-not-committable slot changes are split of into an extra patch

> The open issues that I know of are:
> * do we modify struct SnapshotData to be polymorphic based on some tag
> or move comments there?
> * How/whether to change the exclusive lock on the ProcArrayLock in

As usual the the branch xlog-decoding-rebasing-remapping on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git
contains the latest and greatest.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz	application/x-patch-gzip	123.9 KB
0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	12.8 KB
0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	11.4 KB
0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz	application/x-patch-gzip	10.9 KB
0005-wal_decoding-Temporarily-add-some-uncommittable-slot.patch.gz	application/x-patch-gzip	2.4 KB
0006-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB

From:	Thom Brown <thom(at)linux(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.8
Date:	2014-02-27 16:59:36
Message-ID:	CAA-aLv4L7F3831J7sXwoB8R6N_03qPXrJ6fEoR_MrkV+JL8H0w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 27 February 2014 16:56, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Hi,
>
> Attached you can find version 7.8 of this patcheset. Changes since 7.7
> include:

Try again? :)

--
Thom

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Changeset Extraction v7.9
Date:	2014-03-03 16:26:52
Message-ID:	20140303162652.GB16654@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-02-27 17:56:08 +0100, Andres Freund wrote:
> * do we modify struct SnapshotData to be polymorphic based on some tag
> or move comments there?

I tried that, and it got far to invasive. So I've updated the relevant
comment in snapshot.h, inl

> * How/whether to change the exclusive lock on the ProcArrayLock in
> CreateInitDecodingContext()

I looked at this, and I believe the current code is the best
solution. It's pretty far away from any hot codepath and it's a short
operation. I liked the idea about using snapmgr.c for this in principle,
but it doesn't have enough smarts by far...

So, attached is the newest version:
* Management of historic/timetravel snapshot is now done by snapmgr.c,
not tqual.c anymore. No ->satisfies pointers are redirected anymore
* removal of the "suspend" logic for historic snapshot, instead the one
place that needed it, now explicitly uses a snapshot
* removal of some pointless CREATE EXTENSIONs from the regression tests
* splitoff of the slot tests that aren't committable into a separate
commit.
* minor doc adjustments

I am not aware of any further things that need to be fixed now (in
contrast to features for later releases of which there are aplenty).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-wal_decoding-Introduce-logical-changeset-extraction.patch.gz	application/x-patch-gzip	125.1 KB
0002-wal_decoding-logical-changeset-extraction-walsender-.patch.gz	application/x-patch-gzip	12.8 KB
0003-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz	application/x-patch-gzip	11.4 KB
0004-wal_decoding-Documentation-for-replication-slots-and.patch.gz	application/x-patch-gzip	10.8 KB
0005-wal_decoding-Temporarily-add-some-uncommittable-slot.patch.gz	application/x-patch-gzip	2.4 KB
0006-wal_decoding-Temporarily-add-logical-decoding-regres.patch.gz	application/x-patch-gzip	1.4 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9
Date:	2014-03-03 21:48:15
Message-ID:	CA+TgmoaEVKeudWKAK3ZjPY3JyXEt8_OVHO+c5JJnnn0ZHL+pVg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 3, 2014 at 11:26 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-27 17:56:08 +0100, Andres Freund wrote:
>> * do we modify struct SnapshotData to be polymorphic based on some tag
>> or move comments there?
>
> I tried that, and it got far to invasive. So I've updated the relevant
> comment in snapshot.h, inl
>
>> * How/whether to change the exclusive lock on the ProcArrayLock in
>> CreateInitDecodingContext()
>
> I looked at this, and I believe the current code is the best
> solution. It's pretty far away from any hot codepath and it's a short
> operation. I liked the idea about using snapmgr.c for this in principle,
> but it doesn't have enough smarts by far...
>
> So, attached is the newest version:
> * Management of historic/timetravel snapshot is now done by snapmgr.c,
> not tqual.c anymore. No ->satisfies pointers are redirected anymore
> * removal of the "suspend" logic for historic snapshot, instead the one
> place that needed it, now explicitly uses a snapshot
> * removal of some pointless CREATE EXTENSIONs from the regression tests
> * splitoff of the slot tests that aren't committable into a separate
> commit.
> * minor doc adjustments
>
> I am not aware of any further things that need to be fixed now (in
> contrast to features for later releases of which there are aplenty).

OK, I've committed the 0001 patch, which is the core of this feature,
with a bit of minor additional hacking.

I'm sure there are some problems here yet and some things that people
will want fixed, as is inevitable for any patch of this size. But I
don't have any confidence that further postponing commit is going to
be the best way to find those issues, so in it goes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9
Date:	2014-03-03 22:43:25
Message-ID:	20140303224325.GJ17253@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Robert, Everyone!

On 2014-03-03 16:48:15 -0500, Robert Haas wrote:
> OK, I've committed the 0001 patch, which is the core of this feature,
> with a bit of minor additional hacking.

Many, many, thanks!

> I'm sure there are some problems here yet and some things that people
> will want fixed, as is inevitable for any patch of this size. But I
> don't have any confidence that further postponing commit is going to
> be the best way to find those issues, so in it goes.

Unsurprisingly I do agree with this. It's a big feature, and there's
imperfection. But I think it's a good start.

A very first such imperfection is that the buildfarm doesn't actually
excercise make check in contribs, just make installcheck... Which this
patch doesn't use because the tests require wal_level=logical and
max_replication_slots >= 2. Andrew said on IRC that maybe it's a good
idea to add a make-contrib-check stage to the buildfarm.

A patch fixing a couple of absolutely trivial things is attached.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
trivial-fixups.patch	text/x-patch	2.2 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-04 23:26:02
Message-ID:	20140304232602.GG27273@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-03 16:48:15 -0500, Robert Haas wrote:
> OK, I've committed the 0001 patch, which is the core of this feature,
> with a bit of minor additional hacking.

Attached are the rebased patches that are remaining.

Changes:
* minor conflict due to 7558cc95d31edb
* removal of the last XXX in the walsender patch by setting the
timestamps in the 'd' messages correctly.
* Some documentation wordsmithing by Craig

The walsender patch currently contains the changes about feedback we
argued about elsewhere, I guess I either need to back them out, or we
need to argue out that minor bit.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-Minor-regression-test-fixup-for-7e8db2dc4.patch	text/x-patch	1.0 KB
0002-logical-decoding-walsender-interface.patch	text/x-patch	45.9 KB
0003-Introduce-pg_receivexlog-equivalent-for-logical-deco.patch	text/x-patch	45.8 KB
0004-Documentation-for-logical-decoding.patch	text/x-patch	42.1 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-05 18:49:23
Message-ID:	CA+TgmoaBc+UtEBNdnY0mCP5sjJmvPjfWjYUDk-q0CfoMa0pPQA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 4, 2014 at 6:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-03-03 16:48:15 -0500, Robert Haas wrote:
>> OK, I've committed the 0001 patch, which is the core of this feature,
>> with a bit of minor additional hacking.
>
> Attached are the rebased patches that are remaining.
>
> Changes:
> * minor conflict due to 7558cc95d31edb
> * removal of the last XXX in the walsender patch by setting the
> timestamps in the 'd' messages correctly.
> * Some documentation wordsmithing by Craig
>
> The walsender patch currently contains the changes about feedback we
> argued about elsewhere, I guess I either need to back them out, or we
> need to argue out that minor bit.

OK, reading through the walsender patch (0002 in this series):

PLEASE stop using a comma to join two independent thoughts. Don't do
it in the comments, and definitely don't do it in error messages. I'm
referring to things like this: "invalid value for option
\"replication\", legal values are false, 0, true, 1 or database". I
know that you're not a native English speaker, and if you were
submitting a smaller amount of code I wouldn't just fix it for you,
but you do this A LOT and I've probably fixed a hundred instances of
it already and I can't cope with fixing another hundred. In code
comments, a semicolon is often an adequate substitute, but that even
with that change this won't do for an error message. For that, you
should copy the style of something done elsewhere. For example, in
this instance, perhaps look to this precedent:

rhaas=# set synchronous_commit = barfle;
ERROR: invalid value for parameter "synchronous_commit": "barfle"
HINT: Available values: local, remote_write, on, off.

This patch still treats "allow a walsender to connect to a database"
as a separate feature from "allow logical replication". I'm not
convinced that's a good idea. What you're proposing to do is allow
replication=database in addition to replication=true and
replication=false. But how about instead allowing
replication=physical and replication=logical? "physical" can just be
a synonym for "true" and the database name can be ignored as it is
today. "logical" can pay attention the database name. I'm not
totally wedded to that exact design, but basically, I'm not
comfortable with allowing a physical WAL sender to connect to a
database in advance of a concrete need. We might want to leave some
room to go there later if we think it's a likely direction, but
allowing people to do it in advance of any functional advantage just
seems like a recipe for bugs. Practically nobody will run that way so
breakage won't be timely detected. (And no, I don't know exactly what
will break.)

+ if (am_cascading_walsender && !RecoveryInProgress())
+ {
+ ereport(LOG,
+ (errmsg("terminating walsender process
to force cascaded standby to update timeline and reconnect")));
+ walsender_ready_to_stop = true;
+ }

Does this apply to logical replication? Seems like it could at least
have a comment.

+ /*
+ * XXX: For feedback purposes it would be nicer to set sentPtr to
+ * cmd->startpoint, but we use it to know where to read xlog in the main
+ * loop...
+ */

I'm not sure I understand this.

WalSndWriteData() looks kind of cut-and-pasty.

WalSndWaitForWal() is yet another slightly-modified copy of the same
darn loop. Surely we need a better way of doing this. It's
absolutely inevitable that some future hacker will not patch every
copy of this loop in some situation where that is required.

There might be more; that's all I see at the moment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-05 18:57:53
Message-ID:	531773B1.6020301@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/05/2014 10:49 AM, Robert Haas wrote:
> This patch still treats "allow a walsender to connect to a database"
> as a separate feature from "allow logical replication". I'm not
> convinced that's a good idea. What you're proposing to do is allow
> replication=database in addition to replication=true and
> replication=false. But how about instead allowing
> replication=physical and replication=logical? "physical" can just be
> a synonym for "true" and the database name can be ignored as it is
> today. "logical" can pay attention the database name. I'm not
> totally wedded to that exact design, but basically, I'm not
> comfortable with allowing a physical WAL sender to connect to a
> database in advance of a concrete need. We might want to leave some
> room to go there later if we think it's a likely direction, but
> allowing people to do it in advance of any functional advantage just
> seems like a recipe for bugs. Practically nobody will run that way so
> breakage won't be timely detected. (And no, I don't know exactly what
> will break.)

Personally, I'd prefer to just have the permission here governed by the
existing replication permission; why make things complicated for users?
But maybe Andres has some other requirement he's trying to fullfill?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-05 19:04:11
Message-ID:	CA+TgmoYjEJ8=gDMkF3MqEdW0cOyBFA_7qn-b88e=pr549bcPDQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 5, 2014 at 1:57 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 03/05/2014 10:49 AM, Robert Haas wrote:
>> This patch still treats "allow a walsender to connect to a database"
>> as a separate feature from "allow logical replication". I'm not
>> convinced that's a good idea. What you're proposing to do is allow
>> replication=database in addition to replication=true and
>> replication=false. But how about instead allowing
>> replication=physical and replication=logical? "physical" can just be
>> a synonym for "true" and the database name can be ignored as it is
>> today. "logical" can pay attention the database name. I'm not
>> totally wedded to that exact design, but basically, I'm not
>> comfortable with allowing a physical WAL sender to connect to a
>> database in advance of a concrete need. We might want to leave some
>> room to go there later if we think it's a likely direction, but
>> allowing people to do it in advance of any functional advantage just
>> seems like a recipe for bugs. Practically nobody will run that way so
>> breakage won't be timely detected. (And no, I don't know exactly what
>> will break.)
>
> Personally, I'd prefer to just have the permission here governed by the
> existing replication permission; why make things complicated for users?
> But maybe Andres has some other requirement he's trying to fullfill?

This isn't about permissions; it's about the fact that physical
replication is cluster-wide, but logical replication is per-database.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-05 20:04:17
Message-ID:	20140305200417.GJ27273@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-03-05 13:49:23 -0500, Robert Haas wrote:
> PLEASE stop using a comma to join two independent thoughts.

Ok. I'll try.

Is this a personal preference, or a general rule? There seems to be a
fair amount of comments in pg doing so?

> This patch still treats "allow a walsender to connect to a database"
> as a separate feature from "allow logical replication". I'm not
> convinced that's a good idea. What you're proposing to do is allow
> replication=database in addition to replication=true and
> replication=false. But how about instead allowing
> replication=physical and replication=logical? "physical" can just be
> a synonym for "true" and the database name can be ignored as it is
> today. "logical" can pay attention the database name. I'm not
> totally wedded to that exact design, but basically, I'm not
> comfortable with allowing a physical WAL sender to connect to a
> database in advance of a concrete need. We might want to leave some
> room to go there later if we think it's a likely direction, but
> allowing people to do it in advance of any functional advantage just
> seems like a recipe for bugs. Practically nobody will run that way so
> breakage won't be timely detected. (And no, I don't know exactly what
> will break.)

I am only mildly against doing so, so you certainly can nudge me in that
direction.
Would you want to refuse using existing commands in logical mode? It's not
unrealistic to first want to perform a basebackup and then establish a
logical slot to replay from there on. It's probably not too bad to force
separate connections there, but it seems like a somewhwat pointless
exercise to me?

> + if (am_cascading_walsender && !RecoveryInProgress())
> + {
> + ereport(LOG,
> + (errmsg("terminating walsender process
> to force cascaded standby to update timeline and reconnect")));
> + walsender_ready_to_stop = true;
> + }
>
> Does this apply to logical replication? Seems like it could at least
> have a comment.

I think it does make sense to force a disconnect in this case to
simplify code, but you're right, both a comment and some TLC for the
message are in order.

> WalSndWriteData() looks kind of cut-and-pasty.

You mean from the WalSndLoop? Yea. I tried to reduce it by introducing
WalSndCheckTimeOut() but I think at the very least
WalSndComputeTimeOut() is in order.

I very much dislike having the three different event loops, but it's
pretty much forced by the design of the xlogreader. "My" xlogreader
version didn't block when it neeeded to wait for WAL but just returned
"need input/output", but with the eventually committed version you're
pretty much forced to block inside the read_page callback.

I don't really have a idea how we could sensibly unify them atm.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-05 22:05:24
Message-ID:	CA+TgmoZqm=h_NUdZ=TNDWF5JObhMwzDK=h4YdQNKVoFwo4TZ3g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 5, 2014 at 3:04 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> On 2014-03-05 13:49:23 -0500, Robert Haas wrote:
>> PLEASE stop using a comma to join two independent thoughts.
>
> Ok. I'll try.
>
> Is this a personal preference, or a general rule? There seems to be a
> fair amount of comments in pg doing so?

http://en.wikipedia.org/wiki/Comma_splice

>> This patch still treats "allow a walsender to connect to a database"
>> as a separate feature from "allow logical replication". I'm not
>> convinced that's a good idea. What you're proposing to do is allow
>> replication=database in addition to replication=true and
>> replication=false. But how about instead allowing
>> replication=physical and replication=logical? "physical" can just be
>> a synonym for "true" and the database name can be ignored as it is
>> today. "logical" can pay attention the database name. I'm not
>> totally wedded to that exact design, but basically, I'm not
>> comfortable with allowing a physical WAL sender to connect to a
>> database in advance of a concrete need. We might want to leave some
>> room to go there later if we think it's a likely direction, but
>> allowing people to do it in advance of any functional advantage just
>> seems like a recipe for bugs. Practically nobody will run that way so
>> breakage won't be timely detected. (And no, I don't know exactly what
>> will break.)
>
> I am only mildly against doing so, so you certainly can nudge me in that
> direction.
> Would you want to refuse using existing commands in logical mode? It's not
> unrealistic to first want to perform a basebackup and then establish a
> logical slot to replay from there on. It's probably not too bad to force
> separate connections there, but it seems like a somewhwat pointless
> exercise to me?

Hmm, that's an interesting point. I didn't consider the case of a
base backup followed by replication, on the same connection. That
might be sufficient justification for doing it the way you have it.

>> WalSndWriteData() looks kind of cut-and-pasty.
>
> You mean from the WalSndLoop? Yea. I tried to reduce it by introducing
> WalSndCheckTimeOut() but I think at the very least
> WalSndComputeTimeOut() is in order.
>
> I very much dislike having the three different event loops, but it's
> pretty much forced by the design of the xlogreader. "My" xlogreader
> version didn't block when it neeeded to wait for WAL but just returned
> "need input/output", but with the eventually committed version you're
> pretty much forced to block inside the read_page callback.
>
> I don't really have a idea how we could sensibly unify them atm.

WalSndLoop(void (*gutsfn)())?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-05 22:20:57
Message-ID:	20140305222057.GC6010@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-05 17:05:24 -0500, Robert Haas wrote:
> > I very much dislike having the three different event loops, but it's
> > pretty much forced by the design of the xlogreader. "My" xlogreader
> > version didn't block when it neeeded to wait for WAL but just returned
> > "need input/output", but with the eventually committed version you're
> > pretty much forced to block inside the read_page callback.
> >
> > I don't really have a idea how we could sensibly unify them atm.
>
> WalSndLoop(void (*gutsfn)())?

The problem is that they are actually different. In the WalSndLoop we're
also maintaining the walsender's state, in WalSndWriteData() we're just
waiting for writes to be flushed, in WalSndWaitForWal we're primarily
waiting for the flush pointer to pass some LSN. And the timing of the
individual checks isn't trivial (just added some more comments about
it).

I'll simplify it by pulling out more common code, maybe it'll become
apparent how it should look.

Greetings,

Andres Freund

PS: I so far considered my language counted poetic, that's why I used
the splicing comma so liberally... Thanks for the link.

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-07 12:44:37
Message-ID:	20140307124437.GA22909@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-03-05 23:20:57 +0100, Andres Freund wrote:
> On 2014-03-05 17:05:24 -0500, Robert Haas wrote:
> > > I very much dislike having the three different event loops, but it's
> > > pretty much forced by the design of the xlogreader. "My" xlogreader
> > > version didn't block when it neeeded to wait for WAL but just returned
> > > "need input/output", but with the eventually committed version you're
> > > pretty much forced to block inside the read_page callback.
> > >
> > > I don't really have a idea how we could sensibly unify them atm.
> >
> > WalSndLoop(void (*gutsfn)())?
>
> The problem is that they are actually different. In the WalSndLoop we're
> also maintaining the walsender's state, in WalSndWriteData() we're just
> waiting for writes to be flushed, in WalSndWaitForWal we're primarily
> waiting for the flush pointer to pass some LSN. And the timing of the
> individual checks isn't trivial (just added some more comments about
> it).
>
> I'll simplify it by pulling out more common code, maybe it'll become
> apparent how it should look.

I've attached a new version of the walsender patch. It's been rebased
ontop of Heikki's latest commit to walsender.c. I've changed a fair bit
of stuff:
* The sleeptime is now computed to sleep until we either need to send a
keepalive or kill ourselves, as Heikki sugggested.
* Sleep time computation, sending pings, checking timeouts is now done
in separate functions.
* Comment and codestyle improvements.

Although they are shorter and simpler now, I have not managed to unify
the three loops however. They seem to be too different to unify them
inside one. I tried a common function with an 'wait_for' bitmask
argument, but that turned out to be fairly illegible. The checks in
WalSndWaitForWal() and WalSndLoop() just seem to be too different.

I'd be grateful if you (or somebody else!) could have a quick look at
body of the loops in WalSndWriteData(), WalSndWaitForWal() and
WalSndLoop(). Maybe I am just staring at it the wrong way.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-Add-walsender-interface-for-the-logical-decoding-fun.patch	text/x-patch	50.5 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-07 13:17:21
Message-ID:	20140307131721.GC4759@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund escribió:

> fprintf(stderr,
> - _("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d fields\n"),
> - progname, PQntuples(res), PQnfields(res), 1, 3);
> + _("%s: could not identify system: got %d rows and %d fields, expected 1 row and 3 or more fields\n"),
> + progname, PQntuples(res), PQnfields(res));

Please don't change this. The reason these messages use %d and an extra
printf argument is to avoid giving translators extra work when the
number of rows or fields is changed. In these cases I suggest this:

> - _("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d fields\n"),
> - progname, PQntuples(res), PQnfields(res), 1, 3);
> + _("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d or more fields\n"),
> + progname, PQntuples(res), PQnfields(res), 1, 3);

(Yes, I know the "expected 1 rows" output looks a bit silly. Since this
is an unexpected error message anyway, I don't think that's worth
fixing.)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-07 13:32:01
Message-ID:	20140307133201.GB22909@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-07 10:17:21 -0300, Alvaro Herrera wrote:
> Andres Freund escribió:
>
> > fprintf(stderr,
> > - _("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d fields\n"),
> > - progname, PQntuples(res), PQnfields(res), 1, 3);
> > + _("%s: could not identify system: got %d rows and %d fields, expected 1 row and 3 or more fields\n"),
> > + progname, PQntuples(res), PQnfields(res));
>
> Please don't change this. The reason these messages use %d and an extra
> printf argument is to avoid giving translators extra work when the
> number of rows or fields is changed. In these cases I suggest this:
>
> > - _("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d fields\n"),
> > - progname, PQntuples(res), PQnfields(res), 1, 3);
> > + _("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d or more fields\n"),
> > + progname, PQntuples(res), PQnfields(res), 1, 3);
>
> (Yes, I know the "expected 1 rows" output looks a bit silly. Since this
> is an unexpected error message anyway, I don't think that's worth
> fixing.)

I changed it to not use placeholders because I thought "or more" was
specific enough to be unlikely to be used in other places, but I don't
have a problem with continuing to use them.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 18:54:06
Message-ID:	CA+TgmoaMk78RqMP5=Oh++NWhYac+JK_9ohOoq0XWtQTeonodpg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 7, 2014 at 7:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I've attached a new version of the walsender patch. It's been rebased
> ontop of Heikki's latest commit to walsender.c. I've changed a fair bit
> of stuff:
> * The sleeptime is now computed to sleep until we either need to send a
> keepalive or kill ourselves, as Heikki sugggested.
> * Sleep time computation, sending pings, checking timeouts is now done
> in separate functions.
> * Comment and codestyle improvements.
>
> Although they are shorter and simpler now, I have not managed to unify
> the three loops however. They seem to be too different to unify them
> inside one. I tried a common function with an 'wait_for' bitmask
> argument, but that turned out to be fairly illegible. The checks in
> WalSndWaitForWal() and WalSndLoop() just seem to be too different.
>
> I'd be grateful if you (or somebody else!) could have a quick look at
> body of the loops in WalSndWriteData(), WalSndWaitForWal() and
> WalSndLoop(). Maybe I am just staring at it the wrong way.

I've committed this patch now with a few further tweaks, leaving this
issue unaddressed. It may well be something that needs improvement,
but I don't think it's a big enough issue to justify holding back a
commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 19:33:33
Message-ID:	20140310193332.GL4759@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas escribió:

> I've committed this patch now with a few further tweaks, leaving this
> issue unaddressed. It may well be something that needs improvement,
> but I don't think it's a big enough issue to justify holding back a
> commit.

Hmm, is the buildfarm exercising any of this?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 19:35:06
Message-ID:	CA+TgmoYQ9kkHaBh1tZ-jkR7h7NGXNcFUb++GnWEmMh1dpPzZbA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 10, 2014 at 3:33 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Robert Haas escribió:
>> I've committed this patch now with a few further tweaks, leaving this
>> issue unaddressed. It may well be something that needs improvement,
>> but I don't think it's a big enough issue to justify holding back a
>> commit.
>
> Hmm, is the buildfarm exercising any of this?

I think it isn't, apart from whether it builds. Apparently the
buildfarm only runs installcheck on contrib, not check. And the
test_decoding plugin only runs under installcheck, not check. Also,
it's not going to test walsender/walreceiver at all, but that's harder
to fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 19:36:48
Message-ID:	20140310193648.GA17059@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-03-10 16:33:33 -0300, Alvaro Herrera wrote:
> Robert Haas escribió:
> > I've committed this patch now with a few further tweaks, leaving this
> > issue unaddressed. It may well be something that needs improvement,
> > but I don't think it's a big enough issue to justify holding back a
> > commit.
>
> Hmm, is the buildfarm exercising any of this?

Not sufficiently yet, no. The logical decoding facilities themselves are
actually covered by tests in contrib/test_decoding, but due to the
issues mentioned in 20140303224325(dot)GJ17253(at)awork2(dot)anarazel(dot)de they
aren't run.
The walsender interface isn't tested at all. Be it new or old
functionality. I have some hopes for Peter's client test patches
there...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 19:38:42
Message-ID:	531E14C2.6030803@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/10/2014 11:54 AM, Robert Haas wrote:
> I've committed this patch now with a few further tweaks, leaving this
> issue unaddressed. It may well be something that needs improvement,
> but I don't think it's a big enough issue to justify holding back a
> commit.

Wait, does this mean Changesets is committed? Or only part of it?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 19:39:35
Message-ID:	20140310193935.GB17059@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-10 12:38:42 -0700, Josh Berkus wrote:
> On 03/10/2014 11:54 AM, Robert Haas wrote:
> > I've committed this patch now with a few further tweaks, leaving this
> > issue unaddressed. It may well be something that needs improvement,
> > but I don't think it's a big enough issue to justify holding back a
> > commit.
>
> Wait, does this mean Changesets is committed? Or only part of it?

The docs and pg_recvlogical aren't yet, everything else is. Working on
rebasing/copy-editing the former two right now.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 19:46:52
Message-ID:	CA+TgmoZiS8NfjXoiE2F6ZS0nbA+NJg2ahx4f+2n=VHxVo3tFtA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 10, 2014 at 3:38 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 03/10/2014 11:54 AM, Robert Haas wrote:
>> I've committed this patch now with a few further tweaks, leaving this
>> issue unaddressed. It may well be something that needs improvement,
>> but I don't think it's a big enough issue to justify holding back a
>> commit.
>
> Wait, does this mean Changesets is committed? Or only part of it?

The core of the feature was b89e151054a05f0f6d356ca52e3b725dd0505e53,
but that only allowed it through the SQL interface. The new commit,
8722017bbcbc95e311bbaa6d21cd028e296e5e35, makes it available via
walsender interface. There isn't a client for that interface yet, but
if you're wondering whether it's time to break out the champagne, I'm
thinking probably.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 20:55:53
Message-ID:	531E26D9.9070304@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/10/2014 12:46 PM, Robert Haas wrote:
> On Mon, Mar 10, 2014 at 3:38 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 03/10/2014 11:54 AM, Robert Haas wrote:
>>> I've committed this patch now with a few further tweaks, leaving this
>>> issue unaddressed. It may well be something that needs improvement,
>>> but I don't think it's a big enough issue to justify holding back a
>>> commit.
>>
>> Wait, does this mean Changesets is committed? Or only part of it?
>
> The core of the feature was b89e151054a05f0f6d356ca52e3b725dd0505e53,
> but that only allowed it through the SQL interface. The new commit,
> 8722017bbcbc95e311bbaa6d21cd028e296e5e35, makes it available via
> walsender interface. There isn't a client for that interface yet, but
> if you're wondering whether it's time to break out the champagne, I'm
> thinking probably.

Yeah, that's my thoughts. Although I might wait for recvlogical. Will
put documentation wordsmithing on my todo list once Andres commits.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 21:08:56
Message-ID:	CA+TgmoZRWt7aa8M3ZyV7bMcTQSY_oSg-bUDMO1mvQNXpGEMpCA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 10, 2014 at 4:55 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 03/10/2014 12:46 PM, Robert Haas wrote:
>> On Mon, Mar 10, 2014 at 3:38 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> On 03/10/2014 11:54 AM, Robert Haas wrote:
>>>> I've committed this patch now with a few further tweaks, leaving this
>>>> issue unaddressed. It may well be something that needs improvement,
>>>> but I don't think it's a big enough issue to justify holding back a
>>>> commit.
>>>
>>> Wait, does this mean Changesets is committed? Or only part of it?
>>
>> The core of the feature was b89e151054a05f0f6d356ca52e3b725dd0505e53,
>> but that only allowed it through the SQL interface. The new commit,
>> 8722017bbcbc95e311bbaa6d21cd028e296e5e35, makes it available via
>> walsender interface. There isn't a client for that interface yet, but
>> if you're wondering whether it's time to break out the champagne, I'm
>> thinking probably.
>
> Yeah, that's my thoughts. Although I might wait for recvlogical. Will
> put documentation wordsmithing on my todo list once Andres commits.

Is this your way of announcing that Andres is getting a commit bit, or
did you just mis-speak?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 21:16:03
Message-ID:	20140310211603.GN4759@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas escribió:
> On Mon, Mar 10, 2014 at 3:33 PM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
> > Robert Haas escribió:
> >> I've committed this patch now with a few further tweaks, leaving this
> >> issue unaddressed. It may well be something that needs improvement,
> >> but I don't think it's a big enough issue to justify holding back a
> >> commit.
> >
> > Hmm, is the buildfarm exercising any of this?
>
> I think it isn't, apart from whether it builds. Apparently the
> buildfarm only runs installcheck on contrib, not check. And the
> test_decoding plugin only runs under installcheck, not check. Also,
> it's not going to test walsender/walreceiver at all, but that's harder
> to fix.

So the buildfarm exercises pg_upgrade, to some extent, by way of a
custom module,
https://github.com/PGBuildFarm/client-code/blob/master/PGBuild/Modules/TestUpgrade.pm
As far as I can tell, test_decoding wants to do the same thing (i.e. get
make check to run). Is the best option to write a new TestLogical.pm
module for the buildfarm, or should we somehow think about how to
generalize the pg_upgrade trick so that animal caretakers can enable
runs of test_decoding by simply upgrading to a newer version of the
buildfarm script?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-10 21:52:50
Message-ID:	531E3432.5000208@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/10/2014 02:08 PM, Robert Haas wrote:
> On Mon, Mar 10, 2014 at 4:55 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Yeah, that's my thoughts. Although I might wait for recvlogical. Will
>> put documentation wordsmithing on my todo list once Andres commits.
>
> Is this your way of announcing that Andres is getting a commit bit, or
> did you just mis-speak?

Hah. No, I have no such knowledge. I was using "commit" as in the git
sense, as in "commits to his fork".

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-12 18:06:22
Message-ID:	20140312180622.GA10179@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-10 13:55:53 -0700, Josh Berkus wrote:
> On 03/10/2014 12:46 PM, Robert Haas wrote:
> > On Mon, Mar 10, 2014 at 3:38 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >> On 03/10/2014 11:54 AM, Robert Haas wrote:
> >>> I've committed this patch now with a few further tweaks, leaving this
> >>> issue unaddressed. It may well be something that needs improvement,
> >>> but I don't think it's a big enough issue to justify holding back a
> >>> commit.
> >>
> >> Wait, does this mean Changesets is committed? Or only part of it?
> >
> > The core of the feature was b89e151054a05f0f6d356ca52e3b725dd0505e53,
> > but that only allowed it through the SQL interface. The new commit,
> > 8722017bbcbc95e311bbaa6d21cd028e296e5e35, makes it available via
> > walsender interface. There isn't a client for that interface yet, but
> > if you're wondering whether it's time to break out the champagne, I'm
> > thinking probably.
>
> Yeah, that's my thoughts. Although I might wait for recvlogical. Will
> put documentation wordsmithing on my todo list once Andres commits.

Heh, as Robert observed, no can do...

Attached are the collected remaining patches. The docs might need
further additions, but it seems better to add them now.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-Fix-typo-in-Assert-statement-causing-SetTransactionS.patch	text/x-patch	1.5 KB
0002-Add-pg_recvlogical-a-commandline-tool-to-receive-dat.patch	text/x-patch	35.4 KB
0003-Documentation-for-logical-decoding.patch	text/x-patch	53.3 KB
0004-Adapt-test_decoding-s-documentation-to-last-minute-l.patch	text/x-patch	957 bytes
0005-Minor-test-decoding-comment-improvements.patch	text/x-patch	2.4 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 10:55:28
Message-ID:	CA+TgmoYKshsx_APXj3ht9AB8xn-_FJBVfdr=3Mwg1Mf9tsnGUQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 12, 2014 at 2:06 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Attached are the collected remaining patches. The docs might need
> further additions, but it seems better to add them now.

A few questions about pg_recvlogical:

- There doesn't seem to be any provision for this tool to ever switch
from one output file to the next. That seems like a practical need.
One idea would be to have it respond to SIGHUP by reopening the
originally-named output file. Another would be to switch, after so
many bytes, to filename.1, then filename.2, etc.

- It confirms the write and flush positions, but doesn't appear to
actually flush anywhere.

- While I quite agree with your desire for stringinfo in src/common,
couldn't you use the roughly-equivalent PQExpBuffer facilities in
libpq instead?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 11:27:06
Message-ID:	20140317112706.GA16438@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-17 06:55:28 -0400, Robert Haas wrote:
> On Wed, Mar 12, 2014 at 2:06 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > Attached are the collected remaining patches. The docs might need
> > further additions, but it seems better to add them now.
>
> A few questions about pg_recvlogical:
>
> - There doesn't seem to be any provision for this tool to ever switch
> from one output file to the next. That seems like a practical need.
> One idea would be to have it respond to SIGHUP by reopening the
> originally-named output file. Another would be to switch, after so
> many bytes, to filename.1, then filename.2, etc.

Hm. So far I haven't had the need, but you're right, it would be
useful. I don't like the .<n> notion, but SIGHUP would be fine with
me. I'll add that.

> - It confirms the write and flush positions, but doesn't appear to
> actually flush anywhere.

Yea. The reason it reports the flush position is that it allows to test
sync rep. I don't think other usecases will appreciate frequent
fsyncs... Maybe make it optional?

> - While I quite agree with your desire for stringinfo in src/common,
> couldn't you use the roughly-equivalent PQExpBuffer facilities in
> libpq instead?

Yes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 12:00:22
Message-ID:	CA+TgmobbGx2sVG6_Mm8z4q-6moY0Mk-hCdVY7rtT6O7K7Zz5hQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 17, 2014 at 7:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> - There doesn't seem to be any provision for this tool to ever switch
>> from one output file to the next. That seems like a practical need.
>> One idea would be to have it respond to SIGHUP by reopening the
>> originally-named output file. Another would be to switch, after so
>> many bytes, to filename.1, then filename.2, etc.
>
> Hm. So far I haven't had the need, but you're right, it would be
> useful. I don't like the .<n> notion, but SIGHUP would be fine with
> me. I'll add that.

Cool.

>> - It confirms the write and flush positions, but doesn't appear to
>> actually flush anywhere.
>
> Yea. The reason it reports the flush position is that it allows to test
> sync rep. I don't think other usecases will appreciate frequent
> fsyncs... Maybe make it optional?

Well, as I'm sure you recognize, if you're actually trying to build a
replication solution with this tool, you can't let the database throw
away the state required to suck changes out of the database unless
you've got those changes safely stored away somewhere else. Now, of
course, if you don't acknowledge to the database that the stuff is on
disk, you're going to get data file bloat and excess WAL retention,
unlucky you. But acknowledging that you've got the changes when
they're not actually on disk doesn't actually provide the guarantees
you went to so much trouble to build in to the mechanism. So the
no-flush version really can ONLY ever be useful for testing, AFAICS,
or if you really don't care that much whether it can survive a server
crash.

Perhaps there could be a switch for an fsync interval, or something
like that. The default could be, say, to fsync every 10 seconds. And
if you want to change it, then go ahead; 0 disables. Writing to
standard output would be documented as unreliable. Other ideas
welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 12:29:24
Message-ID:	20140317122924.GB16438@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-17 08:00:22 -0400, Robert Haas wrote:
> > Yea. The reason it reports the flush position is that it allows to test
> > sync rep. I don't think other usecases will appreciate frequent
> > fsyncs... Maybe make it optional?
>
> Well, as I'm sure you recognize, if you're actually trying to build a
> replication solution with this tool, you can't let the database throw
> away the state required to suck changes out of the database unless
> you've got those changes safely stored away somewhere else.

Hm. I don't think a replication tool will use pg_recvlogical, do you
really forsee that? The main use cases for it that I can see are
testing/development of output plugins and logging/auditing.

That's not to say safe writing method isn't interesting tho.

> Perhaps there could be a switch for an fsync interval, or something
> like that. The default could be, say, to fsync every 10 seconds. And
> if you want to change it, then go ahead; 0 disables. Writing to
> standard output would be documented as unreliable. Other ideas
> welcome.

Hm. That'll be a bit nasty. fsync() is async signal safe, but it's still
forbidden to be called from a signal on windows IIRC. I guess we can
couple it with the standby_message_timeout stuff.

Unless you have a better idea?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 12:58:15
Message-ID:	20140317125815.GC16438@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-17 08:00:22 -0400, Robert Haas wrote:
> On Mon, Mar 17, 2014 at 7:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> - There doesn't seem to be any provision for this tool to ever switch
> >> from one output file to the next. That seems like a practical need.
> >> One idea would be to have it respond to SIGHUP by reopening the
> >> originally-named output file. Another would be to switch, after so
> >> many bytes, to filename.1, then filename.2, etc.
> >
> > Hm. So far I haven't had the need, but you're right, it would be
> > useful. I don't like the .<n> notion, but SIGHUP would be fine with
> > me. I'll add that.
>
> Cool.

So, I've implemented this, but it won't work on windows afaics. There's
no SIGHUP on windows, and the signal emulation code used in the backend
is backend only...
I'll be happy enough to declare this a known limitation for
now. Arguments to the contrary, best complemented with a solution?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 13:13:38
Message-ID:	CA+TgmoZuBJmqU9dpYa-4RB8KhT5C5ObaJD1sJkTq5nR6cPLuqw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 17, 2014 at 8:29 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Perhaps there could be a switch for an fsync interval, or something
>> like that. The default could be, say, to fsync every 10 seconds. And
>> if you want to change it, then go ahead; 0 disables. Writing to
>> standard output would be documented as unreliable. Other ideas
>> welcome.
>
> Hm. That'll be a bit nasty. fsync() is async signal safe, but it's still
> forbidden to be called from a signal on windows IIRC. I guess we can
> couple it with the standby_message_timeout stuff.

Eh... I don't see any need to involve signals. I'd just check after
each write() whether enough time has passed, or something like that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 13:14:51
Message-ID:	CA+TgmoZfEAS7ZUdNwG4Tvxju=Uha0dwPwMjQZOwGw5Z_26V1FA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 17, 2014 at 8:58 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-03-17 08:00:22 -0400, Robert Haas wrote:
>> On Mon, Mar 17, 2014 at 7:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> - There doesn't seem to be any provision for this tool to ever switch
>> >> from one output file to the next. That seems like a practical need.
>> >> One idea would be to have it respond to SIGHUP by reopening the
>> >> originally-named output file. Another would be to switch, after so
>> >> many bytes, to filename.1, then filename.2, etc.
>> >
>> > Hm. So far I haven't had the need, but you're right, it would be
>> > useful. I don't like the .<n> notion, but SIGHUP would be fine with
>> > me. I'll add that.
>>
>> Cool.
>
> So, I've implemented this, but it won't work on windows afaics. There's
> no SIGHUP on windows, and the signal emulation code used in the backend
> is backend only...
> I'll be happy enough to declare this a known limitation for
> now. Arguments to the contrary, best complemented with a solution?

Blarg. I don't really like that, but I admit I don't have a better
idea, unless it's to go back to the suffix idea, with something like
--file-size-limit=XXX to trigger the switch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 13:16:38
Message-ID:	20140317131637.GD16438@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-17 09:13:38 -0400, Robert Haas wrote:
> On Mon, Mar 17, 2014 at 8:29 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> Perhaps there could be a switch for an fsync interval, or something
> >> like that. The default could be, say, to fsync every 10 seconds. And
> >> if you want to change it, then go ahead; 0 disables. Writing to
> >> standard output would be documented as unreliable. Other ideas
> >> welcome.
> >
> > Hm. That'll be a bit nasty. fsync() is async signal safe, but it's still
> > forbidden to be called from a signal on windows IIRC. I guess we can
> > couple it with the standby_message_timeout stuff.
>
> Eh... I don't see any need to involve signals. I'd just check after
> each write() whether enough time has passed, or something like that.

What if no new writes are needed? Because e.g. there's either no write
activity on the primary or all writes are in another database or
somesuch?
I think checking in sendFeedback() is the best bet...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-17 13:20:03
Message-ID:	20140317132003.GE16438@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-03-17 09:14:51 -0400, Robert Haas wrote:
> On Mon, Mar 17, 2014 at 8:58 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-03-17 08:00:22 -0400, Robert Haas wrote:
> >> On Mon, Mar 17, 2014 at 7:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> >> - There doesn't seem to be any provision for this tool to ever switch
> >> >> from one output file to the next. That seems like a practical need.
> >> >> One idea would be to have it respond to SIGHUP by reopening the
> >> >> originally-named output file. Another would be to switch, after so
> >> >> many bytes, to filename.1, then filename.2, etc.
> >> >
> >> > Hm. So far I haven't had the need, but you're right, it would be
> >> > useful. I don't like the .<n> notion, but SIGHUP would be fine with
> >> > me. I'll add that.
> >>
> >> Cool.
> >
> > So, I've implemented this, but it won't work on windows afaics. There's
> > no SIGHUP on windows, and the signal emulation code used in the backend
> > is backend only...
> > I'll be happy enough to declare this a known limitation for
> > now. Arguments to the contrary, best complemented with a solution?
>
> Blarg. I don't really like that, but I admit I don't have a better
> idea, unless it's to go back to the suffix idea, with something like
> --file-size-limit=XXX to trigger the switch.

I think the SIGHUP support can be a useful independently from
--file-size-limit, so adding it seems like a good idea anyway. I think
anything more advanced than the SIGHUP stuff is for another day.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.9.1
Date:	2014-03-18 14:23:44
Message-ID:	20140318142344.GG13855@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-03-17 14:16:38 +0100, Andres Freund wrote:
> On 2014-03-17 09:13:38 -0400, Robert Haas wrote:
> > On Mon, Mar 17, 2014 at 8:29 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > >> Perhaps there could be a switch for an fsync interval, or something
> > >> like that. The default could be, say, to fsync every 10 seconds. And
> > >> if you want to change it, then go ahead; 0 disables. Writing to
> > >> standard output would be documented as unreliable. Other ideas
> > >> welcome.
> > >
> > > Hm. That'll be a bit nasty. fsync() is async signal safe, but it's still
> > > forbidden to be called from a signal on windows IIRC. I guess we can
> > > couple it with the standby_message_timeout stuff.
> >
> > Eh... I don't see any need to involve signals. I'd just check after
> > each write() whether enough time has passed, or something like that.
>
> What if no new writes are needed? Because e.g. there's either no write
> activity on the primary or all writes are in another database or
> somesuch?
> I think checking in sendFeedback() is the best bet...

So, I've tried to implement the things you asked for. Changes:
* reopen config files on SIGHUP to allow for rotating files
* use PQEexpBuffer instead of handrolling stuff
* fsync the output file if either --fsync-interval seconds passed, or
the server asks for feedback.
* report back a correct flush position.
* updated documentation

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-Add-pg_recvlogical-a-commandline-tool-to-receive-dat.patch	text/x-patch	37.6 KB
0002-Documentation-for-logical-decoding.patch	text/x-patch	54.7 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-05-31 14:11:58
Message-ID:	20140531141158.GD4286@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-02-21 15:14:15 -0600, Jim Nasby wrote:
> On 2/17/14, 7:31 PM, Robert Haas wrote:
> >But do you really want to keep that snapshot around long enough to
> >copy the entire database? I bet you don't: if the database is big,
> >holding back xmin for long enough to copy the whole thing isn't likely
> >to be fun.
>
> I can confirm that this would be epic fail, at least for londiste. It takes about 3 weeks for a new copy of a ~2TB database. There's no way that'd work with one snapshot. (Granted, copy performance in londiste is rather lackluster, but still...)

I'd marked this email as todo:
If you have such a huge database you can, with logical decoding at
least, use a basebackup using pg_basebackup or pg_start/stop_backup()
and roll forwards from that... That'll hopefull make such huge copies
much faster.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-06-01 05:50:58
Message-ID:	538ABF42.9040106@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/31/14, 9:11 AM, Andres Freund wrote:
> On 2014-02-21 15:14:15 -0600, Jim Nasby wrote:
>> On 2/17/14, 7:31 PM, Robert Haas wrote:
>>> But do you really want to keep that snapshot around long enough to
>>> copy the entire database? I bet you don't: if the database is big,
>>> holding back xmin for long enough to copy the whole thing isn't likely
>>> to be fun.
>>
>> I can confirm that this would be epic fail, at least for londiste. It takes about 3 weeks for a new copy of a ~2TB database. There's no way that'd work with one snapshot. (Granted, copy performance in londiste is rather lackluster, but still...)
>
> I'd marked this email as todo:
> If you have such a huge database you can, with logical decoding at
> least, use a basebackup using pg_basebackup or pg_start/stop_backup()
> and roll forwards from that... That'll hopefull make such huge copies
> much faster.

Just keep in mind that one of the use cases for logical replication is upgrades.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-06-01 05:57:32
Message-ID:	20140601055732.GF4286@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-06-01 00:50:58 -0500, Jim Nasby wrote:
> On 5/31/14, 9:11 AM, Andres Freund wrote:
> >On 2014-02-21 15:14:15 -0600, Jim Nasby wrote:
> >>On 2/17/14, 7:31 PM, Robert Haas wrote:
> >>>But do you really want to keep that snapshot around long enough to
> >>>copy the entire database? I bet you don't: if the database is big,
> >>>holding back xmin for long enough to copy the whole thing isn't likely
> >>>to be fun.
> >>
> >>I can confirm that this would be epic fail, at least for londiste. It takes about 3 weeks for a new copy of a ~2TB database. There's no way that'd work with one snapshot. (Granted, copy performance in londiste is rather lackluster, but still...)
> >
> >I'd marked this email as todo:
> >If you have such a huge database you can, with logical decoding at
> >least, use a basebackup using pg_basebackup or pg_start/stop_backup()
> >and roll forwards from that... That'll hopefull make such huge copies
> >much faster.

> Just keep in mind that one of the use cases for logical replication is upgrades.

Should still be fine. Make a physical copy; pg_upgrade; catchup via
logical rep.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Euler Taveira <euler(at)timbira(dot)com(dot)br>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-06-01 15:49:11
Message-ID:	538B4B77.7070903@timbira.com.br
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01-06-2014 02:57, Andres Freund wrote:
> On 2014-06-01 00:50:58 -0500, Jim Nasby wrote:
>> On 5/31/14, 9:11 AM, Andres Freund wrote:
>>> On 2014-02-21 15:14:15 -0600, Jim Nasby wrote:
>>>> On 2/17/14, 7:31 PM, Robert Haas wrote:
>>>>> But do you really want to keep that snapshot around long enough to
>>>>> copy the entire database? I bet you don't: if the database is big,
>>>>> holding back xmin for long enough to copy the whole thing isn't likely
>>>>> to be fun.
>>>>
>>>> I can confirm that this would be epic fail, at least for londiste. It takes about 3 weeks for a new copy of a ~2TB database. There's no way that'd work with one snapshot. (Granted, copy performance in londiste is rather lackluster, but still...)
>>>
>>> I'd marked this email as todo:
>>> If you have such a huge database you can, with logical decoding at
>>> least, use a basebackup using pg_basebackup or pg_start/stop_backup()
>>> and roll forwards from that... That'll hopefull make such huge copies
>>> much faster.
>
>> Just keep in mind that one of the use cases for logical replication is upgrades.
>
> Should still be fine. Make a physical copy; pg_upgrade; catchup via
> logical rep.
>
Have in mind that it is not an option if you want to copy *part* of the
database(s) (unless you have space available and want to do the cleanup
after upgrade). In a near future, a (new) tool could do (a) copy schema,
(b) accumulate modifications while copying data, (c) copy whole table
and (d) apply modifications for selected table(s)/schema(s). Such a tool
could even be an alternative to pg_upgrade.

--
Euler Taveira Timbira - http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Euler Taveira <euler(at)timbira(dot)com(dot)br>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Changeset Extraction v7.6.1
Date:	2014-06-01 17:35:21
Message-ID:	538B6459.7030001@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/1/14, 10:49 AM, Euler Taveira wrote:
> On 01-06-2014 02:57, Andres Freund wrote:
>> On 2014-06-01 00:50:58 -0500, Jim Nasby wrote:
>>> On 5/31/14, 9:11 AM, Andres Freund wrote:
>>>> On 2014-02-21 15:14:15 -0600, Jim Nasby wrote:
>>>>> On 2/17/14, 7:31 PM, Robert Haas wrote:
>>>>>> But do you really want to keep that snapshot around long enough to
>>>>>> copy the entire database? I bet you don't: if the database is big,
>>>>>> holding back xmin for long enough to copy the whole thing isn't likely
>>>>>> to be fun.
>>>>>
>>>>> I can confirm that this would be epic fail, at least for londiste. It takes about 3 weeks for a new copy of a ~2TB database. There's no way that'd work with one snapshot. (Granted, copy performance in londiste is rather lackluster, but still...)
>>>>
>>>> I'd marked this email as todo:
>>>> If you have such a huge database you can, with logical decoding at
>>>> least, use a basebackup using pg_basebackup or pg_start/stop_backup()
>>>> and roll forwards from that... That'll hopefull make such huge copies
>>>> much faster.
>>
>>> Just keep in mind that one of the use cases for logical replication is upgrades.
>>
>> Should still be fine. Make a physical copy; pg_upgrade; catchup via
>> logical rep.
>>
> Have in mind that it is not an option if you want to copy *part* of the
> database(s) (unless you have space available and want to do the cleanup
> after upgrade). In a near future, a (new) tool could do (a) copy schema,
> (b) accumulate modifications while copying data, (c) copy whole table
> and (d) apply modifications for selected table(s)/schema(s). Such a tool
> could even be an alternative to pg_upgrade.

There's also things that pg_upgrade doesn't handle, so it's not always an option.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net