Re: logical changeset generation v6.2

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical changeset generation v6.2
Date: 2013-10-18 18:26:16
Message-ID: 20131018182616.GC16188@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-10-14 09:36:03 -0400, Robert Haas wrote:
> > I thought and implemented that in the beginning. Unfortunately it's not
> > enough :(. That's probably the issue that took me longest to understand
> > in this patchseries...
> >
> > Combocids can only fix the case where a transaction actually has create
> > a combocid:
> >
> > 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid)
> > 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)
> >
> > So, if we're decoding data that needs to lookup those rows in TX1 or TX2
> > we both times need access to cmin and cmax, but neither transaction will
> > have created a multixact. That can only be an issue in transaction with
> > catalog modifications.
>
> Oh, yuck. So that means you have to write an extra WAL record for
> EVERY heap insert, update, or delete to a catalog table? OUCH.

So. As it turns out that solution isn't sufficient in the face of VACUUM
FULL and mixed DML/DDL transaction that have not yet been decoded.

To reiterate, as published it works like:
For every modification of catalog tuple (insert, multi_insert, update,
delete) that has influence over visibility issue a record that contains:
* filenode
* ctid
* (cmin, cmax)

When doing a visibility check on a catalog row during decoding of mixed
DML/DDL transaction lookup (cmin, cmax) for that row since we don't
store both for the tuple.

That mostly works great.

The problematic scenario is decoding a transaction that has done mixed
DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
FULL obviously changes the filenode and the ctid of a tuple, so we
cannot successfully do a lookup based on what we logged before.

I know of the following solutions:
1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
2) Make VACUUM FULL prevent DDL and then wait till all changestreams
have decoded up to the current point.
3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
if there are life decoding slots around, instead delegate that
responsibility to the slot management.
4) Store both (cmin, cmax) for catalog tuples.

I bascially think only 1) and 4) are realistic. And 1) sucks.

I've developed a prototype for 4) and except currently being incredibly
ugly, it seems to be the most promising approach by far. My trick to
store both cmin and cmax is to store cmax in t_hoff managed space when
wal_level = logical.
That even works when changing wal_level from < logical to logical
because only ever need to store both cmin and cmax for transactions that
have decodeable content - which they cannot yet have before wal_level =
logical.

This requires some not so nice things:
* A way to declare we're storing both. I've currently chosen
HEAP_MOVED_OFF | HEAP_MOVED_IN. That sucks.
* A way for heap_form_tuple to know it should add the necessary space to
t_hoff. I've added TupleDesc->tdhaswidecid for it.
* Fiddling with existing checks for HEAP_MOVED{,OFF,IN} to check for
both set at the same time.
* Changing the WAL logging to (optionally?) transport the current
CommandId instead of always resetting it InvalidCommandId.

The benefits are:
* Working VACUUM FULL
* Much simpler tqual.c logic, everything is stored in the row itself. No
hash or something like that built.
* No more need to log (relfilenode, cmin, cmax) separately from heap
changes itself anymore.

In the end, the costs are that individual catalog rows are 4 bytes
bigger iff wal_level = logical. That seems acceptable.

Some questions remain:
* Better idea for a flag than HEAP_MOVED_OFF | HEAP_MOVED_IN
* Should we just unconditionally log the current CommandId or make it
conditional. We have plenty of flag space to signal whether it's
present, but it's just 4 bytes.

Comments?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-10-18 18:50:58 Re: logical changeset generation v6.4
Previous Message Peter Eisentraut 2013-10-18 18:22:49 Re: libpgport vs libpgcommon