Quick Links

Re: WAL format and API changes (9.5)

Lists:	pgsql-hackers

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	WAL format and API changes (9.5)
Date:	2014-04-03 14:14:23
Message-ID:	533D6CBF.6080203@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I'd like to do some changes to the WAL format in 9.5. I want to annotate
each WAL record with the blocks that they modify. Every WAL record
already includes that information, but it's done in an ad hoc way,
differently in every rmgr. The RelFileNode and block number are
currently part of the WAL payload, and it's the REDO routine's
responsibility to extract it. I want to include that information in a
common format for every WAL record type.

That makes life a lot easier for tools that are interested in knowing
which blocks a WAL record modifies. One such tool is pg_rewind; it
currently has to understand every WAL record the backend writes. There's
also a tool out there called pg_readahead, which does prefetching of
blocks accessed by WAL records, to speed up PITR. I don't think that
tool has been actively maintained, but at least part of the reason for
that is probably that it's a pain to maintain when it has to understand
the details of every WAL record type.

It'd also be nice for contrib/pg_xlogdump and backend code itself. The
boilerplate code in all WAL redo routines, and writing WAL records,
could be simplified.

So, here's my proposal:

Insertion
---------

The big change in creating WAL records is that the buffers involved in
the WAL-logged operation are explicitly registered, by calling a new
XLogRegisterBuffer function. Currently, buffers that need full-page
images are registered by including them in the XLogRecData chain, but
with the new system, you call the XLogRegisterBuffer() function instead.
And you call that function for every buffer involved, even if no
full-page image needs to be taken, e.g because the page is going to be
recreated from scratch at replay.

It is no longer necessary to include the RelFileNode and BlockNumber of
the modified pages in the WAL payload. That information is automatically
included in the WAL record, when XLogRegisterBuffer is called.

Currently, the backup blocks are implicitly numbered, in the order the
buffers appear in XLogRecData entries. With the new API, the blocks are
numbered explicitly. This is more convenient when a WAL record sometimes
modifies a buffer and sometimes not. For example, a B-tree split needs
to modify four pages: the original page, the new page, the right sibling
(unless it's the rightmost page) and if it's an internal page, the page
at the lower level whose split the insertion completes. So there are two
pages that are sometimes missing from the record. With the new API, you
can nevertheless always register e.g. original page as buffer 0, new
page as 1, right sibling as 2, even if some of them are actually
missing. SP-GiST contains even more complicated examples of that.

The new XLogRegisterBuffer would look like this:

void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

blockref_id: An arbitrary ID given to this block reference. It is used
in the redo routine to open/restore the same block.
buffer: the buffer involved
buffer_std: is the page in "standard" page layout?

That's for the normal cases. We'll need a couple of variants for also
registering buffers that don't need full-page images, and perhaps also a
function for registering a page that *always* needs a full-page image,
regardless of the LSN. A few existing WAL record types just WAL-log the
whole page, so those ad-hoc full-page images could be replaced with this.

With these changes, a typical WAL insertion would look like this:

/* register the buffer with the WAL record, with ID 0 */
XLogRegisterBuffer(0, buf, true);

rdata[0].data = (char *) &xlrec;
rdata[0].len = sizeof(BlahRecord);
rdata[0].buffer_id = -1; /* -1 means the data is always included */
rdata[0].next = &(rdata[1]);

rdata[1].data = (char *) mydata;
rdata[1].len = mydatalen;
rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered above */
rdata[1].next = NULL

...
recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);

PageSetLSN(buf, recptr);

(While we're at it, perhaps we should let XLogInsert set the LSN of all
the registered buffers, to reduce the amount of boilerplate code).

(Instead of using a new XLogRegisterBuffer() function to register the
buffers, perhaps they should be passed to XLogInsert as a separate list
or array. I'm not wedded on the details...)

Redo
----

There are four different states a block referenced by a typical WAL
record can be in:

1. The old page does not exist at all (because the relation was
truncated later)
2. The old page exists, but has an LSN higher than current WAL record,
so it doesn't need replaying.
3. The LSN is < current WAL record, so it needs to be replayed.
4. The WAL record contains a full-page image, which needs to be restored.

With the current API, that leads to a long boilerplate:

/* If we have a full-page image, restore it and we're done */
if (HasBackupBlock(record, 0))
{
(void) RestoreBackupBlock(lsn, record, 0, false, false);
return;
}
buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
/* If the page was truncated away, we're done */
if (!BufferIsValid(buffer))
return;

page = (Page) BufferGetPage(buffer);

/* Has this record already been replayed? */
if (lsn <= PageGetLSN(page))
{
UnlockReleaseBuffer(buffer);
return;
}

/* Modify the page */
...

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);

Let's simplify that, and have one new function, XLogOpenBuffer, which
returns a return code that indicates which of the four cases we're
dealing with. A typical redo function looks like this:

if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
{
/* Modify the page */
...

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);

The '0' in the XLogOpenBuffer call is the ID of the block reference
specified in the XLogRegisterBuffer call, when the WAL record was created.

WAL format
----------

The registered block references need to be included in the WAL record.
We already do that for backup blocks, so a naive implementation would be
to just include a BkpBlock struct for all the block references, even
those that don't need a full-page image. That would be rather bulky,
though, so that needs some optimization. Shouldn't be difficult to omit
duplicated/unnecessary information, and add a flags field indicating
which fields are present. Overall, I don't expect there to be any big
difference in the amount of WAL generated by a typical application.

- Heikki

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 15:37:08
Message-ID:	27502.1396539428@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> The big change in creating WAL records is that the buffers involved in
> the WAL-logged operation are explicitly registered, by calling a new
> XLogRegisterBuffer function.

Seems reasonable, especially if we can make the buffer numbering business
less error-prone.

> void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

> blockref_id: An arbitrary ID given to this block reference. It is used
> in the redo routine to open/restore the same block.
> buffer: the buffer involved
> buffer_std: is the page in "standard" page layout?

> That's for the normal cases. We'll need a couple of variants for also
> registering buffers that don't need full-page images, and perhaps also a
> function for registering a page that *always* needs a full-page image,
> regardless of the LSN. A few existing WAL record types just WAL-log the
> whole page, so those ad-hoc full-page images could be replaced with this.

Why not just one function with an additional flags argument?

Also, IIRC there are places that WAL-log full pages that aren't in a
shared buffer at all (btree build does this I think). How will that fit
into this model?

> (While we're at it, perhaps we should let XLogInsert set the LSN of all
> the registered buffers, to reduce the amount of boilerplate code).

Yeah, maybe so. I'm not sure why that was separate to begin with.

> Let's simplify that, and have one new function, XLogOpenBuffer, which
> returns a return code that indicates which of the four cases we're
> dealing with. A typical redo function looks like this:

> if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
> {
> /* Modify the page */
> ...

> PageSetLSN(page, lsn);
> MarkBufferDirty(buffer);
> }
> if (BufferIsValid(buffer))
> UnlockReleaseBuffer(buffer);

> The '0' in the XLogOpenBuffer call is the ID of the block reference
> specified in the XLogRegisterBuffer call, when the WAL record was created.

+1, but one important step here is finding the data to be replayed.
That is, a large part of the complexity of replay routines has to do
with figuring out which parts of the WAL record were elided due to
full-page-images, and locating the remaining parts. What can we do
to make that simpler?

Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
also calculate and hand back the address/size of the logged data that
had been pointed to by the associated XLogRecData chain item. The
trouble here is that there might've been multiple XLogRecData items
pointing to the same buffer. Perhaps the magic ID number you give to
XLogOpenBuffer should be thought of as identifying an XLogRecData chain
item, not so much a buffer? It's fairly easy to see what to do when
there's just one chain item per buffer, but I'm not sure what to do
if there's more than one.

regards, tom lane

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 15:58:23
Message-ID:	533D851F.3070608@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 04/03/2014 06:37 PM, Tom Lane wrote:
> Also, IIRC there are places that WAL-log full pages that aren't in a
> shared buffer at all (btree build does this I think). How will that fit
> into this model?

Hmm. We could provide a function for registering a block with given
content, without a Buffer. Something like:

XLogRegisterPage(int id, RelFileNode, BlockNumber, Page)

>> Let's simplify that, and have one new function, XLogOpenBuffer, which
>> returns a return code that indicates which of the four cases we're
>> dealing with. A typical redo function looks like this:
>
>> if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
>> {
>> /* Modify the page */
>> ...
>
>> PageSetLSN(page, lsn);
>> MarkBufferDirty(buffer);
>> }
>> if (BufferIsValid(buffer))
>> UnlockReleaseBuffer(buffer);
>
>> The '0' in the XLogOpenBuffer call is the ID of the block reference
>> specified in the XLogRegisterBuffer call, when the WAL record was created.
>
> +1, but one important step here is finding the data to be replayed.
> That is, a large part of the complexity of replay routines has to do
> with figuring out which parts of the WAL record were elided due to
> full-page-images, and locating the remaining parts. What can we do
> to make that simpler?

We can certainly add more structure to the WAL records, but any extra
information you add will make the records larger. It might be worth it,
and would be lost in the noise for more complex records like page
splits, but we should keep frequently-used records like heap insertions
as lean as possible.

> Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
> also calculate and hand back the address/size of the logged data that
> had been pointed to by the associated XLogRecData chain item. The
> trouble here is that there might've been multiple XLogRecData items
> pointing to the same buffer. Perhaps the magic ID number you give to
> XLogOpenBuffer should be thought of as identifying an XLogRecData chain
> item, not so much a buffer? It's fairly easy to see what to do when
> there's just one chain item per buffer, but I'm not sure what to do
> if there's more than one.

Hmm. You could register a separate XLogRecData chain for each buffer.
Along the lines of:

rdata[0].data = data for buffer
rdata[0].len = ...
rdata[0].next = &rdata[1];
rdata[1].data = more data for same buffer
rdata[1].len = ...
rdata[2].next = NULL;

XLogRegisterBuffer(0, buffer, &data[0]);

At replay:

if (XLogOpenBuffer(0, &buffer, &xldata, &len) == BLK_REPLAY)
{
/* xldata points to the data registered for this buffer */
}

Plus one more chain for the data not associated with a buffer.

- Heikki

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 16:11:10
Message-ID:	28487.1396541470@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> On 04/03/2014 06:37 PM, Tom Lane wrote:
>> +1, but one important step here is finding the data to be replayed.
>> That is, a large part of the complexity of replay routines has to do
>> with figuring out which parts of the WAL record were elided due to
>> full-page-images, and locating the remaining parts. What can we do
>> to make that simpler?

> We can certainly add more structure to the WAL records, but any extra
> information you add will make the records larger. It might be worth it,
> and would be lost in the noise for more complex records like page
> splits, but we should keep frequently-used records like heap insertions
> as lean as possible.

Sure, but in simple WAL records there would just be a single data chunk
that would be present in the normal case and not present in the FPI case.
Seems like we ought to be able to handle that degenerate case with no
significant wastage (probably just a flag bit someplace).

More generally, I'm pretty sure that your proposal is already going to
involve some small growth of WAL records compared to today, but I think
that's probably all right; the benefits seem significant.

> Hmm. You could register a separate XLogRecData chain for each buffer.
> Plus one more chain for the data not associated with a buffer.

That would probably work.

regards, tom lane

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 16:39:47
Message-ID:	533D8ED3.40302@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 04/03/2014 07:11 PM, Tom Lane wrote:
>
> More generally, I'm pretty sure that your proposal is already going to
> involve some small growth of WAL records compared to today,

Quite possible.

> but I think
> that's probably all right; the benefits seem significant.

Yep.

OTOH, once we store the relfilenode+block in a common format, we can
then try to optimize that format more heavily. Just as an example, omit
the tablespace oid in the RelFileNode, when it's the default tablespace
(with a flag bit indicating we did that). Or use a variable-length
endoding for the block number, on the assumption that smaller numbers
are more common. Probably not be worth the extra complexity, but we can
easily experiment with that kind of stuff once we have the
infrastructure in place.

- Heikki

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 22:28:34
Message-ID:	CA+Tgmobfy2XbMBuz6+jN+HAoaDBHJKi7T+LpG6JFPYojzOihZA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> (Instead of using a new XLogRegisterBuffer() function to register the
> buffers, perhaps they should be passed to XLogInsert as a separate list or
> array. I'm not wedded on the details...)

That would have the advantage of avoiding the function-call overhead,
which seems appealing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 23:08:27
Message-ID:	17176.1396566507@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> (Instead of using a new XLogRegisterBuffer() function to register the
>> buffers, perhaps they should be passed to XLogInsert as a separate list or
>> array. I'm not wedded on the details...)

> That would have the advantage of avoiding the function-call overhead,
> which seems appealing.

You're kidding, right? One function call per buffer touched by an update
is going to be lost in the noise.

A somewhat more relevant concern is where are we going to keep the state
data involved in all this. Since this code is, by definition, going to be
called in critical sections, any solution involving internal pallocs is
right out. The existing arrangement keeps all its data in local variables
of the function doing the update, which is probably a nice property to
preserve if we can manage it. That might force us to do it as above.

I'd still like something that *looks* more like a list of function calls,
though, even if they're macros under the hood. The existing approach to
setting up the XLogRecData chains is awfully prone to bugs of omission.
There are a few places that went so far as to set up custom macros to help
with that. I'm not a fan of doing the same thing in randomly different
ways in different places; but if we did it that way uniformly, it might be
better/safer.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 23:23:47
Message-ID:	20140403232347.GH17307@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-04-03 19:08:27 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
> > <hlinnakangas(at)vmware(dot)com> wrote:
> >> (Instead of using a new XLogRegisterBuffer() function to register the
> >> buffers, perhaps they should be passed to XLogInsert as a separate list or
> >> array. I'm not wedded on the details...)
>
> > That would have the advantage of avoiding the function-call overhead,
> > which seems appealing.
>
> You're kidding, right? One function call per buffer touched by an update
> is going to be lost in the noise.
>
> A somewhat more relevant concern is where are we going to keep the state
> data involved in all this. Since this code is, by definition, going to be
> called in critical sections, any solution involving internal pallocs is
> right out.

We actually already allocate memory in XLogInsert() :(, although only in
the first XLogInsert() a backend does... I didn't remember it, but I
complained about it before:
http://archives.postgresql.org/message-id/4FEB223A.3060707%40enterprisedb.com

I didn't realize the implications for it back then and thus didn't press
hard for a change, but I think it should be fixed.

> The existing arrangement keeps all its data in local variables
> of the function doing the update, which is probably a nice property to
> preserve if we can manage it. That might force us to do it as above.

We could simply allocate enough scratch space for the state at process
startup. I don't think there'll ever be a need for XLogInsert() to be
reentrant, so that should be fine.

> I'd still like something that *looks* more like a list of function calls,
> though, even if they're macros under the hood. The existing approach to
> setting up the XLogRecData chains is awfully prone to bugs of
> omission.

Agreed. There's some pretty ugly code around this.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 23:33:12
Message-ID:	17742.1396567992@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2014-04-03 19:08:27 -0400, Tom Lane wrote:
>> A somewhat more relevant concern is where are we going to keep the state
>> data involved in all this. Since this code is, by definition, going to be
>> called in critical sections, any solution involving internal pallocs is
>> right out.

> We actually already allocate memory in XLogInsert() :(, although only in
> the first XLogInsert() a backend does...

Ouch. I wonder if we should put an Assert(not-in-critical-section)
into MemoryContextAlloc.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-03 23:40:28
Message-ID:	20140403234028.GI17307@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-04-03 19:33:12 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2014-04-03 19:08:27 -0400, Tom Lane wrote:
> >> A somewhat more relevant concern is where are we going to keep the state
> >> data involved in all this. Since this code is, by definition, going to be
> >> called in critical sections, any solution involving internal pallocs is
> >> right out.
>
> > We actually already allocate memory in XLogInsert() :(, although only in
> > the first XLogInsert() a backend does...
>
> Ouch. I wonder if we should put an Assert(not-in-critical-section)
> into MemoryContextAlloc.

XLogInsert() is using malloc() directly, so that wouldn't detect this
case...

It's not a bad idea tho. I wonder how far the regression tests
get...

Not even through initdb. GetVirtualXIDsDelayingChkpt() is to blame.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-04 07:48:32
Message-ID:	533E63D0.1000908@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 04/04/2014 02:40 AM, Andres Freund wrote:
> On 2014-04-03 19:33:12 -0400, Tom Lane wrote:
>> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>>> On 2014-04-03 19:08:27 -0400, Tom Lane wrote:
>>>> A somewhat more relevant concern is where are we going to keep the state
>>>> data involved in all this. Since this code is, by definition, going to be
>>>> called in critical sections, any solution involving internal pallocs is
>>>> right out.
>>
>>> We actually already allocate memory in XLogInsert() :(, although only in
>>> the first XLogInsert() a backend does...
>>
>> Ouch. I wonder if we should put an Assert(not-in-critical-section)
>> into MemoryContextAlloc.

Good idea!

> XLogInsert() is using malloc() directly, so that wouldn't detect this
> case...
>
> It's not a bad idea tho. I wonder how far the regression tests
> get...
>
> Not even through initdb. GetVirtualXIDsDelayingChkpt() is to blame.

Hmm, the checkpointer process seems to ignore the rule, and just hopes
for the best. The allocation in GetVirtualXIDsDelayingChkpt() was
probably just an oversight, and that call could be moved outside the
critical section, but we also have this in AbsortFsyncRequests():

> /*
> * We have to PANIC if we fail to absorb all the pending requests (eg,
> * because our hashtable runs out of memory). This is because the system
> * cannot run safely if we are unable to fsync what we have been told to
> * fsync. Fortunately, the hashtable is so small that the problem is
> * quite unlikely to arise in practice.
> */
> START_CRIT_SECTION();

Perhaps we could fix that by just calling fsync() directly if the
allocation fails, like we do if ForwardFsyncRequest() fails.

But if we give the checkpointer process a free pass, running the
regression tests with an assertion in AllocSetAlloc catches five genuine
bugs:

1. _bt_newroot
2. XLogFileInit
3. spgPageIndexMultiDelete
4. PageRepairFragmentation
5. PageIndexMultiDelete

I'll commit the attached patch to fix those shortly.

- Heikki

Attachment	Content-Type	Size
avoid-allocations-in-crit-section.patch	text/x-diff	9.5 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-04 08:41:08
Message-ID:	20140404084108.GJ17307@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:
> But if we give the checkpointer process a free pass, running the regression
> tests with an assertion in AllocSetAlloc catches five genuine bugs:
>
> 1. _bt_newroot
> 2. XLogFileInit
> 3. spgPageIndexMultiDelete
> 4. PageRepairFragmentation
> 5. PageIndexMultiDelete

Some of those, like PageRepairFragmentation, are somewhat bad...

> @@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
> ((PageHeader) page)->pd_upper = pd_special;
> }
> else
> - { /* nstorage != 0 */
> + {
> /* Need to compact the page the hard way */
> - itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage);
> - itemidptr = itemidbase;
> + itemIdSortData itemidbase[MaxHeapTuplesPerPage];
> + itemIdSort itemidptr = itemidbase;
> +

That's a fair bit of stack, and it can be called somewhat deep on the
stack via heap_page_prune_opt(). I wonder if we ought to add a
check_stack_depth() somewhere.

Thanks,

Andres Freund

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-04 09:50:25
Message-ID:	533E8061.7080508@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 04/04/2014 11:41 AM, Andres Freund wrote:
> On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:
>> @@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
>> ((PageHeader) page)->pd_upper = pd_special;
>> }
>> else
>> - { /* nstorage != 0 */
>> + {
>> /* Need to compact the page the hard way */
>> - itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage);
>> - itemidptr = itemidbase;
>> + itemIdSortData itemidbase[MaxHeapTuplesPerPage];
>> + itemIdSort itemidptr = itemidbase;
>> +
>
> That's a fair bit of stack, and it can be called somewhat deep on the
> stack via heap_page_prune_opt(). I wonder if we ought to add a
> check_stack_depth() somewhere.

Hmm, on my 64-bit laptop, that array is 24*291=6984 bytes with 8k block
size. That's fairly large, but not unheard of; there are a few places
where we allocate a BLCKSZ-sized buffer from stack.

We could easily reduce the stack consumption here by changing itemIdSort
to use 16-bit ints; all the lengths and offsets that
PageRepairFragmentation deals with fit in 16-bits.

But overall I wouldn't worry about it. check_stack_depth() leaves a fair
amount of headroom: STACK_DEPTH_SLOP is 512kB. As long as we don't
recurse, that's plenty.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-04 10:05:06
Message-ID:	20140404100506.GA1339@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-04-04 12:50:25 +0300, Heikki Linnakangas wrote:
> On 04/04/2014 11:41 AM, Andres Freund wrote:
> >On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:
> >>@@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
> >> ((PageHeader) page)->pd_upper = pd_special;
> >> }
> >> else
> >>- { /* nstorage != 0 */
> >>+ {
> >> /* Need to compact the page the hard way */
> >>- itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage);
> >>- itemidptr = itemidbase;
> >>+ itemIdSortData itemidbase[MaxHeapTuplesPerPage];
> >>+ itemIdSort itemidptr = itemidbase;
> >>+
> >
> >That's a fair bit of stack, and it can be called somewhat deep on the
> >stack via heap_page_prune_opt(). I wonder if we ought to add a
> >check_stack_depth() somewhere.
>
> Hmm, on my 64-bit laptop, that array is 24*291=6984 bytes with 8k block
> size. That's fairly large, but not unheard of; there are a few places where
> we allocate a BLCKSZ-sized buffer from stack.

Yea, I am not complaing about using so much stack. Seems sensible here.

> But overall I wouldn't worry about it. check_stack_depth() leaves a fair
> amount of headroom: STACK_DEPTH_SLOP is 512kB. As long as we don't recurse,
> that's plenty.

Well, there's no checks at nearby afair. That's why I was
wondering... But I don't have a big problem with not checking, I just
wanted to bring it up.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Allocations in critical section (was Re: WAL format and API changes (9.5))
Date:	2014-04-04 11:31:48
Message-ID:	533E9824.9020604@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ok, I fixed the issues that the assertion fixed. I also committed a
patch to add the assertion itself; let's see if the buildfarm finds more
cases that violate the rule.

It ignores the checkpointer, because it's known to violate the rule, and
allocations in ErrorContext, which is used during error recovery, e.g if
you indeed PANIC while in a critical section for some other reason.

I didn't backpatch this. Although you shouldn't be running with
assertions enabled in production, it nevertheless seems too risky. There
might be some obscure cases where there is no real risk, e.g because the
current memory context always has enough free space because of a
previous pfree, and it doesn't seem worth tracking down and fixing such
issues in backbranches. You have to be pretty unlucky to run out of
memory in a critical section to begin with.

- Heikki

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Allocations in critical section (was Re: WAL format and API changes (9.5))
Date:	2014-04-04 13:56:38
Message-ID:	5381.1396619798@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> Ok, I fixed the issues that the assertion fixed. I also committed a
> patch to add the assertion itself; let's see if the buildfarm finds more
> cases that violate the rule.

> It ignores the checkpointer, because it's known to violate the rule,

... uh, isn't that a bug to be fixed?

> and
> allocations in ErrorContext, which is used during error recovery, e.g if
> you indeed PANIC while in a critical section for some other reason.

Yeah, I realized we'd have to do something about elog's own allocations.
Not sure if a blanket exemption for ErrorContext is the best way. I'd
been thinking of having a way to turn off the complaint once processing
of an elog(PANIC) has started.

> I didn't backpatch this.

Agreed.

BTW, I'm pretty sure you added some redundant assertions in mcxt.c.
eg, palloc does not need its own copy.

regards, tom lane

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Allocations in critical section (was Re: WAL format and API changes (9.5))
Date:	2014-04-04 14:19:33
Message-ID:	533EBF75.10009@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 04/04/2014 04:56 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>> Ok, I fixed the issues that the assertion fixed. I also committed a
>> patch to add the assertion itself; let's see if the buildfarm finds more
>> cases that violate the rule.
>
>> It ignores the checkpointer, because it's known to violate the rule,
>
> ... uh, isn't that a bug to be fixed?

Yes. One step a time ;-).

>> and
>> allocations in ErrorContext, which is used during error recovery, e.g if
>> you indeed PANIC while in a critical section for some other reason.
>
> Yeah, I realized we'd have to do something about elog's own allocations.
> Not sure if a blanket exemption for ErrorContext is the best way. I'd
> been thinking of having a way to turn off the complaint once processing
> of an elog(PANIC) has started.

Hmm. PANIC processing should avoid allocations too, except in
ErrorContext, because otherwise you might get an out-of-memory during
PANIC processing.

ErrorContext also covers elog(DEBUG2, ...). I presume we'll want to
ignore that too. Although I also tried without the exemption for
ErrorContext at first, and didn't get any failures from the regression
tests, so I guess we never do that in a critical section. I was a bit
surprised by that.

> BTW, I'm pretty sure you added some redundant assertions in mcxt.c.
> eg, palloc does not need its own copy.

palloc is copy-pasted from MemoryContextAlloc - it does need its own copy.

- Heikki

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-04-08 14:35:22
Message-ID:	CAA4eK1Juw0ySkMaQOM55QH1K-DoRBo+OvCcPCv1V7grYETKULw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Apr 3, 2014 at 7:44 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I'd like to do some changes to the WAL format in 9.5. I want to annotate
> each WAL record with the blocks that they modify. Every WAL record already
> includes that information, but it's done in an ad hoc way, differently in
> every rmgr. The RelFileNode and block number are currently part of the WAL
> payload, and it's the REDO routine's responsibility to extract it. I want to
> include that information in a common format for every WAL record type.
>
> That makes life a lot easier for tools that are interested in knowing which
> blocks a WAL record modifies. One such tool is pg_rewind; it currently has
> to understand every WAL record the backend writes. There's also a tool out
> there called pg_readahead, which does prefetching of blocks accessed by WAL
> records, to speed up PITR. I don't think that tool has been actively
> maintained, but at least part of the reason for that is probably that it's a
> pain to maintain when it has to understand the details of every WAL record
> type.
>
> It'd also be nice for contrib/pg_xlogdump and backend code itself. The
> boilerplate code in all WAL redo routines, and writing WAL records, could be
> simplified.

I think it will also be useful, if we want to implement table/tablespace
PITR.

>
> That's for the normal cases. We'll need a couple of variants for also
> registering buffers that don't need full-page images, and perhaps also a
> function for registering a page that *always* needs a full-page image,
> regardless of the LSN. A few existing WAL record types just WAL-log the
> whole page, so those ad-hoc full-page images could be replaced with this.
>
> With these changes, a typical WAL insertion would look like this:
>
> /* register the buffer with the WAL record, with ID 0 */
> XLogRegisterBuffer(0, buf, true);
>
> rdata[0].data = (char *) &xlrec;
> rdata[0].len = sizeof(BlahRecord);
> rdata[0].buffer_id = -1; /* -1 means the data is always included */
> rdata[0].next = &(rdata[1]);
>
> rdata[1].data = (char *) mydata;
> rdata[1].len = mydatalen;
> rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered
> above */
> rdata[1].next = NULL
>
> ...
> recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);
>
> PageSetLSN(buf, recptr);

If we do register buffer's (that require or don't require FPI) before
calling XLogInsert(), then will there be any impact to handle case
where we come to know that we need to backup the buffer after
taking WALInsertLock.. or will that part of code remains same as it is
today.

> Redo
> ----
>
> There are four different states a block referenced by a typical WAL record
> can be in:
>
> 1. The old page does not exist at all (because the relation was truncated
> later)
> 2. The old page exists, but has an LSN higher than current WAL record, so it
> doesn't need replaying.
> 3. The LSN is < current WAL record, so it needs to be replayed.
> 4. The WAL record contains a full-page image, which needs to be restored.

I think there might be a need to have separate handling for some special
cases like Init Page which is used in few ops (Insert/Update/multi-insert).
Is there any criteria to decide if it needs to be a separate state or a special
handling for operations?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-05-17 11:54:07
Message-ID:	53774DDF.2090600@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 04/03/2014 06:37 PM, Tom Lane wrote:
>> >Let's simplify that, and have one new function, XLogOpenBuffer, which
>> >returns a return code that indicates which of the four cases we're
>> >dealing with. A typical redo function looks like this:
>> > if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
>> > {
>> > /* Modify the page */
>> > ...
>> > PageSetLSN(page, lsn);
>> > MarkBufferDirty(buffer);
>> > }
>> > if (BufferIsValid(buffer))
>> > UnlockReleaseBuffer(buffer);
>> >The '0' in the XLogOpenBuffer call is the ID of the block reference
>> >specified in the XLogRegisterBuffer call, when the WAL record was created.
> +1, but one important step here is finding the data to be replayed.
> That is, a large part of the complexity of replay routines has to do
> with figuring out which parts of the WAL record were elided due to
> full-page-images, and locating the remaining parts. What can we do
> to make that simpler?
>
> Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
> also calculate and hand back the address/size of the logged data that
> had been pointed to by the associated XLogRecData chain item. The
> trouble here is that there might've been multiple XLogRecData items
> pointing to the same buffer. Perhaps the magic ID number you give to
> XLogOpenBuffer should be thought of as identifying an XLogRecData chain
> item, not so much a buffer? It's fairly easy to see what to do when
> there's just one chain item per buffer, but I'm not sure what to do
> if there's more than one.

Ok, I wrote the first version of a patch for this. The bulk of the patch
is fairly mechanical changes to all the routines that generate WAL
records and their redo routines, to use the new facilities. I'm sure
we'll have to go through a few rounds of discussions and review on the
APIs and names, but I wanted to go and change all the rest of the code
now anyway, to make sure the APIs cover the needs of all the WAL records.

The interesting part is the new APIs for constructing a WAL record, and
replaying it. I haven't polished the implementation of the APIs, but I'm
fairly happy with the way the WAL generation and redo routines now look
like.

Constructing a WAL record is now more convenient; no more constructing
of XLogRecData structs. There's a function, XLogRegisterData(int id,
char *data, int len), that takes the pointer and length directly as
arguments. You can call it multiple times with the same 'id', and the
data will be appended. (although I kept an API function that still
allows you to pass a chain of XLogRecDatas too, for the odd case where
that's more convenient)

Internally, XLogRegisterData() still constructs a chain of XLogRecDatas,
using a small number of static XLogRecData structs. That's a tad messy,
and I hope there is a better way to do that, but I wanted to focus on
the API for now.

Here's some more details on the API (also included in the README in the
patch):

Constructing a WAL record
=========================

A WAL record consists of multiple chunks of data. Each chunk is
identified by an ID number. A WAL record also contains information about
the blocks that it applies to, and each block reference is also
identified by an ID number. If the same ID number is used for data and a
block reference, the data is left out of the WAL record if a full-page
image of the block is taken.

The API for constructing a WAL record consists of four functions:
XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, and XLogInsert.
First, call XLogBeginInsert(). Then register all the buffers modified,
and data needed to replay the changes, using XLogRegister* functions.
Finally, insert the constructed record to the WAL with XLogInsert(). For
example:

XLogBeginInsert();

/* register buffers modified as part of this action */
XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);
XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);

/* register data that is always included in the WAL record */
XLogRegisterData(-1, &xlrec, SizeOfFictionalAction);

/*
* register data associated with lbuffer. This will not be
* included in the record if a full-page image is taken.
*/
XLogRegisterData(0, footuple->data, footuple->len);

/* data associated with rbuffer */
XLogRegisterData(0, data2, len2);

/*
* Ok, all the data and buffers to include in the WAL record
* have been registered. Insert the record.
*/
recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DOSTUFF);

Details of the API functions:

void XLogRegisterBuffer(int id, Buffer buf, int flags);
-----

XLogRegisterBuffer adds information about a data block to the WAL
record. 'id' is an arbitrary number used to identify this page reference
in the redo routine. The information needed to re-find the page at redo
- the relfilenode, fork, and block number - are included in the WAL record.

XLogInsert will automatically include a full copy of the page contents,
if this is the first modification of the buffer since the last
checkpoint. It is important to register every buffer modified by the
action with XLogRegisterBuffer, to avoid torn-page hazards.

The flags control when and how the buffer contents are included in the
WAL record. REGBUF_STANDARD means that the page follows the standard
page layout, and causes the area between pd_lower and pd_upper to be
left out from the image, reducing the WAL volume. Normally, a full-page
image is taken only if the page has not been modified since the last
checkpoint, and only if full_page_writes=on or an online backup is in
progress. The REGBUF_FORCE_IMAGE flag can be used to force a full-page
image to always be included. If the REGBUF_WILL_INIT flag is given, a
full-page image is never taken. The redo routine must completely
re-generate the page contents, using the data included in the WAL record.

void XLogRegisterData(int id, char *data, int len);
-----

XLogRegisterData is used to include arbitrary data in the WAL record. If
XLogRegisterData() is called multiple times with the same 'id', the data
are appended, and will be made available to the redo routine as one
contiguous chunk.

If the same 'id' is used in an XLogRegisterBuffer and XLogRegisterData()
call, the data is not included in the WAL record if a full-page image of
the page is taken. Data registered with id -1 is not related with a
buffer, and is always included.

Writing a REDO routine
======================

A REDO routine uses the data and page references included in the WAL
record to reconstruct the new state of the page. To access the data and
pages included in the WAL record, the following functions are available:

char *XLogRecGetData(XLogRecord *record)
-----

Returns the "main" chunk of data included in the WAL record. That is,
the data included in the record with XLogRegisterData(-1, ...).

XLogReplayResult XLogReplayBuffer(int id, Buffer *buf)
-----

Reads a block associated with the WAL record currently being replayed,
with the given 'id'. The shared buffer is returned in *buf, or
InvalidBuffer if the page could not be found. The block is read into a
shared buffer and locked in exclusive mode. Returns one of the following
result codes:

typedef enum
{
BLK_NEEDS_REDO, /* block needs to be replayed */
BLK_DONE, /* block was already replayed */
BLK_RESTORED, /* block was restored from a full-page image */
BLK_NOTFOUND /* block was not found (and hence does not need
* to be replayed) */
} XLogReplayResult;

The REDO routine must redo the actions to the page if XLogReplayBuffer
returns BLK_NEEDS_REDO. In other cases, no further action is no
required, although the result code can be used to distinguish the reason.

After modifying the page (if it was necessary), the REDO routine must
unlock and release the buffer. Note that the buffer must be unlocked and
released even if no action was required.

XLogReplayResult XLogReplayBufferExtended(int id, ReadBufferMode mode,
bool get_cleanup_lock, Buffer *buf)
-----

Like XLogReplayBuffer(), but with a few extra options.

mode' can be passed to e.g force the page to be zeroed, instead of
reading it from disk. This RBM_ZERO mode should be used to re-initialize
pages registered in the REGBUF_WILL_INIT mode in XLogRegisterBuffer().

if 'get_cleanup_lock' is TRUE, a stronger "cleanup" lock on the page is
acquired, instead of a reguler exclusive-lock.

char *XLogGetPayload(XLogRecord *record, int id, int *len)
-----

Returns a chunk of data included in the WAL record (with
XLogRegisterData(id, ...)). The length of the data is returned in *len.
This is typically used after XLogReplayBuffer() returned BLK_NEEDS_REDO,
with the same 'id', to get the information required to redo the actions
on the page. If no data with the given id is included, perhaps because a
full-page image of the associated buffer was taken instead, an error is
thrown.

void XLogBlockRefGetTag(XLogRecord *record, int id, RelFileNode *rnode,
ForkNumber *forknum, BlockNumber *blknum);
-----

Returns the relfilenode, fork number, and block number of the page
registered with the given ID.

Finally, here's the usual pattern for how a REDO routine works (taken
from btree_xlog_insert):

if (XLogReplayBuffer(0, &buffer) == BLK_NEEDS_REDO)
{
int datalen;
char *datapos = XLogGetPayload(record, 0, &datalen);

page = (Page) BufferGetPage(buffer);

if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
false, false) == InvalidOffsetNumber)
elog(PANIC, "btree_insert_redo: failed to add item");

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-1.patch.gz	application/gzip	82.8 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-06-14 08:04:43
Message-ID:	539C021B.9070001@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05/17/2014 02:54 PM, Heikki Linnakangas wrote:
> On 04/03/2014 06:37 PM, Tom Lane wrote:
>>>> Let's simplify that, and have one new function, XLogOpenBuffer, which
>>>> returns a return code that indicates which of the four cases we're
>>>> dealing with. A typical redo function looks like this:
>>>> if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
>>>> {
>>>> /* Modify the page */
>>>> ...
>>>> PageSetLSN(page, lsn);
>>>> MarkBufferDirty(buffer);
>>>> }
>>>> if (BufferIsValid(buffer))
>>>> UnlockReleaseBuffer(buffer);
>>>> The '0' in the XLogOpenBuffer call is the ID of the block reference
>>>> specified in the XLogRegisterBuffer call, when the WAL record was created.
>> +1, but one important step here is finding the data to be replayed.
>> That is, a large part of the complexity of replay routines has to do
>> with figuring out which parts of the WAL record were elided due to
>> full-page-images, and locating the remaining parts. What can we do
>> to make that simpler?
>>
>> Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
>> also calculate and hand back the address/size of the logged data that
>> had been pointed to by the associated XLogRecData chain item. The
>> trouble here is that there might've been multiple XLogRecData items
>> pointing to the same buffer. Perhaps the magic ID number you give to
>> XLogOpenBuffer should be thought of as identifying an XLogRecData chain
>> item, not so much a buffer? It's fairly easy to see what to do when
>> there's just one chain item per buffer, but I'm not sure what to do
>> if there's more than one.
>
> Ok, I wrote the first version of a patch for this.

Here's a new version, rebased against master. No other changes.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-2.patch.gz	application/gzip	80.1 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-06-14 18:43:50
Message-ID:	20140614184349.GU18688@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:

> Here's a new version, rebased against master. No other changes.

This is missing xlogrecord.h ...

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-06-14 21:11:17
Message-ID:	539CBA75.3050802@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06/14/2014 09:43 PM, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>
>> Here's a new version, rebased against master. No other changes.
>
> This is missing xlogrecord.h ...

Oops, here you are.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-3.patch.gz	application/gzip	80.6 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-06-27 06:12:05
Message-ID:	CAB7nPqQO57-nEXgT2zkvmsYwXov9kzVzEBio05=SSH_ghe8+=Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 15, 2014 at 6:11 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 06/14/2014 09:43 PM, Alvaro Herrera wrote:
>
>> Heikki Linnakangas wrote:
>>
>> Here's a new version, rebased against master. No other changes.
>>>
>>
>> This is missing xlogrecord.h ...
>>
>
> Oops, here you are.
>
Looking at this patch, it does not compile when assertions are enabled
because of this assertion in heapam.c:
Assert(recdata == data + datalen);
It seems like a thinko as removing it makes the code compile.

Here is a review of this patch:
- Compilation without warnings, regression tests pass
- That's not a small patch, but the changes mainly done are xlog record
insertion in accordance to the new API format.
$ git diff master --stat | tail -n1
52 files changed, 3601 insertions(+), 4371 deletions(-)

Some comments about the code:
1) In src/backend/access/transam/README:
's/no further action is no required/no further action is required/g'
2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
XLogGetBlockRefIds are missing in src/backend/access/transam/README.
3) There are a couple of code blocks marked as FIXME and BROKEN. There are
as well some NOT_USED blocks.
4) Current patch crashes when running make check in contrib/test_decoding.
Looking closer, wal_level = logical is broken:
=# show wal_level;
wal_level
-----------
logical
(1 row)
=# create database foo;
The connection to the server was lost. Attempting reset: Failed.

Looking even closer, log_heap_new_cid is broken:
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8a3c2866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
frame #3: 0x0000000104428819
postgres`ExceptionalCondition(conditionName=0x00000001044a711c,
errorType=0x000000010448b19f, fileName=0x00000001044a2cfd, lineNumber=6879)
+ 137 at assert.c:54
frame #4: 0x0000000103f08dbe
postgres`log_heap_new_cid(relation=0x00007fae4482afb8,
tup=0x00007fae438062e8) + 126 at heapam.c:6879
frame #5: 0x0000000103f080cf
postgres`heap_insert(relation=0x00007fae4482afb8, tup=0x00007fae438062e8,
cid=0, options=0, bistate=0x0000000000000000) + 383 at heapam.c:2095
frame #6: 0x0000000103f09b6a
postgres`simple_heap_insert(relation=0x00007fae4482afb8,
tup=0x00007fae438062e8) + 74 at heapam.c:2533
frame #7: 0x000000010406aacf postgres`createdb(stmt=0x00007fae44843690)
+ 6335 at dbcommands.c:511
frame #8: 0x000000010429def8
postgres`standard_ProcessUtility(parsetree=0x00007fae44843690,
queryString=0x00007fae44842c38, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0000000000000000, dest=0x00007fae44843a18,
completionTag=0x00007fff5bd4ee30) + 1720 at utility.c:56

5) Yes,there are some WAL records that have only data related to buffers,
XLOG_GIN_VACUUM_DATA_LEAF_PAGE being one, with nothing registered with
buffer_id = -d, but using a dummy entry seems counter-productive:
+ rdata = rdata_head;
+ if (rdata == NULL)
+ {
+ /*
+ * the rest of this function assumes that there is at least
some
+ * rdata entries, so fake an empty rdata entry
+ */
+ dummy_rdata.data = NULL;
+ dummy_rdata.len = 0;
+ dummy_rdata.next = NULL;
+ rdata = &dummy_rdata;
+ }
Could this be removed and replaced by a couple of checks on rdata being
NULL in some cases? XLogInsert should be updated in consequence.
6) XLogRegisterData creates a XLogRecData entry and appends it in one of
the static XLogRecData entries if buffer_id >= 0 or to
rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
WAL record). XLogRegisterXLogRecData does something similar, but with an
entry of XLogRecData already built. What do you think about removing
XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
interface simpler, and XLogRecData could be kept private in xlog.c. This
would need more work especially on gin side where I am seeing for example
constructLeafRecompressWALData directly building a XLogRecData entry.
By doing so, we could as remove the while loop at the bottom of
XLogRegisterXLogRecData as we would insert in the tail only the latest
record created, aka replacing that:
while ((*tail)->next)
*tail = (*tail)->next;
By simply that:
*tail = (*tail)->next;
7) New structures at the top of xlog.c should have more comments about
their role, particularly rdata_head and rdata_tail that store main WAL data
(id of -1)
8) What do you think about adding in the README a description about how to
retrieve the block list modified by a WAL record, for a flow similar to
that:
record = XLogReadRecord(...);
nblocks = XLogGetBlockRefIds(record, blockids);
for (i = 0; i < nblocks; i++)
{
bkpb = XLogGetBlockRef(record, id, NULL);
// stuff
}
pg_xlogdump is a good example of what can be done as well.
9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
+ char *payload = XLogRecGetData(record) +
sizeof(ginxlogInsert);
+#ifdef NOT_USED
leftChildBlkno = BlockIdGetBlockNumber((BlockId) payload);
+#endif
10) Shouldn't the call to XLogBeginInsert come first here?
@@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState *state,
{
XLogRecPtr recptr;

- recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE, rdata);
+ XLogRegisterBuffer(0, saveCurrent.buffer,
REGBUF_STANDARD); /* orig page */
+ XLogRegisterBuffer(1, current->buffer,
REGBUF_STANDARD); /* new page */
+ if (xlrec.parentBlk == 2)
+ XLogRegisterBuffer(2, parent->buffer,
REGBUF_STANDARD);
+
+ XLogBeginInsert();
+ XLogRegisterData(-1, (char *) &xlrec,
sizeof(xlrec));
+ XLogRegisterData(-1, (char *) newInnerTuple,
newInnerTuple->size);
+
+ recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE);
11) Any reason for raising a PANIC here?
@@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
break;
}
if (i < 0)
- elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
+ elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l));
num_held_lwlocks--;
for (; i < num_held_lwlocks; i++)
14) SPgist redo is crashing (during make installcheck run on a master, redo
done on a standby at recovery).
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8a3c2866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
frame #3: 0x000000010ba8d819
postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65) +
137 at assert.c:54
frame #4: 0x000000010b5c3a50
postgres`addOrReplaceTuple(page=0x000000010bfee980,
tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608,
record=0x00007ffd41822800) + 719 at spgxlog.c:263
frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
record=0x00007ffd41822800) + 402 at spgxlog.c:988
frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at xlog.c:6780
frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
startup.c:224
frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
frame #10: 0x000000010b864998
postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
frame #12: 0x000000010b7a9335 postgres`main(argc=3,
argv=0x00007ffd40c04470) + 773 at main.c:227

Also, I wanted to run the WAL replay facility to compare before and after
buffer images, but code crashed before... I am still planning to do so once
this code gets more stable though.
Regards,
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-06-30 20:51:45
Message-ID:	53B1CDE1.1040901@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for the review!

On 06/27/2014 09:12 AM, Michael Paquier wrote:
> Looking at this patch, it does not compile when assertions are enabled
> because of this assertion in heapam.c:
> Assert(recdata == data + datalen);
> It seems like a thinko as removing it makes the code compile.

Fixed.

> Some comments about the code:
> 1) In src/backend/access/transam/README:
> 's/no further action is no required/no further action is required/g'

Fixed.

> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
> XLogGetBlockRefIds are missing in src/backend/access/transam/README.

Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
what to do with the latter two. They're not really intended for use by
redo routines, but for things like pg_xlogdump that want to extract
information about any record type.

> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are
> as well some NOT_USED blocks.

Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
mostly stand for code that used to print block numbers or relfilenodes,
which doesn't make much sense to do in an ad hoc way in rmgr-specific
desc routines anymore. Will need to go through them all an decide what's
the important information to print for each record type.

The few NOT_USED blocks are around code in redo routines that extract
some information from the WAL record, that isn't needed anymore. We
could remove the fields from the WAL records altogether, but might be
useful to keep for debugging purposes.

> 4) Current patch crashes when running make check in contrib/test_decoding.
> Looking closer, wal_level = logical is broken:
> =# show wal_level;
> wal_level
> -----------
> logical
> (1 row)
> =# create database foo;
> The connection to the server was lost. Attempting reset: Failed.

Fixed.

> Looking even closer, log_heap_new_cid is broken:

Fixed.

> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
> the static XLogRecData entries if buffer_id >= 0 or to
> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
> WAL record). XLogRegisterXLogRecData does something similar, but with an
> entry of XLogRecData already built. What do you think about removing
> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
> interface simpler, and XLogRecData could be kept private in xlog.c. This
> would need more work especially on gin side where I am seeing for example
> constructLeafRecompressWALData directly building a XLogRecData entry.

Hmm. I considered that, but punted because I didn't want to rewrite all
the places that currently use XLogRecDatas in less straightforward ways.
I kept XLogRegisterXLogRecData as an escape hatch for those.

But I bit the bullet now, and got rid of all the
XLogRegisterXLogRecData() calls. XLogRecData struct is now completely
private to xlog.c.

Another big change in the attached patch version is that
XLogRegisterData() now *copies* the data to a working area, instead of
merely adding a XLogRecData entry to the chain. This simplifies things
in xlog.c, now that the XLogRecData concept is not visible to the rest
of the system. This is a subtle semantic change for the callers: you can
no longer register a struct first, and fill the fields afterwards.

That adds a lot of memcpys to the critical path, which could be bad for
performance. In practice, I think it's OK, or even a win, because a
typical WAL record payload is very small. But I haven't measured that.
If it becomes an issue, we could try to optimize, and link larger data
blocks to the rdata chain, but memcpy small small chunks of data. There
are a few WAL record types that don't fit in a small statically
allocated buffer, so I provided a new function, XLogEnsureSpace(), that
can be called before XLogBegin() to allocate a large-enough working area.

These changes required changing a couple of places where WAL records are
created:

* ginPlaceToPage. This function has a strange split of responsibility
between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on
what kind of a tree it is, and dataPlaceToPage or entryPlaceToPage
contributes some data to the WAL record. Then ginPlaceToPage adds some
more, and finally calls XLogInsert(). I simplified the SPLIT case so
that we now just create full page images of both page halves. We were
already logging all the data on both page halves, but we were doing it
in a "smarter" way, knowing what kind of pages they were. For example,
for an entry tree page, we included an IndexTuple struct for every tuple
on the page, but didn't include the item pointers. A straight full page
image takes more space, but not much, so I think in any real application
it's going to be lost in noise. (again, haven't measured it though)

> 8) What do you think about adding in the README a description about how to
> retrieve the block list modified by a WAL record, for a flow similar to
> that:
> record = XLogReadRecord(...);
> nblocks = XLogGetBlockRefIds(record, blockids);
> for (i = 0; i < nblocks; i++)
> {
> bkpb = XLogGetBlockRef(record, id, NULL);
> // stuff
> }
> pg_xlogdump is a good example of what can be done as well.

Yeah, that's probably in order. I don't actually like the way that works
currently very much. Passing the char[256] blockids array to the
function feels weird. Will have to think a bit harder about that..

> 9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
> + char *payload = XLogRecGetData(record) +
> sizeof(ginxlogInsert);
> +#ifdef NOT_USED
> leftChildBlkno = BlockIdGetBlockNumber((BlockId) payload);
> +#endif

No. XLogRecGetData() is used to get data from the "main chunk", while
XLogGetPayload() is used to get data associated with a buffer. This does
the former.

> 10) Shouldn't the call to XLogBeginInsert come first here?
> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState *state,
> {
> XLogRecPtr recptr;
>
> - recptr = XLogInsert(RM_SPGIST_ID,
> XLOG_SPGIST_ADD_NODE, rdata);
> + XLogRegisterBuffer(0, saveCurrent.buffer,
> REGBUF_STANDARD); /* orig page */
> + XLogRegisterBuffer(1, current->buffer,
> REGBUF_STANDARD); /* new page */
> + if (xlrec.parentBlk == 2)
> + XLogRegisterBuffer(2, parent->buffer,
> REGBUF_STANDARD);
> +
> + XLogBeginInsert();
> + XLogRegisterData(-1, (char *) &xlrec,
> sizeof(xlrec));
> + XLogRegisterData(-1, (char *) newInnerTuple,
> newInnerTuple->size);
> +
> + recptr = XLogInsert(RM_SPGIST_ID,
> XLOG_SPGIST_ADD_NODE);

Yeah. Apparently we have no regression test coverage of this codepath,
or it would've triggered an assertion. Fixed now, just by looking at the
code, but there's still no test coverage of this so it's likely still
broken in other ways. Before this is committed, I think we'll need to go
through the coverage report and make sure all the XLogInsert() codepaths
(and their redo routines) are covered.

> 11) Any reason for raising a PANIC here?
> @@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
> break;
> }
> if (i < 0)
> - elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
> + elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l));
> num_held_lwlocks--;
> for (; i < num_held_lwlocks; i++)

No, leftover from debugging. Fixed.

> 14) SPgist redo is crashing (during make installcheck run on a master, redo
> done on a standby at recovery).
> (lldb) bt
> * thread #1: tid = 0x0000, 0x00007fff8a3c2866
> libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
> * frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill + 92
> frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
> frame #3: 0x000000010ba8d819
> postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
> errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65) +
> 137 at assert.c:54
> frame #4: 0x000000010b5c3a50
> postgres`addOrReplaceTuple(page=0x000000010bfee980,
> tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
> frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608,
> record=0x00007ffd41822800) + 719 at spgxlog.c:263
> frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
> record=0x00007ffd41822800) + 402 at spgxlog.c:988
> frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at xlog.c:6780
> frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
> startup.c:224
> frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
> argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
> frame #10: 0x000000010b864998
> postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
> frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
> argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
> frame #12: 0x000000010b7a9335 postgres`main(argc=3,
> argv=0x00007ffd40c04470) + 773 at main.c:227

Fixed.

> Also, I wanted to run the WAL replay facility to compare before and after
> buffer images, but code crashed before... I am still planning to do so once
> this code gets more stable though.

I did that earlier, and did it now again. I got a single difference from
a sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will
have to investigate it later.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-4.patch.gz	application/gzip	89.6 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-06-30 22:18:39
Message-ID:	CAHGQGwG-uvL8krea3f_ExRSYkPK6t=cn6YzoYpj_vDCfx7H_ng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Thanks for the review!
>
>
> On 06/27/2014 09:12 AM, Michael Paquier wrote:
>>
>> Looking at this patch, it does not compile when assertions are enabled
>> because of this assertion in heapam.c:
>> Assert(recdata == data + datalen);
>> It seems like a thinko as removing it makes the code compile.
>
>
> Fixed.
>
>
>> Some comments about the code:
>> 1) In src/backend/access/transam/README:
>> 's/no further action is no required/no further action is required/g'
>
>
> Fixed.
>
>
>> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
>> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
>
>
> Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> what to do with the latter two. They're not really intended for use by redo
> routines, but for things like pg_xlogdump that want to extract information
> about any record type.
>
>
>> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are
>> as well some NOT_USED blocks.
>
>
> Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They mostly
> stand for code that used to print block numbers or relfilenodes, which
> doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> routines anymore. Will need to go through them all an decide what's the
> important information to print for each record type.
>
> The few NOT_USED blocks are around code in redo routines that extract some
> information from the WAL record, that isn't needed anymore. We could remove
> the fields from the WAL records altogether, but might be useful to keep for
> debugging purposes.
>
>
>> 4) Current patch crashes when running make check in
>> contrib/test_decoding.
>> Looking closer, wal_level = logical is broken:
>> =# show wal_level;
>> wal_level
>> -----------
>> logical
>> (1 row)
>> =# create database foo;
>> The connection to the server was lost. Attempting reset: Failed.
>
>
> Fixed.
>
>
>> Looking even closer, log_heap_new_cid is broken:
>
>
> Fixed.
>
>
>> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
>> the static XLogRecData entries if buffer_id >= 0 or to
>> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
>> WAL record). XLogRegisterXLogRecData does something similar, but with an
>> entry of XLogRecData already built. What do you think about removing
>> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
>> interface simpler, and XLogRecData could be kept private in xlog.c. This
>> would need more work especially on gin side where I am seeing for example
>> constructLeafRecompressWALData directly building a XLogRecData entry.
>
>
> Hmm. I considered that, but punted because I didn't want to rewrite all the
> places that currently use XLogRecDatas in less straightforward ways. I kept
> XLogRegisterXLogRecData as an escape hatch for those.
>
> But I bit the bullet now, and got rid of all the XLogRegisterXLogRecData()
> calls. XLogRecData struct is now completely private to xlog.c.
>
> Another big change in the attached patch version is that XLogRegisterData()
> now *copies* the data to a working area, instead of merely adding a
> XLogRecData entry to the chain. This simplifies things in xlog.c, now that
> the XLogRecData concept is not visible to the rest of the system. This is a
> subtle semantic change for the callers: you can no longer register a struct
> first, and fill the fields afterwards.
>
> That adds a lot of memcpys to the critical path, which could be bad for
> performance. In practice, I think it's OK, or even a win, because a typical
> WAL record payload is very small. But I haven't measured that. If it becomes
> an issue, we could try to optimize, and link larger data blocks to the rdata
> chain, but memcpy small small chunks of data. There are a few WAL record
> types that don't fit in a small statically allocated buffer, so I provided a
> new function, XLogEnsureSpace(), that can be called before XLogBegin() to
> allocate a large-enough working area.
>
> These changes required changing a couple of places where WAL records are
> created:
>
> * ginPlaceToPage. This function has a strange split of responsibility
> between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. ginPlaceToPage
> calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a tree
> it is, and dataPlaceToPage or entryPlaceToPage contributes some data to the
> WAL record. Then ginPlaceToPage adds some more, and finally calls
> XLogInsert(). I simplified the SPLIT case so that we now just create full
> page images of both page halves. We were already logging all the data on
> both page halves, but we were doing it in a "smarter" way, knowing what kind
> of pages they were. For example, for an entry tree page, we included an
> IndexTuple struct for every tuple on the page, but didn't include the item
> pointers. A straight full page image takes more space, but not much, so I
> think in any real application it's going to be lost in noise. (again,
> haven't measured it though)
>
>
>> 8) What do you think about adding in the README a description about how to
>> retrieve the block list modified by a WAL record, for a flow similar to
>> that:
>> record = XLogReadRecord(...);
>> nblocks = XLogGetBlockRefIds(record, blockids);
>> for (i = 0; i < nblocks; i++)
>> {
>> bkpb = XLogGetBlockRef(record, id, NULL);
>> // stuff
>> }
>> pg_xlogdump is a good example of what can be done as well.
>
>
> Yeah, that's probably in order. I don't actually like the way that works
> currently very much. Passing the char[256] blockids array to the function
> feels weird. Will have to think a bit harder about that..
>
>
>> 9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
>> + char *payload = XLogRecGetData(record) +
>> sizeof(ginxlogInsert);
>> +#ifdef NOT_USED
>> leftChildBlkno = BlockIdGetBlockNumber((BlockId)
>> payload);
>> +#endif
>
>
> No. XLogRecGetData() is used to get data from the "main chunk", while
> XLogGetPayload() is used to get data associated with a buffer. This does the
> former.
>
>
>> 10) Shouldn't the call to XLogBeginInsert come first here?
>> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
>> *state,
>> {
>> XLogRecPtr recptr;
>>
>> - recptr = XLogInsert(RM_SPGIST_ID,
>> XLOG_SPGIST_ADD_NODE, rdata);
>> + XLogRegisterBuffer(0, saveCurrent.buffer,
>> REGBUF_STANDARD); /* orig page */
>> + XLogRegisterBuffer(1, current->buffer,
>> REGBUF_STANDARD); /* new page */
>> + if (xlrec.parentBlk == 2)
>> + XLogRegisterBuffer(2, parent->buffer,
>> REGBUF_STANDARD);
>> +
>> + XLogBeginInsert();
>> + XLogRegisterData(-1, (char *) &xlrec,
>> sizeof(xlrec));
>> + XLogRegisterData(-1, (char *) newInnerTuple,
>> newInnerTuple->size);
>> +
>> + recptr = XLogInsert(RM_SPGIST_ID,
>> XLOG_SPGIST_ADD_NODE);
>
>
> Yeah. Apparently we have no regression test coverage of this codepath, or it
> would've triggered an assertion. Fixed now, just by looking at the code, but
> there's still no test coverage of this so it's likely still broken in other
> ways. Before this is committed, I think we'll need to go through the
> coverage report and make sure all the XLogInsert() codepaths (and their redo
> routines) are covered.
>
>
>> 11) Any reason for raising a PANIC here?
>> @@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
>> break;
>> }
>> if (i < 0)
>> - elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
>> + elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l));
>> num_held_lwlocks--;
>> for (; i < num_held_lwlocks; i++)
>
>
> No, leftover from debugging. Fixed.
>
>
>> 14) SPgist redo is crashing (during make installcheck run on a master,
>> redo
>> done on a standby at recovery).
>> (lldb) bt
>> * thread #1: tid = 0x0000, 0x00007fff8a3c2866
>> libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
>> * frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill +
>> 10
>> frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill +
>> 92
>> frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
>> frame #3: 0x000000010ba8d819
>> postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
>> errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65)
>> +
>> 137 at assert.c:54
>> frame #4: 0x000000010b5c3a50
>> postgres`addOrReplaceTuple(page=0x000000010bfee980,
>> tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
>> frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608,
>> record=0x00007ffd41822800) + 719 at spgxlog.c:263
>> frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
>> record=0x00007ffd41822800) + 402 at spgxlog.c:988
>> frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at
>> xlog.c:6780
>> frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
>> startup.c:224
>> frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
>> argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
>> frame #10: 0x000000010b864998
>> postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
>> frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
>> argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
>> frame #12: 0x000000010b7a9335 postgres`main(argc=3,
>> argv=0x00007ffd40c04470) + 773 at main.c:227
>
>
> Fixed.
>
>
>> Also, I wanted to run the WAL replay facility to compare before and after
>> buffer images, but code crashed before... I am still planning to do so
>> once
>> this code gets more stable though.
>
>
> I did that earlier, and did it now again. I got a single difference from a
> sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to
> investigate it later.

With the patch, when I ran the simple test which generates lots of
WAL, I got PANIC error.

=# create extension pgcrypto ;
CREATE EXTENSION
=# create extension pg_trgm ;
CREATE EXTENSION
=# create table hoge (col1 text);
CREATE TABLE
=# create index hogeidx on hoge using gin (col1 gin_trgm_ops);
CREATE INDEX
=# insert into hoge select gen_random_uuid() from generate_series(1,10000000);
PANIC: too many registered buffers
PANIC: too many registered buffers
The connection to the server was lost. Attempting reset: Failed.

Server log is
2014-07-01 07:12:37 JST PANIC: too many registered buffers
2014-07-01 07:12:37 JST STATEMENT: insert into hoge select
gen_random_uuid() from generate_series(1,10000000);
2014-07-01 07:12:40 JST LOG: server process (PID 64020) was
terminated by signal 6: Abort trap
2014-07-01 07:12:40 JST DETAIL: Failed process was running: insert
into hoge select gen_random_uuid() from generate_series(1,10000000);

Regards,

--
Fujii Masao

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-07-01 14:46:41
Message-ID:	CA+TgmoZHK3bWYFJ37oPxtXKbYe+TbiMC5uyczQ5+Q_XLEH3zVg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 30, 2014 at 4:51 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Another big change in the attached patch version is that XLogRegisterData()
> now *copies* the data to a working area, instead of merely adding a
> XLogRecData entry to the chain. This simplifies things in xlog.c, now that
> the XLogRecData concept is not visible to the rest of the system. This is a
> subtle semantic change for the callers: you can no longer register a struct
> first, and fill the fields afterwards.
>
> That adds a lot of memcpys to the critical path, which could be bad for
> performance. In practice, I think it's OK, or even a win, because a typical
> WAL record payload is very small. But I haven't measured that. If it becomes
> an issue, we could try to optimize, and link larger data blocks to the rdata
> chain, but memcpy small small chunks of data. There are a few WAL record
> types that don't fit in a small statically allocated buffer, so I provided a
> new function, XLogEnsureSpace(), that can be called before XLogBegin() to
> allocate a large-enough working area.

On the other hand, the patch I just committed
(9f03ca915196dfc871804a1f8aad26207f601fd6) demonstrates that even
copying relatively small amounts of data can be expensive if you do it
frequently, and certainly WAL insertions are frequent. Of course,
some of the overhead there is from palloc() and friends rather than
from the copying itself, but the copying itself isn't free, either.
And some WAL records can be really big.

Of course, it could still work out to a win if the simplifications in
xlog.c save enough. I don't know whether that's the case or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-07-02 06:01:02
Message-ID:	CAB7nPqToR=cHmxMf9qPsW8_d8MjDXiz1yaoFf5X_Z7+Ppvm_Dw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 1, 2014 at 7:18 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> > Thanks for the review!
> >
> >
> > On 06/27/2014 09:12 AM, Michael Paquier wrote:
> >>
> >> Looking at this patch, it does not compile when assertions are enabled
> >> because of this assertion in heapam.c:
> >> Assert(recdata == data + datalen);
> >> It seems like a thinko as removing it makes the code compile.
> >
> >
> > Fixed.
> >
> >
> >> Some comments about the code:
> >> 1) In src/backend/access/transam/README:
> >> 's/no further action is no required/no further action is required/g'
> >
> >
> > Fixed.
> >
> >
> >> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
> >> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
> >
> >
> > Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> > what to do with the latter two. They're not really intended for use by
> redo
> > routines, but for things like pg_xlogdump that want to extract
> information
> > about any record type.
> >
> >
> >> 3) There are a couple of code blocks marked as FIXME and BROKEN. There
> are
> >> as well some NOT_USED blocks.
> >
> >
> > Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
> mostly
> > stand for code that used to print block numbers or relfilenodes, which
> > doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> > routines anymore. Will need to go through them all an decide what's the
> > important information to print for each record type.
> >
> > The few NOT_USED blocks are around code in redo routines that extract
> some
> > information from the WAL record, that isn't needed anymore. We could
> remove
> > the fields from the WAL records altogether, but might be useful to keep
> for
> > debugging purposes.
> >
> >
> >> 4) Current patch crashes when running make check in
> >> contrib/test_decoding.
> >> Looking closer, wal_level = logical is broken:
> >> =# show wal_level;
> >> wal_level
> >> -----------
> >> logical
> >> (1 row)
> >> =# create database foo;
> >> The connection to the server was lost. Attempting reset: Failed.
> >
> >
> > Fixed.
> >
> >
> >> Looking even closer, log_heap_new_cid is broken:
> >
> >
> > Fixed.
> >
> >
> >> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
> >> the static XLogRecData entries if buffer_id >= 0 or to
> >> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to
> the
> >> WAL record). XLogRegisterXLogRecData does something similar, but with an
> >> entry of XLogRecData already built. What do you think about removing
> >> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes
> the
> >> interface simpler, and XLogRecData could be kept private in xlog.c. This
> >> would need more work especially on gin side where I am seeing for
> example
> >> constructLeafRecompressWALData directly building a XLogRecData entry.
> >
> >
> > Hmm. I considered that, but punted because I didn't want to rewrite all
> the
> > places that currently use XLogRecDatas in less straightforward ways. I
> kept
> > XLogRegisterXLogRecData as an escape hatch for those.
> >
> > But I bit the bullet now, and got rid of all the
> XLogRegisterXLogRecData()
> > calls. XLogRecData struct is now completely private to xlog.c.
> >
> > Another big change in the attached patch version is that
> XLogRegisterData()
> > now *copies* the data to a working area, instead of merely adding a
> > XLogRecData entry to the chain. This simplifies things in xlog.c, now
> that
> > the XLogRecData concept is not visible to the rest of the system. This
> is a
> > subtle semantic change for the callers: you can no longer register a
> struct
> > first, and fill the fields afterwards.
> >
> > That adds a lot of memcpys to the critical path, which could be bad for
> > performance. In practice, I think it's OK, or even a win, because a
> typical
> > WAL record payload is very small. But I haven't measured that. If it
> becomes
> > an issue, we could try to optimize, and link larger data blocks to the
> rdata
> > chain, but memcpy small small chunks of data. There are a few WAL record
> > types that don't fit in a small statically allocated buffer, so I
> provided a
> > new function, XLogEnsureSpace(), that can be called before XLogBegin() to
> > allocate a large-enough working area.
> >
> > These changes required changing a couple of places where WAL records are
> > created:
> >
> > * ginPlaceToPage. This function has a strange split of responsibility
> > between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
> ginPlaceToPage
> > calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a
> tree
> > it is, and dataPlaceToPage or entryPlaceToPage contributes some data to
> the
> > WAL record. Then ginPlaceToPage adds some more, and finally calls
> > XLogInsert(). I simplified the SPLIT case so that we now just create full
> > page images of both page halves. We were already logging all the data on
> > both page halves, but we were doing it in a "smarter" way, knowing what
> kind
> > of pages they were. For example, for an entry tree page, we included an
> > IndexTuple struct for every tuple on the page, but didn't include the
> item
> > pointers. A straight full page image takes more space, but not much, so I
> > think in any real application it's going to be lost in noise. (again,
> > haven't measured it though)
> >
> >
> >> 8) What do you think about adding in the README a description about how
> to
> >> retrieve the block list modified by a WAL record, for a flow similar to
> >> that:
> >> record = XLogReadRecord(...);
> >> nblocks = XLogGetBlockRefIds(record, blockids);
> >> for (i = 0; i < nblocks; i++)
> >> {
> >> bkpb = XLogGetBlockRef(record, id, NULL);
> >> // stuff
> >> }
> >> pg_xlogdump is a good example of what can be done as well.
> >
> >
> > Yeah, that's probably in order. I don't actually like the way that works
> > currently very much. Passing the char[256] blockids array to the function
> > feels weird. Will have to think a bit harder about that..
> >
> >
> >> 9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
> >> + char *payload = XLogRecGetData(record) +
> >> sizeof(ginxlogInsert);
> >> +#ifdef NOT_USED
> >> leftChildBlkno = BlockIdGetBlockNumber((BlockId)
> >> payload);
> >> +#endif
> >
> >
> > No. XLogRecGetData() is used to get data from the "main chunk", while
> > XLogGetPayload() is used to get data associated with a buffer. This does
> the
> > former.
> >
> >
> >> 10) Shouldn't the call to XLogBeginInsert come first here?
> >> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
> >> *state,
> >> {
> >> XLogRecPtr recptr;
> >>
> >> - recptr = XLogInsert(RM_SPGIST_ID,
> >> XLOG_SPGIST_ADD_NODE, rdata);
> >> + XLogRegisterBuffer(0, saveCurrent.buffer,
> >> REGBUF_STANDARD); /* orig page */
> >> + XLogRegisterBuffer(1, current->buffer,
> >> REGBUF_STANDARD); /* new page */
> >> + if (xlrec.parentBlk == 2)
> >> + XLogRegisterBuffer(2, parent->buffer,
> >> REGBUF_STANDARD);
> >> +
> >> + XLogBeginInsert();
> >> + XLogRegisterData(-1, (char *) &xlrec,
> >> sizeof(xlrec));
> >> + XLogRegisterData(-1, (char *) newInnerTuple,
> >> newInnerTuple->size);
> >> +
> >> + recptr = XLogInsert(RM_SPGIST_ID,
> >> XLOG_SPGIST_ADD_NODE);
> >
> >
> > Yeah. Apparently we have no regression test coverage of this codepath,
> or it
> > would've triggered an assertion. Fixed now, just by looking at the code,
> but
> > there's still no test coverage of this so it's likely still broken in
> other
> > ways. Before this is committed, I think we'll need to go through the
> > coverage report and make sure all the XLogInsert() codepaths (and their
> redo
> > routines) are covered.
> >
> >
> >> 11) Any reason for raising a PANIC here?
> >> @@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
> >> break;
> >> }
> >> if (i < 0)
> >> - elog(ERROR, "lock %s %d is not held", T_NAME(l),
> T_ID(l));
> >> + elog(PANIC, "lock %s %d is not held", T_NAME(l),
> T_ID(l));
> >> num_held_lwlocks--;
> >> for (; i < num_held_lwlocks; i++)
> >
> >
> > No, leftover from debugging. Fixed.
> >
> >
> >> 14) SPgist redo is crashing (during make installcheck run on a master,
> >> redo
> >> done on a standby at recovery).
> >> (lldb) bt
> >> * thread #1: tid = 0x0000, 0x00007fff8a3c2866
> >> libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
> >> * frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill
> +
> >> 10
> >> frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill +
> >> 92
> >> frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
> >> frame #3: 0x000000010ba8d819
> >> postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
> >> errorType=0x000000010baf019f, fileName=0x000000010bb212dd,
> lineNumber=65)
> >> +
> >> 137 at assert.c:54
> >> frame #4: 0x000000010b5c3a50
> >> postgres`addOrReplaceTuple(page=0x000000010bfee980,
> >> tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
> >> frame #5: 0x000000010b5c031f
> postgres`spgRedoMoveLeafs(lsn=71708608,
> >> record=0x00007ffd41822800) + 719 at spgxlog.c:263
> >> frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
> >> record=0x00007ffd41822800) + 402 at spgxlog.c:988
> >> frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at
> >> xlog.c:6780
> >> frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
> >> startup.c:224
> >> frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
> >> argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
> >> frame #10: 0x000000010b864998
> >> postgres`StartChildProcess(type=StartupProcess) + 328 at
> postmaster.c:5090
> >> frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
> >> argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
> >> frame #12: 0x000000010b7a9335 postgres`main(argc=3,
> >> argv=0x00007ffd40c04470) + 773 at main.c:227
> >
> >
> > Fixed.
> >
> >
> >> Also, I wanted to run the WAL replay facility to compare before and
> after
> >> buffer images, but code crashed before... I am still planning to do so
> >> once
> >> this code gets more stable though.
> >
> >
> > I did that earlier, and did it now again. I got a single difference from
> a
> > sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have
> to
> > investigate it later.
>
> With the patch, when I ran the simple test which generates lots of
> WAL, I got PANIC error.
>
> =# create extension pgcrypto ;
> CREATE EXTENSION
> =# create extension pg_trgm ;
> CREATE EXTENSION
> =# create table hoge (col1 text);
> CREATE TABLE
> =# create index hogeidx on hoge using gin (col1 gin_trgm_ops);
> CREATE INDEX
> =# insert into hoge select gen_random_uuid() from
> generate_series(1,10000000);
> PANIC: too many registered buffers
> PANIC: too many registered buffers
> The connection to the server was lost. Attempting reset: Failed.
>
Yep, this is easily reproducible. I imagine that a call to
XLogEnsureRecordSpace before START_CRIT_SECTION() is missing in the gin
code path.
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-07-02 07:09:46
Message-ID:	CAB7nPqR2RVPJL1WmF-PXa3GLhD58Pr4DKK9K=0aazyE3dhKuLw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:

> On 06/27/2014 09:12 AM, Michael Paquier wrote:
>
>> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
>> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
>>
>
> Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> what to do with the latter two. They're not really intended for use by redo
> routines, but for things like pg_xlogdump that want to extract information
> about any record type.

That's definitely not part of the redo section or WAL construction section.
It could be a new section in the README describing how to get the list of
blocks touched by a WAL record. I'd rather have those APIs documented
somewhere than nowhere, and the README would be well-suited for that (as
well as in-code comments) as those APIs are aimed at developers.

> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are
>> as well some NOT_USED blocks.
>>
>
> Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
> mostly stand for code that used to print block numbers or relfilenodes,
> which doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> routines anymore. Will need to go through them all an decide what's the
> important information to print for each record type.
>
> The few NOT_USED blocks are around code in redo routines that extract some
> information from the WAL record, that isn't needed anymore. We could remove
> the fields from the WAL records altogether, but might be useful to keep for
> debugging purposes.

They could be removed and be kept as a part of the git history... That's
not really a vital point btw.

>
> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
>> the static XLogRecData entries if buffer_id >= 0 or to
>> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
>> WAL record). XLogRegisterXLogRecData does something similar, but with an
>> entry of XLogRecData already built. What do you think about removing
>> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
>> interface simpler, and XLogRecData could be kept private in xlog.c. This
>> would need more work especially on gin side where I am seeing for example
>> constructLeafRecompressWALData directly building a XLogRecData entry.
>>
>
> Another big change in the attached patch version is that
> XLogRegisterData() now *copies* the data to a working area, instead of
> merely adding a XLogRecData entry to the chain. This simplifies things in
> xlog.c, now that the XLogRecData concept is not visible to the rest of the
> system. This is a subtle semantic change for the callers: you can no longer
> register a struct first, and fill the fields afterwards.
>
> That adds a lot of memcpys to the critical path, which could be bad for
> performance. In practice, I think it's OK, or even a win, because a typical
> WAL record payload is very small. But I haven't measured that. If it
> becomes an issue, we could try to optimize, and link larger data blocks to
> the rdata chain, but memcpy small small chunks of data. There are a few WAL
> record types that don't fit in a small statically allocated buffer, so I
> provided a new function, XLogEnsureSpace(), that can be called before
> XLogBegin() to allocate a large-enough working area.
>

I just ran the following test on my laptop with your latest patch:
- Patched version:
=# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 32913.419 ms
=# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 32261.045 ms
- master (d97e98e):
=# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 29651.394 ms
=# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 29734.887 ms
10% difference... That was just a quick test though.

These changes required changing a couple of places where WAL records are
> created:
>
> * ginPlaceToPage. This function has a strange split of responsibility
> between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
> ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on what
> kind of a tree it is, and dataPlaceToPage or entryPlaceToPage contributes
> some data to the WAL record. Then ginPlaceToPage adds some more, and
> finally calls XLogInsert(). I simplified the SPLIT case so that we now just
> create full page images of both page halves. We were already logging all
> the data on both page halves, but we were doing it in a "smarter" way,
> knowing what kind of pages they were. For example, for an entry tree page,
> we included an IndexTuple struct for every tuple on the page, but didn't
> include the item pointers. A straight full page image takes more space, but
> not much, so I think in any real application it's going to be lost in
> noise. (again, haven't measured it though)

Thanks, the code looks cleaner this way. Having looked a bit at the way
ginPlaceToPage works, it seems a bit sensible to refactor
dataPlaceToPageLeaf and dataPlaceToPageInternal to make the WAL operations
smarter than they are now. I am not saying that this is impossible, just
not that straight-forward, especially considering that this code is like
that for a long time.
I actually tried to run some tests on the WAL difference it generates, but
got discouraged by the email Fujii-san sent reporting the crash in gin code
path.

> 10) Shouldn't the call to XLogBeginInsert come first here?
>> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
>> *state,
>> {
>> XLogRecPtr recptr;
>>
>> - recptr = XLogInsert(RM_SPGIST_ID,
>> XLOG_SPGIST_ADD_NODE, rdata);
>> + XLogRegisterBuffer(0, saveCurrent.buffer,
>> REGBUF_STANDARD); /* orig page */
>> + XLogRegisterBuffer(1, current->buffer,
>> REGBUF_STANDARD); /* new page */
>> + if (xlrec.parentBlk == 2)
>> + XLogRegisterBuffer(2, parent->buffer,
>> REGBUF_STANDARD);
>> +
>> + XLogBeginInsert();
>> + XLogRegisterData(-1, (char *) &xlrec,
>> sizeof(xlrec));
>> + XLogRegisterData(-1, (char *) newInnerTuple,
>> newInnerTuple->size);
>> +
>> + recptr = XLogInsert(RM_SPGIST_ID,
>> XLOG_SPGIST_ADD_NODE);
>>
>
> Yeah. Apparently we have no regression test coverage of this codepath, or
> it would've triggered an assertion. Fixed now, just by looking at the code,
> but there's still no test coverage of this so it's likely still broken in
> other ways. Before this is committed, I think we'll need to go through the
> coverage report and make sure all the XLogInsert() codepaths (and their
> redo routines) are covered.

Hm. This is not really related to this patch as it means that even current
code is not tested properly. I wouldn't mind taking a closer look at the
test suite and come up new tests improving coverage of WAL replay and
record construction...

> Also, I wanted to run the WAL replay facility to compare before and after
>> buffer images, but code crashed before... I am still planning to do so
>> once
>> this code gets more stable though.
>>
>
> I did that earlier, and did it now again. I got a single difference from a
> sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to
> investigate it later.
>
OK, thanks.

I haven't looked at the patch in more details:
1) Noticed a small type => s/reprsent/represent/g
2) This time make installcheck passed for both master and standby, even on
contrib stuff...
3) I noticed a bug in gin redo code path when trying to use the WAL replay
facility though so I couldn't make a proper run with it.
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-07-02 07:23:50
Message-ID:	CAB7nPqQzPgY_=Q2XQf8kKW7evwx_FAmAfFa9re7uQmaGPfjG5g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 2, 2014 at 4:09 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:
>
> 3) I noticed a bug in gin redo code path when trying to use the WAL replay
> facility.
>

Looking at the backtrace, it seems that a page for gin is corrupted. The
buffer capture patch does some sanity checks on the page format when
performing masking and it is failing in one of them on a standby when
kicking ginRedoUpdateMetapage:
if (pd_lower > pd_upper || pd_special < pd_upper ||
pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
{
elog(ERROR, "invalid page at %X/%08X\n",
((PageHeader) page)->pd_lsn.xlogid,
((PageHeader) page)->pd_lsn.xrecoff);
}

frame #4: 0x000000010437dec5 postgres`CheckForBufferLeaks + 165 at
bufmgr.c:1778
frame #5: 0x000000010437df1e postgres`AtProcExit_Buffers(code=1, arg=0)
+ 30 at bufmgr.c:1750
frame #6: 0x000000010438fded postgres`shmem_exit(code=1) + 301 at
ipc.c:263
frame #7: 0x000000010438fc1c postgres`proc_exit_prepare(code=1) + 124
at ipc.c:187
frame #8: 0x000000010438fb63 postgres`proc_exit(code=1) + 19 at
ipc.c:102
frame #9: 0x0000000104555b3c postgres`errfinish(dummy=0) + 1180 at
elog.c:555
frame #10: 0x00000001045590de postgres`elog_finish(elevel=20,
fmt=0x0000000104633d4f) + 830 at elog.c:1362
frame #11: 0x000000010437c1af
postgres`mask_unused_space(page=0x00007fff5bc20a70) + 159 at bufcapt.c:78
frame #12: 0x000000010437b53d
postgres`mask_heap_page(page=0x00007fff5bc20a70) + 29 at bufcapt.c:95
frame #13: 0x000000010437b1cd
postgres`buffer_capture_write(newcontent=0x0000000104ab3980, blkno=0) + 205
at bufcapt.c:329
frame #14: 0x000000010437bc7d postgres`buffer_capture_forget(buffer=82)
+ 349 at bufcapt.c:433
frame #15: 0x00000001043801c9 postgres`LockBuffer(buffer=82, mode=0) +
233 at bufmgr.c:2773
frame #16: 0x00000001043800c8 postgres`UnlockReleaseBuffer(buffer=82) +
24 at bufmgr.c:2554
frame #17: 0x0000000103fefb03
postgres`ginRedoUpdateMetapage(lsn=335350144, record=0x00007fb1740382f0) +
1843 at ginxlog.c:580
frame #18: 0x0000000103fede96 postgres`gin_redo(lsn=335350144,
record=0x00007fb1740382f0) + 534 at ginxlog.c:724
frame #19: 0x00000001040ad692 postgres`StartupXLOG + 8482 at xlog.c:6810
frame #20: 0x0000000104330e0e postgres`StartupProcessMain + 430 at
startup.c:224
frame #21: 0x00000001040c64d9 postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff5bc231b0) + 1897 at bootstrap.c:416
frame #22: 0x000000010432b1a8
postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
frame #23: 0x00000001043292b9 postgres`PostmasterMain(argc=3,
argv=0x00007fb173c044e0) + 5401 at postmaster.c:1212
frame #24: 0x000000010426f995 postgres`main(argc=3,
argv=0x00007fb173c044e0) + 773 at main.c:219
Note that I have never seen that with vanilla, only with this patch.
Regards,
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-07-19 00:50:14
Message-ID:	CAB7nPqRHk_9adObZq5Mg4v0amqXsL3is-LFy5-uVGDFmH=ZWPg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 2, 2014 at 4:23 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>
>
>
> On Wed, Jul 2, 2014 at 4:09 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
> wrote:
>>
>> 3) I noticed a bug in gin redo code path when trying to use the WAL replay
>> facility.
This patch has been on status "Waiting on author" for a little bit of
time, marking it now as "Returned with feedback" as there are still
issues with gin record construction and redo code paths.

Feel free to disagree if you don't think this is correct.
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-01 16:40:23
Message-ID:	53DBC2F7.8050604@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's a new version of this patch, please review.

I've cleaned up a lot of stuff, fixed all the bugs reported so far, and
a bunch of others I found myself while testing.

I'm not going to explain again what the patch does; the README and
comments should now be complete enough to explain how it works. If not,
please speak up.

In the previous version of this patch, I made the XLogRegisterData
function to copy the WAL data to a temporary buffer, instead of
constructing the XLogRecData chain. I decided to revert back to the old
way after all; I still think that copying would probably be OK from a
performance point of view, but that'd need more testing. We can still
switch to doing it that way later; the XLogRecData struct is no longer
exposed to the functions that generate the WAL records, so it would be a
very isolated change.

I moved the new functions for constructing the WAL record into a new
file, xlogconstruct.c. XLogInsert() now just calls a function called
XLogRecordAssemble(), which returns the full XLogRecData chain that
includes all the data and backup blocks, ready to be written to the WAL
buffer. All the code to construct the XLogRecData chain is now in
xlogrecord.c; this makes XLogInsert() simpler and more readable.

One change resulting from that worth mentioning is that XLogInsert() now
always re-constructs the XLogRecordData chain, if after acquiring the
WALInsertLock it sees that RedoRecPtr changed (i.e. a checkpoint just
started). Before, it would recheck the LSNs on the individual buffers to
see if it's necessary. This is simpler, and I don't think it makes any
difference to performance in practice.

I ran this through my WAL page comparison tool to verify that all the
WAL record types are replayed correctly (although I did some small
cleanup after that, so it's not impossible that I broke it again; will
re-test before committing).

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-5.patch.gz	application/gzip	96.2 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-05 12:50:21
Message-ID:	CAB7nPqSLK66MLLmWxPLWUiWoN=HHLnU1YM9aVNw8BP-_PUzNrA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 2, 2014 at 1:40 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:
> I ran this through my WAL page comparison tool to verify that all the WAL
> record types are replayed correctly (although I did some small cleanup
after
> that, so it's not impossible that I broke it again; will re-test before
> committing).
I have run as well some tests on it with a master a a standby to check WAL
construction and replay:
- regression tests as well as tests from contrib, isolation
- WAL page comparison tool
- Some tests with pgbench
- When testing pg_xlogdump I found an assertion failure for record
XLOG_GIN_INSERT:
Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0)))), function
gin_desc, file gindesc.c, line 127.

Then, I had a look at the code, and here are some comments:
- This comment has been introduced with 96ef3b8 that has added page
checksums. Perhaps the authors can tell what was the original purpose of
this comment (Jeff, Simon). For now, it may be better to remove this FIXME
and to let this comment as is.
+ * FIXME: I didn't understand below comment. Is it still
relevant?
+ *
* This also means there is no corresponding API call for
this, so an
* smgr implementation has no need to implement anything.
Which means
* nothing is needed in md.c etc
- Would it be a win to add an assertion check for (CritSectionCount == 0)
in XLogEnsureRecordSpace?
- There is no mention of REGBUF_NO_IMAGE in transam/README.
- This patch adds a lot of blocks like that in the redo code:
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
What do you think about adding a new macro like UnlockBufferIfValid(buffer)?
- Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
+ int flags = REGBUF_FORCE_IMAGE;
+ if (buffer_std)
+ flags |= REGBUF_STANDARD;
+ XLogRegisterBuffer(0, buffer, flags);
[...]
- recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
+ recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
- There is a typo in the header of xlogrecord.h, the "to" is not needed:
+ * Definitions for to the WAL record
format.

- There is still a TODO item in ValidXLogRecordHeader to check if block
data header is valid or not. Just mentioning.
- Just came across that: s/reference refes to/reference refers to
- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
user-friendly. Cannot we just do all the work inside this function?
- All the functions in xlogconstruct.c could be defined in a separate
header xlogconstruct.h. What they do is rather independent from xlog.h. The
same applies to all the structures and functions of xlogconstruct.c in
xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
InitXLogRecordConstruct. (worth noticing as well that the only reason
XLogRecData is defined externally of xlogconstruct.c is that it is being
used in XLogInsert()).

The patch is more solid and better structured than the previous versions.
It is in good shape.
Regards,
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-12 09:44:22
Message-ID:	53E9E1F6.6040204@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/05/2014 03:50 PM, Michael Paquier wrote:
> - When testing pg_xlogdump I found an assertion failure for record
> XLOG_GIN_INSERT:
> Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
> ((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0)))), function
> gin_desc, file gindesc.c, line 127.

I could not reproduce this. Do you have a test case?

> - Would it be a win to add an assertion check for (CritSectionCount == 0)
> in XLogEnsureRecordSpace?

Maybe; added.

> - There is no mention of REGBUF_NO_IMAGE in transam/README.

Fixed

> - This patch adds a lot of blocks like that in the redo code:
> + if (BufferIsValid(buffer))
> + UnlockReleaseBuffer(buffer);
> What do you think about adding a new macro like UnlockBufferIfValid(buffer)?

I don't think such a macro would be an improvement in readability, TBH.

> - Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
> + int flags = REGBUF_FORCE_IMAGE;
> + if (buffer_std)
> + flags |= REGBUF_STANDARD;
> + XLogRegisterBuffer(0, buffer, flags);
> [...]
> - recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
> + recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);

Indeed, fixed. With that, "initdb -k" works, I had not tested the patch
with page checksums before.

> - There is still a TODO item in ValidXLogRecordHeader to check if block
> data header is valid or not. Just mentioning.

Removed.

Previously, we ValidXLogRecordHeader checked that the xl_tot_len of a
record doesn't exceed the size of header + xl_len + (space needed for
max number of bkp blocks). But the new record format has no limit on the
amount of data registered with a buffer, so such a test is not possible
anymore. I added the TODO item there to remind me that I need to check
if we need to do something else there instead, but I think it's fine as
it is. We still sanity-check the block data in ValidXLogRecord; the
ValidXLogRecordHeader() check happens before we have read the whole
record header in memory.

> - XLogRecGetBlockRefIds needing an already-allocated array for *out is not
> user-friendly. Cannot we just do all the work inside this function?

I agree it's not a nice API. Returning a palloc'd array would be nicer,
but this needs to work outside the backend too (e.g. pg_xlogdump). It
could return a malloc'd array in frontend code, but it's not as nice. On
the whole, do you think that be better than the current approach?

> - All the functions in xlogconstruct.c could be defined in a separate
> header xlogconstruct.h. What they do is rather independent from xlog.h. The
> same applies to all the structures and functions of xlogconstruct.c in
> xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
> InitXLogRecordConstruct. (worth noticing as well that the only reason
> XLogRecData is defined externally of xlogconstruct.c is that it is being
> used in XLogInsert()).

Hmm. I left the defines for xlogconstruct.c functions that are used to
build a WAL record in xlog.h, so that it's not necessary to include both
xlog.h and xlogconstruct.h in all files that write a WAL record
(XLogInsert() is defined in xlog.h).

But perhaps it would be better to move the prototype for XLogInsert to
xlogconstruct.h too? That might be a better division; everything needed
to insert a WAL record would be in xlogconstruct.h, and xlog.h would
left for more internal functions related to WAL. And perhaps rename the
files to xloginsert.c and xloginsert.h.

Here's a new patch with those little things fixed.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-6.patch.gz	application/gzip	96.8 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-12 14:52:45
Message-ID:	20140812145245.GD5728@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> On 08/05/2014 03:50 PM, Michael Paquier wrote:

> >- All the functions in xlogconstruct.c could be defined in a separate
> >header xlogconstruct.h. What they do is rather independent from xlog.h. The
> >same applies to all the structures and functions of xlogconstruct.c in
> >xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
> >InitXLogRecordConstruct. (worth noticing as well that the only reason
> >XLogRecData is defined externally of xlogconstruct.c is that it is being
> >used in XLogInsert()).
>
> Hmm. I left the defines for xlogconstruct.c functions that are used
> to build a WAL record in xlog.h, so that it's not necessary to
> include both xlog.h and xlogconstruct.h in all files that write a
> WAL record (XLogInsert() is defined in xlog.h).
>
> But perhaps it would be better to move the prototype for XLogInsert
> to xlogconstruct.h too? That might be a better division; everything
> needed to insert a WAL record would be in xlogconstruct.h, and
> xlog.h would left for more internal functions related to WAL. And
> perhaps rename the files to xloginsert.c and xloginsert.h.

Yes, IMO ideally modules that only construct WAL records to insert, but
are not directly involved with other XLog stuff, should only be using a
lean header file, not the whole xlog.h. I imagine xlogconstruct.h would
be such a file. (The patch you just posted doesn't have such a file --
AFAICS that stuff is all in xlog.h still).

No opinion on xlogconstruct vs. xloginsert as file names. Both seem
like good enough names to me. Unless others have stronger opinions, I
would left that decision to you.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-12 14:55:21
Message-ID:	20140812145521.GE5728@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> On 08/05/2014 03:50 PM, Michael Paquier wrote:

> >- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
> >user-friendly. Cannot we just do all the work inside this function?
>
> I agree it's not a nice API. Returning a palloc'd array would be
> nicer, but this needs to work outside the backend too (e.g.
> pg_xlogdump). It could return a malloc'd array in frontend code, but
> it's not as nice. On the whole, do you think that be better than the
> current approach?

I don't see the issue. Right now, in frontend code you can call
palloc() and pfree(), and they behave like malloc() and free() (see
fe_memutils.c). Your module doesn't need to do anything special.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-12 22:04:22
Message-ID:	20140812220422.GF14949@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-12 12:44:22 +0300, Heikki Linnakangas wrote:
> Here's a new patch with those little things fixed.

I've so far ignored this thread... I'm now taking a look. Unfortunately
I have to admit I'm not unequivocally happy.

* XLogRegisterData/XLogRegisterRecData imo don't really form a
consistent API namewise. What does 'Rec' mean here?

* I'm distinctly not a fan of passing around both the LSN and the struct
XLogRecord to functions. We should bit the bullet and use something
encapsulating both.

* XLogReplayBuffer imo isn't a very good name - the buffer isn't
replayed. If anything it's sometimes replaced by the backup block, but
the real replay still happens outside of that function.

* Why do we need XLogBeginInsert(). Right now this leads to uglyness
like duplicated if (RelationNeedsWAL()) wal checks and
XLogCancelInsert() of inserts that haven't been started. And if we
have Begin, why do we also need Cancel?

* "XXX: do we need to do this for every page?" - yes, and it's every
tuple, not every page... And It needs to be done *after* the tuple has
been put on the page, not before. Otherwise the location will be wrong.

* The patch mixes the API changes around WAL records with changes of how
individual actions are logged. That makes it rather hard to review -
and it's a 500kb patch already.

I realize it's hard to avoid because the new API changes which
information about blocks is available, but it does make it really hard
to see whether the old/new code is doing something
equivalent. It's hard to find more critical code than this :/

Have you done any performance evaluation?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-12 23:36:59
Message-ID:	53EAA51B.2020105@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/13/2014 01:04 AM, Andres Freund wrote:
> On 2014-08-12 12:44:22 +0300, Heikki Linnakangas wrote:
>> Here's a new patch with those little things fixed.
>
> I've so far ignored this thread... I'm now taking a look. Unfortunately
> I have to admit I'm not unequivocally happy.
>
> * XLogRegisterData/XLogRegisterRecData imo don't really form a
> consistent API namewise. What does 'Rec' mean here?

The latter function is actually called XLogRegisterBufData. The README
was wrong, will fix.

> * I'm distinctly not a fan of passing around both the LSN and the struct
> XLogRecord to functions. We should bit the bullet and use something
> encapsulating both.

You mean, the redo functions? Seems fine to me, and always it's been
like that...

> * XLogReplayBuffer imo isn't a very good name - the buffer isn't
> replayed. If anything it's sometimes replaced by the backup block, but
> the real replay still happens outside of that function.

I agree, but haven't come up with a better name.

> * Why do we need XLogBeginInsert(). Right now this leads to uglyness
> like duplicated if (RelationNeedsWAL()) wal checks and
> XLogCancelInsert() of inserts that haven't been started.

I like clearly marking the beginning of the insert, with
XLogBeginInsert(). We certainly could design it so that it's not needed,
and you could just start registering stuff without calling
XLogBeginInsert() first. But I believe it's more robust this way. For
example, you will get an error at the next XLogBeginInsert(), if a
previous operation had already registered some data without calling
XLogInsert().

> And if we have Begin, why do we also need Cancel?

We could just automatically "cancel" any previous insertion when
XLogBeginInsert() is called, but again it seems like bad form to leave
references to buffers and data just lying there, if an operation bails
out after registering some stuff and doesn't finish the insertion.

> * "XXX: do we need to do this for every page?" - yes, and it's every
> tuple, not every page... And It needs to be done *after* the tuple has
> been put on the page, not before. Otherwise the location will be wrong.

That comment is in heap_multi_insert(). All the inserted tuples have the
same command id, seems redundant to log multiple NEW_CID records for them.

> * The patch mixes the API changes around WAL records with changes of how
> individual actions are logged. That makes it rather hard to review -
> and it's a 500kb patch already.
>
> I realize it's hard to avoid because the new API changes which
> information about blocks is available, but it does make it really hard
> to see whether the old/new code is doing something
> equivalent. It's hard to find more critical code than this :/

Yeah, I hear you. I considered doing this piecemeal, just adding the new
functions first so that you could still use the old XLogRecData API,
until all the functions have been converted. But in the end, I figured
it's not worth it, as sooner or later we'd want to convert all the
functions anyway.

> Have you done any performance evaluation?

No, not yet.

- Heikki

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-13 07:49:50
Message-ID:	CAB7nPqRjgOt1wgX3zFtap8QfrsBeGDyBGixzo_x8yXsWgXPRxw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 12, 2014 at 6:44 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 08/05/2014 03:50 PM, Michael Paquier wrote:
>>
>> - When testing pg_xlogdump I found an assertion failure for record
>> XLOG_GIN_INSERT:
>> Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
>> ((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0)))), function
>> gin_desc, file gindesc.c, line 127.
>
>
> I could not reproduce this. Do you have a test case?
Indeed. I am not seeing anything now for this record. Perhaps I was
using an incorrect version of pg_xlogdump.

>> - Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
>> + int flags = REGBUF_FORCE_IMAGE;
>> + if (buffer_std)
>> + flags |= REGBUF_STANDARD;
>> + XLogRegisterBuffer(0, buffer, flags);
>> [...]
>> - recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
>> + recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
>
>
> Indeed, fixed. With that, "initdb -k" works, I had not tested the patch with
> page checksums before.
Thanks for the fix. Yes now this works.

>> - XLogRecGetBlockRefIds needing an already-allocated array for *out is not
>> user-friendly. Cannot we just do all the work inside this function?
>
>
> I agree it's not a nice API. Returning a palloc'd array would be nicer, but
> this needs to work outside the backend too (e.g. pg_xlogdump). It could
> return a malloc'd array in frontend code, but it's not as nice. On the
> whole, do you think that be better than the current approach?

A palloc'd array returned is nicer for the user. And I would rather
rewrite the API like that, having it return the array instead:
int *XLogRecGetBlockRefIds(XLogRecord *record, int *num_array)
Alvaro has also outlined that in the FRONTEND context pfree and palloc
would work as pg_free and pg_malloc (didn't recall it before he
mentioned it btw).

>> - All the functions in xlogconstruct.c could be defined in a separate
>> header xlogconstruct.h. What they do is rather independent from xlog.h.
>> The
>> same applies to all the structures and functions of xlogconstruct.c in
>> xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
>> InitXLogRecordConstruct. (worth noticing as well that the only reason
>> XLogRecData is defined externally of xlogconstruct.c is that it is being
>> used in XLogInsert()).
>
>
> Hmm. I left the defines for xlogconstruct.c functions that are used to build
> a WAL record in xlog.h, so that it's not necessary to include both xlog.h
> and xlogconstruct.h in all files that write a WAL record (XLogInsert() is
> defined in xlog.h).
>
> But perhaps it would be better to move the prototype for XLogInsert to
> xlogconstruct.h too? That might be a better division; everything needed to
> insert a WAL record would be in xlogconstruct.h, and xlog.h would left for
> more internal functions related to WAL. And perhaps rename the files to
> xloginsert.c and xloginsert.h.

Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c,
it is not weird. This approach has the merit to make XLogRecData
completely bundled with the insertion and construction of the WAL
records. Then for the name xloginsert.c is fine for me too.

Among the things changed since v5, v6 is introducing a safety limit to
the maximum number of backup blocks within a single WAL record:
#define XLR_MAX_BKP_BLOCKS 256
XLogRecordBlockData is updated to use uint8 instead of char to
identify the block id, and XLogEnsureRecordSpace fails if more than
XLR_MAX_BKP_BLOCKS are wanted by the caller. I am fine with this
change, just mentioning it.
Regards,
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-13 08:07:15
Message-ID:	CAB7nPqQZnTMmr5zVh1ukTT0GHTijxjLTssb2Gi7nom_PWqBGyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 13, 2014 at 4:49 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c,
> it is not weird. This approach has the merit to make XLogRecData
> completely bundled with the insertion and construction of the WAL
> records. Then for the name xloginsert.c is fine for me too.
At the same time, renaming XLogInsert to XLogCompleteInsert or
XLogFinishInsert would be nice to make the difference with
XLogBeginInsert. This could include XLogCancel renamed tos
XLogCancelInsert.

Appending the prefix XLogInsert* to those functions would make things
more consistent as well.

But feel free to discard those ideas if you do not like that.
Regards,
--
Michael

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-13 11:07:46
Message-ID:	20140813110746.GB28982@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote:
> On 08/13/2014 01:04 AM, Andres Freund wrote:
> >* I'm distinctly not a fan of passing around both the LSN and the struct
> > XLogRecord to functions. We should bit the bullet and use something
> > encapsulating both.
>
> You mean, the redo functions? Seems fine to me, and always it's been like
> that...

Meh. By that argument we could just not do this patch :)

> >* XLogReplayBuffer imo isn't a very good name - the buffer isn't
> > replayed. If anything it's sometimes replaced by the backup block, but
> > the real replay still happens outside of that function.
>
> I agree, but haven't come up with a better name.

XLogRestoreBuffer(), XLogRecoverBuffer(), XLogReadBufferForReplay(), ...?

> >* Why do we need XLogBeginInsert(). Right now this leads to uglyness
> > like duplicated if (RelationNeedsWAL()) wal checks and
> > XLogCancelInsert() of inserts that haven't been started.
>
> I like clearly marking the beginning of the insert, with XLogBeginInsert().
> We certainly could design it so that it's not needed, and you could just
> start registering stuff without calling XLogBeginInsert() first. But I
> believe it's more robust this way. For example, you will get an error at the
> next XLogBeginInsert(), if a previous operation had already registered some
> data without calling XLogInsert().

I'm coming around to that view - especially as accidentally nesting xlog
insertions is a real concern with the new API.

> >And if we have Begin, why do we also need Cancel?
>
> We could just automatically "cancel" any previous insertion when
> XLogBeginInsert() is called, but again it seems like bad form to leave
> references to buffers and data just lying there, if an operation bails out
> after registering some stuff and doesn't finish the insertion.
>
> >* "XXX: do we need to do this for every page?" - yes, and it's every
> > tuple, not every page... And It needs to be done *after* the tuple has
> > been put on the page, not before. Otherwise the location will be wrong.
>
> That comment is in heap_multi_insert(). All the inserted tuples have the
> same command id, seems redundant to log multiple NEW_CID records for them.

Yea, I know it's in heap_multi_insert(). They tuples have the same cid,
but they don't have the same tid. Or well, wouldn't if
log_heap_new_cid() were called after the PutHeapTuple(). Note that this
won't trigger for any normal user defined relations.

> >* The patch mixes the API changes around WAL records with changes of how
> > individual actions are logged. That makes it rather hard to review -
> > and it's a 500kb patch already.
> >
> > I realize it's hard to avoid because the new API changes which
> > information about blocks is available, but it does make it really hard
> > to see whether the old/new code is doing something
> > equivalent. It's hard to find more critical code than this :/
>
> Yeah, I hear you. I considered doing this piecemeal, just adding the new
> functions first so that you could still use the old XLogRecData API, until
> all the functions have been converted. But in the end, I figured it's not
> worth it, as sooner or later we'd want to convert all the functions anyway.

I think it might be worthwile anyway. I'd be very surprised if there
aren't several significant bugs in the conversion. Your full page
checking tool surely helps to reduce the number, but it's not
foolproof. I can understand not wanting to do it though, it's a
significant amount of work.

Would you ask somebody else to do it in two steps?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-13 13:15:46
Message-ID:	53EB6502.60403@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/13/2014 02:07 PM, Andres Freund wrote:
> On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote:
>> On 08/13/2014 01:04 AM, Andres Freund wrote:
>>> * The patch mixes the API changes around WAL records with changes of how
>>> individual actions are logged. That makes it rather hard to review -
>>> and it's a 500kb patch already.
>>>
>>> I realize it's hard to avoid because the new API changes which
>>> information about blocks is available, but it does make it really hard
>>> to see whether the old/new code is doing something
>>> equivalent. It's hard to find more critical code than this :/
>>
>> Yeah, I hear you. I considered doing this piecemeal, just adding the new
>> functions first so that you could still use the old XLogRecData API, until
>> all the functions have been converted. But in the end, I figured it's not
>> worth it, as sooner or later we'd want to convert all the functions anyway.
>
> I think it might be worthwile anyway. I'd be very surprised if there
> aren't several significant bugs in the conversion. Your full page
> checking tool surely helps to reduce the number, but it's not
> foolproof. I can understand not wanting to do it though, it's a
> significant amount of work.
>
> Would you ask somebody else to do it in two steps?

Hmm, thinking about this some more, there is one sensible way to split
this patch: We can add the XLogReplayBuffer() function and rewrite all
the redo routines to use it, without changing any WAL record formats or
anything in the way the WAL records are constructed. In the patch,
XLogReplayBuffer() takes one input arument, the block reference ID, and
it fetches the RelFileNode and BlockNumber of the block based on that.
Without the WAL format changes, the information isn't there in the
record, but we can require the callers to pass the RelFileNode and
BlockNumber. The final patch will remove those arguments from every
caller, but that's a very mechanical change.

As in the attached patch. I only modified the heapam redo routines to
use the new XLogReplayBuffer() idiom; the idea is to do that for every
redo routine.

After applying such a patch, the main WAL format changing patch becomes
much smaller, and makes it easier to see from the redo routines where
significant changes to the WAL record formats have been made. This also
allows us to split the bikeshedding; we can discuss the name of
XLogReplayBuffer() first :-).

- Heikki

Attachment	Content-Type	Size
xlogreplaybuffer-1.patch	text/x-diff	50.1 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-13 17:05:58
Message-ID:	53EB9AF6.6050703@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/13/2014 04:15 PM, Heikki Linnakangas wrote:
> Hmm, thinking about this some more, there is one sensible way to split
> this patch: We can add the XLogReplayBuffer() function and rewrite all
> the redo routines to use it, without changing any WAL record formats or
> anything in the way the WAL records are constructed. In the patch,
> XLogReplayBuffer() takes one input arument, the block reference ID, and
> it fetches the RelFileNode and BlockNumber of the block based on that.
> Without the WAL format changes, the information isn't there in the
> record, but we can require the callers to pass the RelFileNode and
> BlockNumber. The final patch will remove those arguments from every
> caller, but that's a very mechanical change.
>
> As in the attached patch. I only modified the heapam redo routines to
> use the new XLogReplayBuffer() idiom; the idea is to do that for every
> redo routine.
>
> After applying such a patch, the main WAL format changing patch becomes
> much smaller, and makes it easier to see from the redo routines where
> significant changes to the WAL record formats have been made. This also
> allows us to split the bikeshedding; we can discuss the name of
> XLogReplayBuffer() first :-).

Here's a full version of this refactoring patch, all the rmgr's have now
been updated to use XLogReplayBuffer(). I think this is a worthwhile
change on its own, even if we drop the ball on the rest of the WAL
format patch, because it makes the redo-routines more readable. I
propose to commit this as soon as someone has reviewed it, and we agree
on a for what's currently called XLogReplayBuffer(). I have tested this
with my page-image comparison tool.

- Heikki

Attachment	Content-Type	Size
xlogreplaybuffer-2.patch	text/x-diff	120.2 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-13 17:51:24
Message-ID:	20140813175124.GM5728@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:

> Here's a full version of this refactoring patch, all the rmgr's have
> now been updated to use XLogReplayBuffer(). I think this is a
> worthwhile change on its own, even if we drop the ball on the rest
> of the WAL format patch, because it makes the redo-routines more
> readable. I propose to commit this as soon as someone has reviewed
> it, and we agree on a for what's currently called
> XLogReplayBuffer(). I have tested this with my page-image comparison
> tool.

XLogReadBufferForReplay ?
ReadBufferForXLogReplay ?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-13 18:04:18
Message-ID:	20140813180417.GN5728@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:

What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible. Is there no other good way to get this info down to
whatever it is that needs them?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-14 07:29:33
Message-ID:	CAB7nPqQPvGzfp+On3CbavyqUk+wkjWedGwU8NDBTEooHH42zCQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Heikki Linnakangas wrote:
> What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
> more than enough global variables already, it'd be good to avoid two
> more if possible. Is there no other good way to get this info down to
> whatever it is that needs them?
Yep, they do not look necessary. Looking at the patch, you could get
rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
to XLogReplayBuffer: one for the LSN of the current record, and a
second for the record pointer. The code just saves those two variables
in the redo loop of StartupXLOG to only reuse them in
XLogReplayBufferExtended, and I saw no code paths in the redo routines
where XLogReplayBuffer is called at places without the LSN position
and the record pointer.

However I think that Heikki introduced those two variables to make the
move to the next patch easier.
Regards,
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-14 08:22:12
Message-ID:	CAB7nPqQ3hPneYWZx9Zvs0ASMFCucRAVZ5r57=jr-QFjkvwATKQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 14, 2014 at 2:05 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Here's a full version of this refactoring patch, all the rmgr's have now
> been updated to use XLogReplayBuffer(). I think this is a worthwhile change
> on its own, even if we drop the ball on the rest of the WAL format patch,
> because it makes the redo-routines more readable. I propose to commit this
> as soon as someone has reviewed it, and we agree on a for what's currently
> called XLogReplayBuffer(). I have tested this with my page-image comparison
> tool.

I have as well passed this patch through the page comparison tool and
found no problems, with both regression and isolation tests. I also
had a look at the redo routines that are changed and actually found
nothing. Now, what this patch does is not much complicated, it adds a
couple of status flags returned by XLogReplayBuffer. Then, roughly,
when BLK_NEEDS_REDO is returned the previous restore actions are done.
This has the merit to put the LSN check on current page to determine
if a page needs to be replayed inside XLogReplayBuffer.

A couple of minor comments though:
1) Why changing that from ERROR to PANIC?
/* Caller specified a bogus block_index */
- elog(ERROR, "failed to restore block_index %d", block_index);
+ elog(PANIC, "failed to restore block_index %d", block_index);
2) There are some whitespaces, like here:
+ {
+ destPage = NULL; /* don't do any page updates */
}

Regards,
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-14 09:41:35
Message-ID:	53EC844F.4010807@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/14/2014 10:29 AM, Michael Paquier wrote:
> On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
>> Heikki Linnakangas wrote:
>> What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
>> more than enough global variables already, it'd be good to avoid two
>> more if possible. Is there no other good way to get this info down to
>> whatever it is that needs them?
> Yep, they do not look necessary. Looking at the patch, you could get
> rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
> to XLogReplayBuffer: one for the LSN of the current record, and a
> second for the record pointer. The code just saves those two variables
> in the redo loop of StartupXLOG to only reuse them in
> XLogReplayBufferExtended, and I saw no code paths in the redo routines
> where XLogReplayBuffer is called at places without the LSN position
> and the record pointer.
>
> However I think that Heikki introduced those two variables to make the
> move to the next patch easier.

The next patch doesn't necessary require them either, you could always
pass the LSN and record as an argument. I wanted to avoid that, because
every redo function would just pass the current record being replayed,
so it seems nicer to pass that information "out-of-band". I guess if we
do that, though, we should remove those arguments from rm_redo interface
altogether, and always rely on the global variables to get the "current"
record or its LSN. I'm not wedded on this, I could be persuaded to go
either way...

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-14 09:42:52
Message-ID:	53EC849C.2090205@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/14/2014 11:22 AM, Michael Paquier wrote:
> 1) Why changing that from ERROR to PANIC?
> /* Caller specified a bogus block_index */
> - elog(ERROR, "failed to restore block_index %d", block_index);
> + elog(PANIC, "failed to restore block_index %d", block_index);

No reason, just a leftover from debugging. Please ignore.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-14 09:44:46
Message-ID:	20140814094446.GA5096@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-14 12:41:35 +0300, Heikki Linnakangas wrote:
> On 08/14/2014 10:29 AM, Michael Paquier wrote:
> >On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
> ><alvherre(at)2ndquadrant(dot)com> wrote:
> >>Heikki Linnakangas wrote:
> >>What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
> >>more than enough global variables already, it'd be good to avoid two
> >>more if possible. Is there no other good way to get this info down to
> >>whatever it is that needs them?
> >Yep, they do not look necessary. Looking at the patch, you could get
> >rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
> >to XLogReplayBuffer: one for the LSN of the current record, and a
> >second for the record pointer. The code just saves those two variables
> >in the redo loop of StartupXLOG to only reuse them in
> >XLogReplayBufferExtended, and I saw no code paths in the redo routines
> >where XLogReplayBuffer is called at places without the LSN position
> >and the record pointer.
> >
> >However I think that Heikki introduced those two variables to make the
> >move to the next patch easier.
>
> The next patch doesn't necessary require them either, you could always pass
> the LSN and record as an argument. I wanted to avoid that, because every
> redo function would just pass the current record being replayed, so it seems
> nicer to pass that information "out-of-band". I guess if we do that, though,
> we should remove those arguments from rm_redo interface altogether, and
> always rely on the global variables to get the "current" record or its LSN.
> I'm not wedded on this, I could be persuaded to go either way...

I personally find the out of band transport really ugly.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-18 13:55:11
Message-ID:	53F205BF.2060303@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/14/2014 12:44 PM, Andres Freund wrote:
> On 2014-08-14 12:41:35 +0300, Heikki Linnakangas wrote:
>> On 08/14/2014 10:29 AM, Michael Paquier wrote:
>>> On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
>>> <alvherre(at)2ndquadrant(dot)com> wrote:
>>>> Heikki Linnakangas wrote:
>>>> What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
>>>> more than enough global variables already, it'd be good to avoid two
>>>> more if possible. Is there no other good way to get this info down to
>>>> whatever it is that needs them?
>>> Yep, they do not look necessary. Looking at the patch, you could get
>>> rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
>>> to XLogReplayBuffer: one for the LSN of the current record, and a
>>> second for the record pointer. The code just saves those two variables
>>> in the redo loop of StartupXLOG to only reuse them in
>>> XLogReplayBufferExtended, and I saw no code paths in the redo routines
>>> where XLogReplayBuffer is called at places without the LSN position
>>> and the record pointer.
>>>
>>> However I think that Heikki introduced those two variables to make the
>>> move to the next patch easier.
>>
>> The next patch doesn't necessary require them either, you could always pass
>> the LSN and record as an argument. I wanted to avoid that, because every
>> redo function would just pass the current record being replayed, so it seems
>> nicer to pass that information "out-of-band". I guess if we do that, though,
>> we should remove those arguments from rm_redo interface altogether, and
>> always rely on the global variables to get the "current" record or its LSN.
>> I'm not wedded on this, I could be persuaded to go either way...
>
> I personally find the out of band transport really ugly.

All rightey.

Here's the next version of this work. It now comes as two patches. The
first one refactors adds the XLogReplayBuffer() function and refactors
all the redo functions to use it. It doesn't change the WAL record
format in any way. The second patch applies over the first one, and
changes the WAL format, and all the WAL-creating functions to use the
new API for constructing WAL records. The second patch removes the
relfilenode and block number arguments from XLogReplayBuffer, because
they're no longer needed when that information is included in the record
format.

Todo:

* Performance testing. Do the new WAL construction functions add
overhead to WAL insertion?
* Compare WAL record sizes before and after. I've tried to keep it as
compact as possible, but I know that some records have gotten bigger.
Need to do a more thorough analysis.
* Rename XLogReplayBuffer. I don't particularly like any of the
suggestions so far, XLogReplayBuffer included. Discussion continues..
* Anything else?

- Heikki

Attachment	Content-Type	Size
0001-Refactor-redo-routines-to-use-XLogReplayBuffer.patch.gz	application/gzip	20.7 KB
0002-Change-the-way-WAL-records-are-constructed.patch.gz	application/gzip	91.5 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-19 08:07:30
Message-ID:	CAB7nPqQreciciXVLpQ8Sk_HZM4_6twEtb79TCXCbK00CF423Eg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 18, 2014 at 10:55 PM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:
> All rightey.
>
> Here's the next version of this work. It now comes as two patches. The
first
> one refactors adds the XLogReplayBuffer() function and refactors all the
> redo functions to use it. It doesn't change the WAL record format in any
> way. The second patch applies over the first one, and changes the WAL
> format, and all the WAL-creating functions to use the new API for
> constructing WAL records. The second patch removes the relfilenode and
block
> number arguments from XLogReplayBuffer, because they're no longer needed
> when that information is included in the record format.
>
> Todo:
>
> * Performance testing. Do the new WAL construction functions add overhead
to
> WAL insertion?
> * Compare WAL record sizes before and after. I've tried to keep it as
> compact as possible, but I know that some records have gotten bigger. Need
> to do a more thorough analysis.
> * Rename XLogReplayBuffer. I don't particularly like any of the
suggestions
> so far, XLogReplayBuffer included. Discussion continues..
> * Anything else?

Patch 1 looks good. The main difference with the v1 submitted a couple of
days back is that the global variable approach is replaced with additional
arguments for the LSN position and record pointer in XLogReplayBuffer. I
have as well run a couple of tests with the page comparison tool, done some
tests based on installcheck-world with a slave replaying WAL behind, and
found no issues with it.
Perhaps we could consider pushing it to facilitate the next work? Even if
the second patch is dropped it is still a win IMO to have backup block
replay managed within a single function (being it named XLogReplayBuffer in
latest patch), and having it return a status flag.

Regarding patch 2:
- The main differences with the latest version are the modifications for
XLogReplayBuffer having new arguments (LSN position and record pointer).
XLogRecGetBlockRefIds has been changed to return a palloc'd array of block
IDs. xloginsert.h, containing all the functions for xlog record
construction is introduced as well.
- Tiny thing, be aware of tab padding. Here is heapam.c:
page = BufferGetPage(buffer);
PageSetAllVisible(page);
MarkBufferDirty(buffer);
- XLogRecGetBlockRefIds is not described in
src/backend/access/transam/README. Btw, pg_xlogdump drops a core dump when
using it:
--Output:
Assertion failed: (n == *num_refs), function XLogRecGetBlockRefIds, file
xlogreader.c, line 1157.
rmgr: Heap len (rec/tot): 14/ 12912, tx: 3, lsn:
0/01000148, prev 0/01000120, Abort trap: 6 (core dumped)
-- Backtrace:
frame #4: 0x0000000103870363
pg_xlogdump`XLogRecGetBlockRefIds(record=0x00007ff38a003200,
num_refs=0x00007fff5c394464) + 435 at xlogreader.c:1157
frame #5: 0x000000010386d610
pg_xlogdump`XLogDumpDisplayRecord(config=0x00007fff5c3945c8,
ReadRecPtr=16777544, record=0x00007ff38a003200) + 272 at pg_xlogdump.c:357
frame #6: 0x000000010386cad8 pg_xlogdump`main(argc=2,
argv=0x00007fff5c394658) + 3160 at pg_xlogdump.c:749
In order to reproduce that, simply run regression tests, followed by
pg_xlogdump on one of the WAL files generated.
- This patch includes some diffs from pg_receivexlog.c taken from 52bffe3.
- I have run installcheck-world and compared the size of the WAL generated
on HEAD and the patch (any hints to improve such analysis are of course
welcome)
name | start | stop | diff
----------------+-----------+------------+-----------
HEAD (8605bc7) | 0/16C6808 | 0/11A2C670 | 271998568
Patch 1+2 | 0/16D45D8 | 0/1267A4B0 | 284843736
(2 rows)
So that's a diff of more or less 13MB for this test set.

Looking forward for some performance numbers as well as more precise
comparison of WAL record length.
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-19 08:48:58
Message-ID:	53F30F7A.1090606@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/19/2014 11:07 AM, Michael Paquier wrote:
> Patch 1 looks good. The main difference with the v1 submitted a couple of
> days back is that the global variable approach is replaced with additional
> arguments for the LSN position and record pointer in XLogReplayBuffer. I
> have as well run a couple of tests with the page comparison tool, done some
> tests based on installcheck-world with a slave replaying WAL behind, and
> found no issues with it.

Thanks!

> Perhaps we could consider pushing it to facilitate the next work? Even if
> the second patch is dropped it is still a win IMO to have backup block
> replay managed within a single function (being it named XLogReplayBuffer in
> latest patch), and having it return a status flag.

Yeah, that was my plan.

Regarding the name, the following have been suggested so far:

XLogReplayBuffer
XLogRestoreBuffer
XLogRecoverBuffer
XLogReadBufferForReplay
ReadBufferForXLogReplay

One more idea:

XLogRedoBuffer (would be like three first options above, but would match
that we call the functions that call this "redo" functions)

I think XLogReadBufferForReplay is the most descriptive. Andres and
Alvaro both suggested it - independently I believe - so that seems to
come out naturally. But maybe make it XLogReadBufferForRedo, since we
call the redo functions redo functions and not replay functions.

Yet another option is to just call it XLogReadBuffer, and rename the
existing XLogReadBuffer to something else. With the 2nd patch, there are
only a few callers of XLogReadBuffer left. But is it too deceitful if
XLogReadBuffer doesn't merely read the page, but also sometimes replaces
it with a full-page image? Maybe it's OK..

Barring objections or better ideas, I'm leaning towards
XLogReadBufferForRedo.

- Heikki

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-19 14:33:29
Message-ID:	20140819143329.GC6817@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:

> Barring objections or better ideas, I'm leaning towards
> XLogReadBufferForRedo.

WFM

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-08-19 14:38:14
Message-ID:	20140819143814.GB30693@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-19 10:33:29 -0400, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>
> > Barring objections or better ideas, I'm leaning towards
> > XLogReadBufferForRedo.
>
> WFM

for me too. Although we could imo strip the 'XLog' in the beginning if
we want to make it shorter. The ForRedo is saying that pretty much.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-02 12:23:16
Message-ID:	5405B6B4.6040300@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/19/2014 05:38 PM, Andres Freund wrote:
> On 2014-08-19 10:33:29 -0400, Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>>
>>> Barring objections or better ideas, I'm leaning towards
>>> XLogReadBufferForRedo.
>>
>> WFM
>
> for me too. Although we could imo strip the 'XLog' in the beginning if
> we want to make it shorter. The ForRedo is saying that pretty much.

I committed the redo-routine refactoring patch. I kept the XLog prefix
in the XLogReadBufferForRedo name; it's redundant, but all the other
similar functions in xlogutils.c use the XLog prefix so it would seem
inconsistent to not have it here.

I'll post a new version of the main patch shortly...

- Heikki

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-04 00:39:37
Message-ID:	CAB7nPqTYGCpsA8Hk039SL=27FNGWmTd_b89NKb+P-YMaZ8oRZA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I committed the redo-routine refactoring patch. I kept the XLog prefix in
> the XLogReadBufferForRedo name; it's redundant, but all the other similar
> functions in xlogutils.c use the XLog prefix so it would seem inconsistent
> to not have it here.
Thanks! Even that will be helpful for a potential patch doing
consistency comparisons of FPW with current pages having WAL of a
record applied.

> I'll post a new version of the main patch shortly...
Looking forward to seeing it.

Regards,
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-04 00:39:41
Message-ID:	CAB7nPqSf=K_1B5epPjNXvBVzsWxbw4uJPrL5-b_8xxVNXwdD_Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I'll post a new version of the main patch shortly...
Looking forward to seeing it.

Regards,
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-15 12:41:22
Message-ID:	5416DE72.7030005@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/04/2014 03:39 AM, Michael Paquier wrote:
> On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> I committed the redo-routine refactoring patch. I kept the XLog prefix in
>> the XLogReadBufferForRedo name; it's redundant, but all the other similar
>> functions in xlogutils.c use the XLog prefix so it would seem inconsistent
>> to not have it here.
> Thanks! Even that will be helpful for a potential patch doing
> consistency comparisons of FPW with current pages having WAL of a
> record applied.
>
>> I'll post a new version of the main patch shortly...
> Looking forward to seeing it.

Here we go. I've split this again into two patches. The first patch is
just refactoring the current code. It moves XLogInsert into a new file,
xloginsert.c, and the definition of XLogRecord to new xlogrecord.h
header file. As a result, there is a a lot of churn in the #includes in
C files that generate WAL records, or contain redo routines. The number
of files that pull in xlog.h - directly or indirectly through other
headers - is greatly reduced.

The second patch contains the interesting changes.

I wrote a little benchmark kit to performance test this. I'm trying to
find out two things:

1) How much CPU overhead do the new XLogBeginInsert and XLogRegister*
functions add, compared to the current approach with XLogRecDatas.

2) How much extra WAL is generated with the patch. This affects the CPU
time spent in the tests, but it's also interesting to measure directly,
because WAL size affects many things like WAL archiving, streaming
replication etc.

Attached is the test kit I'm using. To run the battery of tests, use
"psql -f run.sql". To answer the question of WAL volume, it runs a bunch
of tests that exercise heap insert, update and delete, as well as b-tree
and GIN insertions. To answer the second test, it runs a heap insertion
test, with a tiny record size that's chosen so that it generates exactly
the same amount of WAL after alignment with and without the patch. The
test is repeated many times, and the median of runtimes is printed out.

Here are the results, comparing unpatched and patched versions. First,
the WAL sizes:

> postgres=# \i compare.sql
> description | wal_per_op (orig) | wal_per_op (patched) | %
> --------------------------------+-------------------+----------------------+--------
> heap insert 26 | 64 | 64 | 100.00
> heap insert 27 | 64 | 72 | 112.50
> heap insert 28 | 64 | 72 | 112.50
> heap insert 29 | 64 | 72 | 112.50
> heap insert 30 | 72 | 72 | 100.00
> heap insert 31 | 72 | 72 | 100.00
> heap insert 32 | 72 | 72 | 100.00
> heap insert 33 | 72 | 72 | 100.00
> heap insert 34 | 72 | 72 | 100.00
> heap insert 35 | 72 | 80 | 111.11
> heap update 26 | 80 | 80 | 100.00
> heap update 27 | 80 | 88 | 110.00
> heap update 28 | 107 | 88 | 82.24
> heap update 29 | 88 | 88 | 100.00
> heap update 30 | 88 | 108 | 122.73
> heap update 31 | 88 | 88 | 100.00
> heap update 32 | 105 | 88 | 83.81
> heap update 33 | 88 | 88 | 100.00
> heap update 34 | 88 | 102 | 115.91
> heap update 35 | 88 | 96 | 109.09
> hot update 26 | 112 | 80 | 71.43
> hot update 27 | 80 | 88 | 110.00
> hot update 28 | 80 | 94 | 117.50
> hot update 29 | 88 | 88 | 100.00
> hot update 30 | 105 | 88 | 83.81
> hot update 31 | 88 | 105 | 119.32
> hot update 32 | 88 | 88 | 100.00
> hot update 33 | 88 | 88 | 100.00
> hot update 34 | 124 | 88 | 70.97
> hot update 35 | 88 | 111 | 126.14
> heap + btree insert 26 | 149 | 157 | 105.37
> heap + btree insert 27 | 161 | 161 | 100.00
> heap + btree insert 28 | 177 | 178 | 100.56
> heap + btree insert 29 | 177 | 185 | 104.52
> heap + btree insert 30 | 178 | 185 | 103.93
> heap + btree insert 31 | 185 | 188 | 101.62
> heap + btree insert 32 | 202 | 202 | 100.00
> heap + btree insert 33 | 205 | 211 | 102.93
> heap + btree insert 34 | 202 | 210 | 103.96
> heap + btree insert 35 | 211 | 210 | 99.53
> heap + gin insert (fastupdate) | 12479 | 13182 | 105.63
> heap + gin insert | 232547 | 236677 | 101.78
> (42 rows)

A heap insertion records are 2 bytes larger with the patch. Due to
alignment, that makes for a 0 or 8 byte difference in the record sizes.
Other WAL records have a similar store; a few extra bytes but no big
regressions. There are a few outliers above where it appears that the
patched version takes less space. Not sure why that would be, probably
just a glitch in the test, autovacuum kicked in or something.

Now, for the CPU overhead:

description | dur_us (orig) | dur_us (patched) | %
----------------+---------------+------------------+--------
heap insert 30 | 0.7752835 | 0.831883 | 107.30
(1 row)

So, the patched version runs 7.3 % slower. That's disappointing :-(.

This are the result I got on my laptop today. Previously, the typical
result I've gotten has been about 5%, so that's a bit high.
Nevertheless, even a 5% slowdown is probably not acceptable.

While I've trying to nail down where that difference comes from, I've
seen a lot of strange phenomenon. At one point, the patched version was
10% slower, but I was able to bring the difference down to 5% if I added
a certain function in xloginsert.c, but never called it. It was very
repeatable at the time, I tried adding and removing it many times and
always got the same result, but I don't see it with the current HEAD and
patch version anymore. So I think 5% is pretty close to the margin of
error that arises from different compiler optimizations,
data/instruction cache effects etc.

Looking at the 'perf' profile, The new function calls only amount to
about 2% of overhead, so I'm not sure where the slowdown is coming from.
Here are explanations I've considered, but I haven't been able to prove
any of them:

* Function call overhead of the new functions. I've tried inlining them,
but found no big difference.

* The relation and block information are included as a separate
XLogRecData entry, so the chain that needs to be memcpy'd and CRCd is
one entry longer. I've tried hacking away the extra entry, but haven't
seen much difference.

* Even though the record size is the same after alignment, it's 2 bytes
longer without alignment, which happens to be about 5% of the total
record size. I've tried modifying the record to be 2 bytes smaller for
test purposes, but found no difference.

I'm out of ideas at the moment. Anyone else?

- Heikki

Attachment	Content-Type	Size
0001-Move-the-backup-block-logic-from-XLogInsert-to-xlogi.patch	text/x-diff	84.7 KB
0002-Change-the-way-WAL-records-are-constructed.patch	text/x-diff	399.2 KB
walperftest.tar.gz	application/gzip	2.5 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-16 02:03:12
Message-ID:	20140916020311.GP4701@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:

> Here we go. I've split this again into two patches. The first patch
> is just refactoring the current code. It moves XLogInsert into a new
> file, xloginsert.c, and the definition of XLogRecord to new
> xlogrecord.h header file. As a result, there is a a lot of churn in
> the #includes in C files that generate WAL records, or contain redo
> routines. The number of files that pull in xlog.h - directly or
> indirectly through other headers - is greatly reduced.

I think you should push the first patch for now and continue to
investigate the issues in the second patch. Some minor suggestions I
have for the first patch are that since xlog.h is no longer used by
other headers (but only by .c files), you should just #include
xloginsert.h instead of doing the forward declaration of struct
XLogRecData; and in xlog_internal.h, instead of

- * XLogRecord is defined in xlog.h, but we avoid #including that to keep
+ * XLogRecord is defined in xlogrecord.h, but we avoid #including that to keep

I would just suggest to #include xlogrecord.h, since it should just
work to include that file from frontend programs. It didn't work for
xlog.h because that one #includes fmgr.h IIRC and that one causes Datum
to appear, which makes compilation blow up. Shouldn't be the case here.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-16 03:00:48
Message-ID:	CAB7nPqSsSJ=-TNoH5R4vNWokiWsAcv9+VQLpXkeVcZ+MJe1vWA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 15, 2014 at 7:03 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Heikki Linnakangas wrote:
>
>> Here we go. I've split this again into two patches. The first patch
>> is just refactoring the current code. It moves XLogInsert into a new
>> file, xloginsert.c, and the definition of XLogRecord to new
>> xlogrecord.h header file. As a result, there is a a lot of churn in
>> the #includes in C files that generate WAL records, or contain redo
>> routines. The number of files that pull in xlog.h - directly or
>> indirectly through other headers - is greatly reduced.
>
> I think you should push the first patch for now and continue to
> investigate the issues in the second patch. Some minor suggestions I
> have for the first patch are that since xlog.h is no longer used by
> other headers (but only by .c files), you should just #include
> xloginsert.h instead of doing the forward declaration of struct
> XLogRecData; and in xlog_internal.h, instead of
>
> - * XLogRecord is defined in xlog.h, but we avoid #including that to keep
> + * XLogRecord is defined in xlogrecord.h, but we avoid #including that to keep
>
> I would just suggest to #include xlogrecord.h, since it should just
> work to include that file from frontend programs. It didn't work for
> xlog.h because that one #includes fmgr.h IIRC and that one causes Datum
> to appear, which makes compilation blow up. Shouldn't be the case here.

Alvaro got faster than me... I was just looking at the first patch and
+1 on those comments. It is worth noting that the first patch, as it
does only a large refactoring, does not impact performance or size of
WAL records. Btw, a declaration of access/xlog.h in fd.c is forgotten.
Also, if somebody is interested in running the test suite, there are
comments on how to do it at the top of compare.sql.
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-16 04:16:52
Message-ID:	CAB7nPqSYDcvfxXaery6pjFo0iL4g4etCbBDZmNTXuGTNyQfZqg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 15, 2014 at 8:00 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> Alvaro got faster than me... I was just looking at the first patch and
> +1 on those comments. It is worth noting that the first patch, as it
> does only a large refactoring, does not impact performance or size of
> WAL records.

Some comments after a second and closer read of the first patch:
1) Wouldn't it be better to call GetFullPageWriteInfo directly in
XLogRecordAssemble, removing RedoRecPtr and doPageWrites from its
arguments?
2) XLogCheckBufferNeedsBackup is not used. It can be removed.
3) If I am following correctly, there are two code paths where
XLogInsertRecData can return InvalidXLogRecPtr, and then there is this
process in XLogInsert
+ {
+ /*
+ * Undo the changes we made to the rdata chain.
+ *
+ * XXX: This doesn't undo *all* the changes;
the XLogRecData
+ * entries for buffers that we had already
decided to back up have
+ * had their data-pointers cleared. That's OK,
as long as we
+ * decide to back them up on the next
iteration as well. Hence,
+ * don't allow "doPageWrites" value to go from
true to false after
+ * we've modified the rdata chain.
+ */
+ bool newDoPageWrites;
+
+ GetFullPageWriteInfo(&RedoRecPtr, &newDoPageWrites);
+ if (!doPageWrites && newDoPageWrites)
+ doPageWrites = true;
+ rdt_lastnormal->next = NULL;
+ }
Wouldn't it be better to keep that in XLogInsertRecData? This way
GetFullPageWriteInfo could be removed (I actually see little point in
keeping it as part of this patch, but does this change make its sense
in patch 2?).
4) This patch is in DOS encoding (?!)

That's all I have for now...
Regards,
--
Michael

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-16 10:21:38
Message-ID:	20140916102138.GI23806@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:
> Here we go. I've split this again into two patches. The first patch is just
> refactoring the current code. It moves XLogInsert into a new file,
> xloginsert.c, and the definition of XLogRecord to new xlogrecord.h header
> file. As a result, there is a a lot of churn in the #includes in C files
> that generate WAL records, or contain redo routines. The number of files
> that pull in xlog.h - directly or indirectly through other headers - is
> greatly reduced.
>
> The second patch contains the interesting changes.
>
> I wrote a little benchmark kit to performance test this. I'm trying to find
> out two things:
>
> 1) How much CPU overhead do the new XLogBeginInsert and XLogRegister*
> functions add, compared to the current approach with XLogRecDatas.
>
> 2) How much extra WAL is generated with the patch. This affects the CPU time
> spent in the tests, but it's also interesting to measure directly, because
> WAL size affects many things like WAL archiving, streaming replication etc.
>
> Attached is the test kit I'm using. To run the battery of tests, use "psql
> -f run.sql". To answer the question of WAL volume, it runs a bunch of tests
> that exercise heap insert, update and delete, as well as b-tree and GIN
> insertions. To answer the second test, it runs a heap insertion test, with a
> tiny record size that's chosen so that it generates exactly the same amount
> of WAL after alignment with and without the patch. The test is repeated many
> times, and the median of runtimes is printed out.
>
> Here are the results, comparing unpatched and patched versions. First, the
> WAL sizes:
>
> A heap insertion records are 2 bytes larger with the patch. Due to
> alignment, that makes for a 0 or 8 byte difference in the record sizes.
> Other WAL records have a similar store; a few extra bytes but no big
> regressions. There are a few outliers above where it appears that the
> patched version takes less space. Not sure why that would be, probably just
> a glitch in the test, autovacuum kicked in or something.

I've to admit, that's already not a painless amount of overhead.

> Now, for the CPU overhead:
>
> description | dur_us (orig) | dur_us (patched) | %
> ----------------+---------------+------------------+--------
> heap insert 30 | 0.7752835 | 0.831883 | 107.30
> (1 row)
>
> So, the patched version runs 7.3 % slower. That's disappointing :-(.
>
> This are the result I got on my laptop today. Previously, the typical result
> I've gotten has been about 5%, so that's a bit high. Nevertheless, even a 5%
> slowdown is probably not acceptable.

Yes, I definitely think it's not.

> While I've trying to nail down where that difference comes from, I've seen a
> lot of strange phenomenon. At one point, the patched version was 10% slower,
> but I was able to bring the difference down to 5% if I added a certain
> function in xloginsert.c, but never called it. It was very repeatable at the
> time, I tried adding and removing it many times and always got the same
> result, but I don't see it with the current HEAD and patch version anymore.
> So I think 5% is pretty close to the margin of error that arises from
> different compiler optimizations, data/instruction cache effects etc.
>
> Looking at the 'perf' profile, The new function calls only amount to about
> 2% of overhead, so I'm not sure where the slowdown is coming from. Here are
> explanations I've considered, but I haven't been able to prove any of them:

I'd suggest doing:
a) perf stat -vvv of both workloads. That will often tell you stuff already
b) Look at other events. Particularly stalled-cycles-frontend,
stalled-cycles-backend, cache-misses

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-09-16 12:11:06
Message-ID:	CAB7nPqT2dewPyfueL52ic1uf4G083_Rew=pCsaFr=ETR=cEZdg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 15, 2014 at 9:16 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> 2) XLogCheckBufferNeedsBackup is not used. It can be removed.

Please ignore this comment, XLogCheckBufferNeedsBackup is used in heapam.c.
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-03 12:51:37
Message-ID:	542E9BD9.3000600@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/16/2014 01:21 PM, Andres Freund wrote:
>> >While I've trying to nail down where that difference comes from, I've seen a
>> >lot of strange phenomenon. At one point, the patched version was 10% slower,
>> >but I was able to bring the difference down to 5% if I added a certain
>> >function in xloginsert.c, but never called it. It was very repeatable at the
>> >time, I tried adding and removing it many times and always got the same
>> >result, but I don't see it with the current HEAD and patch version anymore.
>> >So I think 5% is pretty close to the margin of error that arises from
>> >different compiler optimizations, data/instruction cache effects etc.
>> >
>> >Looking at the 'perf' profile, The new function calls only amount to about
>> >2% of overhead, so I'm not sure where the slowdown is coming from. Here are
>> >explanations I've considered, but I haven't been able to prove any of them:
> I'd suggest doing:
> a) perf stat -vvv of both workloads. That will often tell you stuff already
> b) Look at other events. Particularly stalled-cycles-frontend,
> stalled-cycles-backend, cache-misses

That didn't make me any wiser, unfortunately.

After a lot of experimentation, I figured out that the slowdown is
already apparent with the *first* patch, the one that just refactors
existing XLogInsert, without any changes to the WAL format or to the
callers. I fiddled with that for a long time, trying to tease apart the
change that makes the difference, and was finally able to narrow it down.

Attached are two patches. These are both just refactoring, with no
changes to the WAL format or APIs. The first is the same as the
refactoring patch I posted earlier, with only minor changes to
#includes, comments and such (per Alvaro's and Michael's suggestions -
thanks!). With the first patch, the test case I've been using to
performance test this becomes somewhere between 5% - 10% slower.
Applying the second patch on top of that restores the performance back
to what you get without these patches.

Strange. The second patch moves the CRC calculation from a separate loop
over the XLogRecDatas to the earlier loop, that also iterates through
the XLogRecDatas. The strange thing about this is that when I tried to
make the same change to current git master, without applying the first
patch, it didn't make any difference. The CRC calculation used to
integrated to the earlier loop in 9.1 and before, but in 9.2 it was
moved to a separate loop for simplicity, because it didn't make any
difference to performance.

So I now have a refactoring patch ready that I'd like to commit (the
attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.

If someone else wants to try this out, the performance difference can be
seen with the test suite I posted earlier, or with an even simpler
pgbench test script:

-- create the test table once, truncate between pgbench runs:
create table test (id int4);

-- pgbench script:
insert into test select i from generate_series(1,100) i;

- Heikki

Attachment	Content-Type	Size
0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new.patch	text/x-diff	89.5 KB
0002-Move-the-CRC-calculations-into-the-other-loop-throug.patch	text/x-diff	5.0 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-03 13:10:49
Message-ID:	20141003131049.GR7158@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-10-03 15:51:37 +0300, Heikki Linnakangas wrote:
> After a lot of experimentation, I figured out that the slowdown is already
> apparent with the *first* patch, the one that just refactors existing
> XLogInsert, without any changes to the WAL format or to the callers. I
> fiddled with that for a long time, trying to tease apart the change that
> makes the difference, and was finally able to narrow it down.
>
> Attached are two patches. These are both just refactoring, with no changes
> to the WAL format or APIs. The first is the same as the refactoring patch I
> posted earlier, with only minor changes to #includes, comments and such (per
> Alvaro's and Michael's suggestions - thanks!). With the first patch, the
> test case I've been using to performance test this becomes somewhere between
> 5% - 10% slower. Applying the second patch on top of that restores the
> performance back to what you get without these patches.
>
> Strange. The second patch moves the CRC calculation from a separate loop
> over the XLogRecDatas to the earlier loop, that also iterates through the
> XLogRecDatas. The strange thing about this is that when I tried to make the
> same change to current git master, without applying the first patch, it
> didn't make any difference. The CRC calculation used to integrated to the
> earlier loop in 9.1 and before, but in 9.2 it was moved to a separate loop
> for simplicity, because it didn't make any difference to performance.
>
> So I now have a refactoring patch ready that I'd like to commit (the
> attached two patches together), but to be honest, I have no idea why the
> second patch is so essential to performance.

Maybe a stupid question, but did you verify it's not caused by splitting
xloginsert over two translation units and/or not inlining the separate
function?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-03 13:28:17
Message-ID:	542EA471.9060502@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/03/2014 04:10 PM, Andres Freund wrote:
> On 2014-10-03 15:51:37 +0300, Heikki Linnakangas wrote:
>> After a lot of experimentation, I figured out that the slowdown is already
>> apparent with the *first* patch, the one that just refactors existing
>> XLogInsert, without any changes to the WAL format or to the callers. I
>> fiddled with that for a long time, trying to tease apart the change that
>> makes the difference, and was finally able to narrow it down.
>>
>> Attached are two patches. These are both just refactoring, with no changes
>> to the WAL format or APIs. The first is the same as the refactoring patch I
>> posted earlier, with only minor changes to #includes, comments and such (per
>> Alvaro's and Michael's suggestions - thanks!). With the first patch, the
>> test case I've been using to performance test this becomes somewhere between
>> 5% - 10% slower. Applying the second patch on top of that restores the
>> performance back to what you get without these patches.
>>
>> Strange. The second patch moves the CRC calculation from a separate loop
>> over the XLogRecDatas to the earlier loop, that also iterates through the
>> XLogRecDatas. The strange thing about this is that when I tried to make the
>> same change to current git master, without applying the first patch, it
>> didn't make any difference. The CRC calculation used to integrated to the
>> earlier loop in 9.1 and before, but in 9.2 it was moved to a separate loop
>> for simplicity, because it didn't make any difference to performance.
>>
>> So I now have a refactoring patch ready that I'd like to commit (the
>> attached two patches together), but to be honest, I have no idea why the
>> second patch is so essential to performance.
>
> Maybe a stupid question, but did you verify it's not caused by splitting
> xloginsert over two translation units and/or not inlining the separate
> function?

Hmph, now that I try this again, I don't see any difference with or
without the "second" patch - they both perform just as well as unpatched
master. So I'm totally baffled again, but I guess the patch is good to
go. I need a more stable test environment..

Yeah, I did try merging xlog.c and xloginsert.c into the same .c file at
some point, and marking the XLogInsertRecData function as static,
allowing the compiler to inline it, but I didn't see any difference. But
you have to take that with a grain of salt - I'm apparently not capable
of constructing a properly repeatable test case for this.

- Heikki

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-06 01:42:50
Message-ID:	CAB7nPqQT0Gbjo1z=YeKtD86OxR1NH9X4j09jk=+o=rEHNKmWvg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:
> So I now have a refactoring patch ready that I'd like to commit (the
attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.
Thanks. I did some more tests with master, master+patch1, master+patch1+CRC
refactoring, but I am not able to see any performance difference with
pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
that barely changed performance. Note that fd.c uses
SYNC_METHOD_FSYNC_WRITETHROUGH, so it is necessary to include xlog.h in it.
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-06 11:19:39
Message-ID:	54327ACB.1070003@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2014 04:42 AM, Michael Paquier wrote:
> On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
> wrote:
>> So I now have a refactoring patch ready that I'd like to commit (the
> attached two patches together), but to be honest, I have no idea why the
> second patch is so essential to performance.
> Thanks. I did some more tests with master, master+patch1, master+patch1+CRC
> refactoring, but I am not able to see any performance difference with
> pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
> that barely changed performance.

Thanks for the confirmation. I'm really going crazy with benchmarking
this. Sometimes I see a big difference, the next day it's gone.

> Note that fd.c uses
> SYNC_METHOD_FSYNC_WRITETHROUGH, so it is necessary to include xlog.h in it.

Ah, thanks! You mentioned that before, but I didn't see it because my
laptop doesn't HAVE_FSYNC_WRITETHROUGH. Fixed now.

Attached is a new version of the first, refactoring-only patch. Notable
changes since last version:

* Fixed XLogSaveBufferForHint. It didn't initialize BkpBlock struct,
rendering it completely broken.

* Moved XLogSaveBufferForHint, log_newpage and log_newpage_buffer to
xloginsert.c. XLogSaveBufferForHint needs to share the XLogFillBkpBlock
function with the rest of the code in xloginsert.c, so this allows
keeping it static. And it makes sense anyway - those are high-level
functions, comparable to XLogInsert, not low-level stuff related to
managing WAL itself.

Barring objections, I'll commit this, and then continue benchmarking the
second patch with the WAL format and API changes.

- Heikki

Attachment	Content-Type	Size
0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new.patch	text/x-diff	102.0 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-06 11:29:26
Message-ID:	20141006112926.GC26194@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:
> On 10/06/2014 04:42 AM, Michael Paquier wrote:
> >On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
> >wrote:
> >>So I now have a refactoring patch ready that I'd like to commit (the
> >attached two patches together), but to be honest, I have no idea why the
> >second patch is so essential to performance.
> >Thanks. I did some more tests with master, master+patch1, master+patch1+CRC
> >refactoring, but I am not able to see any performance difference with
> >pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
> >that barely changed performance.
>
> Thanks for the confirmation. I'm really going crazy with benchmarking this.
> Sometimes I see a big difference, the next day it's gone.

A usual suspect for this is turbo mode and power control. It often
already helps to disable the latter to get much more reproducible
benchmark results.

> Barring objections, I'll commit this, and then continue benchmarking the
> second patch with the WAL format and API changes.

I'd like to have a look at it beforehand. I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Greetings,

Andres Freund

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-06 12:35:53
Message-ID:	CAB7nPqSJ0OLCjKUMNn5kMLpE8uNH4KO4t41otP=fv6sMkR7Y+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 6, 2014 at 8:19 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:

> On 10/06/2014 04:42 AM, Michael Paquier wrote:
>
>> On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <
>> hlinnakangas(at)vmware(dot)com>
>> wrote:
>>
>>> So I now have a refactoring patch ready that I'd like to commit (the
>>>
>> attached two patches together), but to be honest, I have no idea why the
>> second patch is so essential to performance.
>> Thanks. I did some more tests with master, master+patch1,
>> master+patch1+CRC
>> refactoring, but I am not able to see any performance difference with
>> pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
>> that barely changed performance.
>>
>
> Thanks for the confirmation. I'm really going crazy with benchmarking
> this. Sometimes I see a big difference, the next day it's gone.
>
The benchmark paradigms.

> * Fixed XLogSaveBufferForHint. It didn't initialize BkpBlock struct,
> rendering it completely broken.
>
Note for other reviewers: that's represented by this addition in
XLogSaveBufferForHint:
+ /* Make a BkpBlock struct representing the buffer */
+ XLogFillBkpBlock(buffer, buffer_std, &bkpb)

Regards,
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-29 08:24:20
Message-ID:	5450A434.3000505@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2014 02:29 PM, Andres Freund wrote:
> On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:
>> Barring objections, I'll commit this, and then continue benchmarking the
>> second patch with the WAL format and API changes.
>
> I'd like to have a look at it beforehand.

Ping? Here's an rebased patch. I'd like to proceed with this.

> I've not yet really looked,
> but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
> make me happy...

Ok.. Can you elaborate?

- Heikki

Attachment	Content-Type	Size
0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new-2.patch	text/x-diff	97.8 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-30 16:02:07
Message-ID:	20141030160207.GA12193@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
> On 10/06/2014 02:29 PM, Andres Freund wrote:
> >On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:
> >>Barring objections, I'll commit this, and then continue benchmarking the
> >>second patch with the WAL format and API changes.
> >
> >I'd like to have a look at it beforehand.
>
> Ping? Here's an rebased patch. I'd like to proceed with this.

Doing so.

> >I've not yet really looked,
> >but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
> >make me happy...
>
> Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

And it leads to things like the already complicated logic to retry after
detecting missing fpws is now split across two files seems to confirm
that. What happens right now is that XLogInsert() (with some helper
routines) assembles the record. Then hands that off to
XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and
returns back to XLogInsert(), which reverses *some* of its work and then
retries. Brr.

Similary the 'page_writes_omitted' logic doesn't make me particularly
happy. Previously we retried when there actually was a page affected by
the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
is out of date? Won't that quite often spuriously trigger retries? Am I
missing something?
Arguably this doesn't happen often enough to matter, but it's still
something that we should explicitly discuss.

The implementation of the split seems to change the meaning of
TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't
particularly care about those tracepoints, but I see no simplification
due to changing its meaning.

Aside: In C11 function local statics become a bit more expensive - they
essentially require atomics and/or spinlocks on the first few calls.

Next thing: The patch doesn't actually compile. Misses an #include
storage/proc.h for MyPgXact.

I'm not a big fan of the naming for the new split. We have
* XLogInsert() which is the external interface
* XLogRecordAssemble() which builds the rdata chain that will be
inserted
* XLogInsertRecData() which actually inserts the data into the xl buffers.

To me that's pretty inconsistent.

I'm also somewhat confused about the new split of the headers. I'm not
adverse to splitting them, but it's getting a bit hard to understand:
* xlog.h - generic mishmash
* xlogdefs.h - Three typedefs and some #defines
* xlog_fn.h - SQL functions
* xlog_internal.h - Pretty random collection of stuff
* xlogutils.h - "Utilities for replaying WAL records"
* xlogreader.h - "generic XLog reading facility"
* xloginsert.h - "Functions for generating WAL records"
* xlogrecord.h - "Definitions for the WAL record format."

Isn't that a bit too much? Pretty much the only files that have a
recognizable separation to me are xlog_fn.h and xlogreader.h.

You've removed the
- *
- * NB: this routine feels free to scribble on the XLogRecData structs,
- * though not on the data they reference. This is OK since the
XLogRecData
- * structs are always just temporaries in the calling code.
comment, but we still do, no?

While not this patches blame, I see currently we finish the critical
section in XLogInsert/XLogInsertRecData pretty early:
END_CRIT_SECTION();

/*
* Update shared LogwrtRqst.Write, if we crossed page boundary.
*/
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
SpinLockAcquire(&XLogCtl->info_lck);
/* advance global request to include new block(s) */
if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;
/* update local result copy while I have the chance */
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
}
I don't think this is particularly critical, but it still looks wrong to me.

Hm. I think that's what I see for now.

Greetings,

Andres Freund

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-30 17:52:47
Message-ID:	54527AEF.20506@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/30/2014 06:02 PM, Andres Freund wrote:
> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
>> On 10/06/2014 02:29 PM, Andres Freund wrote:
>>> I've not yet really looked,
>>> but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
>>> make me happy...
>>
>> Ok.. Can you elaborate?
>
> To me the split between xloginsert.c doing some of the record assembly,
> and xlog.c doing the lower level part of the assembly is just wrong.

Really? To me, that feels like a natural split. xloginsert.c is
responsible for constructing the final WAL record, with the backup
blocks and all. It doesn't access any shared memory (related to WAL; it
does look at Buffers, to determine what to back up). The function in
xlog.c inserts the assembled record to the WAL, as is. It handles the
locking and WAL buffer management related to that.

What would you suggest? I don't think putting everything in one
XLogInsert function, like we have today, is better. Note that the second
patch makes xloginsert.c a lot more complicated.

Modifying the chain that was already constructed is not pretty, but it's
not this patch's fault. I don't think splitting the functions makes that
any worse. (BTW, the follow-up patch to change the WAL format fixes
that; the per-buffer data is kept separately from the main data chain).

> Similary the 'page_writes_omitted' logic doesn't make me particularly
> happy. Previously we retried when there actually was a page affected by
> the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
> is out of date? Won't that quite often spuriously trigger retries? Am I
> missing something?
> Arguably this doesn't happen often enough to matter, but it's still
> something that we should explicitly discuss.

The situation where it leads to a spurious retry is when the constructed
WAL record skipped the FPW of a buffer, and RedoRecPtr was updated, but
the buffer stilll doesn't need to be FPW according to the updated
RedoRecPtr. That only happens if the same buffer was updated (by another
backend) after the new checkpoint began. I believe that's extremely rare.

We could make that more fine-grained, though. Instead of passing a
boolean 'fpw_sensitive' flag to XLogInsertRecData, pass the lowest LSN
among the pages whose FPW was skipped. If that's less than the new
RedoRecPtr, then retry is needed. That would eliminate the spurious retries.

> The implementation of the split seems to change the meaning of
> TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't
> particularly care about those tracepoints, but I see no simplification
> due to changing its meaning.

I wasn't sure what to do about that. I don't use tracepoints, and I
don't know what others use them for.

> I'm not a big fan of the naming for the new split. We have
> * XLogInsert() which is the external interface
> * XLogRecordAssemble() which builds the rdata chain that will be
> inserted
> * XLogInsertRecData() which actually inserts the data into the xl buffers.
>
> To me that's pretty inconsistent.

Got a better suggestion? I wanted to keep the name XLogInsert()
unchanged, to avoid code churn, and because it's still a good name for
that. XLogRecordAssemble is pretty descriptive for what it does,
although "XLogRecord" implies that it construct an XLogRecord struct. It
does fill in that too, but the main purpose is to build the XLogRecData
chain. Perhaps XLogAssembleRecord() would be better.

I'm not very happy with XLogInsertRecData myself. XLogInsertRecord?

> I'm also somewhat confused about the new split of the headers. I'm not
> adverse to splitting them, but it's getting a bit hard to understand:
> * xlog.h - generic mishmash
> * xlogdefs.h - Three typedefs and some #defines
> * xlog_fn.h - SQL functions
> * xlog_internal.h - Pretty random collection of stuff
> * xlogutils.h - "Utilities for replaying WAL records"
> * xlogreader.h - "generic XLog reading facility"
> * xloginsert.h - "Functions for generating WAL records"
> * xlogrecord.h - "Definitions for the WAL record format."
>
> Isn't that a bit too much? Pretty much the only files that have a
> recognizable separation to me are xlog_fn.h and xlogreader.h.

This is a matter of taste, I guess. I find xlog.h, xlogdefs.h and
xlog_internal.h fairly random, while the rest are have a clear function.
xlogutils.h is needed by redo functions, and xloginsert.h is needed for
generating WAL.

Note that the second patch adds more stuff to xloginsert.h and xlogrecord.h.

> You've removed the
> - *
> - * NB: this routine feels free to scribble on the XLogRecData structs,
> - * though not on the data they reference. This is OK since the
> XLogRecData
> - * structs are always just temporaries in the calling code.
> comment, but we still do, no?

True. XLogInsertRecData doesn't scribble on its input, but XLogInsert
still does. (this is moot after the second patch though)

> While not this patches blame, I see currently we finish the critical
> section in XLogInsert/XLogInsertRecData pretty early:
> END_CRIT_SECTION();
>
> /*
> * Update shared LogwrtRqst.Write, if we crossed page boundary.
> */
> if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
> {
> SpinLockAcquire(&XLogCtl->info_lck);
> /* advance global request to include new block(s) */
> if (XLogCtl->LogwrtRqst.Write < EndPos)
> XLogCtl->LogwrtRqst.Write = EndPos;
> /* update local result copy while I have the chance */
> LogwrtResult = XLogCtl->LogwrtResult;
> SpinLockRelease(&XLogCtl->info_lck);
> }
> I don't think this is particularly critical, but it still looks wrong to me.

Updating LogwrtRqst.Write is not critical. If it's not done for some
reason, everything still works. Keeping the critical section as small as
possible is a good idea, no?

> Hm. I think that's what I see for now.

Thanks!

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-30 19:19:09
Message-ID:	20141030191909.GB12193@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:
> The second patch contains the interesting changes.

Heikki's pushed the newest version of this to the git tree.

Some things I noticed while reading the patch:
* potential mismerge:
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -408,7 +408,7 @@ main(int argc, char **argv)
}
}

- while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv",
+ while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv",
long_options, &option_index)) != -1)
{
switch (c)

* Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you
don't need to test both.

* Is it really a good idea to separate XLogRegisterBufData() from
XLogRegisterBuffer() the way you've done it ? If we ever actually get
a record with a large numbers of blocks touched this issentially is
O(touched_buffers*data_entries).

* At least in Assert mode XLogRegisterBuffer() should ensure we're not
registering the same buffer twice. It's a pretty easy to make mistake
to do it twice for some kind of actions (think UPDATE), and the
consequences will be far from obvious afaics.

* There's lots of functions like XLogRecHasBlockRef() that dig through
the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
oldblk = newblk;

I think doing that repeatedly is quite a bad idea. We should parse the
record once and then use it in a sensible format. Not do it in pieces,
over and over again. It's not like we ignore backup blocks - so doing
this lazily doesn't make sense to me.
Especially as ValidXLogRecord() *already* has parsed the whole damn
thing once.

* Maybe XLogRegisterData() and XLogRegisterBufData() should accept void*
instead of char*? Right now there's lots of pointless casts in callers
needed that don't add any safety. The one additional line that's then
needed in these functions seems well worth it.

* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
refer to the relation, but not to the block number. These still log
their rnode manually. Shouldn't we somehow deal with those in a
similar way explicit block references are dealt with?

* You've added an assert in RestoreBackupBlockContents() that ensures
the page isn't new. That wasn't there before, instead there was a
special case for it. I can't immediately think how that previously
could happen, but I do wonder why it was there.

* The following comment in +RestoreBackupBlock probably is two lines to
early
+ /* Found it, apply the update */
+ if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE))
+ return InvalidBuffer;
This new InvalidBuffer case probably could use a comment in either
XLogReadBufferForRedoExtended or here.

* Most of the commentary above RestoreBackupBlock() probably doesn't
belong there anymore.

* Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just
into flags_fork? Right now it makes at least me think that it's fork
specific flags, which really isn't the actual meaning.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
compression patch, to keep the bkp block data after *all* the
"headers". That'd make it easier to just compress the data.

* +InitXLogRecordConstruct() seems like a remainder of the
xlogconstruct.c naming. I'd just go for InitXLogInsert()

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
other slru stuff) doesn't include a reference to the page. Isn't that
bad? Shouldn't we make XLogRegisterBlock() usable for that case?
Otherwise I fail to see how pg_rewind like tools can sanely deal with this?

* Why did you duplicate the identical cases in btree_desc()? I guess due
to rebasing ontop the xlogdump stats series?

* I haven't actually looked at the xlogdump output so I might be
completely wrong, but won't the output of e.g. updates be harder to
read now because it's not clear which of the references old/new
buffers are? Not that it matters that much.

* s/recdataend/recdata_end/? The former is somewhat hard to parse for
me.

* I think heap_xlog_update is buggy for wal_level=logical because it
computes the length of the tuple using
tuplen = recdataend - recdata;
But the old primary key/old tuple value might be stored there as
well. Afaics that code has to continue to use xl_heap_header_len.

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
to get rid of:
+ /*
+ * The new tuple is normally stored as buffer 0's data. But if
+ * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+ * data, after the xl_heap_insert struct.
+ */
+ if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+ {
+ data = XLogRecGetData(record) + SizeOfHeapInsert;
+ datalen = record->xl_len - SizeOfHeapInsert;
+ }
+ else
+ data = XLogRecGetBlockData(record, 0, &datalen);
or have I misunderstood how that works?

* spurious reindent
@@ -7306,9 +7120,9 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
* XLOG record's LSN, we mustn't mark the page all-visible, because
* the subsequent update won't be replayed to clear the flag.
*/
- page = BufferGetPage(buffer);
- PageSetAllVisible(page);
- MarkBufferDirty(buffer);
+ page = BufferGetPage(buffer);
+ PageSetAllVisible(page);
+ MarkBufferDirty(buffer);
}

* Somewhat long line:
XLogRegisterBuffer(1, heap_buffer, XLogHintBitIsNeeded() ? REGBUF_STANDARD : (REGBUF_STANDARD | REGBUF_NO_IMAGE));

Man. This page is friggin huge. Now I'm tired ;)

This patch needs at the very least:
* Careful benchmarking of both primary and standbys
* Very careful review of the many changed XLogInsert/replay sites. I'd
be very surprised if there weren't bugs hidden somewhere. I've lost
the energy to read them all now.
* A good amount of consideration about the increase in WAL volume. I
think there's some cases where that's not inconsiderable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-10-31 05:47:49
Message-ID:	CAB7nPqT7SaYuWSTuYeJgqYZnSLXCQ=Kv18J+Q-zx0BAN-KrRjQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 31, 2014 at 2:52 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 10/30/2014 06:02 PM, Andres Freund wrote:
>
>> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
>>
>>> On 10/06/2014 02:29 PM, Andres Freund wrote:
>>>
>>>> I've not yet really looked,
>>>> but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
>>>> make me happy...
>>>>
>>>
>>> Ok.. Can you elaborate?
>>>
>>
>> To me the split between xloginsert.c doing some of the record assembly,
>> and xlog.c doing the lower level part of the assembly is just wrong.
>>
>
> Really? To me, that feels like a natural split. xloginsert.c is
> responsible for constructing the final WAL record, with the backup blocks
> and all. It doesn't access any shared memory (related to WAL; it does look
> at Buffers, to determine what to back up). The function in xlog.c inserts
> the assembled record to the WAL, as is. It handles the locking and WAL
> buffer management related to that.
>
FWIW, I tend to the same opinion here.

What would you suggest? I don't think putting everything in one XLogInsert
> function, like we have today, is better. Note that the second patch makes
> xloginsert.c a lot more complicated.
>
I recall some time ago seeing complaints that xlog.c is too complicated and
should be refactored. Any effort in this area is a good thing IMO, and this
split made sense when I went through the code.

> I'm not a big fan of the naming for the new split. We have
>> * XLogInsert() which is the external interface
>> * XLogRecordAssemble() which builds the rdata chain that will be
>> inserted
>> * XLogInsertRecData() which actually inserts the data into the xl buffers.
>>
>> To me that's pretty inconsistent.
>>
>
> Got a better suggestion? I wanted to keep the name XLogInsert() unchanged,
> to avoid code churn, and because it's still a good name for that.
> XLogRecordAssemble is pretty descriptive for what it does, although
> "XLogRecord" implies that it construct an XLogRecord struct. It does fill
> in that too, but the main purpose is to build the XLogRecData chain.
> Perhaps XLogAssembleRecord() would be better.
>
> I'm not very happy with XLogInsertRecData myself. XLogInsertRecord?
>
+1 for XLogInsertRecord.
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-04 16:33:34
Message-ID:	5458FFDE.7040108@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/30/2014 06:02 PM, Andres Freund wrote:
> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
>> On 10/06/2014 02:29 PM, Andres Freund wrote:
>>> On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:
>>>> Barring objections, I'll commit this, and then continue benchmarking the
>>>> second patch with the WAL format and API changes.
>>>
>>> I'd like to have a look at it beforehand.
>>
>> Ping? Here's an rebased patch. I'd like to proceed with this.
>
> Doing so.

Here's a new version of this refactoring patch. It fixes all the
concrete issues you pointed out.

>>> I've not yet really looked,
>>> but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
>>> make me happy...
>>
>> Ok.. Can you elaborate?
>
> To me the split between xloginsert.c doing some of the record assembly,
> and xlog.c doing the lower level part of the assembly is just wrong.

I moved the checks for bootstrap mode and xl_len == 0, from the current
XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what
else to do about it, as it feels very sensible to me as it is now. So
unless you object more loudly, I'm going to just go ahead with this and
commit, probably tomorrow.

> And it leads to things like the already complicated logic to retry after
> detecting missing fpws is now split across two files seems to confirm
> that. What happens right now is that XLogInsert() (with some helper
> routines) assembles the record. Then hands that off to
> XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and
> returns back to XLogInsert(), which reverses *some* of its work and then
> retries. Brr.
>
> Similary the 'page_writes_omitted' logic doesn't make me particularly
> happy. Previously we retried when there actually was a page affected by
> the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
> is out of date? Won't that quite often spuriously trigger retries? Am I
> missing something?
> Arguably this doesn't happen often enough to matter, but it's still
> something that we should explicitly discuss.

I replaced the boolean with an XLogRecPtr of the smallest LSN among the
pages that were modified but not backed up. That brings the behavior
back to the way it is currently; no more spurious retries. Thanks for
making me to think harder about that, I think this way is also clearer
than the boolean.

Fixed.

> Next thing: The patch doesn't actually compile. Misses an #include
> storage/proc.h for MyPgXact.

Fixed.

Renamed XLogInsertRecData() to XLogInsertRecord(). Not sure if it makes
it more consistent, but it sounds better anyway.

Fixed.

- Heikki

Attachment	Content-Type	Size
0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new-3.patch	text/x-diff	100.5 KB

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-05 07:06:17
Message-ID:	CAA4eK1L7PyZeeTouUu6TF360GmB8ZFwKWEztoKxwrZ-gtbWQ2w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 4, 2014 at 10:03 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:
>
> On 10/30/2014 06:02 PM, Andres Freund wrote:
>>
>> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
>>>
>>> On 10/06/2014 02:29 PM, Andres Freund wrote:
>>>>
>>>> On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:
>>>>>
>>>>> Barring objections, I'll commit this, and then continue benchmarking
the
>>>>> second patch with the WAL format and API changes.
>>>>
>>>>
>>>> I'd like to have a look at it beforehand.
>>>
>>>
>>> Ping? Here's an rebased patch. I'd like to proceed with this.
>>
>>
>> Doing so.
>
>
> Here's a new version of this refactoring patch. It fixes all the concrete
issues you pointed out.
>
>>>> I've not yet really looked,
>>>> but on a quick readthrough XLogInsertRecData() staying in xlog.c
doesn't
>>>> make me happy...
>>>
>>>
>>> Ok.. Can you elaborate?
>>
>>
>> To me the split between xloginsert.c doing some of the record assembly,
>> and xlog.c doing the lower level part of the assembly is just wrong.
>
>
> I moved the checks for bootstrap mode and xl_len == 0, from the current
XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
to do about it, as it feels very sensible to me as it is now. So unless you
object more loudly, I'm going to just go ahead with this and commit,
probably tomorrow.
>

Few observations while reading the latest patch:

1.
+XLogRecPtr
+XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
{
..
+ /* info's high bits are reserved for use by me */
+ if (info & XLR_INFO_MASK)
+ elog(PANIC, "invalid xlog info mask %02X", info);
..
}

Earlier before this check, we use to check XLogInsertAllowed()
which is moved to XLogInsertRecord(), isn't it better to keep
the check in beginning of the function XLogInsert()?

2.
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
{
..
if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
{
/*
* Oops, some buffer now needs to be backed up that the caller
* didn't back up. Start over.
*/
WALInsertLockRelease();
END_CRIT_SECTION();
return InvalidXLogRecPtr;
}
..
}

IIUC, there can be 4 states for doPageWrites w.r.t when record is getting
assembled (XLogRecordAssemble) and just before actual insert (
XLogInsertRecord)

I think the behaviour for the state when doPageWrites is true during
XLogRecordAssemble and false during XLogInsertRecord (just before
actual insert) is different as compare to HEAD.

In the patch, it will not retry if doPageWrites is false when we are
about to insert even though fpw_lsn <= RedoRecPtr whereas in HEAD it
will detect this during assembly of record and retry, isn't this a
problem?

3. There are couple of places where *XLogInsert* is used in wal.sgml
and it seems to me some of those needs change w.r.t this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-05 09:26:47
Message-ID:	5459ED57.1020204@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/05/2014 09:06 AM, Amit Kapila wrote:
> 1.
> +XLogRecPtr
> +XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
> {
> ..
> + /* info's high bits are reserved for use by me */
> + if (info & XLR_INFO_MASK)
> + elog(PANIC, "invalid xlog info mask %02X", info);
> ..
> }
>
> Earlier before this check, we use to check XLogInsertAllowed()
> which is moved to XLogInsertRecord(), isn't it better to keep
> the check in beginning of the function XLogInsert()?

Doesn't matter much, but I think it's cleanest the way I had it in the
latest patch. XLogInsert() is responsible for building a valid WAL
record to insert, and XLogInsertRecord() handles all the locking, buffer
management etc. to actually insert it. The high-bit check quoted above
is more related to building a valid record, so it belongs in
XLogInsert(), while the XLogInsertAllowed() check is more related to the
actual insertion of the record, so it belongs in XLogInsertRecord().

> 2.
> XLogRecPtr
> XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
> {
> ..
> if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
> {
> /*
> * Oops, some buffer now needs to be backed up that the caller
> * didn't back up. Start over.
> */
> WALInsertLockRelease();
> END_CRIT_SECTION();
> return InvalidXLogRecPtr;
> }
> ..
> }
>
> IIUC, there can be 4 states for doPageWrites w.r.t when record is getting
> assembled (XLogRecordAssemble) and just before actual insert (
> XLogInsertRecord)
>
> I think the behaviour for the state when doPageWrites is true during
> XLogRecordAssemble and false during XLogInsertRecord (just before
> actual insert) is different as compare to HEAD.
>
> In the patch, it will not retry if doPageWrites is false when we are
> about to insert even though fpw_lsn <= RedoRecPtr whereas in HEAD it
> will detect this during assembly of record and retry, isn't this a
> problem?

So the scenario is that:

* XLogRecordAssemble decides that a page doesn't need to be backed up
* both RedoRecPtr and doPageWrites change while building the record.
doPageWrites goes from true to false.

Without the patch, we would retry, because first check RedoRecPtr has
changed. With the patch, we notice that even though RedoRecPtr has
changed, doPageWrites is now off, so no FPWs are required regardless of
RedoRecPtr, and not retry. Yeah, you're right, the behavior changes in
that case. However, the new behavior is correct; the retry is
unnecessary in that scenario.

> 3. There are couple of places where *XLogInsert* is used in wal.sgml
> and it seems to me some of those needs change w.r.t this patch.

Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by
name in that text. It's otherwise admin-level stuff, on how to set
WAL-related settings. Up to 9.2 that manual page actually called them
"LogInsert" and "LogFlush". But I guess you're right; if we are to
mention the function by name, it should say XLogInsertRecord.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-05 09:30:19
Message-ID:	20141105093019.GT28295@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote:
> On 10/30/2014 06:02 PM, Andres Freund wrote:
> >On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
> >>On 10/06/2014 02:29 PM, Andres Freund wrote:
> >>>I've not yet really looked,
> >>>but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
> >>>make me happy...
> >>
> >>Ok.. Can you elaborate?
> >
> >To me the split between xloginsert.c doing some of the record assembly,
> >and xlog.c doing the lower level part of the assembly is just wrong.
>
> I moved the checks for bootstrap mode and xl_len == 0, from the current
> XLogInsert to the new XLogInsert() function. I don't imagine that to be
> enough address your feelings about the split between xlog.c and
> xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
> to do about it, as it feels very sensible to me as it is now. So unless you
> object more loudly, I'm going to just go ahead with this and commit,
> probably tomorrow.

I'm still not particularly happy with the split. But I think this patch
is enough of a win anyhow. So go ahead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-05 21:08:31
Message-ID:	545A91CF.7030000@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/30/2014 09:19 PM, Andres Freund wrote:
> Some things I noticed while reading the patch:

A lot of good comments, but let me pick up just two that are related:

> * There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
> refer to the relation, but not to the block number. These still log
> their rnode manually. Shouldn't we somehow deal with those in a
> similar way explicit block references are dealt with?
>
> * Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
> other slru stuff) doesn't include a reference to the page. Isn't that
> bad? Shouldn't we make XLogRegisterBlock() usable for that case?
> Otherwise I fail to see how pg_rewind like tools can sanely deal with this?

Yeah, there are still operations that modify relation pages, but don't
store the information about the modified pages in the standard format.
That includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE,
and also XLOG_DBASE_CREATE/DROP. And then there are updates to
non-relation files, like all the slru stuff, relcache init files, etc.
And updates to the FSM and VM bypass the full-page write mechanism too.

To play it safe, pg_rewind copies all non-relation files as is. That
includes all SLRUs, FSM and VM files, and everything else whose filename
doesn't match the (main fork of) a relation file. Of course, that's a
fair amount of copying to do, so we might want to optimize that in the
future, but I want to nail the relation files first. They are usually an
order of magnitude larger than the other files, after all.

Unfortunately pg_rewind still needs to recognize and parse the special
WAL records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files
outside the normal block registration system. I've been thinking that we
should add another flag to the WAL record format to mark such records.
pg_rewind will still need to understand the record format of such
records, but the flag will help to catch bugs of omission. If pg_rewind
or another such tool sees a record that's flagged as "special", but
doesn't recognize the record type, it can throw an error.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-05 22:38:45
Message-ID:	20141105223845.GA28295@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-11-05 23:08:31 +0200, Heikki Linnakangas wrote:
> On 10/30/2014 09:19 PM, Andres Freund wrote:
> >Some things I noticed while reading the patch:
>
> A lot of good comments, but let me pick up just two that are related:
>
> >* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
> > refer to the relation, but not to the block number. These still log
> > their rnode manually. Shouldn't we somehow deal with those in a
> > similar way explicit block references are dealt with?
> >
> >* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
> > other slru stuff) doesn't include a reference to the page. Isn't that
> > bad? Shouldn't we make XLogRegisterBlock() usable for that case?
> > Otherwise I fail to see how pg_rewind like tools can sanely deal with this?
>
> Yeah, there are still operations that modify relation pages, but don't store
> the information about the modified pages in the standard format. That
> includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, and also
> XLOG_DBASE_CREATE/DROP. And then there are updates to non-relation files,
> like all the slru stuff, relcache init files, etc. And updates to the FSM
> and VM bypass the full-page write mechanism too.

That's a awful number of special cases. I see little reason not to
invent something that can reference at least the relation in those
cases. Then we can at least only resync the relations if there were
changes. E.g. for vm's that not necessarily all that likely in an insert
mostly case.
I'm not that worried about things like create/drop database, but fsm,
vm, and the various slru's are really somewhat essential.

> To play it safe, pg_rewind copies all non-relation files as is. That
> includes all SLRUs, FSM and VM files, and everything else whose filename
> doesn't match the (main fork of) a relation file. Of course, that's a fair
> amount of copying to do, so we might want to optimize that in the future,
> but I want to nail the relation files first. They are usually an order of
> magnitude larger than the other files, after all.

That's fair enough.

> Unfortunately pg_rewind still needs to recognize and parse the special WAL
> records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files outside
> the normal block registration system. I've been thinking that we should add
> another flag to the WAL record format to mark such records.

So everytime pg_rewind comes across a record with that flag set which it
doesn't have special case code it'd balk? Sounds sensible.

Personally I think we should at least have a generic format to refer to
entire relations without a specific block number. And one to refer to
SLRUs. I don't think we necessarily need to implement them now, but we
should make sure there's bit space left to denote them.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-06 05:57:52
Message-ID:	CAA4eK1Kihr7mfe6HYFxc6G3e2so3+Xvd3gB+Xnwi6iaf7K5OCA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:
> On 11/05/2014 09:06 AM, Amit Kapila wrote:
>>
>> 2.
>> XLogRecPtr
>> XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
>
> So the scenario is that:
>
> * XLogRecordAssemble decides that a page doesn't need to be backed up
> * both RedoRecPtr and doPageWrites change while building the record.
doPageWrites goes from true to false.
>
> Without the patch, we would retry, because first check RedoRecPtr has
changed. With the patch, we notice that even though RedoRecPtr has changed,
doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr,
and not retry. Yeah, you're right, the behavior changes in that case.
However, the new behavior is correct; the retry is unnecessary in that
scenario.
>

How does it interact with backup, basically in stop backup we first
change forcePageWrite to false and then get the stop wal location
by inserting XLOG_BACKUP_END, so it seems to me that it is quite
possible that the record which backend is inserting using XLogInsert()
will be considered in backup. Now shouldn't this record contain FPW
if forcePageWrite was true when XLogInsert() started to avoid any torn
page taken in backup?

>> 3. There are couple of places where *XLogInsert* is used in wal.sgml
>> and it seems to me some of those needs change w.r.t this patch.
>
>
> Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by
name in that text. It's otherwise admin-level stuff, on how to set
WAL-related settings. Up to 9.2 that manual page actually called them
"LogInsert" and "LogFlush". But I guess you're right; if we are to mention
the function by name, it should say XLogInsertRecord.
>

Agreed.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-06 11:23:28
Message-ID:	545B5A30.7070703@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/06/2014 07:57 AM, Amit Kapila wrote:
> On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
> wrote:
>> On 11/05/2014 09:06 AM, Amit Kapila wrote:
>>>
>>> 2.
>>> XLogRecPtr
>>> XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
>>
>> So the scenario is that:
>>
>> * XLogRecordAssemble decides that a page doesn't need to be backed up
>> * both RedoRecPtr and doPageWrites change while building the record.
>> doPageWrites goes from true to false.
>>
>> Without the patch, we would retry, because first check RedoRecPtr has
>> changed. With the patch, we notice that even though RedoRecPtr has changed,
>> doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr,
>> and not retry. Yeah, you're right, the behavior changes in that case.
>> However, the new behavior is correct; the retry is unnecessary in that
>> scenario.
>
> How does it interact with backup, basically in stop backup we first
> change forcePageWrite to false and then get the stop wal location
> by inserting XLOG_BACKUP_END, so it seems to me that it is quite
> possible that the record which backend is inserting using XLogInsert()
> will be considered in backup. Now shouldn't this record contain FPW
> if forcePageWrite was true when XLogInsert() started to avoid any torn
> page taken in backup?

It doesn't matter what doPageWrites and RedoRecPtr were when XLogInsert
started. It's the state when the insertion actually happens that matters.

It's a bit weird that pg_stop_backup() first turns off forcePagesWrites,
and only then writes the WAL record. But it's OK because when
pg_stop_backup() is called, the backup process has already finished
copying all the blocks. Any pages that are written to disk between
turning forcePageWrites off and writing the WAL record cannot be torn.

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-06 12:12:14
Message-ID:	545B659E.6010509@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/05/2014 11:30 AM, Andres Freund wrote:
> On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote:
>> On 10/30/2014 06:02 PM, Andres Freund wrote:
>>> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
>>>> On 10/06/2014 02:29 PM, Andres Freund wrote:
>>>>> I've not yet really looked,
>>>>> but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
>>>>> make me happy...
>>>>
>>>> Ok.. Can you elaborate?
>>>
>>> To me the split between xloginsert.c doing some of the record assembly,
>>> and xlog.c doing the lower level part of the assembly is just wrong.
>>
>> I moved the checks for bootstrap mode and xl_len == 0, from the current
>> XLogInsert to the new XLogInsert() function. I don't imagine that to be
>> enough address your feelings about the split between xlog.c and
>> xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
>> to do about it, as it feels very sensible to me as it is now. So unless you
>> object more loudly, I'm going to just go ahead with this and commit,
>> probably tomorrow.
>
> I'm still not particularly happy with the split. But I think this patch
> is enough of a win anyhow. So go ahead.

Committed. Now, to the main patch...

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-06 15:32:33
Message-ID:	545B9491.6040409@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Replying to some of your comments below. The rest were trivial issues
that I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:
> * Is it really a good idea to separate XLogRegisterBufData() from
> XLogRegisterBuffer() the way you've done it ? If we ever actually get
> a record with a large numbers of blocks touched this issentially is
> O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to find
the right registered_buffer struct, might be expensive? If that ever
becomes a problem, a simple fix would be to start the linear search from
the end, and make sure that when you touch a large number of blocks, you
do all the XLogRegisterBufData() calls right after the corresponding
XLogRegisterBuffer() call.

I've also though about having XLogRegisterBuffer() return the pointer to
the struct, and passing it as argument to XLogRegisterBufData. So the
pattern in WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have
potential to turn XLogRegisterBufData into a macro or inline function,
to eliminate the function call overhead. I played with that a little
bit, but the difference in performance was so small that it didn't seem
worth it. But passing the registered_buffer pointer, like above, might
not be a bad thing anyway.

> * There's lots of functions like XLogRecHasBlockRef() that dig through
> the wal record. A common pattern is something like:
> if (XLogRecHasBlockRef(record, 1))
> XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
> else
> oldblk = newblk;
>
> I think doing that repeatedly is quite a bad idea. We should parse the
> record once and then use it in a sensible format. Not do it in pieces,
> over and over again. It's not like we ignore backup blocks - so doing
> this lazily doesn't make sense to me.
> Especially as ValidXLogRecord() *already* has parsed the whole damn
> thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need a
fair amount of new infrastructure. Vast majority of WAL records contain
one, maybe two, block references, so it's not that expensive to find the
right one, even if you do it several times.

I ran a quick performance test on WAL replay performance yesterday. I
ran pgbench for 1000000 transactions with WAL archiving enabled, and
measured the time it took to replay the generated WAL. I did that with
and without the patch, and I didn't see any big difference in replay
times. I also ran "perf" on the startup process, and the profiles looked
identical. I'll do more comprehensive testing later, with different
index types, but I'm convinced that there is no performance issue in
replay that we'd need to worry about.

If it mattered, a simple optimization to the above pattern would be to
have XLogRecGetBlockTag return true/false, indicating if the block
reference existed at all. So you'd do:

if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
oldblk != newblk;

On the other hand, decomposing the WAL record into parts, and passing
the decomposed representation to the redo routines would allow us to
pack the WAL record format more tightly, as accessing the different
parts of the on-disk format wouldn't then need to be particularly fast.
For example, I've been thinking that it would be nice to get rid of the
alignment padding in XLogRecord, and between the per-buffer data
portions. We could copy the data to aligned addresses as part of the
decomposition or parsing of the WAL record, so that the redo routines
could still assume aligned access.

> * I wonder if it wouldn't be worthwile, for the benefit of the FPI
> compression patch, to keep the bkp block data after *all* the
> "headers". That'd make it easier to just compress the data.

Maybe. If we do that, I'd also be inclined to move all the bkp block
headers to the beginning of the WAL record, just after the XLogInsert
struct. Somehow it feels weird to have a bunch of header structs
sandwiched between the rmgr-data and per-buffer data. Also, 4-byte
alignment is enough for the XLogRecordBlockData struct, so that would be
an easy way to make use of the space currently wasted for alignment
padding in XLogRecord.

> * I think heap_xlog_update is buggy for wal_level=logical because it
> computes the length of the tuple using
> tuplen = recdataend - recdata;
> But the old primary key/old tuple value might be stored there as
> well. Afaics that code has to continue to use xl_heap_header_len.

No, the old primary key or tuple is stored in the main data portion.
That tuplen computation is done on backup block 0's data.

> * It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
> to get rid of:
> + /*
> + * The new tuple is normally stored as buffer 0's data. But if
> + * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
> + * data, after the xl_heap_insert struct.
> + */
> + if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
> + {
> + data = XLogRecGetData(record) + SizeOfHeapInsert;
> + datalen = record->xl_len - SizeOfHeapInsert;
> + }
> + else
> + data = XLogRecGetBlockData(record, 0, &datalen);
> or have I misunderstood how that works?

Ah, you're right. Actually, the code that writes the WAL record *does*
use REGBUF_KEEP_DATA. That was a bug in the redo routine, it should
always look into buffer 0's data.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-09 21:47:55
Message-ID:	20141109214755.GB28007@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:
> Replying to some of your comments below. The rest were trivial issues that
> I'll just fix.
>
> On 10/30/2014 09:19 PM, Andres Freund wrote:
> >* Is it really a good idea to separate XLogRegisterBufData() from
> > XLogRegisterBuffer() the way you've done it ? If we ever actually get
> > a record with a large numbers of blocks touched this issentially is
> > O(touched_buffers*data_entries).
>
> Are you worried that the linear search in XLogRegisterBufData(), to find the
> right registered_buffer struct, might be expensive? If that ever becomes a
> problem, a simple fix would be to start the linear search from the end, and
> make sure that when you touch a large number of blocks, you do all the
> XLogRegisterBufData() calls right after the corresponding
> XLogRegisterBuffer() call.

Yes, that was what I was (mildly) worried about. Since you specified a
high limit of buffers I'm sure someone will come up with a use case for
it ;)

> I've also though about having XLogRegisterBuffer() return the pointer to the
> struct, and passing it as argument to XLogRegisterBufData. So the pattern in
> WAL generating code would be like:
>
> registered_buffer *buf0;
>
> buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
> XLogRegisterBufData(buf0, data, length);
>
> registered_buffer would be opaque to the callers. That would have potential
> to turn XLogRegisterBufData into a macro or inline function, to eliminate
> the function call overhead. I played with that a little bit, but the
> difference in performance was so small that it didn't seem worth it. But
> passing the registered_buffer pointer, like above, might not be a bad thing
> anyway.

Yes, that was roughly what I was thinking of as well. It's not all that
pretty, but it generally does seem like a good idea to me anyay.

> >* There's lots of functions like XLogRecHasBlockRef() that dig through
> > the wal record. A common pattern is something like:
> >if (XLogRecHasBlockRef(record, 1))
> > XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
> >else
> > oldblk = newblk;
> >
> > I think doing that repeatedly is quite a bad idea. We should parse the
> > record once and then use it in a sensible format. Not do it in pieces,
> > over and over again. It's not like we ignore backup blocks - so doing
> > this lazily doesn't make sense to me.
> > Especially as ValidXLogRecord() *already* has parsed the whole damn
> > thing once.
>
> Hmm. Adding some kind of a parsed XLogRecord representation would need a
> fair amount of new infrastructure.

True.

> Vast majority of WAL records contain one,
> maybe two, block references, so it's not that expensive to find the right
> one, even if you do it several times.

I'm not convinced. It's not an infrequent thing these days to hear
people being bottlenecked by replay. And grovelling repeatedly through
larger records isn't *that* cheap.

> I ran a quick performance test on WAL replay performance yesterday. I ran
> pgbench for 1000000 transactions with WAL archiving enabled, and measured
> the time it took to replay the generated WAL. I did that with and without
> the patch, and I didn't see any big difference in replay times. I also ran
> "perf" on the startup process, and the profiles looked identical. I'll do
> more comprehensive testing later, with different index types, but I'm
> convinced that there is no performance issue in replay that we'd need to
> worry about.

Interesting. What checkpoint_segments/timeout and what scale did you
use? Since that heavily influences the average size of the record that's
quite relevant...

> If it mattered, a simple optimization to the above pattern would be to have
> XLogRecGetBlockTag return true/false, indicating if the block reference
> existed at all. So you'd do:
>
> if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
> oldblk != newblk;

That sounds like a good idea anyway?

> On the other hand, decomposing the WAL record into parts, and passing the
> decomposed representation to the redo routines would allow us to pack the
> WAL record format more tightly, as accessing the different parts of the
> on-disk format wouldn't then need to be particularly fast. For example, I've
> been thinking that it would be nice to get rid of the alignment padding in
> XLogRecord, and between the per-buffer data portions. We could copy the data
> to aligned addresses as part of the decomposition or parsing of the WAL
> record, so that the redo routines could still assume aligned access.

Right. I think it'd generally give us a bit more flexibility.

> >* I wonder if it wouldn't be worthwile, for the benefit of the FPI
> > compression patch, to keep the bkp block data after *all* the
> > "headers". That'd make it easier to just compress the data.
>
> Maybe. If we do that, I'd also be inclined to move all the bkp block headers
> to the beginning of the WAL record, just after the XLogInsert struct.

> Somehow it feels weird to have a bunch of header structs sandwiched between
> the rmgr-data and per-buffer data. Also, 4-byte alignment is enough for the
> XLogRecordBlockData struct, so that would be an easy way to make use of the
> space currently wasted for alignment padding in XLogRecord.

Hm, ok. Don't really care ;)

> >* I think heap_xlog_update is buggy for wal_level=logical because it
> > computes the length of the tuple using
> > tuplen = recdataend - recdata;
> > But the old primary key/old tuple value might be stored there as
> > well. Afaics that code has to continue to use xl_heap_header_len.
>
> No, the old primary key or tuple is stored in the main data portion. That
> tuplen computation is done on backup block 0's data.

Ah. Still living in the "old world" apparently ;). Will look at it again.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-10 19:29:13
Message-ID:	54611209.705@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached is a new version. It's rebased on current git master, including
BRIN. I've also fixed the laundry list of small things you reported, as
well as a bunch of bugs I uncovered during my own testing.

Alvaro: you still have the BRIN WAL-logging code fresh in your memory,
so could you take a look at that part of this patch to check that I
didn't break it? And also, how do you like the new way of writing that,
over what you had to in HEAD ?

More below.

On 11/09/2014 11:47 PM, Andres Freund wrote:
> On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:
>> Replying to some of your comments below. The rest were trivial issues that
>> I'll just fix.
>>
>> On 10/30/2014 09:19 PM, Andres Freund wrote:
>>> * Is it really a good idea to separate XLogRegisterBufData() from
>>> XLogRegisterBuffer() the way you've done it ? If we ever actually get
>>> a record with a large numbers of blocks touched this issentially is
>>> O(touched_buffers*data_entries).
>>
>> Are you worried that the linear search in XLogRegisterBufData(), to find the
>> right registered_buffer struct, might be expensive? If that ever becomes a
>> problem, a simple fix would be to start the linear search from the end, and
>> make sure that when you touch a large number of blocks, you do all the
>> XLogRegisterBufData() calls right after the corresponding
>> XLogRegisterBuffer() call.
>
> Yes, that was what I was (mildly) worried about. Since you specified a
> high limit of buffers I'm sure someone will come up with a use case for
> it ;)
>
>> I've also though about having XLogRegisterBuffer() return the pointer to the
>> struct, and passing it as argument to XLogRegisterBufData. So the pattern in
>> WAL generating code would be like:
>>
>> registered_buffer *buf0;
>>
>> buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
>> XLogRegisterBufData(buf0, data, length);
>>
>> registered_buffer would be opaque to the callers. That would have potential
>> to turn XLogRegisterBufData into a macro or inline function, to eliminate
>> the function call overhead. I played with that a little bit, but the
>> difference in performance was so small that it didn't seem worth it. But
>> passing the registered_buffer pointer, like above, might not be a bad thing
>> anyway.
>
> Yes, that was roughly what I was thinking of as well. It's not all that
> pretty, but it generally does seem like a good idea to me anyay.

I ended up doing something different: The block_id is now used as the
index into the registered_buffers array. So looking up the right struct
is now just &registered_buffers[block_id]. That obviously gets rid of
the linear search. It doesn't seem to make any difference in the quick
performance testing I've been doing.

>>> * There's lots of functions like XLogRecHasBlockRef() that dig through
>>> the wal record. A common pattern is something like:
>>> if (XLogRecHasBlockRef(record, 1))
>>> XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
>>> else
>>> oldblk = newblk;
>>>
>>> I think doing that repeatedly is quite a bad idea. We should parse the
>>> record once and then use it in a sensible format. Not do it in pieces,
>>> over and over again. It's not like we ignore backup blocks - so doing
>>> this lazily doesn't make sense to me.
>>> Especially as ValidXLogRecord() *already* has parsed the whole damn
>>> thing once.
>>
>> Hmm. Adding some kind of a parsed XLogRecord representation would need a
>> fair amount of new infrastructure.
>
> True.
>
>> Vast majority of WAL records contain one,
>> maybe two, block references, so it's not that expensive to find the right
>> one, even if you do it several times.
>
> I'm not convinced. It's not an infrequent thing these days to hear
> people being bottlenecked by replay. And grovelling repeatedly through
> larger records isn't *that* cheap.

I haven't done anything about this. We might want to do that, but I'd
like to get this committed now, with all the changes to the AM's, and
then spend some time trying to further optimize the WAL format. I might
add the "deparsed WAL record" infrastructure as part of that
optimization work.

>> I ran a quick performance test on WAL replay performance yesterday. I ran
>> pgbench for 1000000 transactions with WAL archiving enabled, and measured
>> the time it took to replay the generated WAL. I did that with and without
>> the patch, and I didn't see any big difference in replay times. I also ran
>> "perf" on the startup process, and the profiles looked identical. I'll do
>> more comprehensive testing later, with different index types, but I'm
>> convinced that there is no performance issue in replay that we'd need to
>> worry about.
>
> Interesting. What checkpoint_segments/timeout and what scale did you
> use? Since that heavily influences the average size of the record that's
> quite relevant...

Scale factor 5, checkpoint_segments=10, checkpoint_timeout=30s.
pg_xlogdump --stats says:

> Type N (%) Record size (%) FPI size (%) Combined size (%)
> ---- - --- ----------- --- -------- --- ------------- ---
> XLOG 1434 ( 0.01) 48912 ( 0.00) 10149668 ( 0.42) 10198580 ( 0.29)
> Transaction 4000767 ( 16.24) 176048028 ( 16.75) 0 ( 0.00) 176048028 ( 5.04)
> Storage 232 ( 0.00) 11136 ( 0.00) 0 ( 0.00) 11136 ( 0.00)
> CLOG 123 ( 0.00) 4428 ( 0.00) 0 ( 0.00) 4428 ( 0.00)
> Database 2 ( 0.00) 96 ( 0.00) 0 ( 0.00) 96 ( 0.00)
> Tablespace 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
> MultiXact 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
> RelMap 14 ( 0.00) 7784 ( 0.00) 0 ( 0.00) 7784 ( 0.00)
> Standby 94 ( 0.00) 6016 ( 0.00) 0 ( 0.00) 6016 ( 0.00)
> Heap2 3563327 ( 14.47) 142523126 ( 13.56) 1332675761 ( 54.52) 1475198887 ( 42.20)
> Heap 16880705 ( 68.54) 726291821 ( 69.09) 972350846 ( 39.78) 1698642667 ( 48.59)
> Btree 182603 ( 0.74) 6342386 ( 0.60) 129238796 ( 5.29) 135581182 ( 3.88)
> Hash 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
> Gin 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
> Gist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
> Sequence 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
> SPGist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
> BRIN 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
> -------- -------- -------- --------
> Total 24629301 1051283733 [30.07%] 2444415071 [69.93%] 3495698804 [100%]

So full-page writes indeed form most of the volume.

I run this again with full_page_writes=off. With that, the patched
version generated about 3% more WAL than master. Replay seems to be a
few percentage points slower, too, which can be explained by the
increased WAL volume.

Attached is a shell script I used to create the WAL.

My next steps for this are:

1. Do more correctness testing of WAL replay routines. I've spent some
time looking at the coverage report generated by "make coverage-html",
after running "make installcheck" on a master-server setup, and crafting
test cases to hit the redo routines that are not otherwise covered by
our regression tests. That has already uncovered a couple of bugs in the
patch that I've now fixed. (That's also how I found the bug I just fixed
in 1961b1c1)

2. Reduce the WAL volume. Any increase in WAL volume is painful on its
own, and also reduces performance in general. This seems to add 1-5%,
depending on the workload. There's some padding bytes the WAL format
that I could use. I've refrained from using them up for now, because I
consider that cheating: we could have kept the current WAL format and
used the padding bytes to make that even smaller/faster. But if we
accept that this patch is worth it, even if it leads to some WAL bloat,
we'll want to buy that back if we can so that users won't see a regression.

If I can find some ways to reduce the WAL volume, I might still commit
this patch more or less as it is first, and commit the optimizations
separately later. Depends on how what those optimizations look like.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-7.patch.gz	application/gzip	92.0 KB
replay-perf-test.sh	application/x-shellscript	1.1 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-11 07:39:14
Message-ID:	CAB7nPqQqhqu6eBJjf7Nd=5t0_iJbATr75d=T2QJH2QAsikB4Ug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> Attached is a new version. It's rebased on current git master, including
> BRIN. I've also fixed the laundry list of small things you reported, as
> well as a bunch of bugs I uncovered during my own testing.

This patch needs a small rebase, it has been broken by a590f266 that fixed
WAL replay for brin indexes:
patching file src/backend/access/brin/brin_xlog.c
Hunk #2 FAILED at 42.
Hunk #3 FAILED at 91.
This will facilitate testing as well.
Regards
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-11 08:45:56
Message-ID:	5461CCC4.2080209@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/11/2014 09:39 AM, Michael Paquier wrote:
> On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> Attached is a new version. It's rebased on current git master, including
>> BRIN. I've also fixed the laundry list of small things you reported, as
>> well as a bunch of bugs I uncovered during my own testing.
>
> This patch needs a small rebase, it has been broken by a590f266 that fixed
> WAL replay for brin indexes:
> patching file src/backend/access/brin/brin_xlog.c
> Hunk #2 FAILED at 42.
> Hunk #3 FAILED at 91.
> This will facilitate testing as well.

Here's a rebased patch. No other changes.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-8.patch.gz	application/gzip	92.1 KB

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-11 14:42:25
Message-ID:	CAA4eK1+b5140WcUQiYEBGVTc+aPQqB=ptCi4-ZPehEyjaH4P4g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 11, 2014 at 2:15 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:z
>
> On 11/11/2014 09:39 AM, Michael Paquier wrote:
>>
>> On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com
>>>
>>> wrote:
>>
>>
>>> Attached is a new version. It's rebased on current git master, including
>>> BRIN. I've also fixed the laundry list of small things you reported, as
>>> well as a bunch of bugs I uncovered during my own testing.
>>
>>
>> This patch needs a small rebase, it has been broken by a590f266 that
fixed
>> WAL replay for brin indexes:
>> patching file src/backend/access/brin/brin_xlog.c
>> Hunk #2 FAILED at 42.
>> Hunk #3 FAILED at 91.
>> This will facilitate testing as well.
>
>
> Here's a rebased patch. No other changes.
>

I have done some performance testing of this patch using attached
script and data is as below:

Performance Data
----------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
checkpoint_segments=300
checkpoint_timeout =30min
full_page_writes = off

HEAD (commit - ae667f7) -

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 398510104 | 11.993057012558
two short fields, no change | 396984168 | 10.3508098125458
two short fields, no change | 396983624 | 10.4196150302887
two short fields, one changed | 437106016 | 10.650365114212
two short fields, one changed | 437102456 | 10.756795167923
two short fields, one changed | 437103000 | 10.6824090480804
two short fields, both changed | 438131520 | 11.1027519702911
two short fields, both changed | 437112248 | 11.0681071281433
two short fields, both changed | 437107168 | 11.0370399951935
one short and one long field, no change | 76044176 | 2.90529608726501
one short and one long field, no change | 76047432 | 2.86844515800476
one short and one long field, no change | 76042664 | 2.84641098976135
ten tiny fields, all changed | 478210904 | 13.1221458911896
ten tiny fields, all changed | 477244448 | 12.8017349243164
ten tiny fields, all changed | 477217832 | 12.9907081127167
hundred tiny fields, all changed | 180353400 | 6.03354096412659
hundred tiny fields, all changed | 180349536 | 5.99893379211426
hundred tiny fields, all changed | 180350064 | 6.00634717941284
hundred tiny fields, half changed | 180348480 | 6.03162407875061
hundred tiny fields, half changed | 180348440 | 6.08004903793335
hundred tiny fields, half changed | 180349056 | 6.03126788139343
hundred tiny fields, half nulled | 101369936 | 5.43424606323242
hundred tiny fields, half nulled | 101660160 | 5.06207084655762
hundred tiny fields, half nulled | 100119184 | 5.51814889907837
9 short and 1 long, short changed | 108142168 | 3.26260495185852
9 short and 1 long, short changed | 108138144 | 3.20250797271729
9 short and 1 long, short changed | 109761240 | 3.21728992462158
(27 rows)

Patch-

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 398692848 | 11.2538840770721
two short fields, no change | 396988648 | 11.4972231388092
two short fields, no change | 396988416 | 11.0026741027832
two short fields, one changed | 437104840 | 11.1911199092865
two short fields, one changed | 437103832 | 11.1138079166412
two short fields, one changed | 437101872 | 11.3141660690308
two short fields, both changed | 437133648 | 11.5836050510406
two short fields, both changed | 437106640 | 11.5675358772278
two short fields, both changed | 437104944 | 11.6147150993347
one short and one long field, no change | 77226464 | 3.01720094680786
one short and one long field, no change | 77645248 | 2.9660210609436
one short and one long field, no change | 76045256 | 3.03326010704041
ten tiny fields, all changed | 477396616 | 13.3963651657104
ten tiny fields, all changed | 477219840 | 13.3838939666748
ten tiny fields, all changed | 477223960 | 13.5736708641052
hundred tiny fields, all changed | 180742136 | 5.87386584281921
hundred tiny fields, all changed | 180398320 | 6.12072706222534
hundred tiny fields, all changed | 180354080 | 6.16246104240417
hundred tiny fields, half changed | 180351456 | 6.17317008972168
hundred tiny fields, half changed | 180348096 | 6.13790798187256
hundred tiny fields, half changed | 183369688 | 6.19142913818359
hundred tiny fields, half nulled | 100111888 | 5.55706000328064
hundred tiny fields, half nulled | 100111952 | 5.55145716667175
hundred tiny fields, half nulled | 100119424 | 5.53378510475159
9 short and 1 long, short changed | 116821904 | 3.42408394813538
9 short and 1 long, short changed | 116160920 | 3.37269878387451
9 short and 1 long, short changed | 116160552 | 3.40163683891296
(27 rows)

It seems to me that there is a regression of (4 ~ 8%) for small records,
refer two short fields tests.

2.
Compilation errors on windows

1>E:\WorkSpace\PostgreSQL\master\postgresql\src\include\access/nbtree.h(296):
error C2016: C requires that a struct or union has at least one member
1>E:\WorkSpace\PostgreSQL\master\postgresql\src\include\access/nbtree.h(303):
error C2016: C requires that a struct or union has at least one member
1> nbtree.c

They are due to following structure declarations.
typedef struct xl_btree_split_left
{
/*
* In the _L variants, next is the new item.
* (In the _R variants, the new item is one of the right page's tuples.)
*
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as
* the leftmost key in
the new right page.
*/
} xl_btree_split_left;

typedef struct xl_btree_split_right
{
/*
* Last are the right page's tuples in the form used by _bt_restore_page.
*/
} xl_btree_split_right;

3.
Patch is failed to apply against Head:
patching file src/backend/access/gin/ginfast.c
Hunk #1 succeeded at 108 (offset 2 lines).
Hunk #2 FAILED at 215.
Hunk #3 succeeded at 276 (offset 3 lines).
Hunk #4 succeeded at 311 (offset 3 lines).
Hunk #5 succeeded at 324 (offset 3 lines).
Hunk #6 succeeded at 339 (offset 3 lines).
Hunk #7 succeeded at 348 (offset 3 lines).
Hunk #8 succeeded at 379 (offset 3 lines).
Hunk #9 succeeded at 396 (offset 3 lines).
Hunk #10 succeeded at 523 (offset 5 lines).
Hunk #11 succeeded at 550 (offset 5 lines).
Hunk #12 succeeded at 587 (offset 5 lines).
1 out of 12 hunks FAILED -- saving rejects to file
src/backend/access/gin/ginfast.c.rej

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
wal-update-testsuite.sh	application/x-sh	12.8 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-13 13:33:44
Message-ID:	5464B338.8070805@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/11/2014 04:42 PM, Amit Kapila wrote:
> I have done some performance testing of this patch using attached
> script and data is as below:
>
> ...
>
> It seems to me that there is a regression of (4 ~ 8%) for small records,
> refer two short fields tests.

Thanks for the testing!

Here's a new version, with big changes again to the record format. Have
a look at xlogrecord.h for the details, but in a nutshell:

1. The overall format is now: XLogRecord, per-block headers, header for
main data portion, per-block data, main data.

2. I removed xl_len field from XLogRecord and rearranged the fields, to
shrink the XLogRecord struct from 32 to 24 bytes. (instead, there's a
new 2- or 5-byte header for the "main data", after the block headers).

3. No alignment padding. (the data chunks are copied to aligned buffers
at replay, so redo functions can still assume aligned access)

In quick testing, this new WAL format is somewhat more compact than the
9.4 format. That also seems to have more than bought back the
performance regression I saw earlier. Here are results from my laptop,
using the wal-update-testsuite.sh script:

master:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 396982984 | 7.73713994026184
two short fields, no change | 398531152 | 7.72360110282898
two short fields, no change | 397228552 | 7.90237998962402
two short fields, one changed | 437108464 | 8.03014206886292
two short fields, one changed | 438368456 | 8.17672896385193
two short fields, one changed | 437105232 | 7.89896702766418
two short fields, both changed | 437100544 | 7.98763203620911
two short fields, both changed | 437107032 | 8.0971851348877
two short fields, both changed | 437105368 | 8.1279079914093
one short and one long field, no change | 76552752 | 2.47367906570435
one short and one long field, no change | 76043608 | 2.54243588447571
one short and one long field, no change | 76042576 | 2.6014678478241
ten tiny fields, all changed | 477221488 | 9.41646003723145
ten tiny fields, all changed | 477224080 | 9.37260103225708
ten tiny fields, all changed | 477220944 | 9.41951704025269
hundred tiny fields, all changed | 180889992 | 4.72576093673706
hundred tiny fields, all changed | 180348224 | 4.50496411323547
hundred tiny fields, all changed | 181347504 | 4.78004717826843
hundred tiny fields, half changed | 180379760 | 4.53589606285095
hundred tiny fields, half changed | 181773832 | 4.85075807571411
hundred tiny fields, half changed | 180348160 | 4.65349197387695
hundred tiny fields, half nulled | 100114832 | 3.70726609230042
hundred tiny fields, half nulled | 100116840 | 3.88224697113037
hundred tiny fields, half nulled | 100118848 | 4.00612688064575
9 short and 1 long, short changed | 108140640 | 2.63146805763245
9 short and 1 long, short changed | 108508784 | 2.76349496841431
9 short and 1 long, short changed | 108137144 | 2.79056811332703
(27 rows)

wal-format-and-api-changes-9.patch:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 356865216 | 6.81889986991882
two short fields, no change | 356871304 | 7.0333080291748
two short fields, no change | 356869520 | 6.62423706054688
two short fields, one changed | 356867824 | 7.09969711303711
two short fields, one changed | 356866480 | 7.07576990127563
two short fields, one changed | 357987080 | 7.25394797325134
two short fields, both changed | 396996096 | 7.13484597206116
two short fields, both changed | 396990184 | 7.08063006401062
two short fields, both changed | 396987192 | 7.04641604423523
one short and one long field, no change | 70858376 | 2.2726149559021
one short and one long field, no change | 68024232 | 2.21982789039612
one short and one long field, no change | 69258192 | 2.4696249961853
ten tiny fields, all changed | 396987896 | 8.25723004341125
ten tiny fields, all changed | 396983768 | 8.24221706390381
ten tiny fields, all changed | 397012600 | 8.60816693305969
hundred tiny fields, all changed | 172327416 | 4.57576704025269
hundred tiny fields, all changed | 174669320 | 4.52080512046814
hundred tiny fields, all changed | 172696944 | 4.65672993659973
hundred tiny fields, half changed | 172323720 | 4.57278800010681
hundred tiny fields, half changed | 172330232 | 4.63164114952087
hundred tiny fields, half changed | 172326864 | 4.74219608306885
hundred tiny fields, half nulled | 85597408 | 3.78670310974121
hundred tiny fields, half nulled | 84742808 | 3.82968688011169
hundred tiny fields, half nulled | 84066936 | 3.86192607879639
9 short and 1 long, short changed | 100113080 | 2.54274320602417
9 short and 1 long, short changed | 100119440 | 2.4966151714325
9 short and 1 long, short changed | 100115960 | 2.63230085372925
(27 rows)

Aside from the WAL record format changes, this patch adds the "decoded
WAL record" infrastructure that we talked about with Andres. XLogReader
now has a new function, DecodeXLogRecord, which parses the block headers
etc. from the WAL record, and copies the data chunks to aligned buffers.
The redo routines are passed a pointer to the XLogReaderState, instead
of the plain XLogRecord, and the redo routines can use macros and
functions defined xlogreader.h to access the already-decoded WAL record.
The new WAL record format is difficult to parse in a piece-meal fashion,
so it really needs this separate decoding pass to be efficient.

Thoughts on this new WAL record format? I've attached the xlogrecord.h
file here separately for easy reading, if you want to take a quick look
at just that without applying the whole patch.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-9.patch.gz	application/gzip	107.8 KB
xlogrecord.h	text/x-chdr	6.1 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-13 15:04:45
Message-ID:	20141113150445.GB29112@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:
> Here's a new version, with big changes again to the record format. Have a
> look at xlogrecord.h for the details, but in a nutshell:
>
> 1. The overall format is now: XLogRecord, per-block headers, header for main
> data portion, per-block data, main data.
>
> 2. I removed xl_len field from XLogRecord and rearranged the fields, to
> shrink the XLogRecord struct from 32 to 24 bytes. (instead, there's a new 2-
> or 5-byte header for the "main data", after the block headers).
>
> 3. No alignment padding. (the data chunks are copied to aligned buffers at
> replay, so redo functions can still assume aligned access)

Whoa. That's a large amount of changes... Not bad at all.

> In quick testing, this new WAL format is somewhat more compact than the 9.4
> format.

Well, that's mixing different kinds of changes together to some
degree. Most of this could have been done independently as well. I'm not
generally objecting to doing that, but the space/performance comparison
isn't entirely fair. One could (I'm not!) argue that we should just do a
refactoring like you suggest above, without the additional block
information.

> That also seems to have more than bought back the performance
> regression I saw earlier.

I think we really need to do the performance test with a different CRC
implementation. So far all the tests in this thread seem to be
correlating pretty linearly to the size of the record.

> Here are results from my laptop, using the wal-update-testsuite.sh script:

Would be nice if that'd also compute the differences in wal_generated
and duration between two runs... :)

> Aside from the WAL record format changes, this patch adds the "decoded WAL
> record" infrastructure that we talked about with Andres.

Ah, cool.

> XLogReader now has
> a new function, DecodeXLogRecord, which parses the block headers etc. from
> the WAL record, and copies the data chunks to aligned buffers. The redo
> routines are passed a pointer to the XLogReaderState, instead of the plain
> XLogRecord, and the redo routines can use macros and functions defined
> xlogreader.h to access the already-decoded WAL record. The new WAL record
> format is difficult to parse in a piece-meal fashion, so it really needs
> this separate decoding pass to be efficient.

Hm. I'm not sure if there's really a point in exposing
DecodeXLogRecord() outside of xlogreader.c. In which case would a
xlogreader user not want to decode the record? I.e. couldn't we just
always do that and not bother exposing the function?

Sure, you could skip decoding if the record is of a "uninteresting"
rmgr, but since all validation but CRC now happens only during decoding
I'm not sure that's a good idea. You e.g. have a block

+ if (!DecodeXLogRecord(xlogreader_state, &errormsg))
+ {
+ fprintf(stderr, "error in WAL record at %X/%X: %s\n",
+ (uint32) (xlogreader_state->ReadRecPtr >> 32),
+ (uint32) xlogreader_state->ReadRecPtr,
+ errormsg);
+ /*
+ * Parsing the record failed, but it passed CRC check, so in
+ * theory we can continue reading.
+ */
+ continue;
+ }
+

I think that's quite the bad idea. The CRC is only a part of the error
detection, not its entirety.

> Thoughts on this new WAL record format? I've attached the xlogrecord.h file
> here separately for easy reading, if you want to take a quick look at just
> that without applying the whole patch.

After a quick skim I like it in general. There's some details I'd rather
change, but I think it's quite the step forward.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-14 07:23:46
Message-ID:	CAA4eK1+tAc5B66XzPbsvZE_9N4YkU61G=Qo5WN6NGQh7FwihXA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 13, 2014 at 7:03 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:
>
> On 11/11/2014 04:42 PM, Amit Kapila wrote:
>>
>> I have done some performance testing of this patch using attached
>> script and data is as below:
>>
>> ...
>>
>> It seems to me that there is a regression of (4 ~ 8%) for small records,
>> refer two short fields tests.
>
>
> Thanks for the testing!
>
>
> Thoughts on this new WAL record format? I've attached the xlogrecord.h
file here separately for easy reading, if you want to take a quick look at
just that without applying the whole patch.
>

Apart from changes in XLogRecord to remove xl_len and padding, the new
format for block headers seems to be quite succinct and then by making
data length as variable for actual record the overall WAL record size
becomes smaller. However the increase in unaligned size (as mentined by
you up-thread as 2 bytes) seems to be still remain same.
Is the difference in size is due to additional block reference id this patch
requires or something else?

In new record format the Record header always start at aligned
boundary, however the actual record data doesn't necessarily be
at aligned boundary, can this make difference as currently we copy
both of these separately?

Overall, I think this format is a net improvement.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-14 08:31:08
Message-ID:	CAB7nPqR-TUdK+-Uc5-Aodjpdxw3=-hjeV7y-1nKj+qo+O1WQ+A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 13, 2014 at 10:33 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> In quick testing, this new WAL format is somewhat more compact than the 9.4
> format. That also seems to have more than bought back the performance
> regression I saw earlier. Here are results from my laptop, using the
> wal-update-testsuite.sh script:
> master:
> [results]
> (27 rows)
> wal-format-and-api-changes-9.patch:
> [results]
> (27 rows)
So based on your series of tests, you are saving 6% up to 10%.
That's... Really cool!

> Aside from the WAL record format changes, this patch adds the "decoded WAL
> record" infrastructure that we talked about with Andres. XLogReader now has
> a new function, DecodeXLogRecord, which parses the block headers etc. from
> the WAL record, and copies the data chunks to aligned buffers. The redo
> routines are passed a pointer to the XLogReaderState, instead of the plain
> XLogRecord, and the redo routines can use macros and functions defined
> xlogreader.h to access the already-decoded WAL record. The new WAL record
> format is difficult to parse in a piece-meal fashion, so it really needs
> this separate decoding pass to be efficient.
>
> Thoughts on this new WAL record format? I've attached the xlogrecord.h file
> here separately for easy reading, if you want to take a quick look at just
> that without applying the whole patch.
The new format is neat, and that's a lot of code.. Grouping together
the blocks of data is a good thing for the FPW compression patch as
well.

Note that this patch conflicts with the recent commits 81c4508 and
34402ae. installcheck-world is passing, and standbys are able to
replay records, at least without crashes.

Here are some more comments:
1) with assertions enabled this does not compile because of a small
typo in xlogreader.h here:
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block id >= 0)
2) In xlogreader.c, XLogRecGetBlockData should return char * but a
boolean is returned here:
+XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
+{
+ DecodedBkpBlock *bkpb;
+
+ if (!record->blocks[block_id].in_use)
+ return false;
As no blocks are in use in this case this should be NULL.
3) pg_xlogdump does not seem to work:
$ pg_xlogdump 00000001000000000000000D
pg_xlogdump: FATAL: could not find a valid record after 0/D000000
4) A couple of NOT_USED blocks could be removed, no?
+#ifdef NOT_USED
BlockNumber leftChildBlkno = InvalidBlockNumber;
+#endif
5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)?
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
+ XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT);
6) There are FIXME blocks:
+ // FIXME
+ //if ((bkpb->fork_flags & BKPBLOCK_WILL_INIT) != 0 &&
mode != RBM_ZERO)
+ // elog(PANIC, "block with WILL_INIT flag in WAL
record must be zeroed by redo routine");]
And that:
+ /* FIXME: translation? Although this shouldn't happen.. */
+ ereport(ERROR,
+ (errmsg("error decoding WAL record"),
+ errdetail("%s", errormsg)));
Regards,
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-15 05:34:13
Message-ID:	CAB7nPqSB=M-=_GmZR_J5hv1X0m1VwibnrSJWC9zJS5+gSLD57A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 14, 2014 at 5:31 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> 3) pg_xlogdump does not seem to work:
> $ pg_xlogdump 00000001000000000000000D
> pg_xlogdump: FATAL: could not find a valid record after 0/D000000
This one is a bad manipulation from my side. Please forget this comment.
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-17 11:11:05
Message-ID:	5469D7C9.2050402@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/14/2014 10:31 AM, Michael Paquier wrote:
> 5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)?
> + XLogBeginInsert();
> + XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
> + XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT);

See the comment of the xl_btree_newroot struct. It explains the record
format of the BTREE_NEWROOT record type:

> * Backup Blk 0: new root page (2 tuples as payload, if splitting old root)
> * Backup Blk 1: left child (if splitting an old root)
> * Backup Blk 2: metapage

When _bt_getroot creates a new root, there is no old root, but the same
record type is used in _bt_newroot, which uses block ID 1 to refer to
the old root page.

(Thanks for the comments, again! I'll post a new version soon)

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-17 13:22:22
Message-ID:	5469F68E.3080300@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/13/2014 05:04 PM, Andres Freund wrote:
> On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:
>> In quick testing, this new WAL format is somewhat more compact than the 9.4
>> format.
>
> Well, that's mixing different kinds of changes together to some
> degree. Most of this could have been done independently as well. I'm not
> generally objecting to doing that, but the space/performance comparison
> isn't entirely fair. One could (I'm not!) argue that we should just do a
> refactoring like you suggest above, without the additional block
> information.

Yep. It would be interesting to just do the minimal changes to the
current record format, to get a similar reduction in WAL volume, and
then compare the performance and WAL volume of that vs. the big patch.
But I don't think it's really worth it; we do want the bigger changes,
anyway, even if it costs a little.

>> That also seems to have more than bought back the performance
>> regression I saw earlier.
>
> I think we really need to do the performance test with a different CRC
> implementation. So far all the tests in this thread seem to be
> correlating pretty linearly to the size of the record.

Will do.

>> XLogReader now has
>> a new function, DecodeXLogRecord, which parses the block headers etc. from
>> the WAL record, and copies the data chunks to aligned buffers. The redo
>> routines are passed a pointer to the XLogReaderState, instead of the plain
>> XLogRecord, and the redo routines can use macros and functions defined
>> xlogreader.h to access the already-decoded WAL record. The new WAL record
>> format is difficult to parse in a piece-meal fashion, so it really needs
>> this separate decoding pass to be efficient.
>
> Hm. I'm not sure if there's really a point in exposing
> DecodeXLogRecord() outside of xlogreader.c. In which case would a
> xlogreader user not want to decode the record? I.e. couldn't we just
> always do that and not bother exposing the function?
>
> Sure, you could skip decoding if the record is of a "uninteresting"
> rmgr, but since all validation but CRC now happens only during decoding
> I'm not sure that's a good idea.

Well, extracting all the backup block data etc. is not free, so it would
be nice to be able to filter first. I'm thinking of logical decoding in
particular; it could avoid parsing index AM records. Then again, the
parsing isn't very expensive anyway.

I'll change the patch to decode as part of the XLogReadRecord call for
now; it does things a little bit simpler. We can change it back later if
it seems worthwhile.

I nevertheless kept the DecodeXLogRecord function. It can be used to
decode an XLogRecord that's not read from disk, using the xlogreader,
but already in memory. That was needed to support WAL_DEBUG at
insertion; it needs to rm_desc() a record that's not previously read
from disk with an xlogreader.

>> Thoughts on this new WAL record format? I've attached the xlogrecord.h file
>> here separately for easy reading, if you want to take a quick look at just
>> that without applying the whole patch.
>
> After a quick skim I like it in general. There's some details I'd rather
> change, but I think it's quite the step forward.

Glad to hear that.

Here is an updated version of the patch. It's rebased over current
master, and fixes a bunch of issues you and Michael pointed out, and a
ton of other cleanup.

I'll do more comprehensive performance testing with this, and post results.

BTW, I'm planning to replace the malloc() calls in xlogreader.c with
palloc(). It currently uses malloc because xlogreader.c needs to be
usable in frontend code, but nowadays we map palloc automatically to
malloc when building frontend code. A consequence of that is that you
get a straight exit() on out-of-memory, but I think that's acceptable.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-10.patch.gz	application/gzip	108.3 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-17 13:38:13
Message-ID:	5469FA45.7090202@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/17/2014 03:22 PM, Heikki Linnakangas wrote:
> Here is an updated version of the patch. It's rebased over current
> master, and fixes a bunch of issues you and Michael pointed out, and a
> ton of other cleanup.

That patch had two extra, unrelated patches applied to it. One to use
the Intel CRC instruction for the CRC calculation, and another to speed
up PageRepairFragmentation, by replacing the pq_qsort() call with a
custom sort algorithm. Sorry about that. (the latter is something that
showed up very high in a perf profile of replaying the WAL generated by
pgbench - I'll start a new thread about that)

Here is the same patch without those extra things mixed in.

- Heikki

Attachment	Content-Type	Size
wal-format-and-api-changes-11.patch.gz	application/gzip	106.6 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-17 19:31:41
Message-ID:	546A4D1D.4010207@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/13/2014 05:04 PM, Andres Freund wrote:
> On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:
>> That also seems to have more than bought back the performance
>> regression I saw earlier.
>
> I think we really need to do the performance test with a different CRC
> implementation. So far all the tests in this thread seem to be
> correlating pretty linearly to the size of the record.

Ok, I ran some tests comparing this patch, plus a quick & dirty patch to
use the Intel CRC instruction for CRC calculation. The quick & dirty is
attached. It's based on code you sent me over IM some weeks ago, and
needs -msse4.2 to compile. The point of the CRC patch is to just "hide"
the effect of simply reducing the WAL volume, to concentrate on the
possible overhead on constructing and replaying the new format.

WAL insertion performance
=========================

To measure the performance of generating WAL, I ran the
wal-update-testsuite.sh that Amit also ran earlier. The cluster was
initialized with:

shared_buffers=512MB
checkpoint_segments=30
fsync=off
autovacuum=off
full_page_writes=off

master + intel-crc.patch
------------------------

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 388523496 | 5.98275089263916
two short fields, no change | 384400536 | 5.81247496604919
two short fields, no change | 394318384 | 5.83137106895447
two short fields, one changed | 424519456 | 5.94383692741394
two short fields, one changed | 427103960 | 6.0239098072052
two short fields, one changed | 431826936 | 5.99587488174438
two short fields, both changed | 424524232 | 6.24820590019226
two short fields, both changed | 424520616 | 6.27542090415955
two short fields, both changed | 429219024 | 6.24310803413391
one short and one long field, no change | 78215672 | 1.69482898712158
one short and one long field, no change | 79275744 | 1.70809984207153
one short and one long field, no change | 76444464 | 1.6900041103363
ten tiny fields, all changed | 468263328 | 7.33222103118896
ten tiny fields, all changed | 475685120 | 7.44711399078369
ten tiny fields, all changed | 466329808 | 7.44984602928162
hundred tiny fields, all changed | 181802520 | 3.48464608192444
hundred tiny fields, all changed | 169941080 | 3.57416915893555
hundred tiny fields, all changed | 170178472 | 3.51270413398743
hundred tiny fields, half changed | 184539560 | 3.50752401351929
hundred tiny fields, half changed | 184539488 | 3.51076292991638
hundred tiny fields, half changed | 182090680 | 3.50708913803101
hundred tiny fields, half nulled | 104020856 | 3.43046092987061
hundred tiny fields, half nulled | 100607288 | 3.2797839641571
hundred tiny fields, half nulled | 97194600 | 3.26747107505798
9 short and 1 long, short changed | 99243976 | 1.94056487083435
9 short and 1 long, short changed | 95552912 | 1.94364786148071
9 short and 1 long, short changed | 100634736 | 1.94284915924072
(27 rows)

wal-format-and-api-changes-12.patch + crc-intel.patch
-----------------------------------------------------

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 344717720 | 6.01146483421326
two short fields, no change | 351999048 | 5.94603586196899
two short fields, no change | 359280088 | 5.92386198043823
two short fields, one changed | 349783736 | 5.98722791671753
two short fields, one changed | 357064632 | 5.95086097717285
two short fields, one changed | 347568640 | 5.92789196968079
two short fields, both changed | 388404496 | 5.9715678691864
two short fields, both changed | 389122792 | 5.99439883232117
two short fields, both changed | 389840936 | 5.98658204078674
one short and one long field, no change | 56844552 | 1.69335913658142
one short and one long field, no change | 67099160 | 1.66509008407593
one short and one long field, no change | 65557168 | 1.67773389816284
ten tiny fields, all changed | 387205416 | 7.0555100440979
ten tiny fields, all changed | 396668040 | 7.05817604064941
ten tiny fields, all changed | 389353592 | 6.98230600357056
hundred tiny fields, all changed | 163675416 | 3.4382529258728
hundred tiny fields, all changed | 173227296 | 3.4594509601593
hundred tiny fields, all changed | 172800520 | 3.44062900543213
hundred tiny fields, half changed | 174070216 | 3.51688480377197
hundred tiny fields, half changed | 169965336 | 3.47681999206543
hundred tiny fields, half changed | 175331128 | 3.45153713226318
hundred tiny fields, half nulled | 84784272 | 3.26745700836182
hundred tiny fields, half nulled | 83732928 | 3.22160506248474
hundred tiny fields, half nulled | 80339600 | 3.22326803207397
9 short and 1 long, short changed | 98087384 | 2.00620484352112
9 short and 1 long, short changed | 101273968 | 1.99480700492859
9 short and 1 long, short changed | 93123424 | 1.99506211280823
(27 rows)

Summary: No meaningful difference in runtime.

WAL replay performance
======================

To test WAL replay performance, I ran pgbench with WAL archiving
enabled, and timed the replay of the generated WAL. I used the attached
script, replay-perf-test.sh for that. full_page_writes were disabled,
because replaying full page images is quite different from replaying
other records. (Performance of full-page images is interesting too, but
it's not expected that these WAL format changes make much difference to
that).

(This is the same test setup I used in 54611209(dot)705(at)vmware(dot)com(dot))

I repeated the replay three times, and grepped for the "redo starts" and
"redo done" messages in the log. The results:

master + crc-intel.patch
------------------------

2014-11-17 20:40:10.175 EET LOG: redo starts at 0/2000090
2014-11-17 20:41:04.327 EET LOG: redo done at 0/7A0000C8

2014-11-17 20:42:46.319 EET LOG: redo starts at 0/2000090
2014-11-17 20:43:37.973 EET LOG: redo done at 0/7A0000C8

2014-11-17 20:44:14.889 EET LOG: redo starts at 0/2000090
2014-11-17 20:45:07.161 EET LOG: redo done at 0/7A0000C8

wal-format-and-api-changes-12.patch + crc-intel.patch
-----------------------------------------------------

2014-11-17 20:20:54.198 EET LOG: redo starts at 0/2000090
2014-11-17 20:21:49.073 EET LOG: redo done at 0/6A0000C8

2014-11-17 20:23:09.438 EET LOG: redo starts at 0/2000090
2014-11-17 20:24:04.744 EET LOG: redo done at 0/6A0000C8

2014-11-17 20:25:24.076 EET LOG: redo starts at 0/2000090
2014-11-17 20:26:18.830 EET LOG: redo done at 0/6A0000C8

In summary, there is no significant difference in replay performance.
The amount of WAL generated is much smaller with the patch.

For completeness, attached is the version of the patch that I used for
these tests. It's the same as patch version 11, but I fixed one trivial
bug in a b-tree redo function.

This concludes my performance testing, until someone wants to see some
other scenario being tested. I'm happy with the results.

- Heikki

Attachment	Content-Type	Size
replay-perf-test.sh	application/x-shellscript	1.1 KB
crc-intel.patch	text/x-diff	1.9 KB
wal-format-and-api-changes-12.patch.gz	application/gzip	106.6 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-20 17:06:38
Message-ID:	546E1F9E.5000202@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

As you may have noticed, I committed this (after some more cleanup). Of
course, feel free to still review it, and please point out any issues
you may find.

- Heikki

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-21 03:58:59
Message-ID:	CAA4eK1Jr8dw0_oVBmzGtc7DGpJyj=zrKnhbOpbOyHDvzT+DpWQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 20, 2014 at 10:36 PM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:
>
> As you may have noticed, I committed this (after some more cleanup). Of
course, feel free to still review it, and please point out any issues you
may find.
>

Few minor observations:
1. Readme

void XLogResetInsertion(void)

Clear any currently registered data and buffers from the WAL record
construction workspace. This is only needed if you have already
called XLogBeginInsert(), but decide to not insert the record after all.

I think above sentence could be slightly rephrased as the this function is
also getting called at end of XLogInsert().

2.
shiftList()
{
..
XLogEnsureRecordSpace(data.ndeleted + 1, 0);
..
}

Shouldn't above function call to XLogEnsureRecordSpace() be done
under if (RelationNeedsWAL(rel))?

3.
XLogInsert(RmgrId rmid, uint8 info)
{
XLogRecPtr EndPos;

/* XLogBeginInsert() must have been called.
*/
if (!begininsert_called)
elog(ERROR, "XLogBeginInsert was not called");

As we are in critical section at this moment, so is it okay to have
elog(ERROR,). I think this can only happen due to some coding
mistake, but still not sure if elog(ERROR) is okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-21 07:19:20
Message-ID:	CAB7nPqRCTU8zN4FjYQEk2qVHMCw-vDK2i7XrVGZkVGmNs63FtQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 21, 2014 at 2:06 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> As you may have noticed, I committed this (after some more cleanup). Of
> course, feel free to still review it, and please point out any issues you
> may find.
>

This comment on top of XLogRecordAssemble is not adapted as
page_writes_omitted does not exist. Perhaps this is a remnant of some older
version of the patch?
* If there are any registered buffers, and a full-page image was not taken
* of all them, *page_writes_omitted is set to true. This signals that the
* assembled record is only good for insertion on the assumption that the
* RedoRecPtr and doPageWrites values were up-to-date.
*/
--
Michael

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-21 10:16:38
Message-ID:	546F1106.60802@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/21/2014 09:19 AM, Michael Paquier wrote:
> On Fri, Nov 21, 2014 at 2:06 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> As you may have noticed, I committed this (after some more cleanup). Of
>> course, feel free to still review it, and please point out any issues you
>> may find.
>>
>
> This comment on top of XLogRecordAssemble is not adapted as
> page_writes_omitted does not exist. Perhaps this is a remnant of some older
> version of the patch?
> * If there are any registered buffers, and a full-page image was not taken
> * of all them, *page_writes_omitted is set to true. This signals that the
> * assembled record is only good for insertion on the assumption that the
> * RedoRecPtr and doPageWrites values were up-to-date.
> */

Yep, fixed thanks.
- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WAL format and API changes (9.5)
Date:	2014-11-21 13:17:51
Message-ID:	546F3B7F.7090007@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/21/2014 05:58 AM, Amit Kapila wrote:
> On Thu, Nov 20, 2014 at 10:36 PM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>>
>> As you may have noticed, I committed this (after some more cleanup). Of
> course, feel free to still review it, and please point out any issues you
> may find.
>>
>
> Few minor observations:
> 1. Readme
>
> void XLogResetInsertion(void)
>
> Clear any currently registered data and buffers from the WAL record
> construction workspace. This is only needed if you have already
> called XLogBeginInsert(), but decide to not insert the record after all.
>
> I think above sentence could be slightly rephrased as the this function is
> also getting called at end of XLogInsert().

Well, yeah, but that's internal to the xloginsert machinery, not
relevant for someone learning about how to use it.

> 2.
> shiftList()
> {
> ..
> XLogEnsureRecordSpace(data.ndeleted + 1, 0);
> ..
> }
>
> Shouldn't above function call to XLogEnsureRecordSpace() be done
> under if (RelationNeedsWAL(rel))?

True, fixed. (It doesn't hurt to call it, but it's clearly unnecessary)

> 3.
> XLogInsert(RmgrId rmid, uint8 info)
> {
> XLogRecPtr EndPos;
>
> /* XLogBeginInsert() must have been called.
> */
> if (!begininsert_called)
> elog(ERROR, "XLogBeginInsert was not called");
>
> As we are in critical section at this moment, so is it okay to have
> elog(ERROR,). I think this can only happen due to some coding
> mistake, but still not sure if elog(ERROR) is okay.

In a critical section, the ERROR will be promoted automatically to a
PANIC. There isn't much else we can do if that happens; it is a coding
mistake as you say.

BTW, not all XLogInsert() calls are inside a critical section. All the
ones that modify data pages are, but there are others.

- Heikki