Re: Reducing size of WAL record headers

Lists: pgsql-hackers
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Reducing size of WAL record headers
Date: 2013-01-09 20:36:05
Message-ID: CA+U5nMJKvGhBF0Zwvg0-fuLisXf+Okue7_9fxAShwmq2UBM0KA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
waste 4 bytes per record. Or put another way, if we could reduce
record header by 4 bytes, we would actually reduce it by 8 bytes per
record. So looking for ways to do that seems like a good idea.

The WAL record header starts with xl_tot_len, a 4 byte field. There is
also another field, xl_len. The difference is that xl_tot_len includes
the header, xl_len and any backup blocks. Since the header is fixed,
the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have
backup blocks.

We can re-arrange the record layout so that we remove xl_tot_len and
add another (maxaligned) 4 byte field (--> 8 bytes) after the record
header, xl_bkpblock_len that only exists if we have backup blocks.
This will then save 8 bytes from every record that doesn't have backup
blocks, and be the same as now with backup blocks.

The only problem is that we currently allow WAL records to be written
so that the header wraps across pages. This allows us to save space in
WAL when we have between 5 and 32 bytes spare at the end of a page. To
reduce the header size by 8 bytes we would need to ensure that the
whole header, which would now be 24 or 32 bytes, is all on one page.
My math tells me that would waste on average 12 bytes per page because
of the end-of-page wastage, but would gain 8 bytes per record when we
don't have backup blocks. My thinking is that the end of page loss
would be much reduced on average when we had backup blocks, so we
could ignore that case.

Assuming typically 100 records per page when we have no backup blocks,
this is a considerable upside. We would make gains on any page with 3
or more WAL records on it, so low downside even in worst cases. That
seems like a great break-even point for optimisation.

Since we've changed the WAL format already this release, another
change seems OK. More to the point, we can remove backup blocks in the
common case without changing WAL format, so this might be the last
time we have the chance to make this change.

Forcing the XLogRecord header to be all on one page makes the format
more robust and simplifies the code that copes with header wrapping.

The format changes would mean that its still possible to work out the
length of the WAL record precisely
= SizeOfXLogRecord + (HasBkpBlocks ? SizeOf(uint32) : 0) + xl_len
and so would then be protected by the WAL record CRC.

Thoughts?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-09 20:54:33
Message-ID: 50EDD909.9040401@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09.01.2013 22:36, Simon Riggs wrote:
> Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
> waste 4 bytes per record. Or put another way, if we could reduce
> record header by 4 bytes, we would actually reduce it by 8 bytes per
> record. So looking for ways to do that seems like a good idea.

Agreed.

> The WAL record header starts with xl_tot_len, a 4 byte field. There is
> also another field, xl_len. The difference is that xl_tot_len includes
> the header, xl_len and any backup blocks. Since the header is fixed,
> the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have
> backup blocks.
>
> We can re-arrange the record layout so that we remove xl_tot_len and
> add another (maxaligned) 4 byte field (--> 8 bytes) after the record
> header, xl_bkpblock_len that only exists if we have backup blocks.
> This will then save 8 bytes from every record that doesn't have backup
> blocks, and be the same as now with backup blocks.

Here's a better idea:

Let's keep xl_tot_len as it is, but move xl_len at the very end of the
WAL record, after all the backup blocks. If there are no backup blocks,
xl_len is omitted. Seems more robust to keep xl_tot_len, so that you
require less math to figure out where one record ends and where the next
one begins.

> Forcing the XLogRecord header to be all on one page makes the format
> more robust and simplifies the code that copes with header wrapping.

-1 on that. That would essentially revert the changes I made earlier.
The purpose of allowing the header to be wrapped was that you could
easily calculate ahead of time exactly how much space a WAL record
takes. My motivation for that was the XLogInsert scaling patch. Now, I
admit I haven't had a chance to work further on that patch, so we're not
gaining much from the format change at the moment. Nevertheless, I don't
want us to get back to the situation that you sometimes need to add
padding to the end of a WAL page.

My suggestion above to keep xl_tot_len and remove xl_len from XLogRecord
doesn't have a problem with crossing page boundaries.

- Heikki


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-09 20:59:00
Message-ID: 20130109205900.GA8545@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 9, 2013 at 10:54:33PM +0200, Heikki Linnakangas wrote:
> On 09.01.2013 22:36, Simon Riggs wrote:
> >Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
> >waste 4 bytes per record. Or put another way, if we could reduce
> >record header by 4 bytes, we would actually reduce it by 8 bytes per
> >record. So looking for ways to do that seems like a good idea.
>
> Agreed.
>
> >The WAL record header starts with xl_tot_len, a 4 byte field. There is
> >also another field, xl_len. The difference is that xl_tot_len includes
> >the header, xl_len and any backup blocks. Since the header is fixed,
> >the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have
> >backup blocks.
> >
> >We can re-arrange the record layout so that we remove xl_tot_len and
> >add another (maxaligned) 4 byte field (--> 8 bytes) after the record
> >header, xl_bkpblock_len that only exists if we have backup blocks.
> >This will then save 8 bytes from every record that doesn't have backup
> >blocks, and be the same as now with backup blocks.
>
> Here's a better idea:
>
> Let's keep xl_tot_len as it is, but move xl_len at the very end of
> the WAL record, after all the backup blocks. If there are no backup
> blocks, xl_len is omitted. Seems more robust to keep xl_tot_len, so
> that you require less math to figure out where one record ends and
> where the next one begins.

OK, crazy idea, but can we just record xl_len as a difference against
xl_tot_len, and shorten the xl_len field?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-09 21:02:20
Message-ID: 50EDDADC.4000702@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09.01.2013 22:59, Bruce Momjian wrote:
> On Wed, Jan 9, 2013 at 10:54:33PM +0200, Heikki Linnakangas wrote:
>> On 09.01.2013 22:36, Simon Riggs wrote:
>>> The WAL record header starts with xl_tot_len, a 4 byte field. There is
>>> also another field, xl_len. The difference is that xl_tot_len includes
>>> the header, xl_len and any backup blocks. Since the header is fixed,
>>> the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have
>>> backup blocks.
>>>
>>> We can re-arrange the record layout so that we remove xl_tot_len and
>>> add another (maxaligned) 4 byte field (--> 8 bytes) after the record
>>> header, xl_bkpblock_len that only exists if we have backup blocks.
>>> This will then save 8 bytes from every record that doesn't have backup
>>> blocks, and be the same as now with backup blocks.
>>
>> Here's a better idea:
>>
>> Let's keep xl_tot_len as it is, but move xl_len at the very end of
>> the WAL record, after all the backup blocks. If there are no backup
>> blocks, xl_len is omitted. Seems more robust to keep xl_tot_len, so
>> that you require less math to figure out where one record ends and
>> where the next one begins.
>
> OK, crazy idea, but can we just record xl_len as a difference against
> xl_tot_len, and shorten the xl_len field?

Hmm, so it would essentially be the length of all the backup blocks.
perhaps rename it to xl_bkpblk_len.

However, that would cap the total size of backup blocks to 64k. Which
would not be enough with 32k BLCKSZ.

- Heikki


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-09 21:15:16
Message-ID: CA+U5nMKF96EONecEj20M5buYk2-O0sc8H2wXcLbJytj=o3B=7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9 January 2013 21:02, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

>> OK, crazy idea, but can we just record xl_len as a difference against
>> xl_tot_len, and shorten the xl_len field?
>
>
> Hmm, so it would essentially be the length of all the backup blocks. perhaps
> rename it to xl_bkpblk_len.
>
> However, that would cap the total size of backup blocks to 64k. Which would
> not be enough with 32k BLCKSZ.

Since that requires a recompile anyway, why not make XLogRecord
smaller only for 16k BLCKSZ or less?

Problem if we do that is that xl_len is used extensively in _redo
routines, so its a much more invasive patch.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-09 21:17:25
Message-ID: CA+U5nMJZxnS7zqtY2-ReRL62wHdUDXdQKWr-3feB50nP9An5KQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9 January 2013 20:54, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> Here's a better idea:
>
> Let's keep xl_tot_len as it is, but move xl_len at the very end of the WAL
> record, after all the backup blocks. If there are no backup blocks, xl_len
> is omitted. Seems more robust to keep xl_tot_len, so that you require less
> math to figure out where one record ends and where the next one begins.

OK, I avoided tampering with xl_len cos its so widely used. Will look.

>> Forcing the XLogRecord header to be all on one page makes the format
>> more robust and simplifies the code that copes with header wrapping.

> -1 on that. That would essentially revert the changes I made earlier.

OK, idea dropped.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-09 21:43:27
Message-ID: 20130109214327.GB8545@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 9, 2013 at 09:15:16PM +0000, Simon Riggs wrote:
> On 9 January 2013 21:02, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
>
> >> OK, crazy idea, but can we just record xl_len as a difference against
> >> xl_tot_len, and shorten the xl_len field?
> >
> >
> > Hmm, so it would essentially be the length of all the backup blocks. perhaps
> > rename it to xl_bkpblk_len.
> >
> > However, that would cap the total size of backup blocks to 64k. Which would
> > not be enough with 32k BLCKSZ.
>
> Since that requires a recompile anyway, why not make XLogRecord
> smaller only for 16k BLCKSZ or less?
>
> Problem if we do that is that xl_len is used extensively in _redo
> routines, so its a much more invasive patch.

I would just make it int16 on <=16k block size, and int32 on >16k
blocks.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-09 22:06:49
Message-ID: 3413.1357769209@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
> waste 4 bytes per record. Or put another way, if we could reduce
> record header by 4 bytes, we would actually reduce it by 8 bytes per
> record. So looking for ways to do that seems like a good idea.

I think this is extremely premature, in view of the ongoing discussions
about shoehorning logical replication and other kinds of data into the
WAL stream. It seems quite likely that we'll end up eating some of
that padding space to support those features. So whacking a lot of code
around in service of squeezing the existing padding out could very
easily end up being wasted work, in fact counterproductive if it
degrades either code readability or robustness.

Let's wait till we see where the logical rep stuff ends up before we
worry about saving 4 bytes per WAL record.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-10 19:54:24
Message-ID: 20130110195424.GA4318@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 9, 2013 at 05:06:49PM -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
> > waste 4 bytes per record. Or put another way, if we could reduce
> > record header by 4 bytes, we would actually reduce it by 8 bytes per
> > record. So looking for ways to do that seems like a good idea.
>
> I think this is extremely premature, in view of the ongoing discussions
> about shoehorning logical replication and other kinds of data into the
> WAL stream. It seems quite likely that we'll end up eating some of
> that padding space to support those features. So whacking a lot of code
> around in service of squeezing the existing padding out could very
> easily end up being wasted work, in fact counterproductive if it
> degrades either code readability or robustness.
>
> Let's wait till we see where the logical rep stuff ends up before we
> worry about saving 4 bytes per WAL record.

Well, we have wal_level to control the amount of WAL traffic. It is
hard to imagine we are going to want to ship logical WAL information by
default, so most people will not be using logical WAL and would see a
benefit from an optimized WAL stream?

What percentage is 8-bytes in a typical WAL record?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-10 20:13:20
Message-ID: 2846.1357848800@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> On Wed, Jan 9, 2013 at 05:06:49PM -0500, Tom Lane wrote:
>> Let's wait till we see where the logical rep stuff ends up before we
>> worry about saving 4 bytes per WAL record.

> Well, we have wal_level to control the amount of WAL traffic.

That's entirely irrelevant. The point here is that we'll need more bits
to identify what any particular record is, unless we make a decision
that we'll have physically separate streams for logical replication
info, which doesn't sound terribly attractive; and in any case no such
decision has been made yet, AFAIK.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-01-11 00:14:53
Message-ID: CA+U5nMK4ofQ7F1tcg1jtZvvxFvpnh3xjiuRO8vqprFU=sU8pBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10 January 2013 20:13, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> On Wed, Jan 9, 2013 at 05:06:49PM -0500, Tom Lane wrote:
>>> Let's wait till we see where the logical rep stuff ends up before we
>>> worry about saving 4 bytes per WAL record.
>
>> Well, we have wal_level to control the amount of WAL traffic.
>
> That's entirely irrelevant. The point here is that we'll need more bits
> to identify what any particular record is, unless we make a decision
> that we'll have physically separate streams for logical replication
> info, which doesn't sound terribly attractive; and in any case no such
> decision has been made yet, AFAIK.

You were right to say that this is less important than logical
replication. I don't need any more reason than that to stop talking
about it.

I have a patch for this, but as yet no way to submit it while at the
same time saying "put this at the back of the queue".

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jim Nasby <jim(at)nasby(dot)net>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing size of WAL record headers
Date: 2013-08-24 03:18:15
Message-ID: 521825F7.4000304@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/10/13 6:14 PM, Simon Riggs wrote:
> On 10 January 2013 20:13, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>>> On Wed, Jan 9, 2013 at 05:06:49PM -0500, Tom Lane wrote:
>>>> Let's wait till we see where the logical rep stuff ends up before we
>>>> worry about saving 4 bytes per WAL record.
>>
>>> Well, we have wal_level to control the amount of WAL traffic.
>>
>> That's entirely irrelevant. The point here is that we'll need more bits
>> to identify what any particular record is, unless we make a decision
>> that we'll have physically separate streams for logical replication
>> info, which doesn't sound terribly attractive; and in any case no such
>> decision has been made yet, AFAIK.
>
> You were right to say that this is less important than logical
> replication. I don't need any more reason than that to stop talking
> about it.
>
> I have a patch for this, but as yet no way to submit it while at the
> same time saying "put this at the back of the queue".

Anything ever come of this?
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net