Re: WAL format changes

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: WAL format changes
Date: 2012-06-14 21:01:42
Message-ID: 4FDA5136.6080206@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

As I threatened earlier
(http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.com),
here are three patches that change the WAL format. The goal is to change
the format so that when you're inserting a WAL record of a given size,
you know exactly how much space it requires in the WAL.

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

2. Always include the xl_rem_len field, used for continuation records,
in the xlog page header. A continuation log record only contained that
one field, it's now included straight in the page header, so the concept
of a continuation record doesn't exist anymore. Because of alignment,
this wastes 4 bytes on every page that contains continued data from a
previous record, and 8 bytes on pages that don't. That's not very much,
and the next step will buy that back:

3. Allow WAL record header to be split across pages. Per Tom's
suggestion, move xl_tot_len to be the first field in XLogRecord, so that
even if the header is split, xl_tot_len is always on the first page.
xl_crc is moved to be the last field, and xl_prev is the second to last.
This has the advantage that you can calculate the CRC for all the other
fields before acquiring WALInsertLock. For xl_prev, you need to know
where exactly the record is inserted, so it's handy that it's the last
field before CRC. This patch doesn't try to take advantage of that,
however, and I'm not sure if that makes any difference once I finish the
patch to make XLogInsert scale better, which is the ultimate goal of all
this.

Those are the three patches I'd like to get committed in this
commitfest. To see where all this is leading to, I've included a rough
WIP version of the XLogInsert scaling patch. This version is quite
different from the one I posted in spring, it takes advantage of the WAL
format changes, and I'm also experimenting with a different method of
tracking how far each WAL insertion has progressed. But more on that later.

(Note to self: remember to bump XLOG_PAGE_MAGIC)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
1-use-uint64-got-segno.patch text/x-diff 82.2 KB
2-move-continuation-record-to-page-header.patch text/x-diff 5.1 KB
3-allow-wal-record-header-to-be-split.patch text/x-diff 22.1 KB
4-WIP-xloginsert-scale.patch text/x-diff 86.5 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-14 21:52:04
Message-ID: 201206142352.04528.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thursday, June 14, 2012 11:01:42 PM Heikki Linnakangas wrote:
> As I threatened earlier
> (http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.co
> m), here are three patches that change the WAL format. The goal is to
> change the format so that when you're inserting a WAL record of a given
> size, you know exactly how much space it requires in the WAL.
I fear the patches need rebasing after the pgindent run... Even before that
(60801944fa105252b48ea5688d47dfc05c695042) it only applies with offsets?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-14 21:58:12
Message-ID: 201206142358.12431.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thursday, June 14, 2012 11:01:42 PM Heikki Linnakangas wrote:
> As I threatened earlier
> (http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.co
> m), here are three patches that change the WAL format. The goal is to
> change the format so that when you're inserting a WAL record of a given
> size, you know exactly how much space it requires in the WAL.
>
> 1. Use a 64-bit segment number, instead of the log/seg combination. And
> don't waste the last segment on each logical 4 GB log file. The concept
> of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
> but it should now be understood as a plain 64-bit value, just split into
> two 32-bit integers for historical reasons. On disk, this means that
> there will be log files ending in FF, those were skipped before.
Whats the reason for keeping that awkward split now? There aren't that many
users of xlogid/xcrecoff and many of those would be better served by using
helper macros.
API compatibility isn't a great argument either as code manually playing
around with those needs to be checked anyway. I think there might be some code
around that does XLogRecPtr addition manuall and such.

> 2. Always include the xl_rem_len field, used for continuation records,
> in the xlog page header. A continuation log record only contained that
> one field, it's now included straight in the page header, so the concept
> of a continuation record doesn't exist anymore. Because of alignment,
> this wastes 4 bytes on every page that contains continued data from a
> previous record, and 8 bytes on pages that don't. That's not very much,
> and the next step will buy that back:
>
> 3. Allow WAL record header to be split across pages. Per Tom's
> suggestion, move xl_tot_len to be the first field in XLogRecord, so that
> even if the header is split, xl_tot_len is always on the first page.
> xl_crc is moved to be the last field, and xl_prev is the second to last.
> This has the advantage that you can calculate the CRC for all the other
> fields before acquiring WALInsertLock. For xl_prev, you need to know
> where exactly the record is inserted, so it's handy that it's the last
> field before CRC. This patch doesn't try to take advantage of that,
> however, and I'm not sure if that makes any difference once I finish the
> patch to make XLogInsert scale better, which is the ultimate goal of all
> this.
>
> Those are the three patches I'd like to get committed in this
> commitfest. To see where all this is leading to, I've included a rough
> WIP version of the XLogInsert scaling patch. This version is quite
> different from the one I posted in spring, it takes advantage of the WAL
> format changes, and I'm also experimenting with a different method of
> tracking how far each WAL insertion has progressed. But more on that later.
>
> (Note to self: remember to bump XLOG_PAGE_MAGIC)
Will review.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-18 18:00:11
Message-ID: CA+TgmobG9VWWvLXeK0XOxoWXtza8JTFgSu2eoAtVFucxH6gLBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> 1. Use a 64-bit segment number, instead of the log/seg combination. And
>> don't waste the last segment on each logical 4 GB log file. The concept
>> of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
>> but it should now be understood as a plain 64-bit value, just split into
>> two 32-bit integers for historical reasons. On disk, this means that
>> there will be log files ending in FF, those were skipped before.
> Whats the reason for keeping that awkward split now? There aren't that many
> users of xlogid/xcrecoff and many of those would be better served by using
> helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-18 18:08:14
Message-ID: 4FDF6E8E.4080900@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18.06.2012 21:00, Robert Haas wrote:
> On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres(at)2ndquadrant(dot)com> wrote:
>>> 1. Use a 64-bit segment number, instead of the log/seg combination. And
>>> don't waste the last segment on each logical 4 GB log file. The concept
>>> of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
>>> but it should now be understood as a plain 64-bit value, just split into
>>> two 32-bit integers for historical reasons. On disk, this means that
>>> there will be log files ending in FF, those were skipped before.
>> Whats the reason for keeping that awkward split now? There aren't that many
>> users of xlogid/xcrecoff and many of those would be better served by using
>> helper macros.
>
> I wondered that, too. There may be a good reason for keeping it split
> up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-18 18:13:12
Message-ID: 201206182013.12868.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:
> On 18.06.2012 21:00, Robert Haas wrote:
> > On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres(at)2ndquadrant(dot)com>
wrote:
> >>> 1. Use a 64-bit segment number, instead of the log/seg combination. And
> >>> don't waste the last segment on each logical 4 GB log file. The concept
> >>> of a "logical log file" is now completely gone. XLogRecPtr is
> >>> unchanged, but it should now be understood as a plain 64-bit value,
> >>> just split into two 32-bit integers for historical reasons. On disk,
> >>> this means that there will be log files ending in FF, those were
> >>> skipped before.
> >>
> >> Whats the reason for keeping that awkward split now? There aren't that
> >> many users of xlogid/xcrecoff and many of those would be better served
> >> by using helper macros.
> >
> > I wondered that, too. There may be a good reason for keeping it split
> > up that way, but we at least oughta think about it a bit.
>
> The page header contains an XLogRecPtr (LSN), so if we change it we'll
> have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
> around as the on-disk representation, and convert between the 64-bit
> integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
> many xlog calculations would admittedly be simpler if it was an uint64.
I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without
marking the page dirty?

There is the valid argument that you would loose some information when pages
with hint bits are written out again, but on the other hand you would also
gain the information that it was a hint-bit write...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-18 18:13:21
Message-ID: CA+TgmobCJznkeEMHd-vLm4FFVH+t+sgjbenA4gqc568DRsHh9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 18, 2012 at 2:08 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 18.06.2012 21:00, Robert Haas wrote:
>> On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres(at)2ndquadrant(dot)com>
>>  wrote:
>>>>
>>>> 1. Use a 64-bit segment number, instead of the log/seg combination. And
>>>> don't waste the last segment on each logical 4 GB log file. The concept
>>>> of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
>>>> but it should now be understood as a plain 64-bit value, just split into
>>>> two 32-bit integers for historical reasons. On disk, this means that
>>>> there will be log files ending in FF, those were skipped before.
>>>
>>> Whats the reason for keeping that awkward split now? There aren't that
>>> many
>>> users of xlogid/xcrecoff and many of those would be better served by
>>> using
>>> helper macros.
>>
>> I wondered that, too.  There may be a good reason for keeping it split
>> up that way, but we at least oughta think about it a bit.
>
> The page header contains an XLogRecPtr (LSN), so if we change it we'll have
> to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as
> the on-disk representation, and convert between the 64-bit integer and
> XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog
> calculations would admittedly be simpler if it was an uint64.

Ugh. Well, that's a good reason for thinking twice before changing
it, if not abandoning the idea altogether.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-18 18:32:54
Message-ID: 4FDF7456.9010709@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18.06.2012 21:13, Andres Freund wrote:
> On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:
>> The page header contains an XLogRecPtr (LSN), so if we change it we'll
>> have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
>> around as the on-disk representation, and convert between the 64-bit
>> integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
>> many xlog calculations would admittedly be simpler if it was an uint64.
> I am out of my depth here, not having read any of the relevant code, but
> couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without
> marking the page dirty?
>
> There is the valid argument that you would loose some information when pages
> with hint bits are written out again, but on the other hand you would also
> gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-18 18:45:53
Message-ID: 201206182045.53962.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:
> On 18.06.2012 21:13, Andres Freund wrote:
> > On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:
> >> The page header contains an XLogRecPtr (LSN), so if we change it we'll
> >> have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
> >> around as the on-disk representation, and convert between the 64-bit
> >> integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
> >> many xlog calculations would admittedly be simpler if it was an uint64.
> >
> > I am out of my depth here, not having read any of the relevant code, but
> > couldn't we simply replace the lsn from disk with InvalidXLogRecPtr
> > without marking the page dirty?
> >
> > There is the valid argument that you would loose some information when
> > pages with hint bits are written out again, but on the other hand you
> > would also gain the information that it was a hint-bit write...
>
> Sorry, I don't understand that. Where would you "replace the LSN from
> disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )
In ReadBuffer_common or such, after reading a page from disk and verifying
that the page has a valid header it seems to be possible to replace pd_lsn *in
memory* by InvalidXLogRecPtr without marking the page dirty.
If the page isn't modified the lsn on disk won't be changed so you don't loose
debugging information in that case. If will be zero if it has been written by
a hint-bit write and thats arguable a win.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-18 19:19:48
Message-ID: 4FDF7F54.5060406@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18.06.2012 21:45, Andres Freund wrote:
> On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:
>> On 18.06.2012 21:13, Andres Freund wrote:
>>> On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:
>>>> The page header contains an XLogRecPtr (LSN), so if we change it we'll
>>>> have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
>>>> around as the on-disk representation, and convert between the 64-bit
>>>> integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
>>>> many xlog calculations would admittedly be simpler if it was an uint64.
>>>
>>> I am out of my depth here, not having read any of the relevant code, but
>>> couldn't we simply replace the lsn from disk with InvalidXLogRecPtr
>>> without marking the page dirty?
>>>
>>> There is the valid argument that you would loose some information when
>>> pages with hint bits are written out again, but on the other hand you
>>> would also gain the information that it was a hint-bit write...
>>
>> Sorry, I don't understand that. Where would you "replace the LSN from
>> disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )
> In ReadBuffer_common or such, after reading a page from disk and verifying
> that the page has a valid header it seems to be possible to replace pd_lsn *in
> memory* by InvalidXLogRecPtr without marking the page dirty.
> If the page isn't modified the lsn on disk won't be changed so you don't loose
> debugging information in that case. If will be zero if it has been written by
> a hint-bit write and thats arguable a win.

We use the LSN to decide whether a full-page image need to be xlogged if
the page is modified. If you reset LSN every time you read in a page,
you'll be doing full page writes every time a page is read from disk and
modified, whether or not it's the first time the page is modified after
the last checkpoint.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-18 20:57:57
Message-ID: 201206182257.58019.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, June 18, 2012 09:19:48 PM Heikki Linnakangas wrote:
> On 18.06.2012 21:45, Andres Freund wrote:
> > On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:
> >> On 18.06.2012 21:13, Andres Freund wrote:
> >>> On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:
> >>>> The page header contains an XLogRecPtr (LSN), so if we change it we'll
> >>>> have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
> >>>> around as the on-disk representation, and convert between the 64-bit
> >>>> integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
> >>>> many xlog calculations would admittedly be simpler if it was an
> >>>> uint64.
> >>>
> >>> I am out of my depth here, not having read any of the relevant code,
> >>> but couldn't we simply replace the lsn from disk with
> >>> InvalidXLogRecPtr without marking the page dirty?
> >>>
> >>> There is the valid argument that you would loose some information when
> >>> pages with hint bits are written out again, but on the other hand you
> >>> would also gain the information that it was a hint-bit write...
> >>
> >> Sorry, I don't understand that. Where would you "replace the LSN from
> >> disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )
> >
> > In ReadBuffer_common or such, after reading a page from disk and
> > verifying that the page has a valid header it seems to be possible to
> > replace pd_lsn *in memory* by InvalidXLogRecPtr without marking the page
> > dirty.
> > If the page isn't modified the lsn on disk won't be changed so you don't
> > loose debugging information in that case. If will be zero if it has been
> > written by a hint-bit write and thats arguable a win.
>
> We use the LSN to decide whether a full-page image need to be xlogged if
> the page is modified. If you reset LSN every time you read in a page,
> you'll be doing full page writes every time a page is read from disk and
> modified, whether or not it's the first time the page is modified after
> the last checkpoint.
Yea, I somehow disregarded that hint-bit writes would make a problem with full
page writes in that case. Normal writes wouldn't be a problem...
This should be fixable but the result would be too ugly. So its either
converting the on-disk representation or keeping the duplicated
representation.

pd_lsn seems to be well enough abstracted, doing the conversion in
PageSet/GetLSN seems to be easy enough.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-19 08:14:08
Message-ID: 4FE034D0.5070900@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18.06.2012 21:08, Heikki Linnakangas wrote:
> On 18.06.2012 21:00, Robert Haas wrote:
>> On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres(at)2ndquadrant(dot)com>
>> wrote:
>>>> 1. Use a 64-bit segment number, instead of the log/seg combination. And
>>>> don't waste the last segment on each logical 4 GB log file. The concept
>>>> of a "logical log file" is now completely gone. XLogRecPtr is
>>>> unchanged,
>>>> but it should now be understood as a plain 64-bit value, just split
>>>> into
>>>> two 32-bit integers for historical reasons. On disk, this means that
>>>> there will be log files ending in FF, those were skipped before.
>>> Whats the reason for keeping that awkward split now? There aren't
>>> that many
>>> users of xlogid/xcrecoff and many of those would be better served by
>>> using
>>> helper macros.
>>
>> I wondered that, too. There may be a good reason for keeping it split
>> up that way, but we at least oughta think about it a bit.
>
> The page header contains an XLogRecPtr (LSN), so if we change it we'll
> have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
> around as the on-disk representation, and convert between the 64-bit
> integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
> many xlog calculations would admittedly be simpler if it was an uint64.

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

The LSNs on pages are still stored in the old format, to avoid changing
the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the
control file and WAL are changed, however, so an initdb (or at least
pg_resetxlog) is required.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol. That would make it easier
to write a client that works across different architectures, in >= 9.3.
The contents of the WAL would of course be architecture-dependent, but
it would be nice if pg_receivexlog and similar tools could nevertheless
be architecture-independent.

I kept the %X/%X representation in error messages etc. I'm quite used to
that output, so reluctant to change it, although it's a bit silly now
that it represents just 64-bit value. Using UINT64_FORMAT would also
make the messages harder to translate.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
xlogrecptr-uint64-1.patch text/x-diff 72.3 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-19 15:46:40
Message-ID: 201206191746.41027.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:
> On 18.06.2012 21:08, Heikki Linnakangas wrote:
> > On 18.06.2012 21:00, Robert Haas wrote:
> >> On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres(at)2ndquadrant(dot)com>
> >>
> >> wrote:
> >>>> 1. Use a 64-bit segment number, instead of the log/seg combination.
> >>>> And don't waste the last segment on each logical 4 GB log file. The
> >>>> concept of a "logical log file" is now completely gone. XLogRecPtr is
> >>>> unchanged,
> >>>> but it should now be understood as a plain 64-bit value, just split
> >>>> into
> >>>> two 32-bit integers for historical reasons. On disk, this means that
> >>>> there will be log files ending in FF, those were skipped before.
> >>>
> >>> Whats the reason for keeping that awkward split now? There aren't
> >>> that many
> >>> users of xlogid/xcrecoff and many of those would be better served by
> >>> using
> >>> helper macros.
> >>
> >> I wondered that, too. There may be a good reason for keeping it split
> >> up that way, but we at least oughta think about it a bit.
> >
> > The page header contains an XLogRecPtr (LSN), so if we change it we'll
> > have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
> > around as the on-disk representation, and convert between the 64-bit
> > integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
> > many xlog calculations would admittedly be simpler if it was an uint64.
>
> Well, that was easier than I thought. Attached is a patch to make
> XLogRecPtr a uint64, on top of my other WAL format patches. I think we
> should go ahead with this.
Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure
if having two representations which just have a constant factor inbetween
really makes sense.

> The LSNs on pages are still stored in the old format, to avoid changing
> the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the
> control file and WAL are changed, however, so an initdb (or at least
> pg_resetxlog) is required.
Sounds sensible.

> Should we keep the old representation in the replication protocol
> messages? That would make it simpler to write a client that works with
> different server versions (like pg_receivexlog). Or, while we're at it,
> perhaps we should mandate network-byte order for all the integer and
> XLogRecPtr fields in the replication protocol.
The replication protocol uses pq_sendint for integers which should take care
of converting to big endian already. I don't think anything but the wal itself
is otherwise transported in a binary fashion? So I don't think there is any
such architecture dependency in the protocol currently?

I don't really see a point in keeping around a backward-compatible
representation just for the sake of running such tools on multiple versions. I
might not be pragmatic enough, but: Why would you want to do that *at the
moment*? Many of the other tools are already version specific, so...
When the protocol starts to be used by more tools, maybe, but imo were not
there yet.

But then its not hard to convert to the old representation for that.

> I kept the %X/%X representation in error messages etc. I'm quite used to
> that output, so reluctant to change it, although it's a bit silly now
> that it represents just 64-bit value. Using UINT64_FORMAT would also
> make the messages harder to translate.
No opinion on that. Its easier to see for me whether two values are exactly
the same or very similar with the 64bit representation but its harder to gauge
bigger differences. So ...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-19 15:57:12
Message-ID: CA+TgmoZu=UMKLJDQ5FNRqPnomiGOSdEvzaj_Gg9CJzcPnvuxgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
> a uint64, on top of my other WAL format patches. I think we should go ahead
> with this.

+1.

> The LSNs on pages are still stored in the old format, to avoid changing the
> on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
> file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
> is required.

Seems fine.

> Should we keep the old representation in the replication protocol messages?
> That would make it simpler to write a client that works with different
> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
> should mandate network-byte order for all the integer and XLogRecPtr fields
> in the replication protocol. That would make it easier to write a client
> that works across different architectures, in >= 9.3. The contents of the
> WAL would of course be architecture-dependent, but it would be nice if
> pg_receivexlog and similar tools could nevertheless be
> architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

> I kept the %X/%X representation in error messages etc. I'm quite used to
> that output, so reluctant to change it, although it's a bit silly now that
> it represents just 64-bit value. Using UINT64_FORMAT would also make the
> messages harder to translate.

I could go either way on this one, but I have no problem with the way
you did it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-19 22:24:54
Message-ID: 4FE0FC36.9080409@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19.06.2012 18:46, Andres Freund wrote:
> On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:
>> Well, that was easier than I thought. Attached is a patch to make
>> XLogRecPtr a uint64, on top of my other WAL format patches. I think we
>> should go ahead with this.
> Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure
> if having two representations which just have a constant factor inbetween
> really makes sense.

I wasn't planning to, it didn't even occur to me that we might be able
to get rid of XLogSegNo to be honest. There's places that deal whole
segments, rather than with specific byte positions in the WAL, so I
think XLogSegNo makes more sense in that context. Take
XLogArchiveNotifySeg(), for example. It notifies the archiver that a
given segment is ready for archiving, so we pass an XLogSegNo to
identify that segment as an argument. I suppose we could pass an
XLogRecPtr that points to the beginning of the segment instead, but it
doesn't really feel like an improvement to me.

>> Should we keep the old representation in the replication protocol
>> messages? That would make it simpler to write a client that works with
>> different server versions (like pg_receivexlog). Or, while we're at it,
>> perhaps we should mandate network-byte order for all the integer and
>> XLogRecPtr fields in the replication protocol.
> The replication protocol uses pq_sendint for integers which should take care
> of converting to big endian already. I don't think anything but the wal itself
> is otherwise transported in a binary fashion? So I don't think there is any
> such architecture dependency in the protocol currently?

We only use pg_sendint() for the few values exchanged in the handshake
before we start replicating, but once we begin, we just send structs
around. For example, in ProcessStandbyReplyMessage():

> static void
> ProcessStandbyReplyMessage(void)
> {
> StandbyReplyMessage reply;
>
> pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
> ...

After that, we just the fields in the reply struct like in any other struct.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WAL format changes
Date: 2012-06-19 22:30:38
Message-ID: 201206200030.38683.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Wednesday, June 20, 2012 12:24:54 AM Heikki Linnakangas wrote:
> On 19.06.2012 18:46, Andres Freund wrote:
> > On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:
> >> Well, that was easier than I thought. Attached is a patch to make
> >> XLogRecPtr a uint64, on top of my other WAL format patches. I think we
> >> should go ahead with this.
> >
> > Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not
> > sure if having two representations which just have a constant factor
> > inbetween really makes sense.

> I wasn't planning to, it didn't even occur to me that we might be able
> to get rid of XLogSegNo to be honest. There's places that deal whole
> segments, rather than with specific byte positions in the WAL, so I
> think XLogSegNo makes more sense in that context. Take
> XLogArchiveNotifySeg(), for example. It notifies the archiver that a
> given segment is ready for archiving, so we pass an XLogSegNo to
> identify that segment as an argument. I suppose we could pass an
> XLogRecPtr that points to the beginning of the segment instead, but it
> doesn't really feel like an improvement to me.
I am not sure its a win either, was just wondering because they now are that
similar.

> >> Should we keep the old representation in the replication protocol
> >> messages? That would make it simpler to write a client that works with
> >> different server versions (like pg_receivexlog). Or, while we're at it,
> >> perhaps we should mandate network-byte order for all the integer and
> >> XLogRecPtr fields in the replication protocol.
> >
> > The replication protocol uses pq_sendint for integers which should take
> > care of converting to big endian already. I don't think anything but the
> > wal itself is otherwise transported in a binary fashion? So I don't
> > think there is any such architecture dependency in the protocol
> > currently?
> We only use pg_sendint() for the few values exchanged in the handshake
> before we start replicating, but once we begin, we just send structs
>
> around. For example, in ProcessStandbyReplyMessage():
> > static void
> > ProcessStandbyReplyMessage(void)
> > {
> >
> > StandbyReplyMessage reply;
> >
> > pq_copymsgbytes(&reply_message, (char *) &reply,
> > sizeof(StandbyReplyMessage));
> >
> > ...
>
> After that, we just the fields in the reply struct like in any other
> struct.
Yes, forgot that, true. I guess the best fix would be to actually send normal
messages instead of CopyData ones? Much more to type though...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-20 11:19:30
Message-ID: CABUevExOgqc+XqsOQs_B-TT8w4iC1iPGofz2EPejUKkGJpPF4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
>> a uint64, on top of my other WAL format patches. I think we should go ahead
>> with this.
>
> +1.
>
>> The LSNs on pages are still stored in the old format, to avoid changing the
>> on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
>> file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
>> is required.
>
> Seems fine.
>
>> Should we keep the old representation in the replication protocol messages?
>> That would make it simpler to write a client that works with different
>> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
>> should mandate network-byte order for all the integer and XLogRecPtr fields
>> in the replication protocol. That would make it easier to write a client
>> that works across different architectures, in >= 9.3. The contents of the
>> WAL would of course be architecture-dependent, but it would be nice if
>> pg_receivexlog and similar tools could nevertheless be
>> architecture-independent.
>
> I share Andres' question about how we're doing this already.  I think
> if we're going to break this, I'd rather do it in 9.3 than 5 years
> from now.  At this point it's just a minor annoyance, but it'll
> probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-20 17:43:17
Message-ID: CAHGQGwFa0mDNrVqwdq1ObVtb+aA6Tq=0m6QTynr6qtO2GSDtSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
>>> a uint64, on top of my other WAL format patches. I think we should go ahead
>>> with this.
>>
>> +1.
>>
>>> The LSNs on pages are still stored in the old format, to avoid changing the
>>> on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
>>> file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
>>> is required.
>>
>> Seems fine.
>>
>>> Should we keep the old representation in the replication protocol messages?
>>> That would make it simpler to write a client that works with different
>>> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
>>> should mandate network-byte order for all the integer and XLogRecPtr fields
>>> in the replication protocol. That would make it easier to write a client
>>> that works across different architectures, in >= 9.3. The contents of the
>>> WAL would of course be architecture-dependent, but it would be nice if
>>> pg_receivexlog and similar tools could nevertheless be
>>> architecture-independent.
>>
>> I share Andres' question about how we're doing this already.  I think
>> if we're going to break this, I'd rather do it in 9.3 than 5 years
>> from now.  At this point it's just a minor annoyance, but it'll
>> probably get worse as people write more tools that understand WAL.
>
> If we are looking at breaking it, and we are especially concerned
> about something like pg_receivexlog... Is it something we could/should
> change in the protocl *now* for 9.2, to make it non-broken in any
> released version? As in, can we extract just the protocol change and
> backpatch that to 9.2beta?

pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips "FF". So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send "FF". No?

Regards,

--
Fujii Masao


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-20 19:42:20
Message-ID: 4FE2279C.2070506@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20.06.2012 20:43, Fujii Masao wrote:
> On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander<magnus(at)hagander(dot)net> wrote:
>> On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>>> On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
>>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>> Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
>>>> a uint64, on top of my other WAL format patches. I think we should go ahead
>>>> with this.
>>>
>>> +1.
>>>
>>>> The LSNs on pages are still stored in the old format, to avoid changing the
>>>> on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
>>>> file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
>>>> is required.
>>>
>>> Seems fine.
>>>
>>>> Should we keep the old representation in the replication protocol messages?
>>>> That would make it simpler to write a client that works with different
>>>> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
>>>> should mandate network-byte order for all the integer and XLogRecPtr fields
>>>> in the replication protocol. That would make it easier to write a client
>>>> that works across different architectures, in>= 9.3. The contents of the
>>>> WAL would of course be architecture-dependent, but it would be nice if
>>>> pg_receivexlog and similar tools could nevertheless be
>>>> architecture-independent.
>>>
>>> I share Andres' question about how we're doing this already. I think
>>> if we're going to break this, I'd rather do it in 9.3 than 5 years
>>> from now. At this point it's just a minor annoyance, but it'll
>>> probably get worse as people write more tools that understand WAL.
>>
>> If we are looking at breaking it, and we are especially concerned
>> about something like pg_receivexlog... Is it something we could/should
>> change in the protocl *now* for 9.2, to make it non-broken in any
>> released version? As in, can we extract just the protocol change and
>> backpatch that to 9.2beta?
>
> pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
> (which was skipped in 9.2 or before). For example, pg_receivexlog calls
> XLByteAdvance() which always skips "FF". So even if we change the protocol,
> ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
> might send "FF". No?

Yeah, you can't use pg_receivexlog from 9.2 against a 9.3 server. We
can't really promise compatibility when using an older client against a
newer server, but we can try to be backwards-compatible in the other
direction. I'm thinking of using a 9.3 pg_receivexlog against a 9.2 server.

But I guess Robert is right and we shouldn't worry about
backwards-compatibility at this point. Instead, let's try to get the
protocol right, so that we can more easily provide
backwards-compatibility in the future. Like, using a 9.4 pg_receivexlog
against a 9.3 server.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-24 16:24:31
Message-ID: 4FE73F3F.5040105@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ok, committed all the WAL format changes now.

On 19.06.2012 18:57, Robert Haas wrote:
>> Should we keep the old representation in the replication protocol messages?
>> That would make it simpler to write a client that works with different
>> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
>> should mandate network-byte order for all the integer and XLogRecPtr fields
>> in the replication protocol. That would make it easier to write a client
>> that works across different architectures, in>= 9.3. The contents of the
>> WAL would of course be architecture-dependent, but it would be nice if
>> pg_receivexlog and similar tools could nevertheless be
>> architecture-independent.
>
> I share Andres' question about how we're doing this already. I think
> if we're going to break this, I'd rather do it in 9.3 than 5 years
> from now. At this point it's just a minor annoyance, but it'll
> probably get worse as people write more tools that understand WAL.

I didn't touch the replication protocol yet, but I think we should do it
some time during 9.3.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-24 18:34:29
Message-ID: CA+U5nMJXwcvei__royX0q8NN+wv40-5huhb7Jn_A92H6R0ubmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24 June 2012 17:24, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> Ok, committed all the WAL format changes now.

Nice!

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-25 17:41:49
Message-ID: CAHGQGwFRd8ZuR_+6g9BOb4pOPNuEMufOtz=RpY-uX_uEFq2NPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 25, 2012 at 1:24 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Ok, committed all the WAL format changes now.

This breaks pg_resetxlog -l at all. When I ran "pg_resetxlog -l 0x01,0x01,0x01"
in the HEAD, I got the following error message though the same command
successfully completed in 9.1.

pg_resetxlog: invalid argument for option -l
Try "pg_resetxlog --help" for more information.

I think the attached patch needs to be applied.

Regards,

--
Fujii Masao

Attachment Content-Type Size
resetxlog_bugfix_v1.patch application/octet-stream 2.1 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-25 17:53:39
Message-ID: CAHGQGwHpdpPrmbq49yOp=zZTsiB+_cnwojRh0Xe9BE1D8fLegg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 25, 2012 at 1:24 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Ok, committed all the WAL format changes now.

I found the typo.

In walsender.c
- reply.write.xlogid, reply.write.xrecoff,
- reply.flush.xlogid, reply.flush.xrecoff,
- reply.apply.xlogid, reply.apply.xrecoff);
+ (uint32) (reply.write << 32), (uint32) reply.write,
+ (uint32) (reply.flush << 32), (uint32) reply.flush,
+ (uint32) (reply.apply << 32), (uint32) reply.apply);

"<<" should be ">>". The attached patch fixes this typo.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-25 17:57:42
Message-ID: CAHGQGwGZo-hv2Fxm3LpCsh=mh1T_21mj7MKe35VQyBSZ6PsyVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 26, 2012 at 2:53 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Jun 25, 2012 at 1:24 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Ok, committed all the WAL format changes now.
>
> I found the typo.
>
> In walsender.c
> -                reply.write.xlogid, reply.write.xrecoff,
> -                reply.flush.xlogid, reply.flush.xrecoff,
> -                reply.apply.xlogid, reply.apply.xrecoff);
> +                (uint32) (reply.write << 32), (uint32) reply.write,
> +                (uint32) (reply.flush << 32), (uint32) reply.flush,
> +                (uint32) (reply.apply << 32), (uint32) reply.apply);
>
> "<<" should be ">>". The attached patch fixes this typo.

Oh, I forgot to attach the patch.. Here is the patch.

Regards,

--
Fujii Masao

Attachment Content-Type Size
walsender_typo_v1.patch application/octet-stream 964 bytes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-25 18:01:32
Message-ID: CA+TgmoZz=7xRhg389Qj=SwdVCvpK4VSpjJm_yg7WKQuUpk0wjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 25, 2012 at 1:57 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> "<<" should be ">>". The attached patch fixes this typo.
>
> Oh, I forgot to attach the patch.. Here is the patch.

I committed both of the patches you posted to this thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-26 00:09:34
Message-ID: 4FE8FDBE.8060901@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25.06.2012 21:01, Robert Haas wrote:
> On Mon, Jun 25, 2012 at 1:57 PM, Fujii Masao<masao(dot)fujii(at)gmail(dot)com> wrote:
>>> "<<" should be">>". The attached patch fixes this typo.
>>
>> Oh, I forgot to attach the patch.. Here is the patch.
>
> I committed both of the patches you posted to this thread.

Thanks Robert. I was thinking that "pg_resetxlog -l" would accept a WAL
file name, instead of comma-separated tli, xlogid, segno arguments. The
latter is a bit meaningless now that we don't use the xlogid+segno
combination anywhere else. Alvaro pointed out that pg_upgrade was broken
by the change in pg_resetxlog -n output - I changed that too to print
the "First log segment after reset" information as a WAL file name,
instead of logid+segno. Another option would be to print the 64-bit
segment number, but I think that's worse, because the 64-bit segment
number is harder to associate with a physical WAL file.

So I think we should change pg_resetxlog -l option to take a WAL file
name as argument, and fix pg_upgrade accordingly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL format changes
Date: 2012-06-26 00:42:43
Message-ID: 1340671279-sup-3382@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Heikki Linnakangas's message of lun jun 25 20:09:34 -0400 2012:
> On 25.06.2012 21:01, Robert Haas wrote:
> > On Mon, Jun 25, 2012 at 1:57 PM, Fujii Masao<masao(dot)fujii(at)gmail(dot)com> wrote:
> >>> "<<" should be">>". The attached patch fixes this typo.
> >>
> >> Oh, I forgot to attach the patch.. Here is the patch.
> >
> > I committed both of the patches you posted to this thread.
>
> Thanks Robert. I was thinking that "pg_resetxlog -l" would accept a WAL
> file name, instead of comma-separated tli, xlogid, segno arguments. The
> latter is a bit meaningless now that we don't use the xlogid+segno
> combination anywhere else. Alvaro pointed out that pg_upgrade was broken
> by the change in pg_resetxlog -n output - I changed that too to print
> the "First log segment after reset" information as a WAL file name,
> instead of logid+segno. Another option would be to print the 64-bit
> segment number, but I think that's worse, because the 64-bit segment
> number is harder to associate with a physical WAL file.
>
> So I think we should change pg_resetxlog -l option to take a WAL file
> name as argument, and fix pg_upgrade accordingly.

The only thing pg_upgrade does with the tli/logid/segno combo, AFAICT,
is pass it back to pg_resetxlog -l, so this plan seems reasonable.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL format changes
Date: 2012-06-26 01:51:45
Message-ID: 8405.1340675505@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> So I think we should change pg_resetxlog -l option to take a WAL file
> name as argument, and fix pg_upgrade accordingly.

Seems reasonable I guess. It's really specifying a starting WAL
location, but only to file granularity, so treating the argument as a
file name is sort of a type cheat but seems convenient.

If we do it that way, we'd better validate that the argument is a legal
WAL file name, so as to catch any cases where somebody tries to do it
old-style.

BTW, does pg_resetxlog's logic for setting the default -l value (from
scanning pg_xlog to find the largest existing file name) still work?

regards, tom lane


From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "'Robert Haas'" <robertmhaas(at)gmail(dot)com>, "'Fujii Masao'" <masao(dot)fujii(at)gmail(dot)com>, "'Andres Freund'" <andres(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL format changes
Date: 2012-06-26 03:14:50
Message-ID: 001501cd5349$d9e82480$8db86d80$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> So I think we should change pg_resetxlog -l option to take a WAL file
>> name as argument, and fix pg_upgrade accordingly.

> Seems reasonable I guess. It's really specifying a starting WAL
> location, but only to file granularity, so treating the argument as a
> file name is sort of a type cheat but seems convenient.

> If we do it that way, we'd better validate that the argument is a legal
> WAL file name, so as to catch any cases where somebody tries to do it
> old-style.

> BTW, does pg_resetxlog's logic for setting the default -l value (from
> scanning pg_xlog to find the largest existing file name) still work?

It finds the segment number for largest existing file name from pg_xlog and
then compare it with input provided by the
user for -l Option, if input is greater it will use the input to set in
control file.

With Regards,
Amit Kapila.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: WAL format changes
Date: 2012-07-06 21:06:58
Message-ID: 1341608818.7092.12.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On fre, 2012-06-15 at 00:01 +0300, Heikki Linnakangas wrote:
> 1. Use a 64-bit segment number, instead of the log/seg combination. And
> don't waste the last segment on each logical 4 GB log file. The concept
> of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
> but it should now be understood as a plain 64-bit value, just split into
> two 32-bit integers for historical reasons. On disk, this means that
> there will be log files ending in FF, those were skipped before.

A thought on this. There were some concerns that this would silently
break tools that pretend to have detailed knowledge of WAL file
numbering and this previous behavior of skipping the FF files. We could
address this by "fixing" the overall file naming from something like

00000001000008D0000000FD
00000001000008D0000000FE
00000001000008D0000000FF
00000001000008D100000000

to

00000001000008D0FD000000
00000001000008D0FE000000
00000001000008D0FF000000
00000001000008D100000000

which represents the new true WAL stream numbering as opposed to the old
two-part numbering.

Thus, any tool that thinks it knows how the WAL files are sequenced will
break very obviously, but any tool that just looks for 24 hexadecimal
digits will be fine.

I wonder if any tools in the former category would also break if one
changes XLOG_SEG_SIZE.


From: Greg Stark <stark(at)mit(dot)edu>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL format changes
Date: 2012-07-06 21:24:12
Message-ID: CAM-w4HMe_Ty-6bbqE_3A55q8M5OnjdPrqQJXBW51tM6nOi9L9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 14, 2012 at 10:01 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> This has the advantage that you can calculate the CRC for all the other
> fields before acquiring WALInsertLock. For xl_prev, you need to know where
> exactly the record is inserted, so it's handy that it's the last field
> before CRC.

It may be late to mention this but fwiw you don't need to reorder the
fields to do this. CRC has the property that you can easily adjust it
for any changes to the data covered by it. Regardless of where the
xl_prev link is you can calculate the CRC as if xl_prev is 0 and then
once you get the lock "add in" the correct xl_prev. This is an
argument in favour of using CRC over other checksums for which that
would be hard or impossible.

--
greg