Re: WAL replay bugs

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL replay bugs
Date: 2014-06-13 07:14:55
Message-ID: CAB7nPqTm9Xx5rHY6uSjfreBvLe4cKmDn2Ngh8wXCmaPw+HLdBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> Perhaps there are parts of what is proposed here that could be made
> more generalized, like the masking functions. So do not hesitate if
> you have any opinion on the matter.
OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a symbol
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in
test-default.sh but user is free to set up custom tests by creating a
file called test-custom.sh that can be kicked by the test facility if
this file is present instead of the defaults.

Patch will be added to the first commit fest as well. Note that the
footprint on core code is limited, so even if there is more than 1k
lines of codes, review is simpler than it looks.

A couple of things to note though:
1) In order to detect if a page is used for a sequence, SEQ_MAGIC
needs to be exposed in sequence.h. This is included in the patch
attached but perhaps this should be changed as a separate patch
2) Regression test facility uses some useful parts taken from
pg_upgrade. I think that we should gather those parts in a common
place (contrib/common?). This can facilitate the integration of other
modules using regression based on bash scripts.
3) While hacking this facility, I noticed that some ItemId entries in
btree pages could be inconsistent between master and standby. Those
items are masked in the current patch, but it looks like a bug of
Postgres itself.

Documentation is added in the code itself, I didn't feel any need to
expose this facility the lambda users in doc/src/sgml...
Regards,
--
Michael

Attachment Content-Type Size
0001-Buffer-capture-facility-check-WAL-replay-consistency.patch text/plain 41.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-06-13 07:20:05 Re: loading .so file at run time
Previous Message Dean Rasheed 2014-06-13 07:11:44 Re: API change advice: Passing plan invalidation info from the rewriter into the planner?