Re: Review: Patch to compute Max LSN of Data Pages

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Hari Babu <haribabu(dot)kommi(at)huawei(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review: Patch to compute Max LSN of Data Pages
Date: 2013-10-18 15:31:14
Message-ID: CA+TgmoZqbYHWYbi18FUk-+dGHig=icV+pj-NNav3miy4dajgrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 9, 2013 at 2:28 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Sep 14, 2013 at 3:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>>On Monday, July 08, 2013 5:16 PM Andres Freund wrote:
>>>>On 2013-07-08 17:10:43 +0530, Amit Kapila wrote:
>>>>> On Monday, July 08, 2013 4:26 PM Andres Freund wrote:
>>>>> > On 2013-07-08 16:17:54 +0530, Hari Babu wrote:
>>>>> > > + This utility can also be used to decide whether backup is
>>>>> > required or not when the data page
>>>>> > > + in old-master precedes the last applied LSN in old-standby
>>>>> > (i.e., new-master) at the
>>>>> > > + moment of the failover.
>>>>> > > + </para>
>>>>> > > + </refsect1>
>>>>> >
>>>>> > I don't think this is safe in any interesting set of cases. Am I
>>>>> > missing
>>>>> > something?
>>>>>
>>>>> No, you are not missing anything. It can be only used to find max LSN in
>>>>> database which can avoid further corruption
>>
>>>>Why is the patch submitted documenting it as a use-case then? I find it
>>>>rather scary if the *patch authors* document a known unsafe use case as
>>>>one of the known use-cases.
>>
>>>I got the problem which can occur with the specified use case. Removed the
>>>wrong use case specified above.
>>>Thanks for the review, please find the updated patch attached in the mail.
>>
>> Patch is not getting compiled on Windows, I had made following changes:
>> a. updated the patch for resolving Windows build
>> b. few documentation changes in (pg_resetxlog.sgml) for spelling
>> mistake and minor line change
>> c. corrected year for Copyright in file pg_computemaxlsn.c
>
> I am OK with this patch in its current form, modulo some grammar
> issues in the documentation which I can fix before committing.
> However, I'm unclear whether there was sufficient consensus to proceed
> with this. Can others weigh in? If there is too much residual
> unhappiness with this, then we should just mark this as Rejected and
> stop wasting time on it; it can be pushed to PGXN or similar even if
> we don't put it in core.

I didn't hear any other votes. Anyone else have an opinion about
this? If I can't get a +1 from anyone who wasn't involved in writing
the patch, I'm inclined to think we don't have sufficient consensus to
commit this.

On further review of the patch, I also found a number of other issues
that I think need to be fixed before we could consider committing it:

- Per a previous request of mine, the patch has three different modes:
it can be run on an individual file, on a directory potentially
containing multiple relation files, or on an entire data directory.
This is not explained in the documentation.

- The patch skips printing an error if attempting to open a file
returns ENOENT. I don't see why that case shouldn't print an error.
Yeah, it could be legit if you're executing this against a running
server, but why are you doing that? And even if you are (e.g. for
corruption detection), printing a error message and proceeding makes
more sense than proceeding without printing anything, at least IMHO.

- Some of the static functions in this file preceed main and others
follow it. And they use different naming conventions. I suggest
putting all of them after main() and using names like check_data_dir,
check_data_file_name (instead of validateRelfilenodename; note that
the comment for that function is also incorrect),
find_max_lsn_in_file, find_max_lsn_in_directory,
find_max_lsn_in_pgdata.

- Since the patch goes to some pains to exit with 0 if no errors were
encountered and 1 if any were, that probably ought to be documented.
But I think instead of passing a "result" argument back up the call
stack it would be better to just keep a global variable called
"errors_encountered". When we encounter an error, bump that value.
Then you can just do exit(errors_encountered > 0 ? 1 : 0) and not
bother passing all of this stuff around via the return value. The
current behavior is inconsistent in another way, too: if you encounter
an error while scanning through a particular directory, you finish the
whole directory. But if multiple command-line arguments were passed,
you don't proceed to any subsequent command-line arguments. I think
you should continue always, which the above-mentioned change will take
care of basically for free.

- The description of the -q option is not very clear. A reader could
be forgiven for thinking that the option suppresses all output, which
would be quite useless, or at least left in doubt about what output
will still be provided.

A broader complaint I have with this patch is that it almost but
not-quite solves a problem I've had a few times in the past: namely,
searching through the data directory for data blocks which have LSNs
in the future. This has come up a few times for me, and this tool
would make it easier, because I'd be able to run it and look through
the output to see which relations have high max-LSN values. However,
it wouldn't be quite enough, because it'd only tell me about the block
with the highest LSN in each file, whereas what I'd really want to
find is every block with an LSN greater than some threshold value.
Maybe I'm pushing the envelope too much by trying to fit that into the
framework of this patch, but I can't help thinking we're not going to
want both pg_computemaxlsn and pg_findlsnsaftersomethreshold that are
95% the same code, so maybe we ought to rename the utility to
something slightly more generic than "pg_computemaxlsn".

In some ways this is almost a plausible shell for a generic
block-corruption checker. If we wanted to validate, for example, that
every page in a relation (or a whole data directory) had the expected
page version number, we'd want almost exactly this same code
structure, just with more checks added in.

I'm going to mark this "Returned with Feedback" in the CF app for now.
I still think that there's promise in this idea but, on further
reflection, committing what we have here now doesn't feel like the
right decision.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-10-18 15:41:33 Re: psql tab completion for updatable foreign tables
Previous Message Tom Lane 2013-10-18 15:17:45 Re: ERROR : 'tuple concurrently updated'