Re: Review: Patch to compute Max LSN of Data Pages

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Amit kapila <amit(dot)kapila(at)huawei(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Review: Patch to compute Max LSN of Data Pages
Date: 2013-07-02 19:55:34
Message-ID: CA+Tgmoag8myqtu9x6M_AcMSE8sNECuKhWeDuLdNOnOf8-0Kbgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 25, 2013 at 1:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I think the usecase for this utility isn't big enough to be included in
> postgres since it really can only help in a very limited
> circumstances. And I think it's too likely to be misused for stuff it's
> not useable for (e.g. remastering).
>
> The only scenario I see is that somebody deleted/corrupted
> pg_controldata. In that scenario the tool is supposed to be used to find
> the biggest lsn used so far so the user then can use pg_resetxlog to set
> that as the wal starting point.
> But that can be way much easier solved by just setting the LSN to
> something very, very high. The database cannot be used for anything
> reliable afterwards anyway.

I guess this is true, but I think I'm mildly in favor of including
this anyway. I think I would have used it once or twice, if it had
been there - maybe not even to feed into pg_resetxlog, but just to
check for future LSNs. We don't have anything like a suite of
diagnostic tools in bin or contrib today, for use by database
professionals in fixing things that fall strictly in the realm of
"don't try this at home", and I think we should have such a thing.
Unfortunately this covers about 0.1% of the space I'd like to see
covered, which might be a reason to reject it or at least insist that
it be enhanced first.

At any rate, I don't think this is anywhere near committable as it
stands today. Some random review comments:

remove_parent_refernces() is spelled wrong.

Why does this patch need all of this fancy path-handling logic -
specifically, remove_parent_refernces() and make_absolute_path()?
Surely its needs are not that different from pg_ctl or pg_resetxlog or
pg_controldata. If they all need it and it's duplicated, we should
fix that. Otherwise, why the special logic here?

I don't think we need getLinkPath() either. The server has no trouble
finding its files by just using a pathname that follows the symlink.
Why do we need anything different here? The whole point of symlinks
is that you can traverse them transparently, without knowing where
they point.

Duplicating PageHeaderIsValid doesn't seem acceptable. Moreover,
since the point of this is to be able to use it on a damaged cluster,
why is that logic even desirable? I think we'd be better off assuming
the pages to be valid.

The calling convention for this utility seems quite baroque. There's
no real need for all of this -p/-P stuff. I think the syntax should
just be:

pg_computemaxlsn file_or_directory...

For each argument, we determine whether it's a file or a directory.
If it's a file, we assume it's a PostgreSQL data file and scan it. If
it's a directory, we check whether it looks like a data directory. If
it does, we recurse through the whole tree structure and find the data
files, and process them. If it doesn't look like a data directory, we
scan each plain file in that directory whose name looks like a
PostgreSQL data file name. With this approach, there's no need to
limit ourselves to a single input argument and no need to specify what
kind of argument it is; the tool just figures it out.

I think it would be a good idea to have a mode that prints out the max
LSN found in *each data file* scanned, and then prints out the overall
max found at the end. In fact, I think that should perhaps be the
default mode, with -q, --quiet to disable it. When printing out the
per-file data, I think it would be useful to track and print the block
number where the highest LSN in that file was found. I have
definitely had cases where I suspected, but was not certain of,
corruption. One could use a tool like this to hunt for problems, and
then use something like pg_filedump to dump the offending blocks.
That would be a lot easier than running pg_filedump on the whole file
and grepping through the output.

Similarly, I see no reason for the restriction imposed by
check_path_belongs_to_pgdata(). I've had people mail me one data
file; why shouldn't I be able to run this tool on it? It's a
read-only utility.

- if (0 != FindMaxLSNinDir(pathbuf, maxlsn, false)) and similar is not
PostgreSQL style.

+ printf(_("%s compute the maximum LSN in PostgreSQL data
pages.\n\n"), progname);

s/compute/computes/

+ /*
+ * Don't allow pg_computemaxlsn to be run as root, to avoid overwriting
+ * the ownership of files in the data directory. We need only check for
+ * root -- any other user won't have sufficient permissions to modify
+ * files in the data directory.
+ */
+ #ifndef WIN32
+ if (geteuid() == 0)
+ {
+ fprintf(stderr, _("%s: cannot be executed by \"root\".\n"),
+ progname);
+ fprintf(stderr, _("You must run %s as the PostgreSQL
superuser.\n"),
+ progname);
+ exit(1);
+ }
+ #endif

This utility only reads files; it does not modify them. So this seems
unnecessary. I assume it's blindly copied from somewhere else.

+ fprintf(stderr, _("%s: \"%s\" not a valid data directory.\n"),

Extra space.

+ /*
+ * Check for a postmaster lock file --- if there is one, refuse to
+ * proceed, on grounds we might be interfering with a live installation.
+ */
+ snprintf(path, MAXPGPATH, "%s/postmaster.pid", DataDir);

Again, this might be appropriate for pg_resetxlog, but I see no reason
for the restriction here. The output might be inaccurate in that
case, but if you're using this tool you're required to know what
you're doing.

+ For safety reasons, you must specify the data directory on the
command line.
+ <command>pg_computemaxlsn</command> does not use the environment variable
+ <envar>PGDATA</>.

Same thing here. I think using PGDATA would be quite appropriate for
this utility.

+ <para>
+ This command must not be used when the server is
+ running. <command>pg_computemaxlsn</command> will refuse to start up if
+ it finds a server lock file in the data directory. If the
+ server crashed then a lock file might have been left
+ behind; in that case you can remove the lock file to allow
+ <command>pg_computemaxlsn</command> to run. But before you do
+ so, make doubly certain that there is no server process still alive.
+ </para>

More of the same paranoia.

Overall my feeling is that this can be simplified quite a lot.
There's a lot of stuff in here that's really not needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-07-02 19:56:10 Re: [HACKERS] Add contrib module functions to docs' function index
Previous Message Hannu Krosing 2013-07-02 19:00:48 Re: [9.4 CF 1] The Commitfest Slacker List