Re: Cleaning up unreferenced table files

Lists: pgsql-hackerspgsql-patches
From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Cleaning up unreferenced table files
Date: 2005-05-02 18:30:03
Message-ID: 200505021830.j42IU3l21583@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Applied.

---------------------------------------------------------------------------

pgman wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> > > On Tue, 26 Apr 2005, Tom Lane wrote:
> > >> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> > >>> I feel that crashes that leaves behind stale files are rare.
> > >>
> > >> Indeed, and getting more so all the time ...
> >
> > > How so? Have changes been made in those parts of the code?
> >
> > No, just that the overall reliability of Postgres keeps improving.
> >
> > At the time that TODO entry was created, I don't think we even had the
> > ability to roll back table creates/drops properly, so there were
> > scenarios in which unreferenced files could be left behind without even
> > assuming any software error. And the prevalence of backend crashes was
> > way higher than it is now, too. So I think a good argument could be
> > made that the TODO item isn't nearly as important as it was at the time.
>
> Agreed, but I don't think we are ever going to have the file system obey
> the same semantics as the database, so there will always be corner cases
> where we have the possibility for unreferenced files.
>
> > > If nobody ever runs into this issue in production, and this whole exercise
> > > turns out to be completely unnecessary, at least we'll know. That alone
> > > makes me feel better.
> >
> > We will know no such thing, unless the patch is made to announce the
> > problem so intrusively that people are certain to see it *and report it
> > to us*. Which I don't think will be acceptable.
>
> You are right that checking only after a server crash isn't going to
> help us very much, because the odds someone is going to look at the logs
> just at that moment are slim, and if they stop and start the server
> again, the message will not appear so they will think it is fixed.
>
> Though the problem is caused by a server crash, it remains even after a
> clean startup. My new version of the patch checks not just after a
> server crash, but every time the server starts up. Here are some
> timings:
>
> 3dbs without patch 0m0.230s 300 files
> 3dbs with patch 0m0.230s 300 files
> 100dbs with patch 0m0.270s 10000 files
>
> (The timing test started the server and waited for the message "database
> system is ready".)
>
> You can see that for three empty databases (only system tables/indexes),
> the patch has no impact. For 100 databases, or 10,000 files, there is a
> 17% increase in startup time, which seems acceptable.
>
> The message printed in the logs is:
>
> LOG: The table or index file "/u/pgsql/data/base/1/100001" is stale and can be safely removed
>
> I made a few changes to the patch. I changed 'clean' to 'check' because
> I don't think we are ever going to remove the files on startup (If we
> did we could do the cleanup just on crash recovery startup.) I changed
> the message from NOTICE to LOG, I made it run on all startups and not
> just crash recovery, and it was missing FreeDir() calls.
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania 19073

> Index: doc/src/sgml/maintenance.sgml
> ===================================================================
> RCS file: /cvsroot/pgsql/doc/src/sgml/maintenance.sgml,v
> retrieving revision 1.41
> diff -c -c -r1.41 maintenance.sgml
> *** doc/src/sgml/maintenance.sgml 20 Feb 2005 02:21:26 -0000 1.41
> --- doc/src/sgml/maintenance.sgml 1 May 2005 04:30:48 -0000
> ***************
> *** 474,479 ****
> --- 474,496 ----
> </para>
> </sect1>
>
> + <sect1 id="check-files-after-crash">
> + <title>Check files after crash</title>
> +
> + <indexterm zone="check-files-after-crash">
> + <primary>stale file</primary>
> + </indexterm>
> +
> + <para>
> + <productname>PostgreSQL</productname> recovers automatically after crash
> + using the write-ahead log (see <xref linkend="wal">) and no manual
> + operations are normally needed. However, if there was a transaction running
> + when the crash occured that created or dropped a relation, the
> + transaction might have left a stale file in the data directory. If this
> + happens, you will get a notice in the log file stating which files can be
> + deleted.
> + </para>
> + </sect1>
>
> <sect1 id="logfile-maintenance">
> <title>Log File Maintenance</title>
> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.189
> diff -c -c -r1.189 xlog.c
> *** src/backend/access/transam/xlog.c 28 Apr 2005 21:47:10 -0000 1.189
> --- src/backend/access/transam/xlog.c 1 May 2005 04:30:52 -0000
> ***************
> *** 43,48 ****
> --- 43,49 ----
> #include "utils/builtins.h"
> #include "utils/guc.h"
> #include "utils/relcache.h"
> + #include "utils/flatfiles.h"
>
>
> /*
> ***************
> *** 4525,4530 ****
> --- 4526,4533 ----
>
> CreateCheckPoint(true, true);
>
> + CheckStaleRelFiles();
> +
> /*
> * Close down recovery environment
> */
> ***************
> *** 4536,4541 ****
> --- 4539,4550 ----
> */
> remove_backup_label();
> }
> + else
> + {
> + XLogInitRelationCache();
> + CheckStaleRelFiles();
> + XLogCloseRelationCache();
> + }
>
> /*
> * Preallocate additional log files, if wanted.
> Index: src/backend/catalog/catalog.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/catalog/catalog.c,v
> retrieving revision 1.59
> diff -c -c -r1.59 catalog.c
> *** src/backend/catalog/catalog.c 14 Apr 2005 20:03:23 -0000 1.59
> --- src/backend/catalog/catalog.c 1 May 2005 04:30:52 -0000
> ***************
> *** 106,111 ****
> --- 106,144 ----
> return path;
> }
>
> + /*
> + * GetTablespacePath - construct path to a tablespace symbolic link
> + *
> + * Result is a palloc'd string.
> + *
> + * XXX this must agree with relpath and GetDatabasePath!
> + */
> + char *
> + GetTablespacePath(Oid spcNode)
> + {
> + int pathlen;
> + char *path;
> +
> + Assert(spcNode != GLOBALTABLESPACE_OID);
> +
> + if (spcNode == DEFAULTTABLESPACE_OID)
> + {
> + /* The default tablespace is {datadir}/base */
> + pathlen = strlen(DataDir) + 5 + 1;
> + path = (char *) palloc(pathlen);
> + snprintf(path, pathlen, "%s/base",
> + DataDir);
> + }
> + else
> + {
> + /* All other tablespaces have symlinks in pg_tblspc */
> + pathlen = strlen(DataDir) + 11 + OIDCHARS + 1;
> + path = (char *) palloc(pathlen);
> + snprintf(path, pathlen, "%s/pg_tblspc/%u",
> + DataDir, spcNode);
> + }
> + return path;
> + }
>
> /*
> * IsSystemRelation
> Index: src/backend/commands/tablespace.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/commands/tablespace.c,v
> retrieving revision 1.17
> diff -c -c -r1.17 tablespace.c
> *** src/backend/commands/tablespace.c 14 Apr 2005 20:03:24 -0000 1.17
> --- src/backend/commands/tablespace.c 1 May 2005 04:30:53 -0000
> ***************
> *** 341,348 ****
> /*
> * All seems well, create the symlink
> */
> ! linkloc = (char *) palloc(strlen(DataDir) + 11 + 10 + 1);
> ! sprintf(linkloc, "%s/pg_tblspc/%u", DataDir, tablespaceoid);
>
> if (symlink(location, linkloc) < 0)
> ereport(ERROR,
> --- 341,347 ----
> /*
> * All seems well, create the symlink
> */
> ! linkloc = GetTablespacePath(tablespaceoid);
>
> if (symlink(location, linkloc) < 0)
> ereport(ERROR,
> ***************
> *** 495,502 ****
> char *subfile;
> struct stat st;
>
> ! location = (char *) palloc(strlen(DataDir) + 11 + 10 + 1);
> ! sprintf(location, "%s/pg_tblspc/%u", DataDir, tablespaceoid);
>
> /*
> * Check if the tablespace still contains any files. We try to rmdir
> --- 494,500 ----
> char *subfile;
> struct stat st;
>
> ! location = GetTablespacePath(tablespaceoid);
>
> /*
> * Check if the tablespace still contains any files. We try to rmdir
> ***************
> *** 1036,1043 ****
> set_short_version(location);
>
> /* Create the symlink if not already present */
> ! linkloc = (char *) palloc(strlen(DataDir) + 11 + 10 + 1);
> ! sprintf(linkloc, "%s/pg_tblspc/%u", DataDir, xlrec->ts_id);
>
> if (symlink(location, linkloc) < 0)
> {
> --- 1034,1040 ----
> set_short_version(location);
>
> /* Create the symlink if not already present */
> ! linkloc = GetTablespacePath(xlrec->ts_id);
>
> if (symlink(location, linkloc) < 0)
> {
> Index: src/backend/utils/adt/misc.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/utils/adt/misc.c,v
> retrieving revision 1.40
> diff -c -c -r1.40 misc.c
> *** src/backend/utils/adt/misc.c 31 Dec 2004 22:01:22 -0000 1.40
> --- src/backend/utils/adt/misc.c 1 May 2005 04:30:53 -0000
> ***************
> *** 26,31 ****
> --- 26,32 ----
> #include "funcapi.h"
> #include "catalog/pg_type.h"
> #include "catalog/pg_tablespace.h"
> + #include "catalog/catalog.h"
>
> #define atooid(x) ((Oid) strtoul((x), NULL, 10))
>
> ***************
> *** 144,154 ****
>
> fctx = palloc(sizeof(ts_db_fctx));
>
> - /*
> - * size = path length + tablespace dirname length + 2 dir sep
> - * chars + oid + terminator
> - */
> - fctx->location = (char *) palloc(strlen(DataDir) + 11 + 10 + 1);
> if (tablespaceOid == GLOBALTABLESPACE_OID)
> {
> fctx->dirdesc = NULL;
> --- 145,150 ----
> ***************
> *** 157,168 ****
> }
> else
> {
> ! if (tablespaceOid == DEFAULTTABLESPACE_OID)
> ! sprintf(fctx->location, "%s/base", DataDir);
> ! else
> ! sprintf(fctx->location, "%s/pg_tblspc/%u", DataDir,
> ! tablespaceOid);
> !
> fctx->dirdesc = AllocateDir(fctx->location);
>
> if (!fctx->dirdesc)
> --- 153,159 ----
> }
> else
> {
> ! fctx->location = GetTablespacePath(tablespaceOid);
> fctx->dirdesc = AllocateDir(fctx->location);
>
> if (!fctx->dirdesc)
> Index: src/backend/utils/init/Makefile
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/utils/init/Makefile,v
> retrieving revision 1.18
> diff -c -c -r1.18 Makefile
> *** src/backend/utils/init/Makefile 20 Feb 2005 02:22:00 -0000 1.18
> --- src/backend/utils/init/Makefile 1 May 2005 04:30:53 -0000
> ***************
> *** 12,18 ****
> top_builddir = ../../../..
> include $(top_builddir)/src/Makefile.global
>
> ! OBJS = flatfiles.o globals.o miscinit.o postinit.o
>
> all: SUBSYS.o
>
> --- 12,18 ----
> top_builddir = ../../../..
> include $(top_builddir)/src/Makefile.global
>
> ! OBJS = flatfiles.o globals.o miscinit.o postinit.o checkfiles.o
>
> all: SUBSYS.o
>
> Index: src/include/catalog/catalog.h
> ===================================================================
> RCS file: /cvsroot/pgsql/src/include/catalog/catalog.h,v
> retrieving revision 1.30
> diff -c -c -r1.30 catalog.h
> *** src/include/catalog/catalog.h 31 Dec 2004 22:03:24 -0000 1.30
> --- src/include/catalog/catalog.h 1 May 2005 04:30:54 -0000
> ***************
> *** 19,24 ****
> --- 19,25 ----
>
> extern char *relpath(RelFileNode rnode);
> extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
> + extern char *GetTablespacePath(Oid spcNode);
>
> extern bool IsSystemRelation(Relation relation);
> extern bool IsToastRelation(Relation relation);
> Index: src/include/utils/flatfiles.h
> ===================================================================
> RCS file: /cvsroot/pgsql/src/include/utils/flatfiles.h,v
> retrieving revision 1.1
> diff -c -c -r1.1 flatfiles.h
> *** src/include/utils/flatfiles.h 20 Feb 2005 02:22:07 -0000 1.1
> --- src/include/utils/flatfiles.h 1 May 2005 04:30:54 -0000
> ***************
> *** 30,33 ****
> --- 30,36 ----
>
> extern Datum flatfile_update_trigger(PG_FUNCTION_ARGS);
>
> + /* from checkfiles.c */
> + extern void CheckStaleRelFiles(void);
> +
> #endif /* FLATFILES_H */

> /*-------------------------------------------------------------------------
> *
> * checkfiles.c
> * support to clean up stale relation files on crash recovery
> *
> * If a backend crashes while in a transaction that has created or
> * deleted a relfilenode, a stale file can be left over in the data
> * directory. This file contains routines to clean up those stale
> * files on recovery.
> *
> * $PostgreSQL$
> *
> *-------------------------------------------------------------------------
> */
> #include "postgres.h"
>
> #include "storage/fd.h"
>
> #include "utils/flatfiles.h"
> #include "miscadmin.h"
> #include "catalog/pg_tablespace.h"
> #include "catalog/catalog.h"
> #include "access/skey.h"
> #include "utils/fmgroids.h"
> #include "access/relscan.h"
> #include "access/heapam.h"
> #include "utils/resowner.h"
>
> static void CheckStaleRelFilesFrom(Oid tablespaceoid, Oid dboid);
> static void CheckStaleRelFilesFromTablespace(Oid tablespaceoid);
>
> /* Like AllocateDir, but ereports on failure */
> static DIR *
> AllocateDirChecked(char *path)
> {
> DIR *dirdesc = AllocateDir(path);
> if (dirdesc == NULL)
> ereport(ERROR,
> (errcode_for_file_access(),
> errmsg("could not open directory \"%s\": %m",
> path)));
> return dirdesc;
> }
>
> /*
> * Scan through all tablespaces for relations left over
> * by aborted transactions.
> *
> * For example, if a transaction issues
> * BEGIN; CREATE TABLE foobar ();
> * and then the backend crashes, the file is left in the
> * tablespace until CheckStaleRelFiles deletes it.
> */
> void
> CheckStaleRelFiles(void)
> {
> DIR *dirdesc;
> struct dirent *de;
> char *path;
> int pathlen;
>
> pathlen = strlen(DataDir) + 11 + 1;
> path = (char *) palloc(pathlen);
> snprintf(path, pathlen, "%s/pg_tblspc/", DataDir);
> dirdesc = AllocateDirChecked(path);
> while ((de = readdir(dirdesc)) != NULL)
> {
> char *invalid;
> Oid tablespaceoid;
>
> /* Check that the directory name looks like valid tablespace link. */
> tablespaceoid = (Oid) strtol(de->d_name, &invalid, 10);
> if(invalid[0] == '\0')
> CheckStaleRelFilesFromTablespace(tablespaceoid);
> }
> FreeDir(dirdesc);
> pfree(path);
>
> CheckStaleRelFilesFromTablespace(DEFAULTTABLESPACE_OID);
> }
>
> /* Scan a specific tablespace for stale relations */
> static void
> CheckStaleRelFilesFromTablespace(Oid tablespaceoid)
> {
> DIR *dirdesc;
> struct dirent *de;
> char *path;
>
> path = GetTablespacePath(tablespaceoid);
>
> dirdesc = AllocateDirChecked(path);
> while ((de = readdir(dirdesc)) != NULL)
> {
> char *invalid;
> Oid dboid;
>
> dboid = (Oid) strtol(de->d_name, &invalid, 10);
> if(invalid[0] == '\0')
> CheckStaleRelFilesFrom(tablespaceoid, dboid);
> }
> FreeDir(dirdesc);
> pfree(path);
> }
>
> /* Scan a specific database in a specific tablespace for stale relations.
> *
> * First, pg_class for the database is opened, and the relfilenodes of all
> * relations mentioned there are stored in a hash table.
> *
> * Then the directory is scanned. Every file in the directory that's not
> * found in pg_class (the hash table) is logged.
> */
> static void
> CheckStaleRelFilesFrom(Oid tablespaceoid, Oid dboid)
> {
> DIR *dirdesc;
> struct dirent *de;
> HASHCTL hashctl;
> HTAB *relfilenodeHash;
> MemoryContext mcxt;
> RelFileNode rnode;
> char *path;
>
> /* We create a private memory context so that we can easily deallocate
> * the hash table and its contents
> */
> mcxt = AllocSetContextCreate(TopMemoryContext, "CheckStaleRelFiles",
> ALLOCSET_DEFAULT_MINSIZE,
> ALLOCSET_DEFAULT_INITSIZE,
> ALLOCSET_DEFAULT_MAXSIZE);
>
> hashctl.hash = tag_hash;
> /* The entry contents is not used for anything, we just check
> * if an oid is in the hash table or not. */
> hashctl.keysize = sizeof(Oid);
> hashctl.entrysize = 1;
> hashctl.hcxt = mcxt;
> relfilenodeHash = hash_create("relfilenodeHash", 100, &hashctl,
> HASH_FUNCTION
> | HASH_ELEM | HASH_CONTEXT);
>
> /* Read all relfilenodes from pg_class into the hash table */
> {
> ResourceOwner owner, oldowner;
> Relation rel;
> HeapScanDesc scan;
> HeapTuple tuple;
>
> /* Need a resowner to keep the heapam and buffer code happy */
> owner = ResourceOwnerCreate(NULL, "CheckStaleRelFiles");
> oldowner = CurrentResourceOwner;
> CurrentResourceOwner = owner;
>
> rnode.spcNode = tablespaceoid;
> rnode.dbNode = dboid;
> rnode.relNode = RelationRelationId;
> rel = XLogOpenRelation(true, 0 , rnode);
>
> scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
> while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
> {
> Form_pg_class classform = (Form_pg_class) GETSTRUCT(tuple);
>
> hash_search(relfilenodeHash, &classform->relfilenode,
> HASH_ENTER, NULL);
> }
> heap_endscan(scan);
>
> XLogCloseRelation(rnode);
> CurrentResourceOwner = oldowner;
> ResourceOwnerDelete(owner);
> }
>
> /* Scan the directory */
> path = GetDatabasePath(dboid, tablespaceoid);
>
> dirdesc = AllocateDirChecked(path);
> while ((de = readdir(dirdesc)) != NULL)
> {
> char *invalid;
> Oid relfilenode;
>
> relfilenode = strtol(de->d_name, &invalid, 10);
> if(invalid[0] == '\0')
> {
> /* Filename was a valid number, check if pg_class knows
> about it */
> if(hash_search(relfilenodeHash, &relfilenode,
> HASH_FIND, NULL) == NULL)
> {
> char *filepath;
> rnode.spcNode = tablespaceoid;
> rnode.dbNode = dboid;
> rnode.relNode = relfilenode;
>
> filepath = relpath(rnode);
>
> ereport(LOG,
> (errcode_for_file_access(),
> errmsg("The table or index file \"%s\" is stale and can be safely removed",
> filepath)));
> pfree(filepath);
> }
> }
> }
> FreeDir(dirdesc);
> pfree(path);
> hash_destroy(relfilenodeHash);
> MemoryContextDelete(mcxt);
> }

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Cleaning up unreferenced table files
Date: 2005-05-06 00:55:24
Message-ID: 8291.1115340924@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Applied.

Now that I've had a chance to look at it, this patch is thoroughly
broken. Problems observed in a quick review:

1. It doesn't work at all for non-default tablespaces: it will
claim that every file in such a tablespace is stale. The fact
that it does that rather than failing entirely is accidental.
It tries to read the database's pg_class in the target tablespace
whether it's there or not. Because the system is still in recovery
mode, the low-level routines allow the access to the nonexistent
pg_class table to pass --- in fact they think they should create
the file, so after it runs there's a bogus empty "1259" file in each
such tablespace (which of course it complains about, too). The code
then proceeds to think that pg_class is empty so of course everything
draws a warning.

2. It's not robust against stale subdirectories of a tablespace
(ie, subdirs corresponding to a nonexistent database) --- again,
it'll try to read a nonexistent pg_class. Then it'll produce a
bunch of off-target complaint messages.

3. It's assuming that relfilenode is unique database-wide, when no
such assumption is safe. We only have a guarantee that it's unique
tablespace-wide.

4. It fails to examine table segment files (such as "nnn.1"). These
should be complained of when the "nnn" doesn't match any hash entry.

5. It will load every relfilenode value in pg_class into the hashtable
whether it's meaningful or not. There should be a check on relkind.

6. I don't think relying on strtol to decide if a filename is entirely
numeric is very safe. Note all the extra defenses in pg_atoi against
various platform-specific misbehaviors of strtol. Personally I'd use a
strspn test instead.

7. There are no checks for readdir failure (compare any other readdir
loop in the backend).

See also Simon Riggs' complaints that the circumstances under which it's
done are pretty randomly selected. (One particular thing that I think
is a bad idea is to do this in a standalone backend. Any sort of
corruption in any db's pg_class would render it impossible to start up.)

To fix the first three problems, and also avoid the performance problem
of multiply rescanning a database's pg_class for each of its
tablespaces, I would suggest that the hashtable entries be widened to
RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then
there should be one iteration over pg_database to learn the OIDs and
default tablespaces of each database; with that you can read pg_class
from its correct location for each database and load all the entries
into the hashtable. Then you iterate through the tablespaces looking
for stuff not present in the hashtable. You might also want to build a
list or hashtable of known database OIDs, so that you can recognize a
stale subdirectory immediately and issue a direct complaint about it
without even recursing into it.

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Cleaning up unreferenced table files
Date: 2005-05-06 21:37:58
Message-ID: Pine.OSF.4.61.0505070011300.97241@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 5 May 2005, Tom Lane wrote:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> Applied.
>
> Now that I've had a chance to look at it, this patch is thoroughly
> broken. Problems observed in a quick review:
>
> 1. It doesn't work at all for non-default tablespaces: it will
> claim that every file in such a tablespace is stale. The fact
> that it does that rather than failing entirely is accidental.
> It tries to read the database's pg_class in the target tablespace
> whether it's there or not. Because the system is still in recovery
> mode, the low-level routines allow the access to the nonexistent
> pg_class table to pass --- in fact they think they should create
> the file, so after it runs there's a bogus empty "1259" file in each
> such tablespace (which of course it complains about, too). The code
> then proceeds to think that pg_class is empty so of course everything
> draws a warning.
>
> 2. It's not robust against stale subdirectories of a tablespace
> (ie, subdirs corresponding to a nonexistent database) --- again,
> it'll try to read a nonexistent pg_class. Then it'll produce a
> bunch of off-target complaint messages.
>
> 3. It's assuming that relfilenode is unique database-wide, when no
> such assumption is safe. We only have a guarantee that it's unique
> tablespace-wide.
>
> 4. It fails to examine table segment files (such as "nnn.1"). These
> should be complained of when the "nnn" doesn't match any hash entry.
>
> 5. It will load every relfilenode value in pg_class into the hashtable
> whether it's meaningful or not. There should be a check on relkind.
>
> 6. I don't think relying on strtol to decide if a filename is entirely
> numeric is very safe. Note all the extra defenses in pg_atoi against
> various platform-specific misbehaviors of strtol. Personally I'd use a
> strspn test instead.
>

I'll fix 1-6 according to your suggestions, and send another patch.

It shows how little experience I have with multiple database
and tablespace management.

> 7. There are no checks for readdir failure (compare any other readdir
> loop in the backend).

I couldn't figure out what you meant. The readdir code is the same as
anywhere else. Also, man page (Linux) says that readdir returns NULL on
error, and that is checked.

> See also Simon Riggs' complaints that the circumstances under which it's
> done are pretty randomly selected. (One particular thing that I think
> is a bad idea is to do this in a standalone backend. Any sort of
> corruption in any db's pg_class would render it impossible to start up.)

I'd agree with Simons complaints if we actually deleted the files. But
since we only report them, it's a good idea to report them on every
startup, otherwise the DBA might think that the stale files are not there
anymore since the system isn't complaining about them anymore.

The original patch only ran the check on crash recovery, but Bruce changed
it to run on startup as well, for the above reason.

I agree, though, that it's a bad idea to do it in standalone mode. I'll
add a check for that. Also it probably shouldn't stop the startup even if
some pg_class is corrupt. Other databases could be fine.

> To fix the first three problems, and also avoid the performance problem
> of multiply rescanning a database's pg_class for each of its
> tablespaces, I would suggest that the hashtable entries be widened to
> RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then
> there should be one iteration over pg_database to learn the OIDs and
> default tablespaces of each database; with that you can read pg_class
> from its correct location for each database and load all the entries
> into the hashtable. Then you iterate through the tablespaces looking
> for stuff not present in the hashtable. You might also want to build a
> list or hashtable of known database OIDs, so that you can recognize a
> stale subdirectory immediately and issue a direct complaint about it
> without even recursing into it.
>
> regards, tom lane
>

- Heikki


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-07 18:51:17
Message-ID: Pine.OSF.4.61.0505072027140.513162@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Maybe we should take a different approach to the problem:

1. Create new file with an extension to mark that it's not
yet committed (eg. 1234.notcommitted)
2. ...
3. Take CheckpointStartLock
4. Write commit record to WAL, with list of created files.
5. rename created file (1234.notcommitted -> 1234).
6. Release CheckpointStartLock

This would guarantee that after successful WAL replay, all files in the
data directory with .notcommitted extension can be safely deleted. No need
to read pg_database or pg_class.

We would take a performance hit because of the additional rename and fsync
step. Also, we must somehow make sure that the new file or the directory
it's in is fsynced on checkpoint to make sure that the rename is flushed
to disk.

A variant of the scheme would be to create two files on step 1. One would
be the actual relfile (1234) and the other would an empty marker file
(1234.notcommitted). That way the smgr code wouldn't have to care it the
file is new or not when opening it.

- Heikki

On Thu, 5 May 2005, Tom Lane wrote:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> Applied.
>
> Now that I've had a chance to look at it, this patch is thoroughly
> broken. Problems observed in a quick review:
>
> 1. It doesn't work at all for non-default tablespaces: it will
> claim that every file in such a tablespace is stale. The fact
> that it does that rather than failing entirely is accidental.
> It tries to read the database's pg_class in the target tablespace
> whether it's there or not. Because the system is still in recovery
> mode, the low-level routines allow the access to the nonexistent
> pg_class table to pass --- in fact they think they should create
> the file, so after it runs there's a bogus empty "1259" file in each
> such tablespace (which of course it complains about, too). The code
> then proceeds to think that pg_class is empty so of course everything
> draws a warning.
>
> 2. It's not robust against stale subdirectories of a tablespace
> (ie, subdirs corresponding to a nonexistent database) --- again,
> it'll try to read a nonexistent pg_class. Then it'll produce a
> bunch of off-target complaint messages.
>
> 3. It's assuming that relfilenode is unique database-wide, when no
> such assumption is safe. We only have a guarantee that it's unique
> tablespace-wide.
>
> 4. It fails to examine table segment files (such as "nnn.1"). These
> should be complained of when the "nnn" doesn't match any hash entry.
>
> 5. It will load every relfilenode value in pg_class into the hashtable
> whether it's meaningful or not. There should be a check on relkind.
>
> 6. I don't think relying on strtol to decide if a filename is entirely
> numeric is very safe. Note all the extra defenses in pg_atoi against
> various platform-specific misbehaviors of strtol. Personally I'd use a
> strspn test instead.
>
> 7. There are no checks for readdir failure (compare any other readdir
> loop in the backend).
>
> See also Simon Riggs' complaints that the circumstances under which it's
> done are pretty randomly selected. (One particular thing that I think
> is a bad idea is to do this in a standalone backend. Any sort of
> corruption in any db's pg_class would render it impossible to start up.)
>
> To fix the first three problems, and also avoid the performance problem
> of multiply rescanning a database's pg_class for each of its
> tablespaces, I would suggest that the hashtable entries be widened to
> RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then
> there should be one iteration over pg_database to learn the OIDs and
> default tablespaces of each database; with that you can read pg_class
> from its correct location for each database and load all the entries
> into the hashtable. Then you iterate through the tablespaces looking
> for stuff not present in the hashtable. You might also want to build a
> list or hashtable of known database OIDs, so that you can recognize a
> stale subdirectory immediately and issue a direct complaint about it
> without even recursing into it.
>
> regards, tom lane
>

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-07 20:37:39
Message-ID: 29232.1115498259@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> Maybe we should take a different approach to the problem:
> 1. Create new file with an extension to mark that it's not
> yet committed (eg. 1234.notcommitted)

This is pushing the problem into the wrong place, viz the lowest-level
file access routines, which will now all have to know about
.notcommitted status. It also creates race conditions --- think about
backend A trying to commit file 1234 at about the same time that
backend B is trying to flush some dirty buffers belonging to that file.
But most importantly, it doesn't handle the file-deletion case.

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-07 21:08:43
Message-ID: Pine.OSF.4.61.0505072341360.217863@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sat, 7 May 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> Maybe we should take a different approach to the problem:
>> 1. Create new file with an extension to mark that it's not
>> yet committed (eg. 1234.notcommitted)
>
> This is pushing the problem into the wrong place, viz the lowest-level
> file access routines, which will now all have to know about
> .notcommitted status. It also creates race conditions --- think about
> backend A trying to commit file 1234 at about the same time that
> backend B is trying to flush some dirty buffers belonging to that file.

True. With the rename variant, it might indeed get messy.

Consider the variant with extra marker files. In that case, backend B
doesn't have to know about the .notcommitted status to flush the buffers.

> But most importantly, it doesn't handle the file-deletion case.

File-deletions are easy to handle. Just write the list of pending
deletions to WAL on commit.

To recap, we have 2 slightly different scenarios:

a) Delete a file, write commit record, crash
b) Create a file, crash

Just WAL logging the deletions on commit would take care of A. The
.notcommitted mechanism would take care of B.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-08 16:35:22
Message-ID: 12069.1115570122@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> Consider the variant with extra marker files. In that case, backend B
> doesn't have to know about the .notcommitted status to flush the buffers.

[ shrug ] It's still broken, and the reason is that there's no
equivalent of fsync for directory operations. Consider

A creates 1234 and 1234.notcommitted

A commits

B performs a checkpoint

crash

all before A manages to delete 1234.notcommitted, or at least before
that deletion has made its way to disk. Upon restart, only WAL
events after the checkpoint will be replayed, so 1234.notcommitted
doesn't go away, and then you've got a problem.

To fix this there would need to be a way (1) for B to be aware of the
pending file deletion and (2) for B to delay committing the checkpoint
until the directory update is surely down on disk. Your proposal
doesn't provide for (1), and even if we fixed that, I know of no
portable kernel API for (2). fsync isn't applicable.

While your original patch is buggy, it's at least fixable and has
localized, limited impact. I don't think these schemes are safe
at all --- they put a great deal more weight on the semantics of
the filesystem than I care to do.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-09 03:41:38
Message-ID: 877ji9t6a5.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> > Consider the variant with extra marker files. In that case, backend B
> > doesn't have to know about the .notcommitted status to flush the buffers.
>
> [ shrug ] It's still broken, and the reason is that there's no
> equivalent of fsync for directory operations. Consider

Traditionally that's because directory operations were always synchronous, and
hence didn't need to be fsynced. I think this is still true, other systems
like qmail's maildir still depend on this, for example.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-09 03:54:06
Message-ID: 16523.1115610846@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> [ shrug ] It's still broken, and the reason is that there's no
>> equivalent of fsync for directory operations. Consider

> Traditionally that's because directory operations were always
> synchronous, and hence didn't need to be fsynced.

That might be true with respect to the process requesting the directory
operation ... but I think you missed the point entirely.

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-10 20:29:22
Message-ID: Pine.OSF.4.61.0505102211560.368341@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 8 May 2005, Tom Lane wrote:

> While your original patch is buggy, it's at least fixable and has
> localized, limited impact. I don't think these schemes are safe
> at all --- they put a great deal more weight on the semantics of
> the filesystem than I care to do.

I'm going to try this some more, because I feel that a scheme like this
that doesn't rely on scanning pg_class and the file system would in fact
be safer.

The key is to A) obey the "WAL first" rule, and A) remember information
about file creations over a checkpoint. The problem with the my previous
suggestion was that it didn't reliably accomplish either :).

Right now we break the WAL rule because the file creation is recorded
after the file is created. And the record is not flushed.

The trivial way to fix that is to write and flush the xlog record before
actually creating the file. (for a more optimized way to do it, see end of
message). Then we could trust that there aren't any files in the data
directory that don't have a corresponding record in WAL.

But that's not enough. If a checkpoint occurs after the file is
created, but before the transaction ends, WAL replay doesn't see the file
creation record. That's why we need a mechanism to carry the information
over the checkpoint.

We could do that by extending the ForwardFsyncRequest function or by
creating something similar to that. When a backend writes the file
creation WAL record, it also sends a message to the bgwriter that says
"I'm xid 1234, and I have just created file foobar/1234" (while holding
CheckpointStartLock). Bgwriter keeps a list of xid/file pairs like it
keeps a list of pending fsync operations. On checkpoint, the checkpointer
scans the list and removes entries for transactions that have already
ended, and attaches the remaining list to the checkpoint record.

WAL replay would start with the xid/file list in the checkpoint record,
and update it during the replay whenever a file creation or a transaction
commit/rollback record is seen. On a rollback record, files created by
that transaction are deleted. At the end of WAL replay, the files that are
left in the list belong to transactions that implicitly aborted, and can
be deleted.

If we don't want to extend the checkpoint record, a separate WAL record
works too.

Now, the more optimized way to do A:

Delay the actual file creation until it's first written to. The write
needs to be WAL logged anyway, so we would just piggyback on that.

Implemented this way, I don't think there would be a significant
performance hit from the scheme. We would create more ForwardFsyncRequest
traffic, but not much compared to the block fsync requests we have right
now.

BTW: If we allowed mdopen to create the file if it doesn't exist already,
would we need the current file creation xlog record for anything? (I'm
not suggesting to do that, just trying to get more insight)

- Heikki


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-10 20:55:45
Message-ID: 200505102055.j4AKtj522003@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> On Sun, 8 May 2005, Tom Lane wrote:
>
> > While your original patch is buggy, it's at least fixable and has
> > localized, limited impact. I don't think these schemes are safe
> > at all --- they put a great deal more weight on the semantics of
> > the filesystem than I care to do.
>
> I'm going to try this some more, because I feel that a scheme like this
> that doesn't rely on scanning pg_class and the file system would in fact
> be safer.

The current code is nice and localized and doesn't add any burden on our
existing code, which is already complicated enough. I think we either
fix checkfiles.c, or we remove it and decide it isn't worth checking for
unrefrenced files.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-10 21:03:30
Message-ID: 26932.1115759010@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On Sun, 8 May 2005, Tom Lane wrote:
>> While your original patch is buggy, it's at least fixable and has
>> localized, limited impact. I don't think these schemes are safe
>> at all --- they put a great deal more weight on the semantics of
>> the filesystem than I care to do.

> I'm going to try this some more, because I feel that a scheme like this
> that doesn't rely on scanning pg_class and the file system would in fact
> be safer.

I think this proposal is getting more and more invasive and expensive,
and it's all to solve a problem that we do not even know is worth
spending any time on. I *really* think this is the wrong direction
to take. Aside from the required effort and risk of breaking things,
the original patch incurred cost only during crash recovery; this is
pushing costs into the normal code paths.

> Delay the actual file creation until it's first written to. The write
> needs to be WAL logged anyway, so we would just piggyback on that.

This is a bad idea since by then it's (potentially) too late to roll
back the creating transaction if the creation fails. Consider for
instance a tablespace directory that's mispermissioned read-only, or
some such. I'd rather have the CREATE TABLE fail than a later INSERT.
(Admittedly, we can't completely guarantee that an INSERT won't hit
some kind of filesystem-level problem, but it's still something to
try to avoid.)

Also, the "first write" actually comes from mdextend, which is not a
WAL-logged operation AFAIR. Some rethinking of that would be necessary
before this would have any chance of working.

regards, tom lane


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-10 21:18:30
Message-ID: Pine.OSF.4.61.0505110009250.368341@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 10 May 2005, Bruce Momjian wrote:

> The current code is nice and localized and doesn't add any burden on our
> existing code, which is already complicated enough. I think we either
> fix checkfiles.c, or we remove it and decide it isn't worth checking for
> unrefrenced files.

Let's pull the patch for now.

I still think we should check for unreferenced files, but it needs some
more thought and it's by no means urgent.

If I have time later on, I might write a patch to implement the WAL-based
scheme to see how much change to existing code it really would need.

- Heikki


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-10 22:28:19
Message-ID: 200505102228.j4AMSJ907460@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> On Tue, 10 May 2005, Bruce Momjian wrote:
>
> > The current code is nice and localized and doesn't add any burden on our
> > existing code, which is already complicated enough. I think we either
> > fix checkfiles.c, or we remove it and decide it isn't worth checking for
> > unrefrenced files.
>
> Let's pull the patch for now.

Removed, and TODO item restored.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-11 04:02:34
Message-ID: 29801.1115784154@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On Tue, 10 May 2005, Bruce Momjian wrote:
>> The current code is nice and localized and doesn't add any burden on our
>> existing code, which is already complicated enough. I think we either
>> fix checkfiles.c, or we remove it and decide it isn't worth checking for
>> unrefrenced files.

> Let's pull the patch for now.

FWIW, I was OK with the idea of adding something similar to the given
patch to find out whether we had a problem or not. With sufficient
evidence that lost files are a big problem, I'd be in favor of a
mechanism of the kind proposed in Heikki's latest messages. The
disconnect for me at the moment is that there's no evidence to justify
that amount of effort/risk. A startup-time patch would have provided
that evidence, or else have proven that it's not worth spending more
time on.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-11 13:19:57
Message-ID: 200505111319.j4BDJvH04879@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> > On Tue, 10 May 2005, Bruce Momjian wrote:
> >> The current code is nice and localized and doesn't add any burden on our
> >> existing code, which is already complicated enough. I think we either
> >> fix checkfiles.c, or we remove it and decide it isn't worth checking for
> >> unrefrenced files.
>
> > Let's pull the patch for now.
>
> FWIW, I was OK with the idea of adding something similar to the given
> patch to find out whether we had a problem or not. With sufficient
> evidence that lost files are a big problem, I'd be in favor of a
> mechanism of the kind proposed in Heikki's latest messages. The
> disconnect for me at the moment is that there's no evidence to justify
> that amount of effort/risk. A startup-time patch would have provided
> that evidence, or else have proven that it's not worth spending more
> time on.

Agreed. Imagine a backend creates a table file, then the operating
system crashes. I assume WAL wasn't fsync'ed, so there is no way that
WAL can discover that unreferenced file.

While I think WAL can correct some cases, I don't think it can correct
them all, so it seems it is necessary to check the file system against
pg_class to catch all the cases. The transaction and file system
semantics are just different and need to be checked against each other.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-06-27 10:12:52
Message-ID: 200506271012.j5RACrT15690@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Heikki, do you have any interest in completing your file checking patch
for inclusion in 8.1 by adding tablespace information and other fixes as
requested by Tom below? The current patch version is at:

ftp://candle.pha.pa.us/pub/postgresql/mypatches

called checkfiles*.

Anyone else want to complete it?

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Applied.
>
> Now that I've had a chance to look at it, this patch is thoroughly
> broken. Problems observed in a quick review:
>
> 1. It doesn't work at all for non-default tablespaces: it will
> claim that every file in such a tablespace is stale. The fact
> that it does that rather than failing entirely is accidental.
> It tries to read the database's pg_class in the target tablespace
> whether it's there or not. Because the system is still in recovery
> mode, the low-level routines allow the access to the nonexistent
> pg_class table to pass --- in fact they think they should create
> the file, so after it runs there's a bogus empty "1259" file in each
> such tablespace (which of course it complains about, too). The code
> then proceeds to think that pg_class is empty so of course everything
> draws a warning.
>
> 2. It's not robust against stale subdirectories of a tablespace
> (ie, subdirs corresponding to a nonexistent database) --- again,
> it'll try to read a nonexistent pg_class. Then it'll produce a
> bunch of off-target complaint messages.
>
> 3. It's assuming that relfilenode is unique database-wide, when no
> such assumption is safe. We only have a guarantee that it's unique
> tablespace-wide.
>
> 4. It fails to examine table segment files (such as "nnn.1"). These
> should be complained of when the "nnn" doesn't match any hash entry.
>
> 5. It will load every relfilenode value in pg_class into the hashtable
> whether it's meaningful or not. There should be a check on relkind.
>
> 6. I don't think relying on strtol to decide if a filename is entirely
> numeric is very safe. Note all the extra defenses in pg_atoi against
> various platform-specific misbehaviors of strtol. Personally I'd use a
> strspn test instead.
>
> 7. There are no checks for readdir failure (compare any other readdir
> loop in the backend).
>
> See also Simon Riggs' complaints that the circumstances under which it's
> done are pretty randomly selected. (One particular thing that I think
> is a bad idea is to do this in a standalone backend. Any sort of
> corruption in any db's pg_class would render it impossible to start up.)
>
> To fix the first three problems, and also avoid the performance problem
> of multiply rescanning a database's pg_class for each of its
> tablespaces, I would suggest that the hashtable entries be widened to
> RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then
> there should be one iteration over pg_database to learn the OIDs and
> default tablespaces of each database; with that you can read pg_class
> from its correct location for each database and load all the entries
> into the hashtable. Then you iterate through the tablespaces looking
> for stuff not present in the hashtable. You might also want to build a
> list or hashtable of known database OIDs, so that you can recognize a
> stale subdirectory immediately and issue a direct complaint about it
> without even recursing into it.
>
> regards, tom lane
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-06-27 17:37:01
Message-ID: Pine.OSF.4.61.0506272030530.449706@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

No, not for now. Maybe for 8.2. And maybe as a contrib tool at first after
all.

- Heikki

On Mon, 27 Jun 2005, Bruce Momjian wrote:

>
> Heikki, do you have any interest in completing your file checking patch
> for inclusion in 8.1 by adding tablespace information and other fixes as
> requested by Tom below? The current patch version is at:
>
> ftp://candle.pha.pa.us/pub/postgresql/mypatches
>
> called checkfiles*.
>
> Anyone else want to complete it?
>
> ---------------------------------------------------------------------------
>
> Tom Lane wrote:
>> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>>> Applied.
>>
>> Now that I've had a chance to look at it, this patch is thoroughly
>> broken. Problems observed in a quick review:
>>
>> 1. It doesn't work at all for non-default tablespaces: it will
>> claim that every file in such a tablespace is stale. The fact
>> that it does that rather than failing entirely is accidental.
>> It tries to read the database's pg_class in the target tablespace
>> whether it's there or not. Because the system is still in recovery
>> mode, the low-level routines allow the access to the nonexistent
>> pg_class table to pass --- in fact they think they should create
>> the file, so after it runs there's a bogus empty "1259" file in each
>> such tablespace (which of course it complains about, too). The code
>> then proceeds to think that pg_class is empty so of course everything
>> draws a warning.
>>
>> 2. It's not robust against stale subdirectories of a tablespace
>> (ie, subdirs corresponding to a nonexistent database) --- again,
>> it'll try to read a nonexistent pg_class. Then it'll produce a
>> bunch of off-target complaint messages.
>>
>> 3. It's assuming that relfilenode is unique database-wide, when no
>> such assumption is safe. We only have a guarantee that it's unique
>> tablespace-wide.
>>
>> 4. It fails to examine table segment files (such as "nnn.1"). These
>> should be complained of when the "nnn" doesn't match any hash entry.
>>
>> 5. It will load every relfilenode value in pg_class into the hashtable
>> whether it's meaningful or not. There should be a check on relkind.
>>
>> 6. I don't think relying on strtol to decide if a filename is entirely
>> numeric is very safe. Note all the extra defenses in pg_atoi against
>> various platform-specific misbehaviors of strtol. Personally I'd use a
>> strspn test instead.
>>
>> 7. There are no checks for readdir failure (compare any other readdir
>> loop in the backend).
>>
>> See also Simon Riggs' complaints that the circumstances under which it's
>> done are pretty randomly selected. (One particular thing that I think
>> is a bad idea is to do this in a standalone backend. Any sort of
>> corruption in any db's pg_class would render it impossible to start up.)
>>
>> To fix the first three problems, and also avoid the performance problem
>> of multiply rescanning a database's pg_class for each of its
>> tablespaces, I would suggest that the hashtable entries be widened to
>> RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then
>> there should be one iteration over pg_database to learn the OIDs and
>> default tablespaces of each database; with that you can read pg_class
>> from its correct location for each database and load all the entries
>> into the hashtable. Then you iterate through the tablespaces looking
>> for stuff not present in the hashtable. You might also want to build a
>> list or hashtable of known database OIDs, so that you can recognize a
>> stale subdirectory immediately and issue a direct complaint about it
>> without even recursing into it.
>>
>> regards, tom lane
>>
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania 19073
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

- Heikki