Re: 8.1.3 and unused files

Lists: pgsql-hackers
From: Rod Taylor <pg(at)rbt(dot)ca>
To: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: 8.1.3 and unused files
Date: 2006-05-05 17:53:06
Message-ID: 1146851586.8325.30.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am I correct in the thought that the various files listed below are not
used by the database and can be safely removed? There were no other
active db connections when I issued this command.

I think truncate (Slony) left them behind.

ssdb=# select file
from pg_ls_dir('base/'|| (select oid from pg_database where datname =
'ssdb')) as tab(file)
where file !~ '\\..*$'
and file not in (select relfilenode from pg_class)
and file not in ('PG_VERSION', 'pgsql_tmp');

file
---------
1434986
1434984
1434985
(3 rows)

[root(at)DB1 16384]# ls -la 143498[456]*
-rw------- 1 rbt sysadmin 1073741824 May 1 20:56 1434984
-rw------- 1 rbt sysadmin 1073741824 May 1 21:11 1434984.1
-rw------- 1 rbt sysadmin 1073741824 May 1 23:21 1434984.10
-rw------- 1 rbt sysadmin 1073741824 May 1 23:36 1434984.11
-rw------- 1 rbt sysadmin 1073741824 May 1 23:50 1434984.12
-rw------- 1 rbt sysadmin 1073741824 May 2 00:06 1434984.13
-rw------- 1 rbt sysadmin 1073741824 May 2 00:23 1434984.14
-rw------- 1 rbt sysadmin 1073741824 May 2 00:39 1434984.15
-rw------- 1 rbt sysadmin 1073741824 May 2 00:57 1434984.16
-rw------- 1 rbt sysadmin 1073741824 May 2 01:14 1434984.17
-rw------- 1 rbt sysadmin 1073741824 May 2 01:31 1434984.18
-rw------- 1 rbt sysadmin 1073741824 May 2 01:50 1434984.19
-rw------- 1 rbt sysadmin 1073741824 May 1 21:25 1434984.2
-rw------- 1 rbt sysadmin 1073741824 May 2 02:07 1434984.20
-rw------- 1 rbt sysadmin 1073741824 May 2 02:23 1434984.21
-rw------- 1 rbt sysadmin 1073741824 May 2 02:41 1434984.22
-rw------- 1 rbt sysadmin 1073741824 May 2 02:55 1434984.23
-rw------- 1 rbt sysadmin 1073741824 May 2 03:09 1434984.24
-rw------- 1 rbt sysadmin 1073741824 May 2 03:24 1434984.25
-rw------- 1 rbt sysadmin 1073741824 May 2 03:37 1434984.26
-rw------- 1 rbt sysadmin 1073741824 May 2 03:53 1434984.27
-rw------- 1 rbt sysadmin 1073741824 May 2 04:09 1434984.28
-rw------- 1 rbt sysadmin 1073741824 May 2 04:24 1434984.29
-rw------- 1 rbt sysadmin 1073741824 May 1 21:40 1434984.3
-rw------- 1 rbt sysadmin 1073741824 May 2 04:40 1434984.30
-rw------- 1 rbt sysadmin 1073741824 May 2 04:56 1434984.31
-rw------- 1 rbt sysadmin 990912512 May 2 05:09 1434984.32
-rw------- 1 rbt sysadmin 1073741824 May 1 21:54 1434984.4
-rw------- 1 rbt sysadmin 1073741824 May 1 22:08 1434984.5
-rw------- 1 rbt sysadmin 1073741824 May 1 22:23 1434984.6
-rw------- 1 rbt sysadmin 1073741824 May 1 22:36 1434984.7
-rw------- 1 rbt sysadmin 1073741824 May 1 22:52 1434984.8
-rw------- 1 rbt sysadmin 1073741824 May 1 23:07 1434984.9
-rw------- 1 rbt sysadmin 8192 May 1 20:40 1434985
-rw------- 1 rbt sysadmin 1073741824 May 2 11:27 1434986
-rw------- 1 rbt sysadmin 1073741824 May 2 11:39 1434986.1
-rw------- 1 rbt sysadmin 121733120 May 2 16:53 1434986.10
-rw------- 1 rbt sysadmin 1073741824 May 2 11:56 1434986.2
-rw------- 1 rbt sysadmin 1073741824 May 2 12:15 1434986.3
-rw------- 1 rbt sysadmin 1073741824 May 2 12:43 1434986.4
-rw------- 1 rbt sysadmin 1073741824 May 2 13:15 1434986.5
-rw------- 1 rbt sysadmin 1073741824 May 2 13:53 1434986.6
-rw------- 1 rbt sysadmin 1073741824 May 2 14:35 1434986.7
-rw------- 1 rbt sysadmin 1073741824 May 2 15:38 1434986.8
-rw------- 1 rbt sysadmin 1073741824 May 2 16:53 1434986.9
--


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 18:09:29
Message-ID: 26731.1146852569@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor <pg(at)rbt(dot)ca> writes:
> Am I correct in the thought that the various files listed below are not
> used by the database and can be safely removed? There were no other
> active db connections when I issued this command.

> I think truncate (Slony) left them behind.

I don't particularly like the casual assumption that truncate is broken.
If I were you I'd be looking harder for a plausible explanation about
where these files came from, especially seeing how large they are.
Have you tried dumping the file contents to see if the data looks
recognizable at all?

regards, tom lane


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 18:26:04
Message-ID: 1146853564.8325.50.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-05-05 at 14:09 -0400, Tom Lane wrote:
> Rod Taylor <pg(at)rbt(dot)ca> writes:
> > Am I correct in the thought that the various files listed below are not
> > used by the database and can be safely removed? There were no other
> > active db connections when I issued this command.
>
> > I think truncate (Slony) left them behind.
>
> I don't particularly like the casual assumption that truncate is broken.

> If I were you I'd be looking harder for a plausible explanation about
> where these files came from, especially seeing how large they are.
> Have you tried dumping the file contents to see if the data looks
> recognizable at all?

Hardware is perfectly functional and has been for about 18 months in
production with 8.0.x.

It is a completely new 8.1 database and Slony is the only entity that
has been working in it. There are not very many possibilities.

I'm fairly confident I know exactly which table they are/were a part of.
1434984 is the table data, 1434986 is the primary key of the table (only
index), and 1434985 is probably the toast structure.

The structure have different relfilenode values and valid data at this
time.

At some point it must have failed in copying the data across, aborted,
and restarted.

So it would have been something like this:

BEGIN;
TRUNCATE;
"decouple indexes -- ask Jan";
COPY;
"recouple indexes";
REINDEX <crash, abort, something else to cause a Slony to restart>;

<reconnect>
BEGIN;
TRUNCATE;
"decouple indexes -- ask Jan";
COPY; "recouple indexes";
REINDEX;
COMMIT;

--


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 18:31:50
Message-ID: 26896.1146853910@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor <pg(at)rbt(dot)ca> writes:
> At some point it must have failed in copying the data across, aborted,
> and restarted.

Unless you had an actual backend crash, that's not an adequate
explanation. Transaction abort does clean up created files.

regards, tom lane


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 18:52:01
Message-ID: 1146855121.8325.72.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote:
> Rod Taylor <pg(at)rbt(dot)ca> writes:
> > At some point it must have failed in copying the data across, aborted,
> > and restarted.
>
> Unless you had an actual backend crash, that's not an adequate
> explanation. Transaction abort does clean up created files.

The only reason I noticed is because pg_database_size didn't match
sum(pg_total_relation_size()) and was investigating what I thought was a
bug in one of those functions.

I'm afraid we don't have all of the monitoring, logging, and change
control bits hooked up to the non-production DBs, so that is pretty much
all I have other than conjecture.

The only thing I can come up with is that perhaps someone forcefully
gave it a kick. SIGTERM is a necessary action once in a while to unwedge
a stuck db connection (killing the client script doesn't always get it
immediately).

Slony holds open a transaction on the master while reindexing the slave,
so perhaps someone thought the slave needed help. Making a copy of the
master takes several weeks. They may have killed slony, found the
statements still working away, SIGTERM'd them both, then restarted
slony. It wouldn't be an unusual pattern of events, particularly since
they've not been taught about pg_cancel_backend() yet (no 8.1 training).

How about this?

BEGIN;
TRUNCATE;
COPY;
REINDEX <SIGTERM during REINDEX>;

pg_class references old files. New files in their aborted state are left
behind?

--


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 19:10:18
Message-ID: 27237.1146856218@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor <pg(at)rbt(dot)ca> writes:
> On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote:
>> Unless you had an actual backend crash, that's not an adequate
>> explanation. Transaction abort does clean up created files.

> The only thing I can come up with is that perhaps someone forcefully
> gave it a kick. SIGTERM is a necessary action once in a while to unwedge
> a stuck db connection (killing the client script doesn't always get it
> immediately).

SIGTERM wouldn't cause that either. I hope your people are not in the
habit of using "kill -9"?

regards, tom lane


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 19:58:26
Message-ID: 1146859106.8325.112.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-05-05 at 15:10 -0400, Tom Lane wrote:
> Rod Taylor <pg(at)rbt(dot)ca> writes:
> > On Fri, 2006-05-05 at 14:31 -0400, Tom Lane wrote:
> >> Unless you had an actual backend crash, that's not an adequate
> >> explanation. Transaction abort does clean up created files.
>
> > The only thing I can come up with is that perhaps someone forcefully
> > gave it a kick. SIGTERM is a necessary action once in a while to unwedge
> > a stuck db connection (killing the client script doesn't always get it
> > immediately).
>
> SIGTERM wouldn't cause that either. I hope your people are not in the
> habit of using "kill -9"?

Command line records, etc. are not available, but I did track down a a
snippet of logs from the backups (daily log rotation). Sorry, I didn't
realize there were backups initially -- it's unusual. Appears it did
crash or get killed in some way "exited with exit code 1".

It's a temp DB to try a different database encoding (prep for 8.1
upgrade) with production data.

Is there something you would like me to try doing in an attempt to
reproduce? Preferably with a smaller structure.

The truncate would have happened as part of the prepareTableForCopy()
call.

slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOG: 00000: duration: 30526368.316 ms statement: select "_ssrep".prepareTableForCopy(1010); copy <SNIP> from stdin;
slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOCATION: exec_simple_query, postgres.c:1103
slony%ssdb 10171 4621947 2006-05-02 05:09:40 EDTLOCATION: exec_simple_query, postgres.c:1103
slony%ssdb 10181 0 2006-05-02 15:32:06 EDTLOG: 08P01: unexpected EOF on client connection
slony%ssdb 10181 0 2006-05-02 15:32:06 EDTLOCATION: SocketBackend, postgres.c:295
slony%ssdb 10154 0 2006-05-02 15:32:06 EDTLOG: 08P01: unexpected EOF on client connection
slony%ssdb 10154 0 2006-05-02 15:32:06 EDTLOCATION: SocketBackend, postgres.c:295
slony%ssdb 10173 0 2006-05-02 16:30:53 EDTLOG: 08P01: unexpected EOF on client connection
slony%ssdb 10173 0 2006-05-02 16:30:53 EDTLOCATION: SocketBackend, postgres.c:295
slony%ssdb 10755 0 2006-05-02 16:30:53 EDTLOG: 08P01: unexpected EOF on client connection
slony%ssdb 10755 0 2006-05-02 16:30:53 EDTLOCATION: SocketBackend, postgres.c:295
slony%ssdb 300 0 2006-05-02 16:55:18 EDTLOG: 08P01: unexpected EOF on client connection
slony%ssdb 300 0 2006-05-02 16:55:18 EDTLOCATION: SocketBackend, postgres.c:295
slony%ssdb 301 0 2006-05-02 16:55:18 EDTLOG: 08P01: unexpected EOF on client connection
slony%ssdb 301 0 2006-05-02 16:55:18 EDTLOCATION: SocketBackend, postgres.c:295
% 1960 2006-05-02 17:03:19 EDTLOG: 00000: server process (PID 10171) exited with exit code 1
% 1960 2006-05-02 17:03:19 EDTLOCATION: LogChildExit, postmaster.c:2416
% 1960 2006-05-02 17:03:19 EDTLOG: 00000: terminating any other active server processes
% 1960 2006-05-02 17:03:19 EDTLOCATION: HandleChildCrash, postmaster.c:2306
% 1960 2006-05-02 17:03:19 EDTLOCATION: HandleChildCrash, postmaster.c:2306
% 1960 2006-05-02 17:03:19 EDTLOG: 00000: all server processes terminated; reinitializing
% 1960 2006-05-02 17:03:19 EDTLOCATION: reaper, postmaster.c:2206
< snip connection attempts >
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: database system was interrupted at 2006-05-02 16:06:20 EDT
% 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4374
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: checkpoint record is at 59/E0B56920
% 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4442
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: redo record is at 59/E0B56920; undo record is at 0/0; shutdown FALSE
% 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4469
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: next transaction ID: 4863932; next OID: 1441853
% 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4472
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: next MultiXactId: 1; next MultiXactOffset: 0
% 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4475
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: database system was not properly shut down; automatic recovery in progress
% 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4532
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: redo starts at 59/E0B56970
% 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4569
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: record with zero length at 59/E0E429B8
% 5826 2006-05-02 17:03:22 EDTLOCATION: ReadRecord, xlog.c:2764
% 5826 2006-05-02 17:03:22 EDTLOG: 00000: redo done at 59/E0E42988
% 5826 2006-05-02 17:03:22 EDTLOCATION: StartupXLOG, xlog.c:4627
% 5826 2006-05-02 17:03:23 EDTLOG: 00000: database system is ready
% 5826 2006-05-02 17:03:23 EDTLOCATION: StartupXLOG, xlog.c:4821
% 5826 2006-05-02 17:03:23 EDTLOG: 00000: transaction ID wrap limit is 1073749769, limited by database "ssdb"
% 5826 2006-05-02 17:03:23 EDTLOCATION: SetTransactionIdLimit, varsup.c:234
--


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 20:11:44
Message-ID: 27779.1146859904@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor <pg(at)rbt(dot)ca> writes:
> % 1960 2006-05-02 17:03:19 EDTLOG: 00000: server process (PID 10171) exited with exit code 1

Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers.

regards, tom lane


From: Darcy Buskermolen <darcy(at)wavefire(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Rod Taylor <pg(at)rbt(dot)ca>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 21:13:35
Message-ID: 200605051413.36697.darcy@wavefire.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday 05 May 2006 13:11, Tom Lane wrote:
> Rod Taylor <pg(at)rbt(dot)ca> writes:
> > % 1960 2006-05-02 17:03:19 EDTLOG: 00000: server process (PID 10171)
> > exited with exit code 1
>
> Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers.

No, there are no calls to exit() in the database loaded funcs.

>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Darcy Buskermolen
Wavefire Technologies Corp.

http://www.wavefire.com
ph: 250.717.0200
fx: 250.763.1759


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 22:22:41
Message-ID: 1146867761.8325.145.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote:
> Rod Taylor <pg(at)rbt(dot)ca> writes:
> > % 1960 2006-05-02 17:03:19 EDTLOG: 00000: server process (PID 10171) exited with exit code 1
>
> Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers.

It doesn't appear so. It does have this though:

Datum
_Slony_I_killBackend(PG_FUNCTION_ARGS)
{
int32 pid;
int32 signo;
text *signame;

if (!superuser())
elog(ERROR, "Slony-I: insufficient privilege for killBackend");

pid = PG_GETARG_INT32(0);
signame = PG_GETARG_TEXT_P(1);

if (VARSIZE(signame) == VARHDRSZ + 4 &&
memcmp(VARDATA(signame), "NULL", 0) == 0)
{
signo = 0;
}
else if (VARSIZE(signame) == VARHDRSZ + 4 &&
memcmp(VARDATA(signame), "TERM", 0) == 0)
{
signo = SIGTERM;
}
else
{
elog(ERROR, "Slony-I: unsupported signal");
}

if (kill(pid, signo) < 0)
PG_RETURN_INT32(-1);

PG_RETURN_INT32(0);
}

--


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 22:53:11
Message-ID: 28826.1146869591@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor <pg(at)rbt(dot)ca> writes:
> On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote:
>> Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers.

> It doesn't appear so. It does have this though:

Well, a SIGTERM would have resulted in a bleat in the postmaster log.
The striking thing about your log is that the backend went down without
saying a word; which would be understandable if it had crashed (eg SEGV
or kill -9) but then the postmaster would have seen some other exit
status. I'm fairly certain there are no paths in the standard backend
code that will exit(1) without any attempt to print a message. That's
why I'm wondering about add-ons.

regards, tom lane


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.1.3 and unused files
Date: 2006-05-05 23:20:42
Message-ID: 1146871242.8325.160.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-05-05 at 18:53 -0400, Tom Lane wrote:
> Rod Taylor <pg(at)rbt(dot)ca> writes:
> > On Fri, 2006-05-05 at 16:11 -0400, Tom Lane wrote:
> >> Hm. I wonder if there are any uses of "exit(1)" in the Slony triggers.
>
> > It doesn't appear so. It does have this though:
>
> Well, a SIGTERM would have resulted in a bleat in the postmaster log.
> The striking thing about your log is that the backend went down without
> saying a word; which would be understandable if it had crashed (eg SEGV
> or kill -9) but then the postmaster would have seen some other exit
> status. I'm fairly certain there are no paths in the standard backend
> code that will exit(1) without any attempt to print a message. That's
> why I'm wondering about add-ons.

Add-ons are slim. Slony. We don't have any C based functions and only a
few plpgsql functions in that DB.

I did trim out a ton of autovacuum log entries (it likes to log once a
minute) but I don't see anything interesting in that area nor autovac
the pid that exited.

My knowledge of signal handling is pretty basic. Any chance that
multiple SIGTERMs could have caused it to avoid the logging?

--