Re: production server down

From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 05:50:02
Message-ID: 41BFD08A.5000501@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
>>...
>>pg_control last modified: Tue Dec 14 15:39:26 2004
>>...
>>Time of latest checkpoint: Tue Nov 2 17:05:32 2004
>
> [ blink... ] That seems like an unreasonable gap between checkpoints,
> especially for a production server. Can you see an explanation?

Hmmm, this is even more scary. We have two database clusters on this
server, one on /replica/pgdata, and one on /production/pgdata (ignore
the names -- /replica is actually the "production" instance at the moment).

# pg_controldata /replica/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Dec 14 15:39:26 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

# pg_controldata /production/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Nov 2 21:57:49 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

I have no idea how this happened, but those look too similar except for
the "last modified" date. The space used is quite what I'd expect:

# du -h --max-depth=1 /replica
403G /replica/pgdata

# du -h --max-depth=1 /production
201G /production/pgdata

The "/production/pgdata" cluster has not been in use since Nov 2. But
we've been loading data aggressively into "/replica/pgdata".

Any theories on how we screwed up?

Joe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Browne 2004-12-15 06:01:31 Re: V8.0rc1 On AIX.
Previous Message Tom Lane 2004-12-15 05:30:39 Re: production server down