Quick Links

Re: Disaster!

Lists:	pgsql-hackers

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Disaster!
Date:	2004-01-23 20:28:34
Message-ID:	401183F2.3070509@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

We ran out of disk space on our main server, and now I've freed up
space, we cannot start postgres!

Jan 23 12:18:50 canaveral postgres[563]: [2-1] LOG: checkpoint record
is at 2/96500B94
Jan 23 12:18:50 canaveral postgres[563]: [3-1] LOG: redo record is at
2/964BD23C; undo record is at 0/0; shutdown FALSE
Jan 23 12:18:50 canaveral postgres[563]: [4-1] LOG: next transaction
ID: 14216463; next OID: 4732327
Jan 23 12:18:50 canaveral postgres[563]: [5-1] LOG: database system was
not properly shut down; automatic recovery in progress
Jan 23 12:18:50 canaveral postgres[563]: [6-1] LOG: redo starts at
2/964BD23C
Jan 23 12:18:51 canaveral postgres[563]: [7-1] PANIC: could not access
status of transaction 14286850
Jan 23 12:18:51 canaveral postgres[563]: [7-2] DETAIL: could not read
from file "/usr/local/pgsql/data/pg_clog/000D" at offset 163840:
Undefined error: 0
Jan 23 12:18:51 canaveral postgres[567]: [1-1] FATAL: the database
system is starting up
Jan 23 12:18:52 canaveral postgres[558]: [1-1] LOG: startup process
(PID 563) was terminated by signal 6
Jan 23 12:18:52 canaveral postgres[558]: [2-1] LOG: aborting startup
due to startup process failure

What can I do?

Chris

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Disaster!
Date:	2004-01-23 20:34:51
Message-ID:	4011856B.1090003@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pg_clog information:

# cd pg_clog
# ls -al
total 3602
drwx------ 2 pgsql pgsql 512 Jan 23 03:49 .
drwx------ 6 pgsql pgsql 512 Jan 23 12:30 ..
-rw------- 1 pgsql pgsql 262144 Jan 18 19:43 0000
-rw------- 1 pgsql pgsql 262144 Jan 18 19:43 0001
-rw------- 1 pgsql pgsql 262144 Jan 18 19:43 0002
-rw------- 1 pgsql pgsql 262144 Jan 18 19:43 0003
-rw------- 1 pgsql pgsql 262144 Jan 19 08:35 0004
-rw------- 1 pgsql pgsql 262144 Jan 21 10:35 0005
-rw------- 1 pgsql pgsql 262144 Jan 20 10:07 0006
-rw------- 1 pgsql pgsql 262144 Jan 21 19:30 0007
-rw------- 1 pgsql pgsql 262144 Jan 21 17:24 0008
-rw------- 1 pgsql pgsql 262144 Jan 22 06:50 0009
-rw------- 1 pgsql pgsql 262144 Jan 22 13:01 000A
-rw------- 1 pgsql pgsql 262144 Jan 23 06:45 000B
-rw------- 1 pgsql pgsql 262144 Jan 23 09:37 000C
-rw------- 1 pgsql pgsql 163840 Jan 23 12:30 000D

I don't have debug symbols, so backtrace isn't that helpful:

(gdb) bt
#0 0x2840a848 in kill () from /usr/lib/libc.so.4
#1 0x2844ee90 in abort () from /usr/lib/libc.so.4
#2 0x81b33ba in ?? ()
#3 0x8092a0c in ?? ()
#4 0x8092450 in ?? ()
#5 0x808a9d7 in ?? ()
#6 0x808adf4 in ?? ()
#7 0x808ae8c in ?? ()
#8 0x808bffa in ?? ()
#9 0x809082f in ?? ()
#10 0x8095204 in ?? ()
#11 0x8135c6b in ?? ()
#12 0x813309c in ?? ()
#13 0x8108dc3 in ?? ()
#14 0x806d49b in ?? ()

I have a backup of the data dir for future analysis, but this is kind of
an urgent fix right now! (Especially since it's 4am my time :( )

Thanks,

Chris

Christopher Kings-Lynne wrote:

> We ran out of disk space on our main server, and now I've freed up
> space, we cannot start postgres!
>
> Jan 23 12:18:50 canaveral postgres[563]: [2-1] LOG: checkpoint record
> is at 2/96500B94
> Jan 23 12:18:50 canaveral postgres[563]: [3-1] LOG: redo record is at
> 2/964BD23C; undo record is at 0/0; shutdown FALSE
> Jan 23 12:18:50 canaveral postgres[563]: [4-1] LOG: next transaction
> ID: 14216463; next OID: 4732327
> Jan 23 12:18:50 canaveral postgres[563]: [5-1] LOG: database system was
> not properly shut down; automatic recovery in progress
> Jan 23 12:18:50 canaveral postgres[563]: [6-1] LOG: redo starts at
> 2/964BD23C
> Jan 23 12:18:51 canaveral postgres[563]: [7-1] PANIC: could not access
> status of transaction 14286850
> Jan 23 12:18:51 canaveral postgres[563]: [7-2] DETAIL: could not read
> from file "/usr/local/pgsql/data/pg_clog/000D" at offset 163840:
> Undefined error: 0
> Jan 23 12:18:51 canaveral postgres[567]: [1-1] FATAL: the database
> system is starting up
> Jan 23 12:18:52 canaveral postgres[558]: [1-1] LOG: startup process
> (PID 563) was terminated by signal 6
> Jan 23 12:18:52 canaveral postgres[558]: [2-1] LOG: aborting startup
> due to startup process failure
>
> What can I do?
>
> Chris
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 20:42:07
Message-ID:	3756.1074890527@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> We ran out of disk space on our main server, and now I've freed up
> space, we cannot start postgres!

> Jan 23 12:18:51 canaveral postgres[563]: [7-1] PANIC: could not access
> status of transaction 14286850
> Jan 23 12:18:51 canaveral postgres[563]: [7-2] DETAIL: could not read
> from file "/usr/local/pgsql/data/pg_clog/000D" at offset 163840:
> Undefined error: 0

I'd suggest extending that file with 8K of zeroes (might need more than
that, but probably not).

regards, tom lane

From:	Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 20:44:45
Message-ID:	1074890685.401187bd551c1@bugs.unl.edu.ar
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mensaje citado por Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>:

> > I'd suggest extending that file with 8K of zeroes (might need more than
> > that, but probably not).
>
> How do I do that? Sorry - I'm not sure of the quickest way, and I'm
> reading man pages as we speak!

# dd if=/dev/zeros of=somefile
# cat file1 somefile >> newfile
# mv newfile file1

file1 is "/usr/local/pgsql/data/pg_clog/000D"

--
select 'mmarques' || '@' || 'unl.edu.ar' AS email;
-------------------------------------------------------
Martín Marqués | Programador, DBA
Centro de Telemática | Administrador
Universidad Nacional
del Litoral
-------------------------------------------------------

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 20:46:08
Message-ID:	40118810.9080202@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I'd suggest extending that file with 8K of zeroes (might need more than
> that, but probably not).

How do I do that? Sorry - I'm not sure of the quickest way, and I'm
reading man pages as we speak!

Thanks Tom,

Chris

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 20:50:08
Message-ID:	3915.1074891008@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
>> I'd suggest extending that file with 8K of zeroes (might need more than
>> that, but probably not).

> How do I do that? Sorry - I'm not sure of the quickest way, and I'm
> reading man pages as we speak!

Something like "dd if=/dev/zero bs=8k count=1 >>clogfile", but check the
dd man page (and make sure you have a /dev/zero).

regards, tom lane

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 20:51:11
Message-ID:	4011893F.9030007@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I'd suggest extending that file with 8K of zeroes (might need more than
> that, but probably not).

OK, I've done

dd if=/dev/zero of=zeros count=16

Then cat zero >> 000D

Now I can start it up! Thanks!

What should I do now?

Chris

From:	Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 20:58:33
Message-ID:	1074891513.40118af99977f@bugs.unl.edu.ar
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mensaje citado por Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> > Now I can start it up! Thanks!
>
> > What should I do now?
>
> Go home and get some sleep ;-). If the WAL replay succeeded, you're up
> and running, nothing else to do.

Tom, could you give a small insight on what occurred here, why those 8k of zeros
fixed it, and what is a "WAL replay"?

I am very curious about it.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 21:00:55
Message-ID:	4031.1074891655@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> Now I can start it up! Thanks!

> What should I do now?

Go home and get some sleep ;-). If the WAL replay succeeded, you're up
and running, nothing else to do.

regards, tom lane

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 21:06:28
Message-ID:	40118CD4.900@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>What should I do now?
>
>
> Go home and get some sleep ;-). If the WAL replay succeeded, you're up
> and running, nothing else to do.

Cool, thanks heaps Tom.

Are you interested in real backtraces, any of the old data directory,
etc. to debug the problem?

Obviously it ran out of disk space, but surely postgres should be able
to start up somehow?

Chris

From:	Rod Taylor <pg(at)rbt(dot)ca>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 21:12:32
Message-ID:	1074892351.17127.44.camel@jester
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2004-01-23 at 16:00, Tom Lane wrote:
> Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> > Now I can start it up! Thanks!
>
> > What should I do now?
>
> Go home and get some sleep ;-). If the WAL replay succeeded, you're up
> and running, nothing else to do.

Granted, running out of diskspace is a bad idea, but can (has?)
something be put into place to prevent manual intervention from being
required in restarting the database?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 21:13:09
Message-ID:	4168.1074892389@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> Are you interested in real backtraces, any of the old data directory,
> etc. to debug the problem?

If you could recompile with debug support and get a backtrace from the
panic, it would be helpful. I suspect what we need to do is make the
clog code more willing to interpret a zero-length read as 8K of zeroes
instead of an error, at least during recovery. But I kinda thought
there was such an escape hatch already, so I want to see exactly how
it got to the point of the failure.

Also, which PG version are you running exactly?

> Obviously it ran out of disk space, but surely postgres should be able
> to start up somehow?

See the response I'm about to write to Martn.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 21:21:04
Message-ID:	4221.1074892864@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?iso-8859-1?b?TWFydO1uIA==?= =?iso-8859-1?b?TWFycXXpcw==?= <martin(at)bugs(dot)unl(dot)edu(dot)ar> writes:
> Tom, could you give a small insight on what occurred here, why those
> 8k of zeros fixed it, and what is a "WAL replay"?

I think what happened is that there was insufficient space to write out
a new page of the clog (transaction commit) file. This would result in
a database panic, which is fine --- you're not gonna get much done
anyway if you are down to zero free disk space. However, after Chris
freed up space, the system needed to replay the WAL from the last
checkpoint to ensure consistency. The WAL entries evidently included
references to transactions whose commit bits were in the unwritten page.
Now there would also be WAL entries recording those commits, so once the
replay was complete everything would be cool. But the clog access code
evidently got confused by being asked to read a page that didn't exist
in the file. I'm not sure yet how that sequence of events occurred,
which is why I asked Chris for a stack trace.

Adding a page of zeroes fixed it by eliminating the read error
condition. It was okay to do so because zeroes is the correct initial
state for a clog page (all transactions in it "still in progress").
After WAL replay, any completed transactions would be updated in the page.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To:	Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 21:22:49
Message-ID:	20040123212249.GA31024@dcc.uchile.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 23, 2004 at 05:58:33PM -0300, Martín Marqués wrote:

> Tom, could you give a small insight on what occurred here, why those 8k of zeros
> fixed it, and what is a "WAL replay"?

If I may ...

- the disk filled up
- Postgres registered something in WAL that required some commit status
(WAL log space is preallocated on disk, so this didn't fail)
- the clog code tried to store information about the commit bits, but
noticed that it needed to extend the clog file.
- the extension failed because the disk was full
- the server went down and a WAL replay was in order, but ...
- the WAL replay could not be done because the code tried to read a
commit status in pg_clog that wasn't there

(some time later)

- Chris emptied up some space and extended the clog
- WAL replay completed, reading an "uncommitted" status from the clog.

Here, "clog" is the "commit log", an area which indicates for each
transaction whether it committed or aborted. A WAL replay is the
operation of bringing the data files (tables, indexes, etc) up to date
by reading the "Write-ahead log" (WAL).

Tom's answer will be undoubtly better ...

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"¿Que diferencia tiene para los muertos, los huérfanos, y aquellos que han
perdido su hogar, si la loca destrucción ha sido realizada bajo el nombre
del totalitarismo o del santo nombre de la libertad y la democracia?" (Gandhi)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Rod Taylor <pg(at)rbt(dot)ca>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 21:23:17
Message-ID:	4253.1074892997@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Rod Taylor <pg(at)rbt(dot)ca> writes:
> Granted, running out of diskspace is a bad idea, but can (has?)
> something be put into place to prevent manual intervention from being
> required in restarting the database?

See subsequent discussion. I do want to modify the code to avoid this
problem in future, but we do not need to make Chris work on it at 5 AM
his time. It can wait till tomorrow.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 21:26:39
Message-ID:	20040123212639.GB31024@dcc.uchile.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 23, 2004 at 04:21:04PM -0500, Tom Lane wrote:

> But the clog access code evidently got confused by being asked to read
> a page that didn't exist in the file. I'm not sure yet how that
> sequence of events occurred, which is why I asked Chris for a stack
> trace.

There was a very similar episode some time ago, and Elein summed it up
nicely in General Bits:

http://www.varlena.com/GeneralBits/45.php

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso." (Ernesto Hernández-Novich)

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-23 22:44:35
Message-ID:	200401232244.i0NMiZL14330@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> > Are you interested in real backtraces, any of the old data directory,
> > etc. to debug the problem?
>
> If you could recompile with debug support and get a backtrace from the
> panic, it would be helpful. I suspect what we need to do is make the
> clog code more willing to interpret a zero-length read as 8K of zeroes
> instead of an error, at least during recovery. But I kinda thought
> there was such an escape hatch already, so I want to see exactly how
> it got to the point of the failure.
>
> Also, which PG version are you running exactly?

I asked him, 7.4.1.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc:	Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-24 01:21:24
Message-ID:	8707.1074907284@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl> writes:
> Tom's answer will be undoubtly better ...

Nope, I think you got all the relevant points.

The only thing I'd add after having had more time to think about it is
that this seems very much like the problem we noticed recently with
recovery-from-WAL being broken by the new code in bufmgr.c that tries to
validate the header fields of any page it reads in. We had to add an
escape hatch to disable that check while InRecovery, and I expect what
we will end up with here is a few lines added to slru.c to make it treat
read-past-EOF as a non-error condition when InRecovery. Now the clog
code has always had all that paranoid error checking, but because it
deals in such tiny volumes of data (only 2 bits per transaction), it's
unlikely to suffer an out-of-disk-space condition. That's why we hadn't
seen this failure mode before.

regards, tom lane

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-24 06:31:36
Message-ID:	Pine.LNX.4.58.0401241726570.23715@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 23 Jan 2004, Tom Lane wrote:

> Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl> writes:
> > Tom's answer will be undoubtly better ...
>
> Nope, I think you got all the relevant points.
>
> The only thing I'd add after having had more time to think about it is
> that this seems very much like the problem we noticed recently with
> recovery-from-WAL being broken by the new code in bufmgr.c that tries to
> validate the header fields of any page it reads in. We had to add an
> escape hatch to disable that check while InRecovery, and I expect what
> we will end up with here is a few lines added to slru.c to make it treat
> read-past-EOF as a non-error condition when InRecovery. Now the clog
> code has always had all that paranoid error checking, but because it
> deals in such tiny volumes of data (only 2 bits per transaction), it's
> unlikely to suffer an out-of-disk-space condition. That's why we hadn't
> seen this failure mode before.

It seems that by adding the following to SlruPhysicalReadPage() we can
recover in a reasonable way here. Instead of:

if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
{
slru_errcause = SLRU_SEEK_FAILED;
slru_errno = errno;
return false;
}

We have:

if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
{
if(!InRecovery)
{
slru_errcause = SLRU_SEEK_FAILED;
slru_errno = errno;
return false;
}
ereport(LOG,
(errmsg("Short read from file \"%s\", reading as zeroes",
path)));
MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
return true;
}

Which is exactly how we recover from a missing pg_clog file.

>
> regards, tom lane

Gavin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-24 17:40:19
Message-ID:	3756.1074966019@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> It seems that by adding the following to SlruPhysicalReadPage() we can
> recover in a reasonable way here. Instead of:
> [ add non-error check to lseek() ]

But it's not the lseek() that's gonna fail. What we'll actually see,
and did see in Chris' report, is a read() that returns zero bytes, or
possibly an incomplete page. So actually this change is needed in the
next step, not the lseek.

BUT: after looking at the code more, I'm confused again about exactly
how Chris' failure happened. The backtrace he sent this morning shows
that the panic occurs while replaying a transaction-commit WAL entry ---
it's trying to set the commit status bit for that transaction number,
and finding that the clog page containing that bit doesn't exist. But
there should have been a previous WAL entry recording the ZeroCLOGPage()
action for that clog page. The only way that wouldn't have got replayed
too is if there was a checkpoint in between ... but a checkpoint should
not have been able to complete without flushing the clog buffer to disk.
If there wasn't disk space enough to hold the clog page, the checkpoint
attempt should have failed. So it may be that allowing a short read in
slru.c would be patching the symptom of a bug that is really elsewhere.
We need to understand the sequence of events in more detail.

Chris, you said you'd saved a copy of the data directory at the time of
the failure, right? Could you send me the pg_control file and the
active segments of pg_xlog? (It should be sufficient to send the ones
with file mod times within five minutes of the crash.)

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-24 17:52:43
Message-ID:	3813.1074966763@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I said:
> If there wasn't disk space enough to hold the clog page, the checkpoint
> attempt should have failed. So it may be that allowing a short read in
> slru.c would be patching the symptom of a bug that is really elsewhere.

After more staring at the code, I have a theory. SlruPhysicalWritePage
and SlruPhysicalReadPage are coded on the assumption that close() can
never return any interesting failure. However, it now occurs to me that
there are some filesystem implementations wherein ENOSPC could be
returned at close() rather than the preceding write(). (For instance,
the HPUX man page for close() states that this never happens on local
filesystems but can happen on NFS.) So it'd be possible for
SlruPhysicalWritePage to think it had successfully written a page when
it hadn't. This would allow a checkpoint to complete :-(

Chris, what's your platform exactly, and what kind of filesystem are
you storing pg_clog on?

regards, tom lane

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-25 00:21:45
Message-ID:	40130C19.2070903@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> After more staring at the code, I have a theory. SlruPhysicalWritePage
> and SlruPhysicalReadPage are coded on the assumption that close() can
> never return any interesting failure. However, it now occurs to me that
> there are some filesystem implementations wherein ENOSPC could be
> returned at close() rather than the preceding write(). (For instance,
> the HPUX man page for close() states that this never happens on local
> filesystems but can happen on NFS.) So it'd be possible for
> SlruPhysicalWritePage to think it had successfully written a page when
> it hadn't. This would allow a checkpoint to complete :-(

FreeBSD 4.7/4.9 and the UFS filesystem

RETURN VALUES
The close() function returns the value 0 if successful; otherwise the
value -1 is returned and the global variable errno is set to
indicate the
error.

ERRORS
Close() will fail if:

[EBADF] D is not an active descriptor.

[EINTR] An interrupt was received.

Chris

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-25 02:42:13
Message-ID:	16406.1074998533@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> FreeBSD 4.7/4.9 and the UFS filesystem

Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
at close(). We need to fix the code to check close's return value,
probably, but it seems we still lack a clear explanation of what
happened to your database.

That request to look at your WAL files is still open ...

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-25 05:46:55
Message-ID:	87fze4wpf4.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> > FreeBSD 4.7/4.9 and the UFS filesystem
>
> Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
> at close(). We need to fix the code to check close's return value,
> probably, but it seems we still lack a clear explanation of what
> happened to your database.

The traditional Unix filesystems certainly don't return errors at close. Even
NFS doesn't traditionally do so. I think NFSv3 can if the server disappears
after the client obtains a lease on a piece of the file, but I'm not sure if
ENOSPC is a possible failure mode.

I do know that AFS returns quota failures on close. This was unusual enough
that when AFS was deployed at school unix tools failed left and right over
precisely this issue. Though it mostly just meant they returned the wrong exit
status.

--
greg

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-25 06:47:35
Message-ID:	20040125144710.E35192-100000@houston.familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> That request to look at your WAL files is still open ...

I've sent you it privately - let me know how it goes.

Chris

From:	Randolf Richardson <rr(at)8x(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-25 07:36:59
Message-ID:	Xns947AEE1192C04rr8xca@200.46.204.72
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"gsstark(at)mit(dot)edu (Greg Stark)" stated in
comp.databases.postgresql.hackers:

> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
>> > FreeBSD 4.7/4.9 and the UFS filesystem
>>
>> Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
>> at close(). We need to fix the code to check close's return value,
>> probably, but it seems we still lack a clear explanation of what
>> happened to your database.
>
> The traditional Unix filesystems certainly don't return errors at close.
> Even NFS doesn't traditionally do so. I think NFSv3 can if the server
> disappears after the client obtains a lease on a piece of the file, but
> I'm not sure if ENOSPC is a possible failure mode.
[sNip]

Why shouldn't the close() function return an error? If an invalid
file handle was passed to it, it most certainly should indicate this since
it's always possible for a separate thread to close it first (or other
reasons as well).

--
Randolf Richardson - rr(at)8x(dot)ca
Vancouver, British Columbia, Canada

"We are anti-spammers. You will confirm
subscriptions. Resistance is futile."

Please do not eMail me directly when responding
to my postings in the newsgroups.

From:	Manfred Spraul <manfred(at)colorfullife(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-25 10:30:19
Message-ID:	40139ABB.8070605@colorfullife.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark wrote:

>I do know that AFS returns quota failures on close. This was unusual enough
>that when AFS was deployed at school unix tools failed left and right over
>precisely this issue. Though it mostly just meant they returned the wrong exit
>status.
>
That means
open();
write();
sync();

could succeed, but the data is not stored on disk, correct?

--
Manfred

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Manfred Spraul <manfred(at)colorfullife(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-25 14:14:27
Message-ID:	871xpow1x8.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Manfred Spraul <manfred(at)colorfullife(dot)com> writes:

> That means
> open();
> write();
> sync();
>
> could succeed, but the data is not stored on disk, correct?

That would be true on any filesystem. Unless you throw an fsync() call in.

With sync replaced by fsync then any filesystem ought to guarantee the data
has reached disk by the time fsync returns. I think this is even true of NFS
or AFS, though I wouldn't depend on it for my own data.

--
greg

From:	Manfred Spraul <manfred(at)colorfullife(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-25 14:32:33
Message-ID:	4013D381.20900@colorfullife.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark wrote:

>Manfred Spraul <manfred(at)colorfullife(dot)com> writes:
>
>
>
>>That means
>> open();
>> write();
>> sync();
>>
>>could succeed, but the data is not stored on disk, correct?
>>
>>
>
>That would be true on any filesystem. Unless you throw an fsync() call in.
>
>
The checkpoint code uses sync() right now. Actually sync();sleep(2);sync().
Win32 has no sync() call, therefore it will use fsyncs. Perhaps
platforms with deferred errors on close must use fsync, too. Hopefully
parallel fsyncs - sequential fsyncs could be slow due to more seeking.

--
Manfred

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Manfred Spraul <manfred(at)colorfullife(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-25 23:26:05
Message-ID:	87hdyjvcdu.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Manfred Spraul <manfred(at)colorfullife(dot)com> writes:

> The checkpoint code uses sync() right now. Actually sync();sleep(2);sync().
> Win32 has no sync() call, therefore it will use fsyncs. Perhaps platforms with
> deferred errors on close must use fsync, too. Hopefully parallel fsyncs -
> sequential fsyncs could be slow due to more seeking.

That code is known to be totally bogus in theory. However in practice it seems
to be the best of the possible bad choices.

Even on filesystems where errors won't be deferred after the write() the data
is still not guaranteed to be on disk. Even after the sync() call. There's no
guarantee of any particular sleep time being enough.

This was brought up a few months ago. The only safe implementation would be to
fsync every file descriptor that had received writes. The problem is keeping
track of which file descriptors those are. Also people were uncertain whether
a backend opening a file and calling fsync would guarantee that writes written
to the same file by other processes through other file descriptors would be
flushed. I'm fairly convinced they would be on all sane vfs implementations
but others were less convinced.

--
greg

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Martín Marqués <martin(at)bugs(dot)unl(dot)edu(dot)ar>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-26 18:04:12
Message-ID:	200401261804.i0QI4CY10530@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> I said:
> > If there wasn't disk space enough to hold the clog page, the checkpoint
> > attempt should have failed. So it may be that allowing a short read in
> > slru.c would be patching the symptom of a bug that is really elsewhere.
>
> After more staring at the code, I have a theory. SlruPhysicalWritePage
> and SlruPhysicalReadPage are coded on the assumption that close() can
> never return any interesting failure. However, it now occurs to me that
> there are some filesystem implementations wherein ENOSPC could be
> returned at close() rather than the preceding write(). (For instance,
> the HPUX man page for close() states that this never happens on local
> filesystems but can happen on NFS.) So it'd be possible for
> SlruPhysicalWritePage to think it had successfully written a page when
> it hadn't. This would allow a checkpoint to complete :-(
>
> Chris, what's your platform exactly, and what kind of filesystem are
> you storing pg_clog on?

We already have a TODO on fclose():

* Add checks for fclose() failure

From:	Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
To:	pgman(at)candle(dot)pha(dot)pa(dot)us (Bruce Momjian)
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-29 16:13:08
Message-ID:	200401291513.QAA11098@rodos
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
> Tom Lane wrote:
> > I said:
> > > If there wasn't disk space enough to hold the clog page, the checkpoint
> > > attempt should have failed. So it may be that allowing a short read in
> > > slru.c would be patching the symptom of a bug that is really elsewhere.
> >
> > After more staring at the code, I have a theory. SlruPhysicalWritePage
> > and SlruPhysicalReadPage are coded on the assumption that close() can
> > never return any interesting failure. However, it now occurs to me that
> > there are some filesystem implementations wherein ENOSPC could be
> > returned at close() rather than the preceding write(). (For instance,
> > the HPUX man page for close() states that this never happens on local
> > filesystems but can happen on NFS.) So it'd be possible for
> > SlruPhysicalWritePage to think it had successfully written a page when
> > it hadn't. This would allow a checkpoint to complete :-(
> >
> > Chris, what's your platform exactly, and what kind of filesystem are
> > you storing pg_clog on?
>
> We already have a TODO on fclose():
>
> * Add checks for fclose() failure
>
Tom was referring to close(), not fclose().
I once had an awful time searching for a memory leak caused
by a typo using close instead of fclose.
So adding checks for both is probably a good idea.

Regards, Christoph

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us (Bruce Momjian), pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-29 16:19:48
Message-ID:	23399.1075393188@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christoph Haller <ch(at)rodos(dot)fzk(dot)de> writes:
> Tom was referring to close(), not fclose().
> I once had an awful time searching for a memory leak caused
> by a typo using close instead of fclose.
> So adding checks for both is probably a good idea.

Already done.

regards, tom lane

From:	Jeroen Ruigrok/asmodai <asmodai(at)wxs(dot)nl>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Disaster!
Date:	2004-01-30 05:59:32
Message-ID:	20040130055932.GZ39523@nexus.ninth-circle.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-On [20040125 03:52], Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
>at close().

From Tru64's write(2):

[ENOSPC]
[XSH4.2] No free space is left on the file system containing the
file.
[Tru64 UNIX] An attempt was made to write past the "early
warning" EOT while this indicator was enabled.
[Tru64 UNIX] An attempt was made to write at or beyond the end of
a partition.

From close(2):

[Tru64 UNIX] A close() function on an NFS file system waits for all
outstanding I/O to complete. If any operation completes with an error,
the error will be returned by close(). The possible errors depend on the
NFS server implementation, but the most likely errors are:

[snip...]

[ENOSPC] Attempted to write on a full file system.

>We need to fix the code to check close's return value, probably, but it
>seems we still lack a clear explanation of what happened to your
>database.

You always need to check the return codes of calls like that, what if
you received EBADF or EINTR for whatever reason?

--
Jeroen Ruigrok van der Werven <asmodai(at)wxs.nl> / asmodai / kita no mono
PGP fingerprint: 2D92 980E 45FE 2C28 9DB7 9D88 97E6 839B 2EAC 625B
http://www.tendra.org/ | http://diary.in-nomine.org/
From the pine tree, learn of the pine tree. And from the bamboo, of the
bamboo...

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Randolf Richardson <rr(at)8x(dot)ca>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Disaster!
Date:	2004-01-31 04:54:45
Message-ID:	26500.1075524885@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Randolf Richardson <rr(at)8x(dot)ca> writes:
> "gsstark(at)mit(dot)edu (Greg Stark)" stated in
> comp.databases.postgresql.hackers:
>> The traditional Unix filesystems certainly don't return errors at close.

> Why shouldn't the close() function return an error? If an invalid
> file handle was passed to it, it most certainly should indicate this

Of course. We're discussing the situation where no errors were reported
in prior syscalls --- in particular, open() succeeded.

regards, tom lane