pg_clog woes with 7.3.2 - Episode 2

Lists: pgsql-hackers
From: "Dave Page" <dpage(at)vale-housing(dot)co(dot)uk>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: pg_clog woes with 7.3.2 - Episode 2
Date: 2003-04-16 14:20:19
Message-ID: 03AF4E498C591348A42FC93DEA9661B83AF043@mail.vale-housing.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

A week or 2 back I reported a problem with 7.3.2 failing to open pg_clog
files ([HACKERS] pg_clog woes with 7.3.2).

Tom Lane was kind enough to login and do some debugging on a couple of
occasions and found corrupt pages in a database on the system. After the
first session, he suggested running memtest86 which showed no errors
after multiple passes and badblocks which showed no errors after
multiple non-destructive read-write tests.

One initdb and reload later (it's a new system, the old is still running
OK), and the error comes back again, only this time Tom finds the
corruption is in a couple of pages. Memtest86 again shows no errors, but
eventually badblocks did, but only when I used a destructive read write
test.

So, the disk goes back to Seagate, and is replaced with another
identical one, and a similar problem reoccurs (logs below) :-(. I
haven't run badblocks yet as it takes a fair while, but wanted to find
out if anyone thought this could be an OS issue or something else.
Previously I've been using the 2.4.19 Linux kernel, however this machine
is 2.4.20 (Slackware Linux 9). The SCSI adaptor is an Adaptec 29160, and
the disks are 34Gb Seagate Cheetah X15's.

Any thoughts or suggestions would be appreciated.

Regards, Dave.

LOG: connection received: host=[local]
LOG: connection authorized: user=postgres database=mnogo_int
LOG: query: begin; select getdatabaseencoding(); commit
LOG: query: vacuum;
PANIC: read of clog file 0, offset 253952 failed: Success
LOG: statement: vacuum;
LOG: server process (pid 2006) was terminated by signal 6
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing shared memory and
semaphores
LOG: database system was interrupted at 2003-04-16 15:06:34 BST
LOG: checkpoint record is at 0/2FB37D94
LOG: redo record is at 0/2FB37D94; undo record is at 0/0; shutdown TRUE
LOG: next transaction id: 3186; next oid: 9724496
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/2FB37DD4
LOG: ReadRecord: record with zero length at 0/2FC39CAC
LOG: redo done at 0/2FC39C88
LOG: database system is ready


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Dave Page" <dpage(at)vale-housing(dot)co(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_clog woes with 7.3.2 - Episode 2
Date: 2003-04-16 15:22:22
Message-ID: 18323.1050506542@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Dave Page" <dpage(at)vale-housing(dot)co(dot)uk> writes:
> One initdb and reload later (it's a new system, the old is still running
> OK), and the error comes back again, only this time Tom finds the
> corruption is in a couple of pages. Memtest86 again shows no errors, but
> eventually badblocks did, but only when I used a destructive read write
> test.

> So, the disk goes back to Seagate, and is replaced with another
> identical one, and a similar problem reoccurs (logs below) :-(. I
> haven't run badblocks yet as it takes a fair while, but wanted to find
> out if anyone thought this could be an OS issue or something else.
> Previously I've been using the 2.4.19 Linux kernel, however this machine
> is 2.4.20 (Slackware Linux 9). The SCSI adaptor is an Adaptec 29160, and
> the disks are 34Gb Seagate Cheetah X15's.

How annoying. My bet would be on the SCSI adaptor being the problem.
Or you could have a cabling issue --- SCSI is not as bad as IDE, but
it's still finicky, esp w.r.t. termination.

regards, tom lane


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_clog woes with 7.3.2 - Episode 2
Date: 2003-04-17 02:35:25
Message-ID: 20030417023525.GE1833@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dave Page wrote:
> So, the disk goes back to Seagate, and is replaced with another
> identical one, and a similar problem reoccurs (logs below) :-(. I
> haven't run badblocks yet as it takes a fair while, but wanted to find
> out if anyone thought this could be an OS issue or something else.
> Previously I've been using the 2.4.19 Linux kernel, however this machine
> is 2.4.20 (Slackware Linux 9). The SCSI adaptor is an Adaptec 29160, and
> the disks are 34Gb Seagate Cheetah X15's.
>
> Any thoughts or suggestions would be appreciated.

I'd definitely run badblocks against the new drive -- multiple times.
Either it should yield the same bad block list each time (in which
case you've got a set of unrecoverable bad block -- this usually means
there are no spare blocks left), or you should see the number of bad
blocks drop to zero (as the SCSI bad block remapping takes effect),
unless something really funky is going on.

I'd also start looking carefully through the system logs for SCSI
errors. You should see some if you're getting bad block problems (in
particular, you should see bad block remapping attempts that couldn't
read the data from the original bad block -- this, or running out of
spare blocks, is the only reason you should see errors at all on an
otherwise functional setup).

If badblocks shows errors but you don't see any SCSI errors in the
system logs, then it's time to start suspecting the disk controller or
perhaps even the PCI bus controller, because it means something really
weird is happening on the backend that is entirely invisible. Cabling
or termination could be an issue, but I'd expect to see parity errors,
timed out commands, etc. if that's the problem.

--
Kevin Brown kevin(at)sysexperts(dot)com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_clog woes with 7.3.2 - Episode 2
Date: 2003-04-17 03:39:31
Message-ID: 3008.1050550771@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> If badblocks shows errors but you don't see any SCSI errors in the
> system logs, then it's time to start suspecting the disk controller or
> perhaps even the PCI bus controller, because it means something really
> weird is happening on the backend that is entirely invisible. Cabling
> or termination could be an issue, but I'd expect to see parity errors,
> timed out commands, etc. if that's the problem.

Dave neglected to mention that the two or three bad blocks we'd traced
down all showed a consistent pattern of errors: there was a 64-byte
region of wrong data, aligned on a 64-byte offset from the start of the
disk block, and the contents were copies of correct data from positions
exactly 64 bytes before or after the bad area.

Considering that, I would bet a good deal that the problem is some kind
of transfer timing error in some chunk of hardware that copies the data
64 bytes at a time. I withdraw my previous thought that it might be
cabling --- there are no 64-byte-wide SCSI cables. It could easy be
internal to the SCSI adaptor though. If his motherboard is high-end
enough that the DMA path from adaptor to memory is 64 bytes wide, then
DMA timing errors would be a possibility too.

regards, tom lane


From: John Ireland <member46957(at)dbforums(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_clog woes with 7.3.2 - Episode 2
Date: 2003-11-04 15:06:09
Message-ID: 3557599.1067958369@dbforums.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Just caught up on this thread - I currently have the same problem on a
Duron 900 with a Gigabyte 7VAX (or similar) motherboard. We've had a
number of problems with this system before, but to date they were all
fixed by increasing the available shared memory (echo $BIGNUM >
/proc/sys/kernel/shmall).

We're on a bog standard IDE drive. I've not yet run so many tests, but
will advise of results shortly.

--
Posted via http://dbforums.com