Re: could not read block 77 of relation 1663/16385/388818775

From: Alexandra Nitzschke <an(at)clickware(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Subject: Re: could not read block 77 of relation 1663/16385/388818775
Date: 2008-11-21 14:49:39
Message-ID: 4926CA83.5000304@clickware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

here is some information about the server

- no other database system runs on the server
- suse 10.3, standard installation
- jvm 1.5.0_16
- as interface to the database we use jdbc version 8.1-407
- as procdural language we only use pl/pgsql
- we have no specials in the database like custom dataypes an so on

We have had a look at the /var/log files, no system crash, kernel panic or messages like this has happened.

Running a RAID verify reported no error.

We have setup a new database on 2008/11/17 after updating to 8.3.5.
Since then we didn't kill the postmaster manually or any backend process using kill -9.
Just normal stop/start/restart, no manually deletion of postmaster-pid or something like that.

Let me tell the circumstances:

To handle any failures of the system, we build a "pair" of servers.
One server is primary server and one is stand by server.
In case of any system failure we are able to switch the servers and can use the "old" stand by as "new" primary.

For database replication we use the "Warm Standby Using Point-In-Time Recovery" method.

Last Friday I updated postgres to 8.3.5 on the primary server and setup a new database and insert a dump from wednesday
using the pg_restore utility.

This monday I updated postgres to 8.3.5 on the standby server.
After that I intialized the database
( copy once the database from the primary system: removing data/* on stand-by, setting the database on primary in
backup-modus and then copy the database files )
and startet the recovery-mode on standby and the WAL replication from primary.
On Tuesday we have done a switch test and started up the database on the standby and use it as primary server.
Everything works fine until yesterday morning.

Our workload:
Over night we retrieve a lot of data and insert it into the database ( ~300000 - ~700000 recordsets into diffrent tables
but three main tables ).
In the morning we do some processing, but mostly selects.
During the day people work on the webapp on the system, but do mostly selects.
Every evening runs "vacuum analyze".

Regards,

A. Nitzschke

Craig Ringer schrieb:
> Alexandra Nitzschke wrote:
>> Hi,
>>
>> we have had similar postgres problems in the past.
>> Please have a look at Bug 3484.
>>
>> We didn't resolve the problems metioned in bug 3484. The other postgres
>> developers also thought, that there are hardware
>> problems.
>> So our customer bought a new server with diffrent hardware configuration
>> ( ... and NEW hardware drives ... ).
>> The error today encountered on the new machine. Just running under heavy
>> load since two days.
>
> Yes, that does seem somewhat unlikely, especially if in both cases
> you've only seen issues with PostgreSQL. However, I'm a bit confused
> about the fact that you're seeing apparent corruption all over the place
> - your earlier report mentions damaged blocks across a number of
> relations, and this one is a bad index. You'd expect this sort of thing
> to come up a lot on the list, so it must be assumed that there's
> something a bit unusual or different about your configuration that's
> either triggering a hard-to-hit bug in PostgreSQL, or that's damaging
> PostgreSQL's data somehow.
>
> Is there any chance you have EVER hard-killed the postmaster manually
> (eg with "kill -9" or "kill -KILL")? If you do that and don't also kill
> the backends, it's my understanding that BAD things may happen
> especially if you then attempt to relaunch the postmaster.
>
> Do you use _any_ 3rd party C extensions? Contrib modules? It doesn't
> have to be in the same database, another database on the same machine
> could be bad too.
>
> Do you have any unusual workload? What is your workload like?
>
> What procedural languages, if any, do you use? Pl/PgSQL? Pl/Perl?
> Pl/Java? Pl/Python? etc. Again, in any database, not just your problem
> one. If you use any other than Pl/PgSQL please also note the version of
> the language interpreter/tools and in the case of Java the JVM vendor &
> install method.
>
> Does your site possibly have dodgy power? Are the servers on a UPS?
>
> Have the servers had any crashes, kernel panics, unexpected reboots, or
> hard poweroffs?
>
> (Not that it should matter, but): Have you hard killed any backends
> (kill -9 / SIGKILL)?
>
> If you run a RAID verify using tw_cli or through the 3dm web interface,
> does it report any block mismatches in the array?
>
> --
> Craig Ringer
>
>
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2008-11-21 17:05:28 Re: executing SELECT xmlelement(name foo); causes "server closed the connection unexpectedly" Error
Previous Message Sushil 2008-11-21 10:27:25 Re: executing SELECT xmlelement(name foo); causes "server closed the connection unexpectedly" Error