Severe problems with PostgreSQL 7.4.7 installation, please help!

From: "Peter Bauer" <peter(dot)m(dot)bauer(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org, peter(dot)m(dot)bauer(at)gmail(dot)com
Subject: Severe problems with PostgreSQL 7.4.7 installation, please help!
Date: 2006-09-02 18:59:18
Message-ID: 764c9e910609021159i29d8af44nd57029e2e62f0fdc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

like always, time is short and we have a very critical situation here,
so please help to save the day once again.

Installation facts:
- 4 High Availability Clusters consisting of 2 Dual Xeon Server
machines (Dell PowerEdge 2850) using Heartbeat1. Each server has only
one harddisk, no RAID configuration is used.
- Each Cluster has 4 drbd Devices, one for the PostgreSQL data
- Two of these clusters are using the same PostgreSQL installation to
share the data, the database can be moved from one cluster to the
other in case of failure
- OS: Debian Sarge with postgresql-7.4.7-6sarge1
- Two cronjobs are configured to perform a "vacuumdb --all --full"
every 10 minutes and a "vacuumdb --all --full --analyze" every 23
minutes
- There are 3 applications using the PostgreSQL installation, each
with their own database.
- The main application is based on Tomcat 4.1-30 partly a
Web-Application, partly a terminal login protocol based on http and
XML, so the database access is of course done using JDBC
- A cronjob is configured to perform a pg_dump of the main database
every 4 hours

Usage:
- There are 3 tables which are heavily used by inserting and deleting
some hundred entries per second, the others are rarely used by the
Web-Application. One of these 3 table is the heaviest used one, also
with locking mechanisms.
- There are some pgsql functions which use locking mechanisms (select
for update)
- When a user logs in into the Web-Application, about 20 tables are
created which contain the data used for the Web-Application. The last
10 versions of these 20 tables -> about 200 tables are used as old
versions which can be used to restore previous configuration data
(pretty bad solution). These tables were never used after their
creation, just the oldest one is deleted if there are >10 versions
- Unfortunately we are not using Database connection pooling
- There are some threads which have a permanent database connection
and perform some actions either on a regular basis or on demand.

Problems:
1. The first problem we faced was that the pg_dump which is performed
every 4 hours began to fail about 1-2 times a week because of the
following error:

pg_dump: attempt to lock table "tableblabla" failed: ERROR: catalog
is missing 5 attribute(s) for relid 44798701

The table for which the errors occur is always one of these 200 unused
tables containing previous configuration data. The recovery procedure
always consists of deleting all entries with relname tableblablabla
from the pg_class, dumping and restoring the database.

2. Happens about once a month: Starting with select and delete queries
regarding the heaviest used database table, the database starts to not
respond to different queries which leads to a unusable system. If the
database is in such a state, it is not even possible to connect to the
database using psql on the commandline. The recovery procedure was to
just restart the PostgreSQL service.

3. There is a table which contains some configuration values which are
never changed. A thread of the application has a permanent database
connection reads a value from this table every 10 seconds and then
performs some actions. Sometimes, the value cannot be read from this
table, which means that the same SELECT statement executed every 10
seconds delivers no result, although the data is never modified. The
table is never subject to locking or anything like that. After some
queries failed, suddenly the correct value is returned again.

4. Once, the following error prevents creation of some tables (postgresql.log):

2006-08-30 17:50:39 [11189] ERROR: 23505: duplicate key violates
unique constraint "pg_class_oid_index"

followed by

2006-08-30 18:00:11 [18296] ERROR: XX000: catalog is missing 10
attribute(s) for relid 17586461

for every run of the vacuum cronjob. This errors occurred together
with the freeze described in 3.

Are there any known bugs of the used version of PostgreSQL, Tomcat,
JDBC which could cause such problems? All these problems occur
independently on both 2 cluster systems, so i think there is no
hardware error.
If you have any suggestions or ideas for debugging or what the reason
for these problems could be, please reply as soon as possible.

br,
Peter

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2006-09-02 21:30:40 Re: Thought provoking piece on NetBSD
Previous Message Chris Travers 2006-09-02 18:28:09 Re: [pgsql-advocacy] Thought provoking piece on NetBSD