From: | Waldo Nell <pwnell(at)telkomsa(dot)net> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Database corruption |
Date: | 2006-09-01 15:02:24 |
Message-ID: | D15255F3-FB3D-4AA4-8145-65B8DD83BCA0@telkomsa.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
We have recently upgraded from PostgreSQL 7.4.5 to 8.1.4. Our DB is
about 45GB in size and has about 100 tables, lots of stored
procedures, triggers etc.
The DB lived on a SAN and is accessed via Fibre Channel cards in an
IBM BladeCenter. The system is running RedHat Linux Enterprise 3 I
think. We connect to it using the latest JDBC driver from Java 1.5.
Here is the problem. Yesterday night I saw one of my statements in
my code threw this exception:
net.za.pwnconsulting.dblayer.query.DBQueryLayerException: Could not
execute SQL query [select c.mc_ca_id as c_mc_ca_id, d.mc_ca_id as
d_mc_ca_id, b.ca_id, b.ca_ctr_id from contract a, contract_agreement
b left outer join monthend_contract c on c.mc_ca_id = ? and
c.mc_last_proc_date = ? left outer join monthend2_contract d on
d.mc_ca_id = ? and d.mc_last_proc_date = ? where a.ctr_id =
b.ca_ctr_id and b.ca_agreement_type = ? and b.ca_start_date > ? and
b.ca_start_date <= ? and a.ctr_id = ? and ca_level_company = ? and
ca_level_data = ? and ca_level_type = 'V' and ca_id <> ? limit 1] in
pool [Oasis] because: ERROR: out of memory
Detail: Failed on request of size 20.
org.postgresql.util.PSQLException: ERROR: out of memory
Detail: Failed on request of size 20.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse
(QueryExecutorImpl.java:1525)
at org.postgresql.core.v3.QueryExecutorImpl.processResults
(QueryExecutorImpl.java:1309)
at org.postgresql.core.v3.QueryExecutorImpl.execute
(QueryExecutorImpl.java:188)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute
(AbstractJdbc2Statement.java:452)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags
(AbstractJdbc2Statement.java:340)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery
(AbstractJdbc2Statement.java:239)
at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter
nal(StandardQueryLayer.java:289)
at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter
nalExceptionWrapper(StandardQueryLayer.java:158)
at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase
(StandardQueryLayer.java:68)
at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase
(StandardQueryLayer.java:58)
at net.za.pwnconsulting.dblayer.AbstractDBLayer.queryDatabase
(AbstractDBLayer.java:57)
at
za.co.massmart.oasis.daemons.ContractExtensionTimerTask.safeExecuteInter
nal(ContractExtensionTimerTask.java:183)
at za.co.massmart.oasis.daemons.ExtendedHACronTimerTask.safeExecute
(ExtendedHACronTimerTask.java:94)
at net.za.pwnconsulting.javaconfig.utils.time.HACronTimerTask.execute
(HACronTimerTask.java:156)
at net.za.pwnconsulting.javaconfig.utils.time.CronTimerTask.run
(CronTimerTask.java:103)
at java.lang.Thread.run(Thread.java:595)
So I checked and saw another application connecting to another DB in
the same instance showed a very weird error. For an "update
tablename set field1 = X where primary_field = Y" statement I got a
duplicate key exception on the serial field "primary_field" - which
is impossible since I am not changing its value - I am updating a
different column. Vacuum full analyze did not fix this. I dropped
and recreated the index, then the error went away. This was just
after I restarted the postgres instance (I stopped it using pg_ctl
stop -m immediate" since there were connections I could not close
from other systems.
I restarted the application that encountered the first error above
and ran the task again - it worked. So I then started a vacuum full
analyze on the main production DB. It produced several warnings
about page sizes not being large enough I think, but it completed
successfully.
This morning we found some data generated this morning at 01:00 was
present, but most of the tables lost all data captured since 25
August 2006. That amounts to about 500 contracts, a while billing
cycle etc. We upgraded postgresql on the 20th.
I checked the file system and it yielded no errors (ext3). I checked
the server and there were no obvious hardware issues.
How can part of the DB just be lost like that? I have just verified
by restoring a backup made yesterday night at 20:00 - before the
errors I showed above happened - that DB also has the data missing.
The bill run happened yesterday afternoon so somewhere between that
something broke. Any ideas on how to trouble shoot this?
To summarise - the DB lost about 6 days of transactions for no known
reason.
From | Date | Subject | |
---|---|---|---|
Next Message | Anastasios Hatzis | 2006-09-01 15:30:56 | Duplicating rows in one table but with one column value different |
Previous Message | Martijn van Oosterhout | 2006-09-01 14:55:14 | Re: [pgsql-advocacy] Thought provoking piece on |