Re: BUG #3484: Missing pg_clog file / corrupt index

From: Decibel! <decibel(at)decibel(dot)org>
To: Marc Schablewski <ms(at)clickware(dot)de>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3484: Missing pg_clog file / corrupt index
Date: 2007-07-31 01:20:25
Message-ID: AF7FB1C8-1457-41D4-9419-D9068470FCB9@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Jul 25, 2007, at 4:02 PM, Marc Schablewski wrote:
> The following bug has been logged online:
>
> Bug reference: 3484
> Logged by: Marc Schablewski
> Email address: ms(at)clickware(dot)de
> PostgreSQL version: 8.1.8
> Operating system: SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64)
> Description: Missing pg_clog file / corrupt index
> Details:
>
> Our application receives and processes payment information that
> comes in
> plain text files and stores the processed data into different
> tables in our
> database. There are about 5 tables involved, three of them with 35
> million
> records so far. We get approximately 150000 payments a day. Each
> payment is
> handled in a single transaction, because in case of an error, we
> want to
> store as many payments as possible. We have about 500000 INSERT and
> a few
> UPDATE statements each day. The whole application runs on two
> servers (see
> specs below) which are nearly identical. One is the production
> system, the
> other is for testing.
>
> A few months ago we had some trouble with the test system. Postgres
> complained about a missing pg_clog file during nightly routine
> VACUUM/ANALYZE. Some days later, the same error occurred on the
> production
> system, even on the same table! The corrupted table is one of those
> bigger
> ones involved into the file processing. After searching the web we
> found a
> hint that this problem could be related to a bug in 8.1.3, so we
> upgraded to
> 8.1.8 and restored the databases on both servers. This was around
> May, but
> now we discovered the same problem on our production system again.
> Actually,
> the error occurred four weeks ago, but it was not discovered until
> now –
> if you do error logging, you should look at it from time to time ;)
>
> When trying to backup or vacuum the database, we receive one of the
> following error messages:
>
> **BACKUP**
> pg_dump: ERROR: could not access status of transaction 2134240
> DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis
> nicht
> gefunden
> pg_dump: SQL command to dump the contents of table "dateiblock"
> failed:
> PQendcopy() failed.
> pg_dump: Error message from server: ERROR: could not access status of
> transaction 2134240
> DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis
> nicht
> gefunden
> pg_dump: The command was: COPY public.dateiblock (id, eda_id,
> dbt_id, lfdnr,
> binaer_offset, laenge) TO stdout;
>
> **VACUUM**
> INFO: vacuuming "public.dateiblock"
> ERROR: could not access status of transaction 2134240
> DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis
> nicht
> gefunden
>
> (For those not familiar to the German language: ‘Datei oder
> Verzeichnis
> nicht gefunden’ means ‘file or directory not found’).
>
> Current pg_clogs range from 005A to 008F. The oldest one is dated
> to April
> 30th.
>
> We narrowed it down to a few records in that table. Some records
> contain
> unreasonable values, others produce the same message about the missing
> pg_clog file when selected and some are simply missing. But they
> must have
> existed, because there are still records in a second table
> referencing them.
> One strange thing about this is, that the referencing records are
> about two
> and a half months old and shouldn’t been touched since then.
>
> We don’t think this is a hardware issue, because we had it on two
> different servers and within a short period of time.
>
> Luckily, the loss of data is minimal. There are only about 30 records
> affected. Otherwise this would have been fatal, because as said
> before, our
> backup was not working either.
>
>
> In addition, we had another problem today. One of the indexes on a
> second
> table became corrupted, causing the database backend to rollback all
> transactions and restart the processes. A REINDEX fixed it, but it
> leaves me
> with a bad feeling about what will break next.
>
> 2007-07-25 08:07:00 CEST PANIC: right sibling's left-link doesn't
> match
> 2007-07-25 08:07:00 CEST STATEMENT: insert into transaktion (kaz_id,
> dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status,
> kartennr
> , wae_id, kassenschnitt, freigabe, flo_id, importdatum) VALUES
> ($1, $2, $3,
> $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15)
> 2007-07-25 08:07:00 CEST LOG: server process (PID 5699) was
> terminated by
> signal 6
> 2007-07-25 08:07:00 CEST LOG: terminating any other active server
> processes
> 2007-07-25 08:07:00 CEST WARNING: terminating connection because
> of crash
> of another server process
> 2007-07-25 08:07:00 CEST DETAIL: The postmaster has commanded this
> server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
>
>
> Kind regards,
>
> Marc Schablewski
>
>
>
> System:
>
> OS: SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64
>
> System: 2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP
> proliant
> server)
> 3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database)
>
> Postgres 8.1.8
>
> Postgres was compiled manually but with default parameters. The
> configuration has been tuned to improve performance.
>
> shared_buffers = 30000
> maintenance_work_mem = 131072
> max_fsm_pages = 500000
> default_statistics_target = 200
> redirect_stderr = on
> log_directory = '/var/log/pg_log'
>
> log_min_duration_statement = 10
> log_duration = on
> log_line_prefix = '%t '
> stats_start_collector = on
> stats_command_string = on
> stats_block_level = on
> stats_row_level = on
> stats_reset_on_server_start = off
>
> VACUUM ANALYZE runs once a day, auto_vacuum is off.

Actually, this does sound like a hardware problem to me. You only
have 5 tables that get hit heavily, so you've likely got somewhere
around a 20% chance that corruption would hit the same table on two
different machines.

So far you haven't said anything that sounds unusual about how you're
using the database, and the hardware certainly seems pretty common-
place, so I'm rather doubtful that it's software. But if you still
have copies of the bad database, someone might be able to help you.
--
Decibel!, aka Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Decibel! 2007-07-31 01:33:45 Re: BUG #3486: doc bug - Preventing transaction ID wraparound failures
Previous Message Flavio Botelho 2007-07-30 22:36:09 BUG #3500: Horrible performance when wrong type is set in prepared statement