Re: inconsistent state after crash recovery

Lists: pgsql-hackers
From: Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: inconsistent state after crash recovery
Date: 2013-07-26 04:33:13
Message-ID: 51F1FC09.7090208@uptime.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I received a question about inconsistent state after crash recovery.

When a table file is broken (or just lost), PostgreSQL can not recover
a whole table, and does not show any notice while recoverying.
I think it means "inconsistent" state.

(1) create a table, and fill records.
(2) process a checkpoint.
(3) fill more records.
(4) force a crash, and delete the table file.
(5) run recovery on restarting.
(6) only records after the checkpoint can be recoverd.

For example, the attached log shows that PostgreSQL can recover
only 1058 records in the table which contains 2000 records
before the crash, and does not tell anything in the server log.

----------------------------------------------------------
insert into t1 values (trim(to_char(generate_series(1,1000),
'00000000000000000000')) );
INSERT 0 1000
select count(*) from t1;
count
-------
1000
(1 row)

checkpoint;
CHECKPOINT
insert into t1 values (trim(to_char(generate_series(1001,2000),
'00000000000000000000')) );
INSERT 0 1000
select count(*) from t1;
count
-------
2000
(1 row)

(delete the table file)
(kill postgres)
(restart postgres with recovery)

select count(*) from t1;
count
-------
1058
(1 row)
----------------------------------------------------------

Is this expected or acceptable?

I think, at least, PostgreSQL should say something about this
situation in the server log, because DBA can not recognize
this situation if no server log exists.

To reproduce it, please check the attached test script.

Any comments?

Regards,
--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

Attachment Content-Type Size
log.txt text/plain 5.2 KB
test_recovery.sh text/plain 1.5 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: inconsistent state after crash recovery
Date: 2013-07-26 06:47:52
Message-ID: 20130726064752.GG15081@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-07-26 13:33:13 +0900, Satoshi Nagayasu wrote:
> I received a question about inconsistent state after crash recovery.
>
> When a table file is broken (or just lost), PostgreSQL can not recover
> a whole table, and does not show any notice while recoverying.
> I think it means "inconsistent" state.
>
> (1) create a table, and fill records.
> (2) process a checkpoint.
> (3) fill more records.
> (4) force a crash, and delete the table file.
> (5) run recovery on restarting.
> (6) only records after the checkpoint can be recoverd.
>
> For example, the attached log shows that PostgreSQL can recover
> only 1058 records in the table which contains 2000 records
> before the crash, and does not tell anything in the server log.
>
> Is this expected or acceptable?

I'd say it's both. WAL replay doesn't have the knowledge to detect that
in all too many cases. Nearly always a page's contents will be restored
by a full page image the first time it gets changed so they will
individually look completely normal.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: inconsistent state after crash recovery
Date: 2013-07-26 12:27:05
Message-ID: 13112.1374841625@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2013-07-26 13:33:13 +0900, Satoshi Nagayasu wrote:
>> Is this expected or acceptable?

> I'd say it's both.

Postgres is built on the assumption that the underlying filesystem is
reliable, ie, once you've successfully fsync'd some data that data won't
disappear. If the filesystem fails to honor that contract, it's a
filesystem bug not a Postgres bug. Nor is it reasonable to expect
Postgres to be able to detect every such violation. As an example,
would you expect crash recovery to notice the disappearance of a file
that was touched nowhere in the replayed actions?

regards, tom lane


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: inconsistent state after crash recovery
Date: 2013-07-29 07:43:23
Message-ID: 51F61D1B.7010509@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Satoshi,

I was wondering about this problem. Please tell us about your system enviroment
which is postgresql version ,OS, raid card, and file system.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: inconsistent state after crash recovery
Date: 2013-08-02 12:08:14
Message-ID: CA+TgmoYapNH2nDE15yMkRt61X4by+5bfGRgetyt-gDbQW-uXmQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>> On 2013-07-26 13:33:13 +0900, Satoshi Nagayasu wrote:
>>> Is this expected or acceptable?
>
>> I'd say it's both.
>
> Postgres is built on the assumption that the underlying filesystem is
> reliable, ie, once you've successfully fsync'd some data that data won't
> disappear. If the filesystem fails to honor that contract, it's a
> filesystem bug not a Postgres bug. Nor is it reasonable to expect
> Postgres to be able to detect every such violation. As an example,
> would you expect crash recovery to notice the disappearance of a file
> that was touched nowhere in the replayed actions?

Eh, maybe not. But should we try harder to detect the unexpected
disappearance of one that is?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: inconsistent state after crash recovery
Date: 2013-08-02 12:17:47
Message-ID: 17381.1375445867@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> would you expect crash recovery to notice the disappearance of a file
>> that was touched nowhere in the replayed actions?

> Eh, maybe not. But should we try harder to detect the unexpected
> disappearance of one that is?

We do, don't we? The replay stuff should complain unless it sees a drop
or truncate covering any unaccounted-for pages.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: inconsistent state after crash recovery
Date: 2013-08-02 12:19:10
Message-ID: CA+TgmobN0+f4efOFS9KybDOr8pr6ed47ig4fdycwwtG3H3pOZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> would you expect crash recovery to notice the disappearance of a file
>>> that was touched nowhere in the replayed actions?
>
>> Eh, maybe not. But should we try harder to detect the unexpected
>> disappearance of one that is?
>
> We do, don't we? The replay stuff should complain unless it sees a drop
> or truncate covering any unaccounted-for pages.

Hmm. Yeah. But the OP seems to think it doesn't work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomasz Nowak <tomasz(dot)nowaktn(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: inconsistent state after crash recovery
Date: 2013-08-02 12:46:05
Message-ID: CAPp0YYtS2wL_swxD3zB7+QH4+7Fg+AHC=J3m=G+B0iS3Jy5UFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,
I'm very new here on this mailing list, but I've been using PostgreSQL
for a while, and it scares me a little, that it's a real pain to try to
recover data from corrupted table.

Situations like file being lost following server crash (after fsck) or
page corruption happens quite often.

Having corruption at the row level for example, system could mark the
page as corrupted in the system catalog.
Giving that page knows last change to this page, can we use archived
WAL-s to recover the page?
We can have a table inside pg_catalog like pg_corrupted_pages with
information of page corruption detected by backend.

Similar to tables, in a case of lost file, once system notice that, we
should have a column in pg_class called relneedrecovery to record that.
Will it be possible to recover this file from last hot backup and apply
redo to it based on WAL records?

Similar functionality is provider by Oracle (media and block recovery).
I assume we will need some extra DDL commands to handle file/block
recovery.
Also, It would be nice to have a command to quickly cross check files
between pg_class and filesystem - just simple open/close system call for
each relation - it is always faster that to run vaccum to check that.
Tomasz

On 2 August 2013 13:19, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> would you expect crash recovery to notice the disappearance of a file
> >>> that was touched nowhere in the replayed actions?
> >
> >> Eh, maybe not. But should we try harder to detect the unexpected
> >> disappearance of one that is?
> >
> > We do, don't we? The replay stuff should complain unless it sees a drop
> > or truncate covering any unaccounted-for pages.
>
> Hmm. Yeah. But the OP seems to think it doesn't work.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

On 2 August 2013 13:19, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> would you expect crash recovery to notice the disappearance of a file
> >>> that was touched nowhere in the replayed actions?
> >
> >> Eh, maybe not. But should we try harder to detect the unexpected
> >> disappearance of one that is?
> >
> > We do, don't we? The replay stuff should complain unless it sees a drop
> > or truncate covering any unaccounted-for pages.
>
> Hmm. Yeah. But the OP seems to think it doesn't work.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: inconsistent state after crash recovery
Date: 2013-08-03 10:24:31
Message-ID: 51FCDA5F.40602@uptime.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/08/02 21:19), Robert Haas wrote:
> On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> would you expect crash recovery to notice the disappearance of a file
>>>> that was touched nowhere in the replayed actions?
>>
>>> Eh, maybe not. But should we try harder to detect the unexpected
>>> disappearance of one that is?
>>
>> We do, don't we? The replay stuff should complain unless it sees a drop
>> or truncate covering any unaccounted-for pages.
>
> Hmm. Yeah. But the OP seems to think it doesn't work.

Yes. I'm afraid that.

My attached script shows that crash recovery re-creates the lost
table file implicitly, and fills some of those blocks (maybe
lower ones) with zero without any notice. We can easily observe
it by using pg_filedump.

Thus, the table file can lose records, but DBA cannot recognize
it because no message is left in the server log.

I agree that this is not a PostgreSQL bug.

However, DBA still needs to detect this table corruption,
brought by several components which PostgreSQL relys on, to
consider restoring from database backup.

If PostgreSQL can detect and tell something about that, it
would be really helpful for DBA to make some critical decision.
I think PostgreSQL will be able to do that.

Regards,
--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

Attachment Content-Type Size
test_recovery.sh text/plain 3.2 KB