Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

Lists: pgsql-hackers
From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-14 11:57:10
Message-ID: CADbMkNOL1Hbi4cggF3ZhO68nsr3yRvkunNYz2yBmt8QOupPomw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Today we have seen this on our testing database instance:

ERROR: could not open file "base/16416/291498116.3" (target block 431006):
No such file or directory

That database get's created by rsyncing the LVM snapshot of the standby,
which is a readonly backup of proddb
using streaming replication.

We do not put the standby database into backup mode before rsyncing the LVM
snapshot, trusting postgres ability to recover
from crash. Now we are not sure anymore... Thoughts? It's worth noting
that we have been using this method of taking a backup
for a long long time, but we only recently (after 9.2 came out we switched
to doing a streaming replica and then to take the
"frozen snapshot" backup method from the streaming replica).

Let me know if there is more information you need, or if this is a known
problem.

Thanks,

Bene


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-14 13:27:24
Message-ID: 51923BBC.6070007@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14.05.2013 14:57, Benedikt Grundmann wrote:
> Today we have seen this on our testing database instance:
>
> ERROR: could not open file "base/16416/291498116.3" (target block 431006):
> No such file or directory
>
> That database get's created by rsyncing the LVM snapshot of the standby,
> which is a readonly backup of proddb
> using streaming replication.
>
> We do not put the standby database into backup mode before rsyncing the LVM
> snapshot, trusting postgres ability to recover
> from crash. Now we are not sure anymore... Thoughts? It's worth noting
> that we have been using this method of taking a backup
> for a long long time, but we only recently (after 9.2 came out we switched
> to doing a streaming replica and then to take the
> "frozen snapshot" backup method from the streaming replica).
>
> Let me know if there is more information you need, or if this is a known
> problem.

That certainly should work. I'd suggest that you narrow down the problem
a bit more. Is the file present in the original system? Anything else
that might be related in the logs?

- Heikki


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, David Powers <dpowers(at)janestreet(dot)com>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-14 13:48:55
Message-ID: CADbMkNPP0xA2S+b625my3YznBcToZkCaVktTaaP4H5ZX_WLdSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

It's on the production database and the streaming replica. But not on the
snapshot.

production
-rw------- 1 postgres postgres 312778752 May 13 21:28
/database/postgres/base/16416/291498116.3

streaming replica
-rw------- 1 postgres postgres 312778752 May 13 23:50
/database/postgres/base/16416/291498116.3

Is there a way to find out what the file contains?

We just got some more information. All of the following was done / seen in
the logs of the snapshot database.

After we saw this we run a vacuum full on the table we suspect to be backed
by this file. This happened:

WARNING: concurrent insert in progress within table "js_equity_daily_diff"

Over and over again. So we killed the VACUUM full by hitting ctrl-c again
and again. After that the logs contained:

fgrep ERROR postgresql-2013-05-14.csv | head
2013-05-14 00:23:16.028 EDT,"postgres_prod","proddb_testing",15698,"
172.24.65.103:54536",5191bc31.3d52,3,"SELECT",2013-05-14 00:23:13
EDT,2/3330,0,ERROR,58P01,"could not open file ""base/16416/291498116.3""
(target block 431006): No such file or directory",,,,,"SQL statement
""select max(equity_date) from js_equity_daily_diff""

The above was the original error that made us investigate.

2013-05-14 09:21:47.121 EDT,"postgres_prod","proddb_testing",21002,"
172.27.41.24:36815",51923a67.520a,38756,"VACUUM",2013-05-14 09:21:43
EDT,2/21611,645995272,ERROR,57014,"canceling statement due to user
request",,,,,,"vacuum full js_equity_daily_diff",,,"psql"
2013-05-14 09:22:04.700 EDT,"postgres_prod","proddb_testing",21008,"
172.27.41.24:36831",51923a75.5210,115909,"VACUUM",2013-05-14 09:21:57
EDT,2/21618,645995273,ERROR,57014,"canceling statement due to user
request",,,,,,"vacuum full js_equity_daily_diff",,,"psql"

Than us cancelling the vacuum full

2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0 of relation
base/16416/291498116",,,,""
2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0 of relation
base/16416/291498116",,,,""

And after that these started appearing in logs (and they get repeated every
second now:

[root(at)nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv | tail -n
2
2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of
relation base/16416/291498116",,,,""
2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of
relation base/16416/291498116",,,,""

There are no earlier ERROR's in the logs.

2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of
relation base/16416/291498116",,,,""
2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13
22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of
base/16416/291498116","Multiple failures --- write error might be
permanent.",,,,,,,,""

The disk is not full nor are there any messages in the kernel logs.

Cheers,

Bene

On Tue, May 14, 2013 at 9:27 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 14.05.2013 14:57, Benedikt Grundmann wrote:
>
>> Today we have seen this on our testing database instance:
>>
>> ERROR: could not open file "base/16416/291498116.3" (target block
>> 431006):
>> No such file or directory
>>
>> That database get's created by rsyncing the LVM snapshot of the standby,
>> which is a readonly backup of proddb
>> using streaming replication.
>>
>> We do not put the standby database into backup mode before rsyncing the
>> LVM
>> snapshot, trusting postgres ability to recover
>> from crash. Now we are not sure anymore... Thoughts? It's worth noting
>> that we have been using this method of taking a backup
>> for a long long time, but we only recently (after 9.2 came out we switched
>> to doing a streaming replica and then to take the
>> "frozen snapshot" backup method from the streaming replica).
>>
>> Let me know if there is more information you need, or if this is a known
>> problem.
>>
>
> That certainly should work. I'd suggest that you narrow down the problem a
> bit more. Is the file present in the original system? Anything else that
> might be related in the logs?
>
> - Heikki
>


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, David Powers <dpowers(at)janestreet(dot)com>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-14 15:34:24
Message-ID: 51925980.5030208@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14.05.2013 16:48, Benedikt Grundmann wrote:
> It's on the production database and the streaming replica. But not on the
> snapshot.

So, the LVM snapshot didn't work correctly?

- Heikki


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, David Powers <dpowers(at)janestreet(dot)com>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-14 16:09:19
Message-ID: CADbMkNOYQEK4hLBixptF_oG9DU=wuQs1iAaiodCpnhXwfSQOow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

That's one possible explanation. It's worth noting that we haven't seen
this before moving to streaming rep first and we have been using that
method for a long time.

On Tue, May 14, 2013 at 11:34 AM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:

> On 14.05.2013 16:48, Benedikt Grundmann wrote:
>
>> It's on the production database and the streaming replica. But not on the
>> snapshot.
>>
>
> So, the LVM snapshot didn't work correctly?
>
> - Heikki
>


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, David Powers <dpowers(at)janestreet(dot)com>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-14 20:47:26
Message-ID: CADbMkNMN6pcrxxaiyWeZ5XFvJFX9iAn1gqwi65L_UKa5Z-0rQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think my previous message wasn't clear enough. I do *NOT* think that LVM
snapshot is the culprit.

However I cannot discount it as one of the possibilities. But I have no
evidence in either /var/log/messages or in dmesg that the LVM snapshot went
into a bad state AND we have been using this method for a long time.

The only thing that is *new* is that we took the snapshot from the
streaming replica. So again my best guess as of now is that if the
database crashes while it is in streaming standby a invalid disk state can
result during during the following startup (in rare and as of now unclear
circumstances).

You seem to be quite convinced that it must be LVM can you elaborate why?

Thanks,

Bene

On Tue, May 14, 2013 at 12:09 PM, Benedikt Grundmann <
bgrundmann(at)janestreet(dot)com> wrote:

> That's one possible explanation. It's worth noting that we haven't seen
> this before moving to streaming rep first and we have been using that
> method for a long time.
>
>
> On Tue, May 14, 2013 at 11:34 AM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> On 14.05.2013 16:48, Benedikt Grundmann wrote:
>>
>>> It's on the production database and the streaming replica. But not on
>>> the
>>> snapshot.
>>>
>>
>> So, the LVM snapshot didn't work correctly?
>>
>> - Heikki
>>
>
>


From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Benedikt Grundmann'" <bgrundmann(at)janestreet(dot)com>, "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>
Cc: "'PostgreSQL-Dev'" <pgsql-hackers(at)postgresql(dot)org>, "'David Powers'" <dpowers(at)janestreet(dot)com>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-15 05:56:40
Message-ID: 004f01ce5130$f9077a60$eb166f20$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday, May 14, 2013 7:19 PM Benedikt Grundmann wrote:
>It's on the production database and the streaming replica.  But not on the
snapshot.

> production
> -rw------- 1 postgres postgres 312778752 May 13 21:28
/database/postgres/base/16416/291498116.3

> streaming replica
> -rw------- 1 postgres postgres 312778752 May 13 23:50
/database/postgres/base/16416/291498116.3
> Is there a way to find out what the file contains?

You can try with pageinspect module in contrib.

> We just got some more information.  All of the following was done / seen
in the logs of the snapshot database.

> After we saw this we run a vacuum full on the table we suspect to be
backed by this file.  This happened:

>WARNING:  concurrent insert in progress within table "js_equity_daily_diff"

> 2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0
> of relation base/16416/291498116",,,,""
> 2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0
> of relation base/16416/291498116",,,,""
> And after that these started appearing in logs (and they get repeated
every second now:

> [root(at)nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv  | tail
-n 2
> 2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> 2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> There are no earlier ERROR's in the logs.
> 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13
22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of
base/16416/291498116","Multiple failures --- write error might be
> permanent.",,,,,,,,""

> The disk is not full nor are there any messages in the kernel logs.

The reason for this is that system is not able to flush XLOG upto requested
point, most likely, the requested flush point is past end of XLOG.
This has been seen to occur when a disk page has a corrupted LSN. (I am
quoting this from comment in code where the above error message occur)

So if XLOG is not flushed checkpointer will not flush even data of file
291498116.

It seems to me that your database where these errors are observed is
corrupted.

With Regards,
Amit Kapila.


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, David Powers <dpowers(at)janestreet(dot)com>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-15 06:24:03
Message-ID: 51932A03.3070908@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14.05.2013 23:47, Benedikt Grundmann wrote:
> The only thing that is *new* is that we took the snapshot from the
> streaming replica. So again my best guess as of now is that if the
> database crashes while it is in streaming standby a invalid disk state can
> result during during the following startup (in rare and as of now unclear
> circumstances).

A bug is certainly possible. There isn't much detail here to debug with,
I'm afraid. Can you share the full logs on all three systems? I'm
particularly interest

> You seem to be quite convinced that it must be LVM can you elaborate why?

Well, you said that there was a file in the original filesystem, but not
in the snapshot. If you didn't do anything in between, then surely the
snapshot is broken, if it skipped a file. Or was the file created in the
original filesystem after the snapshot was taken? You probably left out
some crucial details on how exactly the snapshot and rsync are
performed. Can you share the scripts you're using?

Can you reproduce this problem with a new snapshot? Do you still have
the failed snapshot unchanged?

- Heikki


From: David Powers <dpowers(at)janestreet(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-15 12:42:07
Message-ID: CAJpcCMhxK56fyjj708Q2x-8F8Q2nacJ5gs9ALMFW13K9sqjeoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

First, thanks for the replies. This sort of thing is frustrating and hard
to diagnose at a distance, and any help is appreciated.

Here is some more background:

We have 3 9.2.4 databases using the following setup:

- A primary box
- A standby box running as a hot streaming replica from the primary
- A testing box restored nightly from a static backup

As noted, the static backup is taken off of the standby by taking an LVM
snapshot of the database filesystem and rsyncing. I don't think it's a
likely problem but the rsync leverages the previous backup (using
--link-dest) to make the rsync faster and the resulting backup smaller.
Each database is ~1.5T, so this is necessary to keep static backup times
reasonable.

We've been using the same system for quite some time, but previously (~ 1
month ago) had been taking the backup off of the primary (still using the
LVM snapshot). The replication is a recent addition, and a very helpful
one. LVM snapshots aren't lightweight in the face of writes and in some
circumstances a long running rsync would spike the IO load on the
production box.

Results of some additional tests:

After the user noticed that the test restore showed the original problem we
ran `vacuum analyze` on all three testing databases thinking that it had a
good chance of quickly touching most of the underlying files. That gave us
errors on two of the testing restores similar to:

ERROR: invalid page header in block 5427 of relation base/16417/199732075

In the meantime I modified the static backup procedure to shut standby
completely down before taking the LVM snapshot and am trying a restore
using that snapshot now. I'll test that using the same vacuum analyze
test, and if that passes, a full vacuum.

I'm also running the vacuum analyze on the production machines to double
check that the base databases don't have a subtle corruption that simply
hasn't been noticed. They run with normal autovacuum settings, so I
suspect that they are fine/this won't show anything because we should have
seen this from the autovacuum daemon before.

I'm happy to share the scripts we use for the backup/restore process if the
information above isn't enough, as well as the logs - though the postgres
logs don't seem to contain much of interest (the database system doesn't
really get involved).

I also have the rsyncs of the failed snapshots available and could restore
them for testing purposes. It's also easy to look in them (they are just
saved as normal directories on a big SAN) if I know what to look for.

-David

On Wed, May 15, 2013 at 2:24 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 14.05.2013 23:47, Benedikt Grundmann wrote:
>
>> The only thing that is *new* is that we took the snapshot from the
>>
>> streaming replica. So again my best guess as of now is that if the
>> database crashes while it is in streaming standby a invalid disk state can
>> result during during the following startup (in rare and as of now unclear
>> circumstances).
>>
>
> A bug is certainly possible. There isn't much detail here to debug with,
> I'm afraid. Can you share the full logs on all three systems? I'm
> particularly interest
>
>
> You seem to be quite convinced that it must be LVM can you elaborate why?
>>
>
> Well, you said that there was a file in the original filesystem, but not
> in the snapshot. If you didn't do anything in between, then surely the
> snapshot is broken, if it skipped a file. Or was the file created in the
> original filesystem after the snapshot was taken? You probably left out
> some crucial details on how exactly the snapshot and rsync are performed.
> Can you share the scripts you're using?
>
> Can you reproduce this problem with a new snapshot? Do you still have the
> failed snapshot unchanged?
>
> - Heikki
>


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: David Powers <dpowers(at)janestreet(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-15 18:50:23
Message-ID: 5193D8EF.3060000@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 15.05.2013 15:42, David Powers wrote:
> First, thanks for the replies. This sort of thing is frustrating and hard
> to diagnose at a distance, and any help is appreciated.
>
> Here is some more background:
>
> We have 3 9.2.4 databases using the following setup:

The subject says 9.2.3. Are you sure you're running 9.2.4 on all the
servers? There was a fix to a bug related to starting a standby server
from a filesystem snapshot. I don't think it was quite the case you
have, but pretty close.

> - A primary box
> - A standby box running as a hot streaming replica from the primary
> - A testing box restored nightly from a static backup
>
> As noted, the static backup is taken off of the standby by taking an LVM
> snapshot of the database filesystem and rsyncing. I don't think it's a
> likely problem but the rsync leverages the previous backup (using
> --link-dest) to make the rsync faster and the resulting backup smaller.

I guess that should work, as long as you make a full copy when you
restore the backup, and never start a postgres server directly against a
backup that contains the hardlinks.

> We've been using the same system for quite some time, but previously (~ 1
> month ago) had been taking the backup off of the primary (still using the
> LVM snapshot). The replication is a recent addition, and a very helpful
> one. LVM snapshots aren't lightweight in the face of writes and in some
> circumstances a long running rsync would spike the IO load on the
> production box.
>
> Results of some additional tests:
>
> After the user noticed that the test restore showed the original problem we
> ran `vacuum analyze` on all three testing databases thinking that it had a
> good chance of quickly touching most of the underlying files. That gave us
> errors on two of the testing restores similar to:
>
> ERROR: invalid page header in block 5427 of relation base/16417/199732075

Huh, that's surprising. You have to be quite unlucky to end with a
corrupt page header, even if there's something wrong with the underlying
storage or backup routines. Could you grab a copy of that block?
Something like

dd skip=5427 bs=8192 count=1 if=base/16417/199732075 of=corrupt-block

I'm especially curious what the first 20 bytes or so look like. I'm
guessing that it's all zeros, while some later part of the page contains
data. Ie. the page was torn at some point in the process.

> I'm also running the vacuum analyze on the production machines to double
> check that the base databases don't have a subtle corruption that simply
> hasn't been noticed. They run with normal autovacuum settings, so I
> suspect that they are fine/this won't show anything because we should have
> seen this from the autovacuum daemon before.

Vacuum analyze doesn't always scan the whole table with default
settings, so that's no proof that are no missing pages. "set
vacuum_freeze_table_age=0; vacuum <table>" will force a full scan of the
table.

> I'm happy to share the scripts we use for the backup/restore process if the
> information above isn't enough, as well as the logs - though the postgres
> logs don't seem to contain much of interest (the database system doesn't
> really get involved).

Yeah, please share the scripts. What you've described seems correct to
me, but maybe there's something I'm missing.

> I also have the rsyncs of the failed snapshots available and could restore
> them for testing purposes. It's also easy to look in them (they are just
> saved as normal directories on a big SAN) if I know what to look for.

Great. Not sure what to look at right now, but good that you have kept
the evidence.

- Heikki


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: David Powers <dpowers(at)janestreet(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-15 19:50:49
Message-ID: CADbMkNPhwfkCjX0UzVBtqU83dLfm928O0V0KfrQDoegEBQiTgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 15.05.2013 15:42, David Powers wrote:
>
>> First, thanks for the replies. This sort of thing is frustrating and hard
>> to diagnose at a distance, and any help is appreciated.
>>
>> Here is some more background:
>>
>> We have 3 9.2.4 databases using the following setup:
>>
>
> The subject says 9.2.3. Are you sure you're running 9.2.4 on all the
> servers? There was a fix to a bug related to starting a standby server from
> a filesystem snapshot. I don't think it was quite the case you have, but
> pretty close.
>
>
So this is delightfully embarrassing I just went back to double check and

- primary box is 9.2.3
- standby is 9.2.4
- testing is 9.2.4

I guess that alone could possibly explain it?

Thanks,

Bene


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: David Powers <dpowers(at)janestreet(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-15 19:53:39
Message-ID: 5193E7C3.2050400@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 15.05.2013 22:50, Benedikt Grundmann wrote:
> On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com
>> The subject says 9.2.3. Are you sure you're running 9.2.4 on all the
>> servers? There was a fix to a bug related to starting a standby server from
>> a filesystem snapshot. I don't think it was quite the case you have, but
>> pretty close.
>
> So this is delightfully embarrassing I just went back to double check and
>
> - primary box is 9.2.3
> - standby is 9.2.4
> - testing is 9.2.4
>
> I guess that alone could possibly explain it?

Hmm, no, it should still work. There haven't been any changes in the WAL
format. I do recommend upgrading the primary, of course, but I don't
really see how that would explain what you're seeing.

- Heikki


From: David Powers <dpowers(at)janestreet(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-16 12:10:22
Message-ID: CAJpcCMi6dYXFheP6pwNUXffDzOhknW_ndK=aD+9dc2SvuhJuQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'll try to get the primary upgraded over the weekend when we can afford a
restart.

In the meantime I have a single test showing that a shutdown, snapshot,
restart produces a backup that passes the vacuum analyze test. I'm going
to run a full vacuum today.

-David

On Wed, May 15, 2013 at 3:53 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 15.05.2013 22:50, Benedikt Grundmann wrote:
>
>> On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas<hlinnakangas(at)**
>> vmware.com <hlinnakangas(at)vmware(dot)com>
>>
>>> The subject says 9.2.3. Are you sure you're running 9.2.4 on all the
>>>
>>> servers? There was a fix to a bug related to starting a standby server
>>> from
>>> a filesystem snapshot. I don't think it was quite the case you have, but
>>> pretty close.
>>>
>>
>> So this is delightfully embarrassing I just went back to double check and
>>
>> - primary box is 9.2.3
>> - standby is 9.2.4
>> - testing is 9.2.4
>>
>> I guess that alone could possibly explain it?
>>
>
> Hmm, no, it should still work. There haven't been any changes in the WAL
> format. I do recommend upgrading the primary, of course, but I don't really
> see how that would explain what you're seeing.
>
> - Heikki
>


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: David Powers <dpowers(at)janestreet(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-21 15:59:26
Message-ID: CADbMkNMurWJMUmXAKqtFi1p40=G0ncVvOKXjnixhX5Bjb4-8BQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We are seeing these errors on a regular basis on the testing box now. We
have even changed the backup script to
shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync
the lvm snapshot. It still happens.

We have never seen this before we introduced the hot standby. So we will
now revert to taking the backups from lvm snapshots on the production
database. If you have ideas of what else we should try / what information
we can give you to debug this let us know and we will try to so.

Until then we will sadly operate on the assumption that the combination of
hot standby and "frozen snapshot" backup of it is not production ready.

Thanks,

Bene

On Thu, May 16, 2013 at 8:10 AM, David Powers <dpowers(at)janestreet(dot)com>wrote:

> I'll try to get the primary upgraded over the weekend when we can afford a
> restart.
>
> In the meantime I have a single test showing that a shutdown, snapshot,
> restart produces a backup that passes the vacuum analyze test. I'm going
> to run a full vacuum today.
>
> -David
>
>
> On Wed, May 15, 2013 at 3:53 PM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> On 15.05.2013 22:50, Benedikt Grundmann wrote:
>>
>>> On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas<hlinnakangas(at)**
>>> vmware.com <hlinnakangas(at)vmware(dot)com>
>>>
>>>> The subject says 9.2.3. Are you sure you're running 9.2.4 on all the
>>>>
>>>> servers? There was a fix to a bug related to starting a standby server
>>>> from
>>>> a filesystem snapshot. I don't think it was quite the case you have, but
>>>> pretty close.
>>>>
>>>
>>> So this is delightfully embarrassing I just went back to double check and
>>>
>>> - primary box is 9.2.3
>>> - standby is 9.2.4
>>> - testing is 9.2.4
>>>
>>> I guess that alone could possibly explain it?
>>>
>>
>> Hmm, no, it should still work. There haven't been any changes in the WAL
>> format. I do recommend upgrading the primary, of course, but I don't really
>> see how that would explain what you're seeing.
>>
>> - Heikki
>>
>
>


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: David Powers <dpowers(at)janestreet(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-23 15:26:35
Message-ID: CA+TgmobWPC3_NOmnpoFhxFvnpgHLPWXi8OG5xD9tNusGGB1mzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
<bgrundmann(at)janestreet(dot)com> wrote:
> We are seeing these errors on a regular basis on the testing box now. We
> have even changed the backup script to
> shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync
> the lvm snapshot. It still happens.
>
> We have never seen this before we introduced the hot standby. So we will
> now revert to taking the backups from lvm snapshots on the production
> database. If you have ideas of what else we should try / what information
> we can give you to debug this let us know and we will try to so.
>
> Until then we will sadly operate on the assumption that the combination of
> hot standby and "frozen snapshot" backup of it is not production ready.

I'm pretty suspicious that your backup procedure is messed up in some
way. The fact that you got invalid page headers is really difficult
to attribute to a PostgreSQL bug. A number of the other messages that
you have posted also tend to indicate either corruption, or that WAL
replay has stopped early. It would be interesting to see the logs
from when the clone was first started up, juxtaposed against the later
WAL flush error messages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: David Powers <dpowers(at)janestreet(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-23 16:30:41
Message-ID: CAJpcCMhzNgWLrYPS1pMHGxrsSKOwBPvoHE-ozWSj=jpbfTCdfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for the response.

I have some evidence against an issue in the backup procedure (though I'm
not ruling it out). We moved back to taking the backup off of the primary
and all errors for all three clusters went away. All of the hardware is
the same, OS and postgres versions are largely the same (9.2.3 vs. 9.2.4 in
some cases, various patch levels of Cent 6.3 for the OS). The backup code
is exactly the same, just pointed at a different set of boxes.

Currently I'm just running for a couple of days to ensure that we have
viable static backups. After that I'll redo one of the restores from a
suspected backup and will post the logs.

-David

On Thu, May 23, 2013 at 11:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
> <bgrundmann(at)janestreet(dot)com> wrote:
> > We are seeing these errors on a regular basis on the testing box now. We
> > have even changed the backup script to
> > shutdown the hot standby, take lvm snapshot, restart the hot standby,
> rsync
> > the lvm snapshot. It still happens.
> >
> > We have never seen this before we introduced the hot standby. So we will
> > now revert to taking the backups from lvm snapshots on the production
> > database. If you have ideas of what else we should try / what
> information
> > we can give you to debug this let us know and we will try to so.
> >
> > Until then we will sadly operate on the assumption that the combination
> of
> > hot standby and "frozen snapshot" backup of it is not production ready.
>
> I'm pretty suspicious that your backup procedure is messed up in some
> way. The fact that you got invalid page headers is really difficult
> to attribute to a PostgreSQL bug. A number of the other messages that
> you have posted also tend to indicate either corruption, or that WAL
> replay has stopped early. It would be interesting to see the logs
> from when the clone was first started up, juxtaposed against the later
> WAL flush error messages.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: David Powers <dpowers(at)janestreet(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-28 14:53:59
Message-ID: CADbMkNPpn6HB9DXu=0zvq34-uC1cPu6AO=dc6A0OSsnG1dZXXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Today we have seen

2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10
EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
base/16416/293974676",,,,""
2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10
EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
base/16416/293974676",,,,""

while taking the *backup of the primary*. We have been running for a few
days like that and today is the first day where we see these problems
again. So it's not entirely deterministic / we don't know yet what we have
to do to reproduce.

So this makes Robert's theory more likely. However we have also using this
method (LVM + rsync with hardlinks from primary) for years without these
problems. So the big question is what changed?

One hypothesis is that it is related to the primary being in hot_standby
instead of minimal or archive wal_method (which we used before we switched
to 9.2).

Here are the entries in the log related to the startup of the corrupt
testing cluster:

2013-05-27 22:41:10.029 EDT,,,30598,,51a41946.7786,1,,2013-05-27 22:41:10
EDT,,0,LOG,00000,"database system was interrupted; last known up at
2013-05-26 21:01:09 EDT",,,,,,,,,""
2013-05-27 22:41:10.029 EDT,,,30599,"",51a41946.7787,1,"",2013-05-27
22:41:10 EDT,,0,LOG,00000,"connection received: host=172.27.65.204
port=55279",,,,,,,,,""
2013-05-27 22:41:10.030 EDT,,,30598,,51a41946.7786,2,,2013-05-27 22:41:10
EDT,,0,LOG,00000,"database system was not properly shut down; automatic
recovery in progress",,,,,,,,,""
2013-05-27 22:41:10.030 EDT,"as-elephant","postgres",30599,"
172.27.65.204:55279",51a41946.7787,2,"",2013-05-27 22:41:10
EDT,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2013-05-27 22:41:10.031 EDT,,,30598,,51a41946.7786,3,,2013-05-27 22:41:10
EDT,,0,LOG,00000,"redo starts at 1E7E/2152B178",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,4,,2013-05-27 22:41:10
EDT,,0,LOG,00000,"record with zero length at 1E7E/215AC6B8",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,5,,2013-05-27 22:41:10
EDT,,0,LOG,00000,"redo done at 1E7E/215AC688",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,6,,2013-05-27 22:41:10
EDT,,0,LOG,00000,"last completed transaction was at log time 2013-05-26
21:09:08.06351-04",,,,,,,,,""
2013-05-27 22:41:10.134 EDT,,,30595,,51a41945.7783,1,,2013-05-27 22:41:09
EDT,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
2013-05-27 22:41:10.134 EDT,,,30603,,51a41946.778b,1,,2013-05-27 22:41:10
EDT,,0,LOG,00000,"autovacuum launcher started",,,,,,,,,""
2013-05-27 22:41:15.037 EDT,,,30608,"",51a4194b.7790,1,"",2013-05-27
22:41:15 EDT,,0,LOG,00000,"connection received: host=172.27.65.204
port=55283",,,,,,,,,""

This means we currently do NOT have a way to make backups that we trust.
We are very open to any suggestions of any alternative methods we should
consider using. The database is of non trivial size by now:

proddb=> select pg_size_pretty(pg_database_size('proddb'));
pg_size_pretty
----------------
1294 GB
(1 row)

The backup script itself is by now a rather long OCaml program, so I doubt
the value in posting it to this list. But here is the log of what it did
which should be pretty explanatory:

proddb backup: starting
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'"
(enqueued)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'" (running
as pid: [23422])
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'" ([23422]
exited normally)
proddb backup: /bin/mkdir -p
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (enqueued)
proddb backup: /bin/mkdir -p
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (running as
pid: [23433])
proddb backup: /bin/mkdir -p
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress ([23433]
exited normally)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001
/mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb)
(hostname tot-dbc-001) (superuser postgres_prod) (basedir /database)
(version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1)
(backup_dir
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress)
(last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001
/mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb)
(hostname tot-dbc-001) (superuser postgres_prod) (basedir /database)
(version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1)
(backup_dir
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress)
(last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup [23437]: 2013-05-27 20:00:02.232864-04:00 Debug backing up
tot-dbc-001:proddb to ((username postgres) (hostname 127.0.0.1)
proddb backup [23437]: (backup_dir
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress)
proddb backup [23437]: (last_backup_dir (..//proddb.2013-05-26.20-00-00)))
proddb backup [23437]: 2013-05-27 20:00:02.232869-04:00 Debug running
/usr/pgsql-9.2/bin/psql -U postgres_prod -h tot-dbc-001 -d postgres -c
checkpoint
proddb backup [23437]: 2013-05-27 20:00:02.459035-04:00 Debug Running
preliminary rsync outside of snapshot to minimize its lifetime.
proddb backup [23437]: 2013-05-27 20:00:02.459049-04:00 Debug running rsync
from /database/ to postgres(at)127(dot)0(dot)0(dot)1:
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 20:00:02.459053-04:00 Debug running nice
-n 19 rsync --link-dest ..//proddb.2013-05-26.20-00-00 -a -q --delete
/database/ postgres(at)127(dot)0(dot)0(dot)1:
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 21:43:05.742036-04:00 Debug Taking
snapshot of database
proddb backup [23437]: 2013-05-27 21:43:05.742056-04:00 Debug finding LVM
device for /database
proddb backup [23437]: 2013-05-27 21:43:05.796354-04:00 Debug real path of
/database is /mnt/dbc01-d1/proddb
proddb backup [23437]: 2013-05-27 21:43:05.797222-04:00 Debug /database is
located on LVM device /dev/dbc01/d1 at relative path /proddb
proddb backup [23437]: 2013-05-27 21:43:05.797237-04:00 Debug running
lvcreate --size 500G --snapshot --name proddb-snapshot /dev/dbc01/d1
proddb backup [23437]: 2013-05-27 21:43:06.487274-04:00 Debug running mount
-o nouuid,ro /dev/dbc01/proddb-snapshot /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 21:43:06.682700-04:00 Debug running rsync
from /tmp/snapfchjHv/proddb/ to postgres(at)127(dot)0(dot)0(dot)1:
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 21:43:06.682710-04:00 Debug running nice
-n 19 rsync --inplace --link-dest ..//proddb.2013-05-26.20-00-00 -a -q
--delete /tmp/snapfchjHv/proddb/ postgres(at)127(dot)0(dot)0(dot)1:
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 22:38:45.817805-04:00 Debug Deleting
snapshot of database
proddb backup [23437]: 2013-05-27 22:38:45.817824-04:00 Debug running
umount /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 22:38:47.096200-04:00 Debug removing
mount point /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 22:38:47.096732-04:00 Debug running
lvremove -f /dev/dbc01/proddb-snapshot
proddb backup [23437]: 2013-05-27 22:38:47.532788-04:00 Debug running
lvdisplay /dev/dbc01/proddb-snapshot
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001
/mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb)
(hostname tot-dbc-001) (superuser postgres_prod) (basedir /database)
(version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1)
(backup_dir
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress)
(last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup: /usr/bin/ssh postgres(at)tot-dbc-001 cd
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (enqueued)
proddb backup: /usr/bin/ssh postgres(at)tot-dbc-001 cd
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (running as
pid: [14318])
proddb backup: /usr/bin/ssh postgres(at)tot-dbc-001 cd
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress ([14318]
exited normally)
proddb backup: /usr/bin/ssh postgres(at)tot-dbc-001 mv
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00
(enqueued)
proddb backup: /usr/bin/ssh postgres(at)tot-dbc-001 mv
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00
(running as pid: [14319])
proddb backup: /usr/bin/ssh postgres(at)tot-dbc-001 mv
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress
/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00
([14319] exited normally)
proddb backup: finished; status = ok

Thanks in advance,

Bene

On Thu, May 23, 2013 at 5:30 PM, David Powers <dpowers(at)janestreet(dot)com>wrote:

> Thanks for the response.
>
> I have some evidence against an issue in the backup procedure (though I'm
> not ruling it out). We moved back to taking the backup off of the primary
> and all errors for all three clusters went away. All of the hardware is
> the same, OS and postgres versions are largely the same (9.2.3 vs. 9.2.4 in
> some cases, various patch levels of Cent 6.3 for the OS). The backup code
> is exactly the same, just pointed at a different set of boxes.
>
> Currently I'm just running for a couple of days to ensure that we have
> viable static backups. After that I'll redo one of the restores from a
> suspected backup and will post the logs.
>
> -David
>
>
> On Thu, May 23, 2013 at 11:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com>wrote:
>
>> On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
>> <bgrundmann(at)janestreet(dot)com> wrote:
>> > We are seeing these errors on a regular basis on the testing box now.
>> We
>> > have even changed the backup script to
>> > shutdown the hot standby, take lvm snapshot, restart the hot standby,
>> rsync
>> > the lvm snapshot. It still happens.
>> >
>> > We have never seen this before we introduced the hot standby. So we
>> will
>> > now revert to taking the backups from lvm snapshots on the production
>> > database. If you have ideas of what else we should try / what
>> information
>> > we can give you to debug this let us know and we will try to so.
>> >
>> > Until then we will sadly operate on the assumption that the combination
>> of
>> > hot standby and "frozen snapshot" backup of it is not production ready.
>>
>> I'm pretty suspicious that your backup procedure is messed up in some
>> way. The fact that you got invalid page headers is really difficult
>> to attribute to a PostgreSQL bug. A number of the other messages that
>> you have posted also tend to indicate either corruption, or that WAL
>> replay has stopped early. It would be interesting to see the logs
>> from when the clone was first started up, juxtaposed against the later
>> WAL flush error messages.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: David Powers <dpowers(at)janestreet(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-28 23:27:49
Message-ID: CA+TgmoYC8yY7WDWJ5cankEEOMAT=v7aZPXLT9Z-HCRxPATebhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 28, 2013 at 10:53 AM, Benedikt Grundmann
<bgrundmann(at)janestreet(dot)com> wrote:
> Today we have seen
>
> 2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10
> EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> base/16416/293974676",,,,""
> 2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10
> EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> base/16416/293974676",,,,""
>
> while taking the backup of the primary. We have been running for a few days
> like that and today is the first day where we see these problems again. So
> it's not entirely deterministic / we don't know yet what we have to do to
> reproduce.
>
> So this makes Robert's theory more likely. However we have also using this
> method (LVM + rsync with hardlinks from primary) for years without these
> problems. So the big question is what changed?

Well... I don't know. But my guess is there's something wrong with
the way you're using hardlinks. Remember, a hardlink means two
logical pointers to the same file on disk. So if either file gets
modified after the fact, then the other pointer is going to see the
changes. The xlog flush request not satisfied stuff could happen if,
for example, the backup is pointing to a file, and the primary is
pointing to the same file, and the primary modifies the file after the
backup is taken (thus modifying the backup after-the-fact).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: David Powers <dpowers(at)janestreet(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-29 20:59:03
Message-ID: CAJpcCMjAZ7r0Tbs2f9gwtjN573GON4WcE4eeu1UqDmKYyDKpPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

It's another possibility, but I think it's still somewhat remote given how
long we've been using this method with this code. It's sadly hard to test
because taking the full backup without the hard linking is fairly expensive
(the databases comprise multiple terabytes).

As a possibly unsatisfying solution I've spent the last day reworking the
backups to use the low level api and the pg_basebackup method to take
snapshots and the streaming replica out of the picture entirely.

-David

On Tue, May 28, 2013 at 7:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, May 28, 2013 at 10:53 AM, Benedikt Grundmann
> <bgrundmann(at)janestreet(dot)com> wrote:
> > Today we have seen
> >
> > 2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10
> > EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> > flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> > base/16416/293974676",,,,""
> > 2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10
> > EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> > flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> > base/16416/293974676",,,,""
> >
> > while taking the backup of the primary. We have been running for a few
> days
> > like that and today is the first day where we see these problems again.
> So
> > it's not entirely deterministic / we don't know yet what we have to do to
> > reproduce.
> >
> > So this makes Robert's theory more likely. However we have also using
> this
> > method (LVM + rsync with hardlinks from primary) for years without these
> > problems. So the big question is what changed?
>
> Well... I don't know. But my guess is there's something wrong with
> the way you're using hardlinks. Remember, a hardlink means two
> logical pointers to the same file on disk. So if either file gets
> modified after the fact, then the other pointer is going to see the
> changes. The xlog flush request not satisfied stuff could happen if,
> for example, the backup is pointing to a file, and the primary is
> pointing to the same file, and the primary modifies the file after the
> backup is taken (thus modifying the backup after-the-fact).
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>