Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

Lists: pgsql-hackers
From: Christophe Pettus <xof(at)thebuild(dot)com>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 18:58:26
Message-ID: 0E76EE0A-1740-41B5-88FF-54AA98794532@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication on client systems. The versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; two cases they were for the same client (9.0.14 and 9.3.1), one for a different client (9.2.5).

The details of each incident are similar, but not identical.

The details of each incident are:

INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using rsync off of an existing, correct primary (P1) for the base backup, and using WAL-E for WAL segment shipping. Both the primary and secondary were running 9.0.14. S1 properly connected to the primary once the it was caught up on WAL segments, and S1 was then promoted as a primary using the trigger file.

No errors in the log files on either system.

After promotion, it was discovered that there was significant data loss on S1. Rows that were present on P1 were missing on S1, and some rows were duplicated (including duplicates that violated primary key and other unique constraints). The indexes were corrupt, in that they seemed to think that the duplicates were not duplicated, and that the missing rows were still present.

Because the client's schema included a "last_updated" field, we were able to determine that all of the rows that were either missing or duplicated had been updated on P1 shortly (3-5 minutes) before S1 was promoted. It's possible, but not confirmed, that there were active queries (including updates) running on P1 at the moment of S1's promotion.

As a note, P1 was created from another system (let's call it P0) using just WAL shipping (no streaming replication), and no data corruption was observed.

P1 and S1 were both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.

P1 and S1 have been destroyed at this point.

INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary and unique constraints, and restored into a new 9.3.1 server, P2. Duplicate rows were purged, and missing rows were added again. The database, a new primary, was then put back into production, and ran without incident.

A new secondary, S2 was created off of the primary. This secondary was created using pg_basebackup using --xlog-method=stream, although the WAL-E archiving was still present.

S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered (although this time without the duplicated rows, just missing rows). At this point, it's not clear if there was some clustering in the "last_updated" timestamp for the rows that are missing from S2. No duplicated rows were observed.

P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.

No errors in the log files on either system.

P2 and S2 are still operational.

INCIDENT #3: 9.2.5 -- A client was migrating a large database from a 9.2.2 system (P3) to a new 9.2.5 system (S3) using streaming replication. As I personally didn't do the steps on this one, I don't have quite as much information, but the basics are close to incident #2: When S3 was promoted using the trigger file, no errors were observed and the database came up normally, but rows were missing from S3 that were present on P3.

P1 is running Centos 6.3 with ext4 as the file system.

P2 is running Centos 6.4 with ext3 as the file system.

Log shipping in this case was done via rsync.

P3 and S3 are still operational.

No errors in the log files on either system.

--

Obviously, we're very concerned that a bug was introduced in the latest minor release. We're happy to gather data as required to assist in diagnosing this.
--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 19:17:00
Message-ID: 528A67AC.10104@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/18/2013 10:58 AM, Christophe Pettus wrote:
> Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication on client systems. The versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; two cases they were for the same client (9.0.14 and 9.3.1), one for a different client (9.2.5).

To emphasize a salient point: we have not previously seen data
corruption with these exact symptoms prior to the recent patch release.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 19:19:02
Message-ID: 7F5D0546-D6C1-4F09-9440-FE81E38EE172@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 18, 2013, at 10:58 AM, Christophe Pettus <xof(at)thebuild(dot)com> wrote:
> As a note, P1 was created from another system (let's call it P0) using just WAL shipping (no streaming replication), and no data corruption was observed.

As another data point, P0 was running 9.0.13, rather than 9.0.14.
--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 19:28:57
Message-ID: 20131118192857.GB26763@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote:
> INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using
> rsync off of an existing, correct primary (P1) for the base backup,
> and using WAL-E for WAL segment shipping. Both the primary and
> secondary were running 9.0.14. S1 properly connected to the primary
> once the it was caught up on WAL segments, and S1 was then promoted as
> a primary using the trigger file.

Could you detail how exactly the base backup was created? Including the
*exact* logic for copying?

> No errors in the log files on either system.

Do you have the log entries for the startup after the base backup?

> Because the client's schema included a "last_updated" field, we were
> able to determine that all of the rows that were either missing or
> duplicated had been updated on P1 shortly (3-5 minutes) before S1 was
> promoted. It's possible, but not confirmed, that there were active
> queries (including updates) running on P1 at the moment of S1's
> promotion.

Any chance you have log_checkpoints enabled? If so, could you check
whether there was a checkpoint around the time of the base backup?

This server is gone, right? If not, could you do:
SELECT ctid, xmin, xmax, * FROM whatever WHERE duplicate_row?

> INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary and unique constraints, and restored into a new 9.3.1 server, P2. Duplicate rows were purged, and missing rows were added again. The database, a new primary, was then put back into production, and ran without incident.
>
> A new secondary, S2 was created off of the primary. This secondary was created using pg_basebackup using --xlog-method=stream, although the WAL-E archiving was still present.
>
> S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered (although this time without the duplicated rows, just missing rows). At this point, it's not clear if there was some clustering in the "last_updated" timestamp for the rows that are missing from S2. No duplicated rows were observed.
>
> P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.
>
> No errors in the log files on either system.

Could you list the *exact* steps you did to startup the cluster?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 19:38:43
Message-ID: B4C7F01E-5789-493B-A009-D72BF2AED668@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 18, 2013, at 11:28 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Could you detail how exactly the base backup was created? Including the
> *exact* logic for copying?

0. Before any of this began, P1 was archiving WAL segments to AWS-S3.
1. pg_start_backup('', true) on P1.
2. Using rsync -av on P1, the entire $PGDATA directory was pushed from P1 to S2.
3. Once the rsync was complete, pg_stop_backup() on P1.
4. Create appropriate recovery.conf on S1.
5. Bring up PostgreSQL on S1.
6. PostgreSQL recovers normally (pulling WAL segments from WAL-E), and eventually connects to P1.

> Do you have the log entries for the startup after the base backup?

Sadly, not anymore.

> This server is gone, right?

Correct.

> Could you list the *exact* steps you did to startup the cluster?

0. Before any of this began, P2 was archiving WAL segments to AWS-S3.
1. Initial (empty) data directory deleted on S2.
2. New data directory created with:

/usr/lib/postgresql/9.3/bin/pg_basebackup --verbose --progress --xlog-method=stream --host=<ip> --user=repluser --pgdata=/data/9.3/main

3. Once the pg_basebackup completed, create appropriate recovery.conf on S1.
4. Bring up PostgreSQL on S2.
5. PostgreSQL recovers normally (pulling a small number of WAL segments from WAL-E), and eventually connects to P2.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 19:47:40
Message-ID: 20131118194740.GC26763@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-18 11:38:43 -0800, Christophe Pettus wrote:
>
> On Nov 18, 2013, at 11:28 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > Could you detail how exactly the base backup was created? Including the
> > *exact* logic for copying?
>
> 0. Before any of this began, P1 was archiving WAL segments to AWS-S3.
> 1. pg_start_backup('', true) on P1.
> 2. Using rsync -av on P1, the entire $PGDATA directory was pushed from P1 to S2.

Without deleting any data, including pg_xlog/, backup.label, anything?

Did you have hot_standby enabled on all of those machines? Even on the
9.0.13 cluster?

> > Could you list the *exact* steps you did to startup the cluster?
>
> 0. Before any of this began, P2 was archiving WAL segments to AWS-S3.
> 1. Initial (empty) data directory deleted on S2.
> 2. New data directory created with:
>
> /usr/lib/postgresql/9.3/bin/pg_basebackup --verbose --progress --xlog-method=stream --host=<ip> --user=repluser --pgdata=/data/9.3/main
>
> 3. Once the pg_basebackup completed, create appropriate recovery.conf on S1.

That was just recovery command and primary conninfo?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 19:55:10
Message-ID: DC5896BC-8EC8-4129-B970-8CBD3B392277@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 18, 2013, at 11:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Without deleting any data, including pg_xlog/, backup.label, anything?

Correct.

> Did you have hot_standby enabled on all of those machines? Even on the
> 9.0.13 cluster?

Yes.

> That was just recovery command and primary conninfo?

Correct.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 19:58:23
Message-ID: 79C8DBD3-2BA8-4C30-8DCE-5A6321C4985F@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 18, 2013, at 11:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Did you have hot_standby enabled on all of those machines? Even on the
> 9.0.13 cluster?

Actually, it's a bit more complex than this:

1. We don't know about P0, the 9.0.13 machine. I assume it was, but it was managed elsewhere.
2. P1 never had hot_standby = 'on', as it was never intended to be a hot stand_by.
3. S1 did have hot_standby = 'on.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 20:00:55
Message-ID: 37992C8D-E7E0-42FC-AAC2-78EE151B5ABF@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 18, 2013, at 11:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Without deleting any data, including pg_xlog/, backup.label, anything?

One more correction: After rsync finished and the pg_base_backup() was issued, the contents of pg_xlog/ on S1 were deleted.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 20:04:34
Message-ID: A19DD803-FC95-4EBE-B5EA-1ED390E71DEC@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 18, 2013, at 12:00 PM, Christophe Pettus <xof(at)thebuild(dot)com> wrote:

> One more correction: After rsync finished and the pg_base_backup() was issued, the contents of pg_xlog/ on S1 were deleted.

pg_stop_backup(), sorry.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 20:57:03
Message-ID: 20131118205703.GA24195@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote:
> After promotion, it was discovered that there was significant data
> loss on S1. Rows that were present on P1 were missing on S1, and some
> rows were duplicated (including duplicates that violated primary key
> and other unique constraints). The indexes were corrupt, in that they
> seemed to think that the duplicates were not duplicated, and that the
> missing rows were still present.

Were there any kind of patterns in the lost data? What kind of workload
are they running? I have an idea what the issue might be...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 21:21:29
Message-ID: BA4C09CC-7787-4F42-B91C-876864FB6431@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 18, 2013, at 12:57 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Were there any kind of patterns in the lost data? What kind of workload
> are they running? I have an idea what the issue might be...

On the P1 > S1 case, the data corrupted was data modified in the last few minutes before the switchover. I don't want to over-analyze, but it was within the checkpoint_timeout value for that sever.

On the P2 > S2 case, it's less obvious what the pattern is, since there was no cutover.

Insufficient information on the P3 > S3 case.

Each of them is a reasonably high-volume OLTP-style workload. The P1/P2 client has a very high level of writes; the P3 more read-heavy, but still a fair number of writes.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 22:15:59
Message-ID: 20131118221559.GE26763@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Afaics it's likely a combination/interaction of bugs and fixes between:
* the initial HS code
* 5a031a5556ff83b8a9646892715d7fef415b83c3
* f44eedc3f0f347a856eea8590730769125964597

But that'd mean nobody noticed it during 9.3's beta...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 22:25:58
Message-ID: CFDBBB3B-A5B9-4BA9-B7C4-BA46868DA70B@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Great! If there's any further data I can supply to help, let me know.

On Nov 18, 2013, at 2:15 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Hi,
>
> Afaics it's likely a combination/interaction of bugs and fixes between:
> * the initial HS code
> * 5a031a5556ff83b8a9646892715d7fef415b83c3
> * f44eedc3f0f347a856eea8590730769125964597
>
> But that'd mean nobody noticed it during 9.3's beta...
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 22:26:56
Message-ID: 20131118222656.GF26763@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-18 14:25:58 -0800, Christophe Pettus wrote:
> Great! If there's any further data I can supply to help, let me know.

Trying to reproduce the issue with and without hot_standby=on would be
very helpful, but I guess that's time consuming?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-18 22:34:13
Message-ID: DFEEED62-480B-4339-9312-97298974684A@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 18, 2013, at 2:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Trying to reproduce the issue with and without hot_standby=on would be
> very helpful, but I guess that's time consuming?

I've been working on it, but I haven't gotten it to fail yet. I'll keep at it.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 14:20:01
Message-ID: 20131119142001.GA10498@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-18 23:15:59 +0100, Andres Freund wrote:
> Afaics it's likely a combination/interaction of bugs and fixes between:
> * the initial HS code
> * 5a031a5556ff83b8a9646892715d7fef415b83c3
> * f44eedc3f0f347a856eea8590730769125964597

Yes, the combination of those is guilty.

Man, this is (to a good part my) bad.

> But that'd mean nobody noticed it during 9.3's beta...

It's fairly hard to reproduce artificially since a) there have to be
enough transactions starting and committing from the start of the
checkpoint the standby is starting from to the point it does
LogStandbySnapshot() to cross a 32768 boundary b) hint bits often save
the game by not accessing clog at all anymore and thus not noticing the
corruption.
I've reproduced the issue by having an INSERT ONLY table that's never
read from. It's helpful to disable autovacuum.

Imo something the attached patch should be done. The description I came
up with is:

Fix Hot-Standby initialization of clog and subtrans.

These bugs can cause data loss on standbys started with hot_standby=on
at the moment they start to accept read only queries by marking
committed transactions as uncommited. The likelihood of such
corruptions is small unless the primary has a high transaction rate.

5a031a5556ff83b8a9646892715d7fef415b83c3 fixed bugs in HS's startup
logic by maintaining less state until at least
STANDBY_SNAPSHOT_PENDING state was reached, missing the fact that both
clog and subtrans are written to before that. This only failed to fail
in common cases because the usage of ExtendCLOG in procarray.c was
superflous since clog extensions are actually WAL logged.

f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the
missing extensions of pg_subtrans due to the former commit's changes -
which are not WAL logged - by performing the extensions when switching
to a state > STANDBY_INITIALIZED and not performing xid assignments
before that - again missing the fact that ExtendCLOG is unneccessary -
but screwed up twice: Once because latestObservedXid wasn't updated
anymore in that state due to the earlier commit and once by having an
off-by-one error in the loop performing extensions.
This means that whenever a CLOG_XACTS_PER_PAGE (32768 with default
settings) boundary was crossed between the start of the checkpoint
recovery started from and the first xl_running_xact record old
transactions commit bits in pg_clog could be overwritten if they
started and committed in that window.

Fix this mess by not performing ExtendCLOG() in HS at all anymore
since it's unneeded and evidently dangerous and by performing subtrans
extensions even before reaching STANDBY_SNAPSHOT_PENDING.

Imo this warrants and expedited point release :(

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 14:22:18
Message-ID: 20131119142218.GB10498@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
> Imo something the attached patch should be done. The description I came
> up with is:
>
> Fix Hot-Standby initialization of clog and subtrans.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Fix-Hot-Standby-initialization-of-clog-and-subtrans.patch text/x-patch 7.2 KB

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 14:33:34
Message-ID: 528B76BE.1080102@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 11/19/2013 09:20 AM, Andres Freund wrote:
> Imo this warrants and expedited point release :(

I presume anyone who is vulnerable to it would need to recreate their
secondary servers to get rid of potential problems?

cheers

andrew


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 14:59:07
Message-ID: 20131119145907.GC10498@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote:
>
> On 11/19/2013 09:20 AM, Andres Freund wrote:
> >Imo this warrants and expedited point release :(
>
>
> I presume anyone who is vulnerable to it would need to recreate their
> secondary servers to get rid of potential problems?

Yes. There's less expensive ways to do it, but those seem to complicated
to suggest.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 15:46:33
Message-ID: A681F3EE-6C9A-49EF-93B6-C0F630027461@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 19, 2013, at 6:59 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Yes. There's less expensive ways to do it, but those seem to complicated
> to suggest.

If this is something that could be built into to a tool, acknowledging the complexity, I'd be happy to see about building it.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 17:51:28
Message-ID: 528BA520.8080601@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/19/2013 06:20 AM, Andres Freund wrote:
> Hi,
>
> On 2013-11-18 23:15:59 +0100, Andres Freund wrote:
>> Afaics it's likely a combination/interaction of bugs and fixes between:
>> * the initial HS code
>> * 5a031a5556ff83b8a9646892715d7fef415b83c3
>> * f44eedc3f0f347a856eea8590730769125964597
>
> Yes, the combination of those is guilty.
>
> Man, this is (to a good part my) bad.
>
>> But that'd mean nobody noticed it during 9.3's beta...

Ah, so this affected 9.3.0 as well?

Maybe it's worth it now to devise some automated replication testing?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 17:58:52
Message-ID: 20131119175852.GK19293@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 09:51:28 -0800, Josh Berkus wrote:
> On 11/19/2013 06:20 AM, Andres Freund wrote:
> > Hi,
> >
> > On 2013-11-18 23:15:59 +0100, Andres Freund wrote:
> >> Afaics it's likely a combination/interaction of bugs and fixes between:
> >> * the initial HS code
> >> * 5a031a5556ff83b8a9646892715d7fef415b83c3
> >> * f44eedc3f0f347a856eea8590730769125964597
> >
> > Yes, the combination of those is guilty.
> >
> > Man, this is (to a good part my) bad.
> >
> >> But that'd mean nobody noticed it during 9.3's beta...
>
> Ah, so this affected 9.3.0 as well?

Yes.

> Maybe it's worth it now to devise some automated replication testing?

It'd be a good idea, but I am not sure where to get resources for it
from.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:16:26
Message-ID: 528BAAFA.7050600@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/19/2013 09:58 AM, Andres Freund wrote:
> On 2013-11-19 09:51:28 -0800, Josh Berkus wrote:
>> Maybe it's worth it now to devise some automated replication testing?
>
> It'd be a good idea, but I am not sure where to get resources for it
> from.

Well, servers isn't hard ... various cloud companies will donate them.

What would be hard is writing the tests. I know how they should work,
and the tools to use, but there's about 1,000 to 2,000 lines of code
between here and there.

(This is why we should have encouraged the folks from SFDC and Huawei
who wanted to do testing, instead of blowing them off.)

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:25:39
Message-ID: 20131119182539.GB7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 10:16:26 -0800, Josh Berkus wrote:
> On 11/19/2013 09:58 AM, Andres Freund wrote:
> > On 2013-11-19 09:51:28 -0800, Josh Berkus wrote:
> >> Maybe it's worth it now to devise some automated replication testing?
> >
> > It'd be a good idea, but I am not sure where to get resources for it
> > from.
>
> Well, servers isn't hard ... various cloud companies will donate them.

Yea, I was talking about human/time/financial ones.

> What would be hard is writing the tests. I know how they should work,
> and the tools to use, but there's about 1,000 to 2,000 lines of code
> between here and there.

I am not sure how much code it's going to take (I'd hope somewhat less),
but it certainly will take some time to agree how it should be built and
then building and integrating it.

> (This is why we should have encouraged the folks from SFDC and Huawei
> who wanted to do testing, instead of blowing them off.)

Haven't noticed/followed anything around that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:27:30
Message-ID: 0916D5F3-71C1-46DE-8F67-255C8C7BF8EA@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 19, 2013, at 10:25 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I am not sure how much code it's going to take (I'd hope somewhat less),
> but it certainly will take some time to agree how it should be built and
> then building and integrating it.

Given that the situation we're in right now is that we have an unknown number of silently corrupt secondaries out there which will only be discovered when someone promotes them to being a primary (possibly because the current primary died without a backup), I'd say that this is something pretty urgent.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:29:31
Message-ID: 20131119182931.GC7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 10:27:30 -0800, Christophe Pettus wrote:
>
> On Nov 19, 2013, at 10:25 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > I am not sure how much code it's going to take (I'd hope somewhat less),
> > but it certainly will take some time to agree how it should be built and
> > then building and integrating it.
>
> Given that the situation we're in right now is that we have an unknown
> number of silently corrupt secondaries out there which will only be
> discovered when someone promotes them to being a primary (possibly
> because the current primary died without a backup), I'd say that this
> is something pretty urgent.

It's pretty unlikely that any automated testing would have cought this,
the required conditions are too unlikely for that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:32:10
Message-ID: A19BC379-2AA5-4BC6-8482-C20571D963CD@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 19, 2013, at 10:29 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> It's pretty unlikely that any automated testing would have cought this,
> the required conditions are too unlikely for that.

I would expect that "promote secondary while primary is under heavy load" is clear-cut test case. What concerns me more is that we don't seem to have a framework to put in a regression test on the bug you just found (and thank you for finding it so quickly!).

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:34:09
Message-ID: 528BAF21.7020509@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/19/2013 10:29 AM, Andres Freund wrote:
> It's pretty unlikely that any automated testing would have cought this,
> the required conditions are too unlikely for that.

Given our almost total lack of automated testing for replication, how
would you (or anyone else) possibly know that? We certainly do know
that we won't catch these kinds of bugs *without* testing.

Heck, even MySQL AB had automated testing for replication. They largely
ignored the results, but they had it.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:40:04
Message-ID: 20131119184004.GD7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 10:32:10 -0800, Christophe Pettus wrote:
>
> On Nov 19, 2013, at 10:29 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> > It's pretty unlikely that any automated testing would have cought this,
> > the required conditions are too unlikely for that.
>
> I would expect that "promote secondary while primary is under heavy
> load" is clear-cut test case.

That's not sufficient though. It's e.g. very hard to reproduce the issue
using the standard pgbench workload (not enough xids generated, too many
hint bits).

Note that the bug isn't caused by promotion, the problem occurs during
the initial startup of a Hot-Standby standby. If the bug wasn't hit
there, it won't be a problem at promotion.

> What concerns me more is that we don't seem to have a framework to put
> in a regression test on the bug you just found (and thank you for
> finding it so quickly!).

Agreed. But regarding it as a bad situation isn't fixing it
unfortunately.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:42:00
Message-ID: 20131119184200.GE7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 10:34:09 -0800, Josh Berkus wrote:
> On 11/19/2013 10:29 AM, Andres Freund wrote:
> > It's pretty unlikely that any automated testing would have cought this,
> > the required conditions are too unlikely for that.
>
> Given our almost total lack of automated testing for replication, how
> would you (or anyone else) possibly know that?

I've tried reproducing it, and it's not easy.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Christophe Pettus <xof(at)thebuild(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:43:14
Message-ID: 528BB142.4060606@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/19/2013 10:40 AM, Andres Freund wrote:
> On 2013-11-19 10:32:10 -0800, Christophe Pettus wrote:
>> What concerns me more is that we don't seem to have a framework to put
>> in a regression test on the bug you just found (and thank you for
>> finding it so quickly!).
>
> Agreed. But regarding it as a bad situation isn't fixing it
> unfortunately.

Well, I happen to have some pieces of such a framework: the parts which
can automate spinning up arbitrarily complex groups of replicas and
doing failover between them. What we'd still need is:

a) a slightly better workload than pgbench
b) a way to compare and test databases for data corruption of several kinds

Can someone else kick in to help with this?

I think this last issue shows that it's critical as a community to have
such a testing framework in place, otherwise we really need to halt all
work on replication until we have such a thing. I can't see how you
expect to complete streaming logical replication without a replication
testing framework.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 18:51:27
Message-ID: 20131119185127.GF7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 10:43:14 -0800, Josh Berkus wrote:
> I think this last issue shows that it's critical as a community to have
> such a testing framework in place, otherwise we really need to halt all
> work on replication until we have such a thing.

> I can't see how you expect to complete streaming logical replication
> without a replication testing framework.

That's actually easier to test since you can relatively easily integrate
it into pg_regress and isolationtester - thus the patchset actually
contains tests.
You seem to imply that I/we should do that work? That's a bit onesided,
isn't it?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 19:08:00
Message-ID: 528BB710.80809@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/19/2013 10:51 AM, Andres Freund wrote:
> That's actually easier to test since you can relatively easily integrate
> it into pg_regress and isolationtester - thus the patchset actually
> contains tests.
> You seem to imply that I/we should do that work? That's a bit onesided,
> isn't it?

Nope, just that you should be worried about it, and maybe shouting for help.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 19:08:26
Message-ID: A4A8D2A1-B569-4701-9036-33F98EAAC55A@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 19, 2013, at 10:51 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> You seem to imply that I/we should do that work?

No, just that it be done. Of course, the more support from the professional PG community that is given to it, the better.

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 23:41:58
Message-ID: 528BF746.8000804@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/19/2013 06:20 AM, Andres Freund wrote:
> Imo this warrants and expedited point release :(

Yes, I agree. I'd go further and say it's worth releasing source as
soon as we have it on this one.

What amount of testing were you able to give your patch? We're trying
to arrange some testing on our side.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-19 23:53:36
Message-ID: FE7BDDE8-B6E4-440A-9CBC-E88AA7F2CB83@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, Andres,

From my understanding, the problem only occurs over streaming replication; if the secondary was never a hot standby, and only used the archived WAL segments, that would be safe. Is that correct?
--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 00:05:35
Message-ID: 20131120000535.GA22154@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-19 15:53:36 -0800, Christophe Pettus wrote:
> From my understanding, the problem only occurs over streaming
> replication; if the secondary was never a hot standby, and only used
> the archived WAL segments, that would be safe. Is that correct?

Not entirely.

It's related to a standby running with hot_standby=on. Both archive
based and streaming replication can be used with hot_standby=on or off.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 00:09:08
Message-ID: 20131120000908.GB22154@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 15:41:58 -0800, Josh Berkus wrote:
> On 11/19/2013 06:20 AM, Andres Freund wrote:
> > Imo this warrants and expedited point release :(
>
> Yes, I agree. I'd go further and say it's worth releasing source as
> soon as we have it on this one.
>
> What amount of testing were you able to give your patch? We're trying
> to arrange some testing on our side.

So far I've only tested whether it fixes the corruption I could
reproduce with some effort and that pgbench tables are consistent on the
primary and the standby.
This certainly warrants another look from somebody that knows that
code. Which afair is only Simon and Heikki? I'll also have another look
after sleeping on it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 00:10:45
Message-ID: 7EC18CC7-42B0-43C6-BD79-0CAEDF0E3DFD@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 19, 2013, at 4:05 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> Hi,
>
> On 2013-11-19 15:53:36 -0800, Christophe Pettus wrote:
>> From my understanding, the problem only occurs over streaming
>> replication; if the secondary was never a hot standby, and only used
>> the archived WAL segments, that would be safe. Is that correct?
>
> Not entirely.
>
> It's related to a standby running with hot_standby=on. Both archive
> based and streaming replication can be used with hot_standby=on or off.
>

So, does that mean that restoring from PITR based backups from tools like barman and wal-e could exhibit the same issue if hot_standby=on was in the postgresql.conf?


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 10:48:50
Message-ID: 528C9392.8000004@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19.11.2013 16:22, Andres Freund wrote:
> On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
>> Imo something the attached patch should be done. The description I came
>> up with is:
>>
>> Fix Hot-Standby initialization of clog and subtrans.

Looks ok for a back-patchable fix.

It's a bit bizarre that the ExtendSUBTRANS loop in
ProcArrayApplyRecoveryInfo looks different from the one in
RecordKnownAssignedTransactionIds, but both look correct to me.

In master, it'd be nice to do some further cleanup. Some gripes:

In ProcArrayApplyXidAssignment, I wonder if it would be best to just
remove the (standbyState == STANDBY_INITIALIZED) check altogether. The
KnownAssignedXidsRemoveTree() that follows is harmless if there is
nothing in the known-assigned-xids array (xact_redo_commit does it in
STANDBY_INITIALIZED state too). The other thing that's done after that
check is updating lastOverflowedXid, and AFAICS it would be harmless to
update that, too. It will be overwritten by the
ProcArrayApplyRecoveryInfo() call that comes later.

Clog, subtrans and multixact are all handled differently. Extensions of
clog and multixact are WAL-logged, but extensions of subtrans are not.
They all have a Startup function, but it has a slightly different
purpose. StartupCLOG only sets latest_page_number, but StartupSUBTRANS
and StartupMultiXact zero out the current page. For CLOG, the TrimCLOG()
function does that. Why is clog handled differently from multixact?

StartupCLOG() and StartupMultiXact set latest_page_number, but
StartupSUBTRANS does not. Is that a problem for subtrans? StartupCLOG()
and StartupMultiXact() are called at different stages in hot standby -
StartupCLOG() is called at the beginning of recovery, but
StartupMultiXact() is only called at end of recovery. Why the
discrepancy, does latest_page_number need to be set during hot standby
or not?

I think we should bite the bullet, and WAL-log the extension of
subtrans, too. Then make the startup and extension procedure for all the
SLRUs the same.

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 12:00:35
Message-ID: 20131120120035.GA25406@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:
> On 19.11.2013 16:22, Andres Freund wrote:
> >On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
> >>Imo something the attached patch should be done. The description I came
> >>up with is:
> >>
> >> Fix Hot-Standby initialization of clog and subtrans.
>
> Looks ok for a back-patchable fix.
>
> It's a bit bizarre that the ExtendSUBTRANS loop in
> ProcArrayApplyRecoveryInfo looks different from the one in
> RecordKnownAssignedTransactionIds, but both look correct to me.

That's because of the different upper bounds (nextxid) vs xid]), but
yea, I wondered about that as well.

> In master, it'd be nice to do some further cleanup. Some gripes:
>
> In ProcArrayApplyXidAssignment, I wonder if it would be best to just remove
> the (standbyState == STANDBY_INITIALIZED) check altogether. The
> KnownAssignedXidsRemoveTree() that follows is harmless if there is nothing
> in the known-assigned-xids array (xact_redo_commit does it in
> STANDBY_INITIALIZED state too). The other thing that's done after that check
> is updating lastOverflowedXid, and AFAICS it would be harmless to update
> that, too. It will be overwritten by the ProcArrayApplyRecoveryInfo() call
> that comes later.

I was thinking about removing it entirely in the patch, but chose not to
do so. I don't really care which way we go.

> Clog, subtrans and multixact are all handled differently. Extensions of clog
> and multixact are WAL-logged, but extensions of subtrans are not. They all
> have a Startup function, but it has a slightly different purpose.
> StartupCLOG only sets latest_page_number, but StartupSUBTRANS and
> StartupMultiXact zero out the current page. For CLOG, the TrimCLOG()
> function does that. Why is clog handled differently from multixact?

I'd guess clog and multixact are handled differently because multixact
supposedly is never queried during recovery. But I don't that's actually
still true, thinking of 9.3's changes around fkey locks and
HeapTupleGetUpdateXid().
So it's probably time to split StartupMultiXact similar to clog's routines.

> StartupCLOG() and StartupMultiXact set latest_page_number, but
> StartupSUBTRANS does not. Is that a problem for subtrans?

I don't think it is, the difference is that StartupSUBTRANS() zeroes out
the current subtrans page which will set latest_page_number, the other's
access the pages normally, which doesn't set it.

> StartupCLOG() and
> StartupMultiXact() are called at different stages in hot standby -
> StartupCLOG() is called at the beginning of recovery, but StartupMultiXact()
> is only called at end of recovery. Why the discrepancy, does
> latest_page_number need to be set during hot standby or not?

latest_page_number primarily is an optimization, isn't it? Except for a
safeguard check in SimpleLruTruncate() it should only influence victim
buffer initialization. But: slru.c explicitly doesn't initialize
->latest_page_number, which means it's zeroed from a memset slightly
above. Which seems we might cause problems when performing truncations
during recovery, before the first page is zeroed (which'd set
latest_page_number again).
...
Hm. Do we actually *ever* truncate the multixact slru during recovery?
clog.c's truncations are WAL logged, TruncateSUBTRANS() is performed by
restartpoints, but there's no callers to TruncateMultiXact but
vac_truncate_clog and it's not logged? That doesn't seem right.

> I think we should bite the bullet, and WAL-log the extension of subtrans,
> too. Then make the startup and extension procedure for all the SLRUs the
> same.

Hm. I don't really see a big advantage in that? I am all for trying to
bring more symetry to the startup routines, but I don't think forcing
WAL logging for something scrapped every restart is necessary for that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 13:30:39
Message-ID: 1384954239.82640.YahooMailNeo@web162906.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote:
>>
>> On 11/19/2013 09:20 AM, Andres Freund wrote:
>> >Imo this warrants and expedited point release :(

+1

>> I presume anyone who is vulnerable to it would need to recreate
>> their secondary servers to get rid of potential problems?
>
> Yes. There's less expensive ways to do it, but those seem to
> complicated to suggest.

Wouldn't a database VACUUM FREEZE fix it, with WAL-logged writing
of everything that doesn't yet have hint bits set?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 13:36:18
Message-ID: 20131120133618.GD25406@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote:
> > Yes. There's less expensive ways to do it, but those seem to
> > complicated to suggest.
>
> Wouldn't a database VACUUM FREEZE fix it, with WAL-logged writing
> of everything that doesn't yet have hint bits set?

Besides also being pretty expensive it still wouldn't correct the clog -
and we don't always rely on hint bits.

I was thinking about just copying over the clog from the primary, but
it's not trivial if the standby isn't cought up, since the primary's
clog could have been truncated ahead of what the standby needs.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 13:59:58
Message-ID: 1384955998.30844.YahooMailNeo@web162905.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote:

>> Wouldn't a database VACUUM FREEZE fix it, with WAL-logged
>> writing of everything that doesn't yet have hint bits set?
>
> Besides also being pretty expensive it still wouldn't correct the
> clog - and we don't always rely on hint bits.

I'm talking about after a fix is deployed, fixing up the possible
corruption.  Can you explain where VACUUM FREEZE would not suffice?
I don't know of anywhere that we have hint bits set for a tuple and
we go fetch the clog bits in spite of that.  I don't understand
where that would make sense; especially since I thought that a
database FREEZE followed by a checkpoint releases old clog space
anyway.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 14:07:26
Message-ID: 20131120140726.GE25406@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote:
>
> >> Wouldn't a database VACUUM FREEZE fix it, with WAL-logged
> >> writing of everything that doesn't yet have hint bits set?
> >
> > Besides also being pretty expensive it still wouldn't correct the
> > clog - and we don't always rely on hint bits.
>
> I'm talking about after a fix is deployed, fixing up the possible
> corruption.  Can you explain where VACUUM FREEZE would not suffice?
> I don't know of anywhere that we have hint bits set for a tuple and
> we go fetch the clog bits in spite of that.

There's several places. Grep for TransactionIdDidCommit() and ignore the
bits in tqual.c. Many of the remaining ones do not look at hint bits.

> I don't understand
> where that would make sense; especially since I thought that a
> database FREEZE followed by a checkpoint releases old clog space
> anyway.

It only releases them up to the (cluster wide) xmin horizon. So if there
are older snapshots or prepared xacts around...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 14:21:13
Message-ID: 1384957273.12016.YahooMailNeo@web162901.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote:

>> I don't understand where that would make sense; especially since
>> I thought that a database FREEZE followed by a checkpoint
>> releases old clog space anyway.
>
> It only releases them up to the (cluster wide) xmin horizon. So
> if there are older snapshots or prepared xacts around...

So as long as there are no open transactions or prepared
transactions on the master which started before the release with
the fix is applied, VACUUM FREEZE would be guaranteed to work?
Since I don't see how a non-prepared transaction would be running
from before a minor release upgrade, that just means we have to
make sure there are no prepared transactions from before the
upgrade?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 14:53:20
Message-ID: 20131120145320.GF25406@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote:
>
> >> I don't understand where that would make sense; especially since
> >> I thought that a database FREEZE followed by a checkpoint
> >> releases old clog space anyway.
> >
> > It only releases them up to the (cluster wide) xmin horizon. So
> > if there are older snapshots or prepared xacts around...
>
> So as long as there are no open transactions or prepared
> transactions on the master which started before the release with
> the fix is applied, VACUUM FREEZE would be guaranteed to work?
> Since I don't see how a non-prepared transaction would be running
> from before a minor release upgrade, that just means we have to
> make sure there are no prepared transactions from before the
> upgrade?

That's not a bad point. So the way to fix it would be:

1) Restart the standby to the new minor release, wait for catchup
2) Restart the primary (fast or smart) to the new minor release
3) Acquire enough new xids to make sure we cross a clog page (?)
4) Jot down a new xid: SELECT txid_current()::bigint % (1::bigint<<33-1)
5) vacuumdb -z -a
6) Ensure that there are no prepared xacts older than 3) around
SELECT *
FROM pg_prepared_xacts
ORDER BY age(transaction) DESC LIMIT 1;
7) Ensure the xmin horizon is above the one from: 3:
SELECT datname, datfrozenxid
FROM pg_database
WHERE datname != 'template0'
ORDER BY age(datfrozenxid) DESC LIMIT 1;
8) Get the current lsn: SELECT pg_current_xlog_location();
9) verify on each standby that SELECT pg_last_xlog_receive_location() is
past 7)
10) be happy

I am not sure how we can easily compute that 6) and 7) are past 3) in
the presence of xid wraparounds.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 15:06:04
Message-ID: 1384959964.49778.YahooMailNeo@web162903.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote:

>> So as long as there are no open transactions or prepared
>> transactions on the master which started before the release with
>> the fix is applied, VACUUM FREEZE would be guaranteed to work?
>> Since I don't see how a non-prepared transaction would be running
>> from before a minor release upgrade, that just means we have to
>> make sure there are no prepared transactions from before the
>> upgrade?
>
> That's not a bad point. So the way to fix it would be:
>
> 1) Restart the standby to the new minor release, wait for catchup
> 2) Restart the primary (fast or smart) to the new minor release
> 3) Acquire enough new xids to make sure we cross a clog page (?)
> 4) Jot down a new xid:  SELECT txid_current()::bigint % (1::bigint<<33-1)
> 5) vacuumdb -z -a
> 6) Ensure that there are no prepared xacts older than 3) around
> SELECT *
> FROM pg_prepared_xacts
> ORDER BY age(transaction) DESC LIMIT 1;
> 7) Ensure the xmin horizon is above the one from: 3:
> SELECT datname, datfrozenxid
> FROM pg_database
> WHERE datname != 'template0'
> ORDER BY age(datfrozenxid) DESC LIMIT 1;
> 8) Get the current lsn: SELECT pg_current_xlog_location();
> 9) verify on each standby that SELECT pg_last_xlog_receive_location() is
>   past 7)
> 10) be happy
>
> I am not sure how we can easily compute that 6) and 7) are past 3) in
> the presence of xid wraparounds.

I may well be missing something here, but wouldn't it be sufficient to?:
1) Restart the standby to the new minor release, wait for catchup
2) Restart the primary (fast or smart) to the new minor release
3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary
4) Run CHECKPOINT command on primary, or just wait for one to run
5) Wait for standby to process to the checkpoint
6) Be happy

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 15:15:46
Message-ID: 20131120151546.GA18801@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 07:06:04 -0800, Kevin Grittner wrote:
> > That's not a bad point. So the way to fix it would be:
> >
> > 1) Restart the standby to the new minor release, wait for catchup
> > 2) Restart the primary (fast or smart) to the new minor release
> > 3) Acquire enough new xids to make sure we cross a clog page (?)
> > 4) Jot down a new xid:  SELECT txid_current()::bigint % (1::bigint<<33-1)
> > 5) vacuumdb -z -a
> > 6) Ensure that there are no prepared xacts older than 3) around
> > SELECT *
> > FROM pg_prepared_xacts
> > ORDER BY age(transaction) DESC LIMIT 1;
> > 7) Ensure the xmin horizon is above the one from: 3:
> > SELECT datname, datfrozenxid
> > FROM pg_database
> > WHERE datname != 'template0'
> > ORDER BY age(datfrozenxid) DESC LIMIT 1;
> > 8) Get the current lsn: SELECT pg_current_xlog_location();
> > 9) verify on each standby that SELECT pg_last_xlog_receive_location() is
> >   past 7)
> > 10) be happy
> >
> > I am not sure how we can easily compute that 6) and 7) are past 3) in
> > the presence of xid wraparounds.
>
>
> I may well be missing something here, but wouldn't it be sufficient to?:
> 1) Restart the standby to the new minor release, wait for catchup
> 2) Restart the primary (fast or smart) to the new minor release
> 3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary
> 4) Run CHECKPOINT command on primary, or just wait for one to run
> 5) Wait for standby to process to the checkpoint
> 6) Be happy

Well, in some cases it might. But what if there's a prepared xact
around? Or a transaction started directly after 2) preventing
FreezeLimit to go up? Or vacuum_defer_cleanup_age is set? Or there's
another bug like 4c697d8f4845823a8af67788b219ffa4516ad14c?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 16:25:56
Message-ID: 528CE294.8080300@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20.11.2013 17:06, Kevin Grittner wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote:
>
>
>>> So as long as there are no open transactions or prepared
>>> transactions on the master which started before the release with
>>> the fix is applied, VACUUM FREEZE would be guaranteed to work?
>>> Since I don't see how a non-prepared transaction would be running
>>> from before a minor release upgrade, that just means we have to
>>> make sure there are no prepared transactions from before the
>>> upgrade?
>>
>> That's not a bad point. So the way to fix it would be:
>>
>> 1) Restart the standby to the new minor release, wait for catchup
>> 2) Restart the primary (fast or smart) to the new minor release
>> 3) Acquire enough new xids to make sure we cross a clog page (?)
>> 4) Jot down a new xid: SELECT txid_current()::bigint % (1::bigint<<33-1)
>> 5) vacuumdb -z -a
>> 6) Ensure that there are no prepared xacts older than 3) around
>> SELECT *
>> FROM pg_prepared_xacts
>> ORDER BY age(transaction) DESC LIMIT 1;
>> 7) Ensure the xmin horizon is above the one from: 3:
>> SELECT datname, datfrozenxid
>> FROM pg_database
>> WHERE datname != 'template0'
>> ORDER BY age(datfrozenxid) DESC LIMIT 1;
>> 8) Get the current lsn: SELECT pg_current_xlog_location();
>> 9) verify on each standby that SELECT pg_last_xlog_receive_location() is
>> past 7)
>> 10) be happy
>>
>> I am not sure how we can easily compute that 6) and 7) are past 3) in
>> the presence of xid wraparounds.
>
>
> I may well be missing something here, but wouldn't it be sufficient to?:
> 1) Restart the standby to the new minor release, wait for catchup
> 2) Restart the primary (fast or smart) to the new minor release
> 3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary
> 4) Run CHECKPOINT command on primary, or just wait for one to run
> 5) Wait for standby to process to the checkpoint
> 6) Be happy

Isn't it possible that the standby has already incorrectly set
HEAP_XMIN_INVALID hint bit on a page? The full page images generated by
VACUUM FREEZE will correct the damage, but if not, e.g. because
full_page_writes=off, strange things will happen.

Personally, I wouldn't trust anything less than a new base backup.

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Kevin Grittner <kgrittn(at)ymail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 16:46:22
Message-ID: 20131120164622.GE18801@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 18:25:56 +0200, Heikki Linnakangas wrote:
> Isn't it possible that the standby has already incorrectly set
> HEAP_XMIN_INVALID hint bit on a page? The full page images generated by
> VACUUM FREEZE will correct the damage, but if not, e.g. because
> full_page_writes=off, strange things will happen.

The xlog_heap_freeze records should repair that afaics.

> Personally, I wouldn't trust anything less than a new base backup.

But I still tend to agree.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Kevin Grittner <kgrittn(at)ymail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 18:30:07
Message-ID: 528CFFAF.3000908@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrews, Kevin:

Presumably a replica created while all traffic was halted on the master
would be clean, correct? This bug can only be triggered if there's
heavy write load on the master, right?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 18:48:41
Message-ID: 528D0409.8080803@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/20/2013 10:30 AM, Josh Berkus wrote:
> Andrews, Kevin:

Andres, that is.

>
> Presumably a replica created while all traffic was halted on the master
> would be clean, correct? This bug can only be triggered if there's
> heavy write load on the master, right?
>

Also, just to verify:

If someone is doing PITR based on a snapshot taken with pg_basebackup,
that will only trip this corruption bug if the user has hot_standby=on
in their config *while restoring*? Or is it critical if they have
hot_standby=on while backing up?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 23:41:41
Message-ID: 20131120234141.GI18801@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 10:48:41 -0800, Josh Berkus wrote:
> > Presumably a replica created while all traffic was halted on the master
> > would be clean, correct? This bug can only be triggered if there's
> > heavy write load on the master, right?

Kinda. It's unfortunately necessary to understand how HS works to some
degree:
Everytime a server is (re-)started with a recovery.conf present and
hot_standby=on (be it streaming, archive based replication or PITR) the
Hot Standby code is used.
(Crash|Replication)-Recovery starts by reading the last checkpoint (from
pg_control or, if present, backup.label) and then replays WAL from the
'redo' point included in the checkpoint. The bug then occurs when it
first (or, in some case second time) replays a 'xl_running_xacts'
record. That's used to reconstruct information needed to allow queries.

Everytime the server in HS mode allows connections ("consistent recovery state
reached at ..." and "database system is ready to accept read only
connections" in the log), the bug can be triggered. If there weren't too
many transactions at that point, the problem won't occur until the
standby is restarted.

> If someone is doing PITR based on a snapshot taken with pg_basebackup,
> that will only trip this corruption bug if the user has hot_standby=on
> in their config *while restoring*? Or is it critical if they have
> hot_standby=on while backing up?

hot_standby=on only has an effect while starting up with a recovery.conf
present. So, if you have an old base backup around and all WAL files,
you can start from that.

Does that answer your questsions?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 23:52:22
Message-ID: 528D4B36.1010804@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres,

> Everytime the server in HS mode allows connections ("consistent recovery state
> reached at ..." and "database system is ready to accept read only
> connections" in the log), the bug can be triggered. If there weren't too
> many transactions at that point, the problem won't occur until the
> standby is restarted.

Oh, so this doesn't just happen when the base backup is first taken;
*any* time the standby is restarted, it can happen. (!!!)

>> If someone is doing PITR based on a snapshot taken with pg_basebackup,
>> that will only trip this corruption bug if the user has hot_standby=on
>> in their config *while restoring*? Or is it critical if they have
>> hot_standby=on while backing up?
>
> hot_standby=on only has an effect while starting up with a recovery.conf
> present. So, if you have an old base backup around and all WAL files,
> you can start from that.
>
> Does that answer your questsions?

Yeah, thanks.

If you have any ideas for how we'd write code to scan for this kind of
corruption, please post them.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-20 23:57:28
Message-ID: 20131120235728.GJ18801@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 15:52:22 -0800, Josh Berkus wrote:
> Andres,
>
> > Everytime the server in HS mode allows connections ("consistent recovery state
> > reached at ..." and "database system is ready to accept read only
> > connections" in the log), the bug can be triggered. If there weren't too
> > many transactions at that point, the problem won't occur until the
> > standby is restarted.
>
> Oh, so this doesn't just happen when the base backup is first taken;
> *any* time the standby is restarted, it can happen. (!!!)

Yes.

> If you have any ideas for how we'd write code to scan for this kind of
> corruption, please post them.

I don't really have one. Current corruption would be somewhat easy to
detect (walk through the clog, check if all commit bits match), but that
doesn't detect wether already truncated clog was corrupted.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Christophe Pettus <xof(at)thebuild(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 00:36:46
Message-ID: 72AFBD6A-D4CC-4972-9972-31F5BA7DAD48@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Nov 20, 2013, at 3:57 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> On 2013-11-20 15:52:22 -0800, Josh Berkus wrote:
>> Oh, so this doesn't just happen when the base backup is first taken;
>> *any* time the standby is restarted, it can happen. (!!!)
>
> Yes.

So, to be completely clear, any secondary running the affected versions which is started with hot_standby=on could potentially be corrupted even if it never connects to a primary?

--
-- Christophe Pettus
xof(at)thebuild(dot)com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christophe Pettus <xof(at)thebuild(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 00:38:51
Message-ID: 20131121003851.GK18801@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 16:36:46 -0800, Christophe Pettus wrote:
>
> On Nov 20, 2013, at 3:57 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> > On 2013-11-20 15:52:22 -0800, Josh Berkus wrote:
> >> Oh, so this doesn't just happen when the base backup is first taken;
> >> *any* time the standby is restarted, it can happen. (!!!)
> >
> > Yes.
>
> So, to be completely clear, any secondary running the affected
> versions which is started with hot_standby=on could potentially be
> corrupted even if it never connects to a primary?

Yes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Kevin Grittner <kgrittn(at)ymail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 11:34:42
Message-ID: 20131121113442.GA31748@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 17:46:22 +0100, Andres Freund wrote:
> On 2013-11-20 18:25:56 +0200, Heikki Linnakangas wrote:
> > Isn't it possible that the standby has already incorrectly set
> > HEAP_XMIN_INVALID hint bit on a page? The full page images generated by
> > VACUUM FREEZE will correct the damage, but if not, e.g. because
> > full_page_writes=off, strange things will happen.
>
> The xlog_heap_freeze records should repair that afaics.

Nope, it wouldn't, it just uses heap_freeze_tuple() again which *only*
sets HEAP_XMIN_COMMITTED but does *not* clear HEAP_XMIN_INVALID. And if
we were to change that now, it wouldn't help already frozen tuples that
wouldn't get frozen again.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 15:25:20
Message-ID: CADFUPgfaQH-CWet1eFwPWgBVAj1BLf7+26fJPQGZBG+Pf+xGgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 19, 2013 at 9:22 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
>> Imo something the attached patch should be done. The description I came
>> up with is:
>>
>> Fix Hot-Standby initialization of clog and subtrans.
>

G'day Andres.

This wouldn't happen to be related to the issue I was having with
missing pg_subtrans files, would it, as discussed here
http://www.postgresql.org/message-id/CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
? Based on my reading of the commit message and what I could glean
from the patch itself, they seem to be related.

If there's anything I can do to help test, just say the word and I can
fit in some time. Cheers.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 15:27:53
Message-ID: 20131121152753.GF31748@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-21 10:25:20 -0500, J Smith wrote:
> On Tue, Nov 19, 2013 at 9:22 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
> >> Imo something the attached patch should be done. The description I came
> >> up with is:
> >>
> >> Fix Hot-Standby initialization of clog and subtrans.
> >
>
> G'day Andres.
>
> This wouldn't happen to be related to the issue I was having with
> missing pg_subtrans files, would it, as discussed here
> http://www.postgresql.org/message-id/CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
> ? Based on my reading of the commit message and what I could glean
> from the patch itself, they seem to be related.

I don't think so - for one, pg_subtrans isn't really the problems with
that bug, for another, it won't cause missing files. Also, you're not
using replication, right?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: josh(at)agliodbs(dot)com
Cc: andres(at)2ndquadrant(dot)com, xof(at)thebuild(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 16:32:30
Message-ID: 20131122.013230.983057682533416530.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Well, I happen to have some pieces of such a framework: the parts which
> can automate spinning up arbitrarily complex groups of replicas and
> doing failover between them. What we'd still need is:
>
> a) a slightly better workload than pgbench
> b) a way to compare and test databases for data corruption of several kinds
>
> Can someone else kick in to help with this?

Recently pgpool-II introduces its own regression test framework
written in shell script which can create a streaming replication
primary and arbitrary number of standbys in single server. The test
scenarios are written in a shell script, thus is reasonably
flexible. I have read through all the thread but I am not sure how to
reproduce the problem reliably. If I would know it, probably I can
integrate into the regression test.

> I think this last issue shows that it's critical as a community to have
> such a testing framework in place, otherwise we really need to halt all
> work on replication until we have such a thing.

Agreed. I think this is very critical for us.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 16:54:58
Message-ID: CADFUPgdnn6QgAdtCnE77J6e261afXxr6uqp5v4VUafkcQ3JTFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 21, 2013 at 10:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> I don't think so - for one, pg_subtrans isn't really the problems with
> that bug, for another, it won't cause missing files. Also, you're not
> using replication, right?
>

Actually, this server is a master being replicated to a standby. We
haven't experienced the problem on the standby, but we are indeed
replicating the server experiencing the issue.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 17:23:53
Message-ID: 20131121172353.GK7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-21 11:54:58 -0500, J Smith wrote:
> On Thu, Nov 21, 2013 at 10:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >
> > I don't think so - for one, pg_subtrans isn't really the problems with
> > that bug, for another, it won't cause missing files. Also, you're not
> > using replication, right?
> >
>
> Actually, this server is a master being replicated to a standby. We
> haven't experienced the problem on the standby, but we are indeed
> replicating the server experiencing the issue.

It's still not this issue in that case, but I might have an idea... Do
you have hot_standby_feedback enabled?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 18:01:13
Message-ID: CADFUPgfaA8fb+DcspcOYFiYEJafse2fxG_WB_ywN2SFWRS0SXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 21, 2013 at 12:23 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> It's still not this issue in that case, but I might have an idea... Do
> you have hot_standby_feedback enabled?
>

Nope, hot_standby_feedback is set to its default setting which is off.

At any rate, I'm going to try and capture a backtrace from the error
I'm experiencing as suggested in the other thread by Robert Haas and
we'll see if that provides any insight.

Cheers


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 20:36:28
Message-ID: 528E6ECC.3070305@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hello,

This is turning into a rather large thread and I have a simple question:

Is a work-around to this problem as simple as disabling streaming
replication and enabling log shipping instead?

Joshua D. Drake

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 20:39:54
Message-ID: 528E6F9A.5080001@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/21/2013 12:36 PM, Joshua D. Drake wrote:
>
> Hello,
>
> This is turning into a rather large thread and I have a simple question:
>
> Is a work-around to this problem as simple as disabling streaming
> replication and enabling log shipping instead?

Yes, and re-cloning the replica, in case it already has corruption.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 20:46:33
Message-ID: 20131121204633.GA27838@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-21 12:39:54 -0800, Josh Berkus wrote:
> On 11/21/2013 12:36 PM, Joshua D. Drake wrote:
> >
> > Hello,
> >
> > This is turning into a rather large thread and I have a simple question:
> >
> > Is a work-around to this problem as simple as disabling streaming
> > replication and enabling log shipping instead?

No. Check
http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de

The problem is starting with hot_standby=on on a system with
recovery.conf present. It is independent of whether you use streaming
replication, archive based recovery, or just shutdown the server and
manually copy xlog segments there.
As long as hot_standby=on, and recovery.conf is present you can hit the
bug.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 20:51:17
Message-ID: 528E7245.1010206@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/21/2013 12:46 PM, Andres Freund wrote:
> No. Check
> http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de
>
> The problem is starting with hot_standby=on on a system with
> recovery.conf present. It is independent of whether you use streaming
> replication, archive based recovery, or just shutdown the server and
> manually copy xlog segments there.
> As long as hot_standby=on, and recovery.conf is present you can hit the
> bug.

Oh, aha. There have to be some transactions which are awaiting
checkpoint, though, correct? As in, if there's no activity on the
master, you can't trigger the bug?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 20:53:08
Message-ID: 20131121205308.GB27838@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-21 12:51:17 -0800, Josh Berkus wrote:
> On 11/21/2013 12:46 PM, Andres Freund wrote:
> > No. Check
> > http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de
> >
> > The problem is starting with hot_standby=on on a system with
> > recovery.conf present. It is independent of whether you use streaming
> > replication, archive based recovery, or just shutdown the server and
> > manually copy xlog segments there.
> > As long as hot_standby=on, and recovery.conf is present you can hit the
> > bug.
>
> Oh, aha. There have to be some transactions which are awaiting
> checkpoint, though, correct? As in, if there's no activity on the
> master, you can't trigger the bug?

Correct. Also, if you *start* at such a checkpoint you're not vulnerable
until the standby is restarted.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 20:55:01
Message-ID: 20131121205501.GC27838@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:
> On 19.11.2013 16:22, Andres Freund wrote:
> >On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
> >>Imo something the attached patch should be done. The description I came
g> >>up with is:
> >>
> >> Fix Hot-Standby initialization of clog and subtrans.
>
> Looks ok for a back-patchable fix.

Do you plan to commit this? Or who is going to?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 21:02:29
Message-ID: 528E74E5.1060906@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21.11.2013 22:53, Andres Freund wrote:
> On 2013-11-21 12:51:17 -0800, Josh Berkus wrote:
>> On 11/21/2013 12:46 PM, Andres Freund wrote:
>>> The problem is starting with hot_standby=on on a system with
>>> recovery.conf present. It is independent of whether you use streaming
>>> replication, archive based recovery, or just shutdown the server and
>>> manually copy xlog segments there.
>>> As long as hot_standby=on, and recovery.conf is present you can hit the
>>> bug.
>>
>> Oh, aha. There have to be some transactions which are awaiting
>> checkpoint, though, correct? As in, if there's no activity on the
>> master, you can't trigger the bug?
>
> Correct. Also, if you *start* at such a checkpoint you're not vulnerable
> until the standby is restarted.

Keep in mind that autovacuum counts as "activity" in this case. If
you're unlucky, that is. It's next to impossible to determine
after-the-fact if there has been activity in the master that might've
caused problems.

If you have ever set hot_standby=on in your standby server, running one
of the affected versions, you're at risk. The standby might be corrupt,
and should be rebuild from a base backup. The higher the transaction
rate in the master, the higher the risk.

I wouldn't try to narrow it down any further than that, it gets too
complicated.

- Heikki


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 21:13:01
Message-ID: 528E775D.2020405@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21.11.2013 22:55, Andres Freund wrote:
> Hi,
>
> On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:
>> On 19.11.2013 16:22, Andres Freund wrote:
>>> On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
>>>> Imo something the attached patch should be done. The description I came
> g> >>up with is:
>>>>
>>>> Fix Hot-Standby initialization of clog and subtrans.
>>
>> Looks ok for a back-patchable fix.
>
> Do you plan to commit this? Or who is going to?

I can commit it tomorrow, unless someone beats me to it.

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-21 21:15:38
Message-ID: 20131121211538.GE27838@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-21 23:02:29 +0200, Heikki Linnakangas wrote:
> On 21.11.2013 22:53, Andres Freund wrote:
> >On 2013-11-21 12:51:17 -0800, Josh Berkus wrote:
> >>On 11/21/2013 12:46 PM, Andres Freund wrote:
> >>>The problem is starting with hot_standby=on on a system with
> >>>recovery.conf present. It is independent of whether you use streaming
> >>>replication, archive based recovery, or just shutdown the server and
> >>>manually copy xlog segments there.
> >>>As long as hot_standby=on, and recovery.conf is present you can hit the
> >>>bug.
> >>
> >>Oh, aha. There have to be some transactions which are awaiting
> >>checkpoint, though, correct? As in, if there's no activity on the
> >>master, you can't trigger the bug?
> >
> >Correct. Also, if you *start* at such a checkpoint you're not vulnerable
> >until the standby is restarted.
>
> Keep in mind that autovacuum counts as "activity" in this case. If you're
> unlucky, that is. It's next to impossible to determine after-the-fact if
> there has been activity in the master that might've caused problems.

Well, in that case you're going to "just" loose the
pg_database/pg_class/stats updates from analyze/vacuum. That's annoying,
but not too bad.

> I wouldn't try to narrow it down any further than that, it gets too
> complicated.

Yes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-22 12:22:56
Message-ID: 528F4CA0.30408@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19.11.2013 16:20, Andres Freund wrote:
> On 2013-11-18 23:15:59 +0100, Andres Freund wrote:
>> Afaics it's likely a combination/interaction of bugs and fixes between:
>> * the initial HS code
>> * 5a031a5556ff83b8a9646892715d7fef415b83c3
>> * f44eedc3f0f347a856eea8590730769125964597
>
> Yes, the combination of those is guilty.
>
> Man, this is (to a good part my) bad.
>
>> But that'd mean nobody noticed it during 9.3's beta...
>
> It's fairly hard to reproduce artificially since a) there have to be
> enough transactions starting and committing from the start of the
> checkpoint the standby is starting from to the point it does
> LogStandbySnapshot() to cross a 32768 boundary b) hint bits often save
> the game by not accessing clog at all anymore and thus not noticing the
> corruption.
> I've reproduced the issue by having an INSERT ONLY table that's never
> read from. It's helpful to disable autovacuum.

For the archive, here's what I used to reproduce this. It creates master
and a standby, and also uses an INSERT only table. To make it trigger
more easily, it helps to insert sleeps in CreateCheckpoint(), around the
LogStandbySnapshot() call.

- Heikki

Attachment Content-Type Size
test-hot-standby-bug.sh application/x-sh 1.5 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-22 13:01:10
Message-ID: 528F5596.2060008@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21.11.2013 22:55, Andres Freund wrote:
> On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:
>> On 19.11.2013 16:22, Andres Freund wrote:
>>> On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
>>>> Imo something the attached patch should be done. The description I came
> g> >>up with is:
>>>>
>>>> Fix Hot-Standby initialization of clog and subtrans.
>>
>> Looks ok for a back-patchable fix.
>
> Do you plan to commit this? Or who is going to?

Ok, committed.

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Date: 2013-11-22 13:38:08
Message-ID: 20131122133808.GA17400@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-22 15:01:10 +0200, Heikki Linnakangas wrote:
> On 21.11.2013 22:55, Andres Freund wrote:
> >On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:
> >>Looks ok for a back-patchable fix.
> >
> >Do you plan to commit this? Or who is going to?
>
> Ok, committed.

Thanks!

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services