Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 00:09:23
Message-ID: CA+Tgmob8vfzYrLToqYr7uJ2moW3Gnv8rZpPtznxVXRPfTHQpCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On the pgsql-packagers list, there has been some (OT for that list)
discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6
is sufficiently serious to justify yet another immediate minor release
of 9.3.x. The relevant questions seem to be:

1. Is it really bad?

2. Does it affect a lot of people or only a few?

3. Are there more, equally bad bugs that are unfixed, or perhaps even
unreported, yet?

Obviously, we don't want to leave serious bugs unpatched. On the
other hand, as Tom pointed out in that discussion, releases are a lot
of work, and we can't do them for every commit.

Discuss.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 01:15:23
Message-ID: 26918.1394673323@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Discuss.

This thread badly needs a more informative Subject line.

But, yeah: do people think the referenced commit fixes a bug bad enough
to deserve a quick update release? If so, why? Multiple reports of
problems in the field would be a good reason, but I've not seen such.

regards, tom lane


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Bug: Fix Wal replay of locking an updated tuple (WAS: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6)
Date: 2014-03-13 01:22:54
Message-ID: 5321086E.2010801@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/12/2014 06:15 PM, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Discuss.
>
> This thread badly needs a more informative Subject line.
>

No kidding. Or at least a link for goodness sake. Although the
pgsql-packers list wasn't all that helpful either.

What I know is that we have a known in the wild version of PostgreSQL
that eats data. That is bad. It is unfortunate that we just released
9.3.3 but we can't knowingly allow people to get their data eaten. We
look bad.

It appears that this is the specific bug:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9a57858f1103b89a5674f0d50c5fe1f756411df6

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 01:37:33
Message-ID: 20140313013733.GI12995@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> This thread badly needs a more informative Subject line.

Agreed.

> But, yeah: do people think the referenced commit fixes a bug bad enough
> to deserve a quick update release? If so, why? Multiple reports of
> problems in the field would be a good reason, but I've not seen such.

Uh, isn't what brought this to light two independent complaints from
Peter and Greg Stark of seeing corruption in the field due to this?

Peter's initial email also indicated it was two different systems which
had gotten bit by this and Greg explicitly stated that he was working on
an independent database from what Peter was reporting on, so that's at
least 2 (one each), or 3 (if you count databases, as Peter had 2).
Sure, they're all from Heroku, but I find it highly unlikely no one else
has run into this issue. More likely, they simply haven't realized it's
happened to them (which is another reason this is a particularly nasty
bug..).

I understand that another release makes work for everyone, and that
stinks, and it's also no fun in the press to have *another* release that
is fixing corruption issues, but sitting on a fix which is actively
causing corruption in the field isn't any good either.

So, my +1 is for a "quick update release"- and if there's a way I can
help offload some of the work (or at least learn the steps to help with
offloading in the future), I'm happy to do so- just let me know.

Thanks,

Stephen


From: David Johnston <polobo(at)yahoo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug: Fix Wal replay of locking an updated tuple (WAS: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6)
Date: 2014-03-13 01:37:35
Message-ID: 1394674655343-5795827.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joshua D. Drake wrote
> On 03/12/2014 06:15 PM, Tom Lane wrote:
>> Robert Haas &lt;

> robertmhaas@

> &gt; writes:
>>> Discuss.
>>
>> This thread badly needs a more informative Subject line.
>>
>
> No kidding. Or at least a link for goodness sake. Although the
> pgsql-packers list wasn't all that helpful either.

A link would be nice though if -packers is a security list then that may not
be a good thing since -hackers is public...

A quick search of Nabble and the "Mailing Lists" section of the homepage do
not indicate pgsql-packers exists - at least not in any publicly (even if
read-only) accessible way.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/9a57858f1103b89a5674f0d50c5fe1f756411df6-tp5795816p5795827.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Greg Stark <stark(at)mit(dot)edu>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 03:51:42
Message-ID: CAM-w4HO2CAQ1k34cx3vw3_gJ8eQxUA44kgSh=pCQTpCsj5VnPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13 Mar 2014 01:36, "Stephen Frost" <sfrost(at)snowman(dot)net> wrote:
>
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> > This thread badly needs a more informative Subject line.
>
> Agreed.
>
> > But, yeah: do people think the referenced commit fixes a bug bad enough
> > to deserve a quick update release? If so, why? Multiple reports of
> > problems in the field would be a good reason, but I've not seen such.
>
> Uh, isn't what brought this to light two independent complaints from
> Peter and Greg Stark of seeing corruption in the field due to this?
>
> Peter's initial email also indicated it was two different systems which
> had gotten bit by this and Greg explicitly stated that he was working on
> an independent database from what Peter was reporting on, so that's at
> least 2 (one each), or 3 (if you count databases, as Peter had 2).
> Sure, they're all from Heroku, but I find it highly unlikely no one else
> has run into this issue. More likely, they simply haven't realized it's
> happened to them (which is another reason this is a particularly nasty
> bug..).

We have the two databases where we're sure this was the problem. On the one
I worked on the customer complained that it happened repeatedly.

The key I demonstrated here wasn't even the one the costumer was
complaining about. It seems their usage pattern made it extremely easy to
trigger and that usage pattern arose naturally from using a rails module
called counter_cache which maintains a cache of the count of a child take
in the parent table.

We also have a few other customers complaining about duplicate keys. It's
hard to be sure but these may have been standbys where the problem occurred
ages ago and they only now activated their standby and ran into the problem.

That's what worries me most about this bug. You'll only detect it if you're
routinely querying your standby. If you have a standby for HA purposes it
might be corrupt for a long time without you realising it. We may be
fielding corruption complaints for a long time without being able to
conclusively prove whether it's due to this bug or not.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 11:00:58
Message-ID: 20140313110058.GB8268@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-03-12 20:09:23 -0400, Robert Haas wrote:
> On the pgsql-packagers list, there has been some (OT for that list)
> discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6
> is sufficiently serious to justify yet another immediate minor release
> of 9.3.x. The relevant questions seem to be:
>
> 1. Is it really bad?

It breaks the ctid of concurrently updated/locked tuples during WAL
replay. Which can lead to all sorts of nastiness like indexes not
finding any rows. Since that kind of locking/updating is pretty common
with foreign keys, it's not an unlikely scenario.
Unfortunately FPIs won't save the day in all that many scenarios because
there'll normally a XLOG_HEAP2_LOCK_UPDATED before the XLOG_HEAP_LOCK
record which is replayed badly.

Now, one could argue that it only affects replicas or servers that
crashed at some point, but I think that's not much comfort.

> 2. Does it affect a lot of people or only a few?

It's been reported twice (Peter Geoghegan, Greg Stark) by Heroku and one
person on IRC could reproduce it repeatedly. The latter was what made me
look into it again and find the bug. Greg has confirmed that it fixes
the bug when replaying the WAL again.

> 3. Are there more, equally bad bugs that are unfixed, or perhaps even
> unreported, yet?

Uh. I have no idea. I don't know of any reports that can't be attributed
to any of these, but as you're also include unreported bugs in that
question...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jozef Mlich <jmlich(at)redhat(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 12:06:00
Message-ID: 1394712360.2351.2.camel@mlich-lenovo.usersys.redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2014-03-13 at 12:00 +0100, Andres Freund wrote:
> On 2014-03-12 20:09:23 -0400, Robert Haas wrote:
> > On the pgsql-packagers list, there has been some (OT for that list)
> > discussion of whether commit 9a57858f1103b89a5674f0d50c5fe1f756411df6
> > is sufficiently serious to justify yet another immediate minor release
> > of 9.3.x. The relevant questions seem to be:
> >
> > 1. Is it really bad?
>
> It breaks the ctid of concurrently updated/locked tuples during WAL
> replay. Which can lead to all sorts of nastiness like indexes not
> finding any rows. Since that kind of locking/updating is pretty common
> with foreign keys, it's not an unlikely scenario.
> Unfortunately FPIs won't save the day in all that many scenarios because
> there'll normally a XLOG_HEAP2_LOCK_UPDATED before the XLOG_HEAP_LOCK
> record which is replayed badly.
>
> Now, one could argue that it only affects replicas or servers that
> crashed at some point, but I think that's not much comfort.
>
> > 2. Does it affect a lot of people or only a few?
>
> It's been reported twice (Peter Geoghegan, Greg Stark) by Heroku and one
> person on IRC could reproduce it repeatedly. The latter was what made me
> look into it again and find the bug. Greg has confirmed that it fixes
> the bug when replaying the WAL again.
>
> > 3. Are there more, equally bad bugs that are unfixed, or perhaps even
> > unreported, yet?
>
> Uh. I have no idea. I don't know of any reports that can't be attributed
> to any of these, but as you're also include unreported bugs in that
> question...
>

Does this affect also other branches? 9.2 ?

regards,
--
Jozef Mlich <jmlich(at)redhat(dot)com>
Associate Software Engineer - EMEA ENG Developer Experience
Mobile: +420 604 217 719
http://cz.redhat.com/
Red Hat, Inc.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jozef Mlich <jmlich(at)redhat(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 12:12:14
Message-ID: 20140313121214.GF8268@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-03-13 13:06:00 +0100, Jozef Mlich wrote:
> Does this affect also other branches? 9.2 ?

Nope, it's 9.3 only.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Jozef Mlich <jmlich(at)redhat(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 17:10:00
Message-ID: 5321E668.8090808@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

All,

First, I'll note that one of the reasons we haven't had a bunch of
reports from the field about this is that a lot of our users have yet to
apply 9.3.3, so if they have corruption issues they probably attribute
them to the issues which are fixed in 9.3.3. I know that's the case
with our customer base.

As much as I hate extra releases, it might be better to push this one
out; if we can get it out in the next 2 weeks, folks can skip the
downtime for 9.3.3 and go straight to 9.3.4.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Greg Stark <stark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Jozef Mlich <jmlich(at)redhat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-13 17:23:29
Message-ID: CAM-w4HN+0jnvN6cNitNFBuz9Au3E4KfVrPg4Uq+NwW+5LqoeeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 13, 2014 at 5:10 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> First, I'll note that one of the reasons we haven't had a bunch of
> reports from the field about this is that a lot of our users have yet to
> apply 9.3.3, so if they have corruption issues they probably attribute
> them to the issues which are fixed in 9.3.3. I know that's the case
> with our customer base.

I was speculating that the reason we saw a sudden bunch after 9.3.3
might be that there might be a number of people who wait N releases
before upgrading and the number of people for whom the value of N is 3
might be significant.

Or it could be a coincidence. Users will only notice if they fail over
to their standby or run queries on their standby.

--
greg


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Jozef Mlich <jmlich(at)redhat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-14 19:51:15
Message-ID: 53235DB3.7060804@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro, All:

Can someone help me with what we should tell users about this issue?

1. What users are especially likely to encounter it? All replication
users, or do they have to do something else?

2. What error messages will affected users get? A link to the reports
of this issue on pgsql lists would tell me this, but I'm not sure
exactly which error reports are associated.

3. If users have already encountered corruption due to the fixed issue,
what do they need to do after updating? re-basebackup?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Jozef Mlich <jmlich(at)redhat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9a57858f1103b89a5674f0d50c5fe1f756411df6
Date: 2014-03-14 21:19:00
Message-ID: 20140314211900.GH6899@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> Alvaro, All:
>
> Can someone help me with what we should tell users about this issue?
>
> 1. What users are especially likely to encounter it? All replication
> users, or do they have to do something else?

Replication users are more likely to get it on replicas, of course,
because that's running the recovery code continuously; however, anyone
that suffers a crash of a standalone system might also be affected.
(And it'd be worse, even, because that corrupts your main source of
data, not just a replicated copy of it.) Obviously, if you have a
corrupted replica and fail over to it, you're similarly screwed.

Basically you might be affected if you have tables that are referenced
in primary keys and to which you also apply UPDATEs that are
HOT-enabled.

> 2. What error messages will affected users get? A link to the reports
> of this issue on pgsql lists would tell me this, but I'm not sure
> exactly which error reports are associated.

Not sure about error messages. Essentially some rows would be visible
to seqscans but not to index scans.
These are the threads:
http://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com
http://www.postgresql.org/message-id/CAM-w4HPTOeMT4KP0OJK+mGgzgcTOtLRTvFZyvD0O4aH-7dxo3Q@mail.gmail.com

> 3. If users have already encountered corruption due to the fixed issue,
> what do they need to do after updating? re-basebackup?

Replicas can be fixed by recloning, yeah. I haven't stopped to think
how to fix the masters. Greg, Peter, any clues there?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services