BUG #6200: standby bad memory allocations on SELECT

Lists: pgsql-bugs
From: "Daniel Farina" <daniel(at)heroku(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #6200: standby bad memory allocations on SELECT
Date: 2011-09-08 22:33:37
Message-ID: 201109082233.p88MXbGE026996@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 6200
Logged by: Daniel Farina
Email address: daniel(at)heroku(dot)com
PostgreSQL version: 9.0.4
Operating system: Ubuntu 10.04
Description: standby bad memory allocations on SELECT
Details:

A huge thanks to Conrad Irwin of Rapportive for furnishing virtually all the
details of this bug report.

The following error occurs up to a couple of times a day on a busy
PostgreSQL database:

ERROR: invalid memory alloc request size 18446744073709551613

The occurrence rate is somewhere in the one per tens-of-millions of
queries.

The number is always the same (2**64 - 3), and there's no obvious
pattern in the distribution of errors (they don't even appear to be
correlated with system load). The error has not been recorded on the
primary database, even though the same workload is submitted.

These errors do not reproduce, seeming to evaporate almost immediately on
the standby, so durable/long lasting index corruption is not likely. This
problem has persisted among multiple generations of hot standbys on
different hardware and sourced from different base backups.

At least once, a hot standby was promoted to a primary and the errors seem
to discontinue, but then reappear on a newly-provisioned standby.

The VERSION() string is:
PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC
gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit

The problem is confined to a particular access patterns and schema objects,
enumerated below:

The points2 table looks like:

Table "public.points2"
Column | Type |
Modifiers
------------------+-----------------------------+---------------------------
---------------------------
id | integer | not null default nextval('points2_id_seq'::regclass)
identifier | text | not null
scope_id | integer | not null
class_number | smallint | not null
authorization_id | integer | not null
sum_json | text | not null
amended_at | timestamp without time zone | not null
Indexes:
"points2_pkey" PRIMARY KEY, btree (id)
"points2_special" btree (identifier_hash(identifier), scope_id,
class_number, authorization_id)

CREATE FUNCTION identifier_hash(text) RETURNS bigint IMMUTABLE
LANGUAGE SQL AS $$
SELECT ('x' || md5($1))::bit(64)::bigint;
$$;

This has only been seen on queries of the form:

SELECT * FROM "points2" WHERE
(identifier_hash(identifier) = identifier_hash('1104131405')
AND identifier = '1104131405'
AND scope_id = 0
AND authorization_id = 0
AND class_number = 25)

Though this table is accessed similarly frequently by queries of the form:

SELECT points2.* FROM points2
JOIN (VALUES (8585261297509044776, 0, 47,
'ae9064e6f24127c6a1f483cd71e14e64'))
AS query(hashed_identifier, scope_id, class_number, identifier)
ON identifier_hash(points2.identifier) = query.hashed_identifier
AND points2.scope_id = query.scope_id
AND points2.class_number = query.class_number
AND points2.identifier = query.identifier;

these do not trigger the problem.

The table is always updated to or inserted into one row at a time
(using the "id" primary key for updates), though we sometimes update
multiple rows in a single transaction, synchronous_commit is turned off for
connections that touch the points2 table on the primary.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Daniel Farina" <daniel(at)heroku(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6200: standby bad memory allocations on SELECT
Date: 2011-09-09 15:02:44
Message-ID: 11731.1315580564@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Daniel Farina" <daniel(at)heroku(dot)com> writes:
> A huge thanks to Conrad Irwin of Rapportive for furnishing virtually all the
> details of this bug report.

This isn't really enough information to reproduce the problem ...

> The occurrence rate is somewhere in the one per tens-of-millions of
> queries.

... and that statement is going to discourage anyone from even trying,
since with such a low occurrence rate it's going to be impossible to be
sure whether the setup to reproduce the problem is correct. So if you'd
like this to be fixed, you're either going to need to show us exactly
how to reproduce it, or investigate it yourself.

The way that I'd personally proceed to investigate it would probably be
to change the "invalid memory alloc request size" size errors (in
src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from
ERROR to PANIC so that they'll provoke a core dump, and then use gdb
to get a stack trace, which would provide at least a little more
information about what happened. However, if you are only able to
reproduce it in a production server, you might not like that approach.
Perhaps you can set up an extra standby that's only there for testing,
so you don't mind if it crashes?

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6200: standby bad memory allocations on SELECT
Date: 2011-09-09 15:13:19
Message-ID: 4E6A2D0F.1050601@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 09.09.2011 18:02, Tom Lane wrote:
> The way that I'd personally proceed to investigate it would probably be
> to change the "invalid memory alloc request size" size errors (in
> src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from
> ERROR to PANIC so that they'll provoke a core dump, and then use gdb
> to get a stack trace, which would provide at least a little more
> information about what happened. However, if you are only able to
> reproduce it in a production server, you might not like that approach.
> Perhaps you can set up an extra standby that's only there for testing,
> so you don't mind if it crashes?

If that's not possible or doesn't reproduce the issue, there's also
functions in glibc to produce a backtrace without aborting the program:
https://www.gnu.org/s/libc/manual/html_node/Backtraces.html.

I think you could also fork() + abort() to generate a core dump, not
just a backtrace.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #6200: standby bad memory allocations on SELECT
Date: 2011-09-09 15:14:38
Message-ID: CA+U5nML0o-LSKuTY=Rxs0uzdDM=ProMppDE584zc8szS=9Px+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Sep 8, 2011 at 11:33 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:

>  ERROR: invalid memory alloc request size 18446744073709551613

> At least once, a hot standby was promoted to a primary and the errors seem
> to discontinue, but then reappear on a newly-provisioned standby.

So the query that fails is a btree index on a hot standby. I don't
fully accept it as an HS bug, but lets assume that it is and analyse
what could cause it.

The MO is certain user queries, only observed in HS. So certain
queries might be related to the way we use indexes or not.

There is a single and small difference between how a btree index
operates in HS and "normal" operation, which relates to whether we
kill tuples in the index. That's simple code and there's no obvious
bugs there, nor anything that specifically allocates memory even. So
the only bug that springs to mind is something related to how we
navigate hot chains with/without killed tuples. i.e. the bug is not
actually HS related, but is only observed under conditions typical in
HS.

HS touches almost nothing else in user space, apart from snapshots. So
there could be a bug there also, maybe in CopySnapshot().

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services