8.3.0: vacuum full analyze: "invalid memory alloc request size"

Lists: pgsql-bugs
From: Tomas Szepe <szepe(at)pinerecords(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-09 11:52:21
Message-ID: 20080209115221.GA27757@louise.pinerecords.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hello,

Trying to run "vacuum full analyze" on a ~200MB UTF8 database fails with:

fs=> vacuum full analyze;
WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
WARNING: skipping "pg_database" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdescription" --- only table or database owner can vacuum it
WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
ERROR: invalid memory alloc request size 4294965504

But not always, sometimes it goes through w/o a problem:

fs=> vacuum full analyze;
WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
WARNING: skipping "pg_database" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdescription" --- only table or database owner can vacuum it
WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING: skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING: skipping "pg_pltemplate" --- only table or database owner can vacuum it
VACUUM

This is on x86 Linux. 8.2.6 does not exhibit the problem, or at least I haven't
run into it. What info do I need to submit so that this can be tracked down?

Best regards,
--
Tomas Szepe <szepe(at)pinerecords(dot)com>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Szepe <szepe(at)pinerecords(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-09 17:19:27
Message-ID: 24857.1202577567@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tomas Szepe <szepe(at)pinerecords(dot)com> writes:
> Trying to run "vacuum full analyze" on a ~200MB UTF8 database fails with:

> fs=> vacuum full analyze;
> WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
> WARNING: skipping "pg_database" --- only table or database owner can vacuum it
> WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
> WARNING: skipping "pg_shdescription" --- only table or database owner can vacuum it
> WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
> ERROR: invalid memory alloc request size 4294965504

Hmm, please see if you can get a stack trace from that (set the
breakpoint at errfinish()). You might want to use vacuum verbose
first so that you can figure out which individual table is causing it.

regards, tom lane


From: Tomas Szepe <szepe(at)pinerecords(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-10 01:44:17
Message-ID: 20080210014417.GB27757@louise.pinerecords.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

> > Trying to run "vacuum full analyze" on a ~200MB UTF8 database fails with:
> >
> > fs=> vacuum full analyze;
> > WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
> > WARNING: skipping "pg_database" --- only table or database owner can vacuum it
> > WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
> > WARNING: skipping "pg_shdescription" --- only table or database owner can vacuum it
> > WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
> > ERROR: invalid memory alloc request size 4294965504
>
> Hmm, please see if you can get a stack trace from that (set the
> breakpoint at errfinish()). You might want to use vacuum verbose
> first so that you can figure out which individual table is causing it.

Ok, I recompiled CVS head with --enable-debug and with --enable-cassert
and hit the following assert on "vacuum full verbose analyze":

[snip]
INFO: vacuuming "pg_catalog.pg_class"
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Assert (from the log):
TRAP: FailedAssertion("!((BlockNumber) fraged_pages->num_pages >= empty_end_pages)", File: "vacuum.c", Line: 1737)

Backtrace:
#0 0xb7b54797 in raise () from /lib/libc.so.6
#1 0xb7b560b8 in abort () from /lib/libc.so.6
#2 0x08292713 in ExceptionalCondition (
conditionName=0x8335d58 "!((BlockNumber) fraged_pages->num_pages >= empty_end_pages)", errorType=0x82bdca2 "FailedAssertion", fileName=0x82bfbfa "vacuum.c",
lineNumber=1737) at assert.c:57
#3 0x0815b9d4 in full_vacuum_rel (onerel=0x841ad4c, vacstmt=0x844a4d0)
at vacuum.c:1737
#4 0x0815ec35 in vacuum_rel (relid=<value optimized out>, vacstmt=0x844a4d0,
expected_relkind=114 'r') at vacuum.c:1117
#5 0x0815ee19 in vacuum (vacstmt=0x6, relids=0x0, bstrategy=0x0,
isTopLevel=<value optimized out>) at vacuum.c:423
#6 0x081fcec5 in PortalRunUtility (portal=0x8474acc, utilityStmt=0x844a4d0,
isTopLevel=0 '\0', dest=0x844a52c, completionTag=0xbfd41dba "") at pquery.c:1173
#7 0x081fdc97 in PortalRunMulti (portal=0x8474acc,
isTopLevel=<value optimized out>, dest=0x844a52c, altdest=0x844a52c,
completionTag=0xbfd41dba "") at pquery.c:1268
#8 0x081fe3f3 in PortalRun (portal=0x8474acc, count=2147483647,
isTopLevel=6 '\006', dest=0x844a52c, altdest=0x844a52c,
completionTag=0xbfd41dba "") at pquery.c:813
#9 0x081f9bce in exec_simple_query (
query_string=0x8449c8c "vacuum full verbose analyze;") at postgres.c:963
#10 0x081fa951 in PostgresMain (argc=4, argv=0x83d63e4, username=0x83d63bc "kala")
at postgres.c:3530
#11 0x081cc83a in ServerLoop () at postmaster.c:3207
#12 0x081cd32d in PostmasterMain (argc=4, argv=0x83d31c0) at postmaster.c:1029
#13 0x0818717e in main (argc=4, argv=0x83d31c0) at main.c:188

Recompiled CVS head again, this time with --enable-debug but without
--enable-cassert. On "vacuum full verbose analyze," I hit the following:

[snip]
INFO: vacuuming "pg_catalog.pg_attribute"
INFO: "pg_attribute": found 23891 removable, 3520 nonremovable row versions in 1301 pages
DETAIL: 0 dead row versions cannot be removed yet.
Nonremovable row versions range from 128 to 128 bytes long.
There were 34 unused item pointers.
Total free space (including removable row versions) is 9679844 bytes.
1243 pages are or will become empty, including 1243 at the end of the table.
4294967291 pages containing 0 free bytes are potential move destinations.
CPU 0.03s/0.01u sec elapsed 1.72 sec.
INFO: index "pg_attribute_relid_attnam_index" now contains 3520 row versions in 1242 pages
DETAIL: 108686 index row versions were removed.
1182 index pages have been deleted, 1182 are currently reusable.
CPU 0.02s/0.05u sec elapsed 0.31 sec.
INFO: index "pg_attribute_relid_attnum_index" now contains 3520 row versions in 309 pages
DETAIL: 108686 index row versions were removed.
296 index pages have been deleted, 296 are currently reusable.
CPU 0.02s/0.04u sec elapsed 0.12 sec.
INFO: "pg_attribute": truncated 1301 to 58 pages
ERROR: invalid memory alloc request size 4294967256

Backtrace:
#0 errfinish (dummy=0) at elog.c:307
#1 0x082629a4 in elog_finish (elevel=20,
fmt=0x832dd0c "invalid memory alloc request size %lu") at elog.c:951
#2 0x082778fc in MemoryContextAlloc (context=0x8388e58, size=4294967256)
at mcxt.c:510
#3 0x08141801 in full_vacuum_rel (onerel=0x83bcaf0, vacstmt=0x83f84c0)
at vacuum.c:3463
#4 0x08142229 in vacuum_rel (relid=<value optimized out>, vacstmt=0x83f84c0,
expected_relkind=114 'r') at vacuum.c:1117
#5 0x081423ed in vacuum (vacstmt=0x7, relids=0x0, bstrategy=0x0,
isTopLevel=<value optimized out>) at vacuum.c:423
#6 0x081cf70d in PortalRunUtility (portal=0x8422ac8, utilityStmt=0x83f84c0,
isTopLevel=88 'X', dest=0x83f8510, completionTag=0xbfac7b2a "") at pquery.c:1173
#7 0x081d0321 in PortalRunMulti (portal=0x8422ac8,
isTopLevel=<value optimized out>, dest=0x83f8510, altdest=0x83f8510,
completionTag=0xbfac7b2a "") at pquery.c:1268
#8 0x081d0a47 in PortalRun (portal=0x8422ac8, count=2147483647, isTopLevel=7 '\a',
dest=0x83f8510, altdest=0x83f8510, completionTag=0xbfac7b2a "") at pquery.c:813
#9 0x081cc4ba in exec_simple_query (
query_string=0x83f7c88 "vacuum full verbose analyze;") at postgres.c:963
#10 0x081cd23d in PostgresMain (argc=4, argv=0x83843b0, username=0x8384390 "kala")
at postgres.c:3530
#11 0x081a6785 in ServerLoop () at postmaster.c:3207
#12 0x081a722d in PostmasterMain (argc=4, argv=0x83811c0) at postmaster.c:1029
#13 0x08166b02 in main (argc=4, argv=0x83811c0) at main.c:188

Hope this helps. Please let me know if you need additional info.

Best regards,
--
Tomas Szepe <szepe(at)pinerecords(dot)com>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Szepe <szepe(at)pinerecords(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-10 02:37:13
Message-ID: 6630.1202611033@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tomas Szepe <szepe(at)pinerecords(dot)com> writes:
>> Hmm, please see if you can get a stack trace from that (set the
>> breakpoint at errfinish()). You might want to use vacuum verbose
>> first so that you can figure out which individual table is causing it.

> Ok, I recompiled CVS head with --enable-debug and with --enable-cassert
> and hit the following assert on "vacuum full verbose analyze":
> [etc]

It seems fairly clear that both of these symptoms mean that
empty_end_pages got to be larger than fraged_pages->num_pages.
In the first case the Assert catches that directly, but with
asserts disabled the code just allows num_pages to go negative
and then the space calculation in vac_update_fsm goes nuts.

So the question is, how could that happen? There are only three places
where empty_end_pages is incremented, and the first two definitely add
the page to fraged_pages as well. What I'm thinking is you must have
had a few pages where notup was true but do_frag didn't get set, and
it's not quite clear how that could be. It seems most likely that the
page contained only LP_DEAD tuples but didn't have free space large
enough to get it put into the fraged_pages list. But the only place
that would mark tuples LP_DEAD is pruneheap.c, and it should have
done a PageRepairFragmentation() after doing so.

Do you perhaps have a ridiculously low fillfactor attached to
the system catalogs?

The fix should probably be to force pages to be put in fraged_pages
if notup is true, but first I want to understand exactly how it got
into this state --- there may be something else going on here.

regards, tom lane


From: Tomas Szepe <szepe(at)pinerecords(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-10 09:53:17
Message-ID: 20080210095317.GC27757@louise.pinerecords.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

> >> Hmm, please see if you can get a stack trace from that (set the
> >> breakpoint at errfinish()). You might want to use vacuum verbose
> >> first so that you can figure out which individual table is causing it.
>
> > Ok, I recompiled CVS head with --enable-debug and with --enable-cassert
> > and hit the following assert on "vacuum full verbose analyze":
> > [etc]
>
> It seems fairly clear that both of these symptoms mean that
> empty_end_pages got to be larger than fraged_pages->num_pages.
> In the first case the Assert catches that directly, but with
> asserts disabled the code just allows num_pages to go negative
> and then the space calculation in vac_update_fsm goes nuts.
>
> So the question is, how could that happen? There are only three places
> where empty_end_pages is incremented, and the first two definitely add
> the page to fraged_pages as well. What I'm thinking is you must have
> had a few pages where notup was true but do_frag didn't get set, and
> it's not quite clear how that could be. It seems most likely that the
> page contained only LP_DEAD tuples but didn't have free space large
> enough to get it put into the fraged_pages list. But the only place
> that would mark tuples LP_DEAD is pruneheap.c, and it should have
> done a PageRepairFragmentation() after doing so.
>
> Do you perhaps have a ridiculously low fillfactor attached to
> the system catalogs?

I don't know - how do I tell? My config is quite ordinary as far
as I can see (postgresql.conf.gz attached), the db was only created
4 days ago and populated from an 8.2.6 dump, there were no power
failures, freezes or even postmaster shutdowns, in fact no signs
of any problem whatsoever. There are about 300.000 queries a day.

> The fix should probably be to force pages to be put in fraged_pages
> if notup is true, but first I want to understand exactly how it got
> into this state --- there may be something else going on here.

Well, I've never hacked on postgresql, so probably all I can do
is tar/rzip the whole data dir and put it up for download somewhere
(there is no sensitive data).

Thanks for your help,
--
Tomas Szepe <szepe(at)pinerecords(dot)com>

Attachment Content-Type Size
postgresql.conf.gz application/x-gunzip 5.3 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Szepe <szepe(at)pinerecords(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-10 18:34:48
Message-ID: 17704.1202668488@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tomas Szepe <szepe(at)pinerecords(dot)com> writes:
>> Do you perhaps have a ridiculously low fillfactor attached to
>> the system catalogs?

> I don't know - how do I tell?

If you did, you'd know it --- that's not something that would happen
without your trying to do it.

After poking around in the code a bit, I found a way to reproduce the
assert failure:

regression=# create table foo as select null::int as x from generate_series(1,2000) x;
SELECT
regression=# delete from foo;
DELETE 2000
regression=# select count(*) from foo;
count
-------
0
(1 row)

regression=# vacuum full foo;
server closed the connection unexpectedly

What is happening here is that after the SELECT, we have pages
containing MaxHeapTuplesPerPage DEAD line pointers (and nothing else).
PageGetHeapFreeSpace is designed to return zero in this situation,
so PageGetFreeSpaceWithFillFactor does too, which means we end up with
do_frag = false and notup = true, which is not a combination that the
empty_end_pages logic is prepared for.

While it would clearly be a good idea to defend against that
combination, I think the real problem here is that we are blindly
applying PageGetHeapFreeSpace in a context where it's inappropriate.
Those DEAD line pointers are going to get recycled and so there is
no need to prevent VACUUM FULL from trying to put more tuples on
the page. OTOH, if we don't do it this way we'd probably need to
put explicit checks for too-many-line-pointers somewhere else in
VACUUM FULL.

Another issue is that while I can create a page with
MaxHeapTuplesPerPage dead tuples easily enough in testing, it's not
clear how you managed to get into that state in the system catalogs.
Neither pg_class nor pg_attribute can have minimal-length tuples.
Are you doing anything that would involve lots of updates in these
catalogs --- maybe repeatedly renaming a column, or something like that?

regards, tom lane


From: Tomas Szepe <szepe(at)pinerecords(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-10 18:56:45
Message-ID: 20080210185645.GA1560@louise.pinerecords.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

> After poking around in the code a bit, I found a way to reproduce the
> assert failure:
[snip]

Great!

> Another issue is that while I can create a page with
> MaxHeapTuplesPerPage dead tuples easily enough in testing, it's not
> clear how you managed to get into that state in the system catalogs.
> Neither pg_class nor pg_attribute can have minimal-length tuples.
> Are you doing anything that would involve lots of updates in these
> catalogs --- maybe repeatedly renaming a column, or something like that?

Hmm, I typically use a pair of
"ALTER TABLE table DISABLE TRIGGER USER;"/
"ALTER TABLE table ENABLE TRIGGER USER;"
per almost every relation when loading a dump, other than that there's
only the initial db creation code (lots of plpgsql triggers and a couple
immutable functions) and then using temp tables for complicated queries.

Thanks again for looking into this,
--
Tomas Szepe <szepe(at)pinerecords(dot)com>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Szepe <szepe(at)pinerecords(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-10 19:14:31
Message-ID: 7436.1202670871@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tomas Szepe <szepe(at)pinerecords(dot)com> writes:
>> Are you doing anything that would involve lots of updates in these
>> catalogs --- maybe repeatedly renaming a column, or something like that?

> Hmm, I typically use a pair of
> "ALTER TABLE table DISABLE TRIGGER USER;"/
> "ALTER TABLE table ENABLE TRIGGER USER;"
> per almost every relation when loading a dump, other than that there's
> only the initial db creation code (lots of plpgsql triggers and a couple
> immutable functions) and then using temp tables for complicated queries.

Hm, those ALTERs wouldn't affect pg_attribute. Maybe just using a lot
of temp tables is enough?

Are you running with autovacuum on, or not?

regards, tom lane


From: Tomas Szepe <szepe(at)pinerecords(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-10 19:31:48
Message-ID: 20080210193148.GB1560@louise.pinerecords.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

> Tomas Szepe <szepe(at)pinerecords(dot)com> writes:
> >> Are you doing anything that would involve lots of updates in these
> >> catalogs --- maybe repeatedly renaming a column, or something like that?
>
> > Hmm, I typically use a pair of
> > "ALTER TABLE table DISABLE TRIGGER USER;"/
> > "ALTER TABLE table ENABLE TRIGGER USER;"
> > per almost every relation when loading a dump, other than that there's
> > only the initial db creation code (lots of plpgsql triggers and a couple
> > immutable functions) and then using temp tables for complicated queries.
>
> Hm, those ALTERs wouldn't affect pg_attribute. Maybe just using a lot
> of temp tables is enough?
>
> Are you running with autovacuum on, or not?

At the moment autovacuum is off, but it _might_ have been on in the first
40 hours or so... Sorry, I can't say exactly.

--
Tomas Szepe <szepe(at)pinerecords(dot)com>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Szepe <szepe(at)pinerecords(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 8.3.0: vacuum full analyze: "invalid memory alloc request size"
Date: 2008-02-10 20:18:35
Message-ID: 8193.1202674715@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tomas Szepe <szepe(at)pinerecords(dot)com> writes:
>> Are you running with autovacuum on, or not?

> At the moment autovacuum is off, but it _might_ have been on in the first
> 40 hours or so... Sorry, I can't say exactly.

Okay ... so with autovac off, I can reproduce a problem this way:

1. Put 1000 repetitions of this into a script file:

create temp table tt(x int, y int);
drop table tt;

2. Execute the script file, with no other sessions active.

3. vacuum full verbose pg_attribute;

You need a script file, not say a loop in a plpgsql function, because
the commands have to be in separate transactions. On a machine with
MAXALIGN=4 you might need a different number of columns in the temp
table --- I only tried it with MAXALIGN=8 hardware.

I believe what's happening here is that as we create and delete temp
tables, we prune the current insertion target page in pg_attribute
repeatedly, turning removed rows into LP_DEAD line pointers; but since
no actual vacuum gets done, the LP_DEAD line pointers can't go away and
gradually accumulate. Eventually there are MaxHeapTuplesPerPage line
pointers on the page, so no more can be inserted, and the insertion
activity shifts to a new page of pg_attribute. Then when we eventually
do VACUUM FULL, PageGetHeapFreeSpace returns zero "because it's full"
(this would happen whether or not notup was true). If notup does happen
to be true, we hit the incorrectly handled empty_end_pages case; and
in any case we are leaving money on the table because the page will
falsely be thought to not have any usable space. (So just fixing the
notup case isn't really enough here...)

With autovac on, this would be a lot harder to hit because autovac would
most likely clean the page before you'd gotten around to doing VACUUM
FULL. That might explain how the problem escaped notice during testing
--- AFAICS this behavior isn't specific to the system catalogs but would
occur in any table that received repetitive insertions/deletions. It's
a bit unnerving that it wasn't seen, though ... makes one wonder about
how thoroughly the HOT code has really been exercised in combination
with VACUUM FULL.

This needs a little thought about the cleanest way to fix. I'll put
up a proposed patch later.

regards, tom lane