Quick Links

MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

Lists:	pgsql-hackers

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-05-13 14:26:53
Message-ID:	20130513142653.GB171500@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

A memory chunk allocated through the existing palloc.h interfaces is limited
to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need
not check its own 1 GiB limit, and algorithms that grow a buffer by doubling
need not check for overflow. However, a handful of callers are quite happy to
navigate those hazards in exchange for the ability to allocate a larger chunk.

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording
whether they were allocated as huge; one can start with palloc() and then
repalloc_huge() to grow the value. To demonstrate, I put this to use in
tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's
the trace_sort from building the pgbench_accounts primary key at scale factor
7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec

Compare:

LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec

This was made easier by tuplesort growth algorithm improvements in commit
8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
(TODO item "Allow sorts to use more available memory"), and Tom floated the
idea[1] behind the approach I've used. The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

I have not added variants like palloc_huge() and palloc0_huge(), and I have
not added to the frontend palloc.h interface. There's no particular barrier
to doing any of that. I don't expect more than a dozen or so callers, so most
of the variations might go unused.

The comment at MaxAllocSize said that aset.c expects doubling the size of an
arbitrary allocation to never overflow, but I couldn't find the code in
question. AllocSetAlloc() does double sizes of blocks used to aggregate small
allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless,
that expectation does apply to dozens of repalloc() users outside aset.c, and
I preserved it for repalloc_huge(). 64-bit builds will never notice, and I
won't cry for the resulting 2 GiB limit on 32-bit.

Thanks,
nm

[1] http://www.postgresql.org/message-id/19908.1297696263@sss.pgh.pa.us

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
alloc-huge-v1.patch	text/plain	28.0 KB

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-05-13 15:31:03
Message-ID:	CAFj8pRAk5GgsSer+ZNkRz9PU1xiDD2_nnOMPuESJAUpJyjkp-Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Pavel
Dne 13.5.2013 16:29 "Noah Misch" <noah(at)leadboat(dot)com> napsal(a):

> A memory chunk allocated through the existing palloc.h interfaces is
> limited
> to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE()
> need
> not check its own 1 GiB limit, and algorithms that grow a buffer by
> doubling
> need not check for overflow. However, a handful of callers are quite
> happy to
> navigate those hazards in exchange for the ability to allocate a larger
> chunk.
>
> This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
> check
> a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother
> recording
> whether they were allocated as huge; one can start with palloc() and then
> repalloc_huge() to grow the value. To demonstrate, I put this to use in
> tuplesort.c; the patch also updates tuplestore.c to keep them similar.
> Here's
> the trace_sort from building the pgbench_accounts primary key at scale
> factor
> 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:
>
> LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec
> elapsed 391.21 sec
>
> Compare:
>
> LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u
> sec elapsed 1146.05 sec
>
> This was made easier by tuplesort growth algorithm improvements in commit
> 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
> (TODO item "Allow sorts to use more available memory"), and Tom floated the
> idea[1] behind the approach I've used. The next limit faced by sorts is
> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
> 150 GiB when sorting int4.
>
> I have not added variants like palloc_huge() and palloc0_huge(), and I have
> not added to the frontend palloc.h interface. There's no particular
> barrier
> to doing any of that. I don't expect more than a dozen or so callers, so
> most
> of the variations might go unused.
>
> The comment at MaxAllocSize said that aset.c expects doubling the size of
> an
> arbitrary allocation to never overflow, but I couldn't find the code in
> question. AllocSetAlloc() does double sizes of blocks used to aggregate
> small
> allocations, so maxBlockSize had better stay under SIZE_MAX/2.
> Nonetheless,
> that expectation does apply to dozens of repalloc() users outside aset.c,
> and
> I preserved it for repalloc_huge(). 64-bit builds will never notice, and I
> won't cry for the resulting 2 GiB limit on 32-bit.
>
> Thanks,
> nm
>
> [1] http://www.postgresql.org/message-id/19908.1297696263@sss.pgh.pa.us
>
> --
> Noah Misch
> EnterpriseDB http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-22 07:46:49
Message-ID:	20130622074649.GE7093@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Noah,

* Noah Misch (noah(at)leadboat(dot)com) wrote:
> This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
> a higher MaxAllocHugeSize limit of SIZE_MAX/2.

Nice! I've complained about this limit a few different times and just
never got around to addressing it.

That's frustratingly small. :(

[...]
> --- 1024,1041 ----
> * new array elements even if no other memory were currently used.
> *
> * We do the arithmetic in float8, because otherwise the product of
> ! * memtupsize and allowedMem could overflow. Any inaccuracy in the
> ! * result should be insignificant; but even if we computed a
> ! * completely insane result, the checks below will prevent anything
> ! * really bad from happening.
> */
> double grow_ratio;
>
> grow_ratio = (double) state->allowedMem / (double) memNowUsed;
> ! if (memtupsize * grow_ratio < INT_MAX)
> ! newmemtupsize = (int) (memtupsize * grow_ratio);
> ! else
> ! newmemtupsize = INT_MAX;
>
> /* We won't make any further enlargement attempts */
> state->growmemtuples = false;

I'm not a huge fan of moving directly to INT_MAX. Are we confident that
everything can handle that cleanly..? I feel like it might be a bit
safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly
paranoid, but there's an awful lot of callers and some loop which +2's
and then overflows would suck, eg:

int x = INT_MAX;
for (x-1; (x-1) < INT_MAX; x += 2) {
myarray[x] = 5;
}

Also, could this be used to support hashing larger sets..? If we change
NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
larger than INT_MAX since, with 8-byte pointers, that'd only be around
134M tuples.

Haven't had a chance to review the rest, but +1 on the overall idea. :)

Thanks!

Stephen

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-22 10:36:45
Message-ID:	CA+U5nMK7F2MzbJ2jyNhGX=VxNcxwkHYKLZ0WdiU4Eqpp4=BXhg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13 May 2013 15:26, Noah Misch <noah(at)leadboat(dot)com> wrote:
> A memory chunk allocated through the existing palloc.h interfaces is limited
> to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need
> not check its own 1 GiB limit, and algorithms that grow a buffer by doubling
> need not check for overflow. However, a handful of callers are quite happy to
> navigate those hazards in exchange for the ability to allocate a larger chunk.
>
> This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
> a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording
> whether they were allocated as huge; one can start with palloc() and then
> repalloc_huge() to grow the value.

I like the design and think its workable.

I'm concerned that people will accidentally use MaxAllocSize. Can we
put in a runtime warning if someone tests AllocSizeIsValid() with a
larger value?

> To demonstrate, I put this to use in
> tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's
> the trace_sort from building the pgbench_accounts primary key at scale factor
> 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:
>
> LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec
>
> Compare:
>
> LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec

Cool.

I'd like to put in an explicit test for this somewhere. Obviously not
part of normal regression, but somewhere, at least, so we have
automated testing that we all agree on. (yes, I know we don't have
that for replication/recovery yet, but thats why I don't want to
repeat that mistake).

> This was made easier by tuplesort growth algorithm improvements in commit
> 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
> (TODO item "Allow sorts to use more available memory"), and Tom floated the
> idea[1] behind the approach I've used. The next limit faced by sorts is
> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
> 150 GiB when sorting int4.
>
> I have not added variants like palloc_huge() and palloc0_huge(), and I have
> not added to the frontend palloc.h interface. There's no particular barrier
> to doing any of that. I don't expect more than a dozen or so callers, so most
> of the variations might go unused.
>
> The comment at MaxAllocSize said that aset.c expects doubling the size of an
> arbitrary allocation to never overflow, but I couldn't find the code in
> question. AllocSetAlloc() does double sizes of blocks used to aggregate small
> allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless,
> that expectation does apply to dozens of repalloc() users outside aset.c, and
> I preserved it for repalloc_huge(). 64-bit builds will never notice, and I
> won't cry for the resulting 2 GiB limit on 32-bit.

Agreed. Can we document this for the relevant parameters?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-22 10:39:15
Message-ID:	CA+U5nMKkRMin1pV8VMpS6_n7hcOWSG0kZS3oFL9JOa8DV6vJyQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22 June 2013 08:46, Stephen Frost <sfrost(at)snowman(dot)net> wrote:

>>The next limit faced by sorts is
>> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
>> 150 GiB when sorting int4.
>
> That's frustratingly small. :(
>

But that has nothing to do with this patch, right? And is easily fixed, yes?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-22 13:12:31
Message-ID:	20130622131231.GF7093@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Simon Riggs (simon(at)2ndQuadrant(dot)com) wrote:
> On 22 June 2013 08:46, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> >>The next limit faced by sorts is
> >> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
> >> 150 GiB when sorting int4.
> >
> > That's frustratingly small. :(
>
> But that has nothing to do with this patch, right? And is easily fixed, yes?

I don't know about 'easily fixed' (consider supporting a HashJoin of >2B
records) but I do agree that dealing with places in the code where we are
using an int4 to keep track of the number of objects in memory is outside
the scope of this patch.

Hopefully we are properly range-checking and limiting ourselves to only
what a given node can support and not solely depending on MaxAllocSize
to keep us from overflowing some int4 which we're using as an index for
an array or as a count of how many objects we've currently got in
memory, but we'll want to consider carefully what happens with such
large sets as we're adding support into nodes for these Huge
allocations (along with the recent change to allow 1TB work_mem, which
may encourage users with systems large enough to actually try to set it
that high... :)

Thanks,

Stephen

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-22 19:03:39
Message-ID:	CA+Tgmobs4hWd51877WY4kfs+R4+GPSh8icTdW5j6YO+Ez0p6Hw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 22, 2013 at 3:46 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> I'm not a huge fan of moving directly to INT_MAX. Are we confident that
> everything can handle that cleanly..? I feel like it might be a bit
> safer to shy a bit short of INT_MAX (say, by 1K).

Maybe it would be better to stick with INT_MAX and fix any bugs we
find. If there are magic numbers short of INT_MAX that cause
problems, it would likely be better to find out about those problems
and adjust the relevant code, rather than trying to dodge them. We'll
have to confront all of those problems eventually as we come to
support larger and larger sorts; I don't see much value in putting it
off.

Especially since we're early in the release cycle.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-24 16:33:18
Message-ID:	20130624163318.GA835122@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 22, 2013 at 03:46:49AM -0400, Stephen Frost wrote:
> * Noah Misch (noah(at)leadboat(dot)com) wrote:
> > The next limit faced by sorts is
> > INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
> > 150 GiB when sorting int4.
>
> That's frustratingly small. :(

I could appreciate a desire to remove that limit. The way to do that is to
audit all uses of "int" variables in tuplesort.c and tuplestore.c, changing
them to Size where they can be used as indexes into the memtuples array.
Nonetheless, this new limit is about 50x the current limit; you need an
(unpartitioned) table of 2B+ rows to encounter it. I'm happy with that.

> > ! if (memtupsize * grow_ratio < INT_MAX)
> > ! newmemtupsize = (int) (memtupsize * grow_ratio);
> > ! else
> > ! newmemtupsize = INT_MAX;
> >
> > /* We won't make any further enlargement attempts */
> > state->growmemtuples = false;
>
> I'm not a huge fan of moving directly to INT_MAX. Are we confident that
> everything can handle that cleanly..? I feel like it might be a bit
> safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly
> paranoid, but there's an awful lot of callers and some loop which +2's
> and then overflows would suck, eg:

Where are you seeing "an awful lot of callers"? The code that needs to be
correct with respect to the INT_MAX limit is all in tuplesort.c/tuplestore.c.
Consequently, I chose to verify that code rather than add a safety factor. (I
did add an unrelated safety factor to repalloc_huge() itself.)

> Also, could this be used to support hashing larger sets..? If we change
> NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
> larger than INT_MAX since, with 8-byte pointers, that'd only be around
> 134M tuples.

The INT_MAX limit is an internal limit of tuplesort/tuplestore; other
consumers of the huge allocation APIs are only subject to that limit if they
find reasons to enforce it on themselves. (Incidentally, the internal limit
in question is INT_MAX tuples, not INT_MAX bytes.)

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-24 16:36:20
Message-ID:	20130624163620.GB835122@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 22, 2013 at 11:36:45AM +0100, Simon Riggs wrote:
> On 13 May 2013 15:26, Noah Misch <noah(at)leadboat(dot)com> wrote:

> I'm concerned that people will accidentally use MaxAllocSize. Can we
> put in a runtime warning if someone tests AllocSizeIsValid() with a
> larger value?

I don't see how we could. To preempt a repalloc() failure, you test with
AllocSizeIsValid(); testing a larger value is not a programming error.

> > To demonstrate, I put this to use in
> > tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's
> > the trace_sort from building the pgbench_accounts primary key at scale factor
> > 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:
> >
> > LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec
> >
> > Compare:
> >
> > LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec
>
> Cool.
>
> I'd like to put in an explicit test for this somewhere. Obviously not
> part of normal regression, but somewhere, at least, so we have
> automated testing that we all agree on. (yes, I know we don't have
> that for replication/recovery yet, but thats why I don't want to
> repeat that mistake).

Probably the easiest way to test from nothing is to run "pgbench -i -s 7500"
under a high work_mem. I agree that an automated test suite dedicated to
coverage of scale-dependent matters would be valuable, though I'm disinclined
to start one in conjunction with this particular patch.

> > The comment at MaxAllocSize said that aset.c expects doubling the size of an
> > arbitrary allocation to never overflow, but I couldn't find the code in
> > question. AllocSetAlloc() does double sizes of blocks used to aggregate small
> > allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless,
> > that expectation does apply to dozens of repalloc() users outside aset.c, and
> > I preserved it for repalloc_huge(). 64-bit builds will never notice, and I
> > won't cry for the resulting 2 GiB limit on 32-bit.
>
> Agreed. Can we document this for the relevant parameters?

I attempted to cover most of that in the comment above MaxAllocHugeSize, but I
did not mention the maxBlockSize constraint. I'll add an
Assert(AllocHugeSizeIsValid(maxBlockSize)) and a comment to
AllocSetContextCreate(). Did I miss documenting anything else notable?

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-24 17:11:01
Message-ID:	CAOuzzgqy=PDWrMDPuvgiNGUu1=nCCi9uauS0Kh_w9_nrpR0Y5g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, June 24, 2013, Noah Misch wrote:

> On Sat, Jun 22, 2013 at 03:46:49AM -0400, Stephen Frost wrote:
> > * Noah Misch (noah(at)leadboat(dot)com <javascript:;>) wrote:
> > > The next limit faced by sorts is
> > > INT_MAX concurrent tuples in memory, which limits helpful work_mem to
> about
> > > 150 GiB when sorting int4.
> >
> > That's frustratingly small. :(
>
> I could appreciate a desire to remove that limit. The way to do that is to
> audit all uses of "int" variables in tuplesort.c and tuplestore.c, changing
> them to Size where they can be used as indexes into the memtuples array.

Right, that's about what I figured would need to be done.

> Nonetheless, this new limit is about 50x the current limit; you need an
> (unpartitioned) table of 2B+ rows to encounter it. I'm happy with that.

Definitely better but I could see cases with that many tuples in the
not-too-distant future, esp. when used with MinMax indexes...

> > > ! if (memtupsize * grow_ratio < INT_MAX)
> > > ! newmemtupsize = (int) (memtupsize * grow_ratio);
> > > ! else
> > > ! newmemtupsize = INT_MAX;
> > >
> > > /* We won't make any further enlargement attempts */
> > > state->growmemtuples = false;
> >
> > I'm not a huge fan of moving directly to INT_MAX. Are we confident that
> > everything can handle that cleanly..? I feel like it might be a bit
> > safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly
> > paranoid, but there's an awful lot of callers and some loop which +2's
> > and then overflows would suck, eg:
>
> Where are you seeing "an awful lot of callers"? The code that needs to be
> correct with respect to the INT_MAX limit is all in
> tuplesort.c/tuplestore.c.
> Consequently, I chose to verify that code rather than add a safety factor.
> (I
> did add an unrelated safety factor to repalloc_huge() itself.)

Ok, I was thinking this code was used beyond tuplesort (I was thinking it
was actually associated with palloc). Apologies for the confusion. :)

> > Also, could this be used to support hashing larger sets..? If we change
> > NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
> > larger than INT_MAX since, with 8-byte pointers, that'd only be around
> > 134M tuples.
>
> The INT_MAX limit is an internal limit of tuplesort/tuplestore; other
> consumers of the huge allocation APIs are only subject to that limit if
> they
> find reasons to enforce it on themselves. (Incidentally, the internal
> limit
> in question is INT_MAX tuples, not INT_MAX bytes.)

There's other places where we use integers for indexes into arrays of
tuples (at least hashing is another area..) and those are then also subject
to INT_MAX, which was really what I was getting at. We might move the
hashing code to use the _huge functions and would then need to adjust that
code to use Size for the index into the hash table array of pointers.

Thanks,

Stephen

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-26 22:48:23
Message-ID:	CAMkU=1zVD82voXw1vBG1kWcz5c2G=SupGohPKM0ThwmpRK1Ddw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, May 13, 2013 at 7:26 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:

Since it doesn't record the size, I assume the non-use as a varlena is
enforced only by coder discipline and not by the system?

! * represented in a varlena header. Callers that never use the
allocation as
! * a varlena can access the higher limit with MemoryContextAllocHuge().
Both
! * limits permit code to assume that it may compute (in size_t math)
twice an
! * allocation's size without overflow.

What is likely to happen if I accidentally let a pointer to huge memory
escape to someone who then passes it to varlena constructor without me
knowing it? (I tried sabotaging the code to make this happen, but I could
not figure out how to). Is there a place we can put an Assert to catch
this mistake under enable-cassert builds?

I have not yet done a detailed code review, but this applies and builds
cleanly, passes make check with and without enable-cassert, it does what it
says (and gives performance improvements when it does kick in), and we want
this. No doc changes should be needed, we probably don't want run an
automatic regression test of the size needed to usefully test this, and as
far as I know there is no infrastructure for "big memory only" tests.

The only danger I can think of is that it could sometimes make some sorts
slower, as using more memory than is necessary can sometimes slow down an
"external" sort (because the heap is then too big for the fastest CPU
cache). If you use more tapes, but not enough more to reduce the number of
passes needed, then you can get a slowdown.

I can't imagine that it would make things worse on average, though, as the
benefit of doing more sorts as quicksorts rather than merge sorts, or doing
mergesort with fewer number of passes, would outweigh sometimes doing a
slower mergesort. If someone has a pathological use pattern for which the
averages don't work out favorably for them, they could probably play with
work_mem to correct the problem. Whereas without the patch, people who
want more memory have no options.

People have mentioned additional things that could be done in this area,
but I don't think that applying this patch will make those things harder,
or back us into a corner. Taking an incremental approach seems suitable.

Cheers,

Jeff

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-27 19:12:30
Message-ID:	20130627191230.GA912836@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 26, 2013 at 03:48:23PM -0700, Jeff Janes wrote:
> On Mon, May 13, 2013 at 7:26 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
> > check
> > a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother
> > recording
> > whether they were allocated as huge; one can start with palloc() and then
> > repalloc_huge() to grow the value.
>
>
> Since it doesn't record the size, I assume the non-use as a varlena is
> enforced only by coder discipline and not by the system?

We will rely on coder discipline, yes. The allocator actually does record a
size. I was referring to the fact that it can't distinguish the result of
repalloc(p, 7) from the result of repalloc_huge(p, 7).

> What is likely to happen if I accidentally let a pointer to huge memory
> escape to someone who then passes it to varlena constructor without me
> knowing it? (I tried sabotaging the code to make this happen, but I could
> not figure out how to). Is there a place we can put an Assert to catch
> this mistake under enable-cassert builds?

Passing a too-large value gives a modulo effect. We could inject an
AssertMacro() into SET_VARSIZE(). But it's a hot path, and I don't think this
mistake is too likely.

> The only danger I can think of is that it could sometimes make some sorts
> slower, as using more memory than is necessary can sometimes slow down an
> "external" sort (because the heap is then too big for the fastest CPU
> cache). If you use more tapes, but not enough more to reduce the number of
> passes needed, then you can get a slowdown.

Interesting point, though I don't fully understand it. The fastest CPU cache
will be a tiny L1 data cache; surely that's not the relevant parameter here?

> I can't imagine that it would make things worse on average, though, as the
> benefit of doing more sorts as quicksorts rather than merge sorts, or doing
> mergesort with fewer number of passes, would outweigh sometimes doing a
> slower mergesort. If someone has a pathological use pattern for which the
> averages don't work out favorably for them, they could probably play with
> work_mem to correct the problem. Whereas without the patch, people who
> want more memory have no options.

Agreed.

> People have mentioned additional things that could be done in this area,
> but I don't think that applying this patch will make those things harder,
> or back us into a corner. Taking an incremental approach seems suitable.

Committed with some cosmetic tweaks discussed upthread.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-06-27 23:37:42
Message-ID:	CAMkU=1y8ZBMMapk5i1BgsMHQZsaxDCO=UEKWnu6J=XEjQ-gpAw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 22, 2013 at 12:46 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Noah,
>
> * Noah Misch (noah(at)leadboat(dot)com) wrote:
> > This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
> check
> > a higher MaxAllocHugeSize limit of SIZE_MAX/2.
>
> Nice! I've complained about this limit a few different times and just
> never got around to addressing it.
>
> > This was made easier by tuplesort growth algorithm improvements in commit
> > 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
> > (TODO item "Allow sorts to use more available memory"), and Tom floated
> the
> > idea[1] behind the approach I've used. The next limit faced by sorts is
> > INT_MAX concurrent tuples in memory, which limits helpful work_mem to
> about
> > 150 GiB when sorting int4.
>
> That's frustratingly small. :(
>

I've added a ToDo item to remove that limit from sorts as well.

I was going to add another item to make nodeHash.c use the new huge
allocator, but after looking at it just now it was not clear to me that it
even has such a limitation. nbatch is limited by MaxAllocSize, but
nbuckets doesn't seem to be.

Cheers,

Jeff

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Date:	2013-07-06 16:54:24
Message-ID:	20130706165424.GD3286@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff,

* Jeff Janes (jeff(dot)janes(at)gmail(dot)com) wrote:
> I was going to add another item to make nodeHash.c use the new huge
> allocator, but after looking at it just now it was not clear to me that it
> even has such a limitation. nbatch is limited by MaxAllocSize, but
> nbuckets doesn't seem to be.

nodeHash.c:ExecHashTableCreate() allocates ->buckets using:

palloc(nbuckets * sizeof(HashJoinTuple))

(where HashJoinTuple is actually just a pointer), and reallocates same
in ExecHashTableReset(). That limits the current implementation to only
about 134M buckets, no?

Now, what I was really suggesting wasn't so much changing those specific
calls; my point was really that there's a ton of stuff in the HashJoin
code that uses 32bit integers for things which, these days, might be too
small (nbuckets being one example, imv). There's a lot of code there
though and you'd have to really consider which things make sense to have
as int64's.

Thanks,

Stephen