Quick Links

Re: Poor memory context performance in large hash joins

Lists:	pgsql-hackers

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Poor memory context performance in large hash joins
Date:	2017-02-23 22:13:19
Message-ID:	CAMkU=1x1hvue1XYrZoWk_omG0Ja5nBvTdvgrOeVkkeqs71CV8g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

When doing a hash join with large work_mem, you can have a large number of
chunks. Then if ExecHashIncreaseNumBatches gets called, those chunks are
walked through, moving the tuples to new chunks (or to disk, if they no
longer match the batch's bitmask), and freeing the old chunks.

The number of new chunks can be almost as as large as the number of old
chunks, especially if there is a very popular value. The problem is that
every time an old chunk is freed, the code in aset.c around line 968 has to
walk over all the newly allocated chunks in the linked list before it can
find the old one being freed. This is an N^2 operation, and I think it has
horrible CPU cache hit rates as well.

Is there a good solution to this? Could the new chunks be put in a
different memory context, and then destroy the old context and install the
new one at the end of ExecHashIncreaseNumBatches? I couldn't find a destroy
method for memory contexts, it looks like you just reset the parent
instead. But I don't think that would work here.

Thanks,

Jeff

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-23 22:15:18
Message-ID:	CAH2-Wz=uF=Qe0n0atK17vbdQ0LMkF6QQU9she9aaZA_BLs+_mw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2017 at 2:13 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> Is there a good solution to this? Could the new chunks be put in a
> different memory context, and then destroy the old context and install the
> new one at the end of ExecHashIncreaseNumBatches? I couldn't find a destroy
> method for memory contexts, it looks like you just reset the parent instead.
> But I don't think that would work here.

Are you aware of the fact that tuplesort.c got a second memory context
for 9.6, entirely on performance grounds?

--
Peter Geoghegan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-23 22:28:26
Message-ID:	10401.1487888906@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> The number of new chunks can be almost as as large as the number of old
> chunks, especially if there is a very popular value. The problem is that
> every time an old chunk is freed, the code in aset.c around line 968 has to
> walk over all the newly allocated chunks in the linked list before it can
> find the old one being freed. This is an N^2 operation, and I think it has
> horrible CPU cache hit rates as well.

Maybe it's time to convert that to a doubly-linked list. Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong. It should
learn to aggregate them into larger requests.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 06:47:44
Message-ID:	20170224064744.veq4et4imuzhdrbm@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> > The number of new chunks can be almost as as large as the number of old
> > chunks, especially if there is a very popular value. The problem is that
> > every time an old chunk is freed, the code in aset.c around line 968 has to
> > walk over all the newly allocated chunks in the linked list before it can
> > find the old one being freed. This is an N^2 operation, and I think it has
> > horrible CPU cache hit rates as well.
>
> Maybe it's time to convert that to a doubly-linked list.

Yes, I do think so. Given that we only have that for full blocks, not
for small chunks, the cost seems neglegible.

That would also, partially, address the performance issue
http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
addresses, in a more realistically backpatchable manner.

Jeff, do you have a handy demonstrator?

> Although if the
> hash code is producing a whole lot of requests that are only a bit bigger
> than the separate-block threshold, I'd say It's Doing It Wrong. It should
> learn to aggregate them into larger requests.

That's probably right, but we can't really address that in the
back-branches. And to me this sounds like something we should address
in the branches, not just in master. Even if we'd also fix the
hash-aggregation logic, I think such an O(n^2) behaviour in the
allocator is a bad idea in general, and we should fix it anyway.

Regards,

Andres

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 06:59:01
Message-ID:	8808.1487919541@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
>> Maybe it's time to convert that to a doubly-linked list.

> Yes, I do think so. Given that we only have that for full blocks, not
> for small chunks, the cost seems neglegible.
> That would also, partially, address the performance issue
> http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
> addresses, in a more realistically backpatchable manner.

Yeah, I was wondering if we could get away with back-patching such a
change. In principle, nothing outside aset.c should know what's in the
header of an AllocBlock, but ...

> Jeff, do you have a handy demonstrator?

A solid test case would definitely help to convince people that it was
worth taking some risk here.

regards, tom lane

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 07:00:27
Message-ID:	CAEepm=24pdL0eVZmjZtBCLeHugHBisADht2TjaEdCB_aFv16ZQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 24, 2017 at 12:17 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Jeff, do you have a handy demonstrator?

If you want to hit ExecHashIncreaseNumBatches a bunch of times, see
the "ugly" example in the attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachment	Content-Type	Size
hj-test-queries.sql	application/octet-stream	2.9 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 07:25:16
Message-ID:	20170224072516.pifpksx4725rfv4t@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-02-24 01:59:01 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> >> Maybe it's time to convert that to a doubly-linked list.
>
> > Yes, I do think so. Given that we only have that for full blocks, not
> > for small chunks, the cost seems neglegible.
> > That would also, partially, address the performance issue
> > http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
> > addresses, in a more realistically backpatchable manner.
>
> Yeah, I was wondering if we could get away with back-patching such a
> change. In principle, nothing outside aset.c should know what's in the
> header of an AllocBlock, but ...

You'd need to go through a fair amount of intentional pain to be
affected by a change AllocBlockData's structure. We could add the
->prev pointer to the end of AllocBlockData's definition to make it less
likely that one would be affected in that unlikely case - but I'm a bit
doubtful it's worth the trouble.

- Andres

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 18:19:02
Message-ID:	CAMkU=1zbiQYRtR_qOEkfqGLhqite-JN-J92v7SyEf7P9pg94YA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> > The number of new chunks can be almost as as large as the number of old
> > chunks, especially if there is a very popular value. The problem is that
> > every time an old chunk is freed, the code in aset.c around line 968 has
> to
> > walk over all the newly allocated chunks in the linked list before it can
> > find the old one being freed. This is an N^2 operation, and I think it
> has
> > horrible CPU cache hit rates as well.
>
> Maybe it's time to convert that to a doubly-linked list.

I don't think that would help. You would need some heuristic to guess
whether the chunk you are looking for is near the front, or near the end.
And in this case, the desired chunk starts out at the front, and then keeps
moving down the list with each iteration as new things are added to the
front, until near the end of the ExecHashIncreaseNumBatches it is freeing
them from near the end. But in between, it is freeing them from the
middle, so I don't think a doubly-linked list would change it from N^2,
just lower the constant, even if you always knew which end to start at. Or
am I misunderstanding what the implications for a doubly-linked-list are?

What would really help here is if it remembered the next pointer of the
just-freed chunk, and started the scan from that location the next time,
cycling around to the head pointer if it doesn't find anything. But I
don't think that that is a very general solution.

Or, if you could pass a flag when creating the context telling it whether
memory will be freed mostly-LIFO or mostly-FIFO, and have it use a stack or
a queue accordingly.

Right now it is using compiled-in 32KB chunks. Should it use something
like max(32kb,work_mem/128) instead?

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 18:47:14
Message-ID:	CAMkU=1wCoM1bO3G2mNnEMWgGhcHqH_M5O3VuwUQZD8zRYDdK0g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 23, 2017 at 10:47 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2017-02-23 17:28:26 -0500, Tom Lane wrote:
> > Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> > > The number of new chunks can be almost as as large as the number of old
> > > chunks, especially if there is a very popular value. The problem is
> that
> > > every time an old chunk is freed, the code in aset.c around line 968
> has to
> > > walk over all the newly allocated chunks in the linked list before it
> can
> > > find the old one being freed. This is an N^2 operation, and I think
> it has
> > > horrible CPU cache hit rates as well.
> >
> > Maybe it's time to convert that to a doubly-linked list.
>
> Yes, I do think so. Given that we only have that for full blocks, not
> for small chunks, the cost seems neglegible.
>
> That would also, partially, address the performance issue
> http://archives.postgresql.org/message-id/d15dff83-0b37-28ed
> -0809-95a5cc7292ad%402ndquadrant.com
> addresses, in a more realistically backpatchable manner.
>
> Jeff, do you have a handy demonstrator?
>

Not exactly "handy", as it takes about 45 minutes to set up and uses >50GB
of disk and 16GB of RAM, but here is a demonstration.

create table foobar as select CASE when random()<(420.0/540.0) then 1 else
floor(random()*880000)::int END as titleid from
generate_series(1,540000000);

create table foobar2 as select distinct titleid from foobar ;

analyze;

set work_mem TO "13GB";

select count(*) from foobar2 where not exists (select 1 from foobar t where
t.titleid=foobar2.titleid);

This will run for effectively forever, unless it gets killed/dies due to
OOM first. If I have other things consuming some of my 16GB of RAM, then
it gets OOM (which is not a complaint: it is as one should expect given
that I told it work_mem was 13GB). If I have no other demands on RAM, then
I can't tell if it would eventually OOM or not because of how long it would
take to get that far.

This is inspired by the thread
https://www.postgresql.org/message-id/flat/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1%2BA0PRx1BGg%40mail(dot)gmail(dot)com#CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1+A0PRx1BGg(at)mail(dot)gmail(dot)com

I ran into this while evaluating Tom's responding patch, but you don't need
to apply that patch to run this example and see the effect. I don't have
an example of a case that demonstrates the problem in the absence of a
degenerate hash bucket. I think there should be non-degenerate cases that
trigger it, but I haven't been able to produce one yet.

> > Although if the
> > hash code is producing a whole lot of requests that are only a bit bigger
> > than the separate-block threshold, I'd say It's Doing It Wrong. It
> should
> > learn to aggregate them into larger requests.
>
> That's probably right, but we can't really address that in the
> back-branches. And to me this sounds like something we should address
> in the branches, not just in master. Even if we'd also fix the
> hash-aggregation logic, I think such an O(n^2) behaviour in the
> allocator is a bad idea in general, and we should fix it anyway.
>

I don't know how important a back-patch would be. This is a toy case for
me. Presumably not for David Hinkle, except he doesn't want a hash join in
the first place. While I want one that finishes in a reasonable time. It
might depend on what happens to Tom's OOM patch.

It would be great if the allocator was made bullet-proof, but I don't think
adding reverse links (or anything else back-patchable) is going to be
enough to do that.

Cheers,

Jeff

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 18:59:45
Message-ID:	8652.1487962785@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Maybe it's time to convert that to a doubly-linked list.

> I don't think that would help. You would need some heuristic to guess
> whether the chunk you are looking for is near the front, or near the end.

Uh, what? In a doubly-linked list, you can remove an element in O(1)
time, you don't need any searching. It basically becomes
item->prev->next = item->next;
item->next->prev = item->prev;
modulo possible special cases for the head and tail elements.

>> Although if the
>> hash code is producing a whole lot of requests that are only a bit bigger
>> than the separate-block threshold, I'd say It's Doing It Wrong. It should
>> learn to aggregate them into larger requests.

> Right now it is using compiled-in 32KB chunks. Should it use something
> like max(32kb,work_mem/128) instead?

I'd say it should double the size of the request each time. That's what
we do in most places.

regards, tom lane

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 19:57:17
Message-ID:	CAMkU=1xdSg0g8JuwUq=zro=PG508_0qKaKFLQbEcGciT4hAYLw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> > On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Maybe it's time to convert that to a doubly-linked list.
>
> > I don't think that would help. You would need some heuristic to guess
> > whether the chunk you are looking for is near the front, or near the end.
>
> Uh, what? In a doubly-linked list, you can remove an element in O(1)
> time, you don't need any searching. It basically becomes
> item->prev->next = item->next;
> item->next->prev = item->prev;
> modulo possible special cases for the head and tail elements.
>

Currently it is walking the chain to identify which block holds the chunk
in the first place, not just to get the pointer to the previous block. But
I see that that could be fixed by pointer arithmetic once there is a reason
to fix it. Which there isn't a reason to as long as you need to walk the
chain to get the prev pointer anyway. Like this:?

targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);

> >> Although if the
> >> hash code is producing a whole lot of requests that are only a bit
> bigger
> >> than the separate-block threshold, I'd say It's Doing It Wrong. It
> should
> >> learn to aggregate them into larger requests.
>
> > Right now it is using compiled-in 32KB chunks. Should it use something
> > like max(32kb,work_mem/128) instead?
>
> I'd say it should double the size of the request each time. That's what
> we do in most places.
>

I thought the design goal there was that the space in old chunks could get
re-used into the new chunks in a reasonably fine-grained way. If the last
chunk contains just over half of all the data, that couldn't happen.

Cheers,

Jeff

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 23:04:18
Message-ID:	32354.1487977458@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Uh, what? In a doubly-linked list, you can remove an element in O(1)
>> time, you don't need any searching.

> Currently it is walking the chain to identify which block holds the chunk
> in the first place, not just to get the pointer to the previous block. But
> I see that that could be fixed by pointer arithmetic once there is a reason
> to fix it. Which there isn't a reason to as long as you need to walk the
> chain to get the prev pointer anyway. Like this:?
> targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);

Right. We're just turning around the existing address calculation. It'd
be a good idea to provide some cross-check that we found a sane-looking
block header, but that's not that hard --- we know what ought to be in
aset, freeptr, and endptr for a single-chunk block's header, and that
seems like enough of a crosscheck to me.

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

One argument against this is that it adds some nonzero amount of overhead
to block management; but considering that we are calling malloc or realloc
or free alongside any one of these manipulations, that overhead should be
pretty negligible. I'm a bit more worried about increasing the alloc
block header size by 8 bytes, but that would only really matter with a
very small allocChunkLimit.

regards, tom lane

Attachment	Content-Type	Size
doubly-linked-block-list.patch	text/x-diff	8.3 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 23:12:37
Message-ID:	20170224231237.rzhlpdgnkby266sc@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> Concretely, something like the attached. This passes regression tests
> but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

- Andres

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-24 23:18:04
Message-ID:	20170224231804.j5d5jkgph3b3aboj@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > Concretely, something like the attached. This passes regression tests
> > but I've not pushed on it any harder than that.
>
> Heh, I'd just gotten something that didn't immediately crash anymore ;)
>
> Running your patch against Jeff's test-case, verified before that I
> could easily reproduce the O(N^2) cost.

Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);
┌───────┐
│ count │
├───────┤
│ 0 │
└───────┘
(1 row)

Time: 268043.710 ms (04:28.044)

Profile:
13.49% postgres [.] ExecHashTableInsert
11.16% postgres [.] BufFileRead
9.20% postgres [.] ExecHashIncreaseNumBatches
9.19% libc-2.24.so [.] __memmove_avx_unaligned_erms
6.74% postgres [.] dense_alloc.isra.0
5.53% postgres [.] ExecStoreMinimalTuple
5.14% postgres [.] ExecHashJoinGetSavedTuple.isra.0
4.68% postgres [.] AllocSetAlloc
4.12% postgres [.] AllocSetFree
2.31% postgres [.] palloc
2.06% [kernel] [k] free_pcppages_bulk

I'd previously run the test for 30min, with something like 99% of the
profile being in AllocSetFree().

- Andres

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-27 11:55:03
Message-ID:	20170227115503.riiwt2hppkp7gebf@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-02-24 15:18:04 -0800, Andres Freund wrote:
> On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> > On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > > Concretely, something like the attached. This passes regression tests
> > > but I've not pushed on it any harder than that.
> >
> > Heh, I'd just gotten something that didn't immediately crash anymore ;)
> >
> > Running your patch against Jeff's test-case, verified before that I
> > could easily reproduce the O(N^2) cost.
>
> Oh, that didn't take as long as I was afraid (optimized/non-assert build):
>
> postgres[26268][1]=# SET work_mem = '13GB';
> SET
> Time: 2.591 ms
> postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);
> Time: 268043.710 ms (04:28.044)

As another datapoint, I measured this patch against the problem from
https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
(see top post in thread), and it indeed fixes the runtime issue -
there's still considerably higher memory usage and some runtime
overhead, but the quadratic behaviour is gone.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.

Regards,

Andres

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-27 18:20:56
Message-ID:	4c86b41a-e6dc-8722-71c3-28ebb461b4eb@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/27/2017 12:55 PM, Andres Freund wrote:
> On 2017-02-24 15:18:04 -0800, Andres Freund wrote:
>> On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
>>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
>>>> Concretely, something like the attached. This passes regression tests
>>>> but I've not pushed on it any harder than that.
>>>
>>> Heh, I'd just gotten something that didn't immediately crash anymore ;)
>>>
>>> Running your patch against Jeff's test-case, verified before that I
>>> could easily reproduce the O(N^2) cost.
>>
>> Oh, that didn't take as long as I was afraid (optimized/non-assert build):
>>
>> postgres[26268][1]=# SET work_mem = '13GB';
>> SET
>> Time: 2.591 ms
>> postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);
>> Time: 268043.710 ms (04:28.044)
>
> As another datapoint, I measured this patch against the problem from
> https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
> (see top post in thread), and it indeed fixes the runtime issue -
> there's still considerably higher memory usage and some runtime
> overhead, but the quadratic behaviour is gone.
>
> I think we should go forward with something like this patch in all
> branches, and only use Tomas' patch in master, because they're
> considerably larger.
>

So you've tried to switch hashjoin to the slab allocators? Or what have
you compared?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-02-27 18:26:39
Message-ID:	20170227182639.ibdg5dm35z563sjw@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-02-27 19:20:56 +0100, Tomas Vondra wrote:
> On 02/27/2017 12:55 PM, Andres Freund wrote:
> > On 2017-02-24 15:18:04 -0800, Andres Freund wrote:
> > > On 2017-02-24 15:12:37 -0800, Andres Freund wrote:
> > > > On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> > > > > Concretely, something like the attached. This passes regression tests
> > > > > but I've not pushed on it any harder than that.
> > > >
> > > > Heh, I'd just gotten something that didn't immediately crash anymore ;)
> > > >
> > > > Running your patch against Jeff's test-case, verified before that I
> > > > could easily reproduce the O(N^2) cost.
> > >
> > > Oh, that didn't take as long as I was afraid (optimized/non-assert build):
> > >
> > > postgres[26268][1]=# SET work_mem = '13GB';
> > > SET
> > > Time: 2.591 ms
> > > postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);
> > > Time: 268043.710 ms (04:28.044)
> >
> > As another datapoint, I measured this patch against the problem from
> > https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
> > (see top post in thread), and it indeed fixes the runtime issue -
> > there's still considerably higher memory usage and some runtime
> > overhead, but the quadratic behaviour is gone.
> >
> > I think we should go forward with something like this patch in all
> > branches, and only use Tomas' patch in master, because they're
> > considerably larger.
> >
>
> So you've tried to switch hashjoin to the slab allocators? Or what have you
> compared?

No, sorry for not being more explicit about this. Meant that Tom's
patch addresses the performance issue in the reorderbuffer.c to a good
degree (i.e. gets rid of the quadratic cost, even though constants are
higher than w/ your patches). As the patch here is a lot smaller, it
seems like a better choice for the back-branches than backporting
slab.c/generation.c.

Andres

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-03-07 18:06:39
Message-ID:	28659.1488909999@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
>>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
>>>> Concretely, something like the attached. This passes regression tests
>>>> but I've not pushed on it any harder than that.

> I think we should go forward with something like this patch in all
> branches, and only use Tomas' patch in master, because they're
> considerably larger.

While I'm willing to back-patch the proposed patch to some extent,
I'm a bit hesitant to shove it into all branches, because I don't
think that usage patterns that create a problem occurred till recently.
You won't get a whole lot of standalone-block allocations unless you
have code that allocates many chunks bigger than 8K without applying a
double-the-request-each-time type of strategy. And even then, it doesn't
hurt unless you try to resize those chunks, or delete them in a non-LIFO
order.

The hashjoin issue is certainly new, and the reorderbuffer issue can't
go back further than 9.4. So my inclination is to patch back to 9.4
and call it good.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-03-07 18:37:14
Message-ID:	20170307183714.sea7l76or4zxgqe3@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-03-07 13:06:39 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> >>> On 2017-02-24 18:04:18 -0500, Tom Lane wrote:
> >>>> Concretely, something like the attached. This passes regression tests
> >>>> but I've not pushed on it any harder than that.
>
> > I think we should go forward with something like this patch in all
> > branches, and only use Tomas' patch in master, because they're
> > considerably larger.
>
> While I'm willing to back-patch the proposed patch to some extent,
> I'm a bit hesitant to shove it into all branches, because I don't
> think that usage patterns that create a problem occurred till recently.
> You won't get a whole lot of standalone-block allocations unless you
> have code that allocates many chunks bigger than 8K without applying a
> double-the-request-each-time type of strategy. And even then, it doesn't
> hurt unless you try to resize those chunks, or delete them in a non-LIFO
> order.
>
> The hashjoin issue is certainly new, and the reorderbuffer issue can't
> go back further than 9.4. So my inclination is to patch back to 9.4
> and call it good.

That works for me. If we find further cases later, we can easily enough
backpatch it then.

Regards,

Andres

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Poor memory context performance in large hash joins
Date:	2017-03-08 17:22:51
Message-ID:	31888.1488993771@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2017-03-07 13:06:39 -0500, Tom Lane wrote:
>> The hashjoin issue is certainly new, and the reorderbuffer issue can't
>> go back further than 9.4. So my inclination is to patch back to 9.4
>> and call it good.

> That works for me. If we find further cases later, we can easily enough
> backpatch it then.

Done like that.

regards, tom lane