PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

Lists: pgsql-hackers
From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-07 03:08:28
Message-ID: 5483C4AC.5060107@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

back when we were discussing the hashjoin patches (now committed),
Robert proposed that maybe it'd be a good idea to sometimes increase the
number of tuples per bucket instead of batching.

That is, while initially sizing the hash table - if the hash table with
enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into
work_mem, try a bit higher load factor before starting to batch.

Attached patch is an initial attempt to implement this - it's a bit
rough on the edges, but hopefully enough to judge the benefit of this.

The default load factor is 1. The patch tries to degrade this to 2, 4 or
8 in attempt to fit the hash table into work mem. If it doesn't work, it
starts batching with the default load factor. If the batching is
required while the hashjoin is running, the load factor is switched back
to the default one (if we're batching, there's no point in keeping the
slower hash table).

The patch also modifies explain output, to show the load factor.

The testing I've done so far are rather disappointing, though.

create table a as select i from generate_series(1,1000000) s(i);
create table b as select i from generate_series(1,1000000) s(i);

analyze a;
analyze b;

select a.i, b.i from a join b on (a.i = b.i);

work_mem batches tuples per bucket duration
-----------------------------------------------------
64 MB 1 1 585 ms
46 MB 1 2 639 ms
43 MB 1 4 794 ms
40 MB 1 8 1075 ms
39 MB 2 1 623 ms

So, even increasing the load factor to 2 is slower than batching.

Of course, with other examples the results may be different. For example
with a much larger outer table:

create table a as select mod(i,1000000) i
from generate_series(1,10000000) s(i);
analyze a;

work_mem batches tuples per bucket duration
-----------------------------------------------------
64 MB 1 1 3904 ms
46 MB 1 2 4692 ms
43 MB 1 4 6499 ms
40 MB 1 8 9264 ms
39 MB 2 1 4054 ms

Same results. Of course, for "huge" outer tables it will make a
difference. For example on a ~8GB outer table (on a machine with 8GB of
RAM), the batching causes enough I/O to make it slower than ntup=2 (50s
vs. 80s, for example).

I'm not sure what's the best way forward - clearly, for cases that fit
into RAM (temp files into page cache), batching is faster. For tables
large enough to cause a lot of I/O, it may make a difference - but I'm
not sure how to identify these cases.

So ISTM implementing this requires a reliable way to identify which case
we're dealing with - if the outer table is large enough (prefer
increasing load factor) or not (prefer batching). Until then keeping the
current simple/predictible approach is probably better.

Of course, this also depends on additional variables (e.g. is the system
memory-stressed?).

regards
Tomas

Attachment Content-Type Size
hashjoin-graceful-v1.patch text/x-diff 10.7 KB

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-10 14:02:21
Message-ID: 646916973.4801168.1418220141970.JavaMail.yahoo@jws10090.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> back when we were discussing the hashjoin patches (now committed),
> Robert proposed that maybe it'd be a good idea to sometimes increase the
> number of tuples per bucket instead of batching.
>
> That is, while initially sizing the hash table - if the hash table with
> enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into
> work_mem, try a bit higher load factor before starting to batch.
>
> Attached patch is an initial attempt to implement this - it's a bit
> rough on the edges, but hopefully enough to judge the benefit of this.
>
> The default load factor is 1. The patch tries to degrade this to 2, 4 or
> 8 in attempt to fit the hash table into work mem. If it doesn't work, it
> starts batching with the default load factor. If the batching is
> required while the hashjoin is running, the load factor is switched back
> to the default one (if we're batching, there's no point in keeping the
> slower hash table).
>
> The patch also modifies explain output, to show the load factor.
>
> The testing I've done so far are rather disappointing, though.
>
> create table a as select i from generate_series(1,1000000) s(i);
> create table b as select i from generate_series(1,1000000) s(i);
>
> analyze a;
> analyze b;
>
> select a.i, b.i from a join b on (a.i = b.i);
>
> work_mem batches tuples per bucket duration
> -----------------------------------------------------
> 64 MB 1 1 585 ms
> 46 MB 1 2 639 ms
> 43 MB 1 4 794 ms
> 40 MB 1 8 1075 ms
> 39 MB 2 1 623 ms
>
> So, even increasing the load factor to 2 is slower than batching.

Right, I saw the same thing.

I tried pretty hard to create a case where increasing the load
factor from 1 to 2 was faster than going to a second batch, and was
unable to do so. I hypothesized that the second batch was cached
by the OS and flipping the data in and out of the OS cache was
faster than chasing through the links. I expect that if you have a
large enough hashtable to barely exceed your machines ability to
cache, you *might* see a benefit in the hash table access by
increasing the load factor. I think it would be incredibly hard to
accurately identify those cases, and every time a hash table was
used there would be a cost to trying to figure it out. I just
can't see this being a win.

> I'm not sure what's the best way forward - clearly, for cases that fit
> into RAM (temp files into page cache), batching is faster. For tables
> large enough to cause a lot of I/O, it may make a difference - but I'm
> not sure how to identify these cases.
>
> So ISTM implementing this requires a reliable way to identify which case
> we're dealing with - if the outer table is large enough (prefer
> increasing load factor) or not (prefer batching). Until then keeping the
> current simple/predictible approach is probably better.
>
> Of course, this also depends on additional variables (e.g. is the system
> memory-stressed?).

All that is on-target, but my take-away is that increasing load
factor to avoid a second batch was an interesting idea that turns
out to not really be a good one. If someone can craft a
reproducible test case that demonstrates a win for increasing the
load factor that doesn't have significant cost for cases where it
isn't a win, I might change my opinion; but count me as a skeptic.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-11 02:51:10
Message-ID: CA+Tgmoa-==rtgzkNOES91o1Lurr1g5_vfM8cbfJNATjQSdcKeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 6, 2014 at 10:08 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> select a.i, b.i from a join b on (a.i = b.i);

I think the concern is that the inner side might be something more
elaborate than a plain table scan, like an aggregate or join. I might
be all wet, but my impression is that you can make rescanning
arbitrarily expensive if you work at it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-11 17:29:05
Message-ID: 374360640.5196069.1418318945338.JavaMail.yahoo@jws10094.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Dec 6, 2014 at 10:08 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> select a.i, b.i from a join b on (a.i = b.i);
>
> I think the concern is that the inner side might be something more
> elaborate than a plain table scan, like an aggregate or join. I might
> be all wet, but my impression is that you can make rescanning
> arbitrarily expensive if you work at it.

I'm not sure I'm following. Let's use a function to select from b:

create or replace function fb()
returns setof b
language plpgsql
rows 1
as $$
begin
return query select i from b;
end;
$$;

explain (analyze, buffers, verbose)
select a.i, b.i from a join fb() b on (a.i = b.i);

I used the low row estimate to cause the planner to put this on the inner side.

16 batches
Execution time: 1638.582 ms

Now let's make it slow.

create or replace function fb()
returns setof b
language plpgsql
rows 1
as $$
begin
perform pg_sleep(2.0);
return query select i from b;
end;
$$;
explain (analyze, buffers, verbose)
select a.i, b.i from a join fb() b on (a.i = b.i);

16 batches
Execution time: 3633.859 ms

Under what conditions do you see the inner side get loaded into the
hash table multiple times?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-11 19:00:12
Message-ID: CA+TgmoYJBfdMkNM++d+3SBEaT9B_MSh9K-cU0HoBN=t2xb7MpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 11, 2014 at 12:29 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sat, Dec 6, 2014 at 10:08 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>> select a.i, b.i from a join b on (a.i = b.i);
>>
>> I think the concern is that the inner side might be something more
>> elaborate than a plain table scan, like an aggregate or join. I might
>> be all wet, but my impression is that you can make rescanning
>> arbitrarily expensive if you work at it.
>
> I'm not sure I'm following. Let's use a function to select from b:
>
> create or replace function fb()
> returns setof b
> language plpgsql
> rows 1
> as $$
> begin
> return query select i from b;
> end;
> $$;
>
> explain (analyze, buffers, verbose)
> select a.i, b.i from a join fb() b on (a.i = b.i);
>
> I used the low row estimate to cause the planner to put this on the inner side.
>
> 16 batches
> Execution time: 1638.582 ms
>
> Now let's make it slow.
>
> create or replace function fb()
> returns setof b
> language plpgsql
> rows 1
> as $$
> begin
> perform pg_sleep(2.0);
> return query select i from b;
> end;
> $$;
> explain (analyze, buffers, verbose)
> select a.i, b.i from a join fb() b on (a.i = b.i);
>
> 16 batches
> Execution time: 3633.859 ms
>
> Under what conditions do you see the inner side get loaded into the
> hash table multiple times?

Huh, interesting. I guess I was thinking that the inner side got
rescanned for each new batch, but I guess that's not what happens.

Maybe there's no real problem here, and we just win.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-11 19:51:31
Message-ID: 5489F5C3.3090800@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.12.2014 20:00, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 12:29 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
>>
>> Under what conditions do you see the inner side get loaded into the
>> hash table multiple times?
>
> Huh, interesting. I guess I was thinking that the inner side got
> rescanned for each new batch, but I guess that's not what happens.

No, it's not rescanned. It's scanned only once (for the batch #0), and
tuples belonging to the other batches are stored in files. If the number
of batches needs to be increased (e.g. because of incorrect estimate of
the inner table), the tuples are moved later.

>
> Maybe there's no real problem here, and we just win.

I'm a bit confused by this discussion, because the inner relation has
nothing to do with this patch. It gets scanned exactly once, no matter
what the load factor is. If a batching is necessary, only the already
files (without reexecuting the inner part) are read. However in that
case this patch makes no difference, because it explicitly reverts to
load factor = NTUP_PER_BUCKET (which is 1).

The only point of this patch was to prevent batching because of the
outer table. Usually, the outer table is much larger than the inner one
(e.g. in a star schema, outer = fact table, inner = dimension). Batching
the outer table means you have to write >= 50% into a temporary file.

The idea was that if we could increase the load a bit (e.g. using 2
tuples per bucket instead of 1), we will still use a single batch in
some cases (when we miss the work_mem threshold by just a bit). The
lookups will be slower, but we'll save the I/O.

regards
Tomas


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-11 21:16:47
Message-ID: CA+TgmoZVLCxa4OaRTh3Qa2BZq75=zA+3-awM-pUi6kzBWswt7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 11, 2014 at 2:51 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> No, it's not rescanned. It's scanned only once (for the batch #0), and
> tuples belonging to the other batches are stored in files. If the number
> of batches needs to be increased (e.g. because of incorrect estimate of
> the inner table), the tuples are moved later.

Yeah, I think I sort of knew that, but I got confused. Thanks for clarifying.

> The idea was that if we could increase the load a bit (e.g. using 2
> tuples per bucket instead of 1), we will still use a single batch in
> some cases (when we miss the work_mem threshold by just a bit). The
> lookups will be slower, but we'll save the I/O.

Yeah. That seems like a valid theory, but your test results so far
seem to indicate that it's not working out like that - which I find
quite surprising, but, I mean, it is what it is, right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-11 22:46:17
Message-ID: 548A1EB9.5020506@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.12.2014 22:16, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 2:51 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> No, it's not rescanned. It's scanned only once (for the batch #0), and
>> tuples belonging to the other batches are stored in files. If the number
>> of batches needs to be increased (e.g. because of incorrect estimate of
>> the inner table), the tuples are moved later.
>
> Yeah, I think I sort of knew that, but I got confused. Thanks for clarifying.
>
>> The idea was that if we could increase the load a bit (e.g. using 2
>> tuples per bucket instead of 1), we will still use a single batch in
>> some cases (when we miss the work_mem threshold by just a bit). The
>> lookups will be slower, but we'll save the I/O.
>
> Yeah. That seems like a valid theory, but your test results so far
> seem to indicate that it's not working out like that - which I find
> quite surprising, but, I mean, it is what it is, right?

Not exactly. My tests show that as long as the outer table batches fit
into page cache, icreasing the load factor results in worse performance
than batching.

When the outer table is "sufficiently small", the batching is faster.

Regarding the "sufficiently small" - considering today's hardware, we're
probably talking about gigabytes. On machines with significant memory
pressure (forcing the temporary files to disk), it might be much lower,
of course. Of course, it also depends on kernel settings (e.g.
dirty_bytes/dirty_background_bytes).

If we could identify those cases (at least the "temp files > RAM") then
maybe we could do this. Otherwise we're going to penalize all the other
queries ...

Maybe the best solution for now is "increase the work_mem a bit"
recommendation.

regards
Tomas


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-12 13:19:59
Message-ID: CA+TgmobEapmVBgH6LOafBEzZJkAk4pgj-OXwnT58xkoccetabg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 11, 2014 at 5:46 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>> The idea was that if we could increase the load a bit (e.g. using 2
>>> tuples per bucket instead of 1), we will still use a single batch in
>>> some cases (when we miss the work_mem threshold by just a bit). The
>>> lookups will be slower, but we'll save the I/O.
>>
>> Yeah. That seems like a valid theory, but your test results so far
>> seem to indicate that it's not working out like that - which I find
>> quite surprising, but, I mean, it is what it is, right?
>
> Not exactly. My tests show that as long as the outer table batches fit
> into page cache, icreasing the load factor results in worse performance
> than batching.
>
> When the outer table is "sufficiently small", the batching is faster.
>
> Regarding the "sufficiently small" - considering today's hardware, we're
> probably talking about gigabytes. On machines with significant memory
> pressure (forcing the temporary files to disk), it might be much lower,
> of course. Of course, it also depends on kernel settings (e.g.
> dirty_bytes/dirty_background_bytes).

Well, this is sort of one of the problems with work_mem. When we
switch to a tape sort, or a tape-based materialize, we're probably far
from out of memory. But trying to set work_mem to the amount of
memory we have can easily result in a memory overrun if a load spike
causes lots of people to do it all at the same time. So we have to
set work_mem conservatively, but then the costing doesn't really come
out right. We could add some more costing parameters to try to model
this, but it's not obvious how to get it right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-12 16:50:56
Message-ID: 548B1CF0.9010808@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12.12.2014 14:19, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 5:46 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>
>> Regarding the "sufficiently small" - considering today's hardware, we're
>> probably talking about gigabytes. On machines with significant memory
>> pressure (forcing the temporary files to disk), it might be much lower,
>> of course. Of course, it also depends on kernel settings (e.g.
>> dirty_bytes/dirty_background_bytes).
>
> Well, this is sort of one of the problems with work_mem. When we
> switch to a tape sort, or a tape-based materialize, we're probably far
> from out of memory. But trying to set work_mem to the amount of
> memory we have can easily result in a memory overrun if a load spike
> causes lots of people to do it all at the same time. So we have to
> set work_mem conservatively, but then the costing doesn't really come
> out right. We could add some more costing parameters to try to model
> this, but it's not obvious how to get it right.

Ummm, I don't think that's what I proposed. What I had in mind was a
flag "the batches are likely to stay in page cache". Because when it is
likely, batching is probably faster (compared to increased load factor).

Tomas


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-12 21:13:09
Message-ID: CA+Tgmobh+BcdzGWSUK9hmLm53gwjNmnCKA+1mF2FGUhSMdEtAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 11:50 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> On 12.12.2014 14:19, Robert Haas wrote:
>> On Thu, Dec 11, 2014 at 5:46 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>
>>> Regarding the "sufficiently small" - considering today's hardware, we're
>>> probably talking about gigabytes. On machines with significant memory
>>> pressure (forcing the temporary files to disk), it might be much lower,
>>> of course. Of course, it also depends on kernel settings (e.g.
>>> dirty_bytes/dirty_background_bytes).
>>
>> Well, this is sort of one of the problems with work_mem. When we
>> switch to a tape sort, or a tape-based materialize, we're probably far
>> from out of memory. But trying to set work_mem to the amount of
>> memory we have can easily result in a memory overrun if a load spike
>> causes lots of people to do it all at the same time. So we have to
>> set work_mem conservatively, but then the costing doesn't really come
>> out right. We could add some more costing parameters to try to model
>> this, but it's not obvious how to get it right.
>
> Ummm, I don't think that's what I proposed. What I had in mind was a
> flag "the batches are likely to stay in page cache". Because when it is
> likely, batching is probably faster (compared to increased load factor).

How will you know whether to set the flag?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-12 21:54:08
Message-ID: 548B6400.2040904@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12.12.2014 22:13, Robert Haas wrote:
> On Fri, Dec 12, 2014 at 11:50 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> On 12.12.2014 14:19, Robert Haas wrote:
>>> On Thu, Dec 11, 2014 at 5:46 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>>
>>>> Regarding the "sufficiently small" - considering today's hardware, we're
>>>> probably talking about gigabytes. On machines with significant memory
>>>> pressure (forcing the temporary files to disk), it might be much lower,
>>>> of course. Of course, it also depends on kernel settings (e.g.
>>>> dirty_bytes/dirty_background_bytes).
>>>
>>> Well, this is sort of one of the problems with work_mem. When we
>>> switch to a tape sort, or a tape-based materialize, we're probably far
>>> from out of memory. But trying to set work_mem to the amount of
>>> memory we have can easily result in a memory overrun if a load spike
>>> causes lots of people to do it all at the same time. So we have to
>>> set work_mem conservatively, but then the costing doesn't really come
>>> out right. We could add some more costing parameters to try to model
>>> this, but it's not obvious how to get it right.
>>
>> Ummm, I don't think that's what I proposed. What I had in mind was a
>> flag "the batches are likely to stay in page cache". Because when it is
>> likely, batching is probably faster (compared to increased load factor).
>
> How will you know whether to set the flag?

I don't know. I just wanted to make it clear that I'm not suggesting
messing with work_mem (increasing it or whatewer). Or maybe I got your
comments about memory overrun etc. wrong - now that I read it again,
maybe it's meant just as an example of how difficult problem it is?

regards
Tomas


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-12 22:04:57
Message-ID: CA+TgmoZT8LBsj6Stytpa5ePgWiTXS0DKEQYx2qzZBUN4dbN6kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 4:54 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>>> Well, this is sort of one of the problems with work_mem. When we
>>>> switch to a tape sort, or a tape-based materialize, we're probably far
>>>> from out of memory. But trying to set work_mem to the amount of
>>>> memory we have can easily result in a memory overrun if a load spike
>>>> causes lots of people to do it all at the same time. So we have to
>>>> set work_mem conservatively, but then the costing doesn't really come
>>>> out right. We could add some more costing parameters to try to model
>>>> this, but it's not obvious how to get it right.
>>>
>>> Ummm, I don't think that's what I proposed. What I had in mind was a
>>> flag "the batches are likely to stay in page cache". Because when it is
>>> likely, batching is probably faster (compared to increased load factor).
>>
>> How will you know whether to set the flag?
>
> I don't know. I just wanted to make it clear that I'm not suggesting
> messing with work_mem (increasing it or whatewer). Or maybe I got your
> comments about memory overrun etc. wrong - now that I read it again,
> maybe it's meant just as an example of how difficult problem it is?

More or less, yeah.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2014-12-12 22:30:52
Message-ID: CAM3SWZQOJsZ54CVYWHY90-GdrvYU2ChPHJK9Oz2Hv7tvdm4+vQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 5:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Well, this is sort of one of the problems with work_mem. When we
> switch to a tape sort, or a tape-based materialize, we're probably far
> from out of memory. But trying to set work_mem to the amount of
> memory we have can easily result in a memory overrun if a load spike
> causes lots of people to do it all at the same time. So we have to
> set work_mem conservatively, but then the costing doesn't really come
> out right. We could add some more costing parameters to try to model
> this, but it's not obvious how to get it right.

I've heard of using "set work_mem = *" with advisory locks plenty of
times. There might be a better way to set it dynamically than a full
admission control implementation.

--
Peter Geoghegan


From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2015-01-14 20:02:59
Message-ID: 54B6CB73.8020906@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.12.2014 23:46, Tomas Vondra wrote:
> On 11.12.2014 22:16, Robert Haas wrote:
>> On Thu, Dec 11, 2014 at 2:51 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>
>>> The idea was that if we could increase the load a bit (e.g. using 2
>>> tuples per bucket instead of 1), we will still use a single batch in
>>> some cases (when we miss the work_mem threshold by just a bit). The
>>> lookups will be slower, but we'll save the I/O.
>>
>> Yeah. That seems like a valid theory, but your test results so far
>> seem to indicate that it's not working out like that - which I find
>> quite surprising, but, I mean, it is what it is, right?
>
> Not exactly. My tests show that as long as the outer table batches fit
> into page cache, icreasing the load factor results in worse performance
> than batching.
>
> When the outer table is "sufficiently small", the batching is faster.
>
> Regarding the "sufficiently small" - considering today's hardware, we're
> probably talking about gigabytes. On machines with significant memory
> pressure (forcing the temporary files to disk), it might be much lower,
> of course. Of course, it also depends on kernel settings (e.g.
> dirty_bytes/dirty_background_bytes).
>
> If we could identify those cases (at least the "temp files > RAM") then
> maybe we could do this. Otherwise we're going to penalize all the other
> queries ...
>
> Maybe the best solution for now is "increase the work_mem a bit"
> recommendation.

I think it's time to mark this patch as rejected (or maybe returned with
feedback). The patch was meant as an attempt to implement Robert's idea
from the hashjoin patch, but apparently we have no clear idea how to do
it without hurting performance for many existing users.

Maybe we can try later again, but there's no poin in keeping this in the
current CF.

Any objections?

regards
Tomas


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Date: 2015-02-13 07:55:28
Message-ID: CAB7nPqR9BFCVHZfxPSni2z9UJcd+CCZFRM-mESxeh-yhEHtHWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 15, 2015 at 5:02 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Maybe we can try later again, but there's no poin in keeping this in the
> current CF.
>
> Any objections?
>

None, marked as rejected.
--
Michael