Re: Hash Anti Join performance degradation

Lists: pgsql-hackerspgsql-performance
From: panam <panam(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Hash Anti Join performance degradation
Date: 2011-05-24 04:14:24
Message-ID: 1306210464806-4420974.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi,

In my DB, there is a large table containing messages and one that contains
message boxes.
Messages are assigned to boxes via a child parent link m->b.
In order to obtain the last message for a specific box, I use the following
SQL:

SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;

This worked quite well for a long time. But now, suddenly the performance of
this query drastically degraded as new messages were added.
If these new messages are removed again, everything is back to normal. If
other messages are removed instead, the problem remains, so it does not seem
to be a memory issue. I fear I have difficulties to understand what is going
wrong here.

This is the query plan when everything is fine:

"Seq Scan on public.box this_ (cost=0.00..10467236.32 rows=128 width=696)
(actual time=0.169..7683.978 rows=128 loops=1)"
" Output: this_.id, this_.login, (SubPlan 1)"
" Buffers: shared hit=188413 read=94635 written=135, temp read=22530
written=22374"
" SubPlan 1"
" -> Hash Anti Join (cost=41323.25..81775.25 rows=20427 width=8)
(actual time=59.571..59.877 rows=1 loops=128)"
" Output: m1.id"
" Hash Cond: (m1.box_id = m2.box_id)"
" Join Filter: (m1.id < m2.id)"
" Buffers: shared hit=188412 read=94633 written=135, temp
read=22530 written=22374"
" -> Bitmap Heap Scan on public.message m1 (cost=577.97..40212.28
rows=30640 width=16) (actual time=3.152..9.514 rows=17982 loops=128)"
" Output: m1.id, m1.box_id"
" Recheck Cond: (m1.box_id = $0)"
" Buffers: shared hit=131993 read=9550 written=23"
" -> Bitmap Index Scan on "message_box_Idx"
(cost=0.00..570.31 rows=30640 width=0) (actual time=2.840..2.840 rows=18193
loops=128)"
" Index Cond: (m1.box_id = $0)"
" Buffers: shared hit=314 read=6433 written=23"
" -> Hash (cost=40212.28..40212.28 rows=30640 width=16) (actual
time=26.840..26.840 rows=20014 loops=115)"
" Output: m2.box_id, m2.id"
" Buckets: 4096 Batches: 4 (originally 2) Memory Usage:
5444kB"
" Buffers: shared hit=56419 read=85083 written=112, temp
written=7767"
" -> Bitmap Heap Scan on public.message m2
(cost=577.97..40212.28 rows=30640 width=16) (actual time=2.419..20.007
rows=20014 loops=115)"
" Output: m2.box_id, m2.id"
" Recheck Cond: (m2.box_id = $0)"
" Buffers: shared hit=56419 read=85083 written=112"
" -> Bitmap Index Scan on "message_box_Idx"
(cost=0.00..570.31 rows=30640 width=0) (actual time=2.166..2.166 rows=20249
loops=115)"
" Index Cond: (m2.box_id = $0)"
" Buffers: shared hit=6708"
"Total runtime: 7685.202 ms"

This is the plan when the query gets sluggish:

"Seq Scan on public.box this_ (cost=0.00..10467236.32 rows=128 width=696)
(actual time=0.262..179333.086 rows=128 loops=1)"
" Output: this_.id, this_.login, (SubPlan 1)"
" Buffers: shared hit=189065 read=93983 written=10, temp read=22668
written=22512"
" SubPlan 1"
" -> Hash Anti Join (cost=41323.25..81775.25 rows=20427 width=8)
(actual time=1264.700..1400.886 rows=1 loops=128)"
" Output: m1.id"
" Hash Cond: (m1.box_id = m2.box_id)"
" Join Filter: (m1.id < m2.id)"
" Buffers: shared hit=189064 read=93981 written=10, temp read=22668
written=22512"
" -> Bitmap Heap Scan on public.message m1 (cost=577.97..40212.28
rows=30640 width=16) (actual time=3.109..9.850 rows=18060 loops=128)"
" Output: m1.id, m1.box_id"
" Recheck Cond: (m1.box_id = $0)"
" Buffers: shared hit=132095 read=9448"
" -> Bitmap Index Scan on "message_box_Idx"
(cost=0.00..570.31 rows=30640 width=0) (actual time=2.867..2.867 rows=18193
loops=128)"
" Index Cond: (m1.box_id = $0)"
" Buffers: shared hit=312 read=6435"
" -> Hash (cost=40212.28..40212.28 rows=30640 width=16) (actual
time=27.533..27.533 rows=20102 loops=115)"
" Output: m2.box_id, m2.id"
" Buckets: 4096 Batches: 4 (originally 2) Memory Usage:
5522kB"
" Buffers: shared hit=56969 read=84533 written=10, temp
written=7811"
" -> Bitmap Heap Scan on public.message m2
(cost=577.97..40212.28 rows=30640 width=16) (actual time=2.406..20.492
rows=20102 loops=115)"
" Output: m2.box_id, m2.id"
" Recheck Cond: (m2.box_id = $0)"
" Buffers: shared hit=56969 read=84533 written=10"
" -> Bitmap Index Scan on "message_box_Idx"
(cost=0.00..570.31 rows=30640 width=0) (actual time=2.170..2.170 rows=20249
loops=115)"
" Index Cond: (m2.box_id = $0)"
" Buffers: shared hit=6708"
"Total runtime: 179334.310 ms"

So from my limited experience, the only significant difference I see is that
the Hash Anti Join takes a lot more time in plan 2, but I do not understand
why.
Ideas somebody?

Thanks
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4420974.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-24 05:53:00
Message-ID: 4DDB47BC.8030504@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 24/05/11 12:14, panam wrote:
> Hi,
>
> In my DB, there is a large table containing messages and one that contains
> message boxes.
> Messages are assigned to boxes via a child parent link m->b.
> In order to obtain the last message for a specific box, I use the following
> SQL:
>
> SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
> AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;
>
> This worked quite well for a long time. But now, suddenly the performance of
> this query drastically degraded as new messages were added.
> If these new messages are removed again, everything is back to normal. If
> other messages are removed instead, the problem remains, so it does not seem
> to be a memory issue. I fear I have difficulties to understand what is going
> wrong here.

The usual cause is that the statistics for estimated row counts cross a
threshold that makes the query planner think that a different kind of
plan will be faster.

If the query planner is using bad information about the performance of
the storage, then it will be making bad decisions about which approach
is faster. So the usual thing to do is to adjust seq_page_cost and
random_page_cost to more closely reflect the real performance of your
hardware, and to make sure that effective_cache_size matches the real
amount of memory your computer has free for disk cache use.

Newer versions of PostgreSQL always include query planning and
statistics improvements too.

BTW, it can be really helpful to paste your query plans into
http://explain.depesz.com/ , which will provide an easier to read visual
analysis of the plan. This will only work with query plans that haven't
been butchered by mail client word wrapping, so I can't do it for you,
but if you paste them there and post the links that'd be really handy.

Also have a look at http://wiki.postgresql.org/wiki/SlowQueryQuestions

I found the plans you posted a bit hard to read. Not your fault; it's
stupid mail clients. Maybe depesz.com needs to be taught to de-munge
the damage done to plans by common mail clients.

--
Craig Ringer


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-24 11:16:44
Message-ID: BANLkTi=EB-LzgBaFy7BEzKzuaDNmrVX9uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

2011/5/24 panam <panam(at)gmx(dot)net>:
> Hi,
>
> In my DB, there is a large table containing messages and one that contains
> message boxes.
> Messages are assigned to boxes via a child parent link m->b.
> In order to obtain the last message for a specific box, I use the following
> SQL:
>
> SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
> AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;
>
> This worked quite well for a long time. But now, suddenly the performance of
> this query drastically degraded as new messages were added.
> If these new messages are removed again, everything is back to normal. If
> other messages are removed instead, the problem remains, so it does not seem
> to be a memory issue. I fear I have difficulties to understand what is going
> wrong here.

We need more information here. The case is in fact interesting.
What's the PostgreSQL version, and did you have log of vacuum and
checkpoint activity ? (no vacuum full/cluster or such thing running ?)

Obvisouly, Craig suggestion to read
http://wiki.postgresql.org/wiki/SlowQueryQuestions is relevant as it
helps to have all common information required to analyze the issue.

>
> This is the query plan when everything is fine:
>
> "Seq Scan on public.box this_  (cost=0.00..10467236.32 rows=128 width=696)
> (actual time=0.169..7683.978 rows=128 loops=1)"
> "  Output: this_.id, this_.login, (SubPlan 1)"
> "  Buffers: shared hit=188413 read=94635 written=135, temp read=22530
> written=22374"
> "  SubPlan 1"
> "    ->  Hash Anti Join  (cost=41323.25..81775.25 rows=20427 width=8)
> (actual time=59.571..59.877 rows=1 loops=128)"
> "          Output: m1.id"
> "          Hash Cond: (m1.box_id = m2.box_id)"
> "          Join Filter: (m1.id < m2.id)"
> "          Buffers: shared hit=188412 read=94633 written=135, temp
> read=22530 written=22374"
> "          ->  Bitmap Heap Scan on public.message m1  (cost=577.97..40212.28
> rows=30640 width=16) (actual time=3.152..9.514 rows=17982 loops=128)"
> "                Output: m1.id, m1.box_id"
> "                Recheck Cond: (m1.box_id = $0)"
> "                Buffers: shared hit=131993 read=9550 written=23"
> "                ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..570.31 rows=30640 width=0) (actual time=2.840..2.840 rows=18193
> loops=128)"
> "                      Index Cond: (m1.box_id = $0)"
> "                      Buffers: shared hit=314 read=6433 written=23"
> "          ->  Hash  (cost=40212.28..40212.28 rows=30640 width=16) (actual
> time=26.840..26.840 rows=20014 loops=115)"
> "                Output: m2.box_id, m2.id"
> "                Buckets: 4096  Batches: 4 (originally 2)  Memory Usage:
> 5444kB"
> "                Buffers: shared hit=56419 read=85083 written=112, temp
> written=7767"
> "                ->  Bitmap Heap Scan on public.message m2
> (cost=577.97..40212.28 rows=30640 width=16) (actual time=2.419..20.007
> rows=20014 loops=115)"
> "                      Output: m2.box_id, m2.id"
> "                      Recheck Cond: (m2.box_id = $0)"
> "                      Buffers: shared hit=56419 read=85083 written=112"
> "                      ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..570.31 rows=30640 width=0) (actual time=2.166..2.166 rows=20249
> loops=115)"
> "                            Index Cond: (m2.box_id = $0)"
> "                            Buffers: shared hit=6708"
> "Total runtime: 7685.202 ms"
>
> This is the plan when the query gets sluggish:
>
> "Seq Scan on public.box this_  (cost=0.00..10467236.32 rows=128 width=696)
> (actual time=0.262..179333.086 rows=128 loops=1)"
> "  Output: this_.id, this_.login, (SubPlan 1)"
> "  Buffers: shared hit=189065 read=93983 written=10, temp read=22668
> written=22512"
> "  SubPlan 1"
> "    ->  Hash Anti Join  (cost=41323.25..81775.25 rows=20427 width=8)
> (actual time=1264.700..1400.886 rows=1 loops=128)"
> "          Output: m1.id"
> "          Hash Cond: (m1.box_id = m2.box_id)"
> "          Join Filter: (m1.id < m2.id)"
> "          Buffers: shared hit=189064 read=93981 written=10, temp read=22668
> written=22512"
> "          ->  Bitmap Heap Scan on public.message m1  (cost=577.97..40212.28
> rows=30640 width=16) (actual time=3.109..9.850 rows=18060 loops=128)"
> "                Output: m1.id, m1.box_id"
> "                Recheck Cond: (m1.box_id = $0)"
> "                Buffers: shared hit=132095 read=9448"
> "                ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..570.31 rows=30640 width=0) (actual time=2.867..2.867 rows=18193
> loops=128)"
> "                      Index Cond: (m1.box_id = $0)"
> "                      Buffers: shared hit=312 read=6435"
> "          ->  Hash  (cost=40212.28..40212.28 rows=30640 width=16) (actual
> time=27.533..27.533 rows=20102 loops=115)"
> "                Output: m2.box_id, m2.id"
> "                Buckets: 4096  Batches: 4 (originally 2)  Memory Usage:
> 5522kB"
> "                Buffers: shared hit=56969 read=84533 written=10, temp
> written=7811"
> "                ->  Bitmap Heap Scan on public.message m2
> (cost=577.97..40212.28 rows=30640 width=16) (actual time=2.406..20.492
> rows=20102 loops=115)"
> "                      Output: m2.box_id, m2.id"
> "                      Recheck Cond: (m2.box_id = $0)"
> "                      Buffers: shared hit=56969 read=84533 written=10"
> "                      ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..570.31 rows=30640 width=0) (actual time=2.170..2.170 rows=20249
> loops=115)"
> "                            Index Cond: (m2.box_id = $0)"
> "                            Buffers: shared hit=6708"
> "Total runtime: 179334.310 ms"
>
>
> So from my limited experience, the only significant difference I see is that
> the Hash Anti Join takes a lot more time in plan 2, but I do not understand
> why.
> Ideas somebody?
>
> Thanks
> panam
>
>
> --
> View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4420974.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: panam <panam(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-24 14:34:57
Message-ID: 1306247697267-4422247.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi Craig and Cédric,

Thanks for the very informative introduction to the netiquette here and
thanks for sharing your time.
I wasn't aware of http://explain.depesz.com/, very useful.
So, here are the query plans:
http://explain.depesz.com/s/6AU (1st from previous post, good)
http://explain.depesz.com/s/YPS (2nd from previous post, bad)

> The usual cause is that the statistics for estimated row counts cross a
> threshold that makes the query planner think that a different kind of
> plan will be faster.

Hm, as far as i understand the plans, they are equivalent, aren't they?

> If the query planner is using bad information about the performance of
> the storage, then it will be making bad decisions about which approach
> is faster. So the usual thing to do is to adjust seq_page_cost and
> random_page_cost to more closely reflect the real performance of your
> hardware, and to make sure that effective_cache_size matches the real
> amount of memory your computer has free for disk cache use.

Will this make any difference even when the plans are equivalent as assumed
above?

The table creation SQL is as follows:
http://pastebin.com/qFDUP7Aa (Message table); ~ 2328680 rows, is growing
constantly (~ 10000 new rows each day),
http://pastebin.com/vEmh4hb8 (Box table); ~ 128 rows (growing very slowly 1
row every two days, each row updated about 2x a day)

The DB contains the same data, except that for the "good" query, the last
10976 rows (0.4%) of message are removed by doing a

DELETE FROM message where timestamp > TO_DATE ('05/23/2011','mm/dd/yyyy');

This speeds up the query by a factor of ~27. (207033.081 (bad) vs. 7683.978
(good)).

Each query was run before and after a vacuum analyze, one time to create
appropriate statistics, and the second time to do the actual measurement.
All tests were made on the dev-machine, which is a 8GB, Core i7, Windows 7

I experienced the issue at first on the "production"-environment, which is a
64-bit Ubuntu, running PostgreSQL 9.0.1 on x86_64-unknown-linux-gnu,
compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit,
and later for analysis on the dev-environment, which is a
64-bit Windows 7, running PostgreSQL 9.0.4, compiled by Visual C++ build
1500, 64-bit
For testing, I've increased the buffers that I judge important for the issue
to the following values:
effective_cache_size: 4GB
shared_buffers: 1GB
work_mem: 1GB
temp_buffers: 32MB
After that, configuration was reloaded and the postgresql service was
restarted using pgAdmin.
Interestingly, there was no performance gain as compared to the default
settings, the "bad" query even took about 30 seconds (15%) longer.
As well it seems, all data fit into memory, so there is not much disk I/O
involved.

@Cédric
> did you have log of vacuum and checkpoint activity ?
> (no vacuum full/cluster or such thing running ?)
There is no clustering involved here, its a pretty basic setup.
How can I obtain the information you require here? I could send you the
output of the analyse vacuum command from pgAdmin, but is there a way to
make it output the information in English (rather than German)?

Thanks for your interest in this issue.

Regards,
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4422247.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-24 20:38:33
Message-ID: 15099.1306269513@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

panam <panam(at)gmx(dot)net> writes:
> In my DB, there is a large table containing messages and one that contains
> message boxes.
> Messages are assigned to boxes via a child parent link m->b.
> In order to obtain the last message for a specific box, I use the following
> SQL:

> SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
> AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;

BTW, this query doesn't actually match the EXPLAIN outputs...

> So from my limited experience, the only significant difference I see is that
> the Hash Anti Join takes a lot more time in plan 2, but I do not understand
> why.

Whatever's going on is below the level that EXPLAIN can show. I can
think of a couple of possibilities:

1. The "extra" rows in the slower case all manage to come out to the
same hash value, or some very small number of distinct hash values, such
that we spend a lot of time searching a single hash chain. But it's
hard to credit that adding 0.4% more rows could result in near 100x
slowdown, no matter how bad their distribution.

2. There's some inefficiency in the use of the temp files, though again
it's far from clear why your two cases would be noticeably different
there. Possibly enabling log_temp_files would tell you something useful.

One other thing I'm not following is how come it's using hash temp files
at all, when you claim in your later message that you've got work_mem
set to 1GB. It should certainly not take more than a couple meg to hold
20K rows at 16 payload bytes per row. You might want to check whether
that setting actually took effect.

regards, tom lane


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-25 01:53:13
Message-ID: 4DDC6109.7060708@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 24/05/11 22:34, panam wrote:

>> The usual cause is that the statistics for estimated row counts cross a
>> threshold that makes the query planner think that a different kind of
>> plan will be faster.
>
> Hm, as far as i understand the plans, they are equivalent, aren't they?

Yes, they are, and the estimates are too. This isn't the usual case
where the planner trips over a threshold and switches to a totally
different plan type that it thinks is faster, but isn't.

The estimates are actually IDENTICAL for the hash anti-join node of
interest, and so are the actual loop count and row count. Temp file
activity is much the same across both plans too.

You can reproduce this behaviour consistently? It's _seriously_ weird,
and the sort of thing that when I encounter myself I tend to ask "what
else is going on that I'm missing?".

What happens if you DELETE more rows? Or fewer? What's the threshold?

What happens if you DELETE rows from the start not the end, or a random
selection?

Does the problem persist if you DELETE the rows then CLUSTER the table
before running the query?

Does the problem persist if you DELETE the rows then REINDEX?

>> If the query planner is using bad information about the performance of
>> the storage, then it will be making bad decisions about which approach
>> is faster. [snip]
>
> Will this make any difference even when the plans are equivalent as assumed
> above?

Nope. It doesn't seem to be a problem with plan selection.

> This speeds up the query by a factor of ~27. (207033.081 (bad) vs. 7683.978
> (good)).

That's a serious WTF.

> @Cédric
>> did you have log of vacuum and checkpoint activity ?
>> (no vacuum full/cluster or such thing running ?)
> There is no clustering involved here, its a pretty basic setup.

He means 'CLUSTER', the SQL command that tells PostgreSQL to re-organize
a table.

The answer from the rest of your post would appear to be 'no, it's being
run in an otherwise-idle stand-alone test environment'. Right?

> How can I obtain the information you require here? I could send you the
> output of the analyse vacuum command from pgAdmin, but is there a way to
> make it output the information in English (rather than German)?

It's easy enough to read familiar output like that in German, if needs be.

--
Craig Ringer


From: panam <panam(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-25 16:42:29
Message-ID: 1306341749094-4425890.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi all,

@Tom,
> BTW, this query doesn't actually match the EXPLAIN outputs...
You're right, it is actually just the "heavy" subquery of a larger query
which can be found here:
http://pastebin.com/fuGrt0tB

> One other thing I'm not following is how come it's using hash temp files
> at all, when you claim in your later message that you've got work_mem
> set to 1GB. It should certainly not take more than a couple meg to hold
> 20K rows at 16 payload bytes per row. You might want to check whether
> that setting actually took effect.
As I said, I drastically increased the buffer sizes (at least I intended
to) to see if it changed something. It first I thought it wouldn't. But
yesterday (I think it was after a reboot), the "bad" queries suddenly were
much faster (~ 20secs, still at least 3 times slower than the "good" queries
though). Today, they were very slow again (I replayed the dumps in between).
So I am not sure whether postgres actually picks up the altered
configuration (even after reboot). Is there a way to determine the values
actually used?

@Craig
> You can reproduce this behaviour consistently? It's _seriously_ weird,
> and the sort of thing that when I encounter myself I tend to ask "what
> else is going on that I'm missing?".
Yes, I can reproduce it consistently using a dumpfile.

> What happens if you DELETE more rows? Or fewer? What's the threshold?
> What happens if you DELETE rows from the start not the end, or a random
selection?
>From some experiments I made earlier, I conclude that the rows added last
are the problem. Deleting the first 10% did not seem to speed up the bad
query. However, I haven't checked that systematically.

> Does the problem persist if you DELETE the rows then CLUSTER the table
> before running the query?
Wow, I wasn't aware of cluster. Applying it (clustering the id PK) on the
table causing the "bad" query worked wonders. It now needs just 4.3 secs as
compared to 600 secs before (now with one day of data added as compared to
the previous post) and 4.0 secs for the "good" query (also clustered) which
is faster than the unclustered "good" query (about 6-8 secs).

> Does the problem persist if you DELETE the rows then REINDEX?
No, not noticeably.

> The answer from the rest of your post would appear to be 'no, it's being
> run in an otherwise-idle stand-alone test environment'. Right?
Correct.

It seems my issue is solved (at least for now). My impression is that it was
just somehow "bad luck" that the rows originally and replayed from the dumps
were kind of messed up in regard to their ids, especially - it seems - the
newly added ones. This is somehow consistent with the peculiarities of the
query which contains a pairwise id comparison which should greatly benefit
an ordered set of ids.
This also made me wonder how the internal plan is carried out. Is the engine
able to leverage the fact that a part/range of the rows is totally or
partially ordered on disk, e.g. using some kind of binary search or even
"nearest neighbor"-search in that section (i.e. a special "micro-plan" or
algorithm)? Or is the speed-up "just" because related data is usually
"nearby" and most of the standard algorithms work best with clustered data?
If the first is not the case, would that be a potential point for
improvement? Maybe it would even be more efficient, if there were some sort
of constraints that guarantee "ordered row" sections on the disk, i.e.
preventing the addition of a row that had an index value in between two row
values of an already ordered/clustered section. In the simplest case, it
would start with the "first" row and end with the "last" row (on the time of
doing the equivalent of "cluster"). So there would be a small list saying
rows with id x - rows with id y are guaranteed to be ordered on disk (by id
for example) now and for all times.

So, would you like to further investigate my previous issue (I think it is
still strange that performance suddenly dropped that dramatically)?

Many thanks and regards,
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4425890.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "panam" <panam(at)gmx(dot)net>,<pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-25 17:40:56
Message-ID: 4DDCF8D8020000250003DC5E@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

panam <panam(at)gmx(dot)net> wrote:

> Is there a way to determine the values actually used?

The pg_settings view. Try the query shown here:

http://wiki.postgresql.org/wiki/Server_Configuration

-Kevin


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 02:40:22
Message-ID: 4DDDBD96.4010809@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 05/26/2011 12:42 AM, panam wrote:

> So, would you like to further investigate my previous issue (I think it is
> still strange that performance suddenly dropped that dramatically)?

It's a bit beyond me, but I suspect that it'd be best if you could hang
onto the dump file in case someone has the time and enthusiasm to
investigate it. I take it you can't distribute the dump file, even
privately?

--
Craig Ringer


From: panam <panam(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 12:33:37
Message-ID: 1306413217406-4428435.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi there,

Kevin Grittner wrote:
>
>> Is there a way to determine the values actually used?
> The pg_settings view. Try the query shown here:
> http://wiki.postgresql.org/wiki/Server_Configuration
>
Thanks Kevin, very usful. Here is the output:

"version";"PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit"
"bytea_output";"escape"
"client_encoding";"UNICODE"
"effective_cache_size";"4GB"
"lc_collate";"German_Germany.1252"
"lc_ctype";"German_Germany.1252"
"listen_addresses";"*"
"log_destination";"stderr"
"log_line_prefix";"%t "
"logging_collector";"on"
"max_connections";"100"
"max_stack_depth";"2MB"
"port";"5432"
"server_encoding";"UTF8"
"shared_buffers";"1GB"
"temp_buffers";"4096"
"TimeZone";"CET"
"work_mem";"1GB"

Craig Ringer wrote:
>
> On 05/26/2011 12:42 AM, panam wrote:
> It's a bit beyond me, but I suspect that it'd be best if you could hang
> onto the dump file in case someone has the time and enthusiasm to
> investigate it. I take it you can't distribute the dump file, even
> privately?
>
Fortunately, I managed to reduce it to the absolute minimum (i.e. only
meaningless ids), and the issue is still observable.
You can download it from here:
http://www.zumodrive.com/file/460997770?key=cIdeODVlNz

Some things to try:
* tune your psql settings if you want
* reindex, vaccum analzye if you want

"Patholgical" query:

select
b.id,
(SELECT
m1.id
FROM
message m1
LEFT JOIN
message m2
ON (
m1.box_id = m2.box_id
AND m1.id < m2.id
)
WHERE
m2.id IS NULL
AND m1.box_id = b.id)
from
box b

=> takes almost "forever" (~600 seconds on my system)

Try

delete from message where id > 2550000;

=> deletes 78404 rows
Do the "pathological" query again
=> speed is back (~4 seconds on my system)

Replay the dump
Try

delete from message where id < 1000000;

=> deletes 835844 (10 times than before) rows. Maybe you can delete many
more, I haven't tested this systematically.
Do the "pathological" query again
=> takes almost "forever" (didn't wait...)

Replay the dump
Cluster:

cluster message_pkey on message;

Do the "pathological" query again
=> speed is back (~3 seconds on my system)

Any third party confirmation?

Thanks
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4428435.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 14:21:21
Message-ID: BANLkTin6XpuCUCk97R6MNcuwd8bVvJ3ZcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

2011/5/26 panam <panam(at)gmx(dot)net>:
> Hi there,
>
>
> Kevin Grittner wrote:
>>
>>> Is there a way to determine the values actually used?
>> The pg_settings view.  Try the query shown here:
>> http://wiki.postgresql.org/wiki/Server_Configuration
>>
> Thanks Kevin, very usful. Here is the output:
>
> "version";"PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit"
> "bytea_output";"escape"
> "client_encoding";"UNICODE"
> "effective_cache_size";"4GB"
> "lc_collate";"German_Germany.1252"
> "lc_ctype";"German_Germany.1252"
> "listen_addresses";"*"
> "log_destination";"stderr"
> "log_line_prefix";"%t "
> "logging_collector";"on"
> "max_connections";"100"
> "max_stack_depth";"2MB"
> "port";"5432"
> "server_encoding";"UTF8"
> "shared_buffers";"1GB"
> "temp_buffers";"4096"
> "TimeZone";"CET"
> "work_mem";"1GB"
>
>
> Craig Ringer wrote:
>>
>> On 05/26/2011 12:42 AM, panam wrote:
>> It's a bit beyond me, but I suspect that it'd be best if you could hang
>> onto the dump file in case someone has the time and enthusiasm to
>> investigate it. I take it you can't distribute the dump file, even
>> privately?
>>
> Fortunately, I managed to reduce it to the absolute minimum (i.e. only
> meaningless ids), and the issue is still observable.
> You can download it from here:
> http://www.zumodrive.com/file/460997770?key=cIdeODVlNz
>
> Some things to try:
> * tune your psql settings if you want
> * reindex, vaccum analzye if you want
>
> "Patholgical" query:
>
> select
>        b.id,
>        (SELECT
>                m1.id
>        FROM
>                message m1
>        LEFT JOIN
>                message m2
>                        ON (
>                                m1.box_id = m2.box_id
>                                AND m1.id < m2.id
>                        )
>        WHERE
>                m2.id IS NULL
>                AND m1.box_id = b.id)
> from
>        box b
>
> => takes almost "forever" (~600 seconds on my system)
>
> Try
>
> delete from message where id > 2550000;
>
> => deletes 78404 rows
> Do the "pathological" query again
> => speed is back (~4 seconds on my system)
>
> Replay the dump
> Try
>
> delete from message where id < 1000000;
>
> => deletes 835844 (10 times than before) rows. Maybe you can delete many
> more, I haven't tested this systematically.
> Do the "pathological" query again
> => takes almost "forever" (didn't wait...)
>
> Replay the dump
> Cluster:
>
> cluster message_pkey on message;
>
> Do the "pathological" query again
> => speed is back (~3 seconds on my system)
>
> Any third party confirmation?

without explaining further why the antijoin has bad performance
without cluster, I wonder why you don't use this query :

SELECT b.id,
max(m.id)
FROM box b, message m
WHERE m.box_id = b.id
GROUP BY b.id;

looks similar and fastest.

>
> Thanks
> panam
>
> --
> View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4428435.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, "panam" <panam(at)gmx(dot)net>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 14:48:46
Message-ID: 4DDE21FE020000250003DD49@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Cédric Villemain<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> 2011/5/26 panam <panam(at)gmx(dot)net>:

>> "max_connections";"100"

>> "work_mem";"1GB"

Each connection can allocate work_mem, potentially several times.
On a machines without hundreds of GB of RAM, that pair of settings
could cause severe swapping.

>> "Patholgical" query:
>>
>> select
>> b.id,
>> (SELECT
>> m1.id
>> FROM
>> message m1
>> LEFT JOIN
>> message m2
>> ON (
>> m1.box_id = m2.box_id
>> AND m1.id < m2.id
>> )
>> WHERE
>> m2.id IS NULL
>> AND m1.box_id = b.id)
>> from
>> box b

> without explaining further why the antijoin has bad performance
> without cluster, I wonder why you don't use this query :
>
> SELECT b.id,
> max(m.id)
> FROM box b, message m
> WHERE m.box_id = b.id
> GROUP BY b.id;
>
> looks similar and fastest.

I think you would need a left join to actually get identical
results:

SELECT b.id, max(m.id)
FROM box b
LEFT JOIN message m ON m.box_id = b.id
GROUP BY b.id;

But yeah, I would expect this approach to be much faster. Rather
easier to understand and harder to get wrong, too.

-Kevin


From: panam <panam(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 16:08:07
Message-ID: 1306426087061-4429125.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Hi all,

Cédric Villemain-3 wrote:
>
> without explaining further why the antijoin has bad performance
> without cluster, I wonder why you don't use this query :
>
> SELECT b.id,
> max(m.id)
> FROM box b, message m
> WHERE m.box_id = b.id
> GROUP BY b.id;
>
> looks similar and fastest.
>
I actually did use a similar strategy in the meantime (during my problem
with the "left join" query we are talking about here all the time) for
mitigation.
It was
SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id
and it performed worse in comparison to the "left join" query in the general
case (i.e. before my problems began).
At the end of this post is an explanation why I think I cannot use the
solution you suggested above.

Kevin Grittner wrote:
>
> Each connection can allocate work_mem, potentially several times.
> On a machines without hundreds of GB of RAM, that pair of settings
> could cause severe swapping.
>
Indeed, thanks for the warning. These settings are not for production but to
exclude a performance degradation because of small cache sizes.

Kevin Grittner wrote:
>
> I think you would need a left join to actually get identical
> results:
>
> SELECT b.id, max(m.id)
> FROM box b
> LEFT JOIN message m ON m.box_id = b.id
> GROUP BY b.id;
>
> But yeah, I would expect this approach to be much faster. Rather
> easier to understand and harder to get wrong, too.
>
>
Correct, it is much faster, even with unclustered ids.
However, I think I cannot use it because of the way that query is generated
(by hibernate).
The (simplyfied) base query is just

SELECT b.id from box

the subquery

(SELECT m1.id FROM message m1
LEFT JOIN message m2
ON (m1.box_id = m2.box_id AND m1.id < m2.id )
WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId

is due to a hibernate formula (containing more or less plain SQL) to
determine the last message id for that box. It ought to return just one row,
not multiple. So I am constrained to the subquery in all optimization
attemps (I cannot combine them as you did), at least I do not see how. If
you have an idea for a more performant subquery though, let me know, as this
can easily be replaced.

Thanks for your help and suggestions
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429125.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "panam" <panam(at)gmx(dot)net>,<pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 17:08:18
Message-ID: 4DDE42B2020000250003DD5E@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

panam <panam(at)gmx(dot)net> wrote:

> I cannot use it because of the way that query is generated
> (by hibernate).
>
> The (simplyfied) base query is just
>
> SELECT b.id from box
>
> the subquery
>
> (SELECT m1.id FROM message m1
> LEFT JOIN message m2
> ON (m1.box_id = m2.box_id AND m1.id < m2.id )
> WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId
>
> is due to a hibernate formula (containing more or less plain SQL)
> to determine the last message id for that box. It ought to return
> just one row, not multiple. So I am constrained to the subquery in
> all optimization attemps (I cannot combine them as you did), at
> least I do not see how. If you have an idea for a more performant
> subquery though, let me know, as this can easily be replaced.

Maybe:

(SELECT max(m1.id) FROM message m1 WHERE m1.box_id = b.id)

-Kevin


From: panam <panam(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 18:04:35
Message-ID: 1306433075215-4429475.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Sorry,

SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id

as posted previously should actually read

SELECT max(m1.id) FROM message m1 WHERE m1.box_id = b.id)

so I tried this already.

Regards,
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429475.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 18:13:22
Message-ID: BANLkTimqewBM=G47Qdp+6D7nsBQG8FN7kA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

2011/5/26 panam <panam(at)gmx(dot)net>:
> Hi all,
>
>
> Cédric Villemain-3 wrote:
>>
>> without explaining further why the antijoin has bad performance
>> without cluster, I wonder why you don't use this query :
>>
>> SELECT  b.id,
>>                   max(m.id)
>> FROM box b, message m
>> WHERE m.box_id = b.id
>> GROUP BY b.id;
>>
>> looks similar and fastest.
>>
> I actually did use a similar strategy in the meantime (during my problem
> with the "left join" query we are talking about here all the time) for
> mitigation.
> It was
> SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id
> and it performed worse in comparison to the "left join" query in the general
> case (i.e. before my problems began).
> At the end of this post is an explanation why I think I cannot use the
> solution you suggested above.
>
>
> Kevin Grittner wrote:
>>
>>  Each connection can allocate work_mem, potentially several times.
>> On a machines without hundreds of GB of RAM, that pair of settings
>> could cause severe swapping.
>>
> Indeed, thanks for the warning. These settings are not for production but to
> exclude a performance degradation because of small cache sizes.
>
>
> Kevin Grittner wrote:
>>
>> I think you would need a left join to actually get identical
>> results:
>>
>> SELECT  b.id, max(m.id)
>>   FROM box b
>>   LEFT JOIN message m ON m.box_id = b.id
>>   GROUP BY b.id;
>>
>> But yeah, I would expect this approach to be much faster.  Rather
>> easier to understand and harder to get wrong, too.
>>
>>
> Correct, it is much faster, even with unclustered ids.
> However, I think I cannot use it because of the way that query is generated
> (by hibernate).
> The (simplyfied) base query is just
>
> SELECT b.id from box
>
> the subquery
>
> (SELECT  m1.id FROM message m1
>   LEFT JOIN message m2
>      ON (m1.box_id = m2.box_id  AND m1.id < m2.id )
>   WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId
>
> is due to a hibernate formula (containing more or less plain SQL) to
> determine the last message id for that box. It ought to return just one row,
> not multiple. So I am constrained to the subquery in all optimization
> attemps (I cannot combine them as you did), at least I do not see how. If
> you have an idea for a more performant subquery though, let me know, as this
> can easily be replaced.

In production, if you have a decent IO system, you can lower
random_page_cost and it may be faster using index (by default, with
the use case you provided it choose a seqscan). It can be a bit tricky
if you have to lower random_page_cost so much that it destroy others
query plan but increase the perf of the current one. if it happens,
post again :) (sometime need to change other cost parameters but it
needs to be handle with care)

I am not an hibernate expert, but I'll surprised if you can not drive
hibernate to do what you want.

>
> Thanks for your help and suggestions
> panam
>
> --
> View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429125.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: panam <panam(at)gmx(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-27 00:00:52
Message-ID: 4DDEE9B4.7090603@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 05/27/2011 02:13 AM, Cédric Villemain wrote:

> I am not an hibernate expert, but I'll surprised if you can not drive
> hibernate to do what you want.

If nothing else, you can do a native query in hand-written SQL through
Hibernate. ORMs are useful tools for some jobs, but it's good to be able
to bypass them when needed too.

--
Craig Ringer


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-31 21:58:08
Message-ID: BANLkTinwJ69W4i6JXVGqodbzZdTZ+SSzyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Thu, May 26, 2011 at 8:33 AM, panam <panam(at)gmx(dot)net> wrote:
> Any third party confirmation?

Yeah, it definitely looks like there is some kind of bug here. Or if
not a bug, then a very surprising feature. EXPLAIN ANALYZE outputs
from your proposed test attached. Here's a unified diff of the two
outputs:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
- Seq Scan on box b (cost=0.00..3669095.76 rows=128 width=8) (actual
time=0.147..431517.693 rows=128 loops=1)
+ Seq Scan on box b (cost=0.00..3669095.76 rows=128 width=8) (actual
time=0.047..6938.165 rows=128 loops=1)
SubPlan 1
- -> Hash Anti Join (cost=14742.77..28664.79 rows=19239 width=8)
(actual time=2960.176..3370.425 rows=1 loops=128)
+ -> Hash Anti Join (cost=14742.77..28664.79 rows=19239 width=8)
(actual time=48.385..53.361 rows=1 loops=128)
Hash Cond: (m1.box_id = m2.box_id)
Join Filter: (m1.id < m2.id)
- -> Bitmap Heap Scan on message m1 (cost=544.16..13696.88
rows=28858 width=16) (actual time=2.320..6.204 rows=18487 loops=128)
+ -> Bitmap Heap Scan on message m1 (cost=544.16..13696.88
rows=28858 width=16) (actual time=1.928..5.502 rows=17875 loops=128)
Recheck Cond: (box_id = b.id)
- -> Bitmap Index Scan on "message_box_Idx"
(cost=0.00..536.94 rows=28858 width=0) (actual time=2.251..2.251
rows=18487 loops=128)
+ -> Bitmap Index Scan on "message_box_Idx"
(cost=0.00..536.94 rows=28858 width=0) (actual time=1.797..1.797
rows=18487 loops=128)
Index Cond: (box_id = b.id)
- -> Hash (cost=13696.88..13696.88 rows=28858 width=16)
(actual time=12.632..12.632 rows=19720 loops=120)
- Buckets: 4096 Batches: 4 (originally 2) Memory Usage: 1787kB
- -> Bitmap Heap Scan on message m2
(cost=544.16..13696.88 rows=28858 width=16) (actual time=1.668..6.619
rows=19720 loops=120)
+ -> Hash (cost=13696.88..13696.88 rows=28858 width=16)
(actual time=11.603..11.603 rows=20248 loops=113)
+ Buckets: 4096 Batches: 4 (originally 2) Memory Usage: 1423kB
+ -> Bitmap Heap Scan on message m2
(cost=544.16..13696.88 rows=28858 width=16) (actual time=1.838..6.886
rows=20248 loops=113)
Recheck Cond: (box_id = b.id)
- -> Bitmap Index Scan on "message_box_Idx"
(cost=0.00..536.94 rows=28858 width=0) (actual time=1.602..1.602
rows=19720 loops=120)
+ -> Bitmap Index Scan on "message_box_Idx"
(cost=0.00..536.94 rows=28858 width=0) (actual time=1.743..1.743
rows=20903 loops=113)
Index Cond: (box_id = b.id)
- Total runtime: 431520.186 ms
+ Total runtime: 6940.369 ms

That's pretty odd.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
ea1.txt text/plain 1.6 KB
ea2.txt text/plain 1.6 KB

From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: panam <panam(at)gmx(dot)net>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 00:43:05
Message-ID: BANLkTi==zbH2kepG3OmR=Nt0ArxMvWgwdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

2011/5/31 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Thu, May 26, 2011 at 8:33 AM, panam <panam(at)gmx(dot)net> wrote:
>> Any third party confirmation?
>
> Yeah, it definitely looks like there is some kind of bug here.  Or if
> not a bug, then a very surprising feature.  EXPLAIN ANALYZE outputs
> from your proposed test attached.  Here's a unified diff of the two
> outputs:
>
>
>  QUERY PLAN
>  ----------------------------------------------------------------------------------------------------------------------------------------------------------
> - Seq Scan on box b  (cost=0.00..3669095.76 rows=128 width=8) (actual
> time=0.147..431517.693 rows=128 loops=1)
> + Seq Scan on box b  (cost=0.00..3669095.76 rows=128 width=8) (actual
> time=0.047..6938.165 rows=128 loops=1)
>    SubPlan 1
> -     ->  Hash Anti Join  (cost=14742.77..28664.79 rows=19239 width=8)
> (actual time=2960.176..3370.425 rows=1 loops=128)
> +     ->  Hash Anti Join  (cost=14742.77..28664.79 rows=19239 width=8)
> (actual time=48.385..53.361 rows=1 loops=128)
>            Hash Cond: (m1.box_id = m2.box_id)
>            Join Filter: (m1.id < m2.id)
> -           ->  Bitmap Heap Scan on message m1  (cost=544.16..13696.88
> rows=28858 width=16) (actual time=2.320..6.204 rows=18487 loops=128)
> +           ->  Bitmap Heap Scan on message m1  (cost=544.16..13696.88
> rows=28858 width=16) (actual time=1.928..5.502 rows=17875 loops=128)
>                  Recheck Cond: (box_id = b.id)
> -                 ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..536.94 rows=28858 width=0) (actual time=2.251..2.251
> rows=18487 loops=128)
> +                 ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..536.94 rows=28858 width=0) (actual time=1.797..1.797
> rows=18487 loops=128)
>                        Index Cond: (box_id = b.id)
> -           ->  Hash  (cost=13696.88..13696.88 rows=28858 width=16)
> (actual time=12.632..12.632 rows=19720 loops=120)
> -                 Buckets: 4096  Batches: 4 (originally 2)  Memory Usage: 1787kB
> -                 ->  Bitmap Heap Scan on message m2
> (cost=544.16..13696.88 rows=28858 width=16) (actual time=1.668..6.619
> rows=19720 loops=120)
> +           ->  Hash  (cost=13696.88..13696.88 rows=28858 width=16)
> (actual time=11.603..11.603 rows=20248 loops=113)
> +                 Buckets: 4096  Batches: 4 (originally 2)  Memory Usage: 1423kB
> +                 ->  Bitmap Heap Scan on message m2
> (cost=544.16..13696.88 rows=28858 width=16) (actual time=1.838..6.886
> rows=20248 loops=113)
>                        Recheck Cond: (box_id = b.id)
> -                       ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..536.94 rows=28858 width=0) (actual time=1.602..1.602
> rows=19720 loops=120)
> +                       ->  Bitmap Index Scan on "message_box_Idx"
> (cost=0.00..536.94 rows=28858 width=0) (actual time=1.743..1.743
> rows=20903 loops=113)
>                              Index Cond: (box_id = b.id)
> - Total runtime: 431520.186 ms
> + Total runtime: 6940.369 ms
>
> That's pretty odd.

Yes, while here I noticed that the query was long to be killed.
I added a CHECK_FOR_INTERRUPT() in the for(;;) loop in nodeHashjoin.c.
It fixes the delay when trying to kill but I don't know about
performance impact this can have in this place of the code.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Attachment Content-Type Size
fix_hardtokill_hash.patch text/x-patch 506 bytes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 00:55:38
Message-ID: BANLkTimRjbzjqA13MLtjLf1XCini_UAAQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, May 31, 2011 at 8:43 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> Yes, while here I noticed that the query was long to be killed.
> I added a CHECK_FOR_INTERRUPT() in the for(;;) loop in nodeHashjoin.c.
> It fixes the delay when trying to kill but I don't know about
> performance impact this can have in this place of the code.

Well, seems easy enough to find out: just test the query with and
without your patch (and without casserts). If there's no measurable
difference on this query, there probably won't be one anywhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 01:11:16
Message-ID: BANLkTikDKUF53_MhP-T9y9NM7a8z40djXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

2011/6/1 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Tue, May 31, 2011 at 8:43 PM, Cédric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> Yes, while here I noticed that the query was long to be killed.
>> I added a CHECK_FOR_INTERRUPT() in the for(;;) loop in nodeHashjoin.c.
>> It fixes the delay when trying to kill but I don't know about
>> performance impact this can have in this place of the code.
>
> Well, seems easy enough to find out: just test the query with and
> without your patch (and without casserts).  If there's no measurable
> difference on this query, there probably won't be one anywhere

Oh damned, I am currently with an eeepc, I'll need 2 days to bench that :-D
I'll see tomorow.
.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 02:58:16
Message-ID: BANLkTinbb35QT_57EozGOnszBWuarBgW1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, May 31, 2011 at 9:11 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> Oh damned, I am currently with an eeepc, I'll need 2 days to bench that :-D
> I'll see tomorow.

LOL.

With respect to the root of the issue (why does the anti-join take so
long?), my first thought was that perhaps the OP was very unlucky and
had a lot of values that hashed to the same bucket. But that doesn't
appear to be the case. There are 120 different box_id values in the
message table, and running hashint8(box_id) % 16384 (which I think is
the right calculation: 4096 buckets * 4 batches) yields 120 different
values.

I think that part of the problem here is that the planner has no
particularly efficient way of executing a non-equijoin. Each probe
finds the appropriate hash bucket and must then probe the entire
chain, all of which pass the hash condition and some of which fail the
join qual. So if there are n_1 instances of value v_1, n_2 instances
of value v_2, etc. then the total effort is proportional to n_1^2 +
n_2^2 + ...

But that doesn't really explain the problem because removing the last
few rows only changes that sum by a small amount.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 03:47:42
Message-ID: 7913.1306900062@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> With respect to the root of the issue (why does the anti-join take so
> long?), my first thought was that perhaps the OP was very unlucky and
> had a lot of values that hashed to the same bucket. But that doesn't
> appear to be the case.

Well, yes it is. Notice what the subquery is doing: for each row in
"box", it's pulling all matching "box_id"s from message and running a
self-join across those rows. The hash join condition is a complete
no-op. And some of the box_ids have hundreds of thousands of rows.

I'd just write it off as being a particularly stupid way to find the
max(), except I'm not sure why deleting just a few thousand rows
improves things so much. It looks like it ought to be an O(N^2)
situation, so the improvement should be noticeable but not amazing.

And if you force it to not use a hashjoin, suddenly things are better.
Nestloop should also be O(N^2) in this situation, but seemingly it
avoids whatever weird corner case we are hitting here.

As Cedric says, the lack of any CHECK_FOR_INTERRUPTS in this loop is
also problematic. I'm not sure that right there is an ideal place
to put it, but we need one somewhere in the loop.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 11:40:27
Message-ID: BANLkTim-DqDC2AbVJ_1t-XAS4NYq2tQYZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Tue, May 31, 2011 at 11:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> With respect to the root of the issue (why does the anti-join take so
>> long?), my first thought was that perhaps the OP was very unlucky and
>> had a lot of values that hashed to the same bucket.  But that doesn't
>> appear to be the case.
>
> Well, yes it is.  Notice what the subquery is doing: for each row in
> "box", it's pulling all matching "box_id"s from message and running a
> self-join across those rows.  The hash join condition is a complete
> no-op.  And some of the box_ids have hundreds of thousands of rows.
>
> I'd just write it off as being a particularly stupid way to find the
> max(), except I'm not sure why deleting just a few thousand rows
> improves things so much.  It looks like it ought to be an O(N^2)
> situation, so the improvement should be noticeable but not amazing.

Yeah, this is what I was getting at, though perhaps I didn't say it
well. If the last 78K rows were particularly pathological in some
way, that might explain something, but as far as one can see they are
not a whole heck of a lot different from the rest of the data.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: panam <panam(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 12:40:55
Message-ID: 1306932055085-4445123.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance


Tom Lane-2 wrote:
>
> It looks like it ought to be an O(N^2)
> situation, so the improvement should be noticeable but not amazing.
>

Hm, the performance was reasonable again when doing a cluster...
So I believe this should be more a technical than an
algorithmical/complexity issue. Maybe it is the way the hashtable is built
and that order makes a difference in that case? In short: Why is clustered
data not affected?

Regards,
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-PERFORM-Hash-Anti-Join-performance-degradation-tp4443803p4445123.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:25:56
Message-ID: 24247.1306959956@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, May 31, 2011 at 11:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I'd just write it off as being a particularly stupid way to find the
>> max(), except I'm not sure why deleting just a few thousand rows
>> improves things so much. It looks like it ought to be an O(N^2)
>> situation, so the improvement should be noticeable but not amazing.

> Yeah, this is what I was getting at, though perhaps I didn't say it
> well. If the last 78K rows were particularly pathological in some
> way, that might explain something, but as far as one can see they are
> not a whole heck of a lot different from the rest of the data.

Well, I wasted a lot more time on this than I should have, but the
answer is: it's a pathological data ordering.

The given data file loads the message rows approximately in "id" order,
and in fact once you lop off the ones with id > 2550000, it's
sufficiently in order that the highest id is also physically last in
the table, at least for all of the box_ids that have large numbers of
entries. Now, the tested query plan loads the hash table like this:

-> Hash (cost=13685.86..13685.86 rows=28511 width=16) (actual time=176.286..176.286 rows=211210 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 9901kB
-> Bitmap Heap Scan on message m2 (cost=537.47..13685.86 rows=28511 width=16) (actual time=23.204..124.624 rows=211210 loops=1)
Recheck Cond: (box_id = $1)
-> Bitmap Index Scan on "message_box_Idx" (cost=0.00..530.34 rows=28511 width=0) (actual time=21.974..21.974 rows=211210 loops=1)
Index Cond: (box_id = $1)

Because of the way that a bitmap heap scan works, the rows are
guaranteed to be loaded into the hash table in physical order, which
means (in the fast case) that the row with the largest "id" value gets
loaded last. And because ExecHashTableInsert pushes each row onto the
front of its hash chain, that row ends up at the front of the hash
chain. Net result: for all the outer rows that aren't the one with
maximum id, we get a joinqual match at the very first entry in the hash
chain. Since it's an antijoin, we then reject that outer row and go
on to the next. The join thus ends up costing us only O(N) time.

However, with the additional rows in place, there are a significant
number of outer rows that don't get a match at the first hash chain
entry, and we're spending more like O(N^2) time. I instrumented things
for the specific case of box_id = 69440, which is the most common
box_id, and got these results:

2389 got match of join quals at probe 208077
1 got match of join quals at probe 1
175 got match of join quals at probe 208077
273 got match of join quals at probe 1
21 got match of join quals at probe 208077
1 got match of join quals at probe 1
24 got match of join quals at probe 208077
6 got match of join quals at probe 1
157 got match of join quals at probe 208077
1 got match of join quals at probe 1
67 got match of join quals at probe 208077
18 got match of join quals at probe 1
1 generate null-extended tuple after 211211 probes
208075 got match of join quals at probe 1
1 got match of join quals at probe 208077

(This is a "uniq -c" summary of a lot of printfs, so the first column
is the number of consecutive occurrences of the same printout.) Even
though a large majority of the outer rows still manage to find a match
at the first probe, there remain about 2800 that don't match there,
because they've got pretty big ids, and so they traipse through the hash
chain until they find the genuinely largest id, which is unfortunately
way down there ---- the 208077'th chain entry in fact. That results in
about half a billion more ExecScanHashBucket and ExecQual calls than
occur in the easy case (and that's just for this one box_id).

So it's not that the full data set is pathologically bad, it's that the
reduced set is pathologically good. O(N^2) performance is what you
should expect for this query, and that's what you're actually getting
with the full data set.

Also, I noted earlier that performance seemed a good deal better with a
NestLoop plan. The reason for that is that NestLoop doesn't have the
reversal of inner row ordering that's caused by prepending entries to
the hash chain, so the very largest row id isn't 208077 entries into the
list for it, but only 211211-208077 = 3134 entries in. But it still
manages to eliminate most outer rows at the first probe, because there's
a fairly large value at that end of the dataset too.

I don't see anything much that we could or should do about this. It
just depends on the order in which things appear in the hash chain,
and even if we fooled around with that ordering, we'd only be moving
the improbably-lucky behavior from one case to some other case.

We do need to look into putting a CHECK_FOR_INTERRUPTS call in here
somewhere, though. I'm inclined to think that right before the
ExecScanHashBucket is the best place. The reason that nest and merge
joins don't show a comparable non-responsiveness to cancels is that they
always call a child plan node at the equivalent place, and ExecProcNode
has got a CHECK_FOR_INTERRUPTS. So we ought to check for interrupts
at the point of "fetching a tuple from the inner child plan", and
ExecScanHashBucket is the equivalent thing in this logic. Cedric's
suggestion of putting it before the switch would get the job done, but
it would result in wasting cycles during unimportant transitions from
one state machine state to another.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:31:34
Message-ID: BANLkTim=Tp55eM1Ch6VMtkHT5Tr=Xvnq5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Jun 1, 2011 at 4:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Because of the way that a bitmap heap scan works, the rows are
> guaranteed to be loaded into the hash table in physical order, which
> means (in the fast case) that the row with the largest "id" value gets
> loaded last.  And because ExecHashTableInsert pushes each row onto the
> front of its hash chain, that row ends up at the front of the hash
> chain.  Net result: for all the outer rows that aren't the one with
> maximum id, we get a joinqual match at the very first entry in the hash
> chain.  Since it's an antijoin, we then reject that outer row and go
> on to the next.  The join thus ends up costing us only O(N) time.

Ah! Make sense. If I'm reading your explanation right, this means
that we could have hit a similar pathological case on a nestloop as
well, just with a data ordering that is the reverse of what we have
here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:34:38
Message-ID: BANLkTi=Bn=ZiZrEWZmFNiTMCbjB=53j3Ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

2011/6/1 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> We do need to look into putting a CHECK_FOR_INTERRUPTS call in here
> somewhere, though.  I'm inclined to think that right before the
> ExecScanHashBucket is the best place.  The reason that nest and merge
> joins don't show a comparable non-responsiveness to cancels is that they
> always call a child plan node at the equivalent place, and ExecProcNode
> has got a CHECK_FOR_INTERRUPTS.  So we ought to check for interrupts
> at the point of "fetching a tuple from the inner child plan", and
> ExecScanHashBucket is the equivalent thing in this logic.  Cedric's
> suggestion of putting it before the switch would get the job done, but
> it would result in wasting cycles during unimportant transitions from
> one state machine state to another.

exact, thanks to your last email I read more the code and get the same
conclusion and put it in a more appropriate place : before
ExecScanHashBucket.

I was about sending it, so it is attached.

>
>                        regards, tom lane
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Attachment Content-Type Size
fix_hardtokill_hashv2.patch text/x-patch 1.0 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:35:16
Message-ID: 24461.1306960516@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Jun 1, 2011 at 4:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Because of the way that a bitmap heap scan works, the rows are
>> guaranteed to be loaded into the hash table in physical order, which
>> means (in the fast case) that the row with the largest "id" value gets
>> loaded last. And because ExecHashTableInsert pushes each row onto the
>> front of its hash chain, that row ends up at the front of the hash
>> chain. Net result: for all the outer rows that aren't the one with
>> maximum id, we get a joinqual match at the very first entry in the hash
>> chain. Since it's an antijoin, we then reject that outer row and go
>> on to the next. The join thus ends up costing us only O(N) time.

> Ah! Make sense. If I'm reading your explanation right, this means
> that we could have hit a similar pathological case on a nestloop as
> well, just with a data ordering that is the reverse of what we have
> here?

Yeah. It's just chance that this particular data set, with this
particular ordering, happens to work well with a nestloop version
of the query. On average I'd expect nestloop to suck even more,
because of more per-inner-tuple overhead.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:43:29
Message-ID: BANLkTi=6pGhGOUddirQmEqavJ3_T-y4xFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Jun 1, 2011 at 4:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Jun 1, 2011 at 4:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Because of the way that a bitmap heap scan works, the rows are
>>> guaranteed to be loaded into the hash table in physical order, which
>>> means (in the fast case) that the row with the largest "id" value gets
>>> loaded last.  And because ExecHashTableInsert pushes each row onto the
>>> front of its hash chain, that row ends up at the front of the hash
>>> chain.  Net result: for all the outer rows that aren't the one with
>>> maximum id, we get a joinqual match at the very first entry in the hash
>>> chain.  Since it's an antijoin, we then reject that outer row and go
>>> on to the next.  The join thus ends up costing us only O(N) time.
>
>> Ah!  Make sense.  If I'm reading your explanation right, this means
>> that we could have hit a similar pathological case on a nestloop as
>> well, just with a data ordering that is the reverse of what we have
>> here?
>
> Yeah.  It's just chance that this particular data set, with this
> particular ordering, happens to work well with a nestloop version
> of the query.  On average I'd expect nestloop to suck even more,
> because of more per-inner-tuple overhead.

I guess the real issue here is that m1.id < m2.id has to be evaluated
as a filter condition rather than a join qual. That tends to perform
poorly in general, which is why rewriting this using min() or max() or
ORDER BY .. LIMIT 1 was elsewhere recommended. I've occasionally had
cause to join on something other than equality in cases not
susceptible to such rewriting, so it would be neat to improve this
case, but it's not likely to make it to the top of my list any time
soon.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:47:36
Message-ID: 24707.1306961256@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I guess the real issue here is that m1.id < m2.id has to be evaluated
> as a filter condition rather than a join qual.

Well, if you can invent an optimized join technique that works for
inequalities, go for it ... but I think you should get at least a
PhD thesis out of that.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:58:36
Message-ID: BANLkTimXuMGGv_vPQrQjR17K3QomP8NnQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Jun 1, 2011 at 4:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I guess the real issue here is that m1.id < m2.id has to be evaluated
>> as a filter condition rather than a join qual.
>
> Well, if you can invent an optimized join technique that works for
> inequalities, go for it ... but I think you should get at least a
> PhD thesis out of that.

Sounds good, except that so far NOT getting a PhD seems like a much
better financial prospect. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 21:02:51
Message-ID: 25112.1306962171@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

=?ISO-8859-1?Q?C=E9dric_Villemain?= <cedric(dot)villemain(dot)debian(at)gmail(dot)com> writes:
> exact, thanks to your last email I read more the code and get the same
> conclusion and put it in a more appropriate place : before
> ExecScanHashBucket.

> I was about sending it, so it is attached.

Applied with cosmetic adjustments.

regards, tom lane


From: "Ross J(dot) Reedstrom" <reedstrm(at)rice(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 21:53:15
Message-ID: 20110601215315.GC14215@rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, Jun 01, 2011 at 04:58:36PM -0400, Robert Haas wrote:
> On Wed, Jun 1, 2011 at 4:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> I guess the real issue here is that m1.id < m2.id has to be evaluated
> >> as a filter condition rather than a join qual.
> >
> > Well, if you can invent an optimized join technique that works for
> > inequalities, go for it ... but I think you should get at least a
> > PhD thesis out of that.
>
> Sounds good, except that so far NOT getting a PhD seems like a much
> better financial prospect. :-)

Yeah, last time I heard of some Uni being so impressed by independent
work that they just sort of handed out a Ph.D. it involved a Swiss
patent clerk ...

Ross
--
Ross Reedstrom, Ph.D. reedstrm(at)rice(dot)edu
Systems Engineer & Admin, Research Scientist phone: 713-348-6166
Connexions http://cnx.org fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE


From: panam <panam(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 22:49:31
Message-ID: 1306968571518-4446629.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

I'd like to thank you all for getting this analyzed, especially Tom!
Your rigor is pretty impressive. Seems like otherwise it'd impossible to
maintain a DBS, though.
In the end, I know a lot more of postgres internals and that this
idiosyncrasy (from a user perspective) could happen again. I guess it is my
first time where I actually encountered an unexpected worst case scenario
like this...
Seems it is up to me know to be a bit more creative with query optimzation.
And in the end, it'll turn out to require an architectural change...
As the only thing to achieve is in fact to obtain the last id (currently
still with the constraint that it has to happen in an isolated subquery), i
wonder whether this requirement (obtaining the last id) is worth a special
technique/instrumentation/strategy ( lacking a good word here), given the
fact that this data has a full logical ordering (in this case even total)
and the use case is quite common I guess.

Some ideas from an earlier post:

panam wrote:
>
> ...
> This also made me wonder how the internal plan is carried out. Is the
> engine able to leverage the fact that a part/range of the rows ["/index
> entries"] is totally or partially ordered on disk, e.g. using some kind of
> binary search or even "nearest neighbor"-search in that section (i.e. a
> special "micro-plan" or algorithm)? Or is the speed-up "just" because
> related data is usually "nearby" and most of the standard algorithms work
> best with clustered data?
> If the first is not the case, would that be a potential point for
> improvement? Maybe it would even be more efficient, if there were some
> sort of constraints that guarantee "ordered row" sections on the disk,
> i.e. preventing the addition of a row that had an index value in between
> two row values of an already ordered/clustered section. In the simplest
> case, it would start with the "first" row and end with the "last" row (on
> the time of doing the equivalent of "cluster"). So there would be a small
> list saying rows with id x - rows with id y are guaranteed to be ordered
> on disk (by id for example) now and for all times.
>

Maybe I am completely off the mark but what's your conclusion? To much
effort for small scenarios? Nothing that should be handled on a DB level? A
try to battle the laws of thermodynamics with small technical dodge?

Thanks again
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-PERFORM-Hash-Anti-Join-performance-degradation-tp4443803p4446629.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.