Re: join to view over custom aggregate seems like it should be faster

Lists: pgsql-performance
From: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
To: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: join to view over custom aggregate seems like it should be faster
Date: 2007-04-09 19:05:17
Message-ID: b42b73150704091205o668a084y1693520c974f3c6b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I have an odd performance issue on 8.2 that I'd thought I'd document
here. I have a workaround, but I'm if there is something that I'm not
seeing.

ok, for starters:
I have a large table that is basically organized like this:
create table big
(
key1 int,
key2 int,
ts timestamp
[other fields]
);

and a view most_recent_big which lists for each combination of key1
and key2, the '[other fields]' that are behind the highest (most
recent) timestamp. The original view implementation involved a self
join which is the classic sql approach to pulling values from a
denormalized table (the real solution of course is to normalize the
data but I can't do that for various reasons). This wasn't very fast,
so I wrote a custom aggregate to optimize the view (there are usuallly
very small #s of records for key1, key2 pair:

create view latest_big_view as
select key1, key2, max_other_fields[other fields]
from big
group by key1, key2;

This worked very well, but sometimes the index on key1, key2 does not
get utilized when joining against latest_big_view. Let's say I have a
number of key1, key2 pairs in another table:

for example:
select * from foo, latest_big_view using (key1, key2);
breaks down.

here is a example of the 'breakdown' plan on real tables. selecting a
single record from the view is very fast...1ms or less. The join
can't 'see through' the view to filter the index.

dev20400=# explain analyze select * from foo join latest_download
using (host_id, software_binary_id);

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=15.35..4616.65 rows=25 width=90) (actual
time=229.623..10601.317 rows=494 loops=1)
Hash Cond: ((latest_download.host_id = foo.host_id) AND
(latest_download.software_binary_id = foo.software_binary_id))
-> GroupAggregate (cost=0.00..4499.01 rows=4535 width=94) (actual
time=0.346..10370.383 rows=37247 loops=1)
-> Index Scan using software_download_idx on
software_download (cost=0.00..2526.53 rows=45342 width=94) (actual
time=0.028..344.591
SubPlan
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.010 rows=1 loops=37247)
-> Hash (cost=7.94..7.94 rows=494 width=8) (actual
time=5.568..5.568 rows=494 loops=1)
-> Seq Scan on foo (cost=0.00..7.94 rows=494 width=8)
(actual time=0.018..2.686 rows=494 loops=1)
Total runtime: 10604.260 ms
(18 rows)

Here is the same query but on the root table, instead of the view:
dev20400=# explain analyze select * from foo join software_download
using (host_id, software_binary_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..1521.60 rows=19 width=94) (actual
time=0.084..24.992 rows=607 loops=1)
-> Seq Scan on foo (cost=0.00..7.94 rows=494 width=8) (actual
time=0.044..2.753 rows=494 loops=1)
-> Index Scan using software_download_idx on software_download
(cost=0.00..3.05 rows=1 width=94) (actual time=0.011..0.019 rows=1
loops=49
Index Cond: ((foo.host_id = software_download.host_id) AND
(foo.software_binary_id = software_download.software_binary_id))
Total runtime: 28.385 ms
(5 rows)

I can use a trick with a function to make the view give out reasonalbe results:

create function foo(int, int) returns latest_download as
$$ select * from latest_download where software_binary_id = $1 and
host_id = $2; $$ language sql;

dev20400=# explain analyze select (v).* from (select
foo(software_binary_id, host_id) as v from foo) q;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Subquery Scan q (cost=0.00..14.12 rows=494 width=32) (actual
time=1.436..139.644 rows=494 loops=1)
-> Seq Scan on foo (cost=0.00..9.18 rows=494 width=8) (actual
time=1.414..131.144 rows=494 loops=1)
Total runtime: 142.887 ms
(3 rows)

Time: 144.306 ms

merlin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
Cc: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: join to view over custom aggregate seems like it should be faster
Date: 2007-04-09 21:07:46
Message-ID: 10651.1176152866@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Merlin Moncure" <mmoncure(at)gmail(dot)com> writes:
> I have an odd performance issue on 8.2 that I'd thought I'd document
> here. I have a workaround, but I'm if there is something that I'm not
> seeing.

It's hard to comment on this without seeing the full details of the view
and tables. I'm wondering where the SubPlans are coming from, for instance.

regards, tom lane


From: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: join to view over custom aggregate seems like it should be faster
Date: 2007-04-09 21:34:39
Message-ID: b42b73150704091434i596751f6p8914fdc96b16b290@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 4/9/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Merlin Moncure" <mmoncure(at)gmail(dot)com> writes:
> > I have an odd performance issue on 8.2 that I'd thought I'd document
> > here. I have a workaround, but I'm if there is something that I'm not
> > seeing.
>
> It's hard to comment on this without seeing the full details of the view
> and tables. I'm wondering where the SubPlans are coming from, for instance.

ok, this is really odd. I was in the process of busting all that out
for you when I noticed this:

here is the source sql for the view
create or replace view latest_download as
select software_binary_id, host_id,
((
select latest_software_download(
(bds_status_id,
mtime,
dl_window_open,
dl_window_close,
download_start,
download_stop,
info,
userupgradeable,
overrideflag,
percent_complete)::software_download_data)
)::software_download_data).*
from software_download group by host_id, software_binary_id;

here is what psql \d shows:

SELECT software_download.software_binary_id,
software_download.host_id, ((SELECT
latest_software_download(ROW(software_download.bds_status_id,
software_download.mtime, software_download.dl_window_open,
software_download.dl_window_close, software_download.download_start,
software_download.download_stop, software_download.info,
software_download.userupgradeable, software_download.overrideflag,
software_download.percent_complete)::software_download_data) AS
latest_software_download)).bds_status_id AS bds_status_id, ((SELECT l
[snip]

this is repeated several more times...I replace the view just to be safe.

for posterity:
create or replace function max_software_download(l
software_download_data, r software_download_data) returns
software_download_data as
$$
begin
if l.mtime > r.mtime then
return l;
end if;

return r;
end;
$$ language plpgsql;

CREATE TYPE software_download_data as
(
bds_status_id integer,
mtime timestamp with time zone,
dl_window_open time without time zone,
dl_window_close time without time zone,
download_start timestamp with time zone,
download_stop timestamp with time zone,
info text,
userupgradeable boolean,
overrideflag boolean,
percent_complete integer
);

CREATE AGGREGATE latest_software_download
(
BASETYPE=software_download_data,
SFUNC=max_software_download,
STYPE=software_download_data
);

merlin


From: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: join to view over custom aggregate seems like it should be faster
Date: 2007-04-09 21:36:47
Message-ID: b42b73150704091436g7fd1f903q1209218fc5eee88d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 4/9/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Merlin Moncure" <mmoncure(at)gmail(dot)com> writes:
> > I have an odd performance issue on 8.2 that I'd thought I'd document
> > here. I have a workaround, but I'm if there is something that I'm not
> > seeing.
>
> It's hard to comment on this without seeing the full details of the view
> and tables. I'm wondering where the SubPlans are coming from, for instance.

ah, it looks like the aggregate is being re-expanded for each field
returned by the aggregate. I notice this for non-trivial record
returning functions also. standard m.o. is to push into a subquery
and expand afterwords.

merlin


From: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: join to view over custom aggregate seems like it should be faster
Date: 2007-04-10 12:59:23
Message-ID: b42b73150704100559n3ccb9d89p1767a0fa94324375@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 4/9/07, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On 4/9/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > "Merlin Moncure" <mmoncure(at)gmail(dot)com> writes:
> > > I have an odd performance issue on 8.2 that I'd thought I'd document
> > > here. I have a workaround, but I'm if there is something that I'm not
> > > seeing.
> >
> > It's hard to comment on this without seeing the full details of the view
> > and tables. I'm wondering where the SubPlans are coming from, for instance.
>
> ah, it looks like the aggregate is being re-expanded for each field
> returned by the aggregate. I notice this for non-trivial record
> returning functions also. standard m.o. is to push into a subquery
> and expand afterwords.

[sorry for the deluge of info]

I cleaned up the view from:
create or replace view latest_download as
select software_binary_id, host_id,
((
select latest_software_download(
(bds_status_id,
mtime,
dl_window_open,
dl_window_close,
download_start,
download_stop,
info,
userupgradeable,
overrideflag,
percent_complete)::software_download_data)
)::software_download_data).*
from software_download group by host_id, software_binary_id;

to this:
create or replace view latest_download as
select software_binary_id, host_id, (v).* from
(
select
software_binary_id, host_id,
latest_software_download(
(bds_status_id,
mtime,
dl_window_open,
dl_window_close,
download_start,
download_stop,
info,
userupgradeable,
overrideflag,
percent_complete)::software_download_data) as v
from software_download group by host_id, software_binary_id
) q;

this cleaned up the odd subplans but is still slow:
dev20400=# explain analyze select * from foo join latest_download
using (host_id, software_binary_id);

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1308.84..1467.81 rows=25 width=40) (actual
time=1472.668..1914.799 rows=494 loops=1)
Hash Cond: ((q.host_id = foo.host_id) AND (q.software_binary_id =
foo.software_binary_id))
-> HashAggregate (cost=1293.48..1350.17 rows=4535 width=94)
(actual time=1467.002..1700.388 rows=37247 loops=1)
-> Seq Scan on software_download (cost=0.00..953.42
rows=45342 width=94) (actual time=0.014..274.747 rows=45342 loops=1)
-> Hash (cost=7.94..7.94 rows=494 width=8) (actual
time=5.028..5.028 rows=494 loops=1)
-> Seq Scan on foo (cost=0.00..7.94 rows=494 width=8)
(actual time=0.022..2.507 rows=494 loops=1)
Total runtime: 1918.721 ms

compare it to this:
dev20400=# explain analyze select * from foo f where exists (select *
from latest_download where host_id = f.host_id and software_binary_id
= f.software_binary_id);

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on foo f (cost=0.00..3122.01 rows=247 width=8) (actual
time=0.152..45.941 rows=494 loops=1)
Filter: (subplan)
SubPlan
-> Subquery Scan q (cost=0.00..6.30 rows=1 width=40) (actual
time=0.081..0.081 rows=1 loops=494)
-> GroupAggregate (cost=0.00..6.29 rows=1 width=94)
(actual time=0.065..0.065 rows=1 loops=494)
-> Index Scan using software_download_idx on
software_download (cost=0.00..6.27 rows=1 width=94) (actual
time=0.013..0.021 r
Index Cond: ((host_id = $0) AND
(software_binary_id = $1))
Total runtime: 48.323 ms
(8 rows)

Time: 49.851 ms

I since I need both sides, I can't figure out a way to force the index
to be used during the join except to use a function to look up the
view based on the key, which works:
dev20400=# explain analyze select latest_download(host_id,
software_binary_id) from foo;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..9.18 rows=494 width=8) (actual
time=0.566..51.605 rows=494 loops=1)
Total runtime: 54.290 ms
(2 rows)

dev20400=# explain analyze select * from latest_download where host_id
= 1 and software_binary_id = 12345;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan q (cost=0.00..6.30 rows=1 width=40) (actual
time=0.046..0.046 rows=0 loops=1)
-> GroupAggregate (cost=0.00..6.29 rows=1 width=94) (actual
time=0.035..0.035 rows=0 loops=1)
-> Index Scan using software_download_idx on
software_download (cost=0.00..6.27 rows=1 width=94) (actual
time=0.024..0.024 rows=0 lo
Index Cond: ((host_id = 1) AND (software_binary_id = 12345))
Total runtime: 0.134 ms

For some reason, I can't get the index to be used on the table sitting
under a view during a join, even though it should be, or at least it
seems....

merlin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
Cc: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: join to view over custom aggregate seems like it should be faster
Date: 2007-04-10 17:03:06
Message-ID: 27416.1176224586@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Merlin Moncure" <mmoncure(at)gmail(dot)com> writes:
> For some reason, I can't get the index to be used on the table sitting
> under a view during a join, even though it should be, or at least it
> seems....

Nope, that's not going to work, because the aggregate keeps the subquery
from being flattened into the upper query, which is what would have to
happen for a nestloop-with-inner-indexscan join to be considered.
AFAICS you've got to structure it so that the aggregation happens above
the join.

regards, tom lane


From: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: join to view over custom aggregate seems like it should be faster
Date: 2007-04-10 17:53:02
Message-ID: b42b73150704101053x297e5cb3kfe81b8979dad3002@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 4/10/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Merlin Moncure" <mmoncure(at)gmail(dot)com> writes:
> > For some reason, I can't get the index to be used on the table sitting
> > under a view during a join, even though it should be, or at least it
> > seems....
>
> Nope, that's not going to work, because the aggregate keeps the subquery
> from being flattened into the upper query, which is what would have to
> happen for a nestloop-with-inner-indexscan join to be considered.
> AFAICS you've got to structure it so that the aggregation happens above
> the join.

right, i see that it's actually the 'group by' that does it:

select a, b from foo join (select a, b from bar group by a,b) q using (a,b);

is enough to keep it from using the index on a,b from bar. thats too bad...

merlin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
Cc: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: join to view over custom aggregate seems like it should be faster
Date: 2007-04-10 18:21:24
Message-ID: 28253.1176229284@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Merlin Moncure" <mmoncure(at)gmail(dot)com> writes:
> right, i see that it's actually the 'group by' that does it:
> select a, b from foo join (select a, b from bar group by a,b) q using (a,b);
> is enough to keep it from using the index on a,b from bar. thats too bad...

Some day it'd be nice to be able to reorder grouping/aggregation steps
relative to joins, the way we can now reorder outer joins. Don't hold
your breath though ... I think it'll take some pretty major surgery on
the planner.

regards, tom lane